MIT Art Design and Technology University
MIT School of Computing, Pune
23CSE3007 –Data Mining & Analytics
Class - T.Y. (SEM-V), AIA
Unit - II Data Preprocessing and Classification
Prof. Dr. Ranjana Kale
AY 2025-2026 SEM-V
Unit II - Syllabus
Unit II – Data Preprocessing and Classification
09 hours
Data cleaning and preprocessing techniques, Handling missing values and
outliers, normalization, data transformation and standardization: Log and
exponential transform, normal transform. Classification and prediction,
Association Rule mining, Bayesian classification: Naïve- Bayes classifier, Model
selection and model evaluation for Classifier
Data Pre-processing
Data Preprocessing Techniques
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Data Cleaning
• process of identifying and correcting errors or inconsistencies in the
dataset
• involves handling missing values, removing duplicates, and
correcting incorrect or outlier data to ensure the dataset is accurate
and reliable.
• Clean data is essential for effective analysis, as it improves the
quality of results and enhances the performance of data models.
Handling missing values and outliers
• Missing values occur when data is absent from a dataset.
• We can either ignore the rows with missing data or fill the gaps
manually, with the attribute mean, or by using the most probable
value.
• This ensures the dataset remains accurate and complete for
analysis.
To handle missing values
• Ignore the tuple
• Fill in the missing value manually
• Use a global constant
• Use attribute mean
• Use the most probable value calculated by decision tree, Bayesian
formalism
• Use a measure of central tendency for the attribute (e.g., the mean
or median)
To handle Noisy Data or outliers
• Noise is a random error or variance in a measured variable.
• Noisy Data or outliers are irrelevant or incorrect data that is
difficult for machines to interpret, often caused by errors in data
collection or entry. It can be handled in several ways:
• 1. Binning Method: The data is sorted into equal segments, and each segment is smoothed
by replacing values with the mean or boundary values.
• 2. Regression: Data can be smoothed by fitting it to a regression function, either linear or
multiple, to predict values.
• Linear regression involves finding the “best” line to fit two attributes (or variables) so that
one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface
• 3. Outlier analysis by Clustering
Binning
• Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
• Because binning methods consult the neighborhood of values, they perform local
smoothing.
• example, fig 3.2 the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values).
• In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin.
• For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
• Therefore, each original value in this bin is replaced by the value 9.
• Similarly, smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value.
• Outlier analysis by Clustering: This
method groups similar data points
together, with outliers either being
undetected or falling outside the
clusters. These techniques help
remove noise and improve data
quality.
• Removing Duplicates:
• It involves identifying and eliminating repeated data entries to
ensure accuracy and consistency in the dataset.
• This process prevents errors and ensures reliable analysis by
keeping only unique records.
• Data decay, also known as data degradation, refers to the gradual
deterioration of data quality over time.
• It is a common phenomenon that affects all types of data, including
customer data, marketing data, and business data
• As data ages, it can become outdated or inaccurate due to a variety of
factors, such as changes in customer behavior, business operations, or
market trends.
• For example, contact information for customers may change as they
switch jobs, or product prices may fluctuate over time. move
Data Integration
• It involves merging data from various sources into a single, unified dataset.
• It can be challenging due to differences in data formats, structures, and meanings.
• Techniques like record linkage and data fusion help in combining data efficiently,
ensuring consistency and accuracy.
• Record Linkage is the process of identifying and matching records from different
datasets that refer to the same entity, even if they are represented differently. It helps in
combining data from various sources by finding corresponding records based on
common identifiers or attributes.
• Data Fusion involves combining data from multiple sources to create a more
comprehensive and accurate dataset. It integrates information that may be inconsistent or
incomplete from different sources, ensuring a unified and richer dataset for analysis.
Data Transformation
• It involves converting data into a format suitable for analysis.
• It is an important step in data analysis process that involves the conversion,
cleaning, and organizing of data into accessible formats. It ensures that the
information is accessible, consistent, secure, and finally recognized by the
intended business users.
• Common techniques include
• Normalization, which scales data to a common range; (normal Transform). It is done
in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
• Standardization, which adjusts data to have zero mean and unit variance;
• Discretization: Converting continuous data into discrete categories for easier analysis.
Discretization. techniques include binning, histogram analysis, cluster analysis, decision
tree analysis, and correlation analysis.
• Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
• Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis. For Example-The
attribute “city” can be converted to “country”.
• These techniques help prepare the data for more accurate analysis.
Data Normalization
• used in data mining to transform the values of a dataset into a
common scale.
• This is important because many machine learning algorithms are
sensitive to the scale of the input features and can produce better
results when the data is normalized.
• Normalization is used to scale the data of an attribute so that it falls
in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0.
• It is generally useful for classification algorithms.
Need of Data Normalization
• Normalization is generally required when we are dealing with
attributes on a different scale, otherwise, it may lead to a dilution in
effectiveness of an important equally important attribute(on lower
scale) because of other attribute having values on larger scale.
• when multiple attributes are there but attributes have values on
different scales, this may lead to poor data models while performing
data mining operations.
• So they are normalized to bring all the attributes on the same scale.
Methods of Data Normalization
• 1. Min-Max Normalization-
• Linear transformation is performed on the original data. Minimum and maximum value from
data is fetched and each value is replaced according to the following formula.
• Min(A) - It is the minimum absolute value A.
• Max(A) - It is maximum absolute value of A.
• v’ - It is the new value of each attribute data.
• v - It is the old value of each attribute data.
• new_max(A), new_min(A) is the max and min value within the range
• (i.e boundary value of range required) respectively.
• Min-Max Normalization maps a value v of A to v' in the range
• [new_min(A),new_max(A)] by computing.
example
• 1000,2000,3000,9000 normalize to min:0 , max:1
• new_max(A)=1 , as given in question- max=1
• new_min(A)=0, as given in question- min=0
• max(A)=9000,
• min(A)=1000,
• v = 2000, putting all values in the formula, we get
• v '= (2000-1000) X (1-0)
----------------- + 0 = 0 .125
9000-1000
v=9000, putting all values in the formula, we get
v'=(9000-1000) X (1-0)
----------------- + 0 =1
9000-1000
. 2. Normalization by Decimal Scaling
Where j is the smallest integer such that Max(|ν’|) < 1
3. Standardization or z-Score Normalization(zero-mean)
• It is used to transform data into a standard normal distribution, ensuring that all
features are on the same scale.
• This process helps to avoid the dominance of certain features over others due to
differences in their scales, which can significantly impact the performance of
model.
In this technique, values are normalized based on mean and standard deviation
of the data A.
The formula used is:
v', v is the new and old of each entry in data respectively. σA, A is the standard
deviation and mean of A respectively.
Why Z-Score Normalization is Necessary?
• Normalization is necessary when dealing with features on different scales.
Without normalization, features with larger scales can dominate those with
smaller scales, leading to biased results in machine learning models.
• Z-score normalization addresses this issue by scaling the data based on its
statistical properties, making it particularly useful for algorithms that rely
on distance calculations, such as K-nearest neighbors and clustering
algorithms
• Z-score normalization ensures that all features are treated equally,
enabling the algorithm to identify meaningful patterns and relationships
more effectively.
70, 80, 90, 100, 110 find Z-score
• Mean= (70+80+90+100+110)/5 = 90
• Standard deviation= sqrt( (90-70)^2+ (90-80)^2+….(90-110)^2)
• = 14.14213
• Z-score = (80- 90)/(14.14213) = -0.70710678
• Z-scores: [-1.41421356 -0.70710678 0. 0.70710678
1.41421356]
Concept Hierarchy Generation for nominal Data
• Organizing data into a hierarchy of concepts
to provide a higher-level view for better
understanding and analysis.
• For Example-The attribute “city” can be
converted to “country”.
• attributes such as street can be generalized
to higher-level concepts, like city or country.
• Many hierarchies for nominal attributes are
implicit within the database schema and can
be automatically defined at the schema
definition level.
Importance of Data Transformation
• Data transformation is important because it improves data quality, compatibility, and utility.
The procedure is critical for companies and organizations that depend on data to make informed
decisions because it assures the data's accuracy, reliability, and accessibility across many systems and
applications.
• Improved Data Quality: Data transformation eliminates mistakes, inserts in missing information, and
standardizes formats, resulting in higher-quality, more dependable, and accurate data.
• Enhanced Compatibility: By converting data into a suitable format, companies may avoid possible
compatibility difficulties when integrating data from many sources or systems.
• Simplified Data Management: Data transformation is the process of evaluating and modifying data to
maximize storage and discoverability, making it simpler to manage and maintain.
• Broader Application: Transformed data is more useable and applicable in a larger variety of scenarios,
allowing enterprises to get the most out of their data.
• Faster Queries: By standardizing data and appropriately storing it in a warehouse, query performance
and BI tools may be enhanced, resulting in less friction during analysis.
Data Transformation Techniques and Tools
• Programmatic Transformation: Automated using scripts in Python, R, or SQL.
• ETL Tools: Automate extract-transform-load processes for large-scale data (e.g., Talend,
Informatica).
• Normalization/Standardization: Use MinMaxScaler, StandardScaler from Scikit-learn.
• Categorical Encoding such as One-hot get_dummies (Pandas) and LabelEncoder (Scikit-learn)
• fillna (Pandas)
• SimpleImputer (Scikit-learn) for mean/median/mode imputation.
• Feature Engineering: Create new features using apply, map, transform in Pandas.
• Aggregation/Grouping: Use groupby in Pandas for sum, mean, count, etc.
• Text Preprocessing: Tokenization, stemming, stop-word removal using NLTK and SpaCy.
• Dimensionality Reduction: Use Scikit-learn's PCA and TruncatedSVD.
Log transformation
• Log transformation is a way to change data that has very large
numbers, very small numbers or a skewed shape. It works by taking
the logarithm of each number in the data which helps to
“compress” the large values and spread out the small ones.
• We use log transformation when:
• Some numbers are extremely large compared to others.
• Data is not following a normal (bell-shaped) distribution.
• We want to make patterns easier to spot for analysis.
• Log transformation shrinks big numbers and stretches small
numbers so the data becomes more balanced.
• Log transformation helps by:
• Making the data more symmetrical (normal-shaped).
• Making patterns between variables clearer.
• Making predictions more stable and reliable.
• Reducing the effect of extreme values (outliers).
Exponential transform
• a technique used to convert data values by applying an exponential
function, often to address skewness or improve model
performance.
• This transformation can make data distributions more symmetric
and linearize relationships, especially when dealing with
exponential growth or decay patterns.
• It's particularly useful when dealing with skewed data or when
certain statistical methods require data to be normally distributed.
• Y= e^x
• Benefits:
• Addressing Skewness: It can reduce the impact of outliers and
make the data distribution more symmetrical.
• Stabilizing Variance: It can make the variability of data more
consistent across different ranges.
• Linearizing Relationships: It can transform non-linear relationships
into linear ones, making them easier to model.
• When to use:
• When data exhibits a skewed distribution.
• When data has exponential growth or decay patterns.
• When statistical methods require normally distributed data.
• Example:
• If you have stock prices that tend to grow exponentially, applying a
logarithmic transformation (a type of exponential transformation)
can help stabilize the variance and make it easier to model trends.
Data Reduction
• It reduces the dataset's size while maintaining key information.
• This can be done through feature selection, which chooses the most
relevant features, and feature extraction, which transforms the data
into a lower-dimensional space while preserving important details.
It uses various reduction techniques such as,
1. Dimensionality Reduction (e.g., Principal Component Analysis
2. Numerosity Reduction
3. Wavelet Transform
4. Data Compression
1. Dimensionality Reduction
• datasets with too many features can cause issues like slow computation
and overfitting.
• Dimensionality reduction helps to reduce the number of features while
retaining key information.
• Techniques like principal component analysis (PCA), singular value
decomposition (SVD) and linear discriminant analysis (LDA) convert data
into a lower-dimensional space while preserving important details.
• Example: when you are building a model to predict house prices with features
like bedrooms, square footage and location. If you add too many features such
as room condition or flooring type, the dataset becomes large and complex.
• Feature Selection
• The most relevant features that contribute the most to the
target variable are selected and irrelevant data are removed.
• This involves selecting a subset of relevant features from the
dataset.
• Feature selection is often performed to remove irrelevant or
redundant features from the dataset.
• It can be done using various techniques such as correlation
analysis, mutual information, and principal component
analysis (PCA).
• Feature Extraction
• This involves transforming the data into a lower-dimensional
space while preserving the important information.
• Feature extraction is often used when the original features
are high-dimensional and complex.
• It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
2. Numerosity Reduction:
• Numerosity reduction is a technique used in data mining to reduce
the number of data points in a dataset while still preserving the
most important information.
• This can be beneficial in situations where the dataset is too large to
be processed efficiently, or where the dataset contains a large
amount of irrelevant or redundant data points.
Techniques for Numerosity Reduction
• Data Sampling: This technique involves selecting a subset of the data points to work
with, rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
• Clustering: This technique involves grouping similar data points together and then
representing each group by a single representative data point.
• Data Aggregation: This technique involves combining multiple data points into a single
data point by applying a summarization function.
• Data Generalization: This technique involves replacing a data point with a more general
data point that still preserves the important information.
• Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset
3. Wavelet Transforms
• The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to
a data vector X, transforms it to a numerically different vector, X’, of wavelet coefficients.
• The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X={x1,x2, … ,xn}, depicting n
measurements made on the tuple from n database attributes.
• For example, all wavelet coefficients larger than some user-specified threshold can be retained. All
other coefficients are set to 0.
• The resulting data representation is therefore very sparse, so that operations that can take
advantage of data sparsity are computationally very fast if performed in wavelet space.
• The technique also works to remove noise without smoothing out the main features of the data,
making it effective for data cleaning as well.
• Given a set of coefficients, an approximation of the original data can be constructed by applying
the inverse of the DWT used.
4. Data Compression:
• Reducing the size of data by encoding it in a more compact form, making it easier to
AIA5 25/8
store and process
• Lossless Compression -
• Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original
data from the compressed data.
• Lossy Compression -
• Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In
lossy-data compression, the decompressed data may differ from the original data but
are useful enough to retrieve information from them.
Data Compression
Original Data Compressed
Data
lossless
s s y
lo
Original Data
Approximated
Importance of data preprocessing
• Data preprocessing is an important step in the data process.
• It refers to the cleaning, transforming, and integrating of data in order to make it ready
for analysis.
• The goal of data preprocessing is to improve the quality of the data and to make it more
suitable.
• It improves accuracy and reliability:
• Preprocessing data removes missing or inconsistent data values resulting from human
or computer error, which can improve the accuracy and quality of a dataset, making it
more reliable.
• It makes data consistent:
• When collecting data, it's possible to have data duplicates, and discarding them during
preprocessing can ensure the data values for analysis are consistent, which helps
produce accurate results.
Importance of data preprocessing
• It increases the data's algorithm readability. Preprocessing enhances the data's quality
and makes it easier for machine learning algorithms to read, use, and interpret it.
• Increasing productivity: Clean data can increase overall productivity.
• Improving customer relations: Clean data can lead to fewer errors, which can improve
customer relations.
• Enabling better understanding of the audience: Clean data can help businesses build a
better understanding of their audience.
• Helping people understand data: Data visualization helps people see, interact with, and
better understand data.
• Enabling models to extract meaningful patterns: Clean data ensures that training data
mirrors reality, which enables models to extract meaningful patterns.
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
General Approach to Classification
• Data classification is a two-step process
• 1. a learning step (where a classification model is constructed)
• 2. a classification step (where the model is used to predict class labels for
given data)
• first step, a classifier is built describing a predetermined set of data
classes or concepts.
• classification algorithm builds the classifier by analyzing or “learning
from” a training set made up of database tuples and their associated
class labels.
• A tuple, X, is an attribute vector, X ={x1, x2, : : : , xn}, depicting n
measurements made on the tuple from n database attributes,
respectively, A1, A2, .. , An.
• Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class label
• class label attribute is discrete-valued and unordered.
• It is categorical (or nominal) in that each value serves as a
category or class.
• The individual tuples making up the training set are referred to
as training tuples
• data tuples can be referred to as samples, examples,
instances, data points, or objects.
• This first step of the classification process can also be viewed as the
learning of a mapping or function, y =f(X) hat can predict the
associated class label y of a given tuple X.
a. Learning
b. Classification
• Typical applications
• Credit/loan approval
• target marketing, performance prediction,
• manufacturing
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Prediction
• something that may go to happen in the future.
• And just like that in prediction, we identify or predict the missing or
unavailable data for a new observation based on the previous data
that we have and based on the future assumptions
• In prediction, the output is a continuous value
• The model used to predict the unknown value is called a predictor
• the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
What Is Association Rule Mining?
• Association rule mining:
• Association rule mining is a technique used in data mining to discover frequent
patterns, relationships or associations among a set of items in large datasets.
• It is particularly useful in market basket analysis, where the goal is to understand
customer purchasing behavior by finding associations between different products.
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
What Is Association Mining?
• Market Basket Analysis
• Frequent patterns are patterns (e.g., itemsets, subsequences, or
substructures) that appear frequently in a data set.
• Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
• it helps in data classification, clustering, and other data mining tasks.
• frequent pattern mining has become an important data mining task and a
focused theme in data mining research.
• Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
• The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business
decision-making processes such as catalog design, cross-marketing,
and customer shopping behavior analysis.
• A typical example of frequent itemset mining is market basket analysis.
• This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”
• discovery of these associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased
together by customers.
• For instance, if customers are buying milk, how likely are they to also buy
bread on the same trip
• Suppose, as manager of an AllElectronics branch, to learn more about
the buying habits of your customers
• market basket analysis may be performed on the retail data of customer
transactions at your store.
• use these results to plan marketing or advertising strategies, or in the
design of a new catalog.
• In one strategy, items that are frequently purchased together can be
placed in proximity to further encourage the combined sale of such
items.
• If customers who purchase computers also tend to buy antivirus
software at the same time, then placing the hardware display close to
the software display may help increase the sales of both items.
• In an alternative strategy, placing hardware and software at opposite
ends of the store may entice customers who purchase such items to
pick up other items along the way
• Eg. information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following
association rule:
• Rule support and confidence are two measures of rule interestingness.
• They respectively reflect the usefulness and certainty of discovered
rules. A support of 2% for means that 2% of all the transactions under
analysis show that computer and antivirus software are purchased
together.
• A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software.
• Typically, association rules are considered interesting if they satisfy both
a minimum support threshold and a minimum confidence threshold.
• These thresholds can be a set by users or domain experts
Basic Concepts
• Itemset
– A collection of one or more items
– Item: A single article or product.
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count (σ)-The number of transactions containing the
itemset.
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or equal to a minsup
threshold
Frequent Itemsets, Closed Itemsets, and Association
Rules
Confidence
• Confidence of a rule: Measures how often items in Y appear in transactions that contain X. It is an
indication of how frequently the rule X⇒Y is found to be true.
• Formula: support (X=> Y)= P(X U Y), confidence(X=>Y) =P(Y|X)
• Lift
• Lift of a rule: Measures how much more likely Y is to be bought when X is bought compared to being
bought independently of X.
• Formula:
• Lift values:
• >1 : Positive association (items co-occur more often than expected).
• =1 : No association (items co-occur as expected under independence).
• <1 : Negative association (items co-occur less often than expected).
• Example Consider a simple dataset of transactions:
Transaction
Items
ID
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Coke}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Coke}
Frequent Itemsets, Closed Itemsets, and Association
Rules
• An itemset X is closed in a data set D if there exists no proper
super-itemset Y such that Y has the same support count as X in D.
• An itemset X is a closed frequent itemset in set D if X is both
closed and frequent in D.
• An itemset X is a maximal frequent itemset (or max-itemset) in a
data set D if X is frequent, and there exists no super-itemset Y such
that X ⸦ Y and Y is frequent in D.
• Let C be the set of closed frequent itemsets for a data set D
satisfying a minimum support threshold, min_sup. Let, M be the set
of maximal frequent itemsets for D satisfying min_sup.
Apriori Algorithm
1. Apriori Algorithm
• A classic algorithm for mining frequent itemsets for boolean association rules.
• Based on the principle that any subset of a frequent itemset must also be frequent.
• Steps:
• Generate candidate itemsets: Generate all possible itemsets of a given length.
• Count support: Count the support for each candidate itemset.
• Prune: Remove itemsets that do not meet the minimum support threshold.
• Repeat: Increase the length of itemsets by 1 and repeat the process until no more frequent
itemsets are found.
2. Generating Association Rules
• From the frequent itemsets identified, generate association rules.
• For each frequent itemset L, generate all non-empty subsets of L.
• For every non-empty subset s of L, output the rule s⇒(L−s) if the confidence of the rule
meets the minimum confidence threshold.
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Implementation of Apriori Algorithm
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
The Apriori Algorithm AIA6
• Find frequent itemsets using an iterative level-wise approach based on candidate generation.
• Input:
• D, a database of transactions;
• min sup, the minimum support count threshold.
• Output: L, frequent itemsets in D.
• Method:
(1) L1 = find frequent 1-itemsets(D);
(2) for (k = 2;Lk-1≠ϕ ;k++) {
(3) Ck = apriori gen(Lk-1);
(4) for each transaction t ϵ D {// scan D for counts
(5) Ct = subset(Ck, t); // get the subsets of t that are candidates
(6) for each candidate c ϵ Ct
(7) c.count++;
(8) }
(9) Lk =(c ϵCk|c.count ≥ min_sup) }
(10) return L = UkLk;
The Apriori Algorithm
• procedure apriori gen(Lk-1:frequent (k -1)-itemsets)
(1) for each itemset l1 ϵ Lk-1
(2) for each itemset l2 ϵ Lk-1
(3) if (l1[1] = l2[1]) ^ (l1[2] = l2[2])
^...^(l1[k -2] = l2[k -2])^(l1[k -1] < l2[k -1]) then {
(4) c = l1 ∞ l2; // join step: generate candidates
(5) if has infrequent_subset(c, Lk-1) then
(6) delete c; // prune step: remove unfruitful candidate
(7) else add c to Ck;
(8) }
(9) return Ck;
procedure has infrequent_subset(c: candidate k-itemset)
Lk-1: frequent (k -1)-itemsets); // use prior knowledge
(1) for each (k -1)-subset s of c
(2) if s ∉ Lk-1 then
(3) return TRUE;
(4) return FALSE;
Mining Association Rules—an Example
Transaction-id Items bought Min. support 50%
Min. confidence 50%
10 A, B, C
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A ⇒ C: {A, C} 50%
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
The Apriori Algorithm—An Example
Itemset sup
Database TDB
Itemset sup AIA6
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2
Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset 3rd scan
L3
Itemset sup
{B, C, E}
{B, C, E} 2
Why Is Frequent Pattern or Assoiciation Mining
an Essential Task in Data Mining?
• Foundation for many essential data mining tasks
• Association, correlation, causality
• Sequential patterns, temporal or cyclic association, partial periodicity, spatial and
multimedia association
• Associative classification, cluster analysis, iceberg cube, fascicles (semantic data
compression)
• Broad applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis
• Web log (click stream) analysis, DNA sequence analysis, etc.
Generating Association Rules from Frequent Itemsets
• Once the frequent itemsets from transactions in a database D have been found,
it is straightforward to generate strong association rules from them (where
strong association rules satisfy both minimum support and minimum
confidence).
• This can be done using Eq. for confidence, which we show again here for
completeness:
Generating Association Rules from Frequent
Itemsets
Content Beyond Syllabus
Decision Tree
• Decision tree induction is the learning of decision trees from class-labeled training
tuples.
• A decision tree is a flowchart-like tree structure, where each internal node (non-leaf
node) denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label. The topmost node in a tree
is the root node.
• Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.
• Some decision tree algorithms produce only binary trees (where each internal node
branches to exactly two other nodes), whereas others
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
• A path is traced from the root to a leaf node, which holds the
class prediction for that tuple.
• Decision trees can easily be converted to classification rules.
Why are decision tree classifiers so
popular?”
• The construction of decision tree classifiers does not require any domain
knowledge or parameter setting.
• appropriate for exploratory knowledge discovery.
• Decision trees can handle multidimensional data.
• Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans.
• The learning and classification steps of decision tree induction are simple and
fast.
• In general, decision tree classifiers have good accuracy.
• However, successful use may depend on the data at hand.
• Decision tree induction algorithms have been used for classification in many
application areas such as medicine, manufacturing and production, financial
analysis, astronomy, and molecular biology
Decision Tree Induction: An Example
❑ Training data set: Buys_computer
❑ The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
❑ Resulting tree:
Information Gain
• Let node N represent or hold the tuples of partition D.
• The attribute with the highest information gain is chosen as the splitting
attribute for node N.
• This attribute minimizes the information needed to classify the tuples in
the resulting partitions and reflects the least randomness or “impurity” in
these partitions.
• Such an approach minimizes the expected number of tests needed to
classify a given tuple and guarantees that a simple tree is found.
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by
|Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:
• Information needed (after using A to split D into v partitions) to classify D:
• Information gained by branching on attribute A
• Gain(A) tells us how much would be gained by branching on A.
• It is the expected reduction in the information requirement caused by
knowing the value of A.
• attribute A with the highest information gain, Gain(A), is chosen as the
splitting attribute at node N.
• This is equivalent to saying that we want to partition on the attribute A
that would do the “best classification,” so that the amount of information
still required to finish classifying the tuples is minimal (i.e., minimum Info
A(D)).
Gain Ratio
• The information gain measure is biased toward tests with many
outcomes.
• That is, it prefers to select attributes having a large number of values.
• For example, consider an attribute that acts as a unique identifier such
as product ID. A split on product ID would result in a large number of
partitions (as many as there are values), each one containing just one
tuple.
• Because each partition is pure, the information required to classify data
set D based on this partitioning would be Info_product_ID(D)= 0.
• Therefore, the information gained by partitioning on this attribute is
maximal.
• Clearly, such a partitioning is useless for classification.
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info(D) as
• This value represents the potential information generated by splitting the
training data set, D, into v partitions, corresponding to the v outcomes of a
test on attribute A.
• for each outcome, it considers the number of tuples having that outcome
with respect to the total number of tuples in D.
• It differs from information gain, which measures the information with
respect to classification that is acquired based on the same partitioning.
The gain ratio is defined as
• The attribute with the maximum gain ratio is selected as the splitting attribute
• if the split information approaches 0, the ratio becomes unstable.
Gini Index
• The Gini index considers a binary split for each attribute.
• When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition.
• if a binary split on A partitions D into D1 and D2, the Gini index of D given that partitioning is
• For each attribute, each of the possible binary splits is considered.
• For a discrete-valued attribute, the subset that gives the minimum Gini index for that attribute is
selected as its splitting subset.
• Similarly, the Gini index values for splits on the remaining subsets are
0.458 (for the subsets {low, high} and {medium} and 0.450 (for the
subsets {medium, high} and {low}).
• Therefore, the best binary split for attribute income is on {low,
medium} or {high} because it minimizes the Gini index
• Evaluating age, we obtain {youth, senior} or {middle aged} as the best
split for age with a Gini index of 0.375; the attributes student and credit
rating are both binary, with Gini index values of 0.367 and 0.429,
respectively.
85
Attribute Selection: Information Gain
g Class P: buys_computer = “yes”
g Class N: buys_computer = “no”
means “age <=30” has 5 out of 14
samples, with 2 yes’es and 3 no’s.
Hence
Similarly,
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural network
classifiers
• Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
• Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can
be measured
Bayes’ Theorem: Basics
• Total probability Theorem:
• Bayes’ Theorem:
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the probability
that the hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the
Bayes’ theorem
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X)
for all the k classes
• Practical difficulty: It requires initial knowledge of many probabilities, involving
significant computational cost
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem
• Since P(X) is constant for all classes, only
needs to be maximized
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci,
D
| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution
with a mean μ and standard deviation σ
and P(xk|Ci) is
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier: An Example
• Play Tennis
The learning phase for
tennis example
P(Play=Yes) =
9/14
P(Play=No) = 5/14
We have four variables, we calculate for each we calculate
the conditional probability table
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5
Overcast Hot 2/9 2/5
4/9 0/5
Rain Mild 4/9 2/5
3/9 2/5
Cool 3/9 1/5
Humidity Play=Yes Play=No
Wind Play=Yes Play=No
High Strong 3/9 3/5
3/9 4/5
Normal Weak 6/9 2/5
6/9 1/5
The test phase for the tennis example
• Test Phase
– Given a new instance of variable values,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Given calculated Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise,
the predicted prob. will be zero
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990),
and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their “uncorrected” counterparts
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
• How to deal with these dependencies? Bayesian Belief Networks (Chapter 9)
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to consider?
• Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
104
Evaluation of Classifiers
•Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
Confusion Matrix –
• a table that compares a classification model's predicted values to the
actual values in a dataset, and is used to evaluate the model's
performance. It's also known as an error matrix.
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Given m classes, an entry, CMi,j in a confusion matrix indicates # of
tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
106
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C ■ Class Imbalance Problem:
C TP FN P
■ One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
■ Significant majority of the
• Classifier Accuracy, or negative class and minority of
recognition rate: percentage of the positive class
test set tuples that are correctly ■ Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All ■ Sensitivity = TP/P
• Error rate: 1 – accuracy, or ■ Specificity: True Negative
Error rate = (FP + FN)/All recognition rate
■ Specificity = TN/N
107
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive
• Recall: completeness – what % of positive tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and recall,
• Fß: weighted measure of precision and recall
• assigns ß times as much weight to recall as to precision
108
Classifier Evaluation Metrics: Example
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
109
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized data
• *Stratified cross-validation*: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data
110
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:
111
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
■ Vertical axis
• The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
• Rank the test tuples in decreasing ■ Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at ■ The plot also shows a
the top of the list diagonal line
■ A model with perfect
• The closer to the diagonal line (i.e., the accuracy will have an
closer the area is to 0.5), the less area of 1.0
accurate is the model
112
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
113
Classification Techniques
• Decision Trees
• Rule Approaches
• Bayesian Classifiers
• Neural Networks
• Discriminant Analysis
• Support Vector Machines
• k-nearest neighbor classifiers
• Logistic regression
• Artificial Neural Networks
• Genetic Classifiers
• Advantages of Dimensionality Reduction
• Disadvantages of Dimensionality Reduction