KEMBAR78
Chapter3 Classification and Prediction | PDF | Statistical Classification | Bayesian Inference
0% found this document useful (0 votes)
27 views63 pages

Chapter3 Classification and Prediction

Module 3 focuses on classification in data mining, detailing methods such as decision tree induction, Bayesian classification, and various classifiers like SVM and k-Nearest-Neighbor. It distinguishes between classification and prediction, emphasizing the importance of model construction and evaluation metrics for accuracy. The module also covers data preparation techniques and compares different attribute selection measures used in classification algorithms.

Uploaded by

Harsh Chaudhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views63 pages

Chapter3 Classification and Prediction

Module 3 focuses on classification in data mining, detailing methods such as decision tree induction, Bayesian classification, and various classifiers like SVM and k-Nearest-Neighbor. It distinguishes between classification and prediction, emphasizing the importance of model construction and evaluation metrics for accuracy. The module also covers data preparation techniques and compares different attribute selection measures used in classification algorithms.

Uploaded by

Harsh Chaudhari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 63

— Module 3 —

Classification

Course Outcomes
ITC602.4: Implement the appropriate data mining methods like
classification, clustering or Frequent Pattern mining on large data sets.
ITC602.5: Define and apply metrics to measure the performance of
various data mining algorithms.
1
Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection measures  Simple linear regression

 Tree pruning  Accuracy and error measures

 2. Bayesian classification  Evaluating the Accuracy

 Naive Bayes classifier


03/30/21 Data Mining: Concepts and Techniques 2
Classification vs. Prediction
 Classification
 predicts categorical class labels (discrete,unordered)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Ex. Classification model to categorize bank loan applications as
either safe or risky
 Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
Ex. Prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their
income and occupation

Data Mining: Concepts and Techniques


03/30/21 3
Typical applications
 Credit approval/ Performance approval
 Target marketing
 Medical diagnosis
 Manufacturing
 Fraud detection

Data Mining: Concepts and


Techniques
03/30/21 4
Classification—A Two-Step Process
1. Learning Step / Training Phase :
Classifier is built describing a predetermined set of
classes/concepts

Classification algorithm builds the classifier by analyzing


or “learning from” a training set made up of database
tuples /sample and their associated class label attribute
The class label attribute is discrete-valued and unordered
The model is represented as classification rules, decision
trees, or mathematical formulae

Data Mining: Concepts and


03/30/21 Techniques 5
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Data Mining: Concepts and
03/30/21 Techniques 6
Classification—A Two-Step Process

2. Model usage: for classifying future or unknown data


Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples
that are correctly classified by the model
Test set is independent of training set
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known

Data Mining: Concepts and


03/30/21 Techniques 7
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and
03/30/21 Techniques 8
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
Data Mining: Concepts and
03/30/21 Techniques 9
Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and
handle missing values
 Relevance analysis (Attribute/feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation and reduction
 Generalize and/or normalize data

Data Mining: Concepts and


03/30/21 Techniques 11
Comparing Classification and Prediction
Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted
attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules

03/30/21 Data Mining: Concepts and Techniques 12


Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 14


Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and
03/30/21 Techniques 15
Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

03/30/21 Data Mining: Concepts and Techniques 16


Algorithm for Decision Tree
Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
03/30/21 Data Mining: Concepts and Techniques 17
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information
gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a
m
tuple in D:
Info( D )=−∑ p i log 2 ( p i )
i=1
 Information needed (after using A to split D into v
v
partitions) to classify D: |D j|
Info A ( D )=∑ ×I ( D j )
j= 1 |D|
 Information gained by branching on attribute A
G ain ( A )= Info ( D )− Info A ( D )
03/30/21 Data Mining: Concepts and Techniques 18
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
 Class N: buys_computer = “no” Info age ( D )= I (2,3 )+ I ( 4,0)
14 14
9 9 5 5
Info( D )=I (9,5 )=− log 2 ( )− log 2 ( )=0 . 940 5
14 14 14 14 + I (3,2)=0 . 694
14
age pi n i I(p i, n i) 5
I (2,3 ) means “age <=30”
<=30 2 3 0.971 14
31…40 4 0 0 has 5 out of 14 samples,
>40 3 2 0.971 with 2 yes’es and 3 no’s.
age income student credit_rating buys_computer Hence
<=30 high no fair no G ain ( age )= Info ( D )− Info age ( D )= 0 . 246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no Similarly,
31…40 low yes excellent yes Gain( income )=0 . 029
<=30 medium no fair no
<=30 low yes fair yes Gain( student )=0 . 151
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain( credit )=0 . 048
rating
31…40 high yes fair yes
>4003/30/21
medium no excellent Data
no Mining: Concepts and Techniques 19
Computing Information-Gain for
Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
03/30/21 Data Mining: Concepts and Techniques 20
Gain Ratio for Attribute Selection
(C4.5)
 Information gain measure is biased towards
attributes with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome
the problem (normalization to information gain)
v
|D j| |D j|
SplitInfo A ( D)=−∑ ×log 2 ( )
j= 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex. 4 4 6 6 4 4
SplitInfo A ( D)=− ×log 2 ( )− ×log 2 ( )− ×log 2 ( )=0 . 926
14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected
as the splitting attribute
03/30/21 Data Mining: Concepts and Techniques 21
Gini index (CART, IBM
IntelligentMiner)
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D )=1− p 2j ∑
j=1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D 1| |D 2|
gini A ( D )= gini( D 1 )+ gini ( D 2 )
|D| |D|
 Reduction in Impurity:
Δgini ( A )= gini ( D )− gini A ( D )
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
03/30/21 Data Mining: Concepts and Techniques 22
Gini index (CART, IBM
IntelligentMiner)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

gini( D )=1− ( ) ( )
9 2 5 2
14

14
=0 .459

 Suppose the attribute income partitions D into 10 in D1: {low,


medium} and 4 in Dgini
2 income∈ {low,medium } ( ( )
D)=
10
14
Gini( D ( )
1 )+
4
14
Gini( D2 )

but gini{medium,high} is 0.30 and thus the best since it is the


lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
23
Comparing Attribute Selection
Measures
 The three measures, in general, return good results
but
 Information gain:
biased towards multivalued attributes
 Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized
03/30/21 partitions and purity
Data Mining:in both
Concepts and partitions
Techniques 24
Other Attribute Selection
Measures
 CHAID: a popular decision tree algorithm, measure based on χ2
test for independence
 C-SEP: performs better than info. gain and gini index in certain
cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest
solution is preferred):
 The best tree as the one that requires the fewest # of bits
to both (1) encode the tree, and (2) encode the exceptions
to the tree
 Multivariate splits (partition based on multiple variable
combinations)
 CART: finds multivariate splits based on a linear comb. of
03/30/21 attrs. Data Mining: Concepts and Techniques 25
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training
data
 Too many branches, some may reflect anomalies due to noise
or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”
03/30/21 Data Mining: Concepts and Techniques 26
Enhancements to Basic Decision Tree
Induction

 Allow for continuous-valued attributes


 Dynamically define new discrete-valued
attributes that partition the continuous attribute
value into a discrete set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones
that are sparsely represented
 This reduces fragmentation, repetition, and
03/30/21replication Data Mining: Concepts and
Techniques 27
Classification in Large Databases

 Classification—a classical problem extensively


studied by statisticians and machine learning
researchers
 Scalability: Classifying data sets with millions of
examples and hundreds of attributes with
reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other
classification methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
Data Mining: Concepts and
 comparable classification
03/30/21 accuracy with other
Techniques 28
Scalable Decision Tree Induction
Methods

 SLIQ (EDBT’96 — Mehta et al.)


 Builds an index for each attribute and only class
list and the current attribute list reside in
memory
 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop
growing the tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
 Builds an AVC-list (attribute, value, class label)
 BOAT
03/30/21 (PODS’99 — Gehrke, Ganti, Ramakrishnan &
Data Mining: Concepts and
Techniques 29
Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 38


Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

Data Mining: Concepts and


03/30/21 Techniques 39
Bayesian Theorem: Basics
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Data Mining: Concepts and
03/30/21 Techniques 40
Bayesian Theorem

 Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes theorem
P ( X |H ) P ( H )
P ( H |X )=
P( X )

 Informally, this can be written as


posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and
03/30/21 Techniques 41
Towards Naïve Bayesian
Classifier
 Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2,
…, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P( X|C i ) P( C i )
P( C i|X )=
P( X )
 Since P(X) is constant for all classes, only
• needs to be maximized P ( C i|X )=P ( X |C i ) P ( C i )
Data Mining: Concepts and
03/30/21 Techniques 42
Derivation of Naïve Bayes
Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X|C i )=∏ P( x |Ci )=P( x |Ci )×P( x |C i )×. ..×P( x |C i )
 This greatly reduces the computation cost: Only counts the class
k 1 2 n
k=1

distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value
xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
2
( x −μ )

1 2 σ2
and P(xk|Ci) is g( x,μ,σ )=
√2 π σ
e

P( X|C i )=g ( x k ,μ C ,σ C )i i
Data Mining: Concepts and
03/30/21 Techniques 43
Naïve Bayesian Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample
<=30 medium no fair no
X = (age <=30,
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent
Data Mining: Concepts and
no
03/30/21 Techniques 44
Naïve Bayesian Classifier: An
Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


03/30/21 45
Avoiding the 0-Probability
Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X|C i )=∏ P( x k|C i )
k=1

 Ex. Suppose a dataset with 1000 tuples, income=low (0),


income= medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Data Mining: Concepts and
03/30/21 Techniques 46
Naïve Bayesian Classifier:
Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence,
therefore loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
 Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
 How to deal with these dependencies?
Data Mining: Concepts and
 Bayesian Belief Networks
03/30/21 Techniques 47
Chapter 6. Classification and
Prediction
 What is classification? What  Support Vector Machines
is prediction? (SVM)
 Issues regarding  Associative classification
classification and prediction  Lazy learners (or learning
 Classification by decision from your neighbors)
tree induction  Other classification methods
 Bayesian classification  Prediction
 Rule-based classification  Accuracy and error
 Classification by back measures
propagation  Ensemble methods
Data Mining: Concepts and
03/30/21  Model selection
Techniques 104
Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 105


What Is Prediction?
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or predictor
variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson regression,
log-linear models, regression trees
Data Mining: Concepts and
03/30/21 Techniques 106
Linear Regression
 Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
| D|

∑ ( x i − x̄ )( y i − ȳ )
w 1=
i= 1
| D| w 0 = ȳ −w 1 x̄
∑ ( x i − x̄ ) 2

i= 1

– Multiple linear regression: involves more than one predictor


variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using
statistical softwares/languages like SPSS, STATA, SAS, S-Plus,
R 107
Nonlinear Regression
 Some nonlinear models can be modeled by a polynomial function
 A polynomial regression model can be transformed into linear
regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be transformed to
linear model
 Some models are intractable nonlinear (e.g., sum of exponential
terms)
 possible to obtain least square estimates through extensive
calculation on more complex formulae
108
Other Regression-Based Models
 Generalized linear model:
 Foundation on which linear regression can be applied to
modeling categorical response variables
 Variance of y is a function of the mean value of y, not a
constant
 Logistic regression: models the prob. of some event
occurring as a linear function of a set of predictor variables
 Poisson regression: models the data that exhibit a Poisson
distribution
 Log-linear models: (for categorical data)
 Approximate discrete multidimensional prob. distributions
 Also useful for data compression and smoothing
 Regression trees and model trees
 Trees to predict continuous values rather than class labels
03/30/21 109
Regression Trees and Model
Trees
 Regression tree: proposed in CART system (Breiman et al.
1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the
training tuples that reach the leaf
 Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear
equation for the predicted attribute
 A more general case than regression tree
 Regression and model trees tend to be more accurate than
linear regression when the data are not represented well by a
110
Predictive Modeling in Multidimensional
Databases
 Predictive modeling: Predict data values or construct generalized
linear models based on the database data
 One can only predict value ranges or category distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the prediction
 Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
 Multi-level prediction: drill-down and roll-up analysis
111
Prediction: Numerical Data

112
Prediction: Categorical Data

113
Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 115


Classifier Accuracy
Measures C1 C2

Accuracy of a classifier M, acc(M): C1 True False


percentage of test set tuples that are positive negative
correctly classified by the model M
 Error rate (misclassification rate) of MC= 1 – False
acc(M) True
2
positive negative
 Given m classes, CMi,j, an entry in a confusion matrix,
indicates # of tuples in class i that are labeled by the
classifier as class j
 Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity *
neg/(pos + neg)
 This model can also be used for cost-benefit analysis
03/30/21 116
Classifier Accuracy Measures

classes buy_computer = buy_computer total recognition(%)


yes = no

buy_computer = 6954 46 7000 99.34


yes

buy_computer = 412 2588 3000 86.27


no

total 7366 2634 10000 95.52

Data Mining: Concepts and


03/30/21 Techniques 117
Predictor Error Measures
 Measure predictor accuracy: measure how far off the predicted
value is from the actual known value
 Loss function: measures the error betw. yi and the predicted
value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
d d

 Test error (generalization∑error):


|y i − y i '| the average loss over∑the y i ' )2
( y i − test
i=1 i=1
set d dd
d
 Mean absolute error: ∑ | y i − y i '| Mean squared error: i=1 ∑ ( y i − y i ' ) 2

i= 1
d d
∑| y i − ȳ|
i=1
∑ ( y i − ȳ )2
i=1
 Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers


Popularly use (square) root mean-square error, similarly, root 118
Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 119


Evaluating the Accuracy of a
Classifier or Predictor (I)
 Holdout method
 Given data is randomly partitioned into two independent
sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the
accuracies obtained

Data Mining: Concepts and


03/30/21 Techniques 120
Evaluating the Accuracy of a
Classifier or Predictor (II)
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as training
set
 Leave-one-out: k folds where k = # of tuples, for small
sized data
 Stratified cross-validation: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data

Data Mining: Concepts and


03/30/21 Techniques 121
Evaluating the Accuracy of a
Classifier or Predictor (III)
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with
replacement
 i.e., each time a tuple is selected, it is equally likely to
be selected again and re-added to the training set
 Several boostrap methods, and a common one is .632
boostrap

Data Mining: Concepts and


03/30/21 Techniques 122
Evaluating the Accuracy of a
Classifier or Predictor (IV)
 Several boostrap methods, and a common one is .632
boostrap
 Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set
end up forming the test set. About 63.2% of the original data will
end up in the bootstrap, and the remaining 36.8% will form the
test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedue k times, overall accuracy of
the model:

k
acc ( M )=∑ (0 . 632×acc ( M i )test + 0 . 368×acc ( M i )train )
set set
i=1

Data Mining: Concepts and


03/30/21 Techniques 123
Module 3. Classification

 What is classification? What  Rule-based classification,


is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21  Summary
Data Mining: Concepts and Techniques 124
Summary (I)
 Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes
or to predict future data trends.
 Effective and scalable methods have been developed for decision
trees induction, Naive Bayesian classification, Bayesian belief
network, rule-based classifier, Backpropagation, Support Vector
Machine (SVM), associative classification, nearest neighbor
classifiers, and case-based reasoning, and other classification
methods such as genetic algorithms, rough set and fuzzy set
approaches.
 Linear, nonlinear, and generalized linear models of regression can
be used for prediction. Many nonlinear problems can be
converted to linear problems by performing transformations on
the predictor variables. Regression trees and model trees are also
used for prediction. 133
Summary (II)
 Stratified k-fold cross-validation is a recommended method for
accuracy estimation. Bagging and boosting can be used to
increase overall accuracy by learning and combining a series of
individual models.
 Significance tests and ROC curves are useful for model selection
 There have been numerous comparisons of the different
classification and prediction methods, and the matter remains a
research topic
 No single method has been found to be superior over all others for
all data sets
 Issues such as accuracy, training time, robustness, interpretability,
and scalability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method 134
References (1)
 C. Apte and S. Weiss. Data mining with decision trees and decision rules.
Future Generation Computer Systems, 13, 1997.
 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University
Press, 1995.
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth International Group, 1984.
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from
partitioned data for scaling machine learning. KDD'95.
 W. Cohen. Fast effective rule induction. ICML'95.
 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule
groups for gene expression data. SIGMOD'05.
 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman
and Hall, 1990.
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
Data Mining: Concepts and
Techniques 135
References (2)
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley and
Sons, 2001
 U. M. Fayyad. Branching on attribute values in decision tree generation.
AAAI’94.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. J. Computer and System Sciences, 1997.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast
decision tree construction of large datasets. VLDB’98.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision
Tree Construction. SIGMOD'99.
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
 D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and
decision tree induction: Efficient classification in data mining. RIDE'97.
 B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule. KDD'98.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01.
136
References (3)
 T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000.
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods
of Marketing Research, Blackwell Business, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for
data mining. EDBT'96.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A
Multi-Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-
389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.

137
References (4)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates
building and pruning. VLDB’98.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. VLDB’96.
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan
Kaufmann, 1990.
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison
Wesley, 2005.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
 X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03.
138

You might also like