0% found this document useful (0 votes)

27 views63 pages

Chapter3 Classification and Prediction

Module 3 focuses on classification in data mining, detailing methods such as decision tree induction, Bayesian classification, and various classifiers like SVM and k-Nearest-Neighbor. It distinguishes between classification and prediction, emphasizing the importance of model construction and evaluation metrics for accuracy. The module also covers data preparation techniques and compares different attribute selection measures used in classification algorithms.

Uploaded by

Harsh Chaudhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views63 pages

Chapter3 Classification and Prediction

Uploaded by

Harsh Chaudhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 63

— Module 3 —

Classification

Course Outcomes
ITC602.4: Implement the appropriate data mining methods like
classification, clustering or Frequent Pattern mining on large data sets.
ITC602.5: Define and apply metrics to measure the performance of
various data mining algorithms.
1
Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection measures  Simple linear regression

 Tree pruning  Accuracy and error measures

 2. Bayesian classification  Evaluating the Accuracy

 Naive Bayes classifier

03/30/21 Data Mining: Concepts and Techniques 2
Classification vs. Prediction
 Classification
 predicts categorical class labels (discrete,unordered)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Ex. Classification model to categorize bank loan applications as
either safe or risky
 Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
Ex. Prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their
income and occupation

Data Mining: Concepts and Techniques

03/30/21 3
Typical applications
 Credit approval/ Performance approval
 Target marketing
 Medical diagnosis
 Manufacturing
 Fraud detection

Data Mining: Concepts and

Techniques
03/30/21 4
Classification—A Two-Step Process
1. Learning Step / Training Phase :
Classifier is built describing a predetermined set of
classes/concepts

Classification algorithm builds the classifier by analyzing

or “learning from” a training set made up of database
tuples /sample and their associated class label attribute
The class label attribute is discrete-valued and unordered
The model is represented as classification rules, decision
trees, or mathematical formulae

Data Mining: Concepts and

03/30/21 Techniques 5
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Data Mining: Concepts and
03/30/21 Techniques 6
Classification—A Two-Step Process

2. Model usage: for classifying future or unknown data

Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples
that are correctly classified by the model
Test set is independent of training set
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known

Data Mining: Concepts and

03/30/21 Techniques 7
Process (2): Using the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and
03/30/21 Techniques 8
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
Data Mining: Concepts and
03/30/21 Techniques 9
Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and
handle missing values
 Relevance analysis (Attribute/feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation and reduction
 Generalize and/or normalize data

Data Mining: Concepts and

03/30/21 Techniques 11
Comparing Classification and Prediction
Methods
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted
attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules

03/30/21 Data Mining: Concepts and Techniques 12

Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 14

Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Data Mining: Concepts and
03/30/21 Techniques 15
Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

03/30/21 Data Mining: Concepts and Techniques 16

Algorithm for Decision Tree
Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected
attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
03/30/21 Data Mining: Concepts and Techniques 17
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information
gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a
m
tuple in D:
Info( D )=−∑ p i log 2 ( p i )
i=1
 Information needed (after using A to split D into v
v
partitions) to classify D: |D j|
Info A ( D )=∑ ×I ( D j )
j= 1 |D|
 Information gained by branching on attribute A
G ain ( A )= Info ( D )− Info A ( D )
03/30/21 Data Mining: Concepts and Techniques 18
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
 Class N: buys_computer = “no” Info age ( D )= I (2,3 )+ I ( 4,0)
14 14
9 9 5 5
Info( D )=I (9,5 )=− log 2 ( )− log 2 ( )=0 . 940 5
14 14 14 14 + I (3,2)=0 . 694
14
age pi n i I(p i, n i) 5
I (2,3 ) means “age <=30”
<=30 2 3 0.971 14
31…40 4 0 0 has 5 out of 14 samples,
>40 3 2 0.971 with 2 yes’es and 3 no’s.
age income student credit_rating buys_computer Hence
<=30 high no fair no G ain ( age )= Info ( D )− Info age ( D )= 0 . 246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no Similarly,
31…40 low yes excellent yes Gain( income )=0 . 029
<=30 medium no fair no
<=30 low yes fair yes Gain( student )=0 . 151
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain( credit )=0 . 048
rating
31…40 high yes fair yes
>4003/30/21
medium no excellent Data
no Mining: Concepts and Techniques 19
Computing Information-Gain for
Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
03/30/21 Data Mining: Concepts and Techniques 20
Gain Ratio for Attribute Selection
(C4.5)
 Information gain measure is biased towards
attributes with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome
the problem (normalization to information gain)
v
|D j| |D j|
SplitInfo A ( D)=−∑ ×log 2 ( )
j= 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex. 4 4 6 6 4 4
SplitInfo A ( D)=− ×log 2 ( )− ×log 2 ( )− ×log 2 ( )=0 . 926
14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected
as the splitting attribute
03/30/21 Data Mining: Concepts and Techniques 21
Gini index (CART, IBM
IntelligentMiner)
 If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D )=1− p 2j ∑
j=1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D 1| |D 2|
gini A ( D )= gini( D 1 )+ gini ( D 2 )
|D| |D|
 Reduction in Impurity:
Δgini ( A )= gini ( D )− gini A ( D )
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
03/30/21 Data Mining: Concepts and Techniques 22
Gini index (CART, IBM
IntelligentMiner)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

gini( D )=1− ( ) ( )
9 2 5 2
14
−
14
=0 .459

 Suppose the attribute income partitions D into 10 in D1: {low,

medium} and 4 in Dgini
2 income∈ {low,medium } ( ( )
D)=
10
14
Gini( D ( )
1 )+
4
14
Gini( D2 )

but gini{medium,high} is 0.30 and thus the best since it is the

lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
23
Comparing Attribute Selection
Measures
 The three measures, in general, return good results
but
 Information gain:
biased towards multivalued attributes
 Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized
03/30/21 partitions and purity
Data Mining:in both
Concepts and partitions
Techniques 24
Other Attribute Selection
Measures
 CHAID: a popular decision tree algorithm, measure based on χ2
test for independence
 C-SEP: performs better than info. gain and gini index in certain
cases
 G-statistics: has a close approximation to χ2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest
solution is preferred):
 The best tree as the one that requires the fewest # of bits
to both (1) encode the tree, and (2) encode the exceptions
to the tree
 Multivariate splits (partition based on multiple variable
combinations)
 CART: finds multivariate splits based on a linear comb. of
03/30/21 attrs. Data Mining: Concepts and Techniques 25
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training
data
 Too many branches, some may reflect anomalies due to noise
or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”
03/30/21 Data Mining: Concepts and Techniques 26
Enhancements to Basic Decision Tree
Induction

 Allow for continuous-valued attributes

 Dynamically define new discrete-valued
attributes that partition the continuous attribute
value into a discrete set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones
that are sparsely represented
 This reduces fragmentation, repetition, and
03/30/21replication Data Mining: Concepts and
Techniques 27
Classification in Large Databases

 Classification—a classical problem extensively

studied by statisticians and machine learning
researchers
 Scalability: Classifying data sets with millions of
examples and hundreds of attributes with
reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other
classification methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
Data Mining: Concepts and
 comparable classification
03/30/21 accuracy with other
Techniques 28
Scalable Decision Tree Induction
Methods

 SLIQ (EDBT’96 — Mehta et al.)

 Builds an index for each attribute and only class
list and the current attribute list reside in
memory
 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop
growing the tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
 Builds an AVC-list (attribute, value, class label)
 BOAT
03/30/21 (PODS’99 — Gehrke, Ganti, Ramakrishnan &
Data Mining: Concepts and
Techniques 29
Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 38

Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

Data Mining: Concepts and

03/30/21 Techniques 39
Bayesian Theorem: Basics
 Let X be a data sample (“evidence”): class label is unknown
 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
Data Mining: Concepts and
03/30/21 Techniques 40
Bayesian Theorem

 Given training data X, posteriori probability of a hypothesis H,

P(H|X), follows the Bayes theorem
P ( X |H ) P ( H )
P ( H |X )=
P( X )

 Informally, this can be written as

posteriori = likelihood x prior/evidence
 Predicts X belongs to C2 iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and
03/30/21 Techniques 41
Towards Naïve Bayesian
Classifier
 Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2,
…, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P( X|C i ) P( C i )
P( C i|X )=
P( X )
 Since P(X) is constant for all classes, only
• needs to be maximized P ( C i|X )=P ( X |C i ) P ( C i )
Data Mining: Concepts and
03/30/21 Techniques 42
Derivation of Naïve Bayes
Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X|C i )=∏ P( x |Ci )=P( x |Ci )×P( x |C i )×. ..×P( x |C i )
 This greatly reduces the computation cost: Only counts the class
k 1 2 n
k=1

distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value
xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
2
( x −μ )
−
1 2 σ2
and P(xk|Ci) is g( x,μ,σ )=
√2 π σ
e

P( X|C i )=g ( x k ,μ C ,σ C )i i
Data Mining: Concepts and
03/30/21 Techniques 43
Naïve Bayesian Classifier: Training
Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample
<=30 medium no fair no
X = (age <=30,
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent
Data Mining: Concepts and
no
03/30/21 Techniques 44
Naïve Bayesian Classifier: An
Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

03/30/21 45
Avoiding the 0-Probability
Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X|C i )=∏ P( x k|C i )
k=1

 Ex. Suppose a dataset with 1000 tuples, income=low (0),

income= medium (990), and income = high (10),
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Data Mining: Concepts and
03/30/21 Techniques 46
Naïve Bayesian Classifier:
Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence,
therefore loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
 Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
 How to deal with these dependencies?
Data Mining: Concepts and
 Bayesian Belief Networks
03/30/21 Techniques 47
Chapter 6. Classification and
Prediction
 What is classification? What  Support Vector Machines
is prediction? (SVM)
 Issues regarding  Associative classification
classification and prediction  Lazy learners (or learning
 Classification by decision from your neighbors)
tree induction  Other classification methods
 Bayesian classification  Prediction
 Rule-based classification  Accuracy and error
 Classification by back measures
propagation  Ensemble methods
Data Mining: Concepts and
03/30/21  Model selection
Techniques 104
Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 105

What Is Prediction?
 (Numerical) prediction is similar to classification
 construct a model
 use model to predict continuous or ordered value for a given input
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or predictor
variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson regression,
log-linear models, regression trees
Data Mining: Concepts and
03/30/21 Techniques 106
Linear Regression
 Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
| D|

∑ ( x i − x̄ )( y i − ȳ )
w 1=
i= 1
| D| w 0 = ȳ −w 1 x̄
∑ ( x i − x̄ ) 2

i= 1

– Multiple linear regression: involves more than one predictor

variable
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method or using
statistical softwares/languages like SPSS, STATA, SAS, S-Plus,
R 107
Nonlinear Regression
 Some nonlinear models can be modeled by a polynomial function
 A polynomial regression model can be transformed into linear
regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be transformed to
linear model
 Some models are intractable nonlinear (e.g., sum of exponential
terms)
 possible to obtain least square estimates through extensive
calculation on more complex formulae
108
Other Regression-Based Models
 Generalized linear model:
 Foundation on which linear regression can be applied to
modeling categorical response variables
 Variance of y is a function of the mean value of y, not a
constant
 Logistic regression: models the prob. of some event
occurring as a linear function of a set of predictor variables
 Poisson regression: models the data that exhibit a Poisson
distribution
 Log-linear models: (for categorical data)
 Approximate discrete multidimensional prob. distributions
 Also useful for data compression and smoothing
 Regression trees and model trees
 Trees to predict continuous values rather than class labels
03/30/21 109
Regression Trees and Model
Trees
 Regression tree: proposed in CART system (Breiman et al.
1984)
 CART: Classification And Regression Trees
 Each leaf stores a continuous-valued prediction
 It is the average value of the predicted attribute for the
training tuples that reach the leaf
 Model tree: proposed by Quinlan (1992)
 Each leaf holds a regression model—a multivariate linear
equation for the predicted attribute
 A more general case than regression tree
 Regression and model trees tend to be more accurate than
linear regression when the data are not represented well by a
110
Predictive Modeling in Multidimensional
Databases
 Predictive modeling: Predict data values or construct generalized
linear models based on the database data
 One can only predict value ranges or category distributions
 Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction
 Determine the major factors which influence the prediction
 Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
 Multi-level prediction: drill-down and roll-up analysis
111
Prediction: Numerical Data

112
Prediction: Categorical Data

113
Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 115

Classifier Accuracy
Measures C1 C2

Accuracy of a classifier M, acc(M): C1 True False

percentage of test set tuples that are positive negative
correctly classified by the model M
 Error rate (misclassification rate) of MC= 1 – False
acc(M) True
2
positive negative
 Given m classes, CMi,j, an entry in a confusion matrix,
indicates # of tuples in class i that are labeled by the
classifier as class j
 Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity *
neg/(pos + neg)
 This model can also be used for cost-benefit analysis
03/30/21 116
Classifier Accuracy Measures

classes buy_computer = buy_computer total recognition(%)

yes = no

buy_computer = 6954 46 7000 99.34

yes

buy_computer = 412 2588 3000 86.27

total 7366 2634 10000 95.52

Data Mining: Concepts and

03/30/21 Techniques 117
Predictor Error Measures
 Measure predictor accuracy: measure how far off the predicted
value is from the actual known value
 Loss function: measures the error betw. yi and the predicted
value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
d d

 Test error (generalization∑error):

|y i − y i '| the average loss over∑the y i ' )2
( y i − test
i=1 i=1
set d dd
d
 Mean absolute error: ∑ | y i − y i '| Mean squared error: i=1 ∑ ( y i − y i ' ) 2

i= 1
d d
∑| y i − ȳ|
i=1
∑ ( y i − ȳ )2
i=1
 Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root 118
Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21 Data Mining: Concepts and Techniques 119

Evaluating the Accuracy of a
Classifier or Predictor (I)
 Holdout method
 Given data is randomly partitioned into two independent
sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the
accuracies obtained

Data Mining: Concepts and

03/30/21 Techniques 120
Evaluating the Accuracy of a
Classifier or Predictor (II)
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as training
set
 Leave-one-out: k folds where k = # of tuples, for small
sized data
 Stratified cross-validation: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial
data

Data Mining: Concepts and

03/30/21 Techniques 121
Evaluating the Accuracy of a
Classifier or Predictor (III)
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with
replacement
 i.e., each time a tuple is selected, it is equally likely to
be selected again and re-added to the training set
 Several boostrap methods, and a common one is .632
boostrap

Data Mining: Concepts and

03/30/21 Techniques 122
Evaluating the Accuracy of a
Classifier or Predictor (IV)
 Several boostrap methods, and a common one is .632
boostrap
 Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set
end up forming the test set. About 63.2% of the original data will
end up in the bootstrap, and the remaining 36.8% will form the
test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedue k times, overall accuracy of
the model:

k
acc ( M )=∑ (0 . 632×acc ( M i )test + 0 . 368×acc ( M i )train )
set set
i=1

Data Mining: Concepts and

03/30/21 Techniques 123
Module 3. Classification

 What is classification? What  Rule-based classification,

is prediction? SVM, Backpropagation, k-
 Issues regarding Nearest-Neighbor classifiers,

classification and prediction GA, fuzzy set

 1. Classification by decision  Prediction : structure of

tree induction regression models

 Attribute Selection meaures  Simple linear regression

 Tree pruning  Multiple linear regression

 2. Bayesian classification  Accuracy and error measures

 Naive Bayes classifier  Evaluating the Accuracy

03/30/21  Summary
Data Mining: Concepts and Techniques 124
Summary (I)
 Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes
or to predict future data trends.
 Effective and scalable methods have been developed for decision
trees induction, Naive Bayesian classification, Bayesian belief
network, rule-based classifier, Backpropagation, Support Vector
Machine (SVM), associative classification, nearest neighbor
classifiers, and case-based reasoning, and other classification
methods such as genetic algorithms, rough set and fuzzy set
approaches.
 Linear, nonlinear, and generalized linear models of regression can
be used for prediction. Many nonlinear problems can be
converted to linear problems by performing transformations on
the predictor variables. Regression trees and model trees are also
used for prediction. 133
Summary (II)
 Stratified k-fold cross-validation is a recommended method for
accuracy estimation. Bagging and boosting can be used to
increase overall accuracy by learning and combining a series of
individual models.
 Significance tests and ROC curves are useful for model selection
 There have been numerous comparisons of the different
classification and prediction methods, and the matter remains a
research topic
 No single method has been found to be superior over all others for
all data sets
 Issues such as accuracy, training time, robustness, interpretability,
and scalability must be considered and can involve trade-offs,
further complicating the quest for an overall superior method 134
References (1)
 C. Apte and S. Weiss. Data mining with decision trees and decision rules.
Future Generation Computer Systems, 13, 1997.
 C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University
Press, 1995.
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth International Group, 1984.
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from
partitioned data for scaling machine learning. KDD'95.
 W. Cohen. Fast effective rule induction. ICML'95.
 G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule
groups for gene expression data. SIGMOD'05.
 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman
and Hall, 1990.
 G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
Data Mining: Concepts and
Techniques 135
References (2)
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley and
Sons, 2001
 U. M. Fayyad. Branching on attribute values in decision tree generation.
AAAI’94.
 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. J. Computer and System Sciences, 1997.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast
decision tree construction of large datasets. VLDB’98.
 J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision
Tree Construction. SIGMOD'99.
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, 2001.
 D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and
decision tree induction: Efficient classification in data mining. RIDE'97.
 B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule. KDD'98.
 W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01.
136
References (3)
 T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000.
 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods
of Marketing Research, Blackwell Business, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for
data mining. EDBT'96.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A
Multi-Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-
389, 1998
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
 J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.

137
References (4)
 R. Rastogi and K. Shim. Public: A decision tree classifier that integrates
building and pruning. VLDB’98.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. VLDB’96.
 J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan
Kaufmann, 1990.
 P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison
Wesley, 2005.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
 X. Yin and J. Han. CPAR: Classification based on predictive association rules.
SDM'03
 H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03.
138

DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
7 Class
No ratings yet
7 Class
72 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Classification and Prediction
No ratings yet
Classification and Prediction
134 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
7 Class
No ratings yet
7 Class
72 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
No ratings yet
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
129 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Chap 7
No ratings yet
Chap 7
71 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Classification
No ratings yet
Classification
36 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
A Data Mining Query Language
No ratings yet
A Data Mining Query Language
69 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Classification Lecture 1
No ratings yet
Classification Lecture 1
51 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Classification & Prediction in Data Mining
No ratings yet
Classification & Prediction in Data Mining
112 pages
Classification
No ratings yet
Classification
20 pages
Class Basic
No ratings yet
Class Basic
67 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
30 pages
Lecture 10
No ratings yet
Lecture 10
53 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
7 Classification
100% (3)
7 Classification
63 pages
Data Mining Basics for Beginners
No ratings yet
Data Mining Basics for Beginners
20 pages
Classification Intr DT
No ratings yet
Classification Intr DT
31 pages
DmUnit 3
No ratings yet
DmUnit 3
42 pages
Classification
No ratings yet
Classification
45 pages
Classification and Prediction
No ratings yet
Classification and Prediction
130 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
DM - Unit-1 - Fundamentals of Data Mining
No ratings yet
DM - Unit-1 - Fundamentals of Data Mining
43 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
27 pages
DAMI 011114a
No ratings yet
DAMI 011114a
48 pages
Classification
No ratings yet
Classification
95 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
Snapdragon G3 Gen 3 Product Brief
No ratings yet
Snapdragon G3 Gen 3 Product Brief
1 page
New SGP Application Form Guide
No ratings yet
New SGP Application Form Guide
2 pages
Syllabus DBI202
No ratings yet
Syllabus DBI202
8 pages
Kami Export - Year-8-Autumn 2-Pre-Learning HA
No ratings yet
Kami Export - Year-8-Autumn 2-Pre-Learning HA
9 pages
TMI-Sample Session Plan
No ratings yet
TMI-Sample Session Plan
3 pages
FN595NWS
No ratings yet
FN595NWS
53 pages
Cable Types & Electrical Accessories Guide
No ratings yet
Cable Types & Electrical Accessories Guide
9 pages
IT Cost Data Integrity Briefing
No ratings yet
IT Cost Data Integrity Briefing
10 pages
Discovering Gis and Arcgis Pro, 3E 3Rd Edition Bradley Shellito Download
No ratings yet
Discovering Gis and Arcgis Pro, 3E 3Rd Edition Bradley Shellito Download
67 pages
React Assignment
No ratings yet
React Assignment
5 pages
Info Retrieval for CS Students
No ratings yet
Info Retrieval for CS Students
18 pages
Devicenet: Leoni Special Cables GMBH
No ratings yet
Devicenet: Leoni Special Cables GMBH
2 pages
OMT315 - 151B3003 (Technical Data)
No ratings yet
OMT315 - 151B3003 (Technical Data)
1 page
JIMS 5518 Instructions
No ratings yet
JIMS 5518 Instructions
2 pages
H8-PSM Tech Spec
No ratings yet
H8-PSM Tech Spec
3 pages
Cecomp: Cecomp Battery Powered Digital Pressure Gauges DPG1000B, F4B
No ratings yet
Cecomp: Cecomp Battery Powered Digital Pressure Gauges DPG1000B, F4B
2 pages
HVAC System Configuration Guide
No ratings yet
HVAC System Configuration Guide
43 pages
IEC 61010-1-2010 Amd1-2016 Cor1-2019
50% (2)
IEC 61010-1-2010 Amd1-2016 Cor1-2019
4 pages
DOM Traversing
No ratings yet
DOM Traversing
10 pages
Problem On Monte Carlo Simulation
No ratings yet
Problem On Monte Carlo Simulation
3 pages
The Use of Post Tensioning in Marine Structures
No ratings yet
The Use of Post Tensioning in Marine Structures
39 pages
Lab 2 - Configuration 2-35
No ratings yet
Lab 2 - Configuration 2-35
10 pages
Crops Diseases Detection and Solution System
No ratings yet
Crops Diseases Detection and Solution System
9 pages
Model Constitution of The Bar Associations of State of U - PDF - Advocate - Quor
No ratings yet
Model Constitution of The Bar Associations of State of U - PDF - Advocate - Quor
32 pages
Pruebas de Acceso A Enseñanzas Universitarias Oficiales de Grado
No ratings yet
Pruebas de Acceso A Enseñanzas Universitarias Oficiales de Grado
1 page
ServiceNow Interview QA Part1
No ratings yet
ServiceNow Interview QA Part1
2 pages
Claire - Measurement of Furnitures, Fixtures, Doors and Windows
No ratings yet
Claire - Measurement of Furnitures, Fixtures, Doors and Windows
1 page
Diesel Pump Service Price List 2023
No ratings yet
Diesel Pump Service Price List 2023
1 page
Iphone 4 (GSM - AT&T) Screen Replacement - Ifixit Repair Guide
No ratings yet
Iphone 4 (GSM - AT&T) Screen Replacement - Ifixit Repair Guide
1 page
Alternative Bom
No ratings yet
Alternative Bom
16 pages