KEMBAR78
Classification Basic Concepts, Decision Trees, and Model Evaluation | PDF | Statistical Classification | Computational Neuroscience
0% found this document useful (0 votes)
97 views67 pages

Classification Basic Concepts, Decision Trees, and Model Evaluation

The document discusses classification, which involves using a model to predict discrete class labels for samples based on their attribute values. It defines classification, outlines the stages of a classification task using a training set to build a model and test set to validate it, and provides examples of classification problems. It also introduces several common classification techniques like decision trees, logistic regression, and neural networks. Specifically, it shows an example decision tree model and how that model would be applied to samples in a test set to predict their class labels.

Uploaded by

sarvesh_mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views67 pages

Classification Basic Concepts, Decision Trees, and Model Evaluation

The document discusses classification, which involves using a model to predict discrete class labels for samples based on their attribute values. It defines classification, outlines the stages of a classification task using a training set to build a model and test set to validate it, and provides examples of classification problems. It also introduces several common classification techniques like decision trees, logistic regression, and neural networks. Specifically, it shows an example decision tree model and how that model would be applied to samples in a test set to predict their class labels.

Uploaded by

sarvesh_mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 67

Classification

Basic Concepts, Decision Trees, and


Model Evaluation

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Classification definition

 Given a collection of samples (training set)


– Each sample contains a set of attributes.
– Each sample also has a discrete class label.
 Learn a model that predicts class label as a
function of the values of the attributes.
 Goal: model should assign class labels to
previously unseen samples as accurately as
possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Stages in a classification task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Examples of classification tasks

 Two classes
– Predicting tumor cells as benign or malignant
– Classifying credit card transactions
as legitimate or fraudulent

 Multiple classes
– Classifying secondary structures of
protein as alpha-helix, beta-sheet,
or random coil
– Categorizing news stories as finance,
weather, entertainment, sports, etc

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Classification techniques

 Decision trees
 Rule-based methods
 Logistic regression
 Discriminant analysis
 k-Nearest neighbor (instance-based learning)
 Naïve Bayes
 Neural networks
 Support vector machines
 Bayesian belief networks

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Example of a decision tree

splitting nodes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No Refund


Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

classification nodes

training data model: decision tree

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Another example of decision tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There can be more than one tree
10 No Single 90K Yes that fits the same data!
10

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Decision tree classification task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Apply model to test data

Test data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Apply model to test data

Test data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Apply model to test data

Test data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Apply model to test data

Test data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Apply model to test data

Test data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Apply model to test data

Test data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Decision tree classification task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Decision tree induction

 Many algorithms:
– Hunt’s algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ, SPRINT

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


General structure of Hunt’s algorithm

Hunt’s algorithm is recursive.


Tid Refund Marital Taxable
 Status Income Cheat

 General procedure: 1 Yes Single 125K No


2 No Married 100K No
Let Dt be the set of training
3 No Single 70K No
records that reach a node t.
4 Yes Married 120K No
a) If all records in Dt belong to 5 No Divorced 95K Yes
the same class yt, then t is a 6 No Married 60K No
leaf node labeled as yt. 7 Yes Divorced 220K No

b) If Dt is an empty set, then t is 8 No Single 85K Yes


9 No Married 75K No
a leaf node labeled by the
10 No Single 90K Yes
default class, yd. 10

c) If Dt contains records that Dt


belong to more than one
class, use an attribute test to a), b), t
split the data into smaller or c)?
subsets, then apply the
procedure to each subset.

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Applying Hunt’s algorithm

Refund Refund
Don’t
Yes No Yes No
Cheat
Don’t Don’t Don’t Marital
Cheat Cheat Cheat Status
Single,
Tid Refund Marital Taxable Married
Status Income Cheat
Divorced
1 Yes Single 125K No Refund Don’t
Cheat
2 No Married 100K No Yes No Cheat
3 No Single 70K No
4 Yes Married 120K No Don’t Marital
5 No Divorced 95K Yes Cheat Status
6 No Married 60K No Single,
7 Yes Divorced 220K No
Married
Divorced
8 No Single 85K Yes
9 No Married 75K No Taxable Don’t
Cheat
10
10 No Single 90K Yes Income
< 80K >= 80K

Don’t Cheat
Cheat

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Tree induction

 Greedy strategy
– Split the records at each node based on an
attribute test that optimizes some chosen
criterion.

 Issues
– Determine how to split the records
 How to specify structure of split?
 What is best attribute / attribute value for splitting?

– Determine when to stop splitting


Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Tree induction

 Greedy strategy
– Split the records at each node based on an
attribute test that optimizes some chosen
criterion.

 Issues
– Determine how to split the records
 How to specify structure of split?
 What is best attribute / attribute value for splitting?

– Determine when to stop splitting


Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Specifying structure of split

 Depends on attribute type


– Nominal
– Ordinal
– Continuous (interval or ratio)

 Depends on number of ways to split


– Binary (two-way) split
– Multi-way split

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Splitting based on nominal attributes

 Multi-way split: Use as many partitions as distinct


values.
CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Splitting based on ordinal attributes

 Multi-way split: Use as many partitions as distinct


values.
Size
Small Large
Medium

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

Size
{Small,
 What about this split? Large} {Medium}

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Splitting based on continuous attributes

 Different ways of handling


– Discretization to form an ordinal attribute
 static – discretize once at the beginning
 dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Threshold decision: (A < v) or (A  v)


 consider all possible split points v and find the one
that gives the best split
 can be more compute intensive

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Splitting based on continuous attributes

 Splitting based on threshold decision

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Tree induction

 Greedy strategy
– Split the records at each node based on an
attribute test that optimizes some chosen
criterion.

 Issues
– Determine how to split the records
 How to specify structure of split?
 What is best attribute / attribute value for splitting?

– Determine when to stop splitting


Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Determining the best split

Before splitting: 10 records of class 0


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which attribute gives the best split?

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Determining the best split

 Greedy approach:
Nodes with homogeneous class distribution
are preferred.
 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
high degree of impurity low degree of impurity

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Measures of node impurity

 Gini index

 Entropy

 Misclassification error

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Using a measure of impurity to determine best split

Before splitting: C0 N00


M0
C1 N01

Attribute A? Attribute B?

Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs. M0 – M34
Choose attribute that maximizes gain
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Measure of impurity: Gini index

 Gini index for a given node t :

GINI (t )  1   [ p( j | t )]2
j

p( j | t ) is the relative frequency of class j at node t


– Maximum (1 – 1 / nc ) when records are equally
distributed among all classes, implying least amount
of information ( nc = number of classes ).
– Minimum ( 0.0 ) when all records belong to one class,
implying most amount of information.
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Examples of computing Gini index

GINI (t )  1   [ p( j | t )] 2

C1 0 p( C1 ) = 0 / 6 = 0 p( C2 ) = 6 / 6 = 1
C2 6 Gini = 1 – p( C1 )2 – p( C2 )2 = 1 – 0 – 1 = 0

C1 1 p( C1 ) = 1 / 6 p( C2 ) = 5 / 6
C2 5 Gini = 1 – ( 1 / 6 )2 – ( 5 / 6 )2 = 0.278

C1 2 p( C1 ) = 2 / 6 p( C2 ) = 4 / 6
C2 4 Gini = 1 – ( 2 / 6 )2 – ( 4 / 6 )2 = 0.444

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Splitting based on Gini index

 Used in CART, SLIQ, SPRINT.


 When a node t is split into k partitions (child nodes), the
quality of split is computed as,

k
ni
GINI split   GINI (i )
i 1 n

where ni = number of records at child node i


n = number of records at parent node t

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Computing Gini index: binary attributes

 Splits into two partitions


 Effect of weighting partitions: favors larger and purer
partitions
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini( N1 )
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini( children )
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini( N2 ) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Computing Gini index: categorical attributes

 For each distinct value, gather counts for each class in


the dataset
 Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of attribute values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Computing Gini index: continuous attributes

Tid Refund Marital Taxable


 Make binary split based on a threshold Status Income Cheat
(splitting) value of attribute
1 Yes Single 125K No
 Number of possible splitting values = 2 No Married 100K No
(number of distinct values attribute has 3 No Single 70K No
at that node) - 1 4 Yes Married 120K No
 Each splitting value v has a count 5 No Divorced 95K Yes
matrix associated with it 6 No Married 60K No

– Class counts in each of the 7 Yes Divorced 220K No

partitions, A < v and A  v 8 No Single 85K Yes


9 No Married 75K No
 Simple method to choose best v
10 No Single 90K Yes
– For each v, scan the attribute 10

values at the node to gather count Taxable


matrix, then compute its Gini index. Income
> 80K?
– Computationally inefficient!
Repetition of work.
Yes No

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Computing Gini index: continuous attributes

 For efficient computation, do following for each (continuous) attribute:


– Sort attribute values.
– Linearly scan these values, each time updating the count matrix
and computing Gini index.
– Choose split position that has minimum Gini index.

Cheat No No No Yes Yes Yes No No No No


Taxable Income

sorted values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
split positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Comparison among splitting criteria

For a two-class problem:

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Tree induction

 Greedy strategy
– Split the records at each node based on an
attribute test that optimizes some chosen
criterion.

 Issues
– Determine how to split the records
 How to specify structure of split?
 What is best attribute / attribute value for splitting?

– Determine when to stop splitting


Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Stopping criteria for tree induction

 Stop expanding a node when all the records


belong to the same class
 Stop expanding a node when all the records have
identical (or very similar) attribute values
– No remaining basis for splitting
 Early termination

Can also prune tree post-induction

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Decision trees: decision boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6
y

0.5 y < 0.47? y < 0.33?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
 Border between two neighboring regions of different classes is known
as decision boundary.
 In decision trees, decision boundary segments are always parallel to
attribute axes, because test condition involves one attribute at a time.
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Classification with decision trees

 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy comparable to other classification
techniques for many simple data sets
 Disadvantages:
– Easy to overfit
– Decision boundary restricted to being parallel
to attribute axes
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
MATLAB interlude

matlab_demo_04.m

Part A

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Producing useful models: topics

 Generalization

 Measuring classifier performance

 Overfitting, underfitting

 Validation

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Generalization

 Definition: model does a good job of correctly predicting


class labels of previously unseen samples.
 Generalization is typically evaluated using a test set of
data that was not involved in the training process.
 Evaluating generalization requires:
– Correct labels for test set are known.
– A quantitative measure (metric) of tendency for model
to predict correct labels.

NOTE: Generalization is separate from other performance


issues around models, e.g. computational efficiency,
scalability.

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Generalization of decision trees

 If you make a decision tree deep enough, it can usually


do a perfect job of predicting class labels on training set.
Is this a good thing?
NO!

 Leaf nodes do not have to be pure for a tree to generalize


well. In fact, it’s often better if they aren’t.
 Class prediction of an impure leaf node is simply the
majority class of the records in the node.
 An impure node can also be interpreted as making a
probabilistic prediction.
– Example: 7 / 10 class 1 means p( 1 ) = 0.7
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Metrics for classifier performance

 Accuracy
a = number of test samples with label correctly predicted
b = number of test samples with label incorrectly predicted
a
accuracy 
ab
example
 75 samples in test set

 correct class label predicted for 62 samples

 wrong class label predicted for 13 samples

 accuracy = 62 / 75 = 0.827

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Metrics for classifier performance

 Limitations of accuracy
– Consider a two-class problem
 number of class 1 test samples = 9990
 number of class 2 test samples = 10
– What if model predicts everything to be class 1?
 accuracy is extremely high: 9990 / 10000 = 99.9 %
 but model will never correctly predict any sample in class 2
 in this case accuracy is misleading and does not give a good
picture of model quality

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Metrics for classifier performance

 Confusion matrix
example
(continued from actual class
two slides back)
class 1 class 2

class 1 21 6
predicted
class
class 2 7 41

21  41 62
accuracy  
21  6  7  41 75

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Metrics for classifier performance

 Confusion matrix
actual class
– derived metrics
(for two classes) class 1 class 2
(negative) (positive)
class 1
21 (TN) 6 (FN)
predicted (negative)
class class 2
7 (FP) 41 (TP)
(positive)

TN: true negatives FN: false negatives

FP: false positives TP: true positives

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Metrics for classifier performance

 Confusion matrix
actual class
– derived metrics
(for two classes) class 1 class 2
(negative) (positive)
class 1
21 (TN) 6 (FN)
predicted (negative)
class class 2
7 (FP) 41 (TP)
(positive)

TP TN
sensitivit y  specificit y 
TP  FN TN  FP

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


MATLAB interlude

matlab_demo_04.m

Part B

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Underfitting and overfitting

 Fit of model to training and test sets is controlled


by:

– model capacity (  number of parameters )


 example: number of nodes in decision tree

– stage of optimization
 example: number of iterations in a gradient
descent optimization

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Underfitting and overfitting

underfitting overfitting

optimal fit

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Sources of overfitting: noise

Decision boundary distorted by noise point


Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Sources of overfitting: insufficient examples

 Lack of data points in lower half of diagram makes it


difficult to correctly predict class labels in that region.
– Insufficient training records in the region causes decision tree to
predict the test examples using other training records that are
irrelevant to the classification task.
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Occam’s Razor

 Given two models with similar generalization


errors, one should prefer the simpler model over
the more complex model.

 For complex models, there is a greater chance it


was fitted accidentally by errors in data.

 Model complexity should therefore be considered


when evaluating a model.

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Decision trees: addressing overfitting

 Pre-pruning (early stopping rules)


– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
– Early stopping conditions (more restrictive):
 Stop if number of instances is less than some user-specified
threshold
 Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Decision trees: addressing overfitting

 Post-pruning
– Grow full decision tree
– Trim nodes of full tree in a bottom-up fashion
– If generalization error improves after trimming, replace
sub-tree by a leaf node.
– Class label of leaf node is determined from majority
class of instances in the sub-tree
– Can use various measures of generalization error for
post-pruning (see textbook)

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Example of post-pruning
Training error (before splitting) = 10/30

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = No 10 Training error (after splitting) = 9/30

Error = 10/30 Pessimistic error (after splitting)


= (9 + 4  0.5)/30 = 11/30
PRUNE!
A?

A1 A4
A2 A3

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5


Class = No 4 Class = No 4 Class = No 1 Class = No 1

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


MNIST database of handwritten digits

 Gray-scale images, 28 x 28 pixels.


 10 classes, labels 0 through 9.
 Training set of 60,000 samples.
 Test set of 10,000 samples.
 Subset of a larger set available from NIST.
 Each digit size-normalized and centered in a fixed-size image.
 Good database for people who want to try machine learning
techniques on real-world data while spending minimal effort
on preprocessing and formatting.
 http://yann.lecun.com/exdb/mnist/
 We will use a subset of MNIST with 5000 training and 1000
test samples and formatted for MATLAB (mnistabridged.mat).
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
MATLAB interlude

matlab_demo_04.m

Part C

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Model validation

 Every (useful) model offers choices in one or


more of:
– model structure
 e.g. number of nodes and connections
– types and numbers of parameters
 e.g. coefficients, weights, etc.
 Furthermore, the values of most of these
parameters will be modified (optimized) during
the model training process.
 Suppose the test data somehow influences the
choice of model structure, or the optimization of
parameters …
Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›
Model validation

The one commandment of machine learning

TRAIN
on
TEST

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Model validation

Divide available labeled data into three sets:

 Training set:
– Used to drive model building and parameter optimization
 Validation set
– Used to gauge status of generalization error
– Results can be used to guide decisions during training process
 typically used mostly to optimize small number of high-level meta
parameters, e.g. regularization constants; number of gradient descent
iterations
 Test set
– Used only for final assessment of model quality, after training +
validation completely finished

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Validation strategies

 Holdout
 Cross-validation
 Leave-one-out (LOO)

 Random vs. block folds


– Use random folds if data are independent
samples from an underlying population
– Must use block folds if any there is any spatial
or temporal correlation between samples

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›


Validation strategies

 Holdout
– Pro: results in single model that can be used directly
in production
– Con: can be wasteful of data
– Con: a single static holdout partition has the potential
to be unrepresentative and statistically misleading
 Cross-validation and leave-one-out (LOO)
– Con: do not lead directly to a single production model
– Pro: use all available data for evaulation
– Pro: many partitions of data, helps average out
statistical variability

Jeff Howbert Introduction to Machine Learning Winter 2012 ‹#›

You might also like