L4 Classification
L4 Classification
Classification Techniques
Example of a Decision Tree
Decision Tree Induction Algos
FUNDAMENTALS OF DATA MINING Hunt’s Algorithm
Computing GINI Index
1 2
3 4
sets, with training set used to build the model and test set
12 Yes Medium 80K ?
15
No
No
Small
Large
95K
67K
?
?
10
Test Set
5 6
1
Example of a Decision Tree Another Example of Decision Tree
7 8
6 No Medium 60K No
Yes No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No NO MarSt
10 No Small 90K Yes
10
15 No Large 67K ?
10
Test Set
9 10
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
11 12
2
Apply Model to Test Data Apply Model to Test Data
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
13 14
Training Set
TaxInc NO Apply Decision
Model Tree
< 80K > 80K Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
15 16
17 18
3
Tid Refund Marital Taxable
Status Income Cheat
Don’t
Yes
Refund
No
4
5
Yes
No
Married 120K
Divorced 95K
No
Yes
Greedy strategy.
Cheat
Don’t Don’t 6 No Married 60K No Splitthe records based on an attribute test that
Cheat
Cheat 7 Yes Divorced 220K No
optimizes certain criterion.
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Refund
Refund
Yes No
10
Issues
Yes No
Don’t Marital Determine how to split the records
Don’t Marital Cheat
Cheat Status
Single,
Status ◼ How to specify the attribute test condition?
Single, Married
Married
Divorced Divorced
◼ How to determine the best split?
Don’t Taxable Don’t
Cheat
Cheat Income
Cheat Determine when to stop splitting
< 80K >= 80K
Don’t Cheat
Cheat
19 20
Continuous
Issues
Determine how to split the records Depends on number of ways to split
◼ How to specify the attribute test condition? 2-way split
◼ How to determine the best split?
Multi-way split
Determine when to stop splitting
21 22
Size
{Small,
{Medium}
What about this split? Large}
23 24
4
Splitting Based on Continuous Attributes Splitting Based on Continuous Attributes
25 26
Determine how to split the records C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
27 28
Non-homogeneous, Homogeneous,
Oracle support both Gini and Entropy, default is Gini
High degree of impurity Low degree of impurity
29 30
5
How to Find the Best Split Measure of Impurity: GINI
C0 N00
Before Splitting:
C1 N01
M0 Gini Index for a given node t :
A? B? GINI (t ) = 1 − [ p( j | t )]2
j
Yes No Yes No
(NOTE: p( j | t) is the relative frequency of class j at node t).
Node N1 Node N2 Node N3 Node N4
Maximum (1 - 1/nc) when records are equally distributed
C0 N10 C0 N20 C0 N30 C0 N40 among all classes, implying least interesting information
C1 N11 C1 N21 C1 N31 C1 N41
Minimum (0.0) when all records belong to one class, implying
most interesting information
M1 M2 M3 M4
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
M12 M34 Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Gain = M0 – M12 vs M0 – M34
31 32
33 34
Binary Attributes: Computing GINI Index Categorical Attributes: Computing Gini Index
Splits into two partitions For each distinct value, gather counts for each class in the
Effect of Weighing partitions: dataset
– Larger and Purer Partitions are sought for. Use the count matrix to make decisions
Parent Multi-way split Two-way split
B? (find best partition of values)
C1 6
Yes No C2 6 CarType CarType CarType
{Sports, {Family,
Gini = 0.500 Family Sports Luxury
Luxury}
{Family} {Sports}
Luxury}
Node N1 Node N2 C1 1 2 1 C1 3 1 C1 2 2
Gini(N1) C2 4 1 1 C2 2 4 C2 1 5
= 1 – (5/7)2 – (2/7)2 N1 N2
Gini 0.393 Gini 0.400 Gini 0.419
Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.371 = 0.371
= 0.32
35 36
6
Continuous Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index...
Number of possible splitting values 2 No Married 100K No Linearly scan these values, each time updating the count matrix and
= Number of distinct values 3 No Single 70K No computing gini index
Each splitting value has a count matrix 4 Yes Married 120K No Choose the split position that has the least gini index
associated with it 5 No Divorced 95K Yes
6 No Married 60K No
Class counts in each of the partitions, A < Cheat No No No Yes Yes Yes No No No No
v and A v
7 Yes Divorced 220K No
Taxable Income
8 No Single 85K Yes
Simple method to choose best v 9 No Married 75K No Sorted Values 60 70 75 85 90 95 100 120 125 220
Taxable Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
Computationally Inefficient! Repetition of
Income
work. > 80K?
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Yes No
37 38
39 40
41 42