KEMBAR78
L4 Classification | PDF | Statistical Classification | Software Engineering
0% found this document useful (0 votes)
7 views7 pages

L4 Classification

The document discusses classification techniques in data mining, focusing on decision trees and their induction algorithms, including Hunt's Algorithm. It outlines various classification tasks, such as predicting tumor cells and classifying credit card transactions, and explains the process of building and applying decision trees using training and test sets. Additionally, it covers the criteria for splitting records based on different attribute types and the challenges involved in determining optimal splits.

Uploaded by

samplesaji1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

L4 Classification

The document discusses classification techniques in data mining, focusing on decision trees and their induction algorithms, including Hunt's Algorithm. It outlines various classification tasks, such as predicting tumor cells and classifying credit card transactions, and explains the process of building and applying decision trees using training and test sets. Additionally, it covers the criteria for splitting records based on different attribute types and the challenges involved in determining optimal splits.

Uploaded by

samplesaji1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

This Lecture…

 Classification Techniques
 Example of a Decision Tree
 Decision Tree Induction Algos
FUNDAMENTALS OF DATA MINING  Hunt’s Algorithm
 Computing GINI Index

Lecture Classification Techniques – Decision Tree

1 2

Examples of Classification Task Classification Techniques


 Predicting tumor cells as benign or malignant  Decision Tree based Methods
 Neural Networks
 Classifying credit card transactions  Naïve Bayes and Bayesian Belief Networks
as legitimate or fraudulent
 Support Vector Machines
 Rule-based Methods
 Categorizing news stories as finance,
weather, entertainment, sports, etc  Memory based reasoning

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

3 4

Classification: Definition Illustrating Classification Task


 Goal: previously unseen records should be Tid Attrib1 Attrib2 Attrib3 Class Learning
assigned a class as accurately as possible. 1
2
Yes
No
Large
Medium
125K
100K
No
No
algorithm
3 No Small 70K No

 Given a collection of records (training set ) 4


5
Yes
No
Medium
Large
120K
95K
No
Yes
Induction

 Each record contains a set of attributes, one of the 6


7
No
Yes
Medium
Large
60K
220K
No
No Learn
attributes is the class. 8
9
No
No
Small
Medium
85K
75K
Yes
No
Model

 Find a model for class attribute as a function of the 10


10 No Small 90K Yes
Model
values of other attributes. Training Set
Apply
 A test set is used to determine the accuracy of the model. Tid Attrib1 Attrib2 Attrib3 Class Model
Usually, the given data set is divided into training and test 11 No Small 55K ?

sets, with training set used to build the model and test set
12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


used to validate it. 14

15
No

No
Small

Large
95K

67K
?

?
10

Test Set

5 6

1
Example of a Decision Tree Another Example of Decision Tree

Tid Refund Marital Taxable


Splitting Attributes
MarSt Single,
Status Income Cheat
Married Divorced
Tid Refund Marital Taxable
1 Yes Single 125K No Status Income Cheat
2 No Married 100K No Refund NO Refund
1 Yes Single 125K No
Yes No Yes No
3 No Single 70K No
2 No Married 100K No
4 Yes Married 120K No NO MarSt 3 No Single 70K No NO TaxInc
5 No Divorced 95K Yes Single, Divorced Married 4 Yes Married 120K No < 80K > 80K
6 No Married 60K No
5 No Divorced 95K Yes
7 Yes Divorced 220K No TaxInc NO NO YES
6 No Married 60K No
8 No Single 85K Yes < 80K > 80K
7 Yes Divorced 220K No
9 No Married 75K No
NO YES 8 No Single 85K Yes
10 No Single 90K Yes
10
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
Training Data Model: Decision Tree 10

7 8

Decision Tree Classification Task Apply Model to Test Data


Tree
Test Data
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No Induction Start from the root of tree. Refund Marital Taxable
2 No Medium 100K No algorithm Status Income Cheat
3 No Small 70K No

4 Yes Medium 120K No No Married 80K ?


Induction
5 No Large 95K Yes Refund 10

6 No Medium 60K No
Yes No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No NO MarSt
10 No Small 90K Yes
10

Model Single, Divorced Married


Training Set
Apply Decision TaxInc NO
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ? < 80K > 80K
12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
NO YES
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

9 10

Apply Model to Test Data Apply Model to Test Data


Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?


Refund 10

Refund 10

Yes No Yes No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES

11 12

2
Apply Model to Test Data Apply Model to Test Data
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?


Refund 10

Refund 10

Yes No Yes No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES

13 14

Apply Model to Test Data Decision Tree Classification Task


Test Data
Tid Attrib1 Attrib2 Attrib3 Class
Tree
Refund Marital Taxable Induction
1 Yes Large 125K No
Status Income Cheat
2 No Medium 100K No algorithm
3 No Small 70K No
No Married 80K ?
Refund 10
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
Yes No 6 No Medium 60K No

7 Yes Large 220K No Learn


NO MarSt 8 No Small 85K Yes Model
9 No Medium 75K No

Single, Divorced Married Assign Cheat to “No” 10 No Small 90K Yes


Model
10

Training Set
TaxInc NO Apply Decision
Model Tree
< 80K > 80K Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?


NO YES 13 Yes Large 110K ?
Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

15 16

Decision Tree Induction General Structure of Hunt’s Algorithm


Tid Refund Marital Taxable
 Many Algorithms:  Let Dt be the set of training records that Status Income Cheat
reach a node t 1 Yes Single 125K No
 Hunt’s Algorithm (one of the earliest)  General Procedure: 2 No Married 100K No

 If Dt contains records that belong the


3 No Single 70K No
 CART 4 Yes Married 120K No
same class yt, then t is a leaf node
 ID3,C4.5 5 No Divorced 95K Yes
labeled as yt 6 No Married 60K No

 SLIQ,SPRINT 7 Yes Divorced 220K No


 If Dt contains records that belong to 8 No Single 85K Yes
more than one class, use an attribute 9 No Married 75K No
test to split the data into smaller 10 No Single 90K Yes

subsets. Recursively apply the


10

procedure to each subset. Dt

17 18

3
Tid Refund Marital Taxable
Status Income Cheat

Hunt’s Algorithm 1 Yes Single 125K No Tree Induction


2 No Married 100K No
3 No Single 70K No

Don’t
Yes
Refund
No
4
5
Yes
No
Married 120K
Divorced 95K
No
Yes
 Greedy strategy.
Cheat
Don’t Don’t 6 No Married 60K No  Splitthe records based on an attribute test that
Cheat
Cheat 7 Yes Divorced 220K No
optimizes certain criterion.
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Refund
Refund
Yes No
10

 Issues
Yes No
Don’t Marital  Determine how to split the records
Don’t Marital Cheat
Cheat Status
Single,
Status ◼ How to specify the attribute test condition?
Single, Married
Married
Divorced Divorced
◼ How to determine the best split?
Don’t Taxable Don’t
Cheat
Cheat Income
Cheat  Determine when to stop splitting
< 80K >= 80K
Don’t Cheat
Cheat

19 20

Tree Induction How to Specify Test Condition?


 Greedy strategy.  Depends on attribute types
 Split the records based on an attribute test that  Nominal
optimizes certain criterion.  Ordinal

 Continuous

 Issues
 Determine how to split the records  Depends on number of ways to split
◼ How to specify the attribute test condition?  2-way split
◼ How to determine the best split?
 Multi-way split
 Determine when to stop splitting

21 22

Splitting Based on Nominal Attributes


Splitting Based on Ordinal Attributes
 Multi-way split: Use as many partitions as distinct  Multi-way split: Use as many partitions as distinct
values. values.
CarType Size
Family Luxury
Small Large
Sports
Medium

 Binary split: Divides values into two subsets.


 Binary split: Divides values into two subsets. Need to find optimal partitioning.
Need to find optimal partitioning. Size Size
{Small,
{Large}
OR {Medium,
{Small}
CarType CarType Medium} Large}
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Size
{Small,
{Medium}
 What about this split? Large}

23 24

4
Splitting Based on Continuous Attributes Splitting Based on Continuous Attributes

 Different ways of handling Taxable Taxable


 Discretization to form an ordinal categorical attribute Income Income?
◼ Static – discretize once at the beginning > 80K?
< 10K > 80K
◼ Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing Yes No

(percentiles), or clustering. [10K,25K) [25K,50K) [50K,80K)

 Binary Decision: (A < v) or (A  v) (i) Binary split (ii) Multi-way split


◼ consider all possible splits and finds the best cut
◼ can be more compute intensive

25 26

Tree Induction How to determine the Best Split


Before Splitting: 10 records of class 0,
 Greedy strategy. 10 records of class 1
 Splitthe records based on an attribute test that
optimizes certain criterion. Own Car Student
Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
 Issues Sports

 Determine how to split the records C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

◼ How to specify the attribute test condition?


◼ How to determine the best split?

 Determine when to stop splitting Which test condition is the best?

27 28

How to determine the Best Split Measures of Node Impurity


 Greedy approach:  Gini Index
 Nodes with homogeneous class distribution are
preferred
 Entropy
 Need a measure of node impurity:
C0: 5 C0: 9  Misclassification error
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
 Oracle support both Gini and Entropy, default is Gini
High degree of impurity Low degree of impurity

29 30

5
How to Find the Best Split Measure of Impurity: GINI
C0 N00
Before Splitting:
C1 N01
M0  Gini Index for a given node t :

A? B? GINI (t ) = 1 − [ p( j | t )]2
j
Yes No Yes No
(NOTE: p( j | t) is the relative frequency of class j at node t).
Node N1 Node N2 Node N3 Node N4
 Maximum (1 - 1/nc) when records are equally distributed
C0 N10 C0 N20 C0 N30 C0 N40 among all classes, implying least interesting information
C1 N11 C1 N21 C1 N31 C1 N41
 Minimum (0.0) when all records belong to one class, implying
most interesting information
M1 M2 M3 M4
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
M12 M34 Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Gain = M0 – M12 vs M0 – M34

31 32

Splitting Based on GINI


Examples for computing GINI
GINI (t ) = 1 − [ p( j | t )]2  Used in CART, SLIQ, SPRINT.
j  When a node p is split into k partitions (children), the quality of
split is computed as,
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
k
ni
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 GINI split =  GINI (i )
i =1 n
C1 1 P(C1) = 1/6 P(C2) = 5/6
C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 where, ni = number of records at child i,
n = number of records at node p.
C1 2 P(C1) = 2/6 P(C2) = 4/6
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

33 34

Binary Attributes: Computing GINI Index Categorical Attributes: Computing Gini Index

Splits into two partitions  For each distinct value, gather counts for each class in the
Effect of Weighing partitions: dataset
– Larger and Purer Partitions are sought for.  Use the count matrix to make decisions
Parent Multi-way split Two-way split
B? (find best partition of values)
C1 6
Yes No C2 6 CarType CarType CarType
{Sports, {Family,
Gini = 0.500 Family Sports Luxury
Luxury}
{Family} {Sports}
Luxury}
Node N1 Node N2 C1 1 2 1 C1 3 1 C1 2 2
Gini(N1) C2 4 1 1 C2 2 4 C2 1 5
= 1 – (5/7)2 – (2/7)2 N1 N2
Gini 0.393 Gini 0.400 Gini 0.419
Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.371 = 0.371
= 0.32

35 36

6
Continuous Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index...

 Use Binary Decisions based on one value Tid Refund Marital


Status
Taxable
Income Cheat
 For efficient computation: for each attribute,
 Several Choices for the splitting value  Sort the attribute on values
1 Yes Single 125K No

 Number of possible splitting values 2 No Married 100K No  Linearly scan these values, each time updating the count matrix and
= Number of distinct values 3 No Single 70K No computing gini index
 Each splitting value has a count matrix 4 Yes Married 120K No  Choose the split position that has the least gini index
associated with it 5 No Divorced 95K Yes
6 No Married 60K No
 Class counts in each of the partitions, A < Cheat No No No Yes Yes Yes No No No No
v and A  v
7 Yes Divorced 220K No
Taxable Income
8 No Single 85K Yes
 Simple method to choose best v 9 No Married 75K No Sorted Values 60 70 75 85 90 95 100 120 125 220

 For each v, scan the database to gather


55 65 72 80 87 92 97 110 122 172 230
10 No Single 90K Yes Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
count matrix and compute its Gini index
10

Taxable Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
 Computationally Inefficient! Repetition of
Income
work. > 80K?
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Yes No

37 38

Tree Induction Stopping Criteria for Tree Induction


 Greedy strategy.  Stop expanding a node when all the records
 Splitthe records based on an attribute test that belong to the same class
optimizes certain criterion.
 Stop expanding a node when all the records have
 Issues similar attribute values
 Determine how to split the records
◼ How to specify the attribute test condition?
◼ How to determine the best split?

 Determine when to stop splitting

39 40

Decision Tree Based Classification Example: C4.5


 Advantages:  Simple depth-first construction.
 Inexpensive to construct  Uses Information Gain
 Extremely fast at classifying unknown records  Sorts Continuous Attributes at each node.
 Easy to interpret for small-sized trees
 Needs entire data to fit in memory.
 Accuracy is comparable to other classification
techniques for many simple data sets
 Unsuitable for Large Datasets.
 Needs out-of-core sorting.

 You can download the software from:


http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

41 42

You might also like