KEMBAR78
Chap3 Basic Classification | PDF | Statistical Classification | Algorithms And Data Structures
0% found this document useful (0 votes)
15 views63 pages

Chap3 Basic Classification

This document provides an overview of classification in data mining, detailing the definition, examples, and techniques used to build classification models. It discusses various classification methods including decision trees, rule-based methods, and ensemble classifiers, along with specific algorithms like ID3 and CART. The document emphasizes the process of learning a model from a training set and applying it to a test set for classification tasks.

Uploaded by

waliurrehman446
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views63 pages

Chap3 Basic Classification

This document provides an overview of classification in data mining, detailing the definition, examples, and techniques used to build classification models. It discusses various classification methods including decision trees, rule-based methods, and ensemble classifiers, along with specific algorithms like ID3 and CART. The document emphasizes the process of learning a model from a training set and applying it to a test set for classification tasks.

Uploaded by

waliurrehman446
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Mining

Classification: Basic Concepts and


Techniques

Lecture Notes for Chapter 3

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

4/17/2025 Introduction to Data Mining, 2nd Edition 1


Classification: Definition

l Given a collection of records (training set )


– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
◆ x: attribute, predictor, independent variable, input
◆ y: class, response, dependent variable, output

l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

4/17/2025 Introduction to Data Mining, 2nd Edition 2


Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies

4/17/2025 Introduction to Data Mining, 2nd Edition 3


General Approach for Building
Classification Model
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

4/17/2025 Introduction to Data Mining, 2nd Edition 4


Classification Techniques

Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Deep Learning
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines

Ensemble Classifiers
– Boosting, Bagging, Random Forests
4/17/2025 Introduction to Data Mining, 2nd Edition 5
Example of a Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

4/17/2025 Introduction to Data Mining, 2nd Edition 6


Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10

4/17/2025 Introduction to Data Mining, 2nd Edition 7


Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

4/17/2025 Introduction to Data Mining, 2nd Edition 8


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

4/17/2025 Introduction to Data Mining, 2nd Edition 9


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

4/17/2025 Introduction to Data Mining, 2nd Edition 10


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

4/17/2025 Introduction to Data Mining, 2nd Edition 11


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

4/17/2025 Introduction to Data Mining, 2nd Edition 12


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

4/17/2025 Introduction to Data Mining, 2nd Edition 13


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

4/17/2025 Introduction to Data Mining, 2nd Edition 14


Decision Tree Induction

Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

4/17/2025 Introduction to Data Mining, 2nd Edition 15


ID3 (Iterative Dichotomiser 3):
• Type: ID3 is primarily used for classification tasks, where the
target variable is categorical.
• Attribute Selection: Uses the Information Gain criterion to
select attributes for splitting nodes. It measures the reduction
in entropy after each split and selects the attribute with the
highest Information Gain.
• Handling Missing Values: ID3 does not handle missing
values well and can be sensitive to them.
• Tree Pruning: ID3 doesn't include a built-in mechanism for
tree pruning, which means it tends to create deeper trees that
may overfit the data.
• Outcome: The resulting tree from ID3 can be complex and
prone to overfitting, especially when dealing with noisy data.
4/17/2025 Introduction to Data Mining, 2nd Edition 16
Hunt's Algorithm (One-R):
• Type: Hunt's Algorithm is a simple rule-based classification
algorithm, not a traditional decision tree algorithm.
• Attribute Selection: It selects a single attribute (the most
predictive one) and creates a rule for classification based on
that attribute.
• Simplicity: Hunt's Algorithm is known for its simplicity and
interpretability, but it may not capture complex interactions in
the data.
• Tree Depth: It generates a shallow tree with just one level of
nodes, making it easy to understand but potentially less
accurate.
• Outcome: Hunt's Algorithm is often used as a baseline or
benchmark for classification tasks due to its simplicity.
4/17/2025 Introduction to Data Mining, 2nd Edition 17
CART (Classification and Regression Trees):
• Type: CART is a versatile algorithm that can be used for both
classification and regression tasks.
• Attribute Selection: CART uses the Gini Impurity for
classification tasks and Mean Squared Error (MSE) for regression
tasks as criteria for attribute selection. It chooses the attribute
and split point that minimizes impurity or error.
• Handling Missing Values: CART can handle missing values by
creating surrogate splits, which help preserve predictive power
in the presence of missing data.
• Tree Pruning: CART includes a tree-pruning mechanism to
prevent overfitting. It grows a deep tree and then prunes it back
to an optimal size based on cross-validation or other criteria.
• Outcome: CART tends to produce binary trees, which are easier
to understand and often more robust against overfitting
4/17/2025 nd
compared to ID3. Introduction to Data Mining, 2 Edition 18
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
l Let Dt be the set of training ID
Owner Status Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
l General Procedure:
4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No

then t is a leaf node 7 Yes Divorced 220K No


8 No Single 85K Yes
labeled as yt
9 No Married 75K No
– If Dt contains records that 10
10 No Single 90K Yes

belong to more than one


Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

4/17/2025 Introduction to Data Mining, 2nd Edition 19


Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
4/17/2025 Introduction to Data Mining, 2nd Edition 20
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
4/17/2025 Introduction to Data Mining, 2nd Edition 21
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
4/17/2025 Introduction to Data Mining, 2nd Edition 22
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
4/17/2025 Introduction to Data Mining, 2nd Edition 23
Example

Step 1: Attribute Selection: | Customer | Age | Purchased


•Attribute that results in the best classification →Age
|---------- |-----|-----------|
•Step 2: Create Rules:
•Divide the dataset into groups |A | 25 | Yes |
•Group 1: Age <= 30 |B | 35 | No |
•Majority class: Yes (2 Yes, 1 No) |C | 30 | Yes |
•Group 2: Age > 30 |D | 20 | No |
•Majority class: Yes (2 Yes, 0 No) |E | 45 | Yes |
Step 3: Final Rule:
•Create a rule based on the attribute and majority class
•Rule 1: If Age <= 30, predict "Yes."
•Rule 2: If Age > 30, predict "Yes."
Step 4: Evaluation:
•Apply the rules to new data for classification.
• For instance, if you have a new customer, with an Age of 28,
• you would apply Rule 1 and predict "Yes" for the purchase.

4/17/2025 Introduction to Data Mining, 2nd Edition 24


Design Issues of Decision Tree Induction

l How should training records be split?


– Method for specifying test condition
◆ depending on attribute types
– Measure for evaluating the goodness of a test
condition

l How should the splitting procedure stop?


– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
4/17/2025 Introduction to Data Mining, 2nd Edition 25
Methods for Expressing Test Conditions

l Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

l Depends on number of ways to split


– 2-way split
– Multi-way split

4/17/2025 Introduction to Data Mining, 2nd Edition 26


Test Condition for Nominal Attributes

Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

4/17/2025 Introduction to Data Mining, 2nd Edition 27


Test Condition for Ordinal Attributes

l Multi-way split: Shirt


Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

Shirt Shirt
l Binary split: Size Size

– Divides values into two


subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt


attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
4/17/2025 Introduction to Data Mining, 2nd Edition 28
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

4/17/2025 Introduction to Data Mining, 2nd Edition 29


Splitting Based on Continuous Attributes

l Different ways of handling


– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
◆ Static – discretize once at the beginning

◆ Dynamic – repeat at each node

– Binary Decision: (A < v) or (A  v)


◆ consider all possible splits and finds the best cut
◆ can be more compute intensive
4/17/2025 Introduction to Data Mining, 2nd Edition 30
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


4/17/2025 Introduction to Data Mining, 2nd Edition 31
How to determine the Best Split

l Greedy approach:
– Nodes with purer class distribution are
preferred

l Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

4/17/2025 Introduction to Data Mining, 2nd Edition 32


Measures of Node Impurity

Gini Index
a powerful measure of the randomness or the impurity
or entropy in the values of a dataset. Gini Index aims to
decrease the impurities from the root nodes (at the top
of the decision tree) to the leaf nodes (vertical branches
down the decision tree) of a decision tree model.

GINI (t ) = 1 −  [ p( j | t )]2
j

4/17/2025 Introduction to Data Mining, 2nd Edition 33


Measures of Node Impurity

Entropy

Entropy is the measure of uncertainty in the data.

Entropy(t ) = − p( j | t ) log p( j | t )
j

Misclassification error

Error (t ) = 1 − max P(i | t ) i

4/17/2025 Introduction to Data Mining, 2nd Edition 34


Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
◆ Compute the impurity measure of each child node
◆ M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
Gain = P – M

or equivalently, the lowest impurity measure


after splitting (M)

4/17/2025 Introduction to Data Mining, 2nd Edition 35


Finding the Best Split
Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
4/17/2025 Introduction to Data Mining, 2nd Edition 36
Measure of Impurity: GINI

Gini Index for a given node t :

GINI (t ) = 1 −  [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

4/17/2025 Introduction to Data Mining, 2nd Edition 37


Measure of Impurity: GINI

Gini Index for a given node t :

GINI (t ) = 1 −  [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– For 2-class problem (p, 1 – p):


◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

4/17/2025 Introduction to Data Mining, 2nd Edition 38


Computing Gini Index of a Single Node

GINI (t ) = 1 −  [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

4/17/2025 Introduction to Data Mining, 2nd Edition 39


Computing Gini Index for a Collection of
Nodes
When a node p is split into k partitions (children)
k
ni
GINI split =  GINI (i )
i =1 n

where, ni = number of records at child i,


n = number of records at parent node p.

Choose the attribute that minimizes weighted average


Gini index of the children

Gini index is used in decision tree algorithms such as


CART, SLIQ, SPRINT
4/17/2025 Introduction to Data Mining, 2nd Edition 40
Binary Attributes: Computing GINI Index

Splits into two partitions


Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125

4/17/2025 Introduction to Data Mining, 2nd Edition 41


Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class in


the dataset
Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

4/17/2025 Introduction to Data Mining, 2nd Edition 42


Continuous Attributes: Computing Gini Index

Home Marital Annual


Use Binary Decisions based on one ID
Owner Status Income
Defaulted

value 1 Yes Single 125K No

Several Choices for the splitting value 2 No Married 100K No

– Number of possible splitting values 3 No Single 70K No


4 Yes Married 120K No
= Number of distinct values
5 No Divorced 95K Yes
Each splitting value has a count matrix 6 No Married 60K No
associated with it 7 Yes Divorced 220K No
– Class counts in each of the 8 No Single 85K Yes
partitions, A < v and A  v 9 No Married 75K No

Simple method to choose best v 10


10 No Single 90K Yes

– For each v, scan the database to Annual Income ?


gather count matrix and compute
its Gini index ≤ 80 > 80
– Computationally Inefficient! Defaulted Yes 0 3
Repetition of work.
Defaulted No 3 4

4/17/2025 Introduction to Data Mining, 2nd Edition 43


Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/17/2025 Introduction to Data Mining, 2nd Edition 44


Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/17/2025 Introduction to Data Mining, 2nd Edition 45


Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/17/2025 Introduction to Data Mining, 2nd Edition 46


Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/17/2025 Introduction to Data Mining, 2nd Edition 47


Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

4/17/2025 Introduction to Data Mining, 2nd Edition 48


Measure of Impurity: Entropy

l Entropy at a given node t:


Entropy(t ) = − p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

◆ Maximum (log nc) when records are equally distributed


among all classes implying least information
◆ Minimum (0.0) when all records belong to one class,
implying most information

– Entropy based computations are quite similar to


the GINI index computations
4/17/2025 Introduction to Data Mining, 2nd Edition 49
Computing Entropy of a Single Node

Entropy(t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

4/17/2025 Introduction to Data Mining, 2nd Edition 50


Computing Information Gain After Splitting

l Information Gain:
 n 
GAIN = Entropy( p) −   Entropy(i ) 
k
i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i

– Choose the split that achieves most reduction


(maximizes GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

4/17/2025 Introduction to Data Mining, 2nd Edition 51


Problem with large number of partitions

Node impurity measures tend to prefer splits that


result in large number of partitions, each being
small but pure
Gender Car Customer
Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

– Customer ID has highest information gain


because entropy for all the children is zero
4/17/2025 Introduction to Data Mining, 2nd Edition 52
Gain Ratio

l Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = −  log
Split k
i i
split
SplitINFO n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning


(SplitINFO).
◆ Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

4/17/2025 Introduction to Data Mining, 2nd Edition 53


Gain Ratio

l Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = −  log
Split k
i i
split
SplitINFO n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

4/17/2025 Introduction to Data Mining, 2nd Edition 54


Measure of Impurity: Classification Error

l Classification error at a node t :

Error (t ) = 1 − max P(i | t ) i

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0) when all records belong to one class,
implying most interesting information

4/17/2025 Introduction to Data Mining, 2nd Edition 55


Computing Error of a Single Node

Error (t ) = 1 − max P(i | t ) i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

4/17/2025 Introduction to Data Mining, 2nd Edition 56


Comparison among Impurity Measures

For a 2-class problem:

4/17/2025 Introduction to Data Mining, 2nd Edition 57


Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!

4/17/2025 Introduction to Data Mining, 2nd Edition 58


Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

4/17/2025 Introduction to Data Mining, 2nd Edition 59


Decision Tree Based Classification
l Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
l Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
4/17/2025 Introduction to Data Mining, 2nd Edition 60
Handling interactions

+ : 1000 instances Entropy (X) : 0.99


Entropy (Y) : 0.99
o : 1000 instances
Y

4/17/2025 Introduction to Data Mining, 2nd Edition 61


Handling interactions

+ : 1000 instances Entropy (X) : 0.99


Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X

Z Z

X Y
4/17/2025 Introduction to Data Mining, 2nd Edition 62
Limitations of single attribute-based decision boundaries

Both positive (+) and


negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.

4/17/2025 Introduction to Data Mining, 2nd Edition 63

You might also like