0% found this document useful (0 votes)

7 views7 pages

L4 Classification

The document discusses classification techniques in data mining, focusing on decision trees and their induction algorithms, including Hunt's Algorithm. It outlines various classification tasks, such as predicting tumor cells and classifying credit card transactions, and explains the process of building and applying decision trees using training and test sets. Additionally, it covers the criteria for splitting records based on different attribute types and the challenges involved in determining optimal splits.

Uploaded by

samplesaji1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

L4 Classification

Uploaded by

samplesaji1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

This Lecture…

 Classification Techniques
 Example of a Decision Tree
 Decision Tree Induction Algos
FUNDAMENTALS OF DATA MINING  Hunt’s Algorithm
 Computing GINI Index

Lecture Classification Techniques – Decision Tree

1 2

Examples of Classification Task Classification Techniques

 Predicting tumor cells as benign or malignant  Decision Tree based Methods
 Neural Networks
 Classifying credit card transactions  Naïve Bayes and Bayesian Belief Networks
as legitimate or fraudulent
 Support Vector Machines
 Rule-based Methods
 Categorizing news stories as finance,
weather, entertainment, sports, etc  Memory based reasoning

 Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil

3 4

Classification: Definition Illustrating Classification Task

 Goal: previously unseen records should be Tid Attrib1 Attrib2 Attrib3 Class Learning
assigned a class as accurately as possible. 1
2
Yes
No
Large
Medium
125K
100K
No
No
algorithm
3 No Small 70K No

 Given a collection of records (training set ) 4

5
Yes
No
Medium
Large
120K
95K
No
Yes
Induction

 Each record contains a set of attributes, one of the 6

7
No
Yes
Medium
Large
60K
220K
No
No Learn
attributes is the class. 8
9
No
No
Small
Medium
85K
75K
Yes
No
Model

 Find a model for class attribute as a function of the 10

10 No Small 90K Yes
Model
values of other attributes. Training Set
Apply
 A test set is used to determine the accuracy of the model. Tid Attrib1 Attrib2 Attrib3 Class Model
Usually, the given data set is divided into training and test 11 No Small 55K ?

sets, with training set used to build the model and test set
12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

used to validate it. 14

15
No

No
Small

Large
95K

67K
?

?
10

Test Set

5 6

1
Example of a Decision Tree Another Example of Decision Tree

Tid Refund Marital Taxable

Splitting Attributes
MarSt Single,
Status Income Cheat
Married Divorced
Tid Refund Marital Taxable
1 Yes Single 125K No Status Income Cheat
2 No Married 100K No Refund NO Refund
1 Yes Single 125K No
Yes No Yes No
3 No Single 70K No
2 No Married 100K No
4 Yes Married 120K No NO MarSt 3 No Single 70K No NO TaxInc
5 No Divorced 95K Yes Single, Divorced Married 4 Yes Married 120K No < 80K > 80K
6 No Married 60K No
5 No Divorced 95K Yes
7 Yes Divorced 220K No TaxInc NO NO YES
6 No Married 60K No
8 No Single 85K Yes < 80K > 80K
7 Yes Divorced 220K No
9 No Married 75K No
NO YES 8 No Single 85K Yes
10 No Single 90K Yes
10
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
Training Data Model: Decision Tree 10

7 8

Decision Tree Classification Task Apply Model to Test Data

Tree
Test Data
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No Induction Start from the root of tree. Refund Marital Taxable
2 No Medium 100K No algorithm Status Income Cheat
3 No Small 70K No

4 Yes Medium 120K No No Married 80K ?

Induction
5 No Large 95K Yes Refund 10

6 No Medium 60K No
Yes No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No NO MarSt
10 No Small 90K Yes
10

Model Single, Divorced Married

Training Set
Apply Decision TaxInc NO
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ? < 80K > 80K
12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
NO YES
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

9 10

Apply Model to Test Data Apply Model to Test Data

Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?

Refund 10

Yes No Yes No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES

11 12

2
Apply Model to Test Data Apply Model to Test Data
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?

Refund 10

Yes No Yes No

NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES

13 14

Apply Model to Test Data Decision Tree Classification Task

Test Data
Tid Attrib1 Attrib2 Attrib3 Class
Tree
Refund Marital Taxable Induction
1 Yes Large 125K No
Status Income Cheat
2 No Medium 100K No algorithm
3 No Small 70K No
No Married 80K ?
Refund 10
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
Yes No 6 No Medium 60K No

7 Yes Large 220K No Learn

NO MarSt 8 No Small 85K Yes Model
9 No Medium 75K No

Single, Divorced Married Assign Cheat to “No” 10 No Small 90K Yes

Model
10

Training Set
TaxInc NO Apply Decision
Model Tree
< 80K > 80K Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?

NO YES 13 Yes Large 110K ?
Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

15 16

Decision Tree Induction General Structure of Hunt’s Algorithm

Tid Refund Marital Taxable
 Many Algorithms:  Let Dt be the set of training records that Status Income Cheat
reach a node t 1 Yes Single 125K No
 Hunt’s Algorithm (one of the earliest)  General Procedure: 2 No Married 100K No

 If Dt contains records that belong the

3 No Single 70K No
 CART 4 Yes Married 120K No
same class yt, then t is a leaf node
 ID3,C4.5 5 No Divorced 95K Yes
labeled as yt 6 No Married 60K No

 SLIQ,SPRINT 7 Yes Divorced 220K No

 If Dt contains records that belong to 8 No Single 85K Yes
more than one class, use an attribute 9 No Married 75K No
test to split the data into smaller 10 No Single 90K Yes

subsets. Recursively apply the

procedure to each subset. Dt

17 18

3
Tid Refund Marital Taxable
Status Income Cheat

Hunt’s Algorithm 1 Yes Single 125K No Tree Induction

2 No Married 100K No
3 No Single 70K No

Don’t
Yes
Refund
No
4
5
Yes
No
Married 120K
Divorced 95K
No
Yes
 Greedy strategy.
Cheat
Don’t Don’t 6 No Married 60K No  Splitthe records based on an attribute test that
Cheat
Cheat 7 Yes Divorced 220K No
optimizes certain criterion.
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Refund
Refund
Yes No
10

 Issues
Yes No
Don’t Marital  Determine how to split the records
Don’t Marital Cheat
Cheat Status
Single,
Status ◼ How to specify the attribute test condition?
Single, Married
Married
Divorced Divorced
◼ How to determine the best split?
Don’t Taxable Don’t
Cheat
Cheat Income
Cheat  Determine when to stop splitting
< 80K >= 80K
Don’t Cheat
Cheat

19 20

Tree Induction How to Specify Test Condition?

 Greedy strategy.  Depends on attribute types
 Split the records based on an attribute test that  Nominal
optimizes certain criterion.  Ordinal

 Continuous

 Issues
 Determine how to split the records  Depends on number of ways to split
◼ How to specify the attribute test condition?  2-way split
◼ How to determine the best split?
 Multi-way split
 Determine when to stop splitting

21 22

Splitting Based on Nominal Attributes

Splitting Based on Ordinal Attributes
 Multi-way split: Use as many partitions as distinct  Multi-way split: Use as many partitions as distinct
values. values.
CarType Size
Family Luxury
Small Large
Sports
Medium

 Binary split: Divides values into two subsets.

 Binary split: Divides values into two subsets. Need to find optimal partitioning.
Need to find optimal partitioning. Size Size
{Small,
{Large}
OR {Medium,
{Small}
CarType CarType Medium} Large}
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Size
{Small,
{Medium}
 What about this split? Large}

23 24

4
Splitting Based on Continuous Attributes Splitting Based on Continuous Attributes

 Different ways of handling Taxable Taxable

 Discretization to form an ordinal categorical attribute Income Income?
◼ Static – discretize once at the beginning > 80K?
< 10K > 80K
◼ Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing Yes No

(percentiles), or clustering. [10K,25K) [25K,50K) [50K,80K)

 Binary Decision: (A < v) or (A  v) (i) Binary split (ii) Multi-way split

◼ consider all possible splits and finds the best cut
◼ can be more compute intensive

25 26

Tree Induction How to determine the Best Split

Before Splitting: 10 records of class 0,
 Greedy strategy. 10 records of class 1
 Splitthe records based on an attribute test that
optimizes certain criterion. Own Car Student
Car? Type? ID?

Yes No Family Luxury c1 c20

c10 c11
 Issues Sports

 Determine how to split the records C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

◼ How to specify the attribute test condition?

◼ How to determine the best split?

 Determine when to stop splitting Which test condition is the best?

27 28

How to determine the Best Split Measures of Node Impurity

 Greedy approach:  Gini Index
 Nodes with homogeneous class distribution are
preferred
 Entropy
 Need a measure of node impurity:
C0: 5 C0: 9  Misclassification error
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
 Oracle support both Gini and Entropy, default is Gini
High degree of impurity Low degree of impurity

29 30

5
How to Find the Best Split Measure of Impurity: GINI
C0 N00
Before Splitting:
C1 N01
M0  Gini Index for a given node t :

A? B? GINI (t ) = 1 − [ p( j | t )]2
j
Yes No Yes No
(NOTE: p( j | t) is the relative frequency of class j at node t).
Node N1 Node N2 Node N3 Node N4
 Maximum (1 - 1/nc) when records are equally distributed
C0 N10 C0 N20 C0 N30 C0 N40 among all classes, implying least interesting information
C1 N11 C1 N21 C1 N31 C1 N41
 Minimum (0.0) when all records belong to one class, implying
most interesting information
M1 M2 M3 M4
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
M12 M34 Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Gain = M0 – M12 vs M0 – M34

31 32

Splitting Based on GINI

Examples for computing GINI
GINI (t ) = 1 − [ p( j | t )]2  Used in CART, SLIQ, SPRINT.
j  When a node p is split into k partitions (children), the quality of
split is computed as,
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
k
ni
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 GINI split =  GINI (i )
i =1 n
C1 1 P(C1) = 1/6 P(C2) = 5/6
C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 where, ni = number of records at child i,
n = number of records at node p.
C1 2 P(C1) = 2/6 P(C2) = 4/6
C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

33 34

Binary Attributes: Computing GINI Index Categorical Attributes: Computing Gini Index

Splits into two partitions  For each distinct value, gather counts for each class in the
Effect of Weighing partitions: dataset
– Larger and Purer Partitions are sought for.  Use the count matrix to make decisions
Parent Multi-way split Two-way split
B? (find best partition of values)
C1 6
Yes No C2 6 CarType CarType CarType
{Sports, {Family,
Gini = 0.500 Family Sports Luxury
Luxury}
{Family} {Sports}
Luxury}
Node N1 Node N2 C1 1 2 1 C1 3 1 C1 2 2
Gini(N1) C2 4 1 1 C2 2 4 C2 1 5
= 1 – (5/7)2 – (2/7)2 N1 N2
Gini 0.393 Gini 0.400 Gini 0.419
Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.371 = 0.371
= 0.32

35 36

6
Continuous Attributes: Computing Gini Index Continuous Attributes: Computing Gini Index...

 Use Binary Decisions based on one value Tid Refund Marital

Status
Taxable
Income Cheat
 For efficient computation: for each attribute,
 Several Choices for the splitting value  Sort the attribute on values
1 Yes Single 125K No

 Number of possible splitting values 2 No Married 100K No  Linearly scan these values, each time updating the count matrix and
= Number of distinct values 3 No Single 70K No computing gini index
 Each splitting value has a count matrix 4 Yes Married 120K No  Choose the split position that has the least gini index
associated with it 5 No Divorced 95K Yes
6 No Married 60K No
 Class counts in each of the partitions, A < Cheat No No No Yes Yes Yes No No No No
v and A  v
7 Yes Divorced 220K No
Taxable Income
8 No Single 85K Yes
 Simple method to choose best v 9 No Married 75K No Sorted Values 60 70 75 85 90 95 100 120 125 220

 For each v, scan the database to gather

55 65 72 80 87 92 97 110 122 172 230
10 No Single 90K Yes Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
count matrix and compute its Gini index
10

Taxable Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
 Computationally Inefficient! Repetition of
Income
work. > 80K?
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Yes No

37 38

Tree Induction Stopping Criteria for Tree Induction

 Greedy strategy.  Stop expanding a node when all the records
 Splitthe records based on an attribute test that belong to the same class
optimizes certain criterion.
 Stop expanding a node when all the records have
 Issues similar attribute values
 Determine how to split the records
◼ How to specify the attribute test condition?
◼ How to determine the best split?

 Determine when to stop splitting

39 40

Decision Tree Based Classification Example: C4.5

 Advantages:  Simple depth-first construction.
 Inexpensive to construct  Uses Information Gain
 Extremely fast at classifying unknown records  Sorts Continuous Attributes at each node.
 Easy to interpret for small-sized trees
 Needs entire data to fit in memory.
 Accuracy is comparable to other classification
techniques for many simple data sets
 Unsuitable for Large Datasets.
 Needs out-of-core sorting.

 You can download the software from:

http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

41 42

Important For Data Mining
No ratings yet
Important For Data Mining
96 pages
Classification Basics
No ratings yet
Classification Basics
65 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
CH 6
No ratings yet
CH 6
72 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
09 - ML - Decision Tree
No ratings yet
09 - ML - Decision Tree
45 pages
ML 05 Decision Trees
No ratings yet
ML 05 Decision Trees
76 pages
Datamining Lect10a Classsification Basics DT
No ratings yet
Datamining Lect10a Classsification Basics DT
87 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
5 Classification
No ratings yet
5 Classification
59 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
Decision Trees
No ratings yet
Decision Trees
88 pages
Tree-Based Learning Explained
No ratings yet
Tree-Based Learning Explained
70 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Big Data Classification Basics
No ratings yet
Big Data Classification Basics
47 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Decision Tree
No ratings yet
Decision Tree
38 pages
Classification
No ratings yet
Classification
58 pages
Datamining-Lect3 - Classification. Decision Trees. Evaluation
No ratings yet
Datamining-Lect3 - Classification. Decision Trees. Evaluation
95 pages
Lecture 11-Classification-M
No ratings yet
Lecture 11-Classification-M
33 pages
Week 6 Chap3 - Basic - Classificationi
No ratings yet
Week 6 Chap3 - Basic - Classificationi
59 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
56 pages
DMDW - MOD4 - Classification - PPT Updated
No ratings yet
DMDW - MOD4 - Classification - PPT Updated
128 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
96 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
51 pages
Chap4 Basic Classification PDF
No ratings yet
Chap4 Basic Classification PDF
101 pages
6 Classification DecisionTree
No ratings yet
6 Classification DecisionTree
47 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
Chap4 Basic Classification
No ratings yet
Chap4 Basic Classification
101 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
63 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
160 pages
Classification Techniques
No ratings yet
Classification Techniques
50 pages
DM Lec6
No ratings yet
DM Lec6
18 pages
9 Lecture AI 09
No ratings yet
9 Lecture AI 09
57 pages
4-Chap4 Basic Classification
No ratings yet
4-Chap4 Basic Classification
128 pages
Classification Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification Basic Concepts, Decision Trees, and Model Evaluation
67 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Unit 3
100% (1)
Unit 3
21 pages
Unit-II - Tree Based Methods
No ratings yet
Unit-II - Tree Based Methods
158 pages
Data Mining: Lecture - 03
No ratings yet
Data Mining: Lecture - 03
56 pages
UNIT 3 Classification
No ratings yet
UNIT 3 Classification
17 pages
Decision Tree 1
No ratings yet
Decision Tree 1
59 pages
Chap3 Basic Classification
No ratings yet
Chap3 Basic Classification
59 pages
06 Classification
No ratings yet
06 Classification
32 pages
8 Basic Classification
No ratings yet
8 Basic Classification
59 pages
IML Unit04 - Learning Decision Trees
No ratings yet
IML Unit04 - Learning Decision Trees
28 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
2025 S1 IT2070 Lecture 04 Trees
No ratings yet
2025 S1 IT2070 Lecture 04 Trees
39 pages
2025 S1 IT2070 Lecture 03 Linked Lists
No ratings yet
2025 S1 IT2070 Lecture 03 Linked Lists
24 pages
2025 S1 IT2070 Lecture 01 Stacks
No ratings yet
2025 S1 IT2070 Lecture 01 Stacks
30 pages
2025 S1 IT2070 Lecture 05 Recursion
No ratings yet
2025 S1 IT2070 Lecture 05 Recursion
22 pages
2024 S2 IT3061 Lecture 01 Introduction To Cloud Computing
No ratings yet
2024 S2 IT3061 Lecture 01 Introduction To Cloud Computing
22 pages
Lecture 04
No ratings yet
Lecture 04
41 pages
Lecture 05
No ratings yet
Lecture 05
51 pages
Lecture 03
No ratings yet
Lecture 03
53 pages
AWW Dust Collector Article Jan 2006
No ratings yet
AWW Dust Collector Article Jan 2006
7 pages
Manual IBC 5 New Controls ENG MAY 2023 - 2
No ratings yet
Manual IBC 5 New Controls ENG MAY 2023 - 2
41 pages
TRS501 Vocabulary List
No ratings yet
TRS501 Vocabulary List
9 pages
Prototype Approach To Semantic Structure
No ratings yet
Prototype Approach To Semantic Structure
34 pages
Endemism: Definition, Types, and Examples
No ratings yet
Endemism: Definition, Types, and Examples
39 pages
Astava Catalog
No ratings yet
Astava Catalog
26 pages
Alloy Steel Tubes Material Data
No ratings yet
Alloy Steel Tubes Material Data
3 pages
Ohmicide Ref Man
100% (1)
Ohmicide Ref Man
33 pages
VLSI Design MCQs & Answers
0% (1)
VLSI Design MCQs & Answers
20 pages
S143 Nilas Mandrup Hansen PDF
No ratings yet
S143 Nilas Mandrup Hansen PDF
154 pages
Cauchy Sequences for Math Students
No ratings yet
Cauchy Sequences for Math Students
4 pages
Data Integration
No ratings yet
Data Integration
4 pages
t201 Visit Report
100% (1)
t201 Visit Report
16 pages
English 7 Curriuculum Map Quarter 1-3
100% (4)
English 7 Curriuculum Map Quarter 1-3
15 pages
Ge 7 Morph Report
No ratings yet
Ge 7 Morph Report
19 pages
Cat Red
No ratings yet
Cat Red
5 pages
Branches of Psychology
90% (10)
Branches of Psychology
2 pages
Business Model Innovation Guide
No ratings yet
Business Model Innovation Guide
131 pages
SManual CL001943 CLP675
No ratings yet
SManual CL001943 CLP675
173 pages
Poweroil: Power Gem Ep00, Ep0, Ep1 & Ep2 Extreme Pressure Greases
No ratings yet
Poweroil: Power Gem Ep00, Ep0, Ep1 & Ep2 Extreme Pressure Greases
1 page
Ethical Dilemmas in Movies
No ratings yet
Ethical Dilemmas in Movies
13 pages
ND II 3rdterm Sum
No ratings yet
ND II 3rdterm Sum
7 pages
Mahesh CV
No ratings yet
Mahesh CV
6 pages
Michael Faraday
No ratings yet
Michael Faraday
5 pages
RB Safety Drive
No ratings yet
RB Safety Drive
6 pages
Vitocal Heat Pumps Brochure
No ratings yet
Vitocal Heat Pumps Brochure
52 pages
Hydraulic Handpump
No ratings yet
Hydraulic Handpump
1 page
Calculating Laterite Nickel Reserves Using The Inverse Distancing Weighteness Method
No ratings yet
Calculating Laterite Nickel Reserves Using The Inverse Distancing Weighteness Method
6 pages
AI in Medical Imaging Challenges
No ratings yet
AI in Medical Imaging Challenges
21 pages
Protege CaseStudyBrief
No ratings yet
Protege CaseStudyBrief
2 pages

L4 Classification

Uploaded by

L4 Classification

Uploaded by

This Lecture…

Lecture Classification Techniques – Decision Tree

Examples of Classification Task Classification Techniques

 Classifying secondary structures of protein

Classification: Definition Illustrating Classification Task

 Given a collection of records (training set ) 4

 Each record contains a set of attributes, one of the 6

 Find a model for class attribute as a function of the 10

13 Yes Large 110K ? Deduction

Tid Refund Marital Taxable

Decision Tree Classification Task Apply Model to Test Data

4 Yes Medium 120K No No Married 80K ?

Model Single, Divorced Married

13 Yes Large 110K ?

Apply Model to Test Data Apply Model to Test Data

No Married 80K ? No Married 80K ?

No Married 80K ? No Married 80K ?

Apply Model to Test Data Decision Tree Classification Task

7 Yes Large 220K No Learn

Single, Divorced Married Assign Cheat to “No” 10 No Small 90K Yes

12 Yes Medium 80K ?

Decision Tree Induction General Structure of Hunt’s Algorithm

 If Dt contains records that belong the

 SLIQ,SPRINT 7 Yes Divorced 220K No

subsets. Recursively apply the

procedure to each subset. Dt

Hunt’s Algorithm 1 Yes Single 125K No Tree Induction

Tree Induction How to Specify Test Condition?

Splitting Based on Nominal Attributes

 Binary split: Divides values into two subsets.

 Different ways of handling Taxable Taxable

(percentiles), or clustering. [10K,25K) [25K,50K) [50K,80K)

 Binary Decision: (A < v) or (A  v) (i) Binary split (ii) Multi-way split

Tree Induction How to determine the Best Split

Yes No Family Luxury c1 c20

◼ How to specify the attribute test condition?

 Determine when to stop splitting Which test condition is the best?

How to determine the Best Split Measures of Node Impurity

Splitting Based on GINI

 Use Binary Decisions based on one value Tid Refund Marital

 For each v, scan the database to gather

Tree Induction Stopping Criteria for Tree Induction

 Determine when to stop splitting

Decision Tree Based Classification Example: C4.5

 You can download the software from:

You might also like