Decision trees are a popular supervised learning method that can be used for classification and prediction. They work by splitting a dataset into purer subsets based on the values of predictor variables. The C4.5 algorithm is commonly used to build decision trees in a top-down recursive divide-and-conquer manner by selecting the attribute that produces the highest information gain at each step. It then prunes the fully grown tree to avoid overfitting. Decision trees can be converted to classification rules for interpretation. An example decision tree was built to predict student course enrollment based on attributes like gender, income, and employment sector.
7/4/2008 1
Decision TreeApproach in Data Mining
What is data mining ?
The process of extracting previous unknown
and potentially useful information from large
database
Several data mining approaches nowadays
Association Rules
Decision Tree
Neutral Network Algorithm
2.
7/4/2008 2
Decision TreeInduction
A decision tree is a flow-chart-like tree
structure, where each internal node
denotes a test on an attribute, each
branch represents an outcome of the test,
and leaf nodes represent classes or class
distribution.
3.
7/4/2008 3
Data MiningApproach - Decision Tree
• a model that is both predictive and
descriptive
• can help identify which factors to
consider and how each factor
associated to a business decision
• most commonly used for classification
(predicting what group a case belongs to)
• several decision tree induction
algorithms, e.g. C4.5, CART, CAL5, ID3
etc.
4.
7/4/2008 4
Algorithm forbuilding Decision
Trees
Decision trees are a popular structure for
supervised learning. They are
constructed using attributes best able to
differentiate the concepts to be learned.
A decision tree is built by initially
selecting a subset of instances from a
training set. This subset is then used by
the algorithm to construct a decision
tree. The remaining training set
instances test the accuracy of the
constructed tree.
5.
7/4/2008 5
If thedecision tree classified the
instances correctly, the procedure
terminates. If an instance is
incorrectly classified, the instance
is added to the selected subset of
training instances and a new tree is
constructed. This process
continues until a tree that correctly
classify all nonselected instances is
created or the decision tree is built
from the entire training set.
6.
7/4/2008 6
Entropy
(a) showsprobability p range from 0 to 1 = log(1/p)
(b) Shows probability of an event occurs = p log(1/p)
(c) Shows probability of an expected value (occurs+not occurs)
= p log(1/p) + (1-p) log (1/(1-p))
7.
7/4/2008 7
Training Process
Sample D ata S et
T raining Set
T es ting S et T rained C lass ifier
R es ults
W indow ing
Process
C onstruc t
D ec is ion T ree
& R uleset
Process
Predic tion
Process
B lock D iagram of Training Process
|-------- Data Preparation Stage --------|------- Tree Building Stage -------|--- Prediction Stage ---|
8.
7/4/2008 8
Basic algorithmfor inducing a decision
tree
• Algorithm: Generate_decision_tree. Generate a
decision tree from the given training data.
• Input: The training samples, represented by
discrete-valued attributes; the set of candidate
attributes, attribute-list;
• Output: A decision tree
9.
7/4/2008 9
Begin
Partition (S)
If(all records in S are of the same class or only 1 record found
in S)
then return;
For each attribute Ai do
evaluate splits on attribute Ai;
Use best split found to partition S into S1 and S2 to grow a tree
with two Partition (S1) and Partition (S2);
Repeat partitioning for Partition (S1) and (S2) until it meets tree
stop growing criteria;
End;
10.
7/4/2008 10
Information Gain
Differencebetween information needed for
correct classification before and after the split.
For example, before split, there are 4 possible
outcomes represented in 2 bits in the
information of A, B, …Outcome. After split on
attribute A, the split results in two branches of
the tree, and each tree branch represent two
outcomes represented in 1 bit. Thus, choosing
attribute A results in an information gain of one
bit.
11.
7/4/2008 11
Classification RuleGeneration
• Generate Rules
– rewrite the tree to a collection of rules, one for each tree leaf
– e.g. Rule 1: IF ‘outlook = rain’ AND ‘windy = false’ THEN
‘play’
• Simplifying Rules
– delete any irrelevant rule condition without affecting its
accuracy
– e.g. Rule R-: IF r1 AND r2 AND r3 THEN class1
– Condition: Error Rate (R-) without r1 < Error Rate (R) =>
delete this rule condition r1
– Resultant Rule: IF r2 AND r3 THEN class1
• Ranking Rules
– order the rules according to the error rate
12.
7/4/2008 12
Decision TreeRules
Rules are more appealing than trees,
variations of the basic tree to rule mapping
must be presented. Most variations focus
on simplifying and/or eliminating existing
rules.
13.
7/4/2008 13
Example ofsimplifying rules of credit card
Income Range Life Insurance Credit Card Sex Age
Promotion Insurance
40-50k no no Male 45
30-40k yes no Female 40
40-50k no no Male 42
30-40k yes yes Male 43
50-60k yes no Female 38
20-30k no no Female 55
30-40k yes yes Male 35
20-30k no no Male 27
30-40k no no Male 43
30-40k yes no Female 41
40-50k yes no Female 43
20-30k yes no Male 29
50-60k yes no Female 39
40-50k no no Male 55
20-30k yes yes Female 19
14.
14/4/2008 14
A rulecreated by following one path of the tree is:
Case 1:
If Age<=43 & Sex=Male & Credit Card Insurance=No
Then Life Insurance Promotion = No
The conditions for this rule cover 4 of 15 instances with 75%
accuracy in which 3 out of 4 meet the successful rate.
Case 2:
If Sex=Male & Credit Card Insurance=No
Then Life Insurance Promotion = No
The conditions for this rule cover 5 of 6 instances with 83.3%
accuracy
Therefore, the simplified rule is more general and more accurate
than the original rule.
15.
7/4/2008 15
C4.5 TreeInduction Algorithm
• Involves two phases for decision tree
construction
– growing tree phase
– pruning tree phase
• Growing Tree Phase
– a top-down approach which repeatedly
build the tree, it is a specialization process
• Pruning Tree Phase
– a bottom-up approach which removes sub-
trees by replacing them with leaves, it is a
16.
7/4/2008 16
Expected informationbefore splitting
Let S be a set consisting of s data samples. Suppose
the class label attribute has m distinct values
defining m distinct classes, Ci for i=1,..m. Let Si be
the number of samples of S in class Ci. The
expected information needed to classify a given
sample Si is given by:
m
Info(S)= - Si log2 Si
i=1 S S
Note that a log function to the base 2 is used since the
information is encoded in bit
17.
7/4/2008 17
Expected informationafter splitting
Let attribute A have v distinct values {a1, a2,…av},
and is used to split S into v subsets {S1,…Sv}
where Sj contains those samples in S that
have value aj of A. After splitting, then these
subsets would correspond to the branches
partitioned in S.
v
InfoA(S) = S1j+…+Smj Info(Sj)
j=1 S
Gain (A) = Info (S) – InfoA(S)
18.
7/4/2008 18
C4.5 Algorithm- Growing Tree Phase
Let S = any set of training case
Let |S| = number of classes in set S
Let Freq (Ci, S) = number of cases in S that belong to
class Ci
Info(S) = average amount of information needed to
identify the class in S
Infox(S) = expected information to identify the class of a
case in S after partitioning S with the test on attribute
X
Gain (X) = information gained by partitioning S
according to the test on attribute X
19.
7/4/2008 19
C4.5 Algorithm- Growing Tree Phase
Data
Mining Set
Find Splitting Attribute
Find threshold value for
splitting
Terminate Tree
Growing
?
No
Yes
Tree Splitting
Select Decisive Attribute for Tree Splitting
( Informational Gain Ratio )
m
Info(S)= - Si log2 Si
i=1 S S
v
InfoA(S) = S1j+…+Smj Info(Sj)
j=1 S
Gain (X) = Info (S) – Infox (S)
20.
7/4/2008 20
C4.5 Algorithm- Growing Tree Phase
Let S be the training set
Info (S) = -9 log2 (9) - 5 log2 (5) = 0.42+0.52=0.94
14 14 14 14
Where log2(9/14)= log 2
log (9/14)
InfoOutlook(S) = 5 (- 2 log2 (2) - 3 log2 (3) )
14 5 5 5 5
+ 4 (- 4 log2 (4) - 0 log2 (0) )
14 4 4 4 4
+ 5 (- 3 log2 (3) - 2 log2 (2) ) = 0.694
14 5 5 5 5
Gain (Outlook) = 0.94 - 0.694 = 0.246
Similarly,computed information Gain(Windy)
=Info(S) - InfoWindy(S) = 0.94 - 0.892 = 0.048
Thus, decision tree splits on attribute Outlook with
higher information gain.
Root
|
Outlook
|
Sunny Overcast Rain
21.
7/4/2008 21
After firstsplitting
Windy?
TRUE
TRUE
FALSE
FALSE
FALSE
Class
Play
Don’t Play
Don’t Play
Don’t Play
Play
Windy?
TRUE
FALSE
TRUE
FALSE
Class
Play
Play
Play
Play
Windy?
TRUE
TRUE
FALSE
FALSE
FALSE
Class
Don’t Play
Don’t Play
Play
Play
Play
Root
|
Outlook
/ |
Sunny Overcast Rain
22.
7/4/2008 22
Decision Treeafter grow tree
phase
Root
|
Outlook
/ |
Sunny Overcast Rain
/ | /
Wendy not Play Windy not
wendy (100%) wendy
/ /
Play not play Play not play
(40%) (60%)
7/4/2008 24
Continuous-valued data
Ifinput sample data consists of an attribute that
is continuous-valued, rather than discrete-
valued.
For example, people’s Ages is continuous-
valued.
For such a scenario, we must determine the
“best” split-point for the attribute.
An example is to take an average of the
continuous values.
25.
7/4/2008 25
C4.5 Algorithm- Pruning Tree Phase
E2 < E1
Replace the subtree
Finish ?
Yes
No
Yes
No
Compute Original Sub-
Tree Error Rate
( E1 )
Compute Replaced
Sub-Tree Error Rate
(E2 )
Goto Bottom Sub-tree
( Error-Based Pruning Algorithm )
U25%(E,N) = Predicted Error Rate
= the number of misclassified test cases *
100%
the total number of test cases
where E is no. of error cases in the class,
N is no. of cases in the class
26.
7/4/2008 26
Case studyof predicting student enrolment by
decision tree
• Enrolment Relational schema
Attribute Data type
ID Number
Class Varchar
Sex Varchar
Fin_Support Varchar
Emp_Code Varchar
Job_Code Varchar
Income Varchar
Qualification Varchar
Marital_Status Varchar
27.
7/4/2008 27
Student EnrolmentAnalysis
– deduce influencing factors associated to student course
enrolment
– Three selected courses’ enrolment data is sampled:
Computer Science, English Studies and Real Estate
Management
– with 100 training records and 274 testing records
– prediction result
– Generate Classification Rules
– Decision tree - Classification Rule
– Students Enrolment: 41 Computer Science, 46 English
Studies and 13 Real Estate Management
28.
7/4/2008 28
Growing TreePhase
C4.5 tree induction algorithm gain ratio of all possible data attributes
Note: Emp_code shows highest information gain, and thus is the top
priority in decision tree.
29.
7/4/2008 29
Growing TreePhase Decision Tree
ROOT - Employment
Manufacturing
Social Work
Tourism, Hotel
Trading
Property
Construction
Education
Engineering
Fin/Accounting
Government
Info. Technology
Others
Job
Sex
Real Estate Mangement = 100%
Job
Job
Job
Fin_Support
Income
Sex
Qualification
Qualification
Computer Science = 100%
Form 4, Form 5 [English Studies = 100%]
Form 6 or equivalent [English Studies = 100%]
Master degree [computer Science = 100%]
Owner/partners of Companies [English Studies = 100%]
Executive [English Studies = 100%]
Female [computer Science = 100%]
Female [English Studies = 100%]
Male [computer Science = 100%]
Executive [Real Estate Mgt = 100%]
Professional, technical [Real Estate = 70%]
Clerical [English studies = 70%]
Professional [Computer science = 100%]
Technical studies [real estate = 100%]
Sales [computer Science = 100%]
Yes [Computer sicence = 100%]
No [computer Science = 50%]
Female [English Studies = 80%]
Male [computer Science = 100%]
Form 4, Form 5 [English Studies = 100%]
First degree equivalent [English Studies = 100%]
Postgraduate[computer Science = 100%]
> $800,000 [real estate = 100%]
$200000-$250,000 [English Studies = 100%]
$250,000-$299,000 [Real Estate = 100%]
Professional, Technical [Real Estate Mgt = 80%]
7/4/2008 33
Prune TreePhase classification Rules
No. Rule Class
1 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt
2 IF Emp_Code = “Tourism, Hotel” English Studies
3 IF Emp_Code = “Education” Computer Science
4 IF Emp_Code = “Others” English Studies
5 IF Emp_Code = “Government” AND Income = “$150,000 - $199,999” English Studies
6 IF Emp_Code = “Construction” AND Job_Code = “Professional, Technical” Real Estate Mgt
7 IF Emp_Code = “Manufacturing” English Studies
8 IF Emp_Code = “Trading” AND Sex = “Female” English Studies
9 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt
10 IF Emp_Code = “Engineering” AND Job_Code = “Sales” Computer Science
11 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt
12 IF Emp_Code = “Government” AND Income = “$800,000 - $999,999” Real Estate Mgt
13 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies
14 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science
15 IF Emp_Code = “Social Work” Computer Science
16 IF Emp_Code = “Fin/Accounting” Computer Science
17 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science
18 IF Emp_Code = “Construction” AND Job_Code = “Clerical” English Studies
34.
7/4/2008 34
Simplify classificationrules by deleting
unnecessary conditions
Pessimistic error rate is due to its disappearance is minimal
If the condition disappears, then the error rate is 0.338.
35.
7/4/2008 35
Simplified ClassificationRules
No. Rule Class
1 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt
2 IF Emp_Code = “Tourism, Hotel” English Studies
3 IF Emp_Code = “Education” Computer Science
4 IF Emp_Code = “Others” English Studies
5 IF Emp_Code = “Manufacturing” English Studies
6 IF Emp_Code = “Trading” AND Sex = “Female” English Studies
7 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt
8 IF Job_Code = “Sales” Computer Science
9 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt
10 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies
11 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science
12 IF Emp_Code = “Social Work” Computer Science
13 IF Emp_Code = “Fin/Accounting” Computer Science
14 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science
15 IF Job_Code = “Clerical” English Studies
16 IF Emp_Code = “Property” Real Estate
17 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999” English Studies
c
36.
7/4/2008 36
Ranking Rules
Aftersimplifying the classification rule set, the
remaining step is to rank the rules according to
their prediction reliability percentage defined as
(1 – misclassify cases / total cases of the rule) *
100%
For the rule
If Employment = “Trading” and “Sex=‘female’”
then class = “English Studies”
Gives out 6 cases with 0 misclassify cases.
Therefore, give out 100% reliability percentage
and thus is ranked first rule in the rule set.
37.
7/4/2008 37
Success rateranked classification rules
No. Rule Class
1 IF Emp_Code = “Trading” AND Sex = “Female” English Studies
2 IF Emp_Code = “Construction” AND Job_Code = “Executive” Real Estate Mgt
3 IF Emp_Code = “Info. Technology” AND Sex = “Male” Computer Science
4 IF Emp_Code = “Social Work” Computer Science
5 IF Emp_Code = “Government” AND Income = “$250,000 - $299,999” Real Estate Mgt
6 IF Emp_Code = “Government” AND Income = “$200,000 - $249,999” English Studies
7 IF Emp_Code = “Trading” AND Sex = “Male” Computer Science
8 IF Emp_Code = “Property” Real Estate
9 IF Job_Code = “Sales” Computer Science
10 IF Emp_Code = “Others” English Studies
11 IF Emp_Code = “Info. Technology” AND Sex = “Female” English Studies
12 IF Emp_Code = “Engineering” AND Job_Code = “Professional, Technical” Real Estate Mgt
13 IF Emp_Code = “Education” Computer Science
14 IF Emp_Code = “Manufacturing” English Studies
15 IF Emp_Code = “Tourism, Hotel” English Studies
16 IF Job_Code = “Clerical” English Studies
17 IF Emp_Code = “Fin/Accounting” Computer Science
38.
7/4/2008 38
Data PredictionStage
Classifier No. of misclassify cases Error rate(%)
Pruned Decision Tree 81 30.7%
Classification Rule set 90 32.8%
Both prediction results are reasonable good. The prediction
error rate obtained is 30%, which means nearly 70% of
unseen test cases can have accurate prediction result.
39.
7/4/2008 39
Summary
• “EmploymentIndustry” is the most
significant factor affecting an student
enrolment
• Decision Tree Classifier gives the best
better prediction result
• Windowing mechanism improves
prediction accuracy
40.
7/4/2008 40
Reading Assignment
“DataMining: Concepts and Techniques”
2nd edition, by Han and Kamber, Morgan
Kaufmann publishers, 2007, Chapter 6, pp.
291-309.
41.
7/4/2008 41
Lecture ReviewQuestion 11
(i) Explain the term “Information Gain” in
Decision Tree.
(ii) What is the termination condition of Growing
tree phase?
(iii) Given a decision tree, which option do you
prefer to prune the resulting rule and why?
(a) Converting the decision tree to rules and then
prune the resulting rules.
(b) Pruning the decision tree and then converting
the pruned tree to rules.
42.
7/4/2008 42
CS5483 tutorialquestion 11
Apply C4.5 algorithm to construct a decision tree after first splitting for purchasing records
from the following data after dividing the tuples into two groups according to “age”: one is less
than 25, and another is greater than or equal to 25. Show all the steps and calculation for the
construction.
Location Customer Sex Age Purchase records
Asia Male 15 Yes
Asia Female 23 No
America Female 20 No
Europe Male 18 No
Europe Female 10 No
Asia Female 40 Yes
Europe Male 33 Yes
Asia Male 24 Yes
America Male 25 Yes
Asia Female 27 Yes
America Female 15 Yes
Europe Male 19 No
Europe Female 33 No
Asia Female 35 No
Europe Male 14 Yes
Asia Male 29 Yes
America Male 30 No