The document describes the C4.5 algorithm for building decision trees. It begins with an overview of decision trees and the goals of minimizing tree levels and nodes. It then outlines the steps of the C4.5 algorithm: 1) Choose the attribute that best differentiates training instances, 2) Create a tree node for that attribute and child nodes for each value, 3) Recursively create subordinate nodes until reaching criteria or no remaining attributes. An example applies these steps to build a decision tree to predict customers' responses to a life insurance promotion using attributes like age, income and insurance status.
Introduction by Mohd Noor Abdul Hamid; overview of learning objectives.
Explains decision tree construction, minimizing tree levels and nodes using the C4.5 algorithm.
Exercise scenario involving the BIGG Credit Card company utilizing customer data to develop a predictive model.
Details on dataset (15 instances, 10 attributes), relevant input/output variables.
C4.5 attribute selection using goodness score to determine the best attribute for decision making.
Calculations of Goodness Scores for attributes: Income Range, Credit Card Insurance, Sex.
Analysis of Age attribute, introducing methods for numerical splits, aimed at improving data generalization.
Detailed steps for determining split points and their goodness scores; focuses on the best split for Age.
Determination of Age as the chosen attribute for the decision tree's top node. Steps to create a decision tree with Age as root; classification outcomes based on further subdivisions.
Model classification performance (93% accuracy); assignment to develop an application based on tree model.
Mohd Noor AbdulHamid, Ph.D
Universiti Utara Malaysia
mohdnoor@uum.edu.my
2.
After this class,you should be able to :
Explain the C4.5 Algorithm
Use the algorithm to develop a
Decision Tree
mohdnoor@uum.edu.my
3.
Decision tree areconstructed using
only those attributes best able to
differentiate the concepts to be
learned.
Main goal is to minimize the number
of tree levels and tree nodes
maximizing data generalization
mohdnoor@uum.edu.my
4.
The C4.5 Algorithm
LetT be the set of training instances
Choose an attribute that best
differentiates the instances contained in T.
mohdnoor@uum.edu.my
5.
Let T bethe set of training instances
Choose an attribute that best
differentiates the instances contained in T.
The C4.5 Algorithm
Create a tree node whose value is
the chosen attribute.
Create child links from this node
where each link represents a unique
value for the chosen attribute.
Use the child link values to further
subdivide the instances into
subclasses.
6.
Let T bethe set of training instances
Choose an attribute that best
differentiates the instances contained in T.
Create a tree node whose value is
the chosen attribute.
Create child links from this node
where each link represents a unique
value for the chosen attribute.
Use the child link values to further
subdivide the instances into
subclasses.
The C4.5 Algorithm
Instances in the subclass
satisfy predefine criteria
OR
Remaining attributes choice
for the path is null
7.
Create a treenode whose value is
the chosen attribute.
Create child links from this node
where each link represents a unique
value for the chosen attribute.
Use the child link values to further
subdivide the instances into
subclasses.
Instances in the subclass
satisfy predefine criteria
OR
Remaining attributes choice
for the path is null
Specify the classification for new
instances following this decision path
Y
The C4.5 Algorithm
8.
Use the childlink values to further
subdivide the instances into
subclasses.
Instances in the subclass
satisfy predefine criteria
OR
Remaining attributes choice
for the path is null
Specify the classification for new
instances following this decision path
Y
The C4.5 Algorithm
N
9.
Let T bethe set of training instances
Choose an attribute that best
differentiates the instances contained in T.
Create a tree node whose value is
the chosen attribute.
Create child links from this node
where each link represents a unique
value for the chosen attribute.
Use the child link values to further
The C4.5 Algorithm
END
Exercise : TheScenario
• BIGG Credit Card company wish to
develop a predictive model in order to
identify customers who are likely to take
advantage of the life insurance promotion
– so that they can mail the promotional
item to the potential customer.
12.
Exercise : TheScenario
The model will be develop using the data
stored in the credit card promotion database.
The data contains information obtained about
customers through their initial credit card
application as well as data about whether
these individual have accepted various
promotional offerings sponsored by the
company
Dataset
13.
Let T beset of training instances
Exercise
We follow our previous work with
creditcardpromotion.xls
The dataset consist of 15 instances
(observations)- T, each having 10 attributes
(variables) for our example, the input
attributes are limited to 5. Why??
Step 1
Decision tree are
constructed using only
those attributes best able to
differentiate the concepts to
be learned.
14.
Let T beset of training instances
Exercise
Step 1
Age Interval 19 – 55 years
Independent
Variables
(Inputs)
Sex Nominal Male
Female
Income
Range
Ordinal 20 – 30K
30 – 40K
40 – 50K
50 – 60K
Credit Card
Insurance
Binary Yes
No
Life Insurance
Promotion
Binary Yes
No
Dependent
Variables
(Target / Output)
15.
Choose an attributethat best differentiates
the instances contained in T
Exercise
C4.5 uses measure taken from information theory
to help with the attribute selection process.
The idea is; for any choice point in the tree, C4.5
selects the attributes that splits the data so as to
show the largest amount of gain in information.
We need to choose an input attribute to best
differentiate the instances in T our choices are
among :
- Income Range
- Credit Card Insurance
- Sex
- Age
Step 2
INSURED
16.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Goodness Score for each attribute is
calculated to determined which attribute
best differentiate the training instances (T).
Step 2
Sum of the most frequently encountered class in each branch (level) ÷ T
Number of branches (levels)
Goodness Score
We can develop a partial tree for each
attribute in order to calculate the
Goodness Score.
Sum of the most frequently encountered class in each branch (level) ÷ T
Number of branches (levels)
Goodness Score
Accuracy
17.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2a. Income Range
18.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2a. Income Range
Income
Range
2 Yes
2 No
20 – 30K
4 Yes
1 No
30-40K
1 Yes
3 No
40-50K
2 Yes
50-60K
19.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2a. Income Range
Sum of the most frequently encountered class in each branch (level) ÷ T
Number of branches (levels)
Goodness Score
= (2 + 4 + 3 + 2) ÷ 15
4
= 0.183
20.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2b. Credit Card Insurance
21.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2b. Credit Card Insurance
CC
Insurance
6 Yes
6 No
No
3 Yes
0 No
Yes
22.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2b. Credit Card Insurance
Sum of the most frequently encountered class in each branch (level) ÷ T
Number of branches (levels)
Goodness Score
= (6 + 3) ÷ 15
2
= 0.30
23.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2c. Sex
24.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2c. Sex
Sex
3 Yes
5 No
Male
6 Yes
1 No
Female
25.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2c. Sex
Sum of the most frequently encountered class in each branch (level) ÷ T
Number of branches (levels)
Goodness Score
= (6 + 5) ÷ 15
2
= 0.367
26.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age
Age is an interval variable (numeric),
therefore we need to determine where is
the best split point for all the values.
For this example, we are opt for binary split.
Why??
Main goal is to minimize
the number of tree levels
and tree nodes
maximizing data
generalization
27.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age Steps to determine the best splitting
point for interval (numerical) attribute:
1. Sort the Age values (pair with the target – Life Ins Promo)
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
28.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age Steps to determine the best splitting point for
interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
1 Yes
0 No
8 Yes
6 No
Goodness Score = (1 + 8) ÷ 15 = 0.30
2
29.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age Steps to determine the best splitting point for
interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
1 Yes
1 No
8 Yes
5 No
Goodness Score = (1 + 8) ÷ 15 = 0.30
2
30.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age Steps to determine the best splitting point for
interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
This process continues until a score for the split
between 45 and 55 is obtained.
Split point with the highest Goodness Score is
chosen 43.
31.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age Steps to determine the best splitting point for
interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
9 Yes
3 No
0 Yes
3 No
Goodness Score = (9 + 3) ÷ 15 = 0.40
2
32.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
2d. Age
Age
9 Yes
3 No
≤ 43
0 Yes
3 No
> 43
33.
Choose an attributethat best differentiates
the instances contained in T
Exercise
Step 2
Overall Goodness Score for each input attribute:
Attribute Goodness Score
Age 0.4
Sex 0.367
Credit Card Insurance 0.3
Income Range 0.183
Therefore the attribute Age is chosen as the top
level node
34.
• Create atree node whose value is the
chosen attribute.
• Create child links from this node where each
link represents a unique value for the chosen
attribute.
Exercise
Step 3
For each subclass:
a. If the instances in the subclass satisfy the
predefined criteria or if the set of remaining
attribute choices for this path is null, specify
the classification for new instances following
this decision.
b. If the subclass does not satisfy the predefined
criteria and there is at least one attribute to
further subdivide the path of the three, let T
be the current set of subclass instances and
return to step 2.
Exercise
Step 3
mohdnoor@uum.edu.my
37.
Exercise
Step 3
Age
9 Yes
3No
≤ 43
0 Yes
3 No
> 43
Does not satisfy the predefine criteria.
Subdivide!
Satisfy the predefine criteria.
Classification :
Life Insurance = No
mohdnoor@uum.edu.my
38.
Step 3
Age
≤ 43
0Yes
3 No
> 43
Life Insurance = Yes
Sex
Exercise
Female
6 Yes
0 No
Male
3 Yes
3 No
Life Insurance = Yes Subdivide
mohdnoor@uum.edu.my
39.
Age
≤ 43
0 Yes
3No
> 43
Life Insurance = Yes
Sex
Female
6 Yes
0 No
Male
Life Insurance = Yes
Exercise
Step 3
CC
Insurance
1 Yes
3 No
No
2 Yes
0 No
Yes
Life Insurance = No Life Insurance = Yes
mohdnoor@uum.edu.my
40.
Age
≤ 43
0 Yes
3No
> 43
Life Insurance = No
Sex
Female
6 Yes
0 No
Male
Life Insurance = Yes
Exercise
CC
Insurance
1 Yes
3 No
No
2 Yes
0 No
Yes
Life Insurance = No Life Insurance = Yes
The Decision Tree:
Life Insurance Promo
41.
Exercise
The Decision Tree:
1.Our Decision Tree is able to accurately classify 14
out of 15 training instances.
2. Therefore, the accuracy of our model is 93%
42.
Assignment
• Based onthe Decision Tree
model for the Life Insurance
Promotion, develop
application (program) using
any tools you are familiar with.
• Submit your code and report
next week!
mohdnoor@uum.edu.my