Lecture 4 - Intro To Machine Learning and Decision Trees
Lecture 4 - Intro To Machine Learning and Decision Trees
Lecture 4:
Intro to Machine Learning & Decision Trees
6 February 2024
1
Admin
2
Final Assessment Timing Survey
250
200
150
100
50
0
Saturday 8pm - Sunday 11:59pm Saturday 8pm - Sunday 9pm Sunday 8am - Sunday 11:59pm Sunday 8am - Sunday 9pm
(original) (sleep earlier) (avoid pulling an allnighter) (avoid pulling an allnighter
and sleep earlier)
3
Materials
4
Overview
You are here
Applied CS2040S, CS1231 Applied Linear Algebra, Calculus, Statistics & Probabilities
Python Numpy, Scikit-learn, PyTorch
5
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods
6
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods
7
What is Machine Learning? Applications?
Credit: Google
Credit: VentureBeat
Credit: Skyfish
Credit: Futuremind 8
What is Machine Learning?
“The field of study that gives computers the ability to learn without
being explicitly programmed”
- Arthur Samuel (1959)
9
Why Machine Learning?
Function 5
Indomie
Supervised Unsupervised
“Teacher’s feedback” Semi/Weakly Supervised “No feedback”
observation, reward
action
Reinforcement
“Trial and error” 11
Supervised Learning
Learns from being given the right answers: 𝑿 → 𝒀
Input (X) Output (Y) Application
Email Spam? Spam filtering
Audio Text transcripts Speech recognition
English Chinese Machine translation
Image Position of cars Self-driving car
Types:
• Regression: predict continuous output (e.g., temperature: 0.567)
• Classification: predict discrete output (e.g., animal: cat vs dog)
12
Supervised Learning: Regression
Housing price prediction
500
400
300
200
100
($M)
500 1000 1500 2000 2500
(sqft)
1150?
13
Supervised Learning: Classification
• Cancer prediction: benign (0), malignant (1)
0
1 2 3 4 5 6
(cm)
5 cm?
14
Formalism
• In supervised learning, we assume that 𝑦 is generated by a true
mapping function 𝑓: 𝑥 → 𝑦.
• We want to find a hypothesis ℎ: 𝑥 → 𝑦ො (from a hypothesis class 𝐻)
s.t. ℎ ≈ 𝑓 given a training set 𝑥1 , 𝑓 𝑥1 , … , 𝑥𝑁 , 𝑓 𝑥𝑁 .
• We use a learning algorithm to find this hypothesis
15
Formalism (Illustrated)
Training Set
𝑥1 , 𝑓 𝑥1 , … , 𝑥𝑁 , 𝑓 𝑥𝑁
16
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods
17
Performance Measure
How do we know that our hypothesis is good? i.e., ℎ ≈ 𝑓
1. Use theorems from statistical learning theory
2. Try the hypothesis on a new set of examples (test set)
train
Data test
Test set
18
Regression: Error
If the output of the hypothesis is a continuous value, then we can
measure its error. For an input 𝑥 with a true output 𝑦, we can
compute:
𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 = 𝑦ො − 𝑦 2
Where 𝑦ො = ℎ(𝑥).
19
Regression: Mean Squared Error
For a set of 𝑁 examples 𝑥1 , 𝑦1 , … , 𝑥𝑁 , 𝑦𝑁 we can compute the
average (mean) squared error as follows.
𝑁
1 2
𝑀𝑆𝐸 = 𝑦ො𝑖 − 𝑦𝑖
𝑁
𝑖=1
20
Regression: Mean Absolute Error
For a set of 𝑁 examples 𝑥1 , 𝑦1 , … , 𝑥𝑁 , 𝑦𝑁 we can compute the
average (mean) absolute error as follows.
𝑁
1
𝑀𝐴𝐸 = |𝑦ො𝑖 − 𝑦𝑖 |
𝑁
𝑖=1
21
Classification: Correctness & Accuracy
Classification is correct when the prediction 𝑦ො = 𝑦 (true label).
Cancer
3 Cancer Benign
2 1 Precision
Predicted Label
4 Cancer Benign FN True Positive False Positive 𝑃 = TP / (TP+FP)
5 Cancer Benign
How many selected items are relevant?
3
Benign
6 Benign Benign
False
4 How precise were the positive predicted instances?
7 Benign Benign True Negative
TN Negative Maximize this if false positive (FP) is very costly.
8 Benign Benign E.g., email spam, satellite launch date prediction
9 Benign Benign Recall
10 Benign Cancer FP 𝑅 = TP / (TP+FN)
24
Life’s Difficult Decision: What to Eat?
Decide whether to wait for a table at a restaurant based on the following input
attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
10. WaitEstimate: estimated waiting time in minutes (0-10, 10-30, 30-60, >60)
25
Life’s Difficult Decision: What to Eat?
26
Decision Trees
One possible representation
for hypotheses
27
Expressiveness
• Decision trees can express any function of the input attributes.
• Example: Boolean functions, each row → path from root to leaf
A B A xor B A
F F F F T
F T T B B
T F T F T F T
T T F F T T F
• Trivially, there is a consistent decision tree for any training set, but
probably won’t generalize to new examples.
Prefer a more “compact” tree
28
The Size of the Hypothesis Class
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2𝑛 rows
2𝑛
= 2
A B A <op> B
F F T/F
F T T/F
T F T/F
T T T/F
30
Choosing an Attribute
How do we choose an attribute?
Patrons? Type?
None Full French Burger
Some Italian Thai
Ideally: we want to select an attribute that split the examples into “all
positive” or “all negative”
31
Entropy
• Entropy is a measure of randomness:
• 𝐼 𝑃 𝑣1 , … , 𝑃 𝑣𝑛 = − σ𝑛𝑖=1 𝑃 𝑣𝑖 log 2 𝑃(𝑣𝑖 )
• For a data set containing p positive and n negative examples:
𝑝 𝑛 𝑝 𝑝 𝑛 𝑛 Maximum randomness,
•𝐼 , = − log 2 − log 2 Maximum entropy
𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛
Patrons? Type?
None Full French Burger
Some Italian Thai
32
Entropy of this node - Total entropy of children nodes
Information Gain
𝐼( )
• A chosen attribute A divides the training set E Patrons?
into subsets E1, … , Ev according to their values None Full
Some
for A, where A has v distinct values. 2 4 6
𝑣 12 12 12
𝑝𝑖 + 𝑛𝑖 𝑝𝑖 𝑛𝑖 𝐼( ) 𝐼( ) 𝐼( )
𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝐴) = 𝐼( , )
𝑝 + 𝑛 𝑝𝑖 + 𝑛𝑖 𝑝𝑖 + 𝑛𝑖
𝑖=1
35
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
???
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝐼𝐺 𝐴𝑙𝑡 , 𝐼𝐺 𝐵𝑎𝑟 , 𝐼𝐺 𝐹𝑟𝑖 , 𝐼𝐺 𝐻𝑢𝑛 , … except 𝐼𝐺(𝑃𝑎𝑡)
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree
36
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
???
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree
37
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
???
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
𝐼𝐺 𝐴𝑙𝑡 , 𝐼𝐺 𝐵𝑎𝑟 , 𝐼𝐺 𝑇𝑦𝑝𝑒 , … except 𝐼𝐺(𝑃𝑎𝑡), 𝐼𝐺 𝐻𝑢𝑛
add a branch to tree with label 𝑣𝑖 and subtree subtree
38
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree ???
There is no examples for “French”
Since there is no mode (majority),
we select an arbitrary decision: “T”
39
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree ???
40
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree
41
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree
42
Dealing with Attributes with Many Values
Information gain will select attribute with many values (e.g., dates, phone
numbers) because it splits the data “well”. In the extreme case, each branch
will have a single example, so “all positive” or “all negative”.
𝐸
Balance the information gain with the number of branches.
𝐼𝐺 𝐴 𝐸1 𝐸2
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝐴 = SplitInformation = 1
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝐴)
𝐸
𝑑
|𝐸𝑖 | |𝐸𝑖 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐴 = − log 2 𝐸1 𝐸2 𝐸3
|𝐸| |𝐸|
𝑖=1 SplitInformation = 1.6
Examples 43
Dealing with Attributes with Differing Costs
Certain attributes may have costs (e.g., in medical diagnosis). They
may vary significantly in their costs, both in terms of monetary cost or
patient comfort. Example: biopsy, blood test, etc.
𝐼𝐺 2 𝐴 2𝐼𝐺 𝐴 − 1
OR 𝑤
, 𝑤 ∈ [0,1]
𝐶𝑜𝑠𝑡(𝐴) 𝐶𝑜𝑠𝑡 𝐴 + 1
Determine the importance of cost
44
Dealing with Continues-valued Attributes
Define a discrete-valued input attribute to partition the values into a
discrete set of intervals.
Examples:
• Estimated waiting time (minutes): 0-10, 10-30, 30-60, >60
• Age (year): 0-12, 12-25, 25-40, 40-60, 60-80, >80
•…
45
Dealing with Missing Values
What if some values are missing?
• Assign the most common value of the attribute
• Assign the most common value of the attribute with the same output
• Assign probability to each possible value and sample
• Drop the attribute
• Drop the rows
•…
46
H?
Overfitting Y N
Y N
Decision Trees performance is perfect on training data, but worse on test data
• DT captures the data perfectly, including the noise
47
H?
Occam’s Razor Y N
Y N
48
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Cond1?
Yes No
25 T Cond2?
Yes No
Cond3? F 13
Yes No
9 F Cond4?
Yes No
1 T F 2
49
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5
Cond1? Cond1?
Yes No Yes No
25 T Cond2? 25 T Cond2?
Yes No Yes No
Cond3? F 13 Cond3? F 13
Yes No Yes No
9 F Cond4? 9 F Cond4?
Yes No Yes No
1 T F 2 1 T F 2
50
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5
Cond1? Cond1?
Yes No Yes No
25 T Cond2? 25 T Cond2?
Yes No Yes No
Cond3? F 13 Cond3? F 13
Yes No Yes No
9 F Cond4? 9 F 1 T F 2
Yes No
F (majority)
1 T F 2
51
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5
Cond1? Cond1?
Yes No Yes No
25 T Cond2? 25 T Cond2?
Yes No Yes No
Cond3? F 13 1 T F 11 F 13
Yes No F (majority)
9 F Cond4? Done!
Yes No But can be simplified…
1 T F 2
52
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5
Cond1? Cond1?
Yes No Yes No
25 T Cond2? 25 T 1 T F 24
Yes No F (majority)
Cond3? F 13
Yes No
9 F Cond4?
Yes No
1 T F 2
53
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No
25 T Cond2? 25 T 1 T F 24 25 T Cond2?
Yes No F (majority) Yes No
Cond3? F 13 Cond3? F 13
Yes No Yes No
9 F Cond4? 9 F Cond4?
Yes No Yes No
1 T F 2 1 T F 2
54
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No
25 T Cond2? 25 T 1 T F 24 25 T Cond2?
Yes No F (majority) Yes No
Cond3? F 13 Cond3? F 13
Yes No Yes No
9 F Cond4? 9 F 1 T F 2
Yes No F (majority)
1 T F 2 Done!
But can be simplified… 55
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No
25 T Cond2? 25 T 1 T F 24 25 T Cond2?
Yes No F (majority) Yes No
Cond3? F 13 1 T F 11 F 13
Yes No F (majority)
9 F Cond4?
Yes No
1 T F 2
56
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No
25 T Cond2? 25 T 1 T F 24 25 T 1 T F 24
Yes No F (majority) F (majority)
Cond3? F 13
Yes No
9 F Cond4?
Yes No
The sample that is likely a noise (1T) is ignored!
1 T F 2
Results in a smaller tree which may have higher accuracy 57
Ensemble Methods
59
Coming Up Next Week
• Linear Regression
• Multiple Features
• Polynomial Regression
• Optimization Algorithms
• Gradient Descent
• Variants of Gradient Descent
• Normal Equation
60
To Do
• Lecture Training 4
• +100 Free EXP
• +50 Early bird bonus
• Problem Set 2
• Due Saturday, 10th February
• Problem Set 3
• Out today!
• Contest +500 EXP
61