KEMBAR78
Lecture 4 - Intro To Machine Learning and Decision Trees | PDF | Machine Learning | Accuracy And Precision
0% found this document useful (0 votes)
30 views61 pages

Lecture 4 - Intro To Machine Learning and Decision Trees

The document outlines Lecture 4 of CS2109S, focusing on Machine Learning and Decision Trees. It covers key concepts such as supervised learning, performance measures, and the mechanics of decision tree learning including entropy and information gain. Additionally, it discusses various applications and types of feedback in machine learning.

Uploaded by

Runjia Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views61 pages

Lecture 4 - Intro To Machine Learning and Decision Trees

The document outlines Lecture 4 of CS2109S, focusing on Machine Learning and Decision Trees. It covers key concepts such as supervised learning, performance measures, and the mechanics of decision tree learning including entropy and information gain. Additionally, it discusses various applications and types of feedback in machine learning.

Uploaded by

Runjia Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CS2109S: Introduction to AI and Machine Learning

Lecture 4:
Intro to Machine Learning & Decision Trees

6 February 2024

1
Admin

2
Final Assessment Timing Survey
250

200

150

100

50

0
Saturday 8pm - Sunday 11:59pm Saturday 8pm - Sunday 9pm Sunday 8am - Sunday 11:59pm Sunday 8am - Sunday 9pm
(original) (sleep earlier) (avoid pulling an allnighter) (avoid pulling an allnighter
and sleep earlier)

3
Materials

4
Overview
You are here

“Classical” AI Machine Learning

Search Algorithms ML Models & Techniques Miscellaneous


• Uninformed search: BFS, DFS, UCS, IDS • Decision Trees • Unsupervised Learning
• AI & Ethics
• Informed search: Greedy best-first, A* • Linear/Logistic Regression
• Local Search: hill-climbing, SA, Beam, GA • Support Vector Machines
• Adversarial search: Minimax, Alpha-beta • Neural Networks

Applied CS2040S, CS1231 Applied Linear Algebra, Calculus, Statistics & Probabilities
Python Numpy, Scikit-learn, PyTorch

5
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods

6
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods

7
What is Machine Learning? Applications?

Credit: Google
Credit: VentureBeat

Credit: Skyfish
Credit: Futuremind 8
What is Machine Learning?
“The field of study that gives computers the ability to learn without
being explicitly programmed”
- Arthur Samuel (1959)

“A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.”
- Tom Mitchell

9
Why Machine Learning?
Function 5

“What is AI?” Function “AI is …”


F
u
n
Function
c Bb6 (chess move)
t
Credit: Chess.com
i
o
“Pikachu in the Eiffel tower” Function
n

How do we write this Function? 10


Types of Feedback
Apple

Indomie

Supervised Unsupervised
“Teacher’s feedback” Semi/Weakly Supervised “No feedback”

observation, reward

action

Reinforcement
“Trial and error” 11
Supervised Learning
Learns from being given the right answers: 𝑿 → 𝒀
Input (X) Output (Y) Application
Email Spam? Spam filtering
Audio Text transcripts Speech recognition
English Chinese Machine translation
Image Position of cars Self-driving car

Types:
• Regression: predict continuous output (e.g., temperature: 0.567)
• Classification: predict discrete output (e.g., animal: cat vs dog)
12
Supervised Learning: Regression
Housing price prediction

500

400

300

200

100
($M)
500 1000 1500 2000 2500
(sqft)

1150?
13
Supervised Learning: Classification
• Cancer prediction: benign (0), malignant (1)

0
1 2 3 4 5 6
(cm)
5 cm?

14
Formalism
• In supervised learning, we assume that 𝑦 is generated by a true
mapping function 𝑓: 𝑥 → 𝑦.
• We want to find a hypothesis ℎ: 𝑥 → 𝑦ො (from a hypothesis class 𝐻)
s.t. ℎ ≈ 𝑓 given a training set 𝑥1 , 𝑓 𝑥1 , … , 𝑥𝑁 , 𝑓 𝑥𝑁 .
• We use a learning algorithm to find this hypothesis

15
Formalism (Illustrated)

Training Set
𝑥1 , 𝑓 𝑥1 , … , 𝑥𝑁 , 𝑓 𝑥𝑁

Hypothesis Class 𝐻 Learning Algorithm 𝐴

Input 𝑥 Hypothesis ℎ Output 𝑦 ≈ 𝑓 𝑥

16
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods

17
Performance Measure
How do we know that our hypothesis is good? i.e., ℎ ≈ 𝑓
1. Use theorems from statistical learning theory
2. Try the hypothesis on a new set of examples (test set)

train

Training set Hypothesis

Data test

Test set
18
Regression: Error
If the output of the hypothesis is a continuous value, then we can
measure its error. For an input 𝑥 with a true output 𝑦, we can
compute:

𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 = |𝑦ො − 𝑦|

𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 = 𝑦ො − 𝑦 2

Where 𝑦ො = ℎ(𝑥).

19
Regression: Mean Squared Error
For a set of 𝑁 examples 𝑥1 , 𝑦1 , … , 𝑥𝑁 , 𝑦𝑁 we can compute the
average (mean) squared error as follows.

𝑁
1 2
𝑀𝑆𝐸 = ෍ 𝑦ො𝑖 − 𝑦𝑖
𝑁
𝑖=1

Where 𝑦ො𝑖 = ℎ(𝑥𝑖 ) and 𝑦𝑖 = 𝑓(𝑥𝑖 ).

20
Regression: Mean Absolute Error
For a set of 𝑁 examples 𝑥1 , 𝑦1 , … , 𝑥𝑁 , 𝑦𝑁 we can compute the
average (mean) absolute error as follows.

𝑁
1
𝑀𝐴𝐸 = ෍ |𝑦ො𝑖 − 𝑦𝑖 |
𝑁
𝑖=1

Where 𝑦ො𝑖 = ℎ(𝑥𝑖 ) and 𝑦𝑖 = 𝑓(𝑥𝑖 ).

21
Classification: Correctness & Accuracy
Classification is correct when the prediction 𝑦ො = 𝑦 (true label).

For a set of 𝑁 examples 𝑥1 , 𝑦1 , … , 𝑥𝑁 , 𝑦𝑁 we can compute the


average correctness (accuracy) as follows.
𝑁
1
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ෍ 𝟙𝑦ො 𝑖 =𝑦𝑖
𝑁
𝑖=1

Where 𝑦ො𝑖 = ℎ(𝑥𝑖 ) and 𝑦𝑖 = 𝑓(𝑥𝑖 ).


22
Classification: Confusion Matrix
Can we combine the
FP: Type I error
Accuracy =
TP+TN best of both metrics?
TP+FN+FP+TN
FN: Type II error
F1 Score
Inst. Actual 𝑦 Predicted 𝑦ො
Actual Label 2
1 Cancer Cancer 𝐹1 =
1 1
TP Cancer Benign +
2 Cancer Cancer 𝑃 𝑅

Cancer
3 Cancer Benign
2 1 Precision

Predicted Label
4 Cancer Benign FN True Positive False Positive 𝑃 = TP / (TP+FP)
5 Cancer Benign
How many selected items are relevant?
3

Benign
6 Benign Benign
False
4 How precise were the positive predicted instances?
7 Benign Benign True Negative
TN Negative Maximize this if false positive (FP) is very costly.
8 Benign Benign E.g., email spam, satellite launch date prediction
9 Benign Benign Recall
10 Benign Cancer FP 𝑅 = TP / (TP+FN)

How many relevant items are selected?


How many positive instances can be recalled (predicted)?
Maximize this if false negative (FN) is very dangerous.
E.g., cancer prediction but not music recommendation
23
Outline
• Machine Learning
• What is ML?
• Types of Feedback
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL)
• Entropy and Information Gain
• Different types of attributes
• Pruning
• Ensemble Methods

24
Life’s Difficult Decision: What to Eat?
Decide whether to wait for a table at a restaurant based on the following input
attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?

3. Fri/Sat: is today Friday or Saturday?

4. Hungry: are we hungry?

5. Patrons: number of people in the restaurant (None, Some, Full)

6. Price: price range ($, $$, $$$)

7. Raining: is it raining outside?

8. Reservation: have we made a reservation?

9. Type: kind of restaurant (French, Italian, Thai, Burger)

10. WaitEstimate: estimated waiting time in minutes (0-10, 10-30, 30-60, >60)

25
Life’s Difficult Decision: What to Eat?

You are hungry, it’s Friday, raining, …; Should you wait?

26
Decision Trees
One possible representation
for hypotheses

27
Expressiveness
• Decision trees can express any function of the input attributes.
• Example: Boolean functions, each row → path from root to leaf
A B A xor B A
F F F F T
F T T B B
T F T F T F T
T T F F T T F

• Trivially, there is a consistent decision tree for any training set, but
probably won’t generalize to new examples.
Prefer a more “compact” tree
28
The Size of the Hypothesis Class
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2𝑛 rows
2𝑛
= 2
A B A <op> B
F F T/F
F T T/F
T F T/F
T T T/F

With 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees


29
Decision Tree Learning Greedy, top-down, recursive algorithm.

def DTL(examples, attributes, default):


if examples is empty: return default
if examples have the same classification:
return classification Need to define this!
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree

30
Choosing an Attribute
How do we choose an attribute?

Patrons? Type?
None Full French Burger
Some Italian Thai

Ideally: we want to select an attribute that split the examples into “all
positive” or “all negative”
31
Entropy
• Entropy is a measure of randomness:
• 𝐼 𝑃 𝑣1 , … , 𝑃 𝑣𝑛 = − σ𝑛𝑖=1 𝑃 𝑣𝑖 log 2 𝑃(𝑣𝑖 )
• For a data set containing p positive and n negative examples:
𝑝 𝑛 𝑝 𝑝 𝑛 𝑛 Maximum randomness,
•𝐼 , = − log 2 − log 2 Maximum entropy
𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛 𝑝+𝑛

Patrons? Type?
None Full French Burger
Some Italian Thai

32
Entropy of this node - Total entropy of children nodes
Information Gain
𝐼( )
• A chosen attribute A divides the training set E Patrons?
into subsets E1, … , Ev according to their values None Full
Some
for A, where A has v distinct values. 2 4 6
𝑣 12 12 12
𝑝𝑖 + 𝑛𝑖 𝑝𝑖 𝑛𝑖 𝐼( ) 𝐼( ) 𝐼( )
𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟(𝐴) = ෍ 𝐼( , )
𝑝 + 𝑛 𝑝𝑖 + 𝑛𝑖 𝑝𝑖 + 𝑛𝑖
𝑖=1

• Information Gain (IG) or reduction in entropy


𝐼( )
p n
IG ( A) = I( , ) − remainder ( A)
p+n p+n 2 French
Type?
Burger
12
Italian Thai
Entropy of this node Entropy of children nodes
𝐼( )𝐼 𝐼 𝐼
2 4 4
12 12 12
33
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
(2,4,6) (2,2,4,4)
return classification (None, Some, Full) (French, Italian, Thai, Burger)
if attributes is empty:
return mode(examples) Patrons: (2,4,6)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
2F 4T 2T, 4F
for each value 𝑣𝑖 of best: 2 4 6 2 4
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 1 − 𝐼 0,1 + 𝐼 1,0 + 𝐼 , = 0.541 𝑏𝑖𝑡𝑠
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 } 12 12 12 6 6
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples)) Type: (2,2,4,4)
add a branch to tree with label 𝑣𝑖 and subtree subtree
T,F T,F 2T,2F 2T,2 F

𝐼𝐺 𝐴𝑙𝑡 , 𝐼𝐺 𝐵𝑎𝑟 , 𝐼𝐺 𝐹𝑟𝑖 , 𝐼𝐺 𝐻𝑢𝑛 , … 34


Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
???
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree

35
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
???
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝐼𝐺 𝐴𝑙𝑡 , 𝐼𝐺 𝐵𝑎𝑟 , 𝐼𝐺 𝐹𝑟𝑖 , 𝐼𝐺 𝐻𝑢𝑛 , … except 𝐼𝐺(𝑃𝑎𝑡)
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree

36
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
???
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree

37
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
???
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
𝐼𝐺 𝐴𝑙𝑡 , 𝐼𝐺 𝐵𝑎𝑟 , 𝐼𝐺 𝑇𝑦𝑝𝑒 , … except 𝐼𝐺(𝑃𝑎𝑡), 𝐼𝐺 𝐻𝑢𝑛
add a branch to tree with label 𝑣𝑖 and subtree subtree

38
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree ???
There is no examples for “French”
Since there is no mode (majority),
we select an arbitrary decision: “T”

39
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree ???

𝐼𝐺 𝐴𝑙𝑡 , 𝐼𝐺 𝐹𝑟𝑖 , … except 𝐼𝐺(𝑃𝑎𝑡), 𝐼𝐺 𝐻𝑢𝑛 , 𝐼𝐺 𝑇𝑦𝑝𝑒

40
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree

41
Decision Tree Learning
with Information Gain
def DTL(examples, attributes, default):
if examples is empty: return default
if examples have the same classification:
return classification
if attributes is empty:
return mode(examples)
best = choose_attribute(attributes, examples)
tree = a new decision tree with root best
for each value 𝑣𝑖 of best:
𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 = {rows in examples with best = 𝑣𝑖 }
subtree = DTL(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑖 , attributes – best, mode(examples))
add a branch to tree with label 𝑣𝑖 and subtree subtree

42
Dealing with Attributes with Many Values
Information gain will select attribute with many values (e.g., dates, phone
numbers) because it splits the data “well”. In the extreme case, each branch
will have a single example, so “all positive” or “all negative”.
𝐸
Balance the information gain with the number of branches.

𝐼𝐺 𝐴 𝐸1 𝐸2
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝐴 = SplitInformation = 1
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝐴)
𝐸
𝑑
|𝐸𝑖 | |𝐸𝑖 |
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐴 = − ෍ log 2 𝐸1 𝐸2 𝐸3
|𝐸| |𝐸|
𝑖=1 SplitInformation = 1.6

Examples 43
Dealing with Attributes with Differing Costs
Certain attributes may have costs (e.g., in medical diagnosis). They
may vary significantly in their costs, both in terms of monetary cost or
patient comfort. Example: biopsy, blood test, etc.

Make decision trees to use low-cost attributes where possible using


Cost-Normalized-Gain:

𝐼𝐺 2 𝐴 2𝐼𝐺 𝐴 − 1
OR 𝑤
, 𝑤 ∈ [0,1]
𝐶𝑜𝑠𝑡(𝐴) 𝐶𝑜𝑠𝑡 𝐴 + 1
Determine the importance of cost
44
Dealing with Continues-valued Attributes
Define a discrete-valued input attribute to partition the values into a
discrete set of intervals.

Examples:
• Estimated waiting time (minutes): 0-10, 10-30, 30-60, >60
• Age (year): 0-12, 12-25, 25-40, 40-60, 60-80, >80
•…

45
Dealing with Missing Values
What if some values are missing?
• Assign the most common value of the attribute
• Assign the most common value of the attribute with the same output
• Assign probability to each possible value and sample
• Drop the attribute
• Drop the rows
•…

46
H?
Overfitting Y N

Y N

Decision Trees performance is perfect on training data, but worse on test data
• DT captures the data perfectly, including the noise

Hungry? Moon align with the sun? Pimple? Eat? H?


Yes Yes Yes Yes Y N
Yes No Yes No
M? N
Yes No No Yes Y N
No Yes No No
Y P?
No No No No
Y N
N Y

47
H?
Occam’s Razor Y N

Y N

• Prefer short/simple hypotheses


• In favor:
• Short/simple hypothesis that fits the data is unlikely to be coincidence
• Long/complex hypothesis that fits the data may be coincidence
• Against:
• Many ways to define small sets of hypotheses (e.g., trees with prime number
of nodes that uses attribute beginning with “Z”)
• Different hypotheses representations may be used instead

48
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.

Cond1?
Yes No

25 T Cond2?
Yes No

Cond3? F 13
Yes No

9 F Cond4?
Yes No

1 T F 2
49
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5

Cond1? Cond1?
Yes No Yes No

25 T Cond2? 25 T Cond2?
Yes No Yes No

Cond3? F 13 Cond3? F 13
Yes No Yes No

9 F Cond4? 9 F Cond4?
Yes No Yes No

1 T F 2 1 T F 2
50
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5

Cond1? Cond1?
Yes No Yes No

25 T Cond2? 25 T Cond2?
Yes No Yes No

Cond3? F 13 Cond3? F 13
Yes No Yes No

9 F Cond4? 9 F 1 T F 2
Yes No
F (majority)
1 T F 2
51
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5

Cond1? Cond1?
Yes No Yes No

25 T Cond2? 25 T Cond2?
Yes No Yes No

Cond3? F 13 1 T F 11 F 13
Yes No F (majority)

9 F Cond4? Done!
Yes No But can be simplified…

1 T F 2
52
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5

Cond1? Cond1?
Yes No Yes No

25 T Cond2? 25 T 1 T F 24
Yes No F (majority)
Cond3? F 13
Yes No

9 F Cond4?
Yes No

1 T F 2
53
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No

25 T Cond2? 25 T 1 T F 24 25 T Cond2?
Yes No F (majority) Yes No

Cond3? F 13 Cond3? F 13
Yes No Yes No

9 F Cond4? 9 F Cond4?
Yes No Yes No

1 T F 2 1 T F 2
54
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No

25 T Cond2? 25 T 1 T F 24 25 T Cond2?
Yes No F (majority) Yes No

Cond3? F 13 Cond3? F 13
Yes No Yes No

9 F Cond4? 9 F 1 T F 2
Yes No F (majority)

1 T F 2 Done!
But can be simplified… 55
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No

25 T Cond2? 25 T 1 T F 24 25 T Cond2?
Yes No F (majority) Yes No

Cond3? F 13 1 T F 11 F 13
Yes No F (majority)

9 F Cond4?
Yes No

1 T F 2
56
Pruning
Prevent nodes from being split even when if fails to cleanly separate examples.
Min-sample = 5 Max-depth = 3
Cond1? Cond1? Cond1?
Yes No Yes No Yes No

25 T Cond2? 25 T 1 T F 24 25 T 1 T F 24
Yes No F (majority) F (majority)
Cond3? F 13
Yes No

9 F Cond4?
Yes No
The sample that is likely a noise (1T) is ignored!
1 T F 2
Results in a smaller tree which may have higher accuracy 57
Ensemble Methods

• Bootstrap Aggregation (Bagging)


• Random Forests
• Boosting
• Adaboost, XGBoost
58
Image Credit: https://www.researchgate.net/figure/Illustrations-of-A-bagging-and-B-boosting-ensemble-algorithms_fig4_334404567
Summary
• Machine Learning
• What is ML? – machine that learns through data
• Types of Feedback: supervised, unsupervised, semi-supervised, reinforcement
• Supervised Learning
• Performance Measure
• Regression: mean squared error, mean absolute error
• Classification: correctness, accuracy, confusion matrix, precision, recall, F1
• Decision Trees
• Decision Tree Learning (DTL): greedy, top-down, recursive algorithm
• Entropy and Information Gain
• Different types of attributes: many values, differing costs, missing values
• Pruning: min-sample, max-depth
• Ensemble Methods: bagging, boosting

59
Coming Up Next Week
• Linear Regression
• Multiple Features
• Polynomial Regression
• Optimization Algorithms
• Gradient Descent
• Variants of Gradient Descent
• Normal Equation

60
To Do
• Lecture Training 4
• +100 Free EXP
• +50 Early bird bonus
• Problem Set 2
• Due Saturday, 10th February
• Problem Set 3
• Out today!
• Contest +500 EXP

61

You might also like