KEMBAR78
Lecture 3b: Decision Trees (1 part) | PDF
Machine Learning for Language Technology 2015
http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm
Decision Trees (1 part)
Marina Santini
santinim@stp.lingfil.uu.se
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Autumn 2015
Outline
• Greediness
• Divide and Conquer
• Inductive Bias of the Decision Tree
• Loss function
• Expected loss
• Empirical error
• Induction
Lecture 3: Decision Trees (1) 2
Learning: Generalization Ability
• Predicting the future based on the past
Lecture 3: Decision Trees (1) 3
Predict whether a student will like a course
Lecture 3: Decision Trees (1) 4
Training Data
Lecture 3: Decision Trees (1) 5
That is, ....
• Questions = Features
• Answers = Feature Values
• Ratings = Class Labels
• An example is a set of feature values.
• Traning data is a set of examples associated
with class labels.
Lecture 3: Decision Trees (1) 6
”Greedy model”: the most useful feature
– Histograms
– Rood node
Lecture 3: Decision Trees (1) 7
Divide & Conquer
• Divide:
– Partition the date into 2 parts:
• YES part vs NO part
• Conquer:
– Recurse and run the Divide routine
Lecture 3: Decision Trees (1) 8
The end of the cycle
• ... When it becomes useless to query on
additional features
Lecture 3: Decision Trees (1) 9
Decision tree: Inductive Bias
• The goal of the decision tree learning model
is:
– to figure out what questions to ask
– in what order
– what answer to predict once you have asked
enough questions
– The inductive bias of decision trees: The things
that we want to learn to predect are more like the
root node and less like the other branch nodes.
Lecture 3: Decision Trees (1) 10
Informal Definition
• A decision tree is:
– a flow-chart-like structure, where
• each internal (non-leaf) node denotes a test on an
attribute,
• each branch represents the outcome of a test, and
• each leaf (or terminal) node holds a class label.
• The topmost node in a tree is the root node.
Lecture 3: Decision Trees (1) 11
Formalising the learning problem:
1) the loss function
loss function
Lecture 3: Decision Trees (1) 12
Formalising the learning problem:
2) Data Generating Distribution
D ( x, y )
Lecture 3: Decision Trees (1) 13
Expected Loss
1. The loss function
2. The data generating distribution
Lecture 3: Decision Trees (1) 14
Formulae: Expected Value
How to read:
= epsilon
= equal by definition to (or: is defined as)
= blackboard-bold E
= sub the pair xy
= over script D
= l of the pair y f of x
15
Sum over all the pairs xy in script D
of x and y times l of y and f of x
Training Error
• The training error is the average error over the
training data
• How to read: the training error epsilon-hat is
equal by definition to 1 over N of the Sum from
n=1 to capital N of “l” of y and f of x.
Lecture 3: Decision Trees (1) 16
Empirical Error
• Alpaydin (2010: 24): the empirical error is the
proportion of training instances where the
predictions of h (the hypothesis = the
informed guess) do not mach the required
values given in X (the training set). The error
of the the hypothesis h given the training set X
is:
Lecture 3: Decision Trees (1) 17
Induction
Given:
• a loss function l
and
• a sample d from some unknown distribution D
• you must compute a function f that has low
expected error ε over D with respect to l.
Lecture 3: Decision Trees (1) 18
Quiz 1: Training error
• How would you define a training error on a
dataset:
1. Training error is the average loss over the
training sample
2. Training error is the expected prediction error
over an independent test sample
3. None of the above
Lecture 3: Decision Trees (1) 19
Quiz 2: Distributions
What kind of distribution is D
in the formula above?
1. Normal
2. Unknown
3. None of the above
Lecture 3: Decision Trees (1) 20
Quiz 3: Loss function
• How would you define a loss function?
1. The loss function L(actual value, predicted value)
characterizes how bad predictions are
2. The loss function is an unknown distribution
3. Both definitions are incorrect.
Lecture 3: Decision Trees (1) 21
The End
Lecture 3: Decision Trees (1) 22

Lecture 3b: Decision Trees (1 part)

  • 1.
    Machine Learning forLanguage Technology 2015 http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm Decision Trees (1 part) Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015
  • 2.
    Outline • Greediness • Divideand Conquer • Inductive Bias of the Decision Tree • Loss function • Expected loss • Empirical error • Induction Lecture 3: Decision Trees (1) 2
  • 3.
    Learning: Generalization Ability •Predicting the future based on the past Lecture 3: Decision Trees (1) 3
  • 4.
    Predict whether astudent will like a course Lecture 3: Decision Trees (1) 4
  • 5.
    Training Data Lecture 3:Decision Trees (1) 5
  • 6.
    That is, .... •Questions = Features • Answers = Feature Values • Ratings = Class Labels • An example is a set of feature values. • Traning data is a set of examples associated with class labels. Lecture 3: Decision Trees (1) 6
  • 7.
    ”Greedy model”: themost useful feature – Histograms – Rood node Lecture 3: Decision Trees (1) 7
  • 8.
    Divide & Conquer •Divide: – Partition the date into 2 parts: • YES part vs NO part • Conquer: – Recurse and run the Divide routine Lecture 3: Decision Trees (1) 8
  • 9.
    The end ofthe cycle • ... When it becomes useless to query on additional features Lecture 3: Decision Trees (1) 9
  • 10.
    Decision tree: InductiveBias • The goal of the decision tree learning model is: – to figure out what questions to ask – in what order – what answer to predict once you have asked enough questions – The inductive bias of decision trees: The things that we want to learn to predect are more like the root node and less like the other branch nodes. Lecture 3: Decision Trees (1) 10
  • 11.
    Informal Definition • Adecision tree is: – a flow-chart-like structure, where • each internal (non-leaf) node denotes a test on an attribute, • each branch represents the outcome of a test, and • each leaf (or terminal) node holds a class label. • The topmost node in a tree is the root node. Lecture 3: Decision Trees (1) 11
  • 12.
    Formalising the learningproblem: 1) the loss function loss function Lecture 3: Decision Trees (1) 12
  • 13.
    Formalising the learningproblem: 2) Data Generating Distribution D ( x, y ) Lecture 3: Decision Trees (1) 13
  • 14.
    Expected Loss 1. Theloss function 2. The data generating distribution Lecture 3: Decision Trees (1) 14
  • 15.
    Formulae: Expected Value Howto read: = epsilon = equal by definition to (or: is defined as) = blackboard-bold E = sub the pair xy = over script D = l of the pair y f of x 15 Sum over all the pairs xy in script D of x and y times l of y and f of x
  • 16.
    Training Error • Thetraining error is the average error over the training data • How to read: the training error epsilon-hat is equal by definition to 1 over N of the Sum from n=1 to capital N of “l” of y and f of x. Lecture 3: Decision Trees (1) 16
  • 17.
    Empirical Error • Alpaydin(2010: 24): the empirical error is the proportion of training instances where the predictions of h (the hypothesis = the informed guess) do not mach the required values given in X (the training set). The error of the the hypothesis h given the training set X is: Lecture 3: Decision Trees (1) 17
  • 18.
    Induction Given: • a lossfunction l and • a sample d from some unknown distribution D • you must compute a function f that has low expected error ε over D with respect to l. Lecture 3: Decision Trees (1) 18
  • 19.
    Quiz 1: Trainingerror • How would you define a training error on a dataset: 1. Training error is the average loss over the training sample 2. Training error is the expected prediction error over an independent test sample 3. None of the above Lecture 3: Decision Trees (1) 19
  • 20.
    Quiz 2: Distributions Whatkind of distribution is D in the formula above? 1. Normal 2. Unknown 3. None of the above Lecture 3: Decision Trees (1) 20
  • 21.
    Quiz 3: Lossfunction • How would you define a loss function? 1. The loss function L(actual value, predicted value) characterizes how bad predictions are 2. The loss function is an unknown distribution 3. Both definitions are incorrect. Lecture 3: Decision Trees (1) 21
  • 22.
    The End Lecture 3:Decision Trees (1) 22