Supervised Learning Setup
Course 4232: Machine Learning
Dept. of Computer Science
Faculty of Science and Technology
Lecturer No: Week No: 2 Semester:
Instructor: Prof. Dr. Kamruddin Nur (kamruddin@aiub.edu)
Supervised Learning
Training experience: a set of labeled examples of the form
x = ( x1, x2, …, xn, y )
where xj are values for input variables and y is the output
This implies the existence of a “teacher” who knows the right answers
What to learn: A function f : X1 × X2 × … × Xn → Y , which maps
the input variables into the output domain
2 Goal: minimize the error (loss function) on the training examples
Supervised Learning Problem
Given a data set D X1 × X 2 × … × Xn × Y, find a function
h : X1 × X2 × … × Xn → Y
such that h(x) is a good predictor for the value of y
h is called a hypothesis
Example: Suppose have a dataset D with:
Input features (X₁, X₂, …, Xₙ):These are the variables used to make predictions.
(Example: In house price prediction, X₁ = size, X₂ = location, etc.)
Output/target (Y):This is what you want to predict.
(Example: House price, whether an email is spam, etc.)
Goal: Find a function h (called a hypothesis) that takes X₁, X₂, …, Xₙ as input and predicts Y
accurately.
What is h? (The Hypothesis)
h is just a rule or model that makes predictions.
(Example: A linear equation like h(X) = 2X + 3 could predict house prices.)
3 "Good predictor" means h(X) should be close to the true Y (measured by accuracy, error, etc.).
Supervised Learning Problem
is a regression
If Y is the real set, this problem
If Y is a finite discrete set, this problem is called classification
Case 1: Regression (Y is a real number)
Y is continuous (any numeric value).
(Example: Predicting house prices, temperature, etc.)
Example Hypothesis:
h(size, location) = 50,000 + 200*(size) + 10,000*(location_rating)
Case 2: Classification (Y is discrete)
Y is a category (like labels).
(Example: Spam/Not spam, Cat/Dog/Bird, etc.)
Binary Classification (2 classes):
Y = {0, 1} or {Spam, Not Spam}.
(Example:h(email_text) = "Spam" or "Not Spam".)
Multiclass Classification (>2 classes):
4 Y = {Cat, Dog, Bird, ...}.
(Example:h(image) = "Cat".)
Supervised Learning Steps
Decide what the training examples are
Data collection
Feature extraction or selection:
Discriminative features
Relevant and insensitive to noise
Input space X, output space Y, and feature vectors
Choose a model, i.e. representation for h;
or, the hypothesis class H = {h1, …, hr})
Choose an error function to define the best hypothesis
Choose a learning algorithm: regression or classification method
Training
5
Evaluation = testing
EX: What Model or Hypothesis Space H ?
• Training examples:
ei = <xi, yi>
6 for i = 1, …, 10
Linear Hypothesis
7
Sum-of-Squares Error function
(or Mean Squared Error, MSE)
Should define the error function to measure the difference between the predictions and the true answers?
How to find prediction errors?
Purpose:
• Quantifies Model Accuracy: Measures how well the hypothesis hw fits the training data.
• Optimization Goal: Find the parameters ww that minimize J(w).
8
Example – Minimizing J(w)
9
Least Mean Squares (LMS)
10
Limitation of MSE
11
Mean Absolute Error (MAE)
12
Mean Absolute Error (MAE)
13
Huber Loss and RMSE
14
Some Linear Algebra
15
Some Linear Algebra …
16
Some Linear Algebra - The Solution!
17
Example of Linear Regression - Data Matrices
18
XTX
19
XTY
20
Solving for w – Regression Curve
21
Dr. M M Manjurul Islam
Linear Regression - Summary
The optimal solution can be computed in polynomial time in the size of the
data set.
Too simple for most real-valued problems
The solution is w = (XTX)-1XTY, where
X is the data matrix, augmented with a column of 1’s
Y is the column vector of target outputs
A very rare case in which an analytical exact solution is possible
Nice math, closed-form formula, unique global optimum
Problems when (XTX) does not have an inverse
Possible solutions to this:
1. Include high-order terms in hw
2. Transform the input X to some other space X’, and apply linear regression on X’
3. Use a different but more powerful hypothesis representation
Is Linear Regression enough ?
22
Generalization Ability vs Overfitting
Very important issue for any machine learning algorithms.
Can your algorithm predict the correct target y of any unseen x ?
Hypothesis may perfectly predict for all known x’s but not unseen x’s
This is called overfitting
Each hypothesis h has an unknown true error on the universe: JU(h)
But we only measured the empirical error on the training set: JD(h)
Let h1 and h2 be two hypotheses compared on training set D, such that
we obtained the result JD(h1) < JD(h2)
If h2 is “truly” better, that is JU(h2) < JU(h1)
Then your algorithm is overfitting, and won’t generalize to unseen data
We are not interested in memorizing the training set
In our examples, highest degree d hypotheses overfit (i.e. memorize) the data
23 We need methods to overcome overfitting.
Overfitting
We have overfitting when hypothesis h is more complex than the data
Complexity of h = number of parameters in h
Number of weight parameters in our example increases with degree d
Overfitting = low error on training data but high error on unseen data
Assume D is drawn from some unknown probability distribution
Given the universe U of data, we want to learn a hypothesis h from the
training set 𝐷 ⊂ 𝑈 minimizing the error on unseen data 𝑈 ∖ 𝐷.
Every h has a true error JU(h) on U, which is the expected error when the
data is drawn from the distribution
We can only measure the empirical error JD(h) on D; we do not have U
Then… How can we estimate the error JU(h) from D?
Apply a cross-validation method during D
Determining best hypothesis h which generalizes best is called model selection.
24
Avoiding Overfitting
• Red curve = Test set
• Blue curve = Training set
• What is the best h?
• Find the degree d
• Such that JT(h) minimal
• Training error decreases with complexity of h;
degree d in our example
• Testing error decreases initially then increases
• We need three disjoint sets of data T, V, U of D
• Learn a potential h using the training set T
• Estimate error of h using the validation set V
25 • Report unbiased h using the test set U
Cross-Validation
General procedure for estimating the true error of a learner.
Randomly partition the data into three subsets:
1. Training Set T: used only to find the parameters of classifier, e.g. w.
2. Validation Set V: used to find the correct hypothesis class, e.g. d.
3. Test Set U: used to estimate the true error of your algorithm
These three sets do not intersect, i.e. they are disjoint
Repeat cross-validation many times
Results are averaged to give true error estimate.
26
Cross-Validation and Model Selection
How to find the best degree d which fits the data D the best?
Randomly partition the available data D into three disjoint sets;
training set T, validation set V, and test set U, then:
1. Cross-validation: For each degree d, perform a cross-
validation method using T and V sets for evaluating the
goodness of d.
Some cross-validation techniques to be discussed later
2. Model Selection: Given the best d found in step 1, find hw,d
using T and V sets and report the prediction error of hw,d
using the test set U
Some model selection approaches to be discussed later.
The prediction error on U is an unbiased estimate of the true error
27
Leave-One-Out Cross-Validation
For each degree d do:
1. for i ← 1 to m do:
1. Validation set Vi ← {ei = ( xi, yi )} ; leave the i-the sample out
2. Training set: Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; optimal wd,i using training set Ti
4. J(d, i) ← Test(Vi) ; validation error of wd,i on xi
; J(d, i) is an unbiased estimate of the true prediction error
2. Average validation error: 𝐽 𝑑 ← 1
𝑚
σ𝑚
𝑖=1 𝐽(𝑑, 𝑖)
d* ← arg mind J(d) ; select the degree d with lowest average error
; J(d*) is not an unbiased estimate since all data is used to find it.
28
Example: Estimating True Error for d = 1
29
Example: Estimation results for all d
Optimal choice is d = 2
Overfitting for d > 2
Very high validation error for d = 8 and 9
30
Model Selection
J(d*) is not unbiased since it was obtained using all m sample data
We chose the hypothesis class d* based on 𝐽 𝑑 =
1
𝑚
σ𝑚𝑖=1 𝐽(𝑑, 𝑖)
We want both an hypothesis class and an unbiased true error
estimate
If we want to compare different learning algorithms (or different
hypotheses) an independent test data U is required in order to
decide for the best algorithm or the best hypothesis
In our case, we are trying to decide which regression model to
use, d=1, or d=2, or …, or d=11?
And, which has the best unbiased true error estimate
31
k-Fold Cross-Validation
Partition D into k disjoint subsets of same size and same
distribution, P1, P2, …, Pk
For each degree d do:
for i ← 1 to k do:
1. Validation set Vi ← Pi ; leave Pi out for validation
2. Training set Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; train on Ti
4. J(d, i) ← Test(Vi) ; compute validation error on Ti
Average validation error: 𝐽 𝑑 ← 1
𝑚
σ𝑚
𝑖=1 𝐽(𝑑, 𝑖)
d* ← arg mind J(d) ; return optimal degree d
32
Learning a Class from Examples
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a
family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
33
Training set X
X = {xt ,r t }tN=1
1 if x is positive
r =
0 if x is negative
x1
x=
x2
Class C
(p1 price p2 ) AND (e1 engine power e2 )
Dr. M M Manjurul Islam 35
Hypothesis class H
1 if h says x is positive
h( x) =
0 if h says x is negative
Error of h on H
E (h| X ) = 1(h(xt ) r t )
N
t =1
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)
37
Margin
Choose h with largest margin
Noise and Model Complexity
Use the simpler one because
Simpler to use
(lower computational
complexity)
Easier to train (lower
space complexity)
Easier to explain
(more interpretable)
Generalizes better (lower
variance - Occam’s razor)
39
Multiple Classes, Ci i=1,...,K
X = {xt ,r t }tN=1
1 if x t
Ci
ri =
t
0 if x t
C j , j i
Train hypotheses
hi(x), i =1,...,K:
1 x t
Ci
hi (x ) =
t
if
0 if x t
C j , j i
Regression
X = x , r
t
t N
t =1
g(x ) = w1x + w0
rt
g(x ) = w 2 x 2 + w1 x + w 0
r t = f (x t ) +
1 N t
N t =1
E (g | X ) = r − g (x )
t 2
1 N t
N t =1
E (w1 , w 0 | X ) = r − (w1 x + w 0 )
t 2
Cross-Validation
To estimate generalization error, we need data unseen during
training. We split the data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Resampling when there is few data
Textbook/ Reference Materials
Introduction to Machine Learning by Ethem Alpaydin
Machine Learning: An Algorithmic Perspective by
Stephen Marsland
Pattern Recognition and Machine Learning by
Christopher M. Bishop
43