0% found this document useful (0 votes)

9 views24 pages

Lecture 3

The lecture covers linear regression, focusing on overfitting, cross-validation, and statistical views of regression. It discusses the generalization error, the importance of cross-validation for estimating generalization error, and introduces additive models and polynomial regression. Additionally, it addresses maximum likelihood estimation and its limitations in modeling data, emphasizing the need for careful consideration of noise and model appropriateness.

Uploaded by

abdul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views24 pages

Lecture 3

Uploaded by

abdul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Machine learning: lecture 3

Tommi S. Jaakkola
MIT AI Lab
Topics
• Linear regression
– overfitting, cross-validation
• Additive models
– polynomial regression, other basis functions
• Statistical view of regression
– noise model
– likelihood, maximum likelihood estimation
– limitations

Tommi Jaakkola, MIT AI Lab 2

Review: generalization
• The “generalization” error

E(x,y)∼P (y − ŵ0 − ŵ1x)2

is a sum of two terms:

1. error of the best predictor in the class

E(x,y)∼P (y − w0∗ − w1∗x)2

= min E(x,y)∼P (y − w0 − w1x)2

w0,w1

2. and how well we approximate the best linear predictor

based on a limited training set
n o
2
(w0∗ + w1∗x) − (ŵ0 + ŵ1x)

E(x,y)∼P

Tommi Jaakkola, MIT AI Lab 3

Overfitting
• With too few training examples our linear regression model
may achieve zero training error but nevertless has a large
generalization error
6

2
x

0 x

−2

−4

−6
−2 −1 0 1 2

When the training error no longer bears any relation to the

generalization error the model overfits the data

Tommi Jaakkola, MIT AI Lab 4

Cross-validation
• Cross-validation allows us to estimate generalization error on
the basis of only the training set
For example, the leave-one-out cross-validation error is given
by
n
1 X 2
CV = yi − (ŵ0−i + −i
ŵ1 xi)
n i=1

where (ŵ0−i, ŵ1−i) are least squares estimates computed

without the ith training example.

Tommi Jaakkola, MIT AI Lab 5

Extensions of linear regression: additive models
• Our previous results generalize to models that are linear in
the parameters w, not necessarily in the inputs x
1. Simple linear prediction f : R → R

f (x; w) = w0 + w1x

2. mth order polynomial prediction f : R → R

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

3. Multi-dimensional linear prediction f : Rd → R

f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd

where x = [x1 . . . xd−1 xd]T , d = dim(x)

Tommi Jaakkola, MIT AI Lab 6

Polynomial regression: example
4 4

2 2

1
0
0

y
−1
−2
−2

−3 −4
−4

−5 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 1 degree = 3
4 4

2 2

0 0
y

y
−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 5 degree = 7

Tommi Jaakkola, MIT AI Lab 7

Polynomial regression: example cont’d
4 4

2 2

1
0
y 0

y
−1
−2
−2

−3 −4
−4

−5 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 1, CV = 1.1 degree = 3, CV = 2.6

4 4

2 2

0 0
y

y
−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 5, CV = 44.2 degree = 7, CV = 482.0

Tommi Jaakkola, MIT AI Lab 8

Additive models cont’d
• More generally, predictions are based on a linear combination
of basis functions (features) {φ1(x), . . . , φm(x)}, where each
φi(x) : Rd → R, and

f (x; w) = w0 + w1φ1(x) + . . . + wm−1φm−1(x) + wmφm(x)

• For example:
If φi(x) = xi, i = 1, . . . , m, then

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

If m = d, φi(x) = xi, i = 1, . . . , d, then

f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd

Tommi Jaakkola, MIT AI Lab 9

Additive models cont’d
• Example: it is often useful to find “prototypical” input
vectors µ1, . . . , µm that exemplify different “contexts” for
prediction
We can define basis functions (one 3
for each prototype) that measure 2
how close the the input vector x is 1
to the prototype 0

−1
1
φk (x) = exp{ − kx − µk k2 } −2
2
−3
−2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 10

Additive models cont’d
• The basis functions can capture various (e.g., qualitative)
properties of the inputs.
For example: we can try to rate companies based on text
descriptions

x = text document (string of words)

1 if word i appears in the document
φi(x) =
0 otherwise
X
f (x; w) = w0 + wiφi(x)
i∈words

Tommi Jaakkola, MIT AI Lab 11

Additive models cont’d
• Graphical representation of additive models (cf. neural
networks):
f(x; w)
1 w0

w w
1 m
φ ( x) φ ( x)
1 m
...

x x
1 2

Tommi Jaakkola, MIT AI Lab 12

Statistical view of linear regression
• A statistical regression model

Observed output = function + noise

y = f (x; w) +

where, e.g., ∼ N (0, σ 2).

• Whatever we cannot capture with our chosen family of
functions will be interpreted as noise

Tommi Jaakkola, MIT AI Lab 13

Statistical view of linear regression
• Our function f (x; w) here is trying to capture the mean of
the observations y given a specific input x:

E{ y | x } = f (x; w)

The expectation is taken with respect to P that governs the

underlying (and typically unknown) relation between x and
y.
5

−5
−2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 14

Statistical view of linear regression
• According to our statistical model

y = f (x; w) + , ∼ N (0, σ 2)

the outputs y given x are normally distributed with mean

f (x; w) and variance σ 2:
2 1 1
P (y|x, w, σ ) = √ exp{ − 2 (y − f (x; w))2 }
2πσ 2 2σ

• As a result we can also measure the uncertainty in the

predictions (through variance σ 2), not just the mean
• Loss function? Estimation?

Tommi Jaakkola, MIT AI Lab 15

Maximum likelihood estimation
• Given observations D = {(x1, y1), . . . , (xn, yn)} we find the
parameters w that maximize the likelihood of the observed
outputs
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1

−2

−4

−6
−2 −1 0 1 2

Why is this a bad fit according to the likelihood criterion?

Tommi Jaakkola, MIT AI Lab 16

Maximum likelihood estimation
Likelihood of the observed outputs:
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1

• It is often easier (and equivalent) to try to maximize the

log-likelihood:
n
X
l(D; w, σ 2) = log L(D; w, σ 2) = log P (yi|xi, w, σ 2)
i=1
n √
X 1
= − 2 (yi − f (xi; w))2 − log 2πσ 2
i=1
2σ
X n
1 n
= − 2 (yi − f (xi; w)) − log(2πσ 2)
2
2σ i=1 2

Tommi Jaakkola, MIT AI Lab 17

Maximum likelihood estimation cont’d
• The noise distribution and the loss-function are intricately
related

Loss(y, f (x; w)) = − log P (y|x, w, σ 2) + const.

Tommi Jaakkola, MIT AI Lab 18

Maximum likelihood estimation cont’d
• The likelihood of the observed outputs
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1

provides a general measure of how the model fits the data.

On the basis of this measure, we can estimate the noise
variance σ 2 as well as the weights w.
Can we find a rationale for what the “optimal” noise variance
should be?

Tommi Jaakkola, MIT AI Lab 19

Maximum likelihood estimation cont’d
• To estimate the parameters w and σ 2 quantitatively, we
maximize the log-likelihood with respect to all the parameters

∂
l(D; w, σ 2) = 0
∂w
∂ 2
2
l(D; w, σ ) = 0
∂σ
The resulting noise variance σ̂ 2 is given by
n
1 X
σ̂ 2 = (yi − f (xi; ŵ))2
n i=1
where ŵ is the same ML estimate of w as before.
Interpretation: this is the mean squared prediction error (on
the training set) of the best linear predictor.

Tommi Jaakkola, MIT AI Lab 20

Brief derivation
Consider the log-likelihood evaluated at ŵ
X n
2 1 n
l(D; ŵ, σ ) = − 2 (yi − f (xi; ŵ)) − log(2πσ 2)
2
2σ i=1 2

(need to justify first that we can simply substitute in the ML

solution ŵ rather than perform joint maximization)
Now,
n
X
∂ 2 1 2 n
2
l(D; ŵ, σ ) = (yi − f (xi; ŵ)) − 2 = 0
∂σ 2σ 4 i=1
2σ

and we get the solution by multiplying both sides by 2σ 4/n.

Tommi Jaakkola, MIT AI Lab 21

Cross-validation and log-likelihood
Leave-one-out cross-validated log-likelihood:
n
X
CV = log P (yi|xi, ŵ−i, (σ̂ 2)−i)
i=1

where ŵ−i and (σ̂ 2)−i are maximum likelihood estimates

computed without the ith training example (xi, yi).

Tommi Jaakkola, MIT AI Lab 22

Some limitations
• The simple statistical model

y = f (x; w) + , ∼ N (0, σ 2)

is not always appropriate or useful.

Example: noise may not be Gaussian
5 0.5

0.4

0.3

0
0.2

0.1

−5 0
−2 −1 0 1 2 −5 0 5

Tommi Jaakkola, MIT AI Lab 23

Limitations cont’d
• It may not even be possible (or at all useful) to model the
data with
y = f (x; w) + , ∼ N (0, σ 2)
no matter how flexible the function class f (·; w), w ∈ W is.
Example:
1.5

0.5

0
y

−0.5

−1

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x

(note: this is NOT a limitation conditional models P (y|x, w)

more generally)
Tommi Jaakkola, MIT AI Lab 24

Lecture 2
No ratings yet
Lecture 2
19 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
ML Basics Lecture2 Linear Classification
No ratings yet
ML Basics Lecture2 Linear Classification
34 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Data Science Distributions & Models
100% (1)
Data Science Distributions & Models
5 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
Linear Regression and Classification
No ratings yet
Linear Regression and Classification
8 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
387 pages
Intro to Machine Learning
No ratings yet
Intro to Machine Learning
91 pages
Regression
No ratings yet
Regression
11 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
ML 3
No ratings yet
ML 3
66 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Lecture 1
No ratings yet
Lecture 1
27 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Cse 445 ML - 1
No ratings yet
Cse 445 ML - 1
28 pages
Unit 3 - Estimation And Prediction: θ 1 2 n 1 2 n 1 1 2 2 n n
No ratings yet
Unit 3 - Estimation And Prediction: θ 1 2 n 1 2 n 1 1 2 2 n n
14 pages
9 Mle
No ratings yet
9 Mle
39 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Lecture 4
No ratings yet
Lecture 4
19 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
No ratings yet
Single-Parameter Linear Regression: Predicting Real-Valued Outputs: An Introduction To Regression
51 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
NN Theory
No ratings yet
NN Theory
138 pages
Machine Learning Cheatsheet
100% (1)
Machine Learning Cheatsheet
15 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
Classification
No ratings yet
Classification
47 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
ML - Lec 4-Introduction To Regression
No ratings yet
ML - Lec 4-Introduction To Regression
65 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
MIT 6.867 Machine Learning Exam
No ratings yet
MIT 6.867 Machine Learning Exam
13 pages
SMAI-M20-L09: Aspects of Supervised Learning: C. V. Jawahar
No ratings yet
SMAI-M20-L09: Aspects of Supervised Learning: C. V. Jawahar
16 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Midterm2008f Sol
No ratings yet
Midterm2008f Sol
12 pages
Final F04soln
No ratings yet
Final F04soln
10 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
17 pages
Bayesian Clinical Trials in Action
No ratings yet
Bayesian Clinical Trials in Action
23 pages
Real-Time Video Surveillance Tech
No ratings yet
Real-Time Video Surveillance Tech
6 pages
Bayesian Skin Detection Method
No ratings yet
Bayesian Skin Detection Method
5 pages
Effects On Teachers' Self-Efficacy and Job Satisfaction: Teacher Gender, Years of Experience, and Job Stress
No ratings yet
Effects On Teachers' Self-Efficacy and Job Satisfaction: Teacher Gender, Years of Experience, and Job Stress
17 pages
Determining Optimal Premium Using Credibility Model
No ratings yet
Determining Optimal Premium Using Credibility Model
36 pages
Advanced Assessment Interpreting Findings and Formulating Differential Diagnoses Fourth Edition. Edition Laurie Grubbs Instant Download
100% (4)
Advanced Assessment Interpreting Findings and Formulating Differential Diagnoses Fourth Edition. Edition Laurie Grubbs Instant Download
84 pages
Statistical Undecidability - SSRN-Id1691165
No ratings yet
Statistical Undecidability - SSRN-Id1691165
4 pages
Unit 3 DV
No ratings yet
Unit 3 DV
12 pages
Math Project Class 12 Isc
100% (1)
Math Project Class 12 Isc
16 pages
ABD Formulas
No ratings yet
ABD Formulas
55 pages
STA 342-TH8-Neyman-Pearson
No ratings yet
STA 342-TH8-Neyman-Pearson
18 pages
Nonmem Users Guide Introduction To Nonmem 7 Robert J. Bauer ICON Development Solutions Ellicott City, Maryland February 26, 2010
No ratings yet
Nonmem Users Guide Introduction To Nonmem 7 Robert J. Bauer ICON Development Solutions Ellicott City, Maryland February 26, 2010
61 pages
Machine Learning for CSE Students
No ratings yet
Machine Learning for CSE Students
61 pages
MODULE 5 Part 1
No ratings yet
MODULE 5 Part 1
46 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
Greenland 2006
No ratings yet
Greenland 2006
11 pages
Detail Syllabus For B.A. Part II Honours Anthropology Honours
No ratings yet
Detail Syllabus For B.A. Part II Honours Anthropology Honours
36 pages
SSRN Id4377891
No ratings yet
SSRN Id4377891
35 pages
Handbook of Financial Econometrics Tools and Techniques 1st Edition by Ait Sahalia, Yacine, Hansen, Lars Peter 0080929842 9780080929842 Download PDF
100% (33)
Handbook of Financial Econometrics Tools and Techniques 1st Edition by Ait Sahalia, Yacine, Hansen, Lars Peter 0080929842 9780080929842 Download PDF
87 pages
Business Impact Analysis Template Castellan Solutions
100% (2)
Business Impact Analysis Template Castellan Solutions
9 pages
Actuar
No ratings yet
Actuar
142 pages
Unit 3 Chow's Test
No ratings yet
Unit 3 Chow's Test
4 pages
Offshore Wind Resource Assessment Based On WRF Model
No ratings yet
Offshore Wind Resource Assessment Based On WRF Model
7 pages
PHD Thesis Essays On Regime Switching Models With Endogenous Feedback - Indiana Uni 2019
No ratings yet
PHD Thesis Essays On Regime Switching Models With Endogenous Feedback - Indiana Uni 2019
157 pages
3 BayesianRecursive
No ratings yet
3 BayesianRecursive
19 pages
Analysis of Neural Data ISBN 1461496012, 9781461496014 Illustrated Ebook Download
No ratings yet
Analysis of Neural Data ISBN 1461496012, 9781461496014 Illustrated Ebook Download
15 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
Course Handout - PS (MA 21001)
No ratings yet
Course Handout - PS (MA 21001)
5 pages
The Poisson Regression Model
No ratings yet
The Poisson Regression Model
6 pages

Lecture 3

Uploaded by

Lecture 3

Uploaded by

Machine learning: lecture 3

Tommi Jaakkola, MIT AI Lab 2

E(x,y)∼P (y − ŵ0 − ŵ1x)2

is a sum of two terms:

E(x,y)∼P (y − w0∗ − w1∗x)2

= min E(x,y)∼P (y − w0 − w1x)2

2. and how well we approximate the best linear predictor

Tommi Jaakkola, MIT AI Lab 3

When the training error no longer bears any relation to the

Tommi Jaakkola, MIT AI Lab 4

where (ŵ0−i, ŵ1−i) are least squares estimates computed

Tommi Jaakkola, MIT AI Lab 5

2. mth order polynomial prediction f : R → R

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

3. Multi-dimensional linear prediction f : Rd → R

f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd

where x = [x1 . . . xd−1 xd]T , d = dim(x)

Tommi Jaakkola, MIT AI Lab 6

Tommi Jaakkola, MIT AI Lab 7

degree = 1, CV = 1.1 degree = 3, CV = 2.6

degree = 5, CV = 44.2 degree = 7, CV = 482.0

Tommi Jaakkola, MIT AI Lab 8

f (x; w) = w0 + w1φ1(x) + . . . + wm−1φm−1(x) + wmφm(x)

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

If m = d, φi(x) = xi, i = 1, . . . , d, then

f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd

Tommi Jaakkola, MIT AI Lab 9

Tommi Jaakkola, MIT AI Lab 10

x = text document (string of words)

Tommi Jaakkola, MIT AI Lab 11

Tommi Jaakkola, MIT AI Lab 12

Observed output = function + noise

where, e.g.,  ∼ N (0, σ 2).

Tommi Jaakkola, MIT AI Lab 13

The expectation is taken with respect to P that governs the

Tommi Jaakkola, MIT AI Lab 14

the outputs y given x are normally distributed with mean

• As a result we can also measure the uncertainty in the

Tommi Jaakkola, MIT AI Lab 15

Why is this a bad fit according to the likelihood criterion?

Tommi Jaakkola, MIT AI Lab 16

• It is often easier (and equivalent) to try to maximize the

Tommi Jaakkola, MIT AI Lab 17

Loss(y, f (x; w)) = − log P (y|x, w, σ 2) + const.

Tommi Jaakkola, MIT AI Lab 18

provides a general measure of how the model fits the data.

Tommi Jaakkola, MIT AI Lab 19

Tommi Jaakkola, MIT AI Lab 20

(need to justify first that we can simply substitute in the ML

and we get the solution by multiplying both sides by 2σ 4/n.

Tommi Jaakkola, MIT AI Lab 21

where ŵ−i and (σ̂ 2)−i are maximum likelihood estimates

Tommi Jaakkola, MIT AI Lab 22

is not always appropriate or useful.

Tommi Jaakkola, MIT AI Lab 23

(note: this is NOT a limitation conditional models P (y|x, w)

You might also like

where, e.g., ∼ N (0, σ 2).