KEMBAR78
Lecture 3 | PDF | Estimation Theory | Cross Validation (Statistics)
0% found this document useful (0 votes)
9 views24 pages

Lecture 3

The lecture covers linear regression, focusing on overfitting, cross-validation, and statistical views of regression. It discusses the generalization error, the importance of cross-validation for estimating generalization error, and introduces additive models and polynomial regression. Additionally, it addresses maximum likelihood estimation and its limitations in modeling data, emphasizing the need for careful consideration of noise and model appropriateness.

Uploaded by

abdul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

Lecture 3

The lecture covers linear regression, focusing on overfitting, cross-validation, and statistical views of regression. It discusses the generalization error, the importance of cross-validation for estimating generalization error, and introduces additive models and polynomial regression. Additionally, it addresses maximum likelihood estimation and its limitations in modeling data, emphasizing the need for careful consideration of noise and model appropriateness.

Uploaded by

abdul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Machine learning: lecture 3

Tommi S. Jaakkola
MIT AI Lab
Topics
• Linear regression
– overfitting, cross-validation
• Additive models
– polynomial regression, other basis functions
• Statistical view of regression
– noise model
– likelihood, maximum likelihood estimation
– limitations

Tommi Jaakkola, MIT AI Lab 2


Review: generalization
• The “generalization” error

E(x,y)∼P (y − ŵ0 − ŵ1x)2




is a sum of two terms:


1. error of the best predictor in the class

E(x,y)∼P (y − w0∗ − w1∗x)2




= min E(x,y)∼P (y − w0 − w1x)2



w0,w1

2. and how well we approximate the best linear predictor


based on a limited training set
n o
2
(w0∗ + w1∗x) − (ŵ0 + ŵ1x)

E(x,y)∼P

Tommi Jaakkola, MIT AI Lab 3


Overfitting
• With too few training examples our linear regression model
may achieve zero training error but nevertless has a large
generalization error
6

2
x

0 x

−2

−4

−6
−2 −1 0 1 2

When the training error no longer bears any relation to the


generalization error the model overfits the data

Tommi Jaakkola, MIT AI Lab 4


Cross-validation
• Cross-validation allows us to estimate generalization error on
the basis of only the training set
For example, the leave-one-out cross-validation error is given
by
n
1 X 2
CV = yi − (ŵ0−i + −i
ŵ1 xi)
n i=1

where (ŵ0−i, ŵ1−i) are least squares estimates computed


without the ith training example.

Tommi Jaakkola, MIT AI Lab 5


Extensions of linear regression: additive models
• Our previous results generalize to models that are linear in
the parameters w, not necessarily in the inputs x
1. Simple linear prediction f : R → R

f (x; w) = w0 + w1x

2. mth order polynomial prediction f : R → R

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

3. Multi-dimensional linear prediction f : Rd → R

f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd

where x = [x1 . . . xd−1 xd]T , d = dim(x)

Tommi Jaakkola, MIT AI Lab 6


Polynomial regression: example
4 4

2 2

1
0
0

y
−1
−2
−2

−3 −4
−4

−5 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 1 degree = 3
4 4

2 2

0 0
y

y
−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 5 degree = 7

Tommi Jaakkola, MIT AI Lab 7


Polynomial regression: example cont’d
4 4

2 2

1
0
y 0

y
−1
−2
−2

−3 −4
−4

−5 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 1, CV = 1.1 degree = 3, CV = 2.6


4 4

2 2

0 0
y

y
−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x

degree = 5, CV = 44.2 degree = 7, CV = 482.0

Tommi Jaakkola, MIT AI Lab 8


Additive models cont’d
• More generally, predictions are based on a linear combination
of basis functions (features) {φ1(x), . . . , φm(x)}, where each
φi(x) : Rd → R, and

f (x; w) = w0 + w1φ1(x) + . . . + wm−1φm−1(x) + wmφm(x)

• For example:
If φi(x) = xi, i = 1, . . . , m, then

f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm

If m = d, φi(x) = xi, i = 1, . . . , d, then

f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd

Tommi Jaakkola, MIT AI Lab 9


Additive models cont’d
• Example: it is often useful to find “prototypical” input
vectors µ1, . . . , µm that exemplify different “contexts” for
prediction
We can define basis functions (one 3
for each prototype) that measure 2
how close the the input vector x is 1
to the prototype 0

−1
1
φk (x) = exp{ − kx − µk k2 } −2
2
−3
−2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 10


Additive models cont’d
• The basis functions can capture various (e.g., qualitative)
properties of the inputs.
For example: we can try to rate companies based on text
descriptions

x = text document (string of words)



1 if word i appears in the document
φi(x) =
0 otherwise
X
f (x; w) = w0 + wiφi(x)
i∈words

Tommi Jaakkola, MIT AI Lab 11


Additive models cont’d
• Graphical representation of additive models (cf. neural
networks):
f(x; w)
1 w0

w w
1 m
φ ( x) φ ( x)
1 m
...

x x
1 2

Tommi Jaakkola, MIT AI Lab 12


Statistical view of linear regression
• A statistical regression model

Observed output = function + noise


y = f (x; w) + 

where, e.g.,  ∼ N (0, σ 2).


• Whatever we cannot capture with our chosen family of
functions will be interpreted as noise

Tommi Jaakkola, MIT AI Lab 13


Statistical view of linear regression
• Our function f (x; w) here is trying to capture the mean of
the observations y given a specific input x:

E{ y | x } = f (x; w)

The expectation is taken with respect to P that governs the


underlying (and typically unknown) relation between x and
y.
5

−5
−2 −1 0 1 2

Tommi Jaakkola, MIT AI Lab 14


Statistical view of linear regression
• According to our statistical model

y = f (x; w) + ,  ∼ N (0, σ 2)

the outputs y given x are normally distributed with mean


f (x; w) and variance σ 2:
2 1 1
P (y|x, w, σ ) = √ exp{ − 2 (y − f (x; w))2 }
2πσ 2 2σ

• As a result we can also measure the uncertainty in the


predictions (through variance σ 2), not just the mean
• Loss function? Estimation?

Tommi Jaakkola, MIT AI Lab 15


Maximum likelihood estimation
• Given observations D = {(x1, y1), . . . , (xn, yn)} we find the
parameters w that maximize the likelihood of the observed
outputs
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1

−2

−4

−6
−2 −1 0 1 2

Why is this a bad fit according to the likelihood criterion?

Tommi Jaakkola, MIT AI Lab 16


Maximum likelihood estimation
Likelihood of the observed outputs:
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1

• It is often easier (and equivalent) to try to maximize the


log-likelihood:
n
X
l(D; w, σ 2) = log L(D; w, σ 2) = log P (yi|xi, w, σ 2)
i=1
n  √ 
X 1
= − 2 (yi − f (xi; w))2 − log 2πσ 2
i=1

 X n
1 n
= − 2 (yi − f (xi; w)) − log(2πσ 2)
2
2σ i=1 2

Tommi Jaakkola, MIT AI Lab 17


Maximum likelihood estimation cont’d
• The noise distribution and the loss-function are intricately
related

Loss(y, f (x; w)) = − log P (y|x, w, σ 2) + const.

Tommi Jaakkola, MIT AI Lab 18


Maximum likelihood estimation cont’d
• The likelihood of the observed outputs
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1

provides a general measure of how the model fits the data.


On the basis of this measure, we can estimate the noise
variance σ 2 as well as the weights w.
Can we find a rationale for what the “optimal” noise variance
should be?

Tommi Jaakkola, MIT AI Lab 19


Maximum likelihood estimation cont’d
• To estimate the parameters w and σ 2 quantitatively, we
maximize the log-likelihood with respect to all the parameters


l(D; w, σ 2) = 0
∂w
∂ 2
2
l(D; w, σ ) = 0
∂σ
The resulting noise variance σ̂ 2 is given by
n
1 X
σ̂ 2 = (yi − f (xi; ŵ))2
n i=1
where ŵ is the same ML estimate of w as before.
Interpretation: this is the mean squared prediction error (on
the training set) of the best linear predictor.

Tommi Jaakkola, MIT AI Lab 20


Brief derivation
Consider the log-likelihood evaluated at ŵ
 X n
2 1 n
l(D; ŵ, σ ) = − 2 (yi − f (xi; ŵ)) − log(2πσ 2)
2
2σ i=1 2

(need to justify first that we can simply substitute in the ML


solution ŵ rather than perform joint maximization)
Now,
 n
X
∂ 2 1 2 n
2
l(D; ŵ, σ ) = (yi − f (xi; ŵ)) − 2 = 0
∂σ 2σ 4 i=1

and we get the solution by multiplying both sides by 2σ 4/n.

Tommi Jaakkola, MIT AI Lab 21


Cross-validation and log-likelihood
Leave-one-out cross-validated log-likelihood:
n
X
CV = log P (yi|xi, ŵ−i, (σ̂ 2)−i)
i=1

where ŵ−i and (σ̂ 2)−i are maximum likelihood estimates


computed without the ith training example (xi, yi).

Tommi Jaakkola, MIT AI Lab 22


Some limitations
• The simple statistical model

y = f (x; w) + ,  ∼ N (0, σ 2)

is not always appropriate or useful.


Example: noise may not be Gaussian
5 0.5

0.4

0.3

0
0.2

0.1

−5 0
−2 −1 0 1 2 −5 0 5

Tommi Jaakkola, MIT AI Lab 23


Limitations cont’d
• It may not even be possible (or at all useful) to model the
data with
y = f (x; w) + ,  ∼ N (0, σ 2)
no matter how flexible the function class f (·; w), w ∈ W is.
Example:
1.5

0.5

0
y

−0.5

−1

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x

(note: this is NOT a limitation conditional models P (y|x, w)


more generally)
Tommi Jaakkola, MIT AI Lab 24

You might also like