Machine learning: lecture 3
Tommi S. Jaakkola
MIT AI Lab
Topics
• Linear regression
– overfitting, cross-validation
• Additive models
– polynomial regression, other basis functions
• Statistical view of regression
– noise model
– likelihood, maximum likelihood estimation
– limitations
Tommi Jaakkola, MIT AI Lab 2
Review: generalization
• The “generalization” error
E(x,y)∼P (y − ŵ0 − ŵ1x)2
is a sum of two terms:
1. error of the best predictor in the class
E(x,y)∼P (y − w0∗ − w1∗x)2
= min E(x,y)∼P (y − w0 − w1x)2
w0,w1
2. and how well we approximate the best linear predictor
based on a limited training set
n o
2
(w0∗ + w1∗x) − (ŵ0 + ŵ1x)
E(x,y)∼P
Tommi Jaakkola, MIT AI Lab 3
Overfitting
• With too few training examples our linear regression model
may achieve zero training error but nevertless has a large
generalization error
6
2
x
0 x
−2
−4
−6
−2 −1 0 1 2
When the training error no longer bears any relation to the
generalization error the model overfits the data
Tommi Jaakkola, MIT AI Lab 4
Cross-validation
• Cross-validation allows us to estimate generalization error on
the basis of only the training set
For example, the leave-one-out cross-validation error is given
by
n
1 X 2
CV = yi − (ŵ0−i + −i
ŵ1 xi)
n i=1
where (ŵ0−i, ŵ1−i) are least squares estimates computed
without the ith training example.
Tommi Jaakkola, MIT AI Lab 5
Extensions of linear regression: additive models
• Our previous results generalize to models that are linear in
the parameters w, not necessarily in the inputs x
1. Simple linear prediction f : R → R
f (x; w) = w0 + w1x
2. mth order polynomial prediction f : R → R
f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm
3. Multi-dimensional linear prediction f : Rd → R
f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd
where x = [x1 . . . xd−1 xd]T , d = dim(x)
Tommi Jaakkola, MIT AI Lab 6
Polynomial regression: example
4 4
2 2
1
0
0
y
−1
−2
−2
−3 −4
−4
−5 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x
degree = 1 degree = 3
4 4
2 2
0 0
y
y
−2 −2
−4 −4
−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x
degree = 5 degree = 7
Tommi Jaakkola, MIT AI Lab 7
Polynomial regression: example cont’d
4 4
2 2
1
0
y 0
y
−1
−2
−2
−3 −4
−4
−5 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x
degree = 1, CV = 1.1 degree = 3, CV = 2.6
4 4
2 2
0 0
y
y
−2 −2
−4 −4
−6 −6
−2 −1 0 1 2 −2 −1 0 1 2
x x
degree = 5, CV = 44.2 degree = 7, CV = 482.0
Tommi Jaakkola, MIT AI Lab 8
Additive models cont’d
• More generally, predictions are based on a linear combination
of basis functions (features) {φ1(x), . . . , φm(x)}, where each
φi(x) : Rd → R, and
f (x; w) = w0 + w1φ1(x) + . . . + wm−1φm−1(x) + wmφm(x)
• For example:
If φi(x) = xi, i = 1, . . . , m, then
f (x; w) = w0 + w1x + . . . + wm−1xm−1 + wmxm
If m = d, φi(x) = xi, i = 1, . . . , d, then
f (x; w) = w0 + w1x1 + . . . + wd−1xd−1 + wdxd
Tommi Jaakkola, MIT AI Lab 9
Additive models cont’d
• Example: it is often useful to find “prototypical” input
vectors µ1, . . . , µm that exemplify different “contexts” for
prediction
We can define basis functions (one 3
for each prototype) that measure 2
how close the the input vector x is 1
to the prototype 0
−1
1
φk (x) = exp{ − kx − µk k2 } −2
2
−3
−2 −1 0 1 2
Tommi Jaakkola, MIT AI Lab 10
Additive models cont’d
• The basis functions can capture various (e.g., qualitative)
properties of the inputs.
For example: we can try to rate companies based on text
descriptions
x = text document (string of words)
1 if word i appears in the document
φi(x) =
0 otherwise
X
f (x; w) = w0 + wiφi(x)
i∈words
Tommi Jaakkola, MIT AI Lab 11
Additive models cont’d
• Graphical representation of additive models (cf. neural
networks):
f(x; w)
1 w0
w w
1 m
φ ( x) φ ( x)
1 m
...
x x
1 2
Tommi Jaakkola, MIT AI Lab 12
Statistical view of linear regression
• A statistical regression model
Observed output = function + noise
y = f (x; w) +
where, e.g., ∼ N (0, σ 2).
• Whatever we cannot capture with our chosen family of
functions will be interpreted as noise
Tommi Jaakkola, MIT AI Lab 13
Statistical view of linear regression
• Our function f (x; w) here is trying to capture the mean of
the observations y given a specific input x:
E{ y | x } = f (x; w)
The expectation is taken with respect to P that governs the
underlying (and typically unknown) relation between x and
y.
5
−5
−2 −1 0 1 2
Tommi Jaakkola, MIT AI Lab 14
Statistical view of linear regression
• According to our statistical model
y = f (x; w) + , ∼ N (0, σ 2)
the outputs y given x are normally distributed with mean
f (x; w) and variance σ 2:
2 1 1
P (y|x, w, σ ) = √ exp{ − 2 (y − f (x; w))2 }
2πσ 2 2σ
• As a result we can also measure the uncertainty in the
predictions (through variance σ 2), not just the mean
• Loss function? Estimation?
Tommi Jaakkola, MIT AI Lab 15
Maximum likelihood estimation
• Given observations D = {(x1, y1), . . . , (xn, yn)} we find the
parameters w that maximize the likelihood of the observed
outputs
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1
−2
−4
−6
−2 −1 0 1 2
Why is this a bad fit according to the likelihood criterion?
Tommi Jaakkola, MIT AI Lab 16
Maximum likelihood estimation
Likelihood of the observed outputs:
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1
• It is often easier (and equivalent) to try to maximize the
log-likelihood:
n
X
l(D; w, σ 2) = log L(D; w, σ 2) = log P (yi|xi, w, σ 2)
i=1
n √
X 1
= − 2 (yi − f (xi; w))2 − log 2πσ 2
i=1
2σ
X n
1 n
= − 2 (yi − f (xi; w)) − log(2πσ 2)
2
2σ i=1 2
Tommi Jaakkola, MIT AI Lab 17
Maximum likelihood estimation cont’d
• The noise distribution and the loss-function are intricately
related
Loss(y, f (x; w)) = − log P (y|x, w, σ 2) + const.
Tommi Jaakkola, MIT AI Lab 18
Maximum likelihood estimation cont’d
• The likelihood of the observed outputs
n
Y
L(D; w, σ 2) = P (yi|xi, w, σ 2)
i=1
provides a general measure of how the model fits the data.
On the basis of this measure, we can estimate the noise
variance σ 2 as well as the weights w.
Can we find a rationale for what the “optimal” noise variance
should be?
Tommi Jaakkola, MIT AI Lab 19
Maximum likelihood estimation cont’d
• To estimate the parameters w and σ 2 quantitatively, we
maximize the log-likelihood with respect to all the parameters
∂
l(D; w, σ 2) = 0
∂w
∂ 2
2
l(D; w, σ ) = 0
∂σ
The resulting noise variance σ̂ 2 is given by
n
1 X
σ̂ 2 = (yi − f (xi; ŵ))2
n i=1
where ŵ is the same ML estimate of w as before.
Interpretation: this is the mean squared prediction error (on
the training set) of the best linear predictor.
Tommi Jaakkola, MIT AI Lab 20
Brief derivation
Consider the log-likelihood evaluated at ŵ
X n
2 1 n
l(D; ŵ, σ ) = − 2 (yi − f (xi; ŵ)) − log(2πσ 2)
2
2σ i=1 2
(need to justify first that we can simply substitute in the ML
solution ŵ rather than perform joint maximization)
Now,
n
X
∂ 2 1 2 n
2
l(D; ŵ, σ ) = (yi − f (xi; ŵ)) − 2 = 0
∂σ 2σ 4 i=1
2σ
and we get the solution by multiplying both sides by 2σ 4/n.
Tommi Jaakkola, MIT AI Lab 21
Cross-validation and log-likelihood
Leave-one-out cross-validated log-likelihood:
n
X
CV = log P (yi|xi, ŵ−i, (σ̂ 2)−i)
i=1
where ŵ−i and (σ̂ 2)−i are maximum likelihood estimates
computed without the ith training example (xi, yi).
Tommi Jaakkola, MIT AI Lab 22
Some limitations
• The simple statistical model
y = f (x; w) + , ∼ N (0, σ 2)
is not always appropriate or useful.
Example: noise may not be Gaussian
5 0.5
0.4
0.3
0
0.2
0.1
−5 0
−2 −1 0 1 2 −5 0 5
Tommi Jaakkola, MIT AI Lab 23
Limitations cont’d
• It may not even be possible (or at all useful) to model the
data with
y = f (x; w) + , ∼ N (0, σ 2)
no matter how flexible the function class f (·; w), w ∈ W is.
Example:
1.5
0.5
0
y
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x
(note: this is NOT a limitation conditional models P (y|x, w)
more generally)
Tommi Jaakkola, MIT AI Lab 24