Notes Stat Learning
Notes Stat Learning
Mohammad Safdari
Contents
2 Linear Regression 5
2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Least Squares Method . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 t-statistic and Hypothesis tests . . . . . . . . . . . . . . . . . 9
2.1.4 R2 Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Estimating the Coefficients . . . . . . . . . . . . . . . . . . . 11
2.2.2 Standard Errors of the Coefficient Estimates . . . . . . . . . . 13
2.2.3 Estimating the Irreducible Error . . . . . . . . . . . . . . . . 16
2.2.4 F-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.5 F-Statistic versus t-statistic . . . . . . . . . . . . . . . . . . . 19
2.2.6 R2 Statistic in Multiple Linear Regression . . . . . . . . . . . 20
2.2.7 Confidence and Prediction Intervals . . . . . . . . . . . . . . 22
2.3 Further Topics in Linear Regression . . . . . . . . . . . . . . . . . . . 24
2.3.1 Polynomial Regression and Basis Functions . . . . . . . . . . 24
2.3.2 Qualitative Predictors and Dummy Variables . . . . . . . . . 25
2.3.3 Heteroscedasticity and Weighted Least Squares . . . . . . . . 26
2.3.4 Outliers and Studentized residuals . . . . . . . . . . . . . . . 27
2.3.5 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.6 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 31
i
3 Classification 33
3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 34
5 Nonlinear Methods 56
5.1 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Local Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
ii
Chapter 1
Introduction to Statistical
Learning
Y = f (X) + ,
(x1 , y1 ), . . . , (xn , yn )
for X, Y . Then we can employ this sample, which is also called the training data,
to estimate f using a learning method. Broadly speaking, there are two types
of learning methods for estimation of f . In parametric methods we first assume
that f has a given form with some unknown parameters; then we use the data to
estimate those parameters. For example, we may assume that f is a linear function;
then we can try to estimate the coefficients of that linear function using the data.
1
In contrast, in non-parametric methods we do not assume a predetermined form
for f .
The above process of estimating f is known as supervised learning, since
we have both the response Y and the predictor X. In supervised learning, when
the response Y is a quantitative variable (i.e. it takes numerical values), we are
dealing with a regression problem. And when the response Y is a qualitative or
categorical variable (i.e. it takes finitely many discrete values, in other words, its
values belongs to one of finitely many classes or categories), we are dealing with
a classification problem.
Sometimes there is no response variable at all, and we have only observed the
input variables. In these settings we may for example be interested in detecting
patterns in the data. This is known as unsupervised learning.
Ŷ = fˆ(X).
Then the difference between Y, Ŷ is called the error. And the difference between
f, fˆ is called reducible error, because it can be potentially improved. The other
part of the error which is due to the presence of the random noise , and cannot
be reduced by our choice of the learning method, is called the irreducible error.
We have
(x1 , y1 ), . . . , (xn , yn )
2
for X, Y , where xi s can be vector too. Suppose we use this (training) data set
in a learning method to find the estimate fˆ for f . Then the (training) Mean
Squared Error (MSE) is
n
1X
MSE := (yi − fˆ(xi ))2 .
n
i=1
However, we are interested in the test MSE, i.e. the MSE of the model when it is
applied to new test data.
Let us further assume that f (·) = F (·, θ0 ) belongs to the class of functions
F = {F (·, θ) : θ ∈ I},
which is parametrized by θ in the parameter set I. Then we use a training data set
to predict fˆ(·) = F (·, θ̂). Here θ̂ is the random variable which gives the parameter
to obtain the prediction fˆ when we use a particular training data set as input in
our learning method. We assume that θ̂, are independent, since they are functions
of two sources of randomness; is random due to unmeasured or unmeasurable
variables, but θ̂ is random due to random selection of a training data set.
To simplify the notation we set µ = E[Ŷ ] = E[fˆ(X)], i.e. µ is the average value
of Ŷ at X for different training data sets used as input in our learning method.
Then the expected test MSE (i.e. the average of MSE w.r.t. all possible training
data sets used as input in our learning method evaluated (tested) at X) can be
computed as
2
E[(Y − Ŷ )2 ] = E f (X) + − fˆ(X)
Here variance is the variance of the value of Ŷ at X, and bias is the difference
between the true value f (X) with the mean value of Ŷ at X. Hence we have shown
that the reducible error is the sum of the variance and the square of bias.
3
In other words, suppose we have a large collection of training data sets, and we
use each one of them to calculate an estimate Ŷ for Y , using our learning method.
Then this collection of estimates has a mean and a variance. The estimates are
distributed around their mean, and their variability is quantified by their variance.
The distance of the mean of all the estimates with the true value of f (X) (which is
the expected value of Y given X, since E[] = 0) is called the bias, and the variance
of the estimates is still called the variance. Therefore bias measures the average
distance that estimates have with f (X). And variance measures how much the
estimates are dispersed.
Suppose we have predicted ŷ0 = j. Then the expected test error (ETE) is
K
X
ETE = E[I(y0 6= ŷ0 )] = I(k 6= ŷ0 )P(y0 = k)
k=1
K
X
= I(k 6= j)P(Y = k|X = x0 )
k=1
X
= P(Y = k|X = x0 ) = 1 − P(Y = j|X = x0 ).
k6=j
Thus to minimize ETE we have to set ŷ0 = j such that P(Y = j|X = x0 ) has the
highest value among P(Y = k|X = x0 ) for k = 1, . . . , K. This is called the Bayes
classifier. The Bayes decision boundary is the boundary between the regions
{x0 : ŷ0 = j} in the X-space, for different j. This boundary also determines the
Bayes classifier.
4
Chapter 2
Linear Regression
Y = β0 + β1 X1 + · · · + βp Xp + .
Y = β0 + β1 X + .
The case of more than one predictor will be dealt with in multiple linear regres-
sion.
(x1 , y1 ), . . . , (xn , yn )
for X, Y . Let ŷi = β̂0 + β̂1 xi be the predicted value for Y when X = xi . Then
ei = yi − ŷi is called the ith residual. The residual sum of squares (RSS) is
n
X
RSS = e21 + ··· + e2n = (yi − β̂0 − β̂1 xi )2 .
i=1
5
To find the minimum of RSS we consider it as a function of β̂0 , β̂1 , and differ-
entiate with respect to them. We get
n
−1 ∂ X
0= RSS = (yi − β̂0 − β̂1 xi ) = n(ȳ − β̂0 − β̂1 x̄),
2 ∂ β̂0
i=1
Pn Pn
where x̄ = n1 i=1 xi and ȳ = 1
n i=1 yi are the sample means of xi , yi , respec-
tively. Hence
β̂0 + β̂1 x̄ = ȳ.
Therefore
n
−1 ∂ X
0= RSS = xi (yi − β̂0 − β̂1 xi )
2 ∂ β̂1
i=1
n
X n
X
= xi yi − nβ̂0 x̄ − β̂1 x2i
i=1 i=1
n
X n
X
= xi yi − n(ȳ − β̂1 x̄)x̄ − β̂1 x2i
i=1 i=1
n
X n
X
= (xi yi − x̄ȳ) − β̂1 (x2i − x̄2 )
i=1 i=1
n
X
= (xi − x̄)(yi − ȳ) + x̄yi + xi ȳ − 2x̄ȳ
i=1
n
X
(xi − x̄)2 + 2x̄xi − 2x̄2
− β̂1
i=1
n
X n
X
= (xi − x̄)(yi − ȳ) − β̂1 (xi − x̄)2 .
i=1 i=1
Thus Pn
(x − x̄)(yi − ȳ) qxy
Pn i
β̂1 = i=1 2
= 2,
i=1 (xi − x̄) sx
1 Pn
where qxy = n−1 i=1 (xi − x̄)(yi − ȳ) is the sample covariance of xi , yi , and
1 n
s2x = n−1 x̄)2 is the sample variance of xi . We assume that s2x is
P
(x
i=1 i −
nonzero; otherwise xi s are constant.
Note that the second derivative of RSS is
2 2n 2nx̄
D RSS = .
2nx̄ 2 x2i
P
6
So D2 RSS is positive definite, since its trace and determinant are positive:
X X
det(D2 RSS) = 4n x2i − nx̄2 = 4n (xi − x̄)2 = 4n(n − 1)s2x .
Also note that RSS goes to infinity as β̂0 , β̂1 go to infinity. Thus the above values
for β̂0 , β̂1 give the unique global minimum of RSS.
7
Now we get
hX i hX i
E (xi − x̄)2 = E (xi − µ + µ − x̄)2
i≤n i≤n
hX i
=E (xi − µ)2 + 2(xi − µ)(µ − x̄) + (µ − x̄)2
i≤n
hX X i
=E (xi − µ)2 + 2(µ − x̄) (xi − µ) + n(µ − x̄)2
i≤n i≤n
hX i
=E (xi − µ)2 + 2(µ − x̄)n(x̄ − µ) + n(µ − x̄)2
i≤n
hX i
=E (xi − µ)2 − n(µ − x̄)2
i≤n
X
= E[(xi − µ)2 ] − nE[(µ − x̄)2 ]
i≤n
σ2
= nσ 2 − n = (n − 1)σ 2 .
n
Therefore E[s2 ] = σ 2 , as desired.
On the other hand, notice that in general s is √ not anpunbiased estimator of the
standard deviation σ, since in general E[s] = E[ s2 ] 6= E[s2 ] = σ.
Next let us show that β̂0 , β̂1 are unbiased estimators of β0 , β1 , respectively. First
note that we have yi = β0 + β1 xi + i , where i , xi are independent, and E[i ] = 0
(we also assume that xi , xj , i , j are all independent for i 6= j). Hence we have
h Pn (x − x̄)(y − ȳ) i
E[β̂1 ] = E Pn i
i=1 i
2
(x
i=1 i − x̄)
h Pn (x − x̄)(β + β x + − β − β x̄ − ¯) i
i=1 i 0 1 i i 0 1
=E P n 2
(x
i=1 i − x̄)
h β Pn (x − x̄)2 + Pn (x − x̄)( − ¯) i
1 i=1 i P i=1 i i
=E n 2
i=1 (x i − x̄)
n
X h (xi − x̄) i
= β1 + E Pn (
2 i
− ¯)
i=1 i=1 (xi − x̄)
n
X h (xi − x̄) i
= β1 + E Pn 2
E[i − ¯] = β1 + 0 = β1 .
i=1 (xi − x̄)
i=1
8
Therefore
Pn
h (x − x̄)(yi − ȳ) i
E[β̂0 ] = E[ȳ − β̂1 x̄] = E ȳ − i=1 Pn i 2
x̄
i=1 (xi − x̄)
β1 ni=1 (xi − x̄)2 + ni=1 (xi − x̄)(i − ¯) i
h P P
= E ȳ − Pn 2
x̄
i=1 (xi − x̄)
n h (x − x̄)x̄ i
i
X
= E[ȳ − β1 x̄] − E Pn 2 i
( −
¯)
i=1 i=1 (xi − x̄)
n h (x − x̄)x̄ i
i
X
= µy − β 1 µx − E Pn 2
E[i − ¯]
i=1i=1 (xi − x̄)
= β0 − 0 = β0 .
H0 : β1 = 0.
The idea of the hypothesis test is that if the data shows that the null hypothesis
results in a statistically improbable conclusion, then we can reject the null hypoth-
esis, and accept its negation, which is called the alternative hypothesis. Thus
hypothesis test can be considered a statistical form of reductio ad absurdum. In
our case, the alternative hypothesis is that there is some relation between X, Y , or
more formally
Ha : β1 6= 0.
To perform hypothesis test for the null hypothesis H0 : β1 = 0 we can use the
t-statistic
β̂1 β̂1 − 0
t= = ,
SE(β̂1 ) SE(β̂1 )
which measures the distance of β̂1 to 0 in the units of the standard error of β̂1 .
The distribution of the t-statistic can be computed, and is called t-distribution.
9
The t-distribution is similar to the normal distribution, and is concentrated around
zero. It can be shown that if H0 is true then it is unlikely that the t-statistic has
a large value. So if we observe a large t-statistic, then we can reject H0 . To make
this reasoning quantitative, let T be a random variable with t-distribution. Then
the p-value of the observed t-statistic is
p = P(|T | ≥ |t| | H0 ),
i.e. the p-value is the probability of observing t or less probable values than t,
provided that H0 is true. Therefore if the p-value of t is small enough, we can
conclude that the null hypothesis H0 implies an improbable result, so it can be
rejected.
2.1.4 R2 Statistic
The R2 statistic, also called the coefficient of determination, is
TSS − RSS RSS
R2 = =1− ,
TSS TSS
where TSS = (yi − ȳ)2 is the total sum of squares. Intuitively, TSS is the
P
total variation in the observed values of Y , and RSS is the variation in Y after we
perform the regression. Thus we can regard RSS as the variation in Y which is
unexplained by the linear regression. Hence TSS − RSS is the amount of variation
in Y which is explained by the linear regression, and therefore R2 is the proportion
of the variance in the response which is explained by the linear regression. So if
R2 is close to 1, the linear model is a good approximation of the true relationship
between X, Y . There is an alternative way to see this fact, which we present in the
next paragraph.
Recall that the sample correlation is
Pn
qxy (xi − x̄)(yi − ȳ)
Cor(X, Y ) = rxy = = pPn i=1 pPn .
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2
We have
TSS − RSS
R2 =
TSS
Pn 2
Pn 2
i=1 (yi − ȳ)P− i=1 (yi − β̂0 − β̂1 xi )
= n 2
i=1 (yi − ȳ)
Pn
(yi − ȳ)2 − ni=1 (yi − ȳ + ȳ − β̂0 − β̂1 xi )2
P
= i=1 Pn 2
i=1 (yi − ȳ)
10
Pn 2 2
i=1 (yi − ȳ) − (yi − ȳ + β̂1 x̄ − β̂1 xi )
= Pn 2
(ȳ = β̂0 + β̂1 x̄)
i=1 (yi − ȳ)
Pn 2 2
i=1 2(yi − ȳ)β̂1 (xi − x̄) − β̂1 (xi − x̄)
= Pn 2
i=1 (yi − ȳ)
2β̂1 i=1 (yi − ȳ)(xi − x̄) − β̂12 ni=1 (xi − x̄)2
Pn P
= Pn 2
i=1 (yi − ȳ)
2
qxy
q
2β̂1 qxy − β̂12 s2x 2 sxy
2 qxy − s2
s4x x
2
qxy
= = x
= = Cor2 (X, Y ).
s2y s2y s2x s2y
Y = β0 + β1 X1 + · · · + βp Xp + .
Let β̂0 , β̂1 , . . . , β̂p be estimates for the above coefficients. Then our prediction for
Y is given by the formula
We want to find β̂0 , β̂1 , . . . , β̂p that minimize the sum of squared residuals
n
X n
X
RSS = (yi − ŷi )2 = (yi − β̂0 − β̂1 xi1 − · · · − β̂p xip )2 ,
i=1 i=1
11
We want RSS to have its minimum at β̂. Let γ ∈ Rp+1 be an arbitrary vector.
Then the directional derivative of RSS at β̂ in the direction of γ must be zero.
Hence we must have
d
0= RSS(β̂ + tγ)
dt t=0
d
= (y − Xβ̂ − tXγ)T (y − Xβ̂ − tXγ)
dt t=0
d d
= (y − Xβ̂ − tXγ)T (y − Xβ̂) + (y − Xβ̂)T (y − Xβ̂ − tXγ)
dt t=0 dt t=0
= (−Xγ)T (y − Xβ̂) + (y − Xβ̂)T (−Xγ)
T
= −γ T XT y + γ T XT Xβ̂ − yT Xγ + β̂ XT Xγ
= −2γ T XT y + 2γ T XT Xβ̂ (since aT = a for a 1 × 1 matrix)
= 2γ T − XT y + XT Xβ̂ .
Therefore
XT Xβ̂ = XT y,
since γ is arbitrary, and XT y − XT Xβ̂ is also an element of Rp+1 . If XT X is
invertible, which is the case if the columns of X are linearly independent, then we
have
β̂ = (XT X)−1 XT y.
Let us present another way for finding β̂. Note that in general for every γ ∈
Rp+1 we have
In other words, y − Xβ̂ must be orthogonal to the image of the linear map · 7→ X·,
which is a linear subspace of Rn . For this to happen, Xβ̂ must be the orthogonal
projection of y onto this subspace; and in this case we also have
ky − Xβ̂k ≤ ky − Xγk,
for every γ, which is the desired property of β̂. We should also note that there is
always a β̂ that satisfies the above inequality, and Xβ̂ is uniquely determined as
the orthogonal projection of y; but when XT X is non-invertible, β̂ is not unique.
In addition, note that we actually have
d
Dγ RSS(β̂) = RSS(β̂ + tγ) = 2γ T XT y − XT Xβ̂
dt t=0
T
= 2 yT X − β̂ XT X γ.
12
T
Hence DRSS(β̂) = 2 yT X − β̂ XT X . Therefore
D2 RSS(β̂) = 2XT X
is positive definite. Also note that RSS goes to infinity as β̂ goes to infinity. Thus
the above value for β̂ gives the unique global minimum of RSS.
It is also easy to see that
n
∂ X
0= RSS = −2 (yi − β̂0 − β̂1 xi1 − · · · − β̂p xip )
∂ β̂0 i=1
= −2n(ȳ − β̂0 − β̂1 x̄1 − · · · − β̂p x̄p ).
Hence ŷ¯ = β̂0 + β̂1 x̄1 + · · · + β̂p x̄p = ȳ. In other words, the linear function given by
the least squares method passes through the sample means of the data.
y = Xβ + ,
where = [1 , . . . , n ]T is a vector of i.i.d random variables with 0 mean and variance
σ2 . Thus its covariance matrix is
where I is the identity matrix (note that all covariances are 0 due to independence).
We also assume that X, are independent. Now we have
Hence
13
Therefore β̂ is an unbiased estimator of β. On the other hand, the covariance
matrix of β̂ is
provided that X is not random (or we compute the conditional expectation under
the condition that X is constant).
Suppose p = 1. Then we have (for simplicity, we write xi instead of xi1 )
1 x1
1 1 . . . 1 1 x2
P
T n P x2i .
X X= .. .. = P
x1 x2 . . . x n . . xi xi
1 xn
Hence
P 2 P
T −1 1 xi − xi
(X X) = P 2 P
n xi − ( xi )2 − xi
P
n
(xi − x̄)2 + nx̄2 −nx̄
P
1
= .
n (xi − x̄)2 −nx̄
P
n
14
is given by any of the following two formulas
−1
(A − BD−1 C)−1 −A−1 B(D − CA−1 B)−1
A B
=
C D −D−1 C(A − BD−1 C)−1 (D − CA−1 B)−1
−1
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
= ,
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
provided that all the inverses exist and all the products are defined. (For the
proof we can just multiply the proposed inverses and the original block matrix.)
The matrices (A − BD−1 C), (D − CA−1 B) are called the Schur complements of
D, A, respectively. Note that we also have
A−1 B
A B A 0 I
= .
C D C I 0 D − CA−1 B
Therefore the determinant of the block matrix equals det(A) det(D−CA−1 B). So if
A and the block matrix are invertible then the Schur complement of A is invertible
too.
Now in general we can write the matrix X as X = 1 X̃ , where
T
1 = 1 1 . . . 1 ∈ Rn . Hence
T 1T n 1T X̃ n nx̄T
X X= 1 X̃ = = ,
X̃T X̃T 1 X̃T X̃ nx̄ X̃T X̃
T
where x̄ = x̄1 . . . x̄p . Thus
n−1 + n−1 nx̄T (X̃T X̃ − nx̄x̄T )−1 nx̄n−1 −n−1 nx̄T (X̃T X̃ − nx̄x̄T )−1
(XT X)−1 =
−(X̃T X̃ − nx̄x̄T )−1 nx̄n−1 (X̃T X̃ − nx̄x̄T )−1
−1
n (n − 1) + x̄T Q−1 x̄ −x̄T Q−1
1
= ,
n−1 −Q−1 x̄ Q−1
Therefore
h1 1 i 1
SE(β̂0 )2 = σ2 + x̄T Q−1 x̄ , SE(β̂j )2 = σ 2 (Q−1 )jj .
n n−1 n−1
15
2.2.3 Estimating the Irreducible Error
Let us first compute E[RSS]. We have
E[RSS] = E[(Xβ̂ − y)T (Xβ̂ − y)]
= E[(Xβ̂ − Xβ − )T (Xβ̂ − Xβ − )]
= E (β̂ − β)T XT X(β̂ − β) − T X(β̂ − β) − (β̂ − β)T XT + T
= E (XT X)−1 XT XT X (XT X)−1 XT
T
16
2.2.4 F-Statistic
Suppose Y = β0 + β1 X1 + · · · + βp Xp + . To test the null hypothesis
H0 : β1 = β2 = · · · = βp = 0,
H0 : βp−q+1 = βp−q+2 = · · · = βp = 0,
Z = β0 + β1 X1 + · · · + βp Xp .
σY2 = σZ+
2
= σZ2 + 2Cov(Z, ) + σ2 = σZ2 + σ2 .
17
Hence we also have
E[TSS] = E[(n − 1)s2y ] = (n − 1)σY2 = (n − 1)(σZ2 + σ2 ) ≥ (n − 1)σ2 .
In particular, when q = p and H0 is true, we have E[(TSS − RSS)/p] = σ2 , since Z
is constant in this case; otherwise we have E[(TSS − RSS)/p] ≥ σ2 . Therefore when
q = p and H0 is true we expect the F-statistic be close to 1. Otherwise we expect
F to be greater than 1. In general, under suitable assumptions, the distribution
of F is known, and is called F-distribution. Using F-distribution we can assign
p-value to F , and perform hypothesis testing.
Let us inspect the F-statistic more closely. Let ŷ and ỹ be the estimates of
y using the larger and the smaller models respectively. We know that ỹ is the
orthogonal projection of y on the subspace W which is spanned by the vectors
1, x1 , . . . , xp−q , and ŷ is the orthogonal projection of y on the subspace V which is
spanned by
1, x1 , . . . , xp−q , . . . , xp .
Obviously we have ŷ ∈ V and ỹ ∈ W ⊂ V . Also note that due to the properties of
orthogonal projections we have
y − ỹ ∈ W ⊥ , y − ŷ ∈ V ⊥ .
Therefore we get
RSS0 = ky − ỹk2 = ky − ŷ + ŷ − ỹk2
= ky − ŷk2 + kŷ − ỹk2 (since y − ŷ ⊥ ŷ − ỹ)
2
= RSS + kŷ − ỹk .
An immediate consequence of the above equation is that RSS decreases as we add
more predictors to the model. Intuitively, this happens because RSS measures the
prediction error using the training data, and with more predictors we have more
degrees of freedom, so the training error becomes smaller.
Let us find a geometric description of ŷ − ỹ. Let U = V ∩ W ⊥ be the orthogonal
complement of W in V . Then note that
y − (ŷ − ỹ) = (y − ŷ) + ỹ ∈ U ⊥ ,
since y − ŷ ∈ V ⊥ ⊂ U ⊥ , and ỹ ∈ W = (W ⊥ )⊥ ⊂ U ⊥ . Hence ŷ − ỹ must be the
orthogonal projection of y onto the subspace U . Therefore we have the following
formula for the F-statistic
kŷ − ỹk2 /q n − p − 1 kPU yk2
F = =
ky − ŷk2 /(n − p − 1) q ky − PV yk2
n − p − 1 kPV ∩W ⊥ yk2
= ,
q kPV ⊥ yk2
where P denotes the orthogonal projection operator on a subspace.
18
2.2.5 F-Statistic versus t-statistic
For each predictor we have a t-statistic. It can be shown that the square of this
t-statistic is the F-statistic corresponding to omitting that variable from the model,
i.e. when q = 1. To see this, note that we have
X = Xp̃ xp = 1 x1 . . . xp ,
T T
∈ Rn . Let
where xj = x1j ... xnj , and 1 = 1 1 . . . 1
Then ŷ = Hy and ỹ = Hp̃ y are respectively the predictions of Y based on all the
predictors and the first p − 1 predictors. Let Vj be the linear subspace spanned by
the first j + 1 columns of X. We have Vj ⊂ Vj+1 , and since we assume that the
columns of X are linearly independent, we have dim Vj = j + 1. We also know that
the predictions based on least squares method actually give us the orthogonal pro-
jections on the subspace spanned by the columns. Hence H, Hp̃ are the projection
(hat) matrices on the subspaces Vp , Vp−1 , respectively.
Let x̃p = Hp̃ xp be the projection of xp on Vp−1 , i.e. the prediction of Xp using
the other predictors. Then we have
RSS0 = ky − ỹk2 = ky − ŷk2 + β̂p2 kxp − x̃p k2 = RSS + β̂p2 kxp − x̃p k2 ,
19
Thus by using Schur complements we obtain
" #
T −1 ∗ ∗
(X X) = −1
∗ xTp xp − xTp Xp̃ (XTp̃ Xp̃ )−1 XTp̃ xp
∗ ∗
= −1 .
∗ xTp xp − xTp Hp̃ xp
Hence we have
−1
SE(β̂p )2 = σ2 xTp xp − xTp Hp̃ xp
−1 −1
= σ2 xTp xp − xTp x̃p = σ2 xTp (xp − x̃p )
−1 σ2
= σ2 kxp − x̃p k2 + x̃Tp (xp − x̃p ) = ,
kxp − x̃p k2
since xp − x̃p is orthogonal to Vp−1 , which contains x̃p . (Note that this provides a
new formula, and interpretation, for SE(β̂p )2 .)
Finally we get
as desired. The σ̂2 is an estimate of σ2 , which here we substituted RSE for it.
The problem with individual t-statistics is that when there are many predictors,
some of their associated p-values can be small due to chance. And these randomly
small p-values do not imply that there is a relationship between the response and
their corresponding predictors. Therefore we use the F-statistic to see if there
is a relation between the response and predictors in multiple regression. Notice
that in the joint distribution of all the t-statistics, the event of some of their p-
values being small is quite likely! So the smallness of some of the p-values is not an
improbable event, and based on it we cannot reject the null hypothesis. In contrast,
the smallness of the p-value of the F-statistic is an improbable event, and based on
it we can reject H0 .
20
orthogonal to the image of · 7→ X·, so it is in particular orthogonal to
β̂0 − ȳ 1 x11 x12 . . . x1p β̂0 − ȳ ŷ1 − ȳ
β̂1 ŷ2 − ȳ
β̂1 1 x21 x22 . . . x2p
X . = . .. .. = .. .
.. ..
. ..
. . . . . .
β̂p 1 xn1 xn2 . . . xnp β̂p ŷn − ȳ
Hence the inner product of y − ŷ and the above vector is zero, i.e.
n
X
(yi − ŷi )(ŷi − ȳ) = 0.
i=1
Thus we have
n
X n
X
TSS − RSS = (yi − ȳ)2 − (yi − ŷi )2
i=1 i=1
Xn Xn n
X
= (yi − ȳ)2 − (yi − ŷi )2 − (yi − ŷi )(ŷi − ȳ)
i=1 i=1 i=1
Xn Xn
= (yi − ȳ)2 − (yi − ŷi )(yi − ŷi + ŷi − ȳ)
i=1 i=1
Xn Xn
= (yi − ȳ)2 − (yi − ŷi )(yi − ȳ)
i=1 i=1
n
X n
X
= (yi − ȳ)(yi − ȳ − yi + ŷi ) = (yi − ȳ)(ŷi − ȳ).
i=1 i=1
Now suppose z is the prediction of Y using some other linear model. Then we
have z = Xγ for
some vector γ. Similarly
T to the above, we can see that y − ŷ is
orthogonal to z1 − z̄ . . . zn − z̄ . Hence their inner product is zero too, i.e.
n
X
(yi − ŷi )(zi − z̄) = 0.
i=1
Therefore we have
n
X
(n − 1)qyz = (yi − ȳ)(zi − z̄)
i=1
Xn
= (yi − ŷi + ŷi − ȳ)(zi − z̄)
i=1
21
n
X n
X
= (yi − ŷi )(zi − z̄) + (ŷi − ȳ)(zi − z̄)
i=1 i=1
n
X
=0+ ¯ i − z̄)
(ŷi − ŷ)(z (since ŷ¯ = ȳ)
i=1
= (n − 1)qŷz .
Hence we get
Pn
TSS − RSS (y − ȳ)(ŷi − ȳ)
2
R = = Pn i
i=1
2
TSS i=1 (yi − ȳ)
2 qy2ŷ
qyŷ qyŷ sŷ qyŷ qyŷ
= 2 = 2 2 = 2 2 = 2 2 = Cor2 (Y, Ŷ ),
sy sy sŷ sy sŷ sy sŷ
as desired.
On the other hand, we have
Note that in the above inequality we used the facts that Cor(Ŷ , Z) ≤ 1, and
As a side note, observe that if we use the inequality Cor(Y, Ŷ ) ≤ 1 in the above
equation we obtain
sŷ ≤ sy .
In other words, we have also shown that the correlation of Y, Ŷ cannot be negative,
and the variance of Ŷ cannot be larger than the variance of Y .
22
This relation allowed us to compute the expected value and the covariance matrix
of β̂, yielding the formulas
This variance can be used to compute confidence intervals, which are estimates
of how much xT0 β (which is the expected value of y0 given x0 , since E[0 ] = 0)
differs from the predicted value ŷ0 .
Next let us find the expected value and variance of y0 − ŷ0 . We have
The above variance can be used to compute prediction intervals, which are esti-
mates of how much y0 differs from the predicted value ŷ0 .
23
Note that prediction intervals provide upper and lower estimates for a single
value of response, i.e. y0 . In contrast, confidence intervals provide such estimates
for the average value of response. The reason is that the expected value of 0 is
zero, thus its effects are expected to diminish in average, leaving us with xT0 β as
the average value of response.
Also note that both prediction and confidence intervals are centered around the
predicted value ŷ0 , but the prediction interval is larger, since the variance of y0 − ŷ0
is larger than the variance of xT0 β − ŷ0 . Another way of looking at this is that the
variance of xT0 β − ŷ0 consists of the reducible error, while the variance of y0 − ŷ0
consists of both reducible and irreducible errors.
X1 = X, X2 = X 2 , ... Xm = X m ,
Y = β0 + β1 X1 + β2 X2 + · · · + βm Xm +
= β0 + β1 X + β2 X 2 + · · · + βm X m + ,
by the usual method of least squares in multiple linear regression. This is known
as polynomial regression.
When we initially have more than one features, say X1 , . . . , Xp , we can use new
features which are monomial functions of several variables. In particular, we can
also use interaction terms of the form X1 X2 as new features.
We can use the above idea to fit functions other than polynomials too. In
general, we can use the new features
24
2.3.2 Qualitative Predictors and Dummy Variables
Let X be a quantitative feature whose range contains the values c1 , . . . , ck . A
particular choice of basis functions are piecewise constant functions:
since X belongs to exactly one the intervals. Hence there is a linear dependence
relation between the dummy variables and the constant function 1. So if we use the
intercept term in the regression, which is itself a linear combination of the constant
function 1, we have to put aside one of the dummy variables, and work with the
rest of them. The choice of the dropped dummy variable is arbitrary due to the
above linear dependence relation.
The idea of dummy variables is particularly helpful when X is a qualitative
feature with k levels c1 , . . . , ck (note that c1 , . . . , ck are not necessarily numbers
here). In this case the dummy variables become
C1 (X) = I(X = c1 ),
..
.
Ck−1 (X) = I(X = ck−1 ),
Ck (X) = I(X = ck ).
These new variables are quantitative variables, and can be used in an ordinary least
squares regression. So we can deal with the case where some of the features are
qualitative variables.
Note that each dummy variable has only two levels, and we use k two level
dummy variables to quantify the qualitative variable X, and we do not use a single
k-level dummy variable. The reason is that if we want to use a dummy variable with
25
more than two levels, we have to code the levels of X by numbers. And in doing so
we will impose a linear order on the levels of X, together with a notion of distance
between those levels. However, in general, the levels of a qualitative variable are
not ordered linearly, nor there is a well-defined notion of distance between them.
Hence we use several two-level dummy variables instead. In this way, we avoid
imposing any order on the levels. Also, we do not assign any notion of distance
between different levels, since the 0/1 coding of belonging or not belonging to a
class merely expresses that classes are different.
where wi = σ12 . Let W be the diagonal matrix with diagonal entries wi . Then the
i
above weighted sum of squares can be written as
(y − Xβ̂)T W(y − Xβ̂).
By differentiating this expression with respect to β̂, similarly to Section 2.2.1, we
can easily show that the minimizer is given by
XT WXβ̂ = XT Wy,
or equivalently by
β̂ = (XT WX)−1 XT Wy,
when XT WX is invertible.
26
2.3.4 Outliers and Studentized residuals
Outliers are data points which are far from the rest of the data. At an outlier the
response has an unusual value. The predictors may have unusual values at a point
too. In this section and the next one we consider statistics that can help us detect
outliers and points with outlying predictor values.
Recall that H = X(XT X)−1 XT is the projection (hat) matrix on the image
of the linear map defined by X. Then we have ŷ = Hy. Now note that
HX = X(XT X)−1 XT X = X(XT X)−1 (XT X) = X.
Therefore (I − H)X = X − HX = X − X = 0. Hence we get
y − ŷ = y − Hy = (I − H)y = (I − H)(Xβ + ) = (I − H).
In addition, note that we have
(I − H)2 = I − IH − HI + H2 = I − 2H + H = I − H,
since H2 = H (because H is a projection matrix). In fact, it can be shown that
I − H is the projection matrix on the orthogonal complement of the image of X;
so its square must be equal to itself. This also implies that (I − H)X must be zero,
because the columns of X belong to its image, and the projection of the image of
X onto its orthogonal complement is zero.
Now let us compute the covariance matrix of y − ŷ. First note that
E[y − ŷ] = E[Xβ + − Xβ̂] = Xβ + 0 − X E[β̂] = Xβ − Xβ = 0.
Thus we have
E (y − ŷ)(y − ŷ)T = E (I − H)((I − H))T
= E (I − H)T (I − H)T
= (I − H)E[T ](I − H)T (since H is constant)
T
= (I − H)E[ ](I − H) (since H is symmetric)
= (I − H)σ2 I(I − H) = σ2 (I − H)2 = σ2 (I − H).
Therefore we have Var(ei ) = Var(yi − ŷi ) = (σ2 (I − H))ii = σ2 (1 − hi ), where hi is
the ith diagonal entry of H, and is known as the leverage (discussed below). This
enables us to compute the standard error of the residual ei :
p
SE(ei ) = σ 1 − hi .
The studentized residual ti is defined as the ratio of the residual ei to its standard
error, i.e.
ei
ti = √ .
σ 1 − hi
27
Note that when the leverage hi is close to 1 (which is its maximum as we will see
below), or the residual ei is large, then the studentized residual ti will be large. An
observation with a large studentized residual can be an outlier.
Also note that the variance of ei = yi − ŷi cannot be computed similarly
to the variance of y0 − ŷ0 (which is computed in the previous subsection). Be-
cause unlike 0 , the error term i is not independent from the random vector
= [1 , . . . , i , . . . , n ]T .
2.3.5 Leverage
The leverage statistic hi is the ith diagonal entry of the projection matrix H =
X(XT X)−1 XT . By using the formula for (XT X)−1 we obtain
−1
n (n − 1) + x̄T Q−1 x̄ −x̄T Q−1
1
H= X XT ,
n−1 −Q−1 x̄ Q−1
where Q is the matrix ofTsample covariances, and x̄ is the vector of sample means.
Let xi = xi1 . . . xip be the vector of ith observation. Then the ith row of X
is 1 xTi . Hence we get
n−1 (n − 1) + x̄T Q−1 x̄ −x̄T Q−1
1 1
hi = 1 xTi
n−1 −Q−1 x̄ Q−1 xi
n−1 (n − 1) + x̄T Q−1 (x̄ − xi )
1 T
= 1 xi
n−1 −Q−1 (x̄ − xi )
1 1
= + (x̄ − xi )T Q−1 (x̄ − xi ).
n n−1
Now note that Q is positive definite since as we have seen before, Q is invertible,
and satisfies
(n − 1)Q = (X̃ − x̄1T )T (X̃ − x̄1T ),
where X = 1 X̃ . Therefore Q−1 is also positive definite. Thus the nonnegative
number
(x̄ − xi )T Q−1 (x̄ − xi )
measures the square of the distance of the ith observation from the mean, with
respect to the inner product induced by Q−1 . This distance is known as the
Mahalanobis distance, and intuitively, it measures the distance between a point
xi and the mean x̄ in the units of standard deviations. Hence an observation with
high leverage has outlying predictor values. In the simple linear regression the co-
variance matrix Q has only one entry, i.e. the sample variance; hence the leverage
statistic is given by
1 1 (xi − x̄)2 1 (xi − x̄)2
hi = + = + n .
s2x 2
P
n n−1 n j=1 (xj − x̄)
28
Another property of hi , which motivates the name leverage, is that
∂ ŷi
hi = ,
∂yi
since we have ŷ = Hy. So at points with high leverage, small errors in measurement
of y can result in significant errors in the predicted value ŷ. This can also be seen
from the fact that the variance of ŷi is a multiple of hi . To show this remember
that ŷ = Xβ̂. Hence
E[ŷ] = E[Xβ̂] = X E[β̂] = Xβ.
Therefore the covariance matrix of ŷ is
E (ŷ − Xβ)(ŷ − Xβ)T = E (Xβ̂ − Xβ)(Xβ̂ − Xβ)T
= E X(β̂ − β)(X(β̂ − β))T
= E X(β̂ − β)(β̂ − β)T XT
= X E (β̂ − β)(β̂ − β)T XT (since X is constant)
= X σ2 (XT X)−1 XT = σ2 H.
So we get Var(ŷi ) = (σ2 H)ii = σ2 hi . Notice that this variance cannot be computed
similarly to the case of ŷ0 (which is computed in a previous subsection), because as
we noted before, unlike 0 , the error term i is not independent from the random
vector = [1 , . . . , i , . . . , n ]T .
The average leverage for all the observations is always equal to (p+1)/n, because
X
hi = tr(H) = tr X(XT X)−1 XT = tr (XT X)−1 XT X = tr(I) = p + 1.
Furthermore, the leverage statistic is always between 1/n and 1. We have already
seen that hi ≥ 1/n. To show the other bound note that we have HT = H = H2 ,
since H is the matrix of an orthogonal projection. Hence
X X X
hi = Hii = (H2 )ii = Hij Hji = (Hij )2 = h2i + (Hij )2 ;
j j j6=i
2.3.6 Collinearity
Suppose that we can estimate the predictor Xj using the other predictors quite
accurately. This phenomenon is known as (multi)collinearity. When collinearity
is present, our uncertainty in our estimate of the corresponding regression coefficient
will be high, since we know that
σ2
SE(β̂j )2 = , (∗)
kxj − x̃j k2
29
where x̃j is the prediction of xj based on the other predictors. As a result, the
corresponding t-statistic will be small, and we may fail to reject the null hypothesis
H0 : βj = 0.
To assess the severity of multicollinearity we can use the variance inflation
factor (VIF), which is the ratio of the variance of β̂j in the full model with all the
predictors to its variance in the model with only one predictor Xj , i.e.
SE(β̂j )2
VIF(β̂j ) = .
SE(β̂j, simple )2
In other words, VIF measures the change in our uncertainty in β̂j when we extend
the model with only one predictor Xj by adding the other predictors.
2
Let RX be the R2 statistic from regressing Xj onto all the other predictors.
j |X−j
Let RSSj|−j , TSSj|−j be respectively the RSS and TSS of this regression. Also let
T T
x̃j = x̃1j . . . x̃nj be the prediction of xj = x1j . . . xnj based on the
other predictors. Then we have
1 1 TSSj|−j
2 = RSSj|−j
=
1 − RX j |X−j
RSSj|−j
TSSj|−j
(xij − x̄j )2 2
P P
i i (xij − x̄j )
=P 2
=
i (xij − x̃ij ) kxj − x̃j k2
σ2 /kxj − x̃j k2 SE(β̂j )2
= = = VIF(β̂j ).
σ2 / i (xij − x̄j )2
P
SE(β̂j, simple )2
As a result we can see that the smallest possible value for VIF is 1, since RX 2
j |X−j
is between zero and 1 (remember that RX 2 is equal to the square of correlation
j |X−j
of Xj and its prediction X̃j ). When VIF is 1 we must have RX 2 = 0; thus the
j |X−j
other predictors provide no information about Xj , and there is no collinearity. On
the other hand, if RX 2 is close to 1 then there is considerable collinearity, and
j |X−j
VIF has a large value.
We can also see that the smallest possible value for VIF is 1 by noting that
SE(β̂j )2 increases as we add more predictors to the model. The reason is that our
prediction of xj based on the other predictors becomes more accurate when we
add more predictors to the model, because our prediction of xj is the orthogonal
projection of xj onto the subspace spanned by our observations of the other pre-
dictors. Hence as we add more predictors to the model, this subspace becomes
larger. And therefore the projection of xj becomes closer to xj , since we know that
the orthogonal projection of a vector like xj onto a subspace is the closest vector
in that subspace to the given vector xj ; and if we enlarge the subspace the new
30
closest vector to xj cannot be farther away from xj than the old closest vector.
Therefore by formula (∗) the SE(β̂j )2 increases as we add more predictors to the
model. (Note that in the case of simple linear regression, the prediction of xj us-
ing no other predictor is x̄j 1; and plugging this value in the formula (∗) gives us
β̂j, simple .)
instead of the likelihood function, since it is easier to work with. Note that the
logarithm is a strictly increasing function, so maximizing the log-likelihood results
in the same estimate.
Let us see an example of maximum likelihood estimation. Suppose f is normal,
i.e.
1 1
f (x; µ, σ 2 ) = √ exp − 2 (x − µ)2 .
2πσ 2 2σ
31
(Note that here θ is the vector (µ, σ 2 ). Also note that here we abuse the notation
and represent both θ and θ0 by (µ, σ 2 ).) Then the log-likelihood becomes
X
log L(µ, σ 2 ) = log f (xi ; µ, σ 2 )
i≤n
X 1 1
log(2πσ 2 ) − 2 (xi − µ)2
= −
2 2σ
i≤n
n 1 X
= − log(2πσ 2 ) − 2 (xi − µ)2 .
2 2σ
i≤n
(Note that σ 2 is the variable which we differentiated with respect to, not σ!) Hence
we have
n 1 X 1X
2
= 4 (xi − x̄)2 =⇒ σ̂ 2 = (xi − x̄)2 .
2σ̂ 2σ̂ n
i≤n i≤n
32
Chapter 3
Classification
eβ0 +β1 X 1
p(X) = β +β X
= −(β
.
1+e 0 1 1 + e 0 +β1 X)
It is easy to see that
p(X)
= eβ0 +β1 X .
1 − p(X)
The ratio p(X)/[1 − p(X)] is called the odds. Note that since p(X) is between 0
and 1, the odds is between 0 and ∞. Also note that the odds is a strictly increasing
function of the probability p(X). Now by taking logarithm we get
p(X)
log = β0 + β1 X.
1 − p(X)
The logarithm of the odds is called the log-odds or logit. Hence in logistic regres-
sion we assume that the logit is a linear function of X. Note that logit is also a
strictly increasing function of p(X).
To estimate the coefficients β0 , β1 we use the method of maximum likelihood.
In the logistic regression model, we are interested in the conditional probability
measure P(Y |X). (Comparing with the notation of the previous subsection, here
θ is the vector (β0 , β1 ), and the random vectors (Y1 , X1 ), . . . , (Yn , Xn ) form our
33
sample.) We want to find the maximum likelihood estimates β̂0 , β̂1 for β0 , β1 . First
note that P(Y |X = xi ) is a Bernoulli random variable, so we have
f (yi , xi ; θ) = P(Y = yi |X = xi )
(
p(xi ) if yi = 1
= = p(xi )yi (1 − p(xi ))1−yi .
1 − p(xi ) if yi = 0
Now we can compute the log-likelihood function and differentiate it to find its point
of maximum (β̂0 , β̂1 ). However there is no closed formula for β̂0 , β̂1 , and we have
to estimate them using numerical methods.
Note that when X is a continuous random variable we need to work with probability
density functions instead of probabilities.
If we further assume that fk is normal, i.e. fk (x) = σ√12π exp − 2σ1 2 (x − µk )2 ,
we get
πk
exp − 2σ1 2 (x − µk )2
√
σ 2π
pk (x) = PK π 1
.
√j 2
j=1 σ 2π exp − 2σ 2 (x − µj )
34
To simplify the notation we denote the denominator of pk (x) by g(x). Then we
have
√ 1
log pk (x) = log(πk ) − log(σ 2π) − 2 (x − µk )2 − log g(x)
2σ
√ 1 µk µ2
= log(πk ) − log(σ 2π) − 2 x2 + 2 x − k2 − log g(x)
2σ σ 2σ
µk µ2k √ 1 2
= 2 x − 2 + log(πk ) − log(σ 2π) − 2 x − log g(x).
σ 2σ 2σ
Hence the value of k that maximizes pk (x) is the value of k that maximizes
µk µ2k
δk (x) = x − + log(πk ).
σ2 2σ 2
So this function can be used to determine the Bayes classifier.
Let x1 , y1 , . . . , xn , yn be a set of observations. Let nk be the number of obser-
vations for which yi = k. Consider the estimates
nk
π̂k = ,
n
1 X
µ̂k = xi ,
nk
yi =k
K
1 X 1 X
σ̂ 2 = PK (nk − 1) (xi − µ̂k )2
k=1 (nk − 1) k=1 nk − 1
yi =k
K
1 X X
= (xi − µ̂k )2 .
n−K
k=1 yi =k
Note that σ̂ 2 is a weighted average of the sample variances for each of the K classes.
Also note that the n − K in the denominator makes σ̂ 2 an unbiased estimator of σ 2 ,
1 P
since each nk −1 yi =k (xi − µ̂k )2 is an unbiased estimator of σ 2 , and the expectation
of a weighted average is equal to the weighted average of the expectations. If we
use the above estimates we obtain the following estimate for δk (x):
µ̂k µ̂2k
δ̂k (x) = x − + log(π̂k ).
σ̂ 2 2σ̂ 2
This function is called the discriminant function. Note that the discriminant
function is a linear function in x. Similarly to our use of δk in the Bayes classifier,
we can use δ̂k for classification. This method is known as the linear discriminant
analysis (LDA).
35
Next let us assume that we have more than one predictor, and fk is multivariate
normal, i.e.
1 1
T −1
fk (x) = exp − (x − µ k ) Σk (x − µ k ) ,
(2π)p/2 |Σk |1/2 2
where |Σk | is the determinant of Σk . Note that we allow the covariance matrix Σk
to depend on k. Hence we have
πk |Σk |−1/2 exp − 21 (x − µk )T Σ−1
k (x − µk )
pk (x) = PK .
−1/2 exp − 1 (x − µ )T Σ−1 (x − µ )
j=1 πj |Σj | 2 j j j
which are simply the sample (co)variances. Therefore we can estimate δk (x) by the
discriminant function
1 −1 −1 1 −1 1
δ̂k (x) = − xT Σ̂k x + xT Σ̂k µ̂k − µ̂Tk Σ̂k µ̂k − log |Σ̂k | + log(π̂k ).
2 2 2
Note that this time the discriminant function is a quadratic function in x. If we use
δ̂k for classification, as we used δk in the Bayes classifier, we obtain the quadratic
discriminant analysis (QDA) method.
If we assume that the covariance matrix does not depend on k, the above for-
mulas for QDA will reduce to the corresponding formulas for LDA with more that
one predictor. Especially, the linear discriminant function is given by
−1 1 −1
δ̂k (x) = xT Σ̂ µ̂k − µ̂Tk Σ̂ µ̂k + log(π̂k ),
2
36
where the entries of Σ̂ are weighted averages of the sample (co)variances of each of
the classes
K
1 X 1 X
Σ̂jl = PK (nk − 1) (xi )j − (µ̂k )j (xi )l − (µ̂k )l
k=1 (nk − 1) k=1 nk − 1
yi =k
K
1 X X
= (xi )j − (µ̂k )j (xi )l − (µ̂k )l .
n−K
k=1 yi =k
Let us also mention that when Y only has two levels, the log-odds in the LDA
framework is linear in x. To see this note that
π1 exp − 21 (x − µ1 )T Σ−1 (x − µ1 )
p1 (x)
=
π2 exp − 21 (x − µ2 )T Σ−1 (x − µ2 )
1 − p1 (x)
π1 1
= exp − (x − µ1 )T Σ−1 (x − µ1 ) − (x − µ2 )T Σ−1 (x − µ2 )
π2 2
π1 1 1
= exp (µ1 − µ2 )T Σ−1 x − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 .
π2 2 2
Therefore
p (x) π1 1 T −1 1
1
log = log − µ1 Σ µ1 + µT2 Σ−1 µ2 + (µ1 − µ2 )T Σ−1 x,
1 − p1 (x) π2 2 2
as desired.
37
Chapter 4
4.1 Cross-Validation
In order to estimate the test error rate of a learning method we can randomly split
the data into two parts, and use one part for training the method, and the other
part for testing the obtained model. This is the validation set approach. The
idea of cross-validation (CV) is to also switch the roles of the two parts of data,
and use the second part for training the method, and the first part for testing or
validating the model. Then we have two estimates of the test error rate, and we
can use their average as a more stable estimate of the test error rate.
More generally, in the k-fold cross-validation we randomly split the data into
k parts. Then we fit the method k-times, and each time we use one of the k parts
as test data, and the remaining k − 1 parts as training data. Then we get k test
mean squared errors MSE1 , . . . , MSEk , and we can use their average to obtain the
k-fold CV estimate of the test error:
k
1X
CV(k) = MSEi .
k
i=1
38
We repeat this process n times for i = 1, 2, . . . , n. Then the LOOCV estimate for
the test MSE is
n
1X
CV(n) = MSEi .
n
i=1
In other words, LOOCV is the n-fold CV for a data set with n data points.
When the learning method is the least squares linear regression we also have
n
1 X yi − ŷi 2
CV(n) = ,
n 1 − hi
i=1
where ŷi is the prediction of the response at xi when we fit the model with all
the observations, and hi is the ith leverage of that fit. In particular note that in
this case, to compute CV(n) , we only need to fit the model once. However, for an
arbitrary learning method there is no similar formula, and we have to fit the model
n times to compute CV(n) .
To prove the above relation, it suffices to show that
yi − ŷi
yi − ỹi = .
1 − hi
For simplicity we assume that i = n. Let X be the matrix of observations of
predictors, and y the vector of observations of the response. Then we have
X̃ y
X= T , y = ñ ,
xn yn
T
where xn = 1 xn1 . . . xnp , X̃ is the matrix consisting of the first n − 1 rows
of X, and yñ is the vector consisting of the first n − 1 components of y. Let ỹ, ŷ
be the predictions of y based on n − 1 and n observations respectively. We have
X̃(XT X)−1 XT y
ŷñ T −1 T X̃ T −1 T
ŷ = = X(X X) X y = T (X X) X y = T T −1 T .
ŷn xn xn (X X) X y
h i y T
T ñ
We also have XT y = X̃ xn = X̃ yñ + yn xn . Hence we get
yn
" T T
#
ỹñ X̃(X̃ X̃)−1 X̃ yñ
ỹ = = T T
ỹn xTn (X̃ X̃)−1 X̃ yñ
T T T
= X(X̃ X̃)−1 X̃ yñ = X(X̃ X̃)−1 (XT y − yn xn ). (∗)
T
Therefore we only need to compute (X̃ X̃)−1 . However note that
i
X̃
h T
T T
X X = X̃ xn = X̃ X̃ + xn xTn .
xTn
39
T
So XT X − X̃ X̃ is equal to the rank-one matrix xn xTn . This enables us to compute
T
the of inverse of X̃ X̃.
More generally, suppose A, B are invertible matrices whose difference is a rank-
one matrix, i.e. A − B = uv T , where u, v are vectors. Then we have
−1
B −1 = (A − uv T )−1 = A(I − A−1 uv T ) = (I − A−1 uv T )−1 A−1 .
But note that (A−1 uv T )2 = A−1 uv T A−1 uv T = h(A−1 uv T ), where the scalar h =
v T A−1 u. Thus by induction we obtain (A−1 uv T )m = hm−1 (A−1 uv T ). Hence the
formal power series becomes
1
(I − A−1 uv T )−1 = I + A−1 uv T (1 + h + h2 + · · · ) = I + A−1 uv T ,
1−h
provided that |h| < 1. It is easy to see that the above matrix is in fact the inverse
of I − A−1 uv T :
1
(I − A−1 uv T ) I + A−1 uv T = I − A−1 uv T
1−h
1 1
+ A−1 uv T − (A−1 uv T )2
1−h 1−h
1 h
= I + A−1 uv T − 1 + − = I.
1−h 1−h
Note that the above computation actually shows that if A is invertible and h 6= 1,
then B is also invertible, and we have
1
B −1 = (I − A−1 uv T )−1 A−1 = A−1 + A−1 uv T A−1 .
1−h
Also notice that B −1 is obtained from A−1 by adding a rank-one matrix to it.
Now in our original problem, we have h = xTn (XT X)−1 xn . However note that
the projection matrix of the least square fit with all the observations is
X̃
h i
H = X(X X) X = T (XT X)−1 X̃ xn
−1 T
T T
xn
" T
#
X̃(XT X)−1 X̃ X̃(XT X)−1 xn
= T .
xTn (XT X)−1 X̃ xTn (XT X)−1 xn
40
T
Hence if the leverage hn 6= 1 then X̃ X̃ is invertible too, and we have
T 1
(X̃ X̃)−1 = (XT X)−1 + (XT X)−1 xn xTn (XT X)−1 .
1 − hn
If we use this formula in the equation (∗), we get
T
ỹ = X(X̃ X̃)−1 (XT y − yn xn )
h 1 i
= X (XT X)−1 + (XT X)−1 xn xTn (XT X)−1 (XT y − yn xn )
1 − hn
1
= X(XT X)−1 XT y + X(XT X)−1 xn xTn (XT X)−1 XT y
1 − hn
yn
− yn X(XT X)−1 xn − X(XT X)−1 xn xTn (XT X)−1 xn
1 − hn
ŷn yn hn ŷn − yn
= ŷ + − yn − X(XT X)−1 xn = ŷ + X(XT X)−1 xn .
1 − hn 1 − hn 1 − hn
If we subtract both sides of the above equality from y, and look at the nth compo-
nent of the resulting equation, we obtain
ŷn − yn T T −1
yn − ỹn = yn − ŷn − x (X X) xn
1 − hn n
ŷn − yn −hn yn − ŷn
= yn − ŷn − hn = yn − ŷn 1 − = ,
1 − hn 1 − hn 1 − hn
as desired. As a consequence we get
n−1
|yn − ŷn | ≤ |yn − ỹn | ≤ |yn − ỹn |,
n
since 0 ≤ 1 − hn ≤ 1 − n1 . In other words, adding an observation at some point
improves the linear least squares fit prediction at that point.
41
y − ŷ 2
n n
= xTn (XT X)−1 XT X(XT X)−1 xn
1 − hn
y − ŷ 2
n n (yn − ŷn )2 hn hn
= xTn (XT X)−1 xn = = σ 2 t2 ,
1 − hn (1 − hn )2 1 − hn n
√en n −ŷn
y√
where tn = σ 1−hn
= σ 1−hn
is the studentized residual. The quantity
is called the Cook’s distance of the nth observation (p is the number of predictors).
It measures the effect of removal of the nth observation on the predictions of the
model. We can similarly compute the Cook’s distance Di of the other observations.
Note that the Cook’s distance is an increasing function of both the studentized
residual and the leverage.
(x1 , y1 ), . . . , (xn , yn )
Therefore we can use P̂ to estimate any quantity that we would be able to compute
using P. In particular, we can use P̂ to obtain new samples from the population.
These new samples can for example be used to estimate test error rates, or the
standard error of a coefficient in a parametric learning method.
Now note that when we form a new sample, the P̂-probability of a new data
point (x, y) being in this new sample is zero; so our new sample will only contain
some of the points (xi , yi ), and under P̂ these points are all equally likely to appear
in the new sample. In addition, note that when we form this new sample using P̂,
in each step we randomly choose one of the points, say (xj , yj ), with probability
1
n . But in the next step we also do the same thing, and we might again randomly
choose the same data point (xj , yj ) as in the previous step. Therefore forming a
42
new sample using the probability distribution P̂ is the same as sampling from the
data set
(x1 , y1 ), . . . , (xn , yn )
with replacement. These new samples are called bootstrap samples.
y = Xβ + ,
β̂ = β + (XT X)−1 XT .
Now let us make another set of observations in the population at the same predic-
tor values. These new observations will have different response values due to the
random noise . We denote these new response values by y0 . Then we have
y0 = Xβ + 0 ,
where 0 = [01 , . . . , 0n ]T is independent from . Since the predicted value is deter-
mined by the value of predictors, the prediction of the linear model at both sets of
observations X, y and X, y0 is the same, and is given by ŷ = Xβ̂.
Let us compute the expected prediction error of the model at X, y0 . Recall that
for a random variable like Z we have Var[Z] = E[Z 2 ] − (E[Z])2 . Hence we have
2
E[(yi0 − ŷi )2 ] = Var[yi0 − ŷi ] + E[yi0 − ŷi ]
2
= Var[yi0 ] + Var[ŷi ] − 2Cov[yi0 , ŷi ] + E[yi0 ] − E[ŷi ] .
However note that yi0 = xTi β + 0i and yi = xTi β + i (where xTi is the ith row of
X) are identically distributed and independent. Because i , 0i are independent and
have the same distribution as , and xi , β are constants. In addition, for similar
reasons yi0 and
ŷi = xTi β̂ = xTi β + xTi (XT X)−1 XT
43
are also independent. Thus Cov[yi0 , ŷi ] = 0. Therefore we get
2
E[(yi0 − ŷi )2 ] = Var[yi0 ] + Var[ŷi ] + E[yi0 ] − E[ŷi ]
2
= Var[yi ] + Var[ŷi ] + E[yi ] − E[ŷi ]
2
= Var[yi ] + Var[ŷi ] − 2Cov[yi , ŷi ] + 2Cov[yi , ŷi ] + E[yi − ŷi ]
2
= Var[yi − ŷi ] + E[yi − ŷi ] + 2Cov[yi , ŷi ]
= E[(yi − ŷi )2 ] + 2Cov[yi , ŷi ].
Note that n1 i≤n (yi − ŷi )2 = n1 RSS is the training MSE of the model.
P
On the other hand we have
Therefore
1
RSS + 2dσ̂ 2
Cp =
n
44
is an unbiased estimator of the expected prediction MSE at X, y0 . Note that if we
add more predictors to the model, RSS will decrease, but 2dσ̂ 2 increases. Hence
unlike the training MSE, Cp does not necessarily decrease as we add more predictors
to the model. This property of Cp , and the fact that it is an estimator of the
prediction MSE, makes it a suitable statistic for comparing models with different
numbers of predictors.
Another quantity for comparing models with different numbers of predictors is
the adjusted R2 , defined by
2 RSS/(n − p − 1)
Radj =1− .
TSS/(n − 1)
Equivalently we have
2 n−1
1 − Radj = (1 − R2 ) .
n−p−1
n−1
The factor n−p−1 is included in order to adjust the effect of having more predictors
on the R2 statistic. When we add a predictor to the model, we know that RSS
decreases. However, n − p − 1 decreases too. So Radj 2 does not necessarily increase,
2
in contrast to R .
Now note that when we change the number of predictors, TSS/(n − 1) does not
change. Hence the change of Radj 2 is only due to the change in RSS/(n − p − 1). Let
2 RSS/(n − d)
Radj =1−
TSS/(n − 1)
(RSS + dσ̂ 2 )/n Cp − nd σ̂ 2
=1− =1− .
TSS/(n − 1) TSS/(n − 1)
45
4.4 Regularization
Suppose we have a set of observed data X, y, and we want to use this data to find
the estimate ŷ = Xβ̂ for y. In linear regression we find β̂ by minimizing the RSS.
The idea of regularization is to constrain β̂ so that it cannot take large values,
and as a result its variance decreases. More precisely, instead of minimizing the
RSS we minimize
RSS(β̃) ≤ RSS(β̂).
On the other hand we know that RSS(β̂) + λJ(β̂) ≤ RSS(β̃) + λJ(β̃). Combining
these two inequalities we get
46
d
= (y − Xβ̂ − tXγ)T (y − Xβ̂)
dt t=0
d d
+ (y − Xβ̂)T (y − Xβ̂ − tXγ) + λ J(β̂ + tγ)
dt t=0 dt t=0
= (−Xγ)T (y − Xβ̂) + (y − Xβ̂)T (−Xγ) + λDγ J(β̂)
T
= −γ T XT y + γ T XT Xβ̂ − yT Xγ + β̂ XT Xγ + λγ T DJ(β̂)
= −2γ T XT y + 2γ T XT Xβ̂ + λγ T DJ(β̂) (since aT = a for a 1 × 1 matrix)
1
= 2γ T − XT y + XT Xβ̂ + λDJ(β̂) .
2
Therefore we get
1
XT Xβ̂ + λDJ(β̂) = XT y, (∗)
2
since γ is arbitrary.
T
Now suppose J(β̂) = β̂12 +· · ·+ β̂p2 = β̂ I˜β̂, where I˜ is the diagonal matrix whose
diagonal entries are all 1 except the first diagonal entry which is 0. This method
of regularization is known as ridge regression, and its corresponding coefficient
R
is denoted by β̂ λ . In this case the equation (∗) becomes
R R
˜ β̂ R
XT y = XT Xβ̂ λ + λI˜β̂ λ = (XT X + λI) λ.
Note that det(XT X + λI)˜ is a polynomial in λ; so it has finitely many roots. Thus
the matrix X X + λI˜ is invertible for almost all values of λ (even when p > n), and
T
Hence when we look for the minimizer of RSS + λJ, in addition to the points where
the derivative vanishes, we also have to check the points of nondifferentiability,
i.e. the points which have some zero components. The possibility that these points
with some zero components can be the minimizer implies that the lasso can perform
feature selection too.
47
4.4.2 Regularization from Bayesian viewpoint
Suppose the probability measure P describes the distribution of some quantity
X among a population. We assume that P belongs to a parametric family of
probability measures Pθ , but we do not know which value of θ gives us P. In
classical statistics X is random, but θ is a fixed unknown parameter (which we try
to estimate). In contrast, in Bayesian statistics both X, θ are considered to be
random variables. Let f (x|θ) denote the probability density (mass) function of Pθ ,
which is the distribution of X given θ. As before, f (x|θ) is called the likelihood
function. Let π(θ) be the prior density of θ, i.e. the density of θ before we know
anything about X. Then the joint probability density of X, θ is π(θ)f (x|θ) (note
that unlike X, we use θ to denote both a random variable and its values). Hence
the marginal density of X (before knowing θ) is
Z
f (x) := π(ϑ)f (x|ϑ) dϑ,
Θ
where Θ is the set of all possible values of θ. Thus by Bayes’ theorem, the posterior
density of θ, given the observation X = x, is
π(θ)f (x|θ)
π(θ|x) = .
f (x)
where the constant of proportionality, i.e. f (x), can depend on x but not on θ. In
other words we have
posterior ∝ prior × likelihood.
The posterior mode θ̂ is the value of θ that maximizes the posterior π(θ|x),
i.e. θ̂ is the most likely value of θ, given the observed data x. We have
48
Where the last equality holds since f (x) does not depend on θ. In comparison,
recall that arg maxθ log f (x|θ) is the maximum likelihood estimator of θ.
Now let us consider the regularization in linear regression. Suppose
Y = β1 X1 + · · · + βp Xp + ,
where is a normally distributed noise with zero mean and variance σ 2 . (For
simplicity we assume that β0 = 0, and all variables have zero mean.) Suppose we
have n observations of Y, X1 , . . . , Xp , which satisfy
yi = β1 xi1 + · · · + βp xip + i .
And i s are i.i.d. random variables with the same distribution as . In this case θ
is the vector β = (β1 , . . . , βp ) (and yi plays the role of x in the above discussion,
P
while we treat xi1 , . . . , xip as constants). If xij s and β j s are fixed, then βj xij is
2
P
constant; so yi is normally distributed with mean βj xij , and variance σ , i.e. we
have (y − P β x )2
1 i j ij
f (yi |β) = √ exp − 2
.
2πσ 2 σ
(Be careful not to confuse the mathematical constant π with the prior or posterior
densities!) Then since yi s are independent we have
1 (y − P β x )2
i j ij
Y
f (y|β) = f (y1 , . . . , yn |β) = √ exp − 2
.
2πσ 2 σ
i≤n
Hence
n 1 X X 2
log f (y|β) = − log(2πσ 2 ) − 2 yi − βj xij .
2 σ
i≤n j≥1
1
Now let us assume that π(β) = j≥1 √ 2 exp − βj /τ 2 , i.e. we assume that
2
Q
2πτ
π(βj ) is normal with zero mean and variance τ 2 , and βj s are independent. Then
we get
p 1 X 2
log π(β) = − log(2πτ 2 ) − 2 βj .
2 τ
j≥1
Therefore we have
β̂ = arg max log π(β) + log f (y|β)
β
1 X X 2 1 X 2
= arg max − 2 yi − βj xij − 2 βj
β σ τ
X X 2 σ 2 X 2
= arg min yi − βj xij + 2 βj
β τ
X
= arg min RSS(β) + λ βj2 .
β
49
Note that in the second and third equalities above we are using the fact that σ, τ
are (positive) constants, and therefore canceling them or factoring them out does
not alter the extremum point. Thus we have shown that when the prior density
of βj s is normal with zero mean and variance τ 2 , the posterior mode β̂ is also the
2 2
Q to 1λ = σ /τ .
ridge regression coefficient corresponding
Next let us assume that π(β) = j≥1 2b exp − |βj |/b , i.e. we assume that
π(βj ) is Laplace distribution with zero mean and scale parameter b, and βj s are
independent. Then we get
1X
log π(β) = −p log(2b) − |βj |.
b
j≥1
Therefore we have
β̂ = arg max log π(β) + log f (y|β)
β
1 X X 2 1 X
= arg max − 2 yi − βj xij − |βj |
β σ b
X X 2 σ 2 X
= arg min yi − βj xij + |βj |
β b
X
= arg min RSS(β) + λ |βj | .
β
Thus we have shown that when the prior density of βj s is Laplace distribution with
zero mean and scale parameter b, the posterior mode β̂ is also the lasso coefficient
corresponding to λ = σ 2 /b.
50
close to being a constant. Now note that if φj1 s are allowed to be arbitrarily large,
then the variance of z1 can be made arbitrarily large as well. Thus we require that
φ211 + · · · + φ2p1 = kφ1 k2 = 1, i.e. we only care about the direction determined by
φ1 , and not its magnitude.
If we write the above relation in coordinates we have
p
X
zi1 = φj1 xij .
j=1
Pp
Hence z̄1 = j=1 φj1 x̄j = 0. Therefore the variance of zi1 s can be computed as
follows
X X
(n − 1)s2z1 = (zi1 − z̄1 )2 = 2
zi1 = kz1 k2
i≤n i≤n
= zT1 z1 = (Xφ1 )T Xφ1 = φT1 XT Xφ1 .
But from linear algebra we know that the maximum of φT1 XT Xφ1 among the vectors
φ1 with kφ1 k = 1 is achieved when φ1 is an eigenvector of the symmetric matrix
XT X corresponding to its largest eigenvalue, which we call λ1 , i.e.
i.e. the sample variance of z1 is λ1 /(n − 1). Note that XT X is a positive matrix,
so its eigenvalues are nonnegative. And recall that by definition, the eigenvalues of
XT X are the squares of the singular values of X. The variable z1 is called the first
principal component of x1 , . . . , xp , and φ1 is called the first principal component
loading vector. Also, z11 , . . . , zn1 are known as the scores of the first principal
component.
The second principal component of x1 , . . . , xp is a linear combination
which has the largest variance among those linear combinations of x1 , . . . , xp that
are uncorrelated with z1 . Note that the sample mean of z2 , like all other linear
combinations of x1 , . . . , xp , is zero. Thus the covariance of z1 , z2 is equal to their
inner product, because
X X
(n − 1)qz1 z2 = (zi1 − z̄1 )(zi2 − z̄2 ) = zi1 zi2 = zT1 z2 .
i≤n i≤n
51
Hence z2 has the largest variance among those linear combinations of x1 , . . . , xp
which are orthogonal to z1 . Thus to obtain φ2 we need to maximize φT XT Xφ
among all vectors φ with length one such that Xφ is orthogonal to Xφ1 = z1 . But
in this case we have
52
We can continue the above process for k ≤ p. Therefore, by construction
φ1 , . . . , φp is an orthonormal set of eigenvectors of XT X corresponding to the eigen-
values λ1 ≥ λ2 ≥ · · · ≥ λp . In addition, the principal components z1 , . . . , zp also
form an orthogonal set, which implies they are uncorrelated, since their sample
means are zero. Furthermore for every k we have
X X
(n − 1)s2zk = (zik − z̄k )2 = 2
zik = kzk k2
i≤n i≤n
53
Now suppose W is another m-dimensional subspace of Rp . Let ψ 1 , . . . , ψ m be an
orthonormal basis for W . Then the orthogonal projection of xi on W is
wi = (xTi ψ 1 )ψ 1 + · · · + (xTi ψ m )ψ m .
Thus we can similarly compute the sum of the squared distance of the observed
data from the subspace W :
X X X
kxi − wi k2 = kxi k2 − kwi k2
i≤n i≤n i≤n
X XX
2
= kxi k − (xTi ψ k )2
i≤n i≤n k≤m
X XX X X
2
= kxi k − (xTi ψ k )2 = kxi k2 − kXψ k k2 ,
i≤n k≤m i≤n i≤n k≤m
kXψ 1 k2 = ψ T1 XT Xψ 1 ≤ max φT XT Xφ = λ1 .
kφk=1
Or equivalently we can say that z1 has the largest variance among all linear com-
binations Xφ with kφk = 1. For general k note that we have
ψ k = (ψ Tk φ1 )φ1 + · · · + (ψ Tk φp )φp ,
Thus we get
X XX X X X
kXψ k k2 = λl (ψ Tk φl )2 = λl (ψ Tk φl )2 = al λl ,
k≤m k≤m l≤p l≤p k≤m l≤p
54
2
P T
where al := k≤m (ψ k φl ) ≥ 0. We also know that
X X
al = (ψ Tk φl )2 = (φTl ψ k )2 ≤ kφl k2 = 1,
k≤m k≤m
55
Chapter 5
Nonlinear Methods
5.1 Splines
We have seen in Section 2.3.1 that we can apply nonlinear basis functions, like
polynomials, to the features to obtain new features. Then we can use these new
features in linear regression. These new features can also be used in other learning
methods such as logistic regression and LDA. One particularly useful set of basis
functions is the set of piecewise polynomial functions. For example, using a feature
X we can construct the features
56
constraints while we are fitting the model. The idea for avoiding this difficulty is
to impose the constraints on the basis functions. Then we can use those new basis
functions which satisfy the constraints to fit the model. Let us explain this point
further.
Suppose we want to model the relationship between Y, X by a degree m piece-
wise polynomial which has continuous derivatives up to order m − 1, and has knots
at c1 , . . . , ck . Then we can write
Y = β0 + β1 X + · · · + βm X m
+ βm+1 (X − c1 )m m
+ + · · · + βm+k (X − ck )+ + ,
where (
0 X < cj ,
(X − cj )m
+ =
(X − cj )m X ≥ cj .
Note that each of the basis functions 1, x, x2 , . . . , x, (x − cj )m
+ are (piecewise) poly-
nomials of degree at most m, and they all have everywhere continuous derivatives
up to order m − 1. Hence the function
β0 + β1 X + · · · + βm X m + βm+1 (X − c1 )m m
+ + · · · + βm+k (X − ck )+
When x < c1 we have (x−cj )3+ = 0. So the linearity constraint on (−∞, c1 ) implies
that β2 = β3 = 0. The linearity constraint on (ck , ∞) will yield relationships
between β4 , . . . , βk+3 , which after simplifications imply that
where
(x − cj )3+ − (x − ck )3+ (x − ck−1 )3+ − (x − ck )3+
Nj+2 (x) = − .
ck − cj ck − ck−1
To simplify the notation we also use N1 (x) = 1 and N2 (x) = x. Although the
computations for deriving the above relation is cumbersome, but it can readily be
57
seen that each Nj is a piecewise polynomial of degree at most 3 with continuous
derivatives up to order 2, which is also linear on the intervals (−∞, c1 ) and (ck , ∞).
Because for x < c1 we have (x − cj )3+ = 0 for each j, and for x > ck we have
which is defined for a sufficiently smooth function g. (The integral can be taken
over (−∞, ∞) or a bounded interval containing the range of values of X, but its
bounds are not important in what follows.) Also, λ is a fixed tuning parameter. We
will show that the functional F has a minimizer, which is know as a smoothing
spline.
In fact, we will show that the functional F achieves its minimum in the space
of natural cubic splines with knots at x1 , . . . , xn . Hence smoothing splines are
natural cubic splines; however, their coefficients are different from the coefficients
of a natural cubic spline which minimizes RSS. Let g be an arbitrary function in
the domain of the functional. We will show that there is a natural cubic spline h
such that F [h] ≤ F [g]. Hence if F has a minimizer, it must be a natural cubic
spline. Then we show that F actually achieves its minimum in the space of natural
cubic splines, as desired.
Let N1 , . . . , Nn be the basis for the space of natural cubic splines with knots at
x1 , . . . , xn . We assume that x1 , . . . , xn are distinct points. To simplify the notation
we also assume that x1 , . . . , xn are ordered in an increasing way. Let N = [Nj (xi )]
be the n × n matrix resulting from applying this basis to the observed predictor
58
values. Remember that N1 (x) = 1 and N2 (x) = x. The other entries of N are
(xi − xj )3+ − (xi − xn )3+ (xi − xn−1 )3+ − (xi − xn )3+
Nj+2 (xi ) = −
xn − xj xn − xn−1
3
(xi − xj )+ (xi − xn−1 )+ 3
= − (since xi ≤ xn )
xn − xj xn − xn−1
0
if i ≤ j,
(xi −xj )3
= if j < i < n,
xn −xj
(x − x )2 − (x − x
2
n j n n−1 ) if i = n.
Note that we have Nj+2 (xi ) 6= 0 for i > j. The matrix N is almost lower triangular,
and it can be easily seen that it has at least n − 1 linearly independent columns.
But to show that N is invertible we have to prove that all its columns are linearly
independent.
Suppose there is a linear dependence relation between the columns of N. Then
there are coefficients a1 , . . . , an such that
a1 N1 (xi ) + · · · + an Nn (xi ) = 0
since h̃00 is a nonzero linear function over each of these intervals. But by our
assumption, between the l − k + 1 roots xk , xk+1 , . . . , xl of h̃, h̃00 must have at least
l − k − 1 distinct roots. This contradiction implies that h̃00 must be identically
zero everywhere, which implies that the coefficients a1 , . . . , an are all zero. So the
columns of N are linearly independent;
T and hence N is an invertible matrix.
Now let g = g(x1 ) . . . g(xn ) . Then there is a vector β such that Nβ = g.
Let
h(x) = β1 N1 (x) + · · · + βn Nn (x).
59
Pn h is a natural
Then
2 =
Pn spline that 2satisfies h(xi ) = g(xi ) for every i. Therefore
cubic
(y
i=1 i − g(xi )) i=1 (yi − h(xi )) . Hence we have
Z
F [g] − F [h] = λ (g 00 )2 − (h00 )2 dx.
Thus to conclude F [h] ≤ F [g] we only need to show that (h00 )2 dx ≤ (g 00 )2 dx.
R R
Now note that the integrals containing h(4) vanish since h is a piecewise cubic
polynomial. We also have g(xi ) = h(xi ) for every i; so the n-term sum in the above
formula which contains h000 vanishes too. In addition, the n-term sum containing h00
is a telescoping sum; hence it equals (g 0 (xn )−h0 (xn ))h00 (xn )−(g 0 (x1 )−h0 (x1 ))h00 (x1 ).
But h00 (xn ) = h00 (x1 ) = 0, because h is a natural spline, so it is linear on the intervals
(−∞, x1 )R and (xn , ∞). Therefore all the terms in the above equation vanish. Thus
we have (g 00 − h00 )h00 dx = 0. Hence we get
Z Z Z Z Z
00 2 00 00 1 00 2 00 2 1 00 2 1
(h ) dx = h g dx ≤ ((h ) + (g ) )dx = (h ) dx + (g 00 )2 dx.
2 2 2
60
Now let us compute F [h]. We have
n Z
X 2
F [h] = 2
(yi − h(xi )) + λ h00 (x) dx
i=1
n n Z n
X X 2 X 2
= yi − Nj (xi )βj +λ Nj00 (x)βj dx
i=1 j=1 j=1
n n Z
X X 2 X
= yi − Nj (xi )βj +λ βk βj Nj00 (x)Nk00 (x)dx
i=1 j=1 j,k
2 T
= ky − Nβk + λβ Ωβ,
where Ω is the n × n matrix with entries Ωjk = Nj00 (x)Nk00 (x)dx. Therefore to
R
minimize F over the space of natural cubic splines we need to find β that minimizes
ky − Nβk2 + λβ T Ωβ. We can differentiate this expression with respect to β, and
similarly to the computations in Section 4.4.1 for ridge regression, we obtain that
the minimizer β̂ satisfies
(NT N + λΩ)β̂ = NT y.
And when the matrix NT N + λΩ is invertible, the vector
β̂ = (NT N + λΩ)−1 NT y
provides the minimizer ĥ(x) = β̂1 N1 (x) + · · · + β̂n Nn (x) for the functional F .
where wi (x) = K(xi , x) are the weights. As you see, we can use linear or polynomial
functions in this method. Note that the weights depend on x; so the above fitting
procedure will only give us fˆ(x) for this particular value of x. And if we want
to estimate fˆ(x0 ) for another value x0 we need to compute the weights w(x0 ) =
K(xi , x0 ), and fit the weighted least squares method again.
61