KEMBAR78
Notes Stat Learning | PDF | Variance | Errors And Residuals
0% found this document useful (0 votes)
9 views64 pages

Notes Stat Learning

The document provides an overview of statistical learning, covering key concepts such as populations, samples, and learning methods, including supervised and unsupervised learning. It delves into error measurement, distinguishing between reducible and irreducible errors, and introduces the bias-variance decomposition. Additionally, it discusses linear regression, classification methods, model selection, and regularization techniques.

Uploaded by

sobhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views64 pages

Notes Stat Learning

The document provides an overview of statistical learning, covering key concepts such as populations, samples, and learning methods, including supervised and unsupervised learning. It delves into error measurement, distinguishing between reducible and irreducible errors, and introduces the bias-variance decomposition. Additionally, it discusses linear regression, classification methods, model selection, and regularization techniques.

Uploaded by

sobhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Notes on Statistical Learning

Mohammad Safdari
Contents

1 Introduction to Statistical Learning 1


1.1 Populations, Samples, and Learning methods . . . . . . . . . . . . . 1
1.2 Measuring the Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Reducible and Irreducible Errors . . . . . . . . . . . . . . . . 2
1.2.2 Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . 2
1.2.3 Error rates in Classification . . . . . . . . . . . . . . . . . . . 4

2 Linear Regression 5
2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Least Squares Method . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 t-statistic and Hypothesis tests . . . . . . . . . . . . . . . . . 9
2.1.4 R2 Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Estimating the Coefficients . . . . . . . . . . . . . . . . . . . 11
2.2.2 Standard Errors of the Coefficient Estimates . . . . . . . . . . 13
2.2.3 Estimating the Irreducible Error . . . . . . . . . . . . . . . . 16
2.2.4 F-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.5 F-Statistic versus t-statistic . . . . . . . . . . . . . . . . . . . 19
2.2.6 R2 Statistic in Multiple Linear Regression . . . . . . . . . . . 20
2.2.7 Confidence and Prediction Intervals . . . . . . . . . . . . . . 22
2.3 Further Topics in Linear Regression . . . . . . . . . . . . . . . . . . . 24
2.3.1 Polynomial Regression and Basis Functions . . . . . . . . . . 24
2.3.2 Qualitative Predictors and Dummy Variables . . . . . . . . . 25
2.3.3 Heteroscedasticity and Weighted Least Squares . . . . . . . . 26
2.3.4 Outliers and Studentized residuals . . . . . . . . . . . . . . . 27
2.3.5 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.6 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 31

i
3 Classification 33
3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 34

4 Model Selection and Regularization 38


4.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Leave-One-Out Cross-Validation . . . . . . . . . . . . . . . . 38
4.1.2 Cook’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Cp Statistic and Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Ridge Regression and Lasso . . . . . . . . . . . . . . . . . . . 46
4.4.2 Regularization from Bayesian viewpoint . . . . . . . . . . . . 48
4.5 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . 50

5 Nonlinear Methods 56
5.1 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Local Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

ii
Chapter 1

Introduction to Statistical
Learning

1.1 Populations, Samples, and Learning methods


Suppose the probability measure P describes the distribution of some quantities or
properties among a population. Let Y, X be two of these quantities or properties.
Mathematically, Y, X are random variables. They can also be vector-valued. There
are many situations in which we are interested in predicting the value of Y based on
the observed value of X, or understanding the relationship between Y, X. Suppose

Y = f (X) + ,

where  is a random noise with E[] = 0. We usually assume that Y is scalar-valued,


but X = (X1 , . . . , Xp ) can be vector-valued. In this setting, the components of X
are called (input) variables, or predictors, or features. And Y is called the
output variable, or response.
To understand the relationship between Y, X, and to perform predictions, we
need to estimate the function f . In order to estimate f , we need a sample from the
population, which means we need a set of observations of the variables of interest
X, Y . Suppose we have a data set consisting of n observations

(x1 , y1 ), . . . , (xn , yn )

for X, Y . Then we can employ this sample, which is also called the training data,
to estimate f using a learning method. Broadly speaking, there are two types
of learning methods for estimation of f . In parametric methods we first assume
that f has a given form with some unknown parameters; then we use the data to
estimate those parameters. For example, we may assume that f is a linear function;
then we can try to estimate the coefficients of that linear function using the data.

1
In contrast, in non-parametric methods we do not assume a predetermined form
for f .
The above process of estimating f is known as supervised learning, since
we have both the response Y and the predictor X. In supervised learning, when
the response Y is a quantitative variable (i.e. it takes numerical values), we are
dealing with a regression problem. And when the response Y is a qualitative or
categorical variable (i.e. it takes finitely many discrete values, in other words, its
values belongs to one of finitely many classes or categories), we are dealing with
a classification problem.
Sometimes there is no response variable at all, and we have only observed the
input variables. In these settings we may for example be interested in detecting
patterns in the data. This is known as unsupervised learning.

1.2 Measuring the Error


1.2.1 Reducible and Irreducible Errors
Suppose that we have estimated the following form for f :

Ŷ = fˆ(X).

Then the difference between Y, Ŷ is called the error. And the difference between
f, fˆ is called reducible error, because it can be potentially improved. The other
part of the error which is due to the presence of the random noise , and cannot
be reduced by our choice of the learning method, is called the irreducible error.
We have

E[(Y − Ŷ )2 ] = E[(f (X) +  − fˆ(X))2 ]


= E[(f (X) − fˆ(X))2 + 2(f (X) − fˆ(X)) + 2 ]
= E[(f (X) − fˆ(X))2 ] + 2E[(f (X) − fˆ(X))] + E[2 ]
= (f (X) − fˆ(X))2 + 2(f (X) − fˆ(X))E[] + E[2 ]
= (f (X) − fˆ(X))2 + 0 + E[( − 0)2 ]
= (f (X) − fˆ(X))2 + Var[] .
| {z } | {z }
Reducible Error Irreducible Error

1.2.2 Bias-Variance Decomposition


Suppose we have a data set consisting of n observations

(x1 , y1 ), . . . , (xn , yn )

2
for X, Y , where xi s can be vector too. Suppose we use this (training) data set
in a learning method to find the estimate fˆ for f . Then the (training) Mean
Squared Error (MSE) is
n
1X
MSE := (yi − fˆ(xi ))2 .
n
i=1

However, we are interested in the test MSE, i.e. the MSE of the model when it is
applied to new test data.
Let us further assume that f (·) = F (·, θ0 ) belongs to the class of functions
F = {F (·, θ) : θ ∈ I},
which is parametrized by θ in the parameter set I. Then we use a training data set
to predict fˆ(·) = F (·, θ̂). Here θ̂ is the random variable which gives the parameter
to obtain the prediction fˆ when we use a particular training data set as input in
our learning method. We assume that θ̂,  are independent, since they are functions
of two sources of randomness;  is random due to unmeasured or unmeasurable
variables, but θ̂ is random due to random selection of a training data set.
To simplify the notation we set µ = E[Ŷ ] = E[fˆ(X)], i.e. µ is the average value
of Ŷ at X for different training data sets used as input in our learning method.
Then the expected test MSE (i.e. the average of MSE w.r.t. all possible training
data sets used as input in our learning method evaluated (tested) at X) can be
computed as
2 
E[(Y − Ŷ )2 ] = E f (X) +  − fˆ(X)


= E (f (X) − fˆ(X))2 + 2(f (X) − fˆ(X)) + 2


 

= E[(f (X) − fˆ(X))2 ] + 2E[(f (X) − fˆ(X))] + E[2 ]


= E[(f (X) − fˆ(X))2 ] + 2E[f (X) − fˆ(X)]E[] + E[2 ]
= E[(f (X) − µ + µ − fˆ(X))2 ] + 0 + E[( − 0)2 ]
= E[(f (X) − µ)2 ] + E[(µ − fˆ(X))2 ]
+ 2E[(f (X) − µ)(µ − fˆ(X))] + Var[]
= (f (X) − µ)2 + E[(µ − fˆ(X))2 ]
+ 2(f (X) − µ)E[µ − fˆ(X)] + Var[]
= (f (X) − µ)2 + E[(µ − fˆ(X))2 ] + Var[] .
| {z } | {z } | {z }
Bias2 Variance Irreducible Error

Here variance is the variance of the value of Ŷ at X, and bias is the difference
between the true value f (X) with the mean value of Ŷ at X. Hence we have shown
that the reducible error is the sum of the variance and the square of bias.

3
In other words, suppose we have a large collection of training data sets, and we
use each one of them to calculate an estimate Ŷ for Y , using our learning method.
Then this collection of estimates has a mean and a variance. The estimates are
distributed around their mean, and their variability is quantified by their variance.
The distance of the mean of all the estimates with the true value of f (X) (which is
the expected value of Y given X, since E[] = 0) is called the bias, and the variance
of the estimates is still called the variance. Therefore bias measures the average
distance that estimates have with f (X). And variance measures how much the
estimates are dispersed.

1.2.3 Error rates in Classification


Suppose Y is a qualitative variable, and its classes are represented by {1, 2, . . . , K}.
Let yi be the ith observed value, and ŷi be the predicted value for it. Then the
(training) error rate is:
n
1X
I(yi 6= ŷi ),
n
i=1

where the indicator variable I = 0 if yi = ŷi , and I = 1 if yi 6= ŷi . Let y0 be a new


observed value, and ŷ0 be its predicted value. Then the test error is

E[I(y0 6= ŷ0 )].

Suppose we have predicted ŷ0 = j. Then the expected test error (ETE) is
K
X
ETE = E[I(y0 6= ŷ0 )] = I(k 6= ŷ0 )P(y0 = k)
k=1
K
X
= I(k 6= j)P(Y = k|X = x0 )
k=1
X
= P(Y = k|X = x0 ) = 1 − P(Y = j|X = x0 ).
k6=j

Thus to minimize ETE we have to set ŷ0 = j such that P(Y = j|X = x0 ) has the
highest value among P(Y = k|X = x0 ) for k = 1, . . . , K. This is called the Bayes
classifier. The Bayes decision boundary is the boundary between the regions
{x0 : ŷ0 = j} in the X-space, for different j. This boundary also determines the
Bayes classifier.

4
Chapter 2

Linear Regression

Linear regression is a parametric learning method in which we assume that f is


a linear function, i.e. we assume that

Y = β0 + β1 X1 + · · · + βp Xp + .

Then we need to estimate the coefficients β0 , . . . , βp . In simple linear regression


we assume that p = 1, i.e. we have

Y = β0 + β1 X + .

The case of more than one predictor will be dealt with in multiple linear regres-
sion.

2.1 Simple Linear Regression


2.1.1 Least Squares Method
Suppose our data set consists of n observations

(x1 , y1 ), . . . , (xn , yn )

for X, Y . Let ŷi = β̂0 + β̂1 xi be the predicted value for Y when X = xi . Then
ei = yi − ŷi is called the ith residual. The residual sum of squares (RSS) is
n
X
RSS = e21 + ··· + e2n = (yi − β̂0 − β̂1 xi )2 .
i=1

In the least squares method we choose β̂0 , β̂1 to minimize RSS.

5
To find the minimum of RSS we consider it as a function of β̂0 , β̂1 , and differ-
entiate with respect to them. We get
n
−1 ∂ X
0= RSS = (yi − β̂0 − β̂1 xi ) = n(ȳ − β̂0 − β̂1 x̄),
2 ∂ β̂0
i=1
Pn Pn
where x̄ = n1 i=1 xi and ȳ = 1
n i=1 yi are the sample means of xi , yi , respec-
tively. Hence
β̂0 + β̂1 x̄ = ȳ.
Therefore
n
−1 ∂ X
0= RSS = xi (yi − β̂0 − β̂1 xi )
2 ∂ β̂1
i=1
n
X n
X
= xi yi − nβ̂0 x̄ − β̂1 x2i
i=1 i=1
n
X n
X
= xi yi − n(ȳ − β̂1 x̄)x̄ − β̂1 x2i
i=1 i=1
n
X n
X
= (xi yi − x̄ȳ) − β̂1 (x2i − x̄2 )
i=1 i=1
n
X  
= (xi − x̄)(yi − ȳ) + x̄yi + xi ȳ − 2x̄ȳ
i=1
n
X
(xi − x̄)2 + 2x̄xi − 2x̄2
 
− β̂1
i=1
n
X n
X
= (xi − x̄)(yi − ȳ) − β̂1 (xi − x̄)2 .
i=1 i=1

Thus Pn
(x − x̄)(yi − ȳ) qxy
Pn i
β̂1 = i=1 2
= 2,
i=1 (xi − x̄) sx
1 Pn
where qxy = n−1 i=1 (xi − x̄)(yi − ȳ) is the sample covariance of xi , yi , and
1 n
s2x = n−1 x̄)2 is the sample variance of xi . We assume that s2x is
P
(x
i=1 i −
nonzero; otherwise xi s are constant.
Note that the second derivative of RSS is
 
2 2n 2nx̄
D RSS = .
2nx̄ 2 x2i
P

6
So D2 RSS is positive definite, since its trace and determinant are positive:
X X
det(D2 RSS) = 4n x2i − nx̄2 = 4n (xi − x̄)2 = 4n(n − 1)s2x .


Also note that RSS goes to infinity as β̂0 , β̂1 go to infinity. Thus the above values
for β̂0 , β̂1 give the unique global minimum of RSS.

2.1.2 Unbiased Estimators


The reason that we use n − 1 in the definition of s2 instead of n, is that we want s2
to be an unbiased estimator of (population) variance σ 2 , i.e. the expected value
of s2 be equal to σ 2 . To see this suppose x1 , . . . , xn are independent and identically
distributed (i.i.d.) random variables with mean µ and variance σ 2 . First note that
h1 X i 1 X 1
E[x̄] = E xi = E[xi ] = nµ = µ.
n n n
i≤n i≤n

Therefore the sample mean x̄ is an unbiased estimator of the (population) mean µ.


We also have
1 h X 2 i 1 h X 2 i
E[(µ − x̄)2 ] = 2 E nµ − xi = 2E (µ − xi )
n n
i≤n i≤n
1 h X X i
= 2E (µ − xi )2 + (µ − xi )(µ − xj )
n
i≤n i6=j
1 X 1 X
= 2 E[(µ − xi )2 ] + 2 E[(µ − xi )(µ − xj )]
n n
i≤n i6=j
nσ 2 1 X 
= + 2 E[µ − xi ]E[µ − xj ] (xi , xj are independent)
n2 n
i6=j
σ2 σ2
= +0= .
n n
Note that here we have shown that
σ2
Var(x̄) = E[(x̄ − µ)2 ] = .
n

7
Now we get
hX i hX i
E (xi − x̄)2 = E (xi − µ + µ − x̄)2
i≤n i≤n
hX i
=E (xi − µ)2 + 2(xi − µ)(µ − x̄) + (µ − x̄)2
i≤n
hX X i
=E (xi − µ)2 + 2(µ − x̄) (xi − µ) + n(µ − x̄)2
i≤n i≤n
hX i
=E (xi − µ)2 + 2(µ − x̄)n(x̄ − µ) + n(µ − x̄)2
i≤n
hX i
=E (xi − µ)2 − n(µ − x̄)2
i≤n
X
= E[(xi − µ)2 ] − nE[(µ − x̄)2 ]
i≤n
σ2
= nσ 2 − n = (n − 1)σ 2 .
n
Therefore E[s2 ] = σ 2 , as desired.
On the other hand, notice that in general s is √ not anpunbiased estimator of the
standard deviation σ, since in general E[s] = E[ s2 ] 6= E[s2 ] = σ.
Next let us show that β̂0 , β̂1 are unbiased estimators of β0 , β1 , respectively. First
note that we have yi = β0 + β1 xi + i , where i , xi are independent, and E[i ] = 0
(we also assume that xi , xj , i , j are all independent for i 6= j). Hence we have
h Pn (x − x̄)(y − ȳ) i
E[β̂1 ] = E Pn i
i=1 i
2
(x
i=1 i − x̄)
h Pn (x − x̄)(β + β x +  − β − β x̄ − ¯) i
i=1 i 0 1 i i 0 1
=E P n 2
(x
i=1 i − x̄)
h β Pn (x − x̄)2 + Pn (x − x̄)( − ¯) i
1 i=1 i P i=1 i i
=E n 2
i=1 (x i − x̄)
n
X h (xi − x̄) i
= β1 + E Pn (
2 i
− ¯)
i=1 i=1 (xi − x̄)
n
X h (xi − x̄) i
= β1 + E Pn 2
E[i − ¯] = β1 + 0 = β1 .
i=1 (xi − x̄)
i=1

8
Therefore
Pn
h (x − x̄)(yi − ȳ) i
E[β̂0 ] = E[ȳ − β̂1 x̄] = E ȳ − i=1 Pn i 2

i=1 (xi − x̄)
β1 ni=1 (xi − x̄)2 + ni=1 (xi − x̄)(i − ¯) i
h P P
= E ȳ − Pn 2

i=1 (xi − x̄)
n h (x − x̄)x̄ i
i
X
= E[ȳ − β1 x̄] − E Pn 2 i
( − 
¯)
i=1 i=1 (xi − x̄)
n h (x − x̄)x̄ i
i
X
= µy − β 1 µx − E Pn 2
E[i − ¯]
i=1i=1 (xi − x̄)

= β0 − 0 = β0 .

2.1.3 t-statistic and Hypothesis tests


In the last section we computed the expected values of β̂0 , β̂1 . We can also compute
the variances of β̂0 , β̂1 similarly, but it is easier to compute them using matrix
algebra, and we will do so in Section 2.2.2. The square root of the variance of an
estimate is called its standard error. Let us denote the standard errors of β̂0 , β̂1
by SE(β̂0 ), SE(β̂1 ) respectively.
Now suppose we want to see whether there really is a relationship between X, Y .
In other words, we want to see whether the observed data provides statistically
significant evidence to support that X, Y are related. To do this we can perform a
hypothesis test. We start with a null hypothesis. Here our null hypothesis is
that there is no relation between X, Y , which mathematically can be expressed as

H0 : β1 = 0.

The idea of the hypothesis test is that if the data shows that the null hypothesis
results in a statistically improbable conclusion, then we can reject the null hypoth-
esis, and accept its negation, which is called the alternative hypothesis. Thus
hypothesis test can be considered a statistical form of reductio ad absurdum. In
our case, the alternative hypothesis is that there is some relation between X, Y , or
more formally
Ha : β1 6= 0.
To perform hypothesis test for the null hypothesis H0 : β1 = 0 we can use the
t-statistic
β̂1 β̂1 − 0
t= = ,
SE(β̂1 ) SE(β̂1 )
which measures the distance of β̂1 to 0 in the units of the standard error of β̂1 .
The distribution of the t-statistic can be computed, and is called t-distribution.

9
The t-distribution is similar to the normal distribution, and is concentrated around
zero. It can be shown that if H0 is true then it is unlikely that the t-statistic has
a large value. So if we observe a large t-statistic, then we can reject H0 . To make
this reasoning quantitative, let T be a random variable with t-distribution. Then
the p-value of the observed t-statistic is

p = P(|T | ≥ |t| | H0 ),

i.e. the p-value is the probability of observing t or less probable values than t,
provided that H0 is true. Therefore if the p-value of t is small enough, we can
conclude that the null hypothesis H0 implies an improbable result, so it can be
rejected.

2.1.4 R2 Statistic
The R2 statistic, also called the coefficient of determination, is
TSS − RSS RSS
R2 = =1− ,
TSS TSS
where TSS = (yi − ȳ)2 is the total sum of squares. Intuitively, TSS is the
P
total variation in the observed values of Y , and RSS is the variation in Y after we
perform the regression. Thus we can regard RSS as the variation in Y which is
unexplained by the linear regression. Hence TSS − RSS is the amount of variation
in Y which is explained by the linear regression, and therefore R2 is the proportion
of the variance in the response which is explained by the linear regression. So if
R2 is close to 1, the linear model is a good approximation of the true relationship
between X, Y . There is an alternative way to see this fact, which we present in the
next paragraph.
Recall that the sample correlation is
Pn
qxy (xi − x̄)(yi − ȳ)
Cor(X, Y ) = rxy = = pPn i=1 pPn .
sx sy i=1 (xi − x̄)
2
i=1 (yi − ȳ)
2

We have
TSS − RSS
R2 =
TSS
Pn 2
Pn 2
i=1 (yi − ȳ)P− i=1 (yi − β̂0 − β̂1 xi )
= n 2
i=1 (yi − ȳ)
Pn
(yi − ȳ)2 − ni=1 (yi − ȳ + ȳ − β̂0 − β̂1 xi )2
P
= i=1 Pn 2
i=1 (yi − ȳ)

10
Pn  2 2

i=1 (yi − ȳ) − (yi − ȳ + β̂1 x̄ − β̂1 xi )
= Pn 2
(ȳ = β̂0 + β̂1 x̄)
i=1 (yi − ȳ)
Pn  2 2

i=1 2(yi − ȳ)β̂1 (xi − x̄) − β̂1 (xi − x̄)
= Pn 2
i=1 (yi − ȳ)
2β̂1 i=1 (yi − ȳ)(xi − x̄) − β̂12 ni=1 (xi − x̄)2
Pn P
= Pn 2
i=1 (yi − ȳ)
2
qxy
q
2β̂1 qxy − β̂12 s2x 2 sxy
2 qxy − s2
s4x x
2
qxy
= = x
= = Cor2 (X, Y ).
s2y s2y s2x s2y

It is easy to see that Cor(Ŷ , Y ) = Cor(β̂0 + β̂1 X, Y ) = Cor(X, Y ). Hence we also


have R2 = Cor2 (Ŷ , Y ).

2.2 Multiple Linear Regression


2.2.1 Estimating the Coefficients
Suppose that we have

Y = β0 + β1 X1 + · · · + βp Xp + .

Let β̂0 , β̂1 , . . . , β̂p be estimates for the above coefficients. Then our prediction for
Y is given by the formula

ŷ = β̂0 + β̂1 x1 + · · · + β̂p xp .

We want to find β̂0 , β̂1 , . . . , β̂p that minimize the sum of squared residuals
n
X n
X
RSS = (yi − ŷi )2 = (yi − β̂0 − β̂1 xi1 − · · · − β̂p xip )2 ,
i=1 i=1

i.e. we want to use the method of least squares. Set


       
1 x11 x12 . . . x1p β̂0 y1 ŷ1
1 x21 x22 . . . x2p  β̂1   y2   ŷ2 
X = . ..  , β̂ =  .  , y =  . , ŷ =  .  .
       
.. ..
 .. . . .   ..   ..   .. 
1 xn1 xn2 . . . xnp β̂p yn ŷn

Then we have ŷ = Xβ̂, and

RSS = RSS(β̂) = ky − ŷk2 = ky − Xβ̂k2 = (y − Xβ̂)T (y − Xβ̂).

11
We want RSS to have its minimum at β̂. Let γ ∈ Rp+1 be an arbitrary vector.
Then the directional derivative of RSS at β̂ in the direction of γ must be zero.
Hence we must have
d
0= RSS(β̂ + tγ)
dt t=0
d 
= (y − Xβ̂ − tXγ)T (y − Xβ̂ − tXγ)
dt t=0
d  d 
= (y − Xβ̂ − tXγ)T (y − Xβ̂) + (y − Xβ̂)T (y − Xβ̂ − tXγ)
dt t=0 dt t=0
= (−Xγ)T (y − Xβ̂) + (y − Xβ̂)T (−Xγ)
T
= −γ T XT y + γ T XT Xβ̂ − yT Xγ + β̂ XT Xγ
= −2γ T XT y + 2γ T XT Xβ̂ (since aT = a for a 1 × 1 matrix)

= 2γ T − XT y + XT Xβ̂ .

Therefore
XT Xβ̂ = XT y,
since γ is arbitrary, and XT y − XT Xβ̂ is also an element of Rp+1 . If XT X is
invertible, which is the case if the columns of X are linearly independent, then we
have
β̂ = (XT X)−1 XT y.
Let us present another way for finding β̂. Note that in general for every γ ∈
Rp+1 we have

0 = h0, γi = hXT (y − Xβ̂), γi = hy − Xβ̂, Xγi.

In other words, y − Xβ̂ must be orthogonal to the image of the linear map · 7→ X·,
which is a linear subspace of Rn . For this to happen, Xβ̂ must be the orthogonal
projection of y onto this subspace; and in this case we also have

ky − Xβ̂k ≤ ky − Xγk,

for every γ, which is the desired property of β̂. We should also note that there is
always a β̂ that satisfies the above inequality, and Xβ̂ is uniquely determined as
the orthogonal projection of y; but when XT X is non-invertible, β̂ is not unique.
In addition, note that we actually have
d 
Dγ RSS(β̂) = RSS(β̂ + tγ) = 2γ T XT y − XT Xβ̂
dt t=0
T 
= 2 yT X − β̂ XT X γ.

12
T 
Hence DRSS(β̂) = 2 yT X − β̂ XT X . Therefore

D2 RSS(β̂) = 2XT X

is positive definite. Also note that RSS goes to infinity as β̂ goes to infinity. Thus
the above value for β̂ gives the unique global minimum of RSS.
It is also easy to see that
n
∂ X
0= RSS = −2 (yi − β̂0 − β̂1 xi1 − · · · − β̂p xip )
∂ β̂0 i=1
= −2n(ȳ − β̂0 − β̂1 x̄1 − · · · − β̂p x̄p ).

Hence ŷ¯ = β̂0 + β̂1 x̄1 + · · · + β̂p x̄p = ȳ. In other words, the linear function given by
the least squares method passes through the sample means of the data.

2.2.2 Standard Errors of the Coefficient Estimates


Suppose the true relationship between X, y is

y = Xβ + ,

where  = [1 , . . . , n ]T is a vector of i.i.d random variables with 0 mean and variance
σ2 . Thus its covariance matrix is

E[21 ] E[1 2 ] · · · E[1 n ]


 
 E[1 2 ] E[2 ] · · · E[2 n ]
T 2 2
E[ ] =  . ..  = σ I,
 
.. ..
 .. . . . 
E[1 n ] E[2 n ] · · · E[2n ]

where I is the identity matrix (note that all covariances are 0 due to independence).
We also assume that X,  are independent. Now we have

β̂ = (XT X)−1 XT y = (XT X)−1 XT (Xβ + )


= (XT X)−1 XT Xβ + (XT X)−1 XT  = β + (XT X)−1 XT .

Hence

E[β̂] = E β + (XT X)−1 XT 


 

= β + E (XT X)−1 XT E[] = β + 0 = β.


 

13
Therefore β̂ is an unbiased estimator of β. On the other hand, the covariance
matrix of β̂ is

E (β̂ − β)(β̂ − β)T = E (XT X)−1 XT y − β (XT X)−1 XT y − β


    T 

= E (XT X)−1 XT  (XT X)−1 XT 


  T 

= (XT X)−1 XT E[T ]X(XT X)−1


= (XT X)−1 XT σ2 IX(XT X)−1
= σ2 (XT X)−1 XT X(XT X)−1 = σ2 (XT X)−1 ,

provided that X is not random (or we compute the conditional expectation under
the condition that X is constant).
Suppose p = 1. Then we have (for simplicity, we write xi instead of xi1 )
 
1 x1
1 1 . . . 1 1 x2 
    P 
T n P x2i .
X X=  .. ..  = P
x1 x2 . . . x n  . .  xi xi
1 xn

Hence
 P 2 P 
T −1 1 xi − xi
(X X) = P 2 P
n xi − ( xi )2 − xi
P
n
(xi − x̄)2 + nx̄2 −nx̄
P 
1
= .
n (xi − x̄)2 −nx̄
P
n

Note that here we have used the formula


X X 1 X 2
(xi − x̄)2 = x2i − xi ,
n
which is a manifestation of the formula E[(Z − E[Z])2 ] = E[Z 2 ] − (E[Z])2 , for a
random variable Z. Hence we obtain the following formulas for the standard
errors of β̂0 , β̂1
h1 x̄2 i
SE(β̂0 )2 = Var(β̂0 ) = σ2 + Pn 2
,
n i=1 (xi − x̄)
σ2
SE(β̂1 )2 = Var(β̂1 ) = Pn 2
.
i=1 (xi − x̄)

To compute the standard errors of the coefficients when p is arbitrary we need


some tools from linear algebra. It is easy to see that the inverse of a block matrix

14
is given by any of the following two formulas
−1
(A − BD−1 C)−1 −A−1 B(D − CA−1 B)−1
  
A B
=
C D −D−1 C(A − BD−1 C)−1 (D − CA−1 B)−1
 −1
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1

= ,
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1

provided that all the inverses exist and all the products are defined. (For the
proof we can just multiply the proposed inverses and the original block matrix.)
The matrices (A − BD−1 C), (D − CA−1 B) are called the Schur complements of
D, A, respectively. Note that we also have

A−1 B
    
A B A 0 I
= .
C D C I 0 D − CA−1 B

Therefore the determinant of the block matrix equals det(A) det(D−CA−1 B). So if
A and the block matrix are invertible then the Schur complement of A is invertible
too.  
Now in general we can write the matrix X as X = 1 X̃ , where
T
1 = 1 1 . . . 1 ∈ Rn . Hence


     
T 1T   n 1T X̃ n nx̄T
X X= 1 X̃ = = ,
X̃T X̃T 1 X̃T X̃ nx̄ X̃T X̃
 T
where x̄ = x̄1 . . . x̄p . Thus

n−1 + n−1 nx̄T (X̃T X̃ − nx̄x̄T )−1 nx̄n−1 −n−1 nx̄T (X̃T X̃ − nx̄x̄T )−1
 
(XT X)−1 =
−(X̃T X̃ − nx̄x̄T )−1 nx̄n−1 (X̃T X̃ − nx̄x̄T )−1
 −1
n (n − 1) + x̄T Q−1 x̄ −x̄T Q−1

1
= ,
n−1 −Q−1 x̄ Q−1

where Q = (n − 1)−1 (X̃T X̃ − nx̄x̄T ) is the matrix of sample covariances, since


n
X n
X
T T
(X̃ X̃ − nx̄x̄ )ij = xki xkj − nx̄i x̄j = (xki − x̄i )(xkj − x̄j ).
k=1 k=1

Therefore
h1 1 i 1
SE(β̂0 )2 = σ2 + x̄T Q−1 x̄ , SE(β̂j )2 = σ 2 (Q−1 )jj .
n n−1 n−1 

15
2.2.3 Estimating the Irreducible Error
Let us first compute E[RSS]. We have
E[RSS] = E[(Xβ̂ − y)T (Xβ̂ − y)]
= E[(Xβ̂ − Xβ − )T (Xβ̂ − Xβ − )]
 
= E (β̂ − β)T XT X(β̂ − β) − T X(β̂ − β) − (β̂ − β)T XT  + T 
= E (XT X)−1 XT  XT X (XT X)−1 XT 
 T 

− T X (XT X)−1 XT  − (XT X)−1 XT  XT  + T 


 T 

= E[−T X(XT X)−1 XT  + T ].


Now note that for every vector v the scalar v T v equals the trace of the matrix vv T .
Hence in particular we have
E[T ] = E[tr(T )] = tr(E[T ]) = tr(σ2 I) = nσ2 .
Note that trace and E commute, since trace is linear.
On the other hand, we know that (XT X)−1 is a symmetric positive definite
matrix. Thus we have (XT X)−1 = S2 for some symmetric positive definite matrix
S. Now note that SXT  is a (p + 1)-dimensional random vector. The covariance
matrix of this vector is equal to
E[SXT T XS] = SXT E[T ]XS (since X,S are constant)
= SX T
σ2 IXS = σ2 SXT XS = σ2 S(S2 )−1 S = σ2 I.
Note that here I is the p + 1 by p + 1 identity matrix. Now we have
E[T X(XT X)−1 XT ] = E[T XSSXT ]
= E[(T XS)(SXT )]
= E[(SXT )T (SXT )]
= E[tr((SXT )(SXT )T )]
= E[tr(SXT T XS)]
= tr(E[SXT T XS]) = tr(σ2 I) = (p + 1)σ2 .
Hence we get
E[RSS] = E[−T X(XT X)−1 XT  + T ] = (n − p − 1)σ2 .
Thus the square of residual standard error
RSS
RSE2 =
n−p−1
is an unbiased estimator of σ2 .

16
2.2.4 F-Statistic
Suppose Y = β0 + β1 X1 + · · · + βp Xp + . To test the null hypothesis

H0 : β1 = β2 = · · · = βp = 0,

we use the F-statistic


(TSS − RSS)/p
F = ,
RSS/(n − p − 1)
where TSS = (yi − ȳ)2 and RSS = (yi − ŷi )2 .
P P
More generally, if we want to test the null hypothesis

H0 : βp−q+1 = βp−q+2 = · · · = βp = 0,

we use the F-statistic


(RSS0 − RSS)/q
F = .
RSS/(n − p − 1)
Here RSS0 is the residual sum of squares when we fit a second model using the
first p − q variables, i.e. RSS0 corresponds to a smaller model with predictors
X1 , . . . , Xp−q ; and RSS corresponds to the larger original model with all the pre-
dictors X1 , . . . , Xp−q , . . . , Xp . (Note that TSS is the RSS of a fitted model which
has no predictors, and has only the intercept term. So the previous formula for F
is a special case of the above formula.)
The F statistic is a normalized measure of change in residual sum of squares
(which can be thought of as the total error of prediction in the linear model) per
added predictor, when we extend the model with predictors X1 , . . . , Xp−q by adding
the predictors Xp−q+1 , . . . , Xp ; and we have computed this measure (the F statistic)
in the units of our estimate of σ2 , which is itself a measure of the noise in the
response Y .
When RSS0 − RSS is small (which hints that H0 might be true), the smaller
model with p − q features explains the response nearly as good as the larger model
with p features. Therefore it might be the case that adding the last q variables to
the model does not improve our estimation of Y significantly; so we cannot reject
the null hypothesis H0 .
We have seen that E[RSS/(n − p − 1)] = σ2 . Let

Z = β0 + β1 X1 + · · · + βp Xp .

Then Z,  are uncorrelated, since

Cov(Z, ) = Cov(β0 , ) + β1 Cov(X1 , ) + · · · + βp Cov(Xp , ) = 0,

because Xi ,  are independent for each i, and β0 is a constant. Now we have

σY2 = σZ+
2
= σZ2 + 2Cov(Z, ) + σ2 = σZ2 + σ2 .

17
Hence we also have
E[TSS] = E[(n − 1)s2y ] = (n − 1)σY2 = (n − 1)(σZ2 + σ2 ) ≥ (n − 1)σ2 .
In particular, when q = p and H0 is true, we have E[(TSS − RSS)/p] = σ2 , since Z
is constant in this case; otherwise we have E[(TSS − RSS)/p] ≥ σ2 . Therefore when
q = p and H0 is true we expect the F-statistic be close to 1. Otherwise we expect
F to be greater than 1. In general, under suitable assumptions, the distribution
of F is known, and is called F-distribution. Using F-distribution we can assign
p-value to F , and perform hypothesis testing.
Let us inspect the F-statistic more closely. Let ŷ and ỹ be the estimates of
y using the larger and the smaller models respectively. We know that ỹ is the
orthogonal projection of y on the subspace W which is spanned by the vectors
1, x1 , . . . , xp−q , and ŷ is the orthogonal projection of y on the subspace V which is
spanned by
1, x1 , . . . , xp−q , . . . , xp .
Obviously we have ŷ ∈ V and ỹ ∈ W ⊂ V . Also note that due to the properties of
orthogonal projections we have
y − ỹ ∈ W ⊥ , y − ŷ ∈ V ⊥ .
Therefore we get
RSS0 = ky − ỹk2 = ky − ŷ + ŷ − ỹk2
= ky − ŷk2 + kŷ − ỹk2 (since y − ŷ ⊥ ŷ − ỹ)
2
= RSS + kŷ − ỹk .
An immediate consequence of the above equation is that RSS decreases as we add
more predictors to the model. Intuitively, this happens because RSS measures the
prediction error using the training data, and with more predictors we have more
degrees of freedom, so the training error becomes smaller.
Let us find a geometric description of ŷ − ỹ. Let U = V ∩ W ⊥ be the orthogonal
complement of W in V . Then note that
y − (ŷ − ỹ) = (y − ŷ) + ỹ ∈ U ⊥ ,
since y − ŷ ∈ V ⊥ ⊂ U ⊥ , and ỹ ∈ W = (W ⊥ )⊥ ⊂ U ⊥ . Hence ŷ − ỹ must be the
orthogonal projection of y onto the subspace U . Therefore we have the following
formula for the F-statistic
kŷ − ỹk2 /q n − p − 1 kPU yk2
F = =
ky − ŷk2 /(n − p − 1) q ky − PV yk2
n − p − 1 kPV ∩W ⊥ yk2
= ,
q kPV ⊥ yk2
where P denotes the orthogonal projection operator on a subspace.

18
2.2.5 F-Statistic versus t-statistic
For each predictor we have a t-statistic. It can be shown that the square of this
t-statistic is the F-statistic corresponding to omitting that variable from the model,
i.e. when q = 1. To see this, note that we have
   
X = Xp̃ xp = 1 x1 . . . xp ,
T T
∈ Rn . Let
 
where xj = x1j ... xnj , and 1 = 1 1 . . . 1

H = X(XT X)−1 XT , Hp̃ = Xp̃ (XTp̃ Xp̃ )−1 XTp̃ .

Then ŷ = Hy and ỹ = Hp̃ y are respectively the predictions of Y based on all the
predictors and the first p − 1 predictors. Let Vj be the linear subspace spanned by
the first j + 1 columns of X. We have Vj ⊂ Vj+1 , and since we assume that the
columns of X are linearly independent, we have dim Vj = j + 1. We also know that
the predictions based on least squares method actually give us the orthogonal pro-
jections on the subspace spanned by the columns. Hence H, Hp̃ are the projection
(hat) matrices on the subspaces Vp , Vp−1 , respectively.
Let x̃p = Hp̃ xp be the projection of xp on Vp−1 , i.e. the prediction of Xp using
the other predictors. Then we have

ŷ = Hy = Xβ̂ = β̂0 1 + β̂1 x1 + · · · + β̂p−1 xp−1 + β̂p xp


= β̂0 1 + β̂1 x1 + · · · + β̂p−1 xp−1 + β̂p x̃p + β̂p (xp − x̃p ).

But note that xp − x̃p is orthogonal to Vp−1 , and

v = β̂0 1 + β̂1 x1 + · · · + β̂p−1 xp−1 + β̂p x̃p ∈ Vp−1 .

Now we have y − v = y − ŷ + β̂p (xp − x̃p ). But y − ŷ is orthogonal to Vp ⊃ Vp−1 ,


and xp − x̃p is orthogonal to Vp−1 ; so y − v is orthogonal to Vp−1 . Hence v must
be the orthogonal projection of y on Vp−1 , i.e. v = ỹ = Hp̃ y. Thus we have

RSS0 = ky − ỹk2 = ky − ŷk2 + β̂p2 kxp − x̃p k2 = RSS + β̂p2 kxp − x̃p k2 ,

since xp − x̃p belongs to Vp , and y − ŷ is orthogonal to Vp . Again, we see that RSS


must decrease (or remain constant) when we add new predictors.
On the other hand we have
 T  T 
T Xp̃   Xp̃ Xp̃ XTp̃ xp
X X= Xp̃ xp = .
xTp xTp Xp̃ xTp xp

19
Thus by using Schur complements we obtain
" #
T −1 ∗ ∗
(X X) = −1
∗ xTp xp − xTp Xp̃ (XTp̃ Xp̃ )−1 XTp̃ xp
∗ ∗
 
= −1 .
∗ xTp xp − xTp Hp̃ xp

Hence we have
−1
SE(β̂p )2 = σ2 xTp xp − xTp Hp̃ xp
−1 −1
= σ2 xTp xp − xTp x̃p = σ2 xTp (xp − x̃p )
−1 σ2
= σ2 kxp − x̃p k2 + x̃Tp (xp − x̃p ) = ,
kxp − x̃p k2

since xp − x̃p is orthogonal to Vp−1 , which contains x̃p . (Note that this provides a
new formula, and interpretation, for SE(β̂p )2 .)
Finally we get

β̂p2 β̂p2 kxp − x̃p k2 (RSS0 − RSS)/1


t2 = = 2
= = F,
c β̂p )2
SE( σ̂ RSS/(n − p − 1)

as desired. The σ̂2 is an estimate of σ2 , which here we substituted RSE for it.
The problem with individual t-statistics is that when there are many predictors,
some of their associated p-values can be small due to chance. And these randomly
small p-values do not imply that there is a relationship between the response and
their corresponding predictors. Therefore we use the F-statistic to see if there
is a relation between the response and predictors in multiple regression. Notice
that in the joint distribution of all the t-statistics, the event of some of their p-
values being small is quite likely! So the smallness of some of the p-values is not an
improbable event, and based on it we cannot reject the null hypothesis. In contrast,
the smallness of the p-value of the F-statistic is an improbable event, and based on
it we can reject H0 .

2.2.6 R2 Statistic in Multiple Linear Regression


In multiple linear regression R2 is equal to Cor2 (Y, Ŷ ), i.e. the square of the corre-
lation between the response and its predicted value in the linear model. In addition,
the least squares linear model maximizes this correlation (not its square) among
all possible linear models. Let us prove these claims. First note that y − ŷ is

20
orthogonal to the image of · 7→ X·, so it is in particular orthogonal to
      
β̂0 − ȳ 1 x11 x12 . . . x1p β̂0 − ȳ ŷ1 − ȳ
  β̂1   ŷ2 − ȳ 
 β̂1  1 x21 x22 . . . x2p     
X  .  = . ..   ..  =  ..  .
  
.. ..
 .   ..
. . . .  .   . 
β̂p 1 xn1 xn2 . . . xnp β̂p ŷn − ȳ

Hence the inner product of y − ŷ and the above vector is zero, i.e.
n
X
(yi − ŷi )(ŷi − ȳ) = 0.
i=1

Thus we have
n
X n
X
TSS − RSS = (yi − ȳ)2 − (yi − ŷi )2
i=1 i=1
Xn Xn n
X
= (yi − ȳ)2 − (yi − ŷi )2 − (yi − ŷi )(ŷi − ȳ)
i=1 i=1 i=1
Xn Xn
= (yi − ȳ)2 − (yi − ŷi )(yi − ŷi + ŷi − ȳ)
i=1 i=1
Xn Xn
= (yi − ȳ)2 − (yi − ŷi )(yi − ȳ)
i=1 i=1
n
X n
X
= (yi − ȳ)(yi − ȳ − yi + ŷi ) = (yi − ȳ)(ŷi − ȳ).
i=1 i=1

Now suppose z is the prediction of Y using some other linear model. Then we
have z = Xγ for
 some vector γ. Similarly
T to the above, we can see that y − ŷ is
orthogonal to z1 − z̄ . . . zn − z̄ . Hence their inner product is zero too, i.e.
n
X
(yi − ŷi )(zi − z̄) = 0.
i=1

Therefore we have
n
X
(n − 1)qyz = (yi − ȳ)(zi − z̄)
i=1
Xn
= (yi − ŷi + ŷi − ȳ)(zi − z̄)
i=1

21
n
X n
X
= (yi − ŷi )(zi − z̄) + (ŷi − ȳ)(zi − z̄)
i=1 i=1
n
X
=0+ ¯ i − z̄)
(ŷi − ŷ)(z (since ŷ¯ = ȳ)
i=1
= (n − 1)qŷz .

So qyz = qŷz . In particular for z = ŷ we have

qyŷ = qŷŷ = s2ŷ .

Hence we get
Pn
TSS − RSS (y − ȳ)(ŷi − ȳ)
2
R = = Pn i
i=1
2
TSS i=1 (yi − ȳ)
2 qy2ŷ
qyŷ qyŷ sŷ qyŷ qyŷ
= 2 = 2 2 = 2 2 = 2 2 = Cor2 (Y, Ŷ ),
sy sy sŷ sy sŷ sy sŷ

as desired.
On the other hand, we have

qyz qŷz s2ŷ qŷz s2ŷ qŷz qŷy qŷz


Cor(Y, Z) = = = 2 = =
sy sz sy sz sŷ sy sz sy sŷ sŷ sz sy sŷ sŷ sz
= Cor(Y, Ŷ )Cor(Ŷ , Z) ≤ Cor(Y, Ŷ ) · 1 = Cor(Y, Ŷ ).

Note that in the above inequality we used the facts that Cor(Ŷ , Z) ≤ 1, and

qyŷ s2ŷ sŷ


Cor(Y, Ŷ ) = = = ≥ 0.
sy sŷ sy sŷ sy

As a side note, observe that if we use the inequality Cor(Y, Ŷ ) ≤ 1 in the above
equation we obtain
sŷ ≤ sy .
In other words, we have also shown that the correlation of Y, Ŷ cannot be negative,
and the variance of Ŷ cannot be larger than the variance of Y .

2.2.7 Confidence and Prediction Intervals


Remember that we have
β̂ = β + (XT X)−1 XT .

22
This relation allowed us to compute the expected value and the covariance matrix
of β̂, yielding the formulas

E (β̂ − β)(β̂ − β)T = σ2 (XT X)−1 .


 
E[β̂] = β,

Now suppose that based on β̂ we make a prediction at a new observation of pre-


dictors  
xT0 = 1 x01 x02 . . . x0p ,
to obtain the estimate ŷ0 = xT0 β̂ for the response value y0 = xT0 β + 0 . Then the
expected value and variance of ŷ0 are

E[ŷ0 ] = E[xT0 β̂] = xT0 E[β̂] = xT0 β,


Var[ŷ0 ] = E[(ŷ0 − E[ŷ0 ])2 ] = E[(xT0 β̂ − xT0 β)2 ]
= E[(xT0 (β̂ − β))(xT0 (β̂ − β))]
= E[(xT0 (β̂ − β))(xT0 (β̂ − β))T ] (since aT = a for a scalar)
= E[xT0 (β̂ − β)(β̂ − β)T x0 ]
= xT0 E[(β̂ − β)(β̂ − β)T ]x0 = σ2 xT0 (XT X)−1 x0 .

Therefore the expected value and variance of xT0 β − ŷ0 are

E[xT0 β − ŷ0 ] = E[xT0 β] − E[ŷ0 ] = 0, (since β is constant)


T T 2
Var[x0 β − ŷ0 ] = E[(x0 β − ŷ0 ) ]
= E[(ŷ0 − E[ŷ0 ])2 ] = Var[ŷ0 ] = σ2 xT0 (XT X)−1 x0 .

This variance can be used to compute confidence intervals, which are estimates
of how much xT0 β (which is the expected value of y0 given x0 , since E[0 ] = 0)
differs from the predicted value ŷ0 .
Next let us find the expected value and variance of y0 − ŷ0 . We have

E[y0 − ŷ0 ] = E[xT0 β + 0 − ŷ0 ] = xT0 β + E[0 ] − E[ŷ0 ] = xT0 β + 0 − xT0 β = 0.

Note that 0 , β̂ are independent, because 0 ,  are independent. Hence we get

Var[y0 − ŷ0 ] = E[(y0 − ŷ0 )2 ] = E[(xT0 β + 0 − xT0 β̂)2 ]


= E[(xT0 (β̂ − β))2 ] − 2E[xT0 (β̂ − β)0 ] + E[20 ]
= σ2 xT0 (XT X)−1 x0 − 2xT0 E[β̂ − β]E[0 ] + Var[0 ]
= σ2 xT0 (XT X)−1 x0 − 0 + σ2 = σ2 xT0 (XT X)−1 x0 + 1 .


The above variance can be used to compute prediction intervals, which are esti-
mates of how much y0 differs from the predicted value ŷ0 .

23
Note that prediction intervals provide upper and lower estimates for a single
value of response, i.e. y0 . In contrast, confidence intervals provide such estimates
for the average value of response. The reason is that the expected value of 0 is
zero, thus its effects are expected to diminish in average, leaving us with xT0 β as
the average value of response.
Also note that both prediction and confidence intervals are centered around the
predicted value ŷ0 , but the prediction interval is larger, since the variance of y0 − ŷ0
is larger than the variance of xT0 β − ŷ0 . Another way of looking at this is that the
variance of xT0 β − ŷ0 consists of the reducible error, while the variance of y0 − ŷ0
consists of both reducible and irreducible errors.

2.3 Further Topics in Linear Regression


2.3.1 Polynomial Regression and Basis Functions
We can use the method of least squares to fit nonlinear functions instead of linear
functions. For example, we can fit polynomial functions to the data. To this
end, we still use the linear regression approach, but we use new features which are
themselves nonlinear functions of the original features. For example, instead of just
using a feature X, we can use the features

X1 = X, X2 = X 2 , ... Xm = X m ,

to find a degree m polynomial that minimizes the sum of squared residuals. In


other words, we can estimate the coefficients in a relation of the form

Y = β0 + β1 X1 + β2 X2 + · · · + βm Xm + 
= β0 + β1 X + β2 X 2 + · · · + βm X m + ,

by the usual method of least squares in multiple linear regression. This is known
as polynomial regression.
When we initially have more than one features, say X1 , . . . , Xp , we can use new
features which are monomial functions of several variables. In particular, we can
also use interaction terms of the form X1 X2 as new features.
We can use the above idea to fit functions other than polynomials too. In
general, we can use the new features

b1 (X), b2 (X), . . . , bm (X)

in a multiple linear regression, where b1 , . . . , bm are given functions. (Note that X


can itself be a vector of several features, in which case b1 , . . . , bm will be functions
of several variables.) The functions b1 , . . . , bm are called basis functions.

24
2.3.2 Qualitative Predictors and Dummy Variables
Let X be a quantitative feature whose range contains the values c1 , . . . , ck . A
particular choice of basis functions are piecewise constant functions:

C0 (X) = I(X < c1 ),


C1 (X) = I(c1 ≤ X < c2 ),
..
.
Ck−1 (X) = I(ck−1 ≤ X < ck ),
Ck (X) = I(X ≥ ck ),

where I is the indicator function whose value is 1 if the inside condition is


satisfied, and 0 otherwise. The new features C0 (X), . . . , Ck (X) are also called
dummy variables. Note that a linear combination of C0 (X), . . . , Ck (X) is a
piecewise constant function, i.e. it is constant on each of the intervals [cj , cj+1 ).
Note that we always have

C0 (X) + C1 (X) + · · · + Ck (X) = 1,

since X belongs to exactly one the intervals. Hence there is a linear dependence
relation between the dummy variables and the constant function 1. So if we use the
intercept term in the regression, which is itself a linear combination of the constant
function 1, we have to put aside one of the dummy variables, and work with the
rest of them. The choice of the dropped dummy variable is arbitrary due to the
above linear dependence relation.
The idea of dummy variables is particularly helpful when X is a qualitative
feature with k levels c1 , . . . , ck (note that c1 , . . . , ck are not necessarily numbers
here). In this case the dummy variables become

C1 (X) = I(X = c1 ),
..
.
Ck−1 (X) = I(X = ck−1 ),
Ck (X) = I(X = ck ).

These new variables are quantitative variables, and can be used in an ordinary least
squares regression. So we can deal with the case where some of the features are
qualitative variables.
Note that each dummy variable has only two levels, and we use k two level
dummy variables to quantify the qualitative variable X, and we do not use a single
k-level dummy variable. The reason is that if we want to use a dummy variable with

25
more than two levels, we have to code the levels of X by numbers. And in doing so
we will impose a linear order on the levels of X, together with a notion of distance
between those levels. However, in general, the levels of a qualitative variable are
not ordered linearly, nor there is a well-defined notion of distance between them.
Hence we use several two-level dummy variables instead. In this way, we avoid
imposing any order on the levels. Also, we do not assign any notion of distance
between different levels, since the 0/1 coding of belonging or not belonging to a
class merely expresses that classes are different.

2.3.3 Heteroscedasticity and Weighted Least Squares


So far we have assumed that in the relation Y = f (X) + , the error term  does not
depend on X. Note that the variance of Y under the condition that X is constant
is
Var(Y ) = Var(f (X) + ) = Var().
Hence if  does not depend on X then Var(Y ) does not depend on the value of X
either. However, in general, this assumption does not hold, and the variance of Y
can depend on the value of X. This phenomenon is known as heteroscedasticity.
In the presence of heteroscedasticity (and under the assumption that f is linear),
the observed data will satisfy
yi = β0 + β1 xi1 + · · · + βp xip + i ,
in which i s are independent random variables with zero mean, whose variances σi2
depend on i, unlike before.
To estimate the coefficients β0 , . . . , βp under the above assumptions we can use
the weighted least squares method. In this method, instead of minimizing RSS,
we minimize a weighted version of it:
n
X n
X
wi (yi − ŷi )2 = wi (yi − β̂0 − β̂1 xi1 − · · · − β̂p xip )2 ,
i=1 i=1

where wi = σ12 . Let W be the diagonal matrix with diagonal entries wi . Then the
i
above weighted sum of squares can be written as
(y − Xβ̂)T W(y − Xβ̂).
By differentiating this expression with respect to β̂, similarly to Section 2.2.1, we
can easily show that the minimizer is given by
XT WXβ̂ = XT Wy,
or equivalently by
β̂ = (XT WX)−1 XT Wy,
when XT WX is invertible.

26
2.3.4 Outliers and Studentized residuals
Outliers are data points which are far from the rest of the data. At an outlier the
response has an unusual value. The predictors may have unusual values at a point
too. In this section and the next one we consider statistics that can help us detect
outliers and points with outlying predictor values.
Recall that H = X(XT X)−1 XT is the projection (hat) matrix on the image
of the linear map defined by X. Then we have ŷ = Hy. Now note that
HX = X(XT X)−1 XT X = X(XT X)−1 (XT X) = X.
Therefore (I − H)X = X − HX = X − X = 0. Hence we get
y − ŷ = y − Hy = (I − H)y = (I − H)(Xβ + ) = (I − H).
In addition, note that we have
(I − H)2 = I − IH − HI + H2 = I − 2H + H = I − H,
since H2 = H (because H is a projection matrix). In fact, it can be shown that
I − H is the projection matrix on the orthogonal complement of the image of X;
so its square must be equal to itself. This also implies that (I − H)X must be zero,
because the columns of X belong to its image, and the projection of the image of
X onto its orthogonal complement is zero.
Now let us compute the covariance matrix of y − ŷ. First note that
E[y − ŷ] = E[Xβ +  − Xβ̂] = Xβ + 0 − X E[β̂] = Xβ − Xβ = 0.
Thus we have
   
E (y − ŷ)(y − ŷ)T = E (I − H)((I − H))T
 
= E (I − H)T (I − H)T
= (I − H)E[T ](I − H)T (since H is constant)
T
= (I − H)E[ ](I − H) (since H is symmetric)
= (I − H)σ2 I(I − H) = σ2 (I − H)2 = σ2 (I − H).
Therefore we have Var(ei ) = Var(yi − ŷi ) = (σ2 (I − H))ii = σ2 (1 − hi ), where hi is
the ith diagonal entry of H, and is known as the leverage (discussed below). This
enables us to compute the standard error of the residual ei :
p
SE(ei ) = σ 1 − hi .
The studentized residual ti is defined as the ratio of the residual ei to its standard
error, i.e.
ei
ti = √ .
σ 1 − hi

27
Note that when the leverage hi is close to 1 (which is its maximum as we will see
below), or the residual ei is large, then the studentized residual ti will be large. An
observation with a large studentized residual can be an outlier.
Also note that the variance of ei = yi − ŷi cannot be computed similarly
to the variance of y0 − ŷ0 (which is computed in the previous subsection). Be-
cause unlike 0 , the error term i is not independent from the random vector
 = [1 , . . . , i , . . . , n ]T .

2.3.5 Leverage
The leverage statistic hi is the ith diagonal entry of the projection matrix H =
X(XT X)−1 XT . By using the formula for (XT X)−1 we obtain
 −1
n (n − 1) + x̄T Q−1 x̄ −x̄T Q−1

1
H= X XT ,
n−1 −Q−1 x̄ Q−1
where Q is the matrix ofTsample covariances, and x̄ is the vector of sample means.
Let xi =  xi1 . . . xip be the vector of ith observation. Then the ith row of X
is 1 xTi . Hence we get
 n−1 (n − 1) + x̄T Q−1 x̄ −x̄T Q−1
  
1  1
hi = 1 xTi
n−1 −Q−1 x̄ Q−1 xi
 n−1 (n − 1) + x̄T Q−1 (x̄ − xi )
 
1  T
= 1 xi
n−1 −Q−1 (x̄ − xi )
1 1
= + (x̄ − xi )T Q−1 (x̄ − xi ).
n n−1
Now note that Q is positive definite since as we have seen before, Q is invertible,
and satisfies
(n − 1)Q = (X̃ − x̄1T )T (X̃ − x̄1T ),
where X = 1 X̃ . Therefore Q−1 is also positive definite. Thus the nonnegative
 

number
(x̄ − xi )T Q−1 (x̄ − xi )
measures the square of the distance of the ith observation from the mean, with
respect to the inner product induced by Q−1 . This distance is known as the
Mahalanobis distance, and intuitively, it measures the distance between a point
xi and the mean x̄ in the units of standard deviations. Hence an observation with
high leverage has outlying predictor values. In the simple linear regression the co-
variance matrix Q has only one entry, i.e. the sample variance; hence the leverage
statistic is given by
1 1 (xi − x̄)2 1 (xi − x̄)2
hi = + = + n .
s2x 2
P
n n−1 n j=1 (xj − x̄)

28
Another property of hi , which motivates the name leverage, is that
∂ ŷi
hi = ,
∂yi
since we have ŷ = Hy. So at points with high leverage, small errors in measurement
of y can result in significant errors in the predicted value ŷ. This can also be seen
from the fact that the variance of ŷi is a multiple of hi . To show this remember
that ŷ = Xβ̂. Hence
E[ŷ] = E[Xβ̂] = X E[β̂] = Xβ.
Therefore the covariance matrix of ŷ is
   
E (ŷ − Xβ)(ŷ − Xβ)T = E (Xβ̂ − Xβ)(Xβ̂ − Xβ)T
 
= E X(β̂ − β)(X(β̂ − β))T
 
= E X(β̂ − β)(β̂ − β)T XT
 
= X E (β̂ − β)(β̂ − β)T XT (since X is constant)
= X σ2 (XT X)−1 XT = σ2 H.
So we get Var(ŷi ) = (σ2 H)ii = σ2 hi . Notice that this variance cannot be computed
similarly to the case of ŷ0 (which is computed in a previous subsection), because as
we noted before, unlike 0 , the error term i is not independent from the random
vector  = [1 , . . . , i , . . . , n ]T .
The average leverage for all the observations is always equal to (p+1)/n, because
X
hi = tr(H) = tr X(XT X)−1 XT = tr (XT X)−1 XT X = tr(I) = p + 1.
 

Furthermore, the leverage statistic is always between 1/n and 1. We have already
seen that hi ≥ 1/n. To show the other bound note that we have HT = H = H2 ,
since H is the matrix of an orthogonal projection. Hence
X X X
hi = Hii = (H2 )ii = Hij Hji = (Hij )2 = h2i + (Hij )2 ;
j j j6=i

so h2i ≤ hi , which implies hi ≤ 1.

2.3.6 Collinearity
Suppose that we can estimate the predictor Xj using the other predictors quite
accurately. This phenomenon is known as (multi)collinearity. When collinearity
is present, our uncertainty in our estimate of the corresponding regression coefficient
will be high, since we know that
σ2
SE(β̂j )2 = , (∗)
kxj − x̃j k2

29
where x̃j is the prediction of xj based on the other predictors. As a result, the
corresponding t-statistic will be small, and we may fail to reject the null hypothesis
H0 : βj = 0.
To assess the severity of multicollinearity we can use the variance inflation
factor (VIF), which is the ratio of the variance of β̂j in the full model with all the
predictors to its variance in the model with only one predictor Xj , i.e.

SE(β̂j )2
VIF(β̂j ) = .
SE(β̂j, simple )2

In other words, VIF measures the change in our uncertainty in β̂j when we extend
the model with only one predictor Xj by adding the other predictors.
2
Let RX be the R2 statistic from regressing Xj onto all the other predictors.
j |X−j
Let RSSj|−j , TSSj|−j be respectively the RSS and TSS of this regression. Also let
 T  T
x̃j = x̃1j . . . x̃nj be the prediction of xj = x1j . . . xnj based on the
other predictors. Then we have

1 1 TSSj|−j
2 = RSSj|−j
=
1 − RX j |X−j
RSSj|−j
TSSj|−j

(xij − x̄j )2 2
P P
i i (xij − x̄j )
=P 2
=
i (xij − x̃ij ) kxj − x̃j k2
σ2 /kxj − x̃j k2 SE(β̂j )2
= = = VIF(β̂j ).
σ2 / i (xij − x̄j )2
P
SE(β̂j, simple )2

As a result we can see that the smallest possible value for VIF is 1, since RX 2
j |X−j
is between zero and 1 (remember that RX 2 is equal to the square of correlation
j |X−j

of Xj and its prediction X̃j ). When VIF is 1 we must have RX 2 = 0; thus the
j |X−j
other predictors provide no information about Xj , and there is no collinearity. On
the other hand, if RX 2 is close to 1 then there is considerable collinearity, and
j |X−j
VIF has a large value.
We can also see that the smallest possible value for VIF is 1 by noting that
SE(β̂j )2 increases as we add more predictors to the model. The reason is that our
prediction of xj based on the other predictors becomes more accurate when we
add more predictors to the model, because our prediction of xj is the orthogonal
projection of xj onto the subspace spanned by our observations of the other pre-
dictors. Hence as we add more predictors to the model, this subspace becomes
larger. And therefore the projection of xj becomes closer to xj , since we know that
the orthogonal projection of a vector like xj onto a subspace is the closest vector
in that subspace to the given vector xj ; and if we enlarge the subspace the new

30
closest vector to xj cannot be farther away from xj than the old closest vector.
Therefore by formula (∗) the SE(β̂j )2 increases as we add more predictors to the
model. (Note that in the case of simple linear regression, the prediction of xj us-
ing no other predictor is x̄j 1; and plugging this value in the formula (∗) gives us
β̂j, simple .)

2.3.7 Maximum Likelihood Estimation


Suppose the probability measure P describes the distribution of some quantities
among a population. We assume that P belongs to a parametric family of prob-
ability measures Pθ , but we do not know which value of θ gives us P. In other
words, P = Pθ0 for some unknown θ0 . Let f (x; θ) denote the probability density
(mass) function of Pθ . Suppose X1 , . . . , Xn are a random sample from this distri-
bution, and x1 , . . . , xn are their observed values. (Note that Xi s can be random
vectors, and so xi s can be vectors too.) Then, the likelihood function of θ is
the joint probability density (mass) function of X1 , . . . , Xn at x1 , . . . , xn , which, if
X1 , . . . , Xn are independent, is equal to
Y
L(θ) = L(θ|x1 , . . . , xn ) = f (xi ; θ).
i≤n

Note that the likelihood function is a function of θ, and x1 , . . . , xn are considered


as (fixed) parameters here.
Now remember that we do not know the value of θ0 , and we would like to
estimate it by using the training data x1 , . . . , xn . The maximum likelihood
estimate for θ0 is a value θ̂0 such that L(θ̂0 ) ≥ L(θ) for all θ, i.e. a point of absolute
maximum of L. In other words, θ̂0 is the value of θ that maximizes the probability
of observing x1 , . . . , xn (or the probability of observing values close to x1 , . . . , xn ,
in the case of continuous probability measures). In practice we usually maximize
the log-likelihood function
X
log L(θ) = log f (xi ; θ)
i≤n

instead of the likelihood function, since it is easier to work with. Note that the
logarithm is a strictly increasing function, so maximizing the log-likelihood results
in the same estimate.
Let us see an example of maximum likelihood estimation. Suppose f is normal,
i.e.
1  1 
f (x; µ, σ 2 ) = √ exp − 2 (x − µ)2 .
2πσ 2 2σ

31
(Note that here θ is the vector (µ, σ 2 ). Also note that here we abuse the notation
and represent both θ and θ0 by (µ, σ 2 ).) Then the log-likelihood becomes
X
log L(µ, σ 2 ) = log f (xi ; µ, σ 2 )
i≤n
X 1 1
log(2πσ 2 ) − 2 (xi − µ)2

= −
2 2σ
i≤n
n 1 X
= − log(2πσ 2 ) − 2 (xi − µ)2 .
2 2σ
i≤n

At the point of maximum of log L we must have


∂ log L 1 X 1
0= (µ̂, σ̂ 2 ) = 2 (xi − µ̂) = 2 (nx̄ − nµ̂).
∂µ σ̂ σ̂
i≤n

Therefore the maximum likelihood estimate of µ is µ̂ = x̄. In other words, the


maximum likelihood estimate of the population mean of a normally distributed
variable is the sample mean. Similarly we have
∂ log L 2 n 1 X 2 n 1 X
0= (µ̂, σ̂ ) = − + (x i − µ̂) = − + (xi − x̄)2 .
∂σ 2 2σ̂ 2 2σ̂ 4 2σ̂ 2 2σ̂ 4
i≤n i≤n

(Note that σ 2 is the variable which we differentiated with respect to, not σ!) Hence
we have
n 1 X 1X
2
= 4 (xi − x̄)2 =⇒ σ̂ 2 = (xi − x̄)2 .
2σ̂ 2σ̂ n
i≤n i≤n

32
Chapter 3

Classification

3.1 Logistic Regression


Suppose that the response Y is a qualitative variable with two levels, which we
denote by 0 and 1. Usually in classification, instead of the value of Y , we try to
estimate the probability
p(X) = P(Y = 1|X).
In logistic regression, we assume that p(X) is given by the logistic function

eβ0 +β1 X 1
p(X) = β +β X
= −(β
.
1+e 0 1 1 + e 0 +β1 X)
It is easy to see that
p(X)
= eβ0 +β1 X .
1 − p(X)
The ratio p(X)/[1 − p(X)] is called the odds. Note that since p(X) is between 0
and 1, the odds is between 0 and ∞. Also note that the odds is a strictly increasing
function of the probability p(X). Now by taking logarithm we get

p(X) 
log = β0 + β1 X.
1 − p(X)

The logarithm of the odds is called the log-odds or logit. Hence in logistic regres-
sion we assume that the logit is a linear function of X. Note that logit is also a
strictly increasing function of p(X).
To estimate the coefficients β0 , β1 we use the method of maximum likelihood.
In the logistic regression model, we are interested in the conditional probability
measure P(Y |X). (Comparing with the notation of the previous subsection, here
θ is the vector (β0 , β1 ), and the random vectors (Y1 , X1 ), . . . , (Yn , Xn ) form our

33
sample.) We want to find the maximum likelihood estimates β̂0 , β̂1 for β0 , β1 . First
note that P(Y |X = xi ) is a Bernoulli random variable, so we have

f (yi , xi ; θ) = P(Y = yi |X = xi )
(
p(xi ) if yi = 1
= = p(xi )yi (1 − p(xi ))1−yi .
1 − p(xi ) if yi = 0

Hence the likelihood function is


Y
L(β0 , β1 ) = p(xi )yi (1 − p(xi ))1−yi
i
Y Y Y eβ0 +β1 xi Y 1
= p(xi ) (1 − p(xj )) = .
1 + eβ0 +β1 xi 1 + eβ0 +β1 xj
yi =1 yj =0 y =1
i y =0
j

Now we can compute the log-likelihood function and differentiate it to find its point
of maximum (β̂0 , β̂1 ). However there is no closed formula for β̂0 , β̂1 , and we have
to estimate them using numerical methods.

3.2 Linear Discriminant Analysis


Suppose the response Y is a qualitative variable with levels {1, 2, . . . , K}. Let
πk = P(Y = k) be the prior probability of Y = k, and fk (x) = P(X = x|Y = k) be
the density function of X over {Y = k}. Then by Bayes’ theorem the posterior
probability of Y given X is equal to

pk (x) = P(Y = k|X = x)


P(Y = k)P(X = x|Y = k)
=
P(X = x)
P(Y = k)P(X = x|Y = k) πk fk (x)
= PK = PK .
j=1 P(Y = j)P(X = x|Y = j) j=1 πj fj (x)

Note that when X is a continuous random variable we need to work with probability
density functions instead of probabilities.
If we further assume that fk is normal, i.e. fk (x) = σ√12π exp − 2σ1 2 (x − µk )2 ,


we get
πk
exp − 2σ1 2 (x − µk )2


σ 2π
pk (x) = PK π 1
.
√j 2
j=1 σ 2π exp − 2σ 2 (x − µj )

34
To simplify the notation we denote the denominator of pk (x) by g(x). Then we
have
√ 1
log pk (x) = log(πk ) − log(σ 2π) − 2 (x − µk )2 − log g(x)

√ 1 µk µ2
= log(πk ) − log(σ 2π) − 2 x2 + 2 x − k2 − log g(x)
2σ σ 2σ
µk µ2k √ 1 2
= 2 x − 2 + log(πk ) − log(σ 2π) − 2 x − log g(x).
σ 2σ 2σ
Hence the value of k that maximizes pk (x) is the value of k that maximizes

µk µ2k
δk (x) = x − + log(πk ).
σ2 2σ 2
So this function can be used to determine the Bayes classifier.
Let x1 , y1 , . . . , xn , yn be a set of observations. Let nk be the number of obser-
vations for which yi = k. Consider the estimates
nk
π̂k = ,
n
1 X
µ̂k = xi ,
nk
yi =k
K
1 X  1 X 
σ̂ 2 = PK (nk − 1) (xi − µ̂k )2
k=1 (nk − 1) k=1 nk − 1
yi =k
K
1 X X
= (xi − µ̂k )2 .
n−K
k=1 yi =k

Note that σ̂ 2 is a weighted average of the sample variances for each of the K classes.
Also note that the n − K in the denominator makes σ̂ 2 an unbiased estimator of σ 2 ,
1 P
since each nk −1 yi =k (xi − µ̂k )2 is an unbiased estimator of σ 2 , and the expectation
of a weighted average is equal to the weighted average of the expectations. If we
use the above estimates we obtain the following estimate for δk (x):

µ̂k µ̂2k
δ̂k (x) = x − + log(π̂k ).
σ̂ 2 2σ̂ 2
This function is called the discriminant function. Note that the discriminant
function is a linear function in x. Similarly to our use of δk in the Bayes classifier,
we can use δ̂k for classification. This method is known as the linear discriminant
analysis (LDA).

35
Next let us assume that we have more than one predictor, and fk is multivariate
normal, i.e.
1  1
T −1

fk (x) = exp − (x − µ k ) Σk (x − µ k ) ,
(2π)p/2 |Σk |1/2 2
where |Σk | is the determinant of Σk . Note that we allow the covariance matrix Σk
to depend on k. Hence we have
πk |Σk |−1/2 exp − 21 (x − µk )T Σ−1

k (x − µk )
pk (x) = PK .
−1/2 exp − 1 (x − µ )T Σ−1 (x − µ )
j=1 πj |Σj | 2 j j j

As before, let us denote the denominator of pk (x) by g(x). Then we get


1 1
log pk (x) = log(πk ) −log |Σk | − (x − µk )T Σ−1
k (x − µk ) − log g(x)
2 2
1 1
= − xT Σ−1
k x + xT Σ−1
k µk − µTk Σ−1
k µk
2 2
1
− log |Σk | + log(πk ) − log g(x).
2
(Note that we have used the fact that Σ−1 k is symmetric.) Hence the value of k
that maximizes pk (x) is the value of k that maximizes
1 1 1
δk (x) = − xT Σ−1
k x + xT Σ−1
k µk − µTk Σ−1
k µk − log |Σk | + log(πk ).
2 2 2
So again δk can be used to determine the Bayes classifier. We can estimate πk , µk
as before, and for the entries of Σk we have the estimate
1 X  
(Σ̂k )jl = (xi )j − (µ̂k )j (xi )l − (µ̂k )l ,
nk − 1
yi =k

which are simply the sample (co)variances. Therefore we can estimate δk (x) by the
discriminant function
1 −1 −1 1 −1 1
δ̂k (x) = − xT Σ̂k x + xT Σ̂k µ̂k − µ̂Tk Σ̂k µ̂k − log |Σ̂k | + log(π̂k ).
2 2 2
Note that this time the discriminant function is a quadratic function in x. If we use
δ̂k for classification, as we used δk in the Bayes classifier, we obtain the quadratic
discriminant analysis (QDA) method.
If we assume that the covariance matrix does not depend on k, the above for-
mulas for QDA will reduce to the corresponding formulas for LDA with more that
one predictor. Especially, the linear discriminant function is given by
−1 1 −1
δ̂k (x) = xT Σ̂ µ̂k − µ̂Tk Σ̂ µ̂k + log(π̂k ),
2

36
where the entries of Σ̂ are weighted averages of the sample (co)variances of each of
the classes
K
1 X  1 X  
Σ̂jl = PK (nk − 1) (xi )j − (µ̂k )j (xi )l − (µ̂k )l
k=1 (nk − 1) k=1 nk − 1
yi =k
K
1 X X  
= (xi )j − (µ̂k )j (xi )l − (µ̂k )l .
n−K
k=1 yi =k

Let us also mention that when Y only has two levels, the log-odds in the LDA
framework is linear in x. To see this note that
π1 exp − 21 (x − µ1 )T Σ−1 (x − µ1 )

p1 (x)
=
π2 exp − 21 (x − µ2 )T Σ−1 (x − µ2 )

1 − p1 (x)
π1  1 
= exp − (x − µ1 )T Σ−1 (x − µ1 ) − (x − µ2 )T Σ−1 (x − µ2 )
π2 2
π1  1 1 
= exp (µ1 − µ2 )T Σ−1 x − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 .
π2 2 2
Therefore
 p (x)  π1  1 T −1 1
1
log = log − µ1 Σ µ1 + µT2 Σ−1 µ2 + (µ1 − µ2 )T Σ−1 x,
1 − p1 (x) π2 2 2

as desired.

37
Chapter 4

Model Selection and


Regularization

4.1 Cross-Validation
In order to estimate the test error rate of a learning method we can randomly split
the data into two parts, and use one part for training the method, and the other
part for testing the obtained model. This is the validation set approach. The
idea of cross-validation (CV) is to also switch the roles of the two parts of data,
and use the second part for training the method, and the first part for testing or
validating the model. Then we have two estimates of the test error rate, and we
can use their average as a more stable estimate of the test error rate.
More generally, in the k-fold cross-validation we randomly split the data into
k parts. Then we fit the method k-times, and each time we use one of the k parts
as test data, and the remaining k − 1 parts as training data. Then we get k test
mean squared errors MSE1 , . . . , MSEk , and we can use their average to obtain the
k-fold CV estimate of the test error:
k
1X
CV(k) = MSEi .
k
i=1

4.1.1 Leave-One-Out Cross-Validation


In the Leave-one-out cross-validation (LOOCV) we split the data into a vali-
dation set containing only one observation xi , yi , and a training set containing the
n − 1 remaining observations. Then we fit the learning method with the training
set, and at xi we compute the estimate ỹi for the response. The test mean squared
error (MSE) of this fit is
MSEi = (yi − ỹi )2 .

38
We repeat this process n times for i = 1, 2, . . . , n. Then the LOOCV estimate for
the test MSE is
n
1X
CV(n) = MSEi .
n
i=1
In other words, LOOCV is the n-fold CV for a data set with n data points.
When the learning method is the least squares linear regression we also have
n
1 X  yi − ŷi 2
CV(n) = ,
n 1 − hi
i=1

where ŷi is the prediction of the response at xi when we fit the model with all
the observations, and hi is the ith leverage of that fit. In particular note that in
this case, to compute CV(n) , we only need to fit the model once. However, for an
arbitrary learning method there is no similar formula, and we have to fit the model
n times to compute CV(n) .
To prove the above relation, it suffices to show that
yi − ŷi
yi − ỹi = .
1 − hi
For simplicity we assume that i = n. Let X be the matrix of observations of
predictors, and y the vector of observations of the response. Then we have
   
X̃ y
X= T , y = ñ ,
xn yn
 T
where xn = 1 xn1 . . . xnp , X̃ is the matrix consisting of the first n − 1 rows
of X, and yñ is the vector consisting of the first n − 1 components of y. Let ỹ, ŷ
be the predictions of y based on n − 1 and n observations respectively. We have
X̃(XT X)−1 XT y
     
ŷñ T −1 T X̃ T −1 T
ŷ = = X(X X) X y = T (X X) X y = T T −1 T .
ŷn xn xn (X X) X y
h i y  T
T ñ
We also have XT y = X̃ xn = X̃ yñ + yn xn . Hence we get
yn
  " T T
#
ỹñ X̃(X̃ X̃)−1 X̃ yñ
ỹ = = T T
ỹn xTn (X̃ X̃)−1 X̃ yñ
T T T
= X(X̃ X̃)−1 X̃ yñ = X(X̃ X̃)−1 (XT y − yn xn ). (∗)
T
Therefore we only need to compute (X̃ X̃)−1 . However note that
i 

h T
T T
X X = X̃ xn = X̃ X̃ + xn xTn .
xTn

39
T
So XT X − X̃ X̃ is equal to the rank-one matrix xn xTn . This enables us to compute
T
the of inverse of X̃ X̃.
More generally, suppose A, B are invertible matrices whose difference is a rank-
one matrix, i.e. A − B = uv T , where u, v are vectors. Then we have
−1
B −1 = (A − uv T )−1 = A(I − A−1 uv T ) = (I − A−1 uv T )−1 A−1 .

If we formally expand the power series of (I − A−1 uv T )−1 we get

(I − A−1 uv T )−1 = I + A−1 uv T + (A−1 uv T )2 + · · ·

But note that (A−1 uv T )2 = A−1 uv T A−1 uv T = h(A−1 uv T ), where the scalar h =
v T A−1 u. Thus by induction we obtain (A−1 uv T )m = hm−1 (A−1 uv T ). Hence the
formal power series becomes
1
(I − A−1 uv T )−1 = I + A−1 uv T (1 + h + h2 + · · · ) = I + A−1 uv T ,
1−h
provided that |h| < 1. It is easy to see that the above matrix is in fact the inverse
of I − A−1 uv T :
1
(I − A−1 uv T ) I + A−1 uv T = I − A−1 uv T

1−h
1 1
+ A−1 uv T − (A−1 uv T )2
1−h 1−h
1 h 
= I + A−1 uv T − 1 + − = I.
1−h 1−h
Note that the above computation actually shows that if A is invertible and h 6= 1,
then B is also invertible, and we have
1
B −1 = (I − A−1 uv T )−1 A−1 = A−1 + A−1 uv T A−1 .
1−h
Also notice that B −1 is obtained from A−1 by adding a rank-one matrix to it.
Now in our original problem, we have h = xTn (XT X)−1 xn . However note that
the projection matrix of the least square fit with all the observations is
 

h i
H = X(X X) X = T (XT X)−1 X̃ xn
−1 T
T T

xn
" T
#
X̃(XT X)−1 X̃ X̃(XT X)−1 xn
= T .
xTn (XT X)−1 X̃ xTn (XT X)−1 xn

Therefore the leverage of the nth observation in this fit is

hn = Hn,n = xTn (XT X)−1 xn = h.

40
T
Hence if the leverage hn 6= 1 then X̃ X̃ is invertible too, and we have
T 1
(X̃ X̃)−1 = (XT X)−1 + (XT X)−1 xn xTn (XT X)−1 .
1 − hn
If we use this formula in the equation (∗), we get
T
ỹ = X(X̃ X̃)−1 (XT y − yn xn )
h 1 i
= X (XT X)−1 + (XT X)−1 xn xTn (XT X)−1 (XT y − yn xn )
1 − hn
1
= X(XT X)−1 XT y + X(XT X)−1 xn xTn (XT X)−1 XT y
 
1 − hn
yn
− yn X(XT X)−1 xn − X(XT X)−1 xn xTn (XT X)−1 xn
 
1 − hn
ŷn yn hn  ŷn − yn 
= ŷ + − yn − X(XT X)−1 xn = ŷ + X(XT X)−1 xn .
1 − hn 1 − hn 1 − hn
If we subtract both sides of the above equality from y, and look at the nth compo-
nent of the resulting equation, we obtain
ŷn − yn  T T −1
yn − ỹn = yn − ŷn − x (X X) xn
1 − hn n
ŷn − yn  −hn  yn − ŷn
= yn − ŷn − hn = yn − ŷn 1 − = ,
1 − hn 1 − hn 1 − hn
as desired. As a consequence we get
n−1
|yn − ŷn | ≤ |yn − ỹn | ≤ |yn − ỹn |,
n
since 0 ≤ 1 − hn ≤ 1 − n1 . In other words, adding an observation at some point
improves the linear least squares fit prediction at that point.

4.1.2 Cook’s Distance


Let us further explore the above relation between ỹ, ŷ. We have already shown
that  y − ŷ 
n n
ŷ − ỹ = X(XT X)−1 xn .
1 − hn
Hence

kŷ − ỹk2 = (ŷ − ỹ)T (ŷ − ỹ)


 y − ŷ 2
n n
X(XT X)−1 xn X(XT X)−1 xn
T 
=
1 − hn

41
 y − ŷ 2
n n
= xTn (XT X)−1 XT X(XT X)−1 xn
1 − hn
 y − ŷ 2
n n (yn − ŷn )2 hn hn
= xTn (XT X)−1 xn = = σ 2 t2 ,
1 − hn (1 − hn )2 1 − hn  n

√en n −ŷn
y√
where tn = σ 1−hn
= σ 1−hn
is the studentized residual. The quantity

kŷ − ỹk2 e2n hn t2n hn


Dn = 2
= 2 2
=
(p + 1)σ (p + 1)σ (1 − hn ) p + 1 1 − hn

is called the Cook’s distance of the nth observation (p is the number of predictors).
It measures the effect of removal of the nth observation on the predictions of the
model. We can similarly compute the Cook’s distance Di of the other observations.
Note that the Cook’s distance is an increasing function of both the studentized
residual and the leverage.

4.2 The Bootstrap


Suppose the probability distribution P describes a population. Also suppose that

(x1 , y1 ), . . . , (xn , yn )

is a sample from this population. Then we have the following estimate of P:



1


 n (x, y) = (x1 , y1 ),
 1


n
 (x, y) = (x2 , y2 ),
. ..
P̂(x, y) = .. .

 1



 n (x, y) = (xn , yn ),

0 otherwise.

Therefore we can use P̂ to estimate any quantity that we would be able to compute
using P. In particular, we can use P̂ to obtain new samples from the population.
These new samples can for example be used to estimate test error rates, or the
standard error of a coefficient in a parametric learning method.
Now note that when we form a new sample, the P̂-probability of a new data
point (x, y) being in this new sample is zero; so our new sample will only contain
some of the points (xi , yi ), and under P̂ these points are all equally likely to appear
in the new sample. In addition, note that when we form this new sample using P̂,
in each step we randomly choose one of the points, say (xj , yj ), with probability
1
n . But in the next step we also do the same thing, and we might again randomly
choose the same data point (xj , yj ) as in the previous step. Therefore forming a

42
new sample using the probability distribution P̂ is the same as sampling from the
data set
(x1 , y1 ), . . . , (xn , yn )
with replacement. These new samples are called bootstrap samples.

4.3 Cp Statistic and Adjusted R2


Suppose we have used the observations X, y to train a linear regression model. Also
suppose that the relation between X, y in the population is linear, i.e.

y = Xβ + ,

where  = [1 , . . . , n ]T is a vector of i.i.d random variables. Remember that the


coefficients of the linear model satisfy

β̂ = β + (XT X)−1 XT .

Now let us make another set of observations in the population at the same predic-
tor values. These new observations will have different response values due to the
random noise . We denote these new response values by y0 . Then we have

y0 = Xβ + 0 ,

where 0 = [01 , . . . , 0n ]T is independent from . Since the predicted value is deter-
mined by the value of predictors, the prediction of the linear model at both sets of
observations X, y and X, y0 is the same, and is given by ŷ = Xβ̂.
Let us compute the expected prediction error of the model at X, y0 . Recall that
for a random variable like Z we have Var[Z] = E[Z 2 ] − (E[Z])2 . Hence we have
2
E[(yi0 − ŷi )2 ] = Var[yi0 − ŷi ] + E[yi0 − ŷi ]
2
= Var[yi0 ] + Var[ŷi ] − 2Cov[yi0 , ŷi ] + E[yi0 ] − E[ŷi ] .

However note that yi0 = xTi β + 0i and yi = xTi β + i (where xTi is the ith row of
X) are identically distributed and independent. Because i , 0i are independent and
have the same distribution as , and xi , β are constants. In addition, for similar
reasons yi0 and
ŷi = xTi β̂ = xTi β + xTi (XT X)−1 XT 

43
are also independent. Thus Cov[yi0 , ŷi ] = 0. Therefore we get
2
E[(yi0 − ŷi )2 ] = Var[yi0 ] + Var[ŷi ] + E[yi0 ] − E[ŷi ]
2
= Var[yi ] + Var[ŷi ] + E[yi ] − E[ŷi ]
2
= Var[yi ] + Var[ŷi ] − 2Cov[yi , ŷi ] + 2Cov[yi , ŷi ] + E[yi − ŷi ]
2
= Var[yi − ŷi ] + E[yi − ŷi ] + 2Cov[yi , ŷi ]
= E[(yi − ŷi )2 ] + 2Cov[yi , ŷi ].

Hence the expected prediction MSE at X, y0 is


h1 X i h1 X i 2X
E (yi0 − ŷi )2 = E (yi − ŷi )2 + Cov[yi , ŷi ].
n n n
i≤n i≤n i≤n

Note that n1 i≤n (yi − ŷi )2 = n1 RSS is the training MSE of the model.
P
On the other hand we have

E[y] = E[Xβ + ] = Xβ + E[] = Xβ, E[ŷ] = E[Xβ̂] = X E[β̂] = Xβ.

Therefore the covariance matrix of y, ŷ is


   
E (y − Xβ)(ŷ − Xβ)T = E (Xβ̂ − Xβ)T
= E (X(β̂ − β))T = E  X(XT X)−1 XT 
   T 

= E (H)T = E T H = E T H = σ2 I H = σ2 H,


     

where H is the projection matrix. So the expected prediction MSE at X, y0 is


h1 X i h1 X i 2X
E (yi0 − ŷi )2 = E (yi − ŷi )2 + Cov[yi , ŷi ]
n n n
i≤n i≤n i≤n
h1 i 2X
= E RSS + σ2 Hii
n n
i≤n
h1 i 2 h1 i 2
= E RSS + σ2 tr(H) = E RSS + σ2 d,
n n n n
where d is the number of coefficients in the linear model (i.e. d = p + 1, where p is
the number of predictors). Now let σ̂ 2 be an unbiased estimator of σ2 (for example
we can use RSE2 ). Then we have
h1 X i h1 i 2 h1 2d i
E (yi0 − ŷi )2 = E RSS + σ2 d = E RSS + σ̂ 2 .
n n n n n
i≤n

Therefore
1
RSS + 2dσ̂ 2

Cp =
n

44
is an unbiased estimator of the expected prediction MSE at X, y0 . Note that if we
add more predictors to the model, RSS will decrease, but 2dσ̂ 2 increases. Hence
unlike the training MSE, Cp does not necessarily decrease as we add more predictors
to the model. This property of Cp , and the fact that it is an estimator of the
prediction MSE, makes it a suitable statistic for comparing models with different
numbers of predictors.
Another quantity for comparing models with different numbers of predictors is
the adjusted R2 , defined by

2 RSS/(n − p − 1)
Radj =1− .
TSS/(n − 1)

Equivalently we have
2 n−1
1 − Radj = (1 − R2 ) .
n−p−1
n−1
The factor n−p−1 is included in order to adjust the effect of having more predictors
on the R2 statistic. When we add a predictor to the model, we know that RSS
decreases. However, n − p − 1 decreases too. So Radj 2 does not necessarily increase,
2
in contrast to R .
Now note that when we change the number of predictors, TSS/(n − 1) does not
change. Hence the change of Radj 2 is only due to the change in RSS/(n − p − 1). Let

us denote the number of coefficients by d, i.e. d = p + 1. Then for some unknown


λ we have
RSS 1 
= RSS + λ .
n−d n
Let us compute λ. We have
n RSS n d
= dσ̂ 2 ,

λ= − RSS = RSS − 1 = RSS
n−d n−d n−d

where σ̂ 2 = RSS/(n − d) is the RSE, which we know is an unbiased estimator of


σ2 . Therefore we get

2 RSS/(n − d)
Radj =1−
TSS/(n − 1)
(RSS + dσ̂ 2 )/n Cp − nd σ̂ 2
=1− =1− .
TSS/(n − 1) TSS/(n − 1)

But Cp is an unbiased estimator of the expected prediction MSE. Thus Cp − nd σ̂ 2


2 tends to overes-
tends to underestimate the expected prediction MSE. Hence Radj
timate the prediction accuracy of the model. However, as we have said, it can be
used to compare models with different numbers of predictors.

45
4.4 Regularization
Suppose we have a set of observed data X, y, and we want to use this data to find
the estimate ŷ = Xβ̂ for y. In linear regression we find β̂ by minimizing the RSS.
The idea of regularization is to constrain β̂ so that it cannot take large values,
and as a result its variance decreases. More precisely, instead of minimizing the
RSS we minimize

RSS(β̂) + λJ(β̂) = ky − Xβ̂k2 + λJ(β̂),

where J : Rp+1 → R is a given function, and λ is a given nonnegative number. In


the process of minimization, the penalty term λJ(β̂) forces β̂ to stay closer to the
origin compared to the case where there is no penalty.
First let us show that minimizing RSS+λJ over Rp+1 is equivalent to minimizing
RSS over the set {J ≤ s} for some s ≥ 0. Let β̂ be the minimizer of RSS + λJ over
Rp+1 . Set s := J(β̂). Let β̃ be the minimizer of RSS over the set {J ≤ s}. Note
that β̂ ∈ {J ≤ s}. Hence we must have

RSS(β̃) ≤ RSS(β̂).

On the other hand we know that RSS(β̂) + λJ(β̂) ≤ RSS(β̃) + λJ(β̃). Combining
these two inequalities we get

RSS(β̂) + λJ(β̂) ≤ RSS(β̃) + λJ(β̃) ≤ RSS(β̂) + λJ(β̃)


≤ RSS(β̂) + λs = RSS(β̂) + λJ(β̂).

Therefore we have RSS(β̂) + λJ(β̂) = RSS(β̃) + λJ(β̃). Thus if we further assume


that RSS+λJ has a unique minimizer (which for example is the case when RSS+λJ
is strictly convex), we obtain β̂ = β̃, as desired.

4.4.1 Ridge Regression and Lasso


If we assume that J is differentiable, we can try to find the minimizer β̂ by differ-
entiation. Let γ ∈ Rp+1 be an arbitrary vector. Then the directional derivative of
RSS + λJ at β̂ in the direction of γ must be zero. Hence we must have
d 
0= RSS(β̂ + tγ) + λJ(β̂ + tγ)
dt t=0
d 
= (y − Xβ̂ − tXγ)T (y − Xβ̂ − tXγ) + λJ(β̂ + tγ)
dt t=0

46
d 
= (y − Xβ̂ − tXγ)T (y − Xβ̂)
dt t=0
d  d
+ (y − Xβ̂)T (y − Xβ̂ − tXγ) + λ J(β̂ + tγ)
dt t=0 dt t=0
= (−Xγ)T (y − Xβ̂) + (y − Xβ̂)T (−Xγ) + λDγ J(β̂)
T
= −γ T XT y + γ T XT Xβ̂ − yT Xγ + β̂ XT Xγ + λγ T DJ(β̂)
= −2γ T XT y + 2γ T XT Xβ̂ + λγ T DJ(β̂) (since aT = a for a 1 × 1 matrix)
1 
= 2γ T − XT y + XT Xβ̂ + λDJ(β̂) .
2
Therefore we get
1
XT Xβ̂ + λDJ(β̂) = XT y, (∗)
2
since γ is arbitrary.
T
Now suppose J(β̂) = β̂12 +· · ·+ β̂p2 = β̂ I˜β̂, where I˜ is the diagonal matrix whose
diagonal entries are all 1 except the first diagonal entry which is 0. This method
of regularization is known as ridge regression, and its corresponding coefficient
R
is denoted by β̂ λ . In this case the equation (∗) becomes
R R
˜ β̂ R
XT y = XT Xβ̂ λ + λI˜β̂ λ = (XT X + λI) λ.

Note that det(XT X + λI)˜ is a polynomial in λ; so it has finitely many roots. Thus
the matrix X X + λI˜ is invertible for almost all values of λ (even when p > n), and
T

in this case we have


R
β̂ λ = (XT X + λI)˜ −1 XT y.
R
As a result note that we can find β̂ λ for all such values of λ simultaneously, since
˜ −1 are rational functions of λ, and can be calculated by
the entries of (XT X + λI)
standard algorithms for solving linear systems.
Next suppose J(β̂) = |β̂1 | + · · · + |β̂p |. This method of regularization is known
as the lasso. Note that in this case J is not always differentiable; we have

( 1 β̂j > 0,
sign(β̂j ) β̂j 6= 0, 
Dj J = = −1 β̂j < 0,
does not exist β̂j = 0, 
does not exist β̂j = 0.

Hence when we look for the minimizer of RSS + λJ, in addition to the points where
the derivative vanishes, we also have to check the points of nondifferentiability,
i.e. the points which have some zero components. The possibility that these points
with some zero components can be the minimizer implies that the lasso can perform
feature selection too.

47
4.4.2 Regularization from Bayesian viewpoint
Suppose the probability measure P describes the distribution of some quantity
X among a population. We assume that P belongs to a parametric family of
probability measures Pθ , but we do not know which value of θ gives us P. In
classical statistics X is random, but θ is a fixed unknown parameter (which we try
to estimate). In contrast, in Bayesian statistics both X, θ are considered to be
random variables. Let f (x|θ) denote the probability density (mass) function of Pθ ,
which is the distribution of X given θ. As before, f (x|θ) is called the likelihood
function. Let π(θ) be the prior density of θ, i.e. the density of θ before we know
anything about X. Then the joint probability density of X, θ is π(θ)f (x|θ) (note
that unlike X, we use θ to denote both a random variable and its values). Hence
the marginal density of X (before knowing θ) is
Z
f (x) := π(ϑ)f (x|ϑ) dϑ,
Θ

where Θ is the set of all possible values of θ. Thus by Bayes’ theorem, the posterior
density of θ, given the observation X = x, is

π(θ)f (x|θ)
π(θ|x) = .
f (x)

This relation is usually written as

π(θ|x) ∝ π(θ)f (x|θ),

where the constant of proportionality, i.e. f (x), can depend on x but not on θ. In
other words we have
posterior ∝ prior × likelihood.
The posterior mode θ̂ is the value of θ that maximizes the posterior π(θ|x),
i.e. θ̂ is the most likely value of θ, given the observed data x. We have

θ̂ = arg max π(θ|x) = arg max log π(θ|x),


θ θ

since log is an increasing function. Hence


 π(θ)f (x|θ) 
θ̂ = arg max log π(θ|x) = arg max log
θ θ f (x)

= arg max log π(θ) + log f (x|θ) − log f (x)
θ

= arg max log π(θ) + log f (x|θ) .
θ

48
Where the last equality holds since f (x) does not depend on θ. In comparison,
recall that arg maxθ log f (x|θ) is the maximum likelihood estimator of θ.
Now let us consider the regularization in linear regression. Suppose
Y = β1 X1 + · · · + βp Xp + ,
where  is a normally distributed noise with zero mean and variance σ 2 . (For
simplicity we assume that β0 = 0, and all variables have zero mean.) Suppose we
have n observations of Y, X1 , . . . , Xp , which satisfy
yi = β1 xi1 + · · · + βp xip + i .
And i s are i.i.d. random variables with the same distribution as . In this case θ
is the vector β = (β1 , . . . , βp ) (and yi plays the role of x in the above discussion,
P
while we treat xi1 , . . . , xip as constants). If xij s and β j s are fixed, then βj xij is
2
P
constant; so yi is normally distributed with mean βj xij , and variance σ , i.e. we
have  (y − P β x )2 
1 i j ij
f (yi |β) = √ exp − 2
.
2πσ 2 σ
(Be careful not to confuse the mathematical constant π with the prior or posterior
densities!) Then since yi s are independent we have
1  (y − P β x )2 
i j ij
Y
f (y|β) = f (y1 , . . . , yn |β) = √ exp − 2
.
2πσ 2 σ
i≤n

Hence
n 1 X X 2
log f (y|β) = − log(2πσ 2 ) − 2 yi − βj xij .
2 σ
i≤n j≥1
1
Now let us assume that π(β) = j≥1 √ 2 exp − βj /τ 2 , i.e. we assume that
2
Q 
2πτ
π(βj ) is normal with zero mean and variance τ 2 , and βj s are independent. Then
we get
p 1 X 2
log π(β) = − log(2πτ 2 ) − 2 βj .
2 τ
j≥1
Therefore we have

β̂ = arg max log π(β) + log f (y|β)
β
 1 X X 2 1 X 2
= arg max − 2 yi − βj xij − 2 βj
β σ τ
X X 2 σ 2 X 2 
= arg min yi − βj xij + 2 βj
β τ
X 
= arg min RSS(β) + λ βj2 .
β

49
Note that in the second and third equalities above we are using the fact that σ, τ
are (positive) constants, and therefore canceling them or factoring them out does
not alter the extremum point. Thus we have shown that when the prior density
of βj s is normal with zero mean and variance τ 2 , the posterior mode β̂ is also the
2 2
Q to 1λ = σ /τ .
ridge regression coefficient corresponding 
Next let us assume that π(β) = j≥1 2b exp − |βj |/b , i.e. we assume that
π(βj ) is Laplace distribution with zero mean and scale parameter b, and βj s are
independent. Then we get
1X
log π(β) = −p log(2b) − |βj |.
b
j≥1

Therefore we have

β̂ = arg max log π(β) + log f (y|β)
β
 1 X X 2 1 X 
= arg max − 2 yi − βj xij − |βj |
β σ b
X X 2 σ 2 X 
= arg min yi − βj xij + |βj |
β b
X 
= arg min RSS(β) + λ |βj | .
β

Thus we have shown that when the prior density of βj s is Laplace distribution with
zero mean and scale parameter b, the posterior mode β̂ is also the lasso coefficient
corresponding to λ = σ 2 /b.

4.5 Principal Components Analysis


 
Let X = x1 . . . xp be the matrix of n observations of X1 , . . . , Xp (note that
we do not include the column of 1s). We assume that the observations are shifted
to have zero mean, i.e. x̄j = 0 for each j. Suppose we want to find a linear
combination of x1 , . . . , xp which has the largest variance, i.e. we want to determine
 T
the coefficients φ1 = φ11 . . . φp1 so that

z1 = φ11 x1 + · · · + φp1 xp = Xφ1

has the maximum variance among all linear combinations of x1 , . . . , xp . In other


words, we want to find the direction φ1 in the feature space along which the data has
the largest variance. Informally, the direction with the greatest variance contains a
large amount of the information in the data, because from a statistical point of view
there is little information in a variable with small variance, since that variable is

50
close to being a constant. Now note that if φj1 s are allowed to be arbitrarily large,
then the variance of z1 can be made arbitrarily large as well. Thus we require that
φ211 + · · · + φ2p1 = kφ1 k2 = 1, i.e. we only care about the direction determined by
φ1 , and not its magnitude.
If we write the above relation in coordinates we have
p
X
zi1 = φj1 xij .
j=1
Pp
Hence z̄1 = j=1 φj1 x̄j = 0. Therefore the variance of zi1 s can be computed as
follows
X X
(n − 1)s2z1 = (zi1 − z̄1 )2 = 2
zi1 = kz1 k2
i≤n i≤n
= zT1 z1 = (Xφ1 )T Xφ1 = φT1 XT Xφ1 .

But from linear algebra we know that the maximum of φT1 XT Xφ1 among the vectors
φ1 with kφ1 k = 1 is achieved when φ1 is an eigenvector of the symmetric matrix
XT X corresponding to its largest eigenvalue, which we call λ1 , i.e.

φ1 = arg max φT XT Xφ, λ1 = max φT XT Xφ =⇒ XT Xφ1 = λ1 φ1 .


kφk=1 kφk=1

(The proof is presented below.) As a result we have

(n − 1)s2z1 = kz1 k2 = φT1 XT Xφ1 = λ1 φT1 φ1 = λ1 kφ1 k2 = λ1 ,

i.e. the sample variance of z1 is λ1 /(n − 1). Note that XT X is a positive matrix,
so its eigenvalues are nonnegative. And recall that by definition, the eigenvalues of
XT X are the squares of the singular values of X. The variable z1 is called the first
principal component of x1 , . . . , xp , and φ1 is called the first principal component
loading vector. Also, z11 , . . . , zn1 are known as the scores of the first principal
component.
The second principal component of x1 , . . . , xp is a linear combination

z2 = φ12 x1 + · · · + φp2 xp = Xφ2

which has the largest variance among those linear combinations of x1 , . . . , xp that
are uncorrelated with z1 . Note that the sample mean of z2 , like all other linear
combinations of x1 , . . . , xp , is zero. Thus the covariance of z1 , z2 is equal to their
inner product, because
X X
(n − 1)qz1 z2 = (zi1 − z̄1 )(zi2 − z̄2 ) = zi1 zi2 = zT1 z2 .
i≤n i≤n

51
Hence z2 has the largest variance among those linear combinations of x1 , . . . , xp
which are orthogonal to z1 . Thus to obtain φ2 we need to maximize φT XT Xφ
among all vectors φ with length one such that Xφ is orthogonal to Xφ1 = z1 . But
in this case we have

0 = (Xφ)T Xφ1 = φT XT Xφ1 = λ1 φT φ1 .

Since we assume that XT X is invertible, λ1 is nonzero; so we must have φT φ1 = 0,


i.e. φ, φ1 must be orthogonal. Therefore φ2 maximizes φT XT Xφ among all vectors
φ with length one which are orthogonal to φ1 . However, from linear algebra we
know this happens when (and only when) φ2 is an eigenvector of XT X correspond-
ing to its second largest eigenvalue, which we call λ2 . (The proof is presented below.
Also note that λ2 can be equal to λ1 too.)
We can similarly define the kth principal component zk = Xφk as the
linear combination of x1 , . . . , xp which has the largest variance among those linear
combinations of x1 , . . . , xp that are uncorrelated with z1 , . . . , zk−1 . Then similarly
to the above, we can easily show that this is equivalent to φk being the maximizer of
φT XT Xφ among all vectors φ with length one which are orthogonal to φ1 , . . . , φk−1 .
And again we know this happens when (and only when) φk is an eigenvector of
XT X corresponding to its kth largest eigenvalue, which we call λk . The reason is
that the symmetric matrix XT X has an orthonormal basis of eigenvectors, which
we denote by φ1 , . . . , φp , and we assume they are ordered to make the sequence of
eigenvalues decreasing. Thus if φ is orthogonal to φ1 , . . . , φk−1 , it must be a linear
combination of φk , . . . , φp . Hence we have

φ = (φT φk )φk + · · · + (φT φp )φp ,

since φk , . . . , φp is an orthonormal set. Therefore



φT XT Xφ = φT XT X (φT φk )φk + · · · + (φT φp )φp

= φT (φT φk )XT Xφk + · · · + (φT φp )XT Xφp

= φT (φT φk )λk φk + · · · + (φT φp )λp φp
= λk (φT φk )2 + · · · + λp (φT φp )2
≤ λk (φT φk )2 + · · · + (φT φp )2 = λk kφk2 = λk ,


since λk is the largest eigenvalue among λk , . . . , λp . Furthermore note that the


maximum of φT XT Xφ is achieved when φ = φk , because

φTk XT Xφk = φTk (λk φk ) = λk φTk φk = λk kφk k2 = λk .

Thus φk is the desired maximizer.

52
We can continue the above process for k ≤ p. Therefore, by construction
φ1 , . . . , φp is an orthonormal set of eigenvectors of XT X corresponding to the eigen-
values λ1 ≥ λ2 ≥ · · · ≥ λp . In addition, the principal components z1 , . . . , zp also
form an orthogonal set, which implies they are uncorrelated, since their sample
means are zero. Furthermore for every k we have
X X
(n − 1)s2zk = (zik − z̄k )2 = 2
zik = kzk k2
i≤n i≤n

= zk zk = (Xφk ) Xφk = φTk XT Xφk = λk φTk φk = λk kφk k2 = λk ;


T T

so s2zk = λk /(n − 1).


Another interesting property is that for any m ≤ p, φ1 , . . . , φm span the closest
m-dimensional linear subspace to the data. To explain this let us denote the rows
of the matrix X by x1 , . . . , xn , i.e. we have
 T
x1
 xT 
 2
X =  . .
 .. 
xTn
Also note that xi ∈ Rp is the vector of the ith observation. In addition we have
 T  T 
x1 x1 φk
 ..   .. 
zk = Xφk =  .  φk =  . .
xTn xTn φk
Thus zik = xTi φk , i.e. the ith component of zk is the inner product of xi and φk .
On the other hand, φ1 , . . . , φm form an orthonormal set. Therefore the orthogonal
projection of xi on the subspace V = span(φ1 , . . . , φm ) is given by
v i = (xTi φ1 )φ1 + · · · + (xTi φm )φm .
Furthermore note that, due to the properties of orthogonal projection, v i and xi −v i
are orthogonal; so by Pythagorean theorem we have kxi k2 = kv i k2 + kxi − v i k2 .
Hence the sum of the squared distance of the observed data from the subspace
V is
X X X
kxi − v i k2 = kxi k2 − kv i k2
i≤n i≤n i≤n
X XX
2
= kxi k − (xTi φk )2 (since φk s are orthonormal)
i≤n i≤n k≤m
X XX
= kxi k2 − (xTi φk )2
i≤n k≤m i≤n
X X X X
2
= kxi k − kzk k2 = kxi k2 − λk .
i≤n k≤m i≤n k≤m

53
Now suppose W is another m-dimensional subspace of Rp . Let ψ 1 , . . . , ψ m be an
orthonormal basis for W . Then the orthogonal projection of xi on W is

wi = (xTi ψ 1 )ψ 1 + · · · + (xTi ψ m )ψ m .

Thus we can similarly compute the sum of the squared distance of the observed
data from the subspace W :
X X X
kxi − wi k2 = kxi k2 − kwi k2
i≤n i≤n i≤n
X XX
2
= kxi k − (xTi ψ k )2
i≤n i≤n k≤m
X XX X X
2
= kxi k − (xTi ψ k )2 = kxi k2 − kXψ k k2 ,
i≤n k≤m i≤n i≤n k≤m

because xTi ψ k s are the components of the vector XψP


k . Therefore to minimize the
2
distance of the observed data to the subspace, i.e. i≤n kxi − w i k , we need to
maximize k≤m kXψ k k2 . Hence to show that V is the closest m-dimensional linear
P
subspace to the data we have to show that
X X X
kXψ k k2 ≤ kzk k2 = λk .
k≤m k≤m k≤m

When k = 1 this is easy, since we have

kXψ 1 k2 = ψ T1 XT Xψ 1 ≤ max φT XT Xφ = λ1 .
kφk=1

Or equivalently we can say that z1 has the largest variance among all linear com-
binations Xφ with kφk = 1. For general k note that we have

ψ k = (ψ Tk φ1 )φ1 + · · · + (ψ Tk φp )φp ,

since φ1 , . . . , φp is an orthonormal basis for Rp . Hence

kXψ k k2 = ψ Tk XT Xψ k = ψ Tk XT X (ψ Tk φ1 )φ1 + · · · + (ψ Tk φp )φp




= ψ Tk (ψ Tk φ1 )XT Xφ1 + · · · + (ψ Tk φp )XT Xφp

= ψ Tk (ψ Tk φ1 )λ1 φ1 + · · · + (ψ Tk φp )λp φp
= λ1 (ψ Tk φ1 )2 + · · · + λp (ψ Tk φp )2 .

Thus we get
X XX X X X
kXψ k k2 = λl (ψ Tk φl )2 = λl (ψ Tk φl )2 = al λl ,
k≤m k≤m l≤p l≤p k≤m l≤p

54
2
P T
where al := k≤m (ψ k φl ) ≥ 0. We also know that
X X
al = (ψ Tk φl )2 = (φTl ψ k )2 ≤ kφl k2 = 1,
k≤m k≤m

since φTl ψ k s are the coefficients of the representation of φl in an orthonormal basis


containing ψ 1 , . . . , ψ m . In addition note that
X XX XX X X
al = (ψ Tk φl )2 = (ψ Tk φl )2 = kψ k k2 = 1 = m.
l≤p l≤p k≤m k≤m l≤p k≤m k≤m
P P
Therefore l≤p al λl is at most l≤m λl , since λl s are ordered in a decreasing way.
Hence we have shown that k≤m kXψ k k2 ≤ l≤m λl , as desired.
P P

55
Chapter 5

Nonlinear Methods

5.1 Splines
We have seen in Section 2.3.1 that we can apply nonlinear basis functions, like
polynomials, to the features to obtain new features. Then we can use these new
features in linear regression. These new features can also be used in other learning
methods such as logistic regression and LDA. One particularly useful set of basis
functions is the set of piecewise polynomial functions. For example, using a feature
X we can construct the features

a0 (X) = I(X < c),


a1 (X) = I(X < c)X,
..
.
am (X) = I(X < c)X m ,
b0 (X) = I(X ≥ c),
b1 (X) = I(X ≥ c)X,
..
.
bm (X) = I(X ≥ c)X m ,

where I is the indicator function. A linear combination of aj , bj is a piecewise


degree m polynomial of X whose coefficients can change at the knot c.
Using these new nonlinear functions of X can result in much more flexible mod-
els. However a drawback of using arbitrary piecewise polynomials is that they can
be discontinuous. And we would like to obtain the coefficients of aj , bj in a way that
the resulting piecewise polynomial function, and possibly some of its derivatives,
are continuous everywhere, especially at c. But we find the coefficients of aj , bj
by least squares method, and it seems hard to impose the additional continuity

56
constraints while we are fitting the model. The idea for avoiding this difficulty is
to impose the constraints on the basis functions. Then we can use those new basis
functions which satisfy the constraints to fit the model. Let us explain this point
further.
Suppose we want to model the relationship between Y, X by a degree m piece-
wise polynomial which has continuous derivatives up to order m − 1, and has knots
at c1 , . . . , ck . Then we can write

Y = β0 + β1 X + · · · + βm X m
+ βm+1 (X − c1 )m m
+ + · · · + βm+k (X − ck )+ + ,

where (
0 X < cj ,
(X − cj )m
+ =
(X − cj )m X ≥ cj .
Note that each of the basis functions 1, x, x2 , . . . , x, (x − cj )m
+ are (piecewise) poly-
nomials of degree at most m, and they all have everywhere continuous derivatives
up to order m − 1. Hence the function

β0 + β1 X + · · · + βm X m + βm+1 (X − c1 )m m
+ + · · · + βm+k (X − ck )+

is a piecewise degree m polynomial of X, which automatically has continuous deriva-


tives up to order m−1. Functions of this form are called degree m splines. Now we
can estimate the coefficients βj by our learning method, for example least squares,
and there is no need to impose the constraints during estimation, since the con-
straints are automatically satisfied.
We can also use the above idea to impose more constraints, which are sometimes
useful. For example, in natural cubic splines, in which m = 3, we also require
the function to be linear in the first and last intervals, i.e. (−∞, c1 ) and (ck , ∞).
Consider a cubic spline

f (x) = β0 + β1 x + β2 x2 + β3 x3 + β4 (x − c1 )3+ + · · · + βk+3 (x − ck )3+ .

When x < c1 we have (x−cj )3+ = 0. So the linearity constraint on (−∞, c1 ) implies
that β2 = β3 = 0. The linearity constraint on (ck , ∞) will yield relationships
between β4 , . . . , βk+3 , which after simplifications imply that

f (x) = α1 + α2 x + α3 N3 (x) + · · · + αk Nk (x),

where
(x − cj )3+ − (x − ck )3+ (x − ck−1 )3+ − (x − ck )3+
Nj+2 (x) = − .
ck − cj ck − ck−1
To simplify the notation we also use N1 (x) = 1 and N2 (x) = x. Although the
computations for deriving the above relation is cumbersome, but it can readily be

57
seen that each Nj is a piecewise polynomial of degree at most 3 with continuous
derivatives up to order 2, which is also linear on the intervals (−∞, c1 ) and (ck , ∞).
Because for x < c1 we have (x − cj )3+ = 0 for each j, and for x > ck we have

(x − cj )3+ = (x − cj )3 = x3 − 3cj x2 + 3c2j x − c3j ;

so the coefficients of x3 , x2 vanish in Nj (x).

5.2 Smoothing Splines


Suppose we have a feature X with observations x1 , . . . , xn , and a response Y =
f (X)+ with corresponding observations y1 , . . . , yn . In order to estimate f we have
so far assumed that f has a given form with some parameters; then we estimated
those parameters. For example, we may assume that f is a linear spline with knots
at c1 , . . . , ck , i.e. a linear combination of 1, x, (x − c1 )+ , . . . , (x − ck )+ , and then we
can estimate the coefficients of that linear combination by minimizing the residual
sum of squares.
Alternatively, we can look for the estimate of f in a large class of functions by
a different minimization method. Consider the functional
n Z
X 2
F [g] = (yi − g(xi ))2 + λ g 00 (x) dx,
i=1

which is defined for a sufficiently smooth function g. (The integral can be taken
over (−∞, ∞) or a bounded interval containing the range of values of X, but its
bounds are not important in what follows.) Also, λ is a fixed tuning parameter. We
will show that the functional F has a minimizer, which is know as a smoothing
spline.
In fact, we will show that the functional F achieves its minimum in the space
of natural cubic splines with knots at x1 , . . . , xn . Hence smoothing splines are
natural cubic splines; however, their coefficients are different from the coefficients
of a natural cubic spline which minimizes RSS. Let g be an arbitrary function in
the domain of the functional. We will show that there is a natural cubic spline h
such that F [h] ≤ F [g]. Hence if F has a minimizer, it must be a natural cubic
spline. Then we show that F actually achieves its minimum in the space of natural
cubic splines, as desired.
Let N1 , . . . , Nn be the basis for the space of natural cubic splines with knots at
x1 , . . . , xn . We assume that x1 , . . . , xn are distinct points. To simplify the notation
we also assume that x1 , . . . , xn are ordered in an increasing way. Let N = [Nj (xi )]
be the n × n matrix resulting from applying this basis to the observed predictor

58
values. Remember that N1 (x) = 1 and N2 (x) = x. The other entries of N are
(xi − xj )3+ − (xi − xn )3+ (xi − xn−1 )3+ − (xi − xn )3+
Nj+2 (xi ) = −
xn − xj xn − xn−1
3
(xi − xj )+ (xi − xn−1 )+ 3
= − (since xi ≤ xn )
xn − xj xn − xn−1

0

 if i ≤ j,
(xi −xj )3
= if j < i < n,
 xn −xj
(x − x )2 − (x − x
 2
n j n n−1 ) if i = n.

Note that we have Nj+2 (xi ) 6= 0 for i > j. The matrix N is almost lower triangular,
and it can be easily seen that it has at least n − 1 linearly independent columns.
But to show that N is invertible we have to prove that all its columns are linearly
independent.
Suppose there is a linear dependence relation between the columns of N. Then
there are coefficients a1 , . . . , an such that

a1 N1 (xi ) + · · · + an Nn (xi ) = 0

for every i ≤ n. Let h̃(x) := a1 N1 (x) + · · · + an Nn (x). Then h̃ is a piecewise


cubic polynomial function whose first and second derivatives are continuous. Also,
for every i ≤ n we have h̃(xi ) = 0; so h̃ has n distinct roots. But this implies
that h̃0 has n − 1 distinct roots, one between each pair xi , xi+1 , due to the mean
value theorem. And therefore h̃00 has n − 2 distinct roots, one between each pair of
consecutive roots of h̃0 .
However, note that h̃00 is a linear function in each interval [xi , xi+1 ]. In addition,
h̃00 = 0 on the intervals (−∞, x1 ] and [xn , ∞), since h̃ is linear on these intervals.
Suppose h̃00 is not identically zero on [xk , xl ], and it vanishes over some intervals
before xk and after xl . Then h̃00 cannot have any roots in (xk , xk+1 ] and [xl−1 , xl ),
since over these intervals h̃00 is a nonzero linear function that vanishes at one of the
endpoints. Furthermore, h̃00 has at most one root in each of the l − k − 2 intervals

(xk+1 , xk+2 ], (xk+2 , xk+3 ], . . . , (xl−2 , xl−1 ),

since h̃00 is a nonzero linear function over each of these intervals. But by our
assumption, between the l − k + 1 roots xk , xk+1 , . . . , xl of h̃, h̃00 must have at least
l − k − 1 distinct roots. This contradiction implies that h̃00 must be identically
zero everywhere, which implies that the coefficients a1 , . . . , an are all zero. So the
columns of N are linearly independent;
T and hence N is an invertible matrix.
Now let g = g(x1 ) . . . g(xn ) . Then there is a vector β such that Nβ = g.
Let
h(x) = β1 N1 (x) + · · · + βn Nn (x).

59
Pn h is a natural
Then
2 =
Pn spline that 2satisfies h(xi ) = g(xi ) for every i. Therefore
cubic
(y
i=1 i − g(xi )) i=1 (yi − h(xi )) . Hence we have
Z
F [g] − F [h] = λ (g 00 )2 − (h00 )2 dx.

Thus to conclude F [h] ≤ F [g] we only need to show that (h00 )2 dx ≤ (g 00 )2 dx.
R R

To prove we integrate by parts twice to obtain (Note that we have to decompose


the domain of integration before integrating by parts, because h000 , h(4) do not exist
everywhere.)
Z Z xn
00 00 00
(g − h )h dx = (g 00 − h00 )h00 dx (since h00 = 0 outside [x1 , xn ])
x1
n Z
X xi
= (g 00 − h00 )h00 dx
i=2 xi−1
n Z xi
X n
X
(g 0 − h0 )h000 dx +
 0
=− (g (xi ) − h0 (xi ))h00 (xi )
i=2 xi−1 i=2
− (g 0 (xi−1 ) − h0 (xi−1 ))h00 (xi−1 )

n Z
X xi Xn
(4)
(g(xi ) − h(xi ))h000 (xi )

= (g − h)h dx −
i=2 xi−1 i=2
− (g(xi−1 ) − h(xi−1 ))h000 (xi−1 )

n
X
(g 0 (xi ) − h0 (xi ))h00 (xi )

+
i=1
− (g 0 (xi−1 ) − h0 (xi−1 ))h00 (xi−1 ) .


Now note that the integrals containing h(4) vanish since h is a piecewise cubic
polynomial. We also have g(xi ) = h(xi ) for every i; so the n-term sum in the above
formula which contains h000 vanishes too. In addition, the n-term sum containing h00
is a telescoping sum; hence it equals (g 0 (xn )−h0 (xn ))h00 (xn )−(g 0 (x1 )−h0 (x1 ))h00 (x1 ).
But h00 (xn ) = h00 (x1 ) = 0, because h is a natural spline, so it is linear on the intervals
(−∞, x1 )R and (xn , ∞). Therefore all the terms in the above equation vanish. Thus
we have (g 00 − h00 )h00 dx = 0. Hence we get
Z Z Z Z Z
00 2 00 00 1 00 2 00 2 1 00 2 1
(h ) dx = h g dx ≤ ((h ) + (g ) )dx = (h ) dx + (g 00 )2 dx.
2 2 2

Therefore we obtain (h00 )2 dx ≤ (g 00 )2 dx, as desired.


R R

60
Now let us compute F [h]. We have
n Z
X 2
F [h] = 2
(yi − h(xi )) + λ h00 (x) dx
i=1
n n Z n
X X 2 X 2
= yi − Nj (xi )βj +λ Nj00 (x)βj dx
i=1 j=1 j=1
n n Z
X X 2 X
= yi − Nj (xi )βj +λ βk βj Nj00 (x)Nk00 (x)dx
i=1 j=1 j,k
2 T
= ky − Nβk + λβ Ωβ,
where Ω is the n × n matrix with entries Ωjk = Nj00 (x)Nk00 (x)dx. Therefore to
R

minimize F over the space of natural cubic splines we need to find β that minimizes
ky − Nβk2 + λβ T Ωβ. We can differentiate this expression with respect to β, and
similarly to the computations in Section 4.4.1 for ridge regression, we obtain that
the minimizer β̂ satisfies
(NT N + λΩ)β̂ = NT y.
And when the matrix NT N + λΩ is invertible, the vector
β̂ = (NT N + λΩ)−1 NT y
provides the minimizer ĥ(x) = β̂1 N1 (x) + · · · + β̂n Nn (x) for the functional F .

5.3 Local Regression


Suppose we have a feature X with observations x1 , . . . , xn . In order to estimate
f (x) using local regression we first need to choose a kernel K(xi , x) which assigns
a weight to the observation xi with respect to x. The kernel is chosen so that it
assigns a low (or zero) weight to an observation xi which is far from x, and a higher
weight to observations close to x. Usually there is a parameter in the kernel K
that controls the width of the region in which K assigns a higher weight, and this
parameter should be fixed beforehand. Then we use these weights in weighted least
squares method, described in Section 2.3.3, to fit the model by minimizing
n
X n
X
wi (x)(yi − ŷi )2 = wi (x)(yi − β̂0 − β̂1 xi − · · · − β̂m xm 2
i ) ,
i=1 i=1

where wi (x) = K(xi , x) are the weights. As you see, we can use linear or polynomial
functions in this method. Note that the weights depend on x; so the above fitting
procedure will only give us fˆ(x) for this particular value of x. And if we want
to estimate fˆ(x0 ) for another value x0 we need to compute the weights w(x0 ) =
K(xi , x0 ), and fit the weighted least squares method again.

61

You might also like