21 Nonparametric Regression
21 Nonparametric Regression
1 Introduction
i.e., the regression function is linear in a particular set of covariates/features. Although we have
allowed interactions and higher order polynomial terms to be included in the set of covariates,
so far the particular terms must be chosen by the user, and inference etc. requires the resulting
model to be correct (unless we are taking the projection approach). Now we will drop the linearity
assumption and only assume the regression function is smooth, and discuss more “data-driven”
methods for capturing nonlinearity.
Estimating the regression function m without linearity assumptions (e.g., only assuming smooth-
ness) is called nonparametric regression or smoothing. Without loss of generality, we can always
write
Y = m(X) +
where E( | X) = 0. This follows since, = Y − m(X) and
We also define
σ 2 (x) = Var[Y |X = x] = E[2 |X = x].
Example 1 Figure 1 shows data from a study on bone mineral density. The plots show the relative
change in bone density over two consecutive visits, for men and women. The smooth estimates of
the regression functions suggest that a growth spurt occurs two years earlier for females. In this
example, Y is change in bone mineral density and X is age.
1
0.20
0.15
Females
Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25
Age
0.20
0.15
Males
Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25
Age
where (X, Y ) denotes a new observation. The expectation is over the training data (X1 , Y1 ), . . . , (Xn , Yn )
and the new observation (X, Y ). Let us define m(x) = E[m(x)].
b Then
Y − m(X)
b = [Y − m(X)] + [m(X) − m(X)] + [m(X) − m(X)]
b
= A + B + C.
Hence,
2
[Y − m(X)]
b = A2 + B 2 + C 2 + 2AB + 2AC + 2BC.
Now we take the expected value. The first term is
is called the unavoidable error. Even if we know the regression function m(x), we would still incur
this error when doing prediction. Next we have
Z
E[B ] = E[m(X) − m(X)] = b2 (x)p(x) dx
2 2
where
b(x) = E[m(x)]
b − m(x)
2
is the bias. Next we note that
Z Z
2 2 2
E[C ] = E[m(X) − m(X)]
b = E[m(x)
b − m(x)] p(x)dx = v(x) p(x) dx
since E[Y |X = x] = m(x). You should check that you can show that the other terms are 0. To
summarize:
Hence, whether we use prediction error or MISE, we have to control both the bias and variance of
m(x).
b The key is to make sure that neither one gets too large. The estimator m
b typically involves
smoothing the data in some way. The main challenge is to determine how much smoothing to do.
When the data are oversmoothed, the bias term is large and the variance is small. When the data
are undersmoothed the opposite is true. This is called the bias–variance tradeoff. Minimizing risk
corresponds to balancing bias and variance.
3 The Regressogram
We will begin by assuming there is only one covariate X. For simplicity, assume that 0 ≤ X ≤ 1.
The simplest nonparametric estimator of m is the regressogram. Let k be an integer. Divide [0, 1]
into k bins:
B1 = [0, h], B2 = [h, 2h], B3 = [2h, 3h], . . . .
P
Let nj denote the number of observations in bin Bj . In other words ni = i I(Xi ∈ Bj ) where
I(Xi ∈ Bj ) = 1 if Xi ∈ Bj and I(Xi ∈ Bj ) = 0 if Xi ∈/ Bj .
3
We then define Y j to be the average of Yi ’s in Bj :
1 X
Yj = Yi .
nj
Xi ∈Bj
regressogram = function(x,y,left,right,k,plotit,xlab="",ylab="",sub=""){
### assumes the data are on the interval [left,right]
n = length(x)
B = seq(left,right,length=k+1)
WhichBin = findInterval(x,B)
N = tabulate(WhichBin)
m.hat = rep(0,k)
for(j in 1:k){
if(N[j]>0)m.hat[j] = mean(y[WhichBin == j])
}
if(plotit==TRUE){
a = min(c(y,m.hat))
b = max(c(y,m.hat))
plot(B,c(m.hat,m.hat[k]),lwd=3,type="s",
xlab=xlab,ylab=ylab,ylim=c(a,b),col="blue",sub=sub)
points(x,y)
}
return(list(bins=B,m.hat=m.hat))
}
pdf("regressogram.pdf")
par(mfrow=c(2,2))
### simulated example
n = 100
x = runif(n)
y = 3*sin(8*x) + rnorm(n,0,.3)
plot(x,y,pch=20)
out = regressogram(x,y,left=0,right=1,k=5,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=10,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=20,plotit=TRUE)
dev.off()
4
### Bone mineral density versus age for men and women
pdf("bmd.pdf")
par(mfrow=c(2,2))
library(ElemStatLearn)
attach(bone)
From Figures 2 and 3 you can see two things. First, we need a way to choose k. (The answer will
be cross-validation). But also, m(x)
b is very unsmooth. To get a smoother estimator, we will use
kernel smoothing.
If we had many values of Yi with the same value of x, then we could take the average of these Yi ’s
to estimate E[Y |X = x]. But usually every Xi is different. Perhaps we can take the average of the
Yi ’s whose Xi is close to x. This is the idea of kernel smoothing.
To estimate m(x)
b we will take a local average of the Yi0 s. In other words, we average all Yi such
that |Xi −x| ≤ h where h is some small number called the bandwidth. Let I = {i : ||Xi −x|| ≤ h}
denote all observations that are close to x. We can write the estimator as
1X
m(x)
b = Yi
k
i∈I
where (
1 if |z| ≤ 1
K(z) =
0 if |z| > 1.
5
● ● ● ●
● ● ●●
● ● ● ● ● ●●
●● ●
● ●●●
3
3
● ●●● ●
●
●● ●
● ●
●
●● ● ●
●● ●●
●
● ●● ●● ● ●
●
●
●●
●● ●
●● ●●
●
● ●
2
2
● ● ● ● ● ●●
●
●● ● ●●
●●
●● ●●
●
● ●
● ●
●
●
● ●
●●●
●●● ●●●
●
● ●● ● ●
● ●
●●
●
1
1
●
●●
●
●● ●
●● ●
●
●●
●● ●
● ●
●
● ● ● ●
y
● ●
0
0
● ● ●●
● ● ● ●
●
●
● ●
●●
● ●
−1
−1
● ●
●●
●
●
●
●●
● ●
●●
●
● ●
●● ●●
● ●
● ●
●
● ● ●● ● ● ●●●
−3
−3
● ● ● ●
● ●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
● ● ● ●
● ● ●●
●● ● ● ● ●●
●● ●
● ●●● ● ●●●
3
●
●● ●● ●
●● ●●
● ●
●● ●● ● ● ●● ●● ● ●
●
●● ●● ●
●● ●●
● ● ● ●
2
● ●●
● ● ●●
●
●●
●● ●●
●●
●
● ●
● ●
● ●
●
●●● ●●● ●●● ●●●
● ●
● ● ● ●
● ●
●● ●●
1
●● ● ●● ●
●● ●
● ●
● ●● ●
● ●
●
● ● ● ●
● ●
0
●● ●●
● ● ● ●
●●
● ● ●●
● ●
−1
−1
● ●
● ●● ● ●●
●●
● ●●
●
● ●
●● ●●
● ●
● ●
● ● ●●● ● ● ●●●
−3
−3
● ● ● ●
● ●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
6
● ●
● ●
●● ● ●● ●
0.15
0.15
● ●● ● ●●
● ● ● ●
●●● ●●●
● ●● ● ●●
●● ● ● ●● ● ●
● ● ●● ● ● ●●
●● ●●
Density
Density
●● ● ●● ● ●● ● ●● ●
● ●
● ●●● ● ● ● ● ● ●●● ● ● ● ●
● ● ● ●
●● ● ●● ● ● ● ●● ● ●● ● ● ●
0.05
0.05
● ● ● ●●
● ●● ●●● ●●● ● ● ● ●●
● ●● ●●● ●●●
● ●●
● ●●
● ● ● ●●● ●●
● ● ● ● ●● ● ●●
● ●●
● ● ● ●●● ●●
● ● ● ● ●●
● ●●
● ●●
● ● ●●●● ● ●● ● ● ●●
● ●●
● ● ●●●● ● ●● ●
●●● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ● ● ●●
●●
●
●
● ●●● ●●
●●● ● ● ●● ● ● ●●● ●
●●●●●●● ●● ●● ●●
●
●
● ●●● ●●
●●● ● ● ●● ● ● ●●● ●
●●●●●●● ●● ●●
● ●●● ● ●●●●
● ● ●
● ● ● ●●● ●● ● ●●● ●● ●●●● ●
● ● ● ●●● ●●
●
● ● ●● ● ● ●●
● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ● ● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ●
● ● ●●●● ●●● ● ● ●●●● ●●●
● ● ● ● ● ● ● ● ● ●
● ●
−0.05
−0.05
● ●●●● ● ● ●●●● ●
● ● ● ●
● ●
10 15 20 25 10 15 20 25
Age Age
Male Male
● ●
● ●
●● ●●
● ●● ● ●●
0.15
0.15
● ●
● ●
●● ●●● ● ●● ●●● ●
● ●● ● ●●
●● ●● ● ●● ●● ●
●●●● ●● ●●●● ●●
Density
Density
● ●● ● ● ●● ●
●● ●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●●●●
● ● ●● ● ● ●
●●●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●●● ● ● ●
● ● ● ● ●● ● ● ● ●●●●● ●
●
●● ●
● ●●● ●●●●
● ●●●●●●
●● ●
●● ●
● ●●● ●●●●
● ●●●●●●
●●
●●● ● ●●● ●●
●●●●● ●● ● ●● ●●●● ●●
●●● ●●● ● ●●● ●●
●●●●● ●● ● ●● ●●●● ●●
●●●
● ●● ● ● ● ●●●● ●
●●●● ●
● ● ● ●
● ● ●● ● ● ● ●●●● ●
●●●● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
Age Age
Female Female
Figure 3: Bone density example. Men and women. Left plots use k = 10. Right plots use k = 20.
7
The function K is called the boxcar kernel.
As you can see in Figure 4, this gives much smoother estimates than the regressogram. But we can
improve this even further by replacing K with a smoother function.
This leads us to the following definition. A one-dimensional smoothing kernel is any smooth function
K such that K(x) ≥ 0 and
Z Z Z
K(x) dx = 1, xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0.
2
(4)
Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel estimator is
defined by
Pn x−Xi n
i=1 Yi K h X
m(x)
b ≡m b h (x) = P = Yj `j (x) (5)
n x−Xi
i=1 K h i=1
where
x−Xj
K h
`j (x) = P x−Xs
. (6)
sK h
P
Notice that `j (x) ≥ 0 and, for every x, j `j (x) = 1. Thus m(x)
b is a smooth, local average of the
Yi ’s.
It can be shown that there is an optimal kernel called the Epanechnikov kernel. But the choice of
kernel K is not too important. Estimates obtained by using different kernels are usually numerically
very similar. This observation is confirmed by theoretical calculations which show that the risk is
very insensitive to the choice of kernel. A common choice for the kernel is the normal density:
2 /2
K(z) ∝ e−z .
This does not mean that we are assuming that the data are Normal. We are just using this as a
convenient way to define the smoothing weights.
What does matter much more is the choice of bandwidth h which controls the amount of smoothing.
Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. A
useful result is the following:
Figure 5 shows the kernel estimator m b h (x) for 4 different choices of h using a Gaussian kernel.
As you can see, we get much nicer estimates. But we still need a way to choose the smoothing
parameter h. Before we discuss that, let’s first discuss linear smoothers.
8
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
9
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
10
5 The Kernel Estimator is a Linear Smoother
b = (Yb1 , . . . , Ybn ), Y = (Y1 , . . . , Yn ) and let L be the n × n matrix with entries Lij = `j (Xi ).
Let Y
We then see that
Yb = LY. (7)
This looks a lot like the equation Yb = HY from linear regression. The matrix L is called the
smoothing matrix. It is like the hat matrix but it is not a projection matrix. We call
X
ν = tr(L) = Lii (8)
i
the effective degrees of freedom. As the bandwidth h gets smaller, ν gets larger. In other words,
small bandwidths correspond to more complex models.
The equation Y b = LY means that we can write each Ybi as a linear combination of the Yi ’s. Because
of this, we say that kernel regression is a linear smoother. This does not mean that m(x)
b is linear.
It just means that each Ybi as a linear combination of the Yi ’s.
The are many other linear smoothers besides kernel regression estimators but we will focus on the
kernel estimator for now. Remember: a linear smoother does not mean linear regression.
6 Choosing h by Cross-Validation
We will estimate the prediction risk using leave-one-out cross-validation. Then we choose the
(−i)
bandwidth h by minimizing the cross-validation score. Let m b h be the kernel estimator (with
bandwidth h) obtained after leaving out (Xi , Yi ). The cross-validation score is
1X (−i)
CV (h) = b h (Xi ))2 .
(Yi − m
n
i
It turns out that there is a handy short cut formual for CV (similar to linear models) which is
2
1X (Yi − m
b h (Xi ))
CV (h) = . (9)
n 1 − Lii
i
11
P
Some people approximate the formula above replacing each Lii by the average value (1/n) i Lii =
(1/n)tr(L) = ν/n. If we replace Lii with ν/n we get this formula:
1 1X
GCV (h) = b h (Xi ))2
(Yi − m (10)
ν 2 n
1− n i
Figure 6 shows the cross-validation score and the generalized cross validation score. I plotted it
versus h and then I plotted it versus the effective degrees of freedom ν. The bottom left plot is the
kernel regression estimator using the bandwidth that minimizes the cross-validation score.
kernel = function(x,y,grid,h){
### kernel regression estimator at a grid of values
### one dimension only
### return m.hat(u) for all u in grid
n = length(x)
k = length(grid)
m.hat = rep(0,k)
for(i in 1:k){
w = dnorm(grid[i],x,h)
m.hat[i] = sum(y*w)/sum(w)
}
return(m.hat)
}
kernel.fitted = function(x,y,h){
### kernel regression
### fitted values and diagonal of smoothing matrix
n = length(x)
m.hat = rep(0,n)
S = rep(0,n)
for(i in 1:n){
w = dnorm(x[i],x,h)
w = w/sum(w)
m.hat[i] = sum(y*w)
S[i] = w[i]
}
return(list(fitted=m.hat,S=S))
}
CV = function(x,y,H){
12
Cross−validation Score
Cross−validation Score
0.0017
0.0017
0.0015
0.0015
0.0013
0.0013
0 1 2 3 4 5 0 10 20 30 40 50
●
● ●
● ●●
0.15
●
●
● ● ●●● ●
● ●●
●●●●●●●
Density
● ●● ● ● ●●●●●
● ●●
● ●
● ●● ●
● ● ●
●●●
● ●● ●● ● ●● ●
●● ●● ● ●●
●
0.05
● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●●
● ●●● ● ● ● ●
● ● ●● ● ●
●
● ●●●●●● ●●● ●
● ●
● ● ● ● ●● ● ●
●
●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
●
●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●● ●●●●●
● ●● ● ● ● ●
●●●● ●●●● ● ●
● ●
●
● ●●● ●● ● ● ● ●
● ●● ● ●
●●● ● ● ●
● ●● ● ●
−0.05
●
●● ● ● ●
10 15 20 25
Age
Figure 6: Cross validation (black) and generalized cross validation (red dotted line) versus h and
versus effective degrees of freedom. The bottom left plot is the kernel regression estimator using
the bandwidth that minimizes the cross-validation score.
13
### H is a vector of bandwidths
### Compute cross-validation and generalized cross validation
n = length(x)
k = length(H)
cv = rep(0,k)
nu = rep(0,k)
gcv = rep(0,k)
for(i in 1:k){
tmp = kernel.fitted(x,y,H[i])
cv[i] = mean(((y - tmp$fitted)/(1-tmp$S))^2)
nu[i] = sum(tmp$S)
gcv[i] = mean((y - tmp$fitted)^2)/(1-nu[i]/n)^2
}
### nu is the effective degrees of freedom
return(list(cv=cv,gcv=gcv,nu=nu))
}
pdf("crossval.pdf")
par(mfrow=c(2,2))
bone = read.table("BoneDensity.txt",header=TRUE)
attach(bone)
H = seq(.1,5,length=20)
out = CV(age.female,density.female,H)
plot(H,out$cv,type="l",lwd=3,xlab="Bandwidth",ylab="Cross-validation Score")
lines(H,out$gcv,lty=2,col="red",lwd=3)
plot(out$nu,out$cv,type="l",lwd=3,
xlab="Effective Degrees of Freedom",ylab="Cross-validation Score")
lines(out$nu,out$gcv,lty=2,col="red",lwd=3)
j = which.min(out$cv)
h = H[j]
grid = seq(min(age.female),max(age.female),length=100)
m.hat = kernel(age.female,density.female,grid,h)
plot(age.female,density.female,xlab="Age",ylab="Density")
lines(grid,m.hat,lwd=3,col="blue")
dev.off()
14
7 Analysis of the Kernel Estimator
What do we learn from this? First, the optimal bandwidth gets smaller as the sample size increases.
Second, the IMSE goes to 0 as n gets larger. But it goes to 0 slower than other estimators your are
used to. For example, if you estimate the mean µ with the sample mean Y then E(Y − µ)2 = σ 2 /n.
Roughly speaking, the mean squared error of parametric estimators is something like 1/n but kernel
estimators (and other nonparametric estimators) behave like (1/n)4/5 which goes to 0 more slowly.
Slower convergence is the price of being nonparametric, and making fewer assumptions (note that
the price of making strong parametric assumptions is non-trivial bias that does not even shrink
with sample size).
The formula we derived for hn is interesting for helping our understanding of the theoretical behavior
of m.
b But we can’t use that formual in practice because those constants involve quantities that
depend on the unknown function m. So we use cross-validation to choose h in practice.
8 Estimating σ 2 (x)
We have not assumed that the variance is constant. We will only assume that
σ 2 (x) = Var[Y |X = x] (13)
15
is a smooth function of x. We can estimate σ 2 (x) as follows:
1. Get m(x).
b
b i ))2 = e2i .
2. Let Zi = (Yi − m(X
b2 (x).
3. Do nonparametric regression of Zi versus Xi to get σ
9 Example
16
> y = spnbmd[I]
> out = kernel(x,y,grid,h)
> lines(grid,out,lwd=3,col="blue")
> fit = kernel(x,y,x,h)
> residuals = y - fit
> plot(x,residuals,xlab="Age",ylab="Residuals",pch=19)
> abline(h=0,lwd=3,col="red")
>
> z = residuals^2
> plot(x,z,pch=19,xlab="Age",ylab="Squared Residuals")
> out = kernel(x,z,grid,h)
> lines(grid,out,lwd=3,col="blue")
> dev.off()
The plots are in Figure 7. Do you like the result? How should we fix things?
17
● ●
0.10
● ● ●
● ● ●
● ●● ●
●● ● ●
0.15
● ● ●
● ●
● ● ●●● ● ● ●●
●
Bone Density
● ●● ●
● ●● ● ●● ●● ● ●
Residuals
● ● ● ● ●
●●
● ●● ● ●●
●
● ●● ● ● ●●● ● ● ●
● ● ●
●● ● ●● ●● ● ● ●
● ●●
● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ●●●
● ● ●●●● ●
0.00
●
● ●● ●●●●●●●● ●● ●●●
●● ● ● ●●●●
● ●●
●
●
●●
● ●
● ● ●●●
●●
●●
●● ●● ●●● ● ●● ●
●● ●● ● ● ● ●●●●●●
●
● ● ● ● ●●
●● ●● ●
● ●● ●●
●●●
● ●
● ● ● ●●●● ●● ●●● ●
●●
● ●
●
●● ●● ●●
● ●● ● ●● ●●●● ●
●
●
●
● ●
0.05
● ● ● ● ●● ● ● ● ● ●
● ●
● ● ● ●
● ● ●●● ● ●● ● ●●
● ● ● ●● ● ●●● ●●
● ●
●
● ● ●
● ● ●● ● ● ●●● ● ● ●
●● ●●● ●●● ●
● ● ●● ● ● ● ● ● ● ● ●● ●● ●
● ● ●● ● ●
● ● ● ● ● ● ●
● ● ●● ● ● ●
● ●
● ● ● ● ●●● ●
●
● ● ●●
●●●
● ●
● ●●●● ● ● ● ● ●
●
●● ●●
●● ● ●● ●●●● ●●
●●
●●● ● ● ●● ● ● ●● ●●●● ●●●●●
●●
● ●● ● ● ●●● ●● ●● ●●●● ●● ● ● ●
−0.10
● ●●
● ●●● ●● ●● ● ● ● ● ●●●●● ●
● ● ● ●
● ●● ● ● ● ● ● ●
● ● ●
−0.05
●
●● ● ● ● ●●
10 15 20 25 10 15 20 25
Age Age
●
0.015
●
●
Squared Residuals
0.010
●
● ●
●
●● ● ●
0.005
●●●
● ●
● ●
●
●● ● ● ●
● ● ●●● ●
●● ● ● ●
●●● ● ● ●●● ●● ●
● ●●●●
● ● ● ●●●
●●
● ●● ● ● ●●
0.000
● ●
● ●
●●●●● ●
●● ● ●● ●● ● ●●
● ●
●●●●● ●●●●
●●●●
●
● ●●● ●●
●● ●
●●
●
●●
● ●●
●
●●●
● ●●●●● ●● ●● ●●● ● ●
●
● ● ●
●●●● ●●●
● ●
●●
●●
●●●●
●
●●●●
●
● ● ●
●●●
●
●●●●●●
●●
●●
●●●●●●●●● ●●
●●●
●●●
●●●
10 15 20 25
Age
Figure 7: Top left: data and kernel regression estimator. Top right: residuals. Bottom left:
estimate of σ 2 (x).
18