KEMBAR78
21 Nonparametric Regression | PDF | Regression Analysis | Variance
0% found this document useful (0 votes)
22 views18 pages

21 Nonparametric Regression

The document discusses nonparametric regression, which allows for estimating regression functions without assuming linearity, focusing on methods that capture nonlinearity through data-driven approaches. It highlights the bias-variance tradeoff in model estimation, emphasizing the need to balance bias and variance for optimal prediction accuracy. Additionally, it introduces the regressogram and kernel smoothing as techniques for estimating the regression function, illustrating their application with examples.

Uploaded by

MLGen GT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views18 pages

21 Nonparametric Regression

The document discusses nonparametric regression, which allows for estimating regression functions without assuming linearity, focusing on methods that capture nonlinearity through data-driven approaches. It highlights the bias-variance tradeoff in model estimation, emphasizing the need to balance bias and variance for optimal prediction accuracy. Additionally, it introduces the regressogram and kernel smoothing as techniques for estimating the regression function, illustrating their application with examples.

Uploaded by

MLGen GT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Nonparametric Regression

1 Introduction

So far we have assumed that

Y = m(X) + , where m(x) = β0 + β1 x1 + · · · + βd xd .

i.e., the regression function is linear in a particular set of covariates/features. Although we have
allowed interactions and higher order polynomial terms to be included in the set of covariates,
so far the particular terms must be chosen by the user, and inference etc. requires the resulting
model to be correct (unless we are taking the projection approach). Now we will drop the linearity
assumption and only assume the regression function is smooth, and discuss more “data-driven”
methods for capturing nonlinearity.

Estimating the regression function m without linearity assumptions (e.g., only assuming smooth-
ness) is called nonparametric regression or smoothing. Without loss of generality, we can always
write
Y = m(X) + 
where E( | X) = 0. This follows since,  = Y − m(X) and

E( | X) = E(Y | X) − m(X) = 0.

We also define
σ 2 (x) = Var[Y |X = x] = E[2 |X = x].

We will start with the case where X ∈ R.

Example 1 Figure 1 shows data from a study on bone mineral density. The plots show the relative
change in bone density over two consecutive visits, for men and women. The smooth estimates of
the regression functions suggest that a growth spurt occurs two years earlier for females. In this
example, Y is change in bone mineral density and X is age.

2 The Bias–Variance Tradeoff

If we have an estimate m(x)


b of m(x) we need to see how accurate the estimator is. Note that m(x)
b
is a random function. We will consider two measures of accuracy.

The first is the prediction risk or prediction error which, as before, is


2
b = E[(Y − m(X))
R = R(m, m) b ] (1)

1
0.20
0.15
Females

Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25

Age
0.20
0.15

Males
Change in BMD
0.10
0.05
0.00
−0.05

10 15 20 25

Age

Figure 1: Bone Mineral Density Data

where (X, Y ) denotes a new observation. The expectation is over the training data (X1 , Y1 ), . . . , (Xn , Yn )
and the new observation (X, Y ). Let us define m(x) = E[m(x)].
b Then

Y − m(X)
b = [Y − m(X)] + [m(X) − m(X)] + [m(X) − m(X)]
b
= A + B + C.

Hence,
2
[Y − m(X)]
b = A2 + B 2 + C 2 + 2AB + 2AC + 2BC.
Now we take the expected value. The first term is

τ 2 = E[A2 ] = E[(Y − m(X))2 ]

is called the unavoidable error. Even if we know the regression function m(x), we would still incur
this error when doing prediction. Next we have
Z
E[B ] = E[m(X) − m(X)] = b2 (x)p(x) dx
2 2

where
b(x) = E[m(x)]
b − m(x)

2
is the bias. Next we note that
Z Z
2 2 2
E[C ] = E[m(X) − m(X)]
b = E[m(x)
b − m(x)] p(x)dx = v(x) p(x) dx

where v(x) = Var[m(x)]


b is the variance. The rest of the terms have mean 0. For example,
Z
E[AB] = E([Y − m(X)] [m(X) − m(X)]) = E([Y − m(x)] [m(x) − m(x)])p(x)dx
Z
= [m(x) − m(x)] [m(x) − m(x)])p(x)dx = 0

since E[Y |X = x] = m(x). You should check that you can show that the other terms are 0. To
summarize:

Theorem 2 We have the following bias-variance decomposition:


Z Z
2 2 2
b = E[(Y − m(X))
R(m, m) b ] = τ + b (x)p(x) dx + v(x)p(x) dx.

Our second way to measure the quality of the estimator m(x)


b is to define the mean integrated
squared error Z
M ISE = E [m(x)b − m(x)]2 p(x)dx.

Using very similar calculations we can show that


Z Z Z
2 2
M ISE = E [m(x) b − m(x)] p(x)dx = b (x)p(x)dx + v(x)dx. (2)

Hence, whether we use prediction error or MISE, we have to control both the bias and variance of
m(x).
b The key is to make sure that neither one gets too large. The estimator m
b typically involves
smoothing the data in some way. The main challenge is to determine how much smoothing to do.
When the data are oversmoothed, the bias term is large and the variance is small. When the data
are undersmoothed the opposite is true. This is called the bias–variance tradeoff. Minimizing risk
corresponds to balancing bias and variance.

3 The Regressogram

We will begin by assuming there is only one covariate X. For simplicity, assume that 0 ≤ X ≤ 1.
The simplest nonparametric estimator of m is the regressogram. Let k be an integer. Divide [0, 1]
into k bins:
B1 = [0, h], B2 = [h, 2h], B3 = [2h, 3h], . . . .
P
Let nj denote the number of observations in bin Bj . In other words ni = i I(Xi ∈ Bj ) where
I(Xi ∈ Bj ) = 1 if Xi ∈ Bj and I(Xi ∈ Bj ) = 0 if Xi ∈/ Bj .

3
We then define Y j to be the average of Yi ’s in Bj :

1 X
Yj = Yi .
nj
Xi ∈Bj

Finally, we define m(x)


b = Y j for all x ∈ Bj . We can write this as
k
X
m(x)
b = Y j I(x ∈ Bj ).
j=1

Here are some examples.

regressogram = function(x,y,left,right,k,plotit,xlab="",ylab="",sub=""){
### assumes the data are on the interval [left,right]
n = length(x)
B = seq(left,right,length=k+1)
WhichBin = findInterval(x,B)
N = tabulate(WhichBin)
m.hat = rep(0,k)
for(j in 1:k){
if(N[j]>0)m.hat[j] = mean(y[WhichBin == j])
}
if(plotit==TRUE){
a = min(c(y,m.hat))
b = max(c(y,m.hat))
plot(B,c(m.hat,m.hat[k]),lwd=3,type="s",
xlab=xlab,ylab=ylab,ylim=c(a,b),col="blue",sub=sub)
points(x,y)
}
return(list(bins=B,m.hat=m.hat))
}

pdf("regressogram.pdf")
par(mfrow=c(2,2))
### simulated example
n = 100
x = runif(n)
y = 3*sin(8*x) + rnorm(n,0,.3)
plot(x,y,pch=20)
out = regressogram(x,y,left=0,right=1,k=5,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=10,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=20,plotit=TRUE)
dev.off()

4
### Bone mineral density versus age for men and women
pdf("bmd.pdf")
par(mfrow=c(2,2))
library(ElemStatLearn)
attach(bone)

age.male = age[gender == "male"]


density.male = spnbmd[gender == "male"]
out = regressogram(age.male,density.male,left=9,right=26,k=10,plotit=TRUE,
xlab="Age",ylab="Density",sub="Male")
out = regressogram(age.male,density.male,left=9,right=26,k=20,plotit=TRUE,
xlab="Age",ylab="Density",sub="Male")

age.female = age[gender == "female"]


density.female = spnbmd[gender == "female"]
out = regressogram(age.female,density.female,left=9,right=26,k=10,plotit=TRUE,xlab="Age",ylab="
out = regressogram(age.female,density.female,left=9,right=26,k=20,plotit=TRUE,xlab="Age",ylab="
dev.off()

From Figures 2 and 3 you can see two things. First, we need a way to choose k. (The answer will
be cross-validation). But also, m(x)
b is very unsmooth. To get a smoother estimator, we will use
kernel smoothing.

4 The Kernel Estimator

If we had many values of Yi with the same value of x, then we could take the average of these Yi ’s
to estimate E[Y |X = x]. But usually every Xi is different. Perhaps we can take the average of the
Yi ’s whose Xi is close to x. This is the idea of kernel smoothing.

To estimate m(x)
b we will take a local average of the Yi0 s. In other words, we average all Yi such
that |Xi −x| ≤ h where h is some small number called the bandwidth. Let I = {i : ||Xi −x|| ≤ h}
denote all observations that are close to x. We can write the estimator as
1X
m(x)
b = Yi
k
i∈I

where k is the number of points in I. We can re-write this estimator as


Pn  
x−Xi
i=1 K h Yi
m(x)
b = P   (3)
n x−Xi
i=1 K h

where (
1 if |z| ≤ 1
K(z) =
0 if |z| > 1.

5
● ● ● ●
● ● ●●
● ● ● ● ● ●●
●● ●
● ●●●
3

3
● ●●● ●

●● ●
● ●

●● ● ●
●● ●●

● ●● ●● ● ●


●●
●● ●
●● ●●

● ●
2

2
● ● ● ● ● ●●

●● ● ●●
●●
●● ●●

● ●
● ●


● ●
●●●
●●● ●●●

● ●● ● ●
● ●
●●

1

1

●●

●● ●
●● ●

●●
●● ●
● ●

● ● ● ●
y

● ●
0

0
● ● ●●
● ● ● ●


● ●
●●
● ●
−1

−1
● ●
●●



●●
● ●
●●

● ●
●● ●●
● ●
● ●

● ● ●● ● ● ●●●
−3

−3
● ● ● ●
● ●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

● ● ● ●
● ● ●●
●● ● ● ● ●●
●● ●
● ●●● ● ●●●
3


●● ●● ●
●● ●●
● ●
●● ●● ● ● ●● ●● ● ●

●● ●● ●
●● ●●
● ● ● ●
2

● ●●
● ● ●●

●●
●● ●●
●●

● ●
● ●
● ●

●●● ●●● ●●● ●●●
● ●
● ● ● ●
● ●
●● ●●
1

●● ● ●● ●
●● ●
● ●
● ●● ●
● ●

● ● ● ●
● ●
0

●● ●●
● ● ● ●
●●
● ● ●●
● ●
−1

−1

● ●
● ●● ● ●●
●●
● ●●

● ●
●● ●●
● ●
● ●
● ● ●●● ● ● ●●●
−3

−3

● ● ● ●
● ●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Simulated data. Various choices of k.

6
● ●
● ●
●● ● ●● ●
0.15

0.15
● ●● ● ●●
● ● ● ●
●●● ●●●
● ●● ● ●●
●● ● ● ●● ● ●
● ● ●● ● ● ●●
●● ●●
Density

Density
●● ● ●● ● ●● ● ●● ●
● ●
● ●●● ● ● ● ● ● ●●● ● ● ● ●
● ● ● ●
●● ● ●● ● ● ● ●● ● ●● ● ● ●
0.05

0.05
● ● ● ●●
● ●● ●●● ●●● ● ● ● ●●
● ●● ●●● ●●●
● ●●
● ●●
● ● ● ●●● ●●
● ● ● ● ●● ● ●●
● ●●
● ● ● ●●● ●●
● ● ● ● ●●
● ●●
● ●●
● ● ●●●● ● ●● ● ● ●●
● ●●
● ● ●●●● ● ●● ●
●●● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ● ● ●●
●●


● ●●● ●●
●●● ● ● ●● ● ● ●●● ●
●●●●●●● ●● ●● ●●


● ●●● ●●
●●● ● ● ●● ● ● ●●● ●
●●●●●●● ●● ●●
● ●●● ● ●●●●
● ● ●
● ● ● ●●● ●● ● ●●● ●● ●●●● ●
● ● ● ●●● ●●

● ● ●● ● ● ●●
● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ● ● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ●
● ● ●●●● ●●● ● ● ●●●● ●●●
● ● ● ● ● ● ● ● ● ●
● ●
−0.05

−0.05
● ●●●● ● ● ●●●● ●
● ● ● ●
● ●

10 15 20 25 10 15 20 25

Age Age
Male Male

● ●

● ●
●● ●●
● ●● ● ●●
0.15

0.15

● ●
● ●
●● ●●● ● ●● ●●● ●
● ●● ● ●●
●● ●● ● ●● ●● ●
●●●● ●● ●●●● ●●
Density

Density

● ●● ● ● ●● ●
●● ●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●

0.05

0.05

● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●●●●
● ● ●● ● ● ●
●●●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●●● ● ● ●
● ● ● ● ●● ● ● ● ●●●●● ●

●● ●
● ●●● ●●●●
● ●●●●●●
●● ●
●● ●
● ●●● ●●●●
● ●●●●●●
●●
●●● ● ●●● ●●
●●●●● ●● ● ●● ●●●● ●●
●●● ●●● ● ●●● ●●
●●●●● ●● ● ●● ●●●● ●●
●●●
● ●● ● ● ● ●●●● ●
●●●● ●
● ● ● ●
● ● ●● ● ● ● ●●●● ●
●●●● ●
● ● ● ●

● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05

● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

Age Age
Female Female

Figure 3: Bone density example. Men and women. Left plots use k = 10. Right plots use k = 20.

7
The function K is called the boxcar kernel.

As you can see in Figure 4, this gives much smoother estimates than the regressogram. But we can
improve this even further by replacing K with a smoother function.

This leads us to the following definition. A one-dimensional smoothing kernel is any smooth function
K such that K(x) ≥ 0 and
Z Z Z
K(x) dx = 1, xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0.
2
(4)

Note that the boxcar kernel does satisfy these conditions.

Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel estimator is
defined by  
Pn x−Xi n
i=1 Yi K h X
m(x)
b ≡m b h (x) = P   = Yj `j (x) (5)
n x−Xi
i=1 K h i=1

where  
x−Xj
K h
`j (x) = P x−Xs
. (6)
sK h
P
Notice that `j (x) ≥ 0 and, for every x, j `j (x) = 1. Thus m(x)
b is a smooth, local average of the
Yi ’s.

It can be shown that there is an optimal kernel called the Epanechnikov kernel. But the choice of
kernel K is not too important. Estimates obtained by using different kernels are usually numerically
very similar. This observation is confirmed by theoretical calculations which show that the risk is
very insensitive to the choice of kernel. A common choice for the kernel is the normal density:
2 /2
K(z) ∝ e−z .

This does not mean that we are assuming that the data are Normal. We are just using this as a
convenient way to define the smoothing weights.

What does matter much more is the choice of bandwidth h which controls the amount of smoothing.
Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. A
useful result is the following:

Lemma 3 Suppose that, as n → ∞, we have h → 0 and nh → ∞. Then, under some conditions,


P
b h (x) → m(x).
we have m

Figure 5 shows the kernel estimator m b h (x) for 4 different choices of h using a Gaussian kernel.
As you can see, we get much nicer estimates. But we still need a way to choose the smoothing
parameter h. Before we discuss that, let’s first discuss linear smoothers.

8
● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●

0.05

0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●

●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●

● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05
● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15

● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female

● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●

0.05

0.05

● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●

●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●

● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05

● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

Figure 4: Boxcar kernel estimators for various choices of h.

9
● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●

0.05

0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●

●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●

● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05
● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15

● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female

● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●

0.05

0.05

● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●

●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●

● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05

● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

Figure 5: Kernel estimators for various choices of h.

10
5 The Kernel Estimator is a Linear Smoother

The kernel estimator m(x)


b is defined at each x. As we did for linear regression, we can also compute
the estimator at
P the original data points Xi . Again, we call Ybi = m(X
b i ) the fitted values. We know
that mb h (x) = j Yj `j (x). When x = Xi , we have
X
Ybi = m
b h (Xi ) = Yj `j (Xi ).
j

b = (Yb1 , . . . , Ybn ), Y = (Y1 , . . . , Yn ) and let L be the n × n matrix with entries Lij = `j (Xi ).
Let Y
We then see that
Yb = LY. (7)
This looks a lot like the equation Yb = HY from linear regression. The matrix L is called the
smoothing matrix. It is like the hat matrix but it is not a projection matrix. We call
X
ν = tr(L) = Lii (8)
i

the effective degrees of freedom. As the bandwidth h gets smaller, ν gets larger. In other words,
small bandwidths correspond to more complex models.

The equation Y b = LY means that we can write each Ybi as a linear combination of the Yi ’s. Because
of this, we say that kernel regression is a linear smoother. This does not mean that m(x)
b is linear.
It just means that each Ybi as a linear combination of the Yi ’s.

The are many other linear smoothers besides kernel regression estimators but we will focus on the
kernel estimator for now. Remember: a linear smoother does not mean linear regression.

6 Choosing h by Cross-Validation

We will estimate the prediction risk using leave-one-out cross-validation. Then we choose the
(−i)
bandwidth h by minimizing the cross-validation score. Let m b h be the kernel estimator (with
bandwidth h) obtained after leaving out (Xi , Yi ). The cross-validation score is
1X (−i)
CV (h) = b h (Xi ))2 .
(Yi − m
n
i

It turns out that there is a handy short cut formual for CV (similar to linear models) which is
 2
1X (Yi − m
b h (Xi ))
CV (h) = . (9)
n 1 − Lii
i

Our strategy is to fit m


b h for many values of h. We then compute CV (h) using (9). Then we find
h to minimize CV (h). The bandwidth b
b h is the one we use.

11
P
Some people approximate the formula above replacing each Lii by the average value (1/n) i Lii =
(1/n)tr(L) = ν/n. If we replace Lii with ν/n we get this formula:
1 1X
GCV (h) = b h (Xi ))2
(Yi − m (10)
ν 2 n

1− n i

which is called generalized cross validation.

Figure 6 shows the cross-validation score and the generalized cross validation score. I plotted it
versus h and then I plotted it versus the effective degrees of freedom ν. The bottom left plot is the
kernel regression estimator using the bandwidth that minimizes the cross-validation score.

Here is the code:

kernel = function(x,y,grid,h){
### kernel regression estimator at a grid of values
### one dimension only
### return m.hat(u) for all u in grid
n = length(x)
k = length(grid)
m.hat = rep(0,k)
for(i in 1:k){
w = dnorm(grid[i],x,h)
m.hat[i] = sum(y*w)/sum(w)
}
return(m.hat)
}

kernel.fitted = function(x,y,h){
### kernel regression
### fitted values and diagonal of smoothing matrix
n = length(x)
m.hat = rep(0,n)
S = rep(0,n)
for(i in 1:n){
w = dnorm(x[i],x,h)
w = w/sum(w)
m.hat[i] = sum(y*w)
S[i] = w[i]
}
return(list(fitted=m.hat,S=S))
}

CV = function(x,y,H){

12
Cross−validation Score

Cross−validation Score
0.0017

0.0017
0.0015

0.0015
0.0013

0.0013
0 1 2 3 4 5 0 10 20 30 40 50

Bandwidth Effective Degrees of Freedom


● ●
● ●●
0.15



● ● ●●● ●
● ●●
●●●●●●●
Density

● ●● ● ● ●●●●●
● ●●
● ●
● ●● ●
● ● ●
●●●
● ●● ●● ● ●● ●
●● ●● ● ●●

0.05

● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●●
● ●●● ● ● ● ●
● ● ●● ● ●

● ●●●●●● ●●● ●
● ●
● ● ● ● ●● ● ●

●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●

●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●● ●●●●●
● ●● ● ● ● ●
●●●● ●●●● ● ●
● ●

● ●●● ●● ● ● ● ●
● ●● ● ●
●●● ● ● ●
● ●● ● ●
−0.05


●● ● ● ●

10 15 20 25

Age

Figure 6: Cross validation (black) and generalized cross validation (red dotted line) versus h and
versus effective degrees of freedom. The bottom left plot is the kernel regression estimator using
the bandwidth that minimizes the cross-validation score.

13
### H is a vector of bandwidths
### Compute cross-validation and generalized cross validation
n = length(x)
k = length(H)
cv = rep(0,k)
nu = rep(0,k)
gcv = rep(0,k)
for(i in 1:k){
tmp = kernel.fitted(x,y,H[i])
cv[i] = mean(((y - tmp$fitted)/(1-tmp$S))^2)
nu[i] = sum(tmp$S)
gcv[i] = mean((y - tmp$fitted)^2)/(1-nu[i]/n)^2
}
### nu is the effective degrees of freedom
return(list(cv=cv,gcv=gcv,nu=nu))
}

pdf("crossval.pdf")
par(mfrow=c(2,2))

bone = read.table("BoneDensity.txt",header=TRUE)
attach(bone)

age.female = age[gender == "female"]


density.female = spnbmd[gender == "female"]

H = seq(.1,5,length=20)
out = CV(age.female,density.female,H)
plot(H,out$cv,type="l",lwd=3,xlab="Bandwidth",ylab="Cross-validation Score")
lines(H,out$gcv,lty=2,col="red",lwd=3)
plot(out$nu,out$cv,type="l",lwd=3,
xlab="Effective Degrees of Freedom",ylab="Cross-validation Score")
lines(out$nu,out$gcv,lty=2,col="red",lwd=3)

j = which.min(out$cv)
h = H[j]
grid = seq(min(age.female),max(age.female),length=100)
m.hat = kernel(age.female,density.female,grid,h)
plot(age.female,density.female,xlab="Age",ylab="Density")
lines(grid,m.hat,lwd=3,col="blue")

dev.off()

NOTE: The residuals are defined as before, namely, ei = Yi − m(X


b i ). You should plot the resduals
as you did for linear regression.

14
7 Analysis of the Kernel Estimator

Recall that the mean integrated squared error is


Z Z
2
M ISE = bn (x)p(x)dx + vn (x)p(x)dx.

It can be shown (under certain conditions) that, given X1 , . . . , Xn ,


m0 (x)f 0 (x)
 
µ2 (K) 00
b(x) ≈ m (x) + 2 (11)
2 f (x)
and
R(K) 2
v(x) ≈ σ (x) (12)
nhf (x)
where f (x) is the density of X, µ2 (K) = x2 K(x)dx, R(K) = K 2 (x)dx and σ 2 (x) = Var(Y |X =
R R

x). So we can say that


C2
M ISE = C1 h4 + .
nh
If we minimize this expression over h we find that the best h is
 1/5
C3
hn =
n
where C3 = C2 /(4C1 ). If we insert this hn into the IMSE we find that
 4/5
C4
M ISE = .
n

What do we learn from this? First, the optimal bandwidth gets smaller as the sample size increases.
Second, the IMSE goes to 0 as n gets larger. But it goes to 0 slower than other estimators your are
used to. For example, if you estimate the mean µ with the sample mean Y then E(Y − µ)2 = σ 2 /n.
Roughly speaking, the mean squared error of parametric estimators is something like 1/n but kernel
estimators (and other nonparametric estimators) behave like (1/n)4/5 which goes to 0 more slowly.

Slower convergence is the price of being nonparametric, and making fewer assumptions (note that
the price of making strong parametric assumptions is non-trivial bias that does not even shrink
with sample size).

The formula we derived for hn is interesting for helping our understanding of the theoretical behavior
of m.
b But we can’t use that formual in practice because those constants involve quantities that
depend on the unknown function m. So we use cross-validation to choose h in practice.

8 Estimating σ 2 (x)

We have not assumed that the variance is constant. We will only assume that
σ 2 (x) = Var[Y |X = x] (13)

15
is a smooth function of x. We can estimate σ 2 (x) as follows:

1. Get m(x).
b

b i ))2 = e2i .
2. Let Zi = (Yi − m(X

b2 (x).
3. Do nonparametric regression of Zi versus Xi to get σ

9 Example

Let’s return to the bone mineral density data.

> kernel = function(x,y,grid,h){


+ ### kernel regression estimator at a grid of values ### one dimension only
+ ### return m.hat(u) for all u in grid
+ n = length(x)
+ k = length(grid)
+ m.hat = rep(0,k)
+ for(i in 1:k){
+ w = dnorm(grid[i],x,h)
+ m.hat[i] = sum(y*w)/sum(w)
+ }
+ return(m.hat)
+ }
>
> pdf("Bone-Density.pdf")
> par(mfrow=c(2,2))
> df = read.table("bone.txt",header=TRUE)
> names(df)
[1] "idnum" "age" "gender" "spnbmd"
> attach(df)
> n = nrow(df)
> is.factor(gender)
[1] TRUE
> unique(gender)
[1] male female
Levels: female male
> I = (1:n)[gender == "female"]
> plot(age[I],spnbmd[I],xlab="Age",ylab="Bone Density",pch=19)
>
> h = 0.8
> grid = seq(9,26,length=50)
> x = age[I]

16
> y = spnbmd[I]
> out = kernel(x,y,grid,h)
> lines(grid,out,lwd=3,col="blue")
> fit = kernel(x,y,x,h)
> residuals = y - fit
> plot(x,residuals,xlab="Age",ylab="Residuals",pch=19)
> abline(h=0,lwd=3,col="red")
>
> z = residuals^2
> plot(x,z,pch=19,xlab="Age",ylab="Squared Residuals")
> out = kernel(x,z,grid,h)
> lines(grid,out,lwd=3,col="blue")
> dev.off()

The plots are in Figure 7. Do you like the result? How should we fix things?

17
● ●

0.10
● ● ●
● ● ●
● ●● ●
●● ● ●
0.15

● ● ●
● ●
● ● ●●● ● ● ●●

Bone Density

● ●● ●
● ●● ● ●● ●● ● ●

Residuals
● ● ● ● ●
●●
● ●● ● ●●

● ●● ● ● ●●● ● ● ●
● ● ●
●● ● ●● ●● ● ● ●
● ●●
● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ●●●
● ● ●●●● ●

0.00

● ●● ●●●●●●●● ●● ●●●
●● ● ● ●●●●
● ●●


●●
● ●
● ● ●●●
●●
●●
●● ●● ●●● ● ●● ●
●● ●● ● ● ● ●●●●●●

● ● ● ● ●●
●● ●● ●
● ●● ●●
●●●
● ●
● ● ● ●●●● ●● ●●● ●
●●
● ●

●● ●● ●●
● ●● ● ●● ●●●● ●



● ●
0.05

● ● ● ● ●● ● ● ● ● ●
● ●
● ● ● ●
● ● ●●● ● ●● ● ●●
● ● ● ●● ● ●●● ●●
● ●

● ● ●
● ● ●● ● ● ●●● ● ● ●
●● ●●● ●●● ●
● ● ●● ● ● ● ● ● ● ● ●● ●● ●
● ● ●● ● ●
● ● ● ● ● ● ●
● ● ●● ● ● ●
● ●
● ● ● ● ●●● ●

● ● ●●
●●●
● ●
● ●●●● ● ● ● ● ●

●● ●●
●● ● ●● ●●●● ●●
●●
●●● ● ● ●● ● ● ●● ●●●● ●●●●●
●●
● ●● ● ● ●●● ●● ●● ●●●● ●● ● ● ●

−0.10
● ●●
● ●●● ●● ●● ● ● ● ● ●●●●● ●
● ● ● ●
● ●● ● ● ● ● ● ●
● ● ●
−0.05


●● ● ● ● ●●

10 15 20 25 10 15 20 25

Age Age


0.015



Squared Residuals

0.010


● ●

●● ● ●
0.005

●●●
● ●
● ●

●● ● ● ●
● ● ●●● ●
●● ● ● ●
●●● ● ● ●●● ●● ●
● ●●●●
● ● ● ●●●
●●
● ●● ● ● ●●
0.000

● ●
● ●
●●●●● ●
●● ● ●● ●● ● ●●
● ●
●●●●● ●●●●
●●●●

● ●●● ●●
●● ●
●●

●●
● ●●

●●●
● ●●●●● ●● ●● ●●● ● ●

● ● ●
●●●● ●●●
● ●
●●
●●
●●●●

●●●●

● ● ●
●●●

●●●●●●
●●
●●
●●●●●●●●● ●●
●●●
●●●
●●●

10 15 20 25

Age

Figure 7: Top left: data and kernel regression estimator. Top right: residuals. Bottom left:
estimate of σ 2 (x).

18

You might also like