0% found this document useful (0 votes)

22 views18 pages

21 Nonparametric Regression

The document discusses nonparametric regression, which allows for estimating regression functions without assuming linearity, focusing on methods that capture nonlinearity through data-driven approaches. It highlights the bias-variance tradeoff in model estimation, emphasizing the need to balance bias and variance for optimal prediction accuracy. Additionally, it introduces the regressogram and kernel smoothing as techniques for estimating the regression function, illustrating their application with examples.

Uploaded by

MLGen GT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views18 pages

21 Nonparametric Regression

Uploaded by

MLGen GT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Nonparametric Regression

1 Introduction

So far we have assumed that

Y = m(X) + , where m(x) = β0 + β1 x1 + · · · + βd xd .

i.e., the regression function is linear in a particular set of covariates/features. Although we have
allowed interactions and higher order polynomial terms to be included in the set of covariates,
so far the particular terms must be chosen by the user, and inference etc. requires the resulting
model to be correct (unless we are taking the projection approach). Now we will drop the linearity
assumption and only assume the regression function is smooth, and discuss more “data-driven”
methods for capturing nonlinearity.

Estimating the regression function m without linearity assumptions (e.g., only assuming smooth-
ness) is called nonparametric regression or smoothing. Without loss of generality, we can always
write
Y = m(X) +
where E( | X) = 0. This follows since, = Y − m(X) and

E( | X) = E(Y | X) − m(X) = 0.

We also define
σ 2 (x) = Var[Y |X = x] = E[2 |X = x].

We will start with the case where X ∈ R.

Example 1 Figure 1 shows data from a study on bone mineral density. The plots show the relative
change in bone density over two consecutive visits, for men and women. The smooth estimates of
the regression functions suggest that a growth spurt occurs two years earlier for females. In this
example, Y is change in bone mineral density and X is age.

2 The Bias–Variance Tradeoff

If we have an estimate m(x)

b of m(x) we need to see how accurate the estimator is. Note that m(x)
b
is a random function. We will consider two measures of accuracy.

The first is the prediction risk or prediction error which, as before, is

2
b = E[(Y − m(X))
R = R(m, m) b ] (1)

1
0.20
0.15
Females

Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25

Age
0.20
0.15

Males
Change in BMD
0.10
0.05
0.00
−0.05

10 15 20 25

Age

Figure 1: Bone Mineral Density Data

where (X, Y ) denotes a new observation. The expectation is over the training data (X1 , Y1 ), . . . , (Xn , Yn )
and the new observation (X, Y ). Let us define m(x) = E[m(x)].
b Then

Y − m(X)
b = [Y − m(X)] + [m(X) − m(X)] + [m(X) − m(X)]
b
= A + B + C.

Hence,
2
[Y − m(X)]
b = A2 + B 2 + C 2 + 2AB + 2AC + 2BC.
Now we take the expected value. The first term is

τ 2 = E[A2 ] = E[(Y − m(X))2 ]

is called the unavoidable error. Even if we know the regression function m(x), we would still incur
this error when doing prediction. Next we have
Z
E[B ] = E[m(X) − m(X)] = b2 (x)p(x) dx
2 2

where
b(x) = E[m(x)]
b − m(x)

2
is the bias. Next we note that
Z Z
2 2 2
E[C ] = E[m(X) − m(X)]
b = E[m(x)
b − m(x)] p(x)dx = v(x) p(x) dx

where v(x) = Var[m(x)]

b is the variance. The rest of the terms have mean 0. For example,
Z
E[AB] = E([Y − m(X)] [m(X) − m(X)]) = E([Y − m(x)] [m(x) − m(x)])p(x)dx
Z
= [m(x) − m(x)] [m(x) − m(x)])p(x)dx = 0

since E[Y |X = x] = m(x). You should check that you can show that the other terms are 0. To
summarize:

Theorem 2 We have the following bias-variance decomposition:

Z Z
2 2 2
b = E[(Y − m(X))
R(m, m) b ] = τ + b (x)p(x) dx + v(x)p(x) dx.

Our second way to measure the quality of the estimator m(x)

b is to define the mean integrated
squared error Z
M ISE = E [m(x)b − m(x)]2 p(x)dx.

Using very similar calculations we can show that

Z Z Z
2 2
M ISE = E [m(x) b − m(x)] p(x)dx = b (x)p(x)dx + v(x)dx. (2)

Hence, whether we use prediction error or MISE, we have to control both the bias and variance of
m(x).
b The key is to make sure that neither one gets too large. The estimator m
b typically involves
smoothing the data in some way. The main challenge is to determine how much smoothing to do.
When the data are oversmoothed, the bias term is large and the variance is small. When the data
are undersmoothed the opposite is true. This is called the bias–variance tradeoff. Minimizing risk
corresponds to balancing bias and variance.

3 The Regressogram

We will begin by assuming there is only one covariate X. For simplicity, assume that 0 ≤ X ≤ 1.
The simplest nonparametric estimator of m is the regressogram. Let k be an integer. Divide [0, 1]
into k bins:
B1 = [0, h], B2 = [h, 2h], B3 = [2h, 3h], . . . .
P
Let nj denote the number of observations in bin Bj . In other words ni = i I(Xi ∈ Bj ) where
I(Xi ∈ Bj ) = 1 if Xi ∈ Bj and I(Xi ∈ Bj ) = 0 if Xi ∈/ Bj .

3
We then define Y j to be the average of Yi ’s in Bj :

1 X
Yj = Yi .
nj
Xi ∈Bj

Finally, we define m(x)

b = Y j for all x ∈ Bj . We can write this as
k
X
m(x)
b = Y j I(x ∈ Bj ).
j=1

Here are some examples.

regressogram = function(x,y,left,right,k,plotit,xlab="",ylab="",sub=""){
### assumes the data are on the interval [left,right]
n = length(x)
B = seq(left,right,length=k+1)
WhichBin = findInterval(x,B)
N = tabulate(WhichBin)
m.hat = rep(0,k)
for(j in 1:k){
if(N[j]>0)m.hat[j] = mean(y[WhichBin == j])
}
if(plotit==TRUE){
a = min(c(y,m.hat))
b = max(c(y,m.hat))
plot(B,c(m.hat,m.hat[k]),lwd=3,type="s",
xlab=xlab,ylab=ylab,ylim=c(a,b),col="blue",sub=sub)
points(x,y)
}
return(list(bins=B,m.hat=m.hat))
}

pdf("regressogram.pdf")
par(mfrow=c(2,2))
### simulated example
n = 100
x = runif(n)
y = 3*sin(8*x) + rnorm(n,0,.3)
plot(x,y,pch=20)
out = regressogram(x,y,left=0,right=1,k=5,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=10,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=20,plotit=TRUE)
dev.off()

4
### Bone mineral density versus age for men and women
pdf("bmd.pdf")
par(mfrow=c(2,2))
library(ElemStatLearn)
attach(bone)

age.male = age[gender == "male"]

density.male = spnbmd[gender == "male"]
out = regressogram(age.male,density.male,left=9,right=26,k=10,plotit=TRUE,
xlab="Age",ylab="Density",sub="Male")
out = regressogram(age.male,density.male,left=9,right=26,k=20,plotit=TRUE,
xlab="Age",ylab="Density",sub="Male")

age.female = age[gender == "female"]

density.female = spnbmd[gender == "female"]
out = regressogram(age.female,density.female,left=9,right=26,k=10,plotit=TRUE,xlab="Age",ylab="
out = regressogram(age.female,density.female,left=9,right=26,k=20,plotit=TRUE,xlab="Age",ylab="
dev.off()

From Figures 2 and 3 you can see two things. First, we need a way to choose k. (The answer will
be cross-validation). But also, m(x)
b is very unsmooth. To get a smoother estimator, we will use
kernel smoothing.

4 The Kernel Estimator

If we had many values of Yi with the same value of x, then we could take the average of these Yi ’s
to estimate E[Y |X = x]. But usually every Xi is different. Perhaps we can take the average of the
Yi ’s whose Xi is close to x. This is the idea of kernel smoothing.

To estimate m(x)
b we will take a local average of the Yi0 s. In other words, we average all Yi such
that |Xi −x| ≤ h where h is some small number called the bandwidth. Let I = {i : ||Xi −x|| ≤ h}
denote all observations that are close to x. We can write the estimator as
1X
m(x)
b = Yi
k
i∈I

where k is the number of points in I. We can re-write this estimator as

Pn
x−Xi
i=1 K h Yi
m(x)
b = P (3)
n x−Xi
i=1 K h

where (
1 if |z| ≤ 1
K(z) =
0 if |z| > 1.

5
● ● ● ●
● ● ●●
● ● ● ● ● ●●
●● ●
● ●●●
3

3
● ●●● ●
●
●● ●
● ●
●
●● ● ●
●● ●●
●
● ●● ●● ● ●
●
●
●●
●● ●
●● ●●
●
● ●
2

2
● ● ● ● ● ●●
●
●● ● ●●
●●
●● ●●
●
● ●
● ●
●
●
● ●
●●●
●●● ●●●
●
● ●● ● ●
● ●
●●
●
1

1
●
●●
●
●● ●
●● ●
●
●●
●● ●
● ●
●
● ● ● ●
y

● ●
0

0
● ● ●●
● ● ● ●
●
●
● ●
●●
● ●
−1

−1
● ●
●●
●
●
●
●●
● ●
●●
●
● ●
●● ●●
● ●
● ●
●
● ● ●● ● ● ●●●
−3

−3
● ● ● ●
● ●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

● ● ● ●
● ● ●●
●● ● ● ● ●●
●● ●
● ●●● ● ●●●
3

●
●● ●● ●
●● ●●
● ●
●● ●● ● ● ●● ●● ● ●
●
●● ●● ●
●● ●●
● ● ● ●
2

● ●●
● ● ●●
●
●●
●● ●●
●●
●
● ●
● ●
● ●
●
●●● ●●● ●●● ●●●
● ●
● ● ● ●
● ●
●● ●●
1

●● ● ●● ●
●● ●
● ●
● ●● ●
● ●
●
● ● ● ●
● ●
0

●● ●●
● ● ● ●
●●
● ● ●●
● ●
−1

−1

● ●
● ●● ● ●●
●●
● ●●
●
● ●
●● ●●
● ●
● ●
● ● ●●● ● ● ●●●
−3

−3

● ● ● ●
● ●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Simulated data. Various choices of k.

6
● ●
● ●
●● ● ●● ●
0.15

0.15
● ●● ● ●●
● ● ● ●
●●● ●●●
● ●● ● ●●
●● ● ● ●● ● ●
● ● ●● ● ● ●●
●● ●●
Density

Density
●● ● ●● ● ●● ● ●● ●
● ●
● ●●● ● ● ● ● ● ●●● ● ● ● ●
● ● ● ●
●● ● ●● ● ● ● ●● ● ●● ● ● ●
0.05

0.05
● ● ● ●●
● ●● ●●● ●●● ● ● ● ●●
● ●● ●●● ●●●
● ●●
● ●●
● ● ● ●●● ●●
● ● ● ● ●● ● ●●
● ●●
● ● ● ●●● ●●
● ● ● ● ●●
● ●●
● ●●
● ● ●●●● ● ●● ● ● ●●
● ●●
● ● ●●●● ● ●● ●
●●● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ● ● ●●
●●
●
●
● ●●● ●●
●●● ● ● ●● ● ● ●●● ●
●●●●●●● ●● ●● ●●
●
●
● ●●● ●●
●●● ● ● ●● ● ● ●●● ●
●●●●●●● ●● ●●
● ●●● ● ●●●●
● ● ●
● ● ● ●●● ●● ● ●●● ●● ●●●● ●
● ● ● ●●● ●●
●
● ● ●● ● ● ●●
● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ● ● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ●
● ● ●●●● ●●● ● ● ●●●● ●●●
● ● ● ● ● ● ● ● ● ●
● ●
−0.05

−0.05
● ●●●● ● ● ●●●● ●
● ● ● ●
● ●

10 15 20 25 10 15 20 25

Age Age
Male Male

● ●

● ●
●● ●●
● ●● ● ●●
0.15

0.15

● ●
● ●
●● ●●● ● ●● ●●● ●
● ●● ● ●●
●● ●● ● ●● ●● ●
●●●● ●● ●●●● ●●
Density

Density

● ●● ● ● ●● ●
●● ●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05

0.05

● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●●●●
● ● ●● ● ● ●
●●●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●●● ● ● ●
● ● ● ● ●● ● ● ● ●●●●● ●
●
●● ●
● ●●● ●●●●
● ●●●●●●
●● ●
●● ●
● ●●● ●●●●
● ●●●●●●
●●
●●● ● ●●● ●●
●●●●● ●● ● ●● ●●●● ●●
●●● ●●● ● ●●● ●●
●●●●● ●● ● ●● ●●●● ●●
●●●
● ●● ● ● ● ●●●● ●
●●●● ●
● ● ● ●
● ● ●● ● ● ● ●●●● ●
●●●● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
●●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05

● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

Age Age
Female Female

Figure 3: Bone density example. Men and women. Left plots use k = 10. Right plots use k = 20.

7
The function K is called the boxcar kernel.

As you can see in Figure 4, this gives much smoother estimates than the regressogram. But we can
improve this even further by replacing K with a smoother function.

This leads us to the following definition. A one-dimensional smoothing kernel is any smooth function
K such that K(x) ≥ 0 and
Z Z Z
K(x) dx = 1, xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0.
2
(4)

Note that the boxcar kernel does satisfy these conditions.

Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel estimator is
defined by
Pn x−Xi n
i=1 Yi K h X
m(x)
b ≡m b h (x) = P = Yj `j (x) (5)
n x−Xi
i=1 K h i=1

where
x−Xj
K h
`j (x) = P x−Xs
. (6)
sK h
P
Notice that `j (x) ≥ 0 and, for every x, j `j (x) = 1. Thus m(x)
b is a smooth, local average of the
Yi ’s.

It can be shown that there is an optimal kernel called the Epanechnikov kernel. But the choice of
kernel K is not too important. Estimates obtained by using different kernels are usually numerically
very similar. This observation is confirmed by theoretical calculations which show that the risk is
very insensitive to the choice of kernel. A common choice for the kernel is the normal density:
2 /2
K(z) ∝ e−z .

This does not mean that we are assuming that the data are Normal. We are just using this as a
convenient way to define the smoothing weights.

What does matter much more is the choice of bandwidth h which controls the amount of smoothing.
Small bandwidths give very rough estimates while larger bandwidths give smoother estimates. A
useful result is the following:

Lemma 3 Suppose that, as n → ∞, we have h → 0 and nh → ∞. Then, under some conditions,

P
b h (x) → m(x).
we have m

Figure 5 shows the kernel estimator m b h (x) for 4 different choices of h using a Gaussian kernel.
As you can see, we get much nicer estimates. But we still need a way to choose the smoothing
parameter h. Before we discuss that, let’s first discuss linear smoothers.

8
● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female
● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05

0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05
● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15

● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female

● ●● ● ●●
●● ●● ● ●● ●● ●
● ●● ● ●●
● ●● ●
●● ● ●● ● ●●
● ●● ●
●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ●
●●●● ●● ● ●● ● ●●●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05

0.05

● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05

● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

Figure 4: Boxcar kernel estimators for various choices of h.

9
● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05
● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

● ●

● ●
● ● ● ●
● ●● ● ●●
0.15

0.15

● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female

density.female

0.05

● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ● ● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ●● ●● ●● ● ●●
● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ●
● ●
●● ●●
● ● ●● ● ● ●
●● ●●
● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ● ●●●● ● ●
●
●● ●● ●●● ●●●●
● ●●●● ●●
●● ●
●● ●● ●●● ●●●●
● ●●●● ●●
●●
●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●● ●●● ● ●●● ●●
●●● ●● ●● ● ●● ●●●● ●●●●●
● ●● ● ● ● ●●●● ●●
●● ● ●
● ● ● ●
● ● ●● ● ● ● ●●●
● ●●
●● ● ●
● ● ● ●
●
● ●●● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ● ●● ● ●
● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ●
−0.05

−0.05

● ●
●● ● ● ● ●● ● ● ●

10 15 20 25 10 15 20 25

age.female age.female

Figure 5: Kernel estimators for various choices of h.

10
5 The Kernel Estimator is a Linear Smoother

The kernel estimator m(x)

b is defined at each x. As we did for linear regression, we can also compute
the estimator at
P the original data points Xi . Again, we call Ybi = m(X
b i ) the fitted values. We know
that mb h (x) = j Yj `j (x). When x = Xi , we have
X
Ybi = m
b h (Xi ) = Yj `j (Xi ).
j

b = (Yb1 , . . . , Ybn ), Y = (Y1 , . . . , Yn ) and let L be the n × n matrix with entries Lij = `j (Xi ).
Let Y
We then see that
Yb = LY. (7)
This looks a lot like the equation Yb = HY from linear regression. The matrix L is called the
smoothing matrix. It is like the hat matrix but it is not a projection matrix. We call
X
ν = tr(L) = Lii (8)
i

the effective degrees of freedom. As the bandwidth h gets smaller, ν gets larger. In other words,
small bandwidths correspond to more complex models.

The equation Y b = LY means that we can write each Ybi as a linear combination of the Yi ’s. Because
of this, we say that kernel regression is a linear smoother. This does not mean that m(x)
b is linear.
It just means that each Ybi as a linear combination of the Yi ’s.

The are many other linear smoothers besides kernel regression estimators but we will focus on the
kernel estimator for now. Remember: a linear smoother does not mean linear regression.

6 Choosing h by Cross-Validation

We will estimate the prediction risk using leave-one-out cross-validation. Then we choose the
(−i)
bandwidth h by minimizing the cross-validation score. Let m b h be the kernel estimator (with
bandwidth h) obtained after leaving out (Xi , Yi ). The cross-validation score is
1X (−i)
CV (h) = b h (Xi ))2 .
(Yi − m
n
i

It turns out that there is a handy short cut formual for CV (similar to linear models) which is
2
1X (Yi − m
b h (Xi ))
CV (h) = . (9)
n 1 − Lii
i

Our strategy is to fit m

b h for many values of h. We then compute CV (h) using (9). Then we find
h to minimize CV (h). The bandwidth b
b h is the one we use.

11
P
Some people approximate the formula above replacing each Lii by the average value (1/n) i Lii =
(1/n)tr(L) = ν/n. If we replace Lii with ν/n we get this formula:
1 1X
GCV (h) = b h (Xi ))2
(Yi − m (10)
ν 2 n

1− n i

which is called generalized cross validation.

Figure 6 shows the cross-validation score and the generalized cross validation score. I plotted it
versus h and then I plotted it versus the effective degrees of freedom ν. The bottom left plot is the
kernel regression estimator using the bandwidth that minimizes the cross-validation score.

Here is the code:

kernel = function(x,y,grid,h){
### kernel regression estimator at a grid of values
### one dimension only
### return m.hat(u) for all u in grid
n = length(x)
k = length(grid)
m.hat = rep(0,k)
for(i in 1:k){
w = dnorm(grid[i],x,h)
m.hat[i] = sum(y*w)/sum(w)
}
return(m.hat)
}

kernel.fitted = function(x,y,h){
### kernel regression
### fitted values and diagonal of smoothing matrix
n = length(x)
m.hat = rep(0,n)
S = rep(0,n)
for(i in 1:n){
w = dnorm(x[i],x,h)
w = w/sum(w)
m.hat[i] = sum(y*w)
S[i] = w[i]
}
return(list(fitted=m.hat,S=S))
}

CV = function(x,y,H){

12
Cross−validation Score

Cross−validation Score
0.0017

0.0017
0.0015

0.0015
0.0013

0.0013
0 1 2 3 4 5 0 10 20 30 40 50

Bandwidth Effective Degrees of Freedom

●
● ●
● ●●
0.15

●
●
● ● ●●● ●
● ●●
●●●●●●●
Density

● ●● ● ● ●●●●●
● ●●
● ●
● ●● ●
● ● ●
●●●
● ●● ●● ● ●● ●
●● ●● ● ●●
●
0.05

●
●● ● ● ●

10 15 20 25

Age

Figure 6: Cross validation (black) and generalized cross validation (red dotted line) versus h and
versus effective degrees of freedom. The bottom left plot is the kernel regression estimator using
the bandwidth that minimizes the cross-validation score.

13
### H is a vector of bandwidths
### Compute cross-validation and generalized cross validation
n = length(x)
k = length(H)
cv = rep(0,k)
nu = rep(0,k)
gcv = rep(0,k)
for(i in 1:k){
tmp = kernel.fitted(x,y,H[i])
cv[i] = mean(((y - tmp$fitted)/(1-tmp$S))^2)
nu[i] = sum(tmp$S)
gcv[i] = mean((y - tmp$fitted)^2)/(1-nu[i]/n)^2
}
### nu is the effective degrees of freedom
return(list(cv=cv,gcv=gcv,nu=nu))
}

pdf("crossval.pdf")
par(mfrow=c(2,2))

bone = read.table("BoneDensity.txt",header=TRUE)
attach(bone)

age.female = age[gender == "female"]

density.female = spnbmd[gender == "female"]

H = seq(.1,5,length=20)
out = CV(age.female,density.female,H)
plot(H,out$cv,type="l",lwd=3,xlab="Bandwidth",ylab="Cross-validation Score")
lines(H,out$gcv,lty=2,col="red",lwd=3)
plot(out$nu,out$cv,type="l",lwd=3,
xlab="Effective Degrees of Freedom",ylab="Cross-validation Score")
lines(out$nu,out$gcv,lty=2,col="red",lwd=3)

j = which.min(out$cv)
h = H[j]
grid = seq(min(age.female),max(age.female),length=100)
m.hat = kernel(age.female,density.female,grid,h)
plot(age.female,density.female,xlab="Age",ylab="Density")
lines(grid,m.hat,lwd=3,col="blue")

dev.off()

NOTE: The residuals are defined as before, namely, ei = Yi − m(X

b i ). You should plot the resduals
as you did for linear regression.

14
7 Analysis of the Kernel Estimator

Recall that the mean integrated squared error is

Z Z
2
M ISE = bn (x)p(x)dx + vn (x)p(x)dx.

It can be shown (under certain conditions) that, given X1 , . . . , Xn ,

m0 (x)f 0 (x)

µ2 (K) 00
b(x) ≈ m (x) + 2 (11)
2 f (x)
and
R(K) 2
v(x) ≈ σ (x) (12)
nhf (x)
where f (x) is the density of X, µ2 (K) = x2 K(x)dx, R(K) = K 2 (x)dx and σ 2 (x) = Var(Y |X =
R R

x). So we can say that

C2
M ISE = C1 h4 + .
nh
If we minimize this expression over h we find that the best h is
1/5
C3
hn =
n
where C3 = C2 /(4C1 ). If we insert this hn into the IMSE we find that
4/5
C4
M ISE = .
n

What do we learn from this? First, the optimal bandwidth gets smaller as the sample size increases.
Second, the IMSE goes to 0 as n gets larger. But it goes to 0 slower than other estimators your are
used to. For example, if you estimate the mean µ with the sample mean Y then E(Y − µ)2 = σ 2 /n.
Roughly speaking, the mean squared error of parametric estimators is something like 1/n but kernel
estimators (and other nonparametric estimators) behave like (1/n)4/5 which goes to 0 more slowly.

Slower convergence is the price of being nonparametric, and making fewer assumptions (note that
the price of making strong parametric assumptions is non-trivial bias that does not even shrink
with sample size).

The formula we derived for hn is interesting for helping our understanding of the theoretical behavior
of m.
b But we can’t use that formual in practice because those constants involve quantities that
depend on the unknown function m. So we use cross-validation to choose h in practice.

8 Estimating σ 2 (x)

We have not assumed that the variance is constant. We will only assume that
σ 2 (x) = Var[Y |X = x] (13)

15
is a smooth function of x. We can estimate σ 2 (x) as follows:

1. Get m(x).
b

b i ))2 = e2i .
2. Let Zi = (Yi − m(X

b2 (x).
3. Do nonparametric regression of Zi versus Xi to get σ

9 Example

Let’s return to the bone mineral density data.

> kernel = function(x,y,grid,h){

+ ### kernel regression estimator at a grid of values ### one dimension only
+ ### return m.hat(u) for all u in grid
+ n = length(x)
+ k = length(grid)
+ m.hat = rep(0,k)
+ for(i in 1:k){
+ w = dnorm(grid[i],x,h)
+ m.hat[i] = sum(y*w)/sum(w)
+ }
+ return(m.hat)
+ }
>
> pdf("Bone-Density.pdf")
> par(mfrow=c(2,2))
> df = read.table("bone.txt",header=TRUE)
> names(df)
[1] "idnum" "age" "gender" "spnbmd"
> attach(df)
> n = nrow(df)
> is.factor(gender)
[1] TRUE
> unique(gender)
[1] male female
Levels: female male
> I = (1:n)[gender == "female"]
> plot(age[I],spnbmd[I],xlab="Age",ylab="Bone Density",pch=19)
>
> h = 0.8
> grid = seq(9,26,length=50)
> x = age[I]

16
> y = spnbmd[I]
> out = kernel(x,y,grid,h)
> lines(grid,out,lwd=3,col="blue")
> fit = kernel(x,y,x,h)
> residuals = y - fit
> plot(x,residuals,xlab="Age",ylab="Residuals",pch=19)
> abline(h=0,lwd=3,col="red")
>
> z = residuals^2
> plot(x,z,pch=19,xlab="Age",ylab="Squared Residuals")
> out = kernel(x,z,grid,h)
> lines(grid,out,lwd=3,col="blue")
> dev.off()

The plots are in Figure 7. Do you like the result? How should we fix things?

17
● ●

0.10
● ● ●
● ● ●
● ●● ●
●● ● ●
0.15

● ● ●
● ●
● ● ●●● ● ● ●●
●
Bone Density

● ●● ●
● ●● ● ●● ●● ● ●

Residuals
● ● ● ● ●
●●
● ●● ● ●●
●
● ●● ● ● ●●● ● ● ●
● ● ●
●● ● ●● ●● ● ● ●
● ●●
● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ●●●
● ● ●●●● ●

0.00
●
● ●● ●●●●●●●● ●● ●●●
●● ● ● ●●●●
● ●●
●
●
●●
● ●
● ● ●●●
●●
●●
●● ●● ●●● ● ●● ●
●● ●● ● ● ● ●●●●●●
●
● ● ● ● ●●
●● ●● ●
● ●● ●●
●●●
● ●
● ● ● ●●●● ●● ●●● ●
●●
● ●
●
●● ●● ●●
● ●● ● ●● ●●●● ●
●
●
●
● ●
0.05

−0.10
● ●●
● ●●● ●● ●● ● ● ● ● ●●●●● ●
● ● ● ●
● ●● ● ● ● ● ● ●
● ● ●
−0.05

●
●● ● ● ● ●●

10 15 20 25 10 15 20 25

Age Age

●
0.015

●
●
Squared Residuals

0.010

●
● ●
●

●● ● ●
0.005

●●●
● ●
● ●
●
●● ● ● ●
● ● ●●● ●
●● ● ● ●
●●● ● ● ●●● ●● ●
● ●●●●
● ● ● ●●●
●●
● ●● ● ● ●●
0.000

● ●
● ●
●●●●● ●
●● ● ●● ●● ● ●●
● ●
●●●●● ●●●●
●●●●
●
● ●●● ●●
●● ●
●●
●
●●
● ●●
●
●●●
● ●●●●● ●● ●● ●●● ● ●
●
● ● ●
●●●● ●●●
● ●
●●
●●
●●●●
●
●●●●
●
● ● ●
●●●
●
●●●●●●
●●
●●
●●●●●●●●● ●●
●●●
●●●
●●●

10 15 20 25

Age

Figure 7: Top left: data and kernel regression estimator. Top right: residuals. Bottom left:
estimate of σ 2 (x).

Nonparametric Regression
No ratings yet
Nonparametric Regression
24 pages
Non Par Regression
No ratings yet
Non Par Regression
35 pages
Kernel Smoothing-MP Wand-MC Jones-1995
100% (1)
Kernel Smoothing-MP Wand-MC Jones-1995
228 pages
Prediction of Random Effects and Effects of Misspecification of Their Distribution
No ratings yet
Prediction of Random Effects and Effects of Misspecification of Their Distribution
49 pages
Primer of Applied Regression and Analysis of Variance (Glantz S.a., Slinker B.K., Neilands T.B)
No ratings yet
Primer of Applied Regression and Analysis of Variance (Glantz S.a., Slinker B.K., Neilands T.B)
1,472 pages
Regression Basics for Epidemiologists
No ratings yet
Regression Basics for Epidemiologists
18 pages
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
No ratings yet
Module 3: Linear Regression: TMA4268 Statistical Learning V2025
110 pages
Model Checking
No ratings yet
Model Checking
65 pages
Ridge Regression Lecture Notes
No ratings yet
Ridge Regression Lecture Notes
82 pages
Nonparametric and Semiparametric Models
No ratings yet
Nonparametric and Semiparametric Models
325 pages
Module 4
No ratings yet
Module 4
33 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
Dataanalyticsunit 2
No ratings yet
Dataanalyticsunit 2
24 pages
Nonparametric Regression
No ratings yet
Nonparametric Regression
7 pages
An Overview of Regression Analysis: Notes
No ratings yet
An Overview of Regression Analysis: Notes
5 pages
NVT SDS Unit V Final PDF
No ratings yet
NVT SDS Unit V Final PDF
100 pages
Cl-Vii Ass2 4301063
No ratings yet
Cl-Vii Ass2 4301063
5 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
DECS Cheat Sheet
No ratings yet
DECS Cheat Sheet
8 pages
Lec 5 V 11
No ratings yet
Lec 5 V 11
44 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
48 pages
Ra Web
No ratings yet
Ra Web
70 pages
Kernel Smoothing & Regression Guide
No ratings yet
Kernel Smoothing & Regression Guide
5 pages
Appendix: Answers To Selected Exercises: /user
No ratings yet
Appendix: Answers To Selected Exercises: /user
8 pages
Appendix Nonparametric Regression
No ratings yet
Appendix Nonparametric Regression
17 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Bayesian Methods For Regression Models With Fat Data
No ratings yet
Bayesian Methods For Regression Models With Fat Data
51 pages
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
No ratings yet
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
48 pages
Multivariety Analysis - Nasra & Saida
No ratings yet
Multivariety Analysis - Nasra & Saida
27 pages
332 3 Muscle - Mass
No ratings yet
332 3 Muscle - Mass
9 pages
Nonparametric Statistics Epiphany 2024-25
No ratings yet
Nonparametric Statistics Epiphany 2024-25
102 pages
Correlation
No ratings yet
Correlation
13 pages
Chapter 09 Linear
No ratings yet
Chapter 09 Linear
29 pages
Nonlinear Regression Using EXCEL Solver
No ratings yet
Nonlinear Regression Using EXCEL Solver
10 pages
Excel Solver - Step by Step Nonlinear Regression - Brown
No ratings yet
Excel Solver - Step by Step Nonlinear Regression - Brown
11 pages
Lecture Notes On Ridge Regression
No ratings yet
Lecture Notes On Ridge Regression
149 pages
Lecture 4 Linear Regression
No ratings yet
Lecture 4 Linear Regression
75 pages
1 Introduction
No ratings yet
1 Introduction
8 pages
p16 p26 Annotated
No ratings yet
p16 p26 Annotated
14 pages
R Practice
No ratings yet
R Practice
38 pages
Regression Analysis in Healthcare
No ratings yet
Regression Analysis in Healthcare
3 pages
Regression Modeling Strategies: With Applications To Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Springer Series in Statistics) - ISBN 3319194240, 978-3319194240
100% (31)
Regression Modeling Strategies: With Applications To Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Springer Series in Statistics) - ISBN 3319194240, 978-3319194240
23 pages
Hota ML13
No ratings yet
Hota ML13
24 pages
Regression Modeling PDF
100% (1)
Regression Modeling PDF
598 pages
2015 Book RegressionModelingStrategies-1 PDF
No ratings yet
2015 Book RegressionModelingStrategies-1 PDF
598 pages
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
100% (4)
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
598 pages
Lecture36 2012 Full
No ratings yet
Lecture36 2012 Full
30 pages
Econometrics
No ratings yet
Econometrics
13 pages
Unit 2
No ratings yet
Unit 2
133 pages
Econometric Estimation BETA
No ratings yet
Econometric Estimation BETA
36 pages
Business Analytics
No ratings yet
Business Analytics
19 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Applied Nonparametric Regression
No ratings yet
Applied Nonparametric Regression
7 pages
XSTK Câu hỏi
No ratings yet
XSTK Câu hỏi
19 pages
Methods Seminar Chon 2004-04-14 LK
No ratings yet
Methods Seminar Chon 2004-04-14 LK
19 pages
Fichas Tecnicas Modern Fuji Elevator
No ratings yet
Fichas Tecnicas Modern Fuji Elevator
132 pages
Accounting For Changing Prices (Inflation Accounting)
No ratings yet
Accounting For Changing Prices (Inflation Accounting)
6 pages
Animation Evaluation
No ratings yet
Animation Evaluation
2 pages
08 Naskah Publikasi
No ratings yet
08 Naskah Publikasi
10 pages
Computer Science Specification OCR
No ratings yet
Computer Science Specification OCR
24 pages
Economics 2nd Quarter Exam
No ratings yet
Economics 2nd Quarter Exam
4 pages
Ocean County Board of Elections Hagler Challenge
No ratings yet
Ocean County Board of Elections Hagler Challenge
6 pages
Labour CESS PDF
No ratings yet
Labour CESS PDF
6 pages
Zscaler Cisco SD WAN Deployment Guide FINAL
No ratings yet
Zscaler Cisco SD WAN Deployment Guide FINAL
129 pages
Conservation Education and Outreach Techniques 2nd Edition Jacobson PDF Download
100% (1)
Conservation Education and Outreach Techniques 2nd Edition Jacobson PDF Download
48 pages
SU23 PAD6143 58628 Prysmakova
No ratings yet
SU23 PAD6143 58628 Prysmakova
14 pages
Homework 2
No ratings yet
Homework 2
2 pages
Friction in Orthodontics
No ratings yet
Friction in Orthodontics
21 pages
McDonald's Highway 70 Complaint
No ratings yet
McDonald's Highway 70 Complaint
3 pages
Repair of School Waste Facility
No ratings yet
Repair of School Waste Facility
2 pages
MIS610 - Assessment 2 - Brief - Reflection - Module 4.1 - Final
No ratings yet
MIS610 - Assessment 2 - Brief - Reflection - Module 4.1 - Final
4 pages
The Rise of Thatcherism and The Effects of Her Policies
No ratings yet
The Rise of Thatcherism and The Effects of Her Policies
10 pages
JoyFold Abecedario Halloween
No ratings yet
JoyFold Abecedario Halloween
107 pages
Allama Iqbal Open University, Islamabad (Department of English Language & Applied Linguistics) Warning
No ratings yet
Allama Iqbal Open University, Islamabad (Department of English Language & Applied Linguistics) Warning
4 pages
ZXR10 M6000 Carrier-Class Router Product Description
100% (1)
ZXR10 M6000 Carrier-Class Router Product Description
110 pages
Introverts in Life
No ratings yet
Introverts in Life
2 pages
Church Service: Our Reasonable Duty
No ratings yet
Church Service: Our Reasonable Duty
3 pages
PUTHANE 8290: Polyurethane Coating Specs
No ratings yet
PUTHANE 8290: Polyurethane Coating Specs
2 pages
SAT Math - Non Equations in 1 Var and System of Equations in 2 Vars - Hard R
No ratings yet
SAT Math - Non Equations in 1 Var and System of Equations in 2 Vars - Hard R
56 pages
Experiment 17
No ratings yet
Experiment 17
5 pages
SBI Magnum Tax Gain Scheme
No ratings yet
SBI Magnum Tax Gain Scheme
14 pages
DEED OF PARTITION Sample 2
0% (1)
DEED OF PARTITION Sample 2
2 pages
Senarai Nama Tahap 2 Hari Kanak Kanak
No ratings yet
Senarai Nama Tahap 2 Hari Kanak Kanak
18 pages
Unpinning Desdemona - Walen
No ratings yet
Unpinning Desdemona - Walen
23 pages
EAP Speaking Exam Procedure 2021
0% (1)
EAP Speaking Exam Procedure 2021
49 pages

21 Nonparametric Regression

Uploaded by

21 Nonparametric Regression

Uploaded by

Nonparametric Regression

So far we have assumed that

Y = m(X) + , where m(x) = β0 + β1 x1 + · · · + βd xd .

E( | X) = E(Y | X) − m(X) = 0.

We will start with the case where X ∈ R.

2 The Bias–Variance Tradeoff

If we have an estimate m(x)

The first is the prediction risk or prediction error which, as before, is

Figure 1: Bone Mineral Density Data

τ 2 = E[A2 ] = E[(Y − m(X))2 ]

where v(x) = Var[m(x)]

Theorem 2 We have the following bias-variance decomposition:

Our second way to measure the quality of the estimator m(x)

Using very similar calculations we can show that

Finally, we define m(x)

Here are some examples.

age.male = age[gender == "male"]

age.female = age[gender == "female"]

4 The Kernel Estimator

where k is the number of points in I. We can re-write this estimator as

Figure 2: Simulated data. Various choices of k.

Note that the boxcar kernel does satisfy these conditions.

Lemma 3 Suppose that, as n → ∞, we have h → 0 and nh → ∞. Then, under some conditions,

Figure 4: Boxcar kernel estimators for various choices of h.

Figure 5: Kernel estimators for various choices of h.

The kernel estimator m(x)

Our strategy is to fit m

which is called generalized cross validation.

Here is the code:

Bandwidth Effective Degrees of Freedom

age.female = age[gender == "female"]

NOTE: The residuals are defined as before, namely, ei = Yi − m(X

Recall that the mean integrated squared error is

It can be shown (under certain conditions) that, given X1 , . . . , Xn ,

x). So we can say that

Let’s return to the bone mineral density data.

> kernel = function(x,y,grid,h){

You might also like

Y = m(X) + , where m(x) = β0 + β1 x1 + · · · + βd xd .

E( | X) = E(Y | X) − m(X) = 0.