k-nearest-neighbors regression
Here's a basic method to start us o_: k-nearest-neighbors regression. We _x an integer
k _ 1 and de_ne
bm(x) =
1
k
X
i2Nk(x)
Yi; (4)
where Nk(x) contains the indices of the k closest points of X1; : : : ;Xn to x.
This is not at all a bad estimator, and you will _nd it used in lots of applications, in
many cases probably because of its simplicity. By varying the number of neighbors k, we can
achieve a wide range of exibility in the estimated function bm, with small k corresponding
to a more exible _t, and large k less exible.
But it does have its limitations, an apparent one being that the _tted function bm
essentially always looks jagged, especially for small or moderate k. Why is this? It helps to
write
bm(x) =
Xn
i=1
wi(x)Yi; (5)
where the weights wi(x), i = 1; : : : ; n are de_ned as
wi(x) =
(
1=k if Xi is one of the k nearest points to x
0 else.
Note that wi(x) is discontinuous as a function of x, and therefore so is bm(x).
The representation (5) also reveals that the k-nearest-neighbors estimate is in a class of
estimates we call linear smoothers, i.e., writing Y = (Y1; : : : ; Yn) 2 Rn, the vector of _tted
values
b_ = ( bm(X1); : : : ; bm(Xn)) 2 Rn
can simply be expressed as b_ = SY . (To be clear, this means that for _xed inputs X1; : : : ;Xn,
the vector of _tted values b_ is a linear function of Y ; it does not mean that bm(x) need behave
linearly as a function of x.) This class is quite large, and contains many popular estimators,
as we'll see in the coming sections.
The k-nearest-neighbors estimator is universally consistent, which means Ek bm m0k22
!0
as n ! 1, with no assumptions other than E(Y 2) _ 1, provided that we take k = kn such
that kn ! 1 and kn=n ! 0; e.g., k = pn will do. See Chapter 6.2 of Gyor_ et al. (2002).
Furthermore, assuming the underlying regression function m0 is Lipschitz continuous,
the k-nearest-neighbors estimate with k _ n2=(2+d) satis_es
Ek bm m0k22. n2=(2+d): (6)
See Chapter 6.3 of Gyor_ et al. (2002). Later, we will see that this is optimal.
Proof sketch: assume that Var(Y jX = x) = _2, a constant, for simplicity, and _x
(condition on) the training points. Using the bias-variance tradeo_,
E
_
bm(x) m0(x)
_2_
=
E[ bm(x)] m0(x)
_2
| {z }
Bias2( bm(x))
+E
_
bm(x) E[ bm(x)]
_2_
| {z }
Var( bm(x))
=
_
1
k
X
i2Nk(x)
m0(Xi) m0(x)
__2
+
_2
k
_
_
L
k
X
i2Nk(x)
kXi xk2
_2
+
_2
k
:
In the last line we used the Lipschitz property jm0(x) m0(z)j _ Lkx zk2, for some
constant L > 0. Now for \most" of the points we'll have kXi xk2 _ C(k=n)1=d, for a
constant C > 0. (Think of a having input points Xi, i = 1; : : : ; n spaced equally over (say)
[0; 1]d.) Then our bias-variance upper bound becomes
(CL)2
_
k
n
_2=d
+
_2
k
;
We can minimize this by balancing the two terms so that they are equal, giving k1+2=d _ n2=d,
i.e., k _ n2=(2+d) as claimed. Plugging this in gives the error bound of n2=(2+d), as claimed.