KEMBAR78
K-Nearest-Neighbors Regression: I2n (X) I K 1 N | PDF | Function (Mathematics) | Regression Analysis
0% found this document useful (0 votes)
84 views3 pages

K-Nearest-Neighbors Regression: I2n (X) I K 1 N

k-nearest-neighbors regression is a basic machine learning method where the value predicted for a new data point is the average of the values of the k nearest neighbors. The number of neighbors k can be varied, with smaller k giving a more flexible fit and larger k a less flexible fit. The k-NN estimate is discontinuous and jagged, especially for small k, because the weights assigned to each training point are discontinuous functions of the input. k-NN regression is considered a linear smoother and is universally consistent under certain conditions on k growing with sample size n. If the underlying function is Lipschitz continuous, the error of k-NN decays at a rate of n-2/(2+d)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views3 pages

K-Nearest-Neighbors Regression: I2n (X) I K 1 N

k-nearest-neighbors regression is a basic machine learning method where the value predicted for a new data point is the average of the values of the k nearest neighbors. The number of neighbors k can be varied, with smaller k giving a more flexible fit and larger k a less flexible fit. The k-NN estimate is discontinuous and jagged, especially for small k, because the weights assigned to each training point are discontinuous functions of the input. k-NN regression is considered a linear smoother and is universally consistent under certain conditions on k growing with sample size n. If the underlying function is Lipschitz continuous, the error of k-NN decays at a rate of n-2/(2+d)
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

k-nearest-neighbors regression

Here's a basic method to start us o_: k-nearest-neighbors regression. We _x an integer


k _ 1 and de_ne
bm(x) =
1
k
X
i2Nk(x)
Yi; (4)
where Nk(x) contains the indices of the k closest points of X1; : : : ;Xn to x.

This is not at all a bad estimator, and you will _nd it used in lots of applications, in
many cases probably because of its simplicity. By varying the number of neighbors k, we can
achieve a wide range of exibility in the estimated function bm, with small k corresponding
to a more exible _t, and large k less exible.
But it does have its limitations, an apparent one being that the _tted function bm
essentially always looks jagged, especially for small or moderate k. Why is this? It helps to
write
bm(x) =
Xn
i=1
wi(x)Yi; (5)
where the weights wi(x), i = 1; : : : ; n are de_ned as
wi(x) =
(
1=k if Xi is one of the k nearest points to x
0 else.
Note that wi(x) is discontinuous as a function of x, and therefore so is bm(x).
The representation (5) also reveals that the k-nearest-neighbors estimate is in a class of
estimates we call linear smoothers, i.e., writing Y = (Y1; : : : ; Yn) 2 Rn, the vector of _tted
values
b_ = ( bm(X1); : : : ; bm(Xn)) 2 Rn
can simply be expressed as b_ = SY . (To be clear, this means that for _xed inputs X1; : : : ;Xn,
the vector of _tted values b_ is a linear function of Y ; it does not mean that bm(x) need behave
linearly as a function of x.) This class is quite large, and contains many popular estimators,
as we'll see in the coming sections.
The k-nearest-neighbors estimator is universally consistent, which means Ek bm 􀀀 m0k22
!0
as n ! 1, with no assumptions other than E(Y 2) _ 1, provided that we take k = kn such
that kn ! 1 and kn=n ! 0; e.g., k = pn will do. See Chapter 6.2 of Gyor_ et al. (2002).
Furthermore, assuming the underlying regression function m0 is Lipschitz continuous,
the k-nearest-neighbors estimate with k _ n2=(2+d) satis_es
Ek bm 􀀀 m0k22. n􀀀2=(2+d): (6)
See Chapter 6.3 of Gyor_ et al. (2002). Later, we will see that this is optimal.
Proof sketch: assume that Var(Y jX = x) = _2, a constant, for simplicity, and _x
(condition on) the training points. Using the bias-variance tradeo_,
E
_􀀀
bm(x) 􀀀 m0(x)
_2_
=
􀀀
E[ bm(x)] 􀀀 m0(x)
_2
| {z }
Bias2( bm(x))
+E
_􀀀
bm(x) 􀀀 E[ bm(x)]
_2_
| {z }
Var( bm(x))
=
_
1
k
X
i2Nk(x)
􀀀
m0(Xi) 􀀀 m0(x)
__2
+
_2
k
_
_
L
k
X
i2Nk(x)
kXi 􀀀 xk2
_2
+
_2
k
:
In the last line we used the Lipschitz property jm0(x) 􀀀 m0(z)j _ Lkx 􀀀 zk2, for some
constant L > 0. Now for \most" of the points we'll have kXi 􀀀 xk2 _ C(k=n)1=d, for a

constant C > 0. (Think of a having input points Xi, i = 1; : : : ; n spaced equally over (say)
[0; 1]d.) Then our bias-variance upper bound becomes
(CL)2
_
k
n
_2=d
+
_2
k
;
We can minimize this by balancing the two terms so that they are equal, giving k1+2=d _ n2=d,
i.e., k _ n2=(2+d) as claimed. Plugging this in gives the error bound of n􀀀2=(2+d), as claimed.

You might also like