KEMBAR78
KNN Cookbook | PDF | Learning | Statistical Theory
0% found this document useful (0 votes)
86 views8 pages

KNN Cookbook

Chapter 15 discusses the K-Nearest Neighbors (KNN) classifier, a simple yet widely used method in supervised machine learning that predicts the class of an observation based on its nearest neighbors. It covers finding nearest neighbors using scikit-learn, creating a KNN classifier, selecting the best neighborhood size, and using a radius-based nearest neighbor classifier. The chapter emphasizes the importance of standardizing features and the impact of the parameter 'k' on model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views8 pages

KNN Cookbook

Chapter 15 discusses the K-Nearest Neighbors (KNN) classifier, a simple yet widely used method in supervised machine learning that predicts the class of an observation based on its nearest neighbors. It covers finding nearest neighbors using scikit-learn, creating a KNN classifier, selecting the best neighborhood size, and using a radius-based nearest neighbor classifier. The chapter emphasizes the importance of standardizing features and the impact of the parameter 'k' on model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CHAPTER 15

K-Nearest Neighbors

15.0 Introduction
The K-Nearest Neighbors classifier (KNN) is one of the simplest yet most commonly
used classifiers in supervised machine learning. KNN is often considered a lazy
learner; it doesn’t technically train a model to make predictions. Instead an observa‐
tion is predicted to be the class of that of the largest proportion of the k nearest obser‐
vations. For example, if an observation with an unknown class is surrounded by an
observation of class 1, then the observation is classified as class 1. In this chapter we
will explore how to use scikit-learn to create and use a KNN classifier.

15.1 Finding an Observation’s Nearest Neighbors


Problem
You need to find an observation’s k nearest observations (neighbors).

Solution
Use scikit-learn’s NearestNeighbors:
# Load libraries
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

# Load data
iris = datasets.load_iris()
features = iris.data

# Create standardizer

251
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Two nearest neighbors


nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)

# Create an observation
new_observation = [ 1, 1, 1, 1]

# Find distances and indices of the observation's nearest neighbors


distances, indices = nearest_neighbors.kneighbors([new_observation])

# View the nearest neighbors


features_standardized[indices]
array([[[ 1.03800476, 0.56925129, 1.10395287, 1.1850097 ],
[ 0.79566902, 0.33784833, 0.76275864, 1.05353673]]])

Discussion
In our solution we used the dataset of Iris flowers. We created an observation,
new_observation, with some values and then found the two observations that are
closest to our observation. indices contains the locations of the observations in our
dataset that are closest, so X[indices] displays the values of those observations.
Intuitively, distance can be thought of as a measure of similarity, so the two closest
observations are the two flowers most similar to the flower we created.
How do we measure distance? scikit-learn offers a wide variety of distance metrics, d,
including Euclidean:

deuclidean = ∑ni = 1 xi − yi 2

and Manhattan distance:

n
dmanhattan = ∑ xi − y i
i=1

By default, NearestNeighbors uses Minkowski distance:

n 1/ p
dminkowski = ∑
i=1
xi − yi p

252 | Chapter 15: K-Nearest Neighbors


where xi and yi are the two observations we are calculating the distance between.
Minkowski includes a hyperparameter, p, where p = 1 is Manhattan distance and p =
2 is Euclidean distance, and so on. By default in scikit-learn p = 2.
We can set the distance metric using the metric parameter:
# Find two nearest neighbors based on euclidean distance
nearestneighbors_euclidean = NearestNeighbors(
n_neighbors=2, metric='euclidean').fit(features_standardized)

The distance variable we created contains the actual distance measurement to each
of the two nearest neighbors:
# View distances
distances
array([[ 0.48168828, 0.73440155]])

In addition, we can use kneighbors_graph to create a matrix indicating each observa‐


tion’s nearest neighbors:
# Find each observation's three nearest neighbors
# based on euclidean distance (including itself)
nearestneighbors_euclidean = NearestNeighbors(
n_neighbors=3, metric="euclidean").fit(features_standardized)

# List of lists indicating each observation's 3 nearest neighbors


# (including itself)
nearest_neighbors_with_self = nearestneighbors_euclidean.kneighbors_graph(
features_standardized).toarray()

# Remove 1's marking an observation is a nearest neighbor to itself


for i, x in enumerate(nearest_neighbors_with_self):
x[i] = 0

# View first observation's two nearest neighbors


nearest_neighbors_with_self[0]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0.])
When we are finding nearest neighbors or using any learning algorithm based on dis‐
tance, it is important to transform features so that they are on the same scale. The

15.1 Finding an Observation’s Nearest Neighbors | 253


reason is because the distance metrics treat all features as if they were on the same
scale, but if one feature is in millions of dollars and a second feature is in percentages,
the distance calculated will be biased toward the former. In our solution we addressed
this potential issue by standardizing the features using StandardScaler.

15.2 Creating a K-Nearest Neighbor Classifier


Problem
Given an observation of unknown class, you need to predict its class based on the
class of its neighbors.

Solution
If the dataset is not very large, use KNeighborsClassifier:
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X)

# Train a KNN classifier with 5 neighbors


knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

# Create two observations


new_observations = [[ 0.75, 0.75, 0.75, 0.75],
[ 1, 1, 1, 1]]

# Predict the class of two observations


knn.predict(new_observations)
array([1, 2])

Discussion
In KNN, given an observation, xu, with an unknown target class, the algorithm first
identifies the k closest observations (sometimes called xu’s neighborhood) based on
some distance metric (e.g., Euclidean distance), then these k observations “vote”

254 | Chapter 15: K-Nearest Neighbors


based on their class, and the class that wins the vote is xu’s predicted class. More for‐
mally, the probability xu is some class j is:

1
ki∑
I yi = j
∈ν

where ν is the k observation in xu’s neighborhood, yi is the class of the ith observation,
and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see
these probabilities using predict_proba:
# View probability each observation is one of three classes
knn.predict_proba(new_observations)
array([[ 0. , 0.6, 0.4],
[ 0. , 0. , 1. ]])
The class with the highest probability becomes the predicted class. For example, in
the preceding output, the first observation should be class 1 (Pr = 0.6) while the sec‐
ond observation should be class 2 (Pr = 1), and this is just what we see:
knn.predict(new_observations)
array([1, 2])

KNeighborsClassifier contains a number of important parameters to consider.


First, metric sets the distance metric used (discussed more in Recipe 14.1). Second,
n_jobs determines how many of the computer’s cores to use. Because making a pre‐
diction requires calculating the distance from a point to every single point in the data,
using multiple cores is highly recommended. Third, algorithm sets the method used
to calculate the nearest neighbors. While there are real differences in the algorithms,
by default KNeighborsClassifier attempts to auto-select the best algorithm so you
often don’t need to worry about this parameter. Fourth, by default KNeighborsClassi
fier works how we described previously, with each observation in the neighborhood
getting one vote; however, if we set the weights parameter to distance, the closer
observations’ votes are weighted more than observations farther away. Intuitively this
make sense, since more similar neighbors might tell us more about an observation’s
class than others.
Finally, as discussed in Recipe 14.1, because distance calculations treat all features as if
they are on the same scale, it is important to standardize the features prior to using a
KNN classifier.

15.2 Creating a K-Nearest Neighbor Classifier | 255


15.3 Identifying the Best Neighborhood Size
Problem
You want to select the best value for k in a k-nearest neighbors classifier.

Solution
Use model selection techniques like GridSearchCV:
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create standardizer
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Create a KNN classifier


knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# Create a pipeline
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

# Create space of candidate values


search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

# Create grid search


classifier = GridSearchCV(
pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)

Discussion
The size of k has real implications in KNN classifiers. In machine learning we are try‐
ing to find a balance between bias and variance, and in few places is that as explicit as
the value of k. If k = n where n is the number of observations, then we have high bias
but low variance. If k = 1, we will have low bias but high variance. The best model will
come from finding the value of k that balances this bias-variance trade-off. In our sol‐
ution, we used GridSearchCV to conduct five-fold cross-validation on KNN classifiers

256 | Chapter 15: K-Nearest Neighbors


with different values of k. When that is completed, we can see the k that produces the
best model:
# Best neighborhood size (k)
classifier.best_estimator_.get_params()["knn__n_neighbors"]
6

15.4 Creating a Radius-Based Nearest Neighbor Classifier


Problem
Given an observation of unknown class, you need to predict its class based on the
class of all observations within a certain distance.

Solution
Use RadiusNeighborsClassifier:
# Load libraries
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

# Create standardizer
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Train a radius neighbors classifier


rnn = RadiusNeighborsClassifier(
radius=.5, n_jobs=-1).fit(features_standardized, target)

# Create two observations


new_observations = [[ 1, 1, 1, 1]]

# Predict the class of two observations


rnn.predict(new_observations)
array([2])

Discussion
In KNN classification, an observation’s class is predicted from the classes of its k
neighbors. A less common technique is classification in a radius-based (RNN) classi‐

15.4 Creating a Radius-Based Nearest Neighbor Classifier | 257


fier, where an observation’s class is predicted from the classes of all observations
within a given radius r. In scikit-learn RadiusNeighborsClassifier is very similar to
KNeighborsClassifier, with the exception of two parameters. First, in RadiusNeigh
borsClassifier we need to specify the radius of the fixed area used to determine if
an observation is a neighbor using radius. Unless there is some substantive reason
for setting radius to some value, it is best to treat it like any other hyperparameter
and tune it during model selection. The second useful parameter is outlier_label,
which indicates what label to give an observation that has no observations within the
radius—which itself can often be a useful tool for identifying outliers.

258 | Chapter 15: K-Nearest Neighbors

You might also like