BU MET CS-677: Data Science With Python, v.2.
0 kNN - Nearest Neighbors Classification
NEAREST
NEIGHBORS
CLASSIFICATION
Page 1
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
General Idea
• points in the same class are
ususally ”neighbors”
• assign class based on
majority of neighbors
• need distance
• need to choose k - number of
neighbors
• note: k must be odd for
simple majority
Page 2
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Example of kNN
12
10
8
6
4
Y 2
0 $ %
2
4
6
8
6 3 0 3 6 9 12 15 18
X
• what labels for A and B ?
Page 3
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Assigning a Label for A
Y
$
N
N
N
point k neighbors majority
1 x1 green
A 3 x1, x2, x3 red
5 x1, x2, x3, x4, x5 green
Page 4
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Assigning a Label for B
Y
%
N
N
N
point k neighbors majority
1 x2 red
B 3 x2, x3, x5 red
5 x1, x2, x3, x4, x5 green
Page 5
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
How to Choose k
12
10
8
6
4
Y
2
0 $ %
2
4
6
8
6 3 0 3 6 9 12 15 18
X
point k neighbors majority
1 x1 green
A 3 x1, x2, x3 red
5 x1, x2, x3, x4, x5 green
1 x2 red
B 3 x2, x3, x5 red
5 x1, x2, x3, x4, x5 green
Page 6
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Illustration in Python
import numpy as np
import pandas as pd
from sklearn . neighbors import \
KNeighborsClassifier
data = pd . DataFrame (
{ " id " : [ 1 ,2 ,3 ,4 ,5 ,6] ,
" Label " : [ " green " ," red " ," red " ,
" green " ," green " ," red " ] ,
" X " : [1 , 6 , 7 , 10 , 10 , 15] ,
" Y " : [2 , 4 , 5 , -1 , 2 , 2 ]} ,
columns = [ " id " , " Label " , " X " ," Y " ]}
X = data [[ " X " ," Y " ]]. values
Y = data [[ " Label " ]]. values
knn_classifier = KNeighborsClassifier (
n_neighbors =3)
knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([3 , 2])
prediction = knn_classifier . predict (
new_instance )
ipdb> prediction[0]
red
Page 7
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
A Numerical Example
object Height Weight Foot Label
xi (H) (W) (F) (L)
x1 5.00 100 6 green
x2 5.50 150 8 green
x3 5.33 130 7 green
x4 5.75 150 9 green
x5 6.00 180 13 red
x6 5.92 190 11 red
x7 5.58 170 12 red
x8 5.92 165 10 red
• note different scales
Page 8
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
What is the Label?
14
12
10 Foot
8
6
200
180
4.8 5.0 160
5.2 5.4 140 ight
120 We
Height5.6 5.8 6.0 100
(H=6, W=160, F=10) 7→ ?
Page 9
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
kNN in Python
import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])
X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values
scaler = StandardScaler (). fit ( X )
X = scaler . transform ( X )
knn_classifier = KNeighborsClassifier ( n_neighbors =3)
knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([6 , 160 , 10])
new_instance_scaled = scaler . transform ( new_instance )
prediction = knn_classifier . predict ( new_instance_scaled )
ipdb> prediction[0]
’red’
Page 10
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Result Without Scaling
import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])
X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values
knn_classifier = KNeighborsClassifier ( n_neighbors =3)
knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([6 , 160 , 10])
prediction = knn_classifier . predict ( new_instance )
ipdb> prediction[0]
’red’
Page 11
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Why Scaling?
6
8
Foot 10
12
14
4.8 t
5.0
5.2
5.4
5.6
5.8
200 180 160 140 120 100 6.0 h
Weight Heig
• (euclidean) distances d(·)
dominated by one dimension
Page 12
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Effect of Scaling
2.0
1.5
1.0
0.5
Foot 0.0
0.5
1.0
1.5
2.0 1.52.0
1.0
2.0 1.5 1.0 0.5 0.00.5 ht
0.5
0.0 0.5 1.0 1.5 1.0
1.5
2.0 H eig
Weight 2.0
• without scaling: d(x7, x8) < d(x4, x8)
• with scaling: d(x7, x8) > d(x4, x8)
Page 13
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Calculating k
import pandas as pd
from sklearn . preprocessing import StandardScaler
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split
import pandas as pd
data = pd . DataFrame (
{ " id " :[ 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8] ,
" Label " :[ " green " ," green " ," green " ," green " ,
" red " ," red " ," red " ," red " ] ,
" Height " :[5 ,5.5 ,5.33 ,5.75 ,6.00 ,5.92 ,5.58 ,5.92] ,
" Weight " :[100 ,150 ,130 ,150 ,180 ,190 ,170 ,165] ,
" Foot " :[6 , 8 , 7 , 9 , 13 , 11 , 12 , 10]} ,
columns =[ " id " ," Height " ," Weight " ,
" Foot " ," Label " ])
X = data [[ " Height " ," Weight " ," Foot " ]]. values
Y = data [[ " Label " ]]. values
scaler = StandardScaler (). fit ( X )
X = scaler . transform ( X )
X_train , X_test , Y_train , Y_test = train_test_split (X ,Y ,
test_size =0.5 , random_state =0)
error_rate = []
for k in [1 ,3]:
knn_classifier = KNeighborsClassifier ( n_neighbors = k )
knn_classifier . fit ( X_train , Y_train )
pred_k = knn_classifier . predict ( X_test )
error_rate . append ( np . mean ( pred_k != Y_test ))
ipdb> error_rate
[0.5, 0.5]
Page 14
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Calculating k for IRIS
import numpy as np
import pandas as pd
import matplotlib . pyplot as plt
from sklearn . preprocessing import StandardScaler , LabelEncoder
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split
url = r ’ https :// archive . ics . uci . edu / ml / ’ + \
r ’ machine - learning - databases / iris / iris . data ’
iris_feature_names = [ ’ sepal - length ’ , ’ sepal - width ’ ,
’ petal - length ’ , ’ petal - width ’]
data = pd . read_csv ( url , names =[ ’ sepal - length ’ , ’ sepal - width ’ ,
’ petal - length ’ , ’ petal - width ’ , ’ Class ’ ])
class_labels = [ ’ Iris - versicolor ’ , ’ Iris - virginica ’]
data = data [ data [ ’ Class ’ ]. isin ( class_labels )]
X = data [ iris_feature_names ]. values
scaler = StandardScaler ()
scaler . fit ( X )
X = scaler . transform ( X )
le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Class ’ ]. values )
X_train , X_test , Y_train , Y_test = train_test_split (X ,Y , test_size =0.5 ,
random_state =3)
Page 15
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Calculating k for IRIS
(cont’d)
error_rate = []
for k in range (1 ,21 ,2):
knn_classifier = KNeighborsClassifier ( n_neighbors = k )
knn_classifier . fit ( X_train , Y_train )
pred_k = knn_classifier . predict ( X_test )
error_rate . append ( np . mean ( pred_k != Y_test ))
figure ( figsize =(10 ,4))
ax = plt . gca ()
ax . xaxis . set_major_locator ( MaxNLocator ( integer = True ))
plt . plot ( range (1 ,21 ,2) , error_rate , color = ’ red ’ , linestyle = ’ dashed ’ ,
marker = ’o ’ , markerfacecolor = ’ black ’ , markersize =10)
plt . title ( ’ Error Rate vs . k for Iris Subset ’)
plt . xlabel ( ’ number of neighbors : k ’)
plt . ylabel ( ’ Error Rate ’)
Page 16
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Calculating k for IRIS
Error Rate vs. k: Iris-versicolor and Iris-virginica
0.060
0.055
0.050
0.045
Error Rate
0.040
0.035
0.030
0.025
0.020
2 4 6 8 10 12 14 16 18
number of neighbors: k
Page 17
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
k for IRIS
Iris-setosa
Iris-versicolor
Iris-virginica7
6
petal-length
5
4
3
2
1
4.5
4.0
3.5 h
4.5 5.0
5.5 3.0 l-widt
sepal-6.0 2.5 sepa
leng6.5
th 7.0 7.5 8.0 2.0
Error Rate vs. k: Iris-setosa and Iris-virginica
0.04
0.02
Error Rate
0.00
0.02
0.04
2 4 6 8 10 12 14 16 18
number of neighbors: k
Page 18
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
A Categorical Dataset
Day Weather Temperature Wind Play
1 sunny hot low no
2 rainy mild high yes
3 sunny cold low yes
4 rainy cold high no
5 sunny cold high yes
6 overcast mild low yes
7 sunny hot low yes
8 overcast hot high yes
9 rainy hot high no
10 rainy mild low yes
• what label for x* = (sunny,
cold, low)?
Page 19
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Python Code
import pandas as pd
import numpy as np
from sklearn . neighbors import KNeighborsClassifier
from sklearn . preprocessing import LabelEncoder
data = pd . DataFrame (
{ ’ Day ’: [1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10] ,
’ Weather ’ :[ ’ sunny ’ , ’ rainy ’ , ’ sunny ’ , ’ rainy ’ , ’ sunny ’ , ’ overcast ’ ,
’ sunny ’ , ’ overcast ’ , ’ rainy ’ , ’ rainy ’] ,
’ Temperature ’: [ ’ hot ’ , ’ mild ’ , ’ cold ’ , ’ cold ’ , ’ cold ’ , ’ mild ’ ,
’ hot ’ , ’ hot ’ , ’ hot ’ , ’ mild ’] ,
’ Wind ’: [ ’ low ’ , ’ high ’ , ’ low ’ , ’ high ’ , ’ high ’ , ’ low ’ , ’ low ’ ,
’ high ’ , ’ high ’ , ’ low ’] ,
’ Play ’: [ ’ no ’ , ’ yes ’ , ’ yes ’ , ’ no ’ , ’ yes ’ , ’ yes ’ , ’ yes ’ ,
’ yes ’ , ’ no ’ , ’ yes ’]} ,
columns = [ ’ Day ’ , ’ Weather ’ , ’ Temperature ’ , ’ Wind ’ , ’ Play ’]
)
input_data = data [[ ’ Weather ’ , ’ Temperature ’ , ’ Wind ’ ]]
dummies = [ pd . get_dummies ( data [ c ]) for c in input_data . columns ]
binary_data = pd . concat ( dummies , axis =1)
X = binary_data [0:10]. values
le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Play ’ ]. values )
knn_classifier = KNeighborsClassifier ( n_neighbors =3)
knn_classifier . fit (X , Y )
new_instance = np . asmatrix ([0 ,0 ,1 ,1 ,0 ,0 ,0 ,1])
prediction = knn_classifier . predict ( new_instance )
ipdb> prediction
1
Page 20
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
kNN: IRIS
import pandas as pd
import numpy as np
from sklearn . preprocessing import StandardScaler , LabelEncoder
from sklearn . neighbors import KNeighborsClassifier
from sklearn . model_selection import train_test_split
url = r ’ https :// archive . ics . uci . edu / ml / ’ + \
r ’ machine - learning - databases / iris / iris . data ’
iris_feature_names = [ ’ sepal - length ’ , ’ sepal - width ’ ,
’ petal - length ’ , ’ petal - width ’]
data = pd . read_csv ( url , names =[ ’ sepal - length ’ , ’ sepal - width ’ ,
’ petal - length ’ , ’ petal - width ’ , ’ Class ’ ])
class_labels = [ ’ Iris - versicolor ’ , ’ Iris - virginica ’]
data = data [ data [ ’ Class ’ ]. isin ( class_labels )]
X = data [ iris_feature_names ]. values
scaler = StandardScaler ()
scaler . fit ( X )
X = scaler . transform ( X )
le = LabelEncoder ()
Y = le . fit_transform ( data [ ’ Class ’ ]. values )
X_train , X_test , Y_train , Y_test = train_test_split (X ,Y ,
test_size =0.5 , random_state =3)
knn_classifier = KNeighborsClassifier ( n_neighbors =15)
knn_classifier . fit ( X_train , Y_train )
prediction = knn_classifier . predict ( X_test )
error_rate = np . mean ( prediction != Y_test )
ipdb> error_rate
0.06
Page 21
BU MET CS-677: Data Science With Python, v.2.0 kNN - Nearest Neighbors Classification
Concepts Check:
(a) distances and neighbors
(b) nearest neigbor intuition
(c) need for scaling
(d) how to choose k
(e) analyzing categorical data
Page 22