Mod04 K Nearest Neighbor
Mod04 K Nearest Neighbor
1
Supervised vs. Unsupervised
Methods
• Data mining methods are categorized as either
Unsupervised or Supervised
• Unsupervised Methods
– A target variable is not specified
– Instead, the algorithm searches for patterns and structure among
the variables
– Clustering is the most common unsupervised method
– For example, political consultants analyze voter clusters in
congressional districts that may be responsive to their particular
candidate
– Important variables such as gender, age, income, and race are
input to the clustering algorithm
– Voter profiles for fund-raising and advertising are created
2
Supervised vs. Unsupervised
Methods (cont’d)
• Supervised Methods
– A target variable is specified
– The algorithm “learns” from the examples by determining which
values of the predictor variables are associated with different
values of the target variable
– For example, the regression methods discussed in Chapter 4 are
supervised. The observed values of the response (target)
variable are read by the least-squares algorithm, while it
attempts to minimize the prediction error
– All classification methods in Chapters 5 – 7 are supervised
methods including: Decision Trees, Neural Networks, and k-
Nearest Neighbors
3
Methodology for Supervised
Modeling
• Supervised data mining methods use Training, Test, and
Validation data sets as part of the model building and
evaluation process
• Training
– The Training Set includes records with predictor variables and
pre-classified values for the target variable
– This is the initial stage where a provisional data mining model is
built using the training set
– The model “learns” from the examples in the training set
– What happens if the model blindly applies all patterns learned
from the training set to future data?
5
Methodology for Supervised
Modeling (cont’d)
– For example, suppose every customer in a training set named
“David” happens to be in the high-income bracket
– A data mining model that “memorizes” this idiosyncrasy in the
training set is actually overfitting the data
– Most likely we would not want our model to apply this rule to
future or unseen data
– Therefore, the next step in the process is to examine the
performance of the provisional data model using a different set
of data
6
Methodology for Supervised
Modeling (cont’d)
• Testing
– The Test Set is a “holdout” set of data independent from the
training set that was used to build the provisional data model
– The true values of the target variable in the test set are hidden
temporarily from the provisional data model
– The provisional data model simply classifies the records in the
test set according to the rules and patterns it learned from the
records in the training set
– The performance of the provisional data model is evaluated by
comparing its classifications against the actual values of the
target variable
– The provisional data model is then adjusted in an effort to
minimize the error rate on the test set
7
Methodology for Supervised
Modeling (cont’d)
• Validation
– Next, the adjusted data model is applied to another set of data
called the Validation Set
– The validation set is another “holdout” set of data independent
of the training and test sets
– The performance of the adjusted data model is evaluated
against the validation set
– If required, the adjusted data model is modified to minimize the
error rate on the validation set
– Estimates of data model performance for future, unseen data
are computed using evaluative measures applied to results
obtained when classifying the validation set
8
9
10
Data Mining View
• Traditional Analytics:
– Have techniques that can be applied to different problems
11
k-Nearest Neighbor Algorithm
– The k-Nearest Neighbor algorithm is an example of instance-
based learning where training set records are first stored
– Next, the classification of a new unclassified record is performed
by comparing it to records in the training set it is most similar to
– k-Nearest Neighbor is used most often for classification,
although it is also applicable to estimation and prediction tasks
• Example: Patient 1
– Recall from Chapter 1 that we were interested in classifying the
type of drug a patient should be prescribed
– The training set consists of 200 patients with Na/K ratio, age,
and drug attributes
– Our task is to classify the type of drug new a patient should be
prescribed that is 40-years-old and has a Na/K ratio of 29
12
k-Nearest Neighbor Algorithm (cont’d)
– This scatter plot of Na/K against Age shows the records in the
training set that patients 1, 2, and 3 are most similar to
– A “drug” overlay is shown where Light points = drug Y, Medium
points = drug A or X, and Dark points = drug B or C
13
k-Nearest Neighbor Algorithm (cont’d)
C
Patient2
A
B
14
k-Nearest Neighbor Algorithm (cont’d)
15
k-Nearest Neighbor Algorithm (cont’d)
– However, with k = 3, voting determines that two of the three
closet points to Patient 2 are Medium
– Therefore, Patient 2 is classified as drug A or X
– Note that the classification of Patient 2 differed based on the
value chosen for k
• Example: Patient 3
– Patient 3 is 47-years-old and has a Na/K ratio of 13.5. A close-up
shows Patient 3 in the center, with the closest 3 training data
points
16
k-Nearest Neighbor Algorithm (cont’d)
– With k = 1, Patient 3 is closest to the Dark point, based on a
distance measure
– Therefore, Patient 3 is classified as drug B or C
– Using k = 2 or k = 3, voting does not help since each of the
three nearest training points have different target values
• Considerations when using k-Nearest Neighbor
– How many neighbors should be used? k = ?
– How is the distance between points measured?
– How is the information from two or more neighbors combined
when making a classification decision?
– Should all points be weighted equally, or should some points
have more influence?
17
Distance Function
• Example
– Suppose Patient A is 20-years-old and has a Na/K ratio = 12,
and Patient B is 30-years-old and has a Na/K ratio = 8
– What is the Euclidean distance between these instances?
19
Distance Function (cont’d)
20
Distance Function (cont’d)
21
Distance Function (cont’d)
sqrt((10,000^2)+(10^2)) = 10,000
sqrt((10^2)+(10^2)) = 14.14
22
Distance Function (cont’d)
23
Distance Function (cont’d)
24
Distance Function (cont’d)
0 if xi = yi
different ( xi , yi ) =
1 otherwise
25
Distance Function (cont’d)
20 − 10 20 − 45
B 20 = 0.2 = −1.67 Male
50 15
50 − 10 50 − 45
C 50 = 0.8 = 0.33 Female
50 15
26
Distance Function (cont’d)
d ( A, B) = (50 − 20) 2 + 0 2 = 30
d ( A, C ) = (50 − 50) 2 + 12 = 1
27
Distance Function (cont’d)
29
Distance Function (cont’d)
30
Combination Function
31
Combination Function (cont’d)
32
Weighted Voting
• Weighted Voting
– In this case, the closer the neighbor, the more influence it has in
the classification decision
– This method assumes a closer neighbor is more similar, and
therefore its vote should be weighted more heavily, as compared
that of more distant neighbors
– The weight of particular record is inversely proportional to its
distance to the unclassified record
– A “tie” is unlikely to occur using this approach
33
Weighted Voting (cont’d)
• Example
34
Weighted Voting (cont’d)
• Example
– Again, recall that we classified a new patient 17-years-old with a
Na/K ratio = 12.5, using k = 3
– We determined, using unweighted voting, two of the closest
points were Medium, and the third was Dark
– However, the Dark point is the most similar to the new patient
– Now, we reclassify the new patient using a weighted voting
scheme using values from the table below
35
Weighted Voting (cont’d)
1 1
Votes ( Dark Gray) = 2
= 2
51,818.
d (new, A) .004393
36
Weighted Voting (cont’d)
1 1 1 1
Votes ( Medium Gray) = 2
+ 2
= 2
+ 2
672.
d (new, B) d (new, C ) .058893 .051022
38
Quantifying Attribute Relevance:
Stretching the Axes (cont’d)
– Perhaps, we should consider restricting the algorithm to using
the most important fields for classification
– However, rather than making this determination a priori, we can
make attributes either more, or less important
– This is accomplished using cross-validation or applying domain
knowledge expertise
• Stretching the Axes
– Stretching the Axes finds the coefficient zj by which to multiply
the jth axis. Larger values of zj are associated with the more
important variable axes
• Cross-validation
– Cross-validation selects a random subset of data from the
training set and determines the set of z1, z2, ..., zm that minimize
the classification error on the test set
39
Quantifying Attribute Relevance:
Stretching the Axes (cont’d)
– Repeating the process leads to a more accurate set of values for
z1, z2, ..., zm
• Domain Expertise
– Alternately, we may call upon domain experts to recommend
values for z1, z2, ..., zm
– Using either approach the k-Nearest Neighbor algorithm may be
made more precise
• Example
– Suppose that the Na/K ratio was determined to be 3 times more
important than the Age attribute, for performing drug
classification
40
Quantifying Attribute Relevance:
Stretching the Axes (cont’d)
– The distance of the records A, B, and C to the new record are
calculated as follows:
where z Na / K = 3, z Age = 1
41
Database Considerations
– Instance-based learning methods benefit from having access to
learning examples composed of many attribute value
combinations
– The data set should be balanced to include a sufficient number
of records with common, as well as less-common, classifications
– One approach to balancing the data set is to reduce the
proportion of records with more common classifications
– Restrictions on main memory space may limit the size of the
training set used
– The training set may be reduced to include only those records
that occur near a classification “boundary”
43
k-Nearest Neighbor Algorithm for
Estimation and Prediction (cont’d)
– Assume BP has a range = 80, and minimum = 90
– We also stretch the axes for the Na/K ratio, to reflect its
importance in estimating BP. In addition, we use the inverse
square of the distances for the weights
wi yi 1
ŷ new = i
where wi = for existing records x1 , x2 ,, xk
wi d (new, xi ) 2
i
– The estimated systolic blood pressure for the new record is:
w y i i
120
2
+
122
2
+
130
2
yˆ new = i
= .009305 .17643 .09756 = 120.0954
w i
1
+
1
+
1
i
.0093052 .176432 .09756 2
45
Choosing k (cont’d)
– Text:
• Chapter 3: Exploring Categorical Variables
• Chapter 4: Statistical approaches to estimation and prediction -- confidence
interval estimation
• Chapter 5: Entire chapter
– SAS and R
– Next topic
• K-means chapter 8