Unsupervised Learning: Partitioning Methods
CS 822 Data Mining
Anis ur Rahman
Department of Computing
NUST-SEECS
Islamabad
December 3, 2018
1 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Roadmap
1 Basic Concepts
2 K-Means
3 K-Medoids
2 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
Principle
Maximizing intra-class similarity & minimizing interclass similarity
Typical Applications
WWW, Social networks, Marketing, Biology, Library, etc.
3 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Partitioning Methods
Given
A data set of n objects
K the number of clusters to form
Organize the objects into k partitions (k ≤ n) where each partition
represents a cluster
The clusters are formed to optimize an objective partitioning
criterion
Objects within a cluster are similar
Objects of different clusters are dissimilar
4 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Lazy learning vs. Eager learning
1 Eager learning
Given a set of training set, constructs a classification model before
receiving new (e.g., test) data to classify
e.g. decision tree induction, Bayesian classification, rule-based
classification
2 Lazy learning
Simply stores training data (or only minor processing) and waits
until it is given a new instance
Lazy learners take less time in training but more time in predicting
e.g., k-nearest-neighbor classifiers, case-based reasoning classifiers
5 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Example Problem: Face Recognition
We have a database of (say) 1 million face images
We are given a new image and want to find the most similar
images in the database
Represent faces by (relatively) invariant values, e.g., ratio of nose
width to eye width
Each image represented by a large number of numerical features
Problem: given the features of a new face, find those in the DB that
are close in at least ¾ (say) of the features
6 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Lazy Learning
Typical approaches of lazy learning:
1 k-nearest neighbor approach
Instances represented as points in a Euclidean space.
2 Case-based reasoning
Uses symbolic representations and knowledge-based inference
3 Locally weighted regression
Constructs local approximation
7 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Roadmap
1 Basic Concepts
2 K-Means
3 K-Medoids
8 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means
Goal
Create 3 clusters (partitions)
1 Choose 3 objects (cluster centroids)
2 Assign each object to the closest
centroid to form Clusters
3 Update cluster centroids
9 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means
Goal
Create 3 clusters (partitions)
4 Recompute Clusters
5 If Stable centroids, then stop
10 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means Algorithm
Input
K : the number of clusters
D : a data set containing n objects
Output: A set of k clusters
Method:
1 Arbitrary choose k objects from D as in initial cluster centers
2 Repeat
3 Reassign each object to the most similar cluster based on the mean
value of the objects in the cluster
4 Update the cluster means
5 Until no change
11 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means Properties
The algorithm attempts to determine k partitions that minimize the
square-error function
k X
X
E= |p − mi |2
i =1 p∈Ci
E : the sum of the squared error for all objects in the dataset
P : the data point in the space representing an object
mi : is the mean of cluster Ci
It works well when the clusters are compact clouds that are rather
well separated from one another
12 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means: Advantages
K-means is relatively scalable and efficient in processing large
data sets
The computational complexity of the algorithm is O (nkt)
n: the total number of objects
k : the number of clusters
t: the number of iterations
Normally: k << n and t << n
13 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means: Disadvantages
Can be applied only when the mean of a cluster is defined
Users need to specify k
K-means is not suitable for discovering clusters with nonconvex
shapes or clusters of very different size
It is sensitive to noise and outlier data points (can influence the
mean value)
14 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Means demo
Demo
15 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Variations of K-Means
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
How can we change K-Means to deal with categorical data?
Handling categorical data: k-modes (Huang’98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data
16 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
first described in the early 1950s
It has since been widely used in the area of pattern recognition
The training instances are described by n attributes
Each instance represents a point in an n-dimensional space
A k-nearest-neighbor classifier searches the pattern space for the
k training instances that are closest to the unknown instance
17 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
Example:
We are interested in classifying the type of drug a patient should
be prescribed
Based on the age of the patient and the patient’s
sodium/potassium ratio (Na/K)
Dataset includes 200 patients
18 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Scatter plot
19 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Close-up of neighbors to new patient 2
Main questions:
How many neighbors should we consider? That is, what is k?
How do we measure distance?
Should all points be weighted equally, or should some points have
more influence than others?
20 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
The nearest neighbor can be defined in terms of Euclidean
distance, dist(X1 , X2 )
The Euclidean distance between two points or instances, say,
X1 = (x11 , x12 , · · · , x1n ) and X2 = (x21 , x22 , · · · , x2n ), is:
v
t n
X
dist(X1 , X2 ) = (x1n − x2n )2
i =1
Nominal attributes: distance either 0 or 1
Refer to cluster analysis for more distance metrics
21 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
Typically, we normalize the values of each attribute in advanced.
This helps prevent attributes with initially large ranges (such as
income) from outweighing attributes with initially smaller ranges
(such as binary attributes).
v − minA
v0 =
maxA − minA
Min-max normalization:
all attribute values lie between 0 and 1
For more information on normalization methods refer to data
preprocessing section
22 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
For k-nearest-neighbor classification, the unknown instance is
assigned the most common class among its k nearest neighbors
When k = 1, the unknown instance is assigned the class of the
training instance that is closest to it in pattern space
Nearest-neighbor classifiers can also be used for prediction, that
is, to return a real-valued prediction for a given unknown instance
In this case, the classifier returns the average value of the
real-valued labels associated with the k nearest neighbors of the
unknown instance
23 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
Distances for categorical attributes:
A simple method is to compare the corresponding value of the
attribute in instance X1 with that in instance X2
If the two are identical (e.g., instances X1 and X2 both have the
color blue), then the difference between the two is taken as 0,
otherwise 1
Other methods may incorporate more sophisticated schemes for
differential grading (e.g., where a difference score is assigned, say,
for blue and white than for blue and black)
24 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
Handling missing values:
In general, if the value of a given attribute A is missing in instance
X1 and/or in instance X2 , we assume the maximum possible
difference
For categorical attributes, we take the difference value to be 1 if
either one or both of the corresponding values of A are missing
If A is numeric and missing from both instances X1 and X2 , then
the difference is also taken to be 1
If only one value is missing and the other (which we’ll call v 0 ) is
present and normalized, then we can take the difference to be either
|1 − v 0 | or |0 − v 0 |, whichever is greater
25 / 41
Unsupervised Learning: Nearest-Neighbor Classification
k-Nearest-Neighbor Classifiers
Determining a good value for k:
k can be determined experimentally.
Starting with k = 1, we use a test set to estimate the error rate of
the classifier.
This process can be repeated each time by incrementing k to allow
for one more neighbor.
The k value that gives the minimum error rate may be selected.
In general, the larger the number of training instances is, the
larger the value of k will be
26 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Finding nearest neighbors efficiently
Simplest way of finding nearest neighbor: linear scan of the data
Classification takes time proportional to the product of the number
of instances in training and test sets
Nearest-neighbor search can be done more efficiently using
appropriate methods
kD-trees (k-dimensional trees) represent training data in a tree
structure
27 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Roadmap
1 Basic Concepts
2 K-Means
3 K-Medoids
28 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Medoids Method
Minimize the sensitivity of k-means to outliers
Pick actual objects to represent clusters instead of mean values
Each remaining object is clustered with the representative object
(Medoid) to which is the most similar
The algorithm minimizes the sum of the dissimilarities between
each object and its corresponding representative object
k X
X
E= |p − oi |
i −1 p∈Ci
E : the sum of absolute error for all objects in the data set
P : the data point in the space representing an object
Oi : is the representative object of cluster Ci
29 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Medoids: The idea
Initial representatives are chosen randomly
The iterative process of replacing representative objects by no
representative objects continues as long as the quality of the
clustering is improved
For each representative Object O
For each non-representative object R , swap O and R
Choose the configuration with the lowest cost
Cost function is the difference in absolute error-value if a current
representative object is replaced by a non-representative object
30 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Data Objects
A1 A2
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6 Goal: create two clusters
Choose randmly two medoids
O2 = (3, 4)
O8 = (7, 4)
31 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Data Objects
A1 A2
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5 Assign each object to the closest representative object
O10 7 6
Using L1 Metric (Manhattan), we form the following
clusters
Cluster1 = {O1 , O2 , O3 , O4 }
Cluster2 = {O5 , O6 , O7 , O8 , O9 , O10 }
32 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Data Objects
A1 A2
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4 Compute the absolute error criterion [for the set of Medoids
O7 7 3 (O2 ,O8 )]
O8 7 4
O9 8 5 k X
X
O10 7 6 E= |p − oi |
i =1 p∈Ci
= |O1 − O2 | + |O3 − O2 | + |O4 − O2 | + |O5 − O8 |+
|O6 − O8 | + |O7 − O8 | + |O9 − O8 | + |O10 − O8 |
= (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
33 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Data Objects
A1 A2
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6 Choose a random object O7
Swap O8 and O7
Compute the absolute error criterion [for the set of
Medoids (O2 , O7 )]
34 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Data Objects
A1 A2
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5 Compute the cost function
O10 7 6 Absolute error [for O2 ,O7 ] - Absolute error [O2 ,O8 ]
S = 22 − 20
S > 0 ⇒ it is a bad idea to replace O8 by O7
35 / 41
Unsupervised Learning: Nearest-Neighbor Classification
Data Objects
A1 A2
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6 In this example, changing the medoid of cluster 2 did
not change the assignments of objects to clusters.
What are the possible cases when we replace a medoid
by another object?
36 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Medoids
First case
Currently P assigned to A
The assignment of P to A does not
change
Second case
Currently P assigned to B
P is reassigned to A
37 / 41
Unsupervised Learning: Nearest-Neighbor Classification
K-Medoids
Third case
Currently P assigned to B
P is reassigned to the new B
Fourth case
Currently P assigned to A
P is reassigned to A
38 / 41
Unsupervised Learning: Nearest-Neighbor Classification
PAM: Partitioning Around Medoids
Input
K : the number of clusters
D : a data set containing n objects
Output: A set of k clusters
Method:
1 Arbitrary choose k objects from D as representative objects (seeds)
2 Repeat
3 Assign each remaining object to the cluster with the nearest
representative object
4 For each representative object Oj
5 Randomly select a non representative object Orandom
6 Compute the total cost S of swapping representative object Oj with
Orandom
7 if S < 0 then replace Oj with Orandom
8 Until no change
39 / 41
Unsupervised Learning: Nearest-Neighbor Classification
PAM Properties
The complexity of each iteration is O (k (n − k )2 )
For large values of n and k , such computation becomes very costly
Advantages
K-Medoids method is more robust than k-Means in the presence of
noise and outliers
Disadvantages
K-Medoids is more costly than k-Means
Like k-means, k-medoids requires the user to specify k
It does not scale well for large data sets
40 / 41
Unsupervised Learning: Nearest-Neighbor Classification
References
J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier
Inc. (2006). (Chapter 6)
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, 2nd Edition, Elsevier Inc., 2005. (Chapter 6)
41 / 41