Lecture#7
Unsupervised Learning
(Clustering)
1
What is Cluster Analysis?
Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
Understanding
– Group students who
succeed and fails in
the same exercises
Summarization
– Reduce the size of
large data sets
Clustering precipitation
in Australia
What is not Cluster Analysis?
Supervised classification
– Have class label information
Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
– A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly
one subset
Hierarchical clustering
– A set of nested clusters organized as a hierarchical
tree
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering
Partitional Clustering
Original Points
Partitional Clustering
Original Points with initial centres
Partitional Clustering
Original Points with clusters iteration 1
Partitional Clustering
Original Points with new centres
Partitional Clustering
Original Points with clusters and new centres iteration 2
Partitional Clustering
Original Points with clusters and new centres iteration 3
Partitional Clustering
Final clusters and centres A Partitional
Clustering
Two different K-means Clusterings
3
2.5
1.5
Original Points
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
3 3
2.5 2 .5
2 2
1.5 1 .5
y
y
1 1
0.5 0 .5
0 0
-2 - 1.5 -1 -0 .5 0 0 .5 1 1.5 2 -2 - 1.5 -1 -0.5 0 0 .5 1 1 .5 2
x x
Optimal Clustering Sub-optimal Clustering
Property of K-means
Sum of Squared Error (SSE) diminishes after each
iteration.
The SSE is not necessarily the optimal one .
K
SSE (mi , x)
dist 2
i 1 xCi
Advantages of K-means
Is efficient.
Can be computed in a distributive way.
Is easy to apply.
Limitations of K-means
How to determine the best K?
May give a sub-optimal solution.
K.means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes
K-means is sensible to outliers.