CSE464 & CSE468 : Image
Processing and Pattern
Recognition
Clustering
by:
Hossam El Din Hassan Abd El Munim
حسام الدين حسن عبد المنعم
Computer & Systems Engineering Dept.,
Ain Shams University,
1 El-Sarayat Street, Abbassia, Cairo 11517
1
Today
▪ New Topic: Unsupervised Learning
▪ Supervised vs. unsupervised learning
▪ Unsupervised learning
▪ Nonparametric unsupervised learning = clustering
▪ Proximity Measures
▪ Criterion Functions
▪ Flat Clustering
▪ k-means
▪ Hierarchical Clustering
▪ Divisive
▪ Agglomerative
Supervised vs. Unsupervised Learning
▪ Up to now we considered supervised learning
scenario, where we are given
1. samples x1,…, xn
2. class labels for all samples x1,…, xn
▪ This is also called learning with teacher, since correct
answer (the true class) is provided
▪ In the next few lectures we consider
unsupervised learning scenario, where we are
only given
1. samples x1,…, xn
▪ This is also called learning without teacher, since
correct answer is not provided
▪ do not split data into training and test sets
Unsupervised Learning
a lot is
▪ Data is not labeled known
”easier”
1. Parametric Approach
▪ assume parametric distribution of data
▪ estimate parameters of this distribution
▪ much “harder” than supervised case
▪ NonParametric Approach
▪ group the data into clusters, each cluster (hopefully)
says something about categories (classes) present in
the data
little is
known
“harder”
Why Unsupervised Learning?
▪ Unsupervised learning is harder
▪ How do we know if results are meaningful? No answer
labels are available.
▪ Let the expert look at the results (external evaluation)
▪ Define an objective function on clustering (internal evaluation)
▪ We nevertheless need it because
1. Labeling large datasets is very costly (speech recognition)
▪ sometimes can label only a few examples by hand
2. May have no idea what/how many classes there are (data
mining)
3. May want to use clustering to gain some insight into the
structure of the data before designing a classifier
▪ Clustering as data description
Clustering
▪ Seek “natural” clusters in the data
▪ What is a good clustering?
▪ internal (within the cluster) distances should be small
▪ external (intra-cluster) should be large
▪ Clustering is a way to discover new
categories (classes)
What we Need for Clustering
1. Proximity measure, either
▪ similarity measure s(xi,xk): large if xi,xk are similar
▪ dissimilarity(or distance) measure d(xi,xk): small if xi,xk are similar
large d, small s large s, small d
2. Criterion function to evaluate a clustering
good clustering bad clustering
3. Algorithm to compute clustering
▪ For example, by optimizing the criterion function
How Many Clusters?
3 clusters or 2 clusters?
▪ Possible approaches
1. fix the number of clusters to k
2. find the best clustering according to the criterion
function (number of clusters may vary)
Proximity Measures
▪ good proximity measure is VERY application
dependent
▪ Clusters should be invariant under the transformations
“natural” to the problem
▪ For example for object recognition, should have
invariance to rotation
distance 0
▪ For character recognition, no invariance to rotation
9 6
Distance (dissimilarity) Measures
▪ Euclidean distanced
Σx x
d xi , x j i
k
j
k 2
k 1
▪ translation invariant
▪ Manhattan (city block) distance
d x i , x j Σx ik x jk
d
k 1
▪ approximation to Euclidean distance,
cheaper to compute
▪ Chebyshev distance
d x i , x j max | x ik x jk |
1k d
▪ approximation to Euclidean distance,
cheapest to compute
Similarity Measures
▪ Cosine similarity:
s xi , x j x Ti x j
|| x i || || x j ||
▪ the smaller the angle, the larger the
similarity
▪ scale invariant measure
▪ popular in text retrieval
▪ Correlation coefficient
▪ popular in image processing
Σx x x x
d
k k
i i
i i
s xi , x j k 1
「d
Σx
1/ 2
|Σ xi x i
d
k 2 k
x j 2|
]
j
k 1
k 1
Feature Scale
Simplest Clustering Algorithm
•Two Issues
1.How to measure similarity between samples?
2.How to evaluate partitioning?
•If distance is a good measure of dissimilarity, distance between
samples in same cluster must be smaller than distance between
samples in different clusters
•Two samples belong to the same cluster if distance between them is
less than a threshold d0.
Scaling Axis
Criterion Functions for Clustering
▪ Have samples x1,…,xn
▪ Suppose partitioned samples into c subsets D1,…,Dc
D3
D1
D2
▪ There are approximately cn/c! distinct partitions
▪ Can define a criterion function J(D1,…,Dc) which
measures the quality of a partitioning D1,…,Dc
▪ Then the clustering problem is a well defined
problem
▪ the optimal clustering is the partition which optimizes the
criterion function
SSE Criterion Function
▪ Let ni be the number of samples in Di, and define
the mean of samples in in Di
i
1
ni
Σx
xDi
▪ Then the sum-of-squared errors criterion function (to
minimize) is:
JSSE ΣΣ|| x i ||
c
2
i 1 xDi
2
1
▪ Note that the number of clusters, c, is fixed
SSE Criterion Function
JSSE ΣΣ|| x i ||2
c
i 1 xDi
▪ SSE criterion appropriate when data forms compact
clouds that are relatively well separated
▪ SSE criterion favors equally sized clusters, and may
not be appropriate when “natural” groupings have
very different sizes
large JSSE small JSSE
K-means Clustering
▪ We now consider an example of iterative optimization algorithm for the special case of
JSSE objective function
JSSE ΣΣ|| x i ||2
k
i 1 xDi
▪ for a different objective function, we need a different optimization algorithm, of
course
▪ Fix number of clusters to k (c = k)
▪ k-means is probably the most famous clustering algorithm
▪ it has a smart way of moving from current partitioning to the next one
K-means Clustering k=3
1. Initialize x
▪ pick k cluster centers arbitrary
▪ assign each example to closest x x
center
2. compute sample
means for each cluster x
x x
3. reassign all samples to the x
closest mean x
x
4. if clusters changed at step 3, go to step 2
K-means Clustering (Means Derivation)
K-means Clustering
▪ In practice, k-means clustering performs usually
well
▪ It is very efficient
▪ Its solution can be used as a starting point for
other clustering algorithms
▪ Still 100’s of papers on variants and improvements
of k-means clustering every year
Clustering Criteria Again
23
• Scatter criteria
• Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the between-
scatter matrix SB
ST = SB +SW
• Note:
• ST does not depend on partitioning
• In contrast, SB and SW depend on partitioning
• Two approaches:
• minimize the within-cluster
• maximize the between-cluster scatter
Pattern Classification, Chapter 10
24
The trace (sum of diagonal elements) is the
simplest scalar measure of the scatter matrix
c c
trSW trSi x m i Je
2
i 1 i 1 xDi
• proportional to the sum of the variances in the
coordinate directions
• This is the sum-of-squared-error criterion, Jsse.
Pattern Classification, Chapter 10
25
• As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from
the partitioning, no new results can be derived by
minimizing tr[SB]
• However, seeking to minimize the within-cluster criterion
Jsse=tr[SW], is equivalent to maximise the between-cluster
criterion c
trS B ni m i m
2
i 1
where m is the total mean vector:
1 1 c
m x ni mi
n D n i 1
Pattern Classification, Chapter 10
26
Iterative optimization
• Clustering discrete optimization problem
• Finite data set finite number of partitions
• What is the cost of exhaustive search?
cn/c! For c clusters. Not a good idea
• Typically iterative optimization used:
• starting from a reasonable initial partition
• Redistribute samples to minimize criterion function.
• guarantees local, not global, optimization.
Pattern Classification, Chapter 10
27
• consider an iterative procedure to minimize the
sum-of-squared-error criterion Je
c
Je Ji Ji xm
2
where i
i 1 xDi
where Ji is the effective error per cluster.
• Moving sample x̂ from cluster Di to Dj, changes
the errors in the 2 clusters by:
nj 2 ni
J j* J j xˆ m j Ji* Ji xˆ m i
2
n j 1 ni 1
Pattern Classification, Chapter 10
28
• Hence, the transfer is advantegeous if the
decrease in Ji is larger than the increase in Jj
ni nj 2
xˆ m i xˆ m j
2
ni 1 n j 1
Pattern Classification, Chapter 10
29
• Alg. 3 is sequential version of the k-means alg.
• Alg. 3 updates each time a sample is reclassified
• k-means waits until n samples have been reclassified
before updating
• Alg 3 can get trapped in local minima
• Depends on order of the samples
• However, it is at least a stepwise optimal procedure,
and it can be easily modified to apply to problems in
which samples are acquired sequentially and
clustering must be done on-line.
• Alg 2 is the fuzzy k-means clustering. Pattern Classification, Chapter 10
Selecting the Number of
Clusters (Elbow Method)
31
Hierarchical Clustering
• Many times, clusters are not disjoint, but a cluster
may have subclusters, in turn having sub-
subclusters, etc.
• Consider a sequence of partitions of the n
samples into c clusters
• The first is a partition into n cluster, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third into
n-2, and so on, until the n-th in which there is only one
cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
32
• Given any two samples x and x’, they will be grouped
together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
• Hierarchical clustering tree representation called
dendrogram
Pattern Classification, Chapter 10
33
• Are groupings natural or forced: check similarity values
• Evenly distributed similarity no justification for grouping
• Another representation is based on set, e.g., on the Venn
diagrams
Pattern Classification, Chapter 10
34
• Hierarchical clustering can be divided in
agglomerative and divisive.
• Agglomerative (bottom up, clumping): start with n
singleton cluster and form the sequence by
merging clusters
• Divisive (top down, splitting): start with all of the
samples in one cluster and form the sequence by
successively splitting clusters
Pattern Classification, Chapter 10
35
Agglomerative hierarchical clustering
• The procedure terminates when the specified
number of cluster has been obtained, and returns
the cluster as sets of points, rather than the mean
or a representative vector for each cluster
Pattern Classification, Chapter 10
36
• At any level, the distance between nearest clusters
can provide the dissimilarity value for that level
• To find the nearest clusters, one can use
d min ( Di , D j ) min x x'
xDi , x 'D j
d max ( Di , D j ) max x x'
xDi , x 'D j
1
d avg ( Di , D j )
ni n j
x x'
xDi x 'D j
d mean ( Di , D j ) m i m j
which behave quite similar of the clusters are
hyperspherical and well separated.
• The computational complexity is O(cn2d2), n>>c
Pattern Classification, Chapter 10
• Nearest-neighbor algorithm (single linkage)
37
• dmin is used
• The use of dmin as a distance measure and the
agglomerative clustering:-
Pattern Classification, Chapter 10
• The farthest neighbour algorithm (complete linkage)
38
• dmax is used
• When two clusters are merged, the graph is changed by adding
edges between every pair of nodes in the 2 clusters
• All the procedures involving minima or maxima are sensitive to
outliers. The use of dmean or davg are natural compromises
Pattern Classification, Chapter 10
Applications of Clustering
▪ Image segmentation
▪ Find interesting “objects” in images to focus attention at
From: Image Segmentation by Nested Cuts, O. Veksler, CVPR2000
Applications of Clustering
▪ Image Database Organization
▪ for efficient search
Applications of Clustering
▪ Data Mining
▪ Technology watch
▪ Derwent Database, contains all patents filed in the
last 10 years worldwide
▪ Searching by keywords leads to thousands of
documents
▪ Find clusters in the database and find if there are any
emerging technologies and what competition is up to
▪ Marketing
▪ Customer database
▪ Find clusters of customers and tailor marketing
schemes to them
Applications of Clustering
▪ Profiling Web Users
▪ Use web access logs to generate a feature vector for
each user
▪ Cluster users based on their feature vectors
▪ Identify common goals for users
▪ Shopping
▪ Job Seekers
▪ Product Seekers
▪ Tutorials Seekers
▪ Can use clustering results to improving web content and
design
Summary
▪ Clustering (nonparametric unsupervised learning)
is useful for discovering inherent structure in data
▪ Clustering is immensely useful in different fields
▪ Clustering comes naturally to humans (in up to 3
dimensions), but not so to computers
▪ It is very easy to design a clustering algorithm, but
it is very hard to say if it does anything good
▪ General purpose clustering does not exist, for best
results, clustering should be tuned to application at
hand