KEMBAR78
U L D R: Nsupervised Earning and Imensionality Eduction | PDF | Cluster Analysis | Principal Component Analysis
0% found this document useful (0 votes)
162 views58 pages

U L D R: Nsupervised Earning and Imensionality Eduction

This document discusses unsupervised learning techniques including clustering and dimensionality reduction. It introduces k-means clustering, describing the algorithm, initialization of centroids, and choosing the number of clusters k. It also covers principal component analysis (PCA) as a method for dimensionality reduction.

Uploaded by

SanaullahSunny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views58 pages

U L D R: Nsupervised Earning and Imensionality Eduction

This document discusses unsupervised learning techniques including clustering and dimensionality reduction. It introduces k-means clustering, describing the algorithm, initialization of centroids, and choosing the number of clusters k. It also covers principal component analysis (PCA) as a method for dimensionality reduction.

Uploaded by

SanaullahSunny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

UNSUPERVISED LEARNING AND

DIMENSIONALITY REDUCTION

DR. M M MANJURUL ISLAM


Contents
1. Unsupervised Learning (Clustering Problem)
2. K-means clustering algorithm
3. Centroid initialization and choosing the number of K
4. Dimensionality Reduction
5. Principal Component Analysis
1
Unsupervised Learning
Introduction

4
1 Unsupervised Learning (Clustering Problem)

❖ Introduction into unsupervised learning


• Supervised learning example: Classification Problem

x1

x2

(1) (1) (2) (2) (3) (3) ( m)


Given a training set: {( x , y ),( x , y ),( x , y ),...,( x , y( m) )}
5
1 Unsupervised Learning (Clustering Problem)

❖ Introduction into unsupervised learning


• Unsupervised learning example: Clustering Problem

x1
Cluster 1

Cluster 2

x2

(1) (2) (3) ( m)


Given a training set: {x , x , x ,..., x }
6
1 Unsupervised Learning (Clustering Problem)

❖ Introduction into unsupervised learning


• Applications of clustering

Market Segmentation Organize computing clusters

Astronomical data analysis 7


2
K-means Clustering
Algorithm
2 K-means clustering algorithm

❖ K-means algorithm
• Inputs:
─ K (number of clusters)

─ Training set {𝑥 (1) , 𝑥 (2) , 𝑥 (3) , . . . , 𝑥 (𝑚) }, 𝑥 (𝑖) ∈ ℝ𝑛

• Algorithm flow:
𝑛
Randomly initialize K cluster centroids 𝜇1 , 𝜇2 , . . . , 𝜇𝐾 ∈ ℝ
Repeat {
for i = 1 to m
c (i ) := index (from 1 to K) of cluster centroid closest to x (i )
for k = 1 to K
k := average (mean) of points assigned to cluster k
}

9
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

10
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

11
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

12
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

13
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

14
2 K-means clustering algorithm

❖ K-means algorithm
• Execution of algorithm

15
2 K-means clustering algorithm

❖ K-means algorithm objective function

c (i ) =index of cluster (1,2,…,K) to which example x (i ) is currently assigned


k =cluster centroid k ( k  n
)
c (i ) =cluster centroid to which example x (i ) has been assigned

1 m (i )
, 1 ,...,  K ) =  x − c( i )
2
(1) (m)
J (c ,..., c
m i =1

min
(1) (m)
J ( c (1)
,..., c (m)
, 1 ,...,  K )
c ,..., c
1 ,...,  K

16
Strengths of k-means

• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.

17
Weaknesses of k-means
• The algorithm is only applicable if the mean
is defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different
values.
18
Weaknesses of k-means: Problems with outliers

19
Weaknesses of k-means: To deal with outliers

• One method is to remove some data points in the


clustering process that are much further away from
the centroids than other data points.
– To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
• Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
– Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

20
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.

21
Weaknesses of k-means (cont …)

• If we use different seeds: good results


There are some met
hods to help choo
se good seeds

22
Weaknesses of k-means (cont …)

• The k-means algorithm is not suitable for discoveri


ng clusters that are not hyper-ellipsoids (or hyper-
spheres).

23
Common ways to represent clusters

• Use the centroid of each cluster to


represent the cluster.
– compute the radius and
– standard deviation of the cluster to
determine its spread in each dimension

– The centroid representation alone works well


if the clusters are of the hyper-spherical
shape.
– If clusters are elongated or are of other
shapes, centroids are not sufficient
24
3
Centroid initialization and
Choosing the number of K
3 Centroid initialization and choosing the number of K

❖ Random initialization of centroids for K-means algorithm


• Number of clusters K<m
• Randomly pick K training examples
• Set 1 ,...,  K equal to these K examples

26
3 Centroid initialization and choosing the number of K

❖ Random initialization of centroids for K-means algorithm


• Problem

Ideal case of random initialization Bad examples of random initialization

27
3 Centroid initialization and choosing the number of K

❖ Random initialization of centroids for K-means algorithm


• Solution

For i =1 to 100 {
Randomly initialize K-means.
Run K-means. Get c (1) ,..., c( m ) , 1 ,...,  K
Compute cost function:
J (c (1) ,..., c( m ) , 1 ,...,  K )
}

Pick clustering with the lowest cost J (c (1) ,..., c( m ) , 1 ,...,  K )

28
3 Centroid initialization and choosing the number of K

❖ Choosing the number of clusters


• What is the right number of clusters? How to choose it?

29
3 Centroid initialization and choosing the number of K

❖ Choosing the number of clusters K


• Elbow method

30
3 Centroid initialization and choosing the number of K

❖ Choosing the number of clusters K


• Evaluate K-means based on a metric for how well it performs for the particular
purpose

Rubbing Increasing
Feature 1 and Feature 2
construct Health Index
for rubbing prognosis Feature 1

- No rubbing
- Slight rubbing
- Intensive rubbing

Feature 2
31
4
Dimensionality Reduction
4 Dimensionality Reduction

❖ Motivation. Data Compression and Visualization.


• Reduce data from 2D to 1D
Mapping
x2
𝑥 (1) ∈ ℝ2 → 𝑧 (1) ∈ ℝ

𝑥 (2) ∈ ℝ2 → 𝑧 (2) ∈ ℝ

(1) 𝑥 (𝑚) ∈ ℝ2 → 𝑧 (𝑚) ∈ ℝ


x

x (2)

(1) (2) x1
z z

33
EMBEDDED
4 Dimensionality Reduction SYSTEM
LABORATORY

❖ Motivation. Data Compression and Visualization.


• Reduce data from 3D to 2D

z2

x3

x1 x2
z1
x (i )  3
z (i )  2

34
5
Principal Component
Analysis
Principal Component Analysis

36
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) problem formulation


x2

x1

Reduce from n-dimension to k-dimension: Find k vectors u (1) , u (2) ,..., u ( k ) onto which
project the data, so as to minimize the projection error. 37
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) algorithm


• Data Preprocessing

Given a training data set: x , x ,..., x


(1) (2) ( m)

Preprocessing (feature scaling/mean normalization):


1 m (i )
 j =  xj
m i =1
Replace each x j with x j −  j .
(i )

If different features on different scales, scale features to have comparable


range of values:
x j (i ) −  j
x j (i ) =
max(x j ) − min(x j )

38
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) algorithm

Reduce data from n-dimensions to k-dimensions


Compute ‘covariance matrix:

1 n
 =  ( x ( i ) )( x ( i ) )T
m i =1

Compute “eigenvectors” of covariance matrix:

[U,S,V] = svd(Sigma);

Here, we obtain:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢(𝑛) ∈ ℝ𝑛×𝑛
| | |

References: J.Shlens, ‘A Tutorial on Principal Component Analysis: Derivation, Discussion and


Singular Value Decomposition’, 2003 39
5 Principal Component Analysis

❖ Principal Component Analysis (PCA) algorithm


• Linear transformation of the original data

On previous step, we obtained eigenvectors from singular value decomposition of


covariance matrix:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢 (𝑛) ∈ ℝ𝑛×𝑛
| | |

k
Now, we can transform our original data with it is own feature dimension into new
feature space, based on principal components (𝑥 ∈ ℝ𝑛 → 𝑧 ∈ ℝ𝑘 ) :

 | | |   − (u (1) ) −
 
z (i ) = u (1) u (2) u ( k )  x ( i ) =  − (u (2) ) −  x(i )
 | | |   − (u ( k ) ) −  nx1

nxk kxn
Ureduce
kx1 40
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• MATLAB implementation

After mean normalization (ensure every feature has zero mean) and optionally
feature scaling:
1 n
 =  ( x ( i ) )( x ( i ) )T
m i =1
[U,S,V] = svd(Sigma);

Ureduce = U(:,1:k);

Z = Ureduce’*x;

*Notice:
In MATLAB there are new functions, which you can use directly to obtain principal
components for your data without the steps mentioned above.
Please, have a look on functions called: pca (if you apply PCA directly to data) and pcacov
(if you want to extract principal components from covariance matrix). The results should be
similar. 41
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• Reconstruction of data from its compressed representation

x2 x2

(1) x1 (1)
xapprox x1
x
x (2) (2)
xapprox
z = U  reduce x 𝑧 ∈ ℝ → 𝑥 ∈ ℝ2
z (1) (𝑖)
𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧 (𝑖)
z1 nxk kx1

nx1 42
5 Principal Component Analysis

❖ Choosing the number of principal components

1 m

2
Average squared projection error: x ( i ) − xapprox
(i )

m i =1

1 m
Total variation in the data:  (i ) 2
x
m i =1

Typically, choose k to be smallest value so that:

1

m 2
i =1
x (i )
− x (i )
approx
m  0.01
1
 i =1 (i ) 2
m
x
m

“99% of variance is retained”

43
5 Principal Component Analysis

❖ Choosing the number of principal components

Two ways to check the retained variance of data:


[U,S,V] = svd(Sigma)
Algorithm:
 S11 ... ... ... ... 
Try PCA with k= 1  ... S 22 ... ... ... 
Compute: 
S =  ... ... S33 ... ... 
U reduce , z (1) , z (2) ,..., z ( m ) , xapprox
(1) ( m)
,..., xapprox  
 ... ... ... ... ... 
 ... ... ... ... S nn 
Check if: For given k:
1

m 2


(i ) (i )
i =1
x x approx
k
Sii
m  0.01 1− i =1
 0.01

n
1
 i =1 x (i ) 2
m
i =1
Sii
m
If ‘yes’ – stop algorithm, if ‘no’ –change Pick smallest value of k for which “99%
the value k+=1 and run algorithm again of variance retained
44
5 Principal Component Analysis

❖ Advices for applying PCA


• Supervised learning speedup

( x (1) , y (1) ), ( x (2) , y (2) ),..., ( x ( m ) , y ( m ) )


Extract inputs:
Unlabelled dataset: 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑚) ∈ ℝ10000
PCA
𝑧 (1) , 𝑧 (2) , . . . , 𝑧 (m) ∈ ℝ100

New training set:


( z (1) , y (1) ), ( z (2) , y (2) ),..., ( z ( m ) , y ( m ) )
Note: Mapping x ( i ) → z (i) should be defined by running PCA only on the training set.
This mapping can be applied as well to the examples xcv (i )
and xtest
(i )
in the cross valida
tion and test sets.

45
5 Principal Component Analysis

❖ Advices for applying PCA


• Compression
─ Reduce memory/disk needed to store data
─ Speed up learning algorithm

• Visualisation

• Before implementing PCA, first try running whatever you want to do with the
original/raw data. Only if that does not do what you want, then implement PCA
and consider using another feature space dimension.

46
Principal Component Analysis

47
How to choose a clustering algorithm

• Clustering research has a long history. A vast colle


ction of algorithms are available.
– We only introduced several main algorithms.
• Choosing the “best” algorithm is a challenge.
– Every algorithm has limitations and works well with certa
in data distributions.
– It is very hard, if not impossible, to know what distributi
on the application data follow. The data may not fully fo
llow any “ideal” structure or distribution required by the
algorithms.
– One also needs to decide how to standardize the data, t
o choose a suitable distance function and to select other
parameter values.

48
Choose a clustering algorithm (cont …)

• Due to these complexities, the common practice is


to
– run several algorithms using different distance functions
and parameter settings, and
– then carefully analyze and compare the results.
• The interpretation of the results must be based on
insight into the meaning of the original data toget
her with knowledge of the algorithms used.
• Clustering is highly application dependent and to
certain extent subjective (personal preferences).

49
Cluster Evaluation: hard problem

• The quality of a clustering is very hard to


evaluate because
– We do not know the correct clusters
• Some methods are used:
– User inspection
• Study centroids, and spreads
• Rules from a decision tree.
• For text documents, one can read some documents
in clusters.

50
Cluster evaluation: ground truth

• We use some labeled data (for classification)


• Assumption: Each class is a cluster.
• After clustering, a confusion matrix is constructed. From the
matrix, we compute various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The
clustering method produces k clusters, which divides D
into k disjoint subsets, D1, D2, …, Dk.

51
Evaluation measures: Entropy

52
Evaluation measures: purity

53
A remark about ground truth evaluation

• Commonly used to compare different clustering


algorithms.
• A real-life data set for clustering has no class
labels.
– Thus although an algorithm may perform very well on
some labeled data sets, no guarantee that it will perform
well on the actual application data at hand.
• The fact that it performs well on some label data
sets does give us some confidence of the quality
of the algorithm.
• This evaluation method is said to be based on
external data or information.
54
Evaluation based on internal information

• Intra-cluster cohesion (compactness):


– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are still the key.

55
Indirect evaluation

• In some applications, clustering is not the primary


task, but used to help perform another task.
• We can use the performance on the primary task
to compare clustering methods.
• For instance, in an application, the primary task is
to provide recommendations on book purchasing
to online shoppers.
– If we can cluster books according to their features, we
might be able to provide better recommendations.
– We can evaluate different clustering algorithms based on
how well they help with the recommendation task.
– Here, we assume that the recommendation can be
reliably evaluated.

56
Summary

• Clustering is has along history and still active


– There are a huge number of clustering algorithms
– More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
– density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.

57
58

You might also like