KEMBAR78
Unsupervised Learning | PDF | Cluster Analysis | Algorithms
0% found this document useful (0 votes)
9 views84 pages

Unsupervised Learning

use for machine learning

Uploaded by

yoyoyashu104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views84 pages

Unsupervised Learning

use for machine learning

Uploaded by

yoyoyashu104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Unsupervised Learning

By
Dr. Abhilasha Chaudhuri
Assistant Professor
Department of Computer Science and engineering,
Sardar Vallabhbhai National Institute of Technology, Surat
What is unsupervised learning
■ Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
■ Typical applications
■ As a stand-alone tool to get insight into data distribution

■ As a preprocessing step for other algorithms


■ Cluster: A collection of data objects
■ similar (or related) to one another within the same group

■ dissimilar (or unrelated) to the objects in other groups

■ Cluster analysis (or clustering, data segmentation, …)


■ Finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters
Applications of cluster analysis
• Outlier detection: Outliers are often viewed as those “far away”
from any cluster
• Hypothesis generation and testing
■ Data reduction
■ Summarization: Preprocessing for regression, PCA, classification, and

association analysis
■ Compression: Image processing: vector quantization

■ Information retrieval: document clustering


• Land use: Identification of areas of similar land use in an earth
observation database
Applications of cluster analysis
■ Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
■ City-planning: Identifying groups of houses according to their
house type, value, and geographical location
■ Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
■ Climate: understanding earth climate, find patterns of
atmospheric and ocean
Clustering is an optimization problem
• variability(c) = σ𝑒∈𝑐(𝑚𝑒𝑎𝑛 𝑐 , 𝑒)2
• 𝑑𝑖𝑠𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑐 = σ𝑐∈𝐶 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑐)
Major Clustering Approaches
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some criterion, e.g.,

minimizing the sum of square errors


■ Typical methods: k-means, k-medoids, CLARANS

■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects) using some

criterion
■ Typical methods: Diana, Agnes, BIRCH, CAMELEON

■ Density-based approach:
■ Based on connectivity and density functions

■ Typical methods: DBSACN, OPTICS, DenClue


Hierarchical Clustering
1. Start by assigning each item to a cluster, so that if you have N items,
you now have N clusters, each containing just one item.
2. Find the closest (most similar) pair of clusters and merge them into
a single cluster
3. Continue the process until all items are clustered into a single
cluster of size N.

how will we measure closeness / distance?


Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of


nested partitioning (tree of clusters), called a
dendrogram

A clustering of the data objects is obtained by


cutting the dendrogram at the desired level, then
each connected component forms a cluster
• Dendrogram

9
• Dendrogram

10
Linkage Metrics
• Single-linkage: consider the distance between one cluster and another
cluster to be equal to the shortest distance from any member of one
cluster to any member of the other cluster

• Complete-linkage: consider the distance between one cluster and another


cluster to be equal to the greatest distance from any member of one
cluster to any member of the other cluster
• Average-linkage: consider the distance between one cluster and another
cluster to be equal to the average distance from any member of one
cluster to any member of the other cluster
Example of Hierarchical Clustering
• Suppose we have the aerial distance between some pair of cities
SURAT PUNE MUMBAI RAIPUR UDAIPUR NAGPUR
SURAT 0 206 963 1949 3095 2979
PUNE 0 802 1771 2934 2815
MUMBAI 0 966 2142 2013
RAIPUR 0 1235 1307
UDAIPUR 0 808
NAGPUR 0

{SURAT} {PUNE} {MUMBAI} {RAIPUR} {UDAIPUR} {NAGPUR}


{SURAT, PUNE} {MUMBAI} {RAIPUR} {UDAIPUR} {NAGPUR}
{SURAT,PUNE, MUMBAI} {RAIPUR} {UDAIPUR} {NAGPUR}
{SURAT,PUNE, MUMBAI} {RAIPUR} {UDAIPUR, NAGPUR}

{SURAT,PUNE, MUMBAI, RAIPUR} {UDAIPUR, NAGPUR} single linkage


or
{SURAT,PUNE, MUMBAI} {RAIPUR ,UDAIPUR, NAGPUR}complete linkage
K-means Algorithm
• Randomly chose k examples as initial centroids
• While true:
• Create k clusters by assigning each example to the closest centroid
• Compute k new centroids by averaging examples in each cluster
• If centroid don’t change:
• Break
complexities
• What is the complexity of k-means algorithm?
• K*N

• What is the complexity of hierarchical clustering?


• N*N
An Example of k-means clustering
An Example of k-means clustering

Initial centroids
K=4
An Example of k-means clustering

Iteration 1
An Example of k-means clustering

Iteration 2
An Example of k-means clustering
• Iteration 3
An Example of k-means clustering

Iteration 4
An Example of K-Means Clustering
• Apply K(=2)-Means algorithm over the data (185, 72), (170, 56), (168, 60),
(179,68), (182,72), (188,77) up to two iterations and show the clusters. Initially
choose first two objects as initial centroids.

• Centroid for first cluster c1 = (185, 72)


Centroid for second cluster c2 = (170, 56)
• Iteration 1: Now calculating similarity by using Euclidean distance measure as:

21
An Example of K-Means Clustering

22
An Example of K-Means Clustering

• Iteration 2: Now calculating centroid for each cluster:

23
An Example of K-Means Clustering
• Now, again calculating similarity:

24
An Example of K-Means Clustering
• After second iteration:

25
Issues with k means
• Choosing the “wrong” k can lead to strange results
• Consider k=3
• Results depend upon the initial centroids
How to choose the value of k
• A priori knowledge about application domain
• There are five different types of fabrics: k=5
• There are two types of students: k=2
• Search for a good k
• Try different values of k and evaluate quality of results
• Run hierarchical clustering on subset of data to get the idea about underlying
relationships
• High quality clusters have:
■ high intra-class similarity: cohesive within clusters
■ low inter-class similarity: distinctive between clusters
How to choose the value of k
• Empirical method
• # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10
• Elbow method
• The elbow method is a graphical method for finding the optimal K
value in a k-means clustering algorithm. The elbow graph shows the
within-cluster-sum-of-square (WCSS) values on the y-axis
corresponding to the different values of K (on the x-axis). The optimal
K value is the point at which the graph forms an elbow.
• Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
How to choose the value of k
How to choose the value of k
• Cross validation method
• Divide a given data set into m parts
• Use m – 1 parts to obtain a clustering model
• Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and the
closest centroids to measure how well the model fits the test set
• For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
Unlucky initial centroids
Unlucky initial centroids

After
convergence
Mitigating dependence on initial centroids
• Try multiple sets of randomly chosen centroids
• Select “best” results

best = kMeans(points)
for t in range(numTrials):
C = kMeans(points)
If dissimilarity (C) < dissimilarity(best):
Best = C
Return best
An Example
• Many patients with 4 features each
• Heart rate in beats per minute
• Number of past heart attacks
• Age
• ST elevation (binary)
• Outcome (death) based on features
• Probabilistic, not deterministic
• E.g., older people with multiple heart attacks are at higher risl
• Cluster and examine purity of cluster relative to outcomes
Distance Metrices
• Euclidean
• Manhattan
• Minskowski
• Mahanalobis
• Hamming
Euclidean distance
• For two points, 𝑋 = (𝑥1 , 𝑥2 , 𝑥3 . . . . . 𝑥𝑛 ) and 𝑌 = (𝑦1 , 𝑦2 , 𝑦3 . . . . . 𝑦𝑛 ) in an
n-dimensional space, the Euclidean distance is calculated as:

𝐷𝑒𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 (𝑋, 𝑌) = σ𝑛𝑖=1(𝑥𝑖 − 𝑦𝑖 )2


In vector format:
𝑥1 𝑦1
𝑥2 𝑦2
𝑋= ⋮ 𝑌= ⋮
𝑥𝑛 𝑦𝑛

2
𝐷𝑒𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑋, 𝑌 = 𝑋 − 𝑌 ∙ 𝑋 − 𝑌 = 𝑋 − 𝑌 𝑇 (𝑋 − 𝑌)
Euclidean distance

• When to use it:


• When you have continuous numerical data.
• When the data dimensions are of a similar scale or have been standardized.
• In algorithms like K-Means clustering and as the default for KNN.
• Limitations:
• Scale Variance: Euclidean distance is sensitive to the scale of the features. If one
feature has a much larger range of values than others, it will dominate the distance
calculation. Therefore, feature scaling (e.g., standardization or normalization) is often
a necessary preprocessing step.
Manhattan Distance
• Also known as City Block distance, the Manhattan distance is the sum of the absolute differences
of their Cartesian coordinates. It measures the distance a taxi would have to travel along a grid of
streets to get from point X to point Y.

𝐷𝑚𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 (𝑋, 𝑌) = σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖

• When to use it:


• When dealing with high-dimensional data, as it is less affected by the curse of dimensionality compared
to Euclidean distance.
• For datasets with categorical features that have been one-hot encoded.
• When the individual feature differences are more important than their squared differences.
• Limitations:
• It can be less intuitive than Euclidean distance in some contexts as it doesn't represent the shortest
path.
• Like Euclidean distance, it is also sensitive to the scale of the features.
Minkowski Distance
• Minkowski distance is a generalized metric that can represent both
Euclidean and Manhattan distances by varying a parameter, 'p’.

𝐷𝑚𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖 (𝑋, 𝑌) = (σ𝑛𝑖=1 𝑥𝑖 − 𝑦𝑖 𝑝 )1/𝑝

• When p = 1, it becomes the Manhattan distance.


• When p = 2, it becomes the Euclidean distance.
• As p approaches infinity, it becomes the Chebyshev distance.
• When to use it:
• When you want to experiment with different distance metrics by tuning the
parameter 'p'.
• It offers flexibility in defining the distance based on the problem's characteristics.
Mahanalobis
𝑥 𝑦
Distance
1 1
𝑥2 𝑦2
𝑋= ⋮ 𝑌= ⋮
𝑥𝑛 𝑦𝑛

𝐷𝑚𝑎ℎ𝑎𝑛𝑎𝑙𝑜𝑏𝑖𝑠 (𝑋, 𝑌) = 𝑋 − 𝑌 𝑇 𝑆 −1 (𝑋 − 𝑌)
Here S is the covariance matrix
If there is no correlation between any two dimention in 2- dimensional
data (i.e., data is uniformly distributed in the space). Then
1 0
𝑆= i.e. σ2 =1
0 1
It means Mahanalobis distance is essentially the Euclidean distance. But,
whenever there is some correlation between the data points it takes care of
it and gives us the better measurement of distance.
Hamming Distance
• The Hamming distance is used for comparing two binary strings of equal length. It
is the number of positions at which the corresponding symbols are different.
• For two strings x and y of equal length n, the Hamming distance is given by:

• 𝐷𝐻𝑎𝑚𝑚𝑖𝑛𝑔 (𝑥, 𝑦) = σ𝑛𝑖=1(𝑥𝑖 ≠ 𝑦𝑖 )

• When to use it:


• For comparing categorical variables (e.g., one-hot encoded vectors).
• In information theory for error detection and correction in digital communication.
• Limitations:
• It is only defined for strings or vectors of the same length.
• It is most suitable for binary or categorical data, not continuous numerical data.
Density-Based Clustering:
DBSCAN Example

42
Density-Based Clustering: Basic
Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
■ NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if

■ p belongs to NEps(q) p MinPts = 5


■ core point condition: Eps = 1 cm
|NEps (q)| ≥ MinPts q

57
Density-Reachable and Density-Connected

■ Density-reachable:
■ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
■ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
58
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
■ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
■ Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm

Core MinPts = 5

59
DBSCAN: Sensitive to Parameters

DBSCAN online Demo:


http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
60
DBSCAN: Example
minPts = 3 and epsilon = 1.9

61
DBSCAN: Example
minPts = 3 and epsilon = 1.9

62
DBSCAN: Example
minPts = 3 and epsilon = 1.9

63
DBSCAN: Example
minPts = 3 and
=3 epsilon = 1.9

64
• Example: Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2), G(7, 2) and H(8, 4), Find the core points and outliers using
DBSCAN. Take Eps = 2.5 and MinPts = 3

• Step 1: To find the core points, outliers and clusters by using DBSCAN we need to first calculate the distance among all pairs of
given data point. Let us use Euclidean distance measure for distance calculation

65
66
• The final distance matrix becomes as shown below:

• In the above table, Distance ≤ Epsilon (i.e. 2.5) is marked red.

67
• Step 2: Now, finding all the data points that lie in the Eps-neighborhood of each data
points. That is, put all the points in the neighborhood set of each data point whose
distance is <=2.5.
• N(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A
N(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
N(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
N(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
N(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
N(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
N(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
N(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
• Here, data points A, B and C have neighbors <= MinPts (i.e. 3) so can’t be considered as
core points. Since they belong to the neighborhood of other data points, hence there
exist no outliers in the given set of data points.
• Data points D, E, F, G and H have neighbors >= MinPts (i.e. 3) and hence are the core
data points.

68
Measuring Clustering Quality
■ 2 kinds of measures: External, internal
■ External: supervised, employ criteria not inherent to the
dataset
■ Compare a clustering against prior or expert-specified
knowledge (i.e., the ground truth) using certain clustering
quality measure
■ Internal: unsupervised, criteria derived from data itself
■ Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient

69
Measuring Clustering Quality
■ Silhouette coefficient:
• For a data set, D, of n objects, suppose D is partitioned into k clusters, C1, : : : ,Ck.
• For each object o ∈ D, we calculate a(o) as the average distance between o and all
other objects in the cluster to which o belongs. (cohesion)
• Similarly, b(o) is the minimum average distance from o to all clusters to which o does
not belong. (separation)
• The silhouette coefficient of o is then defined as

• The value of the silhouette coefficient is between -1 and 1. The value of a.(o) reflects
the compactness of the cluster to which o belongs. The smaller the value, the more
compact the cluster.
• The value of b(o) captures the degree to which o is separated from other clusters. The
larger b(o) is, the more separated o is from other clusters.
70
Measuring Clustering Quality

• Therefore, when the silhouette coefficient value of o approaches 1, the cluster containing o is compact and o
is far away from other clusters, which is the preferable case.
• However, when the silhouette coefficient value is negative (i.e., b(o) < a(o)), this means that, in expectation,
o is closer to the objects in another cluster than to the objects in the same cluster as o. this is a bad situation
and should be avoided.
• Silhouette Plot:

71
Measuring Clustering Quality

• Interpretation of Silhouette Plot:

72
Example

• Consider a scenario with 3 clusters, where each cluster has 3 (2-dimensional points)
• Cluster 1:
Point A1: (2, 5)
Point A2: (3, 4)
Point A3: (4, 6)

Cluster 2:
Point B1: (8, 3)
Point B2: (9, 2)
Point B3: (10, 5)

Cluster 3:
Point C1: (6, 10)
Point C2: (7, 8)
Point C3: (8, 9)
• calculate the silhouette coefficient for Point A1

73
• Step 1: Calculate Cohesion for Point A1
• Distance from Point A1 to Point A2: sqrt((2–3)² + (5–4)²) = sqrt(2)
• Distance from Point A1 to Point A3: sqrt((2–4)² + (5–6)²) = sqrt(5)
• Average cohesion for Point A1 = (sqrt(2) + sqrt(5)) / 2 ≈ 1.825
• Step 2: Calculate Separation for Point A1
• Distance from Point A1 to Point B1: sqrt((2–8)² + (5–3)²) = sqrt(40)
• Distance from Point A1 to Point B2: sqrt((2–9)² + (5–2)²) = sqrt(58)
• Distance from Point A1 to Point B3: sqrt((2–10)² + (5–5)²) = sqrt(64)
• Average separation for Point A1 = (sqrt(40) + sqrt(58) + sqrt(64)) / 3 ≈ 7.313
• Distance from Point A1 to Point C1: sqrt((2–6)² + (5–10)²) = sqrt(26)
• Distance from Point A1 to Point C2: sqrt((2–7)² + (5–8)²) = sqrt(18)
• Distance from Point A1 to Point C3: sqrt((2–8)² + (5–9)²) = sqrt(32)
• Average separation for Point A1 = (sqrt(26) + sqrt(18) + sqrt(32)) / 3 ≈ 5.333

74
• Step 3: Calculate Silhouette Coefficient for Point A1
• silhouette coefficient = (separation — cohesion) / max(separation,
cohesion)
• silhouette coefficient = (5.333–1.825) / max(5.333, 1.825) ≈ 0.657
• Step 4: Average Silhouette Coefficient
• Calculate the average silhouette coefficient across all data points to
obtain the overall silhouette score for the clustering result.

75
Gaussian Mixture Model
Gaussian Mixture Model using Expectation
Maximization technique
• Consider a Gaussian Distribution
• Univariate density
• Multivariate density
• Maximum Likelihood Estimation technique
• Log of Gaussian distribution, take the derivative and equate it to zero
Gaussian Distribution
• To understand the GMM first we have to understand the Gaussian
distribution

• To describe this bell curve/ Gaussian distribution we need two


quantities mean (µ) and variance (σ)
Gaussian Distribution
• Once you have these values of (µ) and (σ) you can pick any point x
and you can get likelihood of point x coming out of this gaussian
distribution.

• You can also have the bell curve in more than one dimensions
Gaussian Distribution
• In this cate the mean will be represented by a vector. With one mean
for each dimension.
• Mean = µ ➔ [ µ1, µ2]

• And sigma will be represented by the matrix containing pair wise


variances for each dimensions.
• Variance = σ ➔ Covariance matrix σ [[V11, V12], [V21,V22]]
• And simple multiplication will be replaced by matrix multiplication
Gaussian Mixture model using Expectation
Maximization
• Just like any clustering model, it is iterative and it has two main steps
• Assignment
• Update
• In the initialization phase we randomly assign our gaussians for these data
points, just like centroids in the k-means
• Then calculate the likelihood of each data point for coming from each of
these gaussians.
• In Update phase, we convert these likelihood to weights by dividing each
likelihood vector by its sum.
• Then based on these weights we adjust the mu and sigma for our
gaussians; which in turn will adjust the position of the gaussians
Assignment Update
Assignment Update
Assignment Update

You might also like