Unsupervised
Learning
Motivation
The goal of clustering is to group
data points that are close (or
similar) to each other, identify
such groupings (or clusters) in an
unsupervised manner
How to define similarity ?
How many iterations for
checking cluster quality ?
Introduction
Supervised learning: discover patterns in the data with known target
(class) or label.
These patterns are then utilized to predict the values of the target attribute
in future data instances.
Examples ?
Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
Can we perform regression here ?
Examples ?
Cluster A cluster is represented by a
single point, known as centroid
(or cluster center) of the cluster.
Centroid is computed as the
mean of all data points in a
𝐶𝑗 = ∑ 𝑥𝑖
cluster
Cluster boundary is decided by
the farthest data point in the
cluster.
Applications
Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
Tailor-made for each person: too expensive
One-size-fits-all: does not fit all.
Example 2: In marketing, segment customers according to their
similarities
To do targeted marketing.
Example 3: Given a collection of text documents, we want to
organize them according to their content similarities,
To produce a topic hierarchy
Types of clustering
Clustering: Task of grouping a set of data points such that data points in
the same group are more similar to each other than data points in
another group (group is known as cluster)
It groups data instances that are similar to (near) each other in one cluster and data
instances that are very different (far away) from each other into different clusters.
Types:
1. Exclusive Clustering: K-means
2. Overlapping Clustering: Fuzzy C-means
3. Hierarchical Clustering: Agglomerative clustering, divisive
clustering
4. Probabilistic Clustering: Mixture of Gaussian models
1. Exclusive clustering: K-means
Basic idea: randomly initialize the k cluster centers, and iterate between the two
steps we just saw.
1. Randomly initialize the cluster centers, c1, ..., cK
2. Given cluster centers, determine points in each cluster
For each point p, find the closest ci. Put p into cluster i
3. Given points in each cluster, solve for ci
Set ci to be the mean of points in cluster i
4. If ci have changed, repeat Step 2
K-means contd..
Algorithm
Begin
initialize n, c, 1, 2, …, c(randomly selected)
do classify n samples according to
nearest i
recompute i
until no change in i
return 1, 2, …, c
End
K-means example
This slide is taken from: Andrew
Moore
Example contd..
This slide is taken from: Andrew
Moore
Example contd..
This slide is taken from: Andrew
Moore
Example contd..
This slide is taken from: Andrew
Moore
Example contd..
This slide is taken from: Andrew
Moore
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1|
• Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution
Iteration-01: Calculating Distance Between A1(2, 10) and C1(2, 10)-
We calculate the distance of each Ρ(A1, C1)
point from each of the center of the
three clusters. = |x2 – x1| + |y2 – y1|
The distance is calculated by using
the given distance function. = |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points
from each of the center of the three clusters.
From here, New clusters are-
Distance from Distance from Distance from
Point belongs
Given Points center (2, 10) center (5, 8) of center (1, 2) of Cluster-01:
to Cluster
of Cluster-01 Cluster-02 Cluster-03
First cluster contains points-
A1(2, 10) 0 5 9 C1
A1(2, 10)
A2(2, 5) 5 6 4 C3 Cluster-02:
A3(8, 4) 12 7 9 C2 Second cluster contains points-
A4(5, 8) 5 0 10 C2
A3(8, 4)
A5(7, 5) 10 5 9 C2 A4(5, 8)
A5(7, 5)
A6(6, 4) 10 5 7 C2 A6(6, 4)
A8(4, 9)
A7(1, 2) 9 10 0 C3
Cluster-03:
A8(4, 9) 3 2 10 C2 Third cluster contains points-
A2(2, 5)
A7(1, 2)
Now,
•We re-compute the new cluster clusters.
•The new cluster center is computed by taking mean of all
the points contained in that cluster.
For Cluster-01:
We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each of the center of the
three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(6, 6)-
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
New clusters are-
Cluster-01:
First cluster contains points-
Distance from Distance from Distance from
Given Point belongs
center (2, 10) center (6, 6) of center (1.5, 3.5) A1(2, 10)
Points to Cluster
of Cluster-01 Cluster-02 of Cluster-03
A8(4, 9)
A1(2,
0 8 7 C1 Cluster-02:
10)
Second cluster contains points-
A2(2, 5) 5 5 2 C3
A3(8, 4)
A3(8, 4) 12 4 7 C2 A4(5, 8)
A4(5, 8) 5 3 8 C2 A5(7, 5)
A6(6, 4)
A5(7, 5) 10 2 7 C2
Cluster-03:
A6(6, 4) 10 2 5 C2 Third cluster contains points-
A7(1, 2) 9 9 2 C3 A2(2, 5)
A7(1, 2)
A8(4, 9) 3 5 8 C1
Now,
We re-compute the new cluster clusters.
The new cluster center is computed by taking mean of all the points
contained in that cluster.
For Cluster-01:
Center of Cluster-01 After second iteration, the center of the three clusters are-
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5) C1(3, 9.5)
C2(6.5, 5.25)
For Cluster-02: C3(1.5, 3.5)
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
This is completion of Iteration-02.
Use K-Means Algorithm to create two clusters-
Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
Distance Formula : sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
Contd..
Pros
Simple, fast to compute
Converges to local minimum of within-cluster
squared error
Cons
Setting k?
Sensitive to initial centers
Sensitive to outliers
Detects spherical clusters
Assuming means can be computed
Image source:
Google
2. Fuzzy C-Means Clustering
One data point may belong to two or more cluster with
different memberships.
Objective function:
where 1 ≤ m< ∞
An extension of k-means
Fuzzy c-means algorithm
Let x be a vector of values for data point g .
i i
1. Initialize membership U(0) = [ uij ] for data point gi of cluster clj by
random
2. At the k-th step, compute the fuzzy centroid C(k) = [ cj ] for j =
1, ..,
nc, where nc is the number ofnclusters, using
i1
c j m
(u
n
ij ) xi
(uij )m
i1
where m is the fuzzy parameter and n is
the number of data points.
Fuzzy c-means algorithm
3. Update the fuzzy membership U(k) = [ uij ], using
1
1 m1
uij xi c j
1
nc
1 m1
x
j 1
i c j
4. If ||U(k) – U(k-1)|| < , then STOP, else return to step 2.
5. Determine membership cutoff
🞄 For each data point gi, assign gi to cluster clj if uij of
U(k) >
Fuzzy c-means
Pros:
Allows a data point to be in multiple clusters
A more natural representation of the behavior of genes
genes usually are involved in multiple functions
Cons:
Need to define c (k in K-means), the number of clusters
Need to determine membership cutoff value
Clusters are sensitive to initial assignment of centroids
Fuzzy c-means is not a deterministic algorithm
3. Hierarchical
Clustering
Hierarchical clustering is
another unsupervised
machine learning
algorithm, which is used to
group the unlabeled
datasets into a cluster and
also known as hierarchical
cluster analysis or HCA.
Produce a nested sequence of
clusters, a tree, also called
Dendrogram.
Types of hierarchical clustering
Agglomerative (bottom up) clustering: It builds the dendrogram (tree)
from the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e., the
root cluster).
Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
• Splits the root into a set of child clusters. Each child cluster is
recursively divided further
• stops when only singleton clusters of individual data points remain,
i.e.,
each cluster with only a single point
This slide is taken from Bing Liu, 31
UIC
Agglomerative clustering
• Agglomerative clustering is a type of data clustering method used in
unsupervised learning. It is an iterative process that groups similar
objects into clusters based on some measure of similarity.
• Agglomerative clustering uses a bottom-up approach for dividing data
points into clusters. It starts with individual data points and merges
them into larger clusters until all of the objects are clustered together.
32
How the Agglomerative Hierarchical
clustering Work?
The agglomerative hierarchical clustering algorithm follows a systematic process to build the cluster hierarchy. Below is a
step-by-step explanation:
1. Initialize Clusters:
1. Treat each data point as a separate cluster. For a dataset with n n n points, there are initially n n n clusters.
2. Compute Distances:
1. Calculate the distance (dissimilarity) between all pairs of clusters using a chosen distance metric, such as Euclidean or Manhattan
distance.
3. Merge Closest Clusters:
1. Identify the pair of clusters with the smallest distance and merge them into a single cluster, reducing the total number of clusters by
one.
4. Update Distance Matrix:
1. Recalculate distances between the new cluster and all remaining clusters, using a linkage method to define how distances are
computed between clusters.
5. Repeat:
1. Continue merging clusters and updating the distance matrix until all data points are combined into a single cluster.
6. Construct Dendrogram:
1. Represent the merging process as a dendrogram, where the x-axis lists data points, and the y-axis indicates the distance at which
clusters are merged.
• Step-1: Create each data point as a
single cluster. Let's say there are N data
points, so the number of clusters will also
be N.
• Step-2: Take two closest data points or
clusters and merge them to form one
cluster. So, there will now be N-1
clusters.
• Step-3: Again, take the two closest clusters and
merge them together to form one cluster. There
will be N-2 clusters.
• Step-4: Repeat Step 3 until only one cluster left. So, we will
get the following clusters. Consider the below images:
• Step-5: Once all the clusters are combined into one big
cluster, develop the dendrogram to divide the clusters as
per the problem.
Measure for the distance between two clusters
• Single Linkage: It is the Shortest Distance between the closest points
of the clusters.
• Complete Linkage: It is the farthest distance between the two points
of two different clusters. It is one of the popular linkage methods as it
forms tighter clusters than single-linkage.
• Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by the total
number of datasets to calculate the average distance between two
clusters. It is also one of the most popular linkage methods.
• Centroid Linkage: It is the linkage method in which the distance
between the centroid of the clusters is calculated.
Solved Example (Single Linkage)
Find the points with minimum distance i.e. 42 and 43
Construct new matrix as per new cluster formed and find minimum distance between the points
Reconstruct the matrix with new cluster and find minimum distance and reform clusters
Final clusters are (42,43) and (25,
27, 22, 18)
Average Linkage
{b}
Hierarchical
Pros
clustering
Dendograms are great for visualization
Provides hierarchical relations between clusters
Shown to be able to capture concentric clusters
Cons
Not easy to define levels for clusters
Experiments showed that other clustering techniques outperform
hierarchical clustering
Density-Based Clustering
• Density-based clustering is based on the idea that clusters are regions of high
density separated by regions of low density.
• The algorithm works by first identifying "core" data points, which are data
points that have a minimum number of neighbors within a specified distance.
These core data points form the center of a cluster.
• Next, the algorithm identifies "border" data points, which are data points that
are not core data points but have at least one core data point as a neighbor.
• Finally, the algorithm identifies "noise" data points, which are data points that
are not core data points or border data points.
DBSCAN Clustering
• DBSCAN, which stands for Density-Based Spatial Clustering of Applications with
Noise, is a powerful clustering algorithm that groups points that are closely
packed together in data space.
• DBSCAN is a density-based clustering algorithm that groups data points that
are closely packed together and marks outliers as noise based on their density
in the feature space. It identifies clusters as dense regions in the data space,
separated by areas of lower density.
• Unlike some other clustering algorithms, DBSCAN doesn't require specifying
the number of clusters beforehand, making it particularly useful for
exploratory data analysis.
DBSCAN revolves around three key
concepts:
1.Core Points: These are points that have at
least a minimum number of other points
(MinPts) within a specified distance (ε or
epsilon).
2.Border Points: These are points that are
within the ε distance of a core point but
don't have MinPts neighbors themselves.
3.Noise Points: These are points that are
neither core points nor border points.
They're not close enough to any cluster to
be included.
DBSCAN uses two main parameters:
• ε (epsilon): The maximum distance
between two points for them to be
considered as neighbors.
• MinPts: The minimum number of points
required to form a dense region.
How Does DBSCAN Work?
DBSCAN operates by examining the neighborhood of each point in the dataset. The algorithm follows a step-by-
step process to identify clusters based on the density of data points:
1.Parameter Selection
• Choose ε (epsilon): The maximum distance between two points for them to be considered as neighbors.
• Choose MinPts: The minimum number of points required to form a dense region.
2.Select a Starting Point
• The algorithm starts with an arbitrary unvisited point in the dataset.
3.Examine the Neighborhood
• It retrieves all points within the ε distance of the starting point.
• If the number of neighboring points is less than MinPts, the point is labeled as noise (for now).
• If there are at least MinPts points within ε distance, the point is marked as a core point, and a new cluster is formed.
4.Expand the Cluster
• All the neighbors of the core point are added to the cluster.
• For each of these neighbors:
• If it's a core point, its neighbors are added to the cluster recursively.
• If it's not a core point, it's marked as a border point, and the expansion stops.
5.Repeat the Process
• The algorithm moves to the next unvisited point in the dataset.
• Steps 3-4 are repeated until all points have been visited.
6.Finalize Clusters
• After all points have been processed, the algorithm identifies all clusters.
• Points initially labeled as noise might now be border points if they're within ε distance of a core point.
7.Handling Noise
Solved Example
Using the parameters form clusters from P1 – P12 where epsilon in <= 1.9
Find the status of points as Core, Noise and Border such that in a cluster for a
given point if points are >= minPts than make that point “Core” otherwise
Noise and if a point classified as Noise is in the cluster of core point than update
its status as Border
Final cluster formed are:
What is dimensionality reduction?
• Dimensionality reduction in machine learning is the process of
reducing the number of features or variables in a dataset while
retaining as much of the original information as possible. In other
words, it is a way of simplifying the data by reducing its complexity.
• The need for dimensionality reduction arises when a dataset has a
large number of features or variables. Having too many features can
lead to overfitting and increase the complexity of the model.
Why Dimensionality Reduction is Important
Dimensionality reduction brings many advantages to your machine
learning data, including:
• Fewer features mean less complexity
• You will need less storage space because you have fewer data
• Fewer features require less computation time
• Model accuracy improves due to less misleading data
• Algorithms train faster thanks to fewer data
• Reducing the data set’s feature dimensions helps visualize the data faster
• It removes noise and redundant features
Dimensionality Reduction
Techniques
• Dimensionality reduction techniques can be broadly
divided into two categories:
• Feature Selection
• Feature Extraction
Feature Selection
It involves selecting a subset of relevant features from the original feature
set to reduce the feature space while improving the model’s performance
by reducing computational power. When a dataset has too many features,
some may be irrelevant or add noise which can slow down training
process and reduce accuracy but it helps in building simpler, faster and
more accurate models and also helps in reducing overfitting.
There are various algorithms used for feature selection and are grouped
into three main categories:
1.Filter Methods
2.Wrapper Methods
3.Embedded Methods
Feature Extraction
• Feature extraction is a machine learning technique that
reduces the number of resources required for processing
while retaining significant or relevant information.
• In other words, feature extraction entails constructing new
features that retain the key information from the original
data but in a more efficient manner transforming raw data
into a set of numerical features that a computer program
can easily understand and use.
• Dimensionality reduction approaches include Principal
Component Analysis (PCA), Linear Discriminant Analysis
(LDA), and t-SNE.
Principal Component
Analysis(PCA)
• PCA is a statistical technique introduced by
mathematician Karl Pearson in 1901. It works by
transforming high-dimensional data into a lower-
dimensional space while maximizing the variance
(or spread) of the data in the new space. This helps
preserve the most important patterns and relationships
in the data.
Steps
• Standardize the data − PCA requires that the data be standardized to have
zero mean and unit variance.
• Compute the covariance matrix − PCA computes the covariance matrix of
the standardized data.
• Compute the eigenvectors and eigenvalues of the covariance
matrix − PCA then computes the eigenvectors and eigenvalues of the
covariance matrix.
• Select the principal components − PCA selects the principal components
based on their corresponding eigenvalues, which indicate the amount of
variation in the data explained by each component.
• Project the data onto the new feature space − PCA projects the data
onto the new feature space defined by the selected principal components.
Algorithm
Solved Example
Advantages of PCA
Following are the advantages of using Principal Component Analysis −
• Reduces dimensionality − PCA is particularly useful for high-dimensional datasets because it can reduce the number
of features while retaining most of the original variability in the data.
• Removes correlated features − PCA can identify and remove correlated features, which can help improve the
performance of machine learning models.
• Improves interpretability − The reduced number of features can make it easier to interpret and understand the data.
• Reduces overfitting − By reducing the dimensionality of the data, PCA can reduce overfitting and improve the
generalizability of machine learning models.
• Speeds up computation − With fewer features, the computation required to train machine learning models is faster.
Disadvantages of PCA
Following are the disadvantages of using Principal Component Analysis −
• Information loss − PCA reduces the dimensionality of the data by projecting it onto a lower-dimensional space, which
may lead to some loss of information.
• Can be sensitive to outliers − PCA can be sensitive to outliers, which can have a significant impact on the resulting
principal components.
• Interpretability may be reduced − Although PCA can improve interpretability by reducing the number of features, the
resulting principal components may be more difficult to interpret than the original features.
• Assumes linearity − PCA assumes that the relationships between the features are linear, which may not always be the
case.
• Requires standardization − PCA requires that the data be standardized, which may not always be possible or
appropriate.
What is t-SNE ?
• t-SNE (t-distributed Stochastic Neighbor Embedding) is an unsupervised non-linear
dimensionality reduction technique for data exploration and visualizing high-
dimensional data. Non-linear dimensionality reduction means that the algorithm allows
us to separate data that cannot be separated by a straight line.
t-SNE vs PCA
Both t-SNE and PCA are dimensional reduction techniques with different mechanisms
that work best with different types of data.
PCA (Principal Component Analysis) is a linear technique that works best with data that
has a linear structure. It seeks to identify the underlying principal components in the
data by projecting onto lower dimensions, minimizing variance, and preserving large
pairwise distances. Read our Principal Component Analysis (PCA) tutorial to
understand the inner workings of the algorithms with R examples.
• But, t-SNE is a nonlinear technique that focuses on preserving the pairwise similarities
between data points in a lower-dimensional space. t-SNE is concerned with preserving
small pairwise distances whereas, PCA focuses on maintaining large pairwise distances
to maximize variance.
How t-SNE Works
The t-SNE algorithm finds the similarity measure between pairs of instances in
higher and lower dimensional space. After that, it tries to optimize two
similarity measures. It does all of that in three steps.
1.t-SNE models a point being selected as a neighbor of another point in both
higher and lower dimensions. It starts by calculating a pairwise similarity
between all data points in the high-dimensional space using a Gaussian kernel.
The points far apart have a lower probability of being picked than the points
close together.
2.The algorithm then tries to map higher-dimensional data points onto lower-
dimensional space while preserving the pairwise similarities.
3.It is achieved by minimizing the divergence between the original high-
dimensional and lower-dimensional probability distribution. The algorithm uses
gradient descent to minimize the divergence. The lower-dimensional
embedding is optimized to a stable state.