Chapter-3
Unsupervised Learning
Introduction to Unsupervised Learning
Types of Unsupervised Learning
Clustering Algorithms Such as K-Means Clustering
Evaluation Metrics in Unsupervised Learning
Challenges in Unsupervised Learning
Applications of Unsupervised Learning
Compiled by: Wogayehu A.
Introduction to Unsupervised Learning
• Unsupervised learning is a branch of machine learning that deals with
unlabeled data.
• Unlike supervised learning, where the data is labeled with a specific
category or outcome, unsupervised learning algorithms are tasked
with finding patterns and relationships within the data without any
prior knowledge of the data's meaning.
• Unsupervised machine learning algorithms find hidden patterns and
data without any human intervention, i.e., we don't give output to
our model.
• The training model has only input parameter values and discovers the
groups or patterns on its own.
• Unsupervised Learning finds patterns or structures in data without
labeled outputs.
• Goal: Discover underlying structure (e.g., clusters, associations, low-
dimensional representations).
2
Introduction to Unsupervised Learning
• In general, How Unsupervised Learning Works
Unsupervised learning works by analyzing unlabeled data to identify
patterns and relationships.
The data is not labeled with any predefined categories or outcomes,
so the algorithm must find these patterns and relationships on its
own.
This can be a challenging task, but it can also be very rewarding, as it
can reveal insights into the data that would not be apparent from a
labeled dataset.
• The input to the unsupervised learning models is as follows:
Unstructured data: May contain noisy(meaningless) data, missing values, or
unknown data. Unstructured data is often more powerful for unsupervised learning
because it carries more hidden patterns (e.g., in text, images, audio). But it also
requires more preprocessing and computational resources.
Unlabeled data: Data only contains a value for input parameters; there is no
targeted value(output). It is easy to collect as compared to the labeled one in the
Supervised approach.
3
Introduction to Unsupervised Learning
Key Characteristics of Unsupervised Learning:
o No Labeled Data: This is the most defining feature. The input data
doesn't have predefined categories or target outputs.
o Discovery of Hidden Structures: The primary goal is to uncover inherent
patterns, groupings, and relationships within the data that might not be
immediately obvious.
o No Feedback: The algorithm doesn't receive feedback on the correctness
of its "predictions" during training. It learns by exploring the data and
discovering patterns on its own.
o Data Exploration and Insight Generation: It's incredibly useful for
exploring new datasets, understanding their underlying organization,
and generating insights when you don't have a clear idea of what you're
looking for.
o Feature Learning: Algorithms can automatically learn relevant features
or representations from raw data, which can be useful for further
analysis or modeling
4
Types of Unsupervised Learning
• Unsupervised learning primarily involves three types of tasks:-
Clustering---This involves grouping similar data points together into
"clusters“ based on similarities or patterns. Data points within the same
cluster are more similar to each other than to those in other clusters.
Association Rule Learning--This aims to discover interesting relationships or
"rules" between variables in large datasets. It identifies items that
frequently occur together. Discovers relationships and dependencies
between variables in a dataset.
Dimensionality Reduction---This technique reduces the number of features
(or dimensions) in a dataset while retaining as much relevant information
as possible. This is useful for:
Simplifying Data: Making high-dimensional data easier to visualize
and understand.
Improving Model Performance: Reducing noise and preventing
overfitting in other machine learning models.
Feature Extraction: Creating new, more meaningful features from
existing ones.
5
Types of Unsupervised Learning
• Unsupervised learning primarily involves three types of tasks:-
Autoencoders
• Clustering:- Similar to classification but without predefined classes.
Clustering aims to group similar data points based on their features.
• Association Rule Learning: Identifies relationships or associations
between variables in large datasets.
• Dimensionality reduction:- Simplifies datasets by reducing the
number of features while retaining essential information.
6
Clustering
• Clustering is a technique used to group similar items or data points
together based on certain characteristics or features.
• Clustering can help to identify data points that are far away from the
dataset (outliers) or variations in a dataset.
• It is an unsupervised machine learning algorithm that organizes and
classifies different objects, data points, or observations into groups
or clusters based on similarities or patterns.
• Examples of clusters can include genres of music, different groups of
users, key segments of a market segmentation, types of network
traffic on a server cluster, friend groups in a social network, or many
other kinds of categories.
• The process of clustering can use just one feature of the data or it
can use all of the features present in the data.
7
Clustering
• Clustering is a tool in data science for data analysis and machine
learning to group similar data points together into “clusters.”
• The goal of clustering is to find patterns in data and group similar
data points together, while separating dissimilar data points into
different groups.
• For example, let’s say you have a customer dataset containing their
age, income, and location. You could use clustering to group together
customers who are similar in terms of age and income, and separate
out customers who are very different in terms of these
characteristics. This might be useful for a business that wants to
target marketing campaigns to specific customer segments.
8
Clustering
• Example: Patient Segmentation for Lung Cancer Diagnosis
• Objective: To group patients with similar clinical and diagnostic
patterns related to lung cancer, helping doctors in early detection,
risk profiling, and treatment planning, even when clear labels
(diagnosed vs not) are not available.
• Input Data (Patient Features): Collected from CT scans, medical history, and tests:
o Age
o Smoking history (pack-years)
o Cough frequency
o Presence of chest pain
o Nodule size from CT scan
o Shortness of breath score
o Lung function test results (e.g., FEV1)
o Family history of cancer
o Blood oxygen level
9
Clustering
• Task: Clustering with K-Means or Hierarchical Clustering
• Group patients into clusters based on similarity in features.
• Possible Clustering Output:
Cluster Characteristics Interpretation
Older patients, heavy
smokers, large High risk – Likely
Cluster 1
nodules, low oxygen lung cancer
levels
Moderate-age,
moderate smoking, Medium risk – Needs
Cluster 2
medium-size nodules, monitoring
some symptoms
Younger, non-smokers,
Low risk – Routine
Cluster 3 no nodules, high lung
check-up only
function
• How It Helps:
o Doctors can focus diagnostic tests on high-risk clusters.
o Hospitals can prioritize resources (CT scans, specialist referrals) for critical groups.
o Early intervention may increase survival rates.
10
Type of Clustering
There are many different algorithms that can be used for clustering, such as
k-means clustering and hierarchical clustering. The most common clustering
types are:-
Hierarchical clustering, sometimes called connectivity-based clustering,
groups data points together based on the proximity and connectivity of their
attributes.
11
K-Means Clustering-Partition Clustering
• What is K-Means?
• K-Means is one of the most popular clustering algorithms in unsupervised
machine learning. It is used to group similar data points into clusters based
on their features.
• It is a partition-based clustering algorithm that divides a dataset into K
distinct non-overlapping clusters, where each data point belongs to the
cluster with the nearest mean (called the centroid).
• K-Means works iteratively to minimize the distance between data points
and their respective cluster centers (centroids).
• Objects are classified into a predefined number of groups.
o K = the number of clusters you want the algorithm to find in your data.
o You choose the value of K before the algorithm runs based on the features.
o For example, if K = 3, the algorithm will try to group the data into 3 clusters.
o "Means" refers to the centroid (average position) of a cluster.
12
K-Means Clustering-Distance Measure
• K-Means clustering uses distance to determine the similarity
between data points and cluster centroids. The most common
distance measure used is Euclidean Distance.
• Distance Measure will determine the similarity between two
elements and it will influence the shape of the clusters.
• Euclidean distance is the straight-line distance between two points in space.
• Types of distance Measure
o Euclidean distance measure
the ordinary straight line which is the distance between two
points in Euclidean spaces. For two features: x and y
For two points:
Point A: A(x1,y1)A(x1,y1)
Point B: B(x2,y2)B(x2,y2)
13
K-Means Clustering-Distance Measure
• Euclidean Distance is the default and most widely used distance
metric in K-Means.
• Euclidean distance is used to assign points to the nearest centroid.
• New centroids are computed using the mean of all points in a cluster.
• The process repeats until the cluster assignments don’t change.
k-Means Clustering
• This algorithm partitions the dataset into a set number (k) of clusters.
• It randomly initializes k centroids and assigns each data point to the
nearest centroid. For example: If you want 3 clusters (k = 3), you
randomly place 3 centroids.
• The centroids are updated iteratively until the clusters are stable.
K-Means Clustering-Distance Measure
• Manhattan Distance measure
• Manhattan Distance measures the distance between two points
along axes at right angles.
• Squared Euclidean distance measure
• Squared Euclidean Distance measures the square of the straight-line (Euclidean)
distance between two points. It avoids taking the square root, making it faster to
compute (especially in machine learning algorithms like K-Means).
How does K-Means algorithm Works?
16
K-Means Clustering: Steps
1. Start: The algorithm begins.
2. Number of Clusters K: The first step is to define the
number of clusters, denoted by 'K', that the data
will be grouped into. This value is determined by
the user or through some heuristic methods.
3. Centroid: Initial K centroids are chosen. These are
essentially the initial central points for each of the
K clusters. The choice of initial centroids can be
random or based on some specific strategy.
4. Distance Objects to Centroids: Each data point
(object) in the dataset is assigned to the nearest
centroid. The distance is typically calculated using
metrics like Euclidean distance.
5. Grouping based on minimum Distance: After
calculating distances, each data point is assigned to
the cluster whose centroid is closest to it. This
forms initial groupings.
17
K-Means Clustering: Steps
6. Centroid has Converged? (Decision Point): This is the
core of the iterative process.
False: If the centroids have not converged (meaning their
positions are still changing significantly from one iteration to the
next), the process loops back to the "Centroid" step. In this
iteration, new centroids are calculated as the mean of all data
points currently assigned to that cluster. Then, steps 4 and 5 are
repeated with these new centroids.
True: If the centroids have converged (meaning their positions are
no longer changing significantly, or the assignments of data points
to clusters have stabilized), the algorithm proceeds to the "End"
step.
7. End: The algorithm terminates, and the final clusters
and their centroids are produced. In essence, the K-Means
algorithm works by iteratively performing two main steps:
•Assignment Step: Assigning each data point to its closest
centroid.
•Update Step: Recalculating the centroids based on the mean
of the data points assigned to each cluster.
This process continues until the cluster assignments no longer change
significantly, or a predefined number of iterations is reached, indicating
that the algorithm has converged
18 to a stable clustering solution.
How to decide the number of clusters?
Elbow Method(most common): The elbow method is to run K-Means clustering
on datasets where ‘K’ is referred as number of clusters.
• Plot how the total error (within-cluster sum of squares or WSS) decreases as K
increases.
• Pick the K at the "elbow" point — where adding more clusters gives diminishing
returns.
• The sum of squared error is defined as the sum of the squared distances between
each member of the cluster and its centroid :
Where is closest point to centroid.
We can see a very slow change in the values of
WSS after k=2 , so you should take that elbow point
value as the final number of clusters
• Silhouette Score, Gap Statistic, Domain
Knowledge are also other19 methods to decide K.
Example Problem using Euclidean Distance
• Cl
Example Problem
• Cl
Given
Example Problem
• Cl
Given
Once the distance calculation is finished, we need to assign the
datapoints to one the 3 clusters. If we have 3 clusters
We assign based on the minimum distance to the centroids.
Example Problem
• Cl
Now each data point is assigned to the respective cluster. Once you initially assign data
points to clusters in algorithms like K-Means, it is mandatory to compute new centroids
and reassign points until convergence. This is a core part of the K-Means algorithm.
Now Compute the mean and get the new centroids. Once we do the computation we
got new centroid which is
Make this centroids as current centroids
Example Problem
• Cl
Now we need to consider the new centroids as current centroids and compute the
distance again. Now do the assignments also based on the minimum distance from
each centroids. Snice C2 data point is grouped once in cluster 2 and then in cluster 1
later it is not converged. Or C2 moved from one cluster to another therefore we need to
compute a new Centroid again.
This becomes the current
centroid
Example Problem
• Cl
Again, B1 data point is Grouped once in cluster 2 and then in cluster 1 later it
is not converged. Or B1 moved from one cluster to another so, we need to
compute a new Centroid b/c not Converged still.
And make the new cluster becomes current
Cluster.
Example Problem
• Cl
When we look the previous assignment and the current assignment both are
the same. It means that all the data points are converged to this new clusters
and it is the final step based Euclidean distance b/c all the data points are
Converged.
Finally, We need to write down the final cluster as datapoints A1, B1, C2
are in Claster 1, & A3, B2, B3 are in cluster 2 and A2 and C1 are in cluster 3.
Exercise with 1D clustering
Assign each 1D data points to two clusters as C1 and C2.
Exercise with 1D clustering
Euclidean Distance for 1D
Exercise with 1D clustering
After 1st iteration we will get this groups
& find the new centroids for the new
Cluster for coparison purpose
Exercise with 1D clustering
Once we get the new centroids it will become the current centroids for the 2 nd
Iteration and compute the distances again
Exercise with 1D clustering
After we grouped the data points based on the minimum distances, we have
To compare the clusters between the previous one and the current cluster.
If there are variations between the two it is not converged yet and we
Need to compute another centroids until it becomes converged.
Exercise with 1D clustering
Now we need to check the clusters still it is not converged so we need
to compute the new centroids. Move the new cluster assignment to current
Cluster assignment and compute the distances again.
Exercise with 1D clustering
Now all the data points are converged b/c there is no variation between the
Previous assignments and the current assignments. And we need to put the
final cluster also like above given.
Pros and Cons: K-Means Clustering
Pros:
o Simple and understandable
o Items automatically assigned to cluster
Cons:
o Must define number of clusters
o Hard-cluster
o Need to select the initial centroids for each of the clusters
o Unable to handle noise data and outliers
34
Fuzzy C-Means Clustering
• Fuzzy C-means is an extension of K-Means, the popular
simple clustering techniques.
• Fuzzy clustering(also referred to as soft clustering) is a form
of clustering in which each point can belong to more than one
cluster.
• Fuzzy C-Means is an unsupervised clustering algorithm like K-
Means, but with one big difference:
Instead of assigning each data point to one single cluster, Fuzzy C-Means
allows each point to belong to multiple clusters with varying degrees of
membership.
o "Fuzzy" means soft or partial membership.
o C is the number of clusters, just like K in K-Means.
o So “Fuzzy C-Means” = Fuzzy Clustering with C Clusters.
35
Fuzzy C-Means Clustering
• Fuzzy C-Means (FCM) is a soft clustering algorithm that allows data
points to belong to multiple clusters with varying degrees of
membership, unlike "hard" clustering methods like K-means, where
each data point belongs exclusively to one cluster.
• The core idea of FCM is to minimize an objective function that
represents the sum of squared errors between data points and
cluster centers, weighted by their membership degrees.
• This iterative optimization process moves the cluster centers to the
"right" location within the dataset.
36
Fuzzy C-Means Clustering
37
Evaluation Metrics in Unsupervised Learning
• • For Clustering:
Silhouette Coefficient
Davies-Bouldin Index
Adjusted Rand Index
Confusion Matrix (when ground truth is available)
• • For Dimensionality Reduction:
Reconstruction Error
Visualization
38
Challenges of Unsupervised Learning
•No Ground Truth Labels
•Interpretability of Clusters
•Choice of Distance Metrics
•Scalability
Applications of Unsupervised Learning
•Customer Segmentation
•Recommender Systems
•Anomaly Detection (e.g., Fraud Detection)
•Document and Text Clustering
39
Applications ……
Recommendation Systems: Suggesting products or content
based on the behaviors of similar users.
Anomaly/Outlier Detection: Identifying unusual data points
that deviate significantly from the norm (e.g., fraud
detection, network intrusion).
Natural Language Processing (NLP): Categorizing text, topic
modeling, and understanding relationships between
words.
Bioinformatics: Grouping genes or proteins with similar
functions.
Data Preprocessing: Cleaning and preparing data for other
machine learning tasks.
40
Thank You !!!
41