1
Unsupervised
Learning
CHAPTER 10
“PREDICTING THE FUTURE ISN’T MAGIC, IT’S ARTIFICIAL INTELLIGENCE.”
~DAVE WATERS
Unsupervised Learning 2
Do you have labelled data? NO
Do you want to group the data? YES
Then look for clustering algorithms.
Do you want to group the data? NO
Then look for dimensionality reduction
algorithms
Unsupervised Learning 3
Data: we have input x data, but no labels
y!
Goal: Learn some underlying hidden
structure (interesting pattern or properties)
of the data
Examples: Clustering, dimensionality
reduction, feature learning, density
estimation, etc.
Unsupervised learning - 4
Introduction
Second major problem type
In unsupervised learning, we get unlabeled data
One way of doing this would be to cluster data into to
groups (clustering algorithm)
Example of clustering algorithm
Unsupervised learning 5
Example of application:
Organize computer clusters
Identify
potential weak spots or distribute
workload effectively
Social network analysis
Customer data
Identify unknown subtypes of disease among tissue samples
Astronomical data analysis
Algorithms give amazing results
Unsupervised learning 6
Google news
Groups news stories into cohesive groups
Documentclustering: identify sets of
documents about the same topic
Used in any other problems as well
Genomics
Microarray data
Have a group of individuals
On each measure expression of a gene
Run algorithm to cluster individuals
into types of people
Challenges of Unsupervised 7
Learning
Exploratory data analysis — goal is not always
clearly defined
Difficult to assess performance — “right answer”
unknown
Working with high-dimensional data
Clustering 8
A set of methods for finding subgroups within the dataset
Observations
should share common characteristics within
the same group, but differ across groups
Groupings are determined from attributes of the data itself
— differs from classification.
Clustering 9
Cluster A
Cluster B
Cluster C
Types of Clustering 10
Centroid-based clustering
Hierarchical clustering
Model-based clustering
However, the domains are overlapping, such as
k-medoids are both centroid-based and model-
based clustering technique.
Clustering can be categorised as:
Hard vs.
soft/fuzzy clustering too
Centroid-based clustering 11
organizes data into clusters by minimizing the
distance between data points and the centroid
(center) of their assigned cluster.
The most common algorithm is k-means.
K-means 12
K-means is a type of unsupervised learning, which is
used when you have unlabeled data (i.e., data
without defined categories or groups).
The goal of this algorithm is to find groups in the
data, with the number of groups represented by the
variable K.
K-means 13
Thealgorithm works iteratively to assign each data point to
one of K groups based on the features that are provided.
Data points are clustered based on feature similarity.
The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to
label new data
Labels for the training data (each data point is assigned
to a single cluster)
K-means algorithm 14
It works by choosing the number of clusters k
k centroids is initialized randomly
The algorithm then iterates between two steps:
1. Data assignment step:
Each centroid defines one of the clusters. In this step, each data point is
assigned to its nearest centroid, based on the squared Euclidean distance.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean
of all data points assigned to that centroid's cluster.
The algorithm iterates between steps one and two until a stopping criteria is
met (i.e., no data points change clusters, the sum of the distances is
minimized, or some maximum number of iterations is reached).
K-means algorithm 15
Partition data into K sets
Initialize: choose K centres (at random)
Repeat:
1. Assign points to the nearest centre
2. New centre = mean of points assigned to it
Until no change
Example 16
Pros and Cons 17
Sensitive to
predefined k
Sensitive to outliers
Determine number of k 18
Elbow method
One method would be to try many
different values of k and plot the
average distance of data points from
their respective centroid (average
within-cluster sum of squares) as a
function of k.
Notice how increasing the k value from
1 to 2 drastically lowers the average
distance of each data point to its
respective centroid, but as you continue
increasing k the improvements begin to
decrease asymptotically.
The ideal k value is found at the elbow
of such a plot.
Strengths of k-means 19
Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
Since both k and t are small, k-means is considered
a linear algorithm.
Weaknesses of k-means 20
The algorithm is only applicable if the mean is defined.
The user needs to specify k.
The algorithm is sensitive to outliers
Outliersare data points that are very far away from other
data points.
Outliers could be errors in the data recording or some
special data points with very different values.
The algorithm is sensitive to initial seeds.
The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
Hierarchical Clustering 21
Creates a tree-like structure (dendrogram) of
clusters, either by merging smaller clusters
(agglomerative) or splitting larger ones (divisive)
Most common one is agglomerative clustering
(bottom up) where each data point starts as its own
cluster and merges iteratively
Need to specify number of cluster
Example of Calculation: 22
Agglomerative Clustering
Dataset: Customer spending
Spending
Annual habits (annual income vs
Customer Score (1-
Income (k$) spending score)
100)
A 15 39
B 15 81
C 16 6
D 16 77
E 17 40
Example of Calculation: 23
Agglomerative Clustering
Step 1: Compute pairwise distances
Use Euclidean distance to measure similarity
Repeat for all pairs
E.g.: Distance (A,B) =
A-x B-xC-xD-x
B 42.00
C 33.02 75.0
D 38.01 4.1 71.0
37.0
E 2.24 41.0 34.0
1
Example of Calculation: 24
Agglomerative Clustering
Step 2: merge closest clusters
Initial clusters: {A}, {B}, {C}, {D}, {E}
First merge
E.g.:The two closest points (e.g., A and E with
distance 2.24) → New cluster {A, E}.
Example of Calculation: 25
Agglomerative Clustering
Step 3: update distances
Use linkage criteria to measure cluster distances:
Single linkage: Distance between closest points of two clusters.
E.g.: Distance({C, E}, A) = min[d(C,A), d(E,A)] ≈ min[33, 2]
=2
Complete linkage: Distance between farthest points.
Average linkage: Average distance between all pairs.
Example of Calculation: 26
Agglomerative Clustering
Step 4: Repeat merging
Merge next closest clusters (e.g., {A} and {D} with distance 38).
Continue iteratively merges the two closest clusters at each step until
only one remains
Assume that you set the cluster = 2, then the clustering algorithm will
stop when it gets two clusters. Otherwise, it will keep combine all the
cluster until only one remain
How to decide the cut?
Domain knowledge (e.g.: there should be two customer segment)
Statistical metrics (e.g.: gap of merging distances)
Example of Calculation: 27
Agglomerative Clustering
Clusters Before Clusters After
Example of Step Action
Merging Merging
iteration:
{A}, {B}, {C}, {D}, Merge closest (e.g., {A}, {B}, {D},
1
{E} {C} and {E}) {C,E}
{A}, {B}, {D}, Merge next closest {C}, {A,E},
2
{A,E} (e.g., {B} and {D}) {B,D}
Merge next closest
Cut at here: 3 {C}, {A,E}, {B,D} (e.g., {C} and {C,A,E}, {B,D}
{A,E})
Merge last two {A,B,C,D,E}
4 {C,A,E}, {B,D}
clusters (single root)
Example of Calculation: 28
Agglomerative Clustering
Model-based Clustering 29
Assumes data is generated from a
probabilistic model
They are all soft clustering
techniques
E.g.:
Gaussian distribution (univariate)
Multivariate Gaussians
Gaussian Distribution 30
Gaussian / normal distribution / bell curve /
probability density function (pdf). The function
that describes the normal distribution is the
following:
Multivariate Gaussians 31
For d dimensions, the Gaussian distribution of a vector is defined by:
where is the mean and is the covariance matrix of the Gaussian. The
covariance matrix captures correlations between features
Expectation-Maximization (EM)
32
Algorithm
A general framework for probabilistic
model
Commonly integrate in Gaussian Mixture
Model (GMM) to optimize parameters and
infers soft clusters
E-step: Compute soft assignment of the
points, using current parameters
M-step: Update parameters based on E-
step
Expectation-Maximization (EM) 33
Algorithm
Probabilistic assignment to clusters
(Expectation Step)
After initializing k random Gaussian
models, we can calculate our expectation of , a
vector of probabilities that belongs to the th
cluster for to given the Gaussian parameters
(i.e. mean , covariance and weight )
Denominator is to normalize the probabilities to
sum to 1 across all clusters
Expectation-Maximization (EM) 34
Algorithm
Probabilistic assignment to clusters
(Expectation Step)
As mentioned earlier, we don't know the true
probabilistic cluster assignments for , so we'll
start with a guess and iteratively refine it.
We can calculate the probability of belonging
to cluster using the probability distribution
function of a Gaussian distribution.
Expectation-Maximization (EM) 35
Algorithm
Reformulating the Gaussian models
(Maximization)
In other words, update Gaussian parameters (i.e.
mean , covariance and weight ) using these
probabilities (from expectation step)
We'll then recalculate our Gaussian models
leveraging the weights we found in the
expectation step.
The expectation, , represents the likelihood that
the th observation belongs to cluster .
Expectation-Maximization (EM) 36
Algorithm
Defining a stopping criterion
With k-means clustering, we iteratively recalculated
means and reassigned observations until convergence
and observations stopped moving between clusters.
However, because we're now dealing soft clustering
with continuous probabilities, we can't rely on this
convergence.
Instead, we'll set a stopping criterion to end the
iterative cycle once the observation probabilities
stop changing by above some threshold.
Practicalities 37
1. Initialization:
- EM is an iterative algorithm which is very sensitive
to initial conditions:
Start from trash → end up with trash
- Usually, we use the K-Means to get a good
initialization.
2. Number of Gaussians: Choice of k
3. Can converge to a local rather than global minimum
Soft Clustering: 38
Gaussian Mixture Model (GMM)
Gaussian mixture models are a probabilistic model for representing
normally distributed subpopulations within an overall population.
Mixture models in general don't require knowing which
subpopulation a data point belongs to, allowing the model to learn
the subpopulations automatically.
Formally a mixture model corresponds to the mixture distribution
that represents the probability distribution of observations in the
overall population.
Since subpopulation assignment is not known, this constitutes a
form of unsupervised learning.
When to use GMM 39
Assumes Gaussian clusters
Handles ellipsoidal/cluster correlations
Slow due to covariance calculations
Hard vs. Soft/Fuzzy Clustering 40
Hard Clustering:
Each point belongs to exactly one cluster.
Example:
k-means assigns a point to the nearest centroid.
k-medoids aims to group data points into distinct clusters by
identifying representative data points
selects real data points (medoids) as cluster centers instead of
means
Pros: Simple, deterministic.
Cons: Unsuitable for overlapping clusters.
Hard vs soft assignments
41
In K-means, there is a hard assignment of vectors to a cluster
However, for vectors near the boundary this may be a poor
representation
Instead, can consider a soft-assignment, where the strength of the
assignment depends on distance.
To address these problems, use soft clustering - Gaussian
Mixture Model.
Soft clustering is a form of clustering where observations may
belong to multiple clusters.
Soft clustering: Fuzzy Cmeans 42
Essentially, the process goes as follows:
1. Identify the number of clusters you'd like to split the dataset
into.
2. Define each cluster by generating a Gaussian model.
3. For every observation, calculate the probability that it
belongs to each cluster (e.g. Observation 23 has a 21%
chance that it belongs to Cluster A, a 0.1% chance that it
belongs to Cluster B, a 48% chance of Cluster C, ... and so
forth).
4. Using the above probabilities, recalculate the Gaussian
models.
5. Repeat until observations converge on their assignments.
When to use Fuzzy C Means 43
When there is no distribution assumptions
Works well with spherical clusters
Faster computation
Probability distribution – Anomaly 44
The chance of obtaining a data value (or range of values)
The normal (Gaussian) distribution is the probability
distribution most commonly used to model data, and it is
useful for anomaly detection
It can identify anomalies as low probability events
Can you see the
anomaly?
If so, what is its value?
Anomaly Detection 45
Data that differs a lot from the rest.
An anomaly is an observation which deviates so
much from other observations as to arouse
suspicions that it was generated by a different
mechanism. (Hawkins 1980)
Also called “abnormality” or “deviant.”
“Outlier” is also used as a synonym, but here
we will use a more precise definition.
Anomalies are a subset of outliers (Aggarwal
2013)
All observations = normal data + outliers
Types of anomalies 46
Point anomalies: an individual data point seems
strange when compared with the rest of the data.
Example: an unusually large credit card purchase
Contextual anomalies: the data seems strange in a
specific context, but not otherwise. Example: a US
credit card holder makes a purchase in Japan
Collective anomalies: a collection of data points
seems strange when compared with entire dataset,
although each point may be OK. Example: ten
consecutive credit card purchases for a sandwich at
hourly intervals
Applications and use cases 47
Fraud detection in credit card purchases
Intrusion detection in computer networks
Fault detection in mechanical equipment
Earthquake warning
Automated surveillance
Monitoring gene expression for cancer classification
Detect fake social media accounts
and many more
Anomaly detection: the 48
fundamental idea
The approach used by almost all anomaly detection
algorithms
Create a model for what normal data should look like
Calculate a score for each data point that measures how
far from normal it is
Ifscore is above a previously specified threshold, classify
point as an anomaly
Devising an appropriate model and score is essential
Note: “normal” is used here in the sense of “typical” or
“usual,” which may or may not be related to the normal
distribution.
Anomaly detection: modelling the 49
data
The approach you take depends on what you know
If you have examples of normal or anomalous data, you
can use this information to find anomalies
Supervised anomaly detection
If you don’t have any prior information about normal or
anomalous data, you have to use a different approach
Unsupervised anomaly detection
Requires probability and statistics to look for
anomalies
K-means and anomaly detection 50
Anomaly scoring
Use distances to score anomalies
– distance from cluster centroid
– distance from cluster edge
– distances to each cluster
choice of k affects the results
– sometimes an external constraint/domain knowledge helps
make the choice
Initial choice of centroids can also affect results
– To mitigate this problem, average over multiple runs
While k-means always converges in a finite number of steps, the
final clustering is not necessarily optimal. Non-optimality will
affect anomaly detection