KEMBAR78
Lect 10 - Unsupervised Learning | PDF | Cluster Analysis | Applied Mathematics
0% found this document useful (0 votes)
11 views50 pages

Lect 10 - Unsupervised Learning

The document discusses unsupervised learning, focusing on clustering algorithms and their applications, such as grouping data without labels to identify patterns. It covers various clustering methods, including k-means and hierarchical clustering, along with their strengths and weaknesses. Additionally, it introduces model-based clustering techniques like Gaussian Mixture Models and the Expectation-Maximization algorithm for probabilistic assignments.

Uploaded by

Chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views50 pages

Lect 10 - Unsupervised Learning

The document discusses unsupervised learning, focusing on clustering algorithms and their applications, such as grouping data without labels to identify patterns. It covers various clustering methods, including k-means and hierarchical clustering, along with their strengths and weaknesses. Additionally, it introduces model-based clustering techniques like Gaussian Mixture Models and the Expectation-Maximization algorithm for probabilistic assignments.

Uploaded by

Chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

1

Unsupervised
Learning
CHAPTER 10
“PREDICTING THE FUTURE ISN’T MAGIC, IT’S ARTIFICIAL INTELLIGENCE.”
~DAVE WATERS
Unsupervised Learning 2

 Do you have labelled data?  NO


 Do you want to group the data?  YES
Then look for clustering algorithms.
 Do you want to group the data?  NO
Then look for dimensionality reduction
algorithms
Unsupervised Learning 3

 Data: we have input x data, but no labels


y!
 Goal: Learn some underlying hidden
structure (interesting pattern or properties)
of the data
 Examples: Clustering, dimensionality
reduction, feature learning, density
estimation, etc.
Unsupervised learning - 4

Introduction
 Second major problem type
 In unsupervised learning, we get unlabeled data
 One way of doing this would be to cluster data into to
groups (clustering algorithm)
 Example of clustering algorithm
Unsupervised learning 5

 Example of application:
 Organize computer clusters
Identify
potential weak spots or distribute
workload effectively
 Social network analysis
Customer data
 Identify unknown subtypes of disease among tissue samples
 Astronomical data analysis
Algorithms give amazing results
Unsupervised learning 6
 Google news
Groups news stories into cohesive groups
 Documentclustering: identify sets of
documents about the same topic
 Used in any other problems as well
Genomics
Microarray data
Have a group of individuals
On each measure expression of a gene
Run algorithm to cluster individuals
into types of people
Challenges of Unsupervised 7

Learning
 Exploratory data analysis — goal is not always
clearly defined
 Difficult to assess performance — “right answer”
unknown
 Working with high-dimensional data
Clustering 8

A set of methods for finding subgroups within the dataset


 Observations
should share common characteristics within
the same group, but differ across groups
 Groupings are determined from attributes of the data itself
— differs from classification.
Clustering 9

Cluster A

Cluster B

Cluster C
Types of Clustering 10

 Centroid-based clustering
 Hierarchical clustering
 Model-based clustering
 However, the domains are overlapping, such as
k-medoids are both centroid-based and model-
based clustering technique.
 Clustering can be categorised as:
 Hard vs.
 soft/fuzzy clustering too
Centroid-based clustering 11

 organizes data into clusters by minimizing the


distance between data points and the centroid
(center) of their assigned cluster.
 The most common algorithm is k-means.
K-means 12

 K-means is a type of unsupervised learning, which is


used when you have unlabeled data (i.e., data
without defined categories or groups).
 The goal of this algorithm is to find groups in the
data, with the number of groups represented by the
variable K.
K-means 13

 Thealgorithm works iteratively to assign each data point to


one of K groups based on the features that are provided.
 Data points are clustered based on feature similarity.
 The results of the K-means clustering algorithm are:
 The centroids of the K clusters, which can be used to
label new data
 Labels for the training data (each data point is assigned
to a single cluster)
K-means algorithm 14
 It works by choosing the number of clusters k
 k centroids is initialized randomly
 The algorithm then iterates between two steps:
1. Data assignment step:
Each centroid defines one of the clusters. In this step, each data point is
assigned to its nearest centroid, based on the squared Euclidean distance.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean
of all data points assigned to that centroid's cluster.
 The algorithm iterates between steps one and two until a stopping criteria is
met (i.e., no data points change clusters, the sum of the distances is
minimized, or some maximum number of iterations is reached).
K-means algorithm 15
 Partition data into K sets
 Initialize: choose K centres (at random)
 Repeat:

1. Assign points to the nearest centre


2. New centre = mean of points assigned to it
 Until no change
Example 16
Pros and Cons 17

 Sensitive to
predefined k
 Sensitive to outliers
Determine number of k 18
 Elbow method
 One method would be to try many
different values of k and plot the
average distance of data points from
their respective centroid (average
within-cluster sum of squares) as a
function of k.
 Notice how increasing the k value from
1 to 2 drastically lowers the average
distance of each data point to its
respective centroid, but as you continue
increasing k the improvements begin to
decrease asymptotically.
 The ideal k value is found at the elbow
of such a plot.
Strengths of k-means 19

 Simple: easy to understand and to implement


 Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
 Since both k and t are small, k-means is considered
a linear algorithm.
Weaknesses of k-means 20

 The algorithm is only applicable if the mean is defined.


 The user needs to specify k.
 The algorithm is sensitive to outliers
 Outliersare data points that are very far away from other
data points.
 Outliers could be errors in the data recording or some
special data points with very different values.
 The algorithm is sensitive to initial seeds.
 The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
Hierarchical Clustering 21

 Creates a tree-like structure (dendrogram) of


clusters, either by merging smaller clusters
(agglomerative) or splitting larger ones (divisive)
 Most common one is agglomerative clustering
(bottom up) where each data point starts as its own
cluster and merges iteratively
 Need to specify number of cluster
Example of Calculation: 22

Agglomerative Clustering
 Dataset: Customer spending
Spending
Annual habits (annual income vs
Customer Score (1-
Income (k$) spending score)
100)
A 15 39
B 15 81
C 16 6
D 16 77
E 17 40
Example of Calculation: 23

Agglomerative Clustering
 Step 1: Compute pairwise distances
 Use Euclidean distance to measure similarity
 Repeat for all pairs
 E.g.: Distance (A,B) =

A-x B-xC-xD-x
B 42.00
C 33.02 75.0
D 38.01 4.1 71.0
37.0
E 2.24 41.0 34.0
1
Example of Calculation: 24

Agglomerative Clustering
 Step 2: merge closest clusters
 Initial clusters: {A}, {B}, {C}, {D}, {E}
 First merge
 E.g.:The two closest points (e.g., A and E with
distance 2.24) → New cluster {A, E}.
Example of Calculation: 25

Agglomerative Clustering
 Step 3: update distances
 Use linkage criteria to measure cluster distances:
 Single linkage: Distance between closest points of two clusters.
E.g.: Distance({C, E}, A) = min[d(C,A), d(E,A)] ≈ min[33, 2]
=2
 Complete linkage: Distance between farthest points.
 Average linkage: Average distance between all pairs.
Example of Calculation: 26

Agglomerative Clustering
 Step 4: Repeat merging
 Merge next closest clusters (e.g., {A} and {D} with distance 38).
 Continue iteratively merges the two closest clusters at each step until
only one remains
 Assume that you set the cluster = 2, then the clustering algorithm will
stop when it gets two clusters. Otherwise, it will keep combine all the
cluster until only one remain
 How to decide the cut?
 Domain knowledge (e.g.: there should be two customer segment)
 Statistical metrics (e.g.: gap of merging distances)
Example of Calculation: 27
Agglomerative Clustering
Clusters Before Clusters After
 Example of Step Action
Merging Merging
iteration:
{A}, {B}, {C}, {D}, Merge closest (e.g., {A}, {B}, {D},
1
{E} {C} and {E}) {C,E}

{A}, {B}, {D}, Merge next closest {C}, {A,E},


2
{A,E} (e.g., {B} and {D}) {B,D}
Merge next closest
Cut at here: 3 {C}, {A,E}, {B,D} (e.g., {C} and {C,A,E}, {B,D}
{A,E})

Merge last two {A,B,C,D,E}


4 {C,A,E}, {B,D}
clusters (single root)
Example of Calculation: 28

Agglomerative Clustering
Model-based Clustering 29

 Assumes data is generated from a


probabilistic model
 They are all soft clustering
techniques
 E.g.:
Gaussian distribution (univariate)
Multivariate Gaussians
Gaussian Distribution 30
 Gaussian / normal distribution / bell curve /
probability density function (pdf). The function
that describes the normal distribution is the
following:
Multivariate Gaussians 31
 For d dimensions, the Gaussian distribution of a vector is defined by:

where is the mean and is the covariance matrix of the Gaussian. The
covariance matrix captures correlations between features
Expectation-Maximization (EM)
32
Algorithm
 A general framework for probabilistic
model
 Commonly integrate in Gaussian Mixture
Model (GMM) to optimize parameters and
infers soft clusters
 E-step: Compute soft assignment of the
points, using current parameters
 M-step: Update parameters based on E-
step
Expectation-Maximization (EM) 33
Algorithm
 Probabilistic assignment to clusters
(Expectation Step)
 After initializing k random Gaussian
models, we can calculate our expectation of , a
vector of probabilities that belongs to the th
cluster for to given the Gaussian parameters
(i.e. mean , covariance and weight )
 Denominator is to normalize the probabilities to
sum to 1 across all clusters
Expectation-Maximization (EM) 34
Algorithm
 Probabilistic assignment to clusters
(Expectation Step)
 As mentioned earlier, we don't know the true
probabilistic cluster assignments for , so we'll
start with a guess and iteratively refine it.
 We can calculate the probability of belonging
to cluster using the probability distribution
function of a Gaussian distribution.
Expectation-Maximization (EM) 35
Algorithm
 Reformulating the Gaussian models
(Maximization)
 In other words, update Gaussian parameters (i.e.
mean , covariance and weight ) using these
probabilities (from expectation step)
 We'll then recalculate our Gaussian models
leveraging the weights we found in the
expectation step.
 The expectation, , represents the likelihood that
the th observation belongs to cluster .
Expectation-Maximization (EM) 36

Algorithm
 Defining a stopping criterion
 With k-means clustering, we iteratively recalculated
means and reassigned observations until convergence
and observations stopped moving between clusters.
 However, because we're now dealing soft clustering
with continuous probabilities, we can't rely on this
convergence.
 Instead, we'll set a stopping criterion to end the
iterative cycle once the observation probabilities
stop changing by above some threshold.
Practicalities 37

1. Initialization:
- EM is an iterative algorithm which is very sensitive
to initial conditions:
Start from trash → end up with trash
- Usually, we use the K-Means to get a good
initialization.
2. Number of Gaussians: Choice of k
3. Can converge to a local rather than global minimum
Soft Clustering: 38

Gaussian Mixture Model (GMM)


 Gaussian mixture models are a probabilistic model for representing
normally distributed subpopulations within an overall population.
 Mixture models in general don't require knowing which
subpopulation a data point belongs to, allowing the model to learn
the subpopulations automatically.
 Formally a mixture model corresponds to the mixture distribution
that represents the probability distribution of observations in the
overall population.
 Since subpopulation assignment is not known, this constitutes a
form of unsupervised learning.
When to use GMM 39

 Assumes Gaussian clusters


 Handles ellipsoidal/cluster correlations
 Slow due to covariance calculations
Hard vs. Soft/Fuzzy Clustering 40

 Hard Clustering:
 Each point belongs to exactly one cluster.
 Example:
 k-means assigns a point to the nearest centroid.
 k-medoids aims to group data points into distinct clusters by
identifying representative data points
 selects real data points (medoids) as cluster centers instead of
means
 Pros: Simple, deterministic.
 Cons: Unsuitable for overlapping clusters.
Hard vs soft assignments

41
In K-means, there is a hard assignment of vectors to a cluster
 However, for vectors near the boundary this may be a poor
representation
 Instead, can consider a soft-assignment, where the strength of the
assignment depends on distance.
 To address these problems, use soft clustering - Gaussian
Mixture Model.
 Soft clustering is a form of clustering where observations may
belong to multiple clusters.
Soft clustering: Fuzzy Cmeans 42

 Essentially, the process goes as follows:


1. Identify the number of clusters you'd like to split the dataset
into.
2. Define each cluster by generating a Gaussian model.
3. For every observation, calculate the probability that it
belongs to each cluster (e.g. Observation 23 has a 21%
chance that it belongs to Cluster A, a 0.1% chance that it
belongs to Cluster B, a 48% chance of Cluster C, ... and so
forth).
4. Using the above probabilities, recalculate the Gaussian
models.
5. Repeat until observations converge on their assignments.
When to use Fuzzy C Means 43

 When there is no distribution assumptions


 Works well with spherical clusters
 Faster computation
Probability distribution – Anomaly 44
 The chance of obtaining a data value (or range of values)
 The normal (Gaussian) distribution is the probability
distribution most commonly used to model data, and it is
useful for anomaly detection
 It can identify anomalies as low probability events

Can you see the


anomaly?
If so, what is its value?
Anomaly Detection 45

 Data that differs a lot from the rest.


 An anomaly is an observation which deviates so
much from other observations as to arouse
suspicions that it was generated by a different
mechanism. (Hawkins 1980)
 Also called “abnormality” or “deviant.”
 “Outlier” is also used as a synonym, but here
we will use a more precise definition.
 Anomalies are a subset of outliers (Aggarwal
2013)
 All observations = normal data + outliers
Types of anomalies 46

 Point anomalies: an individual data point seems


strange when compared with the rest of the data.
Example: an unusually large credit card purchase
 Contextual anomalies: the data seems strange in a
specific context, but not otherwise. Example: a US
credit card holder makes a purchase in Japan
 Collective anomalies: a collection of data points
seems strange when compared with entire dataset,
although each point may be OK. Example: ten
consecutive credit card purchases for a sandwich at
hourly intervals
Applications and use cases 47

 Fraud detection in credit card purchases


 Intrusion detection in computer networks
 Fault detection in mechanical equipment
 Earthquake warning
 Automated surveillance
 Monitoring gene expression for cancer classification
 Detect fake social media accounts
 and many more
Anomaly detection: the 48

fundamental idea
 The approach used by almost all anomaly detection
algorithms
 Create a model for what normal data should look like
 Calculate a score for each data point that measures how
far from normal it is
 Ifscore is above a previously specified threshold, classify
point as an anomaly
 Devising an appropriate model and score is essential
 Note: “normal” is used here in the sense of “typical” or
“usual,” which may or may not be related to the normal
distribution.
Anomaly detection: modelling the 49

data
The approach you take depends on what you know
 If you have examples of normal or anomalous data, you
can use this information to find anomalies
 Supervised anomaly detection
 If you don’t have any prior information about normal or
anomalous data, you have to use a different approach
 Unsupervised anomaly detection
 Requires probability and statistics to look for
anomalies
K-means and anomaly detection 50
Anomaly scoring
 Use distances to score anomalies
– distance from cluster centroid
– distance from cluster edge
– distances to each cluster
choice of k affects the results
– sometimes an external constraint/domain knowledge helps
make the choice
 Initial choice of centroids can also affect results
– To mitigate this problem, average over multiple runs
 While k-means always converges in a finite number of steps, the
final clustering is not necessarily optimal. Non-optimality will
affect anomaly detection

You might also like