0% found this document useful (0 votes)

31 views56 pages

ML 07 Clustering

The document provides an overview of clustering in machine learning, focusing on unsupervised learning techniques to group unlabeled data points based on similarity. It discusses various clustering methods, including partitional, hierarchical, density-based, and graph-based approaches, along with specific algorithms like K-means and DBSCAN. The document also highlights challenges in clustering, such as determining the number of clusters and handling outliers.

Uploaded by

yogtinku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views56 pages

ML 07 Clustering

Uploaded by

yogtinku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

CS 60050

Machine Learning

Clustering

Some material borrowed from course materials of Andrew Ng and Jing Gao
Unsupervised learning
• Given a set of unlabeled data points / items
• Find patterns or structure in the data
• Clustering: automatically group the data points / items
into groups of ‘similar’ or ‘related’ points
• Main challenges
– How to measure similarity?
– What is the ideal number of clusters? Few larger clusters, or
more number of smaller clusters?
Motivations for Clustering
• Understanding the data better
– Grouping Web search results into clusters, each of which
captures a particular aspect of the query
– Segment the market or customers of a service
• As precursor for some other application
– Summarization and data compression
– Recommendation
Different types of clustering
• Partitional
– Divide set of items into non-overlapping subsets
– Each item will be member of one subset

• Overlapping
– Divide set of items into potentially overlapping subsets
– Each item can simultaneously belong to multiple subsets
Different types of clustering
• Fuzzy
– Every item belongs to every cluster with a membership
weight between 0 (absolutely does not belong) and 1
(absolutely belongs)
– Usual constraint: sum of weights for each individual item
should be 1
– Convert to partitional clustering: assign every item to that
cluster for which its membership weight is highest
Different types of clustering
• Hierarchical
– Set of nested clusters, where one larger cluster can contain
smaller clusters
– Organized as a tree (dendrogram): leaf nodes are singleton
clusters containing individual items, each intermediate
node is union of its children sub-clusters
– A sequence of partitional clusterings – cut the dendrogram
at a certain level to get a partitional clustering
An example dendrogram
Different types of clustering
• Complete vs. partial
– A complete clustering assigns every item to one or more
clusters
– A partial clustering may not assign some items to any
cluster (e.g., outliers, items that are not sufficiently similar
to any other item)
Types of clustering methods
• Prototype-based
– Each cluster defined by a prototype (centroid or medoid),
i.e., the most representative point in the cluster
– A cluster is the set of items in which each item is closer
(more similar) to the prototype of this cluster, than to the
prototype of any other cluster
– Example method: K-means
Types of clustering methods
• Density-based
– Assumes items distributed in a space where ‘similar’ items
are placed close to each other (e.g., feature space)
– A cluster is a dense region of items, that is surrounded by a
region of low density
– Example method: DBSCAN
Types of clustering methods
• Graph-based
– Assumes items represented as a graph/network where
items are nodes, and ‘similar’ items are linked via edges
– A cluster is a group of nodes having more and / or better
connections among its members, than between its
members and the rest of the network
– Also called ‘community structure’ in networks
– Example method: Algorithm by Girvan and Newman
We are applying clustering
in this lecture itself.

How?
K-means clustering
K-means
• Prototype-based, partitioning technique
• Finds a user-specified number of clusters (K)
• Each cluster represented by its centroid item

• There have been extensions where number of

clusters is not needed as input
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster := index (from 1 to ) of cluster centroid
assignment
closest to
Move for = 1 to
centroid := average (mean) of points assigned to cluster

}
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster := index (from 1 to ) of cluster centroid
assignment
closest to
Move for = 1 to
centroid := average (mean) of points assigned to cluster

}
Optimization in K-means
• Consider data points in Euclidean space
• A measure of cluster quality: Sum of Squared Error (SSE)
– Error of each data point: Euclidean distance of the point to its
closest centroid
– SSE: total sum of the squared error for each point
– Will be minimized if the centroid of a cluster is the mean of all
data points in that cluster
• Steps of K-means minimizes SSE (finds a local minima)
Choosing value of K
• Based on domain knowledge about suitable number of
clusters for a particular problem domain

• Alternatively, based on some measure of cluster quality, e.g.,

try for different values of K and choose that value for which
SSE is minimum
Choosing initial centroids
• Can be selected randomly, but can lead to poor clustering
• Perform multiple runs, each with a different set of randomly
chosen initial centroids, and select that configuration that
yields minimum SSE
• Use domain knowledge to choose centroids, e.g., while
clustering search results, select one search result relevant to
each aspect of the query
Importance of choosing initial centroids well
3

2.5

1.5 Original Points

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3
3
2.5
2.5
2
2
1.5
1.5
y

y
1
1
0.5
0.5
0
0

-2 - 1.5 -1 -0.5 0 0.5 1 1.5 2

x -2 - 1.5 -1 -0.5 0 0.5 1 1.5 2
x

Optimal Clustering Sub-optimal Clustering

Similarity/closeness between items
• Measure of similarity/closeness between items depends on
the problem domain
• Will be performed many times over the course of the
algorithm, hence needs to be efficient
• Examples
– Points in Euclidean space à Euclidean distance
– Text documents à cosine similarity between term-vectors
Reducing SSE with post-processing
• Finding more clusters will reduce SSE, but sometimes we want
to improve SSE without increasing clusters

• K-means has found a local minima; find another “nearby”

clustering with lower SSE (if exists)
Reducing SSE with post-processing
• Techniques used
– Splitting a cluster, e.g., the cluster with highest SSE, or the
cluster with highest standard deviation of a chosen feature
– Merging two clusters, e.g., the clusters with the closest
centroids
Known problem of K-means
• Sensitive to outliers that can change the distribution of
the clusters
– A solution: K-Mediods: instead of taking the mean value of
the points in a cluster, use the medoid that is the most
centrally located point in the cluster
• Detected clusters are usually globular (spherical) in
shape; problems in detecting arbitrary-shaped clusters
Hierarchical clustering
Hierarchical clustering
• Bottom-up or Agglomerative clustering
– Start considering each data point as a singleton cluster
– Successively merge clusters if similarity is sufficiently high
– Until all points have been merged into a single cluster
• Top-down or Divisive clustering
– Start with all data points in a single cluster
– Iteratively split clusters into smaller sub-clusters if the
similarity between two sub-parts is low
Both Divisive and Agglomerative clustering can
be represented as a Dendrogram
Basic agglomerative hierarchical
clustering algorithm
• Start with each item in a singleton cluster
• Compute the proximity/similarity matrix between clusters
• Repeat
– Merge the closest/most similar two clusters
– Update the proximity matrix to reflect proximity between
the new cluster and the other clusters
• Until only one cluster remains
Proximity/similarity between clusters
• MIN similarity between two clusters: Proximity (similarity)
between the closest (most similar) two points, one from each
cluster (minimum pairwise distance)
• MAX similarity between two clusters: Proximity between
the farthest two points, one from each cluster (maximum
pairwise distance)
• Group average similarity: average pairwise proximity of all
pairs of points, one from each cluster
Types of hierarchical clustering
• Complete linkage
– Merge in each step the two clusters with the smallest
maximum similarity
• Single linkage
– Merge in each step the two clusters with the smallest
minimum similarity
A divisive graph-based
clustering algorithm
A graph-based hierarchical clustering
algorithm
• A cluster is a group of nodes having more and / or
better connections among its members, than between
its members and the rest of the network
• Cluster in graphs/networks: also called community
structure
• Algorithm by Girvan and Newman: Community
structure in social and biological networks, PNAS 2002
Girvan-Newman algorithm
• Focus on edges / links that are most ‘between’ clusters
• Edge betweenness of an edge e : fraction of shortest
paths between all pairs of nodes, which pass through e
Girvan-Newman algorithm

• Edges between clusters/communities are likely to have

high betweenness centrality
• Progressively remove edges having high betweenness
centrality, to separate clusters from one another
Girvan-Newman algorithm
Girvan-Newman algorithm
1. Compute betweenness centrality for all edges
2. Remove the edge with highest betweenness centrality
3. Re-compute betweenness centrality for all edges affected by
the removal
4. Repeat steps 2 and 3 until no edges remain

Results in a hierarchical clustering tree (dendrogram)

Density-based clustering
Density based clustering methods
• Locates regions of high density, that are separated
from one another by regions of low density
DBSCAN
• DBSCAN: Density Based Spatial Clustering of
Applications with Noise
– Proposed by Ester et al. in SIGKDD 1996
– First algorithm for detecting density-based clusters
• Advantages (e.g., over K-means)
– Can detect clusters of arbitrary shapes (while clusters
detected by K-means are usually globular)
– Robust to outliers
DBSCAN: intuition
• For any point in a cluster, the local point density
around that point has to exceed some threshold
• The set of points in one cluster is spatially connected
• Local point density at a point p defined by two
parameters
• ε : radius for the neighborhood of point p:
Nε (p) := {q in data set | dist(p, q) ≤ ε}
• MinPts : minimum number of points in the given
neighborhood Nε (p)
Neighborhood of a point
• ε-Neighborhood of a point p : Points within a radius
of ε from the point p
• “High density”: if ε-Neighborhood of a point contains
at least MinPts number of points

ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Divide points into three types
• Core point: A point that has more than a specified number of
points (MinPts) within its ε-Neighborhood (points that are at
the interior of a cluster)
• Border point: has fewer than MinPts points within its ε-
Neighborhood (not a core point), but falls within the ε-
Neighborhood of a core point
• Outlier point: any point that is not a core point nor a border
point
Density-Reachability
• Directly density-reachable: A point q is directly
density-reachable from object p if p is a core point
and q is in p’s ε-neighborhood.
q is directly density-reachable from p
ε ε p is not directly density-reachable from q
q p
Density-reachability is not symmetric
MinPts = 4
Density-Reachability
• Density-reachability can be direct or indirect
– Point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– pßp2ßp1ßq form a chain

MinPts = 7
p p is (indirectly) density-reachable from q
p2 q is not density-reachable from p
p1
q
DBSCAN algorithm
Input: The data set D
Parameters: ε, MinPts
for each point p in D
if p is a core point and not processed then
C = {all points density-reachable from p}
mark all points in C as processed
report C as a cluster
else
mark p as outlier
end if
end for
Understanding the algorithm
• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. ε and MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database

• Continue the process until all of the points have been processed
(each point marked as either core or border or outlier)
When DBSCAN works well

Original Points Clusters

• Resistant to noise / outliers (note: partial clustering)
• Can handle clusters of different shapes and sizes
• Number of clusters identified automatically
When DBSCAN does not work well
• Cannot identify clusters of varying densities
• Sensitive to parameters
Resources on Clustering (free on web)
• Data Clustering: Algorithms and Applications
– Book by Charu Aggarwal, Chandan Reddy

• Community Detection in graphs

– Survey paper by Santo Fortunato

UNIT5
No ratings yet
UNIT5
60 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML - 8
No ratings yet
ML - 8
70 pages
Cluster
100% (1)
Cluster
72 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
118 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering
No ratings yet
Clustering
35 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Unit 2
No ratings yet
Unit 2
33 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
M5
No ratings yet
M5
40 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
28 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Mlclustering2022 10 26
No ratings yet
Mlclustering2022 10 26
36 pages
Clustering
No ratings yet
Clustering
19 pages
K Medoids
No ratings yet
K Medoids
101 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Clustering
No ratings yet
Clustering
65 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Cluster Analysis & Methods Guide
No ratings yet
Cluster Analysis & Methods Guide
11 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Clustering
No ratings yet
Clustering
75 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Nonlinear Dynamics As A Ground-State Solution On Quantum Computers
No ratings yet
Nonlinear Dynamics As A Ground-State Solution On Quantum Computers
21 pages
Sequential Decoding of The XYZ Hexagonal Stabilizer Code:,, Y Zzy
No ratings yet
Sequential Decoding of The XYZ Hexagonal Stabilizer Code:,, Y Zzy
14 pages
Quantium Q.Promotions Brochure Updated
No ratings yet
Quantium Q.Promotions Brochure Updated
4 pages
Quantium Q.Shelf Brochure Updated
No ratings yet
Quantium Q.Shelf Brochure Updated
4 pages
Pre Sales Consultant For Industrial AI
No ratings yet
Pre Sales Consultant For Industrial AI
3 pages
Chapter-5-Cluster Analysis PDF
No ratings yet
Chapter-5-Cluster Analysis PDF
5 pages
Performance Evaluation of Distance Metrics in The Clustering Algorithms
No ratings yet
Performance Evaluation of Distance Metrics in The Clustering Algorithms
14 pages
Module 5
No ratings yet
Module 5
98 pages
Partition
No ratings yet
Partition
52 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
Big Data Techniques of 2025
No ratings yet
Big Data Techniques of 2025
31 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
R & Machine Learning for CSE Students
No ratings yet
R & Machine Learning for CSE Students
62 pages
Define Clustering. What Are The Different Types of Clustering Techniques - Explain Hierarchical and Partitional Clustering in Detail. - Google Search
No ratings yet
Define Clustering. What Are The Different Types of Clustering Techniques - Explain Hierarchical and Partitional Clustering in Detail. - Google Search
1 page
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Exploratory Multivariate Analysis by Example Using R Second Edition Husson - The 2025 Ebook Edition Is Available With Updated Content
100% (4)
Exploratory Multivariate Analysis by Example Using R Second Edition Husson - The 2025 Ebook Edition Is Available With Updated Content
57 pages
Cluster Exam
No ratings yet
Cluster Exam
3 pages
MATLAB - Lecture 1 - Overview
No ratings yet
MATLAB - Lecture 1 - Overview
98 pages
CS273a Final Exam
No ratings yet
CS273a Final Exam
9 pages
Linear Models & SVM in Machine Learning
100% (1)
Linear Models & SVM in Machine Learning
23 pages
Cluster Analysis for Ore Evaluation
No ratings yet
Cluster Analysis for Ore Evaluation
33 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
27 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
48 pages
Multi-Level Observation and Understanding of Program Behaviors
No ratings yet
Multi-Level Observation and Understanding of Program Behaviors
34 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
Cluster Analysis: Prof. Vandith Pamuru
No ratings yet
Cluster Analysis: Prof. Vandith Pamuru
68 pages
ASE - PPT - Unit 2 Discriminant Cluster Analysis
No ratings yet
ASE - PPT - Unit 2 Discriminant Cluster Analysis
27 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
24 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
17 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Data Clustering
No ratings yet
Data Clustering
37 pages
Full Pattern Cluster Poster PDF
No ratings yet
Full Pattern Cluster Poster PDF
1 page
包含15821个大型语言模型的进化树和关系图-On the Origin of LLMs An Evolutionary Tree and Graph for 15,821 Large Language Models
No ratings yet
包含15821个大型语言模型的进化树和关系图-On the Origin of LLMs An Evolutionary Tree and Graph for 15,821 Large Language Models
14 pages
Aob331139 144
No ratings yet
Aob331139 144
7 pages