0% found this document useful (0 votes)

360 views32 pages

K-Means Clustering Guide

Clustering is the process of grouping unlabeled data points into clusters so that objects within the same cluster are more similar to each other than objects in different clusters. K-means clustering is a commonly used partitioning clustering algorithm that groups data points into k number of clusters defined by the user. It works by assigning each data point to the nearest cluster centroid, recalculating the centroid positions, and repeating this process until the centroids are stable or the maximum number of iterations is reached.

Uploaded by

Daneil Radcliffe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

360 views32 pages

K-Means Clustering Guide

Uploaded by

Daneil Radcliffe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 32

Clustering

Introduction

• Cluster: a collection of data objects

– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Clustering is unsupervised classification:
no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms

1
Examples of Clustering

• Marketing: Help marketers discover distinct groups in their

customer bases, and then use this knowledge to develop
targeted marketing programs
• Financial application We might wish to find clusters of
companies that have similar financial perfomance
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• Medical Application: We might wish to find clusters of
patients with similar symptoms.
• Document Retrieval We might wish to find clusters of
documents with related content

2
Good Clustering

• A good clustering method will produce clusters with

– High intra-class similarity
– Low inter-class similarity
• Precise definition of clustering quality is difficult
– Application-dependent
– Ultimately subjective

3
Major Clustering Approaches

• Partitioning: Construct various partitions and then evaluate

them by some criterion, K means, k mediodes.
• Hierarchical: Create a hierarchical decomposition of the set
of objects using some criterion, Example, Agglomerative
• Model-based: Hypothesize a model for each cluster and
find best fit of models to data, Example Expectation
minimization
• Density-based: Guided by connectivity and density
functions, Example DBSCAN, OPTICS, DenClue

4
Partitioning Algorithms

• Partitioning method: Construct a partition of a database D of

n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented
by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw, 1987): Each cluster is represented by one
of the objects in the cluster

5
K means clustering, The concept of Center (Centroid)

Assuming that we are using Euclidean distance or something

similar as a measure we can define the centroid of a cluster
to be the point for which each attribute value is the average
of the values of the corresponding attribute for all the points
in the cluster.
So the centroid of the four points (with 6 attributes)

6
K means clustering, The concept of Center (Centroid)

• The centroid of a cluster will sometimes be one of the

points in the cluster, but frequently, as in the above
example, it will be an ‘imaginary’ point, not part of the
cluster itself, which we can take as marking its center.

7
K-Means Clustering

• Given k, the k-means algorithm consists of four steps:

– Select initial centroids at random.

– Assign each object to the cluster with the nearest
centroid.
– Compute each centroid as the mean of the objects
assigned to it.
– Repeat previous 2 steps until no change.

8
K-Means Clustering
Distance Measurement
Example: K-Means Clustering

• We will illustrate the

k-means algorithm
by using it to cluster
the 16 objects with
two attributes x and
y, as shown.

• These points are

shown in a two
dimensional plane
on the next slide.

10
Example: K-Means Clustering

11
Example: K-Means Clustering

Three of the points shown in the

table have been surrounded by
small circles. We will assume that
we have chosen k = 3 and that these
three points have been selected to
be the locations of the initial three
centroids. This initial (fairly
arbitrary) choice is shown in Figure
on the previous slide.

12
Example: K-Means Clustering

13
Example: K-Means Clustering

The columns headed d1,

d2 and d3 in this table
show the Euclidean
distance of each of the 16
points from the three
centroids.

The column headed

‘cluster’ indicates the
centroid closest to each
point and thus the cluster
to which it should be
assigned.
14
Example: K-Means Clustering

The resulting clusters have been shown and they are actual
points within the clusters. The pervious centroids were not
true centroids.

15
Example: K-Means Clustering

We next calculate the centroids of the three clusters using the

x and y values of the objects currently assigned to each one.

The three centroids have all been moved by the assignment

process, but the movement of the third one is appreciably less
than for the other two.
16
Example: K-Means Clustering

The next step is to reassign the 16 objects to the three clusters

by determining which centroid is closest to each one. This
gives the revised set of clusters as shown below. However the
new centroids are not real ones (not actual points). The object
at (8.3, 6.9) has moved from cluster 2 to cluster 1.

17
Example: K-Means Clustering

We next recalculate the positions of the three centroids, giving

the set of new centroids, as shown below. The first two
centroids have moved a little, but the third has not moved at
all.

18
Example: K-Means Clustering

The next step is to assign 16 objects to clusters, again.

These are the same clusters as before. Their centroids will be
the same as those from which the clusters were generated.
Hence the stopping criterion has been met.

19
Example: K-Means Clustering

The three clusters, for the initially (randomly) chosen three

centroids have been formed.

It is now clear that the formation of the clusters is heavily

dependent on the number of centroids as well as the initial
choice of the centroids.

20
Choosing the best possible k

k-Means has no in-built preference for right number of

clusters, following are some of the common ways k can be
selected.
1.Domain Knowledge – If the problem requires/ prefers
certain number or range of clusters, then that can be useful to
select k. For instance, business may prefer three customer
segments H/M/L.

2.Rule of Thumb – Very rough rule of thumb is, where n is

number of data points, but in reality this is never really useful.
This rule gives

21
Choosing the best possible k

3. Cluster Quality using Silhouette Coefficient

The silhouette coefficient is a measure of the compactness and
separation of the clusters. It increases as the quality of the
clusters increase; it is large for compact clusters that are far
from each other and small for large, overlapping clusters.

The silhouette coefficient is calculated per instance; for a set

of instances, it is calculated as the mean of the individual
samples' scores. The silhouette coefficient for an instance is
calculated with the following equation:

22
Silhouette Coefficient

Cluster Quality using Silhouette Coefficient For the ith object

in a cluster A, calculate its average distance to all other
objects in its cluster. This gives us ai.

For the ith object in cluster A and any other object in some
other cluster B, calculate the mean (average) distance from
the ith object in cluster A to all the objects in cluster B.

Find the minimum such value with respect to all clusters. Call
this bi. The Silhouette coefficient for this ith object is given by
Si = (bi - ai)/ max(ai, bi)

An average of all the Silhouette coefficients gives the quality

of the clustering. 23
Choosing the best possible k

Elbow-Method (using Within Cluster Sum of Squares)

1.Compute clustering algorithm (e.g., k-means clustering) for

different values of k. For instance, by varying k from 1 to 10
clusters.
2. For each k, calculate the total within-cluster sum of square (wcss).
3. Plot the curve of wcss according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered
as an indicator of the appropriate number of clusters.

24
Choosing the best possible k

Elbow-Method (using Within Cluster Sum of Squares)

25
Advantages of K-Means Clustering

The k-means clustering is popular and widely adopted due to

its simplicity and ease of implementation.

It is efficient and has optimal time complexity defined by

O(ikn), where n is the number of data points, k is the number
of clusters, and i is the number of iterations.

26
Disadvantages of K-Means Clustering

The value of k is always a user input.

This algorithm is applicable only when the means are

available, and in the case of categorical data the centroids are
none other than the frequent values

The clusters identified are very sensitive to the initially

identified centers

k-means is very sensitive to outliers

27
K-means variations

• K-medoids – instead of mean, use

medians of each cluster
– Mean of 1, 3, 5, 7, 9 is5
– Mean of 1, 3, 5, 7, 1009 is205
– Median of 1, 3, 5, 7, 1009 is5
– Median advantage: not affected by extreme
values
k-Medoids
k-Medoids Algorithm
Problem with PAM

• Pam is more robust than k-means in the presence of

noise and outliers
• Pam works efficiently for small data sets but does not
scale well for large data sets.
 Sampling based method,
CLARA(Clustering LARge Applications)

31
CLARA (Clustering Large Applications)

• CLARA (Kaufmann and Rousseeuw in 1990)

• It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased

Short Qns CNN
No ratings yet
Short Qns CNN
11 pages
K Means Questions
No ratings yet
K Means Questions
2 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
R PPT 30
No ratings yet
R PPT 30
45 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
ML - Expectation-Maximization Algorithm
No ratings yet
ML - Expectation-Maximization Algorithm
3 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
71A Machine Learning
No ratings yet
71A Machine Learning
8 pages
Unit 4
No ratings yet
Unit 4
4 pages
Cluster
100% (1)
Cluster
72 pages
Perceptron
No ratings yet
Perceptron
26 pages
Unsupervised Learning Notes
No ratings yet
Unsupervised Learning Notes
21 pages
Machine Learning KNN Presentation
No ratings yet
Machine Learning KNN Presentation
28 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
Machine Learning Study Guide
No ratings yet
Machine Learning Study Guide
2 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
Association Rules
No ratings yet
Association Rules
64 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
23 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Nearest Neighbor Classifier Guide
No ratings yet
Nearest Neighbor Classifier Guide
16 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
ANN Unit-2 Chapter-2
No ratings yet
ANN Unit-2 Chapter-2
56 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Computational Learning Theory Guide
No ratings yet
Computational Learning Theory Guide
24 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
80 pages
Model Building Through
No ratings yet
Model Building Through
21 pages
Mining Graphs
No ratings yet
Mining Graphs
23 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Data Mining for CS Students
No ratings yet
Data Mining for CS Students
406 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
K Means Clustering
100% (1)
K Means Clustering
13 pages
Clustering (Unit 3)
100% (2)
Clustering (Unit 3)
71 pages
ML QB With Answer
No ratings yet
ML QB With Answer
20 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Regression Notes
100% (1)
Regression Notes
20 pages
AI Statistical Methods Course
No ratings yet
AI Statistical Methods Course
23 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Data Warehousing & Mining Guide
No ratings yet
Data Warehousing & Mining Guide
142 pages
ML Notes MAKAUT 7th Sem
100% (1)
ML Notes MAKAUT 7th Sem
31 pages
PPT1
No ratings yet
PPT1
93 pages
Assignment # 01 Bscs - 7 Semester: Machine Learning
100% (1)
Assignment # 01 Bscs - 7 Semester: Machine Learning
5 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Clustering
No ratings yet
Clustering
125 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Data Warehouse Lec-3
No ratings yet
Data Warehouse Lec-3
38 pages
Intro to Cryptography & Number Theory
No ratings yet
Intro to Cryptography & Number Theory
45 pages
Data Warehouse: Bilal Hussain
No ratings yet
Data Warehouse: Bilal Hussain
34 pages
4 - Advance Encryption Standard
No ratings yet
4 - Advance Encryption Standard
33 pages
Data Warehouse: Bilal Hussain
No ratings yet
Data Warehouse: Bilal Hussain
20 pages
Cryptography and Finite Fields
No ratings yet
Cryptography and Finite Fields
23 pages
Lecture Decision Trees
No ratings yet
Lecture Decision Trees
46 pages
3 - Block Ciphers and The Data Encryption Standard Part 2
No ratings yet
3 - Block Ciphers and The Data Encryption Standard Part 2
33 pages
3 - Block Ciphers and The Data Encryption Standard
No ratings yet
3 - Block Ciphers and The Data Encryption Standard
32 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
StableNet Administrator Manual
100% (1)
StableNet Administrator Manual
122 pages
Agglomerative Hierarchical Clustering
No ratings yet
Agglomerative Hierarchical Clustering
22 pages
KNN for Data Science Enthusiasts
No ratings yet
KNN for Data Science Enthusiasts
14 pages
Nemo File Format 2.25 PDF
No ratings yet
Nemo File Format 2.25 PDF
642 pages
Specifications For The IManager U2000 Northbound Interface 07 (20170808)
No ratings yet
Specifications For The IManager U2000 Northbound Interface 07 (20170808)
23 pages
Kpi Type KPI 3G: Ps Volume Hsdpa Ps Volume Hsupa
No ratings yet
Kpi Type KPI 3G: Ps Volume Hsdpa Ps Volume Hsupa
12 pages
Radial Basis Functions
No ratings yet
Radial Basis Functions
10 pages
ANN Matlab
No ratings yet
ANN Matlab
13 pages
m4 Savana Naurizka 30321360untitled1.ipynb Colab
No ratings yet
m4 Savana Naurizka 30321360untitled1.ipynb Colab
2 pages
Soft Computing Lab Manual
100% (1)
Soft Computing Lab Manual
25 pages
DL - Unit IV
No ratings yet
DL - Unit IV
36 pages
ML Lecture3 2013 PDF
No ratings yet
ML Lecture3 2013 PDF
60 pages
A Survey and Critique of Deep Learning On Recommender Systems
No ratings yet
A Survey and Critique of Deep Learning On Recommender Systems
31 pages
Syllabus For CSCI 631 - Foundations of Computer Vision
No ratings yet
Syllabus For CSCI 631 - Foundations of Computer Vision
1 page
Exam 2005
No ratings yet
Exam 2005
21 pages
CSC 325 AI Assignment 02 23102023 033111pm
No ratings yet
CSC 325 AI Assignment 02 23102023 033111pm
5 pages
Backpropagation Gradient Guide
No ratings yet
Backpropagation Gradient Guide
1 page
Deep Learning in Software Testing
No ratings yet
Deep Learning in Software Testing
13 pages
GenAI Curriculum (DataSpoof)
No ratings yet
GenAI Curriculum (DataSpoof)
4 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
75 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
12 pages
AI.5 Machine Learning (21 26)
No ratings yet
AI.5 Machine Learning (21 26)
176 pages
Unit-3 ML
No ratings yet
Unit-3 ML
18 pages
DL CNN
No ratings yet
DL CNN
129 pages
Lecture Notes - RRN
No ratings yet
Lecture Notes - RRN
8 pages
Soft Computing
No ratings yet
Soft Computing
92 pages
Autonomous Vehicle DDS via ML
No ratings yet
Autonomous Vehicle DDS via ML
39 pages
Lecture-4 Multi-Layer Perceptrons
No ratings yet
Lecture-4 Multi-Layer Perceptrons
23 pages
XXXBetter Plain ViT Baselines For ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines For ImageNet-1k
3 pages
Neural Network Essentials
No ratings yet
Neural Network Essentials
34 pages
Artificial Neural Network - Hopfield Networks - Tutorialspoint
No ratings yet
Artificial Neural Network - Hopfield Networks - Tutorialspoint
3 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
DeepSeek图解10页
No ratings yet
DeepSeek图解10页
11 pages
Segmentation Survey Arxiv
No ratings yet
Segmentation Survey Arxiv
24 pages
Ensemble Learning Explained
No ratings yet
Ensemble Learning Explained
32 pages
Credit Card Fraud Detection Using Random Forest Algorithm and CNN
No ratings yet
Credit Card Fraud Detection Using Random Forest Algorithm and CNN
48 pages

K-Means Clustering Guide

Uploaded by

K-Means Clustering Guide

Uploaded by

Clustering

• Cluster: a collection of data objects

• Marketing: Help marketers discover distinct groups in their

• A good clustering method will produce clusters with

• Partitioning: Construct various partitions and then evaluate

• Partitioning method: Construct a partition of a database D of

Assuming that we are using Euclidean distance or something

• The centroid of a cluster will sometimes be one of the

• Given k, the k-means algorithm consists of four steps:

– Select initial centroids at random.

• We will illustrate the

• These points are

Three of the points shown in the

The columns headed d1,

The column headed

We next calculate the centroids of the three clusters using the

The three centroids have all been moved by the assignment

The next step is to reassign the 16 objects to the three clusters

We next recalculate the positions of the three centroids, giving

The next step is to assign 16 objects to clusters, again.

The three clusters, for the initially (randomly) chosen three

It is now clear that the formation of the clusters is heavily

k-Means has no in-built preference for right number of

2.Rule of Thumb – Very rough rule of thumb is, where n is

3. Cluster Quality using Silhouette Coefficient

The silhouette coefficient is calculated per instance; for a set

Cluster Quality using Silhouette Coefficient For the ith object

An average of all the Silhouette coefficients gives the quality

Elbow-Method (using Within Cluster Sum of Squares)

1.Compute clustering algorithm (e.g., k-means clustering) for

Elbow-Method (using Within Cluster Sum of Squares)

The k-means clustering is popular and widely adopted due to

It is efficient and has optimal time complexity defined by

The value of k is always a user input.

This algorithm is applicable only when the means are

The clusters identified are very sensitive to the initially

k-means is very sensitive to outliers

• K-medoids – instead of mean, use

• Pam is more robust than k-means in the presence of

• CLARA (Kaufmann and Rousseeuw in 1990)

You might also like