Kmeans clustering using machine learning

MACHINE
LEARNING
Machine learning is an application
of artificial intelligence (AI) that
provides systems the ability to
automatically learn and improve
from experience without being
explicitly programmed.

Types of machine learning
• K-Means Clustering
• Gaussian Mixture
Models
• Dirichlet Process

K-Means Clustering
Clustering:
•is the classification of objects
into different groups, or more
precisely, the partitioning of a
data set into subsets
(clusters), so that the data in
each subset share some
common trait.

K-means Clustering
Types of Clustering
1. Hierarchical
2. Partitional
3. Density Based Clustering
4. Fuzzy logic Clustering

K-means Clustering
• a clustering algorithm in which the K-clusters are based on the
closeness of data points to a reference point (centroid of a cluster).
• I clusters n objects based on attributes into k partitions, where k < n.
Terminology
Centroid
• A reference point of a given cluster. They are used to label new data
i.e. determine which cluster a new data point belongs to.

K-means Clustering
How it works
The algorithm performs two major steps;
1. Data Assignment
• Selection of centroids
• Random selection
• Random Generation
• K-means ++
• Assign data points to centroids based on distance
• Euclidean distance
• Manhattan distance
• Hamming distance
• Inner product space

Manhattan Distance
Euclidean Distance

Illustration
Objective: To create 2 clusters from the set of numbers.(K=2, n=10)

n 2 8
1
2
3
4
5
6
7
8
9
10
Centroid Initialization

1
2
3
4
5
6
7
8
9
10
Data Assignment
7
8
2
8
1
1

n 2 8
1 1 7
2 0 6
3 1 5
4 2 4
5 3 3
6 4 2
7 5 1
8 6 0
9 7 1
10 8 2

K-means Clustering(How it works)
2. Centroid update step
• Centroids are recomputed
• based on mean of all data points assigned to a cluster(in step 1)
• Steps 1 and 2 are run iteratively until;
• Centroids don’t change i.e. distances is the same and data points do not change clusters
• Some maximum number of iterations is reached.
• Some other condition is fulfilled( e.g. minimum distance is achieved)

n 3 8
1 2 7
2 1 6
3 0 5
4 1 4
5 2 3
6 3 2
7 4 1
8 5 0
9 6 1
10 7 2
Step 2 (Centroid Update)
(6,7,8,9,10)= 8
(1,2,3,4,5)=3
Mean
Mean
Centroids

3 8
1 2 7
2 1 6
3 0 5
4 1 4
5 2 3
6 3 2
7 4 1
8 5 0
9 6 1
10 7 2
Step 3 (Centroid Update)
(6,7,8,9,10)= 8
(1,2,3,4,5)=3
Mean
Mean

NOTE
The centroids do not change after the 2nd
iteration. Therefore we stop
updating the centroids.
Remember !
Our goal is to identify the optimum value of K.

K-means Clustering(How to pick the optimum k)
• Minimize the within-cluster sum-of-squares( tighten clusters) and increase
between-cluster sum of squares.
Where
•Sj is a specific cluster in K number of clusters.
•Xn is a datapoint within a cluster Sj.
•µj is the centroid of the cluster

There are 2 common methods ;
1. The Elbow Method
I. Calculate the sum of squares for a cluster.
II. Plot sum of squares against number of clusters, k.
III. Observe change in sum of squares to select optimum K.
Observe graph at k=3(the elbow),
sum of squares does not reduce
significantly after k>3
Elbow

Limitations with the elbow method is that the elbow might not be well
defined.
This can be overcome using the Silhouette method
2. Silhouette method
• The silhouette value is a measure of how similar an object is to its
own cluster (cohesion) compared to other clusters (separation).
• It ranges from [+1,-1] with +1 showing that the point is very close to
its own cluster, -1 shows that the point is very similar to the
neighboring cluster.

• Silhouette value s(i) of a point (i) is mathematically defined as
Where;
• b(i) is the mean distance of point (i) with respect to points in its neighboring
cluster.
• a(i) is the mean distance of point (i) with respect to points in its own cluster

K-means Clustering(advantages)
• It is guaranteed to converge
• Easily scales to large datasets
• Has a linear time complexity O(tkn)
• t – number of iterations
• k – number of clusters
• n – number of data points

K-means Clustering(Limitations)
• k is chosen manually
• Clusters are typically dependent on initial centroids
• Outliers can drastically affect centroids
• Can give unrealistic clusters i.e. (local optimum)
• Organization/order of data may have an impact on results
• Sensitive to scale

Applications
 Pattern recognitions
 Classification Analysis
 Image processing
 Machine Vision

References/Resources
1. https://blogs.oracle.com/datascience/introduction-to-k-means-clust
ering
2. https://medium.com/analytics-vidhya/how-to-determine-the-optim
al-k-for-k-means-708505d204eb
3. https://en.wikipedia.org/wiki/Silhouette_(clustering)

Kmeans clustering using machine learning

More Related Content

Similar to Kmeans clustering using machine learning

More from rajab ssemwogerere

Recently uploaded

Kmeans clustering using machine learning