KEMBAR78
Kmeans clustering using machine learning | PPTX
MACHINE
LEARNING
Machine learning is an application
of artificial intelligence (AI) that
provides systems the ability to
automatically learn and improve
from experience without being
explicitly programmed.
Types of machine learning
• K-Means Clustering
• Gaussian Mixture
Models
• Dirichlet Process
K-Means Clustering
Clustering:
•is the classification of objects
into different groups, or more
precisely, the partitioning of a
data set into subsets
(clusters), so that the data in
each subset share some
common trait.
K-means Clustering
Types of Clustering
1. Hierarchical
2. Partitional
3. Density Based Clustering
4. Fuzzy logic Clustering
K-means Clustering
• a clustering algorithm in which the K-clusters are based on the
closeness of data points to a reference point (centroid of a cluster).
• I clusters n objects based on attributes into k partitions, where k < n.
Terminology
Centroid
• A reference point of a given cluster. They are used to label new data
i.e. determine which cluster a new data point belongs to.
K-means Clustering
How it works
The algorithm performs two major steps;
1. Data Assignment
• Selection of centroids
• Random selection
• Random Generation
• K-means ++
• Assign data points to centroids based on distance
• Euclidean distance
• Manhattan distance
• Hamming distance
• Inner product space
Manhattan Distance
Euclidean Distance
Illustration
Objective: To create 2 clusters from the set of numbers.(K=2, n=10)
1
2
3
4
5
6
7
8
9
10
n 2 8
1
2
3
4
5
6
7
8
9
10
Centroid Initialization
1
2
3
4
5
6
7
8
9
10
Data Assignment
7
8
2
8
1
1
n 2 8
1 1 7
2 0 6
3 1 5
4 2 4
5 3 3
6 4 2
7 5 1
8 6 0
9 7 1
10 8 2
K-means Clustering(How it works)
2. Centroid update step
• Centroids are recomputed
• based on mean of all data points assigned to a cluster(in step 1)
• Steps 1 and 2 are run iteratively until;
• Centroids don’t change i.e. distances is the same and data points do not change clusters
• Some maximum number of iterations is reached.
• Some other condition is fulfilled( e.g. minimum distance is achieved)
n 3 8
1 2 7
2 1 6
3 0 5
4 1 4
5 2 3
6 3 2
7 4 1
8 5 0
9 6 1
10 7 2
Step 2 (Centroid Update)
(6,7,8,9,10)= 8
(1,2,3,4,5)=3
Mean
Mean
Centroids
3 8
1 2 7
2 1 6
3 0 5
4 1 4
5 2 3
6 3 2
7 4 1
8 5 0
9 6 1
10 7 2
Step 3 (Centroid Update)
(6,7,8,9,10)= 8
(1,2,3,4,5)=3
Mean
Mean
NOTE
The centroids do not change after the 2nd
iteration. Therefore we stop
updating the centroids.
Remember !
Our goal is to identify the optimum value of K.
K-means Clustering(How to pick the optimum k)
• Minimize the within-cluster sum-of-squares( tighten clusters) and increase
between-cluster sum of squares.
Where
•Sj is a specific cluster in K number of clusters.
•Xn is a datapoint within a cluster Sj.
•µj is the centroid of the cluster
There are 2 common methods ;
1. The Elbow Method
I. Calculate the sum of squares for a cluster.
II. Plot sum of squares against number of clusters, k.
III. Observe change in sum of squares to select optimum K.
K-means Clustering(How to pick the optimum k)
Observe graph at k=3(the elbow),
sum of squares does not reduce
significantly after k>3
Elbow
Limitations with the elbow method is that the elbow might not be well
defined.
This can be overcome using the Silhouette method
2. Silhouette method
• The silhouette value is a measure of how similar an object is to its
own cluster (cohesion) compared to other clusters (separation).
• It ranges from [+1,-1] with +1 showing that the point is very close to
its own cluster, -1 shows that the point is very similar to the
neighboring cluster.
K-means Clustering(How to pick the optimum k)
• Silhouette value s(i) of a point (i) is mathematically defined as
Where;
• b(i) is the mean distance of point (i) with respect to points in its neighboring
cluster.
• a(i) is the mean distance of point (i) with respect to points in its own cluster
K-means Clustering(How to pick the optimum k)
K-means Clustering(advantages)
• It is guaranteed to converge
• Easily scales to large datasets
• Has a linear time complexity O(tkn)
• t – number of iterations
• k – number of clusters
• n – number of data points
K-means Clustering(Limitations)
• k is chosen manually
• Clusters are typically dependent on initial centroids
• Outliers can drastically affect centroids
• Can give unrealistic clusters i.e. (local optimum)
• Organization/order of data may have an impact on results
• Sensitive to scale
Applications
 Pattern recognitions
 Classification Analysis
 Image processing
 Machine Vision
References/Resources
1. https://blogs.oracle.com/datascience/introduction-to-k-means-clust
ering
2. https://medium.com/analytics-vidhya/how-to-determine-the-optim
al-k-for-k-means-708505d204eb
3. https://en.wikipedia.org/wiki/Silhouette_(clustering)

Kmeans clustering using machine learning

  • 1.
    MACHINE LEARNING Machine learning isan application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
  • 2.
    Types of machinelearning • K-Means Clustering • Gaussian Mixture Models • Dirichlet Process
  • 3.
    K-Means Clustering Clustering: •is theclassification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset share some common trait.
  • 4.
    K-means Clustering Types ofClustering 1. Hierarchical 2. Partitional 3. Density Based Clustering 4. Fuzzy logic Clustering
  • 5.
    K-means Clustering • aclustering algorithm in which the K-clusters are based on the closeness of data points to a reference point (centroid of a cluster). • I clusters n objects based on attributes into k partitions, where k < n. Terminology Centroid • A reference point of a given cluster. They are used to label new data i.e. determine which cluster a new data point belongs to.
  • 6.
    K-means Clustering How itworks The algorithm performs two major steps; 1. Data Assignment • Selection of centroids • Random selection • Random Generation • K-means ++ • Assign data points to centroids based on distance • Euclidean distance • Manhattan distance • Hamming distance • Inner product space
  • 7.
  • 8.
    Illustration Objective: To create2 clusters from the set of numbers.(K=2, n=10)
  • 9.
  • 10.
  • 11.
  • 12.
    n 2 8 11 7 2 0 6 3 1 5 4 2 4 5 3 3 6 4 2 7 5 1 8 6 0 9 7 1 10 8 2
  • 13.
    K-means Clustering(How itworks) 2. Centroid update step • Centroids are recomputed • based on mean of all data points assigned to a cluster(in step 1) • Steps 1 and 2 are run iteratively until; • Centroids don’t change i.e. distances is the same and data points do not change clusters • Some maximum number of iterations is reached. • Some other condition is fulfilled( e.g. minimum distance is achieved)
  • 14.
    n 3 8 12 7 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 2 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean Centroids
  • 15.
    3 8 1 27 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 3 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean
  • 16.
    NOTE The centroids donot change after the 2nd iteration. Therefore we stop updating the centroids. Remember ! Our goal is to identify the optimum value of K.
  • 17.
    K-means Clustering(How topick the optimum k) • Minimize the within-cluster sum-of-squares( tighten clusters) and increase between-cluster sum of squares. Where •Sj is a specific cluster in K number of clusters. •Xn is a datapoint within a cluster Sj. •µj is the centroid of the cluster
  • 18.
    There are 2common methods ; 1. The Elbow Method I. Calculate the sum of squares for a cluster. II. Plot sum of squares against number of clusters, k. III. Observe change in sum of squares to select optimum K. K-means Clustering(How to pick the optimum k) Observe graph at k=3(the elbow), sum of squares does not reduce significantly after k>3 Elbow
  • 19.
    Limitations with theelbow method is that the elbow might not be well defined. This can be overcome using the Silhouette method 2. Silhouette method • The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). • It ranges from [+1,-1] with +1 showing that the point is very close to its own cluster, -1 shows that the point is very similar to the neighboring cluster. K-means Clustering(How to pick the optimum k)
  • 20.
    • Silhouette values(i) of a point (i) is mathematically defined as Where; • b(i) is the mean distance of point (i) with respect to points in its neighboring cluster. • a(i) is the mean distance of point (i) with respect to points in its own cluster K-means Clustering(How to pick the optimum k)
  • 21.
    K-means Clustering(advantages) • Itis guaranteed to converge • Easily scales to large datasets • Has a linear time complexity O(tkn) • t – number of iterations • k – number of clusters • n – number of data points
  • 22.
    K-means Clustering(Limitations) • kis chosen manually • Clusters are typically dependent on initial centroids • Outliers can drastically affect centroids • Can give unrealistic clusters i.e. (local optimum) • Organization/order of data may have an impact on results • Sensitive to scale
  • 23.
    Applications  Pattern recognitions Classification Analysis  Image processing  Machine Vision
  • 24.