MACHINE
LEARNING
Machine learning isan application
of artificial intelligence (AI) that
provides systems the ability to
automatically learn and improve
from experience without being
explicitly programmed.
2.
Types of machinelearning
• K-Means Clustering
• Gaussian Mixture
Models
• Dirichlet Process
3.
K-Means Clustering
Clustering:
•is theclassification of objects
into different groups, or more
precisely, the partitioning of a
data set into subsets
(clusters), so that the data in
each subset share some
common trait.
4.
K-means Clustering
Types ofClustering
1. Hierarchical
2. Partitional
3. Density Based Clustering
4. Fuzzy logic Clustering
5.
K-means Clustering
• aclustering algorithm in which the K-clusters are based on the
closeness of data points to a reference point (centroid of a cluster).
• I clusters n objects based on attributes into k partitions, where k < n.
Terminology
Centroid
• A reference point of a given cluster. They are used to label new data
i.e. determine which cluster a new data point belongs to.
6.
K-means Clustering
How itworks
The algorithm performs two major steps;
1. Data Assignment
• Selection of centroids
• Random selection
• Random Generation
• K-means ++
• Assign data points to centroids based on distance
• Euclidean distance
• Manhattan distance
• Hamming distance
• Inner product space
K-means Clustering(How itworks)
2. Centroid update step
• Centroids are recomputed
• based on mean of all data points assigned to a cluster(in step 1)
• Steps 1 and 2 are run iteratively until;
• Centroids don’t change i.e. distances is the same and data points do not change clusters
• Some maximum number of iterations is reached.
• Some other condition is fulfilled( e.g. minimum distance is achieved)
NOTE
The centroids donot change after the 2nd
iteration. Therefore we stop
updating the centroids.
Remember !
Our goal is to identify the optimum value of K.
17.
K-means Clustering(How topick the optimum k)
• Minimize the within-cluster sum-of-squares( tighten clusters) and increase
between-cluster sum of squares.
Where
•Sj is a specific cluster in K number of clusters.
•Xn is a datapoint within a cluster Sj.
•µj is the centroid of the cluster
18.
There are 2common methods ;
1. The Elbow Method
I. Calculate the sum of squares for a cluster.
II. Plot sum of squares against number of clusters, k.
III. Observe change in sum of squares to select optimum K.
K-means Clustering(How to pick the optimum k)
Observe graph at k=3(the elbow),
sum of squares does not reduce
significantly after k>3
Elbow
19.
Limitations with theelbow method is that the elbow might not be well
defined.
This can be overcome using the Silhouette method
2. Silhouette method
• The silhouette value is a measure of how similar an object is to its
own cluster (cohesion) compared to other clusters (separation).
• It ranges from [+1,-1] with +1 showing that the point is very close to
its own cluster, -1 shows that the point is very similar to the
neighboring cluster.
K-means Clustering(How to pick the optimum k)
20.
• Silhouette values(i) of a point (i) is mathematically defined as
Where;
• b(i) is the mean distance of point (i) with respect to points in its neighboring
cluster.
• a(i) is the mean distance of point (i) with respect to points in its own cluster
K-means Clustering(How to pick the optimum k)
21.
K-means Clustering(advantages)
• Itis guaranteed to converge
• Easily scales to large datasets
• Has a linear time complexity O(tkn)
• t – number of iterations
• k – number of clusters
• n – number of data points
22.
K-means Clustering(Limitations)
• kis chosen manually
• Clusters are typically dependent on initial centroids
• Outliers can drastically affect centroids
• Can give unrealistic clusters i.e. (local optimum)
• Organization/order of data may have an impact on results
• Sensitive to scale