Clustering in ML
1. Machine Learning & Data Mining
Machine learning focuses on prediction, based on known properties learned from the training data.
Data mining focuses on the discovery of (previously) unknown properties in the data. This is the analysis step of Knowledge
Discovery in Databases.
Data mining uses many machine learning methods, but often with a slightly different goal in mind.
Machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner
accuracy.
2. What is Cluster Analysis?
Cluster analysis is a class of techniques that are used to classify objects or cases into relative groups called clusters. Cluster analysis is also
called classification analysis or numerical taxonomy. In cluster analysis, there is no prior information about the group or cluster membership
for any of the objects.
The purpose of cluster analysis is to place objects into groups, or clusters, suggested by the data, not defined a priori, such that objects in a
given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. Finding groups of objects
such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
3. Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on
the purchasing patterns.
Understanding
Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price
fluctuations.
Summarization
Reduce the size of large data sets
4. Growth of Machine Learning
Machine learning is preferred approach to Speech recognition, Natural language processing, Computer vision, Medical outcomes analysis,
Robot control, Computational biology.
This trend is accelerating Improved machine learning algorithms, Improved data capture, networking, faster computers, Software too complex
to write by hand, New sensors / IO devices, Demand for self-customization to user, environment. It turns out to be difficult to extract
knowledge from human expertsfailure of expert systems in the 1980’s.
5. Data Mining / KDD
“KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”
Applications:
Retail: Market basket analysis, Customer relationship management (CRM)
Finance: Credit scoring, fraud detection
Manufacturing: Optimization, troubleshooting
Medicine: Medical diagnosis
Telecommunications: Quality of service optimization
Bioinformatics: Motifs, alignment
6. Clustering Methodology
Page - 1 of 7
Clustering in ML
7. Types of Clustering
A clustering is a set of clusters. There are different types of clustering methods in use and some important Clusterings are:
Partitioning clustering
Hierarchical clustering
Fuzzy clustering
Density-based clustering, and
Distribution Model-based clustering.
Partitioning clustering
Partitioning Clustering is a type of clustering technique, that divides the data set into a set number of groups. A division data objects into non-
overlapping subsets (clusters) such that each data object is in exactly one subset.
Hierarchical clustering
It is a type of clustering technique, that divides that data set into a number of clusters, where the user doesn’t specify the number of clusters
to be generated before training the model. A set of nested clusters organized as a hierarchical tree
Density-Based Clustering
In this clustering, technique clusters will be formed by the segregation of various density regions based on different densities in the data plot.
Density-Based Spatial Clustering and Application with Noise (DBSCAN) is the most used algorithm in this type of technique.
The main idea behind this algorithm is there should be a minimum number of points contained in the neighborhood of a given radius for each
point in the cluster.
Distribution Model-Based Clustering
In this type of clustering, technique clusters are formed by identifying by the probability of all the data points in the cluster come from the
same distribution (Normal, Gaussian). The most popular algorithm in this type of technique is Expectation-Maximization (EM) clustering
using Gaussian Mixture Models (GMM).
8. Other Distinctions Between Sets of Clusters
Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong to multiple clusters.
Can represent multiple classes or ‘border’ points
Fuzzy versus non-fuzzy
In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics
Partial versus complete
In some cases, we only want to cluster some of the data
Page - 2 of 7
Clustering in ML
Heterogeneous versus homogeneous
Cluster of widely different sizes, shapes, and densities.
9. Clustering Algorithms
1. K-means and its variants
2. Hierarchical clustering
3. Density-based clustering
10. K-means Clustering
K-means clustering is a type of unsupervised learning, which is used when there have unlabeled data (i.e., data without defined categories or
groups). The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. K-Means
clustering is one of the most widely used algorithms. It partitions the data points into k clusters based upon the distance metric used for the
clustering. The value of ‘k’ is to be defined by the user. The distance is calculated between the data points and the centroids of the clusters.
The data point which is closest to the centroid of the cluster gets assigned to that cluster. After an iteration, it computes the centroids of those
clusters again and the process continues until a pre-defined number of iterations are completed or when the centroids of the clusters do not
change after an iteration.
It is a very computationally expensive algorithm as it computes the distance of every data point with the centroids of all the clusters at each
iteration. This makes it difficult for implementing the same for huge data sets.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few points change clusters.
Complexity is O ( n * K * I * d )
Here, n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
11. Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
x is a data point in cluster Ci and 𝑚i is the representativepoint for cluster Ci
can show that mi corresponds to the center (mean) of the cluster
Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Page - 3 of 7
Clustering in ML
12. Issues and Limitations for K-means
Advantages:
1. Fast, robust and easier to understand.
2. Relatively efficient: O (tknd), where n is # objects, k is # clusters, d is # dimension of each object, and t is # iterations. Normally,
k, t, d << n.
3. Gives best result when data set are distinct or well separated from each other.
Disadvantages or Limitations:
1. The learning algorithm requires apriori specification of the number of cluster centers.
2. The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not be able to resolve that there are
two clusters.
3. The learning algorithm is not invariant to non-linear transformations i.e. with different representation of data we get different results
(data represented in form of Cartesian co-ordinates and polar co-ordinates will give different results).
4. Euclidean distance measures can unequally weight underlying factors.
5. The learning algorithm provides the local optima of the squared error function.
6. Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7. Applicable only when mean is defined i.e. fails for categorical data.
8. Unable to handle noisy data and outliers.
9. Algorithm fails for non-linear data set.
13. Problems with Selecting Initial Points.
If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. Chance is relatively small when K is large.
If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/1010 = 0.00036
Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t. Consider an example of five pairs of
clusters
14. Solutions to Initial Centroids Problem
Multiple runs
– Helps, but probability is not on your side
Sample and use hierarchical clustering to determine initial centroids
Select more than k initial centroids and then select among these initial centroids
– Select most widely separated
Postprocessing
Bisecting K-means
– Not as susceptible to initialization issues
15. Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree.
Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits.
16. Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
– Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level
They may correspond to meaningful taxonomies
– Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Page - 4 of 7
Clustering in ML
17. Types of Hierarchical Clustering
Two main types of hierarchical clustering, Agglomerative and Divisive
1. Agglomerative:
– Start with the points as individual clusters
– At each step, merge the closest pair of clusters until only one cluster (or k
– clusters) left
2. Divisive:
– Start with one, all-inclusive cluster
– At each step, split a cluster until each cluster contains a point (or there are k clusters)
18. How to Define Inter-Cluster Similarity
Min
Max
Group Average
Distance Between Centroids
19. Cluster Similarity
Similarity is an amount that reflects the strength of relationship between two data items, it represents how similar two data patterns are.
Clustering is done based on a similarity measure to group similar data objects together.
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar (closest) points in the different clusters
– Determined by one pair of points, i.e., by one link in the proximity graph.
Hierarchical Clustering: MIN
Strength – Can handle non-elliptical shapes.
Limitations – Sensitive to noise and outliers.
Page - 5 of 7
Clustering in ML
Cluster Similarity: MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters.
– Determined by all pairs of points in the two clusters
Hierarchical Clustering: MAX
Strength – Less susceptible to noise and outliers
Limitations – Tends to break large clusters, Biased towards globular clusters.
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
Need to use average connectivity for scalability since total proximity favors large clusters
Hierarchical Clustering: Group Average
Hierarchical Clustering: Group Average
Compromise between Single and Complete Link
Strengths – Less susceptible to noise and outliers
Limitations – Biased towards globular clusters
Page - 6 of 7
Clustering in ML
20. Strength and Weakness of different types of clustering.
Algorithm Advantages Disadvantages
1. Severe effectiveness degradation in high dimensional
spaces.
2. Poor cluster descriptors.
1. Relatively Scalable and Simple. 3. Reliance on the user to specify the number of clusters in
Partitioning
2. Suitable for Datasets with compact spherical advance.
Clustering
clusters that are well-separated. 4. High sensitivity to initialization phase, noise and
outliers.
5. Inability to deal with non-convex clusters of varying
size and density.
1. Inability to make corrections once the splitting/merging
1. No need to define number of clusters in advance. decision is made.
2. Calculates a whole hierarchy of clusters. 2. Lack of interpretability regarding the cluster
3. Good result visualizations Joints into the methods. descriptors.
Hierarchical
4. Uses dendrogram for graphical representation. 3. Vagueness of termination criterion.
Clustering
5. Embedded flexibility regarding the level of 4. Prohibitively expensive for high dimensional and
granularity. massive datasets.
6. Point linkages, e.g. Taxonomy trees can be solved. 5. Severe effectiveness degradation in high dimensional
spaces.
21. Limitation of Clustering and how it can be overcome or solved?
There are several things to be aware of when conducting cluster analysis:
1. The different methods of clustering usually give very different results. This occurs because of the different criterion for merging
clusters (including cases). It is important to think carefully about which method is best for what you are interested in looking at.
2. With the exception of simple linkage, the results will be affected by the way in which the variables are ordered.
3. The analysis is not stable when cases are dropped: this occurs because selection of a case (or merger of clusters) depends on
similarity of one case to the cluster. Dropping one case can drastically affect the course in which the analysis progresses.
4. The hierarchical nature of the analysis means that early ‘bad judgements’ cannot be rectified.
There are a variety of tools and strategies to mitigate the limitations that simplify the process of extracting and analyzing clustered data. The
k-means clustering approach is a portioning-based solution that requires networks to assign objects to one and only one cluster. This eliminates
the concern that a single object may bias analysis by appearing in multiple data sets.
Page - 7 of 7