Clustering in Machine Learning
In real world, not every data we work upon has a target variable. Have you ever wondered how
Netflix groups similar movies together or how Amazon organizes its vast product catalog? These
are real-world applications of clustering. This kind of data cannot be analyzed using supervised
learning algorithms.
When the goal is to group similar data points in a dataset, then we use cluster analysis.
What is Clustering?
The task of grouping data points based on their similarity with each other is called Clustering
or Cluster Analysis. This method is defined under the branch of unsupervised learning, which
aims at gaining insights from unlabeled data points.
Think of it as you have a dataset of customers shopping habits. Clustering can help you group
customers with similar purchasing behaviors, which can then be used for targeted
marketing, product recommendations, or customer segmentation
For Example, In the graph given below, we can clearly see that there are 3 circular clusters forming
on the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters
can be arbitrary. There are many algorithms that work well with detecting arbitrary shaped
clusters.
For example, In the below given graph we can see that the clusters formed are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group similar data points:
Hard Clustering: In this type of clustering, each data point belongs to a cluster completely
or not. For example, Let's say there are 4 data point and we have to cluster them into 2
clusters. So each data point will either belong to cluster 1 or cluster 2.
Data Points Clusters
A C1
B C2
C C2
D C1
Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated. For
example, Let's say there are 4 data point and we have to cluster them into 2 clusters. So we
will be evaluating a probability of a data point belonging to both clusters. This probability
is calculated for all data points.
Data Points Probability of C1 Probability of C2
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
Market Segmentation: Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
Market Basket Analysis: Shop owners analyze their sales and figure out which items are
majorly bought together by the customers.
Social Network Analysis: Social media sites use your data to understand your browsing
behavior and provide you with targeted friend recommendations or content
recommendations.
Medical Imaging: Doctors use Clustering to find out diseased areas in diagnostic images
like X-rays.
Anomaly Detection: To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.
Simplify working with large datasets: Each cluster is given a cluster ID after clustering
is complete. Now, you may reduce a feature set's whole feature set into its cluster ID.
Clustering is effective when it can represent a complicated case with a straightforward
cluster ID. Using the same principle, clustering data can make complex datasets simpler.
Centroid-based Clustering (Partitioning methods)
Centroid-based clustering organizes data points around central vectors (centroids) that represent
clusters. Each data point belongs to the cluster with the nearest centroid. Generally, the similarity
measure chosen for these algorithms are Euclidian distance, Manhattan Distance or Minkowski
Distance.
The datasets are separated into a predetermined number of clusters, and each cluster is
referenced by a vector of values. When compared to the vector value, the input data variable
shows no difference and joins the cluster.
The major drawback for centroid-based algorithms is the requirement that we establish the number
of clusters, "k," either intuitively or scientifically (using the Elbow Method) before any clustering
machine learning system starts allocating the data points. Despite this limitation, it remains the
most popular type of clustering due to its simplicity and efficiency. Popular algorithms
of Centroid-based clustering are:
K-means and
K-medoids clustering