Hierarchical Clustering
Learning Objectives
• Understand the difference between supervised and unsupervised
approaches and design the model with no training data.
• Implement Clustering Techniques with real world data .
2
Learning Outcomes
• Understand how Unsupervised Learning is different from supervised.
• Learn the elements of Clustering
•Analyze the Clustering Problems
•Implementation of Clustering on Real World Data
3
Introduction
You shall come across a situation when a Chief Marketing Officer of a company tells
you – “Help me understand our customers better so that we can market our products
to them in a better manner!”
The first reaction of an analyst would also be --- what to do
This is usually the first reaction when you come across an unsupervised learning
problem for the first time!
You are not looking for specific insights for a phenomena, but what you are looking for
are structures with in data with out them being tied down to a specific outcome.
Clustering-Unsupervised learning
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group than those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters.
Types of Clustering models
• Connectivity bases
• Centroid based
• Distribution based
• Density based
Types of clustering models
• Connectivity models:
• As the name suggests, these models are based on the notion that the data points closer in data
space exhibit more similarity to each other than the data points lying farther away. These
models can follow two approaches.
• In the first approach, they start with classifying all data points into separate clusters & then
aggregating them as the distance decreases- Agglomerative.
• In the second approach, all data points are classified as a single cluster and then partitioned as the
distance increases (Divisive). Also, the choice of distance function is subjective. These models are
very easy to interpret but lacks scalability for handling big datasets. Examples of these models are
hierarchical clustering algorithm and its variants.
6
Types of clustering algorithms
• Centroid models:
• These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a
data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls
into this category.
• In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes
it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
• Distribution models:
• These clustering models are based on the notion of how probable is it that all data points in the cluster
belong to the same distribution (For example: Normal, Gaussian).
• These models often suffer from overfitting. A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.
• Density Models:
• These models search the data space for areas of varied density of data points in the data space. It
isolates various different density regions and assign the data points within these regions in the same
cluster.
• Popular examples of density models are DBSCAN and OPTICS.
7
Difference between clustering and classification
– Classification and clustering are two methods of pattern identification used in machine
learning.
– The classification uses predefined classes in which objects are assigned, while
clustering identifies similarities between objects.
– Clustering is framed in unsupervised learning; that is, for this type of algorithm we only
have one set of input data (not labelled), about which we must obtain information,
without previously knowing what the output will be.
– On the other hand, classification belongs to supervised learning, which means that we
know the input data (labeled in this case) and we know the possible output of the
algorithm.
– Clustering example - Netflix recommendation systems. There are about 2,000 clusters or
communities that have common audiovisual tastes.
– Classification- Entities can classify transactions as correct or fraudulent using historical
data on customer behavior to detect fraud very accurately.
8
Clustering - Overview
• Not well defined at times:
• How do we really know how many clusters should exist in
the above example?
9
Applications of Clustering
• Clustering has a large no. of applications spread across various domains. Some of
the most popular applications of clustering are:
• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
• Anomaly detection
10
THANK YOU
11