ASSIGNMENT 4:
Unsupervised
Learning
Made BY: Preyanshi
Enrollment
No:226140307031
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target
attribute in future data instances.
• Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
2
Clustering
• Clustering is a technique for finding similarity groups
in data, called clusters. I.e.,
it groups data instances that are similar to (near) each
other in one cluster and data instances that are very
different (far away) from each other into different clusters.
• Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
• Due to historical reasons, clustering is often
considered synonymous with unsupervised learning.
In fact, association rule mining is also unsupervised
• This chapter focuses on clustering.
3
An illustration
• The data set has three natural groups of data
points, i.e., 3 natural clusters.
CS583, Bing Liu, UIC 4
What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes together to make “small”,
“medium” and “large” T-Shirts.
Tailor-made for each person: too expensive
One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers according to their
similarities
To do targeted marketing.
5
What is clustering for?
(cont…)
• Example 3: Given a collection of text documents, we want to
organize them according to their content similarities,
To produce a topic hierarchy
• In fact, clustering is one of the most utilized data mining
techniques.
It has a long history, and used in almost every field, e.g., medicine,
psychology, botany, sociology, biology, archeology, marketing, insurance,
libraries, etc.
In recent years, due to the rapid increase of online documents, text
clustering becomes important.
6
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is
the number of attributes (dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user
7
K-means algorithm
• Given k, the k-means algorithm works as follows:
1)Randomly choose k data points (seeds) to be the initial centroids, cluster
centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current cluster memberships.
4)If a convergence criterion is not met, go to 2).
8
K-means algorithm – (cont
…)
9
K-means summary
• Despite weaknesses, k-means is still the most popular algorithm
due to its simplicity, efficiency and
other clustering algorithms have their own lists of weaknesses.
• No clear evidence that any other clustering algorithm performs
better in general
although they may be more suitable for some specific types of data or
applications.
• Comparing different clustering algorithms is a difficult task. No one
knows the correct clusters!
10
Common ways to represent
clusters
• Use the centroid of each cluster to represent the cluster.
compute the radius and
standard deviation of the cluster to determine its spread in each
dimension
The centroid representation alone works well if the clusters are of the
hyper-spherical shape.
If clusters are elongated or are of other shapes, centroids are not
sufficient
1
Hierarchical Clustering
• Produce a nested sequence of clusters, a
tree, also called Dendrogram.
CS583, Bing Liu, UIC 12
Using classification model
• All the data points in a
cluster are regarded
to have the same
class label, e.g., the
cluster ID.
run a supervised
learning algorithm on
the data to find a
classification model.
CS583, Bing Liu, UIC 13
DBSCAN Application
• Real-Time Problem: Anomaly Detection in
Credit Card Transactions
• Objective: Detect fraudulent credit card
transactions.
• Dataset: Transaction records including amount,
location, and time.
• Process:
• Apply DBSCAN to cluster normal transactions while
identifying outliers.
• DBSCAN is effective because it does not assume
spherical clusters and can detect outliers.
• Result: Detect anomalies that may indicate
fraudulent activity.
14
Apriori Algorithm
Application
• Real-Time Problem: Optimizing Product
Placement in Retail
• Objective: Identify frequently purchased items
together to improve store layout and product
recommendations.
• Dataset: Transaction data from a large retail store.
• Process:
• Apply the Apriori algorithm to find association rules
between products (e.g., milk and bread are often bought
together).
• Set a minimum support and confidence to filter the rules.
• Result: Store layouts are redesigned to place
frequently bought-together items closer, boosting sales
by cross-promoting products.
15
Conclusion and Key
Takeaways
• Unsupervised Learning is powerful for uncovering
hidden patterns in unlabeled data.
• Real-Time Applications:
• Customer segmentation (K-Means)
• Anomaly detection (DBSCAN)
• Market basket analysis (Apriori)
• Case Study: Retail industry benefits from association
rule mining to improve sales and customer
experience.
16
Summary
• Clustering is has along history and still active
There are a huge number of clustering algorithms
More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.
17
•Thank You!
18