Unsupervised Learning:
Unsupervised learning is a type of machine learning
where the model learns patterns from unlabelled
data.
This type of learning involves training a model on an
unlabelled dataset.
The algorithm identifies patterns, relationships, and
structures within the data without explicit guidance.
Examples include clustering similar customers
together or reducing the dimensionality of data.
Clustering is one of the most common techniques in
unsupervised learning. It involves grouping similar
data points together based on certain characteristics
or features.
What is Clustering?
Clustering is the task of dividing a set of data points
into groups (called clusters) such that points in the
same group are more similar to each other than to
those in other groups.
No labels: Unlike supervised learning, clustering does
not rely on labeled data.
Similarity-based: Groups are formed based on some
notion of similarity (e.g., distance, density, or
distribution).
Common Clustering Algorithms
Algorithm Key Idea Use Case Examples
Customer
Assigns data into K
segmentation,
K-Means clusters by minimizing
image
intra-cluster variance
compression
Builds a tree of
clusters Document or gene
Hierarchical
(agglomerative or taxonomy
divisive)
K-Means clustering Algorithm:
K-Means Clustering is an unsupervised
machine learning algorithm used to group
data into k clusters, where each data point
belongs to the cluster with the nearest mean.
It's widely used for pattern recognition,
image compression, customer segmentation,
etc.
How K-Means Works
1. Choose the number of clusters (k).
2. Randomly initialize k centroids
(cluster centers).
3. Assign each point to the nearest
centroid (cluster).
4. Update centroids by calculating the
mean of all points in each cluster.
5. Repeat steps 3 and 4 until:
o Cluster assignments no longer change,
or
o A maximum number of iterations is
reached.
Example :
Imagine clustering customers based on
spending habits (features: income, spending
score):
Input data:
(40k, Low), (60k, Medium), (80k, High), ...
You set k = 3.
Algorithm groups customers into 3
clusters based on similarity.
Limitations
Requires choosing k in advance.
Sensitive to initial centroid positions.
Assumes spherical clusters (not great for
irregular shapes).
Affected by outliers.
Problem:
Cluster the following set of 2D data points
into 2 clusters using K-Means.
Points: (1,2), (1,4), (1,0),
(10,2), (10,4), (10,0)
Step-by-Step Code Example (Python)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Step 1: Define the data
X = np.array([
[1, 2],
[1, 4],
[1, 0],
[10, 2],
[10, 4],
[10, 0]
])
# Step 2: Create and fit the KMeans model
kmeans = KMeans(n_clusters=2,
random_state=0)
kmeans.fit(X)
# Step 3: Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Step 4: Plot the data points and centroids
colors = ['red', 'blue']
for i in range(len(X)):
plt.scatter(X[i][0], X[i][1],
color=colors[labels[i]], s=100)
# Plot the centroids
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='X', s=200, c='black')
plt.title('K-Means Clustering Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)
plt.show()
Explanation:
The algorithm groups the points into 2
clusters: one around (1, y) and another
around (10, y).
It uses Euclidean distance to assign each
point to the nearest cluster center.
The centroids (black X marks) are updated
until convergence.
Hierarchical clustering :
is an unsupervised machine learning method
used to group data points into clusters based
on their similarity. Unlike other clustering
algorithms like K-Means, hierarchical
clustering builds a hierarchy (tree) of clusters.
Types of Hierarchical Clustering
1. Agglomerative (Bottom-Up)
o Starts with each data point as its own
cluster.
o Iteratively merges the two closest
clusters.
o Continues until all points are in one
single cluster (or desired number of
clusters).
2. Divisive (Top-Down)
o Starts with all data points in one
cluster.
o Recursively splits clusters into smaller
ones.
o Continues until each point is its own
cluster (or desired structure is
reached).
How It Works (Agglomerative Example)
1. Compute a distance matrix between
all pairs of points.
2. Use a linkage method to define inter-
cluster distances:
o Single linkage: Minimum distance
between points.
o Complete linkage: Maximum distance
between points.
o Average linkage: Average distance
between points.
o Ward's method: Minimizes variance
within clusters.
3. Merge the two closest clusters.
4. Update the distance matrix.
5. Repeat until all points are in a single
cluster.
Dendrogram
A dendrogram is a tree-like diagram that
shows the merging/splitting of clusters at
different distance thresholds. It’s commonly
used to:
Visualize the clustering process.
Decide the number of clusters by "cutting"
the tree at a certain height.
Dimensionality Reduction:
Principal Component Analysis (PCA),
Linear Discriminant Analysis (LDA),
t-Distributed Stochastic Neighbor Embedding
(t-SNE).
Dimensionality reduction is a critical
technique in data science and machine
learning for simplifying high-dimensional data
while preserving important structures. Below
is an overview of three commonly used
dimensionality reduction techniques: PCA,
LDA, and t-SNE.
Principal Component Analysis (PCA):
Purpose: Reduce dimensionality while
preserving as much variance in the data as
possible.
Key Features:
Unsupervised method (doesn’t use class
labels).
Projects data onto a new coordinate
system formed by principal components,
which are directions of maximum
variance.
Each principal component is orthogonal
(uncorrelated) to the others.
How it Works:
1. Standardize the data.
2. Compute the covariance matrix.
3. Calculate eigenvalues and
eigenvectors of the covariance matrix.
4. Sort eigenvectors by decreasing
eigenvalues.
5. Select the top k eigenvectors to form
the reduced feature space.
Use Case: Preprocessing, visualization,
compression, noise reduction.
Linear Discriminant Analysis (LDA):
Purpose: Reduce dimensionality while
preserving class-discriminative
information.
Key Features:
Supervised method (uses class labels).
Seeks to maximize the between-class
variance and minimize the within-class
variance.
Projects data in a way that the classes are
most separable.
How it Works:
1. Compute the mean vectors for each
class and the overall mean.
2. Compute the within-class scatter
matrix and between-class scatter matrix.
3. Solve the generalized eigenvalue
problem to find the linear discriminants.
4. Select the top k discriminants to
transform the data.
Use Case: Classification problems, face
recognition, preprocessing for supervised
learning.
t-Distributed Stochastic Neighbor Embedding
(t-SNE).
Purpose: Visualize high-dimensional data
in 2D or 3D by preserving local structures
(i.e., neighborhoods).
Key Features:
Non-linear, unsupervised method.
Excellent for visualizing clusters and
structures in the data.
Converts high-dimensional Euclidean
distances into conditional probabilities.
How it Works:
1. Calculates pairwise similarities in high-
dimensional space (using Gaussian
distributions).
2. Maps similarities to low-dimensional
space (using t-distribution).
3. Minimizes the Kullback-Leibler
divergence between the two distributions
using gradient descent.
Use Case: Visualization of high-
dimensional data such as image
embeddings, word vectors, or gene
expression profiles.
Comparison Table
Feature PCA LDA t-SNE
Unsupervi Supervi Unsupervi
Type
sed sed sed
Goal Maximize Maximi Preserve
variance ze class local
Feature PCA LDA t-SNE
separati
structure
on
Linear/
Linear Linear Non-linear
Non-linear
Interpretab Mediu
High Low
ility m
Computati Mediu
Low High
onal Cost m
Suitable for
Modera Excellent
Visualizatio Moderate
te (2D/3D
n
Association Rule Learning:
Apriori algorithm:
The Apriori algorithm is a classic
algorithm used in data mining for
association rule learning.
It helps identify frequent item sets in a
dataset (typically market basket data)
and derive association rules from them.
What is it used for?
Apriori is commonly used to find:
Items frequently bought together
(e.g., "people who buy bread often
buy butter")
Patterns in transaction databases
Recommendations in retail, e-
commerce, etc.
Key Concepts:
1. Itemset
A group of one or more items.
Example: {milk, bread}.
2. Support
The frequency or proportion of
transactions that contain a certain
itemset.
Support(A)=Transactions containing A/
Total transactions
3. Confidence
The likelihood of item Y being bought
when X is bought.
Confidence(X→Y)=Support(X∪Y)/Support(X)
4. Lift
Measures how much more often X and Y
occur together than expected if they were
independent.
Lift(X→Y)=Confidence(X→Y)/Support(Y)
How Apriori Works
1. Set a minimum support threshold.
2. Generate candidate itemsets of size 1,
and filter those that meet the support
threshold.
3. Generate itemsets of size k+1 from
those of size k (this is the "apriori
property": all subsets of a frequent item
set must also be frequent).
4. Repeat until no more itemsets meet
the minimum support.
5. Generate association rules from the
frequent itemsets based on confidence
and lift.
Example
Assume 5 transactions:
Transaction ID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Let’s say:
Minimum support = 0.6 (i.e., appears in 3
of 5 transactions)
Minimum confidence = 0.7
You’ll find frequent itemsets like:
{Bread, Milk} (support = 0.6)
{Diaper, Beer} (support = 0.6)
From these, you can generate rules like:
Diaper → Beer (confidence = 0.75)
Market Basket Analysis (MBA)
is a data mining technique used to
discover associations or relationships
between items that customers purchase
together. It's commonly used in retail, e-
commerce, and recommendation systems
to optimize sales, marketing, and product
placement.
Key Concept:
Association Rule Learning
Market Basket Analysis is typically
powered by association rule learning,
using algorithms such as:
Apriori Algorithm
FP-Growth Algorithm (Frequent Pattern
Growth)
Eclat Algorithm
Basic Terminology
Let’s say a customer’s shopping basket
has:
{Bread, Milk, Butter}
Here’s how we analyze relationships:
Itemset: A collection of one or more items
(e.g., {Bread, Milk})
Support: How frequently an itemset
appears in all transactions
Support(X)=(TransactionscontainingX)/
(TotalTransactions)
Confidence: Likelihood that item Y is also
bought when item X is bought
Confidence(X→Y)=Support(X∪Y)/Support(X)
Lift: Strength of a rule over the random co-
occurrence of X and Y
Lift(X→Y)=Confidence(X→Y)/Support(Y)
Lift > 1: Positive correlation
Lift = 1: No correlation
Lift < 1: Negative correlation
Example
If 100 customers shopped:
30 bought Milk
20 bought Milk and Bread
Then:
Support(Milk) = 30%
Support(Milk ∩ Bread) = 20%
Confidence(Milk → Bread) = 20% / 30% =
66.7%
If Support(Bread) = 40%, then
Lift = 66.7% / 40% = 1.67
This means customers who buy Milk are 1.67
times more likely to also buy Bread than
random chance.
Evaluation metrics for association rules:
are used to measure the quality and
usefulness of the rules generated by
algorithms like Apriori, FP-Growth, etc.
These metrics help determine which rules
are interesting, reliable, and potentially
actionable.
Here are the key evaluation metrics:
1. Support
Definition: The proportion of transactions
in the dataset that contain the itemset.
Formula:
Support(A⇒B)=
Number of transactions containing A∪B/Total nu
mber of transactions
Interpretation: Indicates how frequently the rule
occurs in the dataset.
2. Confidence
Definition: The proportion of transactions
containing A that also contain B.
Formula:
Confidence(A⇒B)= Support(A∪B)/ Support(A)
Interpretation: Measures the reliability of the
rule.
3. Lift
Definition: The ratio of the observed support of A
→ B to that expected if A and B were
independent.
Formula:
Lift(A⇒B)= Support(A∪B)/
Support(A)×Support(B)
Interpretation:
> 1: Positive correlation (A and B occur together
more than expected).
= 1: No correlation (A and B are independent).
< 1: Negative correlation.
4. Leverage
Definition: The difference between the observed
and expected frequency of A and B occurring
together.
Formula:
Leverage(A⇒B)=Support(A∪B)
−Support(A)×Support(B)
Interpretation: Measures the added value of the
rule over random chance.
5.Conviction
Definition: The ratio of the expected frequency of
A occurring without B, assuming independence,
to the actual frequency.
Formula:
Conviction(A⇒B)=1− Support(B)/ 1-
Confidence(A⇒B)
Interpretation: A conviction > 1 suggests the rule
has predictive power.
Optional / Additional Metrics