KEMBAR78
Unit5 CSM ML | PDF | Cluster Analysis | Matrix (Mathematics)
0% found this document useful (0 votes)
36 views32 pages

Unit5 CSM ML

The document provides an overview of clustering techniques in data mining and machine learning, explaining hard and soft clustering, as well as various methods such as hierarchical, partitioning, and density-based clustering. It details algorithms like K-Means and DBSCAN, their processes, advantages, and limitations, along with the concept of matrix factorization for enhancing clustering performance. Additionally, it highlights the applications of clustering in diverse fields including document clustering, bioinformatics, and customer segmentation.

Uploaded by

girik3828
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views32 pages

Unit5 CSM ML

The document provides an overview of clustering techniques in data mining and machine learning, explaining hard and soft clustering, as well as various methods such as hierarchical, partitioning, and density-based clustering. It details algorithms like K-Means and DBSCAN, their processes, advantages, and limitations, along with the concept of matrix factorization for enhancing clustering performance. Additionally, it highlights the applications of clustering in diverse fields including document clustering, bioinformatics, and customer segmentation.

Uploaded by

girik3828
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 5

Clustering:
Clustering is like sorting a bunch of similar items into different groups based on their
characteristics. In data mining and machine learning, it’s a powerful technique used to group
similar data points together, making it easier to find patterns or understand large datasets.
Essentially, clustering helps identify natural groupings in your data.

There are two common types of clustering:

Broadly speaking, there are 2 types of clustering that can be performed to group similar
data points:

 Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let's say there are 4 data point and we have to cluster
them into 2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Data Points Clusters

A C1

B C2

C C2

D C1

 Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.
For example, Let's say there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.

Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

Types of Clustering Methods:


Clustering is a type of unsupervised learning wherein data points are grouped into different
sets based on their degree of similarity.

The various types of clustering are:

 Hierarchical clustering

 Partitioning clustering

 Density-based Clustering

Hierarchical clustering is further subdivided into:

 Agglomerative clustering

 Divisive clustering

Partitioning clustering is further subdivided into:

 K-Means clustering

 Fuzzy C-Means clustering

Density-based Clustering

 DBSCAN
At the surface level, clustering helps in the analysis of unstructured data. Graphing,
the shortest distance, and the density of the data points are a few of the elements that
influence cluster formation. Clustering is the process of determining how related the
objects are based on a metric called the similarity measure.
Similarity metrics are easier to locate in smaller sets of features and harder as the
number of features increases. Depending on the type of clustering algorithm being
utilized, several techniques are employed to group the data from the datasets. In this part,
the clustering techniques are described. Various types of clustering algorithms are:

1. Centroid-based Clustering (Partitioning methods)


2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)

Partitional Clustering(K MEANS CLUSTERING)

Partitional clustering divides data into distinct non-overlapping subsets or clusters. The most
common algorithm is K-Means, which assigns each data point to the nearest cluster center,
iteratively updating centers to minimize intra-cluster variance.

1. Choose the number of clusters k

The first step in k-means is to pick the number of clusters, k.

2. Select k random points from the data as centroids

Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters,
so k is equal to 2 here. We then randomly select the centroid:

Here, the red and green circles represent the centroid for these clusters.

3. Assign all the points to the closest cluster centroid


Once we have initialized the centroids, we assign each point to the closest cluster centroid:

Here you can see that the points closer to the red point are assigned to the red cluster,
whereas the points closer to the green point are assigned to the green cluster.

4. Recompute the centroids of newly formed clusters

Now, once we have assigned all of the points to either cluster, the next step is to compute the
centroids of newly formed clusters:

Here, the red and green crosses are the new centroids.

5. Repeat steps 3 and 4

We then repeat steps 3 and 4:

The step of computing the centroid and assigning all the points to the cluster based on their
distance from the centroid is a single iteration. But wait – when should we stop this process?
It can’t run till eternity, right?

Stopping Criteria for K-Means Clustering

There are essentially three stopping criteria that can be adopted to stop the K-means
algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached

We can stop the algorithm if the centroids of newly formed clusters are not changing. Even
after multiple iterations, if we are getting the same centroids for all the clusters, we can say
that the algorithm is not learning any new pattern, and it is a sign to stop the training.

EXAMPLE:

Solution:
Iteratively assign data points to the closest cluster and recalculate cluster centers until
convergence.
Step 1 . Assign data points to the closest cluster.
Hierarchical Clustering

Hierarchical clustering is a technique used to group similar data points together based on
their similarity creating a hierarchy or tree-like structure. The key idea is to begin with each
data point as its own separate cluster and then progressively merge or split them based on
their similarity.

Hierarchical clustering builds a multilevel hierarchy of clusters by either merging smaller


clusters (agglomerative) or splitting larger ones (divisive). The results are often visualized
using a dendrogram, illustrating the nested grouping of patterns.

Dendogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group, and as you move
up, similar groups are combined. The lower the merge point, the more similar the groups are.
It helps you see how things are grouped step by step.
The working of the dendrogram can be explained using the below diagram:
In this image, on the left side, there are five points labeled P, Q, R, S, and T. These
represent individual data points that are being clustered. On the right side, there’s
a dendrogram, which shows how these points are grouped together step by step.
 At the bottom of the dendrogram, the points P, Q, R, S, and T are all separate.
 As you move up, the closest points are merged into a single group.
 The lines connecting the points show how they are progressively merged based on
similarity.
 The height at which they are connected shows how similar the points are to each other;
the shorter the line, the more similar they are
Types of Hierarchical Clustering
Now that we understand the basics of hierarchical clustering, let's explore the two main
types of hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
1.Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. This
clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.
2.Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data
have been split into singleton clusters.

Workflow for Hierarchical Divisive clustering :


1. Start with all data points in one cluster: Treat the entire dataset as a single large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to separate
the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.
Advantages:

 No predefined clusters: Hierarchical clustering does not require specifying the number
of clusters in advance.
 Dendrogram for visualization: It allows the creation of dendrograms, providing a
visual representation of clusters at various levels.
 Cluster hierarchy: The hierarchical nature helps reveal nested clusters.
Limitations:

 Computational complexity: Hierarchical clustering can be computationally expensive


for large datasets.
 Inflexibility: Once a merge or split is made, it cannot be undone.

Density-Based Clustering

Density-based clustering algorithms, like DBSCAN, identify clusters as areas of high density
separated by areas of low density. They are effective in discovering clusters of arbitrary
shape and in identifying noise or outliers in the data.
DBSCAN is a density-based clustering algorithm that groups data points that are
closely packed together and marks outliers as noise based on their density in the
feature space.
It identifies clusters as dense regions in the data space separated by areas of lower density.
Unlike K-Means or hierarchical clustering which assumes clusters are compact and
spherical, DBSCAN perform well in handling real-world data irregularities such as:
 Arbitrary-Shaped Clusters: Clusters can take any shape not just circular or convex.
 Noise and Outliers: It effectively identifies and handles noise points without assigning
them to any cluster.
K-Means and Hierarchical handling compact, spherical clusters with varying noise
tolerance while DBSCAN manages arbitrary-shaped clusters and noise handling.

Key Parameters in DBSCAN

1. eps: This defines the radius of the neighborhood around a data point. If the distance
between two points is less than or equal to eps they are considered neighbors. A common
method to determine eps is by analyzing the k-distance graph. Choosing the right eps is
important:
 If eps is too small most points will be classified as noise.
 If eps is too large clusters may merge and the algorithm may fail to distinguish between
them.
1. MinPts: This is the minimum number of points required within the eps radius to
form a dense region. A general rule of thumb is to set MinPts >= D+1 where D is
the number of dimensions in the dataset.

How Does DBSCAN Work?

DBSCAN works by categorizing data points into three types:


1. Core points which have a sufficient number of neighbors within a specified radius
(eplison)
2. Border points which are near core points but lack enough neighbors to be core points
themselves
3. Noise points which do not belong to any cluster.
By iteratively expanding clusters from core points and connecting density-reachable points,
DBSCAN forms clusters without relying on rigid assumptions about their shape or size.
Steps in the DBSCAN Algorithm

1. Identify Core Points: For each point in the dataset count the number of points within
its eps neighborhood. If the count meets or exceeds MinPts mark the point as a core
point.
2. Form Clusters: For each core point that is not already assigned to a cluster create a
new cluster. Recursively find all density-connected points i.e points within
the eps radius of the core point and add them to the cluster.
3. Density Connectivity: Two points a and b are density-connected if there exists a chain
of points where each point is within the eps radius of the next and at least one point in
the chain is a core point. This chaining process ensures that all points in a cluster are
connected through a series of dense regions.
4. Label Noise Points: After processing all points any point that does not belong to a
cluster is labeled as noise.

Partitioning of Data

 Data is divided into distinct groups or clusters.


 Methods:
o Hard Partitioning: Each point belongs to exactly one cluster.
o Soft Partitioning: A point can belong to multiple clusters with a degree of
membership.

Hard Partitioning:

In hard partitioning, each data point belongs to exactly one cluster. There is a clear and
exclusive assignment—no overlaps, no ambiguity.

Key Characteristics:

 Each data point is assigned only one cluster label.


 The partitioning is binary (0 or 1 membership).
 Simpler to understand and implement.
 Used in traditional clustering algorithms like K-Means, Agglomerative Clustering.

Advantages:

 Simplicity: Easy to implement and interpret.


 Efficiency: Requires less computational complexity than soft partitioning.
 Works well when cluster boundaries are well defined.

Soft Partitioning:
In soft partitioning, a data point can belong to multiple clusters with a degree of
membership (between 0 and 1). The sum of the memberships across all clusters is usually 1.

Key Characteristics:

 Each data point has partial memberships to all clusters.


 Membership values express how strongly a point belongs to each cluster.
 Used in algorithms like Fuzzy C-Means, EM (Expectation-Maximization), GMM
(Gaussian Mixture Models).

Advantages:

 Handles uncertainty and overlapping clusters better.


 More flexible and realistic for real-world problems (e.g., customer segmentation,
medical data).
 Suitable for soft boundaries and fuzzy data.

Matrix Factorization in Clustering


Matrix Factorization is a technique used to reduce high-dimensional data into a lower-dimensional
latent space, which can then be used to improve clustering performance.

 Common in collaborative filtering, topic modeling.


 Example: Non-negative Matrix Factorization (NMF) is used to extract features for
clustering.

It is the process of decomposing a matrix into a product of two (or more) smaller
matrices.

The core idea is to represent your data matrix X (e.g., rows are data points,
columns are features) as a product of two lower-rank matrices, W and H:

Here:
 X is the original data matrix (e.g., m×n, where m is the number of data points and n is
the number of features).
 W (e.g., m×k) can be interpreted as the basis matrix or cluster centroid matrix.
Each column of W can represent a latent feature or a cluster prototype.
 H (e.g., n×k) is the coefficient matrix or cluster indicator matrix. Each row of H (or
sometimes a transformed version of it) indicates the degree to which a data point
belongs to each of the k clusters. The non-negativity constraint (especially in NMF)
allows for a "part-based" representation, where parts add up to form the whole, which
is highly interpretable for clustering.

Role in Clustering

 Helps extract latent features or patterns in the data.


 Enables clustering in reduced space, which is faster and often more accurate.
 Can uncover hidden structure in data (especially sparse or noisy data).

Popular Matrix Factorization Techniques for Clustering

While various MF techniques exist, Non-negative Matrix Factorization (NMF) is


particularly prominent in clustering due to its inherent interpretability:

 Non-negative Matrix Factorization (NMF): This technique specifically constrains


the elements of W and H to be non-negative. This constraint leads to a "parts-based"
representation, where the latent factors (columns of W) combine additively to
reconstruct the original data. In clustering, this translates to clusters being formed by
meaningful combinations of features, and data points belonging to clusters based on
their positive contributions.

Other MF techniques like Singular Value Decomposition (SVD) can also be used, often
followed by another clustering step (e.g., K-means on the reduced dimensions). However,
SVD doesn't impose non-negativity, which can make the interpretation of the factors less
direct for clustering.

Applications of Matrix Factorization in Clustering

MF for clustering finds applications in diverse domains:

 Document Clustering: Identifying topics or themes within a collection of documents.


 Image Clustering: Grouping similar images based on their underlying visual
features.
 Bioinformatics: Clustering gene expression data to identify functional groups of
genes or patient subgroups.
 Recommender Systems: While primarily used for recommendations, the latent
factors learned through MF can implicitly cluster users with similar preferences or
items with similar characteristics.
 Social Network Analysis: Discovering communities or groups within a network.
 Customer Segmentation: Grouping customers with similar purchasing behaviors or
demographics.

Advantages of Matrix Factorization in Clustering


 Discovering Latent Structures: Effectively uncovers hidden patterns and
relationships in data that traditional methods might miss.
 Interpretability: Especially with NMF, the non-negative factors can be more easily
interpreted as meaningful "parts" or cluster characteristics.
 Handles High-Dimensional and Sparse Data: Well-suited for datasets with many
features and/or missing values (common in areas like recommender systems).
 Reduced Computational Complexity: By reducing dimensionality, it can make
subsequent clustering steps or analyses more efficient.
 Flexibility: Various regularization terms and constraints can be added to the MF
objective function to tailor it for specific clustering tasks (e.g., enforcing sparsity for
clearer cluster assignments, or incorporating graph regularization for manifold
learning).

Disadvantages and Limitations

 Choosing the Number of Latent Factors (k): Determining the optimal number of
clusters (k) is often a challenge, similar to other clustering algorithms.
 Computational Cost: For very large datasets, the factorization process can still be
computationally expensive, though often more scalable than some high-dimensional
clustering methods.
 Non-convex Optimization (for NMF): NMF's optimization problem is non-convex,
meaning there's no guarantee of finding the global minimum. The results can be
sensitive to initialization.
 Cold-Start Problem (in recommender systems): For entirely new users or items
with no historical data, MF models struggle to generate embeddings or assign them to
clusters.
 Sensitivity to Noise and Outliers: The performance can be affected by significant
noise or outliers in the data.

Fuzzy C-Means (FCM) clustering

Fuzzy C-Means (FCM) clustering is a soft clustering algorithm that allows data points to
belong to multiple clusters simultaneously, with varying degrees of membership. Unlike
"hard" clustering methods like K-means, where each data point is assigned to exactly one
cluster, FCM provides a more nuanced representation of data, especially when clusters
overlap or when a data point naturally exhibits characteristics of more than one group.

How Fuzzy C-Means Clustering Works

The core idea of FCM is to minimize an objective function that measures the weighted sum
of squared distances between data points and cluster centroids. The "fuzzy" aspect comes
from the membership values, which are real numbers between 0 and 1 (inclusive), indicating
the degree to which a data point belongs to a particular cluster. The sum of membership
values for any given data point across all clusters must equal 1.

Here's a step-by-step breakdown of the FCM algorithm:

1. Initialization:
o Choose the number of clusters (C): This is a user-defined parameter, similar
to k in K-means.
o Initialize cluster centroids: Randomly select C data points from the dataset
as initial cluster centers, or randomly assign initial membership values to data
points.
o Choose the fuzziness parameter (m): This parameter controls the degree of
fuzziness or overlap between clusters. m>1 (typically m=2).
 If m→1, FCM approaches hard clustering (like K-means).
 If m→∞, the cluster centers move towards the center of the entire
dataset, and memberships tend to be equal for all clusters.

2. Iterative Optimization: The algorithm iteratively updates the cluster centroids and
membership values until convergence (i.e., when the change in membership values or
the objective function falls below a certain threshold, or a maximum number of
iterations is reached).
o Calculate/Update Membership Values (μij): For each data point xj and each
cluster i, the membership value μij is calculated based on the distance between
xj and the cluster centroid ci. The closer xj is to ci, the higher its membership
degree in cluster i. The formula is:

Calculate/Update Cluster Centroids (ci): After updating the membership values, the
cluster centroids are recalculated using the weighted mean of all data points, where the
weights are the fuzzy membership values:
3.Convergence: The iterations continue until the change in the membership matrix (or
the objective function value) between consecutive iterations is less than a predefined
threshold, or a maximum number of iterations is reached.

Objective Function

FCM aims to minimize the following objective function:

Minimizing this function effectively finds cluster centers that are central to the data points,
with the influence of each data point on a cluster center being proportional to its fuzzy
membership in that cluster.

Advantages of Fuzzy C-Means Clustering

 Soft Clustering: Allows data points to belong to multiple clusters, providing a more realistic
representation of overlapping clusters or ambiguous data points.
 Flexibility: Can handle situations where cluster boundaries are not clear-cut.
 Interpretability: The membership values offer insight into the degree of association between
data points and clusters.
 Robustness to Noise (to some extent): Fuzzy memberships can provide some resilience to
noisy data points compared to hard clustering.

Disadvantages and Limitations


 Choosing the Number of Clusters (C): Like K-means, determining the optimal number of
clusters is still a challenge and often requires domain knowledge or external validation
metrics.
 Sensitivity to Initialization: The algorithm can converge to a local optimum, and the results
can be sensitive to the initial choice of cluster centroids.
 Choice of Fuzziness Parameter (m): Selecting an appropriate value for m can be subjective
and impact the clustering results. A typical value is m=2, but it might need tuning.
 Computational Cost: For very large datasets, the iterative nature and calculations involving
all data points for every cluster can be computationally intensive.
 Sensitive to Outliers: While more robust than K-means, extreme outliers can still
disproportionately influence cluster centroids and membership values.
 Scalability Issues: May not scale well to extremely high-dimensional data without
dimensionality reduction preprocessing.

Applications of Fuzzy C-Means Clustering

FCM is widely applied in various fields, especially where data exhibits ambiguity or overlap:

 Image Segmentation: Dividing images into regions based on pixel intensity, color, or texture,
where pixels might belong to multiple regions (e.g., boundaries).
 Medical Image Analysis: Identifying different tissue types in MRI or CT scans, where
boundaries between tissues might be fuzzy.
 Pattern Recognition: Classifying patterns where there's inherent uncertainty or overlap
between categories.
 Bioinformatics: Clustering gene expression data, where genes might be involved in multiple
biological pathways.
 Customer Segmentation: Grouping customers with overlapping interests or purchasing
behaviors for targeted marketing.
 Geographical Information Systems (GIS): Analyzing spatial data to identify regions with
mixed characteristics.
 Fault Diagnosis: Identifying system faults based on sensor data, where a fault might have
multiple contributing factors.

Rough Clustering
Rough Clustering is an approach to clustering that leverages Rough Set Theory (RST).
Unlike traditional clustering methods that aim to assign each data point definitively to a
single cluster (hard clustering) or to multiple clusters with degrees of membership (fuzzy
clustering), Rough Clustering addresses the inherent vagueness and uncertainty in data by
defining clusters using lower and upper approximations.

What is Rough Set Theory (RST)?

At its core, Rough Set Theory, introduced by Zdzisław Pawlak in 1982, is a mathematical
framework for dealing with imprecise, uncertain, or incomplete information. It operates
on the principle of indiscernibility, meaning that objects that have the same attribute values
are considered indiscernible from each other given the available information.

Key concepts in RST for clustering:


 Information System: Data is represented as an information system, which is a table
where rows are objects (data points) and columns are attributes (features).
 Indiscernibility Relation: Objects are considered indiscernible if they cannot be
distinguished based on the available attributes. This relation partitions the universe of
objects into equivalence classes (elementary sets). These elementary sets are the basic
"atoms of knowledge."
 Approximations of a Set: For any given set (which can be a potential cluster), RST
defines two approximations:
o Lower Approximation (or Positive Region): This consists of all objects that
are definitely members of the set, based on the current knowledge. These are
the elementary sets entirely contained within the target set.
o Upper Approximation (or Possible Region): This consists of all objects that
possibly belong to the set. This includes all elementary sets that have at least
one object in common with the target set.
 Boundary Region: The difference between the upper and lower approximations.
Objects in the boundary region cannot be definitively classified as belonging or not
belonging to the set based on the current knowledge. These are the "rough" elements.

How Rough Set Theory Applies to Clustering

In Rough Clustering, the notion of a "cluster" is defined by these approximations:

 Lower Approximation of a Cluster: Contains objects that unambiguously belong


to that cluster. These objects share similar characteristics so strongly that they are
certainly part of that group.
 Upper Approximation of a Cluster: Contains objects that possibly belong to that
cluster. This includes the definitely belonging objects, plus those that share some
similarities but might also share characteristics with other clusters, thus existing in a
"boundary" region.
 Boundary Region of a Cluster: Objects in this region cannot be precisely assigned
to a single cluster due to their ambiguous characteristics. They might share features
with multiple clusters.

The process of Rough Clustering often involves:

1. Attribute Discretization: For continuous attributes, discretization might be


performed to create discrete intervals, as rough set theory often works well with
categorical or discrete data.
2. Indiscernibility Relation Construction: Based on the attributes, indiscernibility
classes (elementary sets) are formed.
3. Cluster Formation (Iterative or Direct):
o Some algorithms might start with initial "seed" objects or groups and then
iteratively expand the lower and upper approximations of clusters based on
indiscernibility.
o Other methods might use a decision-theoretic rough set approach where
clusters are defined as equivalence classes of objects with similar decision
attributes.
4. Handling Uncertainty: The key strength is explicitly modeling the uncertainty in
cluster assignments through the boundary regions. Data points in the boundary region
are not forced into a single cluster, acknowledging their inherent ambiguity.
Types of Rough Clustering Algorithms

Several algorithms have been developed based on rough set theory for clustering:

 Rough K-means: An extension of K-means where clusters are represented by rough


sets (lower and upper approximations) instead of crisp assignments.
 Rough Hierarchical Clustering: Building a hierarchy of clusters where each cluster
node is defined by rough approximations.
 Rough Density-Based Clustering: Incorporating rough set concepts into density-
based methods to handle noisy data and uncertain cluster boundaries.
 Attribute Reduction for Clustering: RST is also widely used for feature selection
or attribute reduction before traditional clustering. By identifying redundant or
irrelevant attributes using reducts (minimal subsets of attributes that preserve the
original discernibility), RST can simplify the data and improve the performance of
other clustering algorithms.

Advantages of Rough Clustering

 Handles Uncertainty Natively: Its primary strength is its ability to directly deal with
vagueness, imprecision, and incompleteness in data without requiring probability
distributions or membership functions (like fuzzy sets).
 No Prior Assumptions on Data Distribution: Unlike some statistical clustering
methods, RST does not make assumptions about the underlying data distribution.
 Identifies Redundant Features: RST's core concept of reducts makes it effective for
feature selection, which can simplify the clustering problem and improve efficiency.
 Interpretable Results: The lower and upper approximations provide a clear and
interpretable way to understand the certainty (or uncertainty) of an object's belonging
to a cluster.
 Works with Categorical Data: RST is particularly well-suited for categorical
(nominal) data, where traditional distance metrics might be problematic.

Disadvantages and Limitations

 Sensitivity to Discretization: If continuous attributes are present, the choice of


discretization method can significantly impact the results.
 Computational Complexity: Constructing indiscernibility relations and computing
approximations can be computationally intensive for very large datasets, especially
those with many attributes.
 Choice of Parameters: Like other clustering methods, selecting the number of
clusters or specific thresholds for defining approximations can be challenging.
 Scalability for High-Dimensional Data: While good for categorical data, direct
application to very high-dimensional numerical data might still face challenges
without prior dimensionality reduction.
 Interpretation of Boundary Region: While an advantage, the objects in the
boundary region might still require further analysis or domain expert intervention to
make definitive decisions.
Rough K-Means clustering

Rough K-Means clustering is a variation of the traditional K-Means algorithm that


incorporates principles from Rough Set Theory (RST) to handle the inherent vagueness and
uncertainty in data. While standard K-Means assigns each data point to exactly one cluster,
Rough K-Means allows for ambiguous assignments by defining clusters using lower and
upper approximations.

How it Differs from Standard K-Means

The fundamental difference lies in how data points are assigned to clusters and how cluster
centroids are updated:

 Standard K-Means (Hard Clustering):


o Each data point xj is assigned to a single cluster Ci if Ci's centroid is the
closest to xj.
o The objective is to minimize the sum of squared distances between data points
and their assigned cluster centroids.
o Centroids are updated as the mean of all points crisply assigned to that cluster.

 Rough K-Means (Uncertainty-Aware Clustering):


o Instead of a single assignment, each cluster Ci is represented by a lower
approximation (Ci) and an upper approximation (Ci).
o Lower Approximation (Ci): Contains data points that definitively belong to
cluster Ci. These points are significantly closer to Ci's centroid than to any
other cluster's centroid, passing a certain "certainty" threshold.
o Upper Approximation (Ci): Contains data points that possibly belong to
cluster Ci. This includes all points in the lower approximation, plus points that
are somewhat close to Ci's centroid but also share proximity with other cluster
centroids (the "boundary" region).
o Boundary Region: The set of points in the upper approximation but not in the
lower approximation (Ci∖Ci). These are the ambiguous data points.
o The objective function is typically modified to incorporate weighted
contributions from both lower and upper approximations when calculating
centroids.

Rough K-Means Algorithm Steps

The iterative process of Rough K-Means generally follows these steps:

1. Initialization:
o Choose the number of clusters, K.
o Randomly select K initial cluster centroids (similar to standard K-Means).
o Define two weighting parameters, wL and wU, for the lower and upper
approximations, respectively. Typically, wL≥wU, and wL+wU=1. These
weights reflect the importance given to points in the lower vs. upper
approximations during centroid calculation.
2. Assignment Step (Approximation Formation): For each data point xj:
o Calculate the distance from xj to all K cluster centroids (d(xj,ci)).
o Identify the two closest centroids, cp and cq (where cp is the closest and cq is
the second closest).
o Assign to Lower Approximation:
 If the distance to the closest centroid (d(xj,cp)) is significantly smaller
than the distance to the second closest centroid (d(xj,cq)) by a certain
ratio or threshold (e.g., d(xj,cp)/d(xj,cq)<threshold), then xj is assigned
to the lower approximation of cluster Cp (Cp). These are the "certain"
members.
o Assign to Upper Approximation:
 If xj is not in the lower approximation of any cluster (i.e., it's
ambiguous), it is assigned to the upper approximations of all clusters
whose centroids are "close enough" (e.g., within a certain range or a
certain ratio of distances). More commonly, an object xj is added to the
upper approximation Ci if ci is among the "few" closest centroids to xj.
In many formulations, Ci simply includes all points that are not
definitely in the lower approximation of another cluster.

3. Centroid Update Step: Recalculate the cluster centroids based on the new
assignments, typically using a weighted average:

This formula gives more weight (wL) to the points in the lower approximation
(definitely belonging) and less weight (wU) to the points in the boundary region
(possibly belonging).

4. Convergence Check: Repeat steps 2 and 3 until the cluster centroids no longer
change significantly, or a maximum number of iterations is reached.

Advantages of Rough K-Means

 Handles Uncertainty: Explicitly addresses and models vagueness and uncertainty in


data, which is crucial for real-world datasets where clear-cut cluster boundaries may
not exist.
 More Realistic Cluster Representation: Provides a more nuanced view of clusters
by defining both definite members (lower approximation) and possible members
(upper approximation).
 Robust to Noise/Outliers: By assigning ambiguous points to a boundary region or
distributing their influence through the upper approximation, it can be more robust to
noise and outliers compared to hard clustering, which might force outliers into
inappropriate clusters.
 Flexibility: The weighting parameters (wL,wU) and the assignment thresholds offer
flexibility to fine-tune the degree of "roughness."
Disadvantages and Limitations

 Increased Complexity: More complex to implement and understand than standard K-


Means due to the introduction of lower/upper approximations and weighting.
 Parameter Sensitivity: Requires careful selection of additional parameters (e.g., wL
,wU, and the thresholds for lower/upper approximation assignments). The choice of
these parameters can significantly impact the clustering results.
 Computational Cost: Can be more computationally intensive than standard K-
Means, especially in the assignment step, due to the need to determine proximity to
multiple centroids and manage approximations.
 Still Requires K: Similar to K-Means, the number of clusters K still needs to be
predefined, which can be a challenge.
 Interpretation of Boundary: While an advantage, the interpretation of the boundary
region might still require domain expertise, as these points are inherently ambiguous.

Gaussian Mixture Models (GMM)

Gaussian Mixture Model is a probabilistic model that assumes all data points are
generated from a mixture of several Gaussian distributions with unknown parameters.
Unlike hard clustering methods like K-Means that assigns each data point to one cluster
based on the closest centroid which often doesn't align with the complexity of real-world
data. GMMs perform soft clustering meaning each data point belongs to multiple
clusters with certain probabilities. Each Gaussian in the mixture is defined by:
 Mean (μ): The center of the distribution.
 Covariance (Σ): Describes the spread and orientation.
 Mixing coefficient (π): Represents the proportion of each Gaussian in the mixture.

Expectation Maximization (EM)

The EM algorithm is used to find maximum likelihood estimates of parameters in statistical


models with latent variables.

To fit a Gaussian Mixture Model to the data we use the Expectation-Maximization


(EM) algorithm which is an iterative method that optimize the parameters of the Gaussian
distributions like mean, covariance and mixing coefficients. It works in two main steps:
 Expectation Step (E-step): In this step the algorithm calculates the probability that each
data point belongs to each cluster based on the current parameter estimates (mean,
covariance, mixing coefficients).
 Maximization Step (M-step): After estimating the probabilities the algorithm updates
the parameters (mean, covariance and mixing coefficients) to better fit the data.
Expectation-Maximization in EM Algorithm
These two steps are repeated until the model converges meaning the parameters no longer
change significantly between iterations. Here’s a simple breakdown of the GMM process:

1. Initialization: Start with initial guesses for the means, covariances and mixing
coefficients of each Gaussian distribution.
2. E-step: For each data point, calculate the probability of it belonging to each Gaussian
distribution (cluster).
3. M-step: Update the parameters (means, covariances, mixing coefficients) using the
probabilities calculated in the E-step.
4. Repeat: Continue alternating between the E-step and M-step until the log-likelihood of
the data (a measure of how well the model fits the data) converges.
Formula:
The E-step computes the probabilities that each data point belongs to each Gaussian while
the M-step updates the parameters μk, Σk and πk based on these probabilities.

Advantages of EM algorithm
 Always improves results – With each step, the algorithm improves the likelihood
(chances) of finding a good solution.
 Simple to implement – The two steps (E-step and M-step) are often easy to code for
many problems.
 Quick math solutions – In many cases, the M-step has a direct mathematical solution
(closed-form), making it efficient
Disadvantages of EM algorithm
 Takes time to finish: It converges slowly meaning it may take many iterations to reach
the best solution.
 Gets stuck in local best: Instead of finding the absolute best solution, it might settle for
a "good enough" one.
 Needs extra probabilities: Unlike some optimization methods that only need forward
probability, EM requires both forward and backward probabilities making it slightly
more complex.

Spectral Clustering

Spectral Clustering is a powerful and flexible unsupervised learning algorithm that uses
the eigenvalues (spectrum) of a similarity matrix to reduce dimensions before applying a
standard clustering algorithm like K-Means.

It is especially effective for:

 Non-convex cluster shapes


 Manifold-based data
 Graphs or network-like data

Steps in Spectral Clustering

1. Construct a Similarity Matrix (S):

Measures similarity between data points (e.g., using Gaussian RBF kernel).

2. Build the Laplacian Matrix (L):


3. Compute Eigenvectors:
o Compute the first k eigenvectors of the Laplacian (smallest eigenvalues).
o These eigenvectors form a reduced representation of your data.
4. Cluster the Eigenvectors:
o Apply K-Means (or another algorithm) on the reduced k-dimensional data.

Advantages

 Works with non-spherical, complex-shaped clusters


 Can handle graph-based data (e.g., social networks)
 No assumption of convex boundaries

Disadvantages

 Computationally expensive (due to eigen-decomposition)


 Choice of similarity function and σ is critical
 Doesn’t scale well to very large datasets

UNIT-I

Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents
can learn to make decisions through trial and error to maximize cumulative rewards. RL
allows machines to learn by interacting with an environment and receiving feedback based
on their actions. This feedback comes in the form of rewards or penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or
decision-maker) interacts with an environment to achieve a goal. The agent performs
actions and receives feedback to optimize its decision-making over time.
 Agent: The decision-maker that performs actions.

 Environment: The world or system in which the agent operates.


 State: The situation or condition the agent is currently in.
 Action: The possible moves or decisions the agent can make.
 Reward: The feedback or result from the environment based on the agent’s action.
How Reinforcement Learning Works?
The RL process involves an agent performing actions in an environment, receiving rewards
or penalties based on those actions, and adjusting its behavior accordingly. This loop helps
the agent improve its decision-making over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
 Policy: A strategy that the agent uses to determine the next action based on the current
state.
 Reward Function: A function that provides feedback on the actions taken, guiding the
agent towards its goal.
 Value Function: Estimates the future cumulative rewards the agent will receive from a
given state.
 Model of the Environment: A representation of the environment that predicts future
states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while avoiding fire hazards. The
goal is to find the optimal path with the least number of hazards while maximizing the
reward:

The robot learns by exploring different paths in the maze. By trying various moves, it
evaluates the rewards and penalties for each path. Over time, the robot determines the best
route by selecting the actions that lead to the highest cumulative reward.

The robot's learning process can be summarized as follows:


1. Exploration: The robot starts by exploring all possible paths in the maze, taking
different actions at each step (e.g., move left, right, up, or down).
2. Feedback: After each move, the robot receives feedback from the environment:
 A positive reward for moving closer to the diamond.
 A penalty for moving into a fire hazard.
3. Adjusting Behavior: Based on this feedback, the robot adjusts its behavior to
maximize the cumulative reward, favoring paths that avoid hazards and bring it closer
to the diamond.
4. Optimal Path: Eventually, the robot discovers the optimal path with the least number
of hazards and the highest reward by selecting the right actions based on past
experiences.

Types of Reinforcements in RL

1. Positive Reinforcement
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior.
 Advantages: Maximizes performance, helps sustain change over time.
 Disadvantages: Overuse can lead to excess states that may reduce effectiveness.
2. Negative Reinforcement
Negative Reinforcement is defined as strengthening of behavior because a negative
condition is stopped or avoided.
 Advantages: Increases behavior frequency, ensures a minimum performance standard.
 Disadvantages: It may only encourage just enough action to avoid penalties.

RL Framework and Temporal Difference (TD) Learning

A Reinforcement Learning framework refers to the formal structure that defines how an
RL problem is modeled and solved. The RL framework consists of states, actions, rewards,
and policies. The most common framework used is the Markov Decision Process (MDP).

The agent takes actions, observes the environment's state, and receives a reward. The goal is
to learn a policy that maximizes the expected return over time.

RL framework:
1. Agent: The decision-maker that interacts with the environment.
2. Environment: The context in which the agent operates, providing feedback and rewards.
3. Actions: The choices the agent can make within the environment.
4. States: The current situation or condition of the environment at a given time.
5. Rewards: Feedback provided by the environment to the agent, indicating the quality of the
action taken.

The RL Loop:
1. The agent observes the current state of the environment.
2. The agent takes an action based on its current policy.
3. The environment transitions to a new state based on the action.
4. The agent receives a reward from the environment.
5. The agent updates its policy based on the experience, aiming to maximize the expected
cumulative reward.
In essence, the RL framework provides a structured approach for designing algorithms that
can learn to make optimal decisions in complex environments through trial and error and
feedback.

TD Learning is a combination of Monte Carlo ideas and dynamic programming, allowing


learning directly from raw experience without a model of the environment's dynamics.

Solution Methods & Applications in RL

Solution methods in RL include value-based methods (like Q-learning), policy-based


methods, and actor-critic methods. Applications span robotics, game playing,
recommendation systems, and more.

Solution Methods in RL

RL solution methods are generally categorized into:

A. Value-Based Methods

These methods estimate value functions (how good a state or action is).

 Key Idea: Learn the expected reward of being in a state (or performing an action in a
state).

Common Algorithms:

 Q-Learning (off-policy)
 SARSA (on-policy)
 Deep Q-Networks (DQN) – uses neural networks to approximate Q-values

B. Policy-Based Methods

Directly learn the policy (mapping from states to actions), without estimating value functions.

 Key Idea: Optimize the policy using gradients of expected rewards.

Common Algorithms:

 REINFORCE
 Policy Gradient
 Actor-Critic Methods

C. Model-Based Methods
Learn a model of the environment (i.e., transition probabilities and rewards) and use it to plan
actions.

 Key Idea: Simulate outcomes and plan ahead (like in classical planning).

Examples:

 Dyna-Q
 Monte Carlo Tree Search (MCTS)

D. Deep Reinforcement Learning (Deep RL)

Uses deep neural networks to approximate value functions, policies, or models.

Algorithms:

 DQN (Deep Q-Network)


 DDPG (Deep Deterministic Policy Gradient)
 PPO (Proximal Policy Optimization)
 A3C (Asynchronous Advantage Actor-Critic)

Applications of Reinforcement Learning

A. Robotics

 Robot motion control


 Arm manipulation
 Drone flight

B. Healthcare

 Treatment planning (e.g., cancer therapy scheduling)


 Personalized medicine
 Clinical trial optimization

C. Gaming & Simulations

 Atari and 3D games (e.g., AlphaGo, AlphaZero)


 Self-playing agents in strategy games

D. Autonomous Driving

 Decision making in complex environments


 Route planning
 Lane merging, braking, acceleration

E. Finance

 Portfolio optimization
 Trading strategy development
 Risk management

F. Recommendation Systems

 Dynamic content ranking


 Personalized suggestions based on long-term user engagement

G. Energy Systems

 Smart grid optimization


 Dynamic pricing
 Load balancing

Multi-class Classification

Multi-class classification involves categorizing instances into one of three or more classes.
Techniques include extending binary classifiers using strategies like one-vs-rest or one-vs-
one, and using algorithms inherently capable of multi-class classification, such as decision
trees and neural networks.

You might also like