[UNSUPERVISED LEARNING] Unit2
Introduction to Unsupervised Learning:
Unsupervised learning is a machine learning technique in which we provide unlabeled dataset to the
machine. Instead, models itself find the hidden patterns and insights from the given data. Unsupervised
learning is the training of a machine using information that is neither classified nor labeled and allowing
the algorithm to act on that information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences without any prior training of
data. In Unsupervised learning, we have the input data but no corresponding output data. The goal of
unsupervised learning is to find the underlying structure of dataset, group that data according to
similarities, and represent that dataset in a compressed format.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images, containing
both dogs and cats. The model has never seen an image of a dog or cat before, and it has no pre-existing
categories for these animals. Now we have to use unsupervised learning to identify the dogs and cats
image.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as dogs and
cats. But it can categorize them according to their similarities, patterns, and differences, i.e., we can
easily categorize the images into two parts. The first may contain all pictures having dogs in them and
the second part may contain all pictures having cats in them. Here the machine didn’t learn anything
before. It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabelled data.
The Machine Learning algorithm's operation is shown in the following block diagram(fig 1.1):
Fig. 1.1 Diagram of Unsupervised Machine Learning.
DCST(3rd year, 6th sem) Page 1
[UNSUPERVISED LEARNING] Unit2
Types of Unsupervised Machine Learning:
Unsupervised machine learning can be classified into two types:
1. Clustering.
2. Association.
1. Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the objects of another group. Clustering
algorithms work by iteratively moving data points closer to their cluster centers and further
away from data points in other clusters. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those commonalities. A
block diagram of clustering is shown below(Figure 1.2):
Fig. 1.2 Diagram of Clustering.
2. Association:
An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. This rule shows how
frequently a itemset occurs in a transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy
together frequently. Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the transaction.
DCST(3rd year, 6th sem) Page 2
[UNSUPERVISED LEARNING] Unit2
K-means Clustering:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process. K-Means performs the division of objects into clusters that share similarities and
are dissimilar to the objects belonging to another cluster.
The K-means clustering is simple and most commonly used clustering algorithm. The K-means
clustering algorithm starts with a some randomly selected data point called centroids. These
centroids points are used as the beginning points of every cluster. Then repeated calculations are
used to optimize the positions of the centroids.
The K-means clustering stops optimizing clusters in two cases:
i. If centroids have stabilized.
ii. If defined number of iterations achieved.
Fig. 1.3 K-means Clustering.
Steps of K-means Clustering:
Step1. Select the number of clusters(K) to be generated by this algorithm.
Step2. Select k random points from the data as centroids.
Step3. Calculate the Euclidean distance. And assign the points to the nearest centriods.
Thus k groups are created.
Step4. Recompute the centroids of newly formed clusters.
Step5. Again reassign the whole data points based on this new centroid. Then repeat
step4 until the position of centriod does not change.
Step6: The model is ready.
Application of K-means Clustering:
1. Identification of cancer cells: The clustering algorithm are widely used for
identification of cancer cells. It divides cancer data and non-cancer data set
into different groups.
2. In search engines on internet: The search engines like Google, Yahoo work on
the base of clustering technique. The search results appear which are based
on closest to the search query.
DCST(3rd year, 6th sem) Page 3
[UNSUPERVISED LEARNING] Unit2
3. Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
4. In biology: It is used in biology to classify and group different species of plants
and animals using image recognition technique.
Advantage of K-means clustering:
1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large
datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with a large number of data points
and can be easily scaled to handle even larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be
used with different distance metrics and initialization methods.
Disadvantage of K-means clustering:
1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of
centroids and can converge to a suboptimal solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be
specified before running the algorithm, which can be challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant
impact on the resulting clusters.
Kernel K-means:
Kernel k-means have been used to identify clusters that are non-linearly separable in input space. Here
we replace Euclidean distance used in k-means by the kernel functions.
Dimensionality Reduction Technique:
Dimensionality reduction refers to the techniques for reducing the number of input variables in the
training data. By reducing input variable we can build sample machine learning models. When dealing
with high dimensional data, it is useful to reduce the dimensions of data. By projecting to a lower
dimensional space we can capture the essence to data. This is called dimensionality reduction.
Principal Component Analysis or PCA is a dimensionality reduction technique.
Principal Component Analysis(PCA):
DCST(3rd year, 6th sem) Page 4
[UNSUPERVISED LEARNING] Unit2
Principal Component Analysis (PCA) is an unsupervised learning algorithm which is used for dimensionality
reduction in machine learning. PCA is a very popular tool which is used for exploratory data analysis and
predictive modeling. PCA finds most significant features in a large data set and makes the data easy for
plotting in 2D and 3D.
Fig. 1.4 Principal components PC1 and PC2 orthogonal to each other.
In Fig. 1.4., there are several points plotted on a 2D plane. There are two principal components
namely PC-1 and PC-2. The PC1 component shows maximum variance in the data. PC-2 is
orthogonal to PC-1.
Basic mathematical and statistical concepts for PCA:
1. Principal components: The principal components are the straight lines that capture most of
the variance of data. They have magnitude and direction.
2. Dimensionality of data: It is defined as number of variables present in a large data set.
3. Correlation: It shows how strongly two variables are related to each other.
4. Orthogonality: Two variables are said to be orthogonal to each other if they are at right
angle(90 degree) to each other in n-dimensional space.
5. Eigen vectors: The Eigen values and Eigen vectors are defined only for square matrix.
6. Covariance matrix: A matrix containing the covariance between the pair of variables is called
the covariance matrix.
Steps of PCA Algorithm:
Step1: Getting the data set: First of all, we need a data set. Divide this data set into two parts X
and Y. Where X= training set and Y= validation set.
Step2: Represent data into a structure: Represent the two dimensional matrix of independent
variable X. Here each row shows data item, and each column shows features. The number of
columns is called the dimension of dataset.
Step3: Standardise the data(Normalization): The features with high variance are more important
as compared to the features with lower variance.
DCST(3rd year, 6th sem) Page 5
[UNSUPERVISED LEARNING] Unit2
Step4: Calculate the covariance of matrix: Take the matrix, namely Z and transpose it. After
transpose , multiply the matrix(Z) with its transpose. The resulting matrix will be covariance of
matrix(Z).
Step5: Calculate the Eigen values and Eigen vectors: Now, calculate the Eigen values and Eigen
vectors of covariance of matrix(Z). The Eigen vectors of the covariance matrix are the directions
of axes with high information. The coefficients of these Eigen vectors are called eigen values.
Step6: Sort the Eigen vectors: Sort the eigen vectors and arrange in decreasing order.
Simultaneously, sort the eigen vectors in matrix P of eigen values. The resultant matrix is denoted
by P*.
Step7: Calculate new features or principal components: Multiply P* matrix to Z matrix. In
resultant matrix Z*, each observation is linear combination of original features.
Step8: Remove unimportant features from new data set: In this new data set, keep only important
features and remove less important features.
Applications of PCA Algorithm:
1. Image processing.
2. Movie recommendation system.
3. Optimizing power allocation in communication channels.
4. Data mining, finance and psychology etc.
Advantage of PCA algorithm:
1. Correlated features are removed : Running algorithms on large datasets with all of the features
will lower the speed of your method and make it difficult to show the many characteristics in
any type of graph. Finding connections in hundreds of characteristics manually is virtually
difficult, tedious, and time-consuming because principal components are independent of one
another, and associated characteristics are removed.
2. Enhances the performance of the algorithm: With so many characteristics, your algorithm’s
performance will surge greatly. The principal component analysis is a popular method for
speeding up your machine learning algorithm by removing associated variables that don’t help
decision-making. With fewer features, the training time of the algorithms decreases
considerably. So, if the input dimensions are too large, utilizing PCA to speed up the method is
a viable option.
3. Enhanced Visualization: PCA converts high-dimensional data to low-dimensional data that can
be readily viewed. PCA produces large variance, which helps visualization. PCA is based on linear
algebra, which is computationally simple for computers to solve. It accelerates other machine
learning methods, allowing them to converge quicker when trained on main components rather
than the original dataset.
DCST(3rd year, 6th sem) Page 6
[UNSUPERVISED LEARNING] Unit2
Disadvantage of PCA algorithm:
1. The major components are difficult to comprehend:
After applying principal component analysis to the dataset, your original features will be transformed into
Principal Components. Original features are more legible and interpretable than Principal Components.
Even the most basic invariance could not be caught by the PCA unless the training data clearly stated it.
For example, after computing the main components, it is difficult to determine which characteristics in
the dataset are the most significant.
2. Data normalization is required:
Before applying principal component analysis, you must normalize your data; otherwise, PCA will be
difficult to discover optimal principal components. The scale has an effect on PCA, thus you must scale
the features in your data before using PCA. As a result, principal components will be skewed toward
characteristics with large variance, resulting in incorrect findings.
3. Loss of information:
PCA accounts for the greatest amount of variation across data characteristics. If the number of Principal
Components is not carefully chosen, it may miss certain information in contrast to the real list of
characteristics. Although dimensionality reduction is beneficial, it has a cost. Loss of information is an
inevitable component of principal component analysis. Managing the trade-off between dimensionality
reduction and information loss is, regrettably, an unavoidable tradeoff when employing PCA.
Kernel PCA (KPCA):
While traditional PCA is highly effective for linear data transformations, it may not capture the
underlying structure of complex, nonlinear datasets. To handle this problem, we introduce kernel
principal component analysis (KPCA). KPCA relies on the intuition that many data sets that are not
linearly separable in their current dimension can be linearly separable by projecting them into a higher-
dimension space.
Kernel Principal Component Analysis (KPCA) is a technique used in machine learning for nonlinear
dimensionality reduction. It is an extension of the classical Principal Component Analysis (PCA)
algorithm, which is a linear method that identifies the most significant features or components of a
dataset. KPCA applies a nonlinear mapping function to the data before applying PCA, allowing it to
capture more complex and nonlinear relationships between the data points.
In KPCA, a kernel function is used to map the input data to a high-dimensional feature space, where
the nonlinear relationships between the data points can be more easily captured by linear methods
such as PCA.
Advantages of Kernel KPCA:
Some of the advantages of kernel PCA are:
▪ Higher-dimensional transformation – by mapping data into a higher-dimensional space,
kernel PCA can create a more expressive representation, potentially leading to better
separation of classes or clusters
▪ Nonlinear transformation – it has the ability to capture complex and nonlinear
relationships
DCST(3rd year, 6th sem) Page 7
[UNSUPERVISED LEARNING] Unit2
▪ Flexibility – by capturing nonlinear patterns, it’s more flexible and adaptable to various
data types. Thus, kernel PCA is used for many domains, including image recognition and
speech processing.
Disadvantages of KPCA:
▪ Choosing an appropriate kernel function and its parameters can be challenging and may
require expert knowledge or extensive experimentation.
▪ Kernel PCA can be computationally expensive, especially for large datasets, as it requires
the computation of the kernel matrix for all pairs of data points.
▪ It may not always be easy to interpret the results of kernel PCA, as the transformed data
may not have a clear interpretation in the original feature space.
▪ Kernel PCA is not suitable for datasets with many missing values or outliers, as it assumes
a complete and consistent dataset.
Recommendation System:
Mainly there are two types of recommendation system:
1. Content-based filtering: This approach recommends items by analyzing the past
preferences of a particular user. Content-Based Filtering is used to produce items
recommendation based on items’ characteristics.
Fig. 1.5 Content based filtering.
2. Collaborative filtering : Collaborative filtering is the choices of similar users to suggest
relevant items. Collaborative filtering is to discover the similarities on the user’s past behavior
and make predictions to the user based on a similar preference with other users. This model is then
used to predict items that the user may have an interest in.
DCST(3rd year, 6th sem) Page 8
[UNSUPERVISED LEARNING] Unit2
Fig. 1.6 Collaborative filtering.
Matrix Factorization and Matrix Completion:
Matrix factorization :
Matrix factorization is a way to generate latent features when multiplying two different kinds of entities.
Collaborative filtering is the application of matrix factorization to identify the relationship between items’
and users’ entities. With the input of users’ ratings on the shop items, we would like to predict how the
users would rate the items so the users can get the recommendation based on the prediction.
Assume we have a table of 5 users and 5 movies, and the ratings are integers ranking from 1 to 5, the matrix
is provided by the table below.
Movie1 Movie2 Movie3 Movie4 Movie5
User1 5 4 2 1
User2 1 5 3
User3 1 4 4 1
User4 2 2
User5 3 1 1
Since not every user gives ratings to all the movies, there are many missing values in the matrix and it
results in a sparse matrix. Hence, the null values not given by the users would be filled with 0 such that the
filled values are provided for the multiplication. For example, two users give high ratings to a certain move
when the movie is acted by their favorite actor and actress or the movie genre is an action one, etc. From
the table above, we can find that the user1 and user3 both give high ratings to move2 and movie3. Hence,
from the matrix factorization, we are able to discover these latent features to give a prediction on a rating
with respect to the similarity in user’s preferences and interactions.
Given a scenario, user 4 did not give a rating to the movie 4. We would like to know if user 4 would like
movie 4. The method is to discover other users with similar preferences of user 4 by taking the ratings given
by users of similar preferences to the movie 4 and predict whether the user 4 would like the movie 4 or
not.
Matrix Completion:
Matrix Completion is a method for recovering lost information in a data matrix, such as ratings,
preferences, or measurements. . It originates from machine learning and usually deals with highly sparse
matrices. Missing or unknown data is estimated using the low-rank matrix of the known data. It has
applications in recommender systems, computer vision, and natural language processing. However, in
DCST(3rd year, 6th sem) Page 9
[UNSUPERVISED LEARNING] Unit2
many cases, the data matrix is not completely random, but has some underlying structure or pattern that
reflects prior knowledge or constraints. For example, the ratings may be influenced by user profiles, the
images may have low rank or sparsity, or the words may belong to certain topics.
Different types method of matrix completion:
1.Sparsity and regularization:
Another way to incorporate prior knowledge or constraints into matrix completion is to exploit the
sparsity or structure of the data matrix or its factors. For example, if the matrix represents ratings or
preferences, it may have many zeros or near-zeros, indicating that most users have not rated most items
or have low interest in them. In this case, adding a sparsity-inducing penalty or regularization term to the
objective function can help reduce overfitting and improve accuracy. Similarly, if the matrix factors have
some known structure, such as groupings, correlations, or hierarchies, they can be encoded as constraints
or regularization terms in the optimization problem.
2.Rank minimization:
One common approach to matrix completion is to assume that the data matrix has a low rank, meaning
that it can be written as a product of two smaller matrices. This reduces the number of parameters to
estimate and imposes a smoothness or simplicity assumption on the data. The problem of finding the
lowest rank matrix that matches the observed entries can be formulated as a rank minimization problem,
which is NP-hard. However, under certain conditions, it can be relaxed to a convex problem known as
nuclear norm minimization, which can be solved efficiently using semi-definite programming or iterative
methods.
3.Side information and features:
Sometimes, the data matrix is not the only source of information available for matrix completion. There
may be additional features that describe the rows or columns of the matrix, such as user profiles, item
descriptions, or contextual variables. These features can provide useful hints or clues for filling in the
missing entries, especially when the data matrix is sparse or noisy. One way to incorporate such side
information into matrix completion is to use it to construct similarity or affinity matrices that capture the
relationships between the rows or columns of the data matrix. Then, these similarity matrices can be used
to weight or constrain the matrix completion problem, or to augment the matrix factors with feature
embeddings.
4.Probabilistic and Bayesian methods:
A different perspective on matrix completion is to treat it as a probabilistic inference problem,
where the goal is to estimate the posterior distribution of the data matrix given the observed entries and
some prior assumptions. This allows for quantifying the uncertainty and variability of the missing entries,
as well as incorporating prior knowledge or constraints in a principled way
Previous year questions:
1. What is clustering?
Clustering is a method of grouping the objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the objects of another group. Clustering
DCST(3rd year, 6th sem) Page 10
[UNSUPERVISED LEARNING] Unit2
algorithms work by iteratively moving data points closer to their cluster centers and further
away from data points in other clusters. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those commonalities.
Example: Let's understand the clustering technique with the real-world example of Mall: When
we visit any shopping mall, we can observe that the things with similar usage are grouped
together. Such as the t-shirts are grouped in one section, similarly, at fruits sections, apples,
bananas, Mangoes, etc., are grouped in separate sections, so that means same types of things are
kept in one group or place. The clustering technique also works in the same way.
2. Write application of clustering?
i. Identification of cancer cells: The clustering algorithm are widely used for
identification of cancer cells. It divides cancer data and non-cancer data set
into different groups.
ii. In search engines on internet: The search engines like Google, Yahoo work on
the base of clustering technique. The search results appear which are based
on closest to the search query.
iii. Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
iv. In biology: It is used in biology to classify and group different species of plants
and animals using image recognition technique.
3. Briefly discuss K-means clustering?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process. K-Means performs the division of objects into clusters that share similarities and
are dissimilar to the objects belonging to another cluster.
The K-means clustering is simple and most commonly used clustering algorithm. The K-means
clustering algorithm starts with a some randomly selected data point called centroids. These
centroids points are used as the beginning points of every cluster. Then repeated calculations are
used to optimize the positions of the centroids.
The K-means clustering stops optimizing clusters in two cases:
i. If centroids have stabilized.
ii. If defined number of iterations achieved.
Steps of K-means Clustering:
Step1. Select the number of clusters(K) to be generated by this algorithm.
Step2. Select k random points from the data as centroids.
Step3. Calculate the Euclidean distance. And assign the points to the nearest centriods.
Thus k groups are created.
Step4. Recompute the centroids of newly formed clusters.
Step5. Again reassign the whole data points based on this new centroid. Then repeat
step4 until the position of centriod does not change.
DCST(3rd year, 6th sem) Page 11
[UNSUPERVISED LEARNING] Unit2
Step6: The model is ready.
Advantage of K-means clustering:
1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large
datasets with high dimensionality.
Disadvantage of K-means clustering:
1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of
centroids and can converge to a suboptimal solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be
specified before running the algorithm, which can be challenging in some
applications.
4.What is dimensionality reduction?
Dimensionality reduction refers to the techniques for reducing the number of input variables in the
training data. By reducing input variable we can build sample machine learning models. It is a way of
converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information. When dealing with high dimensional data, it is useful to reduce the dimensions of data. By
projecting to a lower dimensional space we can capture the essence to data. This is called dimensionality
reduction. It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data visualization, noise
reduction, cluster analysis, etc.
There are two components of dimensionality reduction:
a) Feature selection: In this, we try to find a subset of the original set of variables, or features, to
get a smaller subset which can be used to model the problem. It usually involves three ways:
i. Filter
ii. Wrapper
iii. Embedded
b) Feature extraction: This reduces the data in a high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.
5.What is PCA?
Principal Component Analysis (PCA) is an unsupervised learning algorithm which is used for dimensionality
reduction in machine learning. PCA is a very popular tool which is used for exploratory data analysis and
predictive modeling. PCA finds most significant features in a large data set and makes the data easy for
plotting in 2D and 3D.
DCST(3rd year, 6th sem) Page 12
[UNSUPERVISED LEARNING] Unit2
Fig. 1 Principal components PC1 and PC2 orthogonal to each other.
In Fig. 1, there are several points plotted on a 2D plane. There are two principal components namely PC-
1 and PC-2. The PC1 component shows maximum variance in the data. PC-2 is orthogonal to PC-1.
Steps of PCA Algorithm:
Step1: Getting the data set: First of all, we need a data set. Divide this data set into two parts X
and Y. Where X= training set and Y= validation set.
Step2: Represent data into a structure: Represent the two dimensional matrix of independent
variable X. Here each row shows data item, and each column shows features. The number of
columns is called the dimension of dataset.
Step3: Standardise the data(Normalization): The features with high variance are more important
as compared to the features with lower variance.
Step4: Calculate the covariance of matrix: Take the matrix, namely Z and transpose it. After
transpose , multiply the matrix(Z) with its transpose. The resulting matrix will be covariance of
matrix(Z).
Step5: Calculate the Eigen values and Eigen vectors: Now, calculate the Eigen values and Eigen
vectors of covariance of matrix(Z). The Eigen vectors of the covariance matrix are the directions
of axes with high information. The coefficients of these Eigen vectors are called eigen values.
Step6: Sort the Eigen vectors: Sort the eigen vectors and arrange in decreasing order.
Simultaneously, sort the eigen vectors in matrix P of eigen values. The resultant matrix is denoted
by P*.
Step7: Calculate new features or principal components: Multiply P* matrix to Z matrix. In
resultant matrix Z*, each observation is linear combination of original features.
Step8: Remove unimportant features from new data set: In this new data set, keep only important
features and remove less important features.
DCST(3rd year, 6th sem) Page 13
[UNSUPERVISED LEARNING] Unit2
Questions:
1. What is Unsupervised Machine learning? Discuss different types of unsupervised machine
learning.
2. Write down advantage and disadvantage of K-means clustering.
3. Write down advantage and disadvantage of PCA.
4. What is matrix factorization and matrix completion.
5. Explain Kernel PCA.
DCST(3rd year, 6th sem) Page 14