Unit - 4 (ML)
Unit - 4 (ML)
Clustering :
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
After applying this clustering technique, each cluster or group is provided with a cluster- ID. ML
system can use this id to simplify the processing of large and complex datasets.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also
other various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The accurate
result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based
ontheir choice and preferences
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use
in the GIS database. This can be very useful to find that for what purpose the particular
land should be used, that means for which purpose it is more suitable.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
limits of k means:
o It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K
Image Segmentation
o A digital image is made up of various components that need to be “analysed”, let’s use
that word for simplicity sake and the “analysis” performed on such components can
reveal a lot of hidden information from them. This information can help us address a
plethora of business problems – which is one of the many end goals that are linked with
image processing.
o Image Segmentation is the process by which a digital image is partitioned into various
subgroups (of pixels) called Image Objects, which can reduce the complexity of the
image, and thus analysing the image becomes simpler.
o We use various image segmentation algorithms to split and group a certain set of pixels
together from the image. By doing so, we are actually assigning labels to pixels and the
pixels with the same label fall under a category where they have some or the other thing
common in them.
Using these labels, we can specify boundaries, draw lines, and separate the most required
objects in an image from the rest of the not-so-important ones. In the below example,
from a main image on the left, we try to get the major components, e.g. chair, table etc.
and hence all the chairs are colored uniformly. In the next tab, we have detected
instances, which talk about individual objects, and hence the all the chairs have different
colors.
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So for this, we use data preprocessing task.
Why
A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required tasks
for cleaning the data and making it suitable for a machine learning model which also increases
the accuracy and efficiency of a machine learning model.
Semi-Supervised Learning:
Semi-Supervised learning is a type of Machine Learning algorithm that represents the intermediate
ground between Supervis and Unsupervised learning algorithms. It uses the combination of
labeled and unlabeled datasets during the training period.
To work with the unlabeled dataset, there must be a relationship between the objects. To
understand this, semi-supervised learning uses any of the following assumptions:
o Continuity
o As per the continuity assumption, the objects near each other tend to share the same
group or label. This assumption is also used in supervised learning, and the datasets are
separated by the decision boundaries. But in semi-supervised, the decision boundaries are
added with the smoothness assumption in low-density boundaries.
o Cluster assumptions- In this assumption, data are divided into different discrete clusters.
Further, the points in the same cluster share the output label.
o Manifold assumptions- This assumption helps to use distances and densities, and this
data lie on a manifold of fewer dimensions than input space.
o The dimensional data are created by a process that has less degree of freedom and may
be hard to model directly. (This assumption becomes practical if high).
Working of Semi-Supervised Learning
Semi-supervised learning uses pseudo labeling to train the model with less labeled training data
than supervised learning. The process can combine various neural network models and training
ways. The whole working of semi-supervised learning is explained in the below points:
o Firstly, it trains the model with less amount of training data similar to the supervised
learning models. The training continues until the model gives accurate results.
o The algorithms use the unlabeled dataset with pseudo labels in the next step, and now the
result may not be accurate.
o Now, the labels from labeled training data and pseudo labels data are linked together.
o The input data in labeled training data and unlabeled training data are also linked.
o In the end, again train the model with the new combined input as did in the first step. It
will reduce errors and improve the accuracy of the model.
DBSCAN
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies
used in model building and machine learning algorithms. The data points in the region separated
by two clusters of low point density are considered as noise. The surroundings with a radius ε of
a given object are known as the ε neighborhood of the object. If the ε neighborhood of the object
comprises at least a minimum number, MinPts of objects, then it is called a core object.
There are two different parameters to calculate the density-based clustering EPS: It
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if
i belongs to NEps(k)
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable
from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Gaussian Discriminant Analysis
There are two types of Supervised Learning algorithms are used in Machine Learning for
classification.
Logistic Regression, Perceptron, and other Discriminative Learning Algorithms are examples of
discriminative learning algorithms. These algorithms attempt to determine a boundary between
classes in the learning process. A Discriminative Learning Algorithm might be used to solve a
classification problem that will determine if a patient has malaria. The boundary is then checked
to see if the new example falls on the boundary, P(y|X), i.e., Given a feature set X, what is its
probability of belonging to the class "y".
Generative Learning Algorithms, on the other hand, take a different approach. They try to
capture each class distribution separately rather than finding a boundary between classes. A
Generative Learning Algorithm, as mentioned, will examine the distribution of infected and
healthy patients separately. It will then attempt to learn each distribution's features individually.
When a new example is presented, it will be compared to both distributions, and the class that it
most closely resembles will be assigned, P(X|y) for a given P(y) here, P(y) is known as a class
prior.
These Bayes Theory predictions are used to predict generative learning algorithms
By analysing only, the numbers of P(X|y) as well as P(y) in the specific class, we can determine
P(y), i.e., considering the characteristics of a sample, how likely is it that it belongs to class "y".
Gaussian Discriminant Analysis is a Generative Learning Algorithm that aims to determine the
distribution of every class. It attempts to create the Gaussian distribution to each category of data
in a separate way. The likelihood of an outcome in the case using an algorithm known as the
Generative learning algorithm is very high if it is close to the centre of the contour, which
corresponds to its class. It diminishes when we move away from the middle of the contour.
Below are images that illustrate the differences between Discriminative as well as Generative
Learning Algorithms.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on
it. Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used.
The prime linear method, called Principal Component Analysis, or PCA, is discussed below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the
help of orthogonal transformation. These new transformed features are called the
Principal Components. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw strong patterns from the given
dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.
PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the least important
variable.
The PCA algorithm is based on some mathematical concepts such as:
in PCA algorithm:
Origin of Scikit-Learn
It was originally called scikits.learn and was initially developed by David Cournapeau as a
Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux,
Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in
Computer Science and Automation), took this project at another level and made the first public
release (v0.1 beta) on 1st Feb. 2010.
Let’s have a look at its version history −
Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like
Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-
learn.
Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised
learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to
unsupervised neural networks.
Clustering − This model is used for grouping unlabeled data.
Cross Validation − It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction − It is used for reducing the number of attributes in data which
can be further used for summarisation, visualisation and feature selection.
Ensemble methods − As name suggest, it is used for combining the predictions of multiple
supervised models. Feature extraction − It is used to extract the features from
data to define the attributes in image and text data. Feature selection − It is used to
identify useful attributes to create supervised mode
data to define the attributes in image and text data. Feature selection − It is used to
identify useful attributes to create supervised mode
KERNEL PCA:
PCA is a linear method. That is it can only be applied to datasets which are linearly separable.
It does an excellent job for datasets, which are linearly separable. But, if we use it to non-linear
datasets, we might get a result which may not be the optimal dimensionality reduction. Kernel
PCA uses a kernel function to project dataset into a higher dimensional feature space, where it
is linearly separable. It is similar to the idea of Support Vector Machines.
There are various kernel methods like linear, polynomial, and gaussian.
Example: Applying kernel PCA on this dataset with RBF kernel with a gamma
value of 15. from sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel ='rbf', gamma =
15) X_kpca = kpca.fit_transform(X)
plt.title("Kernel PCA")
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c = y)
plt.show()
In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function to
project the dataset into a higher-dimensional space, where it is linearly separable.
Both SVD and NIPALS are not very efficient when number of rows in dataset is very large
(e.g. hundreds of thousands values or even more). Such datasets can be easily obtained in case
of for example hyperspectral images. Direct use of the traditional algorithms with such datasets
often leads to a lack of memory and long computational time.
One of the solution here is to use probabilistic algorithms, which allow to reduce the number of
values needed fo r estimation of principal components. Starting from 0.9.0 one of the
probabilistic approach is also implemented
in mdatools. The original idea can be found in this paper and some examples on using the
approach for PCA analysis of hyperspectral