IMPORTANT MACHINE LEARNING ALGORITHMS
This section is aimed at giving an overall view of the important machine learning algorithms.
These algorithms are discussed in a detailed manner in the subsequent chapters of this book.
The presented algorithms are based on their popularity as well as their performance. Let us
discuss some of these algorithms now.
Supervised Algorithms
Supervised algorithms include classification algorithms and regression algorithms.
Classification algorithms classify an unknown instance by finding the labels for it. Some of the
important classification algorithms are listed below.
Decision Tree Algorithm
The decision tree (DT) algorithm was developed by J. Ross Quinlan; the first algorithm was called
iterative dichotomize 3 (ID3) algorithm. This algorithm takes input and produces a tree for each
decision. Each branch in the tree demonstrates possible outcomes of a decision based on some
condition. A decision tree is a simple depiction of classifying instances.
A decision tree (DT) consists of nodes and edges. The nodes may be internal (known as non-leaf) and
external (known as leaf). DT is one of the most important classification algorithms. It is close to
human thinking. The internal nodes have attributes and the decisions made from it are branches.
These branches represent the possible values and outcomes of the feature. Each leaf of the decision
tree is marked with a class or probability distribution. The classification rules can be obtained by
tracing from the root to the leaf. The variations of this tree algorithm are ID3, C4.5, and CART.
Advantages:
Useful for linearly separable classification boundary type of problems
Fast
Accurate
Understandable and easily interpretable
Rules can be generated using decision tree
Disadvantage:
Computationally expensive when there are a lot of uncorrelated attributes
Random Forest Algorithm
This algorithm was designed by Tim Keion Ho in 1995 and later extended by Leo Breiman and Adele
Cutler. This approach creates a group of trees by the randomization of features’ selection. Then, the
outputs are combined by pooling the results of all decision trees or by going with the majority vote to
get final solution.
Advantages:
Accuracy is high
Provides higher classification accuracy
Can be parallelized
Disadvantage:
The theoretical analysis of this algorithm is difficult. This algorithm is used in the product
DeepSpeech.
Support Vector Machines
The Support Vector Machines (SVMs) algorithm was developed by Vladimir N. Vapnik and Alexey
Ya. Chervonenkis in 1963. The concept of Kernels was developed by Bernhard E. Boser, Isabelle M.
Guyoun, and Vladimir N. Vapnik in 1992.
SVM is a binary classifier that uses the decision line with maximal margin to assign new examples
into categories 1 or 2. SVM is a classification method and good for classifying the samples of non-
linear separable classification problems also.
Advantages:
Good generalization
Flexibility
SVMs are robust ness is characteristic of this method
Disadvantages:
Slow
High algorithmic complexity
Naïve Bayes
This is a family of algorithms that includes Bayesian network as well. Naive Bayes treats all features
independent of each other. The algorithm uses Bayesian formula for classification. Bayes theorem
was proposed by Reverend Thomas Bayes. The algorithm computes the posterior probability based on
the prior/probability of the classes. This algorithm is useful when the features are independent of each
other.
Advantages:
Fast
Easily understandable
Robust
Disadvantages:
Cannot handle large training set
Computationally intensive
Markov and Hidden Markov Models
Markov models are probabilistic sequence models with the system broadly assuming Markovian
assumptions. Imagine, one takes a sequence of climate conditions for 20 days. The climate is rainy,
sunny, rainy, sunny, sunny … and so on. This data is called sequence data. The focus of this problem
aim is to answer the questions such as – Will it be rainy or sunny tomorrow?
Markovian assumption means one need not consider the entire sequence data. Instead, the present day
depends only on previous data. This implies that today’s climate is based on yesterday’s climate.
The Markov model in the form of Markov chain was proposed by Andrey Markov in 1906 in the form
process of Markov chain. Later, Hidden Markov Models (HMM) was developed by L.E Baum in
1960s. In HMM, the states are not visible or observable fully or partially.
Markov models and HMM models can be used for prediction. Speech recognition is one application
where these models are in great demand.
Artificial Neural Networks
The artificial neural network (ANN) is modelled on the human brain by showing several neurons that
are interconnected. These networks receive a set of inputs, use activation functions, and result in
further activation of neurons forming a model. These models can be trained and later used by the
unknown test inputs. ANN can be used for classification, regression, and clustering.
Advantages:
Good for classification algorithm, chatbots
Simple to implement
Effective
Disadvantages:
Longer training time and large data is required for higher quality results
Feed forward networks are classifiers. The networks like Self Organizing Maps (SOM) can be used
for clustering. These algorithms are discussed in Chapter 10 of this book.
Deep networks are an extension of neural networks. Any neural network that has more than two
hidden layers are called Deep networks. Deep networks are discussed in Chapter 16 of this book.
Some of the classic applications of deep neural networks are face recognition, image recognition,
recommendation systems, and driverless cars.
Regression Algorithms
Linear regression is used to model the linear relationship between dependent and independent
variables. Linear regression is used when the response involves continuous variables.
Linear regression originated from the method of least squares that was proposed by Legendre in 1805
and by Gauss in 1809. Francis Galton used the name ‘regression’.
Types of regression algorithms:
Polynomial regression
Multiple regression
Logistic regression
Ridge/ Lasso/ Elastic net regression
Disadvantages:
Large datasets are required to uncover relationships
This algorithm also assumes that variables are independent but in practice, it is not possible.
Unsupervised Algorithms
Some of the important unsupervised algorithms are listed below.
k-means Algorithm
It is a non-hierarchical clustering algorithm used to group objects that are similar to each other. This
is called cluster analysis. This was developed by James MacQueen in 1967. Here, k is the number of
clusters the user wants. The algorithm randomly selects k points in the dataset and maps all the
samples to the closest cluster by computing the distance between them and the cluster centroid. This is
an iterative algorithm and continues till all the samples are clustered.
Advantages:
Faster
Generated clusters are spherical and hence it is easy to learn the underlying structure of the
data. For a randomly generated cluster, such learning is difficult.
Easily understandable
Easily interpretable
Disadvantages:
The algorithm is sensitive to outliers and noise.
Different initial points yield different results
The prediction of ‘k’ is difficult
Principal Component Analysis
Principal component analysis (PCA) is a dimension reduction algorithm. Some features contribute
more for classification than others. For example, a mole on the face can help face detection better than
common features like nose. In simple words, the features should be relevant
The idea of PCA or KL transform is to transform a given set of measurements to a new set of features,
so that the features exhibit high information packing properties. This leads to a reduced and compact
set of features. Basically, this elimination of unnecessary features are made possible because of the
information redundancies present. This compact representation is of a reduced dimension.
Apriori
This is a class of algorithm that uses unsupervised learning. This is used for association mining. This
algorithm extracts association rules from recurrent itemsets that are present in the data.
Semi Supervised Algorithms
There are circumstances where the dataset has huge collections of unlabeled data and few
labelled data. Labelling is a costly process and difficult to perform by humans.
Pseudo-labelling is an algorithm of semi-supervised algorithms that uses unlabeled data by
assigning a pseudo-label. Then, the labelled and pseudo-labelled dataset can be combined to
train a classifier.
Reinforcement Algorithms
Q-Learning is known as an off-policy method. What is a Q-value? Q-value is a numerical value
assigned to a state-action pair. It means a value of the action that is performed at state ‘s’.
Q-value indicates the immediate reward and other rewards that are yet to come, and is the final is
known as the total return reward. Q-learning algorithm constructs a table. Then, the algorithms update
Q-Values of the table based on the starting state, action, reward, and new state. The algorithm, say in
a maze, simulates all paths and makes an estimate of the target state and keeps updating it in the table.
The next action is where the cell has the highest Q-value. Finally, the table guides the agent to
navigate.