Machine Learning
Machine Learning uses algorithm that takes input data, learn from data, and
make sense of data
Definition – the application and science of algorithm that make sense of data
Purpose of Machine Learning is to automating automation. For e.g. teaching
computers to code themselves.
Traditional V/S Machine Learning Computation
Traditional Programming refers to any manually created program that uses
input data and runs on a computer to produce the output. In Machine
Learning Programming input and output data drives the system.
How does Machine Learn?
Machine learns through various algorithms used to best understand the given
data. These algorithms are majorly based on various statistical approaches.
The algorithm maps relation b/w input and output data. The aim of the
algorithm used have high success rate rather than high precision.
Success Rate – It is the rate of having values near the actual values.
The relationships can be linear or nonlinear. These values enable the learned
model to output results for new instances based on previous learned ones.
Difference B/w Machine Learning Algorithms
Different machine learning algorithms differs in
1. Model they employ to target function
2. Loss Measure (Difference b/w predicted and actual output)
3. Optimization Procedure
Requirements for Designing and Implementing ML Program
We need to understand following 5 points before designing and implementing
ML
1. Context- How ML is going to help us in solving our problem ?
2. Data.
3. What data do we need?
4. What to do in case of missing data?
5. How much data do we need to train?
6. Modelling - Which model should be used? Accuracy of the model
used on training and testing data needs to be monitored closely. It
should be checked if the model is giving reliable performance on
testing data. Sometimes models work well on training data but not
on testing data.
7. Production – How can you take your model to production scale? Is
it online or offline?
8. Error Handling - What happens if our model breaks? How can we
improve it?
Types of ML
There are majorly 3 types of ML-
1. Supervised Learning – Here machine learns with labelled data. For ex
in spam mail detection, the system is firstly made to learn with labelled
data. In this case labels on training data are spam and not spam.
2. Unsupervised Learning - Here machine is responsible to discover
patterns in input data itself. Input data is unlabeled data. It is used in
clustering of similar documents based on text.
3. Reinforcement Learning – This learning is based on feedback or
reward-based structure. Here the system learns from feedback from the
user. For ex in some ecommerce website, you get some advertisement
of a shoe and a phone. Say you clicked on a phone advertisement then
it's very common to observe that we get more phone advertisements.
Here system generally works on real time user interactive systems.
The ML system is driven by data. This data is majorly stored in text files. The
most common format of such files is CSV (Comma Separated Values). In
CSV files, data is stored in the format of a table. Here rows are separated by a
new line and columns values are separated by columns. In the case of ML,
input data is of two types labelled and unlabeled.
Unlabeled data consists of samples of natural or human-created artifacts that
you can obtain relatively easily from the world. Some examples of unlabeled
data might include photos, audio recordings, videos, news articles, tweets, x-
rays (if you were working on a medical application), etc. There is no
"explanation" for each piece of unlabeled data -- it just contains the data, and
nothing else.
Labeled data typically takes a set of unlabeled data and augments each piece
of that unlabeled data with some sort of meaningful "tag," "label," or "class"
that is somehow informative or desirable to know. For example, labels for the
above types of unlabeled data might be whether this photo contains a horse or
a cow, which words were uttered in this audio recording, what type of action is
being performed in this video, what the topic of this news article is, what the
overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.
Labels for data are often obtained by asking humans to make judgments
about a given piece of unlabeled data (e.g., "Does this photo contain a horse
or a cow?") and are significantly more expensive to obtain than the raw
unlabeled data.
After obtaining a labeled dataset, machine learning models can be applied to
the data so that new unlabeled data can be presented to the model and a
likely label can be guessed or predicted for that piece of unlabeled data.
There are many active areas of research in machine learning that are aimed
at integrating unlabeled and labeled data to build better and more accurate
models of the world. Semi-supervised learning attempts to combine
unlabeled and labeled data (or, more generally, sets of unlabeled data where
only some data points have labels) into integrated models. Deep neural
networks and feature learning are areas of research that attempt to build
models of the unlabeled data alone, and then apply information from the
labels to the interesting parts of the models.
Comparison B/W Supervised,
Unsupervised and Reinforcement
Unsupervise Reinforceme
Parameter Supervised
d nt
Data Labelled Unlabeled Unlabeled
Input and
Output
(Output
Input and
Input Data Input received with
Output
time, differs
with differs
users)
Make
Use Analysis Both
Predictions
Offline and
Scale of Offline and Online
Offline
Use Online (majorly
online)
Accuracy High Low Low
1. Optical
character
recognitio
n
2. Spam Self-Driving
detection Cars
3. Pattern Image Natural
Application recognitio Recognition Language
s n for Processing
4. Speech classification
recognitio
n
Unsupervised Learning
In unsupervised learning, there is no such supervisor and we only have input
data. The aim is to find the regularities in the input. There is a structure to the
input space such that certain patterns occur more often than others, and we
want to see what generally happens and what does not. In statistics, this is
called density estimation.
Example: Finding customer segments
Clustering is an unsupervised technique where the goal is to find natural
groups or clusters in a feature space and interpret the input data. There are
many different clustering algorithms. One common approach is to divide the
data points in a way that each data point falls into a group that is similar to
other data points in the same group based on a predefined similarity or
distance metric in the feature space.
Clustering is commonly used for determining customer segments in marketing
data. Being able to determine different segments of customers helps
marketing teams approach these customer segments in unique ways. (Think
of features like gender, location, age, education, income bracket, and so on.)
Classification
During training, a classification algorithm will be given data points with an
assigned category. The job of a classification algorithm is to then take an input
value and assign it a class, or category, that it fits into based on the training
data provided.
The most common example of classification is determining if an email is spam
or not. With two classes to choose from (spam, or not spam), this problem is
called a binary classification problem. The algorithm will be given training data
with emails that are both spam and not spam. The model will find the features
within the data that correlate to either class and create the mapping function
mentioned earlier: Y=f(x). Then, when provided with an unseen email, the
model will use this function to determine whether or not the email is spam.
There are perhaps four main types of classification tasks that you may
encounter; they are:
1. Binary Classification
2. Multi-Class Classification
3. Multi-Label Classification
4. Imbalanced Classification
Binary Classification refers to those classification tasks that have two class
labels.
Ex-
1. Email spam detection (spam or not).
2. Churn prediction (churn or not).
3. Conversion prediction (buy or not).
Multi- Class Classification refers to those classification tasks that have more
than two mutually exclusive class labels.
Ex-
1. Face classification.
2. Plant species classification.
3. Optical character recognition.
Multi-Label Classification refers to those classification task that have more
than two class labels.
1. Movie Genre Classification
2. Classification of Text on many basis like politics, religion, finance etc.
Imbalanced Classification refers to classification tasks where the number of
examples in each class is unequally distributed.
• Typically, imbalanced classification tasks are binary classification tasks
where many examples in the training dataset belong to the normal class and a
minority of examples belong to the abnormal class.
• Examples include:
Fraud detection.
Outlier detection.
Medical diagnostic tests.
Difference between multi-class classification & multi-label
classification is that in multi-class problems the classes are mutually
exclusive, whereas for multi-label problems each label represents a different
classification task, but the tasks are somehow related
Classification problems can be solved with several algorithms. Whichever
algorithm you choose to use depends on the data and the situation. Here are
a few popular classification algorithms:
Decision Trees
K-Nearest Neighbor
Random ForestLinear Classifiers
Support Vector Machines
Steps of Implementing ML-
1 - Data Collection
The quantity & quality of your data dictate how accurate our model is
The outcome of this step is generally a representation of data (Guo
simplifies to specifying a table) which we will use for training
Using pre-collected data, by way of datasets from Kaggle, UCI, etc., still
fits into this step
2 - Data Preparation
Wrangle data and prepare it for training
Clean that which may require it (remove duplicates, correct errors, deal
with missing values, normalization, data type conversions, etc.)
Randomize data, which erases the effects of the particular order in
which we collected and/or otherwise prepared our data
Visualize data to help detect relevant relationships between variables or
class imbalances (bias alert!), or perform other exploratory analysis
Split into training and evaluation sets
3 - Choose a Model
Different algorithms are for different tasks; choose the right one
4 - Train the Model
The goal of training is to answer a question or make a prediction
correctly as often as possible
Linear regression example: algorithm would need to learn values
for m (or W) and b (x is input, y is output)
Each iteration of process is a training step
5 - Evaluate the Model
Uses some metric or combination of metrics to "measure" objective
performance of model
Test the model against previously unseen data
This unseen data is meant to be somewhat representative of model
performance in the real world, but still helps tune the model (as
opposed to test data, which does not)
Good train/eval split? 80/20, 70/30, or similar, depending on domain,
data availability, dataset particulars, etc.
6 - Parameter Tuning
This step refers to hyperparameter tuning, which is an "artform" as
opposed to a science
Tune model parameters for improved performance
Simple model hyperparameters may include: number of training steps,
learning rate, initialization values and distribution, etc.
7 - Make Predictions
Using further (test set) data which have, until this point, been withheld
from the model (and for which class labels are known), are used to test
the model; a better approximation of how the model will perform in the
real world
Well-defined Problem
Problem Solving by Search
An important aspect of intelligence is goal-based problem solving. The
solution of many problems (e.g. naught and crosses, timetabling, chess) can
be described by finding a sequence of actions that lead to a desirable goal.
Each action changes the state and the aim is to find the sequence of actions
and states that lead from the initial (start) state to a final (goal) state. A well-
defined problem can be described by:
Initial state
Operator or successor function - for any state x returns s(x), the set of states
reachable from x with one action
State space - all states reachable from initial by any sequence of actions
Path - sequence through state space
Path cost - function that assigns a cost to a path. Cost of a path is the sum of
costs of individual actions along the path
Goal test - test to determine if at goal state
WELL-POSED LEARNING PROBLEMS
Definition: A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E. To have a well-
defined learning problem, three features needs to be identified:
1. The class of tasks
2. The measure of performance to be improved
Example 1
Checkers game: A computer program that learns to play checkers might
improve its performance as measured by its ability to win at the class of tasks
involving playing checkers games, through experience obtained by playing
games against itself.
A checkers learning problem:
Task T: playing checkers.
Performance measure P: percent of games won against opponents.
Training experience E: playing practice games against itself.
Example 2
A handwriting recognition learning problem:
Task T: recognizing and classifying handwritten words within images
Performance measure P: percent of words correctly classified
Training experience E: a database of handwritten words with given
classifications
Example 3
A Self driving learning problem:
Task T: driving on public routes using vision sensors
Performance measure P: average distance travelled before an error (as
judged by human overseer)
Training experience E: a sequence of images and steering commands
recorded while observing a human driver
Regression
Regression is a predictive statistical process where the model attempts to find
the important relationship between dependent and independent variables. The
goal of a regression algorithm is to predict a continuous number such as
sales, income, and test scores. The equation for basic linear regression can
be written as so:
Where x[i] is the feature(s) for the data and where w[i] and b are parameters
which are developed during training. For simple linear regression models with
only one feature in the data, the formula looks like this:
Where w is the slope, x is the single feature and b is the y-intercept. Familiar?
For simple regression problems such as this, the models predictions are
represented by the line of best fit. For models using two features, the plane
will be used. Finally, for a model using more than two features, a hyperplane
will be used.
Imagine we want to determine a student’s test grade based on how many
hours they studied the week of the test. Let's say the plotted data with a line of
best fit looks like this:
There is a clear positive correlation between hours studied (independent
variable) and the student’s final test score (dependent variable). A line of best
fit can be drawn through the data points to show the models predictions when
given a new input. Say we wanted to know how well a student would do with
five hours of studying. We can use the line of best fit to predict the test score
based on other student’s performances.
There are many different types of regression algorithms. The three most
common are listed below:
Linear Regression
Logistic Regression
Polynomial Regression
Simple Regression Example
We can then place our line of best fit onto the plot along with all of the data
points.
In middle school, we all learned that the equation for a linear line is y = mx +
b. We can now create a function called “predict” that will multiply the slope (w)
with the new input (x). This function will also use the intercept (b) to return an
output value. After creating the function, we can predict the output values
when x = 3 and when x = -1.5.
Predict y For 3: 194.7953505092923
Predict y For -1.5: -100.16735028915441
Now let’s plot the original data points with the line of best fit. We can then add
the new points that we predicted (colored red). As expected, they fall on the
line of best fit.
Reference - https://towardsdatascience.com/a-brief-introduction-to-
supervised-learning-54a3e3932590
Classification V/S Regression
Classification Regression
Predicting a discrete Prediction of continuous
Task
class label quantity
Data is labelled into
Data one, two or more No class labels
classes
A classification
problem with two
A regression problem
classes is
with multiple input
called Binary
Variables variables is
Classification. Its
called Multivariable
there are more labels
Regression problem
then its called Multi
Class Classification
Detection of spam Predicting price of stocks
Application
mails over period of time
Overfitting:
It is a modeling error that occurs when a function is too closely fit to a limited
set of data points. It refers to a model that models the training data too well.
Example:
How to prevent Overfitting?
Remove unwanted features as overfitting generally happens when there
are many features in dataset.
Train data with more data. This won’t work every time, but this can help
algorithms detect the signal better.
Cross validation. In this we divide training set into two parts. We use
one part for training, and the remaining part is called the validation set
and is used to test the generalization ability. Example: K-fold cross
validation.
K-fold Cross Validation:
In this the dataset X is divided randomly into K equal sized parts, X i where i =
1, 2, 3, …, K. Then we use these different dataset to test our model.
Underfitting:
Underfitting refers to a model that can neither model the training data nor
generalize to new data. An underfit machine learning model is not a suitable
model and will be obvious as it will have poor performance on the training
data.
Example:
How to prevent Underfitting?
Increase the size or number of parameters in the model.
Increase the complexity or type of the model.
Increase the training time until cost function is minimized.
Bias-Variance Relation with Overfitting and Underfitting:
A model with low variance and low bias is the ideal model.
A model with low bias and high variance is a model with
overfitting.
A model with high bias and low variance is a model with
underfitting.
A model with high bias and high variance is worst model that
produces the greatest possible prediction error.
Comparison Between Data Science and Machine
Learning
The below table describes the basic differences between Data Science and
ML:
Data Science Machine Learning
It deals with understanding and finding It is a subfield of data science that enables the
hidden patterns or useful insights from machine to learn from the past data and
the data, which helps to take smarter experiences automatically.
business decisions.
It is used for discovering insights from It is used for making predictions and classifying
the data. the result for new data points.
It is a broad term that includes various It is used in the data modeling step of the data
steps to create a model for a given science as a complete process.
problem and deploy the model.
A data scientist needs to have skills to Machine Learning Engineer needs to have skills
use big data tools like Hadoop, Hive and such as computer science fundamentals,
Pig, statistics, programming in Python, R, programming skills in Python or R, statistics and
or Scala. probability concepts, etc.
It can work with raw, structured, and It mostly requires structured data to work on.
unstructured data.
Data scientists spent lots of time in ML engineers spend a lot of time for managing
handling the data, cleansing the data, the complexities that occur during the
and understanding its patterns. implementation of algorithms and
mathematical concepts behind that.
2
Linear Regression vs Logistic Regression
Linear Regression and Logistic Regression are the two famous Machine
Learning Algorithms which come under supervised learning technique.
Since both the algorithms are of supervised in nature hence these
algorithms use labeled dataset to make the predictions. But the main
difference between them is how they are being used. The Linear
Regression is used for solving Regression problems whereas Logistic
Regression is used for solving the Classification problems. The description
of both the algorithms is given below along with difference table.
Linear Regression:
o Linear Regression is one of the most simple Machine learning algorithm
that comes under Supervised Learning technique and used for solving
regression problems.
o It is used for predicting the continuous dependent variable with the help of
independent variables.
o The goal of the Linear regression is to find the best fit line that can
accurately predict the output for the continuous dependent variable.
o If single independent variable is used for prediction then it is called Simple
Linear Regression and if there are more than two independent variables
then such regression is called as Multiple Linear Regression.
o By finding the best fit line, algorithm establish the relationship between
dependent variable and independent variable. And the relationship should
be of linear nature.
o The output for Linear regression should only be the continuous values
such as price, age, salary, etc. The relationship between the dependent
variable and independent variable can be shown in below image:
In above image the dependent variable is on Y-axis (salary) and
independent variable is on x-axis(experience). The regression line can be
written as:
y= a0+a1x+ ε
Where, a0 and a1 are the coefficients and ε is the error term.
Logistic Regression:
o Logistic regression is one of the most popular Machine learning algorithm
that comes under Supervised Learning techniques.
o It can be used for Classification as well as for Regression problems, but
mainly used for Classification problems.
o Logistic regression is used to predict the categorical dependent variable
with the help of independent variables.
o The output of Logistic Regression problem can be only between the 0 and
1.
o Logistic regression can be used where the probabilities between two
classes is required. Such as whether it will rain today or not, either 0 or 1,
true or false etc.
o Logistic regression is based on the concept of Maximum Likelihood
estimation. According to this estimation, the observed data should be
most probable.
o In logistic regression, we pass the weighted sum of inputs through an
activation function that can map values in between 0 and 1. Such
activation function is known as sigmoid function and the curve obtained
is called as sigmoid curve or S-curve. Consider the below image:
o The equation for logistic regression is:
Linear Regression Logistic Regression
Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a categorical dependent variable using a
given set of independent variables. given set of independent variables.
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
In Linear regression, we predict the In logistic Regression, we predict the
value of continuous variables. values of categorical variables.
In linear regression, we find the best fit In Logistic Regression, we find the S-
line, by which we can easily predict the curve by which we can classify the
output. samples.
Least square estimation method is used Maximum likelihood estimation method
for estimation of accuracy. is used for estimation of accuracy.
The output for Linear Regression must The output of Logistic Regression must
be a continuous value, such as price, be a Categorical value such as 0 or 1,
age, etc. Yes or No, etc.
In Linear regression, it is required that In Logistic regression, it is not required
relationship between dependent variable to have the linear relationship between
and independent variable must be the dependent and independent
linear. variable.
In linear regression, there may be In logistic regression, there should not
collinearity between the independent be collinearity between the independent
variables. variable.
Baysein Learning
Bayes theorem is used to calculate the probability of an event based on its
association with another event. In general, Bayes’ theorem relate an event
(E) to a hypothesis (H) and the probability of E given H.
Bayes' theorem is a mathematical equation used in probability and statistics
to calculate conditional probability.
Example:
• Suppose that we are interested in diagnosing cancer in patients who visit
a chest clinic: • Let A represent the event "Person has cancer"
• Let B represent the event "Person is a smoker"
• Let the probability of the prior event P(A)=0.1 on the basis of past data
(10% of patients entering the clinic turn out to have cancer).
• Let the probability of P(B) =0.5 by considering the percentage of patients
who smoke.
• We want to compute the Probability of person who smokes has cancer,
probability of the posterior event P(A|B).
• Let P(B|A) represents the proportion of smokers among those who are
already diagnosed with cancer, P(B|A)=0.8.
• Using Bayes' rule to compute the probability of person who smokes has
cancer, probability of the posterior event P(A|B):
• P(A|B) = (0.8 * 0.1)/0.5 = 0.16
• Thus, in the light of evidence that the person is a smoker we revise our
prior probability from 0.1 to a posterior probability of0.16. This is a
significance increase, but it is still unlikely that the person has cancer.
Function approximation
Machine learning, specifically supervised learning, can be described as the
desire to use available data to learn a function that best maps inputs to
outputs. Technically, this is a problem called function approximation, where
we are approximating an unknown target function (that we assume exists)
that can best map inputs to outputs on all possible observations from the
problem domain.
Learning
Learning literally means understanding and cognition about abstract
concepts. Learning for a machine learning algorithm involves exploring
hypothesis space toward the best or a good enough hypothesis that can give
best approximation of the target function.
Concept Learning
The problem of automatically inferring the general definition of some concept,
given examples as positive or negative training examples of the concept. This
task is commonly referred to as concept learning, or approximating a
boolean-valued function from examples.
Define (i) Prior Probability (ii) Conditional Probability (iii) Posterior
Probability
Prior probability, in Bayesian statistical inference, is the probability of an
event before new data is collected. This is the best rational assessment of
the probability of an outcome based on the current knowledge before an
experiment is performed.
Posterior Probability: The prior probability of an event will be revised as new
data or information becomes available, to produce a more accurate measure
of a potential outcome. That revised probability becomes the posterior
probability and is calculated using Bayes' theorem.
Conditional probability is defined as the likelihood of an event or outcome
occurring, based on the occurrence of a previous event or outcome.
Conditional probability is calculated by multiplying the probability of the
preceding event by the updated probability of the succeeding, or conditional,
event.
Conditional Probability Formula
P(B|A) = P(A and B) / P(A) which you can also rewrite as: P(B|A) = P(A∩B) /
P(A)
Example: Suppose a student is applying for admission to a university and
hopes to receive an academic scholarship. The school to which they are
applying accepts 100 of every 1,000 applicants (10%) and awards academic
scholarships to 10 of every 500 students who are accepted (2%). Of the
scholarship recipients, 50% of them also receive university stipends
for books, meals& and housing. For our ambitious student, the change of
them being accepted then receiving a scholarship is .2% (.1 x .02). The
chance of them being accepted, receiving the scholarship, then also receiving
a stipend for books, etc. is .1% (.1 x .02 x .5).
Support Vector Machine
Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in
Machine Learning.
The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified
using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in
the KNN classifier. Suppose we see a strange cat that also has some
features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and dogs, and
then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
Backward Skip 10sPlay VideoForward Skip 10s
SVM algorithm can be used for Face detection, image classification,
text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out the
best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider
the below image:
So as it is 2-d space so by just using a straight line, we can easily separate
these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm
finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line.
Consider the below image:
So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel,
is a popular kernel function used in machine learning, particularly in SVMs
(Support Vector Machines). It is a nonlinear kernel function that maps the
input data into a higher-dimensional feature space using a Gaussian
function.
The Gaussian kernel can be defined as:
1. K(x, y) = exp(-gamma * ||x - y||^2)
Where x and y are the input feature vectors, gamma is a parameter that
controls the width of the Gaussian function, and ||x - y||^2 is the squared
Euclidean distance between the input vectors.
When using a Gaussian kernel in an SVM, the decision boundary is a
nonlinear hyper plane that can capture complex nonlinear relationships
between the input features. The width of the Gaussian function, controlled
by the gamma parameter, determines the degree of nonlinearity in the
decision boundary.
One advantage of the Gaussian kernel is its ability to capture complex
relationships in the data without the need for explicit feature engineering.
However, the choice of the gamma parameter can be challenging, as a
smaller value may result in under fitting, while a larger value may result in
over fitting.
Disadvantages of SVM
3
Decision Tree
A decision tree is a map of the possible outcomes of a series of related
choices. It allows an individual or organization to weigh possible actions
against one another based on their costs, probabilities, and benefits.
Why they are used???
They are used to drive informal discussion or to map out an algorithm that
predicts the best choice mathematically.
Decision Tree starts with a single node, which branches into possible
outcomes. Each of these outcomes leads to additional nodes, which branch
off into other possibilities.
Example:
In this example the decision tree starts with a node that asks if the test value
have feathers or not. The tree further divided into two branches on the basis of
the decision if test case had feathers or not. This tree finally predicts the best
class for the test case on the basis of the choices made.
Elements of Decision Tree:
Nodes in the classification tree are identified by the feature names of
given data. Example: Feathers, Fly and Finns in above image.
Branches in the tree are identified by the values of features. Example:
True and False in above image.
The leaf nodes identified by are the class labels. Example: Hawk,
Penguin, Dolphin and Bear in above image.
Expressiveness:
Decision Trees can be expressed as a Boolean function of the input attributes.
We can take nodes as inputs for example A and B. Branches will represent 0
and 1 or True and False and leaf node can represent the final result.
Example:
1. XOR Gate
We can represent more than two input in similar way. We can represent
different gates using this.
1. AND Gate
In this way we can represent different Boolean expressions in Decision Tree
format.
Number of distinct Decision Tree:
Number of Distinct Decision Tree tells how many distinct decision trees with n
Boolean attributes are possible. It can be represented as:
Number of Distinct Decision Tree = Number of distinct truth tables with 2n rows
= 22^n
Example:
With 6 Boolean attributes (22^6), there are 18,446,744,073,709,551,616 trees.
ID3 Algorithm:
ID3 stands for Iterative Dichotomiser 3. This algorithm repeatedly divides
features into two or more groups at each step. It was invented by Ross
Quinlan. It uses a top- down greedy approach to build a decision tree. It is
used for classification problems with categorical features only.
How it works?
ID3 uses Gain or Information Gain to find the best features. First it calculates
the Information Gain of each feature. Then it selects feature with highest
Information Gain and split dataset into subset by making it the decision tree
node. If all rows belong to the same class, make the current node as a leaf
node with the class as its label. Repeat for the remaining features until we run
out of all features, or the decision tree has all leaf nodes.
Bias:
In simple words it is accuracy of our predictions. It measures how far off the
prediction are from the correct values in general if we rebuild the model
multiple times on different training datasets.
There are 2 types of bias in machine learning:
Confirmation bias:
Occurs when we try to prove a predetermined assumption.
Selection bias:
Occurs when data is selected subjectively.
Variance:
In simple words it is the difference between many model’s predictions. It
measures the consistency of the model prediction for classifying a particular
example if we retrain the model multiple times the model is sensitive to the
randomness in the training data.
Example:
Suppose red spot is our expected value then:
Information Gain:
It is the difference between entropy before split and average entropy after split
of the dataset based on given attribute values.
Information Gain = Entropy (parent node) – [Avg Entropy (children)]
Example:
Here is a dataset that shows the chances of a golf match to happen. We have
created its decision tree.
We then split this dataset on the basis of different attributes. The entropy for
each branch is calculated. Then it is added proportionally, to get total entropy
for the split.
Here, Information Gain or Gain is the decrease in entropy after a dataset is
split on an attribute.
Example:
For attribute Outlook:
E (Play Golf, Outlook) = p*E (3, 2) + q*E (4, 0) + r*E (2, 3)
(here p, q and r are the probability of day is judged to be Sunny, Overcast and
Rainy respectively)
(E (3, 2) = Entropy (3, 2) and so on)
= (5/14) *0.971 + (4/14) *0 +(5/14) *0.971
= 0.693
Then,
Information Gain = Entropy (Play Golf) – E (Play Golf, Outlook)
= 0.940 – 0.693 = 0.247
Similarly, we will calculate Information Gain or Gain for different attributes.
Entropy:
Entropy measures the impurity in dataset. It is a measure of the randomness
in the information being processed. The higher the entropy, the harder it is to
draw any conclusions from that information. Low entropy indicates that the
dataset has homogenous (of same type) elements and vice versa.
In case of Binary classes, the value of entropy ranges between 0 to 1. For n
classes, value of entropy ranges from 0 to log2(n).
Calculating Entropy:
Entropy is calculated as:
Entropy = -p.log2(p) -q.log2(q)
Here, p and q are the probability of success and failure respectively in that
node.
Example:
Suppose there are 14 cases out of which in 5 cases we can’t play Golf and in
9 cases we can play Golf, then:
p = 5/14 = 0.36
q = 9/14 = 0.64
Entropy (Play Golf) = Entropy (5, 9)
= Entropy (0.36, 0.64)
= - (0.36*log2(0.36)) – (0.64*log2(0.64))
= 0.94
INSTANCE-BASED LEARNING
The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and then
generalizes to new instances based on some similarity measure. It is called
instance-based because it builds the hypotheses from the training instances.
It is also known as memory-based learning or lazy-learning (because they
delay processing until a new instance must be classified). The time
complexity of this algorithm depends upon the size of training data. Each
time whenever a new query is encountered, its previously stores data is
examined. And assign to a target function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the
number of training instances. For example, If we were to create a spam filter
with an instance-based learning algorithm, instead of just flagging emails that
are already marked as spam emails, our spam filter would be programmed to
also flag emails that are very similar to them. This requires a measure of
resemblance between two emails. A similarity measure between two emails
could be the same sender or the repetitive use of the same keywords or
something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can
be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we
go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the
K-NN algorithm:
o There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most
preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex
some time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.
Locally Weighted Linear Regression
Within the field of machine learning and regression analysis, Locally
Weighted Linear Regression (LWLR) emerges as a notable approach that
bolsters predictive accuracy through the integration of local adaptation. In
contrast to conventional linear regression models, which presume a
universal correlation among variables, LWLR acknowledges the
significance of localized patterns and relationships present in the data. In
the subsequent discourse, we embark on an exploration of the
fundamental principles, diverse applications, and inherent advantages
offered by Locally Weighted Linear Regression. Our aim is to shed light on
its exceptional capacity to amplify predictive prowess and furnish intricate
understandings of intricate datasets.
Fundamentally, LWLR manifests as a non-parametric regression algorithm
that discerns the connection between a dependent variable and several
independent variables.
Notably, LWLR's distinctiveness emanates from its dynamic adaptability,
which empowers it to bestow distinct weights upon individual data points
contingent on their proximity to the target point under prediction. In
essence, this algorithm accords greater significance to proximate data
points, deeming them as more influential contributors in the prediction
process.
Principles of Locally Weighted Linear Regression
LWLR functions on the premise that the association between the
dependent and independent variables adheres to linearity; however, this
relationship is allowed to exhibit variability across distinct sections within
the dataset. This is achieved by employing an individual linear regression
model for each prediction, employing a weighted least squares technique.
The determination of weights is carried out through a kernel function,
which bestows elevated weights upon data points in close proximity to the
target point and diminishes the weights for those that are farther away.
Applications of Locally Weighted Linear Regression
o Time Series Analysis: LWLR is particularly useful in time series analysis,
where the relationship between variables may change over time. By
adapting to the local patterns and trends, LWLR can capture the dynamics
of time-varying data and make accurate predictions.
o Anomaly Detection: LWLR can be employed for anomaly detection in
various domains, such as fraud detection or network intrusion detection.
By identifying deviations from the expected patterns in a localized
manner, LWLR helps detect abnormal behavior that may go unnoticed
using traditional regression models.
o Robotics and Control Systems: In robotics and control systems, LWLR
can be utilized to model and predict the behavior of complex systems. By
adapting to local conditions and variations, LWLR enables precise control
and decision-making in dynamic environments.
Benefits of Locally Weighted Linear Regression
o Improved Predictive Accuracy: By considering local patterns and
relationships, LWLR can capture subtle nuances in the data that might be
overlooked by global regression models. This results in more accurate
predictions and better model performance.
o Flexibility and Adaptability: LWLR can adapt to different regions of the
dataset, making it suitable for complex and non-linear relationships. It
offers flexibility in capturing local variations, allowing for more nuanced
analysis and insights.
o Interpretable Results: Despite its adaptive nature, LWLR still provides
interpretable results. The localized models offer insights into the
relationships between variables within specific regions of the data, aiding
in the understanding of complex phenomena.
Radial Basis Function Networks (RBFNs)
RBFNs are specific types of neural networks that follow a feed-forward
approach and make use of radial functions as activation functions. They
consist of three layers namely the input layer, hidden
layer, and output layer which are mostly used for time-series
prediction, regression testing, and classification.
RBFNs do these tasks by measuring the similarities present in the training
data set. They usually have an input vector that feeds these data into the
input layer thereby confirming the identification and rolling out results by
comparing previous data sets. Precisely, the input layer has neurons that
are sensitive to these data and the nodes in the layer are efficient in
classifying the class of data. Neurons are originally present in the hidden
layer though they work in close integration with the input layer. The
hidden layer contains Gaussian transfer functions that are inversely
proportional to the distance of the output from the neuron's center. The
output layer has linear combinations of the radial-based data where the
Gaussian functions are passed in the neuron as parameter and output is
generated. Consiider the given image below to understand the process
thoroughly.
Case Based Reasoning
Introduction
Case-Based Reasoning in machine learning is an AI technique that is used to solve problems
based on past experiences. The technique is derived from human problem-solving
approaches, where people often rely on their past experiences to make decisions in new
situations. CBR is a type of machine learning that utilizes a database of previously solved
problems or cases to solve new problems. CBR is based on the idea that similar problems can
have similar solutions, and it uses this similarity to find solutions to new problems.
What is Case-Based Reasoning?
In Case-Based Reasoning in machine learning, a problem is solved by retrieving similar past
cases and adapting them to the current situation. The key terms present in CBR are:
Case: A case is a problem that has been previously solved and stored in the database.
Similarity: The similarity measure is used to determine the degree of resemblance
between past cases and the current situation.
Adaptation: Adaptation is the process of modifying a retrieved past case to fit the
current situation.
Process in Case-Based Reasoning
The CBR process typically involves four main steps: retrieve, reuse, revise, and retain.
Retrieve: The first step in the CBR process is to retrieve relevant cases from a case
library. This involves searching through the library to find cases that are similar to the
current problem. The goal is to identify cases that are as close to the current problem
as possible, as these are the most likely to provide useful information. In some cases,
the retrieval step may involve the use of keyword searches or other forms of data
mining to identify relevant cases.
Reuse: Once relevant cases have been retrieved, the next step is to reuse them to
solve the current problem. This involves adapting the solutions used in past cases to
fit the current problem. The goal is to find a solution that is similar enough to the past
cases to be effective, but also different enough to address the unique aspects of the
current problem. This step may involve selecting one or more past cases to use as a
starting point for the solution, or it may involve combining elements from multiple
past cases to create a new solution.
Revise: After a solution has been developed using past cases, the next step is to revise
it to better fit the current problem. This may involve modifying the solution based on
feedback from the user or on new information that has become available. The goal is
to refine the solution to make it as effective as possible for the current problem. In
some cases, the revision step may involve the use of machine learning algorithms to
optimize the solution.
Retain: The final step in the CBR process is to retain the newly developed solution
for future use. This involves adding the new case to the case library so that it can be
used in the retrieval step for future problems. The goal is to continually improve the
quality of the case library and the effectiveness of the CBR process over time. The
retention step may also involve the use of knowledge management tools to help
organize and maintain the case library.
4
Artificial Neural Network Tutorial
Artificial Neural Network Tutorial provides basic and advanced concepts of
ANNs. Our Artificial Neural Network tutorial is developed for beginners as
well as professions.
The term "Artificial neural network" refers to a biologically inspired sub-
field of artificial intelligence modeled after the brain. An Artificial neural
network is usually a computational network based on biological neural
networks that construct the structure of the human brain. Similar to a
human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers
of the networks. These neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related to the
artificial neural network. In this tutorial, we will discuss ANNs, Adaptive
resonance theory, Kohonen self-organizing map, Building blocks,
unsupervised learning, Genetic algorithm, etc.
What is Artificial Neural Network?
The term "Artificial Neural Network" is derived from Biological neural
networks that develop the structure of a human brain. Similar to the
human brain that has neurons interconnected to one another, artificial
neural networks also have neurons that are interconnected to one another
in various layers of the networks. These neurons are known as nodes.
The given figure illustrates the typical diagram of Biological
Neural Network.
The typical Artificial Neural Network looks something like the
given figure.
Dendrites from Biological Neural Network represent inputs in Artificial
Neural Networks, cell nucleus represents Nodes, synapse represents
Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural
network:
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output
An Artificial Neural Network in the field of Artificial
intelligence where it attempts to mimic the network of neurons makes
up a human brain so that computers will have an option to understand
things and make decisions in a human-like manner. The artificial neural
network is designed by programming computers to behave simply like
interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron
has an association point somewhere in the range of 1,000 and 100,000. In
the human brain, data is stored in such a manner as to be distributed, and
we can extract more than one piece of this data when necessary from our
memory parallelly. We can say that the human brain is made up of
incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider
an example of a digital logic gate that takes an input and gives an output.
"OR" gate, which takes two inputs. If one or both the inputs are "On," then
we get "On" in output. If both the inputs are "Off," then we get "Off" in
output. Here the output depends upon input. Our brain does not perform
the same task. The outputs to inputs relationship keep changing because
of the neurons in our brain, which are "learning."
The architecture of an artificial neural network:
To understand the concept of the architecture of an artificial neural
network, we have to understand what a neural network consists of. In
order to define a neural network that consists of a large number of
artificial neurons, which are termed units arranged in a sequence of
layers. Lets us look at various types of layers available in an artificial
neural network.
Artificial Neural Network primarily consists of three layers:
Input Layer:
As the name suggests, it accepts inputs in several different formats
provided by the programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs
all the calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer,
which finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum
of the inputs and includes a bias. This computation is represented in the
form of a transfer function.
It determines weighted total is passed as an input to an activation
function to produce the output. Activation functions choose whether a
node should fire or not. Only those who are fired make it to the output
layer. There are distinctive activation functions available that can be
applied upon the sort of task we are performing.
Advantages of Artificial Neural Network (ANN)
Parallel processing capability:
Artificial neural networks have a numerical value that can perform more
than one task simultaneously.
Storing data on the entire network:
Data that is used in traditional programming is stored on the whole
network, not on a database. The disappearance of a couple of pieces of
data in one place doesn't prevent the network from working.
Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with
inadequate data. The loss of performance here relies upon the
significance of missing data.
Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples
and to encourage the network according to the desired output by
demonstrating these examples to the network. The succession of the
network is directly proportional to the chosen instances, and if the event
can't appear to the network in all its aspects, it can produce false output.
Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating
output, and this feature makes the network fault-tolerance.
Disadvantages of Artificial Neural Network:
Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial
neural networks. The appropriate network structure is accomplished
through experience, trial, and error.
Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing
solution, it does not provide insight concerning why and how. It decreases
trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power,
as per their structure. Therefore, the realization of the equipment is
dependent.
Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into
numerical values before being introduced to ANN. The presentation
mechanism to be resolved here will directly impact the performance of
the network. It relies on the user's abilities.
The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value does
not give us optimum results.
Science artificial neural networks that have steeped into the world in the mid-20 th century are
exponentially developing. In the present time, we have investigated the pros of artificial neural
networks and the issues encountered in the course of their utilization. It should not be overlooked
that the cons of ANN networks, which are a flourishing science branch, are eliminated individually,
and their pros are increasing day by day. It means that artificial neural networks will turn into an
irreplaceable part of our lives progressively important.
How do artificial neural networks work?
Artificial Neural Network can be best represented as a weighted directed
graph, where the artificial neurons form the nodes. The association
between the neurons outputs and neuron inputs can be viewed as the
directed edges with weights. The Artificial Neural Network receives the
input signal from the external source in the form of a pattern and image in
the form of a vector. These inputs are then mathematically assigned by
the notations x(n) for every n number of inputs.
Afterward, each of the input is multiplied by its corresponding weights
( these weights are the details utilized by the artificial neural networks to
solve a specific problem ). In general terms, these weights normally
represent the strength of the interconnection between neurons inside the
artificial neural network. All the weighted inputs are summarized inside
the computing unit.
If the weighted sum is equal to zero, then bias is added to make the
output non-zero or something else to scale up to the system's response.
Bias has the same input, and weight equals to 1. Here the total of
weighted inputs can be in the range of 0 to positive infinity. Here, to keep
the response in the limits of the desired value, a certain maximum value
is benchmarked, and the total of weighted inputs is passed through the
activation function.
The activation function refers to the set of transfer functions used to
achieve the desired output. There is a different kind of the activation
function, but primarily either linear or non-linear sets of functions. Some
of the commonly used sets of activation functions are the Binary, linear,
and Tan hyperbolic sigmoidal activation functions. Let us take a look at
each of them in details:
Binary:
In binary activation function, the output is either a one or a 0. Here, to
accomplish this, there is a threshold value set up. If the net weighted
input of neurons is more than 1, then the final output of the activation
function is returned as one or else the output is returned as 0.
Sigmoidal Hyperbolic:
The Sigmoidal Hyperbola function is generally seen as an "S" shaped
curve. Here the tan hyperbolic function is used to approximate output
from the actual net input. The function is defined as:
F(x) = (1/1 + exp(-????x))
Where ???? is considered the Steepness parameter.
Types of Artificial Neural Network:
There are various types of Artificial Neural Networks (ANN) depending
upon the human brain neuron and network functions, an artificial neural
network similarly performs tasks. The majority of the artificial neural
networks will have some similarities with a more complex biological
partner and are very effective at their expected tasks. For example,
segmentation or classification.
Feedback ANN:
In this type of ANN, the output returns into the network to accomplish the
best-evolved results internally. As per the University of Massachusetts,
Lowell Centre for Atmospheric Research. The feedback networks feed
information back into itself and are well suited to solve optimization
issues. The Internal system error corrections utilize feedback ANNs.
Feed-Forward ANN:
A feed-forward network is a basic neural network comprising of an input layer, an
output layer, and at least one layer of a neuron. Through assessment of its
output by reviewing its input, the intensity of the network can be noticed based
on group behavior of the associated neurons, and the output is decided. The
primary advantage of this network is that it figures out how to evaluate and
recognize input patterns.
Perceptron in Machine Learning
In Machine Learning and Artificial Intelligence, Perceptron is the most
commonly used term for all folks. It is the primary step to learn Machine
Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a
building block of an Artificial Neural Network. Initially, in the mid of
19th century, Mr. Frank Rosenblatt invented the Perceptron for
performing certain calculations to detect input data capabilities or
business intelligence. Perceptron is a linear Machine Learning algorithm
used for supervised learning for various binary classifiers. This algorithm
enables neurons to learn elements and processes them one by one during
preparation. In this tutorial, "Perceptron in Machine Learning," we will
discuss in-depth knowledge of Perceptron and its basic functions in brief.
Let's start with the basic introduction of Perceptron.
What is the Perceptron model in Machine Learning?
Perceptron is Machine Learning algorithm for supervised learning of
various binary classification tasks. Further, Perceptron is also
understood as an Artificial Neuron or neural network unit that
helps to detect certain input data computations in business
intelligence.
Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm
of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and
Bias, net sum, and an activation function.
What is Binary classifier in Machine Learning?
In Machine Learning, binary classifiers are defined as the function that
helps in deciding whether input data can be represented as vectors of
numbers and belongs to some specific class.
Binary classifiers can be considered as linear classifiers. In simple words,
we can understand it as a classification algorithm that can predict
linear predictor function in terms of weight and feature vectors.
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier
which contains three main components. These are as follows:
o Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real
numerical value.
o Wight and Bias:
Weight parameter represents the strength of the connection between
units. This is another most important parameter of Perceptron
components. Weight is directly proportional to the strength of the
associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine
whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective
decision based on various problem statements and forms the desired
outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in
perceptron models by checking whether the learning process is slow or
has vanishing or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural
network that consists of four main parameters named input values (Input
nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and
their weights, then adds these values together to create the weighted
sum. Then this weighted sum is applied to the activation function 'f' to
obtain the desired output. This activation function is also known as
the step function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that
output is mapped between required values (0,1) or (-1,1). It is important
to note that the weight of input is indicative of the strength of a node.
Similarly, an input's bias value gives the ability to shift the activation
function curve up or down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight
values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the
model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-
mentioned weighted sum, which gives us output either in binary form or a
continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These
are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-
layered perceptron model consists feed-forward network and also includes
a threshold transfer function inside the model. The main objective of the
single-layer perceptron model is to analyze the linearly separable objects
with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded
data, so it begins with inconstantly allocated input for weight parameters.
Further, it sums up all inputs (weight). After adding all inputs, if the total
sum of all inputs is more than a pre-determined value, the model gets
activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the
performance of this model is stated as satisfied, and weight demand does
not change. However, this model consists of a few discrepancies triggered
when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary
for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."
Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also
has the same model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation
algorithm, which executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the
forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are
modified as per the model's requirement. In this stage, the error between
actual output and demanded originated backward on the output layer and
ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple
artificial neural networks having various layers in which activation
function does not remain linear, similar to a single layer perceptron
model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can
process linear and non-linear patterns. Further, it can also implement logic
gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
Advantages of Multi-Layer Perceptron:
o A multi-layered perceptron model can be used to solve complex non-linear
problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:
o In Multi-layer perceptron, computations are difficult and time-consuming.
o In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
o The model functioning depends on the quality of the training.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the
input 'x' with the learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector
o 'b' represents the bias
o 'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of
binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is
made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight
function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between
the two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it
must have an output signal; otherwise, no output will be shown.
Limitations of Perceptron Model
A perceptron model has limitations as follows:
o The output of a perceptron can only be a binary number (0 or 1) due to
the hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input
vectors. If input vectors are non-linear, it is not easy to classify them
properly.
Gradient descent and the Delta rule are both concepts related to
optimization algorithms, particularly in the context of training artificial
neural networks. Let's explore each of these concepts individually:
1. Gradient Descent:
Gradient descent is an iterative optimization algorithm used to find
the minimum of a function. In the context of machine learning, this
function is often the cost or loss function, which measures the
difference between the predicted output of a model and the actual
target values.
The basic idea is to iteratively move towards the minimum of the
function by taking steps proportional to the negative of the gradient
of the function at the current point.
The formula for updating the parameters θ in the direction of the
negative gradient (∇J) is given by:
Here, is the learning rate, a hyperparameter that determines the
size of the steps taken during each iteration.
2. Delta Rule (Widrow-Hoff Rule):
The Delta rule is a specific application of gradient descent to update
the weights of connections in a neural network during the training
process. It is also known as the Widrow-Hoff learning rule.
In the context of a single-layer perceptron (a simple neural network
with one layer of weights), the Delta rule can be expressed as:
Δwij=η⋅(di−yi)⋅xj
Here,
Δwij is the change in the weight connecting the ith input
neuron to the jth output neuron.
η is the learning rate.
di is the target output for the ith training example.
yi is the actual output of the network for the ith training
example.
xj is the input from the jth input neuron.
The weights are updated for each connection in the network based on the
difference between the predicted output and the target output, with the
learning rate determining the step size.
In summary, gradient descent is a general optimization algorithm used to
minimize a function, while the Delta rule is a specific application of
gradient descent used for updating weights in a neural network during
training. Both play crucial roles in the iterative process of adjusting model
parameters to improve performance.
Introduction to Deep
Learning:
Deep learning is a subset of machine learning that focuses on neural
networks with multiple layers, known as deep neural networks. The "deep"
in deep learning refers to the presence of multiple layers through which
data passes, enabling the network to automatically learn hierarchical
representations of features from the input data. Deep learning has
demonstrated remarkable success in various tasks, including image and
speech recognition, natural language processing, and playing strategic
games.
Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) is a type of Deep Learning neural
network architecture commonly used in Computer Vision. Computer vision is
a field of Artificial Intelligence that enables a computer to understand and
interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really
well. Neural Networks are used in various datasets like images, audio, and
text. Different types of Neural Networks are used for different purposes, for
example for predicting the sequence of words we use Recurrent Neural
Networks more precisely an LSTM, similarly for image classification we use
Convolution Neural networks. In this blog, we are going to build a basic
building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to the total number of features in
our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden
layer. There can be many hidden layers depending on our model and
data size. Each hidden layer can have different numbers of neurons
which are generally greater than the number of features.
The output from each layer is computed by matrix multiplication of the
output of the previous layer with learnable weights of that layer and then
by the addition of learnable biases followed by activation function which
makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic
function like sigmoid or softmax which converts the output of each class
into the probability score of each class.
The data is fed into the model and output from each layer is obtained from
the above step is called feedforward, we then calculate the error using an
error function, some common error functions are cross-entropy, square loss
error, etc. The error function measures how well the network is performing.
After that, we backpropagate into the model by calculating the derivatives.
This step is called Backpropagation which basically is used to minimize the
loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial
neural networks (ANN) which is predominantly used to extract the feature
from the grid-like matrix dataset. For example visual datasets like images or
videos where data patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.
Simple CNN architecture
The Convolutional layer applies filters to the input image to extract features,
the Pooling layer downsamples the image to reduce computation, and the
fully connected layer makes the final prediction. The network learns the
optimal filters through backpropagation and gradient descent.
How Convolutional Layers works
Convolution Neural Networks or covnets are neural networks that share their
parameters. Imagine you have an image. It can be represented as a cuboid
having its length, width (dimension of the image), and height (i.e the channel
as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural
network, called a filter or kernel on it, with say, K outputs and representing
them vertically. Now slide that neural network across the whole image, as a
result, we will get another image with different widths, heights, and depths.
Instead of just R, G, and B channels now we have more channels but lesser
width and height. This operation is called Convolution. If the patch size is
the same as that of the image it will be a regular neural network. Because of
this small patch, we have fewer weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having
small widths and heights and the same depth as that of input volume (3 if
the input layer is image input).
For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.
During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the dot
product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack
them together as a result, we’ll get output volume having a depth equal to
the number of filters. The network will learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as
covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x
32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer
holds the raw input of the image with width 32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters known
as the kernels to the input images. The filters/kernels are smaller matrices
usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and
computes the dot product between kernel weight and the corresponding
input image patch. The output of this layer is referred ad feature maps.
Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will
apply an element-wise activation function to the output of the convolution
layer. Some common activation functions are RELU: max(0, x),
Tanh, Leaky RELU, etc. The volume remains unchanged hence output
volume will have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its
main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average pooling.
If we use a max pool with 2 x 2 filters and stride 2, the resultant volume
will be of dimension 16x16x12.
Image source: cs231n.stanford.edu
Flattening: The resulting feature maps are flattened into a one-
dimensional vector after the convolution and pooling layers so they can
be passed into a completely linked layer for categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.
Image source: cs231n.stanford.edu
Output Layer: The output from the fully connected layers is then fed into
a logistic function for classification tasks like sigmoid or softmax which
converts the output of each class into the probability score of each class.
Training Convolutional Neural Network
Training a Convolutional Neural Network (CNN) involves the process of adjusting the
weights and biases of the network to minimize the difference between its predictions
and the actual target values. The training process is typically done using a dataset
with known input-output pairs, and it can be summarized in the following steps:
1. Data Preparation:
- **Input Data:** Gather a dataset containing input data (e.g., images) and their
corresponding target labels. Split the dataset into training and validation sets.
- **Data Augmentation:** Augment the training data by applying random
transformations (e.g., rotations, flips, zooms) to increase the diversity of the training
set.
2. Network Architecture:
- **Define the Architecture:** Design the architecture of the CNN, including the
number of convolutional layers, activation functions, pooling layers, and fully
connected layers.
- **Initialize Weights and Biases:** Initialize the weights and biases of the network
randomly. Common initialization methods include Xavier/Glorot initialization.
3. Forward Propagation:
- **Input Forward Pass:** Pass a batch of training data through the network to
compute the predicted output.
- **Loss Calculation:** Compare the predicted output with the actual target values
using a loss function (e.g., mean squared error, cross-entropy).
4. Backpropagation:
- **Compute Gradients:** Calculate the gradient of the loss with respect to the
weights and biases using the chain rule of calculus. This is done through
backpropagation.
- **Update Weights and Biases:** Adjust the weights and biases in the direction that
minimizes the loss. Common optimization algorithms include stochastic gradient
descent (SGD), Adam, and RMSprop.
5. Iterative Optimization:
- **Repeat:** Iterate over steps 3 and 4 for multiple epochs. An epoch is one
complete pass through the entire training dataset.
- **Validation:** Periodically evaluate the model on a separate validation set to
monitor its generalization performance. This helps prevent overfitting.
6. Hyperparameter Tuning:
- **Learning Rate:** Adjust the learning rate, a hyperparameter that controls the size
of the steps taken during optimization. Learning rates that are too high can lead to
divergence, while rates that are too low may result in slow convergence.
- **Batch Size:** Experiment with different batch sizes. Smaller batches introduce
more noise but may converge faster.
- **Regularization:** Consider adding regularization techniques (e.g., dropout) to
prevent overfitting.
7. Model Evaluation:
- **Test Set Evaluation:** Once training is complete, evaluate the model on a
separate test set to assess its performance on unseen data.
8. Fine-Tuning:
- **Transfer Learning:** For some tasks, consider using pre-trained models and fine-
tune them on your specific dataset.
9. Deployment:
- **Deploy the Model:** Once satisfied with the model's performance, deploy it for
making predictions on new, unseen data.