“NAAC” Accredited & ISO 9001:2015 Certified Institute
VIDYA VIKAS PRATISHTHAN
V.V.P.INSTITUTE OF ENGINEERING & TECHNOLOGY,SOLAPUR
72/2 B, Pratapnagar, Soregaon-Dongaon Road, Soregaon, Solapur - 413004
Phone :( 0217)-6054555, 6452555, Email: vvpiet@rediffmail.com,
Website:www.vvpengineering.org
Department of Computer Science &
Engineering
TY (CSE)
Machine Learning
Prof. S.L.Birajdar
Unit 1
Introduction to Machine Learning
Contents
1.1 Basic Definition
1.2 Types of learning
1.3 Hypothesis Space and Inductive Bias
1.4 Evaluation
1.5 Linear Regression
1.6 Decision Trees
1.7 Overfitting
1.1 Definition of Machine Learning
"Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed."
OR
"A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E."
1.2 Types of Machine learning
Machine learning is sub-categorized to three types:
Fig. Types of Machine Learning
a) Supervised Learning:
Supervised machine learning algorithms are designed to learn by example. The name
“supervised” learning originates from the idea that training this type of algorithm is like
having a teacher supervise the whole process.
When training a supervised learning algorithm, the training data will consist of inputs paired
with the correct outputs. During training, the algorithm will search for patterns in the data
that correlate with the desired outputs. After training, a supervised learning algorithm will
take in new unseen inputs and will determine which label the new inputs will be classified as
based on prior training data.
Fig. Supervised Learning.
The objective of a supervised learning model is to predict the correct label for newly
presented input data. At its most basic form, a supervised learning algorithm can be written
simply as:
Where Y is the predicted output that is determined by a mapping function that assigns a class
to an input value x. The function used to connect input features to a predicted output is
created by the machine learning model during training.
Types of Supervised learning:
Supervised learning can be split into two subcategories: Classification and regression.
1. Classification :
Fig. Classification.
During training, a classification algorithm will be given data points with an assigned
category. The job of a classification algorithm is to then take an input value and assign it a
class, or category, that it fits into based on the training data provided.
The most common example of classification is determining if an email is spam or not. With
two classes to choose from (spam, or not spam), this problem is called a binary classification
problem. The algorithm will be given training data with emails that are both spam and not
spam. The model will find the features within the data that correlate to either class and create
the mapping function mentioned earlier: Y=f(x). Then, when provided with an unseen email,
the model will use this function to determine whether or not the email is spam.
Classification problems can be solved with a numerous amount of algorithms. Whichever
algorithm you choose to use depends on the data and the situation.
Here are a few popular classification algorithms:
• Linear Classifiers
• Support Vector Machines
• Decision Trees
• K-Nearest Neighbor
• Random Forest
2. Regression :
Regression is a predictive statistical process where the model attempts to find the important
relationship between dependent and independent variables. The goal of a regression
algorithm is to predict a continuous number such as sales, income, and test scores. The
equation for basic linear regression can be written as so:
Where x[i] is the feature(s) for the data and where w[i] and b are parameters which are
developed during training. For simple linear regression models with only one feature in the
data, the formula looks like this:
Where w is the slope, x is the single feature and b is the y-intercept. Familiar? For simple
regression problems such as this, the models predictions are represented by the line of best
fit. For models using two features, the plane will be used. Finally, for a model using more
than two features, a hyperplane will be used. Imagine we want to determine a student’s test
grade based on how many hours they studied the week of the test. Let's say the plotted data
with a line of best fit looks like this
Fig. Regression.
There is a clear positive correlation between hours studied (independent variable) and the
student’s final test score (dependent variable). A line of best fit can be drawn through the
data points to show the models predictions when given a new input. Say we wanted to know
how well a student would do with five hours of studying. We can use the line of best fit to
predict the test score based on other student’s performances.
b) Unsupervised Learning:
Unsupervised learning is a machine learning technique, where you do not need to supervise
the model. Instead, you need to allow the model to work on its own to discover information.
It mainly deals with the unlabelled data. Unsupervised learning algorithms allows you to
perform more complex processing tasks compared to supervised learning. Although,
unsupervised learning can be more unpredictable compared with other natural learning
methods.
Fig. Unsupervised Learning.
The model learns through observation and finds structures in the data. Once the model is
given a dataset, it automatically finds patterns and relationships in the dataset by creating
clusters in it. What it cannot do is add labels to the cluster , like it cannot say this a group of
apples or mangoes, but it will separate all the apples from mangoes.(give eg. of circle and
square.)
c) Reinforcement Learning:
It is the ability of an agent to interact with the environment and find out what is the best
outcome. It follows the concept of hit and trial method. The agent is rewarded or penalized
with a point for a correct or a wrong answer, and on the basis of the positive reward points
gained the model trains itself. And once trained it gets ready to predict the new data
presented to it.
Fig. Reinforcement Learning
Supervised learning where a student is under the supervision of a teacher at both home and
school, Unsupervised learning where a student has to figure out a concept himself and Semi-
Supervised learning where a teacher teaches a few concepts in class and gives questions as
homework which are based on similar concepts.
1.3 Hypothesis Space and Inductive Bias:
a) Hypothesis Space:-The space of all hypothesis that can, in principle, be output by a
learning algorithm. Each setting of the parameters in the machine is a different hypothesis
about the function that maps input vectors to output vectors.
Hypothesis space is a set of valid hypothesis, i.e. all possible functions. If Hypothesis space
is H & hypothesis h, here H belongs to h which means H is a valid hypothesis whereas h is
all hypothesis.
In the training set given the particular data points, the learning algorithm will come up with
one of the hypothesis in the hypothesis space which hypothesis it comes up with will depend
on the data, and it also will depend on what type of restrictions or biases that we have
imposed.
Instance space:
Instance space is the set of all possible objects that can be described by the features.
Terminology:
Example (x,y) : Instance x with label y=f(x)
Training Data s : Collection of examples observed by learning algorithm.
Instance Space x : Set of all possible objects describable by features.
Concept c : Subset of objects from X (c is unknown).
Target Function f : Maps each instance x ε X to target label y ε Y
Classifier:
Hypothesis h : Function that approximates f.
Hypothesis Space H : Set of functions we allow for approximating f.
The set of hypotheses that can be produced, can be restricted further by specifying a language bias.
Input : Training set S ⊆ X
Output : A hypothesis h ⊆ H
b) Inductive Bias:- The inductive bias (also known as learning bias) of a learning
algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it
has not encountered. In machine learning, one aims to construct algorithms that are able
to learn to predict a certain target output. To achieve this, the learning algorithm is presented
some training examples that demonstrate the intended relation of input and output values.
Then the learner is supposed to approximate the correct output, even for examples that have
not been shown during training. Without any additional assumptions, this problem cannot be
solved exactly since unseen situations might have an arbitrary output value. The kind of
necessary assumptions about the nature of the target function are subsumed in the
phrase inductive bias.
1.3.1 Find-S Algorithm:
In order to understand Find-S algorithm, you need to have a basic idea of the following
concepts as well:
1. Concept Learning
2. General Hypothesis
3. Specific Hypothesis
1. Concept Learning:
Most of human learning is based on past instances or experiences. For example, we are able
to identify any type of vehicle based on a certain set of features like make, model, etc., that
are defined over a large set of features.
These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.
Similar to this, machines can also learn from concepts to identify whether an object belongs
to a specific category or not. Any algorithm that supports concept learning requires the
following:
• Training Data
• Target Concept
• Actual Data Objects
2. General Hypothesis:
Hypothesis, in general, is an explanation for something. The general hypothesis basically
states the general relationship between the major variables. For example, a general
hypothesis for ordering food would be I want a burger.
G = { ‘?’, ‘?’, ‘?’, …..’?’}
3. Specific Hypothesis:
The specific hypothesis fills in all the important details about the variables given in the
general hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.
S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
The Find-S algorithm follows the steps written below:
1. The process starts with initializing ‘h’ with the most specific hypothesis, generally, it is
the first positive example in the data set.
2. We check for each positive example. If the example is negative, we will move on to the
next example but if it is a positive example we will consider it for the next step.
3. We will check if each attribute in the example is equal to the hypothesis value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ‘?’.
6. We do this until we reach the last positive example in the data set.
Fig. Flowchart of Find S algorithm.
Implementation of Find-S Algorithm
To understand the implementation, let us try to implement it to a smaller data set with a
bunch of examples to decide if a person wants to go for a walk.
The concept of this particular problem will be on what days does a person likes to go on
walk.
Fig. Dataset
So now, the general hypothesis is:
h0 = {‘Morning’, ‘Sunny’, ‘Warm’, ‘Yes’, ‘Mild’, ‘Strong’}
This is our general hypothesis, and now we will consider each example one by one, but only
the positive examples.
h1= {‘Morning’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}
h2 = {‘?’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}
We replaced all the different values in the general hypothesis to get a resultant hypothesis.
Limitations of Find-S Algorithm:
There are a few limitations of the Find-S algorithm listed down below:
1. There is no way to determine if the hypothesis is consistent throughout the data.
2. Inconsistent training sets can actually mislead the Find-S algorithm, since it ignores the
negative examples.
3. Find-S algorithm does not provide a backtracking technique to determine the best
possible changes that could be done to improve the resulting hypothesis.
1.3.2 Candidate Elimination Algorithm
The candidate-Elimination algorithm computes the version space containing all (and only
those) hypotheses from H that are consistent with an observed sequence of training
examples.
➢ Candidate Elimination Algorithm performs a bidirectional search in the hypothesis
space.
➢ It maintains a set, S,of most specific hypotheses that are consistent with the training
data and a set ,G, of most general hypotheses that are consistent with the training
data.
➢ These two set forms two boundaries on a version space. It consider both positive and
negative examples. We have both hypothesis i.e most general and most specific.
➢ For a positive example :
We tend to generalize specific hypothesis.
➢ For a negative example :
Let’s take an example to understand the Candidate Elimination Algorithm.
Sky Temp Humid Wind Water Forecast Enjoy Sport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Fig. Dataset
S: {<ɸ,ɸ,ɸ,ɸ,ɸ,ɸ>}
G: {<?,?,?,?,?,?>}
x = <Sunny, Warm, Normal, Strong, Warm, Same>
S:{ <Sunny , Warm , Normal , Strong , Warm , Same>}
G: {<?,?,?,?,?,?>}
x = <Sunny, Warm, High, Strong, Warm, Same>
S:{ <Sunny , Warm , ? , Strong , Warm , Same>}
G: {<?,?,?,?,?,?>}
x = <Rainy, Cold, High, Strong, Warm, Change>
S:{ <Sunny , Warm , ? , Strong , Warm , Same>}
G: {<Sunny,?,?,?,?,?>,<?,Warm,?,?,?,?>,<?,?,?,?,?,Same>}
x = <Sunny, Warm, High, Strong, Cool, Change>
S:{ <Sunny , Warm , ? , Strong , ? , ?>}
G: {<Sunny,?,?,?,?,?>,<?,Warm,?,?,?,?>}
1.4 Model Evaluation Techniques:
Model evaluation aims to estimate the generalization accuracy of a model on future
(unseen/out-of-sample) data. Methods for evaluating a model’s performance are divided into
2 categories: namely, holdout and Cross-validation. Both methods use a test set (i.e. data not
seen by the model) to evaluate model performance.
a) Holdout: The purpose of holdout evaluation is to test a model on different data than it
was trained on. This provides an unbiased estimate of learning performance.In this method,
the dataset is randomly divided into three subsets:
1. Training set is a subset of the dataset used to build predictive models.
2. Validation set is a subset of the dataset used to assess the performance of the model
built in the training phase. It provides a test platform for fine-tuning a model’s
parameters and selecting the best performing model. Not all modeling algorithms need
a validation set.
3. Test set, or unseen data, is a subset of the dataset used to assess the likely future
performance of a model. If a model fits to the training set much better than it fits the
test set, Overfitting is probably the cause.
The holdout approach is useful because of its speed, simplicity, and flexibility. However, this
technique is often associated with high variability since differences in the training and test
dataset can result in meaningful differences in the estimate of accuracy.
b) Cross-Validation: Cross-validation is a technique in which we train our model using the subset
of the data-set and then evaluate using the complementary subset of the data-set.
Methods of Cross Validation:
1. Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used
for the training purpose. The major drawback of this method is that we perform training
on the 50% of the dataset, it may possible that the remaining 50% of the data contains
some important information which we are leaving while training our model i.e. higher
bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole data-set but leaves only one data-point
of the available data-set and then iterates for each data-point. It has some advantages as
well as disadvantages. An advantage of using this method is that we make use of all data
points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model
as we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times.
3. K-Fold Cross Validation
In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved for
testing purpose each time.
Example:
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. and [1-5 and 10-
25] training), and so on.
Fig. K-Fold Cross Validation
In first iteration we use the first 20 percent of data for evaluation, and the remaining 80
percent for training([1-5] testing and [5-25] training) while in the second iteration we use
the second subset of 20 percent for evaluation, and the remaining three subsets of the data
for training([5-10] testing.
1.5 Linear regression:
Linear Regression is a supervised machine learning algorithm where the predicted output is
continuous and has a constant slope. It’s used to predict values within a continuous range,
(e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog).
Linear regression is a linear model that assumes a linear relationship between the input
variables (x) and the single output variable (y). More specifically, that y can be calculated
from a linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to as simple linear
regression.
When there are multiple input variables, the method is known as multiple linear regression.
a) Simple linear regression:
In simple linear regression, we establish a relationship between target variable and input
variables by fitting a line, known as the regression line.
In general, a line can be represented by linear equation y = m * X + b. Where, y is the
dependent variable, X is the independent variable, m is the slope, b is the intercept.
In machine learning, we rewrite our equation as y(x) = w0+ w1 * x where w’s are the
parameters of the model, x is the input, and y is the target variable.
b) Multiple linear regression:
The above equation can be used when we have one input variable (also called feature).
However, in general, we usually deal with datasets which have multiple input variables.
The case when we have more than one feature is known as multiple linear regression, or
simply, linear regression. We can generalize our previous equation for simple linear
regression to multiple linear regression:
y(x) = w0 + w1x1 + w2x2 +…+ wnxn
In the case of multiple linear regression, instead of our prediction being a line in 2-
dimensional space, it is a hyper plane in n-dimensional space.
Linear Regression Terminologies:
The following terminologies are important to be familiar with before moving on to the linear
regression algorithm.
a) Cost Function:
The best fit line can be based on the linear equation given below.
• The dependent variable that is to be predicted is denoted by Y.
• A line that touches the y-axis is denoted by the intercept b0.
Fig. Linear Equation
• b1 is the slope of the line, x represents the independent variables that determine
the prediction of Y.
• The error in the resultant prediction is denoted by e.
The cost function provides the best possible values for b0 and b1 to make the best fit line for
the data points. We do it by converting this problem into a minimization problem to get the
best values for b0 and b1. The error is minimized in this problem between the actual value and
the predicted value.
We choose the function above to minimize the error. We square the error difference and sum
the error over all data points, the division between the total number of data points. Then, the
produced value provides the averaged square error over all data points.
It is also known as MSE(Mean Squared Error), and we change the values of b0 and b1 so that
the MSE value is settled at the minimum.
b) Gradient Descent:
The next important terminology to understand linear regression is gradient descent. It is a
method of updating b0 and b1 values to reduce the MSE. The idea behind this is to keep
iterating the b0 and b1 values until we reduce the MSE to the minimum.
To update b0 and b1, we take gradients from the cost function. To find these gradients, we
take partial derivatives with respect to b0 and b1. These partial derivatives are the gradients
and are used to update the values of b0 and b1.
Fig.learning rate
A smaller learning rate takes closer to the minimum, but it takes more time and in case of a
larger learning rate. The time taken is sooner but there is a chance to overshoot the minimum
value. Now that we are through with the terminologies in linear regression, let us take a look
at a few advantages and disadvantages of linear regression for machine learning.
Advantages of Linear Regression:
1. Linear regression performs exceptionally well for linearly separable data
2. Easier to implement, interpret and efficient to train
3. It handles overfitting pretty well using dimensionally reduction techniques,
regularization, and cross-validation
4. One more advantage is the extrapolation beyond a specific data set
Disadvantages of Linear Regression:
1. The assumption of linearity between dependent and independent variables
2. It is often quite prone to noise and overfitting
3. Linear regression is quite sensitive to outliers
4. It is prone to multicollinearity
1.6 Decision trees:
A decision tree is a flowchart-like structure in which each internal node represents a test on a
feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class
label (decision taken after computing all features) and branches represent conjunctions of
features that lead to those class labels. The paths from root to leaf represent classification
rules. Below diagram illustrate the basic flow of decision tree for decision making with labels
(Rain(Yes), No Rain(No)).
Fig. Decision tree
Decision trees are constructed via an algorithmic approach that identifies ways to split a data
set based on different conditions. It is one of the most widely used and practical methods for
supervised learning. Decision Trees are a non-parametric supervised learning method used
for both classification and regression tasks.
Steps for making decision tree:
• Get list of rows (dataset) which are taken into consideration for making decision tree
(recursively at each node).
• Calculate uncertainty of our dataset or Gini impurity or how much our data is mixed
up etc.
• Generate list of all question which needs to be asked at that node.
• Partition rows into True rows and False rows based on each question asked.
• Calculate information gain based on gini impurity and partition of data from previous
step.
• Update highest information gain based on each question asked.
• Update best question based on information gain (higher information gain).
• Divide the node on best question. Repeat again from step 1 again until we get pure node
(leaf nodes).
1.7 Underfitting & Overfitting in machine learning:
a) Underfitting:
A statistical model or a machine learning algorithm is said to have underfitting when it
cannot capture the underlying trend of the data. Underfitting destroys the accuracy of our
machine learning model. Its occurrence simply means that our model or the algorithm
does not fit the data well enough. It usually happens when we have less data to build an
accurate model and also when we try to build a linear model with a non-linear data. In
such cases the rules of the machine learning model are too easy and flexible to be applied
on such a minimal data and therefore the model will probably make a lot of wrong
predictions. Underfitting can be avoided by using more data and also reducing the
features by feature selection.
Techniques to reduce underfitting:
1. Increase model complexity
2. Increase number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.
b) Overfitting:
A statistical model is said to be overfitted, when we train it with a lot of data . When a
model gets trained with so much of data, it starts learning from the noise and inaccurate
data entries in our data set. Then the model does not categorize the data correctly,
because of too much of details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can
really build unrealistic models. A solution to avoid overfitting is using a linear algorithm
if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
Techniques to reduce overfitting:
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period
as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
Fig. Underfitting Appropriate fitting and Overfitting
c) Good Fit in a Statistical Model:
Ideally, the case when the model makes the predictions with 0 error, is said to have a good
fit on the data. This situation is achievable at a spot between overfitting and underfitting. In
order to understand it we will have to look at the performance of our model with the passage
of time, while it is learning from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on
the training and testing data will keep on decreasing. If it will learn for too long, the model
will become more prone to overfitting due to the presence of noise and less useful details.
Hence the performance of our model will decrease. In order to get a good fit, we will stop at
a point just before where the error starts increasing. At this point the model is said to have
good skills on training datasets as well as our unseen testing dataset.
Important Questions
Q.1] What is machine learning? Explain different types of machine learning.
Q.2] Explain types of supervised learning.
Q.3] Explain hypothesis space and inductive bias.
Q.4] Explain Find S algorithm.
Q.5] Explain Candidate Elimination algorithm.
Q.6] Explain cross-validation.
Q.7] Explain Linear regression.
Q.8] Explain Decision trees.
Q.9] Explain underfitting & overfitting in machine learning.