UNIT-IV: Model Validation in
Classification
By
Dr.V.Srilakshmi
Associate
Professor,
CSE, GRIET
Content
► Model Validation in Classification: Cross Validation - Holdout Method,
K-Fold, Stratified KFold, Leave-One-Out Cross Validation.
► Bias-Variance tradeoff, Regularization, Overfitting, Underfitting.
► Ensemble Methods: Boosting, Bagging, Random Forest.
Bias and Variance in Machine Learning
► Machine learning is a branch of Artificial Intelligence, which allows machines to perform data
analysis and make predictions.
► However, if the machine learning model is not accurate, it can make predictions errors, and these
prediction errors are usually known as Bias and Variance. In machine learning, these errors will
always be present as there is always a slight difference between the model predictions and actual
predictions.
► The main aim of ML/data science analysts is to reduce these errors in order to get more accurate
results. In this topic, we are going to discuss bias and variance, Bias-variance trade-off,
Underfitting and Overfitting. But before starting, let's first understand what errors in Machine
learning are?
Bias and Variance in Machine Learning
► The model accuracy will depends on prediction errors, ie: Bias and Variance, that will always be
associated with any machine learning model.
► There will always be a slight difference in what our model predicts and the actual predictions. These
differences are called errors.
► The goal of an analyst is not to eliminate errors but to reduce them. There is always a tradeoff
between how low you can get errors to be.
► In Machine Learning, error is used to see how accurately our model can predict on data it uses to
learn; as well as new, unseen data. Based on our error, we choose the machine learning model which
performs best for a particular dataset.
► There are two main types of errors present in any machine learning model. They are Reducible Errors
and Irreducible Errors.
❖ Irreducible errors are errors which will always be present in a machine learning model, because
of unknown variables, and whose values cannot be reduced.
❖ Reducible errors : These errors can be reduced to improve the model accuracy. Such errors can further
be classified into bias and Variance. We can further divide reducible errors into two: Bias and
Variance.
Noise
Fig: Errors in Machine Learning
Bias:
□ Bias is the difference between our actual and predicted values. Bias is the simple assumptions that
our model makes about our data to be able to predict new data.
□ When the Bias is high, assumptions made by our model are too basic, the model can’t capture the
important features of our data. This means that our model hasn’t captured patterns in the training data
and hence cannot perform well on the testing data too. If this is the case, our model cannot perform on
new data and cannot be sent into production.
Fig 2: Bias
► This instance, where the model cannot find patterns in our training set and hence
fails for both seen and unseen data, is called Underfitting.
► The below figure shows an example of Underfitting. As we can see, the model has
found no patterns in our data and the line of best fit is a straight line that does not
pass through any of the data points. The model has failed to train properly on the
data given and cannot predict new data either.
Figure 3: Underfitting
Variance:
► Variance is the very opposite of Bias. During training, it allows our model to ‘see’ the data
a certain number of times to find patterns in it. If it does not work on the data for long
enough, it will not find patterns and bias occurs. On the other hand, if our model is allowed
to view the data too many times, it will learn very well for only that data. It will capture
most patterns in the data, but it will also learn from the unnecessary data present, or from
the noise.
► We can define variance as the model’s sensitivity to fluctuations in the data. Our model
may learn from noise. This will cause our model to consider trivial features as important.
Figure 3: Example of Variance
► In the above figure, we can see that our model has learned extremely well for our training
data, which has taught it to identify cats. But when given new data, such as the picture of
a fox, our model predicts it as a cat, as that is what it has learned. This happens when the
Variance is high, our model will capture all the features of the data given to it, including
the noise, will tune itself to the data, and predict it very well but when given new data, it
cannot predict on it as it is too specific to training data.
► Hence, our model will perform really well on testing data and get high accuracy but will fail
to perform on new, unseen data. New data may not have the exact same features and the
model won’t be able to predict it very well. This is called Overfitting.
► Fig: Over-fitted model where we see model performance on, a) training data b) new
data
Different Combinations of Bias-Variance
► There are four possible combinations of bias and variances, which are represented by the below diagram:
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine
learning model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions
are inconsistent and accurate on average. This case occurs when the model learns
with a large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not
learn well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and also
inaccurate on average.
Bias-Variance Tradeoff:
► If the model is very simple with fewer parameters, it may have low variance and
high bias. Whereas, if the model has a large number of parameters, it will have
high variance and low bias. So, it is required to make a balance between bias and
variance errors, and this balance between the bias error and variance error is
known as the Bias-Variance trade-off.
► An optimized model will be sensitive to the patterns in our data, but at the
same time will be able to generalize to new data. In this, both the bias and
variance should be low so as to prevent overfitting and underfitting.
► For an accurate prediction of the model, algorithms need a low variance
and low bias. But this is not possible because bias and variance are
related to each other:
► If we decrease the variance, it will increase the bias.
► If we decrease the bias, it will increase the variance.
► Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with the
unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high variance
algorithm may perform well with training data, but it may lead to overfitting to noisy data.
Whereas, high bias algorithm generates a much simple model that may not even capture important
regularities in the data. So, we need to find a sweet spot between bias and variance to make an
optimal model.
► Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between
bias and variance errors.
► Regularization:
► Why Regularization?
► Sometimes what happens is that our Machine learning model performs well on the training data but does not
perform well on the unseen or test data. It means the model is not able to predict the output or target column
for the unseen data by introducing noise in the output, and hence the model is called an overfitted model.
► By noise , the data points in the dataset which don’t really represent the true properties
of your data, but only due to a random chance.
□ What is Regularization?
▪ It is one of the most important concepts of machine learning. This technique prevents the model from
overfitting by adding extra information to it.
▪ It is a form of regression that shrinks the coefficient estimates towards zero. In other words, this
technique forces us not to learna more complex or flexible model, to avoid the
problem of overfitting.
▪ Now, let’s understand the “How flexibility of a model is represented?”
▪ For regression problems, the increase in flexibility of a model is represented by an increase in its coefficients,
which are calculated from the regression line.
▪ In simple words, “In the Regularization technique, we reduce the magnitude of the independent
variables by keeping the same number of variables”. It maintains accuracy as well as a generalization of the
► Techniques of Regularization
► Mainly, there are two types of regularization techniques, which are given below:
• Ridge Regression (L2 Regularization)
• Lasso Regression (L1 Regularization)
Ridge Regression
► Ridge regression is one of the types of linear regression in which we introduce a small amount of
bias, known as Ridge regression penalty so that we can get better long-term predictions.
► A regression model that uses the L2 regularization technique is called Ridge regression. Ridge
regression adds the “squared magnitude” of the coefficient as a penalty term to the loss function(L).
► In this technique, the cost function is altered by adding the penalty term (shrinkage term), which
multiplies the lambda with the squared weight of each individual feature. Therefore, the
optimization function(cost function) becomes:
where,
m – Number of n – Number of Examples
Features
y_i – Actual Target Value y_i(hat) – Predicted Target
►
Value
Usage of Ridge Regression:
• When we have the independent variables which are having high collinearity (problem of
multicollinearity) between them, at that time general linear or polynomial regression will fail so to
solve such problems, Ridge regression can be used.
• If we have more parameters than the samples, then Ridge regression helps to solve the problems.
► Limitation of Ridge Regression:
• Not helps in Feature Selection: It decreases the complexity of a model but does not reduce
the number of independent variables since it never leads to a coefficient being zero rather only
minimizes it. Hence, this technique is not good for feature selection.
• Model Interpretability: Its disadvantage is model interpretability since it will shrink the
coefficients for least important predictors, very close to zero but it will never make them exactly
zero. In other words, the final model will include all the independent variables, also known as
predictors.
Lasso Regression
► Lasso regression is another variant of the regularization technique used to reduce the complexity of
the model. It stands for Least Absolute and Selection Operator.
► It is similar to the Ridge Regression except that the penalty term includes the absolute weights
instead of a square of weights.
► A regression model which uses the L1 Regularization technique is called LASSO(Least Absolute
Shrinkage and Selection Operator) regression. Lasso Regression adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function(L). Lasso regression also helps
us achieve feature selection by penalizing the weights to approximately equal to zero if that feature
does not serve any purpose in the model.
where,
m – Number of Features n – Number of Examples
y_i – Actual Target y_i(hat) – Predicted Target
Value Value
In this technique, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly
equal to zero which means there is a complete removal of some of the features for model evaluation
when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs
Feature selection and is said to yield sparse models.
► Limitation of Lasso Regression:
• Problems with some types of Dataset: If the number of predictors is greater than the number of data
points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.
• Multicollinearity Problem: If there are two or more highly collinear variables then LASSO
regression selects one of them randomly which is not good for the interpretation of our model.
► Key Differences between Ridge and Lasso Regression
► Ridge regression helps us to reduce only the overfitting in the model while keeping all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients whereas
Lasso regression helps in reducing the problem of overfitting in the model as well as automatic
feature selection.
► Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets
the value of coefficient to absolute zero.
► Regularization:
► For Regression models---We use L1 And L2 regularization
► For decision trees and others ---We use bagging , boosting etc.
► For neural networks ---We use Dropout, Data Augmentation and Early
Stopping
Cross Validation
► Suppose you build a machine learning model to solve a problem, and you have trained the model on a given dataset.
When you check the accuracy of the model on the training data, it is close to 95%. Does this mean that your model
has trained very well, and it is the best model because of the high accuracy?
► No, it’s not! Because your model is trained on the given data, it knows the data well, captured even the minute
variations(noise), and has generalized very well over the given data. If you expose the model to completely new,
unseen data, it might not predict with the same accuracy and it might fail to generalize over the new data. This
problem is called over-fitting.
► Sometimes the model doesn’t train well on the training set as it’s not able to find patterns. In this case, it wouldn’t
perform well on the test set as well. This problem is called Under-fitting.
► To overcome over-fitting problems, we use a technique called Cross-Validation.
► Cross-Validation is a resampling technique with the fundamental idea of splitting the dataset into 2 parts- training
data and test data. Train data is used to train the model and the unseen test data is used for prediction. If the model
performs well over the test data and gives good accuracy, it means the model hasn’t overfitted the training data and
can be used for prediction.
Cross Validation
Types of Cross Validation
Types of Cross Validation in Machine Learning
1. Hold Out method
2. K-Fold
3. Stratified K-Fold
4. Leave-One-Out Cross Validation.
1. Hold Out method:
This is the simplest evaluation method and is widely used in Machine Learning projects. Here the
entire dataset(population) is divided into 2 sets – train set and test set. The data can be divided into
70-30 or 60-40, 75-25 or 80-20, or even 50-50 depending on the use case. As a rule, the proportion
of training data has to be larger than the test data.
► Example– Emails in our inbox are classed as spam or not spam.
► Pros
• This method is entirely data-independent.
• This method requires only one execution, which results in cheaper computing costs.
► Cons
• Due to the lower amount of data, the performance is more variable.
► 2. K-Fold Cross Validation : In this resampling technique, the whole data is divided into k sets of almost
equal sizes. The first set is selected as the test set and the model is trained on the remaining k-1 sets. The
test error rate is then calculated after fitting the model to the test data.
► In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are used to train
the data and the error is calculated. This process continues for all the k sets.
► Limitation: Overfitting
► Pros
• This will aid in resolving the computing power issue.
• Models may be unaffected by the presence of an outlier in the data.
• It assists us in overcoming the issue of unpredictability.
► Cons
• Incorrectly balanced data sets will affect our model.
3. Stratified k-fold cross-validation
► This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best approaches.
► It can be understood with an example of housing prices, such that the price of some houses can be much
high than other houses. To tackle such situations, a stratified k-fold cross-validation technique is useful.
► With this technique we can overcome the problem of overfitting.
► The K Fold Cross Validation approach will not operate as predicted for an unbalanced data set. When a
data set is unstable, a modest modification to the K Fold cross-validation procedure is required to ensure
that each fold has nearly the same number of samples from each output class as the complete. Stratified
K Fold Cross Validation involves using a stratum in K Fold Cross-Validation.
► Pros
• It may enhance many models through hyper-parameter adjustment.
• Assists us in comparing models.
• It contributes to the reduction of both bias and variance.
► Cons
• Execution is expensive.
4. Leave One Out Cross-Validation
► In this method, we divide the data into train and test sets – but with a twist. Instead of dividing the data into
2 subsets, we select a single observation as test data, and everything else is labeled as training data and the
model is trained. Now the 2nd observation is selected as test data and the model is trained on the remaining
data.
► One of the biggest drawbacks of this type is that a major part of the data sample is used for training the
model, however, only a single data point is used to evaluate its accuracy. Therefore, this type is often
considered to be an expensive method.
Ensemble Methods
Ensemble methods is a machine learning technique that combines several base models in order to
produce one optimal predictive model.A single algorithm may not make the perfect prediction for a given
data set. Machine learning algorithms have their limitations and producing a model with high accuracy is
challenging. If we build and combine multiple models, we have the chance to boost the overall accuracy.
Simple Ensemble Techniques:
1. Max Voting
2. Averaging
3. Weighted Averaging
1. Max Voting:
► The max voting method is generally used for classification problems. In this technique, multiple
models are used to make predictions for each data point. The predictions by each model are considered
as a ‘vote’. The predictions which we get from the majority of the models are used as the final
prediction.
► For example, when you asked 5 of your colleagues to rate your movie (out of 5); we’ll assume three
of them rated it as 4 while two of them gave it a 5. Since the majority gave a rating of 4, the final
rating will be taken as 4. You can consider this as taking the mode of all the predictions.
Ensemble Methods
► The result of max voting would be something like this:
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4
► 2. Average Voting:
► Similar to the max voting technique, multiple predictions are made for each data point in
averaging. In this method, we take an average of predictions from all the models and use it to
make the final prediction. Averaging can be used for making predictions in regression problems
or while calculating probabilities for classification problems.
► For example, in the below case, the averaging method would take the average of all the values.
► i.e. (5+4+5+4+4)/5 = 4.4
Colleague 1 Colleague 2 Colleague 3 Colleague 4 Colleague 5 Final rating
5 4 5 4 4 4.4
Ensemble Methods
3. Weighted Average:
► This is an extension of the averaging method. All models are assigned different weights defining
the importance of each model for prediction. For instance, if two of your colleagues are critics,
while others have no prior experience in this field, then the answers by these two friends are
given more importance as compared to the other people.
► The result is calculated as [(5*0.23) + (4*0.23) + (5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
Colleague Colleague Colleague Colleague Colleague Final
1 2 3 4 5 rating
weight 0.23 0.23 0.18 0.18 0.18
rating 5 4 5 4 4 4.41
Ensemble Methods
1. Bagging (Bootstrap Aggregation)
► Bagging is an ensemble method that involves training multiple models
independently on random subsets of the data, and aggregating their predictions
through voting or averaging. Bagging reduces variance and minimizes
overfitting. One example of such a technique is the random forest algorithm.
► This technique is based on a bootstrapping sampling technique. Bootstrapping
creates multiple sets of the original training data with replacement. Replacement
enables the duplication of sample instances in a set. Each subset has the same
equal size and can be used to train models in parallel.
Ensemble Methods: Bagging
1. Multiple subsets are created from the original dataset, selecting observations with replacement.
2. A base model (weak model) is created on each of these subsets.
3. The models run in parallel and are independent of each other.
► The final predictions are determined by combining the predictions from all the models.
Ensemble Methods: Bagging
Ensemble Methods: Boosting
2. Boosting :
► .
Ensemble Methods: Boosting
The following are the steps in the boosting algorithm:
Initialise weights: At the start of the process, each training example is given equal weight.
Train a weak learner: The weighted training data is used to train a weak learner. A weak learner
is a simple model that outperforms random guessing only marginally. A decision tree with a few
levels, for example, can be employed as a weak learner.
Error calculation: The error of the weak learner on the training data is computed. The weighted
sum of misclassified cases constitutes the error.
Update weights: Weights are updated according to the mistake rate of the training
examples. Misclassified examples are given higher weights, whereas correctly classified
examples are given lower weights.
Repeat: Steps 2–4 are repeated several times. A new weak learner is trained on the updated
weights of the training examples in each cycle.
Combine weak learners: The final model is made up of all of the weak learners that were trained
in the preceding steps. The accuracy of each weak learner is weighted, and the final prediction is
based on the weighted total of the weak learners.
Forecast: The finished model is used to forecast fresh instances’ class labels.
Ensemble Methods
Algorithms based on Bagging and Boosting:
► Bagging and Boosting are two of the most commonly used techniques in
machine learning.. Following are the algorithms :
► Bagging algorithms:
∙ Bagging meta-estimator
∙ Random forest
► Boosting algorithms:
∙ AdaBoost(Adaptive Boosting)
∙ GBM(Gradient Boosting Machines)
∙ XGBoost(Extreme Gradient Boosting)
Ensemble Methods
► Bagging meta-estimator:
► Bagging meta-estimator is an ensembling algorithm that can be used for both
classification (BaggingClassifier) and regression (BaggingRegressor) problems.
It follows the typical bagging technique to make predictions. Following are the
steps for the bagging meta-estimator algorithm:
1. Random subsets are created from the original dataset (Bootstrapping).
2. The subset of the dataset includes all features.
3. A user-specified base estimator is fitted on each of these smaller sets.
4. Predictions from each model are combined to get the final result.
Ensemble Methods
► Random Forest :
► Random Forest is another ensemble machine learning algorithm that follows the bagging technique. It is an
extension of the bagging estimator algorithm. The base estimators in random forest are decision trees.
Unlike bagging meta estimator, random forest randomly selects a set of features which are used to decide
the best split at each node of the decision tree.
► Steps:-
1. Random subsets are created from the original dataset (bootstrapping).
2. At each node in the decision tree, only a random set of features are considered to decide the best split.
3. A decision tree model is fitted on each of the subsets.
4. The final prediction is calculated by averaging the predictions from all decision trees.
► Note: The decision trees in random forest can be built on a subset of data and features. Particularly, the
sklearn model of random forest uses all features for decision tree and a subset of features are randomly
selected for splitting at each node.
► To sum up, Random forest randomly selects data points and features, and builds multiple trees (Forest) .
Ensemble Methods: Boosting, Bagging, Random Forest:
Ensemble Methods
► AdaBoost
► Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling.
Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the
observations which are incorrectly predicted and the subsequent model works to predict these values correctly.
► Below are the steps for performing the AdaBoost algorithm:
1. Initially, all observations in the dataset are given equal weights.
2. A model is built on a subset of data.
3. Using this model, predictions are made on the whole dataset.
4. Errors are calculated by comparing the predictions and actual values.
5. While creating the next model, higher weights are given to the data points which were predicted incorrectly.
6. Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the
observation.
7. This process is repeated until the error function does not change, or the maximum limit of the number of estimators is
reached.