Data Science concepts
Overfitting & Underfitting
         Aitor Larrinoa
         January 2025
Contents
1 Introduction                                                                                  1
2 What are overfitting and underfitting?                                                        2
3 Example                                                                                       3
4 Avoid overfitting and underfitting                                                            4
  4.1   Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
  4.2   Tree based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
  4.3   Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
1    Introduction
Our principal goal when we train a ML model is to get good results. Thus, the better is the
metric of the model, the better the performance of it. However, is this entirely true? We have
to be cautious because our main goal should be to look for a good generalization, instead.
A model is said to generalize well when it can handle new, unseen input data effectively.
However, finding a balance between fitting the training data and performing well on new data
is not straightforward and can lead to two common problems: overfitting and underfitting.
In this post we will dive into the concepts of overfitting and underfitting, we will understand
why they happen, get into a practical example and what strategies can we use in order to
avoid them.
                                               1
2    What are overfitting and underfitting?
One of the biggest problems when dealing with ML models are overfitting and underfitting.
As if it were a human being, learning machines must be able to generalize concepts. Suppose
that we see a Labrador Retriever for the first time in our lives, and someone tells us, ”That
is a dog.” Later, we are shown a Poodle and asked, ”Is that a dog?” We might say, ”No,” as
it looks nothing like what we previously learned. Now imagine someone shows us a book with
pictures of 10 different dog breeds. When we see a breed we are unfamiliar with, we will be able
to recognize it as a dog because of the characteristics observed in the various dogs depicted in
the photos.
The goal is to ensure that the model can generalize a concept so that when presented with a
new, unfamiliar dataset, it can understand, and provide a reliable result thanks to its
generalization ability. Before diving into overfitting and underfitting, the following concepts
must be understood:
Defintion 2.1. Bias is the difference between the model’s prediction and the correct value it
aims to predict.
Definition 2.2. Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data.
So now, what is overfitting? what is underfitting?
    • Overfitting: The model will only adjust to learn the specific cases it is taught (training
      set) and will be unable to recognize new input data (test set).
    • Underfitting: Underfitting, in contrast to overfitting, occurs when the algorithm fails to
      extract meaningful patterns from the data and is unable to generalize the knowledge.
In other words, underfitting occurs when the model is too simple, resulting in high bias and
an inability to capture the true patterns in the data, whereas overfitting happens when the
model is too complex, leading to high variance and poor generalization to unseen data.
Next we will show some overfitting and underfitting visual examples for classification and
regression tasks:
                                                2
                            Figure 1: Overfitting and underfitting
3    Example
We will create an example in order to see the relevance of overfitting and underfitting more
easily.
Let’s supose we are in front of a dataset where y is a function of x and their relationship is
given by the next equation:
                                            y = x2
If we consider a linear regression model, for example y = β0 + x · β1 , the error will be high
because a straight line cannot capture the curvature of the previously seen relationship. This
is underfitting and can be shown in the next plot:
                                Figure 2: Underfitting example
                                               3
Clearly, the line does not fit well our data points. In fact, the metric result show us the poor
performance of the model. Thus, clearly underfitting appears.
However, if we consider a polynomial regression with a high degree, let’s say for example 22,
we will obtain a model that fits extremely well on training data and will not be capable of
generalizing predictions.
                                 Figure 3: Overfitting example
As said before, the model performs extremely well on training data. This results in that if we
consider a new data point now, a little bit different from training data, the error will be quite
high due to overfitting. Thus, our main goal when training a model should be to look for
generalization.
These examples can be seen in my Github profile, https://github.com/aitorlarrinoa/
data-science-concepts/blob/main/notebooks/01_overfitting_underfitting.ipynb.
4    Avoid overfitting and underfitting
As seen, overfitting and underfitting can cause serious problems when creating a machine
learning model. Thus, we need to control them. We are going to talk about different generic
considerations we can take in order to have underfitting and overfitting under control:
    • More data. Training with few data points can cause overfitting.
    • Reduce model complexity. Less is more, thus, a very complex model can lead to
      overfitting.
    • Feature engineering. Poor feature engineering means underfitting, and that is why this
      is one of the most important considerations in a data science project.
    • Cross-validation. Techniques like k-fold cross-validation can help evaluate the model’s
      performance on unseen data, reducing the risk of overfitting or underfitting during
      training.
                                               4
In fact, depending on the model we are dealing with, avoiding overfitting and underfitting
never take the same path. Let’s dive into different type of models and how we can deal with
these problems in each case:
4.1   Parametric models
Parametric models, such as linear regression and logistic regression, assume a fixed functional
form with a finite number of parameters. Here are some approaches to controlling
underfitting and overfitting in these models:
   • Regularization: Techniques like ridge or lasso regression add constraints to the model
     coefficients, reducing overfitting.
   • Feature selection: Choose only the most relevant features for the model. Reducing
     irrelevant or highly correlated features can improve generalization.
   • Polynomial features: For models that underfit, consider adding polynomial or
     interaction terms to capture nonlinear relationships in the data. However, ensure the
     degree is not too high to avoid overfitting.
4.2   Tree based algorithms
Descision trees, random forests, XGBoost, ... are examples of tree based algorithms. These
type of algorithms tend to overfit. Let’s see what can be done in these types of model:
   • Ensemble methods: Models like random forests and gradient boosting combine multiple
     trees to improve generalization. Use techniques like bagging (random forests) or
     boosting to balance bias and variance.
   • Hyperparameters: Hyperparameters such as max depth, min samples split, and
     min samples leaf can help us with overfitting and underfitting.
   • Feature importance: Selecting features that contribute most to the model’s predictions
     can help when generalizing those predictions and avoids overfitting.
4.3   Neural networks
Neural networks are the most complex models within machine learning and AI. Thus, these
models are prone to overfitting when the architecture is too complex or the dataset is small.
Consider the following tips:
   • Dropout: Randomly deactivate a proportion of neurons during training to improve
     generalization.
   • Early stopping: Stop the process of training once the loss stops improving, preventing
     the network from overfitting.
   • Regularization: Apply regularization to penalize large weights.
                                              5
• Data augmentation: Normally used when dealing with images. The idea is to artificially
  increase the size of the training dataset by applying transformations such as rotations,
  flips, or noise to the input data. Thus, we will get more data for free.
• Architecture tuning: Reduce the number of layers or neurons if the network is too large
  for the dataset.. This is the principal approach when we have an overfitted neural
  network.