Loss Functions in machine
learning
Submitted by:
          Sumaira Rasool (Ph.D Scholar)
          Department of Computer Science
          University of Peshawar.
Supervised by:
             Dr. Muhammad Naeem
                     Outlines
• Introduction to loss functions
• Categories of loss functions
  – Regression losses
  – Classification losses
                      prepared by Sumaira Rasool   2
Introduction to Loss Functions
• Machines learn by means of a loss function.
• A loss function is a measure of how good a
  prediction model does in terms of being able to
  predict the expected outcome.
• It’s a method of evaluating how well specific
  algorithm models the given data.
• If predictions deviates too much from actual
  results, loss function would generate a very large
  number.
                    prepared by Sumaira Rasool         3
        Categories of loss function
• Regression Losses
        – Regression deals with predicting a continuous value.
            » for example given floor area, number of rooms, size of rooms, predict the
              price of room
• Classification Losses
     • In classification, we are trying to predict output from set of finite
       categorical values.
            » For example, Given large data set of images of hand written digits,
              categorizing them into one of 0–9 digits.
                               prepared by Sumaira Rasool                                 4
            Regression Losses
Loss functions for regression includes:
  Mean Square Error(MSE)/Quadratic loss
  Mean Absolute Error(MAE)
  Mean Square Percentage Error(MSPE)
  Mean Square logistic Error(MSLE)
                   prepared by Sumaira Rasool   5
Mean Square Error (MSE)/ Quadratic loss
 Mean Square Error (MSE) is the most
  commonly used regression loss function.
 MSE is the average of squared distances
  between our target variable and predicted
  values.
                   prepared by Sumaira Rasool   6
       Example:Regression Analysis
 Technique concerned with predicting some
  variables by knowing others.
 The process of predicting variable Y using
  variable X
 Calculates the “best-fit” line for a certain set of
  data.
 The regression line makes the sum of the
  squares of the residuals smaller than for any
  other line. Regression minimizes residuals.
        Linear Equations
Y
      Y = bX + a
                                  Change
                      b = Slope   in Y
             Change in X
    a = Y-intercept
                                       X
Hours studying and grades
Regressing grades on hours
                                                                       
                                                                             Linear Regression
                                                                       
                                                             
      90.00           Final grade in course = 59.95 + 3.17 * study
                      R-Square = 0.88
                                                
                                                             
      80.00                              
      70.00                       
              2.00         4.00         6.00         8.00            10.00
                     Number of hours spent studying
Predicted final grade in class =
59.95 + 3.17*(number of hours you study per week)
Predicted final grade in class = 59.95 + 3.17*(hours of study)
Predict the final grade of…
  • Someone who studies for 12 hours
  • Final grade = 59.95 + (3.17*12)
  • Final grade = 97.99
  • Someone who studies for 1 hour:
  • Final grade = 59.95 + (3.17*1)
  • Final grade = 63.12
• Gradient Descent is a general function for
  minimizing a function, in this case the Mean
  Squared Error cost function.
• However, the square loss function tends to
  penalize outliers excessively, leading to slower
  convergence rates
                   prepared by Sumaira Rasool    12
prepared by Sumaira Rasool   13
prepared by Sumaira Rasool   14
  Mean Absolute Error(MAE)
– Mean Absolute Error(MAE) is the average of sum of
 absolute differences between the target and
 predicted variables.
                    prepared by Sumaira Rasool        15
• MAE is more robust to
  outliers since it does not
  make use of square but its
  derivatives are not
  continuous, making it
  inefficient to find the
  solution.
                  prepared by Sumaira Rasool   16
prepared by Sumaira Rasool   17
 Mean Square Percentage Error(MSPE)
• Also called weighted version of MSE.
• In MSPE the difference is divided by target-
  square value which gives us the relative error.
  This division by it’s target-square can also be
  read as adding weight to MSE.
                   prepared by Sumaira Rasool       18
Mean squared logarithmic error (MSLE)
• MSLE is, as the name suggests, a variation of
  the Mean Squared Error.
• The loss is the mean over the seen data of the
  squared differences between the log-
  transformed true and predicted values. This
  loss can be interpreted as a measure of the
  ratio between actual and predicted.
• Formula:-
                   prepared by Sumaira Rasool   19
MSLE
• The introduction of the logarithm makes MSLE only care
  about the relative difference between the real and the
  predicted value, or in other words, it only cares about
  the percentual difference between them. This means
  that MSLE will treat small differences between small
  true and predicted values approximately the same as
  big differences between large true and predicted values.
• It can be used when you don’t want to penalize huge
  differences when both the values are huge numbers.
• Also, this can be used when you want to penalize under
  estimates more than over estimates.
                      prepared by Sumaira Rasool         20
                       Example
• Case a) : Pi = 600, Ai = 1000
   RMSE = 400, RMSLE = 0.5108
• Case b) : Pi = 1400, Ai = 1000
   RMSE = 400, RMSLE = 0.3365
• As it is evident, the differences are same
  between actual and predicted in both the
  cases. RMSE treated them equally however
  RMSLE penalized the under estimate more
  than over estimate. 
                      prepared by Sumaira Rasool   21
              Classification Losses
• Loss functions for classification includes:
   – Log loss/Binary Cross Entropy Loss
   – Negative Log Likelihood
   – Hinge Loss
                          prepared by Sumaira Rasool   22
   Log loss/Binary Cross Entropy
• Log loss score is kind of penalty for the
  classification. For pretty bad prediction log loss
  penalizes heavily (expect a higher score).
• Minimizing log loss maximizes accuracy.
• Log Loss returns high values for bad predictions
  and low values for good predictions.
                    prepared by Sumaira Rasool     23
Log loss/Binary cross entropy
• The goal of our machine learning models is
  to minimize this value.
• A perfect model would have a log loss of 0.
• Formula:-
                   prepared by Sumaira Rasool   24
Negative Log-Likelihood (NLL)
       This is a widely used loss function in neural networks.
       It measures the accuracy of a classifier.
       It is used when the model outputs a probability for
        each class rather than just the most likely class.
       In practice, the softmax function is used in tandem
        with the negative log-likelihood (NLL). This loss
        function is very interesting if we interpret it in relation
        to the behavior of softmax.
      • Formula:-
                                L(y)=−log(y)
                                 prepared by Sumaira Rasool      25
Negative Log-Likelihood (NLL)
• When training a model, we try to find the minima of a
  loss function given a set of parameters (in a neural
  network, these are the weights and biases). We can
  interpret the loss as the “unhappiness” of the network
  with respect to its parameters. The higher the loss, the
  higher the unhappiness but we don’t want that. We
  actually want to make our model happy.
• So if we are using the negative log-likelihood as our
  loss function, the question is that when does it
  become unhappy and when does it become happy.
                      prepared by Sumaira Rasool        26
In the figure shown below, the loss function reaches infinity
when input is 0, and the loss function reaches 0 when input is 1.
                          prepared by Sumaira Rasool                27
Negative Log-Likelihood (NLL)
         prepared by Sumaira Rasool   28
Negative Log-Likelihood (NLL)
         prepared by Sumaira Rasool   29
                   Hinge Loss
• Hinge loss is used for maximum-margin classification,
  most notably for support vector machines.
• A margin is a separation of line to the closest class
  points.
• A good margin is one where this separation is larger
  for both the classes. Images below gives to visual
  example of good and bad margin. A good margin
  allows the points to be in their respective classes
  without crossing to other class.
                     prepared by Sumaira Rasool       30
Hinge Loss
 Formula:-
  SVMLoss=max(0,1-yf(x)) Or
 Although not differentiable, it’s a convex function
  which makes it easy to work with usual convex
  optimizers used in machine learning domain.
                    prepared by Sumaira Rasool      31
Hinge Loss
             prepared by Sumaira Rasool   32
• Correctly classified points lying outside the
  margin boundaries of the support vectors are
  not penalized, whereas points within the
  margin boundaries or on the wrong side of the
  hyperplane are penalized in a linear fashion
  compared to their distance from the correct
  boundary.
                  prepared by Sumaira Rasool   33
prepared by Sumaira Rasool   34
Optimal hyperplane
     prepared by Sumaira Rasool   35
loss functions in Keras
1) mean_squared_error
  keras.losses.mean_squared_error(y_true, y_pred)
2) mean_absolute_error
  keras.losses.mean_absolute_error(y_true, y_pred)
3) mean_absolute_percentage_error
  keras.losses.mean_absolute_percentage_error(y_true, y_pred)
4) mean_squared_logarithmic_error
  keras.losses.mean_squared_logarithmic_error(y_true, y_pred)
5) hinge
  keras.losses.hinge(y_true, y_pred)
6) categorical_crossentropy
  keras.losses.categorical_crossentropy(y_true, y_pred)
7) binary_crossentropy
  keras.losses.binary_crossentropy(y_true, y_pred)
                               prepared by Sumaira Rasool       36
                    References
1. https://towardsdatascience.com/common-loss-func
   tions-in-machine-learning-46af0ffc4d23
2. https://heartbeat.fritz.ai/5-regression-loss-function
   s-all-machine-learners-should-know-4fb140e9d4b0
3. https://machinelearningmastery.com/loss-and-loss-
   functions-for-training-deep-learning-neural-network
   s/
4. https://peltarion.com/knowledge-center/document
   ation/modeling-view/loss-functions/mean-squared-
   logarithmic-error
5. https://ljvmiranda921.github.io/notebook/2017/08
   /13/softmax-and-the-negative-log-likelihood/
                      prepared by Sumaira Rasool        37