Machine Learning Essentials
Machine Learning Essentials
My github: https://github.com/robinyUArizona
                                                                                                                                                                                   FN
                                                                                                                                                    (6) False Negative Rate:                = 1-Recall — Fraction of
                                                                                                                                                                                 TP + FN
                                                                                                                                                    positives wrongly classified negative. Probability: P [D = 0|Y = 1]
                                                                                                   TP + TN
                                                                          (1) Accuracy:                              → Ratio of correct
                                                                                            TP + TN + FP + FN
As we can see, the training curve looks ok, but the validation function
moves noisily around the training curve. It could be the case that        predictions over total predictions.
validation data is scarce and not very representative of the training     Estimate of P [D = Y ] , probability of decision is equal to outcome.
data, so the model struggles to model these examples.
                                                                                                                           TP
                                                                          (2) Recall or Sensitivity or True positive rate:          .
                                                                                                                       TP + FN
                                                                          Completness of model. → Out of total actual positive (1) values,
                                                                          how often the classifier is correct. Probability: P [D = 1|Y = 1]
                                                                          Example: ”Fradulent transaction detector” or ”Person Cancer” →
                                                                          +ve (1) is ”fraud”: Optimize for sensitivity because false positive (FP
                                                                          normal transactions that are flagged as possible fraud) are more          Note: We can think of the plot as the fraction of correct predictions
                                                                          acceptable than false negative (FN fradulent transactions that are not    for the positive class (y-axis) versus the fraction of errors for the
                                                                          detected)                                                                 negative class (x-axis).
Here, the validation loss is much better than the training one, which
reflects the validation dataset is easier to predict than the training                        TP                                                    (10) AUC: Area Under the ROC Curve: To compute the points in an
                                                                          (3) Precision :              Exactness of model. → Out of total
dataset. An explanation could be the validation data is scarce but                         TP + FP                                                  ROC curve, an efficient, sorting-based algorithm called AUC. AUC
widely represented by the training dataset, so the model performs         predicted positive (1) values, how often classifier is correct.           ranges in value from 0 to 1. Area Under the Curve measures how
extremely well on these few examples.                                                                                                               likely the model differentiates positives and negatives (perfect AUC =
                                                                          Probability: P [Y = 1|D = 1] , If our model says positive, how likely
                                                                                                                                                    1, basline = 0.5)
Model Evaluation                                                          it is correct in that judgement.
                                                                          Example: ”Spam Filter” +ve (1) class is spam → Optimize for
Classification Problems                                                   precision or, specificity because false negatives (FN spam goes to the
                                                                          inbox) are more acceptable than false positive (FP non-spam is
Confusion Matrix
                                                                          caught by the spam filter). Example: ”Hotel booking cancelled”
• The data gives us outcomes (“truth”) (y|Y )
                                                                          +ve (1) class is isCancelled → Optimize for precision or, specificity
• The model makes decisions (d|D) (saving ŷ scores)
                                                                          because false negatives (FN isCancelled labeled as ”not cancelled” 0)
Then, we compare decisions (d) to outcomes (y)
                                                                          are more acceptable than false positive (FP isnotCancelled labeled as
Type I error: The null hypothesis H0 is rejected when it is true.         ”cancelled” 1).
Type II error: The null hypothesis H0 is not rejected when it is
false.                                                                                          Precision × Recall
                                                                          (4) F1-Score = 2 ×                         → False positive (FP) and
→ False negative (Type I error: ) — incorrectly decide no                                       Precision + Recall
→ False positive (Type II error: ) — incorrectly decide yes               False negative (FN) are equally important.
How to choose threshold for the logistic regression? The choice of a              Variance, R2 and the Sum of Squares                                     Convex & Non-convex
threshold depends on the importance of TPR and FPR classification
                                                                                  The total sum of squares: SStotal = i (yi − ȳ)2                        A convex function is one where a line drawn between any two points
                                                                                                                        P
problem. For example: Suppose you are building a model to predict
customer churn. False negatives (not identifying customers who will                                                    1 P
                                                                                  This scales with variance: var(Y ) = n i (yi − ȳ)2                     on the graph lies on or above the graph. It has one minimum. A
churn) might lead to loss of revenue, making TPR crucial. In                                                              P ˆ                             non-convex function is one where a line drawn between any two
                                                                                  The regression sum of squares: SSreg =      (yi − ȳ)2
                                                                                                                              i
contrast, falsely predicting churn (false positives) could lead to                                                                                        points on the graph may intersect other points on the graph. It
                                                                                  , → nVar(predictions)
unnecessary retention efforts, making FPR important. If there is no                                                                                       characterized as ”wavy”
                                                                                  The residual
                                                                                            Psum of squares (squared errro):
external concern about low TPR or high FPR, one option is to weight                                                                                       → When a cost function is non-convex, it means that there is a
                                                                                  SSresid = i (yi − yˆi )2 , → nVar(ϵ)
them equally by choosing the threshold that maximizes TPR−FPR.                                                                                            likelihood that the function may find local minima instead of the
                                                                                  Note: ϵ̄ = 0, E[ŷ] = ȳ
                                                                                                                                                          global minimum, which is typically undesired in machine learning
Precision-Recall curve: - Focuses on the correct prediction of the
                                                                                                                                                          models from an optimization perspective.
minority class, useful when data is imbalanced. Plot precision at                                      SStotal = SSreg + SSresid
different thresholds.                                                                                                                                     Gradient Descent
                                                                                                 SSresid   SSreg     nV ar(P reds)   V ar(P reds)         Gradient Descent is used to find the coefficients of f that minimizes
                                                                                    R2 = 1 −             =         =               =
                                                                                                 SStotal   SStotal     nV ar(Y )       V ar(Y )           a cost function (for example MSE, SSR).
                                                                                                                                                          → Time Complexity: O(kn2 ) → n is no. of data points.
                                                                                  The fraction of variance explained!
                                                                                                                                                          Procedure:
                                                                                                                                                             1. Intialization      θ = 0 (coefficients to 0 or random)
                                                                                                                                                                                                                   
                                                                                                                                                             2. Calculate cost     J(θ) = evaluate f (coefficients)
                                                                                                                                                                                    ∂
                                                                                                                                                             3. Gradient of cost   ∂θj
                                                                                                                                                                                         J(θ) we knows the uphill direction
                                                                                                                                                                                                 ∂
                                                                                                                                                             4. Update coeff        θj = θj − α ∂θ J(θ) we go downhill
                                                                                                                                                                                                    j
                                                     s
                                                                                  Optimization
                                                         PN
                                                             i=1 (yˆi   − yi )2   Almost every machine learning method has an optimization algorithm
   2. Root Mean Squared Error:           RMSE =                                   at its core.
                                                                  N
                                                                                  → Hypothesis : The hypothesis is noted hθ and is the model that
                                            1X                                    we choose. For a given input data x(i) the model prediction output is
   3. Mean Absolute Error:          MAE =       |yi − ŷ|                                                                                                 Tips :
                                            n i                                   hθ (x(i) ) .                                                                •    Change learning rate α (”size of jump” at each iteration)
                                                                                                                                                              •    Plot Cost vs. Time to assess learning rate performance.
                                                                                  → Loss function : L : (z, y) ∈ R × Y 7−→ L(z, y) ∈ R that takes
                                                                                                                                                              •    Rescaling the input variables
                                           X
   4. Sum of Squared Error: SSE =               (yi − ŷ)2                        as inputs the predicted value z corresponding to the real data value
                                            i                                                                                                                 •    Reduce passes through training set with SGD
                                                                                  y and outputs how different they are. The loss function is the
                                                                                                                                                              •    Average over 10 or more updated to observe the learning trend
                                                                                  function that computes the distance or difference between the
                                                                                                                                                                   while using SGD
                                           X
   5. Total Sum of Squares: SST =               (yi − ȳ)2                        current output z of the algorithm and the expected output y.
                                            i                                     The common loss functions are summed up in the table below:             Batch Gradient Descent does summing/average of the cost over all
                                                                                                                                                          the observations.
   6.   R2   Error :                                                                  Least squared error         Logistic loss      Hinge loss           Stochastic Gradient Descent apply the procedure of parameter
                                                                                           1
                          MSE (model)                         SSE                          2
                                                                                             (y − z)2         log (1 + exp(−yz))    max(0, 1 − yz)        updating for each observation.
                 R2 = 1 −                            R2 = 1 −                          Linear Regression      Logistic Regression       SVM               → Time Complexity: O(km2 ) → m is the sample of data selected
                          MSE(baseline)                       SST
                                                                                                                                                          randomly from the entire data of size n
        The proportion of explained y-variability. Negative R2 means
        the model is worse than just predicting the mean. R2 is not
        valid for nonlinear models as SSresidual + SSerror ̸= SST .
   7. Adjusted R2 :
                                   
                                         n−1
                                                                                → Cost function : The cost function J is commonly used to know
                       Ra2 = 1 −                     (1 − R2 )                    the performance of a model, and is defined with the loss function L
                                        n−k−1
                                                                                  as follows:
                                                                                                            Xm
        which changes only when predictors (features) affect R2 above                                J(θ) =     L(hθ (x(i) ), y (i) )
        what would be expected by chance                                                                      i=1
Ordinary Least Squares                                                      • yi ∈ {0, 1} is outcome                                                 Linear Algorithms
                                                                                y
Least Squares Regression                                                    • ŷi i is ŷi if yi = 1, and 1 if yi = 0 — multiplicative if
                                                                                                                                                     Regression
We fit linear models:                                                  Conditioning on Parameters
                                       X                                                                           ⃗ and write function:
                                                                       Fuller definition - condition on parameters β                                 → Regression predicts (or estimates) a continuous variable
                          ŷ = β0 +        βj xj
                                       j
                                                                                                  ⃗              ⃗
                                                                                      P (Y = 1|x, β) = ŷ = m(x, β) = logistic(...)                  Dependent variable Y , Independent variable(s) X
                                                                                                                                                     → compute estimate ŷ ≈ y
Here, βj is the j-th coefficient and xj is the j-th feature.           Likelihood Function                                                                                     yˆi = β0 + β1 xi
Ordinary Least Squares - find β  ⃗ that minimizes squared error:       Given data y = ⟨y1 , ..., yn ⟩, x = ⟨x1 , ..., xn ⟩ and parameters β̂                                      yi = yˆi + ϵi
                                                                                                                                                     Here, β0 is intercept, β1 P
                                                                                                                                                                               is slope and ϵ is residuals. The goal is to
                                                                                                                                 Y
                                  X                                                          ⃗ = P (y, x|β)
                                                                            Likelihood(y, x, β)             ⃗ ∝ P (y|x, β)  ⃗ =                 ⃗
                                                                                                                                    P (yi |xi , β)
                        arg min       (yi − yˆi )2                                                                                                   learn β0 , β1 to minimize    ϵ2i (least squares)
                            ⃗
                            β      i                                                                                                  i
                                                                       This is weird:                                                                Linearity: A linear equation of k + 1 variables is of the form:
                                                                                                          ⃗ ∝ P (y|x, β)
                                                                                                  P (y, x|β)          ⃗
                                                                                                                                                                          ŷ = β0 + β1 x1 + · · · + βk xk
                                                                                                      ⃗ = P (y|x, β)P
                                                                                              P (y, x|β)          ⃗ (x|β)⃗
                                                                                                               ⃗ = P (x). And x is fixed, so         It is the sum of scalar multiples of the individual variables - aline!
                                                                       But x is independent of params, so P (x|β)
                                                                       P (x) is an (unknown) constant.                                               → Linear models are remarkably capable of transforming many
                                                                                                                                                     non-linear problems into linear.
                                                                                                     ⃗ = log Likelihood(y, x, β)
                                                                                        LogLik(y, x, β)                       ⃗
                                                                                                                                                     Linear Regression
                                                                                                         Y              
                                                                                             = log P (x)               ⃗
                                                                                                            P yi |xi , β                                            yˆi = β0 + β1 xi1 + β2 xi2 · · · + βp xip + ϵ
                                                                                                               i                     
                                                                                                                                                                                            p
                                                                                                              X
                                                                                                                                  ⃗                                                       n X
                                                                                            = log P (x) +          log P yi |xi , β                                                       X
                                                                                                              i
                                                                                                                                                                             yˆi = β0 +             βj xij
                                    ⃗ = ŷ
Goal: least-squares solution to : X β                                                                                                                                                     i=1 j=1
                                                                       Maximum Likelihood Estimator
Solution: solve the normal equations:                                                            X                                                 Here, n is total no. of observation, yi is dependent variable, xij is
                                                                                         arg max                  ⃗
                                                                                                    logP yi |xi , β
              XT Xβ ⃗ = X T ŷ   →β ⃗ = (X T X)−1 X T ŷ                                                                                             explanatory variable of j-th features of the i-th observation. β0 is
                                                                                                   ⃗
                                                                                                   β      i                                          intercept or usually called bias coefficient.
                                  ⃗ |X,
                                L(β  ⃗ ŷ)                                            P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi                    Assumptions:
                                                                              logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )         → Linear models make four key assumptions necessary for inferential
General Optimization:                                                                                                                                validity.
   1. Understand data (features and outcome variables)                 Model log likelihood is sum over training data. Applicable to any             • Linearity — outcome y and predictor X have linear relationship.
                                                                       model where ŷ = P (Y = 1|x)                                                  • Independence — observations are independent of each other
   2. Define loss (or gain/utility) function
   3. Define predictive model                                          Likelihood and Posterior                                                      - Independent variables (features) are not highly correlated with each
   4. Search for parameters that minimize loss function                                                        P (y|θ) P (θ)                         other → Low multicollinearity
                                                                                                  P (θ|y) =                                          • Normal errors — residuals are normally distributed - check with
Augmented Loss                                                                                                     P (y)                             Q-Q plots. Violation means line (in Q-Q plots) still fits but p-value
We can add more things to the loss function
                                                                       → Logistic function is trained by maximizing the log likelihood of the        and CIs are unreliable
    • Penalize model complexity
                                                                       training data given the model                                                 • Equal variance — residuals have constant variance (called
    • Penalize ”strong” beliefs                                                                                                                      homoskedasticity; violation is heteroskedasticity) - check scatterplot
         – Requires predictive utility to overcome them                                                                                              or regplot between residuals vs. fitted. Violations means model is
→ Least squares generalizes into minimizing loss functions.                                                                                          failing to capture a systematic effect. → These violations are problem
→ This is the heart of machine learning, particularly supervised                                                                                     only for inference not for prediction
learning.
Maximum Likelihood Estimation
MLE is used to find the estimators that minimized the likelihood
function: L(θ|x) = fθ (x) density function of the data distribution
Log Likelihood
Logistic Regression:                                            
                                                     X
         P (Y = 1|X = x) = ŷ = logistic β0 +               βj xj                                                                                  Variance Inflation Factor : Measures the severity if multicollinearity
                                                         j                                                                                                  1
The model computes probability of yes.                                                                                                               →            , where Ri2 is found by regressing Xi aganist all other
                                                                                                                                                         1 − Ri2
Probability of Observed
What if we want P (Y = yi ), regardless of whether yi is 1 or 0?                                                                                     variables (a common VIF cutoff is 10)
                                                     1−yi                                                                                            Learning: Estimating the coefficients β from the training data using
               P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )
                                                                                                                                                     the optimization algorithm Gradient Descent or Ordinary Least
    • ŷi is model’s estimate of P (Y = 1|X = xi )                                                                                                   Squares.
                                       ⃗ that minimizes squared
Ordinary Least Squares - where we find β                                             - Rescale inputs using standardization or normalization
error:                        X                                                      Advantages:
                      arg min    (yi − yˆi )2
                               ⃗                                                     + Good regression baseline considering simplicity
                               β        i
                                                                                     + Lasso/Ridge can be used to avoid overfitting
                                                                                     + Lasso/Ridge permit feature selection in case of collinearity
                                                                                     Usecase examples:
                                                                                     - Product sales prediction according to prices or promotions
                                                                                     - Call-center waiting-time prediction according to the number of
                                                                                     complaints and the number of working agents
Logistic Regression
                                                                                     Log-Odds and Logistics                                                      The representation below is an equation with binary output, which
                                                                                     Odds                                                                        actually models the probability of default class:
                                                                                     The probability of success P (S): 0 ≤ p ≤ 1
→ The dimension of the hyperplane of the regression is its                                                                                                                              eβ0 +β1 x1 +···+βi xi
                                                                                     → The odds of success are defined as the ratio of the probability of                   p(X) =                              = p(y = 1 | X)
complexity.
                                                                                     success over the probability of failure.                                                         1 + eβ0 +β1 x1 +···+βi xi
Variations: There are extensions of Linear Regression training called                                                    P (S)   P (S)
                                                                                     The odds of success: Odds(S) = P (S c ) = 1−P (S)
regularization methods, that aim to reduce the complexity of the                                                                                                  → predict value close to 1 for default class and close to 0 for the
models or to address over-fitting in ML. The regularizer is not                      Ex: Odds(failure) = x → means x:1 aganist success                           other class.
dependent on the data. → In relation to the bias-variance trade-off,                 Log Odds or logit →                                                         Assumptions:
regularization aims to decrease complexity in a way that significantly                                           P (A)                                           - Linear relationship between X and log-odds of Y
                                                                                           log Odds(A) = log             = logP (A) − log (1 − P (A))
reduces variances while only slightly increasing bias.                                                         1 − P (A)                                         - Observations must be independent to each other
→ Standardize numeric variables when using regularization because                                                                                                - Low multicollinearity
to ensure that 0 is a neutral value, so a low coefficient means ”little              Logistic: The inverse of the logit (logit− 1):                              Learning: Learning the logistic regression coefficients is done by:
                                                                                                                x
effect when deviating from average”. So values, and therefore                        logistic(x) = 1+e1−x = exe+1                                                Note : Coefficients are linearly related to odds, such that a one unit
coefficients, are on the same scale (# of standard deviations), to                                                                                               increase in x1 affects odds by eβ1 .
properly distribute weight between them.                                                                                                                         → Minimizing the logistic loss function
→ Multicollinearity → correlated predictors. Problem: Which                                                                                                                                 X
                                                                                                                                                                                                                 ⃗ i)
                                                                                                                                                                                                                      
                                                                                                                                                                                    arg min     log 1 + exp(−yi βx
coefficient gets the common effect? To solve: Loss and                                                                                                                                 ⃗
                                                                                                                                                                                       β       i
Regularization comes.
                                                                                                                             sigmoid or logistic curve.          → Maximizing the log likelihood of the training data given the model
    • Ridge Regression (L2 regularization): where OLS is modified
      to minimize the squared sum of the coefficients                                                                                                                                        X                  
                                                                                                                                                                                   arg max                     ⃗
                                                                                                                                                                                                log P yi |xi , β
          n                  p                     p                     p                                                                                                                 ⃗
          X                  X                     X                     X                                                                                                                 β       i
                (yi − β0 −         βj xij )2 + λ         βj2 = RSS + λ         βj2
          i=1                j=1                   j=1                   j=1
                                                                                     → Odds are another way of representing probabilities.                                       P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
                                                                                     → The logistic and logit functions convert between probabilities and               logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
       → Prevents the weights from getting too large (L2 norm). If
       lambda is very large then it will add too much weight and it                  log-odds.                                                                   Model log likelihood is sum over training data. Applicable to any
       will lead to under-fit.                                                                                                                                   model where ŷ = P (Y = 1|x)
                                                                                     General Linear Models (GLMs):
                                         1                                                                                                                       Data preparation:
                               λ∝                                                                                                                                - Probability transformation to binary for classification
                                  model variance                                                  yˆi = g −1 (β0 + β1 xi1 + β2 xi2 · · · + βp xip )
                                                                                                                                                                 - Remove noise such as outliers
    • Lasso Regression (L1 regularization) : where OLS is modified                                                       
                                                                                                                                 p
                                                                                                                                                                Advantages:
      to minimize the sum of the coefficients                                                                                                                    + Good classification baseline considering simplicity
                                                                                                                                 X
                                                                                                                    −1
                                                                                                          ŷi = g        β0 +         βj xij 
        Xn            Xp               Xp                  p
                                                           X                                                                     j=1
                                                                                                                                                                 + Possibility to change cutoff for precision/recall tradeoff
           (yi − β0 −    βj xij )2 + λ     |βj | = RSS + λ   |βj |                                                                                               + Robust to noise/overfitting with L1/L2 regularization
         i=1             j=1                   j=1                       j=1         Here, g is a link function                                                  + Probability output can be used for ranking
       where p is the no. features (or dimensions), λ ≥ 0 is a tuning                • Counts: Poision regression, log link func                                 Usecase examples:
       parameters to be determined.                                                  • Binary: Logistic regression, logit link func and g −1 is logistic func    - Customer scoring with probability of purchase
                                                                                     → In logistic regression, a linear output is converted into a probability   - Classification of loan defaults according to profile
       → Lasso shrinks the less important feature’s coefficient to                   between 0 and 1 using the sigmoid or logistic function. It is the go-to
       zero thus, removing some feature altogether. If lambda is very                                                                                            How to choose threshold for the logistic regression? The choice of a
                                                                                     for binary classification.                                                  threshold depends on the importance of TPR and FPR classification
       large value will make coefficients zero hence it will under-fit.                                                                                        problem. For example, if your classifier will decide which criminal
→ L1 is less likely to shrink coefficients to 0. Therefore L1
                                                                                                                                                                 suspects will receive a death sentence, false positives are very bad
                                                                                                                                      X
regularization leads to sparser models.                                                         P (yi = 1|X) = ŷi = logistic β0 +
                                                                                                                                         βj xij 
                                                                                                                                             j                   (innocents will be killed!). Thus you would choose a threshold that
Data preparation:                                                                                                                                                yields a low FPR while keeping a reasonable TPR (so you actually
- Transform data for linear relationship (ex: log transform for                                                                                                  catch some true criminals). If there is no external concern about low
exponential relationship)                                                                                 1       ex                                             TPR or high FPR, one option is to weight them equally by choosing
                                                                                     → logistic(x) =           = x
- Remove noise such as outliers                                                                        1 + e−x  e +1                                             the threshold that maximizes TPR−FPR.
Linear Discriminant Analysis                                            Nonlinear Algorithms                                                                    Likelihood and Posterior
For multiclass classification, LDA is the preferred linear technique.   All Nonlinear Algorithms are non-parametric and more flexible. They                                                           P (y|θ) P (θ)
Representation: LDA representation consists of statistical properties                                                                                                                     P (θ|y) =
                                                                        are not sensible to outliers and do not require any shape of                                                                      P (y)
calculated for each class: means and the covariance matrix:             distribution.                                                                           •   P (θ) is the prior
                   n                              n                                                                                                             •   P (y|θ) isR the likelihood – how likely is the data given params θ
                1 X                       1
                                                                                                                                                                •
                                                  X
          µk =        xi          σ2 =                  (xi − µk )2     Naive Bayes Classifier                                                                      P (y) = P (y|θ)P (θ)dθ is a scaling factor (constant for fixed y)
               nk i=1                    n−k      i=1                                                                                                           •   P (θ|y) is the posterior.
                                                                        Naive Bayes is a classification algorithm interested in selecting the                   • We’re maximizing likelihood (ML estimator)
                                                                        best hypothesis h given data d assuming that the features of each                       • Can also maximize posterior (MAP estimator)
                                                                        data point are all independent                                                             • When prior is constant, they’re the same
                                                                        Representation: The representation is based on Bayes Theorem:                              • With lots of data, they’re almost the same
                                                                                                               P (d|Y ) × P (Y )
                                                                                                  P (Y |d) =
                                                                                                                     P (d)
LDA assumes Gaussian data and attributes of same σ 2 . Predictions max (P (Y |d)) = max (P (d|Y ) × P (Y ))
are made using Bayes Theorem:                                           here, the denominator is not kept as it is only for normalization.
                                         P (k) × P (x|k)
              P (y = k | X = x) = Pk                                    Learning: Training is fast because only probabilities need to be
                                         l=1   P (l) × P (x|l)          calculated:
to obtain a discriminate function (latent variable) for each class k,                         instancesY                              count(x ∧ Y )
                                                                                  P (Y ) =                          P (x|Y ) =
estimating P (x|k) with a Gaussian distribution:                                             all instances                             instancesY
                                 µk   µ2
                 Dk (x) = x ×      2
                                     − k2 + ln(P (k))                   Variations: Gaussian Naive Bayes can extend to numerical attributes
                                 σ    2σ
                                                                        by assuming a Gaussian distribution. Instead of P (x|h) are calculated
The class with largest discriminant value is the output class.          with P (h) during learning, and MAP for prediction is calculated
Variations:                                                             using Gaussian PDF
   1. Quadratic DA: Each class uses its own variance estimate                                                       1            (x − µ)2
                                                                                            f (x | µ(x), σ) = √             e−
   2. Regularized DA: Regularization into the variance estimate.                                                   2πσ 2           2σ 2
Data preparation:
                                                                                                                 v
                                                                                                n                u    n
- Review and modify univariate distributions to be Gaussian                                  1X                  u1 X
                                                                                      µ(x) =       xi          σ=t       (xi − µ(x))2
- Standardize data to µ = 0, σ = 1 to have same variance                                     n i=1                 n i=1
- Remove noise such as outliers
Advantages:                                                             Data preparation:
+ Can be used for dimensionality reduction by keeping the latent        - Change numerical inputs to categorical (binning) or near Gaussian
variables as new variables                                              inputs (remove outliers, log & boxcox transform)
                                                                        - Other distributions can be used instead of Gaussian
Usecase example:
                                                                        - Log-transform of the probabilities can avoid overflow
- Prediction of customer churn
                                                                        - Probabilities can be updated as data becomes available
                                                                        Advantages:
                                                                        + Fast because of the calculations
                                                                        + If the naive assumptions works can converge quicker than other
                                                                        models. Can be used on smaller training data.
                                                                        + Good for few categories variables
                                                                        Usecase examples:                                                                       Support Vector Machines
                                                                        - Article classification using binary word presence
                                                                                                                                                                SVM is a go-to for high performance with little tuning. Compares
                                                                        - Email spam detection using a similar technique
                                                                                                                                                                extreme values in your dataset.
In SVM, a hyperplane (or decision boundary: wT x − b = 0) is            The first term is the regularization term, which is a technique to
selected to separate the points in the input variables space by their   avoid overfitting by penalizing large coefficients in the solution vector.
class, with the largest margin. The closest datapoints (defining the    The second term, hinge loss, is to penalize misclassifications. It
margin) are called the support vectors.                                 measures the error due to misclassification (or data points being
→ The goal of a support vector machine is to find the optimal           closer to the classification boundary than the margin). The λ is the
separating hyperplane which maximizes the margin of the training        regularization coefficient, and its major role is to determine the
data.                                                                   trade-off between increasing the margin size and ensuring that the xi
                                                                        lies on the correct side of the margin.
                                                                        → Kernel : A kernel is a way of computing the dot product of two
                                                                        vectors xx and yy in some (possibly very high dimensional) feature
                                                                        space, which is why kernel functions are sometime called ”generalized
                                                                        dot product”. The kernel trick is a method of using a linear classifier
                                                                        to solve a non-linear problem by transforming linearly inseparable data
                                                                        to linearly separable ones in a higher dimension.
                                                                        Given a feature mapping ϕ, we define the kernel K as follows:
                                                                                                 K(x, z) = ϕ(x)T ϕ(z)
                                                                                                                              ∥x − z∥2
                                                                        In practice, the kernel K defined by K(x, z) = e−              is
                                                                                                                                2σ 2
                                                                        called the Gaussian kernel and is commonly used.
                                                                         DBSCAN
                                                                         → Two parameters: ε - distance, minimum points
                                                                         → Three classifications of points:
                                                                            • Core: has atleast minimum points within ε - distance including
                                                                              itself
                                                                            • ε - distance has less than minimum points within ε - distance
                                                                              but can be reached by clusters.
   1. Divide data into K clusters or groups.                                • Outlier: point that cannot be reached by cluster
   2. Randomly select centroid for each of these K clusters.             Procedure:
   3. Assign data points to their closest cluster centroid according
                                                                            1. Pick a random point that has not been assigned to a cluster
      to Euclidean/ Square Euclidean/Manhattan/Cosine
                                                                               or, designated as an Outlier. Determine if it is a Core Point.
   4. Calculate the centroids of the newly formed clusters.                    If not, label the point as Outlier.
   5. Repeat steps 3 and 4 until the same centroids (convergences)
      are assigned to each cluster.                                         2. Once a Core Point has been found, add all directly reachable
                                                                               to its cluster. Then do neighbor jumps to each reachable
                                                                               point and add them to the cluster. If an Outlier has been
                                                                               added, label it as a Border Point.
                                                                            3. Repeat these steps until all points are assigned a cluster or,
                                                                               label as Outlier.