ML Notes
ML Notes
Feature Scaling
Feature scaling is a crucial preprocessing step in many machine learning workflows. It ensures that numerical features
contribute equally to model performance and prevents features with larger magnitudes from dominating those with smaller
ones. Below are the commonly used feature scaling techniques:
1. Standardization
• Formula: x′ = x−µ
σ , where µ is the mean and σ is the standard deviation of the feature.
• This method transforms the data to have a mean of 0 and a standard deviation of 1.
• It is particularly effective for algorithms that assume normally distributed features, such as linear regression, logistic
regression, and principal component analysis (PCA).
2. Normalization
• Min-Max Scaling:
x−min(x)
– Formula: x′ = max(x)−min(x)
– This technique scales the data to a fixed range, typically [0, 1].
– It is suitable for cases where the presence of both positive and negative values is known and bounded.
• Max-Abs Scaling:
– Formula: x′ = x
max(|x|)
– This method scales each feature by its maximum absolute value, resulting in values between -1 and 1.
– It is especially effective for sparse data representations.
• Robust Scaling:
– Formula: x′ = x−median(x)
IQR , where IQR is the interquartile range.
– This method is less sensitive to outliers as it uses median and IQR instead of mean and standard deviation.
– It is preferred when the dataset contains significant outliers.
1. Ordinal Encoding
This technique is used when categorical features have an inherent order or ranking. Each category is assigned an integer
value reflecting its relative position.
1
Mathematical Transformations
Mathematical transformations are applied to features to stabilize variance, normalize distributions, and improve the perfor-
mance of machine learning algorithms.
1. Function Transformations
These transformations apply mathematical functions directly to the feature values to address skewness and non-linearity.
• Reciprocal Transformation: y = x1
Effective for reducing the impact of large values.
• Square Transformation: y = x2
Useful for handling left-skewed data by stretching values on the right.
• Logarithmic Transformation: y = log(x + 1) or y = log(x)
Appropriate for compressing right-skewed data and reducing the influence of large values.
√
• Square Root Transformation: y = x
Less aggressive than log transformation, often used for count data.
2. Power Transformations
Power transforms are parametric methods that aim to make data more Gaussian-like.
• Box-Cox Transformation:
– Defined as:
(
xλ −1
′ λ , λ ̸= 0
x =
log(x), λ=0
3. Quantile Transformation
Quantile transformation maps the original data to a specified distribution using its empirical cumulative distribution function
(CDF).
• Overview: Converts features into a uniform or normal distribution by ranking the values.
• Steps Involved:
2
– Uniform Mapping:
i
yi = ri =
n
– Normal Mapping:
yi = Φ−1 (ri )
where Φ−1 is the inverse cumulative distribution function (CDF) of the standard normal distribution.
4. Handle Ties: For duplicate values, assign the average of their ranks to ensure consistency.
1. Discretization / Binning
Discretization transforms continuous numerical features into categorical values by grouping them into discrete intervals or
bins.
• Unsupervised Binning: Bin boundaries are determined without using information from the target variable.
– Uniform Binning (Equal Width): Divides the range of the variable into k equal-width intervals.
– Quantile Binning (Equal Frequency): Ensures each bin has approximately the same number of observations.
– K-means Binning: Uses the K-means clustering algorithm to determine bin centers based on feature distribution.
• Supervised Binning: Bin boundaries are chosen based on the relationship between the feature and the target variable
(e.g., maximizing information gain or minimizing impurity).
• Custom Binning: Bin intervals are manually defined using domain expertise or data-specific knowledge.
2. Binarization
Binarization converts continuous numerical values into binary values (0 or 1) based on a predefined threshold.
• Often used to simplify features or prepare them for algorithms that expect binary input.
Outlier
Outlier Detection Techniques
1. Mean and Standard Deviation Method (For Normally Distributed Data)
• Lower Bound: µ − 3σ
• Upper Bound: µ + 3σ
• Any value outside this range is considered an outlier.
2. Interquartile Range (IQR) Method (For Skewed Data)
3
• Values outside this range are considered outliers.
3. Percentile Method
• Outliers are defined based on extreme percentiles.
• Common thresholds: below the 1st percentile or above the 99th percentile.
• Useful for non-normal or highly skewed distributions.
Handling missing values is a critical preprocessing step. Strategies depend on the type, amount, and randomness of the
missing data.
4
Removing
Rows with missing values can be removed under specific conditions to avoid introducing bias or information loss.
• Missing Completely at Random (MCAR): Ensure that the missing values are random and not dependent on any
feature.
• Low Missing Percentage: Preferably less than 5% of the dataset.
• Distribution Preservation: Verify that the removal does not significantly affect the data distribution:
– The probability density function (PDF) of numerical columns should remain similar.
– The distribution of categorical values should also remain consistent.
KNN Imputer
Row feature1 feature2 feature3
0 1.0 2.0 1.5
1 2.0 NaN 2.5
2 NaN 6.0 3.5
3 4.0 8.0 NaN
4 5.0 10.0 5.5
(A) (B)
X
Distance(A, B) = (featurei − featurei )2
5
Step 3: Apply Uniform or Distance Method for Imputation
• Uniform Method: Uses the average of the feature values of the K nearest neighbors.
• Distance Method: Uses the weighted average where weights are inverse distances:
P 1
dist × feature value
Imputed Value = P 1
dist
Iterative Imputation
Iterative imputation is an advanced technique used to handle missing values in a dataset. Unlike simple imputation methods,
such as filling missing values with the mean, iterative imputation attempts to predict missing values based on other features
in the dataset using a model.
• Initial Imputation: Start by filling missing values (NaN) with the mean or median of the respective feature to provide
a starting point for the imputation process.
• Choose a NaN value (X) to impute from the dataset.
• Train a model (e.g., linear regression, decision tree, etc.) where:
– The features (inputs) for training the model are derived from other columns of the dataset. For example, the
features could be a combination of columns b and d, denoted as (b + concat d).
– The target variable (the value we want to predict) is a combination of the values from columns a and c, denoted
as concat a + c. This target represents the missing value in column X.
• Predict the missing value: Once the model is trained, use the available data from other columns (e.g., column e)
to predict the missing value X.
• Repeat the process: Repeat this process for all missing values in the dataset.
• Update the missing values: After imputing all the missing values in the dataset, update the missing values with
the predicted values from the model.
• Iterative updates: Instead of using the mean for imputation in the next iteration, use the newly predicted values as
the starting point for the next iteration of imputation.
• Convergence criteria: Repeat this process for several iterations or until the imputed values stabilize, meaning the
changes between iterations become minimal.
6
Handling Imbalanced Data
1. Undersampling and Oversampling
This step interpolates between the original sample and its neighbor, creating a new sample that lies in between.
• Repeat the process to create multiple synthetic samples for each minority class observation.
• Combine the original dataset with synthetic samples to form a balanced dataset. This helps ensure that the
model is not biased towards the majority class.
3. Ensemble Methods
7
CLASSIFICATION METRICS
Metric Formula / Definition Interpretation
TP + TN
Accuracy Proportion of total predictions that were
TP + TN + FP + FN
correct. Can be misleading on imbalanced
datasets. Greater values indicate better
performance.
TP True Positive
Precision = Proportion of positive predictions that
TP + FP Predicted Positive
were truly positive. Greater values indi-
cate better performance.
TP True Positive
Recall (Sensitivity) = Proportion of actual positive instances cor-
TP + FN Actual Positive
rectly identified. Greater values indicate
better performance.
TN True Negative
Specificity (True Negative Rate) = Proportion of actual negative instances
TN + FP Actual Negative
correctly identified. Greater values indi-
cate better performance.
Recall + Specificity 1 TP TN
Balanced Accuracy = + Average of recall for each class. Useful for
2 2 TP + FN TN + FP
imbalanced datasets as it balances TPR
and TNR. Greater values indicate better
performance.
TP · TN − FP · FN
MCC p Measures the correlation between the ob-
(T P + F P )(T P + F N )(T N + F P )(T N + F N ) served and predicted binary classifications.
Ranges from -1 (perfect inverse prediction)
to +1 (perfect prediction). 0 indicates a
random prediction. Useful for imbalanced
data. Higher values indicate better perfor-
mance.
Po − Pe
Cohen’s Kappa (κ) κ= Measures agreement between predicted
1 − Pe
and actual labels, correcting for chance
agreement. Po is observed agreement
(Accuracy), Pe is expected agreement by
chance. Ranges from -1 to +1. 0 means
chance agreement, +1 is perfect agree-
ment. Higher values indicate better per-
formance.
Precision · Recall
F1-score 2· Harmonic mean of precision and recall.
Precision + Recall
Balances both equally (β = 1). Greater
values indicate better performance.
Precision · Recall
Fβ-score (1 + β 2 ) · Weighted harmonic mean of precision and
(β 2 · Precision) + Recall
recall. β weights recall relative to preci-
sion: β > 1 favors recall, β < 1 favors
precision. Greater values indicate better
performance.
1 Pn
Log Loss − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] Penalizes confident incorrect probabilistic
n i=1
predictions. Lower values indicate better
performance.
Area Under ROC (AUC) Area under the ROC curve (TPR vs FPR). Represents the model’s ability to dis-
tinguish between positive and negative
classes. Greater values (closer to 1) indi-
cate better performance.
8
ROC Curves
When evaluating machine learning classification models, especially binary classifiers, simply looking at accuracy can be
misleading. The Receiver Operating Characteristic (ROC) curve is a powerful graphical tool that provides a more
comprehensive view of a model’s performance across all possible classification thresholds.
An ROC curve plots two key metrics against each other:
1. True Positive Rate (TPR) (Sensitivity or Recall): The proportion of actual positive instances correctly identified.
TP TP
TPR = =
TP + FN P
2. False Positive Rate (FPR): The proportion of actual negative instances incorrectly identified as positive.
FP FP
FPR = =
FP + TN N
Where:
• TP: True Positive (Actual is Positive, Predicted is Positive)
• FP: False Positive (Actual is Negative, Predicted is Positive)
Predicted
Sorted Instance ID Actual Label TP FP TPR ( TPP ) FPR ( FNP ) ROC Point
Prob
1.0 0 0 0 0 (0.0, 0.0)
1 1 0.9 1 0 0.2 0 (0.0, 0.2)
2 0 0.8 1 1 0.2 0.2 (0.2, 0.2)
3 1 0.75 2 1 0.4 0.2 (0.2, 0.4)
4 1 0.7 3 1 0.6 0.2 (0.2, 0.6)
5 0 0.6 3 2 0.6 0.4 (0.4, 0.6)
6 1 0.55 4 2 0.8 0.4 (0.4, 0.8)
7 0 0.4 4 3 0.8 0.6 (0.6, 0.8)
8 1 0.3 5 3 1.0 0.6 (0.6, 1.0)
9 0 0.2 5 4 1.0 0.8 (0.8, 1.0)
10 0 0.1 5 5 1.0 1.0 (1.0, 1.0)
The ROC curve is plotted using the (FPR, TPR) pairs from the last column of the table above.
9
Interpreting the ROC Curve
• The x-axis (FPR) represents the rate of false alarms. It also represent till that point how much percentage of positive
classes are identified of total positive class.
• The y-axis (TPR) represents the rate of correct positive predictions. It also repreent till that point how much negative
class point are identified of total negative class.
• The ideal point is (0,1) - 0% false positives and 100% true positives.
• The diagonal line from (0,0) to (1,1) represents a random classifier. A good model’s curve bows towards the top-left
corner, staying as far from the diagonal as possible.
For our example points, the AUC can be calculated by summing the areas of the trapezoids between consecutive points
(FPRi , TPRi ) and (FPRi+1 , TPRi+1 ):
X TPRi + TPRi+1
AUC = (FPRi+1 − FPRi ) ×
i
2
Calculating the AUC for our example points gives an AUC of 0.70. This indicates the model performs better than
random.
10
Cohen’s Kappa
In evaluating classification models, while accuracy measures the overall proportion of correct predictions, it doesn’t account
for correct predictions that could happen purely by chance. This is where Cohen’s Kappa (κ) provides a more robust and
interpretable evaluation.
Formula
Cohen’s Kappa (κ) is defined as:
Po − P e
κ=
1 − Pe
Where:
• Po is the observed agreement (accuracy):
P
nkk
Po = k
N
Here, nkk is the number of correct predictions for class k, nk. is the total actual instances of class k, n.k is the total
predicted instances of class k, and N is the total number of instances.
• 0: Agreement by chance
• -1: Complete disagreement
Interpretation (Landis Koch, 1977):
11
Multiclass Classification Example
Let’s consider a 3-class problem (A, B, C) with the following confusion matrix for 120 instances:
12
REGRESSION METRICS
Metric Formula / Definition Interpretation
1 Pn
Mean Absolute Error (MAE) |yi − ŷi | Average absolute difference between predicted
n i=1
and actual values. Measures average error magni-
tude. Lower values indicate better performance.
1 Pn
Mean Squared Error (MSE) (yi − ŷi )2 Average squared difference between predicted and
n i=1
actual values. Penalizes larger errors more heav-
ily than MAE. Lower values indicate better per-
formance.
s
1 Pn
Root Mean Squared Error (RMSE) (yi − ŷi )2 Square root of MSE. In the same units as the re-
n i=1
sponse variable. Measures average error magni-
tude. Lower values indicate better performance.
(yi − ŷi )2
P
R-squared (R2 ) 1− P Proportion of the variance in the dependent vari-
(yi − ȳ)2
able that is predictable from the independent vari-
able(s). Ranges from 0 to 1. Higher values (closer
to 1) indicate better fit.
(1 − R2 )(n − 1)
Adjusted R-squared 1− Modified R-squared that adjusts for the number
n−p−1
of predictors (p) and observations (n). Penalizes
the inclusion of irrelevant predictors. Useful for
model comparison. Higher values indicate better
performance.
13
DIMENSION REDUCTION
Principal Component Analysis (PCA)
• Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into
a smaller set of new features called principal components. These components are orthogonal (uncorrelated) and capture
the maximum variance present in the data. The goal of PCA is to reduce the number of features while retaining the
most significant information, making the dataset easier to analyze, visualize, and model.
• The key idea behind PCA is to project the data onto directions that maximize the variance, thus helping uncover
hidden patterns and reducing noise in the data.
• The first step is to center the data by subtracting the mean of each feature (column). This step ensures that the
data has zero mean and allows PCA to focus on the variance rather than the absolute values of the features.
• Mathematically, for a feature fi , subtract its mean, i.e., fi′ = fi − µfi , where µfi is the mean of feature fi .
2. Compute the Covariance Matrix:
• The covariance matrix C captures the relationships between the features. If X is the matrix of mean-centered
data, the covariance matrix is computed as:
1
C= XT X
n−1
where n is the number of samples and X T is the transpose of the mean-centered data matrix.
• Each element in the covariance matrix represents how two features vary together. Diagonal elements represent
the variance of each feature, and off-diagonal elements show the covariance between pairs of features.
f1 f2 ... fm
f1 cov(f1 , f1 ) cov(f1 , f2 ) ... cov(f1 , fm )
f2 cov(f2 , f1 ) cov(f2 , f2 ) ... cov(f2 , fm )
.. .. .. .. ..
. . . . .
fm cov(fm , f1 ) cov(fm , f2 ) ... cov(fm , fm )
3. Calculate Eigenvalues and Eigenvectors:
• Eigenvalues represent the magnitude of variance explained by each principal component, while eigenvectors rep-
resent the directions (or axes) along which the data varies the most.
• Solve for the eigenvalues (λ) and eigenvectors (v) of the covariance matrix C using the equation:
Cv = λv
where v is the eigenvector and λ is the eigenvalue. The eigenvectors form the basis of the new feature space.
4. Select Principal Components:
• Sort the eigenvalues in descending order to identify the most important principal components that explain the
most variance.
• Choose the top p eigenvectors corresponding to the largest eigenvalues. These eigenvectors represent the directions
that capture the most significant variance in the data.
• Construct a matrix V containing the selected p eigenvectors. The dimensions of V will be m × p, where m is the
number of features, and p is the number of principal components chosen.
5. Project Data onto New Subspace:
14
• To reduce the dimensionality of the data, multiply the mean-centered data matrix X by the matrix V containing
the selected eigenvectors:
Xnew = X · V
• This transformation projects the data onto the new subspace defined by the top p principal components. The
resulting data matrix Xnew will have n rows (same number of samples) and p columns (number of selected principal
components).
6. Reconstruction (Optional):
• In some cases, you might want to reconstruct the original data using the reduced-dimensional data. This is done
by multiplying the reduced data matrix by the transpose of the eigenvector matrix:
Xreconstructed = Xnew · V T
• The reconstructed data will be an approximation of the original data, with the error proportional to the variance
lost by discarding less significant components.
15
Linear Discriminant Analysis (LDA)
Given a dataset with c classes, let:
• X = [x1 , x2 , . . . , xn ] represent the dataset with n samples.
• Each sample xi ∈ Rd is a d-dimensional feature vector.
• Let Xj be the subset of samples belonging to class j, which has nj samples.
The main goal of LDA is to find a projection that maximizes the separability between different classes while minimizing
the variance within each class.
• This matrix captures the spread of each class around its own mean:
c X
X
SW = (x − mj )(x − mj )T
j=1 x∈Xj
where nj is the number of samples in class j and m is the overall mean of the dataset.
4. Optimization Objective:
• The goal is to find a projection vector w that maximizes the ratio of between-class variance to within-class variance:
wT SB w
J(w) =
wT SW w
This ratio is maximized when the classes are as far apart as possible while keeping the variance within each class
as small as possible.
5. Eigenvalue Problem:
SB w = λSW w
where SW and SB are the within-class and between-class scatter matrices, respectively, and λ is the eigenvalue.
• Compute the eigenvalues and eigenvectors of the matrix S−1
W SB .
• Sort the eigenvectors in descending order based on their corresponding eigenvalues.
• Select the top k eigenvectors (where k ≤ c − 1) to form the transformation matrix W, where each eigenvector
corresponds to a principal axis in the new subspace.
16
Key Points
• Assumptions: LDA assumes that:
– Each class follows a Gaussian distribution with identical covariance matrices across all classes.
– The data is normally distributed, and the classes have the same covariance.
– LDA is sensitive to outliers and imbalances in the class distribution.
• LDA aims to maximize the distance between class means and minimize the variance within each class, thereby improving
class separability.
• Computationally Efficient: LDA is generally faster and computationally simpler than non-linear dimensionality
reduction methods like t-SNE or UMAP because it is based on linear transformations.
• LDA vs. PCA:
– PCA: A data-driven, unsupervised technique that focuses on maximizing the total variance in the data, irrespec-
tive of class labels.
– LDA: A supervised method that uses class labels to maximize the separability between classes, thereby focusing
on finding dimensions that best distinguish between classes.
17
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a powerful matrix factorization technique used in linear algebra, data analysis, and
machine learning. It decomposes a matrix into three simpler matrices and is widely used for dimensionality reduction, noise
reduction, and feature extraction.
Given a real matrix A ∈ Rm×n , the SVD is defined as:
A = U ΣV T
where:
• U ∈ Rm×m is an orthogonal matrix whose columns are the left singular vectors,
• Σ ∈ Rm×n is a diagonal matrix with non-negative real numbers on the diagonal (the singular values),
• V ∈ Rn×n is an orthogonal matrix whose columns are the right singular vectors.
A = U ΣV T
• Here:
– U ∈ Rm×m contains the left singular vectors as columns.
– Σ ∈ Rm×n is a diagonal matrix with singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0, where r = rank(A).
– V T ∈ Rn×n is the transpose of the matrix of right singular vectors.
Ak = Uk Σk VkT
A′ = AVk = Uk Σk ∈ Rm×k
• The matrix A′ captures the essential structure of the original data in a lower-dimensional space of dimension k,
preserving the most significant singular values and their directions.
1. Compute AAT :
• Calculate the matrix product AAT ∈ Rm×m .
• Find its eigenvalues λ1 , λ2 , . . . , λm and the corresponding eigenvectors u1 , u2 , . . . , um .
• The eigenvectors form the columns of U , and the eigenvalues are used to compute singular values.
2. Compute AT A:
• Compute AT A ∈ Rn×n .
18
• Find its eigenvalues and corresponding eigenvectors v1 , v2 , . . . , vn .
• The eigenvectors form the columns of V .
• The singular values σi are the square roots of the non-zero eigenvalues:
p
σi = λi
where r = rank(A), and the rest of the matrix is padded with zeros.
7. Transpose V :
• The matrix V T is simply the transpose of V .
A = U ΣV T
19
MACHINE LEARNING MODELS
Prediction
The predicted values Ŷ are obtained using the linear model:
Ŷ = Xβ
Y = Xβ + ϵ
where:
Our goal is to find β that minimizes the Residual Sum of Squares (RSS): L(β) = ∥Y − Xβ∥2
This is equivalent to minimizing the squared Euclidean norm of the residuals.
20
9. Step 1: Expand the loss function
Using the identity ∥a∥2 = a⊤ a:
L(β) = Y ⊤ Y − Y ⊤ Xβ − β ⊤ X ⊤ Y + β ⊤ X ⊤ Xβ
= Y ⊤ Y − 2β ⊤ X ⊤ Y + β ⊤ X ⊤ Xβ
−2X ⊤ Y + 2X ⊤ Xβ = 0
X ⊤ Xβ = X ⊤ Y
β = (X ⊤ X)−1 X ⊤ Y
Note: If X ⊤ X is not invertible (i.e., singular), we may use regularization (like Ridge Regression) or the Moore-Penrose
pseudoinverse instead.
An alternative approach is to use gradient descent to iteratively update the coefficients. For n epochs, the update rule
is:
dL
∇β L = = −2X ⊤ Y + 2X ⊤ Xβ
dβ
α
β (n) = β (n−1) − · ∇β L
1000
where α is the learning rate and 1000 is the number of observations (used for averaging the gradient).
21
Polynomial Regression
To model non-linear relationships, polynomial features can be introduced. For example, given three input features x,
y, and z, a second-degree polynomial expansion includes:
x2 , y 2 , z 2 , xy, xz, yz
These new features are added to the design matrix, and then standard linear regression (e.g., OLS or gradient descent)
is applied on the transformed feature space.
m
X n
X
L(β) = (yi − ŷi )2 + λ βj2
i=1 j=1
Where:
• L(β) is the loss function, comprising the residual sum of squares and the regularization term.
• λ is the regularization parameter that controls the magnitude of the penalty.
The gradient of the loss function with respect to the coefficients is given by:
∂L
= −2X ⊤ (Y − Xβ) + 2λX ⊤ Xβ
∂β
Here, X ⊤ X is the Gram matrix of the input features, and λI is the regularization matrix. The identity matrix I is
given by:
0 0 ... 0
0 1 ... 0
I = . .
.. ..
.. .. . .
0 0 ... 1
β = (X ⊤ X + λI)−1 X ⊤ Y
22
Alternatively, using gradient descent, the update rule for the coefficients is:
∂L
β (t+1) = β (t) − α
∂β (t)
Where α is the learning rate and β (t) represents the coefficient vector at the t-th iteration.
For a simple linear regression model, the slope m under Ridge regularization is computed as:
P
(yi − ȳ)(xi − x̄)
m= P
(xi − x̄)2 + λ
Note: As λ increases, the magnitude of the coefficients decreases, but they never reach zero. Therefore, Ridge regression
is not typically used for feature selection.
m
X n
X
L(β) = (yi − ŷi )2 + λ |βj |
i=1 j=1
For the slope m in the case of simple linear regression, Lasso regression modifies the computation as follows:
• For m > 0:
P
(yi − ȳ)(xi − x̄) − λ
m= P
(xi − x̄)2
• For m = 0:
P
(yi − ȳ)(xi − x̄)
m= P
(xi − x̄)2
• For m < 0:
P
(yi − ȳ)(xi − x̄) + λ
m= P
(xi − x̄)2
Lasso regression is particularly effective when you wish to eliminate some features entirely by setting their coefficients
to zero, especially when λ is large.
Note: Lasso is often used for feature selection because, as λ increases, it forces some coefficients to become exactly
zero, thereby removing those features from the model.
m
X n
X n
X
L(β) = (yi − ŷi )2 + λ1 |βj | + λ2 βj2
i=1 j=1 j=1
Where:
23
• λ1 controls the strength of the L1 penalty (Lasso).
• λ2 controls the strength of the L2 penalty (Ridge).
24
Logistic & Multinomial Logistic Regression
Logistic Regression
1. Perceptron Trick (Motivation)
The perceptron algorithm is an early classification method that updates weights when it misclassifies data. The idea is
to nudge the decision boundary to better separate classes.
Explanation: The perceptron algorithm updates its weights only when a prediction is wrong, encouraging the model
to correct its decision boundary. However, it doesn’t give us probabilistic predictions, and it only works well when the
data is linearly separable. This leads us to the need for a more refined model—Logistic Regression.
Notation:
• xi : ith input row vector
• W : weight vector (n × 1)
• yi : true label, ŷi : predicted label
• η: learning rate
1
σ(z) =
1 + e−z
The sigmoid function maps any real number into the range (0, 1). It’s the core of logistic regression.
Explanation: Logistic regression uses the sigmoid function to estimate the probability that a sample belongs to a
particular class. Unlike the hard decision rule in perceptron, the sigmoid gives smooth output, making it differentiable
and suitable for optimization using gradient-based methods.
25
3. Maximum Likelihood & Batch Training
Logistic regression fits weights using maximum likelihood estimation, which minimizes the following binary cross-entropy
loss:
m
1 X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
m i=1
Explanation: Instead of updating weights one sample at a time, we can use all training samples together in batch
mode. This is done by minimizing the log-loss, also known as binary cross-entropy. The loss penalizes incorrect
predictions more heavily as the confidence increases in the wrong direction.
Matrix Formulation
Ŷ = σ(XW )
1 h T i
L=− Y log(Ŷ ) + (1 − Y )T log(1 − Ŷ )
m
Explanation: In matrix form, the loss function and prediction become compact and computationally efficient. This
helps implement batch gradient descent easily using matrix operations.
26
Method One: One-vs-Rest Approach (Simplified View)
• Apply one-hot encoding to convert categorical labels into a binary matrix format.
• Train a separate binary logistic regression model for each class (i.e., one model per column of the one-hot
encoded labels).
• Each model predicts the probability of its respective class vs all others.
• The class with the highest predicted probability is chosen during inference.
(i)
• Here, yk = 1 if the ith example belongs to class k, and 0 otherwise (from one-hot encoding).
(i)
• ŷk is the predicted probability for class k for the ith example.
Softmax Function
ezj
σ(z)j = PK for j = 1, . . . , K
k=1 ezk
Important Points
• Commonly used for multi-class classification tasks (e.g., digit recognition, sentiment with 3+ categories).
• We can apply regularization to prevent overfitting:
– L1 Regularization (Lasso): Encourages sparsity in weights.
– L2 Regularization (Ridge): Encourages smaller weights overall.
– Elastic Net: A mix of both L1 and L2.
• Optimization is usually done via batch or stochastic gradient descent.
27
K-Nearest Neighbors (KNN)
Based on the concept: “You are the average of the five people you spend the most time with”
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for both classification and
regression tasks. It predicts outcomes based on the similarity of data points in the feature space.
Algorithm Steps
Step 1: Normalize the Data
• Scale features appropriately (e.g., min-max normalization):
x − min(X)
x′ =
max(X) − min(X)
ŷ = ({yi }K
i=1 )
28
Naive Bayes Classifier
Bayes’ Theorem
P (B|A)P (A)
P (A|B) =
P (B)
Classification Formulation
For features A, B, C and class Ck :
Example Calculation
Implementation Steps
2
i. Calculate the mean µCk and variance σC k
of each feature for every class.
ii. Use the Gaussian formula to compute the likelihood for each feature.
iii. Multiply all feature likelihoods with the class prior to get posterior probabilities.
iv. Choose the class with the highest posterior probability.
29
CART - Classification and Regression Trees
Pseudo Code
(a) Begin with the training dataset containing feature variables and corresponding target values.
(b) Determine the best feature to split the data on using a criterion like Gini impurity or entropy.
(c) Split the dataset into subsets based on this feature. Each split defines a node in the decision tree.
(d) Recursively repeat steps 2 and 3 for each subset until a stopping criterion is met (e.g., maximum depth, pure leaf,
or minimum samples).
Advantages
• Fast inference: prediction cost is logarithmic in the number of training samples.
• Easy to interpret and visualize.
Disadvantages
• Prone to overfitting, especially with deep trees.
• Sensitive to imbalanced datasets and small variations in data.
Gini Impurity
c
X
G=1− p2i
i=1
30
Information Gain
k
X wi
Information Gain = Impurity (Parent) − · Impurity (Childi )
i=1
W
= Entropy or Gini(P arent) − Weighted Average of Child Impurities
Algorithm
(a) Compute impurity (Entropy or Gini) of the parent node.
(b) For each feature:
• Split the dataset based on feature values.
• Compute impurity of child nodes and their weighted average.
• Calculate the information gain.
(c) Select the feature with the highest information gain to split.
(d) Stop when:
• All samples belong to one class (pure node, entropy = 0), or
• Maximum depth or minimum node size is reached.
31
Regression Trees
Algorithm
• Compute the standard deviation of the target variable in the parent node.
• For each feature (categorical or numerical), calculate the Information Gain:
– Categorical Features:
∗ Split the dataset based on the unique values (labels) of the categorical feature.
∗ Compute the standard deviation of the target variable for each group (child node).
– Numerical Features:
∗ Sort the data based on the numerical feature.
∗ Consider possible split points between consecutive values.
∗ For each split, divide the dataset into two groups.
∗ Compute the standard deviation of the target variable in each group (child node).
– Compute the weighted standard deviation of the child nodes.
– Calculate Information Gain as:
• Select the feature and split that results in the highest Information Gain.
• Recursively repeat the above steps for each child node until a stopping criterion is met (e.g., minimum samples
per node or maximum tree depth).
• At each leaf node, output the mean of the target variable values within that node.
X Nt Nt Nt
Feature Importance(i) = impurityt − l · impuritytl − r · impuritytr
N Nt Nt
t∈nodes using i
32
Ensemble Learning
Ensemble learning is based on the concept of the wisdom of the crowd, where combining predictions from multiple
models increases the overall accuracy and robustness of the system. It helps convert a low-bias, high-variance model
into a low-bias, low-variance model by distributing the impact of random errors and outliers across multiple learners.
• Voting Ensemble: In this method, multiple diverse models (e.g., SVM, Decision Tree, Logistic Regression) make
predictions independently.
– For classification, the final output is determined by majority voting or by averaging class probabilities (soft
voting).
– For regression, the mean of the individual model predictions is used as the final result.
• Stacking: Stacking combines different models by using their predictions as input features for a final meta-model
(e.g., KNN or Logistic Regression).
– Base learners (e.g., SVM, Decision Tree, Logistic Regression) are first trained on the dataset.
– Their predictions are then used as input to train a meta-model that learns to optimally combine these outputs.
• Bagging (Bootstrap Aggregating): Bagging involves training multiple instances of the same model on different
random subsets (with replacement) of the training data.
– The final prediction is obtained through majority voting (classification) or averaging (regression).
– Bagging helps reduce variance and is effective for high-variance, low-bias models like decision trees.
• Boosting: Boosting is a sequential ensemble technique where each model tries to correct the errors of its predecessor.
– Models are trained one after the other.
– Each subsequent model focuses more on the data points that were misclassified by the previous models.
– The final prediction is typically a weighted sum of individual model outputs.
Types of Voting:
• Hard Voting: The class predicted by the majority of the models (i.e., mode of predictions) is selected.
• Soft Voting: The class with the highest average predicted probability (i.e., mean of probabilities) is selected.
33
Stacking
3. Multi-Layered Stacking
Multi-layered stacking extends the idea by creating a hierarchy of models across multiple layers. The output of each
layer is used as input for the next:
This approach enables models to learn from previous model predictions, enhancing overall performance—especially
when using a diverse set of model types.
34
Bagging Techniques
Bagging (Bootstrap Aggregating) is an ensemble method that improves model stability and accuracy by training multiple
models on different subsets of the data and combining their predictions. Below are some of the common techniques
used in bagging:
• Row Sampling with Replacement: This is the classic bagging technique where each individual model is trained
on a bootstrap sample, meaning data points are selected randomly with replacement. Some data points may appear
multiple times in the sample, while others may not be included at all.
• Pasting: Similar to row sampling, but in this case, data is sampled without replacement. Each model is trained on
a unique subset of the data, ensuring that every data point is used exactly once.
• Random Subspaces: This method focuses on column sampling (i.e., selecting a subset of features). Models are
trained on different subsets of features, either with or without replacement, allowing the ensemble to capture different
aspects of the data.
• Random Patches: A combination of both row and column sampling. In this technique, both subsets of rows (data
points) and columns (features) are selected, with or without replacement. This enables the ensemble to capture
diverse patterns from both the data and the feature set.
Out-of-Bag (OOB) Error: In bagging, approximately 37% of the samples are not selected for training any given
model. These samples, called out-of-bag (OOB) samples, can be used as a validation set to evaluate the model’s
performance. This internal validation provides a cost-effective way of assessing the model’s accuracy without the need
for a separate test set.
35
AdaBoost Algorithm
1. Initialize Weights
For a dataset with n samples, initialize the weight for each sample equally:
1
wi = , for all i = 1, 2, . . . , n
n
This ensures that all samples have equal influence initially.
x y wi
1
n
3. Make Predictions
Use the trained weak learner to predict the labels for all training samples. Denote these predictions as ŷi .
x y wi ŷ
1
n
where I(·) is the indicator function that equals 1 when the prediction is incorrect and 0 otherwise.
x y wi ŷ ϵ
1
n
This formulation increases weights for misclassified samples and decreases them for correctly classified ones.
x y wi ŷ ϵ winew
1
n
36
7. Normalize Weights
Normalize the updated weights so that they sum to 1:
wnew
winorm = Pn i new
j=1 wj
x y wi ŷ ϵ winew winorm
1
n
9. Repeat
Repeat steps 2 through 7 for a specified number of iterations or until convergence. Note: Do not reinitialize weights
to n1 in each iteration; the updated weights are carried forward to the next round.
where:
• ht (x) is the prediction of the t-th weak learner,
• αt is the weight of the t-th weak learner,
• H(x) is the final ensemble prediction.
Summary: AdaBoost sequentially trains weak learners on reweighted data, focusing more on previously misclassified
samples. It then combines them into a strong learner using a weighted vote, significantly boosting performance.
37
Gradient Boosting
The final prediction is a weighted sum of all models:
3: modelsList ← [F0 ]
4: F (x) ← F0 (x)
5: for i ← 1 to n do
6: Compute residuals:
9: modelsList.append(hi )
10: end for
11: Output: Final prediction function:
n
X
ŷ(x) = F0 (x) + η · hi (x)
i=1
38
Gradient Boosting for Classification
3: modelsList ← [F0 ]
4: for i ← 1 to n do
5: Compute predicted probabilities:
1
p̂k = , for all xk ∈ X
1+ e−Fi−1 (xk )
6: Compute residuals (negative gradient):
13: modelsList.append(hi )
14: end for
15: Output: Final prediction:
n
X
F (x) = F0 + η · hi (x)
i=1
(
1 1 if p̂(x) ≥ 0.5
p̂(x) = , ŷ =
1 + e−F (x) 0 otherwise
39
Algorithm Steps
1
Pn
▷ e.g., for squared loss: F0 (x) = n i=1 yi
2: for m = 1 to M do
3: Compute pseudo-residuals:
∂L(yi , F (xi ))
rim = −
∂F (xi ) F =Fm−1
9: end for
10: Output: Final model FM (x)
Key Components
• Gradient Computation: Uses first-order Taylor approximation (gradient descent in function space)
• Weak Learner: Usually regression trees with limited depth (e.g., depth 3–6)
• Learning Rate η: Shrinks contribution of each tree; common values: 0.01 to 0.1
• Regularization: Achieved by early stopping, limiting tree depth, subsampling, and η
40
XGBoost (Extreme Gradient Boosting)
Key Advantages of XGBoost
• Flexibility
– Cross-Platform Support: Runs on Windows, Linux, and macOS
– Multi-Language APIs: Available in Python, R, Java, Scala, and C++
– Integration Support: Compatible with scikit-learn, Dask, Spark
– Wide Application: Supports regression, classification, time series, and ranking tasks
• Speed (Often 10× faster than traditional gradient boosting)
– Parallelized Tree Construction: Controlled via n jobs (in scikit-learn API) or nthread (in core XGBoost)
– Columnar Data Storage: Enables cache-aware access patterns
– Cache Optimization: Minimizes memory overhead
– Out-of-Core Learning: Handles datasets larger than memory using tree method = hist
– Distributed Training: Supports large-scale training across clusters
– GPU Acceleration: Leverages GPUs via device = cuda
• Performance
– Regularization: Incorporates both L1 (Lasso) and L2 (Ridge) penalties
– Robust to Missing Data: Automatically learns optimal default directions
– Sparse Data Handling: Efficient processing of sparse input matrices
– Efficient Split Finding:
∗ Uses weighted quantile sketch for approximate split finding
∗ Employs binned feature values to speed up computation
– Tree Pruning: Performs depth-first expansion with backward pruning to eliminate splits with negative gain
– The first term measures how well the model fits the data using a loss function l(·) (e.g., squared error for
regression, log loss for classification).
– The second term Ω(fk ) adds a penalty for model complexity:
∗ T is the number of leaves (terminal nodes) in the tree.
∗ w represents the vector of weights (predictions) in each leaf.
∗ γ and λ are regularization parameters that help prevent overfitting.
• Approximate Learning Algorithm (Histogram-based):
– Instead of evaluating every possible split, XGBoost uses an approximate method for faster computation.
– Continuous features are grouped into discrete bins using quantiles (this forms a histogram).
– For each bin, it calculates the sum of gradients and second-order gradients (Hessians).
– This significantly speeds up the split finding process, especially on large datasets.
– The method is called tree method = hist, and is suitable for both CPU and GPU training.
• Sparsity-Aware Split Finding:
– XGBoost is optimized to handle datasets with missing or sparse values.
– During training, it automatically learns the best way to assign missing values (i.e., go left or right in a tree node).
– This avoids the need for manual imputation and improves performance on real-world datasets with incomplete
entries.
– It also efficiently handles features with many zero entries (common in one-hot encoded or bag-of-words data).
41
XGBoost for Regression
prediction ← mean(y)
1
Final prob =
1 + e−Total log loss
42
XGBoost Mathematics
1. Objective Function
XGBoost minimizes the following regularized objective function at each boosting iteration t:
n
(t−1)
X
L(t) = L yi , ŷi + ft (xi ) + Ω(ft )
i=1
2. Taylor Approximation
(t−1)
Using second-order Taylor expansion around ŷi :
n
X 1
L(t) ≈ gi ft (xi ) + hi ft (xi )2 + Ω(ft )
i=1
2
∂L(yi ,ŷi )
• gi = ∂ ŷi
∂ 2 L(yi ,ŷi )
• hi = ∂ ŷi2
Explanation: This approximation simplifies optimization using gradients (gi ) and Hessians (hi ).
where:
X X
Gj = gi , Hj = hi
i∈Ij i∈Ij
Explanation: The loss is grouped by leaves, allowing us to find optimal wj for each leaf independently.
Explanation: This gives the best score for each leaf to minimize the total objective.
43
5. Optimal Objective Value After Adding Tree
Substitute wj∗ into the objective:
T
1 X G2j
L(t) = − + γT
2 j=1 Hj + λ
Explanation: This score helps evaluate split quality during tree construction.
6. Derivatives Summary
∂L(yi , ŷi )
gi = (gradient)
∂ ŷi
∂ 2 L(yi , ŷi )
hi = (Hessian)
∂ ŷi2
Explanation: These derivatives enable second-order optimization, improving training speed and accuracy.
gi = ŷi − yi , hi = 1
− ŷi )2
P P
i∈Ij (yi − ŷi ) i∈Ij (yi
wj∗ = , Lj =
Nj + λ Nj + λ
where Nj is the number of samples in leaf j.
Explanation: For regression, gradient is the error, and the Hessian is constant, simplifying computation.
44