KEMBAR78
ML Notes | PDF | Eigenvalues And Eigenvectors | Sensitivity And Specificity
0% found this document useful (0 votes)
52 views44 pages

ML Notes

The document outlines essential data preprocessing techniques in machine learning, including feature scaling, encoding categorical and numerical data, handling outliers, and managing missing values. It describes various methods such as standardization, normalization, ordinal and nominal encoding, and imputation techniques like KNN and iterative imputation. Additionally, it discusses strategies for dealing with imbalanced data through undersampling, oversampling, and SMOTE.

Uploaded by

amanalpha4111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views44 pages

ML Notes

The document outlines essential data preprocessing techniques in machine learning, including feature scaling, encoding categorical and numerical data, handling outliers, and managing missing values. It describes various methods such as standardization, normalization, ordinal and nominal encoding, and imputation techniques like KNN and iterative imputation. Additionally, it discusses strategies for dealing with imbalanced data through undersampling, oversampling, and SMOTE.

Uploaded by

amanalpha4111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA PREPROCESSING

Feature Scaling
Feature scaling is a crucial preprocessing step in many machine learning workflows. It ensures that numerical features
contribute equally to model performance and prevents features with larger magnitudes from dominating those with smaller
ones. Below are the commonly used feature scaling techniques:

1. Standardization
• Formula: x′ = x−µ
σ , where µ is the mean and σ is the standard deviation of the feature.
• This method transforms the data to have a mean of 0 and a standard deviation of 1.
• It is particularly effective for algorithms that assume normally distributed features, such as linear regression, logistic
regression, and principal component analysis (PCA).

2. Normalization
• Min-Max Scaling:
x−min(x)
– Formula: x′ = max(x)−min(x)
– This technique scales the data to a fixed range, typically [0, 1].
– It is suitable for cases where the presence of both positive and negative values is known and bounded.
• Max-Abs Scaling:
– Formula: x′ = x
max(|x|)
– This method scales each feature by its maximum absolute value, resulting in values between -1 and 1.
– It is especially effective for sparse data representations.
• Robust Scaling:
– Formula: x′ = x−median(x)
IQR , where IQR is the interquartile range.
– This method is less sensitive to outliers as it uses median and IQR instead of mean and standard deviation.
– It is preferred when the dataset contains significant outliers.

Encoding Categorical Data


Encoding categorical variables is essential to convert non-numeric data into a numerical format that can be interpreted by
machine learning algorithms. The two primary techniques are described below:

1. Ordinal Encoding
This technique is used when categorical features have an inherent order or ranking. Each category is assigned an integer
value reflecting its relative position.

Original Category Encoded Value


Poor 0
Good 1
Excellent 2

2. Nominal Encoding (One-Hot Encoding)


This method is applied when categories do not have any inherent order. Each category is transformed into a binary vector
where only one bit is set to 1, indicating the presence of that category.

Pet Type Dog Cat Rabbit


Dog 1 0 0
Cat 0 1 0
Rabbit 0 0 1

1
Mathematical Transformations
Mathematical transformations are applied to features to stabilize variance, normalize distributions, and improve the perfor-
mance of machine learning algorithms.

1. Function Transformations
These transformations apply mathematical functions directly to the feature values to address skewness and non-linearity.

• Reciprocal Transformation: y = x1
Effective for reducing the impact of large values.
• Square Transformation: y = x2
Useful for handling left-skewed data by stretching values on the right.
• Logarithmic Transformation: y = log(x + 1) or y = log(x)
Appropriate for compressing right-skewed data and reducing the influence of large values.

• Square Root Transformation: y = x
Less aggressive than log transformation, often used for count data.

2. Power Transformations
Power transforms are parametric methods that aim to make data more Gaussian-like.

• Box-Cox Transformation:

– Defined as:
(
xλ −1
′ λ , λ ̸= 0
x =
log(x), λ=0

– Note: Applicable only to strictly positive data.


• Yeo-Johnson Transformation:
– Defined as:

(y+1)λ −1


 λ , λ ̸= 0, y ≥ 0

log(y + 1), λ = 0, y ≥ 0
y′ = 2−λ
−1


 − (−y+1)
2−λ , λ ̸= 2, y < 0

− log(−y + 1), λ = 2, y < 0

– Note: Suitable for both positive and negative values.

3. Quantile Transformation
Quantile transformation maps the original data to a specified distribution using its empirical cumulative distribution function
(CDF).

• Overview: Converts features into a uniform or normal distribution by ranking the values.
• Steps Involved:

1. Sort the Data: Arrange values in ascending order.


2. Compute Rank: For each data point xi , calculate the rank as:
i
ri =
n
where i is the index in the sorted list and n is the total number of samples.
3. Map to Target Distribution:

2
– Uniform Mapping:
i
yi = ri =
n
– Normal Mapping:

yi = Φ−1 (ri )

where Φ−1 is the inverse cumulative distribution function (CDF) of the standard normal distribution.
4. Handle Ties: For duplicate values, assign the average of their ranks to ensure consistency.

Encoding Numerical Features


Encoding numerical features involves transforming continuous or discrete numeric values into formats that are more suitable
for machine learning algorithms.

1. Discretization / Binning
Discretization transforms continuous numerical features into categorical values by grouping them into discrete intervals or
bins.

• Unsupervised Binning: Bin boundaries are determined without using information from the target variable.
– Uniform Binning (Equal Width): Divides the range of the variable into k equal-width intervals.
– Quantile Binning (Equal Frequency): Ensures each bin has approximately the same number of observations.
– K-means Binning: Uses the K-means clustering algorithm to determine bin centers based on feature distribution.
• Supervised Binning: Bin boundaries are chosen based on the relationship between the feature and the target variable
(e.g., maximizing information gain or minimizing impurity).
• Custom Binning: Bin intervals are manually defined using domain expertise or data-specific knowledge.

2. Binarization
Binarization converts continuous numerical values into binary values (0 or 1) based on a predefined threshold.

• Binary Encoding Rule:


(
0, if xi ≤ threshold
yi =
1, if xi > threshold

• Often used to simplify features or prepare them for algorithms that expect binary input.

Outlier
Outlier Detection Techniques
1. Mean and Standard Deviation Method (For Normally Distributed Data)

• Lower Bound: µ − 3σ
• Upper Bound: µ + 3σ
• Any value outside this range is considered an outlier.
2. Interquartile Range (IQR) Method (For Skewed Data)

• Q1 = 25th percentile, Q3 = 75th percentile


• IQR = Q3 − Q1
• Lower Bound: Q1 − 1.5 × IQR
• Upper Bound: Q3 + 1.5 × IQR

3
• Values outside this range are considered outliers.
3. Percentile Method
• Outliers are defined based on extreme percentiles.
• Common thresholds: below the 1st percentile or above the 99th percentile.
• Useful for non-normal or highly skewed distributions.

Outlier Handling Techniques


1. Trimming (Removing Outliers)

• Outliers are removed from the dataset.


• Useful when outliers are likely due to data entry errors or noise.
2. Capping (Winsorizing)
• Outliers are replaced with the nearest boundary value (e.g., 1st or 99th percentile).
• Helps retain all data points while limiting the influence of extreme values.

Handling Missing Values

Figure 1: Handling Missing Values

Handling missing values is a critical preprocessing step. Strategies depend on the type, amount, and randomness of the
missing data.

4
Removing
Rows with missing values can be removed under specific conditions to avoid introducing bias or information loss.
• Missing Completely at Random (MCAR): Ensure that the missing values are random and not dependent on any
feature.
• Low Missing Percentage: Preferably less than 5% of the dataset.
• Distribution Preservation: Verify that the removal does not significantly affect the data distribution:
– The probability density function (PDF) of numerical columns should remain similar.
– The distribution of categorical values should also remain consistent.

KNN Imputer
Row feature1 feature2 feature3
0 1.0 2.0 1.5
1 2.0 NaN 2.5
2 NaN 6.0 3.5
3 4.0 8.0 NaN
4 5.0 10.0 5.5

Step 1: Calculate Squared Euclidean Distance


We compute the squared Euclidean distance for each pair of rows using only the available (non-missing) values.

(A) (B)
X
Distance(A, B) = (featurei − featurei )2

Row 0 Row 1 Row 2 Row 3 Row 4


Row 0 0 2 20 45 96
Row 1 2 0 1 34.25 18
Row 2 20 1 0 4 20
Row 3 45 34.25 4 0 5
Row 4 96 18 20 5 0

Step 2: Adjust Distance Using Missing Values


We modify the squared Euclidean distance to account for missing values using the formula:
r
Total Columns
Adjusted Distance = Squared Distance ×
Columns Used
where:
• Total Columns = 3 (feature1, feature2, feature3)
• Columns Used = Number of non-missing values used in distance computation.
Row
q 0 Row
q 1 Row
q 2 Row
q 3 Row
q 4
Row 0 0 × 33 2 × 32 20 × 32 45 × 32 96 × 33
q q q q q
Row 1 2 × 32 0 × 33 1 × 31 34.25 × 3
1 18 × 23
q q q q q
Row 2 20 × 32 1 × 31 0 × 33 4 × 32 20 × 23
q q q q q
Row 3 45 × 32 34.25 × 3
1 4 × 32 0 × 33 5 × 23
q q q q q
Row 4 96 × 33 18 × 32 20 × 32 5 × 32 0 × 33

Row 0 Row 1 Row 2 Row 3 Row 4


Row 0 0.000 1.732 5.477 9.486 16.970
Row 1 1.732 0.000 1.732 10.143 5.196
Row 2 5.477 1.732 0.000 3.464 5.477
Row 3 9.486 10.143 3.464 0.000 2.738
Row 4 16.970 5.196 5.477 2.738 0.000

5
Step 3: Apply Uniform or Distance Method for Imputation
• Uniform Method: Uses the average of the feature values of the K nearest neighbors.
• Distance Method: Uses the weighted average where weights are inverse distances:
P 1 
dist × feature value
Imputed Value = P 1 
dist

Row feature1 feature2 feature3


0 1.0 2.0 1.5
2.0 6.0 8.0 10.0
1.732 + 5.196 + 10.143 + 16.970
1 2.0 1
+ 1
+ 1
+ 1 2.5
1.732 5.196 10.143 16.970
1.0 2.0 4.0 5.0
5.477 + 1.732 + 3.464 + 5.477
2 1
+ 1
+ 1
+ 1 6.0 3.5
5.477 1.732 3.464 5.477
1.5 2.5 3.5 5.5
9.486 + 10.143 + 3.464 + 2.738
3 4.0 8.0 1 1 1 1
9.486 + 10.143 + 3.464 + 2.738
4 5.0 10.0 5.5

Iterative Imputation
Iterative imputation is an advanced technique used to handle missing values in a dataset. Unlike simple imputation methods,
such as filling missing values with the mean, iterative imputation attempts to predict missing values based on other features
in the dataset using a model.

• Initial Imputation: Start by filling missing values (NaN) with the mean or median of the respective feature to provide
a starting point for the imputation process.
• Choose a NaN value (X) to impute from the dataset.
• Train a model (e.g., linear regression, decision tree, etc.) where:
– The features (inputs) for training the model are derived from other columns of the dataset. For example, the
features could be a combination of columns b and d, denoted as (b + concat d).
– The target variable (the value we want to predict) is a combination of the values from columns a and c, denoted
as concat a + c. This target represents the missing value in column X.
• Predict the missing value: Once the model is trained, use the available data from other columns (e.g., column e)
to predict the missing value X.
• Repeat the process: Repeat this process for all missing values in the dataset.
• Update the missing values: After imputing all the missing values in the dataset, update the missing values with
the predicted values from the model.
• Iterative updates: Instead of using the mean for imputation in the next iteration, use the newly predicted values as
the starting point for the next iteration of imputation.
• Convergence criteria: Repeat this process for several iterations or until the imputed values stabilize, meaning the
changes between iterations become minimal.

6
Handling Imbalanced Data
1. Undersampling and Oversampling

2. SMOTE (Synthetic Minority Oversampling Technique)


• Train a k-NN model on minority class observations:
– Identify the k nearest neighbors for each minority class sample (commonly k = 5). These neighbors are used
to generate synthetic examples that lie within the same feature space as the original data points.
• Create synthetic data:
1. Select 1 example randomly from the minority class. This serves as the base for creating synthetic examples.
2. Select one neighbor randomly from its k-nearest neighbors. This step ensures that the generated samples are
close to the original data points in the feature space.
3. Extract a random number α between 0 and 1 for interpolation. This value controls how much the synthetic
sample will deviate from the original sample.
4. Generate the synthetic sample using the formula:

Synthetic sample = Original sample + α × (Neighbor − Original sample)

This step interpolates between the original sample and its neighbor, creating a new sample that lies in between.
• Repeat the process to create multiple synthetic samples for each minority class observation.
• Combine the original dataset with synthetic samples to form a balanced dataset. This helps ensure that the
model is not biased towards the majority class.

3. Ensemble Methods

4. Giving Weights to Different Classes


• Class weights are applied to penalize misclassifications in the minority class. The more the class imbalance, the higher
the weight is given to the minority class to make it more important during the model training process.
• This can be especially helpful in algorithms like Logistic Regression or Support Vector Machines, where the decision
boundary is influenced by the weights given to each class.
• While applying class weights can help in improving model performance, it may require fine-tuning to balance the
performance of both classes without overfitting to the minority class.

7
CLASSIFICATION METRICS
Metric Formula / Definition Interpretation
TP + TN
Accuracy Proportion of total predictions that were
TP + TN + FP + FN
correct. Can be misleading on imbalanced
datasets. Greater values indicate better
performance.
TP True Positive
Precision = Proportion of positive predictions that
TP + FP Predicted Positive
were truly positive. Greater values indi-
cate better performance.
TP True Positive
Recall (Sensitivity) = Proportion of actual positive instances cor-
TP + FN Actual Positive
rectly identified. Greater values indicate
better performance.
TN True Negative
Specificity (True Negative Rate) = Proportion of actual negative instances
TN + FP Actual Negative
correctly identified. Greater values indi-
cate better performance.
 
Recall + Specificity 1 TP TN
Balanced Accuracy = + Average of recall for each class. Useful for
2 2 TP + FN TN + FP
imbalanced datasets as it balances TPR
and TNR. Greater values indicate better
performance.
TP · TN − FP · FN
MCC p Measures the correlation between the ob-
(T P + F P )(T P + F N )(T N + F P )(T N + F N ) served and predicted binary classifications.
Ranges from -1 (perfect inverse prediction)
to +1 (perfect prediction). 0 indicates a
random prediction. Useful for imbalanced
data. Higher values indicate better perfor-
mance.
Po − Pe
Cohen’s Kappa (κ) κ= Measures agreement between predicted
1 − Pe
and actual labels, correcting for chance
agreement. Po is observed agreement
(Accuracy), Pe is expected agreement by
chance. Ranges from -1 to +1. 0 means
chance agreement, +1 is perfect agree-
ment. Higher values indicate better per-
formance.
Precision · Recall
F1-score 2· Harmonic mean of precision and recall.
Precision + Recall
Balances both equally (β = 1). Greater
values indicate better performance.
Precision · Recall
Fβ-score (1 + β 2 ) · Weighted harmonic mean of precision and
(β 2 · Precision) + Recall
recall. β weights recall relative to preci-
sion: β > 1 favors recall, β < 1 favors
precision. Greater values indicate better
performance.
1 Pn
Log Loss − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] Penalizes confident incorrect probabilistic
n i=1
predictions. Lower values indicate better
performance.
Area Under ROC (AUC) Area under the ROC curve (TPR vs FPR). Represents the model’s ability to dis-
tinguish between positive and negative
classes. Greater values (closer to 1) indi-
cate better performance.

8
ROC Curves
When evaluating machine learning classification models, especially binary classifiers, simply looking at accuracy can be
misleading. The Receiver Operating Characteristic (ROC) curve is a powerful graphical tool that provides a more
comprehensive view of a model’s performance across all possible classification thresholds.
An ROC curve plots two key metrics against each other:

1. True Positive Rate (TPR) (Sensitivity or Recall): The proportion of actual positive instances correctly identified.

TP TP
TPR = =
TP + FN P

2. False Positive Rate (FPR): The proportion of actual negative instances incorrectly identified as positive.

FP FP
FPR = =
FP + TN N

Where:
• TP: True Positive (Actual is Positive, Predicted is Positive)
• FP: False Positive (Actual is Negative, Predicted is Positive)

• FN: False Negative (Actual is Positive, Predicted is Negative)


• TN: True Negative (Actual is Negative, Predicted is Negative)
• N: Total Negative
• P: Total Positive

How the ROC Curve is Generated and Calculated: An Example


Most classification models output a probability or score for an instance belonging to the positive class. A decision threshold
is used to convert this score into a binary prediction. The ROC curve is generated by varying this threshold from very high
to very low. For each potential threshold, the TPR and FPR are calculated, and these (FPR, TPR) pairs are plotted.
Let’s illustrate with a simple example. Suppose we have 10 instances, 5 actual positives (Spam=1) and 5 actual negatives
(Not Spam=0). We have the model’s predicted probability for each. To generate the ROC curve points, we sort the instances
by their predicted probability (highest first) and then calculate the TP, FP, TPR, and FPR values for thresholds that
incrementally include one or more additional instances as ’predicted positive’. Note that the total number of actual positives,
P=5 and negatives, N=5.
The table below shows the sorted instances and the step-by-step calculation of the ROC points as the effective threshold
is lowered, focusing on TP and FP which directly determine TPR and FPR. Total Actual Positives (P) = 5, Total Actual
Negatives (N) = 5.

Predicted
Sorted Instance ID Actual Label TP FP TPR ( TPP ) FPR ( FNP ) ROC Point
Prob
1.0 0 0 0 0 (0.0, 0.0)
1 1 0.9 1 0 0.2 0 (0.0, 0.2)
2 0 0.8 1 1 0.2 0.2 (0.2, 0.2)
3 1 0.75 2 1 0.4 0.2 (0.2, 0.4)
4 1 0.7 3 1 0.6 0.2 (0.2, 0.6)
5 0 0.6 3 2 0.6 0.4 (0.4, 0.6)
6 1 0.55 4 2 0.8 0.4 (0.4, 0.8)
7 0 0.4 4 3 0.8 0.6 (0.6, 0.8)
8 1 0.3 5 3 1.0 0.6 (0.6, 1.0)
9 0 0.2 5 4 1.0 0.8 (0.8, 1.0)
10 0 0.1 5 5 1.0 1.0 (1.0, 1.0)

The ROC curve is plotted using the (FPR, TPR) pairs from the last column of the table above.

9
Interpreting the ROC Curve
• The x-axis (FPR) represents the rate of false alarms. It also represent till that point how much percentage of positive
classes are identified of total positive class.
• The y-axis (TPR) represents the rate of correct positive predictions. It also repreent till that point how much negative
class point are identified of total negative class.
• The ideal point is (0,1) - 0% false positives and 100% true positives.
• The diagonal line from (0,0) to (1,1) represents a random classifier. A good model’s curve bows towards the top-left
corner, staying as far from the diagonal as possible.

Area Under the Curve (AUC)


The Area Under the ROC Curve (AUC) is a single number that summarizes the overall performance depicted by the
curve.

• AUC ranges from 0 to 1.


• An AUC of 1 indicates a perfect classifier.

• An AUC of 0.5 indicates a random classifier.


• An AUC less than 0.5 indicates performance worse than random.
• Higher AUC values generally indicate better model performance across various thresholds.

For our example points, the AUC can be calculated by summing the areas of the trapezoids between consecutive points
(FPRi , TPRi ) and (FPRi+1 , TPRi+1 ):
X TPRi + TPRi+1
AUC = (FPRi+1 − FPRi ) ×
i
2

Calculating the AUC for our example points gives an AUC of 0.70. This indicates the model performs better than
random.

10
Cohen’s Kappa
In evaluating classification models, while accuracy measures the overall proportion of correct predictions, it doesn’t account
for correct predictions that could happen purely by chance. This is where Cohen’s Kappa (κ) provides a more robust and
interpretable evaluation.

What is Cohen’s Kappa?


Cohen’s Kappa (κ) is a statistical measure used to evaluate the level of agreement between two raters(models) or between a
model’s predictions and the ground truth, correcting for the agreement that occurs randomly.

Why Cohen’s Kappa? (Beyond Accuracy)


Accuracy is the ratio of correct predictions to total predictions. However, it can be misleading in imbalanced datasets. For
example, if 95% of instances belong to class A, a model predicting only class A will still achieve 95% accuracy without
meaningful classification.
Cohen’s Kappa adjusts for chance agreement, offering a more reliable measure when class imbalance or random agreement
is a concern.

Formula
Cohen’s Kappa (κ) is defined as:

Po − P e
κ=
1 − Pe
Where:
• Po is the observed agreement (accuracy):
P
nkk
Po = k
N

• Pe is the expected agreement by chance:


P
nk. · n.k
Pe = k 2
N

Here, nkk is the number of correct predictions for class k, nk. is the total actual instances of class k, n.k is the total
predicted instances of class k, and N is the total number of instances.

Interpretation Guidelines for κ


Cohen’s κ ranges from -1 to 1:
• +1: Perfect agreement

• 0: Agreement by chance
• -1: Complete disagreement
Interpretation (Landis Koch, 1977):

κ Value Agreement Level


<0 Less than chance
0.01–0.20 Slight
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
0.81–1.00 Almost perfect

11
Multiclass Classification Example
Let’s consider a 3-class problem (A, B, C) with the following confusion matrix for 120 instances:

Predicted A Predicted B Predicted C Total


Actual A 40 5 5 50
Actual B 3 32 5 40
Actual C 2 3 25 30
Total 45 40 35 120

Total instances N = 120.


Step 1: Calculate Observed Agreement (Po )
Po is the sum of the diagonal elements divided by the total instances:
40 + 32 + 25
Po = ≈ 0.8083
120

Step 2: Calculate Expected Agreement by Chance (Pe )


Pe is the sum, over each class, of the product of the actual total for that class and the predicted total for that class, divided
by N 2 :

(50 × 45) + (40 × 40) + (30 × 35)


Pe = ≈ 0.3403
120 × 120

Step 3: Calculate Kappa (κ)

Po − Pe 0.8083 − 0.3403 0.4680


κ= ≈ = ≈ 0.7095
1 − Pe 1 − 0.3403 0.6597
According to the interpretation scale, a κ of ≈ 0.71 indicates Substantial agreement.

12
REGRESSION METRICS
Metric Formula / Definition Interpretation
1 Pn
Mean Absolute Error (MAE) |yi − ŷi | Average absolute difference between predicted
n i=1
and actual values. Measures average error magni-
tude. Lower values indicate better performance.
1 Pn
Mean Squared Error (MSE) (yi − ŷi )2 Average squared difference between predicted and
n i=1
actual values. Penalizes larger errors more heav-
ily than MAE. Lower values indicate better per-
formance.
s
1 Pn
Root Mean Squared Error (RMSE) (yi − ŷi )2 Square root of MSE. In the same units as the re-
n i=1
sponse variable. Measures average error magni-
tude. Lower values indicate better performance.
(yi − ŷi )2
P
R-squared (R2 ) 1− P Proportion of the variance in the dependent vari-
(yi − ȳ)2
able that is predictable from the independent vari-
able(s). Ranges from 0 to 1. Higher values (closer
to 1) indicate better fit.
(1 − R2 )(n − 1)
Adjusted R-squared 1− Modified R-squared that adjusts for the number
n−p−1
of predictors (p) and observations (n). Penalizes
the inclusion of irrelevant predictors. Useful for
model comparison. Higher values indicate better
performance.

13
DIMENSION REDUCTION
Principal Component Analysis (PCA)
• Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into
a smaller set of new features called principal components. These components are orthogonal (uncorrelated) and capture
the maximum variance present in the data. The goal of PCA is to reduce the number of features while retaining the
most significant information, making the dataset easier to analyze, visualize, and model.
• The key idea behind PCA is to project the data onto directions that maximize the variance, thus helping uncover
hidden patterns and reducing noise in the data.

Steps for Principal Component Analysis (PCA)


1. Mean Centering the Data:

• The first step is to center the data by subtracting the mean of each feature (column). This step ensures that the
data has zero mean and allows PCA to focus on the variance rather than the absolute values of the features.
• Mathematically, for a feature fi , subtract its mean, i.e., fi′ = fi − µfi , where µfi is the mean of feature fi .
2. Compute the Covariance Matrix:

• The covariance matrix C captures the relationships between the features. If X is the matrix of mean-centered
data, the covariance matrix is computed as:
1
C= XT X
n−1

where n is the number of samples and X T is the transpose of the mean-centered data matrix.
• Each element in the covariance matrix represents how two features vary together. Diagonal elements represent
the variance of each feature, and off-diagonal elements show the covariance between pairs of features.
f1 f2 ... fm
f1 cov(f1 , f1 ) cov(f1 , f2 ) ... cov(f1 , fm )
f2 cov(f2 , f1 ) cov(f2 , f2 ) ... cov(f2 , fm )
.. .. .. .. ..
. . . . .
fm cov(fm , f1 ) cov(fm , f2 ) ... cov(fm , fm )
3. Calculate Eigenvalues and Eigenvectors:
• Eigenvalues represent the magnitude of variance explained by each principal component, while eigenvectors rep-
resent the directions (or axes) along which the data varies the most.
• Solve for the eigenvalues (λ) and eigenvectors (v) of the covariance matrix C using the equation:

Cv = λv

where v is the eigenvector and λ is the eigenvalue. The eigenvectors form the basis of the new feature space.
4. Select Principal Components:

• Sort the eigenvalues in descending order to identify the most important principal components that explain the
most variance.
• Choose the top p eigenvectors corresponding to the largest eigenvalues. These eigenvectors represent the directions
that capture the most significant variance in the data.
• Construct a matrix V containing the selected p eigenvectors. The dimensions of V will be m × p, where m is the
number of features, and p is the number of principal components chosen.
5. Project Data onto New Subspace:

14
• To reduce the dimensionality of the data, multiply the mean-centered data matrix X by the matrix V containing
the selected eigenvectors:

Xnew = X · V

• This transformation projects the data onto the new subspace defined by the top p principal components. The
resulting data matrix Xnew will have n rows (same number of samples) and p columns (number of selected principal
components).

6. Reconstruction (Optional):
• In some cases, you might want to reconstruct the original data using the reduced-dimensional data. This is done
by multiplying the reduced data matrix by the transpose of the eigenvector matrix:

Xreconstructed = Xnew · V T

• The reconstructed data will be an approximation of the original data, with the error proportional to the variance
lost by discarding less significant components.

15
Linear Discriminant Analysis (LDA)
Given a dataset with c classes, let:
• X = [x1 , x2 , . . . , xn ] represent the dataset with n samples.
• Each sample xi ∈ Rd is a d-dimensional feature vector.
• Let Xj be the subset of samples belonging to class j, which has nj samples.

The main goal of LDA is to find a projection that maximizes the separability between different classes while minimizing
the variance within each class.

Steps in Linear Discriminant Analysis (LDA)


1. Class Mean Vector:
• For each class j, calculate the mean of the samples:
1 X
mj = x
nj
x∈Xj

where nj is the number of samples in class j.


2. Within-Class Scatter Matrix:

• This matrix captures the spread of each class around its own mean:
c X
X
SW = (x − mj )(x − mj )T
j=1 x∈Xj

where x represents a sample and mj is the class mean.


3. Between-Class Scatter Matrix:
• This matrix quantifies how far apart the class means are from the overall mean m of the dataset:
c
X
SB = nj (mj − m)(mj − m)T
j=1

where nj is the number of samples in class j and m is the overall mean of the dataset.
4. Optimization Objective:

• The goal is to find a projection vector w that maximizes the ratio of between-class variance to within-class variance:

wT SB w
J(w) =
wT SW w
This ratio is maximized when the classes are as far apart as possible while keeping the variance within each class
as small as possible.
5. Eigenvalue Problem:

• Solve the generalized eigenvalue problem to find the projection vector:

SB w = λSW w

where SW and SB are the within-class and between-class scatter matrices, respectively, and λ is the eigenvalue.
• Compute the eigenvalues and eigenvectors of the matrix S−1
W SB .
• Sort the eigenvectors in descending order based on their corresponding eigenvalues.
• Select the top k eigenvectors (where k ≤ c − 1) to form the transformation matrix W, where each eigenvector
corresponds to a principal axis in the new subspace.

16
Key Points
• Assumptions: LDA assumes that:
– Each class follows a Gaussian distribution with identical covariance matrices across all classes.
– The data is normally distributed, and the classes have the same covariance.
– LDA is sensitive to outliers and imbalances in the class distribution.
• LDA aims to maximize the distance between class means and minimize the variance within each class, thereby improving
class separability.
• Computationally Efficient: LDA is generally faster and computationally simpler than non-linear dimensionality
reduction methods like t-SNE or UMAP because it is based on linear transformations.
• LDA vs. PCA:
– PCA: A data-driven, unsupervised technique that focuses on maximizing the total variance in the data, irrespec-
tive of class labels.
– LDA: A supervised method that uses class labels to maximize the separability between classes, thereby focusing
on finding dimensions that best distinguish between classes.

17
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a powerful matrix factorization technique used in linear algebra, data analysis, and
machine learning. It decomposes a matrix into three simpler matrices and is widely used for dimensionality reduction, noise
reduction, and feature extraction.
Given a real matrix A ∈ Rm×n , the SVD is defined as:

A = U ΣV T

where:
• U ∈ Rm×m is an orthogonal matrix whose columns are the left singular vectors,
• Σ ∈ Rm×n is a diagonal matrix with non-negative real numbers on the diagonal (the singular values),
• V ∈ Rn×n is an orthogonal matrix whose columns are the right singular vectors.

Step-by-Step Dimensionality Reduction Using SVD


1. Perform SVD on A:
• Decompose the matrix A into its singular value decomposition:

A = U ΣV T

• Here:
– U ∈ Rm×m contains the left singular vectors as columns.
– Σ ∈ Rm×n is a diagonal matrix with singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0, where r = rank(A).
– V T ∈ Rn×n is the transpose of the matrix of right singular vectors.

2. Truncate U , Σ, and V T to top-k components:


• To reduce the matrix A to a lower-dimensional approximation using the top k singular values:
– Let Uk ∈ Rm×k be the first k columns of U ,
– Let Σk ∈ Rk×k be the top-left k × k block of Σ,
– Let VkT ∈ Rk×n be the first k rows of V T .
• The rank-k approximation of A is:

Ak = Uk Σk VkT

3. Compute the Reduced Representation A′ :


• Often, a reduced feature representation is desired:

A′ = AVk = Uk Σk ∈ Rm×k

• The matrix A′ captures the essential structure of the original data in a lower-dimensional space of dimension k,
preserving the most significant singular values and their directions.

Singular Value Decomposition (SVD) Calculation Steps


To compute SVD manually for a matrix A ∈ Rm×n , follow the steps below:

1. Compute AAT :
• Calculate the matrix product AAT ∈ Rm×m .
• Find its eigenvalues λ1 , λ2 , . . . , λm and the corresponding eigenvectors u1 , u2 , . . . , um .
• The eigenvectors form the columns of U , and the eigenvalues are used to compute singular values.

2. Compute AT A:
• Compute AT A ∈ Rn×n .

18
• Find its eigenvalues and corresponding eigenvectors v1 , v2 , . . . , vn .
• The eigenvectors form the columns of V .

3. Normalize the eigenvectors:


• Ensure that each eigenvector has unit length to construct orthogonal matrices U and V .

4. Sort eigenvalues and corresponding vectors:


• Sort the eigenvalues of AAT and AT A in descending order.
• Reorder the corresponding eigenvectors to match the order of the sorted eigenvalues.

5. Compute Singular Values:

• The singular values σi are the square roots of the non-zero eigenvalues:
p
σi = λi

6. Form the Diagonal Matrix Σ:

• Construct Σ ∈ Rm×n as:


 
σ1 0 · · · 0
 0 σ2 · · · 0 
 .. .. . . .. 
 
Σ= .
 . . . 
0
 0 · · · σr  
.. .. ..
. . .

where r = rank(A), and the rest of the matrix is padded with zeros.

7. Transpose V :
• The matrix V T is simply the transpose of V .

8. Combine to Form the SVD:

• The original matrix A can now be written as:

A = U ΣV T

19
MACHINE LEARNING MODELS

Linear, Ridge, Lasso & Elastic Regression


Linear Regression
• X: Design matrix of input features with dimensions 1000 × 10 (including a column of ones for the intercept and
9 predictor variables).
• Y : Column vector of observed outcomes, with dimensions 1000 × 1.
• β: Column vector of regression coefficients, with dimensions 10 × 1.
• Ŷ : Column vector of predicted values, with dimensions 1000 × 1.

Prediction
The predicted values Ŷ are obtained using the linear model:

Ŷ = Xβ

1. Closed-Form Solution (Ordinary Least Squares)

We consider a linear regression model:

Y = Xβ + ϵ

where:

• Y ∈ Rn×1 is the response/output vector,


• X ∈ Rn×p is the feature matrix (including a column of 1s for the intercept, if applicable),
• β ∈ Rp×1 is the vector of coefficients to estimate,
• ϵ is the error term (noise).

Our goal is to find β that minimizes the Residual Sum of Squares (RSS): L(β) = ∥Y − Xβ∥2
This is equivalent to minimizing the squared Euclidean norm of the residuals.

20
9. Step 1: Expand the loss function
Using the identity ∥a∥2 = a⊤ a:

L(β) = (Y − Xβ)⊤ (Y − Xβ)

Now expand using the distributive property of matrix multiplication:

L(β) = Y ⊤ Y − Y ⊤ Xβ − β ⊤ X ⊤ Y + β ⊤ X ⊤ Xβ
= Y ⊤ Y − 2β ⊤ X ⊤ Y + β ⊤ X ⊤ Xβ

(Note: Y ⊤ Xβ and β ⊤ X ⊤ Y are scalars and equal; we combine them.)

10. Step 2: Take the derivative of L(β) w.r.t. β


∂ ⊤ ∂ ⊤
Using standard matrix calculus rules: - ∂β (β Aβ) = 2Aβ when A is symmetric, - ∂β (b β) = b,
∂L ⊤ ⊤
We get: ∂β = −2X Y + 2X Xβ

11. Step 3: Set the derivative to zero for minimization

−2X ⊤ Y + 2X ⊤ Xβ = 0
X ⊤ Xβ = X ⊤ Y

12. Step 4: Solve for β


Assuming that X ⊤ X is invertible:

β = (X ⊤ X)−1 X ⊤ Y

Note: If X ⊤ X is not invertible (i.e., singular), we may use regularization (like Ridge Regression) or the Moore-Penrose
pseudoinverse instead.

2. Iterative Solution (Gradient Descent)

An alternative approach is to use gradient descent to iteratively update the coefficients. For n epochs, the update rule
is:

dL
∇β L = = −2X ⊤ Y + 2X ⊤ Xβ

α
β (n) = β (n−1) − · ∇β L
1000

where α is the learning rate and 1000 is the number of observations (used for averaging the gradient).

21
Polynomial Regression
To model non-linear relationships, polynomial features can be introduced. For example, given three input features x,
y, and z, a second-degree polynomial expansion includes:

x2 , y 2 , z 2 , xy, xz, yz

These new features are added to the design matrix, and then standard linear regression (e.g., OLS or gradient descent)
is applied on the transformed feature space.

Ridge, Lasso and Elastic Net Regression

Ridge Regression (L2 Regularization)


Ridge regression is a regularized version of linear regression that adds an L2 penalty to the loss function, thereby
discouraging large coefficients and reducing model complexity:

m
X n
X
L(β) = (yi − ŷi )2 + λ βj2
i=1 j=1

Where:

• L(β) is the loss function, comprising the residual sum of squares and the regularization term.
• λ is the regularization parameter that controls the magnitude of the penalty.

The gradient of the loss function with respect to the coefficients is given by:

∂L
= −2X ⊤ (Y − Xβ) + 2λX ⊤ Xβ
∂β

Here, X ⊤ X is the Gram matrix of the input features, and λI is the regularization matrix. The identity matrix I is
given by:

 
0 0 ... 0
0 1 ... 0
I = . .
 
.. .. 
 .. .. . .
0 0 ... 1

The closed-form solution for the Ridge regression coefficients is:

β = (X ⊤ X + λI)−1 X ⊤ Y

22
Alternatively, using gradient descent, the update rule for the coefficients is:

∂L
β (t+1) = β (t) − α
∂β (t)

Where α is the learning rate and β (t) represents the coefficient vector at the t-th iteration.
For a simple linear regression model, the slope m under Ridge regularization is computed as:

P
(yi − ȳ)(xi − x̄)
m= P
(xi − x̄)2 + λ

Note: As λ increases, the magnitude of the coefficients decreases, but they never reach zero. Therefore, Ridge regression
is not typically used for feature selection.

Lasso Regression (L1 Regularization)


Lasso regression introduces an L1 penalty to the loss function, which encourages sparsity in the coefficient estimates,
effectively performing feature selection:

m
X n
X
L(β) = (yi − ŷi )2 + λ |βj |
i=1 j=1

For the slope m in the case of simple linear regression, Lasso regression modifies the computation as follows:

• For m > 0:
P
(yi − ȳ)(xi − x̄) − λ
m= P
(xi − x̄)2

• For m = 0:
P
(yi − ȳ)(xi − x̄)
m= P
(xi − x̄)2

• For m < 0:
P
(yi − ȳ)(xi − x̄) + λ
m= P
(xi − x̄)2

Lasso regression is particularly effective when you wish to eliminate some features entirely by setting their coefficients
to zero, especially when λ is large.
Note: Lasso is often used for feature selection because, as λ increases, it forces some coefficients to become exactly
zero, thereby removing those features from the model.

Elastic Net Regression


Elastic Net regression combines both L1 (Lasso) and L2 (Ridge) regularization penalties, offering a balance between
the two methods. This is useful when there are multiple correlated features or when the number of predictors exceeds
the number of observations.
The Elastic Net loss function is:

m
X n
X n
X
L(β) = (yi − ŷi )2 + λ1 |βj | + λ2 βj2
i=1 j=1 j=1

Where:

23
• λ1 controls the strength of the L1 penalty (Lasso).
• λ2 controls the strength of the L2 penalty (Ridge).

• When λ1 = 0 and λ2 > 0, the model reduces to Ridge regression.


• When λ2 = 0 and λ1 > 0, the model reduces to Lasso regression.
• Elastic Net is effective when there are many correlated predictors, as it encourages grouped selection of correlated
features.

Properties of Ridge and Lasso Regression


(a) Effect of increasing λ on coefficients:
• Ridge Regression: As λ increases, the coefficients decrease toward zero but never exactly reach zero. This
leads to a reduction in the magnitude of the coefficients, but they are never completely eliminated. Thus, for
Ridge regression, m ≈ 0, but m ̸= 0.
• Lasso Regression: As λ increases, the coefficients can shrink to exactly zero, which leads to feature selection.
Therefore, for Lasso regression, m = 0 for some coefficients.
(b) Bias-Variance Tradeoff:

(c) Impact on Loss Function:


• As λ increases, the minimum of the loss function increases. This reflects the fact that we are penalizing the
coefficients more heavily, leading to a tradeoff between fitting the data and keeping the coefficients small.
• As λ increases, the coefficients (m) get closer to zero, thus reducing the model’s complexity.

24
Logistic & Multinomial Logistic Regression
Logistic Regression
1. Perceptron Trick (Motivation)
The perceptron algorithm is an early classification method that updates weights when it misclassifies data. The idea is
to nudge the decision boundary to better separate classes.

• If a point is misclassified, we move the decision boundary toward it.


• Limitation: The perceptron stops once it finds a separating hyperplane, not necessarily the best one.

Explanation: The perceptron algorithm updates its weights only when a prediction is wrong, encouraging the model
to correct its decision boundary. However, it doesn’t give us probabilistic predictions, and it only works well when the
data is linearly separable. This leads us to the need for a more refined model—Logistic Regression.

Algorithm 1 Perceptron Learning


Require: Input data X ∈ Rm×n , labels Y ∈ {±1}m , learning rate η
Ensure: Trained weight vector W
1: Initialize W randomly
2: for t = 1 to 1000 do
3: Select random index i
4: Predict: ŷi = sign(xTi W )
5: Update: W = W + η(yi − ŷi )xi
6: end for

Notation:
• xi : ith input row vector
• W : weight vector (n × 1)
• yi : true label, ŷi : predicted label
• η: learning rate

2. Sigmoid Function and Stochastic Logistic Regression


The perceptron is replaced by the sigmoid function for probabilistic interpretation and smooth optimization:

1
σ(z) =
1 + e−z

The sigmoid function maps any real number into the range (0, 1). It’s the core of logistic regression.

Explanation: Logistic regression uses the sigmoid function to estimate the probability that a sample belongs to a
particular class. Unlike the hard decision rule in perceptron, the sigmoid gives smooth output, making it differentiable
and suitable for optimization using gradient-based methods.

Algorithm 2 Stochastic Logistic Regression


Require: X ∈ Rm×n , Y ∈ {0, 1}m , learning rate η
Ensure: Weights W
1: Initialize W with small values
2: for iteration = 1 to 1000 do
3: Randomly pick (xi , yi )
4: Predict: ŷi = σ(xTi W )
5: Update: W = W + η(yi − ŷi )xi
6: end for

25
3. Maximum Likelihood & Batch Training
Logistic regression fits weights using maximum likelihood estimation, which minimizes the following binary cross-entropy
loss:

m
1 X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
m i=1

Explanation: Instead of updating weights one sample at a time, we can use all training samples together in batch
mode. This is done by minimizing the log-loss, also known as binary cross-entropy. The loss penalizes incorrect
predictions more heavily as the confidence increases in the wrong direction.

Matrix Formulation

Ŷ = σ(XW )
1 h T i
L=− Y log(Ŷ ) + (1 − Y )T log(1 − Ŷ )
m
Explanation: In matrix form, the loss function and prediction become compact and computationally efficient. This
helps implement batch gradient descent easily using matrix operations.

Gradient Descent Update Rule


1 T
W ←W +η· X (Y − Ŷ )
m
Explanation: The gradient is calculated from the derivative of the loss with respect to the weights. This gradient
points in the direction to increase likelihood (or decrease loss), and we move the weights slightly in this direction.

Algorithm 3 Batch Gradient Descent for Logistic Regression


Require: Input matrix X, label vector Y , learning rate η
Ensure: Trained weight vector W
1: Initialize W with small values
2: for epoch = 1 to 10 do
3: Predict: Ŷ = σ(XW )
1
4: Compute gradient: ∇ = m X T (Y − Ŷ )
5: Update: W ← W + η · ∇
6: end for

Softmax Regression / Multinomial Logistic Regression


Softmax regression generalizes logistic regression to multi-class classification problems. Instead of modeling a binary
outcome, it predicts the probability distribution over K classes.

26
Method One: One-vs-Rest Approach (Simplified View)
• Apply one-hot encoding to convert categorical labels into a binary matrix format.

• Train a separate binary logistic regression model for each class (i.e., one model per column of the one-hot
encoded labels).
• Each model predicts the probability of its respective class vs all others.
• The class with the highest predicted probability is chosen during inference.

Method Two: Softmax-Based Training (True Multinomial Logistic Regression)


In this method, we train a single model that jointly predicts all classes using the softmax function. This avoids the
independence assumption of Method One.

Loss Function (Cross-Entropy for Multiple Classes)


m K
1 X X  (i) (i)

Loss = − yk · log(ŷk )
m i=1
k=1

(i)
• Here, yk = 1 if the ith example belongs to class k, and 0 otherwise (from one-hot encoding).
(i)
• ŷk is the predicted probability for class k for the ith example.

Softmax Function
ezj
σ(z)j = PK for j = 1, . . . , K
k=1 ezk

• Converts raw scores (logits) into probabilities.


• Ensures output values are between 0 and 1 and sum to 1 across all classes.

Important Points
• Commonly used for multi-class classification tasks (e.g., digit recognition, sentiment with 3+ categories).
• We can apply regularization to prevent overfitting:
– L1 Regularization (Lasso): Encourages sparsity in weights.
– L2 Regularization (Ridge): Encourages smaller weights overall.
– Elastic Net: A mix of both L1 and L2.
• Optimization is usually done via batch or stochastic gradient descent.

27
K-Nearest Neighbors (KNN)
Based on the concept: “You are the average of the five people you spend the most time with”
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for both classification and
regression tasks. It predicts outcomes based on the similarity of data points in the feature space.

Algorithm Steps
Step 1: Normalize the Data
• Scale features appropriately (e.g., min-max normalization):

x − min(X)
x′ =
max(X) − min(X)

Step 2: Calculate Distances


• Euclidean distance:
v
u n
uX
d(x, y) = t (xi − yi )2
i=1

• Other options: Manhattan, Minkowski, or Cosine similarity


Step 3: Identify K-Nearest Neighbors
• Select top K points with smallest distances
• Optimal K is typically chosen via cross-validation
Step 4: Make Prediction
• Classification: Majority vote

ŷ = ({yi }K
i=1 )

• Regression: Weighted average


PK
wi yi 1
ŷ = Pi=1
K
, wi =
i=1 wi d(x, xi )

28
Naive Bayes Classifier
Bayes’ Theorem
P (B|A)P (A)
P (A|B) =
P (B)

Classification Formulation
For features A, B, C and class Ck :

P (A|Ck )P (B|Ck )P (C|Ck )P (Ck )


P (Ck |A, B, C) =
P (A, B, C)

Using the Naive assumption (feature independence):


n
Y
ŷ = arg max P (Ck ) P (xi |Ck )
Ck
i=1

Example Calculation

Toss Venue Outlook Result


Won Mumbai Overcast Won
Lost Chennai Sunny Won
.. .. .. ..
. . . .
Won Mumbai Sunny Won

For P (Won|Lost, Mumbai, Sunny):


1 2 4 5
∝ P (Lost|Won) · P (Mumbai|Won) · P (Sunny|Won) · P (Won) = × × × ≈ 0.0356
5 5 5 9

Numerical Feature Handling


For continuous data, Gaussian Naive Bayes assumes the likelihood of the features is Gaussian:
!
1 (x − µCk )2
P (x|Ck ) = q exp − 2
2πσ 2 2σC k
Ck

Implementation Steps
2
i. Calculate the mean µCk and variance σC k
of each feature for every class.
ii. Use the Gaussian formula to compute the likelihood for each feature.
iii. Multiply all feature likelihoods with the class prior to get posterior probabilities.
iv. Choose the class with the highest posterior probability.

29
CART - Classification and Regression Trees
Pseudo Code
(a) Begin with the training dataset containing feature variables and corresponding target values.
(b) Determine the best feature to split the data on using a criterion like Gini impurity or entropy.
(c) Split the dataset into subsets based on this feature. Each split defines a node in the decision tree.
(d) Recursively repeat steps 2 and 3 for each subset until a stopping criterion is met (e.g., maximum depth, pure leaf,
or minimum samples).

Advantages
• Fast inference: prediction cost is logarithmic in the number of training samples.
• Easy to interpret and visualize.

Disadvantages
• Prone to overfitting, especially with deep trees.
• Sensitive to imbalanced datasets and small variations in data.

Decision Tree Concepts


Entropy (Information Gain)
c
X
Entropy(X) = − pi log2 (pi )
i=1

where pi is the proportion of samples belonging to class i.

• For a binary classification problem:


– Minimum Entropy = 0 (pure node)
– Maximum Entropy = 1 (uniform distribution)
• For multiclass problems, the maximum entropy > 1.
• Both log2 and loge (natural log) are valid; base 2 is commonly used in decision trees.

Gini Impurity
c
X
G=1− p2i
i=1

• Gini tends to give balanced trees in comparison to entropy.


• Lower Gini implies purer node.

30
Information Gain

k
X wi
Information Gain = Impurity (Parent) − · Impurity (Childi )
i=1
W
= Entropy or Gini(P arent) − Weighted Average of Child Impurities

Algorithm
(a) Compute impurity (Entropy or Gini) of the parent node.
(b) For each feature:
• Split the dataset based on feature values.
• Compute impurity of child nodes and their weighted average.
• Calculate the information gain.
(c) Select the feature with the highest information gain to split.
(d) Stop when:
• All samples belong to one class (pure node, entropy = 0), or
• Maximum depth or minimum node size is reached.

Calculating Information Gain for Numerical Features


• Sort the dataset based on the values of the numerical feature.
• Consider potential split points between each pair of consecutive values.
• For each split point:
– Divide the dataset into two subsets: values less than or equal to the split, and values greater.
– Compute the Information Gain using Entropy or Gini impurity.
• Select the split point that yields the maximum Information Gain.
• The corresponding Information Gain is assigned to that numerical feature.

31
Regression Trees
Algorithm
• Compute the standard deviation of the target variable in the parent node.
• For each feature (categorical or numerical), calculate the Information Gain:
– Categorical Features:
∗ Split the dataset based on the unique values (labels) of the categorical feature.
∗ Compute the standard deviation of the target variable for each group (child node).
– Numerical Features:
∗ Sort the data based on the numerical feature.
∗ Consider possible split points between consecutive values.
∗ For each split, divide the dataset into two groups.
∗ Compute the standard deviation of the target variable in each group (child node).
– Compute the weighted standard deviation of the child nodes.
– Calculate Information Gain as:

Information Gain = Standard Deviation (Parent) − Weighted Standard Deviation (Children)

• Select the feature and split that results in the highest Information Gain.
• Recursively repeat the above steps for each child node until a stopping criterion is met (e.g., minimum samples
per node or maximum tree depth).
• At each leaf node, output the mean of the target variable values within that node.

Feature Importance for Decision Tree-like Algorithms


To compute the importance of a feature i, sum the decrease in impurity for all nodes t where the feature i is used to
split:

 
X Nt Nt Nt
Feature Importance(i) = impurityt − l · impuritytl − r · impuritytr
N Nt Nt
t∈nodes using i

• Feature Importance(i): Importance score for feature i.


• Nt : Number of samples at node t.
• N : Total number of samples in the dataset.
• impurityt : Impurity at node t (e.g., Gini, Entropy, Standard Deviation).
• Ntl : Number of samples in the left child node.
• Ntr : Number of samples in the right child node.
• impuritytl : Impurity of the left child node.
• impuritytr : Impurity of the right child node.

32
Ensemble Learning
Ensemble learning is based on the concept of the wisdom of the crowd, where combining predictions from multiple
models increases the overall accuracy and robustness of the system. It helps convert a low-bias, high-variance model
into a low-bias, low-variance model by distributing the impact of random errors and outliers across multiple learners.

• Voting Ensemble: In this method, multiple diverse models (e.g., SVM, Decision Tree, Logistic Regression) make
predictions independently.
– For classification, the final output is determined by majority voting or by averaging class probabilities (soft
voting).
– For regression, the mean of the individual model predictions is used as the final result.
• Stacking: Stacking combines different models by using their predictions as input features for a final meta-model
(e.g., KNN or Logistic Regression).
– Base learners (e.g., SVM, Decision Tree, Logistic Regression) are first trained on the dataset.
– Their predictions are then used as input to train a meta-model that learns to optimally combine these outputs.
• Bagging (Bootstrap Aggregating): Bagging involves training multiple instances of the same model on different
random subsets (with replacement) of the training data.
– The final prediction is obtained through majority voting (classification) or averaging (regression).
– Bagging helps reduce variance and is effective for high-variance, low-bias models like decision trees.
• Boosting: Boosting is a sequential ensemble technique where each model tries to correct the errors of its predecessor.
– Models are trained one after the other.
– Each subsequent model focuses more on the data points that were misclassified by the previous models.
– The final prediction is typically a weighted sum of individual model outputs.

Voting Ensemble Method


Assume we have three independent models, each with an individual accuracy of 0.7.

Ensemble Accuracy Calculation (Majority Voting):


We calculate the probability that the ensemble gives the correct result when at least 2 out of 3 models are correct:

Final Accuracy = (0.7)3 + 3 × (0.7)2 × 0.3 = 0.343 + 0.441 = 0.784

Types of Voting:
• Hard Voting: The class predicted by the majority of the models (i.e., mode of predictions) is selected.
• Soft Voting: The class with the highest average predicted probability (i.e., mean of probabilities) is selected.

33
Stacking

1. Hold-Out Method (Blending)


The hold-out method, also known as blending, is a basic form of stacking. In this method, the training dataset is split
into two parts:
• Base Model Training: The first part is used to train the base models.
• Meta-Model Training: The trained base models make predictions on the second part of the data. These predictions
serve as input features to the meta-model (often a simple model like linear or logistic regression).
• The meta-model is then trained on these predictions to learn how to best combine the base model outputs for final
prediction.

2. K-Fold Approach (Stacking)


The K-Fold stacking method improves upon blending by using cross-validation to better utilize the training data and
reduce overfitting:
• Data Splitting: Divide the training data into K folds.
• Base Model Training: For each base model type, train K different models. In each iteration, use K−1 folds for
training and the remaining fold for validation. This ensures each data point is used for validation once.
• Out-of-Fold Predictions: Collect the out-of-fold predictions made by each of the K trained models. These
predictions form the new feature set for the meta-model.
• Meta-Model Training: Train the meta-model on the out-of-fold predictions from all base models.
• Final Model Retraining: After training the meta-model, retrain each base model type on the entire training
dataset to be used during final inference.

3. Multi-Layered Stacking
Multi-layered stacking extends the idea by creating a hierarchy of models across multiple layers. The output of each
layer is used as input for the next:

This approach enables models to learn from previous model predictions, enhancing overall performance—especially
when using a diverse set of model types.

34
Bagging Techniques
Bagging (Bootstrap Aggregating) is an ensemble method that improves model stability and accuracy by training multiple
models on different subsets of the data and combining their predictions. Below are some of the common techniques
used in bagging:

• Row Sampling with Replacement: This is the classic bagging technique where each individual model is trained
on a bootstrap sample, meaning data points are selected randomly with replacement. Some data points may appear
multiple times in the sample, while others may not be included at all.
• Pasting: Similar to row sampling, but in this case, data is sampled without replacement. Each model is trained on
a unique subset of the data, ensuring that every data point is used exactly once.
• Random Subspaces: This method focuses on column sampling (i.e., selecting a subset of features). Models are
trained on different subsets of features, either with or without replacement, allowing the ensemble to capture different
aspects of the data.
• Random Patches: A combination of both row and column sampling. In this technique, both subsets of rows (data
points) and columns (features) are selected, with or without replacement. This enables the ensemble to capture
diverse patterns from both the data and the feature set.

Out-of-Bag (OOB) Error: In bagging, approximately 37% of the samples are not selected for training any given
model. These samples, called out-of-bag (OOB) samples, can be used as a validation set to evaluate the model’s
performance. This internal validation provides a cost-effective way of assessing the model’s accuracy without the need
for a separate test set.

35
AdaBoost Algorithm
1. Initialize Weights
For a dataset with n samples, initialize the weight for each sample equally:
1
wi = , for all i = 1, 2, . . . , n
n
This ensures that all samples have equal influence initially.
x y wi
1
n

2. Train a Weak Learner


Train a weak learner (typically a decision stump, i.e., a tree of depth 1) on the training data using the current sample
weights. A weak learner is expected to perform only slightly better than random guessing.

3. Make Predictions
Use the trained weak learner to predict the labels for all training samples. Denote these predictions as ŷi .
x y wi ŷ
1
n

4. Compute Weighted Error


Calculate the weighted error ϵ of the weak learner:
n
X
ϵ= wi · I(ŷi ̸= yi )
i=1

where I(·) is the indicator function that equals 1 when the prediction is incorrect and 0 otherwise.
x y wi ŷ ϵ
1
n

5. Compute Learner Weight (α)


Calculate the weight α for the weak learner, which reflects its contribution to the final model:
 
1 1−ϵ
α = log
2 ϵ

A lower error ϵ results in a higher α, indicating a more reliable learner.

6. Update Sample Weights


Update the weights of each sample depending on whether it was correctly or incorrectly classified:

winew = wi · e−αyi ŷi

This formulation increases weights for misclassified samples and decreases them for correctly classified ones.
x y wi ŷ ϵ winew
1
n

36
7. Normalize Weights
Normalize the updated weights so that they sum to 1:
wnew
winorm = Pn i new
j=1 wj

x y wi ŷ ϵ winew winorm
1
n

8. Optional: Create Sampling Bins


You may optionally build bins for sampling based on the normalized weights. Each sample is assigned a probability
interval:

bini = (ai , bi ) = (bi−1 , bi−1 + winorm )

x y wi ŷ ϵ winew winorm bini = (ai , bi )


1
n

9. Repeat
Repeat steps 2 through 7 for a specified number of iterations or until convergence. Note: Do not reinitialize weights
to n1 in each iteration; the updated weights are carried forward to the next round.

10. Final Prediction


After T rounds, the final prediction is a weighted majority vote (or sum) of all weak learners:
T
!
X
H(x) = sign αt · ht (x)
t=1

where:
• ht (x) is the prediction of the t-th weak learner,
• αt is the weight of the t-th weak learner,
• H(x) is the final ensemble prediction.

Summary: AdaBoost sequentially trains weak learners on reweighted data, focusing more on previously misclassified
samples. It then combines them into a strong learner using a weighted vote, significantly boosting performance.

37
Gradient Boosting
The final prediction is a weighted sum of all models:

Result = Model0 (X) + η · Model1 (X) + η · Model2 (X) + · · · + η · Modeln (X)

Gradient Boosting for Regression

Algorithm 4 Gradient Boosting for Regression


1: Input: Training data (X, y), number of estimators n, learning rate η
2: Initialize model with constant prediction:
n
1X
F0 (x) = ȳ = yi
n i=1

3: modelsList ← [F0 ]
4: F (x) ← F0 (x)
5: for i ← 1 to n do
6: Compute residuals:

ri = yi − F (xi ), for all xi ∈ X

7: Fit regression tree hi (x) to predict ri from xi


8: Update model:

F (x) ← F (x) + η · hi (x)

9: modelsList.append(hi )
10: end for
11: Output: Final prediction function:
n
X
ŷ(x) = F0 (x) + η · hi (x)
i=1

38
Gradient Boosting for Classification

Algorithm 5 Gradient Boosting for Binary Classification


1: Input: Training data (X, y), learning rate η, number of estimators n
2: Compute initial log-odds:
 P 
y
F0 = ln P
(1 − y)

3: modelsList ← [F0 ]
4: for i ← 1 to n do
5: Compute predicted probabilities:
1
p̂k = , for all xk ∈ X
1+ e−Fi−1 (xk )
6: Compute residuals (negative gradient):

rk = yk − p̂k , for all xk ∈ X

7: Fit a regression tree hi (x) to predict rk from xk


8: for each leaf j in hi with region Rj do
9: Compute leaf output:
P
xk ∈Rj rk
γj = P
xk ∈Rj p̂k (1 − p̂k )

10: Assign γj to all xk ∈ Rj


11: end for
12: Update prediction:

Fi (x) = Fi−1 (x) + η · hi (x)

13: modelsList.append(hi )
14: end for
15: Output: Final prediction:
n
X
F (x) = F0 + η · hi (x)
i=1
(
1 1 if p̂(x) ≥ 0.5
p̂(x) = , ŷ =
1 + e−F (x) 0 otherwise

Gradient Boosting Algorithm (General Formulation)


Given:
• Dataset: {(xi , yi )}ni=1
• Differentiable loss function: L(y, F (x)) (e.g., squared error: L(y, F (x)) = 12 (y − F (x))2 )
• Number of boosting iterations: M
• Learning rate: η
Goal: Build an additive model FM (x) minimizing the empirical risk:
M
X
FM (x) = η · hm (x), where hm (x) are weak learners
m=0

39
Algorithm Steps

Algorithm 6 Gradient Boosting (General Form)


1: Initialize model:
n
X
F0 (x) = arg min L(yi , γ)
γ
i=1

1
Pn
▷ e.g., for squared loss: F0 (x) = n i=1 yi
2: for m = 1 to M do
3: Compute pseudo-residuals:
 
∂L(yi , F (xi ))
rim = −
∂F (xi ) F =Fm−1

▷ For squared loss: rim = yi − Fm−1 (xi )


4: Fit a regression tree hm (x) to training data {(xi , rim )}ni=1
5: for each terminal region Rjm of hm do
6: Compute region-specific update:
X
γjm = arg min L(yi , Fm−1 (xi ) + γ)
γ
xi ∈Rjm

▷ For squared loss: γjm = mean{rim | xi ∈ Rjm }


7: end for
8: Update model:
Jm
X
Fm (x) = Fm−1 (x) + η γjm · 1{x∈Rjm }
j=1

9: end for
10: Output: Final model FM (x)

Key Components
• Gradient Computation: Uses first-order Taylor approximation (gradient descent in function space)
• Weak Learner: Usually regression trees with limited depth (e.g., depth 3–6)
• Learning Rate η: Shrinks contribution of each tree; common values: 0.01 to 0.1
• Regularization: Achieved by early stopping, limiting tree depth, subsampling, and η

40
XGBoost (Extreme Gradient Boosting)
Key Advantages of XGBoost
• Flexibility
– Cross-Platform Support: Runs on Windows, Linux, and macOS
– Multi-Language APIs: Available in Python, R, Java, Scala, and C++
– Integration Support: Compatible with scikit-learn, Dask, Spark
– Wide Application: Supports regression, classification, time series, and ranking tasks
• Speed (Often 10× faster than traditional gradient boosting)
– Parallelized Tree Construction: Controlled via n jobs (in scikit-learn API) or nthread (in core XGBoost)
– Columnar Data Storage: Enables cache-aware access patterns
– Cache Optimization: Minimizes memory overhead
– Out-of-Core Learning: Handles datasets larger than memory using tree method = hist
– Distributed Training: Supports large-scale training across clusters
– GPU Acceleration: Leverages GPUs via device = cuda
• Performance
– Regularization: Incorporates both L1 (Lasso) and L2 (Ridge) penalties
– Robust to Missing Data: Automatically learns optimal default directions
– Sparse Data Handling: Efficient processing of sparse input matrices
– Efficient Split Finding:
∗ Uses weighted quantile sketch for approximate split finding
∗ Employs binned feature values to speed up computation
– Tree Pruning: Performs depth-first expansion with backward pruning to eliminate splits with negative gain

Technical Highlights (Simplified and Expanded)


• Regularized Learning Objective:
n K
X X 1
L(ϕ) = l(ŷi , yi ) + Ω(fk ), where Ω(fk ) = γT + λ∥w∥2
i=1
2
k=1

– The first term measures how well the model fits the data using a loss function l(·) (e.g., squared error for
regression, log loss for classification).
– The second term Ω(fk ) adds a penalty for model complexity:
∗ T is the number of leaves (terminal nodes) in the tree.
∗ w represents the vector of weights (predictions) in each leaf.
∗ γ and λ are regularization parameters that help prevent overfitting.
• Approximate Learning Algorithm (Histogram-based):
– Instead of evaluating every possible split, XGBoost uses an approximate method for faster computation.
– Continuous features are grouped into discrete bins using quantiles (this forms a histogram).
– For each bin, it calculates the sum of gradients and second-order gradients (Hessians).
– This significantly speeds up the split finding process, especially on large datasets.
– The method is called tree method = hist, and is suitable for both CPU and GPU training.
• Sparsity-Aware Split Finding:
– XGBoost is optimized to handle datasets with missing or sparse values.
– During training, it automatically learns the best way to assign missing values (i.e., go left or right in a tree node).
– This avoids the need for manual imputation and improves performance on real-world datasets with incomplete
entries.
– It also efficiently handles features with many zero entries (common in one-hot encoded or bag-of-words data).

41
XGBoost for Regression

Algorithm 7 XGBoost Regression Algorithm


1: Initialize:

prediction ← mean(y)

models list ← [mean(y)]


2: for i ← 1 to n estimators do
3: Calculate residuals: residual ← y − prediction
4: Build decision tree:
5: for each possible split do
( residuals)2
P
6: Compute Similarity Score: SS = count(residuals)+λ
7: Compute Gain: Gain = SSleft + SSright − SSparent
8: Select split with maximum Gain
9: end for P
residuals
10: Set node output: output = count(residuals)+λ
11: Update prediction: prediction ← prediction + η · tree predict(X)
12: models list.append(tree)
13: end for
14: Final prediction:
n estimators
X
result = models list[0] + η · models list[i](X)
i=1

XGBoost for Classification

Algorithm 8 XGBoost Classification Algorithm


1: Initialize log odds:
 
Count of Ones
log odds ← ln
Count of Zeros

modelsList ← [log odds]


2: for i ← 1 to nestimators do
1
3: Calculate probability: prob ← 1+e−log odds

4: Compute residuals: residual ← y − prob


5: Build decision tree:
6: for each possible split do
( residuals)2
P
7: Compute Similarity Score: SS = P[prob×(1−prob)]+λ
8: Compute Gain: Gain = SSleft + SSright − SSparent
9: Select split with maximum Gain
10: end for P
residuals
11: Set node output: log loss = P[prob×(1−prob)]+λ
12: modelsList.append(tree)
13: Update log odds: log odds ← log odds + η · log loss
14: end for
15: Final prediction:
nestimators
X
Total log loss = modelsList[0] + η · modelsList[i](X)
i=1

1
Final prob =
1 + e−Total log loss

42
XGBoost Mathematics
1. Objective Function
XGBoost minimizes the following regularized objective function at each boosting iteration t:
n  
(t−1)
X
L(t) = L yi , ŷi + ft (xi ) + Ω(ft )
i=1

• L: loss function (e.g., squared error or logistic loss)


(t−1)
• ŷi : prediction from previous iterations
• ft : the function (tree) added at iteration t
• Ω(ft ): regularization term to penalize complexity
Regularization term:
1
Ω(f ) = γT + λ∥w∥2
2
• T : number of leaves in the tree
• w: vector of output scores on leaves
• γ, λ: regularization hyperparameters
Explanation: This formulation balances model fit (through the loss L) and complexity (through Ω) to avoid overfitting.

2. Taylor Approximation
(t−1)
Using second-order Taylor expansion around ŷi :
n  
X 1
L(t) ≈ gi ft (xi ) + hi ft (xi )2 + Ω(ft )
i=1
2

∂L(yi ,ŷi )
• gi = ∂ ŷi
∂ 2 L(yi ,ŷi )
• hi = ∂ ŷi2

Explanation: This approximation simplifies optimization using gradients (gi ) and Hessians (hi ).

3. Tree Structure Reformulation


Assume ft (xi ) maps xi to leaf j with output score wj . Let Ij be the set of samples in leaf j. Then:
T  
(t)
X 1 2
L = Gj wj + (Hj + λ)wj + γT
j=1
2

where:
X X
Gj = gi , Hj = hi
i∈Ij i∈Ij

Explanation: The loss is grouped by leaves, allowing us to find optimal wj for each leaf independently.

4. Optimal Weight for a Leaf


Set derivative of the loss with respect to wj to zero:
∂L Gj
= Gj + (Hj + λ)wj = 0 ⇒ wj∗ = −
∂wj Hj + λ

Explanation: This gives the best score for each leaf to minimize the total objective.

43
5. Optimal Objective Value After Adding Tree
Substitute wj∗ into the objective:

T
1 X G2j
L(t) = − + γT
2 j=1 Hj + λ

Explanation: This score helps evaluate split quality during tree construction.

6. Derivatives Summary

∂L(yi , ŷi )
gi = (gradient)
∂ ŷi
∂ 2 L(yi , ŷi )
hi = (Hessian)
∂ ŷi2

Explanation: These derivatives enable second-order optimization, improving training speed and accuracy.

7. Special Case: Squared Error Loss (Regression)


For L(yi , ŷi ) = 12 (yi − ŷi )2 :

gi = ŷi − yi , hi = 1

− ŷi )2
P P
i∈Ij (yi − ŷi ) i∈Ij (yi
wj∗ = , Lj =
Nj + λ Nj + λ
where Nj is the number of samples in leaf j.
Explanation: For regression, gradient is the error, and the Hessian is constant, simplifying computation.

44

You might also like