Part 1: Foundational ML Concepts 🧠
Types of ML Techniques
● Supervised Learning: This is like learning with a teacher. The algorithm is trained on a
labeled dataset, which means each data point is tagged with the correct output. The
goal is to learn a mapping function that can predict the output for new, unseen data.
○ Example: Predicting house prices (regression) based on features like area and
location, or classifying emails as spam or not spam (classification).
● Unsupervised Learning: This is like learning without a teacher. The algorithm is given
unlabeled data and tries to find patterns and structure on its own.
○ Example: Grouping customers into different segments based on their purchasing
behavior (clustering), or reducing the number of features in a dataset
(dimensionality reduction).
● Semi-Supervised Learning: A middle ground between supervised and unsupervised
learning. It uses a small amount of labeled data and a large amount of unlabeled data.
This is useful when labeling data is expensive and time-consuming.
● Reinforcement Learning: This is about learning to make decisions. An agent learns by
interacting with an environment. It receives rewards for good actions and penalties for
bad ones. The goal is to learn a policy that maximizes the cumulative reward.
○ Example: Training a program to play a game like chess or controlling a robot to
perform a task.
Key Learning Paradigms
● Agent: The learner or decision-maker in reinforcement learning.
● Self-supervised Learning: A type of unsupervised learning where the supervision
signal is generated from the input data itself. For example, a model learns to predict the
next word in a sentence by looking at the previous words.
● Active Learning: The algorithm can interactively query a user (or another information
source) to label new data points. It tries to select the most informative data points to
query, to learn more efficiently.
● Passive Learning: The standard paradigm where the algorithm is given a fixed dataset
and learns from it passively, without any ability to influence which data it sees.
Eager vs. Lazy Learners
● Eager Learners: These algorithms build a classification model from the training data
before receiving any test data. They spend more time on training but are fast during
prediction.
○ Examples: Linear Regression, Decision Trees, SVM.
● Lazy Learners: These algorithms defer the learning process until it's time to make a
prediction. They simply store the training data. They are fast to train but can be slow to
predict.
○ Example: K-Nearest Neighbors (KNN), where the algorithm looks for the 'k'
closest training examples to make a prediction.
Hypothesis, Bias, and Learning Types
● Hypothesis (h) and Hypothesis Space (H): A hypothesis is a specific function that the
learning algorithm picks to best approximate the true target function. The hypothesis
space is the set of all possible hypotheses the algorithm can choose from. For linear
regression, the hypothesis space is the set of all possible linear equations.
● Inductive Learning: This is the core of most ML. It involves generalizing from specific
examples to create a general rule. The conclusions are probable, not guaranteed.
● Inductive Bias: The set of assumptions a learner uses to make predictions on unseen
data. Without some bias, an algorithm cannot generalize beyond the training data. A
common bias is assuming a linear relationship in linear regression.
● Deductive Learning: This involves moving from a general rule to a specific conclusion.
It's about logical deduction and is less common in machine learning, which typically
deals with uncertain data.
Parametric vs. Non-parametric Algorithms
● Parametric Algorithms: These algorithms have a fixed number of parameters,
regardless of the amount of training data. They make strong assumptions about the form
of the function they are trying to learn (e.g., linear).
○ Pros: Fast, require less data.
○ Cons: Limited complexity, can lead to underfitting if assumptions are wrong.
○ Examples: Linear Regression, Logistic Regression.
● Non-parametric Algorithms: These algorithms do not make strong assumptions about
the form of the target function. The number of parameters often grows with the training
data.
○ Pros: Flexible, can fit a wide range of functions.
○ Cons: Require more data, slower, prone to overfitting.
○ Examples: K-Nearest Neighbors, Decision Trees.
Overfitting and Underfitting
This is a central challenge in machine learning, related to the bias-variance tradeoff.
● Underfitting: The model is too simple to capture the underlying patterns in the data. It
performs poorly on both the training and test sets. It has high bias.
● Overfitting: The model learns the training data too well, including the noise. It performs
very well on the training set but poorly on the test set. It has high variance.
● Good Fit: The model captures the underlying pattern and generalizes well to new data.
Sample OLS Question
Q: Given the data points (1, 2), (2, 4), (3, 5), find the OLS regression line y^=β0+β1x.
A:
1. Calculate means: xˉ=(1+2+3)/3=2, yˉ=(2+4+5)/3=11/3.
2. Calculate β1:
○ Numerator:
(1−2)(2−11/3)+(2−2)(4−11/3)+(3−2)(5−11/3)=(−1)(−5/3)+0+(1)(4/3)=9/3=3.
○ Denominator: (1−2)2+(2−2)2+(3−2)2=1+0+1=2.
○ β1=3/2=1.5.
3. Calculate β0:
○ β0=yˉ−β1xˉ=11/3−(1.5)(2)=11/3−3=2/3≈0.67.
4. Result: The regression line is y^=0.67+1.5x.
Stochastic Gradient Descent (SGD)
● Batch Gradient Descent: Computes the gradient using the entire training set at each
step. Can be very slow for large datasets.
● Stochastic Gradient Descent (SGD): Computes the gradient using a single training
example at each step. Much faster and can help escape local minima, but the updates
are noisy.
● Mini-Batch Gradient Descent: A compromise that updates the parameters using a
small batch of training examples. This is the most common approach.
Correlation and Multicollinearity
Correlation Analysis for Multicollinearity
Correlation measures the statistical relationship or association between two variables. It tells
you how one variable changes as the other one does.
Multicollinearity is a phenomenon that occurs in regression models when two or more
independent variables (predictors) are highly correlated with each other. This means one
predictor can be linearly predicted from the others with a substantial degree of accuracy.
Think of it like having two different witnesses in a trial who tell the exact same story. Hearing the
story a second time doesn't add much new information and can make it difficult to know which
witness is more important. Similarly, in a model, multicollinearity makes it hard to determine the
individual effect of each correlated predictor on the outcome variable.
Why is it a problem? 🧐
● Unstable Coefficients: The coefficient estimates for the correlated variables can
change erratically in response to small changes in the model or the data.
● Difficult Interpretation: It becomes challenging to determine the individual contribution
of each predictor. You can't say "a one-unit increase in X1 causes a B1 increase in Y"
because you can't change X1 without also changing its correlated counterpart, X2.
● Inflated Standard Errors: This makes the coefficients seem statistically insignificant
when they might actually be important.
How to detect it?
1. Correlation Matrix: Calculate the correlation coefficient between every pair of
independent variables. A common rule of thumb is that a correlation coefficient of 0.7 or
higher (or lower than -0.7) indicates potential multicollinearity.
2. Heatmap Visualization: A heatmap provides a clear visual representation of the
correlation matrix, making it easy to spot highly correlated pairs.
Interpreting VIF:
● VIF = 1: No correlation. This is the baseline.
● 1 < VIF < 5: Moderate correlation. This is often acceptable.
● VIF > 5 or 10: High correlation and a cause for concern. It indicates that the model's
coefficients are poorly estimated.
Logistic Regression
Logistic Regression Overview
Despite its name, Logistic Regression is a fundamental algorithm for binary classification,
not regression. It's used to predict a categorical outcome that has two possible values, such as
Yes/No, True/False, or 1/0.
For example, you could use logistic regression to predict:
● Whether an email is spam (1) or not spam (0).
● Whether a customer will churn (Yes) or not (No).
The core idea is to model the probability that a given input belongs to a particular class.
Model Validation and Evaluation
Splitting the Data
To evaluate a model's performance on unseen data, we must first split our dataset.
● Hold-Out Method: This is the simplest strategy. The dataset is split into two parts: a
training set (e.g., 80%) and a testing set (e.g., 20%). The model is built on the training
set and evaluated on the testing set.
○ Drawback: The performance metric can be highly dependent on which data
points end up in the training vs. testing set.
● Stratified Partition: This is an improved version of the hold-out method, crucial for
classification. It ensures that the proportion of different classes is the same in both the
training and testing sets as it is in the original dataset. This prevents a situation where,
for example, the testing set has no examples of a minority class.
Cross-Validation (CV)
Cross-validation is a more robust technique that gives a more reliable estimate of model
performance.
● k-Fold Cross-Validation:
1. The dataset is randomly split into 'k' equal-sized subsets, called folds.
2. The model is trained and tested 'k' times.
3. In each iteration, one fold is held out as the test set, and the remaining k-1 folds
are used for training.
4. The final performance is the average of the performance scores from all 'k'
iterations. Common choices for 'k' are 5 or 10.
● Leave-One-Out Cross-Validation (LOOCV): This is an extreme case of k-Fold CV
where k equals the number of data points (n). In each iteration, one single data point is
used to test the model, and the rest (n−1) are used to train it. It's computationally
expensive but provides a very thorough validation.
● Stratified k-Fold: This combines k-Fold CV with stratification. When creating the folds, it
ensures that each fold is representative of the overall class distribution. This is the
recommended standard for most classification problems.
Metrics for Multiclass Classifiers
When you have more than two classes (e.g., classifying images as Cat, Dog, or Bird), you need
to average the binary metrics.
● Macro-Averaging: Calculate the metric (e.g., precision) independently for each class
and then take the unweighted average. It treats all classes equally.
● Micro-Averaging: Aggregate the counts of TPs, FPs, and FNs across all classes and
then calculate the metric once. It gives more weight to the bigger classes. For
Micro-Averaging, Precision = Recall = F1 Score = Overall Accuracy.
Decision Trees
Decision Tree Overview
A Decision Tree is a supervised learning algorithm that works like a flowchart. It splits the data
into smaller and smaller subsets based on a series of questions about the features, eventually
arriving at a decision or a class label in the leaf nodes.
Key Terminology:
● Root Node: The starting node, which represents the entire dataset.
● Decision Node: An internal node that tests a feature and splits the data into branches
based on the outcome.
● Leaf Node: A terminal node that represents a final classification (decision).
● Homogeneous (Pure) Node: A node containing data points from only one class.
● Heterogeneous (Impure) Node: A node containing a mix of data points from multiple
classes.
The algorithm's goal is to create splits that make the resulting child nodes as pure as possible.
Weather Forecast Decision Tree Example 🌳
Goal: Decide whether to Play or Don't Play tennis based on the weather. Features:
Outlook, Humidity, Wind.
Here's how the tree might be built:
1. Root Node: The tree starts with the entire dataset, which is impure (contains both Play
and Don't Play examples).
2. First Split: The algorithm calculates which feature (Outlook, Humidity, or Wind) will
best split the data into purer subsets. Let's say it's Outlook. This becomes the root
node.
3. Branches: The Outlook node splits into three branches: Sunny, Overcast, and
Rain.
4. Subsequent Splits:
○ The data that goes down the Overcast branch might all be Play. This branch
ends in a pure leaf node labeled Play.
○ The data in the Sunny branch is still mixed. The algorithm now looks at the
remaining features (Humidity, Wind) to find the best feature to split this Sunny
subset. Let's say Humidity is best.
○ This creates two new branches under Sunny: High (which leads to a Don't
Play leaf node) and Normal (which leads to a Play leaf node).
○ This process continues recursively until all branches end in pure leaf nodes or a
stopping condition is met.