Unit 2- Classification
Definition
• In machine learning, classification refers to a predictive modeling
problem where a class label is predicted for a given example of input
data.
• Classification is the task of approximating a mapping function (f) from
input variables (X) that tends to output variables (Y).
• It is basically belonging to the supervised machine learning in which
targets are also provided along with the input data set.
• Eg. Spam detection in emails- binary type classification.
Types of Learners in Classification
• a. Lazy Learners:
• These learners wait for the testing data to be appeared after storing the
training data. Classification is done only after getting the testing data. They
spend less time on training but more time on predicting. Examples of lazy
learners are K-nearest neighbor and case-based reasoning.
• b. Eager Learners
• As opposite to lazy learners, eager learners construct classification model
without waiting for the testing data to be appeared after storing the
training data. They spend more time on training but less time on
predicting. Examples of eager learners are Decision Trees, Naïve Bayes and
Artificial Neural Networks (ANN).’
Types of Classifications
• The algorithm which implements the classification on a dataset is
known as a classifier.
• Binary Classifier: If the classification problem has only two possible
outcomes, then it is called as Binary Classifier. Examples: YES or NO,
MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two
outcomes, then it is called as Multi-class Classifier. Example:
Classifications of types of crops, Classification of types of music.
ASSESSING CLASSIFICATION PERFORMANCE
• A confusion matrix (also called a contingency table) is a tool used to evaluate the
performance of a classification model.
• It is structured as:
• Rows = Actual values (from the dataset)
• Columns = Predicted values (from the model)
Predicted +ve Predicted -ve Row Total
True Positive False Negative
Actual +ve Positives
(TP) (FN)
False Positive True Negative Definitions:
Actual -ve Negatives •True Positives (TP): Actual +ve and predicted +ve
(FP) (TN)
•False Negatives (FN): Actual +ve but predicted -ve
Column Total Pred. +ve Total Pred. -ve Total Total Instances •False Positives (FP): Actual -ve but predicted +ve
•True Negatives (TN): Actual -ve and predicted -ve
Performance Metrics
• Eg. 1
We have a true positive rate of 60%, a true negative rate of 80%, a false negative
rate of 40% and a false positive rate of 20%.
Eg.2.
•TPR = 60 / 75 = 0.80
•TNR = 15 / 25 = 0.60
•Accuracy = (60 + 15) / 100 = 0.75
Weighted Accuracy
Accuracy=pos⋅TPR+neg⋅TNR
Where
• Pos== fraction of actual positives = 75 / 100 = 0.75
• Neg=1 - pos = 0.25
• Accuracy=0.75⋅0.80+0.25⋅0.60=0.75
This shows that:
• When class distribution is imbalanced, the more frequent class has more
influence on accuracy.
• A model can show high accuracy by focusing on the majority class but may
fail on the minority class.
CLASS PROBABILITY ESTIMATION
• A probability estimator is a type of classifier that doesn't just say "yes" or "no"
(like "spam" or "not spam") — instead, it gives a probability score.
For example:
Instead of saying “this is spam,” it says, “There’s an 80% chance this is spam.”
It does this by outputting a vector of probabilities, one for each class (in binary
classification, we just need one, like:
p^(x)=0.80 for the positive class — e.g., spam).
In real-world data, we don’t know the true probabilities. We only know whether an
instance is positive or not. So we estimate probabilities using patterns seen in the
data.
Two Extreme Scenarios for Estimating
Probabilities
1. Every Instance is the Same as Every Other
• That means we consider all instances similar, and so we always
return the overall proportion of positives in the dataset.
• Example: If 30% of emails are spam, we predict p̂(x) = 0.30 for all
emails, no matter their content.
2. Only Identical Instances Are Similar
• We only consider an instance similar if it’s exactly the same as
another.
• In this case, if we’ve seen it before, we know its true label, so we
predict:
• But we can’t generalize this to new data
Feature Tree: A Balance Between the Two
• The tree splits data based on features (like presence of the words
“Bonus” or “lottery”).
BONUS
Bonus
• At each leaf, we assign a probability:
• This means: if two instances end up in the
same leaf, they’re considered similar.
ASSESSING CLASS PROBABILITY ESTIMATES
In machine learning classification, assessing class probability
estimates is essential when you want more than just the predicted
label — you want to know how confident the model is about each
prediction.
To evaluate the accuracy of predicted probabilities from a classifier
(especially in a multi-class setting),
Use Squared error (SE) and Mean Squared Error (MSE), which in
forecasting is also known as the Brier score.
Probability Vector Representation
Handling the Error
• Squared Error and Mean Squared Error
Three-Class Task-CASE 1
The first model gets punished more because, although mostly right, it isn’t quite sure of it
Three-Class Task-CASE 2
The second model gets punished more for not just being wrong, but being presumptuous.
SE in Tree Leaves
• Leaf 1: 20(0.33−1)^2 + 40(0.33−0)^2 = 13.33
• Leaf 2: 10(0.67−1)^2 + 5(0.67−0)^2 = 3.33
• Leaf 3: 20(0.80−1)^2 + 5(0.80−0)^2 = 4.00
• MSE = (SE1 + SE2 + SE3) / 100 = 0.21
Can We Improve the Probabilities?
• Trying new values (e.g., 0.4 or 0.2) increases error
• SE1 = 13.6 (for 0.4), SE1 = 14.4 (for 0.2)
• Best prediction = observed proportion of positives in leaf
Optimum- Empirical Class Frequency
•Squared Error helps measure confidence and correctness
•MSE (Brier Score) penalizes overconfidence and uncertainty
•Optimal prediction = empirical class frequency
MULTI-CLASS PROBABILITY ESTIMATION
Challenges of Probability Estimation
•Real class probabilities are not observable in training data
•We only see (x, y) pairs, not full distributions
•Probabilistic classifiers must estimate
• Classifier can still be accurate even if probability estimates are poor,
as long as it predicts the highest-probability class correctly
• Only when error alters the ranking, does classification fail
• Hence: Good ranking > Exact probability values, for classification
accuracy
REGRESSION
• Regression is a supervised learning technique used to model the
relationship between a target (dependent) variable and one or more
independent (predictor) variables.
• It is primarily applied for prediction, forecasting, time series
modeling, and determining cause-and-effect relationships between
variables.
• Regression techniques vary depending on:
• The number of independent variables involved.
• The type of relationship (linear or nonlinear) between the
independent and dependent variables.
• By analyzing these relationships, regression helps in identifying
correlation between variables and enables the prediction of
continuous output values based on input features.
• What is the purpose of a regression model?
• Regression analysis can be used to do one of two things
predict the value of the dependent variable when you know
something about the independent variables or predict how
the independent variables will affect the dependent
variable.
Types of Regression
Types of Regression Analysis
• 1. Linear Regression
• Purpose: Predicts a continuous outcome based on the linear relationship
between dependent (Y) and independent (X) variables.
• Equation: Y=mx+c
• Can be simple (one X) or multiple (many Xs).
• Uses a straight line (best-fit line).
• 2. Logistic Regression
• Used when the dependent variable is binary (e.g., 0/1, yes/no).
• Uses a sigmoid curve to map predictions to probabilities between 0 and 1.
• Common in classification tasks.
• 3. Polynomial Regression
• Models a non-linear relationship using polynomial terms (e.g., x^2,x^3).
• Best-fit line is curved.
• Useful when data shows a non-linear trend.
• 4. Ridge Regression
• Used when there is multicollinearity (high correlation among predictors).
• Adds L2 regularization (lambda) to shrink coefficients and reduce
overfitting.
• Improves model stability.
• 5. Lasso Regression
• Adds L1 regularization to shrink some coefficients to zero.
• Helps with feature selection by eliminating less important variables.
• 6. Quantile Regression
• Predicts quantiles (e.g., median) instead of the mean.
• Useful when:
• Data has outliers
• Linear regression assumptions are violated
• 7. Bayesian Linear Regression
• Uses Bayes’ theorem to estimate the distribution of coefficients.
• Provides a probabilistic model.
• Often more stable than standard linear regression.
• 8. Principal Components Regression (PCR)
• Uses Principal Component Analysis (PCA) to reduce predictors before
regression.
• Handles multicollinearity by transforming input features.
• Like ridge regression, it reduces overfitting via bias-variance tradeoff.
• 9. Partial Least Squares Regression (PLSR)
• Useful when there are many predictors with multicollinearity.
• Reduces variables to a smaller set while considering both predictor and
response variance.
• Faster and more efficient in high-dimensional data.
• 10. Elastic Net Regression
• Combines Ridge (L2) and Lasso (L1) penalties.
• Good for:
• High-dimensional datasets
• Highly correlated predictors
• Offers balanced regularization.
# Regression Type Key Features Best Used When
Assumes a straight-line relationship between X and Y
1 Linear Regression Simple (1 independent variable) or Multiple (2 or Data has a linear trend
more independent variables)
Target variable is categorical (e.g.,
2 Logistic Regression Predicts binary outcomes using a sigmoid curve
pass/fail)
Models non-linear relationships with polynomial
3 Polynomial Regression Data shows a curved pattern
terms
Adds L2 regularization to reduce overfitting and
4 Ridge Regression Predictors are highly correlated
handle multicollinearity
Adds L1 regularization, can shrink some coefficients
5 Lasso Regression Want to select important features
to zero
Data has outliers or doesn't meet linear
6 Quantile Regression Predicts quantiles (e.g., median) instead of mean
regression assumptions
Uses Bayes’ theorem, returns distribution over Need probabilistic interpretation and
7 Bayesian Linear Regression
coefficients stability
Principal Components Regression
8 Uses PCA to reduce input features before regression High multicollinearity or many predictors
(PCR)
Partial Least Squares Regression Reduces predictors while maximizing covariance with Many highly correlated predictors with
9
(PLSR) the target few observations
10 Elastic Net Regression Combines Ridge + Lasso regularization Predictors are numerous and correlated
Simple Linear Regression
• Simple Linear Regression is a basic type of regression analysis where there is only one
independent variable (x) and one dependent variable (y).
• The relationship between these variables is assumed to be linear.
• It uses the slope-intercept form of a straight line to make predictions:
y=mx+c
x: Independent variable (input)
y: Dependent variable (predicted output)
m: Slope of the line
c: Y-intercept
• The goal of the linear regression algorithm is to find the optimal values for m and c so that the
line best fits the observed data points.
• The red line in the graph (often shown in visual examples) is known as the best fit
line. It minimizes the difference between the predicted values and the actual
values, often using a method like least squares.
• This model is widely used for predictive analysis when the relationship between
the input and output is expected to be linear.
• The line can be modelled based on the linear equation.
y = a_0 + a_1 * x # Linear Equation
• The motive of the linear regression algorithm is to
find the best values for a_0 and a_1.
Cost Function
• The cost function helps determine the optimal values of the regression
coefficients (a0 and a1) that define the best-fit line for the given data.
• These values make sure the line fits the data as closely as possible.
• To do this, we measure the error between:
• What our model predicts (y^ )
• What the actual values are ( y )
• This transforms the problem into a minimization problem.
Mean Squared Error (MSE) – The Cost Function
• The difference between the predicted values and ground truth
measures the error difference.
• We square the error difference and sum over all data points and
divide that value by the total number of data points.
• This provides the average squared error over all the data points.
• Therefore, this cost function is also known as the Mean Squared
Error(MSE) function.
• By changing a0 and a1, we try to minimize this MSE value. The lower
the MSE, the better the model.
• The equation y = a_0 + a_1 * x is also called the hypothesis function,
and the cost function helps us check how well it works.
GRADIENT DESCENT
• Gradient descent is a method of updating a_0 and a_1 to reduce the
cost function(MSE).
• The idea is that we start with some random for a_0 and a_1 and then
we change these values iteratively to reduce the cost.
• Gradient descent helps us on how to change the values.
• Gradient descent is used to minimize the MSE by calculating the
gradient of the cost function.
• To draw an analogy, imagine a pit in the shape of U and you are standing at the
topmost point in the pit and your objective is to reach the bottom of the pit.
• There is a catch, you can only take a discrete number of steps to reach the bottom.
• If you decide to take one step at a time you would eventually reach the bottom of
the pit but this would take a longer time. If you choose to take longer steps each
time, you would reach sooner but, there is a chance that you could overshoot the
bottom of the pit and not exactly at the bottom.
• In the gradient descent algorithm, the number of steps you take is the learning rate.
A small learning rate means slow but steady progress. A large learning rate might
speed things up but can make the model unstable or miss the minimum.
• The goal is to reach the lowest point (minimum cost) — just like getting to the
bottom of the pit.
ASSESSING PERFORMANCE OF REGRESSION
1. Mean Absolute Error (MAE)
2. Root Mean Squared Error (RMSE)
3. R-squared (R2)
4. Residual Standard Error (RSE)
1. Mean Absolute Error
• Mean Absolute Error (MAE), like the RMSE, the MAE measures the
prediction error. Mathematically, it is the average absolute difference
between observed and predicted outcomes,
MAE = mean(abs(observeds- predicteds)).
• MAE is less sensitive to outliers compared to RMSE.
2. Root Mean Squared Error (RMSE)
• It measures the average error performed by the model in predicting
the outcome for an observation.
• Mathematically, the RMSE is the square root of the mean squared
error (MSE) which is the average squared difference between the
observed actual outome values and the values predicted by the
model.
• So, MSE = mean((observeds- predicteds)^2) and
RMSE = sqrt(MSE).
• The lower the RMSE, the better the model.
3. R squared
• R-squared (R2), which is the proportion of variation in the outcome
that is explained by the predictor variables.
• In multiple regression models, R2 corresponds to the squared
correlation between the observed outcome values and the predicted
values by the model.
• The Higher the R-squared, the better the model.
4. Residual Standard Error (RSE)
• It is also known as the model sigma, is a variant of the RMSE adjusted
for the number of predictors in the model.
• The lower the RSE, the better the model. In practice, the difference
between RMSE and RSE is very small, particularly for large
multivariate data.
OVERFITTING
• Overfitting occurs when a regression model captures noise or random
fluctuations in the training data, rather than the true underlying
relationship between variables.
• This problem occurs when the model is too complex.
Why It's a Problem?
1. Misleading Statistics:
• R² becomes artificially high — the model looks like it explains a lot of
variance, but only for the training data.
• Regression coefficients reflect random variation rather than real effects.
• P-values may appear significant when they aren’t — leading to false
conclusions.
2. Poor Generalization:
• The model performs poorly on new or unseen data because it’s tailored to the
specific quirks of the training dataset.
Catalysts for Overfitting
• The more parameters (e.g., means) you try to estimate from a fixed
sample size, the less reliable each estimate becomes.
• If you add too many terms with too few data points, the model will
fit noise — not signal.
• As the number of means increases (1 → 2 → 3...),
but the total sample size stays the same (20):
• Estimates become more variable
Less likely to be replicated in a new sample
Overall precision drops
Causes of Overfitting
• Too many model terms relative to data size
• Small sample size
• Inclusion of irrelevant features
• High model flexibility (e.g., deep neural nets, high-degree
polynomials)
• Multicollinearity (predictors are correlated)
Applying These Concepts to Overfitting
Regression Models:
• You have:
• Fixed number of observations (say, 30 data points)
• You try to fit a model with 6 parameters (variables + interactions +
polynomial terms)
• 30 observations / 6 terms = 5 observations per term → TOO LOW!
• Each estimate is based on too little data → unstable & erratic.
The model might fit the training data well but won’t generalize to
new data.
This is overfitting.
How to Avoid Overfitting Models:
• Plan your model before collecting data — so that your sample size is
large enough to support the complexity of the model.
How to Detect Overfit Models:
• What is Predicted R-squared?
• Predicted R² is a cross-validation metric used in regression to
evaluate:How well your model predicts new (unseen) data.
• Unlike normal R² (which measures how well the model fits existing
data), predicted R² tells you:
• Will this model perform well on future data?
• Or is it just memorizing the current data (i.e., overfitting)?
How Predicted R² Works?
• Statistical software automates this cross-validation as follows:
• Remove one data point from the dataset.
• Fit the regression model to the remaining data.
• Predict the removed data point using that model.
• Measure how close the prediction is to the actual value.
• Repeat for every point in the dataset.
• This is a form of Leave-One-Out Cross-Validation (LOOCV).
POLYNOMIAL REGRESSION
• Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and one or more independent variables(x) as nth degree
polynomial.
• Unlike Simple Linear Regression, it can capture non-linear relationships in the
data.
• The Polynomial Regression equation is given below:
• This makes Polynomial Regression ideal for data with curved patterns that
cannot be effectively modeled with a straight line.
• It is also called the special case of Multiple Linear Regression in ML because we
add some polynomial terms to the Multiple Linear regression equation to convert
it into Polynomial Regression.
• The dataset used in Polynomial regression for training is of non-linear
nature.
• Hence,"In Polynomial regression, the original features are
converted into Polynomial features of required degree (2,3,..,n) and
then modeled using a linear model."
Why Use Polynomial Regression?
• Polynomial Regression is necessary when the data points are
arranged in a non-linear fashion.
• Applying a linear model to such data often results in high error
rates and inaccurate predictions.
• In contrast, Polynomial Regression provides a curved model that
better fits the data, minimizing loss function and enhancing accuracy.
• It is particularly useful for scenarios like price
prediction and weather forecasting, where relationships between
variables are complex.
Need for Polynomial Regression
• In the above figure, we have taken a dataset which is arranged non linearly. So if
we try to cover it with a linear model, then we can clearly see that it hardly covers
any data point. On the other hand, a curve is suitable to cover most of the data
points, which is of the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.
Equation of the Polynomial Regression Model
• When we compare the above three equations, we can clearly see
that all three equations are Polynomial equations but differ by the
degree of variables. The Simple and Multiple Linear equations are
also Polynomial equations with a single degree, and the Polynomial
regression equation is Linear equation with the nth degree.
• So if we add a degree to our linear equations, then it will be
converted into Polynomial Linear equations.
Implementation of Polynomial Regression Using Python
• A Human Resource (HR) company is in the process of hiring a new candidate. The
candidate claims that their previous annual salary was 160K. To verify this claim,
the HR team needs to determine if the candidate is
being truthful or exaggerating. However, they only have access to a dataset from
the candidate’s previous company, which details the salaries associated with the
top 10 job positions, along with their respective levels. Upon analyzing this
dataset, it becomes evident that there is a non-linear relationship between
the job position levels and their corresponding salaries. The objective is to build
a regression model capable of detecting any discrepancies in the candidate’s
claim. This "Bluff Detection" model will help the HR team make an informed
hiring decision, ensuring they bring on board a candidate who is honest about
their past compensation.
Why Polynomial Regression?
• Non-linear Relationships:
• Salary often doesn't have a simple linear relationship with position
level. Polynomial regression can capture these non-linear trends, making it
more suitable than linear regression.
• Flexibility:
• Polynomial regression can fit various curve shapes, allowing for a more
accurate representation of the data.
• Overfitting:
• It's important to choose the right degree for the polynomial to avoid
overfitting (where the model performs well on the training data but poorly
on new data).
• Here's how it works:
• 1. Data Collection:
• Gather data on employee salaries and their corresponding position levels from
the candidate's previous company.
• 2. Model Training:
• Train a polynomial regression model on this data. The model will learn the
relationship between position levels and salaries.
• 3. Prediction:
• Input the candidate's position level into the trained model to predict the
corresponding salary.
• 4. Bluff Detection:
• Compare the predicted salary with the candidate's stated salary. If the predicted
salary is significantly different from the stated salary, it could indicate a bluff.
• https://tutorialforbeginner.com/polynomial-regression-in-machine-
learning#:~:text=Polynomial%20Regression%20Visualization:&text=In
%20these%20visualizations%2C%20the%20Polynomial,the%20variati
ons%20in%20the%20data.
THEORY OF GENERALIZATION
What is Hypothesis?
• A hypothesis is an explanation for something.
• It is a provisional idea, an educated guess that requires some
evaluation.It is not a fact and needs to be tested through
experiments, analysis, or data.
• A good hypothesis is testable; it can be either true or false.
• In science, a hypothesis must be falsifiable, meaning that there exists
a test whose outcome could mean that the hypothesis is not true.
• The hypothesis must also be framed before the outcome of the test is
known.(To avoid bias)
What is Hypothesis Testing?
• Statistical hypothesis testing is a formal process used to:
• Evaluate a claim or assumption (hypothesis) about a population based
on sample data.
• Help determine whether an observed effect (like a difference between
means) is likely to be real or just due to random chance.
Critical Value / Effect
• The effect refers to something we observe in the data — like a
difference between means, a correlation, or an association.
• We then calculate a test statistic (e.g., t, z, F value) that tells us how
extreme the observed effect is under a given hypothesis.
Likelihood Under Null Hypothesis
“...how likely it is to observe the effect if a relationship does not exist.”
• This is done under the null hypothesis, which assumes no effect or
no difference.
• We calculate a p-value, which tells us:
• "If the null hypothesis is true, how likely are we to see an effect as
large as (or larger than) the one we observed?“
• “If the likelihood is very small, then it suggests that the effect is
probably real.”
Null Hypothesis (H₀)
“One hypothesis is that there is no difference between the population
means...”
• The null hypothesis (H₀) is a default position — it assumes no
difference, no effect, or no relationship.
• For example: "The mean score of group A is the same as group B."
Reject or Fail to Reject
• We never say “accept” the null hypothesis because:
• We're working with probabilities, not certainties.
• Even if a p-value is high, it doesn’t prove the null is true — it only
means we don’t have enough evidence to reject it.
REGULARIZATION THEORY
• Regularization is a strategy used in machine learning to improve a
model’s ability to generalize — that is, perform well on unseen data
— even if it means increasing error on the training data.
• Any modification made to a learning algorithm to reduce
generalization error (i.e., error on the test set), even if it increases
training error, is called regularization.
• It aims to fit the training data well enough without overfitting.
• Regularization improves generalization, and make models simpler
and more robust.
How Regularization Works?
• Regularization works by adding constraints or penalties to the model.
• There are two main ways:
• Hard Constraints:
• Puts strict rules or limits on model parameters. (usually the weights in a model).
• Example: Restricting the values of weights so they must stay below a certain number, like:
• Soft Constraints (Penalties):
• Instead of forcing limits, soft constraints use penalties in the loss function.
• These discourage the model from becoming too complex but still allow it if
necessary.
• The strength of this penalty is controlled by a regularization parameter (like λ).
• We modify the loss function:
Where:
Prediction Loss = Error on training data (e.g., MSE)
Penalty = Regularization term
λ = Controls how much penalty affects the total loss
Types of Regularization
• 1. L1 Regularization (Lasso Regression)
• Penalty = sum of absolute values of weights to the loss function.
• Tends to make weights exactly zero, thus useful for feature selection-keeps
only the most important features.
• Eg: Imagine a model is using 50 features, but only 5 really matter. L1
regularization helps the model automatically remove the unnecessary 45
by setting their weights to 0.
2. L2 Regularization (Ridge Regression)
• Penalty = sum of squares of weights to the loss function.
• Encourages small weight values, but not necessarily zero.
• Helps keep the model simple and smooth.
• Eg: Instead of removing features, L2 just shrinks the weights of less
important ones — they still contribute a bit.
3. Elastic Net
• Combines L1 and L2 regularization.
• Useful when there are many correlated features.
Benefits of Regularization
•Prevents overfitting
•Improves generalization
•Helps select important features (L1)
•Keeps model complexity low
VC (VAPNIK-CHERVONENKIS)
DIMENSIONS
• The Vapnik-Chervonenkis (VC) dimension is a key concept in machine
learning that measures the capacity or complexity of a model which
can help us understand how well it can fit different data sets.
• It was introduced by Vladimir Vapnik and Alexey Chervonenkis in the
1970s and has become a fundamental concept in statistical learning
theory.
• It essentially quantifies how well a model can fit different datasets by
counting the maximum number of points that the model can
perfectly classify in all possible ways (shatter).
Bounds of VC - Dimension
• The VC dimension provides both upper and lower bounds on the
number of training examples required to achieve a given level of
accuracy.
• The upper bound on the number of training examples
is logarithmic in the VC dimension, while the lower bound is linear.
Definition:
• The VC dimension of a hypothesis class H, denoted dVC(H), is the largest
number m such that there exists a set of m points that can be shattered by
H.
• Shattering: H can correctly classify all 2^m possible labelings of those m
points.
• What is Shattering?
• A hypothesis class is said to “shatter” a set of data points if, no matter how
you label those points (e.g., assign them as positive or negative), the
hypothesis class has a function that can correctly classify them.
• Eg:
• A straight line in 2D space has a VC dimension of 3. It can shatter any
arrangement of 3 points, but it cannot shatter 4 points if one lies outside
the plane formed by the others.
Why VC Dimension Matters in ML:
• Generalization: Models with too high a VC dimension may overfit
(perform well on training data but poorly on unseen data).
• Simplicity: A model with a lower VC dimension may underfit
(fail to capture the patterns in data).
Applications of VC - Dimension
• The VC dimension has a wide range of applications in machine
learning and statistics. For example, it is used to analyze the
complexity of neural networks, support vector machines, and
decision trees. The VC dimension can also be used to design new
learning algorithms that are robust to noise and can generalize well to
unseen data.
• The VC dimension can be extended to more complex learning
scenarios, such as multiclass classification and regression. The
concept of the VC dimension can also be applied to other areas of
computer science, such as computational geometry and graph theory.