KEMBAR78
ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62 | PDF | Ordinary Least Squares | Regression Analysis
0% found this document useful (0 votes)
26 views29 pages

ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62

This document provides an overview of regression analysis in machine learning, detailing key terminologies such as dependent and independent variables, regression coefficients, and types of regression techniques. It covers logistic regression, including its assumptions, advantages, disadvantages, and applications, as well as simple linear regression and ordinary least squares estimation. The document emphasizes the importance of understanding these concepts for effective application in various fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views29 pages

ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62

This document provides an overview of regression analysis in machine learning, detailing key terminologies such as dependent and independent variables, regression coefficients, and types of regression techniques. It covers logistic regression, including its assumptions, advantages, disadvantages, and applications, as well as simple linear regression and ordinary least squares estimation. The document emphasizes the importance of understanding these concepts for effective application in various fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Unit - 2

Regression Analysis in Machine Learning: Introduction and


Terminologies
Regression analysis is a fundamental technique in machine learning and statistics used to
model the relationship between a dependent variable (also called the target or output) and one
or more independent variables (also called features or predictors). The goal of regression is to
predict a continuous value, making it one of the most widely used approaches for supervised
learning tasks like predicting prices, demand, temperature, etc.

Key Terminologies in Regression


1. Dependent Variable (Target Variable)

The variable we aim to predict or explain. It is continuous in regression problems.

Example: In house price prediction, the house price is the dependent variable.

2. Independent Variables (Predictors/Features)

The input variables used to predict the dependent variable. They can be continuous or
categorical.

Example: The size of the house, number of rooms, location, etc., are independent
variables when predicting house prices.

3. Regression Coefficients

These are the values that represent the relationship between each independent
variable and the dependent variable. In a simple linear regression model, these are the
slope values that define how much the dependent variable changes for each unit
change in the independent variable.

4. Intercept (Constant)

The value of the dependent variable when all the independent variables are zero. It is
the point at which the regression line crosses the y-axis.

5. Residual (Error Term)

The difference between the observed actual values and the values predicted by the
regression model. It represents the unexplained variation in the dependent variable.

Residual = y_observed - y_predicted

Machine Learning 34
6. Line of Best Fit (Regression Line)

In linear regression, the regression line represents the predicted values for the
dependent variable based on the independent variables. The goal is to find the line that
minimizes the residuals.

9. Multicollinearity

A situation where two or more independent variables are highly correlated, making it
difficult for the regression model to separate their individual effects on the dependent
variable.

10. Overfitting and Underfitting

Overfitting: When the model learns not only the underlying patterns in the data but also
the noise, leading to poor generalization to new data.

Underfitting: When the model is too simple to capture the underlying patterns, resulting
in poor performance on both training and test data.

Types of Regression in Machine Learning


There are various types of regression techniques used depending on the nature of the data
and the problem. Below are some of the most commonly used types:

Machine Learning 35
Machine Learning 36
Machine Learning 37
7. Stepwise Regression
Stepwise regression is a technique that involves adding or removing predictors based on their
statistical significance. The process can either:

Forward Selection: Start with no variables and add them one by one based on their
contribution to the model.

Backward Elimination: Start with all variables and remove the least significant ones step by
step.

Machine Learning 38
This technique is particularly useful when trying to build a simpler, more interpretable model by
selecting only significant features.
Applications: Useful in feature selection when you have a large number of variables, like in
econometrics or healthcare.

8. Quantile Regression
Unlike ordinary regression, which estimates the conditional mean of the dependent variable,
quantile regression estimates the conditional median (or other quantiles) of the dependent
variable. This allows us to model different quantiles (such as the 25th, 50th, or 90th percentile)
of the target variable, providing a more comprehensive understanding of the relationship
between variables.

Logistic Regression
Logistic regression is a statistical method used for binary classification problems where the
output variable is categorical and takes one of two possible values. Unlike linear regression,
which predicts a continuous output, logistic regression predicts the probability that a given
input point belongs to a particular class. The model is based on the logistic function, also
known as the sigmoid function, which maps any real-valued number into the range of 0 to 1.

Key Concepts and Terminologies

1. Binary Classification

Logistic regression is primarily used for binary classification tasks, where the output
can be either 0 or 1, such as spam vs. non-spam or disease vs. no disease.

Machine Learning 39
5. Decision Boundary

The decision boundary is the threshold that determines the classification of the data
points. For binary classification, if the predicted probability \( P(y=1|x) \) is greater than
0.5, the output is classified as 1; otherwise, it is classified as 0.

Machine Learning 40
Assumptions of Logistic Regression

Machine Learning 41
1. Binary Outcome

Logistic regression assumes that the dependent variable is binary (0 or 1).

2. Independence of Observations
The observations must be independent of each other. Logistic regression does not handle
correlated observations well.

3. Linearity of Logits
It assumes a linear relationship between the independent variables and the log-odds of the
dependent variable. While the actual relationship may not be linear, the logit transformation
allows modeling this relationship linearly.

4. No Multicollinearity
Logistic regression assumes that the independent variables are not highly correlated with
each other. Multicollinearity can distort the estimates of the coefficients.

5. Large Sample Size


Logistic regression performs better with larger datasets, as small sample sizes may lead to
overfitting and unreliable coefficient estimates.

Advantages of Logistic Regression


1. Interpretability: The model provides easily interpretable coefficients, indicating the effect
of each feature on the probability of the target variable.

2. Less Complex: It is computationally less intensive compared to more complex models like
neural networks, making it efficient for large datasets.

3. Good Performance: Performs well for linearly separable data and can be extended to
handle non-linear relationships through feature transformations.

4. Probabilistic Output: Provides predicted probabilities, allowing for threshold adjustments


depending on the business requirements or costs associated with false positives and false
negatives.

Disadvantages of Logistic Regression


1. Linear Decision Boundary: Logistic regression is limited to linear decision boundaries. It
may struggle with complex relationships unless the features are transformed appropriately.

2. Sensitivity to Outliers: Outliers can significantly affect the coefficients and predictions in
logistic regression, leading to biased results.

3. Multicollinearity Issues: High correlation between independent variables can lead to


unstable coefficient estimates and difficulty in interpreting the effect of individual features.

Machine Learning 42
4. Binary Classification: While extensions like multinomial logistic regression exist for multi-
class classification, logistic regression is primarily designed for binary outcomes.

Extensions of Logistic Regression


1. Multinomial Logistic Regression: Extends logistic regression to handle multi-class
classification problems, where the dependent variable can take on more than two
categories.

2. Ordinal Logistic Regression: Used when the dependent variable is ordinal (i.e., has a
meaningful order but no consistent interval). It models the probability of being in a
particular category or below.

3. Regularized Logistic Regression: Includes L1 (lasso) and L2 (ridge) penalties to prevent


overfitting and improve model generalization.

Applications of Logistic Regression


Logistic regression is widely used across various fields due to its effectiveness and
interpretability:

1. Healthcare: Predicting the presence or absence of diseases based on patient


characteristics (e.g., heart disease prediction).

2. Finance: Credit scoring and risk assessment to determine the likelihood of loan default.

3. Marketing: Customer segmentation and predicting customer churn based on behavior.

4. Social Sciences: Analyzing survey data to understand factors influencing voter behavior or
public opinion.

5. Spam Detection: Classifying emails as spam or non-spam based on their content.

Conclusion
Logistic regression is a powerful and widely used classification algorithm in machine learning.
Its simplicity, interpretability, and effectiveness make it a go-to choice for binary classification
problems. Understanding its underlying mechanics, assumptions, and extensions is crucial for
applying it effectively in various applications. While it has limitations, logistic regression serves
as a solid foundation for many more complex classification techniques in machine learning.

Simple Linear Regression: Introduction, Assumptions, and


Model Building
1. Introduction to Simple Linear Regression

Machine Learning 43
Simple Linear Regression is a statistical method used to model the linear relationship between
two variables:

A dependent variable (Y), also called the response variable

An independent variable (X), also called the predictor or explanatory variable

The goal is to find the best-fitting straight line through the data points, known as the regression
line.

Mathematical Representation
The simple linear regression model is represented as:
Y = β₀ + β₁X + ε

Where:

Y is the dependent variable

X is the independent variable

β₀ is the y-intercept (the value of Y when X = 0)

β₁ is the slope of the line (the change in Y for a unit change in X)

ε is the error term (the difference between the predicted and actual Y values)

2. Assumptions of Simple Linear Regression


For the simple linear regression model to be valid and to produce reliable results, several
assumptions must be met:

1. Linearity: The relationship between X and Y should be linear.

Visualization: Scatter plot of Y vs. X should show a linear trend.

Test: Residual plot should show no pattern.

2. Independence: The observations should be independent of each other.

Often ensured by study design.

Test: Durbin-Watson test for autocorrelation.

3. Homoscedasticity: The variance of residuals should be constant across all levels of X.

Visualization: Residual plot should show constant spread.

Test: Breusch-Pagan test or White test.

4. Normality: The residuals should be normally distributed.

Visualization: Q-Q plot of residuals should follow a straight line.

Test: Shapiro-Wilk test or Kolmogorov-Smirnov test.

Machine Learning 44
5. No or little multicollinearity: Not applicable in simple linear regression, but important in
multiple regression.

6. No influential outliers: Outliers can significantly affect the regression line.

Visualization: Scatter plot, residual plot.

Measure: Cook's distance.

3. Simple Linear Regression Model Building


The process of building a simple linear regression model involves several steps:

Step 1: Data Collection and Preparation


1. Collect relevant data for both X and Y variables.

2. Ensure data quality (handle missing values, outliers).

3. Visualize the data using a scatter plot to check for linearity.

Step 2: Estimating Model Parameters


The most common method for estimating β₀ and β₁ is Ordinary Least Squares (OLS). OLS
minimizes the sum of squared residuals.

Formulas for OLS estimators:


β₁ = Σ((xᵢ - x̄ )(yᵢ - ȳ)) / Σ((xᵢ - x̄ )²)
β₀ = ȳ - β₁x̄
Where x̄ and ȳ are the means of X and Y respectively.

Step 3: Assessing Model Fit


1. R-squared (R²): Coefficient of determination

R² = 1 - (SSR / SST)

Where SSR is the sum of squared residuals and SST is the total sum of squares

R² ranges from 0 to 1, with 1 indicating a perfect fit

2. Adjusted R-squared: Adjusts R² for the number of predictors in the model

Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]

Where n is the number of observations and k is the number of predictors

3. Standard Error of the Regression (S): Measures the average distance between the
observed values and the regression line

S = √(SSR / (n - 2))

Machine Learning 45
Step 4: Hypothesis Testing
1. t-test for individual coefficients:

H₀: β₁ = 0 (no linear relationship)

H₁: β₁ ≠ 0 (there is a linear relationship)

t-statistic = (β₁ - 0) / SE(β₁)

Where SE(β₁) is the standard error of β₁

2. F-test for overall model significance:

H₀: β₁ = 0 (model is not significant)

H₁: β₁ ≠ 0 (model is significant)

F-statistic = MSR / MSE

Where MSR is mean square regression and MSE is mean square error

Step 5: Model Diagnostics


1. Residual analysis:

Plot residuals vs. fitted values to check homoscedasticity and linearity

Plot residuals vs. predictor to check for patterns

Normal Q-Q plot of residuals to check normality

2. Influence diagnostics:

Cook's distance to identify influential points

Leverage to identify high-leverage points

Step 6: Model Interpretation and Use


1. Interpret coefficients:

β₀: Expected value of Y when X = 0

β₁: Expected change in Y for a one-unit increase in X

2. Prediction:

Point prediction: Ŷ = β₀ + β₁X

Interval prediction: Calculate prediction intervals

3. Validate assumptions and refine if necessary

Step 7: Model Validation


1. Cross-validation: e.g., k-fold cross-validation

Machine Learning 46
2. Test on a separate dataset if available

Remember that while simple linear regression is a powerful tool, it's limited to modeling linear
relationships between two variables. For more complex relationships or multiple predictors,
consider polynomial regression or multiple linear regression.

Ordinary Least Squares Estimation, Properties of Estimators,


Interval Estimation, and Residuals in Linear Regression
1. Ordinary Least Squares (OLS) Estimation
OLS is a method for estimating the unknown parameters in a linear regression model. It
minimizes the sum of the squares of the differences between the observed dependent variable
and the predicted dependent variable.

Mathematical Formulation
For the simple linear regression model Y = β₀ + β₁X + ε, OLS minimizes:
S(β₀, β₁) = Σ(yᵢ - (β₀ + β₁xᵢ))²
where (xᵢ, yᵢ) are the observed data points.

Derivation of OLS Estimators


To find the minimum, we differentiate S with respect to β₀ and β₁ and set the derivatives to
zero:
∂S/∂β₀ = -2Σ(yᵢ - (β₀ + β₁xᵢ)) = 0
∂S/∂β₁ = -2Σ(xᵢ(yᵢ - (β₀ + β₁xᵢ))) = 0
Solving these equations leads to the OLS estimators:
β̂₁ = Σ((xᵢ - x̄ )(yᵢ - ȳ)) / Σ((xᵢ - x̄ )²)
β̂₀ = ȳ - β̂₁x̄
where x̄ and ȳ are the sample means of x and y respectively.

2. Properties of the Least-Squares Estimators


Under the Gauss-Markov assumptions, OLS estimators have several important properties:

1. Unbiasedness: E(β̂₀) = β₀ and E(β̂₁) = β₁

The expected values of the estimators equal the true parameter values.

2. Best Linear Unbiased Estimators (BLUE):

Among all linear unbiased estimators, OLS estimators have the smallest variance.

3. Consistency: As sample size increases, β̂₀ and β̂₁ converge in probability to β₀ and β₁.

Machine Learning 47
4. Efficiency: OLS estimators achieve the Cramér-Rao lower bound.

Sampling Distributions of the Estimators


Under normality assumption of errors:

1. β̂₁ ~ N(β₁, σ²/Σ((xᵢ - x̄ )²))

2. β̂₀ ~ N(β₀, σ²(1/n + x̄ ²/Σ((xᵢ - x̄ )²)))

where σ² is the variance of the error term.

3. Properties of the Fitted Regression Model


1. The regression line always passes through the point (x̄ , ȳ).

2. The sum of the residuals is zero: Σ(yᵢ - ŷᵢ) = 0

3. The sum of the observed y values equals the sum of the fitted y values: Σyᵢ = Σŷᵢ

4. The sum of the cross-products of the x values and the residuals is zero: Σ(xᵢ(yᵢ - ŷᵢ)) = 0

4. Interval Estimation in Simple Linear Regression


Interval estimation provides a range of plausible values for the population parameters or future
observations.
Confidence Interval for β₁
A (1-α)100% confidence interval for β₁ is given by:
β̂₁ ± t(α/2, n-2) * SE(β̂₁)

where:

t(α/2, n-2) is the t-value with n-2 degrees of freedom

SE(β̂₁) = s / √(Σ((xᵢ - x̄ )²))

s is the estimated standard error of the regression

Confidence Interval for the Mean Response


For a given x₀, a (1-α)100% confidence interval for the mean response E(Y|x₀) is:
(β̂₀ + β̂₁x₀) ± t(α/2, n-2) * s * √(1/n + (x₀ - x̄ )² / Σ((xᵢ - x̄ )²))
Prediction Interval for a Future Observation
For a future observation at x₀, a (1-α)100% prediction interval is:

(β̂₀ + β̂₁x₀) ± t(α/2, n-2) * s * √(1 + 1/n + (x₀ - x̄ )² / Σ((xᵢ - x̄ )²))


Note that the prediction interval is wider than the confidence interval due to the additional
uncertainty of individual observations.

Machine Learning 48
5. Residuals
Residuals are the differences between the observed values of the dependent variable and the
predicted values:
eᵢ = yᵢ - ŷᵢ

Properties of Residuals
1. The sum of residuals is zero: Σeᵢ = 0

2. The residuals are uncorrelated with the predictor variable: Σ(xᵢeᵢ) = 0

3. The residuals are uncorrelated with the fitted values: Σ(ŷᵢeᵢ) = 0

Standardized Residuals
Standardized residuals are useful for identifying outliers:
rᵢ = eᵢ / (s * √(1 - hᵢᵢ))
where hᵢᵢ is the i-th diagonal element of the hat matrix H = X(X'X)⁻¹X'.

Residual Analysis
Residual analysis is crucial for checking model assumptions:

1. Linearity: Plot residuals vs. fitted values or predictor variable

2. Homoscedasticity: Check for constant spread in residual plot

3. Normality: Q-Q plot of residuals

4. Independence: Plot residuals vs. order of observations

Measures of Influence
1. Leverage: hᵢᵢ (diagonal elements of hat matrix)

2. Cook's Distance: Measures the influence of each observation on the fitted values

Dᵢ = (rᵢ² / p) * (hᵢᵢ / (1 - hᵢᵢ)²)


where p is the number of parameters in the model.
Understanding these concepts is crucial for properly interpreting and validating linear
regression models. They provide the foundation for assessing the reliability of your model and
making informed decisions based on your analysis.

Multiple Linear Regression: Model, Assumptions, Output


Interpretation, and Model Fit Assessment

1. Multiple Linear Regression Model

Machine Learning 49
Multiple Linear Regression (MLR) is an extension of simple linear regression. It models the
linear relationship between a dependent variable and two or more independent variables.

Mathematical Representation
The general form of the multiple linear regression model is:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
Where:

Y is the dependent variable

X₁, X₂, ..., Xₖ are the independent variables

β₀ is the y-intercept (the value of Y when all X's are 0)

β₁, β₂, ..., βₖ are the partial regression coefficients

ε is the error term

In matrix notation:
Y = Xβ + ε
Where:

Y is an n×1 vector of dependent variable observations

X is an n×(k+1) matrix of independent variables (including a column of 1's for the intercept)

β is a (k+1)×1 vector of regression coefficients

ε is an n×1 vector of error terms

2. Assumptions of Multiple Linear Regression


For the multiple linear regression model to be valid and produce reliable results, several
assumptions must be met:

1. Linearity: The relationship between the dependent variable and each independent variable
should be linear.

Visualization: Partial regression plots

Test: Ramsey RESET test

2. Independence: The observations should be independent of each other.

Often ensured by study design

Test: Durbin-Watson test for autocorrelation

3. Homoscedasticity: The variance of residuals should be constant across all levels of the
independent variables.

Machine Learning 50
Visualization: Residual plot against fitted values

Test: Breusch-Pagan test or White test

4. Normality: The residuals should be normally distributed.

Visualization: Q-Q plot of residuals

Test: Shapiro-Wilk test or Kolmogorov-Smirnov test

5. No Multicollinearity: The independent variables should not be highly correlated with each
other.

Measure: Variance Inflation Factor (VIF)

Rule of thumb: VIF > 10 indicates problematic multicollinearity

6. No influential outliers: Outliers can significantly affect the regression results.

Visualization: Leverage vs. Standardized Residual plot

Measure: Cook's distance

3. Interpreting Multiple Linear Regression Output


R-Square (R²)
Definition: The proportion of variance in the dependent variable that is predictable from the
independent variable(s).

Formula: R² = 1 - (SSR / SST)


Where SSR is the sum of squared residuals and SST is the total sum of squares

Interpretation: R² ranges from 0 to 1. An R² of 0.7 means that 70% of the variance in Y is


predictable from X.

Caution: R² always increases when adding more variables, even if they're not meaningful.

Adjusted R-Square
Definition: A modified version of R² that adjusts for the number of predictors in the model.

Formula: Adjusted R² = 1 - [(1 - R²)(n - 1) / (n - k - 1)]


Where n is the number of observations and k is the number of predictors

Interpretation: Useful for comparing models with different numbers of predictors.

Standard Error of the Regression (S)


Definition: The average distance between the observed values and the regression line.

Formula: S = √(SSR / (n - k - 1))

Machine Learning 51
Interpretation: Smaller values indicate better fit. Used in calculating confidence intervals
and prediction intervals.

F-statistic and Significance F


F-statistic: Measures the overall significance of the model.

Formula: F = (R² / k) / ((1 - R²) / (n - k - 1))

Significance F: The p-value associated with the F-statistic.

Interpretation: A small p-value (typically < 0.05) indicates that the model is statistically
significant.

Coefficient P-values
Definition: The p-value for each coefficient tests the null hypothesis that the coefficient is
equal to zero.

Interpretation: A small p-value (typically < 0.05) indicates that the variable is statistically
significant in the model.

Coefficients
Interpretation: Each coefficient represents the change in Y for a one-unit change in X,
holding all other variables constant.

Standardized Coefficients: Allow comparison of the relative importance of predictors


measured on different scales.

4. Assessing the Fit of Multiple Linear Regression Model


R-squared (R²)
Strengths:

Easy to interpret

Provides a measure of the overall fit of the model

Limitations:

Always increases with additional predictors

Doesn't account for model complexity

Adjusted R-squared
Strengths:

Adjusts for the number of predictors in the model

Machine Learning 52
Useful for comparing models with different numbers of predictors

Limitations:

Still doesn't fully account for overfitting

Standard Error of the Regression (S)


Strengths:

Measured in the same units as the dependent variable

Used in calculating prediction intervals

Limitations:

Affected by outliers

Additional Methods for Assessing Model Fit


1. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

Balance model fit with complexity

Lower values indicate better fit

2. Cross-Validation

Assesses how well the model generalizes to new data

Common methods: k-fold cross-validation, leave-one-out cross-validation

3. Residual Analysis

Residual plots: Check for patterns that violate assumptions

Normality of residuals: Q-Q plot

4. Prediction Error

Mean Squared Error (MSE) or Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

5. Multicollinearity Diagnostics

Variance Inflation Factor (VIF)

Condition number

6. Influence Diagnostics

Cook's distance

DFBETAS

7. Partial F-tests

Machine Learning 53
Compare nested models to assess the contribution of a subset of predictors

Remember that assessing model fit is not just about maximizing R² or minimizing error. It's
about finding a balance between model complexity and predictive power, while ensuring that
the model meets the necessary assumptions and is interpretable in the context of your
research question or business problem.
In practice, it's often beneficial to use a combination of these methods to get a comprehensive
understanding of your model's performance and limitations.

Feature Selection and Dimensionality Reduction: PCA, LDA, ICA


Feature selection and dimensionality reduction are crucial techniques in machine learning and
data analysis. They help in reducing the number of input variables, improving model
performance, reducing overfitting, and facilitating data visualization.

Feature Selection
Feature selection is the process of selecting a subset of relevant features for use in model
construction.
Types of Feature Selection:

1. Filter Methods: Select features based on statistical measures

2. Wrapper Methods: Use a predictive model to score feature subsets

3. Embedded Methods: Perform feature selection as part of the model construction process

Dimensionality Reduction
Dimensionality reduction transforms the data from a high-dimensional space to a lower-
dimensional space, retaining most of the relevant information.

Common Techniques:

1. Principal Component Analysis (PCA)

2. Linear Discriminant Analysis (LDA)

3. Independent Component Analysis (ICA)

In Machine Learning and Statistic, Dimensionality Reduction the process of reducing the
number of random variables under consideration via obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.

Principal Component Analysis (PCA).


If you’ have worked with a lot of variables before, you know this can present problems. Do you
understand the relationship between each variable? Do you have so many variables that you

Machine Learning 54
are in danger of overwriting your model to your data or that you might be violating the
assumptions of whichever modeling tactic you’re using?

You might ask the question “how do I take all of the variables. I’ve collected and focused on
only a few of them? In technical terms, you want to “reduce the dimension of your feature
space. By reducing the dimension of your feature space, you have fewer relationships between
variables to consider and less likely to overheat your model.

Somewhat unsurprisingly, reducing the dimension of the feature space is called


“dimensionality reduction” There are many ways to achieve dimensionality reduction, but most
of the techniques fall into one of two classes.

· Feature Elimination

· Feature extraction

Feature Elimination: we reduce the feature space by elimination feature. The advantages of
the feature elimination method include simplicity and maintainability features. We’ve also
eliminated any benefits those dropped variables would bring.

Feature Extraction: PCA is a technique for feature extraction. So it combines our input
variables in a specific way, then we can drop the “least important” variables while still retaining
the most valuable parts of all the variables.
When should I use PCA?

1. Do you want to reduce the no. of variables, but are not able to identify variables to
completely remove from consideration?

2. Do you want to ensure your variables are independent of one another?

Machine Learning 55
3. Are you comfortable making your independent variable less interpretable?

How Principle Component Analysis (PCA) work?


We are going to calculate a matrix that summarizes how our variables all relate to one another.
We’ll then break this matrix down into two separate components: direction and magnitude. we
can then understand the direction of our data and its magnitude.

The above picture displays the two main directions in this data: the back direction and the blue
direction. In red direction is the most important one. We’ll get into why this is the case later, but
given how the case later, but given how the dots are arranged can you see why the red
direction looks more important than the green direction (What would be fitting a line of best fit
to the data look like?

Machine Learning 56
In the above pic, we will transform our original data into aligning with these important
directions. The fig. The show is the same extract data as above but transformed. So that the X-
& Y axis are now direction.

What would the line of best fit look like here:


1. Calculate the covariance matrix X of data points.

2. Calculate eigenvectors and correspond eigenvalues.


3. Sort eigenvectors accordingly to their given value in decrease order.

4. Choose first k eigenvectors and that will be the new k dimensions.

5. Transform the original n-dimensional data points into k_dimensions

Advantages of PCA:
Reduces overfitting by reducing the number of features

Improves visualization by reducing dimensions to 2D or 3D

Identifies patterns in data by revealing the internal structure

Limitations of PCA:
Assumes linearity

Sensitive to the relative scaling of original variables

May not be suitable when the variance in noise is larger than the variance in signal

Linear Discriminant Analysis (LDA)


LDA is a supervised method used for dimensionality reduction and classification. It projects the
data onto a lower-dimensional space while maximizing the separability between classes.

https://youtu.be/azXCzI57Yfc?si=Wg1man-owgnlCn8B
LDA is a type of Linear combination, a mathematical process using various data items and
applying a function to that site to separately analyze multiple classes of objects or items.
Following Fisher’s Linear discriminant, linear discriminant analysis can be useful in areas like
image recognition and predictive analysis in marketing.
The fundamental idea of linear combinations goes back as far as the 1960s with the Altman Z-
scores for bankruptcy and other predictive constructs. Now LDA helps in preventative data for
more than two classes, when Logistics Regression is not sufficient. The linear Discriminant
analysis takes the mean value for each class and considers variants to make predictions
assuming a Gaussian distribution.

Maximizing the component axes for class-separation.

Machine Learning 57
How the Linear Discriminant Analysis (LDA) work?
First general steps for performing a Linear Discriminant Analysis
1. Compute the d-dimensional mean vector for the different classes from the dataset.

2. Compute the Scatter matrix (in between class and within the class scatter matrix)

3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with the largest
eigenvalue to from a d x k dimensional matrix w (where every column represent an
eigenvector)

4. Used d * k eigenvector matrix to transform the sample onto the new subspace.

This can be summarized by the matrix multiplication.


Y = X x W (where X is a n * d dimension matrix representing the n samples and you are
transformed n * k dimensional samples in the new subspace.

Advantages of LDA:
Works well for multi-class classification problems

Can be used for both dimensionality reduction and classification

Maximizes class separability

Limitations of LDA:
Assumes normal distribution of data with equal covariance for each class

Can overfit when the number of samples per class is small

Limited to C-1 features, where C is the number of classes

Machine Learning 58
Independent Component Analysis (ICA)
ICA is a computational method for separating a multivariate signal into additive
subcomponents, assuming the subcomponents are non-Gaussian signals and statistically
independent from each other.

Independent Component Analysis (ICA) is a statistical and computational technique used in


machine learning to separate a multivariate signal into its independent non-Gaussian
components. The goal of ICA is to find a linear transformation of the data such that the
transformed data is as close to being statistically independent as possible.

The heart of ICA lies in the principle of statistical independence. ICA identify components
within mixed signals that are statistically independent of each other.

Statistical Independence Concept:


It is a probability theory that if two random variables X and Y are statistically independent. The
joint probability distribution of the pair is equal to the product of their individual probability
distributions, which means that knowing the outcome of one variable does not change the
probability of the other outcome.

or

Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are statistically
independent of each other.

2. The second assumption is that each source signal exhibits non-Gaussian distributions.

Mathematical Representation of Independent Component Analysis


The observed random vector is

, representing the observed data with m components. The hidden components are represented
by the random vector

Machine Learning 59
, where n is the number of hidden sources.

Linear Static Transformation


The observed data X is transformed into hidden components S using a linear static
transformation representation by the matrix W.

Here, W = transformation matrix.

The goal is to transform the observed data x in a way that the resulting hidden components are
independent. The independence is measured by some function

. The task is to find the optimal transformation matrix W that maximizes the independence of
the hidden components.

Advantages of Independent Component Analysis (ICA):


ICA is a powerful tool for separating mixed signals into their independent components.
This is useful in a variety of applications, such as signal processing, image analysis, and
data compression.

ICA is a non-parametric approach, which means that it does not require assumptions
about the underlying probability distribution of the data.

ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data
is not available.

ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.

Disadvantages of Independent Component Analysis (ICA):


ICA assumes that the underlying sources are non-Gaussian, which may not always be true.
If the underlying sources are Gaussian, ICA may not be effective.

ICA assumes that the sources are mixed linearly, which may not always be the case. If the
sources are mixed nonlinearly, ICA may not be effective.

ICA can be computationally expensive, especially for large datasets. This can make it
difficult to apply ICA to real-world problems.

ICA can suffer from convergence issues, which means that it may not always be able to
find a solution. This can be a problem for complex datasets with many sources.

Machine Learning 60
Cocktail Party Problem
Consider Cocktail Party Problem or Blind Source Separation problem to understand the
problem which is solved by independent component analysis.

Problem: To extract independent sources’ signals from a mixed signal composed of the signals
from those sources.

Given: Mixed signal from five different independent sources.


Aim: To decompose the mixed signal into independent sources:

Source 1

Source 2

Source 3

Source 4

Source 5

Solution: Independent Component Analysis

Here, there is a party going into a room full of people. There is ‘n’ number of speakers in that
room, and they are speaking simultaneously at the party. In the same room, there are also ‘n’
microphones placed at different distances from the speakers, which are recording ‘n’ speakers’
voice signals. Hence, the number of speakers is equal to the number of microphones in the
room.

Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’ voice
signals in the room, given that each microphone recorded the voice signals coming from each

Machine Learning 61
speaker of different intensity due to the difference in distances between them.
Decomposing the mixed signal of each microphone’s recording into an independent source’s
speech signal can be done by using the machine learning technique, independent component
analysis.

where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn are
the new features and are independent components that are independent of each other.

Limitations of ICA:
Assumes statistical independence of sources, which may not always hold

Cannot determine the order and scale of independent components

May converge to local optima

Comparison of PCA, LDA, and ICA


1. Supervision:

PCA: Unsupervised

LDA: Supervised

ICA: Unsupervised

2. Optimization Criterion:

PCA: Maximizes variance

LDA: Maximizes class separability

ICA: Maximizes statistical independence

3. Assumptions:

PCA: Assumes linearity and orthogonality of components

LDA: Assumes normal distribution with equal covariance for each class

ICA: Assumes non-Gaussian distribution and statistical independence of sources

4. Applications:

PCA: General-purpose dimensionality reduction, data compression

LDA: Classification problems, especially multi-class

ICA: Blind source separation, feature extraction in non-Gaussian data

Machine Learning 62

You might also like