ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62
ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62
Example: In house price prediction, the house price is the dependent variable.
The input variables used to predict the dependent variable. They can be continuous or
categorical.
Example: The size of the house, number of rooms, location, etc., are independent
variables when predicting house prices.
3. Regression Coefficients
These are the values that represent the relationship between each independent
variable and the dependent variable. In a simple linear regression model, these are the
slope values that define how much the dependent variable changes for each unit
change in the independent variable.
4. Intercept (Constant)
The value of the dependent variable when all the independent variables are zero. It is
the point at which the regression line crosses the y-axis.
The difference between the observed actual values and the values predicted by the
regression model. It represents the unexplained variation in the dependent variable.
Machine Learning 34
6. Line of Best Fit (Regression Line)
In linear regression, the regression line represents the predicted values for the
dependent variable based on the independent variables. The goal is to find the line that
minimizes the residuals.
9. Multicollinearity
A situation where two or more independent variables are highly correlated, making it
difficult for the regression model to separate their individual effects on the dependent
variable.
Overfitting: When the model learns not only the underlying patterns in the data but also
the noise, leading to poor generalization to new data.
Underfitting: When the model is too simple to capture the underlying patterns, resulting
in poor performance on both training and test data.
Machine Learning 35
Machine Learning 36
Machine Learning 37
7. Stepwise Regression
Stepwise regression is a technique that involves adding or removing predictors based on their
statistical significance. The process can either:
Forward Selection: Start with no variables and add them one by one based on their
contribution to the model.
Backward Elimination: Start with all variables and remove the least significant ones step by
step.
Machine Learning 38
This technique is particularly useful when trying to build a simpler, more interpretable model by
selecting only significant features.
Applications: Useful in feature selection when you have a large number of variables, like in
econometrics or healthcare.
8. Quantile Regression
Unlike ordinary regression, which estimates the conditional mean of the dependent variable,
quantile regression estimates the conditional median (or other quantiles) of the dependent
variable. This allows us to model different quantiles (such as the 25th, 50th, or 90th percentile)
of the target variable, providing a more comprehensive understanding of the relationship
between variables.
Logistic Regression
Logistic regression is a statistical method used for binary classification problems where the
output variable is categorical and takes one of two possible values. Unlike linear regression,
which predicts a continuous output, logistic regression predicts the probability that a given
input point belongs to a particular class. The model is based on the logistic function, also
known as the sigmoid function, which maps any real-valued number into the range of 0 to 1.
1. Binary Classification
Logistic regression is primarily used for binary classification tasks, where the output
can be either 0 or 1, such as spam vs. non-spam or disease vs. no disease.
Machine Learning 39
5. Decision Boundary
The decision boundary is the threshold that determines the classification of the data
points. For binary classification, if the predicted probability \( P(y=1|x) \) is greater than
0.5, the output is classified as 1; otherwise, it is classified as 0.
Machine Learning 40
Assumptions of Logistic Regression
Machine Learning 41
1. Binary Outcome
2. Independence of Observations
The observations must be independent of each other. Logistic regression does not handle
correlated observations well.
3. Linearity of Logits
It assumes a linear relationship between the independent variables and the log-odds of the
dependent variable. While the actual relationship may not be linear, the logit transformation
allows modeling this relationship linearly.
4. No Multicollinearity
Logistic regression assumes that the independent variables are not highly correlated with
each other. Multicollinearity can distort the estimates of the coefficients.
2. Less Complex: It is computationally less intensive compared to more complex models like
neural networks, making it efficient for large datasets.
3. Good Performance: Performs well for linearly separable data and can be extended to
handle non-linear relationships through feature transformations.
2. Sensitivity to Outliers: Outliers can significantly affect the coefficients and predictions in
logistic regression, leading to biased results.
Machine Learning 42
4. Binary Classification: While extensions like multinomial logistic regression exist for multi-
class classification, logistic regression is primarily designed for binary outcomes.
2. Ordinal Logistic Regression: Used when the dependent variable is ordinal (i.e., has a
meaningful order but no consistent interval). It models the probability of being in a
particular category or below.
2. Finance: Credit scoring and risk assessment to determine the likelihood of loan default.
4. Social Sciences: Analyzing survey data to understand factors influencing voter behavior or
public opinion.
Conclusion
Logistic regression is a powerful and widely used classification algorithm in machine learning.
Its simplicity, interpretability, and effectiveness make it a go-to choice for binary classification
problems. Understanding its underlying mechanics, assumptions, and extensions is crucial for
applying it effectively in various applications. While it has limitations, logistic regression serves
as a solid foundation for many more complex classification techniques in machine learning.
Machine Learning 43
Simple Linear Regression is a statistical method used to model the linear relationship between
two variables:
The goal is to find the best-fitting straight line through the data points, known as the regression
line.
Mathematical Representation
The simple linear regression model is represented as:
Y = β₀ + β₁X + ε
Where:
ε is the error term (the difference between the predicted and actual Y values)
Machine Learning 44
5. No or little multicollinearity: Not applicable in simple linear regression, but important in
multiple regression.
R² = 1 - (SSR / SST)
Where SSR is the sum of squared residuals and SST is the total sum of squares
3. Standard Error of the Regression (S): Measures the average distance between the
observed values and the regression line
S = √(SSR / (n - 2))
Machine Learning 45
Step 4: Hypothesis Testing
1. t-test for individual coefficients:
Where MSR is mean square regression and MSE is mean square error
2. Influence diagnostics:
2. Prediction:
Machine Learning 46
2. Test on a separate dataset if available
Remember that while simple linear regression is a powerful tool, it's limited to modeling linear
relationships between two variables. For more complex relationships or multiple predictors,
consider polynomial regression or multiple linear regression.
Mathematical Formulation
For the simple linear regression model Y = β₀ + β₁X + ε, OLS minimizes:
S(β₀, β₁) = Σ(yᵢ - (β₀ + β₁xᵢ))²
where (xᵢ, yᵢ) are the observed data points.
The expected values of the estimators equal the true parameter values.
Among all linear unbiased estimators, OLS estimators have the smallest variance.
3. Consistency: As sample size increases, β̂₀ and β̂₁ converge in probability to β₀ and β₁.
Machine Learning 47
4. Efficiency: OLS estimators achieve the Cramér-Rao lower bound.
3. The sum of the observed y values equals the sum of the fitted y values: Σyᵢ = Σŷᵢ
4. The sum of the cross-products of the x values and the residuals is zero: Σ(xᵢ(yᵢ - ŷᵢ)) = 0
where:
Machine Learning 48
5. Residuals
Residuals are the differences between the observed values of the dependent variable and the
predicted values:
eᵢ = yᵢ - ŷᵢ
Properties of Residuals
1. The sum of residuals is zero: Σeᵢ = 0
Standardized Residuals
Standardized residuals are useful for identifying outliers:
rᵢ = eᵢ / (s * √(1 - hᵢᵢ))
where hᵢᵢ is the i-th diagonal element of the hat matrix H = X(X'X)⁻¹X'.
Residual Analysis
Residual analysis is crucial for checking model assumptions:
Measures of Influence
1. Leverage: hᵢᵢ (diagonal elements of hat matrix)
2. Cook's Distance: Measures the influence of each observation on the fitted values
Machine Learning 49
Multiple Linear Regression (MLR) is an extension of simple linear regression. It models the
linear relationship between a dependent variable and two or more independent variables.
Mathematical Representation
The general form of the multiple linear regression model is:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
Where:
In matrix notation:
Y = Xβ + ε
Where:
X is an n×(k+1) matrix of independent variables (including a column of 1's for the intercept)
1. Linearity: The relationship between the dependent variable and each independent variable
should be linear.
3. Homoscedasticity: The variance of residuals should be constant across all levels of the
independent variables.
Machine Learning 50
Visualization: Residual plot against fitted values
5. No Multicollinearity: The independent variables should not be highly correlated with each
other.
Caution: R² always increases when adding more variables, even if they're not meaningful.
Adjusted R-Square
Definition: A modified version of R² that adjusts for the number of predictors in the model.
Machine Learning 51
Interpretation: Smaller values indicate better fit. Used in calculating confidence intervals
and prediction intervals.
Interpretation: A small p-value (typically < 0.05) indicates that the model is statistically
significant.
Coefficient P-values
Definition: The p-value for each coefficient tests the null hypothesis that the coefficient is
equal to zero.
Interpretation: A small p-value (typically < 0.05) indicates that the variable is statistically
significant in the model.
Coefficients
Interpretation: Each coefficient represents the change in Y for a one-unit change in X,
holding all other variables constant.
Easy to interpret
Limitations:
Adjusted R-squared
Strengths:
Machine Learning 52
Useful for comparing models with different numbers of predictors
Limitations:
Limitations:
Affected by outliers
2. Cross-Validation
3. Residual Analysis
4. Prediction Error
5. Multicollinearity Diagnostics
Condition number
6. Influence Diagnostics
Cook's distance
DFBETAS
7. Partial F-tests
Machine Learning 53
Compare nested models to assess the contribution of a subset of predictors
Remember that assessing model fit is not just about maximizing R² or minimizing error. It's
about finding a balance between model complexity and predictive power, while ensuring that
the model meets the necessary assumptions and is interpretable in the context of your
research question or business problem.
In practice, it's often beneficial to use a combination of these methods to get a comprehensive
understanding of your model's performance and limitations.
Feature Selection
Feature selection is the process of selecting a subset of relevant features for use in model
construction.
Types of Feature Selection:
3. Embedded Methods: Perform feature selection as part of the model construction process
Dimensionality Reduction
Dimensionality reduction transforms the data from a high-dimensional space to a lower-
dimensional space, retaining most of the relevant information.
Common Techniques:
In Machine Learning and Statistic, Dimensionality Reduction the process of reducing the
number of random variables under consideration via obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.
Machine Learning 54
are in danger of overwriting your model to your data or that you might be violating the
assumptions of whichever modeling tactic you’re using?
You might ask the question “how do I take all of the variables. I’ve collected and focused on
only a few of them? In technical terms, you want to “reduce the dimension of your feature
space. By reducing the dimension of your feature space, you have fewer relationships between
variables to consider and less likely to overheat your model.
· Feature Elimination
· Feature extraction
Feature Elimination: we reduce the feature space by elimination feature. The advantages of
the feature elimination method include simplicity and maintainability features. We’ve also
eliminated any benefits those dropped variables would bring.
Feature Extraction: PCA is a technique for feature extraction. So it combines our input
variables in a specific way, then we can drop the “least important” variables while still retaining
the most valuable parts of all the variables.
When should I use PCA?
1. Do you want to reduce the no. of variables, but are not able to identify variables to
completely remove from consideration?
Machine Learning 55
3. Are you comfortable making your independent variable less interpretable?
The above picture displays the two main directions in this data: the back direction and the blue
direction. In red direction is the most important one. We’ll get into why this is the case later, but
given how the case later, but given how the dots are arranged can you see why the red
direction looks more important than the green direction (What would be fitting a line of best fit
to the data look like?
Machine Learning 56
In the above pic, we will transform our original data into aligning with these important
directions. The fig. The show is the same extract data as above but transformed. So that the X-
& Y axis are now direction.
Advantages of PCA:
Reduces overfitting by reducing the number of features
Limitations of PCA:
Assumes linearity
May not be suitable when the variance in noise is larger than the variance in signal
https://youtu.be/azXCzI57Yfc?si=Wg1man-owgnlCn8B
LDA is a type of Linear combination, a mathematical process using various data items and
applying a function to that site to separately analyze multiple classes of objects or items.
Following Fisher’s Linear discriminant, linear discriminant analysis can be useful in areas like
image recognition and predictive analysis in marketing.
The fundamental idea of linear combinations goes back as far as the 1960s with the Altman Z-
scores for bankruptcy and other predictive constructs. Now LDA helps in preventative data for
more than two classes, when Logistics Regression is not sufficient. The linear Discriminant
analysis takes the mean value for each class and considers variants to make predictions
assuming a Gaussian distribution.
Machine Learning 57
How the Linear Discriminant Analysis (LDA) work?
First general steps for performing a Linear Discriminant Analysis
1. Compute the d-dimensional mean vector for the different classes from the dataset.
2. Compute the Scatter matrix (in between class and within the class scatter matrix)
3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with the largest
eigenvalue to from a d x k dimensional matrix w (where every column represent an
eigenvector)
4. Used d * k eigenvector matrix to transform the sample onto the new subspace.
Advantages of LDA:
Works well for multi-class classification problems
Limitations of LDA:
Assumes normal distribution of data with equal covariance for each class
Machine Learning 58
Independent Component Analysis (ICA)
ICA is a computational method for separating a multivariate signal into additive
subcomponents, assuming the subcomponents are non-Gaussian signals and statistically
independent from each other.
The heart of ICA lies in the principle of statistical independence. ICA identify components
within mixed signals that are statistically independent of each other.
or
Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are statistically
independent of each other.
2. The second assumption is that each source signal exhibits non-Gaussian distributions.
, representing the observed data with m components. The hidden components are represented
by the random vector
Machine Learning 59
, where n is the number of hidden sources.
The goal is to transform the observed data x in a way that the resulting hidden components are
independent. The independence is measured by some function
. The task is to find the optimal transformation matrix W that maximizes the independence of
the hidden components.
ICA is a non-parametric approach, which means that it does not require assumptions
about the underlying probability distribution of the data.
ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data
is not available.
ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.
ICA assumes that the sources are mixed linearly, which may not always be the case. If the
sources are mixed nonlinearly, ICA may not be effective.
ICA can be computationally expensive, especially for large datasets. This can make it
difficult to apply ICA to real-world problems.
ICA can suffer from convergence issues, which means that it may not always be able to
find a solution. This can be a problem for complex datasets with many sources.
Machine Learning 60
Cocktail Party Problem
Consider Cocktail Party Problem or Blind Source Separation problem to understand the
problem which is solved by independent component analysis.
Problem: To extract independent sources’ signals from a mixed signal composed of the signals
from those sources.
Source 1
Source 2
Source 3
Source 4
Source 5
Here, there is a party going into a room full of people. There is ‘n’ number of speakers in that
room, and they are speaking simultaneously at the party. In the same room, there are also ‘n’
microphones placed at different distances from the speakers, which are recording ‘n’ speakers’
voice signals. Hence, the number of speakers is equal to the number of microphones in the
room.
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’ voice
signals in the room, given that each microphone recorded the voice signals coming from each
Machine Learning 61
speaker of different intensity due to the difference in distances between them.
Decomposing the mixed signal of each microphone’s recording into an independent source’s
speech signal can be done by using the machine learning technique, independent component
analysis.
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn are
the new features and are independent components that are independent of each other.
Limitations of ICA:
Assumes statistical independence of sources, which may not always hold
PCA: Unsupervised
LDA: Supervised
ICA: Unsupervised
2. Optimization Criterion:
3. Assumptions:
LDA: Assumes normal distribution with equal covariance for each class
4. Applications:
Machine Learning 62