A8751 – Optimization Techniques in
Machine Learning
Course Overview:
The students will be able to understand and analyze how to deal
with changing data. They will also be able to identify and interpret
potential unintended effects in your project. They will understand
and define procedures to operationalize and maintain your
applied machine learning model.
Edited By Mr S Srinivas Reddy Asst Professor
Module 2:
Linear Regression as an Optimization Problem
Based on Mathematics for Machine Learning by
Deisenroth et al.
Chapters Referenced:
Chapter 3 (Linear Models) & Chapter 7 (Probability &
Bayesian Models)
Syllabus
Module 1: Model Fitting and Error Measurement
Optimization Using Gradient Descent, Constrained Optimization and Lagrange Multipliers, Convex Optimization,
Data, Models, and Learning, Empirical Risk Minimization, Parameter Estimation, Probabilistic Modelling and
Inference Directed Graphical Models.
Module 2: Linear Regression as an Optimization Problem
Problem Formulation, Parameter Estimation, Bayesian Linear Regression, Maximum Likelihood as Orthogonal
Projection
Module 3: Dimensionality Reduction and Optimization
Problem Setting, Maximum Variance Perspective, Projection Perspective, Eigenvector Computation and Low-Rank
Approximations, PCA in High Dimensions, Key Steps of PCA in Practice, Latent Variable Perspective
Course Outcomes
A8751.1. Understand the fundamentals of model fitting, empirical risk minimization, and
optimization techniques including gradient descent and Lagrange multipliers.
A8751.2. Formulate linear regression as an optimization problem and apply parameter
estimation techniques including Bayesian and Maximum Likelihood methods.
A8751.3. Apply dimensionality reduction techniques such as PCA using optimization-based
approaches and understand the mathematical foundations of eigenvector computation.
A8751.4. Analyze unsupervised learning problems using Gaussian Mixture Models and the
Expectation Maximization algorithm for parameter estimation.
A8751.5. Evaluate and implement large-margin classifiers including Support Vector
Machines using primal and dual optimization frameworks and kernel methods.
TOPICS TO BE DISCUSSED ARE AS
FOLLOWS :
Problem Formulation
Parameter Estimation
Bayesian Linear Regression
Maximum Likelihood as Orthogonal
Projection
In upcoming lecture, CSD Sec A /B/C students will
learn :
Understand how prediction problems can be
modeled using linear functions
Formulate the linear regression model
Define an optimization objective for learning
model parameters
Linear Model Representation
We assume that our model makes predictions using a linear
equation:
Main Equation (Vector Form):
Parameter Estimation
📚 Mathematics for Machine Learning by
Deisenroth et al.
🔖 Based on Chapter 3, Section 3.2 – Least
Squares Estimation
Simple Numerical Example
Let’s take 2 data points for ease: | x |y |
| --- | --- |
Assume model to fit a line : |1 |2 |
y^=θ +θ x |2 |3 |
0 1
What is Maximum Likelihood Estimation (MLE)?
Maximum Likelihood Estimation (MLE) is a statistical
method for estimating parameters of a model by
maximizing the probability (likelihood) of observing
the given data under that model.
In the context of linear regression, the goal is to
estimate the parameter vector θ that makes the
observed data most probable.
Overfitting in Linear Regression
When you use too many features (especially ones that are
not informative), linear regression starts to memorize
noise in the data, not just patterns.
Why Overfitting Happens:
When features ≥ samples,
the model has too much flexibility.
It can achieve zero training error, but will perform poorly
on new data (poor generalization).
Remedies for Overfitting:
1.Feature Selection:
Remove non-informative or noisy features,
Objective: Eliminate irrelevant or redundant input
variables that do not contribute meaningfully to the
output.
2.Regularization:
Penalize large coefficients to prevent the model from
becoming overly sensitive to specific features.
Add penalty terms like L2 (Ridge) or L1 (Lasso)
3.Bayesian Linear Regression:
Introduce prior distributions over the parameters to
control model complexity.
4.Model Selection:
Use cross-validation to select the right model complexity
2. Regularization in Linear Regression
( to be discussed more for otml)
Regularization is used to prevent overfitting in linear
regression by penalizing large model coefficients.
This helps keep the
model simpler and more generalizable.
Why Regularization?
When the number of features is large or the features
are highly correlated*, the parameter estimates θ can
become unstable and the model may overfit the
training data.
Regularization adds a penalty term to the objective
function to control model complexity.
(* in the context of linear regression refers to the
situation where two or more input variables (features)
carry similar or redundant information.)
Key Assumptions of Linear Regression"
Assumption(properties) Simple Meaning Why it Matters
The relationship between predictors XXX and output yyy must be linear.
Straight-line
Linearity In Bayesian regression, the model is still linear in parameters, but
relationship
uncertainty is added.
Data points don’t Each observation (data point) must be independent of others.
Independence
affect each other This avoids hidden biases or autocorrelations.
Equal Variance Error is same The variance of errors should remain constant across all input values.
(Homoscedasticity) across all inputs This ensures fair treatment across all levels of input.
Prediction mistakes Residuals (errors) are normally distributed.
Normal Errors
follow bell curve Important for inference: p-values, confidence intervals, etc.
Predictors should not be too correlated with each other.
Inputs are not too
No Multicollinearity In Bayesian regression, multicollinearity still increases posterior
similar
uncertainty
No missing or extra
Correct Model All relevant variables are included and irrelevant ones excluded.
variables
Missing key variables leads to bias; extra ones add noise.
Input variables should not be correlated with the error term.
Inputs and errors
𝜃
No Endogeneity Helps in producing trustworthy estimates of
are separate
If Variance Is Equal (Homoscedasticity)
House x: Size (sq ft) y: Price (in lakhs) y^: Predicted Price Error e=y−y^
A 1000 50 52 -2
B 1500 70 72 -2
C 2000 90 92 -2
If Variance Is Not Equal (Heteroscedasticity)
House x y y^ e
A 1000 50 52 -2
B 1500 70 74 -4
C 2000 90 95 -5
If variance is not constant:
•Standard errors of coefficients become wrong
•Confidence intervals and hypothesis tests become
unreliable
What Is
Multicollinearity?
•It means: two or more
predictors (input
variables) are highly
correlated.
•That is, they carry similar
information.
•This makes it difficult to
tell which variable is
responsible for the effect
on the target y.
Packing for a Trip
Scenario:
You're going on a 5-day trip and need to pack smartly because your
suitcase has a weight limit (just like a regularization constraint).
Lasso (L1): Pick a Few Essential Items
You say: "I'll take only the most important things — 2 pairs of jeans,
2 shirts, and skip formal shoes, books, gym gear..."
You pack fewer items, but each one is useful.
Your suitcase has zero of many items.
Result: Simpler, lighter suitcase. Fewer items, more space — like
feature selection.
Ridge (L2): Take Everything, But Lighter
You say: "I want a little of everything, but will reduce
the size —
travel-sized shampoo, thin t-shirts, foldable shoes...“
You don't skip anything, but minimize everything.
Result: Everything fits, but in a compressed
form — like shrinking coefficients.
Bayesian Linear Regression – Theory and Interpretation
In Bayesian Linear Regression, we make two main assumptions:
(a) Likelihood (How data is generated)
We assume that each output ynis generated as:
(b) Prior on Parameters
We don't fix θ, instead we assume a
prior distribution over it:
Goal: Compute Posterior Mean μand Posterior Covariance
Σ
Maximum Likelihood as Orthogonal Projection
1. Background:
In Linear Regression, we aim to fit a line (or hyperplane) that best
represents the data.
Let:
X = matrix of input features (with each row as a data point)
y = vector of observed outputs (target values)
θ = parameter vector (weights of the model)
Maximum Likelihood (ML) Estimation as
Orthogonal Projection in the subject Optimization
Techniques in Machine Learning (OTML) arises
from a need to
connect statistical learning with geometric
intuition and optimization theory.
Why Do We Study Maximum Likelihood as Orthogonal Projection in
OTML?
Limitation of Purely Statistical Interpretation
•Maximum Likelihood Estimation (MLE) is traditionally taught as a
statistical method to estimate parameters that maximize the likelihood
of observing data.
However, this viewpoint:
• Lacks geometric intuition.
• Makes it hard for students to visualize the optimization process.
To overcome this, ML estimation in linear regression is interpreted as a
geometric projection—specifically, orthogonal projection of observed
outputs onto the column space of the input matrix.
Limitations that Force This Study
Limitation in Standard ML How Orthogonal Projection Helps
Estimation
Hard to visualize likelihood Provides geometric interpretation
maximization
Abstract algebra in cost Links to physical projection of data
minimization
Failure of least squares in high- Shows where projection breaks and needs
dimension regularization
Disconnection between vector Bridges linear algebra, calculus, and ML optimization
calculus and learning
Confusion about residuals and Projection shows residuals are orthogonal, satisfying
optimality optimality
•Step 1: Matrix Representation
•Step 2: Normal Equation Setup
•Step 3: Compute Parameter
•Step 4: Predict Output Vector
•Step 5: Calculate Residual Vector
•Step 6: Orthogonality Check