A8751 – Optimization Techniques in
Machine Learning
Course Overview:
The students will be able to understand and analyze how to deal
with changing data. They will also be able to identify and interpret
potential unintended effects in your project. They will understand
and define procedures to operationalize and maintain your
applied machine learning model.
Edited By Mr S Srinivas Reddy Asst Professor
Module 3:
Dimensionality Reduction and Optimization
Based on Mathematics for Machine Learning by
Deisenroth et al.
Chapters Referenced:
Chapter 10 (Dimensionality Reduction with PCA) of
the uploaded Mathematics for Machine Learning
Module 3: Dimensionality Reduction and Optimization
Problem Setting, Maximum Variance Perspective,
Projection Perspective, Eigenvector Computation and
Low-Rank Approximations, PCA in High Dimensions, Key
Steps of PCA in Practice, Latent Variable Perspective
Motivation & Intuition
"Imagine we collect data on 5 characteristics of students — height,
weight, exam score, attendance, and class participation. This is a 5-
dimensional dataset. Visualizing and analyzing it is hard. But what if
we could summarize most of this information in just 2 numbers — and
still capture nearly all the differences between students?“
This is exactly what PCA (Principal Component Analysis) does.
Objective : Reduce dimensions but retain maximum useful
information.
What is the Problem Setting in PCA???
Principal Component Analysis- full form
Linear dimensionality reduction technique.-type
Transform high-dimensional data into a lowe-
dimensional space while retaining as much
information (variance) as possible.- Objective
Why do we need PCA?
Real-world datasets often have many features (e.g., 100s or 1000s).
Many of these features are correlated or redundant.
Working with all features leads to:
High computation cost.
Difficulty in visualization.
Overfitting due to noisy or irrelevant features.
PCA helps by:
Finding new uncorrelated variables (principal components).
Keeping only the most important components (those with
highest variance).
Why do we need this?
•High-dimensional data often lies on a
low-dimensional subspace.
•Many features are redundant or correlated.
•Working in lower dimensions reduces storage,
computation, and improves visualization.
PCA-
To find new directions (called principal components) along which the data
varies the most.
These directions are orthogonal (perpendicular) to each other.
•Why do this?
• Reduce dimensionality (2D → 1D or higher → lower).
• Remove redundancy between correlated features.
• Focus on most informative directions.
Principal Components
• The directions we find are the eigenvectors of the covariance matrix
• The importance of each direction is given by the eigenvalues (they tell
how much variance lies along that direction).
Eigenvectors and Eigenvalues
Eigenvectors: Directions (axes) along which the data shows
maximum variance.
Eigenvalues: Amount of variance captured along each eigenvector.
Key Idea
PCA transforms original correlated features into new
uncorrelated features (principal components).
The first principal component = eigenvector with largest
eigenvalue (maximum variance).
Real-World Examples
•Image Compression: Reduce pixels from 784 to 50 while preserving
shape of digit (MNIST dataset).
•Face Recognition: PCA generates “eigenfaces” for efficient storage and
recognition.
•Finance: Reduce correlated stock indicators into a few principal factors.
•Weather Data: Temperature and humidity projected into one dimension
for seasonal trend analysis.
How does PCA work (Conceptual)?
1.Data as points in high-dimensional space
Example: Each 28×28 pixel image = a point in 784-D space.
2.Variance as Information
Directions where data varies most = most informative.
3.Principal Components
1.First principal component (PC1): Direction of maximum variance.
2.Second principal component (PC2): Next orthogonal direction of
maximum variance.
3.And so on.
4.Projection
1.Project original data onto first few components.
2.New representation = lower dimension but preserves most variance.
PCA with Covariance Matrix-
Problem:
PCA with Covariance Matrix-
Problem:
Solution:
1. Interpret Covariance Matrix
Diagonal elements (2, 2): Variance of each feature = 2 → both features vary
equally.
Off-diagonal elements (1): Covariance = 1 → positive correlation: larger
engines generally have lower fuel efficiency (inverse relationship visible after
sign analysis).
Step 2: Find Principal Components (Systematic Approach)
Step 2.1: Write the Covariance Matrix
Given covariance matrix:
Step 2.2: Compute Eigenvalues
Step 2.3: Compute Eigenvectors
Step 2.4: Order Eigenvalues and Select Principal Components
Step 2.5: Variance Explained
Maximum Variance Perspective of PCA
What is the idea?
We have high-dimensional data (e.g., 2D or 3D) and want to reduce it to
fewer dimensions (e.g., 1D) while keeping maximum information.
Information = spread of data = variance.
So, choose a line (direction) where the variance of projected data is
maximum.
Information = spread of data = variance.
1. What is spread of data
Spread = how far data points are from the center (mean).
If data points are close to mean → small spread.
If data points are far from mean → large spread.
Mathematically, spread is measured by distance from mean.
2 .Measuring distance: deviation from mean--
Maximum Variance Perspective of PCA step-by-step
solution-
• PCA looks for a line (direction) where the data
points are spread out the most.
• This line is called the
first principal component (PC1).
• By projecting data onto this line, we keep most
of the important information but in fewer
dimensions.
Projection Perspective:
Instead of thinking about “maximum spread,” projection perspective
looks at it like this:
“If we drop the data onto a line, which line gives the least
reconstruction error?”
In other words, we want the line where points stay closest to their
original positions after projection and reconstruction.
There are two complementary ways to understand PCA
mathematically:
1.Maximum Variance Perspective (already studied):
1.Finds directions (principal components) with maximum
variance.
2.Equivalent to finding eigenvectors of the covariance matrix.
2.Projection Perspective (to study now):
1.Minimizes reconstruction error when projecting data onto a
subspace.
2.Equivalent mathematically to the variance perspective but
focuses on error minimization.
Motivation
•In real-world applications, we often project high-dimensional data onto
fewer dimensions (like a plane or line).
•The question: How do we choose this projection to minimize the
information lost?
Maximum Variance Perspective says: “Pick the direction with maximum
spread.”
Projection Perspective says: “Pick the subspace that gives the smallest
reconstruction error after projection.”
Both lead to the same principal components but are derived differently:
•Variance View: Maximize variance of projected data.
•Projection View: Minimize error of reconstructing original data from
projection.
How It Works
1.Choose a line (direction) → candidate principal component.
2.Project data points onto this line (like casting shadows).
3.Reconstruct back (lift shadows back to 2D).
4.Measure error:
1.Error = distance between original point and reconstructed point.
5.Find direction that gives smallest total error.
Interpretation:
•When we reduce dimensions (2D → 1D), we lose some information.
•PCA keeps the direction of maximum variance (PC1) and ignores the second
direction (PC2).
•The error represents information lost in the ignored direction.
•Error = 0.06650.06650.0665 (very small)
•Meaning: Projection onto PC1 retains almost all the information (≈95.9% variance
kept).
•The lost variance (≈4%) is small, so 1D is a good approximation.