CS303: Mathematical Foundations for AI
Nonlinear Dimensionality Reduction
23 Jan 2025
Recap
• Recap
▶ Principal Component Analysis (PCA)
▶ Linear Discriminant Analysis
• kernel PCA
• Multidimensional Scaling (MDS)
• Isometric Mapping (ISOMAP)
1 / 28
References
• kernel PCA
▶ PCA and Fisher’s Discriminant Analysis – Bishop, Christopher M., and
Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4.
No. 4. New York: springer, 2006.
▶ kernel PCA
▶ kernel Matrix
• MDS
▶ Video Lecture
▶ Slides
• ISOMAP
▶ Original Paper
▶ Video Lecture
▶ Slides
2 / 28
Manifold
3 / 28
Manifold
3 / 28
Manifold
Locally Euclidean (flat)
4 / 28
PCA on Swiss Roll Dataset
PCA Fails
4 / 28
Recall PCA
• X is the n × d data matrix
• Center the data (i.e., mean normalize) X̄
• X̄ T X̄ is the covariance matrix (d × d)
• Compute the top k eigenvectors v1 , . . . , vk of X̄ T X̄
• The new X ′ = X̄ v1 . . . vk which is of n × k
5 / 28
Recall PCA for Wide Matrices
We have X̄ with n << d
• We compute X̄ X̄ T , which is n × n let K = X̄ X̄ T
• Compute top k eigenvectors of X̄ X̄ T denoted by u1 , . . . , uk
• The normalized eigenvectors vi of X̄ T X̄
1
vi = √ X̄ T ui
λi
Proof.
X̄ X̄ T ui = λi ui
X̄ T X̄ ( X̄ T ui ) = λi ( X̄ T ui ) On both sides multiply X̄ T
X̄ T ui
vi = (∥ X̄ T ui ∥2 = λi )
∥ X̄ T ui ∥
6 / 28
Recall PCA for Wide Matrices
We have X̄ with n << d
• We compute X̄ X̄ T , let K = X̄ X̄ T
• Compute top k eigenvectors of X̄ X̄ T denoted by u1 , . . . , uk
• The eigen vectors vi of X̄ T X̄
1
vi = √ X̄ T ui
λi
• The new data X ′ i.e., n × k is given by
X̄ v1 . . . vk = X̄ X̄ T u1 . . . uk Λ−1/2
= K̄ u1 . . . uk Λ−1/2
6 / 28
kernel PCA
The projected data X ′ i.e., n × k is given by
K̄ u1 . . . uk Λ−1/2
where ui′ s are the eigen vectors of K̄
Note: we do not need to know X at all, all we need is the kernel matrix K (inner
product between all pairs of data points)
“kernel trick”
7 / 28
kernel PCA
8 / 28
kernel PCA
8 / 28
kernel PCA
• Project data x to higher dimension space ϕ( x )
• To reduce dimensions we are going to “higher dimensions” ?
• Note that we will never need the ϕ( x ) (row vector), just like we did not need X
but K̄
• All we need is the following n × n matrix
ϕ ( x1 ) ϕ ( x1 ) T . . . ϕ ( x1 ) ϕ ( x n ) T
K = ϕ( X )ϕ( X )T =
..
.
T
ϕ ( x n ) ϕ ( x1 ) . . . ϕ ( x n ) ϕ ( x n ) T
• Note that the above is symmetric and positive semi-definite
9 / 28
Kernal PCA
Given X,
• We need K̄
• Compute the top k eigenvectors of K̄, u1 , . . . , uk
• Given an input xi the output with reduced dimensions is given by
yi = (yi1 , . . . , yik ),
n u1j n ukj
yi1 = ∑ K̄(xi , x j ) √λ , . . . , yik = ∑ K̄(xi , x j ) √λ
j =1 1 j =1 k
Y = K̄ u1 . . . uk Λ−1/2
10 / 28
kernel PCA
• How to get K̄ from K ? that is kernel matrix from mean centered feature space
• How to chose K? What higher dimension space?
11 / 28
Obtaining K̄
Centering in Feature Space
1 n
n i∑
µϕ = ϕ ( x i ); ϕ ′ ( xi ) = ϕ ( xi ) − µ ϕ , ∀i ∈ {1, 2, . . . , n}.
=1
The centered kernel matrix K̄ is computed as:
K̄ij = ⟨ϕ′ ( xi ), ϕ′ ( x j )⟩.
Expanding ϕ′ ( xi ) and ϕ′ ( x j ) Substitute ϕ′ ( xi ) = ϕ( xi ) − µϕ :
K̄ij = ⟨ϕ( xi ) − µϕ , ϕ( x j ) − µϕ ⟩.
12 / 28
Obtaining K̄
Expand the inner product:
K̄ij = ⟨ϕ( xi ), ϕ( x j )⟩ − ⟨ϕ( xi ), µϕ ⟩ − ⟨µϕ , ϕ( x j )⟩ + ⟨µϕ , µϕ ⟩.
• First term is ⟨ϕ( xi ), ϕ( x j )⟩ = Kij
• Second term is ⟨ϕ( xi ), µϕ ⟩ = n1 ∑nk=1 ⟨ϕ( xi ), ϕ( xk )⟩ = 1
n ∑nk=1 Kik
• Third term is ⟨µϕ , ϕ( x j )⟩ = n1 ∑nk=1 ⟨ϕ( xk ), ϕ( x j )⟩ = n1 ∑nk=1 Kkj
• Fourth term is ⟨µϕ , µϕ ⟩ = n12 ∑nk=1 ∑nl=1 ⟨ϕ( xk ), ϕ( xl )⟩ = n12 ∑nk=1 ∑nl=1 Kkl
12 / 28
Obtaining K̄
Combine these results:
1 n 1 n 1 n n
K̄ij = Kij − ∑
n k =1
Kik − ∑ Kkj + 2
n k =1 n ∑ ∑ Kkl .
k =1 l =1
1 1 1
K̄ = K − K1n − 1n K + 2 1n K1n ,
n n n
where 1n is an n × n matrix with all entries equal to 1.
Simplify further using H = In − n1 1n (the centering matrix):
K̄ = HKH.
12 / 28
Choice of K
• ϕ( X )ϕ( X )T leads to a symmetric and positive semidefinite K
• Any K which is symmetric and positive semidefinite will have corresponding ϕ( X )
(Mercer’s Theorem)
The Radial Basis Function (RBF) kernel,
!
∥ x i − x j ∥2
K ( xi , x j ) = exp − ,
2σ2
13 / 28
Choice of K
What ϕ gives leads to RBF?
13 / 28
Implementation
kernelPCA code
Summary: Given a gram matrix i.e., K or XX T (linear kernel) we can compute data in
lower dimension without access to X
14 / 28
Unrolling Swiss Roll
15 / 28
Multidimensional Scaling
16 / 28
Multidimensional Scaling
16 / 28
Multidimensional Scaling (MDS)
• We have X in n × d (High dimension)
• We need to go to Y in n × k (Low dimension)
min ∥ DX − DY ∥2F
Y,rank(Y )≤k
• DX (ij): distance between xi and x j
• DY (ij): distance between yi and y j
“Preserving the distances after transformation”
17 / 28
Classical Multidimensional Scaling
The given distances are Euclidean
• DX (ij) = d2ij = ∥ xi − x j ∥22
• d2ij = ∥ xi − x j ∥2 = ∥ xi ∥2 + ∥ x j ∥2 − 2xi x Tj (Eq. 1)
• Can we get the kernel (Gram matrix) from the distances?
▶ Non-unique
▶ Assume translational invariance ∑i xi = 0
• From Gram matrix we can then go to optimal low rank representation
18 / 28
Classical MDS
Solve the following equations 3 equations for xi x Tj (And ∑i xi = 0)
!
1 1 1
∑ d2ij =
n ∑ d2ij − ∑ ∥ xi ∥ 2 ∥ x i ∥2 =
n ∑ d2ij − 2n ∑ d2ij
i i i i ij
1 1
!
1
∑ d2ij = n ∑ d2ij − ∑ ∥ x j ∥2 ∥ x j ∥2 =
n ∑ d2ij − 2n ∑ d2ij
j ij
j j j
∑ ∑ d2ij = 2n ∑ ∥ xk ∥2
i j k Substitute the above in the Eq. 1 we have
1
xi x Tj = (∥ xi ∥2 + ∥ x j ∥2 − d2ij )
2
19 / 28
Classical MDS
Using these, the centered form of B is:
!
1 1 1 1
Kij = xi x Tj = −
2
d2ij −
n ∑ d2ij − n ∑ d2ij + n2 ∑ d2ij .
i j ij
Let 1n be an n × n matrix of ones:
1
K = − HDX H,
2
where H = In − n1 1n is the centering matrix
20 / 28
Classical MDS
Complete Algorithm
• We are given DX
• Compute K = − 12 HDX H
• Find the top k eigenvectors V = v1 . . . vk of K
• The new data Y would be
−1/2 1/2 vi vi
Y = KVΛ = VΛ As K √ = λi √
λi λi
21 / 28
Metric MDS
For Swiss roll data Euclidean distance is not good!
• D is any general distance metric
▶ d( x, x ) = 0
▶ Non-negative: d( x, y) ≥ 0
▶ Symmetry: d( x, y) = d(y, x )
▶ Triangle Inequality: d( x, z) ≤ d( x, y) + d(y, x )
• Examples?
22 / 28
Metric MDS
• A non-Euclidean D will not recover the exact embedding :(
• How good is the embedding found?
∑ij (dij − ∥yi − y j ∥)2
stress(y) =
d2ij
• Find y′ s that “Minimize stress” :)
23 / 28
ISOMAP
What D to use for Swiss roll kind of dataset?
Geodesic distance
24 / 28
ISOMAP
Given the data X which lies on some high dimension space on a low dimension
manifold.
• Is the manifold known?
• Then how to compute geodesic distance?
25 / 28
ISOMAP
Exploit “locally euclidean”
• Given a point x
• Find the neighbors of x (ϵ-ball)
• Build the nearest neighbor graph with edges as the Euclidean distances
26 / 28
ISOMAP
How to find the distance DX (ij)
• DX (ij) is the shortest path between ij on the nearest neighbor graph (Dijkstra)
27 / 28
ISOMAP
ISOMAP Algorithm
• Given: pairwise distances between high dimensional input points xi , x j , dij
• Compute nearest neighbour graph G using ϵ-ball
• Compute DX from G
• Apply MDS on DX to obtain low dimensional Y
27 / 28
Implementation ISOMAP
Notebook
28 / 28