Deep Learning Srihari
Linear Factor Models
Sargur N. Srihari
srihari@cedar.buffalo.edu
1
Deep Learning Srihari
Topics in Linear Factor Models
1. Definition of Linear Factor Analysis
2. Related methods
1. Principal Components Analysis
2. Factor Analysis
3. Linear Factor Models generalize the above
4. Independent Component Analysis (ICA)
5. Slow Feature Analysis
6. Sparse Coding
7. Manifold Interpretation of PCA 2
Deep Learning Srihari
Deep Learning is about Models
• Many research frontiers of deep learning
involves building a probabilistic model of input
pmodel(x)
• Such a model can be used with probabilistic
inference to predict any variables given any
other variables
3
Deep Learning Srihari
Simplest model with latent variables
• Deep Learning is often to construct pmodel (x)
– Useful to predict variables given other variables
• With latent variables pmodel(x) = Eh pmodel(x | h)
– Latent variables provide another means of
representing the data
– Representations using latent variables obtain all
advantages of feedforward and recurrent networks
• Latent Factor Models are the simplest models
with latent variables
4
Deep Learning Srihari
Models with Latent Variables
• Many models also have latent variables, h
– We can write pmodel(x)=Eh pmodel(x|h) ∑ ∑
Since p(x) =
h
p(x,h) =
h
p(x | h)p(h) =Eh p(x | h)
– These latent variables provide another means of
representing the data
5
Deep Learning Srihari
Models with Latent Variables
• Much of deep learning involves building a
probabilistic model of input pmodel(x)
– From which we can infer any other variables
• Many models also have latent variables, h
– We can write pmodel(x)=Eh pmodel(x|h) ∑ ∑
Since p(x) =
h
p(x,h) =
h
p(x | h)p(h) =Eh p(x | h)
– These latent variables provide another means of
representing the data
• Distributed representations based on latent
variables can have all the advantages of
representation learning with deep feed-forward
and recurrent networks 6
Deep Learning Srihari
Linear factor models
• Linear factor models are the simplest
probabilistic models
– They are used as building blocks for:
• Mixture models
• Deep probabilistic models
• They are basic approaches to build generative
models that are extended by deep models
• Defined by using a stochastic, linear decoder
function that generates x by adding noise to a
linear transformation of h, i.e.,
x =Wh+b+noise 7
Deep Learning Srihari
Linear Factor Model Definition
• A linear factor model describes a data
generating process as follows
– First we sample the explanatory factors h from a
distribution h ~p(h)
• Where p(h) is a factorial distribution
p(h) = ∏ p(hi )
i
– So that it is easy to sample from
– Next we sample the real-valued observable
variables given the factors
x =Wh+b+noise
• where noise is Gaussian and diagonal (independent
dimensions)
8
Deep Learning Srihari
Graphical Representation of Linear Factor Model
h~p(h) with p(h) = ∏ p(hi )
i
x =Wh+b+noise
9
Deep Learning Srihari
Special cases of Linear Factor Model
h~p(h) with p(h) = ∏ p(hi )
i
x =Wh+b+noise
• Special cases of above equations are:
1. Probabilistic PCA
2. Factor analysis
3. Other Linear Factor Models
• They differ in choices about form of noise and
prior over latent variables h before observing x
– Factor Analysis and Probabilistic PCA shown next10
Deep Learning Srihari
Factor Analysis
h~p(h) with p(h) = ∏ p(hi )
i
x =Wh+b+noise
• Prior p(h) is a unit variance Gaussian h ~N(h;0,I)
• xi are conditionally independent given h
– noise from Gaussian with covariance matrix
ψ=diag (σ2) with σ2 =[σ12,.. σn2], vector of variances
– Latent variables capture dependencies between xi
• It can be shown that x is multivariate Gaussian
x~N(x;b,WWT+ψ)
11
Deep Learning Srihari
Probabilistic PCA
h~p(h) with p(h) = ∏ p(hi )
i
x =Wh+b+noise
• A slightly modified factor analysis model
• Assume equal conditional variances: σ 2 = σ12 = .. = σn2
– Thus x ~ N (x; b,WWT + σ2I)
Or equivalently x = Wh + b + σz
where z ~N(z; 0,I) is Gaussian noise
– Iterative EM can be used to estimate W and σ2
– Takes advantage of observation that most variations are
captured by the latent variables h, upto small residual
reconstruction error σ 2
• Probabilistic PCA becomes PCA as σ à 0
Deep Learning Srihari
PCA (Principal Components Analysis)
https://medium.com/@mallrishabh52/principal-components-analysis-7f6ff559cd83
13
Deep Learning Srihari
PCA
Principal components capture the most variation in a dataset
PCA deals with the curse of dimensionality by capturing
the essence of data into a few principal components.
PC1 must convey the PC2 is the second line that
maximum Variation meets PC1, perpendicularly, at the
among data points and center of the cloud, and describes
contain minimum error. second most variation in the data
Deep Learning Srihari
PCA Algorithm (Linear Algebra)
• Given {x(1),..,x(m)} in Rn represent using Rl l<n
– For point x(i) find code vector c(i) in Rl
• Find encoder f (x) = c and decoder x ≈ g (f (x))
– One decoding function is: g(c) =Dc where
• D is a matrix with l mutually orthogonal columns to
minimize distance between x and reconstruction
r(x)=g(f(x))=DDTx
1
c* = argmin x − g(c) ⎛
( ( ))
⎞2
2
2 D* = argmin ⎜ ∑ x (ij ) − r x (i ) ⎟
c D ⎝ i,j j ⎠
• Solution D
– The l eigenvectors of design matrix X X ∈! m×n
correspond to the largest eigenvalues
15
Deep Learning Srihari
Confirmatory Factor Analysis
Factor analysis is a technique for identifying which underlying factors are
measured by a (much larger) number of observed variables.
Such “underlying factors” are difficult to measure , e.g., IQ, depression or extraversion.
Researcher’s Hypothesis Confirmatory factor analysis
if questions 1, 2 and 3 all measure numeric IQ, then the
Pearson correlations among these items should be
substantial: respondents with high numeric IQ will
typically score high on all 3 questions
Exploratory Factor Analysis
No clue to which -or even how many- factors are represented by the data 16
Deep Learning Srihari
Exploratory Factor Analysis
• Psychologist’s hypothesis: there are two kinds
(k=2) of latent intelligence
– Verbal (factor F1) and mathematical (factor F2)
• Evidence for hypothesis is sought in the
examination scores (x) from p=6 academic
fields (e.g., astronomy) of n=1000 students
– Observable variables x1,.., x6 with means μ1…, μ6
xi –μi = li1F1+ li2 F2+ εi , i = 1,..6, li are the loadings
• In matrix form x –μ =LF + ε ; x is p×n, L is p×k, F is k×n
– Values of L, μ, and variances of errors ε must be estimated from data x
and F (assumption about levels of the factors is fixed for a given F)
– Solution: for astronomy, average student aptitude is 10F1+6F2
Deep Learning Srihari
Factor Analysis
• A method to describe variability among
observed, correlated variables in terms of a
lower no. of latent variables called factors
– E.g., variations in six observed variables mainly
reflect the variations in two latent variables
• Observed variables modeled as linear
combinations of latent factors, plus error terms
– Factor analysis aims to find independent latent
variables
18
Deep Learning Srihari
PCA vs. Factor Analysis
• Both are data reduction techniques
• Both involve choosing components or factors
• Fundamental difference between them:
– PCA is a linear combination of variables
– Factor Analysis is a measurement model of a latent
variable
• PCA is a more basic version of Factor Analysis
19
Deep Learning Srihari
PCA vs Factor Analysis
• Principal Components • Factor Analysis
Analysis – A model for measuring an
– create index variables from unobservable latent
larger set of measured variable
variables
Four measured F, the latent Factor, is causing the responses
variables y combined on the four measured y variables,
into a single component c u’s are the variance in each y that is unexplained
by the factor.
y1 = b1*F + u1
Model set up as: Model set up as y2 = b2*F + u2
c = w1y1 + w2y2 + w3y3 + w4y4 Regression equations: y3 = b3*F + u3
y4 = b4*F + u4
https://www.theanalysisfactor.com/
the-fundamental-difference-between-principal-component-analysis-and-factor-analysis/
Deep Learning Srihari
Independent Component Analysis
• Approach to modeling linear factors
• To separate observed signal into underlying
independent signals
– That are scaled and added together to form the
observed data
21
Examples of ICA
Deep Learning Srihari
1. Extracting source from noisy signal
Mixed True
Signal Source
2. Cocktail party problem: speech signals of people
talking simultaneously are separated
22
Deep Learning Srihari
ICA requires independent signals
• Signals are intended to be fully independent
rather than merely decorrelated from each other
– Independence is stronger than zero covariance
• Ex: No covariance doesn’t mean independence
– We sample x from [-1,1] -1 0 1
x
½
– Let s be 1 with probability 0.5, 0 1
s
otherwise s = 0
– Let y=sx
• Clearly x and y not independent, since y generated from x
– But x and y have zero covariance 23
Deep Learning Srihari
An ICA model
• Prior p(h) fixed ahead of time
• Model deterministically generates x =Wh
– Use nonlinear change of variables to determine p(x)
• Learning proceeds using maximum likelihood
• By choosing independent p(h) we can recover
underlying factors that are close to independent
– Used to recover low level signals mixed together
24
Deep Learning Srihari
ICA signal separation
• Each example is one moment in time
• Each xi is a sensor observation of mixed signals
• Each hi is one estimate of the original signals
Independent source A Top: A -2B Recovered signals
Independent source B Bottom:1.7A+3.4B
25
Deep Learning Srihari
Choice of p(h) in ICA
• All ICA variants require p(h) be non-Gaussian
– This is because if p(h) is an independent prior with
Gaussian components then W is not identifiable
• This is different from probabilistic PCA and
factor analysis, where p(h) is Gaussian
• Typical choice is p(hi)=[d/dhi]σ(hi)
– Have larger peaks near 0 than does Gaussian
• So ICA is learning sparse features
26
Deep Learning Srihari
Generalization of ICA
• PCA generalizes to nonlinear autoencoders
• ICA generalizes to a nonlinear generative
model
– Use a nonlinear f to generate observed data
27
Deep Learning Srihari
Slow Feature Analysis
• It is a Linear factor model
• Uses information from time signals to learn
invariant features
• Motivation: Slowness principle
– Important characteristics change slowly compared
to individual measurements that make up a scene
• Computer vision example shown next
28
Deep Learning Srihari
SFA in computer vision
• Individual pixels can change very rapidly
• Ex: zebra moves from right to left
– Pixels change rapidly from black to white to black
– Feature indicating whether zebra is in image
changes slowly
• Regularize model to learn features that change
slowly with time
29
Deep Learning Srihari
Slowness Principle
• Can apply slowness principle to any model
trained with gradient descent
• Slowness principle is introduced by adding a
term to the cost function of the form
– where f is feature extractor to be regularized
– λ is the strength of the slowness regularization term
– L is a loss function measuring the distance between
f (x(t)) and f (x(t+1))
• Common choice of L is the mean squared difference
30
Deep Learning Srihari
Sparse Coding
• A linear factor model
• Studied as unsupervised feature learning and
extraction
• Terminology
– Sparse Coding refers to inferring the values of h in
the model
– Sparse modeling refers to process of designing and
learning the model
– But sparse coding often refers to both
31
Deep Learning Srihari
Sparse Coding definition
32
Deep Learning Srihari
Manifold Interpretation of PCA
• Linear factor models including PCA and factor
analysis can be interpreted as learning a
manifold
• Probabilistic PCA learns a flat pancake of high
probability
• Illustrated next
33
Deep Learning Srihari
Flat Gaussian near low-dimensional
manifold
Shows upper half of pancake
above the manifold plane, which goes through its middle
Variance orthogonal to manifold is small (can be considered as noise)
while other variances are large (correspond to signal)
34
Deep Learning Srihari
Generality of Interpretation
• Manifold interpretation applies to not just to
PCA but also to any linear autoencoder that
learns Matrices W and V with the goal of making
the reconstruction of x lie as close to x as
possible
• Let the encoder be h = f(x) = W T(x − μ)
• The encoder computes a low-dimensional
representation of h
• With the autoencoderview, we have a decoder
computing the reconstruction xˆ = g(h) = b + V h
35
Deep Learning Srihari
Summary of Linear Factor Models
• Linear factor models are
– The simplest generative models
– Simplest models that learn a representation of data
• Analogy between linear classifiers and linear
factor models
1.Linear classifier/regression models are extended to
deep feedforward networks
2.Linear factor models are extended to autoencoder
networks and deep probabilistic models
– Perform the same tasks but with a much more
powerful and flexible model family 36
Deep Learning Srihari
Distribution of stars: Galaxy M31 in Andromeda
Andromeda
constellation
M31 is 2.1 million light years away and heading on a collision course with the Milky Way.
They should collide in about 4 billion years. We won't feel much from the mash-up
as there is so much empty space between the stars of both galaxies that few if any will notice.
The 3-dimensional data is largely present on a 2-dimensional plane.
Both PCA and Factor Analysis aim to find the plane using different approaches.
M31 Photo Courtesy of Michael Caliguri