MV - Principal Components Using SAS
MV - Principal Components Using SAS
1
Principal Components Analysis
A. The Basic Principle
We wish to explain/summarize the underlying variance-
covariance structure of a large set of variables through a
few linear combinations of these variables. The
objectives of principal components analysis are
- data reduction
- interpretation
- regression analysis
- cluster analysis
B. Population Principal Components
Suppose we have a population measured on p random
variables X1,…,Xp. Note that these random variables
represent the p-axes of the Cartesian coordinate system in
which the population resides. Our goal is to develop a
new set of p axes (linear combinations of the original p
axes) in the directions of greatest variability:
X2
X1
X1
X
X =
2
X p
Var Yi = a'iΣai, i = 1, , p
'
Cov Yi, Yk = aiΣak, i, k = 1, , p
The principal components are those uncorrelated linear
combinations Y1,…,Yp whose variances are as large as
possible.
Thus the first principal component is the linear
combination of maximum variance, i.e., we wish to solve
the nonlinear optimization problem
max a'1Σa1
source of a1
restrict to
nonlinearit
y
st a'1a1 = 1 coefficient
vectors of unit
length
The second principal component is the linear
combination of maximum variance that is uncorrelated
with the first principal component, i.e., we wish to solve
the nonlinear optimization problem
max a'2Σa2
a2
restricts
st a'2a2 = 1 covariance
'
a1Σa2 = 0 to zero
The third principal component is the solution to the
nonlinear optimization problem
max a'3Σa3
a3
st a'3a3 = 1
' restricts
a1Σa3 = 0 covariance
' s to zero
a2Σa3 = 0
Generally, the ith principal component is the linear
combination of maximum variance that is uncorrelated
with all previous principal components, i.e., we wish to
solve the nonlinear optimization problem
max a'iΣai
ai
st a'iai = 1
'
ak Σai = 0 k < i
We can show that, for random vector X ~
with covariance
matrix ~ and eigenvalues 1 2 p 0, the ith
principal component is given by
eik λi
ρYi,X k =
σkk
X1 X2 X3
1.0 6.0 9.0
4.0 12.0 10.0
3.0 12.0 15.0
4.0 10.0 12.0
Note that
λ2 2.5344988
p
= = 0.198784220
17.0
λ
i=1
i
λ3 0.3009542
p
= = 0.023604251
17.0
λ
i=1
i
X1 X2 X3
Y1 0.6109350 0.3853264 0.3678510
Y2 0.4404973 0.1275510 -0.2342330
Y3 0.3152572 -0.0438293 0.0172243
'
x - μ Σ x - μ = c2
c λi, i = 1, , p
1 ' 1 '
2 2
2 '
c = x Σx = e1x ++ epx
λ1 λp
1 2
2 1 2
c = y1 + + yp
λ1 λp
which defines an ellipsoid (note that i > 0 i) in a
coordinate system with axes y1,…,yp lying in the
directions of e~~1,…,e
~
~ p, respectively.
The major axis lies in the direction determined by the
eigenvector ei associated with the largest eigenvalue i -
~
the remaining minor axes lie in the directions
determined by the other eigenvectors.
Example: For the principal components derived from the
following population of four observations made on three
random variables X1, X2, and X3:
X1 X2 X3
1.0 6.0 9.0
4.0 12.0 10.0
3.0 12.0 15.0
4.0 10.0 12.0
3.0
μ = 10.0
11.5
X2
3.0,10.0,15 X
1
.0
X3
…then use the first eigenvector to find a second point on
the first principal axis:
X2
Y1
X1
X3
Y1
X1
X3
Y1
X1
Y3
X3
X2
Y1
X1
Y3
X3
and a translation in p = 3 dimensions.
Y2 Y2
X2
X1
Y3
X3
Note that we can also construct principal components for
the standardized variables Zi:
X i - μi
Zi = , i = 1, , p
σii
which in matrix notation is
X - μ
-1
12
Z = V
where V1/2 is the diagonal standard deviation matrix.
~
Obviously
E Z = 0
-1 -1
Cov Z = V 1 2 Σ V 12
= ρ
This suggests that the principal components for the
standardized variables Zi may be obtained from the
eigenvectors of the correlation matrix ~! The operations
are analogous to those used in conjunction with the
covariance matrix.
V X - μ , i
' ' -1
12
Yi = e Z = e
i i = 1, , p
Note again that the principal components are not unique
if some eigenvalues are equal.
We can also show for random vector Z with covariance
~
matrix and eigenvalue-eigenvector pairs (1 , e1), …, (p ,
~ ~
e~p) where 1 2 p,
p p
Var Z
i =1
i = λ1 + + λ p = Var Y
i =1
i = p
X1 X2 X3
1.0 6.0 9.0
4.0 12.0 10.0
3.0 12.0 15.0
4.0 10.0 12.0
Note that
λ2 0.6226418
p
= = 0.207547267
3
λ
i=1
i
λ3 0.1624235
p
= = 0.054141167
3
λ
i=1
i
Z1 Z2 Z3
Y1 0.8697035 0.944420 0.7527427
Y2 -0.4299875 -0.122290 0.6502288
Y3 0.2423354 -0.305150 0.1028629
Simple Statistics
x1 x2 x3
Mean 3.000000000 10.00000000 11.50000000
StD 1.414213562 2.82842712 2.64575131
Correlation Matrix
x1 x2 x3
x1 Random Variable 1 1.0000 0.8333 0.3563
x2 Random Variable 2 0.8333 1.0000 0.6236
x3 Random Variable 3 0.3563 0.6236 1.0000
Eigenvectors
Prin1 Prin2 Prin3
x1 Random Variable 1 0.581128 -0.562643 0.587982
x2 Random Variable 2 0.645363 -0.121542 -0.754145
x3 Random Variable 3 0.495779 0.817717 0.292477
SAS output for Correlation Matrix – Original Random
Variables vs. Principal Components:
The CORR Procedure
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
Prin1 4 0 1.49314 0 -2.20299 1.11219
Prin2 4 0 0.81371 0 -0.94739 0.99579
Prin3 4 0 0.32928 0 -0.28331 0.47104
x1 4 3.00000 1.41421 12.00000 1.00000 4.00000
x2 4 10.00000 2.82843 40.00000 6.00000 12.00000
x3 4 11.50000 2.64575 46.00000 9.00000 15.00000
x1 x2 x3
Number
SAS output for Factor Analysis
The FACTOR Procedure
Initial Factor Method: Principal Components
Pearson Correlation
Factor Pattern Coefficients for the first
Factor1 principal component
with the three original
x1 Random Variable 1 0.86770 variables X1, X2, and X3
x2 Random Variable 2 0.96362
x3 Random Variable 3 0.74027
Factor1
First eigenvalue 1
2.2294570
x1 x2 x3
Simple Statistics
x1 x2 x3
Mean 3.000000000 10.00000000 11.50000000
StD 1.414213562 2.82842712 2.64575131
Covariance Matrix
x1 x2 x3
x1 Random Variable 1 2.000000000 3.333333333 1.333333333
x2 Random Variable 2 3.333333333 8.000000000 4.666666667
x3 Random Variable 3 1.333333333 4.666666667 7.000000000
Total Variance 17
Eigenvectors
Prin1 Prin2 Prin3
x1 Random Variable 1 0.291038 0.415039 0.861998
x2 Random Variable 2 0.734249 0.480716 -.479364
x3 Random Variable 3 0.613331 -.772434 0.164835
SAS output for Correlation Matrix – Original Random
Variables vs. Principal Components:
The CORR Procedure
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
Prin1 4 0 3.63585 0 -5.05240 3.61516
Prin2 4 0 1.83830 0 -1.74209 2.53512
Prin3 4 0 0.63346 0 -0.38181 0.94442
x1 4 3.00000 1.41421 12.00000 1.00000 4.00000
x2 4 10.00000 2.82843 40.00000 6.00000 12.00000
x3 4 11.50000 2.64575 46.00000 9.00000 15.00000
x1 x2 x3
Number
SAS output for Factor Analysis
The FACTOR Procedure
Initial Factor Method: Principal Components
First eigenvalue 1
Final Communality Estimates and Variable Weights
Total Communality: Weighted = 13.219396 Unweighted = 2.161121
x1 0.55986257 2.00000000
x2 0.89085847 8.00000000
x3 0.71040045 7.00000000
Covariance matrices with special structures yield
particularly interesting principal components:
- Diagonal covariance matrices – suppose is the diagonal
~
matrix σ11 0 0
0 σ 0
Σ =
22
0 0 σ pp
since the eigenvector e~ i has a value of 1 in the ith position
and 0 in all other positions, we have
0
so (ii,ei) is the i
th
σ11 0 0 eigenvalue-
0 σ 0 0
22
eigenvecotr pair
Σei =
σii = σiie i
0
0 0 σ pp
0
…so the linear combination
Yk = Σe'i X = X i
demonstrates that the set of principal components and
the original set of (uncorrelated) random variables are
the same!
Note that this result is also true if we work with the
correlation matrix.
- constant variances and covariance matrices – suppose is
~
the patterned matrix σ2 ρσ2 ρσ2
2 2 2
ρσ σ ρσ
Σ =
2 2 2
ρσ ρσ σ
1 ρ ρ
ρ 1 ρ
ρ =
ρ ρ 1
'
Var ai x = aiSai, i = 1, , p
Cov a'i x, a'k x = a'iSa k, i, k = 1, , p
The principal components are those uncorrelated linear
combinations y^ 1,…,y^ p whose variances are as large as
possible.
Thus the first principal component is the linear
combination of maximum sample variance, i.e., we wish
to solve the nonlinear optimization problem
source of
max a'1Sa1
nonlinearit
a1
restrict to
coefficient
y st a'1a1 = 1 vectors of unit
The second principal component is the linear
combination of maximum sample variance that is
uncorrelated with the first principal component, i.e., we
wish to solve the nonlinear optimization problem
max a'2Sa2
a2
restricts
st a'2a2 = 1 covariance
'
a1Sa2 = 0 to zero
The third principal component is the solution to the
nonlinear optimization problem
max a'3Sa3
a3
st a'3a3 = 1
' restricts
a1Sa3 = 0 covariance
' s to zero
a2Sa3 = 0
Generally, the ith principal component is the linear
combination of maximum sample variance that is
uncorrelated with all previous principal components, i.e.,
we wish to solve the nonlinear optimization problem
max a'iSai
ai
st a'iai = 1
'
akSai = 0 k < i
We can show that, for random sample^ X ~ with
^
sample
^
covariance matrix S~ and eigenvalues 1 2 p 0,
the ith sample principal component is given by
ˆi = ˆ
y ˆ'i1x1 + e
e'ix = e ˆ'i2x 2 + + e
ˆ'ip x p, i = 1, , p
Note that the principal components are not unique if
some eigenvalues are equal.
We can also show for random sample X with sample
~ ^
covariance matrix S and eigenvalue-eigenvector pairs (1 ,
^ ^ ^ ~ ^ ^ ^
e~1), …, (p , e~p) where ~1 ~2 ~p,
p p
ˆ + + λˆ =
s11 + + spp = ii 1
s
i =1
= λ p Var y
i =1
i
eˆik λˆi
rYi,Xk =
skk
'
n j=1
ei xj - x = ei xj - x = eˆi 0 = 0
ˆ ˆ
n
Example: Suppose we have the following sample of four
observations made on three random variables X1, X2, and
X3:
X1 X2 X3
1.0 6.0 9.0
4.0 12.0 10.0
3.0 12.0 15.0
4.0 10.0 12.0
Note that
ˆ
λ 3.37916
p
2
= = 0.198774404
ˆ 17.0
i=1
λ i
ˆ
λ 0.40140
p
3
= = 0.023611782
17.0
λˆ
i =1
i
X1 X2 X3
Y1 0.529016 0.333704 0.318576
Y2 0.381553 0.110454 -0.202839
Y3 0.273055 -0.037965 0.014927
Simple Statistics
x1 x2 x3
Mean 3.000000000 10.00000000 11.50000000
StD 1.414213562 2.82842712 2.64575131
Covariance Matrix
x1 x2 x3
x1 Random Variable 1 2.000000000 3.333333333 1.333333333
x2 Random Variable 2 3.333333333 8.000000000 4.666666667
x3 Random Variable 3 1.333333333 4.666666667 7.000000000
Total Variance 17
Eigenvectors
Prin1 Prin2 Prin3
x1 Random Variable 1 0.291038 0.415039 0.861998
x2 Random Variable 2 0.734249 0.480716 -0.479364
x3 Random Variable 3 0.613331 -0.772434 0.164835
SAS output for Correlation Matrix – Original Random
Variables vs. Principal Components:
The CORR Procedure
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
Prin1 4 0 1.49314 0 -2.20299 1.11219
Prin2 4 0 0.81371 0 -0.94739 0.99579
Prin3 4 0 0.32928 0 -0.28331 0.47104
x1 4 3.00000 1.41421 12.00000 1.00000 4.00000
x2 4 10.00000 2.82843 40.00000 6.00000 12.00000
x3 4 11.50000 2.64575 46.00000 9.00000 15.00000
x1 x2 x3