ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.
edu
7.1 Example of Full Singular Value Decomposition
SVD is based on a theorem from linear algebra which says that a rectangular matrix A can
be broken down into the product of three matrices - an orthogonal matrix U , a diagonal
matrix S, and the transpose of an orthogonal matrix V . The theorem is usually presented
something like this:
T
Amn = Umm Smn Vnn
where U T U = I, V T V = I; the columns of U are orthonormal eigenvectors of AAT , the
columns of V are orthonormal eigenvectors of AT A, and S is a diagonal matrix containing
the square roots of eigenvalues from U or V in descending order.
The following example merely applies this definition to a small matrix in order to compute
its SVD. In the next section, I attempt to interpret the application of SVD to document
classification.
Start with the matrix " #
3 1 1
A=
1 3 1
In order to find U , we have to start with AAT . The transpose of A is
3 1
T
A = 1 3
1 1
so
" # 3 1 " #
3 1 1 11 1
AAT = 1 3 =
1 3 1 1 11
1 1
Next, we have to find the eigenvalues and corresponding eigenvectors of AAT . We know that
eigenvectors are defined by the equation A~v = ~v , and applying this to AAT gives us
" #" # " #
11 1 x1 x1
=
1 11 x2 x2
We rewrite this as the set of equations
11x1 + x2 = x1
x1 + 11x2 = x2
and rearrange to get
(11 )x1 + x2 = 0
x1 + (11 )x2 = 0
16
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
Solve for by setting the determinant of the coefficient matrix to zero,
(11 ) 1
=0
1 (11 )
which works out as
(11 )(11 ) 1 1 = 0
( 10)( 12) = 0
= 10, = 12
to give us our two eigenvalues = 10, = 12. Plugging back in to the original equations
gives us our eigenvectors. For = 10 we get
(11 10)x1 + x2 = 0
x1 = x2
which is true for lots of values, so well pick x1 = 1 and x2 = 1 since those are small and
easier to work with. Thus, we have the eigenvector [1, 1] corresponding to the eigenvalue
= 10. For = 12 we have
(11 12)x1 + x2 = 0
x1 = x 2
and for the same reason as before well take x1 = 1 and x2 = 1. Now, for = 12 we have the
eigenvector [1, 1]. These eigenvectors become column vectors in a matrix ordered by the size
of the corresponding eigenvalue. In other words, the eigenvector of the largest eigenvalue
is column one, the eigenvector of the next largest eigenvalue is column two, and so forth
and so on until we have the eigenvector of the smallest eigenvalue as the last column of our
matrix. In the matrix below, the eigenvector for = 12 is column one, and the eigenvector
for = 10 is column two. " #
1 1
1 1
Finally, we have to convert this matrix into an orthogonal matrix which we do by applying
the Gram-Schmidt orthonormalization process to the column vectors. Begin by normalizing
v~1 .
v~1 [1, 1] [1, 1] 1 1
u~1 = = 2 = = [ , ]
~
|v1 | 1 +1 2 2 2 2
Compute
w~2 = v~2 u~1 v~2 u~1 =
1 1 1 1
[1, 1] [ , ] [1, 1] [ , ] =
2 2 2 2
17
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
1 1
[1, 1] 0 [ , ] = [1, 1] [0, 0] = [1, 1]
2 2
and normalize
w~2 1 1
u~2 = = [ , ]
|w~2 | 2 2
to give
1 1
" #
U= 2 2
1 1
2 2
The calculation of V is similar. V is based on AT A, so we have
3 1 " # 10 0 2
3 1 1
AT A = 1 3 = 0 10 4
1 3 1
1 1 2 4 2
Find the eigenvalues of AT A by
10 0 2 x1 x1
0 10 4
x
2
= x2
2 4 2 x3 x3
which represents the system of equations
10x1 + 2x3 = x1
10x2 + 4x3 = x2
2x1 + 4x2 + 2x3 = x2
which rewrite as
(10 )x1 + 2x3 = 0
(10 )x2 + 4x3 = 0
2x1 + 4x2 + (2 )x3 = 0
which are solved by setting
(10 ) 0 2
0 (10 ) 4 =0
2 4 (2 )
This works out as
(10 ) 4 0 (10 )
(10 ) + 2
=
4 (2 ) 2 4
18
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
(10 )[(10 )(2 ) 16] + 2[0 (20 2)] =
( 10)( 12) = 0,
so = 0, = 10, = 12 are the eigenvalues for AT A. Substituting back into the original
equations to find corresponding eigenvectors yields for = 12
(10 12)x1 + 2x3 = 2x1 + 2x3 = 0
x1 = 1, x3 = 1
(10 12)x2 + 4x3 = 2x2 + 4x3 = 0
x2 = 2x3
x2 = 2
So for = 12, v~1 = [1, 2, 1]. For = 10 we have
(10 10)x1 + 2x3 = 2x3 = 0
x3 = 0
2x1 + 4x2 = 0
x1 = 2x2
x1 = 2, x2 = 1
which means for = 10, v~2 = [2, 1, 0]. For = 0 we have
10x1 + 2x3 = 0
x3 = 5
10x1 20 = 0
x2 = 2
2x1 + 8 10 = 0
x1 = 1
which means for = 0, v~3 = [1, 2, 5]. Order v~1 , v~2 , and v~3 as column vectors in a matrix
according to the size of the eigenvalue to get
1 2 1
2 1 2
1 0 5
19
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
and use the Gram-Schmidt orthonormalization process to convert that to an orthonormal
matrix.
v~1 1 2 1
u~1 = = [ , , ]
|v~1 | 6 6 6
w~2 = v~2 u~1 v~2 u~1 = [2, 1, 0]
w~2 2 1
u~2 = = [ , , 0]
|w~2 | 5 5
2 4 10
w~3 = v~3 u~1 v~3 u~1 u~2 v~3 u~2 = [ , , ]
3 3 3
w~3 1 2 5
u~3 = = [ , , ]
~
|w3 | 30 30 30
All this to give us
1 2 1
6 5 30
2 1 2
V =
6
5 30
1 0 5
6 30
when we really want its transpose
1 2 1
6 6 6
VT = 2 1
0
5 5
1 2 5
30 30 30
For S we take the square roots of the non-zero eigenvalues and populate the diagonal with
them, putting the largest in s11 , the next largest in s22 and so on until the smallest value
ends up in smm . The non-zero eigenvalues of U and V are always the same, so thats why
it doesnt matter which one we take them from. Because we are doing full SVD, instead of
reduced SVD (next section), we have to add a zero column vector to S so that it is of the
proper dimensions to allow multiplication between U and V . The diagonal entries in S are
the singular values of A, the columns in U are called left singular vectors, and the columns
in V are called right singular vectors.
" #
12 0 0
S=
0 10 0
Now we have all the pieces of the puzzle
1 2 1
#"
1 1
" #
6 6 6
T 2 2 12 0 0 2 1
0
Amn = Umm Smn Vnn = 1 1
5
5 =
2
2
0 10 0
1 2 5
30 30 30
1 2 1
12 10
" #
0 6 6 6
2 2 2
3 1 1
1
0
=
12 10 5 5 1 3 1
2 2
0
1 2 5
30 30 30
20
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
7.2 Example of Reduced Singular Value Decomposition
Reduced singular value decomposition is the mathematical technique underlying a type of
document retrieval and word similarity method variously called Latent Semantic Indexing
or Latent Semantic Analysis. The insight underlying the use of SVD for these tasks is
that it takes the original data, usually consisting of some variant of a worddocument
matrix, and breaks it down into linearly independent components. These components are
in some sense an abstraction away from the noisy correlations found in the original data
to sets of values that best approximate the underlying structure of the dataset along each
dimension independently. Because the majority of those components are very small, they
can be ignored, resulting in an approximation of the data that contains substantially fewer
dimensions than the original. SVD has the added benefit that in the process of dimensionality
reduction, the representation of items that share substructure become more similar to each
other, and items that were dissimilar to begin with may become more dissimilar as well. In
practical terms, this means that documents about a particular topic become more similar
even if the exact same words dont appear in all of them.
As weve already seen, SVD starts with a matrix, so well take the following word
document matrix as the starting point of the next example.
2 0 8 6 0
1 6 0 1 7
A=
5 0 7 4 0
7 0 8 5 0
0 10 0 0 7
Remember that to compute the SVD of a matrix A we want the product of three matrices
such that
A = U SV T
where U and V are orthonormal and S is diagonal. The column vectors of U are taken from
the orthonormal eigenvectors of AAT , and ordered right to left from largest corresponding
eigenvalue to the least. Notice that
2 0 8 6 0 2 1 5 7 0 104 8 90 108 0
1 6 0 1 7 0 6 0 0 10 8 87 9 12 109
T
AA =
5 0 7 4 0
8 0 7 8 0 = 90
9 90 111 0
7 0 8 5 0
6 1 4 5 0 108 12 111 138
0
0 10 0 0 7 0 7 0 0 7 0 109 0 0 149
is a matrix whose values are the dot product of all the terms, so it is a kind of dispersion
matrix of terms throughout all the documents. The singular values (eigenvalues) of AA T are
= 321.07, = 230.17, = 12.70, = 3.94, = 0.12
21
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
which are used to compute and order the corresponding orthonormal singular vectors of U .
0.54 0.07 0.82 0.11 0.12
0.10 0.59 0.11 0.79 0.06
U =
0.53 0.06 0.21 0.12 0.81
0.65 0.07 0.51 0.06 0.56
0.06 0.80 0.09 0.59 0.04
This essentially gives a matrix in which words are represented as row vectors containing
linearly independent components. Some word cooccurence patterns in these documents are
indicated by the signs of the coefficients in U . For example, the signs in the first column
vector are all negative, indicating the general cooccurence of words and documents. There
are two groups visible in the second column vector of U : car and wheel have negative
coefficients, while doctor, nurse, and hospital are all positive, indicating a grouping in which
wheel only cooccurs with car. The third dimension indicates a grouping in which car, nurse,
and hospital occur only with each other. The fourth dimension points out a pattern in
which nurse and hospital occur in the absence of wheel, and the fifth dimension indicates a
grouping in which doctor and hospital occur in the absence of wheel.
Computing V T is similar. Since its values come from orthonormal singular vectors of
T
A A, arranged right to left from largest corresponding singular value to the least, we have
79 6 107 68 7
6 136 0 6 112
AT A =
107 0 177 116 0
68 6 116 78 7
7 112 0 7 98
which contains the dot product of all the documents. Applying the Gram-Schmidt orthonor-
malization process and taking the transpose yields
0.46 0.02 0.87 0.00 0.17
0.07 0.76 0.06 0.60 0.23
VT
=
0.74 0.10 0.28 0.22 0.56
0.48 0.03 0.40 0.33 0.70
0.07 0.64 0.04 0.69 0.32
S contains the square roots of the singular values ordered from greatest to least along its
diagonal. These values indicate the variance of the linearly independent components along
each dimension. In order to illustrate the effect of dimensionality reduction on this data set,
well restrict S to the first three singular values to get
17.92 0 0
S= 0
15.17 0
0 0 3.56
22
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
In order for the matrix multiplication to go through, we have to eliminate the corresponding
row vectors of U and corresponding column vectors of V T to give us an approximation of A
using 3 dimensions instead of the original 5. The result looks like this.
A =
0.54 0.07 0.82
0.10 0.59 0.11 17.92 0 0 0.46 0.02 0.87 0.00 0.17
0.53 0.06 0.21 0 15.17 0 0.07 0.76 0.06
0.60 0.23
0.65 0.07 0.51
0 0 3.56 0.74 0.10 0.28 0.22 0.56
0.06 0.80 0.09
2.29 0.66 9.33 1.25 3.09
1.77 6.76 0.90 5.50 2.13
=
4.86 0.96 8.01 0.38 0.97
6.62 1.23 9.58 0.24 0.71
1.14 9.19 0.33 7.19 3.13
In practice, however, the purpose is not to actually reconstruct the original matrix but
to use the reduced dimensionality representation to identify similar words and documents.
Documents are now represented by row vectors in V , and document similarity is obtained
by comparing rows in the matrix V S (note that documents are represented as row vectors
because we are working with V , not V T ). Words are represented by row vectors in U , and
word similarity can be measured by computing row similarity in U S.
Earlier I mentioned that in the process of dimensionality reduction, SVD makes similar
items appear more similar, and unlike items more unlike. This can be explained by looking
at the vectors in the reduced versions of U and V above. We know that the vectors contain
components ordered from most to least amount of variation accounted for in the original
data. By deleting elements representing dimensions which do not exhibit meaningful vari-
ation, we effectively eliminate noise in the representation of word vectors. Now the word
vectors are shorter, and contain only the elements that account for the most significant cor-
relations among words in the original dataset. The deleted elements had the effect of diluting
these main correlations by introducing potential similarity along dimensions of questionable
significance.
8 References
Deerwester, S., Dumais, S., Landauer, T., Furnas, G. and Harshman, R. (1990). Indexing by
Latent Semantic Analysis. Journal of the American Society of Information Science
41(6):391-407.
Ientilucci, E.J., (2003). Using the Singular Value Decomposition. http://www.cis.rit.
edu/ejipci/research.htm
23
ROUGH DRAFT - USE AT OWN RISK: suggestions kbaker@ling.osu.edu
Jackson, J. E. (1991). A Users Guide to Principal Components Analysis. John Wiley &
Sons, NY.
Manning, C. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA.
Marcus, M. and Minc, H. (1968). Elementary Linear Algebra. The MacMillan Company,
NY.
24