Lecture 8: Dimensionality Reduction
Iain Styles
5 November 2019
mnist data is not uniformly distributed
Reducing the Dimensionality of MNIST
One problem with the analysis we have done so far is that we have
considered points that are distributed throughout a hypervolume.
"Real" data does not behave in this way; it tends to lie on some
low-dimensional subspace. Trivially, in MNIST, some pixels have
the same value in all images, as shown in Figure 1, which shows
the mean and standard deviation of 1000 MNIST images, together
with a mask that shows which pixels vary, and which do not. 175
of the 784 pixels in the image, which correspond to dimensions in
the underlying vector space, have zero variance. In this case they
also have zero mean, but that is less important. This means that
we could, trivially, reduce the dimensionality of the data to 609, by
just retaining those pixels which vary and therefore contain useful
information.
Mean Standard Deviation Non-varying pixels
Figure 1: Image statistics from 1000
MNIST images. Left: The pixel-mean.
In Figure 2 we show the cumulative sum of the pixel variances, Centre: the pixel standard deviation.
and see that nearly all of the variance is from 500 pixels, and Right: Black pixels are those with zero
variance.
around 75% of it is from only half of the pixels, suggesting that
further reductions in dimensionality may be possible. In fact, by
looking at each pixel in isolation, we are only scratching the surface
of what is possible in terms of reducing the dimension of the data
in a meaningful way. One easy-to-picture reason why we might be
able to reduce the dimensionality further is that correlations be-
tween pixels mean that the value of one pixel can be predicted from
the value of another, but more complex situations are possible. For
instance, imagine that you have a picture or a poster that is rolled
up. When unrolled, points on the poster can be described quite
adequately in a 2d coordinate system. However, when the poster
is rolled up, it becomes a three-dimensional object that requires a
3d coordinate system. If we could find a transformation that could
"unroll" the poster, then we could work with it in 2d coordinates.
This argument can be extended to higher dimensions, and the task
high dimensional images
tend to have a sub-structure
they are not randomly distributed entirely
across the entire dimension space
if we can identify this underlying substructure
we can reduce the dimensionality
lecture 8: dimensionality reduction 2
of dimensionality reduction is to find the transformations that allows
1.00
Cumulative Variance
to represent data using their intrinsic dimensionality: the number of
meaningful degree of freedom. 0.75
To understand what this means in terms of MNIST, let us con-
0.50
sider what the underlying degrees of freedom are.
0.25
• We have ten digit class.
0.00
• Members of each class may vary by width and height (although
0 500
this is somewhat mitigated by the alignment process.) Dimension
• Each character class will have some shape variation – for exam-
Figure 2: Cumulative variance as a
ple, whether one’s have a serif and/or a base (eg 1 vs 1). function of dimensionality. The pixels
were ordered by variance and the
An order-of-magnitude estimate suggests that there may be a cumulative sum was calculated. This
shows that around 75% of the variance
few ten’s of intrinsics degree of freedom in the data. The question
in the data is from fewer than half of
now is, how do we find a transformation that extracts them? the pixels.
we are seeking to project
our data onto a set of
Dimensionality Reduction axes which describes the
majority of the data
We will limit our discussion to linear methods for dimensionality
reduction; that is, methods that apply a single global linear trans- think taking a 1d object
formation (rotation, scale, shear etc) to the data. To understand and lining it up exclusively
what this might do, imagine a sheet of paper embedded in 3d. If with the x axis in a 3d space
we can rotate and shift the sheet so that it aligns with the 3d coordi- e.g. a pole
nate system (x, y, z), then we can discard the coordinate that aligns
with the thinnest dimension of the sheet of paper (its thickness), we only need one dimension
and just retain those coordinates that describe the main variations to describe this object
in the sheet: we can reduce the problem from 3d to 2d. We will be
trying to do the same here: to shift and rotate the data to find a
coordinate system that describes the main variation whilst preserv-
ing the internal structure of the data. Our hope is that by reducing
the dimensionality of the data in a way that preserves its internal
structure, we can mitigate some of the issues that we see in high
dimensional spaces.
There are many ways in which the dimensionality of a dataset
can be reduced; we will focus on the randomrandom projection.
This is an attractive method because it is computationally very
simple and low-cost, but there is some elegant theory behind it that
provides much insight into the problems that high dimensionality
causes. you can gain
a little bit by running
the random projections
Random Projections a few times
and averaging,
We have already seen random projections in action, and have but it is surprisingly robust
seen, experimentally, how much benefit they can bring to a high- to variance within the random
dimensional learning problem. The central idea is exceptionally projections anyway
simple. Given an original dataset containing N samples in M di-
mensional space, arranged as an M × N matrix X (one sample per
column):
N for mnist
M
M = 784
N = 60,000
K lecture 8: dimensionality reduction 3
M each column in this vector is random, normalised to sum to 1
• Generate K random vectors with M components randomly sam-
we end with X’ = K x N
pled from N (0, 1). Arrange these as a matrix R of size M × K,
N samples in K dimensional space
with one vector per column.
• Normalise the columns of R so each has unit length.
N reduced |P| cos �� P
• Compute X0 = RT X K dimensionality
from M to K r2
• The columns of X0 contain the samples projected (Figure 3) into |P|
the lower dimensional space define by the K random vectors.
r1 |P| cos ��1
��
We saw, in k-nn classification, that this works. But why does it ��
work? The key idea is that whilst distances in high-dimensional Figure 3: Projection of a point P onto
spaces are of limited use, if we could could map the the points into two unit random vectors r1 and r2 .
Noting that r1 and r2 are unit vectors,
some low-dimensional space (where distances are more useful) in
the projection is given by the dot
a way that preserves the local structure of the data, then we can render product P · r = | P| cos θ.
distances useful again.
Is it possible to find such a map? It turns out that random pro-
jections, subject to a few constraints, can be be shown to gener-
ate just such a mapping. The key to this is a mathematical result
known as the Johnson-Lindenstrauss lemma1 . This is a statement that 1
Johnson, William B., and Joram Lin-
points in a high-dimensional space can be mapped onto a low- denstrauss. "Extensions of Lipschitz
mappings into a Hilbert space." Con-
dimensional space in such a way that relative distances between temporary mathematics 26.189-206
points are preserved, whilst the absolute distances are reduced. The (1984): 1.
lemma states that:
Given a set X of N data points in R M (M-dimensional real
distance
between space), there is a linear map f : R 7→ R that maps X onto a
M K
distance between
points random lower-dimensional space RK . The map obeys points post mapping
before
mapping (1 − ε)k x1 − x2 k2 ≤ k f ( x1 ) − f ( x2 )k2 ≤ (1 + ε)k x1 − x2 k2 (1)
distance between
for all x1 , x2 ∈ X and for 0 < ε < 1 and K > 8 ln( N )/ε2 . It has points before mapping
2
also been proven that this holds when the function f is a random, 2
Frankl, Peter, and Hiroshi Maehara.
orthonormal linear transformation. A proof of this result is far be- "The Johnson-Lindenstrauss lemma
and the sphericity of some graphs."
yond the scope of this course. Let us try to unpick its implications. Journal of Combinatorial Theory,
Firstly, we need to be clear of the central result here: Johnson- Series B 44.3 (1988): 355-362.
Lindenstrauss and its extension by Frankl and Maehara together
state that a “random, orthonormal linear transformation” can
be used to project a set of points onto a lower-dimensional sub-
space whilst preserving the distances between them to within
some bounds controlled by e: distances between points are scaled
by a factor between 1 − ε and 1 + ε. Noting the condition that
K > 8 ln( N )/ε2 , this means that smaller values of ε (better distance better preservation of distances
preservation) requires higher values of K. requires a larger dimensionality
One now comes across yet another remarkable property of high-
dimensional spaces. The requirement that the the random vectors
be orthormal – ri · r j = 1 if i=j else 0 – seems like it could be rather
onerous. There are well-established methods for doing this (such as
Gramm-Schmidt orthogonalisation), but they carry a computational
cost. Fortunately, high dimensionality saves us and it turns out that
an explicit normalisation step is often not needed, because in very
high-dimensional spaces, the probability that two random vectors
lecture 8: dimensionality reduction 4
will be orthogonal tends towards 1! We demonstrate this numeri-
cally in Figure 4, noting that because x · y = | x ||y| cos θ, x · y = 0
means that the cos θ = 0 corresponds to an angle of 90◦ which
means that x and y are orthogonal. These results demonstrate that
as the dimensionality increases, the probability that two random
vectors are are not orthogonal tends to zero. This is again a conse-
quence of the dimensionality, and it can be shown that the number
of sets of nearly orthonal vectors is exponential in the dimensional-
ity3 . If the dimensionality of our problem is high enough, random 3
Kainen, Paul C.; Kůrková, Vĕra
vectors are self-orthogonalising! (1993), "Quasiorthogonal dimen-
sion of Euclidean spaces", Applied
dot product gives cosine Mathematics Letters, 6 (3): 7âĂŞ10,
cosine(x) = 0 —> x = 90/180/270 etc
d=5 d=10 d=20
doi:10.1016/0893-9659(93)90023-G, MR
0.8
1347278
1.00 1.5
0.6
0.75
P(r1 r2)
P(r1 r2)
P(r1 r2)
0.4 1.0
0.50
0.2 0.25 0.5
0.0 0.00 0.0
1 0 1 1 0 1 1 0 1
r1 r2 r1 r2 r1 r2
d=50 d=100 d=200
4
2 3 4
P(r1 r2)
P(r1 r2)
P(r1 r2)
2
1 2
1
0 0 0
1 0 1 1 0 1 1 0 1
r1 r2 r1 r2 r1 r2
d=500 d=1000
8
10
6
P(r1 r2)
P(r1 r2)
random vectors
4 5 are nearly all orthogonal
to each other
2
0 0
1 0 1 1 0 1
r1 r2 r1 r2
Figure 4: P(ri · r j |i 6= j) from 100,000
normalised random vectors in different
Finally, let us investigate the effect that random projection has on dimensional spaces.
the pairwise distances. We compute all pairwise distances on the
data projected on to forty random vectors, with distances computed
in the lower-dimensional space. The results are shown in Figure 5.
The mean/median of the distribution has been reduced to ≈ 580
from ≈ 2300, and the standard deviation to ≈ 100 from ≈ 300. This
lecture 8: dimensionality reduction 5
does not seems entirely consistent with the Johnson-Lindenstrauss
lemma, which stated that distances would be preserved. Why is
this?
Although the main implications of Johnson-Lindenstrauss seem
clear, there is a sublety. One would expect dimensionality reduction
to reduce the distances between points, not to preserve them, but J-L
seems to say the opposite. However, it is perhaps more helpful to
rewrite the theorem as
k f ( x1 ) − f ( x2 )k2
1−ε ≤ ≤ 1+ε (2)
k x1 − x2 k
which allows us to see that J-L is in fact a theorem about the
relative distances between points: it states that it is the relative dis-
tances that are preserved, whilst the absolute distances become 0.004
reduced due to dimensional scaling. This is what makes it such a
0.003
P(||r1 r2||)
powerful result for machine learning: we can use random projec-
tions to reduce the dimensionality and overcome the "curse" at a 0.002
very low computational cost, whilst preserving the structure (rela-
0.001
tive distances) within the data. This is why we see such a dramatic
performance improvement in k-nn classification on MNIST digits. 0.000
500 1000
||r1 r2||
Figure 5: Distribution of pairwise
distances following projection onto
forty random vectors in pixel space
between 1000 examples from the
MNIST test set, and 1000 examples
from the training set.