CS294A Lecture notes
Andrew Ng
Sparse autoencoder
1     Introduction
Supervised learning is one of the most powerful tools of AI, and has led to
automatic zip code recognition, speech recognition, self-driving cars, and a
continually improving understanding of the human genome. Despite its sig-
nificant successes, supervised learning today is still severely limited. Specifi-
cally, most applications of it still require that we manually specify the input
features x given to the algorithm. Once a good feature representation is
given, a supervised learning algorithm can do well. But in such domains as
computer vision, audio processing, and natural language processing, there’re
now hundreds or perhaps thousands of researchers who’ve spent years of their
lives slowly and laboriously hand-engineering vision, audio or text features.
While much of this feature-engineering work is extremely clever, one has to
wonder if we can do better. Certainly this labor-intensive hand-engineering
approach does not scale well to new problems; further, ideally we’d like to
have algorithms that can automatically learn even better feature representa-
tions than the hand-engineered ones.
    These notes describe the sparse autoencoder learning algorithm, which
is one approach to automatically learn features from unlabeled data. In some
domains, such as computer vision, this approach is not by itself competitive
with the best hand-engineered features, but the features it can learn do
turn out to be useful for a range of problems (including ones in audio, text,
etc). Further, there’re more sophisticated versions of the sparse autoencoder
(not described in these notes, but that you’ll hear more about later in the
class) that do surprisingly well, and in some cases are competitive with or
sometimes even better than some of the hand-engineered representations.
                                       1
   This set of notes is organized as follows. We will first describe feedforward
neural networks and the backpropagation algorithm for supervised learning.
Then, we show how this is used to construct an autoencoder, which is an
unsupervised learning algorithm, and finally how we can build on this to
derive a sparse autoencoder. Because these notes are fairly notation-heavy,
the last page also contains a summary of the symbols used.
2    Neural networks
Consider a supervised learning problem where we have access to labeled train-
ing examples (x(i) , y (i) ). Neural networks give a way of defining a complex,
non-linear form of hypotheses hW,b (x), with parameters W, b that we can fit
to our data.
    To describe neural networks, we use the following diagram to denote a
single “neuron”:
     This “neuron” is a computational unit that takes as inputPx1 , x2 , x3 (and
a +1 intercept term), and outputs hw,b (x) = f (wT x) = f ( 3i=1 wi xi + b),
where f : R 7→ R is called the activation function. One possible choice for
f (·) is the sigmoid function f (z) = 1/(1 + exp(−z)); in that case, our single
neuron corresponds exactly to the input-output mapping defined by logistic
regression. In these notes, however, we’ll use a different activation function,
the hyperbolic tangent, or tanh, function:
                                             ez − e−z
                         f (z) = tanh(z) =            ,                     (1)
                                             ez + e−z
Here’s a plot of the tanh(z) function:
                                         2
    The tanh(z) function is a rescaled version of the sigmoid, and its output
range is [−1, 1] instead of [0, 1]. Our description of neural networks will use
this activation function.
    Note that unlike CS221 and (parts of) CS229, we are not using the con-
vention here of x0 = 1. Instead, the intercept term is handled separated by
the parameter b.
    Finally, one identity that’ll be useful later: If f (z) = tanh(z), then its
derivative is given by f 0 (z) = 1 − (f (z))2 . (Derive this yourself using the
definition of tanh(z) given in Equation 1.)
2.1    Neural network formulation
A neural network is put together by hooking together many of our simple
“neurons,” so that the output of a neuron can be the input of another. For
example, here is a small neural network:
                                      3
    In this figure, we have used circles to also denote the inputs to the net-
work. The circles labeled “+1” are called bias units, and correspond to the
intercept term. The leftmost layer of the network is called the input layer,
and the rightmost layer the output layer (which, in this example, has only
one node). The middle layer of nodes is called the hidden layer, because
its values are not observed in the training set. We also say that our example
neural network has 3 input units (not counting the bias unit), 3 hidden
units, and 1 output unit.
    We will let nl denote the number of layers in our network; thus nl = 3
in our example. We label layer l as Ll , so layer L1 is the input layer, and
layer Lnl the output layer. Our neural network has parameters (W, b) =
                                                 (l)
(W (1) , b(1) , W (2) , b(2) ), where we write Wij to denote the parameter (or weight)
associated with the connection between unit j in layer l, and unit i in layer
                                                      (l)
l+1. (Note the order of the indices.) Also, bi is the bias associated with unit
i in layer l+1. Thus, in our example, we have W (1) ∈ R3×3 , and W (2) ∈ R1×3 .
Note that bias units don’t have inputs or connections going into them, since
they always output the value +1. We also let sl denote the number of nodes
in layer l (not counting the bias unit).
                            (l)
    We will write ai to denote the activation (meaning output value) of
                                                     (1)
unit i in layer l. For l = 1, we also use ai = xi to denote the i-th input.
Given a fixed setting of the parameters W, b, our neural network defines a
hypothesis hW,b (x) that outputs a real number. Specifically, the computation
that this neural network represents is given by:
                  (2)               (1)     (1)        (1)       (1)
                 a1     = f (W11 x1 + W12 x2 + W13 x3 + b1 )                         (2)
                  (2)               (1)     (1)        (1)       (1)
                 a2     = f (W21 x1 + W22 x2 + W23 x3 + b2 )                         (3)
                  (2)            (1)       (1)      (1)     (1)
                 a3     =   f (W31 x1 + W32 x2 + W33 x3 + b3 )                       (4)
                             (3)       (2)      (2)      (2)    (2)
            hW,b (x) =      a1 = f (W11 a1 + W12 a2 + W13 a3 + b1 )                  (5)
                              (l)
In the sequel, we also let zi denote the total weighted sum of inputs to unit
                                              (2)        (1)     (1)
i in layer l, including the bias term (e.g., zi = nj=1 Wij xj + bi ), so that
                                                  P
 (l)       (l)
ai = f (zi ).
    Note that this easily lends itself to a more compact notation. Specifically,
if we extend the activation function f (·) to apply to vectors in an element-
wise fashion (i.e., f ([z1 , z2 , z3 ]) = [tanh(z1 ), tanh(z2 ), tanh(z3 )]), then we can
                                           4
write Equations (2-5) more compactly as:
                              z (2)   =   W (1) x + b(1)
                              a(2)    =   f (z (2) )
                              z (3)   =   W (2) a(2) + b(2)
                          hW,b (x)    =   a(3) = f (z (3) )
More generally, recalling that we also use a(1) = x to also denote the values
from the input layer, then given layer l’s activations a(l) , we can compute
layer l + 1’s activations a(l+1) as:
                           z (l+1) = W (l) a(l) + b(l)                       (6)
                           a(l+1) = f (z (l+1) )                             (7)
By organizing our parameters in matrices and using matrix-vector operations,
we can take advantage of fast linear algebra routines to quickly perform
calculations in our network.
    We have so far focused on one example neural network, but one can
also build neural networks with other architectures (meaning patterns of
connectivity between neurons), including ones with multiple hidden layers.
The most common choice is a nl -layered network where layer 1 is the input
layer, layer nl is the output layer, and each layer l is densely connected to
layer l + 1. In this setting, to compute the output of the network, we can
successively compute all the activations in layer L2 , then layer L3 , and so on,
up to layer Lnl , using Equations (6-7). This is one example of a feedforward
neural network, since the connectivity graph does not have any directed loops
or cycles.
    Neural networks can also have multiple output units. For example, here
is a network with two hidden layers layers L2 and L3 and two output units
in layer L4 :
                                          5
    To train this network, we would need training examples (x(i) , y (i) ) where
y ∈ R2 . This sort of network is useful if there’re multiple outputs that
 (i)
you’re interested in predicting. (For example, in a medical diagnosis applica-
tion, the vector x might give the input features of a patient, and the different
outputs yi ’s might indicate presence or absence of different diseases.)
2.2      Backpropagation algorithm
We will train our neural network using stochastic gradient descent. For much
of CS221 and CS229, we considered a setting in which we have a fixed train-
ing set {(x(1) , y (1) ), . . . , (x(m) , y (m) )}, and we ran either batch or stochastic
gradient descent on that fixed training set. In these notes, will take an online
learning view, in which we imagine that our algorithm has access to an un-
ending sequence of training examples {(x(1) , y (1) ), (x(2) , y (2) ), (x(3) , y (3) ), . . .}.
In practice, if we have only a finite training set, then we can form such a
sequence by repeatedly visiting our fixed training set, so that the examples in
the sequence will repeat. But even in this case, the online learning view will
make some of our algorithms easier to describe. In this setting, stochastic
gradient descent will proceed as follows:
       For i = 1, 2, 3, . . .
            Get next training example (x(i) , y (i) ).
                          (l)         (l)        ∂               (i) (i)
            Update Wjk := Wjk − α                  (l) J(W, b; x    ,y )
                                                ∂Wjk
                        (l)     (l)          ∂              (i) (i)
                       bj := bj − α           (l) J(W, b; x    ,y )
                                            ∂bj
Here, α is the learning rate parameter, and J(W, b) = J(W, b; x, y) is a cost
function defined with respect to a single training example. (When there is no
risk of ambiguity, we drop the dependence of J on the training example x, y,
and simply write J(W, b)). If the training examples are drawn IID from some
training distribution D, we can think of this algorithm as trying to minimize
                                  E(x,y)∼D [J(W, b; x, y)] .
Alternatively, if our sequence of examples is obtained by repeating some
fixed, finite training set {(x(1) , y (1) ), . . . , (x(m) , y (m) )}, then this algorithm is
standard stochastic gradient descent for minimizing
                                            m
                                       1 X
                                             J(W, b; x, y).
                                       m i=1
                                                   6
       To train our neural network, we will use the cost function:
                                                nl −1 X  sl+1 
                                                      sl X
                           1               2 λX                 (l)
                                                                     2
           J(W, b; x, y) =   khW,b (x) − yk −                   Wji
                           2                  2 l=1 i=1 j=1
The first term is a sum-of-squares error term; the second is a regularization
term (also called a weight decay term) that tends to decrease the mag-
nitude of the weights, and helps prevent overfitting.1 The weight decay
parameter λ controls the relative importance of the two terms.
    This cost function above is often used both for classification and for re-
gression problems. For classification, we let y = +1 or −1 represent the
two class labels (recall that the tanh(z) activation function outputs values
in [−1, 1], so we use +1/-1 valued outputs instead of 0/1). For regression
problems, we first scale our outputs to ensure that they lie in the [−1, 1]
range.
    Our goal is to minimize E(x,y) [J(W, b; x, y)] as a function of W and b. To
                                                                    (l)          (l)
train our neural network, we will initialize each parameter Wij and each bi
to a small random value near zero (say according to a N (0, 2 ) distribution
for some small , say 0.01), and then apply stochastic gradient descent. Since
J(W, b; x, y) is a non-convex function, gradient descent is susceptible to local
optima; however, in practice gradient descent usually works fairly well. Also,
in neural network training, stochastic gradient descent is almost always used
rather than batch gradient descent. Finally, note that it is important to
initialize the parameters randomly, rather than to all 0’s. If all the parameters
start off at identical values, then all the hidden layer units will end up learning
                                                          (1)
the same function of the input (more formally, Wij will be the same for
                             (2)     (2)
all values of i, so that a1 = a2 = . . . for any input x). The random
initialization serves the purpose of symmetry breaking.
    We now describe the backpropagation algorithm, which gives an effi-
cient way to compute the partial derivatives we need in order to perform
stochastic gradient descent. The intuition behind the algorithm is as follows.
Given a training example (x, y), we will first run a “forward pass” to compute
all the activations throughout the network, including the output value of the
hypothesis hW,b (x). Then, for each node i in layer l, we would like to compute
   1                                                         (l)
     Usually weight decay is not applied to the bias terms bi , as reflected in our definition
for J(W, b; x, y). Applying weight decay to the bias units usually makes only a small
different to the final network, however. If you took CS229, you may also recognize weight
decay this as essentially a variant of the Bayesian regularization method you saw there,
where we placed a Gaussian prior on the parameters and did MAP (instead of maximum
likelihood) estimation.
                                              7
                      (l)
an “error term” δi that measures how much that node was “responsible”
for any errors in our output. For an output node, we can directly measure
the difference between the network’s activation and the true target value,
                        (n )
and use that to define δi l (where layer nl is the output layer). How about
                                              (l)
hidden units? For those, we will compute δi based on a weighted average
                                            (l)
of the error terms of the nodes that uses ai as an input. In detail, here is
the backpropagation algorithm:
    1. Perform a feedforward pass, computing the activations for layers L2 ,
    L3 , and so on up to the output layer Lnl .
    2. For each output unit i in layer nl (the output layer), set
              (nl )             ∂      1                           (n )          (n )
             δi       =     (n )
                                         ky − hW,b (x)k2 = −(yi − ai l ) · f 0 (zi l )
                          ∂zi l        2
    3. For l = nl − 1, nl − 2, nl − 3, . . . , 2
         For each node i in layer l, set
                                                      sl+1
                                                                              !
                                        (l)
                                                      X           (l) (l+1)            (l)
                                       δi =                  Wji δj               f 0 (zi )
                                                      j=1
                                              (l)           (l)
    4. Update each weight Wij and bj according to:
                                                        
                    (l)       (l)     (l) (l+1)      (l)
                  Wij := Wij − α aj δi          + λWij
                                 (l)            (l)          (l+1)
                                bi     := bi − αδi                   .
   Although we have not proved it here, it turns out that
                          ∂                                       (l) (l+1)             (l)
                         (l)
                                    J(W, b; x, y) = aj δi                     + λWij ,
                      ∂Wij
                            ∂                                     (l+1)
                         (l)
                                    J(W, b; x, y) = δi                    .
                       ∂bi
Thus, this algorithm is exactly implementing stochastic gradient descent.
Finally, we can also re-write the algorithm using matrix-vectorial notation.
We will use “•” to denote the element-wise product operator (denoted “.*”
in Matlab or Octave, and also called the Hadamard product), so that if
                                                        8
a = b • c, then ai = bi ci . Similar to how we extended the definition of
f (·) to apply element-wise to vectors, we also do the same for f 0 (·) (so that
f 0 ([z1 , z2 , z3 ]) = [ ∂z∂ 1 tanh(z1 ), ∂z∂ 2 tanh(z2 ), ∂z∂ 3 tanh(z3 )]). The algorithm can
then be written:
      1. Perform a feedforward pass, computing the activations for layers L2 ,
      L3 , up to the output layer Lnl , using Equations (6-7).
      2. For the output layer (layer nl ), set
                                 δ (nl ) = −(y − a(nl ) ) • f 0 (z (n) )
      3. For l = nl − 1, nl − 2, nl − 3, . . . , 2
           Set
                                   δ (l) = (W (l) )T δ (l+1) • f 0 (z (l) )
                                                            
      4. Update the parameters according to:
                        W (l) := W (l) − α δ (l+1) (a(l) )T + λW (l)
                                                                                                        b(l) := b(l) − αδ (l+1) .
Implementation note 1: In steps 2 and 3 above, we need to compute
      (l)
f 0 (zi ) for each value of i. Assuming f (z) is the tanh activation function,
                            (l)
we would already have ai stored away from the forward pass through the
network. Thus, using the expression that we worked out earlier for f 0 (z), we
                           (l)        (l)
can compute this as f 0 (zi ) = 1 − (ai )2 .
Implementation note 2: Backpropagation is a notoriously difficult algo-
rithm to debug and get right, especially since many subtly buggy implemen-
tations of it—for example, one that has an off-by-one error in the indices
and that thus only trains some of the layers of weights, or an implemen-
tation that omits the bias term, etc.—will manage to learn something that
can look surprisingly reasonable (while performing less well than a correct
implementation). Thus, even with a buggy implementation, it may not at
all be apparent that anything is amiss. So, when implementing backpropa-
gation, do read and re-read your code to check it carefully. Some people also
numerically check their computation of the derivatives; if you know how to
do this, it’s worth considering too. (Feel free to ask us if you want to learn
more about this.)
                                                9
3     Autoencoders and sparsity
So far, we have described the application of neural networks to supervised
learning, in which we are have labeled training examples. Now suppose we
have only unlabeled training examples set {x(1) , x(2) , x(3) , . . .}, where x(i) ∈
Rn . An autoencoder neural network is an unsupervised learning algorithm
that applies back propagation, setting the target values to be equal to the
inputs. I.e., it uses y (i) = x(i) .
    Here is an autoencoder:
    The autoencoder tries to learn a function hW,b (x) ≈ x. In other words, it
is trying to learn an approximation to the identity function, so as to output
x̂ that is similar to x. The identity function seems a particularly trivial
function to be trying to learn; but by placing constraints on the network,
such as by limiting the number of hidden units, we can discover interesting
structure about the data. As a concrete example, suppose the inputs x are
the pixel intensity values from a 10 × 10 image (100 pixels) so n = 100,
and there are s2 = 50 hidden units in layer L2 . Note that we also have
y ∈ R100 . Since there are only 50 hidden units, the network is forced to
learn a compressed representation of the input. I.e., given only the vector of
hidden unit activations a(2) ∈ R50 , it must try to reconstruct the 100-pixel
                                        10
input x. If the input were completely random—say, each xi comes from an
IID Gaussian independent of the other features—then this compression task
would be very difficult. But if there is structure in the data, for example, if
some of the input features are correlated, then this algorithm will be able to
discover some of those correlations.2
      Our argument above relied on the number of hidden units s2 being small.
But even when the number of hidden units is large (perhaps even greater
than the number of input pixels), we can still discover interesting structure,
by imposing other constraints on the network. In particular, if we impose
a sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units
is large.
      Informally, we will think of a neuron as being “active” (or as “firing”)
if its output value is close to 1, or as being “inactive” if its output value is
close to -1. We would like to constrain the neurons to be inactive most of
the time.3
      We will do this in an online learning fashion. More formally, we again
imagine that our algorithm has access to an unending sequence of training
examples {x(1) , x(2) , x(3) , . . .} drawn IID from some distribution D. Also, let
  (2)
ai as usual denote the activation of hidden unit i in the autoencoder. We
would like to (approximately) enforce the constraint that
                                           h i
                                              (2)
                                       Ex∼D ai = ρ,
where ρ is our sparsity parameter, typically a value slightly above -1.0
(say, ρ ≈ −0.9). In other words, we would like the expected activation of
each hidden neuron i to be close to -0.9 (say). To satisfy this expectation
constraint, the hidden unit’s activations must mostly be near -1.
    Our algorithm for (approximately) enforcing the expectation constraint
will have two major components:
                            h i First, for each hidden unit i, we will keep
                               (2)
a running estimate of Ex∼D ai . Second, after each iteration of stochastic
gradient descent, we will slowly adjust that unit’s parameters to make this
expected value closer to ρ.
   2
     In fact, this simple autoencoder often ends up learning a low-dimensional representa-
tion very similar to PCA’s.
   3
     The term “sparsity” comes from an alternative formulation of these ideas using net-
works with a sigmoid activation function f , so that the activations are between 0 or 1
(rather than -1 and 1). In this case, “sparsity” refers to most of the activations being near
0.
                                             11
   In each iteration of gradient descent, when we see each training input x
                                               (2)
we will compute the hidden units’
                             h i activations ai for each i. We will keep a
                               (2)
running estimate ρ̂i of Ex∼D ai    by updating:
                                                  (2)
                          ρ̂i := 0.999ρ̂i + 0.001ai .
(Or, in vector notation, ρ̂ := 0.999ρ̂ + 0.001a(2) .) Here, the “0.999” (and
“0.001”) is a parameter of the algorithm, and there is a wide range of
values will that will work fine. This particular choice causes ρ̂i to be an
exponentially-decayed weighted average of about the last 1000 observed val-
         (2)
ues of ai . Our running estimates ρ̂i ’s can be initialized to 0 at the start of
the algorithm.
    The second part of the algorithm modifies the parameters so as to try to
satisfy the expectation constraint. If ρ̂i > ρ, then we would like hidden unit
i to become less active, or equivalently, for its activations to become closer
to -1. Recall that unit i’s activation is
                                     n
                                                       !
                          (2)
                                    X      (1)     (1)
                        ai = f         Wij xj + bi       ,                   (8)
                                   j=1
       (1)
where bi is the bias term. Thus, we can make unit i less active by decreasing
 (1)
bi . Similarly, if ρ̂i < ρ, then we would like unit i’s activations to become
                                        (1)
larger, which we can do by increasing bi . Finally, the further ρi is from ρ,
                                                              (1)
the more aggressively we might want to decrease or increase bi so as to drive
the expectation towards ρ. Concretely, we can use the following learning rule:
                            (1)    (1)
                           bi := bi − αβ(ρ̂i − ρ)                           (9)
where β is an additional learning rate parameter.
    To summarize, in order to learn a sparse autoencoder using online learn-
ing, upon getting an example x, we will (i) Run a forward pass on our network
on input x, to compute all units’ activations; (ii) Perform one step of stochas-
tic gradient descent using backpropagation; (iii) Perform the updates given
in Equations (8-9).
4    Visualization
Having trained a (sparse) autoencoder, we would now like to visualize the
function learned by the algorithm, to try to understand what it has learned.
                                         12
Consider the case of training an autoencoder on 10 × 10 images, so that
n = 100. Each hidden unit i computes a function of the input:
                                100
                                                  !
                       (2)
                                X     (1)     (1)
                      ai = f        Wij xj + bi     .
                                       j=1
We will visualize the function computed by hidden unit i—which depends on
                   (1)
the parameters Wij (ignoring the bias term for now) using a 2D image. In
                         (1)
particular, we think of ai as some non-linear feature of the input x. We ask:
                                    (1)
What input image x would cause ai to be maximally activated? For this
question to have a non-trivial answer, we must impose some constraints      on
x. If we suppose that the input is norm constrained by ||x||2 = 100     2
                                                                P
                                                                      x
                                                                   i=1 i  ≤  1,
then one can show (try doing this yourself) that the input which maximally
activates hidden unit i is given by setting pixel xj (for all 100 pixels, j =
1, . . . , 100) to
                                          (1)
                                       Wij
                             x j = qP                .
                                      100      (1) 2
                                      j=1 (W  ij  )
By displaying the image formed by these pixel intensity values, we can begin
to understand what feature hidden unit i is looking for.
    If we have an autoencoder with 100 hidden units (say), then we our
visualization will have 100 such images—one per hidden unit. By examining
these 100 images, we can try to understand what the ensemble of hidden
units is learning.
    When we do this for a sparse autoencoder (trained with 100 hidden units
on 10x10 pixel inputs4 ) we get the following result:
   4
     The results below were obtained by training on whitened natural images. Whitening
is a preprocessing step which removes redundancy in the input, by causing adjacent pixels
to become less correlated.
                                             13
    Each square in the figure above shows the (norm bounded) input image x
that maximally actives one of 100 hidden units. We see that the different hid-
den units have learned to detect edges at different positions and orientations
in the image.
    These features are, not surprisingly, useful for such tasks as object recog-
nition and other vision tasks. When applied to other input domains (such
as audio), this algorithm also learns useful representations/features for those
domains too.
                                      14
5   Summary of notation
x           Input features for a training example, x ∈ Rn .
y           Output/target values. Here, y can be vector valued. In the case
            of an autoencoder, y = x.
    (i) (i)
(x , y ) The i-th training example
hW,b (x)    Output of our hypothesis on input x, using parameters W, b.
            This should be a vector of the same dimension as the target
            value y.
     (l)
Wij         The parameter associated with the connection between unit j
            in layer l, and unit i in layer l + 1.
  (l)
bi          The bias term associated with unit i in layer l + 1. Can also
            be thought of as the parameter associated with the connection
            between the bias unit in layer l and unit i in layer l + 1.
  (l)
ai          Activation (output) of unit i in layer l of the network. In addi-
                                                                   (1)
            tion, since layer L1 is the input layer, we also have ai = xi .
f (·)       The activation function. Throughout these notes, we used
            f (z) = tanh(z).
  (l)                                                                    (l)
zi          Total weighted sum of inputs to unit i in layer l. Thus, ai =
                (l)
            f (zi ).
α           Learning rate parameter
sl          Number of units in layer l (not counting the bias unit).
nl          Number layers in the network. Layer L1 is usually the input
            layer, and layer Lnl the output layer.
λ           Weight decay parameter
x̂          For an autoencoder, its output; i.e., its reconstruction of the
            input x. Same meaning as hW,b (x).
ρ           Sparsity parameter, which specifies our desired level of sparsity
ρ̂i         Our running estimate of the expected activation of unit i (in the
            sparse autoencoder).
β           Learning rate parameter for algorithm trying to (approximately)
            satisfy the sparsity constraint.
                                    15