AUTOENCODERS
Dr. Asifullah Khan, DCS
Autoencoders
An unsupervised learning technique
Leverage neural networks for the task of representation learning.
A neural network architecture where a bottleneck is imposed in the network, which forces
a compressed knowledge representation of the original input.
If the input features are independent of one another then this compression and subsequent reconstruction is a
very difficult task.
If some sort of structure exists in the data (i.e.. correlations between input features), this structure can be
learned and consequently leveraged when forcing the input through the network's bottleneck.
Autoencoders
“An autoencoder is a neural network that is trained to attempt to copy its input to its output.”
— Page 502, Deep Learning, 2016.
A type of neural network, where the output layer has the same dimensionality as the input layer.
The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-
dimensional data, typically for dimensionality reduction, by training the network to capture the most
important parts of the input image.
Autoencoders properties
Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of
important properties:
Data-specific:
only able to meaningfully compress data similar to what they have been trained on.
Lossy:
◦ The output will not be exactly the same as the input, it will be a close but a degraded representation.
Unsupervised:
◦ don’t need explicit labels to train on.
◦ more precisely they are self-supervised because they generate their own labels from the training data.
Architecture of Autoencoders
Autoencoders consist of 3 parts:
1. Encoder: compress the input into an encoded representation that is typically several orders of magnitude
smaller than the input data.
2. Bottleneck: The most important and the smallest part.
Restricts the flow of information to the decoder from the encoder
Helps to form a knowledge-representation of the input.
Prevents the neural network from memorizing the input and overfitting on the data.
Rule of thumb: The smaller the bottleneck, the lower the risk of overfitting.
3. Decoder: “decompress” the knowledge representations and reconstructs the data back from its encoded
form.
Architecture of Autoencoders
Data set: unlabeled framed as a
supervised learning problem
Dataset task: output , a reconstruction of
the original input x.
Training of network: minimizing the
reconstruction error, L(x,), which measures
the differences between the original input
and the consequent reconstruction.
Key attribute :The bottleneck
Without the presence of an information bottleneck,
the network could easily learn to simply memorize
the input values by passing these values along
through the network
Example
How to train autoencoders?
Set 4 hyperparameters before training an autoencoder:
1. Code size (size of the bottleneck ): Smaller size results in more compression. This can also act as a
regularization term.
2. Number of layers: A higher depth increases model complexity, a lower depth is faster to process.
3. Number of nodes per layer: The architecture discussed is a stacked autoencoder as the layers are
stacked one after another. Stacked autoencoders look like a “sandwitch”. The number of nodes per
layer decreases with each subsequent layer of the encoder, and increases back in the decoder.
4. Reconstruction Loss: Either use mean squared error (MSE) or binary cross-entropy. If the input values
are in the range [0, 1] then cross-entropy is used, otherwise mean squared error. With image data, the
most popular loss functions for reconstruction are MSE Loss and L1 Loss.
How to train autoencoders?
Increasing the hyperparameters will let the autoencoder to learn more complex
codings.
If autoencoder become powerful, it will simply learn to copy its inputs to the
output, without learning any meaningful representation i.e.
“mimic the identity function”
The autoencoder will reconstruct the training data perfectly, but it will
be overfitting without being able to generalize to new instances.
Ideal Autoencoder
The ideal autoencoder model balances the following:
Sensitive to the inputs enough to accurately build a reconstruction.
Insensitive enough to the inputs that the model doesn't simply memorize or overfit
the training data.
This trade-off forces the model to maintain only the variations in the data required to reconstruct
the input without holding on to redundancies within the input.
Trade-off: Constructing a loss
function
A loss function:
where :
Reconstruction loss L(x,) term encourages the model to be sensitive to the inputs and,
Regularizer, an added term, discourages memorization/overfitting
Types of Autoencoders
Undercomplete Autoencoders
Sparse Autoencoders
Denoising Autoencoders
Contractive Autoencoders
Undercomplete Autoencoders
• A “sandwich” architecture.
• Deliberately keep the code size
small i.e. to constrain the
number of nodes present in
the hidden layer(s) of the
network, limiting the amount
of information that can flow
through the network.
Undercomplete Autoencoders
It won’t be able to directly copy its inputs to the output
Forced to learn intelligent features.
How?
◦ By penalizing the network as per the reconstruction error.
If the input data has a pattern, for example the digit “1” usually contains a somewhat straight line
and the digit “0” is circular, it will learn this fact and encode it in a more compact form.
If the input data was completely random without any internal correlation or dependency, then an
undercomplete autoencoder won’t be able to recover it perfectly.
This encoding will learn and describe latent attributes of the input data.
Undercomplete Autoencoders vs PCA
For dimensionality reduction, PCA (Principal Component Analysis)
forms a lower-dimensional hyperplane to represent data in a higher-
dimensional form without losing information. PCA can only build
linear relationships.
Autoencoders are capable of learning nonlinear manifolds (a manifold
is defined in simple terms as a continuous, non-intersecting surface).
If we remove all non-linear activations from an undercomplete
autoencoder and use only linear layers, we reduce the undercomplete
autoencoder into something that works at an equal footing with PCA.
Torus(a nonlinear
manifold)
Undercomplete Autoencoders vs PCA
In vanilla autoencoders, i.e. autoencoders with a single hidden layer, it's common to use linear
activations for both the hidden and output layers.
Sparse autoencoders
Use regularization to force autoencoders to learn useful features
How?
Construct the loss function such that we penalize activations within a hidden layer.
This penalty, called the sparsity function, prevents the neural network from activating more
neurons and serves as a regularizer.
A different approach towards regularization, as we normally regularize the weights of a network,
not the activations.
Sparse autoencoders
Allow the network to sensitize individual hidden layer nodes toward specific attributes of the
input data i.e. have nodes in hidden layers dedicated to find specific features
Limited the network's capacity to memorize the input data without limiting the networks capability
to extract features from the data.
This allows us to consider the latent state representation and regularization of the
network separately.
This method works even if the code size is large, since only a small subset of the nodes will be
active at any time.
Sparse autoencoders
The individual nodes of a
trained model which activate
are data-dependent, different
inputs will result in activations
of different nodes through the
network.
Sparse autoencoders: Sparsity
constrain
Two main ways to impose sparsity constraint that involve measuring the hidden layer activations for
each training batch and adding some term to the loss function in order to penalize excessive
activations.
L1 regularizarion
KL-Divergence
L1 Regularization: add a term to the loss function that penalizes the absolute value of the vector of
activations a in layer h for observation i, scaled by a tuning parameter λ.
Sparse autoencoders: Sparsity constrain
KL-Divergence: In essence, KL-divergence is a measure of the difference between two probability
distributions. We can define a sparsity parameter ρ which denotes the average activation of a
neuron over a collection of samples. This expectation can be calculated as
where the subscript j denotes the specific neuron in layer h, summing the activations for m
training observations denoted individually as x.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence within the
loss to reduce the difference between the current distribution of the activations and the ideal
(Bernoulli) distribution:
Denoising autoencoders
Remove noise from an
image.
Does not have the input
image as its ground truth.
Denoising autoencoders
Adds random noise to its inputs and makes it recover the original noise-free data.
The autoencoder can’t simply copy the input to its output because the input also contains
random noise
Contractive Autoencoders
Robust to small changes in the training dataset.
Works on the basis that similar inputs should have similar encodings and a similar latent space
representation i.e. latent space should not vary by a huge amount for minor variations in the input.
Requires the derivative of the hidden layer activations be small with respect to the input.
Essentially forcing the model to learn how to contract a neighborhood of inputs into a smaller
neighborhood of outputs.
Contractive Autoencoders
Quite similar to a denoising autoencoder as
"denoising autoencoders make the reconstruction function (i.e. decoder) resist small but finite-
sized perturbations of the input, while contractive autoencoders make the feature extraction
function (ie. encoder) resist infinitesimal perturbations of the input.“
Contractive Autoencoders
A loss term which penalizes large derivatives of the hidden layer activations with respect to the input
training examples.
penalizes instances, where a small change in the input leads to a large change in the encoding space.
Regularization loss term is the squared Frobenius norm ∥A∥F of the Jacobian matrix J for the hidden
layer activations with respect to the input observations.
Loss function as:
Th an k Yo u