KEMBAR78
DL - Module 3 | PDF | Applied Mathematics | Machine Learning
0% found this document useful (0 votes)
65 views62 pages

DL - Module 3

Uploaded by

faiqansari2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views62 pages

DL - Module 3

Uploaded by

faiqansari2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

DEEP LEARNING-MODULE 3

Dr.Neethu Anna Sabu


Given an input value
z=−1.2, compute the output of the
following activation functions:
1.Sigmoid
2.Tanh
3.ReLU
4.Leaky ReLU (slope = 0.01)

For the same input


z=−1.2, calculate the gradient
(derivative w.r.t. input) for the same
activation functions.
AUTOENCODERS
• unsupervised representation learning, where neural networks discover meaningful
patterns or features in unlabeled data
• Imposes a bottleneck in the network- bottleneck is a hidden layer in the middle of the
network that has fewer neurons than the input or output layers-
• It forces the network to compress the input data into a lower-dimensional
representation
• bottleneck forces a compressed knowledge representation of the input
• An autoencoder is a neural network that is trained to attempt to copy its input to its
output
• The network may be viewed as consisting of two parts:
• an encoder function h = f (x) and
• a decoder that produces a reconstruction r = g(h).

• An autoencoder is a type of neural network architecture designed to efficiently


compress (encode) input data down to its essential features, then reconstruct (decode)
the original input from this compressed representation.
• Autoencoders should not copy perfectly
• But restricted by design to copy only approximately
• By doing so, it learns useful properties of the data
• Modern autoencoders use stochastic mappings
• Autoencoders were traditionally used for
• Dimensionality reduction as well as feature learning

• Auto-encoder is a complex mathematical model which trains on unlabeled as well as


unclassified data and is used to map the input data to another compressed feature
representation and from that feature representation reconstructing back the input data.
ENCODER
This part of the network
compresses the input.
•It reduces the dimensionality
using hidden layers.
•It learns important features
and discards irrelevant
details.

•BOTTLENECK
•Compressed
representation (latent
space).
•It contains only the most
important features needed
to represent the input.

•DECODER
•Reconstructs the original
input
Structure of an autoencoder
Components of the Autoencoder
1. Input (x)
•Raw data (e.g., an image, signal, etc.)
2. Encoder Function (f)
•Maps the input x to a compressed representation h:
•h=f(x)
This layer often includes nonlinear transformations (e.g.,
ReLU, sigmoid)
3.Hidden Layer / Code (h)
•The bottleneck layer — a lower-dimensional
representation of the input.
•Forces the network to learn the most important features
in a compact form.
• 4. Decoder Function (g)
Reconstructs the input from the hidden code:
r=g(h)
• 5. Reconstructed Output (r)
The network’s attempt to recreate the input x from the compressed code.
The autoencoder is trained to minimize the reconstruction error between the input x and
the output r.

Common loss functions include:


• Mean Squared Error (MSE):
APPLICATIONS
1.Dimensionality Reduction : Dimension Reduction refers to the process of
converting a set of data having vast dimensions into data with lesser dimensions
ensuring that it conveys similar information concisely.
2.Image -Denoising : A noisy image can be given as input to the autoencoder and a
de-noised image can be provided as output. The autoencoder will try de-noise the
image by learning the latent features of the image and using that to
reconstruct an image without noise. The reconstruction error can be calculated
as a measure of distance between the pixel values of the output image and ground
truth image.
3.Feature Extraction : Once the model is fit on training dataset, the reconstruction
(decoding) aspect of the model can be discarded and the model up to the point
of the bottleneck can be used (only the encoding part is required). The output of
the model at the bottleneck is a fixed-length vector that provides a compressed
representation of the input data.
1.Data Compression : It is a process to reduce the number of bits
needed to represent data. Compressing data can save storage capacity,
speed up file transfer, and decrease costs for storage hardware and
network bandwidth. Auto-encoders are able to generate reduced
representation of input data.
2.Removing Watermarks from Images
• Drawbacks of Auto-Encoders :
1.An autoencoder learns to capture as much information as possible rather than as
much relevant information as possible.
2.To train an autoencoder there is a need of lot of data, processing time,
hyperparameter tuning, and model validation before even start building the
real model.
3.Trained with “back-propagation technique” using loss-metric, there are chances of
crucial information loss during reconstruction of input.
•.

ASSUMPTIONS
• High degree of correlation exists in the data
• For uncorrelated data, input features are independent, then compression and
subsequent reconstruction would be difficult
•There's nothing to combine or merge without losing information.
•So if you reduce dimensionality, you're throwing away important parts — not just noise
or overlapping data
•Every feature contains unique, essential information
AUTOENCODER

Error should be minimum


No: of neurons in hidden layer< input layer
Dimensionality of x and x^ should be same
Expectation
• Sensitive enough to input for accurate reconstruction-x and x^ should be same
• AE should reconstruct input accurately-identity function can be done as
reconstruction is the idea-NOT OUR AIM
• network just memorizes the input rather than extracting patterns or learning
compressed features.
• Input data should be in compressed domain-AIM
• Insensitive enough that it doesn’t memorize or overfit the training data- does
not learn identity function
• fails to reconstruct or encode new/unseen inputs accurately- learned noise and
specific details, not the underlying structure
• CONFLICTING
By designing loss function which solves both
• our loss function of autoencoder is composed of two different parts.
• Reconstruction loss- Measures how close the reconstructed output X^ is to the
original input X.
• Regularization term adds a penalty to the loss function to:
• Prevent overfitting
• Enforce constraints (e.g., sparsity, smoothness)
• Encourage robust representations

• The first part is the loss function (e.g. mean squared error loss) calculating the
difference between input data and output data while the second term would act
as regularization term which prevents autoencoder from overfitting.
• PCA (Principal Component Analysis) in deep learning is a dimensionality
reduction technique used as a preprocessing step before training deep neural
networks.
• PCA transforms high-dimensional data into a lower-dimensional space by
identifying the directions (principal components) that capture the most variance in
the data.
• This compressed representation can be fed into deep learning models to reduce
complexity, speed up training, and avoid overfitting.
• PCA is like a linear autoencoder with no activation functions(only linear
transformation) and just one hidden layer.
• But autoencoders are more powerful, especially when data relationships are
nonlinear.
• PCA is a linear dimensionality reduction technique, not a neural network.But a
linear autoencoder (which is a neural network) can behave like PCA.
Linear autoencoder
• A Linear Autoencoder is a type of autoencoder where all layers use only linear
transformations (i.e., no activation functions like ReLU, Sigmoid, or Tanh are used).
• Cannot capture non-linear paterns
• Even without non-linearities, a linear autoencoder can learn a compressed
representation of input data with linear transformation
• It is mathematically similar to Principal Component Analysis (PCA).
•Decoder: Maps back to original space (linear reconstruction).
•Encoder: Projects input into a lower-dimensional space (linear projection).
•Loss Function: Mean Squared Error (MSE) between input X and output X^
• When:
• The autoencoder has a single hidden layer
• Uses MSE loss
• No activation functions
• Weights are tied (i.e., decoder weights are the transpose of encoder weights:))
• Then it performs exactly like PCA, learning the same principal subspace (lower
dimensional space).
•.

•X: The input dataset.


•n: Number of data samples (rows).
•d: Number of features (input dimensions).
•So each data sample is a d-dimensional
vector
•h: Latent representation (code) of the input.
•W1​: Encoder weight matrix.
•X^: Reconstructed version of the input.
•W2​: Decoder weight matrix.
•A linear autoencoder with one hidden layer,
tied weights, and MSE loss will learn the
same subspace as Principal Component
Analysis (PCA).
•This is why linear autoencoders are often
considered a neural version of PCA.
Undercomplete autoencoders

By forcing the model to pass the input through


a lower-dimensional space, it must learn to
compress the most important features of the
data.
This prevents it from simply copying the input
(i.e., identity mapping).
Like PCA, but can model non-linear structures.
Learns compressed, informative features.
• Autoencoders whose dimensions are less than the input
dimension are called undercomplete autoencoder.
• By penalizing the network according to the reconstruction error,
the model can learn and capture the most salient features.
• Special case: encoder & decoder are linear, Loss function is
mean square error - Reduces to Principal Component Analysis
• We know neural networks are capable of learning non-linear
functions, and autoencoders such as this one can be thought of
as a non-linear PCA.
• AE training-minimize the loss function
• Mean Squared Error (MSE) Loss function commonly used in
autoencoders
• There are several ways to design autoencoders to copy only
approximately
• The system learns useful properties of the data
• One way makes dimension h < dimension x
• Undercomplete: h has smaller dimension than x
• Overcomplete: h has greater dimension than x
• Principle Component Analysis (PCA)
• An undercomplete autoencoder, linear decoder and MSE loss
function, learns same subspace as PCA
• Nonlinear encoder/decoder functions yield more powerful nonlinear
generalizations of PCA
AVOIDING T RIVIAL IDENT I T Y
• Undercomplete autoencoders

• h has lower dimension than x

• f or g has low capacity (e.g., linear g)

• Must discard some information in h

• Overcomplete autoencoders

• h has higher dimension than x

• Must be regularized
AUTOENCODER

• Linear transformation and then adding non-


linearity
• Hidden representation captures everything that it
requires to capture xi
• After passing through bottle neck layer,
reconstructed output=input
• Analogy with PCA-linear transformation
• similar in that they both aim to reduce the
dimensionality of data and learn compressed
representations
Case 1-Undercomplete autoencoder

Analogy with PCA-h has all imp characteristics to


reconstruct the data, perfectly reconstructs input
from bottle neck layer
Case 1-Overcomplete autoencoder

• 10 bits initially,
store in 16 bit (all
which was in
original)
• Dimension of h >
xi
• All information in
bottleneck layer
Example 01
o/p-binary
No real nos

Appropriate fn for decoder-f?


Example 02
• Linear-produce any
real no:
• https://apxml.com/courses/autoencoders-representation-
learning/chapter-3-regularized-autoencoders/overfitting-in-
autoencoders
Regularization in AE
• Overcomplete autoencoders must be regularized because without regularization,
they risk learning a trivial identity function, defeating the purpose of learning
meaningful representations.
•The autoencoder has more capacity than needed.
•It can easily memorize the input by learning the identity function:
X^=X
•This leads to poor generalization and no meaningful feature learning.
Regularization adds constraints or penalties that make this easy identity mapping undesirable or
infeasible.
Regularization is most often applied to the encoder, because the encoder determines the quality
of the latent representation, which is critical to meaningful learning.

However, decoder regularization is also important to:


•Prevent overfitting during reconstruction
•Ensure smooth mappings from code to output
•Stabilize training
Regularized autoencoders
• A similar problem occurs if the hidden code is allowed to have dimension equal to
the input, and in the overcomplete case in which the hidden code has dimension
greater than the input.
• In these cases, even a linear encoder and a linear decoder can learn to copy the
input to the output without learning anything useful about the data distribution
• With proper regularization of the encoder, autoencoders of any architecture can be
effectively trained by selecting appropriate code dimensionality and network
capacity based on the complexity of the data distribution-Regularized encoder
• Allow overcomplete case but regularize
• To prevent the autoencoder from merely copying the input to the output, the loss
model should incorporate additional terms :-
that encourage desirable properties such as sparsity in the representation,
minimal sensitivity to input changes (small derivatives), and robustness to noise
or missing data.
•SPARSE AUTOENCODER
• modification of an overcomplete autoencoder.
• A sparse autoencoder is a type of model that has been regularized to respond to
unique statistical features by introducing a regularization technique by
enforcing sparsity on the activations within the hidden (bottleneck) layer.
• The core idea is that for any given input sample, only a small subset of neurons
in the hidden layer should be significantly active (i.e., have non-zero or near-
non-zero activation values)
• A sparse autoencoder will be forced to selectively activate regions of the
network depending on the input data.
• This eliminates the networks capacity to memorize the features from the input
data, and since some of the regions are activated while others aren’t, the network
therefore learns the useful information and features.
• https://apxml.com/courses/autoencoders-representation-learning/chapter-3-
regularized-autoencoders/sparse-autoencoders-regularization
•A sparsity constraint on the hidden
layer, encouraging most activations to
be zero.
It can be trained using
backpropagation like other
autoencoders, but with an added
penalty term in the loss function.

• Regularization on activation of
hidden layer nodes
How regularization is done?
Sparsity constraint
Sparsity helps the autoencoder to:
•Learn more interpretable features.
•Avoid simply learning the identity function.
•Encourage disentangled representations,
similar to those found in biological systems.

It's particularly useful when the hidden layer


• Assume non-linear activation as sigmoid has more neurons than the input layer (i.e.,
• Active and non –active at 1 and 0 as o/p an overcomplete autoencoder), which would
• Compute average activation at jth node in hidden otherwise just memorize the input.
layer
• m no: of i/p vectors; xi –any input
• Assume sparsity parameter as 0.02/0.002 or any
smaller value
Regularization term can be as KL
divergence (difference) between 2
distributions (one defined by ρ &
ρ^)

W-weight matrix
Relative weight putting between
data loss component and
regularization component
KL Divergence
• Defining a target sparsity parameter, ρ, which represents the desired
average activation level for each hidden neuron (e.g., ρ=0.05 meaning
we want each neuron to be active, on average, only 5% of the time).

compute the actual average activation of the j-th


hidden neuron, ρ^​j, over a set of m training
samples:

Use the Kullback–Leibler (KL) divergence to


measure the difference between the desired
average activation ρ and the observed average
activation ρ^jρ^​j
• KL divergence term acts as a penalty. It is minimized (equals zero)
when ρ^j=ρ, and increases as ρ^j deviates from ρ.
• We sum this penalty over all s hidden units and add it to the
reconstruction loss, weighted by another hyperparameter β

Minimizing this total loss encourages the network to achieve accurate reconstruction while ensuring
that the average activation of each hidden neuron ρ^j stays close to the target sparsity level ρ
L1 Regularization for Sparsity

• Refer
•Denoising autoencoder
• Autoencoders are Neural Networks which are commonly used for
feature selection and extraction.
• However, when there are more nodes in the hidden layer than there are
inputs, the Network is risking to learn the so-called “Identity Function”,
also called “Null Function”, meaning that the output equals the input,
marking the Autoencoder useless.
• This type of Autoencoder is an alternative to the concept of regular
Autoencoder which is prone to a high risk of overfitting.
• In the case of a Denoising Autoencoder, the data is partially corrupted
by noises added to the input vector in a stochastic manner.
• Then, the model is trained to predict the original, uncorrupted data
point as its output.
• Your friend has learned the manifold from seeing thousands of faces.
• When you hand them a smudged picture, they mentally “snap” it back
to the nearest real-face pattern they know, then redraw it cleanly.
• The denoising autoencoder does exactly this with images, except
mathematically — learning to map noisy data back to the clean data
manifold.
NOISY DATA
ORIGINAL DATA
No memorizing of
original data
Learns a vector field
that maps to a low
dimensional data
f(x)
g(f(x)
X^=g(f(x))
Adding noise to input data X to get
corrupted data/noisy data as X~ (X tilda)
X^=r(X)

Loss function-to minimize-squared error


loss

Learns vector field that maps an input


data to a point in lower dimensional
manifold
• manifold is a lower-dimensional surface embedded in a higher-dimensional space,
where the true data points lie.
• In a Denoising Autoencoder, manifold learning means learning a mapping that
takes corrupted data off the manifold and pulls it back to the high-density
data manifold, capturing the structure of the true data distribution.
• A DAE learns the manifold by training on noisy inputs x~to reconstruct the
clean data x.
• The noise forces the model to focus on the robust, underlying patterns (the
manifold) rather than memorizing noisy details.
• The encoder maps the noisy input to a point on the manifold (latent
representation), and the decoder reconstructs the clean data from this point.
Manifold learning-manifold interpretation of a
Denoising Autoencoder
• The diagram shows how a denoising autoencoder learns to
map noisy inputs back to the clean data manifold.
• The blue curve represents the manifold where real, clean
data points (red X’s) lie.
• When noise is added to a clean sample X, it is displaced to
a corrupted version X~ (brown circle region), which lies
off the manifold.
• The DAE, through its reconstruction function r(x), learns a
vector field r(x)−x (green arrows) that points from noisy
inputs toward the nearest point on the manifold.
• This process effectively “pulls” corrupted samples back to
high-density regions of the data distribution,- moves
Manifold learning has mostly focused on towards manifold-
unsupervised learning procedures that attempt to • MANIFOLD LEARNING-capturing the manifold’s
capture these manifolds structure and making the learned features more robust to
noise.
• Autoencoders exploit the idea that data concentrates around a low-
dimensional manifold or a small set of such manifolds
• Prevents memorizing-learns a vector field-helps AE to move any point
in space to a nearest point in manifold
• All autoencoder training procedures involve a compromise between two
forces:
Learning a representation h of a training example x such that x can
be approximately recovered from h through a decoder.
Satisfying the constraint or regularization penalty.
This can be an architectural constraint that limits the capacity of the
autoencoder, or it can be regularization term added to the reconstruction
cost.
Vector Field Learned by a Denoising
Autoencoder
Example of Denoising AE
Computational graph for the cost function of a
denoising autoencoder.
• The clean data x is corrupted to x~, encoded
(by encoder f) into features h (compressed
features), and then decoded (by decoder g)
back to reconstruct x.
• The loss measures how well the
reconstructed output matches the original
clean data, encouraging the network to
learn a mapping from noisy inputs back to
the data manifold.
merits
• Learns more robust filters
• Prevents from learning a simple identify function
• Decreases the risk of overfitting that can be problematic with
regular AE
CONTRACTIVE AUTOENCODERS

The contractive autoencoder (CAE) uses a


regularizer to make the derivatives of f(x) as small as
possible
The name contractive arises from the way the CAE
warps space
The input neighborhood is contracted to a
smaller output neighborhood
The CAE is contractive only locally
Regularization
• In the limit of small Gaussian input noise, the
denoising reconstruction error is equivalent to a
contractive penalty on the reconstruction
function that maps x to r- g(f(x))
• In other words, denoising autoencoders make the
reconstruction function resist small but finite-
sized perturbations of the input, while contractive
autoencoders make the feature extraction
function resist infinitesimal perturbations of the
input.
• The goal of the CAE is to learn the manifold
structure of the data
• One practical issue with the CAE regularization
criterion is that although it is cheap to compute in
the case of a single hidden layer autoencoder, it
becomes much more expensive in the case of
deeper autoencoders
• Another practical issue is that the contraction
penalty can obtain useless results if we do not
impose some sort of scale on the decoder.
APPLICATIONS

You might also like