KEMBAR78
Deep Learning-1 | PDF | Principal Component Analysis | Support Vector Machine
0% found this document useful (0 votes)
41 views20 pages

Deep Learning-1

Uploaded by

priti shrama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views20 pages

Deep Learning-1

Uploaded by

priti shrama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Deep Learning

Ian Goodfellow, along with co-authors Yoshua Bengio and Aaron Courville, published the book
"Deep Learning" in 2016. This book provides a comprehensive overview of deep learning's
development, theory, and applications, and it outlines several historical trends that shaped the field.

Here are some key historical trends in deep learning that are discussed in the book:

1. Early Neural Networks (1950s–1980s)

• Perceptron (1950s-1960s): Frank Rosenblatt's perceptron is an early type of neural


network based on the idea of a simple linear classifier. While it sparked initial interest, the
limitations of perceptrons—especially the inability to solve non-linearly separable
problems—were highlighted by Minsky and Papert in 1969. This led to a decline in interest
in neural networks for some time.
• Multilayer Networks and Backpropagation (1980s): In the 1980s, researchers
rediscovered the idea of multilayer neural networks (also known as deep networks). The
development of backpropagation by David Rumelhart, Geoffrey Hinton, and others
allowed for training deep networks by efficiently computing gradients and adjusting
weights, which reignited interest in neural networks.

2. Challenges and Setbacks (1990s–2000s)

• Difficulty of Training Deep Networks: During the 1990s, despite backpropagation's


success, training deep neural networks remained challenging due to issues such as
vanishing gradients, overfitting, and limited computational resources. As a result, simpler
methods like support vector machines (SVMs) and decision trees became more popular.
• The AI Winter: Several setbacks, including limited progress in neural networks and
inflated expectations of AI capabilities, led to periods of reduced funding and interest in
AI research, known as the "AI winters."

3. Breakthroughs in Deep Learning (2006 and Beyond)

• Unsupervised Pretraining (2006): A key turning point occurred when Geoffrey Hinton
and his collaborators introduced unsupervised pretraining, where deep networks were
initialized by training each layer as a restricted Boltzmann machine. This technique helped
overcome some difficulties in training deep networks by providing better initial weights.
• ReLU Activation Function (2010s): The rectified linear unit (ReLU) was introduced as
an activation function that improved the efficiency of training deep networks by addressing
the vanishing gradient problem and enabling deeper models.
• Convolutional Neural Networks (CNNs): CNNs, initially proposed by Yann LeCun in
the 1990s, became prominent again in the 2010s due to their success in image recognition
tasks. The breakthrough came when AlexNet, a deep CNN architecture, won the ImageNet
competition in 2012, demonstrating the power of deep learning in computer vision.
• GPU Acceleration: Advances in hardware, particularly the use of graphical processing
units (GPUs) for accelerating deep learning computations, allowed for the training of larger
and deeper networks in practical timeframes.

4. Deep Learning Renaissance (2010s)

• Deep Learning for Speech Recognition and NLP: Deep learning began to outperform
traditional methods in fields beyond computer vision, including speech recognition (as
demonstrated by Google's DeepMind) and natural language processing (NLP). Techniques
like recurrent neural networks (RNNs) and long short-term memory (LSTM) networks
became crucial for handling sequential data.
• Generative Models: Techniques like variational autoencoders (VAEs) and generative
adversarial networks (GANs), introduced by Ian Goodfellow in 2014, marked a major
advance in generating realistic data, including images, audio, and text.
• End-to-End Learning: A major trend in the 2010s was the shift toward end-to-end
learning, where entire systems are trained in a unified way rather than manually designing
intermediate features. This trend was particularly influential in tasks like image
classification, speech recognition, and machine translation.

5. Ongoing Trends and Future Directions

• Scalability and Large Models: One of the key themes in deep learning's history is the
importance of scale. As datasets, computational resources, and models became larger,
performance improved dramatically. This trend continues with the development of even
larger architectures like GPT (generative pre-trained transformers) and BERT.
• Transfer Learning: Another ongoing trend is the success of transfer learning, where
models pre-trained on large datasets are fine-tuned for specific tasks with smaller datasets.
This technique became increasingly popular in NLP and computer vision.
• Reinforcement Learning: Reinforcement learning (RL), combined with deep learning
(referred to as deep reinforcement learning), gained significant attention, particularly due
to its success in training agents to master complex tasks like playing video games (e.g.,
AlphaGo and AlphaZero).

Deep learning model

A deep learning model refers to a neural network with multiple layers of interconnected nodes
(neurons) that are used to automatically learn patterns from data. The "deep" in deep learning
comes from the use of multiple layers (often called hidden layers) between the input and output
layers, which allows the model to capture complex patterns and abstractions. Deep learning models
are typically designed to handle large-scale, high-dimensional data and have become state-of-the-
art solutions in many domains such as computer vision, natural language processing, speech
recognition, and more.

Here’s an overview of the key components and types of deep learning models:
1. Key Components of a Deep Learning Model

• Neurons (Nodes): The fundamental units that make up the layers of the network. Each
neuron receives inputs, applies a weight and bias, processes the result through an activation
function, and passes the output to the next layer.
• Layers:
o Input Layer: The first layer, where data enters the model.
o Hidden Layers: Layers between the input and output layers that perform
transformations of the data. These are responsible for learning the features of the
input data.
o Output Layer: The final layer, which produces the output (predictions) of the
model.
• Weights and Biases: Weights control the strength of the connections between neurons,
while biases allow the model to fit data better by shifting the activation function.
• Activation Function: The function that determines if a neuron should be activated or not.
Common activation functions include:
o ReLU (Rectified Linear Unit): Helps with non-linearities and speeds up learning
by avoiding vanishing gradient problems.
o Sigmoid/Tanh: Used in older networks but less common today due to their
tendency to cause vanishing gradients.
o Softmax: Often used in the output layer for multi-class classification problems.
• Loss Function: The function that measures how far the model’s predictions are from the
actual labels. Common loss functions include:
o Cross-Entropy Loss: For classification problems.
o Mean Squared Error (MSE): For regression problems.
• Optimization Algorithm: Algorithms like stochastic gradient descent (SGD) and Adam
are used to adjust the model’s weights and biases in order to minimize the loss function.

2. Types of Deep Learning Models

a) Feedforward Neural Networks (FNN)

• Also called Multilayer Perceptrons (MLP), FNNs are the simplest type of deep learning
model where the information moves in one direction—from the input layer to the output
layer through hidden layers.
• Used for tasks like image classification, tabular data analysis, and simple predictive
modeling.

b) Convolutional Neural Networks (CNN)

• CNNs are specifically designed for processing grid-like data such as images. They use
convolutional layers to scan input data and detect spatial hierarchies of patterns.
• Key components include:
o Convolutional Layers: Extract features from input data (e.g., edges in images).
o Pooling Layers: Reduce the spatial size of the representation, often using max
pooling.
o Fully Connected Layers: Connect all neurons, typically used at the end for
classification or regression.
• Applications: Image recognition, object detection, video analysis, and more.

c) Recurrent Neural Networks (RNN)

• RNNs are designed for sequence data, such as time series or text. They have a memory
aspect because they retain information from previous inputs, making them suitable for
sequential tasks.
• Key RNN variants include:
o LSTM (Long Short-Term Memory): Addresses the vanishing gradient problem
and helps RNNs capture long-term dependencies in sequences.
o GRU (Gated Recurrent Unit): A simpler alternative to LSTM, often faster and
more efficient.
• Applications: Speech recognition, language modeling, machine translation, time series
forecasting.

d) Transformers

• Transformers are highly parallelizable models designed for sequential tasks, especially in
natural language processing (NLP). Unlike RNNs, they do not process data sequentially
but instead use an attention mechanism that allows them to focus on different parts of the
input sequence.
• Key features:
o Self-Attention Mechanism: Helps the model weigh the importance of different
words in a sentence, making it easier to learn relationships between distant parts of
the input.
o Positional Encoding: Adds information about the position of each word in the
sequence since transformers process all words simultaneously.
• Applications: Language translation, text summarization, and models like GPT, BERT, and
T5.

e) Autoencoders

• Autoencoders are unsupervised learning models used for tasks like dimensionality
reduction, denoising, and anomaly detection. They consist of:
o Encoder: Maps the input into a compressed, lower-dimensional latent space.
o Decoder: Reconstructs the original input from the compressed latent
representation.
• Applications: Image denoising, data compression, generative tasks.

f) Generative Adversarial Networks (GANs)

• GANs are used for generating new data (e.g., images, videos) that resembles a given
dataset. They consist of two neural networks:
o Generator: Tries to generate fake data that resembles the real data.
o Discriminator: Tries to distinguish between real and fake data.
• These networks are trained together in a game-theoretic setup where the generator tries to
fool the discriminator.
• Applications: Image synthesis, art generation, style transfer, and more.

g) Deep Reinforcement Learning (DRL)

• DRL is a combination of deep learning and reinforcement learning, where an agent learns
to make decisions by interacting with an environment. It uses deep neural networks to
approximate value functions or policies.
• Applications: Game playing (e.g., AlphaGo), robotics, self-driving cars.

3. Training a Deep Learning Model

Training a deep learning model involves:

• Forward Pass: Input data is passed through the network layer by layer to generate
predictions.
• Loss Calculation: The difference between predictions and actual labels is calculated using
a loss function.
• Backward Pass (Backpropagation): Gradients of the loss with respect to the model’s
weights are calculated using backpropagation, and the weights are updated accordingly to
minimize the loss.
• Optimization: The optimizer (e.g., SGD, Adam) adjusts weights to converge the model
towards an optimal solution.

4. Popular Deep Learning Frameworks

• TensorFlow: Developed by Google, it's one of the most popular frameworks for building
deep learning models. It supports a wide range of applications and tools for both research
and production.
• PyTorch: Developed by Facebook, it's highly favored in the research community due to
its ease of use and flexibility.
• Keras: A high-level API built on top of TensorFlow, designed for quick experimentation
with deep learning models.
• MXNet, Caffe, Theano: Other deep learning libraries used in various applications.

5. Applications of Deep Learning Models

• Image Recognition (e.g., facial recognition, object detection).


• Natural Language Processing (e.g., language translation, text summarization).
• Speech Recognition (e.g., voice assistants, transcription).
• Autonomous Driving (e.g., lane detection, obstacle avoidance).
• Healthcare (e.g., medical image analysis, drug discovery).
Key concepts related to machine learning tasks, experience, and performance, which are
fundamental to understanding how machine learning and deep learning systems work. Here’s an
overview of how these concepts are explained:

1. Task

• A task refers to what the machine learning system is supposed to accomplish. This could
be any problem that a learning algorithm is designed to solve.
• In deep learning, tasks can vary widely and include:
o Classification: Assigning labels to data points (e.g., classifying images of cats and
dogs).
o Regression: Predicting continuous values (e.g., predicting house prices).
o Clustering: Grouping similar data points together (e.g., clustering customer data
for market segmentation).
o Generative tasks: Learning to generate new data that resembles the training data
(e.g., generating realistic images using GANs).
o Reinforcement learning tasks: Making sequential decisions to maximize
cumulative rewards (e.g., playing games like Go or Atari).
• Defining the task is the first step in designing a machine learning system, as it determines
the nature of the input, output, and how success is measured.

2. Experience

• Experience refers to the data the system learns from. In machine learning, experience
typically comes in the form of a dataset.
• The system improves its performance by learning from this dataset. For example:
o In supervised learning, the experience includes labeled data (e.g., images labeled
as cats or dogs).
o In unsupervised learning, the experience consists of unlabeled data, and the
system must discover hidden patterns within the data (e.g., clustering or
dimensionality reduction).
o In reinforcement learning, experience comes from interactions with an
environment where the system receives feedback in the form of rewards or
penalties.
• The quality, quantity, and diversity of the data (experience) play a crucial role in how well
the system learns. More data often leads to better performance, although it must be relevant
and representative of the task at hand.

3. Performance

• Performance refers to how well the system completes the task based on the experience
(data) it has received. It is typically measured using some form of evaluation metric.
o For classification tasks, common performance metrics include accuracy,
precision, recall, F1-score, and area under the ROC curve.
o For regression tasks, metrics such as mean squared error (MSE) and mean
absolute error (MAE) are commonly used.
o In reinforcement learning, performance might be measured by the total reward
accumulated over time.
• The book emphasizes that performance can be divided into two aspects:
o Training performance: How well the model performs on the training data.
o Generalization performance: How well the model performs on new, unseen data
(test or validation data).
• Overfitting is a major concern in deep learning, where the model performs well on the
training data but poorly on new data due to learning too many specific patterns that do not
generalize.

Key Concepts in Context

In the book, the relationship between task, experience, and performance is central to understanding
the machine learning pipeline:

1. Defining the Task: The task defines the goal of the learning system and shapes what data
(experience) is needed.
2. Gathering Experience: The quality of the experience (data) is critical for effective
learning. A diverse, well-labeled, and representative dataset is key to success.
3. Evaluating Performance: Performance metrics help assess how well the model is
achieving the task based on the data it has learned from. Good generalization is the ultimate
goal of any machine learning system.

In sum, these three pillars—task, experience, and performance—guide the design, implementation,
and evaluation of machine learning and deep learning systems.

Linear regression is a fundamental statistical method used to model the relationship between a
dependent variable y and one or more independent variables x. In its simplest form, simple linear
regression involves one independent variable and is represented by the following equation:

y=w0+w1x+ϵ

Where:

• y is the dependent variable (target).


• x is the independent variable (input feature).
• w0 is the intercept (the value of y when x=0).
• w1 is the slope (the change in y for a one-unit change in x).
• ϵ is the error term or residual (the difference between the actual and predicted values).

Example:

Let’s say we want to predict a student's final exam score (y) based on the number of hours they
studied (x).

Here’s a small dataset:


Hours Studied (x) Exam Score (y)

1 50

2 55

3 65

4 70

5 80

Step 1: Formula for Linear Regression

The general form of the linear regression equation is:

y^=w0+w1x

Where:

• y^ is the predicted exam score based on the model.


• x is the number of hours studied.

Where:

• n is the number of data points.


• xi and yi are the individual data points for x and y.
Step 3: Applying the Formulas

Let’s calculate the coefficients using the dataset above.

• Sum of x: ∑xi=1+2+3+4+5=15
• Sum of y: ∑yi=50+55+65+70+80=320
• Sum of ∑xiyi=(1×50)+(2×55)+(3×65)+(4×70)+(5×80)=1055
• Sum of ∑xi2=(12)+(22)+(32)+(42)+(52)=55
• Number of data points n=5

Now, calculate the slope w1:

w1=(5×1055)−(15×320)(5×55)−(152)w_1 = \frac{(5 \times 1055) - (15 \times 320)}{(5 \times 55) -


(15^2)}w1=(5×55)−(152)(5×1055)−(15×320) w1=5275−4800275−225=47550=9.5w_1 = \frac{5275 -
4800}{275 - 225} = \frac{475}{50} = 9.5w1=275−2255275−4800=50475=9.5

Next, calculate the intercept w0:

w0=320−(9.5×15)5w_0 = \frac{320 - (9.5 \times 15)}{5}w0=5320−(9.5×15)


w0=320−142.55=177.55=35.5w_0 = \frac{320 - 142.5}{5} = \frac{177.5}{5} = 35.5w0=5320−142.5=5177.5
=35.5

Step 4: Linear Regression Equation

Now that we have the slope w1=9.5 and intercept w0=35.5, the linear regression equation is:

y^=35.5+9.5x

Step 5: Prediction

Using the regression equation, we can predict the exam score for any number of study hours. For
example, if a student studies for 3 hours (x=3):

y^=35.5+9.5(3)=35.5+28.5=64So, the predicted exam score for 3 hours of study is 64.

Conclusion:

Linear regression provides a simple and effective way to model the relationship between two
variables, allowing us to make predictions based on the fitted line. The formula for the slope and
intercept helps define the line that minimizes errors in predicting outcomes.

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for both
classification and regression tasks, though it is mostly used for classification. The main objective
of SVM is to find the optimal hyperplane that best separates the data into different classes. For
binary classification, this means finding the hyperplane that maximizes the margin between the
two classes.

Key Concepts in Support Vector Machine (SVM)

1. Hyperplane: In an SVM, a hyperplane is a decision boundary that separates the data into
different classes. In a 2D space, a hyperplane is a line, while in higher dimensions, it is a
plane or a more complex geometric structure.
2. Margin: The margin is the distance between the hyperplane and the closest data points
from each class. SVM aims to maximize this margin to ensure that the classification is as
robust as possible. The data points closest to the hyperplane are called support vectors, as
they are crucial in defining the hyperplane.
3. Support Vectors: Support vectors are the data points that lie closest to the hyperplane and
influence its orientation and position. Only these points are necessary to define the decision
boundary, while the others are ignored.
4. Linear vs Non-Linear Classification:
o Linear SVM: This is used when the data is linearly separable, meaning that it can be
perfectly separated by a straight line (in 2D) or a plane (in higher dimensions).
o Non-Linear SVM: When the data is not linearly separable, SVM can still be used by
transforming the data into a higher dimension using a technique called the kernel trick,
which allows SVM to fit a non-linear boundary.
5. Kernel Trick: The kernel trick is a method of implicitly mapping the input features into
higher-dimensional space to handle non-linear classification. Some common kernels
include:
o Linear Kernel: Used when the data is linearly separable.
o Polynomial Kernel: Handles non-linear relationships.
o Radial Basis Function (RBF) Kernel: A popular kernel for non-linear problems, often called
the Gaussian kernel.
Example of SVM

Let’s go through a simple example of SVM for binary classification.

Problem: Predict whether a person will purchase a product based on their age and salary.

Dataset:
Age (years) Salary ($1000) Purchase (y)

22 40 0

25 45 0

30 50 0

35 55 1

40 60 1

45 65 1

The target variable y takes values 0 (did not purchase) or 1 (purchased). Our goal is to find a
hyperplane that separates the people who purchased from those who did not based on their age and
salary.

Step-by-Step Process of SVM

1. Plotting the Data:

First, we plot the data points in 2D space, where the x-axis represents age and the y-axis represents
salary. The points are labeled based on whether the person purchased (1) or did not purchase (0).

2. Finding the Optimal Hyperplane:

If the data is linearly separable, the SVM will find a hyperplane that maximizes the margin between
the two classes. In this case, the hyperplane might be a straight line separating the two groups of
people: those who purchased and those who did not.

3. Support Vectors:

The data points that are closest to the hyperplane on either side are called the support vectors.
These are the most important data points because they determine the position and orientation of
the hyperplane. For instance, in this case, the individuals with ages 30, 35, and 40 could be support
vectors.
4. Decision Boundary:

Once the optimal hyperplane is found, it serves as the decision boundary. Any new data point can
be classified as 0 or 1 based on which side of the hyperplane it falls. For example:

• If a new person aged 32 with a salary of $52,000 comes in, the model will predict whether this
person will purchase based on which side of the hyperplane their data point falls.

5. Margin:

The distance between the hyperplane and the nearest points (support vectors) on either side is the
margin. The larger the margin, the better the generalization of the classifier.

Mathematically, the linear decision boundary is:

Where:

• w0 is the bias term.


• w1 and w2 are the weights for the features age and salary, respectively.

6. Kernel Trick for Non-Linear Data:

If the data is not linearly separable (for example, if the purchase behavior depends on a non-linear
combination of age and salary), SVM can use a kernel to map the data into a higher-dimensional
space where it becomes linearly separable.

For instance, using an RBF kernel would allow the SVM to classify data that has a circular decision
boundary in 2D.

Advantages of SVM

1. Effective in High Dimensions: SVM works well in cases where the number of features is larger
than the number of samples, which is common in text classification or gene expression data.
2. Robust with Margin Maximization: SVM focuses on the support vectors and tries to maximize the
margin, making it a robust classifier.
3. Versatile with Kernels: By using different kernel functions, SVM can handle both linear and non-
linear classification problems.

Disadvantages of SVM

1. Computational Complexity: For large datasets, SVM can be computationally expensive, especially
with non-linear kernels like RBF.
2. Choosing the Right Kernel: The performance of an SVM heavily depends on choosing the correct
kernel and tuning its parameters.
3. Interpretability: SVMs can be harder to interpret compared to simpler algorithms like logistic
regression.

The Gaussian Kernel, also known as the Radial Basis Function (RBF) Kernel, is one of the
most commonly used kernel functions in Support Vector Machines (SVMs), especially for non-
linear classification tasks. It transforms the data into a higher-dimensional space to make it possible
to find a linear decision boundary in that space.

Principal Component Analysis (PCA)

PCA is used to simplify data while retaining its most important features. It works by finding the
directions (called principal components) along which the variance of the data is maximized.
These directions are orthogonal to each other, and the first principal component captures the
greatest variance, the second captures the second most, and so on.

Mathematical Definition

The main idea behind PCA is to find a new set of axes, called principal components, that are
linear combinations of the original variables. These new axes point in the direction of maximum
variance in the data.

1. Data Centering: The first step in PCA is to center the data by subtracting the mean of
each variable so that the data has zero mean:
2. Covariance Matrix: The covariance matrix captures the relationships between the
variables.
3. Eigenvalue Decomposition: PCA finds the eigenvectors (principal components) and
eigenvalues of the covariance matrix. The eigenvectors represent the directions of
maximum variance, and the eigenvalues correspond to the magnitude of the variance in
those directions.
4. Principal Components: The eigenvectors are sorted in decreasing order of their
corresponding eigenvalues. The first eigenvector represents the direction with the greatest
variance, the second the next greatest, and so on. These eigenvectors are the principal
components.
5. Projection: The data is then projected onto the principal components, reducing the
dimensionality of the dataset while retaining the most important information:

Objective of PCA

The objective of PCA is to reduce the dimensionality of the data by retaining only the components
that capture the most variance. This can be beneficial for visualization, reducing computational
costs, and avoiding the curse of dimensionality.

Example of PCA
Let’s say we have a dataset with two variables: height and weight of individuals. The data might
look like this:

Height (cm) Weight (kg)


160 55
165 58
170 63
175 70
180 75

Now, suppose we want to reduce this 2D data to a 1D dataset while preserving as much information
as possible.

1. Centering the Data: Subtract the mean of height and weight from the respective variables
to center the data.
2. Covariance Matrix: Compute the covariance matrix to understand how the two variables
are related.
3. Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix
to find the principal components. In this case, the first principal component might capture
most of the variance in height and weight, while the second principal component will
capture less variance.
4. Projection: Project the data onto the first principal component to reduce the dimensionality
from 2D to 1D.

Why PCA Works

• Maximizing Variance: PCA works because it captures the directions of maximum


variance. By projecting data onto these directions, it retains the most significant patterns in
the data, which helps reduce dimensionality without losing too much information.
• Orthogonal Components: The principal components are orthogonal to each other,
ensuring that there is no redundant information.

Applications of PCA

1. Dimensionality Reduction: PCA is widely used to reduce the number of features in


datasets with many variables, which makes further analysis easier and faster.
2. Data Visualization: PCA is commonly used to reduce high-dimensional data to 2 or 3
dimensions for visualization purposes, making it easier to spot patterns or clusters in the
data.
3. Noise Filtering: PCA can be used to filter out noise from data by keeping only the principal
components that capture significant variance, while discarding those that represent noise.
4. Compression: In image and signal processing, PCA is often used to compress data by
reducing dimensionality while retaining the most important information.

PCA in the Context of Deep Learning


Although PCA is not typically used directly in deep learning models, it is a crucial concept for
understanding the general principles of dimensionality reduction and feature extraction. PCA can
also serve as a preprocessing step in some deep learning applications, particularly in cases where
reducing input dimensionality is important for improving training efficiency and reducing
overfitting.

Adversarial training is a robust regularization technique that aims to improve the resilience of
neural networks against adversarial examples—perturbations in input data that are carefully
designed to cause the model to make mistakes. Here are some key adversarial training methods
and techniques that serve as regularization for models, helping them generalize better and resist
malicious perturbations.

1. Basic Adversarial Training

• Concept: The model is trained on adversarial examples generated on-the-fly, along with clean
examples, to improve robustness.
• Implementation: Adds adversarially perturbed inputs (usually based on gradients) to the
training data.
• Effect: The model learns to correctly classify both clean and adversarially perturbed inputs,
improving its robustness against adversarial attacks.

2. Projected Gradient Descent (PGD) Adversarial Training

• Concept: Generates adversarial examples using Projected Gradient Descent (PGD), a more
iterative and powerful attack compared to single-step attacks.
• Implementation: In each training iteration, generates a PGD attack to create adversarial
examples, then trains the model on these examples.
• Effect: Increases the model’s robustness, especially against stronger attacks. PGD training is
often regarded as a "universal" defense, providing strong regularization against a wide range of
adversarial methods.

3. Fast Gradient Sign Method (FGSM) Adversarial Training

• Concept: A simpler approach that uses a single gradient step to create adversarial examples.
• Implementation: Perturbs each input xxx slightly in the direction that maximally increases the
model’s loss, defined by: xadv=x+ϵ⋅sign(∇xL(θ,x,y))x_{\text{adv}} = x + \epsilon \cdot
\text{sign}(\nabla_x \mathcal{L}(\theta, x, y))xadv=x+ϵ⋅sign(∇xL(θ,x,y))
• Effect: FGSM training can improve robustness against fast, single-step attacks, though it may be
less effective against multi-step attacks like PGD.

4. Adversarial Logit Pairing (ALP)

• Concept: Encourages the logits (pre-softmax outputs) of clean and adversarial examples to be
similar.
• Implementation: Adds a regularization term to the loss function that penalizes differences
between the logits for clean and adversarial examples.
• Effect: By minimizing the distance between the logits, ALP reduces the effect of adversarial
perturbations on model output, making the model more robust.

5. Virtual Adversarial Training (VAT)

• Concept: A semi-supervised regularization method that introduces perturbations in directions


that most increase the model’s output distribution change.
• Implementation: Calculates perturbations that maximize the KL-divergence between the
outputs of clean and perturbed inputs.
• Effect: Encourages smoothness in the model's output distribution around each input, increasing
robustness even for inputs without ground-truth labels.

6. TRADES (TRAdversarial Defense via Error Sensitivity)

• Concept: Balances the trade-off between clean accuracy and adversarial robustness.
• Implementation: Adds a regularization term that penalizes the Kullback-Leibler (KL) divergence
between the predictions for clean and adversarial inputs.
• Effect: Allows a balance between robustness and accuracy, providing more control over the
model’s resilience to adversarial attacks while maintaining accuracy on clean data.

7. Adversarial Data Augmentation

• Concept: Enhances the training dataset by adding pre-generated adversarial examples.


• Implementation: Uses methods like PGD or FGSM to generate adversarial examples and adds
these to the training dataset.
• Effect: By diversifying the dataset, the model gains exposure to a broader range of inputs,
making it better at generalizing and less sensitive to perturbations.

8. Label Smoothing with Adversarial Training

• Concept: Aims to smooth the model's confidence in its predictions, reducing vulnerability to
adversarial perturbations.
• Implementation: Modifies the one-hot label vectors so that the model does not output overly
confident predictions (e.g., converting a label from 1 to 0.9).
• Effect: Reduces overconfidence in predictions, which can make adversarial attacks less effective
and help improve robustness.
9. Adversarial Weight Perturbation (AWP)

• Concept: Regularizes the model by perturbing the weights, rather than the input data, during
adversarial training.
• Implementation: Optimizes the model's performance against worst-case perturbations in the
weight space.
• Effect: Improves generalization by making the model’s learned features robust to small changes
in the weight parameters, adding another layer of resistance against adversarial inputs.

10. Feature Scattering-Based Adversarial Training

• Concept: Instead of perturbing input data directly, this method perturbs feature representations
within the model.
• Implementation: Creates perturbations in the hidden layers of the network and trains the
model to classify them correctly.
• Effect: Makes the model more robust to attacks by encouraging consistent internal
representations, thus stabilizing its performance under perturbations.

These adversarial regularization methods make neural networks not only more robust to attacks
but also often improve the overall generalization on unseen data. They are commonly used in
applications where security and reliability are essential, such as in finance, autonomous driving,
and cybersecurity.

:
Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Tangent Distance, Tangent Propagation (Tangent Prop), and Manifold Tangent Classifier are
techniques used in machine learning, specifically in fields like image recognition, to improve the
generalization and robustness of models. These methods leverage the concept of small
transformations or variations (tangents) of the input data to build more invariant and robust
representations, which helps models handle complex data variations better.

Here's a breakdown of each:

1. Tangent Distance

• Concept: Tangent distance is a metric that measures the similarity between two points by
considering small transformations or variations (tangents) around each point. This method
assumes that certain transformations, like small rotations or shifts, don’t change the identity of
an object in images, so it incorporates these transformations into the distance calculation.
• Technique: Instead of calculating the standard Euclidean distance between two points, tangent
distance considers the minimum distance between affine spaces generated by transformations
of each point. This makes it ideal for tasks like digit or object recognition, where the identity
remains the same under small variations.
• Application: Tangent distance is used in image recognition tasks, particularly for handwritten
character recognition, where it can help the model recognize the same character in different
orientations or shapes.

2. Tangent Propagation (Tangent Prop)

• Concept: Tangent Propagation (or Tangent Prop) is a regularization method for training neural
networks to learn invariances to small transformations of the input data. By penalizing the
model when its output changes significantly in response to known small transformations,
Tangent Prop encourages the model to learn a more stable and invariant representation.
• Technique: During training, a tangent vector that represents a small transformation (e.g.,
rotation, scaling, translation) is applied to the input. The model is trained to ensure that its
output doesn’t change significantly with these small input changes, thereby introducing
invariance to these transformations. The cost function includes a term that penalizes the
network for being overly sensitive to these variations.
• Application: Tangent Prop is used in scenarios where model invariance to specific
transformations is desirable, such as image and speech recognition. It helps the model learn to
ignore minor, non-informative variations in the input data.

3. Manifold Tangent Classifier

• Concept: The Manifold Tangent Classifier builds on the idea of Tangent Prop by focusing on
training a model to be invariant to perturbations along the manifold (or subspace) of valid data
transformations. The manifold is a lower-dimensional representation of valid data points,
capturing essential transformations like rotations and translations.
• Technique: This approach combines autoencoders with tangent propagation. The autoencoder
is trained to capture the underlying manifold of the data, and Tangent Prop is used to enforce
invariance to directions along this manifold. The goal is for the classifier to recognize data
variations along the manifold while being invariant to irrelevant noise.
• Application: The Manifold Tangent Classifier is particularly useful in high-dimensional data
settings like images, where data points lie on or near a lower-dimensional manifold. It’s
commonly applied in visual recognition tasks to improve robustness to transformations and
noise.

Summary of Differences and Connections

• Tangent Distance uses distance measurements that are invariant to small, predefined
transformations of the data, making it particularly suited for similarity-based methods like
nearest neighbors.
• Tangent Propagation applies transformation-based regularization to neural networks, guiding
them to be invariant to small changes.
• Manifold Tangent Classifier extends these ideas by incorporating manifold learning through
autoencoders to capture valid transformations of the data and improve invariance in
classification.

In essence, all three methods aim to make models more resilient to variations, with each
technique focusing on different aspects of transformation invariance. They have found
applications in computer vision and pattern recognition tasks, where recognizing the same object
under different transformations is crucial.

Figure 7.6: Dropout trains an ensemble consisting of all sub-networks that can be
constructed by removing non-output units from an underlying base network. Here, we
begin with a base network with two visible units and two hidden units. There are sixteen
possible subsets of these four units. We show all sixteen subnetworks that may be formed by
dropping out different subsets of units from the original network. In this small example,a large
proportion of the resulting networks have no input units or no path connecting the input to the
output. This problem becomes insignificant for networks with wider layers, where the probability
of dropping all possible paths from inputs to outputs becomes smaller.
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an ‘8’ detector
on the dataset depicted above, containing an ‘8’, a ‘6’ and a ‘9’. Suppose we make two
different resampled datasets. The bagging training procedure is to construct each of these
datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats the ‘8’.On this
dataset, the detector learns that a loop on top of the digit corresponds to an ‘8’.On the second
dataset, we repeat the ‘9’ and omit the ‘6’. In this case, the detector learns that a loop on the
bottom of the digit corresponds to an ‘8’. Each of these individual classification rules is brittle,
but if we average their output then the detector is robust, achieving maximal confidence only
when both loops of the ‘8’ are present. is a method that allows the same kind of model,
training algorithm and objective function to be reused several times. Specifically, bagging
involves constructing k different datasets. Each dataset has the same number of
examples as the original dataset, but each dataset is constructed by sampling with
replacement from the original dataset. This means that, with high probability, each
dataset is missing some of the examples from the original dataset and also contains
several duplicate examples (on average around 2/3 of the examples from the original
dataset are found in the resulting training set, if it has the same size as the original). Model
i is then trained on dataset i. The differences between which examples are included in
each dataset result in
differences between the trained models. See Fig. 7.5 for an example.
Neural networks reach a wide enough variety of solution points that they can
often benefit from model averaging even if all of the models are trained on the same
dataset. Differences in random initialization, random selection of minibatches,
differences in hyperparameters, or different outcomes of non-deterministic
implementations of neural networks are often enough to cause different members of the
ensemble to make partially independent errors.

You might also like