KEMBAR78
Deep Learning | PDF | Deep Learning | Mathematics
0% found this document useful (0 votes)
76 views50 pages

Deep Learning

Uploaded by

cilimian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views50 pages

Deep Learning

Uploaded by

cilimian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Deep Learning

Concepts & Algorithms


Lu Lu

Department of Chemical and Biomolecular Engineering


Penn Institute for Computational Science
University of Pennsylvania

Tianyuan Mathematical Center in Southeast China


Dec 8, 2021
Deep Learning for Science and Engineering Teaching Kit

Deep Learning for


Scientists and Engineers
Instructors: George Em Karniadakis, Khemraj Shukla, Lu Lu
Teaching Assistants: Vivek Oommen and Aniruddha Bora

(To be released in 2022)


https://www.import.io/post/history-of-deep-learning/
History of Deep Learning
❑ Artificial intelligence (AI) > Machine Learning (ML) > Deep Learning > Scientific Machine Learning (SciML).
❑ The expression “Deep Learning” was (probably) first used by Igor Aizenberg and colleagues around 2000.
❑ 1960s: Shallow Neural Networks.
❑ 1982: Hopfield Network – A Recurrent NN.
❑ 1988-89: Learning by backpropagation, Rumelhart, Hinton & Williams; hand-written text, LeCun.
❑ 1993 NVIDIA was founded; GeForce is the first GPU.
❑ 1990s Unsupervised Deep Learning.
❑ 1993: A Recurrent NN with 1,000 layers (Jürgen Schmidhuber) Artificial
❑ 1994: NN for solving PDEs, Dissanayake & Phan-Thien Intelligence
❑ 1998: Gradient-based learning, LeCun.
❑ 1998: ANN for solving ODEs&PDEs, Lagaris, Likas & Fotiadis Machine
Learning
❑ 1990-2000: Supervised Deep Learning.
❑ 2006: A fast learning algorithm for deep belief nets, Hinton.
❑ 2006-present: Modern Deep Learning. Neural Networks
❑ 2009: ImageNet: A large-scale hierarchical image database (Fei Fei).
❑ 2010: GPUs are only up to 14 times faster than CPUs (Intel). Scientific
Machine
❑ 2010: Tackling the vanishing/exploding gradients: Glorot & Bengio. Learning
❑ 2011: AlexNet – Convolutional NN (CNN) - Alex Krizhevsky.
❑ 2014: Generative Adversarial Networks (GANs) – Ian Goodfellow.
❑ 2015: Batch normalization, Ioffe & Szegedy.
❑ 2017: PINNs: Physics-Informed Neural Networks (Raissi, Perdikaris, Karniadakis).
❑ 2019: Scientific Machine Learning (ICERM workshop Jan. 2019; DOE report, Feb 2019).
❑ 2019: DeepOnet – Operator regression (Lu, Jin, Karniadakis).
3
Basic Research Needs Workshop for Scientific Machine Learning
Core Technologies for Artificial Intelligence, DOE ASCR Report, Feb 2019
❑ Scientific machine learning (SciML) is a core component of artificial intelligence (AI) and a computational
technology that can be trained, with scientific data, to augment or automate human skills.

❑ SciML must achieve the same level of scientific


rigor expected of established methods deployed
in science and applied mathematics. Basic
requirements include validation and limits on
inputs and context implicit in such validations, as
well as verification of the basic algorithms to
ensure they are capable of delivering known
prototypical solutions.

❑ Can SciML achieve Robustness?

• NSF/ICERM Workshop on SciML, January 28-30, 2019 (Organizers: J. Hesthaven & G.E. Karniadakis)
4 • https://icerm.brown.edu/events/ht19-1-sml/)
Fundamental Questions

Johns Hopkins
Turbulence Database

5
Workflow in a Neural Network
Input Hidden layers Output Data

σ σ • Input layer (layer 0):



σ σ
𝑥 𝑦 𝑦∗
• Hidden layers:
σ σ
• Layer 1:
• Layer 2:
σ σ
Forward pass • Output layer (layer 3):

Backward pass •

σ σ

σ σ
𝑥 ∆𝑦
σ σ

σ σ
6
A Neural Network for Regression
❑ Define the affine transformation in -th layer

❑ Activation function
Popular choices: (Rectified Linear Unit, ReLU)
❑ The hidden layers of a feedforward neural network:

Where denotes composition of functions

❑ For regression, a DNN Is typically of the form:


number of Hidden layers: L-1

Hidden layer

Input dim. d
width N

❑ Network parameters:
7
A Neural Network for Classification
❑ For classification, define the softmax function for classes


• Convert any vector to a probability vector

❑ The DNN is typically of the form

8
Building Different NNs: ResNet

❑ Residual network (ResNet)

❑ Replace with

❑ : Identity function

Skipping multiple layers


Skip
connection

9
Universal Function Approximation (single layer)
Definition. We say that is discriminatory if for a measure The space of finite, signed
regular Borel measures on
is denoted by

for all and implies that

Definition. We say that is sigmoidal if

Theorem 1. Let be any continuous discriminatory function. Then finite sums


of the form

are dense in . In other words, given any and , there is ➢ Note: The set of all functions y
a sum, , of the above form, for which does not form a vector space since
it is not closed under addition.

G. Cybenko, “Approximation by superpositions of a sigmoidal function”, Mathematics of Control, Signals and Systems,
10 303-314, 2(4), 1989
Universal Functional Approximation (single layer)

Theorem (Chen and Chen, 1993):


Suppose that is a compact set in is a continuous
functional defined on , and is a bounded sigmoidal function,
then for any , there exist points ,
a positive integer and constants

Such that

holds for all .

T.P. Chen and H. Chen, Approximations of continuous functionals by neural networks with application to dynamic
systems, IEEE Transactions on Neural Networks, 910-918, 4(6), 1993.
11
Adaptive Basis Viewpoint

We consider a family of neural networks consisting of hidden layers of width


composed with a final linear layer, admitting the representation

where and are the parameter corresponding to the final linear layer and the hidden layers
respectively. We interpret as a concatenation of and .
number of Hidden layers: L-1
Hidden layer

Input dim. d
width N

This view point makes it clear that parameterizes the basis (like FEM mesh & Shape functions), while
are just coefficients for these basis functions.
12
Shallow networks vs Deep networks
❑ Universal approximator:
❑ Shallow networks: width
❑ Deep networks: width (for ReLU NN) [Hanin & Sellke, 2017]

❑ From approximation point of view: Deep networks perform better than shallow ones of
comparable size [Mhaskar, 1996]
❑ neurons can approximate functions with error [Mhaskar, 1996]
❑ e.g., a 3-layer NN with 10 neurons per layer may be better than a 1-layer NN with 30
neurons

❑ [Mhaskar & Poggio, 2016]

❑ There exist functions expressible by a small 2-hidden-layer NN, which cannot be


approximated by any shallow NN with the same accuracy, unless its width is
exponential in the dimension. [Eldan & Shamir, 2016]
❑ The number of neurons needed by a shallow NN to approximate a function is
exponentially larger than the number of neurons needed by a deep NN for a given
accuracy level. [Liang & Srikant, 2017; Yarotsky, 2017]
13
Loss Functions

❑ To learn

❑ Given a dataset

❑ Mean Squared Error (MSE) loss:


❑ In general, let be a linear/nonlinear operators

• MSE is a special case with be the identity


• PINN loss uses the PDE residual as the operator

14
Activation Functions Conventional Parameterized*

Tanh

Sigmoid

* In parametrization activation function consider , where is the trainable parameter.


15
ReLU

Leaky ReLU
Conventional Parameterized

16
ELU Conventional Parameterized

Swish

17
Differentiation: Four ways but only one counts: Automatic Differentiation (AD)
• Hand-coded analytical derivative

• Lots of human labor

• Error prone

• Numerical approximations, e.g., finite difference

• Two function evaluations (forward pass) per partial derivative

• Truncation errors

• Symbolic differentiation (used in software programs such as Mathematica, Maxima, Maple, and Python library SymPy)

• Chain rule

• Expression swell: Easily produce exponentially large symbolic representations

• Automatic differentiation (AD; also called algorithmic differentiation)

• Symbolic differentiation simplified by numerical evaluation of intermediate sub-expressions

• Does not provide a general analytical expression for the derivative

• But only the value of the derivative for a specific input 𝑥


18
Automatic Differentiation
❑ Exploits the fact that all computations are compositions of a small set of elementary expressions with known
derivatives
❑ Employs the chain rule to combine these elementary derivatives of the constituent expressions.

❑ Two ways to compute first-order derivative:


❑ Forward mode AD (details not discussed)
❑ Cost scales linearly w.r.t. the input dimension *
❑ Cost is constant w.r.t. the output dimension
❑ Reverse mode AD
❑ Cost is constant w.r.t. the input dimension
❑ Cost scales linearly w.r.t. the output dimension

❑ In deep learning, backpropagation == Reverse mode AD


❑ The input dimension of the loss function is # of parameters, e.g., millions +
❑ The output dimension is 1: the loss value

❑ High-order derivatives:

❑ Nested-derivative approach: Apply first-order AD repeatedly
❑ Cost scales exponentially in the order of differentiation
❑ What we will use in this class, because the simplicity of implementation
❑ More efficient approaches, such as Taylor-mode AD (high-order chain rule)
19 ❑ Not supported in TensorFlow/PyTorch yet
Backpropagation
❑ We apply recursively the Chain rule to implement Backprop
❑ Use computational graphs to accomplish backprop

❑ Example:

Forward pass Backward pass


By chain rule:

* *
+ *

𝑐+𝑎

𝜕𝑑 𝜕𝑑 𝜕𝑐
= ∗ =𝑎∗1=𝑎
𝜕𝑏 𝜕𝑐 𝜕𝑏
+ +

20
Backpropagation

Example: A NN with one hidden neuron

Forward Pass Backward Pass

21 Lu et al. DeepXDE: A deep learning library for solving differential equations. SIAM Review, 2021.
Gradient Descent (GD) ▪ Some GD pathologies in non-convex loss landscapes

Global minima

tf.keras.optimizers.SGD(learning_rate=0.01) torch.optim.SGD(params, lr=0.01)


22
22
Effect of Learning Rate
❑In linear regression we have convexity (hence global minimum) but still we should scale all features for faster convergence
Loss plot associated with learning
the sine function with noise using a
fully connected neural network

Learning rate is too small Learning rate is too large Convergence depends strongly on the lr

❑ An effective strategy is to use a variable/decaying learning rate


23
23
Underfitting vs. Overfitting

❑ Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set (low-capacity models).

❑ Overfitting occurs when the gap between the training error and test error is too high (high-capacity models).

Overfitting

❑ The neural network is


forced to overfit by
considering only 10
training points

❑ The predicted
function passes
through the 10
training points,
making training loss

❑ Model fails to learn


the underlying
function

24
24
Vanishing and Exploding Gradients

❑ Different layers may learn at hugely different rates: for most NN architectures the gradients become smaller
and smaller in back propagation, hence leaving the weights of lower layers unaffected (vanishing gradients). In
recurrent NN in addition the weights may explode.
❑ Exploding gradient: multiplying 100 Gaussian random matrices
(All linear layers)

❑ This was the main obstacle in training DNNs until the early 2000s.

❑ 2010: Breakthrough paper of Xavier Glorot & Youshua Bengio “Understanding the difficulty of training Deep
neural nerworks”, Proc 13th Int. Conf. on AI and Statistics pp. 249-256.

❑ The main reasons were the then popular sigmoid activation function and the normal distribution of initialized
weights . The variance of each layer from input to output increases monotonically and then the
activation function saturates at 0 and 1 in the deep layers. Note that the mean of this activation function is 0.5.

25
25
Xavier and He - Weight Initializations

❑ Variance of output of each layer = Variance of inputs to that layer.

❑ Gradients should have equal variance before and after flowing through a layer in the reverse
direction (fan-in/fan-out) – this led to Xavier (or Glorot) initializations.

❑ He Normal

❑ He initialization is similar with ReLU

Glorot Initialization He Initialization


Glorot Initialization He Initialization tf.initializers.GlorotNormal() tf.initializers.HeNormal()
torch.nn.init.xavier_normal_(w) torch.nn.init.kaiming_normal_(w)
Example: Example:
Example: Example: init = tf.initializers.GlorotNormal() init = tf.initializers.HeNormal()
w = torch.empty(5, 5) w = torch.empty(5, 5) w = initializer(shape=(4, 4)) w = initializer(shape=(4, 4))
nn.init.xavier_normal_(w) torch.nn.init. kaiming_normal_(w)

“Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, 2015, IEEE Conference on Computer Vision, pp. 1026-
26 1034
Data Normalization Ref: https://zaffnet.github.io/batch-normalization

In 1998, Yan LeCun in his paper, Effiecient BackProp, highlighted the importance of normalizing the inputs.
Preprocessing of the inputs using normalization is a standard machine learning procedure and is known to
help in faster convergence. Normalization is done to achieve the following objectives:

❑ The average of each input variable (or feature) over the training set is close to zero (Mean subtraction).

❑ Covariances of the features are same (Scaling).

❑ Decorrelate the features (Whitening – not required for CNNs).


27
27
Data Normalization - Example

Without data-normalization

With data-normalization

28
28
An Overview of Gradient Descent Optimization Algorithms
https://ruder.io/optimizing-gradient-descent/
https://arxiv.org/abs/1609.04747:
▪ This post explores how many of the most popular gradient-based optimization algorithms actually work.

A. This movie shows the behaviour of the algorithms at a saddle point Notice that B. In this movie, we see their behavior on the contours of a loss surface
SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop
eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta almost immediately head off in the right direction and converge similarly fast,
quickly head down the negative slope. while Momentum and NAG are led off-track, evoking the image of a ball rolling
down the hill. NAG, however, is quickly able to correct its course due to its
increased responsiveness by looking ahead and heads to the minimum.

➢ These two animations (Images credit: Alec Radford) provide some intuitions towards the optimization behavior of most of the presented optimization methods.
29
What Optimizer to Use?
Adam Optimizer: Adaptive moment based optimizer

❑ Adam = adaptive moment estimation is a hybrid method and combines the ideas of momentum optimization and RMSProp.

❑ Similar to momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like
RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

❑ Steps 1, 2, and 5 in algorithm (below) reveal Adam’s close similarity to both momentum optimization and RMSProp.

# Import Optimizers bundle # Import Optimizers bundle


opt_class=torch.optim opt_class=tf.keras.optimizers
# Momentum optimizer # Momentum optimizer
opt= opt_class.Adam(model.parameters(),lr=0.01,betas=(0.9,0.999))opt=opt_class.Adam(lr=0.001, beta_1=0.9,
beta_2=0.999)

30
What Optimizer to Use?
Adam Optimizer: Adaptive moment based optimizer

❑ Steps 3 and 4 can be explained as follows: since m and s are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the beginning of training.

❑ The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often
initialized to 0.999. The smoothing term ε is usually initialized to a small number such as 10-7.

❑ Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate hyperparameter η. We can
often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

# Import Optimizers bundle # Import Optimizers bundle


opt_class=torch.optim opt_class=tf.keras.optimizers
# Momentum optimizer # Momentum optimizer
opt= opt_class.Adam(model.parameters(),lr=0.01,betas=(0.9,0.999))opt=opt_class.Adam(lr=0.001, beta_1=0.9,
beta_2=0.999)

31
What Optimizer to Use? ▪ Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization.
Mathematical programming. 1989 Aug;45(1):503-28.
L-BFGS optimizer

❑ BFGS is the most popular of all Quasi-Newton methods and have storage complexity.

❑L-BFGS (Limited memory BFGS), which does not require to explicitly to store but instead stores the previous data
and manages to compute directly from this data. L-BFGS has storage complexity of .

❑ L-BFGS implementation is not straightforward in PyTorch and TF2. A detailed implementation will be discussed in Lecture
4, but here we provide a simple API for both.

L-BFGS implementation in TensorFlow is provided


through TF Probability package

tfp.optimizer.lbfgs_minimize(f,
torch.optim.LBFGS(params, lr=1, max_iter= initial_position=self.get_weights(),
20, max_eval=None, tolerance_grad=1e- num_correction_pairs=50,
07, tolerance_change=1e- max_iterations=2000)
09, history_size=100, line_search_fn=None
)

32
32
Loss Regularizers

33
33
Loss Regularizers:

34
34
Collapse of Deep and Narrow ReLU Neural Networks

L. Lu, Y. Shin, Y. Su, & G. E. Karniadakis. Dying ReLU and initialization: Theory and numerical examples.
35
Communications in Computational Physics, 28(5), 1671–1706, 2020.
35
ReLU Collapse: One-Dimensional Examples

36
36
ReLU Collapse: One-Dimensional Examples

37
37
ReLU Collapse: Two-Dimensional Example

38
38
Does the Loss Type Matter?

39
39
Theoretical Analysis - I

40
40
Theoretical Analysis - I

41
41
Theoretical Analysis - II

42
42
Theoretical Analysis - III

43
43
Theoretical versus Numerical Results

44
Theoretical versus Numerical Results

45
45
Motivation and need for other neural networks
Cause: Diverse data formats and learning tasks
Rescue: Different neural network architectures

Datatype: Tabular data Datatype: Image data Datatype: Time series data
Velocity and pressure for fluid flow
Breast Cancer Wisconsin (Prognostic) Data Set Contiguous U.S. Average Temperature

46
Convolutional Neural Network (CNN)

depth
height 8 filters
Size = 2x2

width

47
FNN vs CNN
FNN CNN

48
CNN kernel vs finite difference stencil
Partial derivative

2D Laplacian (𝚫)

Standard 5-point stencil


finite difference kernel
Learned kernel

49
Resources
• Deep Learning. Ian Goodfellow, Yoshua Bengio, & Aaron Courville.
MIT Press, 2016. http://www.deeplearningbook.org
• Dive into Deep Learning. https://d2l.ai
• https://github.com/lululxvi/tutorials

You might also like