0% found this document useful (0 votes)

114 views30 pages

Foundations of Deep Learning

The document provides definitions and explanations of key concepts in deep learning including representation learning, deep learning, traditional vs representation learning pipelines, multilayer perceptrons, activation functions, forward propagation, optimization using gradient descent, empirical risk minimization, stochastic gradient descent, loss functions, and computing gradients using backpropagation. It explains how deep learning models learn representations from raw data in an end-to-end manner through multiple levels of nonlinear transformations.

Uploaded by

Nelson Ubaldo Quispe M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views30 pages

Foundations of Deep Learning

Uploaded by

Nelson Ubaldo Quispe M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Foundations of Deep Learning

Julio Cesar Mendoza Bobadilla

Institute of Computing
University of Campinas

January 11, 2019

Definitions

Representation Learning. ”A set of methods that allows a

machine to be fed with raw data and to automatically discover the
representations needed for some task” 1
Deep Learning. ”Representation learning methods with multiple
levels of representation, obtained by composing simple but
nonlinear modules that each transform the representation at one
level into a higher, slightly more abstract one” 1

1
Lecun et al., Deep learning, 2015
Traditional Machine Learning Pipeline

I Traditional machine learning algorithms are based on

high-level attributes or features.
I This requires feature engineering.
Representation Learning Pipeline
I Learn features and classifier directly from raw data
I Also known as end-to-end learning.
Deep learning
I Learning a hierarchy of representations that build on each
other, from simple to complex
Multilayer Perceptron
I Pre-activation function for the k layer (h(0) (x) = x)
a(k) (x) = b (k) + W(k) h(k−1) (x)
I Hidden layer activation
h(k) (x) = g (a(k) (x))
I Ouput layer activation
h(L+1) (x) = o(a(L+1) (x) = f (x)

b (1) b (2) b (3)

x3
W(3)
x W(1) h(1) (x) W(2) h(2) (x)
Activation function

0.8
Sigmoid activation function
0.6
I Puts pre-activations between
0.4
0 and 1
I Always positive 0.2

I Bounded 0
I Strictly increasing −4 −2 0 2 4

1
g (x) =
(1 + exp(−x)
Activation function

5
Rectified linear activation
function 4
I Puts pre-activations between 3
0 and 1
2
I Always non negative
1
I Not upper bounded
I Strictly increasing 0
−4 −2 0 2 4
I Tends to give neurons with
sparse activities
g (x) = max(0, a)
Activation function
Softmax activation function
exp(ac )
softmax(ac ) = PC
1 exp(ai )

I Used for multiclass classification.

I Each activation is the conditional probability of the c-class
p(y = c|x)
" #T
exp(a1 ) exp(aC )
o(a) = softmax(a) = PC . . . PC
1 exp(ai ) 1 exp(ai )

I Strictly positive
I Sums to one
I Predicted class is the activation with highest estimated
probability.
Forward propagation

I Can be represented as an
acyclic flow graph.
I The output of each box can
be calculated given its
parents.

2
Hugo Lachorelle, Neural Networks, 2017
Optimization

We want to find the x ∈ θ such

that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization

We want to find the x ∈ θ such

that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is negative, go left!
Optimization

We want to find the x ∈ θ such

that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization

We want to find the x ∈ θ such

that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is positive, go right!
Optimization

We want to find the x ∈ θ such

that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is positive, go right!
I General principle of gradient
descent optimization:
δf
v =u−α
δu
α is the learning rate
Empirical risk minimization

I Framework of design learning algorithms

1 X
argminθ L(f (x(t) ; θ), y (t) +λ Ω(θ)
T t | {z } |{z}
Loss function Regularizer

I We should optimize classification error, but it is not smooth

I Loss function is a surrogate for what we truly should optimize
Stochastic Gradient Descent

I Performs updates after each training example.

procedure StochasticGradientDescent
require: Network parameters θ
Initialize θ . (θ = W(1) , b (1) , ..., W(L+1) , b (L+1) ) )
for N epochs do
for each training sample (x (t) , y (t) ) do
∆ = −∇θ L(f (x(t) ; θ), y (t) ) − λ∇θ Ω(θ)
θ = θ + α∆

I To apply SGD algorithm we require:

I The loss function L(f (x(t) ; θ), y (t) )
I The regularizer Ω(θ)
I A procedure to compute the parameter gradients
∇θ L(f (x(t) ; θ), y (t) ) and λ∇θ Ω(θ)
I A procedure to initialize the parameters
Loss function

I On classification problems neural networks estimate:

f (x)c = p(y = c|x)
I Cross entropy minimize the negative log-likelihood.
X
L(f (x), y ) = − 1(y =c) log f (x)c (1)
c

I log is used for numerical stability and math simplicity.

Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
g ← ∇f (x) J = ∇f (x) L(f (x), y )
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
g = ∇a(x)(k) J = g f 0 (a(x)(k) )

Gradient on the bias of the k-layer

∇b(k) J = g + λ∇b(k) Ω(θ)

Gradient on the weights of the k-layer

∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

Propagate the gradients on the next lower level activations

g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer

∇b(k) J = g + λ∇b(k) Ω(θ)

Gradient on the weights of the k-layer

∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer

∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer

∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer

∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

Propagate the gradients on the next lower level activations

∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j

Gradient on the bias of the k-layer

∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j

Gradient on the weights of the k-layer

∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k

Propagate the gradients on the next lower level activations

∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Backward propagation

I Each node computes the

gradient of the loss with
respect to the node term.
I We first calculate the
gradient of the upper nodes.
I Allows automatic
differentiation.
Activation function

0.8

0.6
Sigmoid activation function
0.4
I Partial derivative:
0.2
0
g (a) = g (a)(1 − g (a)) 0
−4 −2 0 2 4

1
g (x) =
(1 + exp(−x)
Activation function

Rectified linear activation 3

function 2
I Partial derivative:
1
0
g (a) = 1a>0 0
−4 −2 0 2 4

g (x) = max(0, a)
Regularization

I L2 regularization.
X X X (k)
Ω(θ) = (Wi,j )2
k i j

I Gradient:
∇W(k) Ω(θ) = 2W(k)
Applications

I Develop statistical models that can discover underlying

structure, semantic relations, constraints from data.
I Applications:
I Speech Recognition
I Computer Vision
I Recommendation Systems
I Language Understanding
I Robotics
Applications

ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
Module 2
No ratings yet
Module 2
14 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Part 2 Module 2 DL BP
No ratings yet
Part 2 Module 2 DL BP
66 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Learning 3
No ratings yet
Learning 3
98 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
Lecture 02-2
No ratings yet
Lecture 02-2
37 pages
Unit 1
No ratings yet
Unit 1
72 pages
FFNN, GD, Backpropagation
No ratings yet
FFNN, GD, Backpropagation
18 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
No ratings yet
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
29 pages
10 Neural Nets
No ratings yet
10 Neural Nets
61 pages
Module 1 DL
No ratings yet
Module 1 DL
84 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
DS303 NN
No ratings yet
DS303 NN
20 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
43 pages
SJNanda - Neural Network
No ratings yet
SJNanda - Neural Network
43 pages
AIMLB PGP 2025 Session 15
No ratings yet
AIMLB PGP 2025 Session 15
23 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
47 pages
Lecture 4 - SGD, Back Propagation
No ratings yet
Lecture 4 - SGD, Back Propagation
25 pages
Backpropagation & Optimization Guide
No ratings yet
Backpropagation & Optimization Guide
6 pages
5.scaling Optimization
No ratings yet
5.scaling Optimization
68 pages
Mod 4 Notes
No ratings yet
Mod 4 Notes
46 pages
Deep Learning's Evolution and Impact
No ratings yet
Deep Learning's Evolution and Impact
6 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Neural Networks: Learning: Introduction To Machine Learning
No ratings yet
Neural Networks: Learning: Introduction To Machine Learning
8 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Unit 3
No ratings yet
Unit 3
110 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Drill1.4 of LabVIEW
0% (1)
Drill1.4 of LabVIEW
5 pages
Algebra Practice Kumon Style English Extended
No ratings yet
Algebra Practice Kumon Style English Extended
13 pages
Sequence and Series 1.1-1.89: Preface VII
No ratings yet
Sequence and Series 1.1-1.89: Preface VII
3 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Bayesian Estimation in State Space
No ratings yet
Bayesian Estimation in State Space
33 pages
2007 Sajc
No ratings yet
2007 Sajc
31 pages
WW 6
No ratings yet
WW 6
3 pages
Answers PDF
No ratings yet
Answers PDF
87 pages
Common Prospectus 2014 15 New
No ratings yet
Common Prospectus 2014 15 New
76 pages
Solution To Exercise 6.2
No ratings yet
Solution To Exercise 6.2
2 pages
S Worksheet-2 Algebraic Fractions
No ratings yet
S Worksheet-2 Algebraic Fractions
1 page
Wlp-Week 2 General Mathematics
No ratings yet
Wlp-Week 2 General Mathematics
3 pages
Algebra 2 Answers
No ratings yet
Algebra 2 Answers
112 pages
Domain Range 1
No ratings yet
Domain Range 1
2 pages
Global Optimization Toolbox Gads TB 2017
No ratings yet
Global Optimization Toolbox Gads TB 2017
694 pages
Statistics and Probability Performance Task 3 Quarter AY 2020-2021
100% (1)
Statistics and Probability Performance Task 3 Quarter AY 2020-2021
3 pages
Notebook On Spatial Data Analysis
100% (1)
Notebook On Spatial Data Analysis
615 pages
ch8 All 2nd
No ratings yet
ch8 All 2nd
65 pages
Functions 2
No ratings yet
Functions 2
36 pages
SDM Assignment 5
No ratings yet
SDM Assignment 5
3 pages
Linear Algebra Lecture Notes - Shanmugavelan
100% (10)
Linear Algebra Lecture Notes - Shanmugavelan
408 pages
Linear Time Invariant Systems
No ratings yet
Linear Time Invariant Systems
23 pages
UEM Sol To Exerc Chap 059
No ratings yet
UEM Sol To Exerc Chap 059
13 pages
Sharma Et Al. (2012)
No ratings yet
Sharma Et Al. (2012)
16 pages
Numerical Methods: Curve Fitting
No ratings yet
Numerical Methods: Curve Fitting
6 pages
Matlab ODE Lab for Math Students
No ratings yet
Matlab ODE Lab for Math Students
10 pages
Bivium As A Mixed-Integer Linear Programming Problem: (J.Borghoff, Lars.R.Knudsen, M.Stolpe) @mat - Dtu.dk
No ratings yet
Bivium As A Mixed-Integer Linear Programming Problem: (J.Borghoff, Lars.R.Knudsen, M.Stolpe) @mat - Dtu.dk
20 pages
Case Processing Summary
No ratings yet
Case Processing Summary
3 pages
Major Task Part (2) - PHM113s
No ratings yet
Major Task Part (2) - PHM113s
19 pages
A New Proof of A Classical Formula
No ratings yet
A New Proof of A Classical Formula
5 pages

Foundations of Deep Learning

Uploaded by

Foundations of Deep Learning

Uploaded by

Foundations of Deep Learning

Julio Cesar Mendoza Bobadilla

January 11, 2019

Representation Learning. ”A set of methods that allows a

I Traditional machine learning algorithms are based on

b (1) b (2) b (3)

I Used for multiclass classification.

We want to find the x ∈ θ such

We want to find the x ∈ θ such

We want to find the x ∈ θ such

We want to find the x ∈ θ such

We want to find the x ∈ θ such

I Framework of design learning algorithms

I We should optimize classification error, but it is not smooth

I Performs updates after each training example.

I To apply SGD algorithm we require:

I On classification problems neural networks estimate:

I log is used for numerical stability and math simplicity.

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

Gradient on the bias of the k-layer

Gradient on the weights of the k-layer

Propagate the gradients on the next lower level activations

I Each node computes the

Rectified linear activation 3

I Develop statistical models that can discover underlying

You might also like