Foundations of Deep Learning
Julio Cesar Mendoza Bobadilla
Institute of Computing
University of Campinas
January 11, 2019
Definitions
Representation Learning. ”A set of methods that allows a
machine to be fed with raw data and to automatically discover the
representations needed for some task” 1
Deep Learning. ”Representation learning methods with multiple
levels of representation, obtained by composing simple but
nonlinear modules that each transform the representation at one
level into a higher, slightly more abstract one” 1
1
Lecun et al., Deep learning, 2015
Traditional Machine Learning Pipeline
I Traditional machine learning algorithms are based on
high-level attributes or features.
I This requires feature engineering.
Representation Learning Pipeline
I Learn features and classifier directly from raw data
I Also known as end-to-end learning.
Deep learning
I Learning a hierarchy of representations that build on each
other, from simple to complex
Multilayer Perceptron
I Pre-activation function for the k layer (h(0) (x) = x)
a(k) (x) = b (k) + W(k) h(k−1) (x)
I Hidden layer activation
h(k) (x) = g (a(k) (x))
I Ouput layer activation
h(L+1) (x) = o(a(L+1) (x) = f (x)
b (1) b (2) b (3)
x1
x2
x3
W(3)
x W(1) h(1) (x) W(2) h(2) (x)
Activation function
0.8
Sigmoid activation function
0.6
I Puts pre-activations between
0.4
0 and 1
I Always positive 0.2
I Bounded 0
I Strictly increasing −4 −2 0 2 4
1
g (x) =
(1 + exp(−x)
Activation function
5
Rectified linear activation
function 4
I Puts pre-activations between 3
0 and 1
2
I Always non negative
1
I Not upper bounded
I Strictly increasing 0
−4 −2 0 2 4
I Tends to give neurons with
sparse activities
g (x) = max(0, a)
Activation function
Softmax activation function
exp(ac )
softmax(ac ) = PC
1 exp(ai )
I Used for multiclass classification.
I Each activation is the conditional probability of the c-class
p(y = c|x)
" #T
exp(a1 ) exp(aC )
o(a) = softmax(a) = PC . . . PC
1 exp(ai ) 1 exp(ai )
I Strictly positive
I Sums to one
I Predicted class is the activation with highest estimated
probability.
Forward propagation
I Can be represented as an
acyclic flow graph.
I The output of each box can
be calculated given its
parents.
2
Hugo Lachorelle, Neural Networks, 2017
Optimization
We want to find the x ∈ θ such
that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization
We want to find the x ∈ θ such
that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is negative, go left!
Optimization
We want to find the x ∈ θ such
that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
Optimization
We want to find the x ∈ θ such
that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is positive, go right!
Optimization
We want to find the x ∈ θ such
that f (x) is minimum.
I Suppose we start at point u.
Which is the best direction
to search for the point v ,
with f (v ) < f (u)?
I Slope is positive, go right!
I General principle of gradient
descent optimization:
δf
v =u−α
δu
α is the learning rate
Empirical risk minimization
I Framework of design learning algorithms
1 X
argminθ L(f (x(t) ; θ), y (t) +λ Ω(θ)
T t | {z } |{z}
Loss function Regularizer
I We should optimize classification error, but it is not smooth
I Loss function is a surrogate for what we truly should optimize
Stochastic Gradient Descent
I Performs updates after each training example.
procedure StochasticGradientDescent
require: Network parameters θ
Initialize θ . (θ = W(1) , b (1) , ..., W(L+1) , b (L+1) ) )
for N epochs do
for each training sample (x (t) , y (t) ) do
∆ = −∇θ L(f (x(t) ; θ), y (t) ) − λ∇θ Ω(θ)
θ = θ + α∆
I To apply SGD algorithm we require:
I The loss function L(f (x(t) ; θ), y (t) )
I The regularizer Ω(θ)
I A procedure to compute the parameter gradients
∇θ L(f (x(t) ; θ), y (t) ) and λ∇θ Ω(θ)
I A procedure to initialize the parameters
Loss function
I On classification problems neural networks estimate:
f (x)c = p(y = c|x)
I Cross entropy minimize the negative log-likelihood.
X
L(f (x), y ) = − 1(y =c) log f (x)c (1)
c
I log is used for numerical stability and math simplicity.
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
g ← ∇f (x) J = ∇f (x) L(f (x), y )
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
Gradient on the bias of the k-layer
∇b(k) J = g + λ∇b(k) Ω(θ)
Gradient on the weights of the k-layer
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
Propagate the gradients on the next lower level activations
g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j
Gradient on the bias of the k-layer
∇b(k) J = g + λ∇b(k) Ω(θ)
Gradient on the weights of the k-layer
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
Propagate the gradients on the next lower level activations
g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j
Gradient on the bias of the k-layer
∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j
Gradient on the weights of the k-layer
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
Propagate the gradients on the next lower level activations
g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j
Gradient on the bias of the k-layer
∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j
Gradient on the weights of the k-layer
∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k
Propagate the gradients on the next lower level activations
g = ∇h(x)(k−1) J = W(k)T g
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j
Gradient on the bias of the k-layer
∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j
Gradient on the weights of the k-layer
∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k
Propagate the gradients on the next lower level activations
∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Computing gradients
Performed using the backpropagation algorithm.
I Use the chain rule to efficiently compute gradients from top
to bottom.
procedure Backpropagation
Gradient of the loss on the output layer
∂J
g ← ∇f (x) J = ∇f (x) L(f (x), y ) . ∂fi
= −yi f1
i
for k = L, L − 1, ..., 1 do
Gradient on the pre-activation
∂J ∂J ∂fj
g = ∇a(x)(k) J = g f 0 (a(x)(k) )
P
. ∂ai
= ∂fj ∂ai
= fi − yi
j
Gradient on the bias of the k-layer
∂J P ∂J ∂aj
∇b(k) J = g + λ∇b(k) Ω(θ) . ∂bi
= ∂aj ∂bi
= fi − yi
j
Gradient on the weights of the k-layer
∂J ∂J ∂ak
∇W(k) J = g h(x)(k−1)T + λ∇W(k) Ω(θ)
P
. ∂Wij
= ∂ak ∂Wij
= fj − yj hi
k
Propagate the gradients on the next lower level activations
∂J P ∂J ∂ak
g = ∇h(x)(k−1) J = W(k)T g
P
. ∂h = ∂a ∂h
= Wij fj − yj
i k i
k j
Backward propagation
I Each node computes the
gradient of the loss with
respect to the node term.
I We first calculate the
gradient of the upper nodes.
I Allows automatic
differentiation.
Activation function
0.8
0.6
Sigmoid activation function
0.4
I Partial derivative:
0.2
0
g (a) = g (a)(1 − g (a)) 0
−4 −2 0 2 4
1
g (x) =
(1 + exp(−x)
Activation function
Rectified linear activation 3
function 2
I Partial derivative:
1
0
g (a) = 1a>0 0
−4 −2 0 2 4
g (x) = max(0, a)
Regularization
I L2 regularization.
X X X (k)
Ω(θ) = (Wi,j )2
k i j
I Gradient:
∇W(k) Ω(θ) = 2W(k)
Applications
I Develop statistical models that can discover underlying
structure, semantic relations, constraints from data.
I Applications:
I Speech Recognition
I Computer Vision
I Recommendation Systems
I Language Understanding
I Robotics
Applications