Artificial Neural Networks 3:
Deep Learning
Course: Computational Intelligence (TI2736-A)
Lecturer: Thomas Moerland
Recap: Machine Learning
= Function approximation
Today: focus on parametric, supervised learning
y = f(x;θ)
Recap: Machine Learning
= Function approximation
Today: focus on parametric, supervised learning
y = f(x;θ)
Recap: Machine Learning
y target
parameters θ
x input
Recap: Machine Learning
y target
Values given
Value to learn parameters θ by the data
x input
Recap: Machine Learning
L loss
model
prediction
ŷ y target
parameters θ
x input
Recap: Machine Learning
L loss
model
prediction
ŷ y target
The parametric model
parameters θ
1
x input
Recap: Machine Learning
2
L loss = when does the model do good
model
prediction
ŷ y target
The parametric model
parameters θ
1
x input
Recap: Machine Learning
2
L loss
Goal = tune the
parameters θ to
minimize the loss
model
(= optimization)
prediction
ŷ y target
3
The parametric model
parameters θ
1
x input
Content for today
1. The Feedforward Network
a. Artificial Neural Network (ANN): A Parametric Model 1
b. Loss Functions 2
c. Numerical Optimization 3
2. Advanced Neural Network Architectures
a. Convolutional Neural Network (CNN)
b. Recurrent Neural Network (RNN)
3. Deep learning
1. The Feedforward Network
ANN: A Parametric Model
prediction
ŷ
Artificial Neural Network (ANN)
hidden =
layer
h1 h2 h3 stacked sequence of non-linear
regressions
("fully connected layers")
x1 x2
input
Artificial Neural Network structure
prediction per layer:
hidden
layer
h1 h2 h3
x1 x2
input
Artificial Neural Network structure
prediction per layer:
input
ŷ layer number
(vector)
hidden output bias
non-linear
layer (vector) weight (vector)
function
h1 h2 h3 (element-wise (matrix)
to vector)
x1 x2
input
Artificial Neural Network structure
prediction per layer:
input
ŷ layer number
(vector)
hidden output bias
non-linear
layer (vector) weight (vector)
function
h1 h2 h3 (element-wise) (matrix)
Q: How many parameters does this
network have?
x1 x2
input
Artificial Neural Network structure
prediction per layer:
input
ŷ layer number
(vector)
hidden output bias
non-linear
layer (vector) weight (vector)
function
h1 h2 h3 (element-wise) (matrix)
Q: How many parameters does this
network have?
A: 13
First layer: 6 weights + 3 biases
x1 x2
Second layer: 3 weights + 1 bias
input
Activation Functions
Q: Why not stack multiple linear layers?
A: Composition of linear transformations is still linear.
Activation Functions
Q: Why not stack multiple linear layers?
A: Composition of linear transformations is still linear.
Activation function = non-linear transformation
Activation Functions
Activation Functions
1980-2010 : Sigmoid & Tanh. Problems: saturate (both sides) & hard to copy input
2010-now : ReLu & ELU (Partially linear functions): gradient flows more easily
ANN: Layer Stacking
prediction Idea:
ŷ Repeatedly apply the input to such a
parametrized layer
hidden
layer
h1 h2 h3
x1 x2
input
ANN: Layer Stacking
prediction Idea:
ŷ Repeatedly apply the input to such a
parametrized layer
hidden
layer
or, when fully written out
h1 h2 h3
Note: In the last layer we do not apply
a standard non-linearity g(). More
x1 x2 about this in the loss function part.
input
B. Loss function
General idea:
1. Specify error measure between ŷ (prediction) and y (true data target)
2. Minimize that quantity over the entire dataset
B. Loss function
General idea:
1. Specify error measure between ŷ (prediction) and y (true data target)
2. Minimize that quantity over the entire dataset
Two important considerations:
1. Type of y variable (regression vs classification)
2. Deterministic versus probabilistic loss
B. Loss function
1. Regression versus classification (= type of target variable (y))
B. Loss function
1. Regression versus classification (= type of target variable (y))
Target type (y) Name Prediction Network output
Continuous Regression Number on real line Direct prediction (1 head) or
parameters of contin prob. distr.
Discrete Classification Class label out of a set Usually one network head per class
B. Loss function
1. Regression versus classification (= type of target variable (y))
Target type (y) Name Prediction Network output
Continuous Regression Number on real line Direct prediction (1 head) or
parameters of contin prob. distr.
Discrete Classification Class label out of a set Usually one network head per class
Cardinal example: Regression on Mean-Squared Error (MSE)
B. Loss function
1. Regression versus classification (= type of target variable (y))
Target type (y) Name Prediction Network output
Continuous Regression Number on real line Direct prediction (1 head) or
parameters of contin prob. distr.
Discrete Classification Class label out of a set Usually one network head per class
Cardinal example: Regression on Mean-Squared Error (MSE) square the
error
sum over prediction true label
whole dataset
B. Loss function
1. Regression versus classification (= type of target variable (y))
Target type (y) Name Prediction Network output
Continuous Regression Number on real line Direct prediction (1 head) or
parameters of contin prob. distr.
Discrete Classification Class label out of a set Usually one network head per class
Cardinal example: Regression on Mean-Squared Error (MSE) square the
error
Q: why the square of the error? sum over prediction true label
whole dataset
B. Loss function
1. Regression versus classification (= type of target variable (y))
Target type (y) Name Prediction Network output
Continuous Regression Number on real line Direct prediction (1 head) or
parameters of contin prob. distr.
Discrete Classification Class label out of a set Usually one network head per class
Cardinal example: Regression on Mean-Squared Error (MSE) square the
error
Q: why the square of the error? sum over prediction true label
A: penalize positive and negative errors + whole dataset
easier derivative (compared to absolute error)
B. Loss function
2. Deterministic versus probabilistic loss
Main idea of probabilistic loss: The network predicts the parameters of a probability
distribution out of which the observed y would be sampled, instead of predicting y directly.
B. Loss function
2. Deterministic versus probabilistic loss
Main idea of probabilistic loss: The network predicts the parameters of a probability
distribution out of which the observed y would be sampled, instead of predicting y directly.
For example:
μ σ
ŷ ~ N(.|μ,σ) and
h1 h2 h3
x1 x2
B. Loss function
2. Deterministic versus probabilistic loss
Main idea of probabilistic loss: The network predicts the parameters of a probability
distribution out of which the observed y would be sampled, instead of predicting y directly.
Benefits:
1. Model stochastic output & sensor noise
2. Directly have a loss function:
'Maximum likelihood estimation' = learn a model that gives maximum probability to
the observed data
See lecture notes for details (also for classification case)
C. Numerical optimization
Gradient Descent
C. Numerical optimization
Gradient Descent
Update rule:
Learning rate
Non-Convex Objective Function
NN objective/cost
=
Non-convex
Learning rate = crucial
Too small : no progress
Too large : unstable
Importance of learning rate
Gradient Descent for Neural Networks
Two issues around the same problem:
How do we get the gradients in feasible computational time?
1. Datasets are usually large:
Solution: stochastic gradient descent (SGD)
2. Networks are usually large:
Solution: backpropagation ('backprop')
Stochastic Gradient Descent
True gradient is a sum over the entire dataset:
Dataset size (N) may be millions.
Stochastic Gradient Descent
True gradient is a sum over the entire dataset:
Dataset size (N) may be millions.
Solution: approximate the gradient with a sample from the dataset
(= a 'minibatch' per parameter update)
Minibatch size (usually m=32 or m=64) stays fixed when dataset grows!
Backpropagation
First: How do we get the gradient anyway?
Backpropagation
First: How do we get the gradient anyway?
Required: Chain Rule of Calculus
Example:
z = f(x) h = g(z) --> h = g(f(x))
f() g()
x z h
How do we get dh/dx?
Backpropagation
First: How do we get the gradient anyway?
Required: Chain Rule of Calculus
Example:
z = f(x) h = g(z) --> h = g(f(x))
f() g()
x z h
How do we get dh/dx?
chain = multiply the gradients of the subfunctions
(generalizes to case where x,z and h are vectors - need partial derivatives (see lecture notes))
Class example: NN gradients
L = (ŷ - y)2 L Q: To update weight w1 we need dL/dw1.
Give dL/dw1 (symbolic).
ŷ ŷ = w3h+b2 y
h h = g(w1x1+b1)
x1
Class example: NN gradients
L = (ŷ - y)2 L Q: To update weight w1 we need dL/dw1.
Give dL/dw1 (symbolic).
ŷ ŷ = w3h+b2 y A:
h h = g(w1x1+b1)
x1
Class example: NN gradients
L = (ŷ - y)2 L Q: To update weight w1 we need dL/dw1.
Give dL/dw1 (symbolic).
ŷ ŷ = w3h+b2 y A:
Q: Can you further write out dh/dw1?
h h = g(w1x1+b1)
(think about the non-linearity)
x1
Class example: NN gradients
L = (ŷ - y)2 L Q: To update weight w1 we need dL/dw1.
Give dL/dw1 (symbolic).
ŷ ŷ = w3h+b2 y A:
Q: Can you further write out dh/dw1?
h h = g(w1x1+b1)
(think about the non-linearity)
x1
Class example: NN gradients
L = (ŷ - y)2 L Q: Now our input x is actually a vector of
length 2. Can you give dL/dw1 and dL/dw2?
ŷ ŷ = w3h+b2 y
h h = g(w1x1+w2x2+b1)
x1 x2
Class example: NN gradients
L = (ŷ - y)2 L Q: Now our input x is actually a vector of
length 2. Can you give dL/dw1 and dL/dw2?
ŷ ŷ = w3h+b2 y A:
h h = g(w1x1+w2x2+b1)
x1 x2
Class example: NN gradients
L = (ŷ - y)2 L Q: Now our input x is actually a vector of
length 2. Can you give dL/dw1 and dL/dw2?
ŷ ŷ = w3h+b2 y A:
h h = g(w1x1+w2x2+b1)
large part of the gradient is the same
(= key idea of backpropagation)
x1 x2
Backpropagation
Main idea:
- Efficiently store gradients and re-use them by walking backwards
through the network.
Backpropagation
Main idea:
- Efficiently store gradients and re-use them by walking backwards
through the network.
Class example: One full learning loop
L = (ŷ - y)2 L Let's assume some data and initialize parameters:
x1 = 2 w1 = 1.5 b1 = 3
x2 = -1 w2 = 2 b2 = -2
ŷ ŷ = w3h+b2 y y=6 w3 = 2.5 g(z) = ReLu = max(0,z)
w
3
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3
x2 = -1 w2 = 2 b2 = -2
y=6 w3 = 2.5 g(z) = ReLu = max(0,z)
ŷ ŷ = w3h+b2 y
w Q: Compute ŷ (forward pass)
3
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3
x2 = -1 w2 = 2 b2 = -2
y=6 w3 = 2.5 g(z) = ReLu = max(0,z)
ŷ ŷ = w3h+b2 y
w Q: Compute ŷ (forward pass)
3
h h = g(w1x1+w2x2+b1) A: z = (2*1.5) + (-1*2) + 3 = 4
h = max(0,4) =4
w w ŷ = (2.5*4) - 2 =8
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: We assume the squared loss L = (ŷ - y)2.
3 Compute the loss for this datapoint.
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: We assume the squared loss L = (ŷ - y)2.
3 Compute the loss for this datapoint.
h h = g(w1x1+w2x2+b1)
A: L = (8 - 6)2 = 4
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: Let backpropagate. Calculate dL/dw3.
3
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: Let backpropagate. Calculate dL/dw3.
3
h h = g(w1x1+w2x2+b1) A:
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: Now for dL/dw1 and dL/db1
3
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: Now for dL/dw1 and dL/db1
3
h h = g(w1x1+w2x2+b1) A: First through the top layer
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: Now for dL/dw1 and dL/db1
3
h h = g(w1x1+w2x2+b1) A:
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: Now for dL/dw1 and dL/db1
3
h h = g(w1x1+w2x2+b1) A: Re-use previous gradient
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: So dL/dw3=16, dL/dw1=20 and dL/db1=10.
3 Update parameters, take learning rate 0.01.
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.5 b1 = 3 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.5 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: So dL/dw3=16, dL/dw1=20 and dL/db1=10.
3 Update parameters, take learning rate 0.01.
h h = g(w1x1+w2x2+b1)
A: w1 = 1.5 - 0.01*20 = 1.3
w w b1 = 3 - 0.01*10 = 2.9
1 2 w3 = 2.5 - 0.01*16 = 2.34
x1 x2 (Note: normally we update all parameters, i.e.
w2 and b2 as well)
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.3 b1 = 2.9 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.34 g(z) = ReLu ŷ=8
ŷ ŷ = w3h+b2 y
w Q: So we have update the parameters. Did our
3 prediction get better?
h h = g(w1x1+w2x2+b1)
w w
1 2
x1 x2
Class example: One full learning loop
L = (ŷ - y)2 L x1 = 2 w1 = 1.3 b1 = 2.9 z=4
x2 = -1 w2 = 2 b2 = -2 h=4
y=6 w3 = 2.34 g(z) = ReLu ŷ = 6.19
ŷ ŷ = w3h+b2 y
w Q: So we have update the parameters. Did our
3 prediction get better?
h h = g(w1x1+w2x2+b1)
A: z = (2*1.3) + (-1*2) + 2.9 = 3.5
w w h = max(0,3.5) = 3.5
1 2 ŷ = (2.34*3.5) - 2 = 6.19
x1 x2 Yes, we got much closer!
(8 → 6.19, while true y is 6)
Summary: You just manually trained a
neural network (one learning loop)
2
L loss
Goal = tune the
parameters θ to
minimize the loss
model
(= optimization)
prediction
ŷ y target
3
The parametric model
parameters θ
1
x input
Break
2. Advanced Neural Network
Architectures
Advanced neural network architectures
1. Convolutional Neural Network (CNN)
= 'the NN solution to space'
2. Recurrent Neural Network (RNN)
= 'the NN solution to time/sequence'
Convolutional Neural Network (CNN)
Problem:
For high-dimensional input (e.g. images) fully connected layers have way too many
parameters/connections.
Convolutional Neural Network (CNN)
Problem:
For high-dimensional input (e.g. images) fully connected layers have way too many
parameters/connections.
Solution:
Convolutions. Useful for data with grid-like structure, especially 2D/3D (computer
vision), where subpatterns re-appear throughout the grid.
Underlying ideas:
1. Local connectivity: connect input only locally through small kernel
2. Parameter sharing: re-use (move) the kernel along the grid/image/video
Convolutional Neural Network (CNN)
kernel: move along the input
input
output
Convolutional Neural Network (CNN)
kernel: move along the input
input
output
- Besides that similar to fully connected: take linear combination with (kernel)
weights, then add non-linearity.
- But we preserve the grid (2D/3D) structure into the next layer.
Convolutional Neural Network (CNN)
Stacking layers = Hierarchy
Note: The higher-up in the hierarchy, the wider the 'receptive field' in the original image.
Convolutional Neural Network (CNN)
Visualizing the Hierarchy
Zeiler, Matthew D., and Rob Fergus. Visualizing and understanding convolutional networks. 2013.
Convolutional Neural Network (CNN)
Convolution (& Pooling) = effectively a very strong prior on a fully connected layer:
- remove many weights (force to 0)
- tie the values of some others (parameter sharing)
Convolutional Neural Network (CNN)
Convolution (& Pooling) = effectively a very strong prior on a fully connected layer:
- remove many weights (force to 0)
- tie the values of some others (parameter sharing)
Q: Can you think of an example in which convolution would not work?
Convolutional Neural Network (CNN)
Convolution (& Pooling) = effectively a very strong prior on a fully connected layer:
- remove many weights (force to 0)
- tie the values of some others (parameter sharing)
Q: Can you think of an example in which convolution would not work?
A: When there is no spatio-temporal (i.e. grid-like) structure in the data.
For example, if x contains patient information (age, gender, medication, etc.), then it
does not make sense to move a window along it (there is no repeating structure).
Recurrent Neural Network (RNN)
For sequential/temporal data
(text, video, audio, most real-world data is a sequence/stream)
Recurrent Neural Network (RNN)
For sequential/temporal data
(text, video, audio, most real-world data is a sequence/stream)
Feed information of
previous step into next
timestep
Recurrent Neural Network (RNN)
For sequential/temporal data
(text, video, audio, most real-world data is a sequence/stream)
Feed information of
Rolled out graph over
previous step into next
time
timestep
RNN Training
Key idea:
- Recurrent connection between timesteps at the hidden level
- Parameter sharing (again): the recurrent parameters are the same at every
timestep.
RNN Training
Key idea:
- Recurrent connection between timesteps at the hidden level
- Parameter sharing (again): the recurrent parameters are the same at every
timestep.
But: How to train it?
RNN Training
Key idea:
- Recurrent connection between timesteps at the hidden level
- Parameter sharing (again): the recurrent parameters are the same at every
timestep.
But: How to train it?
Backpropagation Through Time (BPPT)
Feed in the entire sequence - backpropagate loss through the recurrency
(until the beginning)
RNN architecture variants
feedforward (e.g action (e.g. translation)
network classification)
3. Deep Learning
Deep Learning
White box = hand designed
Grey box = learned
'End-to-end learning'
Deep Learning
“We have never seen machine learning or artificial intelligence technologies so
quickly make an impact in industry.”
-- Kai Yu, Baidu
Deep learning =
stacking many neural network layers & training them end-to-end
(i.e. already discussed)
I. Illustration: Computer Vision
ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
ImageNet dataset: 1.2 million pictures over 1000 classes.
(x → y)
I. Illustration: Computer Vision
ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
Year Model name Error rate Details
Before 25.7%
2012
I. Illustration: Computer Vision
ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
Year Model name Error rate Details
Before 25.7%
2012
2012 AlexNet 15.4% 7 layers, GPU's, Relu activation
I. Illustration: Computer Vision
ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
Year Model name Error rate Details
Before 25.7%
2012
2012 AlexNet 15.4% 7 layers, GPU's, Relu activation
2013 ZF Net 11.2% Visualization by deconvolution
2014 VGG Net 7.3% Deep (19 layers)
2015 GoogleNet 6.7% Very deep (100 layers), Inception module
2015 ResNet 3.4% Residual connections
I. Illustration: Computer Vision
ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
Year Model name Error rate Details
Before 25.7%
2012
2012 AlexNet 15.4% 7 layers, GPU's, Relu activation
2013 ZF Net 11.2% Visualization by deconvolution
2014 VGG Net 7.3% Deep (19 layers)
2015 GoogleNet 6.7% Very deep (100 layers), Inception module
2015 ResNet 3.4% Residual connections
Human 5~10%
II. History of Neural Networks
II. History of Neural Networks
III. The Benefit of Depth
Depth is beneficial beyond just giving more parameters
IV. Combining Layers
CNN
+
RNN
Karpathy A. Fei-Fei L. Deep Visual-Semantic Alignments for Generating Image Descriptions. 2015.
IV. Combining Layers
CNN
+
RNN
Karpathy A. Fei-Fei L. Deep Visual-Semantic Alignments for Generating Image Descriptions. 2015.
V. Deep Learning Research
Autoencoders Adversarial Training Neural Turing Machines
Deep Generative Models Deep Reinforcement Learning
VI. The other pillars of deep learning
(Apart from the algorithms/math discussed in this lecture)
VI. The other pillars of deep learning
(Apart from the algorithms/math discussed in this lecture)
1. Data 2. Computation
3. Software
Reading Material
1. Lecture Notes
2. Deep learning book (Free PDF: http://www.deeplearningbook.org/)
Read:
Fully connected layers: 6.0-6.1, 6.3
Loss functions: 6.2
Numerical Optimization: 4.0-4.3, 5.9, 6.5.1-4, 8.1-8.1.1
CNN: 9.0-9.4
RNN: 10.0,10.1,10.2.0,10.2.2
Deep learning: 1.0 (+ figure 1.5)