WBL Deep Learning:: Week 1
Beate Sick, Oliver Dürr
Week 1: Introduction and technicalities
Zürich, 9/7/2020
1
Literature
• Probabilistic Deep Learning (Manning in production)
– Our probabilistic take
– https://www.manning.com/books/probabilistic-deep-
learning?a_aid=probabilistic_deep_learning&a_bid=78e55885
• Deep Learning Book (DL-Book) http://www.deeplearningbook.org/.
This is a quite comprehensive book which goes far beyond the
scope of this course.
• Courses
– Convolutional Neural Networks for Visual
Recognition http://cs231n.stanford.edu
– Martin Görner (very practical)
• https://cloud.google.com/blog/products/gcp/learn-tensorflow-and-deep-
learning-without-a-phd
2
Introduction to Deep Learning
--what’s the hype about?
3
AI, Machine Learning, Deep Learning
Slide credit: https://www.datasciencecentral.com/profiles/blogs/artificial-intelligence-vs-machine-learning-vs-deep-learning
4
Machine Perception
Kaggle dog vs cat competition
• Computers have been quite
bad in perceptual tasks
which are easy for humans.
– Images
– Text
– Sound
• A Kaggle contest 2012
What happened, to solve the problem?
5
Deep Learning Success Story: ImageNet 2012, 2013, 2014, 2015
1000 classes
1 Mio samples
…
Human: 5% misclassification
Only one non-CNN
approach in 2013
GoogLeNet 6.7%
A. Krizhevsky 2015: It gets tougher
4.95% Microsoft (Feb 6 surpassing human performance 5.1%)
first CNN in 2012 4.8% Google (Feb 11) -> further improved to 3.6 (Dec)?
Und es hat zoom gemacht 4.58% Baidu (May 11 banned due too many submissions )
3.57% Microsoft (Resnet winner 2015)
6 6
Figure: https://medium.com/global-silicon-valley/machine-learning-yesterday-today-tomorrow-3d3023c7b519
Deep Learning successes
• With DL it took approx. 3 years to solve object detection and other computer
vision task
• Further examples
Images form cs229n 7
What is new in the deep learning approach?
Traditional ML:
Extract handcrafted features & use these features
to train / fit a model (e.g. SVM, RF) and use fitted
model to perform classification/prediction.
Deep learning (end-to-end approach)
Deep neural networks start with raw data and learn during training/fitting to extract
appropriate hierarchical features and to use them for classification/prediction.
Low-level feature Mid-level feature High-level feature
NVIDEA course 8
Focus in this lectures:
Probabilistic Viewpoint
9
Probabilistic vs deterministic models
“Classification”
“Regression” Deterministic Probabilistic
Conditional probability distribution (CPD)
𝑝(𝑦|𝑥)
10
Guiding Theme of the course
• We treat DL as probabilistic models, as a continuation of GLMs (logistic
regression, ...) for CPD 𝑝(𝑦|𝑥)
• The models are fitted to training data with maximum likelihood (or Bayes)
Special networks for x Special heads for y
• Vector FCNN • Classes
• Image CNN • Regression
• Text CNN/RNN
11
Topics
• Day 1
– Introduction to DL
– Fully connected neural Networks (fcNN)
– Introduction to TensorFlow and Keras
• Day 2
– Convolutional Neural Networks (CNN) for image data
– Classification and Regression with fcNN and CNNs
• Day 3
– Probabilistic DL
– Extending the GLM with DL for scalar features and image data
• Day 4
– Extending deep GLMs by deep transformation models
– Deep interpretable ordinal regression models
12
Fully Connected Neural Networks
FCNN
13
The Single Cell: Biological Motivation
Neural networks are loosely inspired by how the brain works
14
An artificial neuron
bias weights
output y
y
z
b w1 w2
1 x1 x2 … Different non-linear transformations
(activation functions) are used to
input vector get from z to output y
sigmoid
1
y=
1 + e- z
!
The sigmoid 𝑧 = ensures number from 0 to 1, which can be interpreted as
!"# !"
probability.
Question: How is this model called in statistic?
Toy Task
• Task tell fake from real banknotes (see exercise 01_nb_ch02_01)
• Banknotes described by two features (𝑥! , 𝑥" ) extracted from image
𝑥$
𝑥!
16
Exercise: Part 1
𝑏
𝑤!
𝑤$
Model: The above network models the probability 𝑝! that a given banknote is false.
TASK
The weights (determined by a training procedure later) are given by
𝑤! = 0.3, 𝑤$ = 0.1, and 𝑏 = 1.0
The probability can be calculated from z using the function sigmoid(z)
1. What is the probability (rough estimate) that a banknote characterized by
𝑥! =1 and 𝑥$ = 2.2 is fake?
17
GPUs love Vectors
"
!!
!"
𝑤!
𝑝! = sigmoid 𝑥! 𝑥" ⋅ 𝑤 +𝑏
"
DL: better to have column vectors
18
Result*
𝑥$
𝑥!
General rule: Networks without hidden layer have linear decision boundary.
*Details of training later 19
20
Introducing hidden layer
𝑊from,to
We stack single neurons in layers. Bias 𝑏$!
We use the output of neuron as ℎ!
input to the new neuron.
ℎ$
Bias 𝑏 $!
!
𝑊!,$
𝑝!
𝑤
𝑝! = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏
𝑤"
ℎ&
input hidden output
21
Introducing hidden layer
!
𝑊!,"
ℎ" = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏"!
𝑊from,to 𝑊","
Bias 𝑏$! General
ℎ! ℎ$ = sigmoid 0 𝑥% 𝑊%,$! + 𝑏$
%
ℎ$ !
𝑊!,$
Bias 𝑏 $! ℎ$ = sigmoid 𝑥! 𝑥" ⋅ !
𝑊",$
+ 𝑏"!
!
𝑊!,$
Matrix Notation (later we drop ⃗ )
Note column vectors!
𝑝!
ℎ = sigmoid 𝑥⃗ ⋅ 𝑾𝟏 + 𝑏!
Complete Network
𝑝! = sigmoid ℎ ⋅ 𝑾𝟐 + 𝑏!"
Code:
h =sigmoid(x %*% W1 + b1)
p1=sigmoid(h %*% W2 + b2)
ℎ&
input hidden output
22
The benefit of hidden layers
ℎ"
ℎ#
Bias ! #"
#"
ℎ$
23
Increasing number of neurons in the hidden layer
A network with one hidden layer is a universal function approximator!
http://cs231n.github.io/neural-networks-1/ 24
DL use many hidden layers
• Empirical observation that having more than one layer improves
generalization (no overfitting)
• Not completely understood why. Some intuition:
– Multiple layers allow hierarchical features
– With same number of weights yield more flexible networks
– Observed in brains
25
DL vs Machine Learning Meme
https://www.reddit.com/r/ProgrammerHumor/comments/8c1i45/stack_more_layers/ 26
Experiment yourself, play at home
http://playground.tensorflow.org
Let’s you explore the effect of hidden layers
27
Structure of the network
In code:
## Solution 2 hidden layers
hidden_1=sigmoid(X %*% W1 + b1)
hidden_2=sigmoid(hidden_1 %*% W2 + b2)
res = sigmoid(hidden_2 %*% W3 + b3)
In math (f = sigmoid) and b1=b2=b3=0
𝑝 = 𝑓(𝑓 𝑓 x W! W $ 𝑊 ' )
Looks a bit like onions, matryoshka (Russian Dolls) or Lego bricks
28
2 Class
Using Networks for Classification
29
So far: Logistic Regression / Binary Classification
𝑤
𝑝! = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏
𝑤"
• Networks out the probability for one class (logistic regression /
logistic regression with hidden layers).
– In probabilistic framework: parameter of a Bernoulli 𝑌 𝑥 ~𝐵𝑒𝑟𝑛(𝑝! (𝑥))
• What to do with more than one class?
30
Classification: Softmax Activation
𝑝( , 𝑝! … 𝑝) are probabilities for the
classes 0 to 9.
Incoming to last layer 𝑧* 𝑖 = 1 ⋯ 9
Makes
outcome
𝑒 +# positive
𝑝* = )
∑,-( 𝑒 +$
Ensures that pi’s sum up
to one
This activation is called softmax
Networks out the probabilities for the classes
In probabilistic framework (parameter vector 𝑝⃗ of a multinomial 𝑌|𝑥 ∼ 𝑚𝑛(𝑝(𝑥) )
31
Training NN
32
Training
Input 𝑥 (&) True class 𝑦 (&) Suggesed class
(Show is most likely class)
Tiger Seal 👎
Neural network with many weights W
Tiger Tiger 👍
Seeh
Seehorse
orse 👍
Trainingsprinciple:
... Weights are tuned so a loss functions gets
minimized.
Typical 1 Mio. Trainingsdaten
𝑙𝑜𝑠𝑠 = 𝑙𝑜𝑠𝑠 𝑦 * , 𝑥 * , 𝑊
33
Loss for classification (‘categorical cross-entropy’)
𝑝( , 𝑝! … 𝑝) are probabilities
for the classes 0 to 9.
Definition (Negative Log-likelihood NLL / cat. crossentropy):
The loss 𝑙! of a single training example x (!) with true label 𝑦 (!) is
𝑙! = − log 𝑝$%&'( 𝑦 ! 𝑥 !
Notation: if true label is 2 then 𝑝$%&'( 𝑦 ! 𝑥 ! = 𝑝)
• Perfect, i.e. predicts class of training example 𝑦 (!) with probability 1 ⇒ 𝑙! = 0
• Worst, i.e. predicts class 𝑦 (!) with probability 0 ⇒ 𝑙! = ∞
*
For more all examples, just average loss = ∑𝑙!
+ 34
Training / Gradient Descent
35
Optimization in DL
Parameters of the network are
the weights.
• DL many parameters
– Optimization loss by simple
gradient descent
• Algorithm: Stochastic Gradient
Descent (SGD)*
– Take a random batch of training
examples 𝑦 * , 𝑥 * *-!,⋯,/
– Calculate the loss of that batch
𝑙𝑜𝑠𝑠 𝑦 * , 𝑥 * , 𝑊
– Tune the weights so that loss gets
minimized a bit (gradient descent) Modern Networks have Billions (10) )
– Repeat of weights.
Record 2020: 175 ⋅ 10) (GPT-3)
*aka minibatch gradient descent. 36
Idea of gradient descent
• Shown loss function for a single parameter a
15000
10000
loss
5000
0
−1.5−1.0−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
a
• Imagine you are the blinded wanderer and just know the loss and
the slope at a position. How to reach the minimum?
– Take a large step if slope is steep (you are away from minimum)
• Slope of loss function is given by gradient (this is a local quantity)
• Iterative update of the parameters
– 𝑎0"! = 𝑎0 − 𝜖 𝑔𝑟𝑎𝑑_𝑎(𝑙𝑜𝑠𝑠)
37
Proper learning rate (Important parameter for DL)
See: https://developers.google.com/machine-learning/crash-course/fitter/graph
Chapter 3: Probabilistic Deep Learning Book 38
In two dimensions
Gradient is perpendicular to contour lines
¶L(w )
w2 wi ( t ) = wi ( t -1) - e ( t )
¶wi w = w ( t -1)
i i
w1
39
Summary: Simple Network no hidden layer
input Score or logit zt Softmax p=S(z) 1-hot labels y
xt
-0.9 0.01
0
0
24.0 0.1 0.03
k=28 0 1
-5.1 2.3
e zk
0.31 1 2
1.2 p̂ k = 0.10
12.2 x (1,k 2 ) × W( k 2 ,10 ) + b (1,10) = z (1,10 ) å e zj 0 3
k=28
0.9 j 0.08 0 4
0.25
89.9 2.1 0 5
-1.2 0.01 0 6
Flatten to : -0.2 0.03 0 7
vector with 0.11 0 8
k2 elements
3.2 1.3
0 9
0.9 0.08
Take step in direction of descent gradient:
(the gradient is oriented orthogonal to contour lines) +
L ( w1 , w2 ) 𝑙& 𝑦 & , 𝑥 & , 𝑊 = − . 𝑦( ⋅ log 𝑝(
w2 ()*
wi ( t ) = wi ( t -1) - e ( t )
¶L(w ) Loss of a mini-batch:
¶wi w = w ( t -1)
i i
𝐿 𝑦 & , 𝑥 & , 𝑊 = 𝑚𝑒𝑎𝑛(𝑙& )
w1 40
The miracle of gradient descent in DL
Loss surface in DL (is not convex) but SGD magically also works for non-convex
problems.
Modern deep learning: No distinction between network (model) and training (SGD)
Chapter 3: Probabilistic Deep Learning Book 41
Backpropagation
• We need to calculate the derivative of the loss function
𝑙𝑜𝑠𝑠 {𝑋; , 𝑌; }, 𝑊 w.r.t. all weights 𝑊
• Efficient way backpropagation (chain rule)
– Forward Pass propagate training example through network
• Gives output for current configuration of network
– Backward pass propagate training example through network
• Chain rule all gradients can be calculated in a single flow from the “end”
forward pass
backward pass
For more see e.g. chapter 3 in Probabilistic deep learning 42
Typical Training Curve / ReLU
Motivation:
Green:
sigmoid.
Red:
Source: ReLU faster
Alexnet
Krizhevsky et al 2012 convergence
Epochs: ”each training examples is used once”
43
Deep Learning Frameworks
44
Recap: The first network
• The input: e.g. intensity values of pixels of an image
• (Almost) no pre-processing
• Output: probability that image belongs to certain class
• Information is processed layer by layer from building blocks
• Arrows are weights (these need to be learned / training)
• Training requires gradients of loss w.r.t weights
45
Deep Learning Frameworks (common)
• Computation needs to be done on GPU or specialized hardware
(compute performance)
• Data Structure are multidimensional arrays (tensors) which are
manipulated
• Automatic calculation of the gradients
– Static: computational graph (see chapter 3 in probabilistic deep learning)
– Dynamic: reverse mode auto diff
• In this course: TensorFlow with Keras
46
Typical Tensors in Deep Learning
W24
• The input can be understood as a vector
• A mini-batch of size 64 of input vectors can be understood as tensor of
order 2
• (index in batch, xj)
• The weights going from e.g. Layer L1 to Layer L2 can be written as a matrix
(often called W)
• A mini-batch of size 64 images with 256,256 pixels and 3 color-channels
can be understood as a tensor of order 4.
47
Introduction to Keras
48
Keras Workflow
Define the network (layerwise)
Add loss and optimization method
Fit network to training data
Evaluate network on test data
Use in production
49
A first run through
50
Define the network
Number of Dimension of
neurons in input, here
first hidden vector size
layer 784
Alternative version w/o pipe
Input shape needs to be defined only at the beginning.
Alternative: input_dim=784 51
Compile the network
52
Fit the network
20% of the data
is not use for
fitting the
weights.
53
Evaluate the network
54
More layers
• Dropout
– keras.layers.Dropout
• Convolutional (see lecture on CNN)
– keras.layers.Conv2D
– keras.layers.Conv1D
• Pooling (see lecture on CNN)
– keras.layers.MaxPooling2D
• Recurrent (not in course)
– keras.layers.SimpleRNNCell
– keras.layers.GRU
– keras.layers.LSTM
55
How to use TF and Keras in the course
• Use google colab
– Free resource with preinstalled (kind of) jupyter notebooks
– Usually for python
– To start R notebook (not possible over GUI)
• https://colab.research.google.com/notebook#create=true&language=r
• Use RStudio
– Installation a bit tedious, especially for tfprobability
– You might by lucky though
• Exercises for today
Play around with code, answer questions ask questions if you have any
– Check installation / colab
https://colab.research.google.com/drive/13scWAt7B3y2KxYOdyWR1XHoSTC-H1DxN?usp=sharing
– Banknote example:
https://colab.research.google.com/drive/1_kWrocpNxlzYYySIi__55ucwtuvgAflv?usp=sharing
– MNIST with simple FCNN
https://colab.research.google.com/drive/1GTfFpUlMJoIiU08KU268ktCM6TGbfDR2?usp=sharing 56