1
Neural networks for
classification and
regression
Outline
▪ Why deep learning?
▪ Neural networks
• activation functions
▪ Training
• Back propogation
• Stochastic gradient descent
▪ Quiz Review and answer to some questions from exercises/problem sets
3
Introduction Linear regression Logistic regression
Feature engineering Data statistics Naive Bayes
KNN Clustering Dimensionality reduction
Neural networks Convolutional neural Decision-trees
networks
Background for Neural networks
Review of supervised learning
i d
feature vectors, independent variables x E IR i = 1
,
,
2
...,
N
m
i
Labels, dependent variables, target, outcome Y
i
e IR F
J
+ 11 ,
2
, . . .,
k)
Training data/set/example S & xi ,
yig for supervised leary
,
\xih . for unsupervised leaving
Sample, sample point ,
data point
↑
xi
ji)
&
S ,
X
Why deep learning?
Logistic regression review
Logistic regression for d-dimensional data:
Input: Activation
Input Weight Output
function
Weights:
Bias: x1
w1
1
Logistic: σ(z) = x2
1 + e −z w2
… wT x + b σ ŷ
wd
xd
Why deep learning?
Limitations of logistic regression
Issue: Logistic regression performs badly on non-
linearly separable data
Potential fix: Use feature engineering to make data
linearly separable, then use logistic regression
However:
▪ Features that linearly separate the data can be
hard to find manually, specially in high dimension
Why deep learning?
From logistic regression to neural networks
Neural networks have been successful in learning complex, non-linear functions
Why deep learning?
New way to approach ML
Before deep learning:
Data Features Model
Hand-design the features
Deep learning:
Data Features Model
Deep neural networks derive
useful features from the data!
Neural networks
Neural networks
Representation
weights weights
“2-layer Neural Net”, or “3-layer Neural Net”, or
“1-hidden-layer Neural Net” “2-hidden-layer Neural Net”
“Fully-connected” layers
Each neuron of a layer is connected
to all neurons of the following layer
Neural networks
Inside a neuron
g = Activation function
Connections to biological neurons
Source: towardsdatascience.com
Applications - nowadays everywhere!
Applications - examples in IGM
Neural networks
Representation
x1
[l]
ai Layer
Node in layer
x = x2 Shape (3, 1)
x3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3
[1] [1]T [1]
z1 = w1 x + b1
[1] [1]T [1]
z2 = w2 x + b2
[1] [1]T [1]
z3 = w3 x + b3
[1] [1]T [1]
z4 = w4 x + b4
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2 Shape (3, 1)
1]
w1,3
[1] [1]T [1] [1] [1] [1]
z1 = w1 x + b1 a1 = g (z1 ) nonlinear transformaten
[1] [1]T [1] [1] [1] [1] of
z2 = w2 x + b2 a2 = g (z2 ) x
Apply activation
[1] [1]T [1] [1] [1] [1]
z3 = w3 x + b3 a3 = g (z3 )
[1]
z4 = [1]T [1]
w4 x + b4
g] [1]
a4 = [1] [1]
g (z4 )
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Vector notation:
[1]
b1
⋮ ⋮ ⋮ ⋮ [1]
[1] [1] [1] [1] [1] b2
W = w1 w2 w3 w4 b [1]
= [1]
b3
⋮ ⋮ ⋮ ⋮
[1]
b4
Shape (3, 4)
[1] [1]T [1] [1] [1] [1]
z =W x+b Apply activation a = g (z )
Activation functions
NN - Activation Function
Introduction
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
Q: What happens if we remove the activations?
NN - Activation Function
Introduction
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
Q: What happens if we remove the activations?
̂y = W[2]T (W[1]T x + b[1]) + b[2]
[2]T [1]T [2]T [1] [2]
ŷ = W W x+W b +b
T [2]T [1]T [2]T [1] [2]
Define W′ =W W Define b′ = W b +b
T
ŷ = W′ x + b′
A: We end up with a linear classifier!
NN - Activation functions
Introduction
To model a nonlinear problem:
▪ Pass the output of each neuron through a nonlinear function,
called activation function
▪ Connection to neuron firing in brain
Some well-known activation functions:
▪ Sigmoid
▪ Tanh
▪ ReLU
NN - Activation functions
Overview
Sigmoid (σ): Tanh:
▪ Squashes input in a [0, 1] range ▪ Squashes input in a [-1, 1] range
▪ Approximately nullifies gradient (for “large” positive or ▪ Like sigmoid, nullifies gradient (for “large” positive or
negative inputs) -> vanishing gradient problem negative inputs)
▪ rarely used except for final layer of binary classification ▪ Zero-centered, preferable over sigmoid as an activation
network ▪ Rarely used in practice (ReLU is more popular)
NN - Activation functions
Overview
Rectified Linear Unit (ReLU): Leaky ReLU:
▪ Easily computed, simple gradient ▪ Attempts to fix “dying ReLU” problem by having a
▪ Greatly accelerates convergence of gradient descent small negative slope for x < 0.
▪ Saturates in only one direction, suffers less from ▪ Leaky ReLU and other ReLU variants (ELU,
vanishing gradient problem SELU, GELU, Swish, etc…) are sometimes used
over ReLU
▪ commonly used in practice
NN - Activation functions
Derivatives
Sigmoid: Rectified Linear Unit (ReLU):
{0 if x ≤ 0
1 x if x > 0
σ(x) = ReLU(x) = = max(0, x)
1 + e −x
d
{0 if x < 0
σ(x) = σ(x)(1 − σ(x)) d 1 if x > 0
dx ReLU(x) =
dx
Tanh:
x −x Note: Derivative of ReLU is undefined for
e −e
tanh(x) = x x = 0. By convention, it is set to 0.
e + e −x
d 2
tanh(x) = 1 − tanh (x)
dx
Training neural nets
Neural networks
Training
Forward pass of 2 layer NN (for a single example):
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
ŷ = a = g (z )
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
[1] [2]
W W
Neural networks
Training
Forward pass of 2 layer NN (for a single sample):
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
To train, we need a loss function: L(y,̂ y)
Using that loss function, we want to update
[1] [1] [2] [2]
W ,b ,W ,b
using gradient descent.
[1] [2]
W W
Loss function
Loss function
Gradient descent
∂L ∂L
Need to compute: ,
∂W[i] ∂b[i]
=> Gradient of loss with respect
to weights
Once gradients are computed,
update weights with:
[i] ∂L
[i]
▪ W := W − α
∂W[i]
[i] [i] ∂L
▪ b := b − α [i]
∂b
where α is the learning rate
Neural networks
Forward / Backward pass
Forward pass: Compute the output of a neural network for a given input
Backward pass: Compute derivatives of the network parameters given the output
During training, you need both the forward pass and the backward pass.
During inference, you only need the forward pass.
Inference: the process of using a trained machine learning model for prediction
Computing gradients
Back propagation
Computing gradients
Back propagation
Neural networks
Forward pass
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
Neural networks
Forward pass
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
[2]
ŷ = a
Neural networks
Forward pass
Forward pass of this 2-layer NN:
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
[2]
ŷ = a
Rewriting it in one equation:
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
Stochastic gradient descent
Mini-batch stochastic gradient descent
Problems with training
Recap on training a neural network
Loop:
1. Sample a batch of data
2. Forward pass to get the loss
3. Backward pass to calculate gradient
4. Update parameters using the gradient
▪ Forward pass computes result of an operation and save any intermediates
needed for gradient computation in memory
▪ Backward pass applies the chain rule to compute the gradient of the loss
function with respect to the inputs
Deep learning frameworks
Overview
Deep learning frameworks are used to efficiently define and train neural networks
• Support for many types of layers, activations, loss functions, optimizers, …
• Backpropagation computed automatically (e.g. loss.backward() in PyTorch)
• GPU support for faster training
Most popular frameworks today:
• PyTorch (https://pytorch.org)
• TensorFlow (https://www.tensorflow.org/)
Deep learning frameworks
Implementing a simple neural network in PyTorch
Extensions of feedforward neural networks
▪ Convolutional neural networks (next lecture)
▪ Recurrent neural networks (relevance for control system)
Python exercises
▪ You will create a neural network for hand-written digit classification
▪ Training data is based on MNIST dataset: online dataset of 70,000 images containing hand-written digits
▪ Training neural networks is time (and energy) consuming
28 x 28 =
784
▪ We will use Google Colab as it has access to faster processing
• GPU (Graphic processing units)
• TPU (Tensor processing units)
↑
D
Brief review of last lecture
PCA
k-means
Questions from problem set
Data covariance matrix