Introduction to neural networks
Logistics
• Check Canvas (announcements)
• Lecture slides in Canvas Files.
• Lab 1 is out. Check early if worried on pre-requisites.
• Office hours info TBA.
Overview of today’s lecture
• Perceptron.
• Neural networks.
• Training perceptrons.
• Gradient descent.
• Backpropagation.
• Stochastic gradient descent.
Perceptron
1950s Age of the Perceptron
1957 The Perceptron (Rosenblatt)
1969 Perceptrons (Minsky, Papert)
1980s Age of the Neural Network
1986 Back propagation (Hinton)
1990s Age of the Graphical Model
2000s Age of the Support Vector Machine
2010s Age of the Deep Network
deep learning = known algorithms + computing power + big data
The Perceptron
weights
sum sign function
(Heaviside step function)
inputs output
y=
The Perceptron
weights
sum sign function
(Heaviside step function)
inputs output
b y=
1
The Perceptron
weights
sum sign function
(Heaviside step function)
inputs output
y=
suppress the bias term
(less clutter)
Aside: Inspiration from Biology
Neural nets/perceptrons are loosely inspired by
biology.
But they certainly are not a model of how the brain
works, or even how neurons work.
The Perceptron
weights
sum sign function
(e.g., step,sigmoid, Tanh, ReLU)
inputs output
bias
Another way to draw it…
weights
(1) Combine the sum and
activation function
inputs output
Activation Function
(e.g., Sigmoid function of weighted sum)
(2) suppress the bias term
(less clutter)
Programming the 'forward pass'
Activation function (sigmoid, logistic function)
float f(float a)
{
return 1.0 / (1.0+ exp(-a));
}
output
Perceptron function (logistic regression)
float perceptron(vector<float> x, vector<float> w)
{
float a = dot(x,w);
return f(a);
}
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
How many perceptrons in this neural network?
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘one perceptron’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘two perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘three perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘four perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘five perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘six perceptrons’
Some terminology…
‘input’ layer
…also called a Multi-layer Perceptron (MLP)
Some terminology…
‘hidden’ layer
‘input’ layer
…also called a Multi-layer Perceptron (MLP)
Some terminology…
‘hidden’ layer
‘input’ layer
‘output’ layer
…also called a Multi-layer Perceptron (MLP)
this layer is a
‘fully connected layer’
all pairwise neurons between layers are connected
so is this
all pairwise neurons between layers are connected
How many neurons (perceptrons)?
How many weights (edges)?
How many learnable parameters total?
How many neurons (perceptrons)? 4+2=6
How many weights (edges)?
How many learnable parameters total?
How many neurons (perceptrons)? 4+2=6
How many weights (edges)? (3 x 4) + (4 x 2) = 20
How many learnable parameters total?
How many neurons (perceptrons)? 4+2=6
How many weights (edges)? (3 x 4) + (4 x 2) = 20
How many learnable parameters total? 20 + 4 + 2 = 26
bias terms
performance usually tops out at 2-3 layers,
deeper networks don’t really improve performance...
...with the exception of convolutional networks for images
Training perceptrons
world’s smallest perceptron!
(a.k.a. line equation, linear regression)
Learning a Perceptron
Given a set of samples and a Perceptron
Estimate the parameters of the Perceptron
Given training data:
10 10.1
2 1.9
3.5 3.4
1 1.1
What do you think the weight parameter is?
Given training data:
10 10.1
2 1.9
3.5 3.4
1 1.1
What do you think the weight parameter is?
not so obvious as the network gets more complicated so we use …
An Incremental Learning Strategy
(gradient descent)
Given several examples
and a perceptron
Modify weight such that gets ‘closer’ to
perceptron perceptron true
parameter output label
An Incremental Learning Strategy
(gradient descent)
Given several examples
and a perceptron
Modify weight such that gets ‘closer’ to
perceptron perceptron what does true
parameter output this mean? label
Before diving into gradient descent, we need to understand …
Loss Function
defines what is means to be
close to the true solution
YOU get to chose the loss function!
(some are better than others depending on what you want to do)
L1 Loss L2 Loss
Zero-One Loss Hinge Loss
Learning Strategy
(gradient descent)
Given several examples
and a perceptron
Modify weight such that gets ‘closer’ to
perceptron perceptron true
parameter output label
Gradient descent
Two ways to think about them:
Slope of a function Knobs on a machine
1. Slope of a function:
describes the slope around a
point
2. Knobs on a machine:
input output
describes how each
‘knob’ affects the output
small change in parameter output will change by
Gradient descent:
Given a
fixed-point on a function,
move in the direction
opposite of the gradient
Gradient descent:
update rule:
Backpropagation
Training the world’s smallest perceptron
This is just gradient
descent, that means…
this should be the gradient of
the loss function
=
Now where does this come from?
Compute the derivative
just shorthand
That means the weight update for gradient descent is:
move in direction of negative gradient
Gradient Descent (world’s smallest perceptron)
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
world’s (second) smallest
perceptron!
y = w1. x1 + w2 x2
.
function of two parameters!
Gradient Descent
For each sample
1. Predict
a. Forward pass
we just need to compute partial
b. Compute Loss derivatives for this network
2. Update
a. Back Propagation
b. Gradient update
Derivative computation
^ ^
y^ = w1. x1 + w2 x2
.
Derivative computation
^ ^
Gradient Update
Gradient Descent
For each sample
1. Predict
a. Forward pass
(side computation to track loss. not
b. Compute Loss needed for backprop)
two lines now
2. Update
a. Back Propagation
b. Gradient update
(adjustable step size)
We haven’t seen a lot of ‘propagation’ yet
because our perceptrons only had one layer…
multi-layer perceptron
function of FOUR parameters and FOUR layers!
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight
input hidden hidden output
layer 1 layer 2 layer 3 layer 4
^
Entire network can be written out as one long equation
^
We need to train the network:
What is known? What is unknown?
Entire network can be written out as a long equation
^
known
We need to train the network:
What is known? What is unknown?
Entire network can be written out as a long equation
^
unknown
We need to train the network:
What is known? What is unknown?
Learning an MLP
Given a set of samples and a MLP
Estimate the parameters of the MLP
Gradient Descent
For each random sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
vector of parameter partial derivatives
b. Gradient update
vector of parameter update equations
So we need to compute the partial derivatives
^
Remember,
Partial derivative describes…
(loss layer)
So, how do you compute it?
The Chain Rule
According to the chain rule…
Intuitively, the effect of weight on loss function :
rest of the network
depends on
depends on
depends on
rest of the network
Chain Rule!
rest of the network
Just the partial
derivative of L2 loss
rest of the network
Let’s use a Sigmoid function
rest of the network
Let’s use a Sigmoid function
rest of the network
already computed.
re-use (propagate)!
The Chain Rule
a.k.a. backpropagation
The chain rule says…
depends on
depends on depends on depends on depends on
depends on
depends on
The chain rule says…
depends on
depends on depends on depends on depends on
depends on
depends on
already computed.
re-use (propagate)!
depends on
depends on depends on depends on depends on
depends on
depends on
depends on
depends on depends on depends on depends on
depends on
depends on
depends on
depends on depends on depends on depends on
depends on
depends on
Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
vector of parameter partial derivatives
-
b. Gradient update
vector of parameter update equations
Stochastic gradient
descent
What we are truly minimizing:
𝑁𝑁
min � 𝐿𝐿(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))
θ
𝑖𝑖=1
The gradient is:
What we are truly minimizing:
𝑁𝑁
min � 𝐿𝐿(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))
θ
𝑖𝑖=1
The gradient is:
𝑁𝑁
𝜕𝜕𝜕𝜕(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))
�
𝜕𝜕𝜕
𝑖𝑖=1
What we use for gradient update is:
What we are truly minimizing:
𝑁𝑁
min � 𝐿𝐿(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))
θ
𝑖𝑖=1
The gradient is:
𝑁𝑁
𝜕𝜕𝜕𝜕(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))
�
𝜕𝜕𝜕
𝑖𝑖=1
What we use for gradient update is:
𝜕𝜕𝜕𝜕(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))
for some i
𝜕𝜕𝜕
Stochastic Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
vector of parameter partial derivatives
-
b. Gradient update
vector of parameter update equations
How do we select which sample?
How do we select which sample?
• Select randomly!
Do we need to use only one sample?
How do we select which sample?
• Select randomly!
Do we need to use only one sample?
• You can use a minibatch of size B < N.
Why not do gradient descent with all samples?
How do we select which sample?
• Select randomly!
Do we need to use only one sample?
• You can use a minibatch of size B < N.
Why not do gradient descent with all samples?
• Bad convergence.
Do I lose anything by using stochastic GD?
How do we select which sample?
• Select randomly!
Do we need to use only one sample?
• You can use a minibatch of size B < N.
Why not do gradient descent with all samples?
• Bad convergence.
Do I lose anything by using stochastic GD?
• Same convergence guarantees and complexity!
• Better generalization.
References
Basic reading: No standard textbooks yet! Some good resources:
• https://sites.google.com/site/deeplearningsummerschool/
• http://www.deeplearningbook.org/
• http://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf