0% found this document useful (0 votes)

27 views57 pages

Lecture 09 Slides - After

The document provides an overview of neural networks, focusing on their application in classification and regression tasks. It covers key concepts such as activation functions, training methods including backpropagation and stochastic gradient descent, and the advantages of deep learning over traditional methods like logistic regression. Additionally, it highlights the importance of deep learning frameworks like PyTorch and TensorFlow for efficient neural network implementation and training.

Uploaded by

baptiste.ferrer10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views57 pages

Lecture 09 Slides - After

Uploaded by

baptiste.ferrer10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

1

Neural networks for

classification and
regression
Outline

▪ Why deep learning?

▪ Neural networks
• activation functions
▪ Training
• Back propogation
• Stochastic gradient descent

▪ Quiz Review and answer to some questions from exercises/problem sets

3
Introduction Linear regression Logistic regression

Feature engineering Data statistics Naive Bayes

KNN Clustering Dimensionality reduction

Neural networks Convolutional neural Decision-trees

networks
Background for Neural networks
Review of supervised learning
i d
feature vectors, independent variables x E IR i = 1
,
,
2
...,
N

m
i

Labels, dependent variables, target, outcome Y

i
e IR F
J
+ 11 ,
2
, . . .,
k)

Training data/set/example S & xi ,

yig for supervised leary
,

\xih . for unsupervised leaving

Sample, sample point ,
data point
↑

xi
ji)
&

S ,
X
Why deep learning?
Logistic regression review

Logistic regression for d-dimensional data:

Input: Activation
Input Weight Output
function
Weights:
Bias: x1
w1
1
Logistic: σ(z) = x2
1 + e −z w2

… wT x + b σ ŷ
wd
xd
Why deep learning?
Limitations of logistic regression

Issue: Logistic regression performs badly on non-

linearly separable data

Potential fix: Use feature engineering to make data

linearly separable, then use logistic regression
However:
▪ Features that linearly separate the data can be
hard to find manually, specially in high dimension
Why deep learning?
From logistic regression to neural networks
Neural networks have been successful in learning complex, non-linear functions
Why deep learning?
New way to approach ML

Before deep learning:

Data Features Model

Hand-design the features

Deep learning:

Data Features Model

Deep neural networks derive

useful features from the data!
Neural networks
Neural networks
Representation
weights weights

“2-layer Neural Net”, or “3-layer Neural Net”, or

“1-hidden-layer Neural Net” “2-hidden-layer Neural Net”

“Fully-connected” layers
Each neuron of a layer is connected
to all neurons of the following layer
Neural networks
Inside a neuron

g = Activation function
Connections to biological neurons

Source: towardsdatascience.com
Applications - nowadays everywhere!
Applications - examples in IGM
Neural networks
Representation
x1
[l]
ai Layer

Node in layer
x = x2 Shape (3, 1)
x3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3

[1] [1]T [1]

z1 = w1 x + b1
[1] [1]T [1]
z2 = w2 x + b2
[1] [1]T [1]
z3 = w3 x + b3
[1] [1]T [1]
z4 = w4 x + b4
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2 Shape (3, 1)

1]
w1,3

[1] [1]T [1] [1] [1] [1]

z1 = w1 x + b1 a1 = g (z1 ) nonlinear transformaten
[1] [1]T [1] [1] [1] [1] of
z2 = w2 x + b2 a2 = g (z2 ) x

Apply activation
[1] [1]T [1] [1] [1] [1]
z3 = w3 x + b3 a3 = g (z3 )
[1]
z4 = [1]T [1]
w4 x + b4
g] [1]
a4 = [1] [1]
g (z4 )
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3

Vector notation:
[1]
b1
⋮ ⋮ ⋮ ⋮ [1]
[1] [1] [1] [1] [1] b2
W = w1 w2 w3 w4 b [1]
= [1]
b3
⋮ ⋮ ⋮ ⋮
[1]
b4
Shape (3, 4)

[1] [1]T [1] [1] [1] [1]

z =W x+b Apply activation a = g (z )
Activation functions
NN - Activation Function
Introduction
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

Q: What happens if we remove the activations?

NN - Activation Function
Introduction
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

Q: What happens if we remove the activations?

̂y = W[2]T (W[1]T x + b[1]) + b[2]

[2]T [1]T [2]T [1] [2]
ŷ = W W x+W b +b

T [2]T [1]T [2]T [1] [2]

Define W′ =W W Define b′ = W b +b
T
ŷ = W′ x + b′

A: We end up with a linear classifier!

NN - Activation functions
Introduction

To model a nonlinear problem:

▪ Pass the output of each neuron through a nonlinear function,
called activation function
▪ Connection to neuron firing in brain

Some well-known activation functions:

▪ Sigmoid
▪ Tanh
▪ ReLU
NN - Activation functions
Overview

Sigmoid (σ): Tanh:

▪ Squashes input in a [0, 1] range ▪ Squashes input in a [-1, 1] range
▪ Approximately nullifies gradient (for “large” positive or ▪ Like sigmoid, nullifies gradient (for “large” positive or
negative inputs) -> vanishing gradient problem negative inputs)
▪ rarely used except for final layer of binary classification ▪ Zero-centered, preferable over sigmoid as an activation
network ▪ Rarely used in practice (ReLU is more popular)
NN - Activation functions
Overview

Rectified Linear Unit (ReLU): Leaky ReLU:

▪ Easily computed, simple gradient ▪ Attempts to fix “dying ReLU” problem by having a
▪ Greatly accelerates convergence of gradient descent small negative slope for x < 0.

▪ Saturates in only one direction, suffers less from ▪ Leaky ReLU and other ReLU variants (ELU,
vanishing gradient problem SELU, GELU, Swish, etc…) are sometimes used
over ReLU
▪ commonly used in practice
NN - Activation functions
Derivatives
Sigmoid: Rectified Linear Unit (ReLU):

{0 if x ≤ 0
1 x if x > 0
σ(x) = ReLU(x) = = max(0, x)
1 + e −x
d
{0 if x < 0
σ(x) = σ(x)(1 − σ(x)) d 1 if x > 0
dx ReLU(x) =
dx

Tanh:
x −x Note: Derivative of ReLU is undefined for
e −e
tanh(x) = x x = 0. By convention, it is set to 0.
e + e −x
d 2
tanh(x) = 1 − tanh (x)
dx
Training neural nets
Neural networks
Training
Forward pass of 2 layer NN (for a single example):
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
ŷ = a = g (z )
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

[1] [2]
W W
Neural networks
Training
Forward pass of 2 layer NN (for a single sample):
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

To train, we need a loss function: L(y,̂ y)

Using that loss function, we want to update

[1] [1] [2] [2]
W ,b ,W ,b
using gradient descent.

[1] [2]
W W
Loss function
Loss function
Gradient descent

∂L ∂L
Need to compute: ,
∂W[i] ∂b[i]
=> Gradient of loss with respect
to weights

Once gradients are computed,

update weights with:
[i] ∂L
[i]
▪ W := W − α
∂W[i]
[i] [i] ∂L
▪ b := b − α [i]
∂b
where α is the learning rate
Neural networks
Forward / Backward pass

Forward pass: Compute the output of a neural network for a given input
Backward pass: Compute derivatives of the network parameters given the output

During training, you need both the forward pass and the backward pass.

During inference, you only need the forward pass.

Inference: the process of using a trained machine learning model for prediction
Computing gradients
Back propagation
Computing gradients
Back propagation
Neural networks
Forward pass
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
[1] [1] [1]
a = g (z )
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
Neural networks
Forward pass

[1] [1]T [1]

z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
[2]
ŷ = a
Neural networks
Forward pass

Forward pass of this 2-layer NN:

[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
[2]
ŷ = a

Rewriting it in one equation:

[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
Stochastic gradient descent
Mini-batch stochastic gradient descent
Problems with training
Recap on training a neural network

Loop:
1. Sample a batch of data
2. Forward pass to get the loss
3. Backward pass to calculate gradient
4. Update parameters using the gradient

▪ Forward pass computes result of an operation and save any intermediates

needed for gradient computation in memory
▪ Backward pass applies the chain rule to compute the gradient of the loss
function with respect to the inputs
Deep learning frameworks
Overview

Deep learning frameworks are used to efficiently define and train neural networks
• Support for many types of layers, activations, loss functions, optimizers, …
• Backpropagation computed automatically (e.g. loss.backward() in PyTorch)
• GPU support for faster training

Most popular frameworks today:

• PyTorch (https://pytorch.org)
• TensorFlow (https://www.tensorflow.org/)
Deep learning frameworks
Implementing a simple neural network in PyTorch
Extensions of feedforward neural networks

▪ Convolutional neural networks (next lecture)

▪ Recurrent neural networks (relevance for control system)

Python exercises
▪ You will create a neural network for hand-written digit classification
▪ Training data is based on MNIST dataset: online dataset of 70,000 images containing hand-written digits

▪ Training neural networks is time (and energy) consuming

28 x 28 =
784
▪ We will use Google Colab as it has access to faster processing
• GPU (Graphic processing units)
• TPU (Tensor processing units)
↑
D
Brief review of last lecture
PCA

k-means
Questions from problem set
Data covariance matrix

CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Lecture 2 - Neural Network v1.0
No ratings yet
Lecture 2 - Neural Network v1.0
64 pages
Session 8 Neural Network
No ratings yet
Session 8 Neural Network
29 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Slides 11
No ratings yet
Slides 11
48 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Gen Ai Mynotes
No ratings yet
Gen Ai Mynotes
12 pages
Unit I
No ratings yet
Unit I
90 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
87 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Week 2 Artificial Neural Networks - Part II
No ratings yet
Week 2 Artificial Neural Networks - Part II
40 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Deep Learning for Beginners
100% (1)
Deep Learning for Beginners
87 pages
Introduction To Deep Learning With IBM PDF
No ratings yet
Introduction To Deep Learning With IBM PDF
15 pages
Unit II
No ratings yet
Unit II
12 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
DS303 NN
No ratings yet
DS303 NN
20 pages
FDL Module1
No ratings yet
FDL Module1
102 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
Unit - 4 Artificial Neural Networks
No ratings yet
Unit - 4 Artificial Neural Networks
33 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
NN Notes
No ratings yet
NN Notes
39 pages
Neural Networks
No ratings yet
Neural Networks
12 pages
Assignment - 4
No ratings yet
Assignment - 4
24 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Lect 5
No ratings yet
Lect 5
89 pages
Chapter 2 - Artificial Neural Networks
No ratings yet
Chapter 2 - Artificial Neural Networks
19 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
13 - Neural Network (Perceptrons)
No ratings yet
13 - Neural Network (Perceptrons)
31 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Mod 4 Notes
No ratings yet
Mod 4 Notes
46 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
Deep Learning
100% (4)
Deep Learning
100 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Lecture2 Slides 1
No ratings yet
Lecture2 Slides 1
28 pages
Unit 4
No ratings yet
Unit 4
19 pages
Unit 1
No ratings yet
Unit 1
16 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Astava Catalog
No ratings yet
Astava Catalog
26 pages
Rem Koolhaas
100% (1)
Rem Koolhaas
7 pages
Ati:F:Ht1: Service Bulletin
No ratings yet
Ati:F:Ht1: Service Bulletin
42 pages
AWW Dust Collector Article Jan 2006
No ratings yet
AWW Dust Collector Article Jan 2006
7 pages
Exercise Workbook2 Basic
No ratings yet
Exercise Workbook2 Basic
90 pages
Buy Ebook The Routledge Handbook of Landscape Ecology 1st Edition Robert A. Francis Cheap Price
100% (11)
Buy Ebook The Routledge Handbook of Landscape Ecology 1st Edition Robert A. Francis Cheap Price
41 pages
CONFIGURATION UPLOAD LSMW
No ratings yet
CONFIGURATION UPLOAD LSMW
31 pages
Leg Foot Massager 1026 Manual
No ratings yet
Leg Foot Massager 1026 Manual
5 pages
RB Safety Drive
No ratings yet
RB Safety Drive
6 pages
Cauchy Sequences for Math Students
No ratings yet
Cauchy Sequences for Math Students
4 pages
Grade 12 Math Integration Worksheet
No ratings yet
Grade 12 Math Integration Worksheet
10 pages
Adolescent KPK Research Report-Final Draft-4th
No ratings yet
Adolescent KPK Research Report-Final Draft-4th
85 pages
Product Guide: Hyundai Construction Equipment
100% (1)
Product Guide: Hyundai Construction Equipment
26 pages
Guidelines For Life Safety Plan (LSP) Submissions
No ratings yet
Guidelines For Life Safety Plan (LSP) Submissions
6 pages
WO Albeng Alprod Depo 30
No ratings yet
WO Albeng Alprod Depo 30
3 pages
All in One Science Class 10
No ratings yet
All in One Science Class 10
25 pages
Job Focused
No ratings yet
Job Focused
4 pages
Error Peskin
No ratings yet
Error Peskin
4 pages
Agsc QP
No ratings yet
Agsc QP
15 pages
Noorul Islam Centre For Higher Education Noorul Islam University, Kumaracoil M.E. Biomedical Instrumentation Curriculum & Syllabus Semester I
No ratings yet
Noorul Islam Centre For Higher Education Noorul Islam University, Kumaracoil M.E. Biomedical Instrumentation Curriculum & Syllabus Semester I
26 pages
Action Plan AP
No ratings yet
Action Plan AP
3 pages
Extn-111 Pyqs Last 4 Yrs
No ratings yet
Extn-111 Pyqs Last 4 Yrs
7 pages
P35 Portable Dewpoint Meter Datasheet 1898 Iss7
No ratings yet
P35 Portable Dewpoint Meter Datasheet 1898 Iss7
3 pages
Analysis of Soil Sample
100% (1)
Analysis of Soil Sample
20 pages
Actividad 6 Reading Comprehension: Deisy Johanna Guayacán Vanegas
No ratings yet
Actividad 6 Reading Comprehension: Deisy Johanna Guayacán Vanegas
4 pages
LP 2
No ratings yet
LP 2
4 pages
Reviewing Strategies Used by Grade 12 Science, Technology, Engineering, and Mathematics Honor Students
No ratings yet
Reviewing Strategies Used by Grade 12 Science, Technology, Engineering, and Mathematics Honor Students
31 pages
CEL 2106 - Material 3
No ratings yet
CEL 2106 - Material 3
12 pages
5.1 Chemical Formulae, Equations, Calculations (1C) QP Part 2
No ratings yet
5.1 Chemical Formulae, Equations, Calculations (1C) QP Part 2
12 pages
IIT Kharagpur M. Tech Cutoff 2008-09
100% (3)
IIT Kharagpur M. Tech Cutoff 2008-09
2 pages

Lecture 09 Slides - After

Uploaded by

Lecture 09 Slides - After

Uploaded by

1

Neural networks for

▪ Why deep learning?

▪ Quiz Review and answer to some questions from exercises/problem sets

Feature engineering Data statistics Naive Bayes

KNN Clustering Dimensionality reduction

Neural networks Convolutional neural Decision-trees

Labels, dependent variables, target, outcome Y

Training data/set/example S & xi ,

\xih . for unsupervised leaving

Logistic regression for d-dimensional data:

Issue: Logistic regression performs badly on non-

Potential fix: Use feature engineering to make data

Before deep learning:

Data Features Model

Hand-design the features

Data Features Model

Deep neural networks derive

“2-layer Neural Net”, or “3-layer Neural Net”, or

[1] [1]T [1]

[1] [1]T [1] [1] [1] [1]

[1] [1]T [1] [1] [1] [1]

Q: What happens if we remove the activations?

Q: What happens if we remove the activations?

̂y = W[2]T (W[1]T x + b[1]) + b[2]

T [2]T [1]T [2]T [1] [2]

A: We end up with a linear classifier!

To model a nonlinear problem:

Some well-known activation functions:

Sigmoid (σ): Tanh:

Rectified Linear Unit (ReLU): Leaky ReLU:

To train, we need a loss function: L(y,̂ y)

Using that loss function, we want to update

Once gradients are computed,

During inference, you only need the forward pass.

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

[1] [1]T [1]

Forward pass of this 2-layer NN:

Rewriting it in one equation:

▪ Forward pass computes result of an operation and save any intermediates

Most popular frameworks today:

▪ Convolutional neural networks (next lecture)

▪ Recurrent neural networks (relevance for control system)

▪ Training neural networks is time (and energy) consuming

You might also like