KEMBAR78
23 DeepLearning PDF | PDF | Deep Learning | Artificial Neural Network
0% found this document useful (0 votes)
228 views74 pages

23 DeepLearning PDF

Deep learning provides state-of-the-art results in machine learning through neural networks that can learn representations of data. Key developments enabled the recent successes of deep learning, including increased computational power for training large neural networks, growth in data availability, and algorithmic improvements like dropout and unsupervised pre-training. Deep learning is now widely used by technology companies for applications like computer vision, speech recognition, and natural language processing.

Uploaded by

kavinscrib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views74 pages

23 DeepLearning PDF

Deep learning provides state-of-the-art results in machine learning through neural networks that can learn representations of data. Key developments enabled the recent successes of deep learning, including increased computational power for training large neural networks, growth in data availability, and algorithmic improvements like dropout and unsupervised pre-training. Deep learning is now widely used by technology companies for applications like computer vision, speech recognition, and natural language processing.

Uploaded by

kavinscrib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Deep Learning

Deep Learning

Why do you want to know about deep learning?


Motivation
• It works!
• State of the art in machine learning
• Google, Facebook, Twitter, Microsoft are all
using it.

• It is fun!
• Need to know what you are doing to do it
well.
Google Trends
Google Trends

https://www.google.com/trends/explore#q=deep%20learning%2C%20%2Fm%2F0hc2
f&cmpt=q&tz=Etc%2FGMT%2B5
Scene recognition

http://places.csail.mit.edu/demo.html
Google Brain - 2012

16 000 Cores
What it learned

http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-
computers-evidence-of-machine-learning.html?pagewanted=all
Google DeepMind

https://www.youtube.com/watch?v=V1eYniJ0Rnk
What is different?
• We have seen ML methods:
– SVM, decision trees, boosting, random forest

• We needed to hand design the input


• ML algorithm learns the decision boundary
Feature Design
Yes No classification

hand designed
features

input image
patch
Learned Feature Hierarchy
Yes No classification

high level
features

medium level
features

low level
features

input image
[Honglak Lee] patch
Scaling with Data Size

Deep Learning
Performance

Most ML algorithms

Amount of data

[Andrew Ng]
Deep Learning Techniques
• Artifical neural network
– Introduced in the 60s

• Convolutional neural network


– Introduced in the 80s

• Recurrent neural network


– Introduced in the 80s
What Changed since the 80s?
• MLPs and CNNs have been around since the
80s and earlier
• Why did people even bother with SVMs,
boosting and co?
• And why do we still care about those
methods?
Brain or Rocket

https://www.youtube.com/watch?v=EczYSl-ei9g
What Changed -Computational Power
What Changed – Data Size
I don’t Have a Cluster at Home

http://www.geforce.com/whats-new/articles/introducing-nvidia-geforce-gtx-titan-z
Deep Learning

What is deep learning?


Perceptron

x1 w1

w2
x2

w3

x3 b

-1
Perceptron

x1 w1

w2
x2

w3

x3 b

-1

𝑠 𝑏 + 𝑤𝑇𝑥
Separating Hyperplane
• x: data point
• y: label
• w: weight vector
w
• b: bias
b
Side Note: Step vs Sigmoid Activation

1
𝑠 𝑥 =
1 + 𝑒 −𝑐𝑥
The XOR Problem

x2

x1
The XOR Problem

x3
x2

x1
Perceptron

x2

𝑤𝑥 = 0

x1
Multi-Perceptron
𝑤11 𝑤12
x2 𝑊 = 𝑤21 𝑤22
𝑤′2 𝑥 = 0 𝑤31 𝑤32
𝑤′1 𝑥 = 0 𝑤′3 𝑥 = 0

𝑊𝑥 = ?

x1
Xor Problem
𝑊𝑎 = [+, −]
𝑊𝑏 = [+, +]
𝑊𝑐 = [−, −]
x2 𝑤′2 𝑥 = 0 𝑊𝑑 = [+, −]

𝑤′1 𝑥 = 0 a b

c d
x1
Xor Problem
𝑊𝑎 = [+, −]
h2 𝑊𝑏 = [+, +]
𝑊𝑐 = [−, −]
b 𝑊𝑑 = [+, −]

h1

c a,d
Multi-Layer Perceptron

𝑊
𝑥

𝑠 𝑏 (1) + 𝑊 (1) 𝑥

http://deeplearning.net/tutorial/mlp.html
Multi-Layer Perceptron

𝑓 𝑥 = 𝐺(𝑏 (2) + 𝑊 (2) 𝑠 𝑏 (1) + 𝑊 (1) 𝑥 )

𝐺: logistic function, softmax for multiclass

http://deeplearning.net/tutorial/mlp.html
Yes No Classification

high level features

medium level features

low level features


Autoencoder
• This is what Google used for their Google
brain
• Basically just a MLP
• Output size is equal to input size

• Popular for pre-training a network on


unlabeled data
Autoencoder

Decoder

WT
Encoder

W
Deep Autoencoder
• Reconstruct image from
learned low dimensional
code
• Weights are tied
• Learned features are often
useful for classification
• Can add noise to input
image to prevent overfitting

Salakhutdinov & Hinton, NIPS 2007


From MLP to CNN
• So far no notion of neighborhood
• Invariant to permutation of input
• A lot of data is structured:
– Images
– Speech
–…

• Convolutional neural networks preserve


neighborhood
http://www.amolgmahurkar.com/classifySTLusingCNN.html
http://www.amolgmahurkar.com/classifySTLusingCNN.html
Convolution
Convolutional Network

http://parse.ele.tue.nl/cluster/2/CNNArchitecture.jpg
CNN Advantages
• neighborhood preserved
• translation invariant
• tied weights
DNNs are hard to train
• backpropagation – gradient descent
• many local minima
• prone to overfitting
• many parameters to tune
• SLOW
Stochastic Gradient Decent

https://www.youtube.com/watch?v=HvLJUsEc6dw
Development
• Computers got faster!
• Data got bigger.
• Initialization got better.
2006 Breakthrough
• Ability to train deep architectures by using
layer-wise unsupervised learning, whereas
previous purely supervised attempts had
failed
• Unsupervised feature learners:
– RBMs
– Auto-encoder variants
– Sparse coding variants
Unsupervised Pretraining

http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
http://jmlr.org/papers/volume11/erhan10a/erhan10a.pdf
Dropout

X
𝑊
X X
𝑥 X X X
• Helps with overfitting
• Typically used with random initialization
• Training is slower than without dropout
Deep Learning for Sequences
• MLPs and CNNs have fixed input size
• How would you handle sequences?

• Example: Complete a sentence


–…
– are …
– How are …
Slide from G. Hinton
Meaning of Life

https://www.youtube.com/watch?v=vShMxxqtDDs
Slide from G. Hinton
Recurrent Neural Network

http://blog.josephwilk.net/ruby/recurrent-
neural-networks-in-ruby.html
Recurrent Neural Network

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Intriguing properties of neural
networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,Ian Goodfellow,
Rob Fergus
International Conference on Learning Representations (2014)
http://www.datascienceassn.org/sites/default/files/Intriguing%20Properties%20of%20Neural%
20Networks_0.pdf
Libraries
• Theano
• Torch
• Caffe

• TensorFlow
• …
Theano
• Full disclosure: My favorite
• Python
• Transparent GPU integration
• Symbolical Graphs
• Auto-gradient
• Low level – in a good way!
• If you want high-level on top:
– Pylearn2
– Keras, Lasagne, Blocks
–…
Torch
• Lua (and no Python interface)
• Very fast convolutions
• Used by Google Deep Mind, Facebook AI, IBM
• Layer instead of graph based

https://en.wikipedia.org/wiki/Torch_(machine_learning)
Caffe
• C++ based
• Higher abstraction than Theano or Torch
• Good for training standard models
• Model zoo for pre-trained models
Tensorflow
• Symbolic graph and auto-gradient
• Python interface
• Visualization tools
• Some performance issues regarding speed and
memory

https://github.com/soumith/convnet-benchmarks/issues/66
Tips and Tricks
Number of Layers / Size of Layers

• If data is unlimited larger and deeper should


be better

• Larger networks can overfit more easily

• Take computational cost into account


Learning Rate
• One of the most important parameters
• If network diverges most probably learning
rate is too large
• Smaller works better
• Can slowly decay over time
• Can have one learning rate per layer

Other tips for SGD:


http://leon.bottou.org/publications/pdf/tricks-2012.pdf
Momentum
• Helps to escape local minima
• Crucial to achieve high performance

More about Momentum:


http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf
Convergence
• Monitor validation error
• Stop when it doesn’t improve within n
iterations

• If learning rate decays you might want to


adjust number of iterations
Initialization of W
• Need randomization to break symmetry

• Bad initializations are untrainable

• Most heuristics depend on the number of input


(and output) units

• Sometimes W is rescaled during training


– Weight decay (L2 regularization)
– Normalization
Data Augmentation
• Exploit invariances of the data
• Rotation, translation
• Nonlinear transformation
• Adding Noise

http://en.wikipedia.org/wiki/MNIST_database
Data Normalization
• We have seen std and mean normalization

• Whitening
– Neighbored pixels often are redundant
– Remove correlation between features

More about preprocessing:


http://deeplearning.stanford.edu/wiki/index.php/Data_Preproc
essing
Non-Linear Activation Function
• Sigmoid
– Traditional choice

• Tanh
– Symmetric around the origin
– Better gradient propagation than Sigmoid

• Rectified Linear
– max(x,0)
– State of the art
– Good gradient propagation
– Can “die”
L1 and L2 Regularization
• Most pictures of nice filters involve some
regularization

• L2 regularization corresponds to weight decay


• L2 and early stopping have similar effects

• L1 leads to sparsity

• Might not be needed anymore (more data,


dropout)
Monitoring Training
• Monitor training and validation performance
• Can monitor hidden units
• Good: Uncorrelated and high variance
Further Resources
• More about theory:
– Yoshua Bengio’s
book:http://www.iro.umontreal.ca/~bengioy/dlbook/
– Deep learning reading list:
http://deeplearning.net/reading-list/

• More about Theano:


– http://deeplearning.net/software/theano/
– http://deeplearning.net/tutorial/

You might also like