KEMBAR78
Deep learning lecture - part 1 (basics, CNN) | PPTX
Deep Learning
Lecture (1)
19.10.22 You Sung Min
Bengio, Yoshua, Ian Goodfellow, and Aaron
Courville. Deep learning. Vol. 1. MIT press, 2017.
0. Introduction
1. Why neural networks?
1. What is the neural network?
2. Universal approximation theorem
3. Why deep neural network?
2. How the network learns
1. Gradient descent
2. Backpropagation
3. Modern deep learning
1. Convolutional neural network
2. Recurrent neural network
Contents
๏ถExample of deep learning model
Introduction
Image source : Zeiler & Fergus, 2014
๏ถArtificial intelligence
Introduction
๏ถHistory of deep learning
Introduction
Backpropagation
Distributed representation
(1986)
Deep
learning
(2006)
LSTM
(1997)Biological
learning
(1943)
Neocognitron
(1980)
Perceptron
(1958) Stochastic
gradient descent
(1960)
๏ถHistory of deep learning
๏ถ Size of dataset
Introduction
๏ถHistory of deep learning
๏ถ Connections per neuron
Introduction
10: GoogleNet
(2014)
๏ถHistory of deep learning
๏ถ Number of neurons
Introduction
1. Perceptron
20. GoogleNet
๏ถStructure of perceptron (Developed in 1950s)
Why neural networks?
=
๐ŸŽ ๐’Š๐’‡
๐’‹
๐Ž๐’‹ ๐’™๐’‹ โ‰ค ๐‘ป
๐Ÿ ๐’Š๐’‡
๐’‹
๐Ž๐’‹ ๐’™๐’‹ > ๐‘ป
๐Ž ๐Ÿ
๐Ž ๐Ÿ
๐Ž ๐Ÿ‘
๐’‹
๐Ž๐’‹ ๐’™๐’‹Binary Inputs
Threshold T
๐’‹
๐Ž๐’‹ ๐’™๐’‹ โˆ’ ๐‘ป โ‰ค ๐ŸŽ
or
๐’‹
๐Ž๐’‹ ๐’™๐’‹ โˆ’ ๐‘ป > ๐ŸŽ
๐’› =
๐’‹
๐Ž๐’‹ ๐’™๐’‹ + ๐’ƒ ๐’๐’–๐’•๐’‘๐’–๐’• ๐’š = ๐“(๐’›), where
๐“ is called activation ftn.
output of a single neuron ๐’š = ๐“( ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ + ๐’ƒ)
๏ถMultilayer perceptron (MLP)
Why neural networks?
๐Ž ๐Ÿ
๐Ÿ
๐Ž๐’Š
๐’‹
๐’š ๐Ÿ
๐Ÿ
๐’™ ๐Ÿ
๐’™ ๐Ÿ
๐’™๐’Š
๐’š๐’‹
๐Ÿ
๐’š ๐Ÿ
๐Ÿ
๐’š ๐Ÿ
๐Ÿ
๐’š๐’‹
๐Ÿ
= ๐“(
๐’Š
๐Ž๐’Š
๐Ÿ
๐’™๐’Š + ๐’ƒ๐’‹
๐Ÿ
)
๐’š ๐Ÿ
๐Ÿ
= ๐“(
๐’Š
๐Ž๐’Š
๐Ÿ
๐’š๐’Š
๐Ÿ
+ ๐’ƒ๐’‹
๐Ÿ
)
๐’š ๐Ÿ‘
๐Ž ๐Ÿ
๐Ÿ
๐Ž ๐Ÿ
๐Ÿ‘
๐’š ๐Ÿ‘ = ๐“(
๐’Š
๐Ž๐’Š
๐Ÿ‘
๐’š๐’Š
๐Ÿ
+ ๐’ƒ๐’‹
๐Ÿ‘
)
๐‘ญ ๐’™ = ๐“
๐’Š
๐Ž๐’Š
๐Ÿ‘
๐“(
๐’Š
๐Ž๐’Š
๐Ÿ
๐“(
๐’Š
๐Ž๐’Š
๐Ÿ
๐’™๐’Š + ๐’ƒ๐’‹
๐Ÿ
) + ๐’ƒ๐’‹
๐Ÿ
) + ๐’ƒ๐’‹
๐Ÿ‘
Output of a network
๏ถUniversal approximation theorem (๋ณดํŽธ ๊ทผ์‚ฌ์ •๋ฆฌ)
โ‡’ For any subset of โ„ ๐’, any continuous function f can be
approximated with a feedforward neural network
that has at least a single hidden layer
โ‡’ ํ•˜๋‚˜์˜ ์€๋‹‰์ธต์„ ๊ฐ–๋Š” ์‹ ๊ฒฝ๋ง์€ ์ž„์˜์˜ ์—ฐ์†์ธ ๋‹ค๋ณ€์ˆ˜ ํ•จ
์ˆ˜๋ฅผ ์›ํ•˜๋Š” ์ •๋„๋กœ ๊ทผ์‚ฌ ํ•  ์ˆ˜ ์žˆ๋‹ค
Why neural networks?
๐‘ญ ๐’™ =
๐’Š=๐Ÿ
๐‘ต
๐’—๐’Š ๐‹ ๐‘พ๐’Š
๐‘ป
๐’™ + ๐’ƒ๐’Š
, where ฯ† is โ„ โ†’ โ„, nonconstant,
bounded , continuous function
๐‘ญ ๐’™ โˆ’ ๐’‡ ๐’™ < ๐ for all ๐’™ โˆˆ ๐’”๐’–๐’ƒ๐’†๐’• ๐’๐’‡ โ„ ๐‘ด
๏ถUniversal approximation theorem (๋ณดํŽธ ๊ทผ์‚ฌ์ •๋ฆฌ)
โ‡’ Regardless of what function we are trying to learn,
a large MLP will be able to represent that function
But not guaranteed that the training algorithm is able to
learn that function
1. Optimization algorithm may fail to find parameters
(weight)
2. Training algorithm might choose wrong function
due to overfitting (fail generalization)
: There is no universal procedure to train and generalize
a function (no free lunch theorem; Wolpert, 1996)
Why neural networks?
๏ถUniversal approximation theorem (๋ณดํŽธ ๊ทผ์‚ฌ์ •๋ฆฌ)
โ‡’ A feed forward with a single hidden layer is sufficient to
represent any function. But the layer may be large and may
fail to learn and generalize correctly
๏ถ Why deep neural network?
In many case, deeper model can reduce the required number
of units (neuron) and the amount of generalization error
Why neural networks?
๏ถWhy deep neural network?
๏ถEffect of depth (Goodfellow et al., 2014)
๏ถ Street View House Numbers (SVHN) database
Why neural networks?
Number of depth
Goodfellow, Ian J., et al. "Multi-digit number recognition from street view imagery using
deep convolutional neural networks." arXiv preprint arXiv:1312.6082 (2013)
๏ถWhy deep neural network?
๏ถCurse of dimensionality (โ†’ statistical challenge)
๏ถLet dimension of data space as d
๏ถRequired number of sample to inference : n
๏ถGenerally in practical task: ๐ โ‰ซ ๐’ ๐Ÿ‘
Why neural networks?
Image source : Nicolas Chapados
d = 10
๐’ ๐Ÿ
d = ๐Ÿ๐ŸŽ ๐Ÿ
๐’ ๐Ÿ
d = ๐Ÿ๐ŸŽ ๐Ÿ‘
๐’ ๐Ÿ‘
๐’ ๐Ÿ < ๐’ ๐Ÿ โ‰ช ๐’ ๐Ÿ‘
๏ถWhy deep neural network?
๏ถLocal constancy prior (smoothness prior)
๏ถ For ๐’™ as an input sample and small change of ฮต,
the well-trained function ๐’‡ should satisfy
Why neural networks?
๐’‡โˆ—
๐’™ โ‰ˆ ๐’‡โˆ—
๐’™ + ๐
๏ถWhy deep neural network?
๏ถLocal constancy prior (smoothness prior)
๏ถModels with local kernel at samples
๏ถ๐‘ถ(๐’Œ) sample is required to distinguish ๐‘ถ(๐’Œ) regions
๏ถDeep learning spans data into subspaces
(Distributed representation)
๏ถData was generated by the composition of factors (or
features), potentially at multiple levels in a hierarchy
Why neural networks?
Voronoi diagram
(nearest-neighborhood)
๏ถWhy deep neural network?
๏ถManifold hypothesis
๏ถManifold : a connected set of points that can be
approximated well by considering only a small
number of degree of freedom (or dimensions) in a
higher-dimensional space
Why neural networks?
๏ถWhy deep neural network?
๏ถManifold hypothesis
๏ถReal world data(sound, image, text etc.) are highly
concentrated
Why neural networks?
Random samples in the image space
๏ถWhy deep neural network?
๏ถManifold hypothesis
๏ถEven though the data space is โ„ ๐’, we donโ€™t have to
consider all the space
๏ถWe may consider only neighborhood of the observed
samples along with some manifolds
๏ถA transfer may exist along the manifold
๏ถFor example, intensity change in images
๏ถ Manifolds related human face and those related with cat
may different
Why neural networks?
๏ถWhy deep neural network?
๏ถManifold hypothesis
Why neural networks?
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with
deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015)
๏ถWhy deep neural network?
๏ถ Non-linear transform by learning
๏ถLinear model: linear combination of input ๐‘ฟ
โ‡’ Linear model with non-linear transform ๐“(๐‘ฟ) as
input
๏ถFinding an optimal ๐“ ๐‘ฟ
๏ถPrevious: human knowledge-based transform
(i.e., handcrafted features)
๏ถDeep learning: learning inside the network
๏ถ๐’š = ๐’‡ ๐’™; ๐œฝ, ๐Ž = ๐“(๐’™; ๐œฝ) ๐‘ป ๐Ž
Why neural networks?
๏ถWhy deep neural network?
Why neural networks?
A hidden layer
๐’š = ๐’‡ ๐’™; ๐œฝ, ๐Ž = ๐“(๐’™; ๐œฝ) ๐‘ป ๐Ž
๏ถWhy deep neural network?
๏ถSummary
๏ถCurse of dimensionality
๏ถLocal constancy prior
๏ถManifold hypothesis
๏ถNonlinear transform by learning
๏ถDimension of the data space can
be reduced as subsets of manifold
๏ถThe number of decision regions
can be spanned with the subspaces
as composition of factors
Why neural networks?
๏ถLearning of the network
๏ถTo approximate a function ๐’‡โˆ—
๏ถClassifier ๐’š = ๐’‡โˆ—(๐’™), where ๐’š๐’Š โˆˆ ๐’‡๐’Š๐’๐’Š๐’•๐’† ๐’”๐’†๐’•
๏ถRegression ๐’š = ๐’‡โˆ—
(๐’™), where ๐’š๐’Š โˆˆ โ„ ๐’…
๏ถ A network defines a mapping ๐’š = ๐’‡(๐’™; ๐œฝ) and
learns parameters ๐œฝ which approximate the function ๐’‡โˆ—
๏ถDue to the non-linearity, the global optimization
algorithm (such as convex optimization) is not proper to
the deep learning โ†’ Update cost function ๐‘ช
๏ถGradient descent
๏ถBackpropagation
How the network learns
๏ถLearning of the network
๏ถGradient descent
How the network learns
๐’‡ ๐Ÿ: โ„ โ†’ โ„
๐’‡ ๐Ÿ: โ„ ๐’ โ†’ โ„
๏ถLearning of the network
๏ถDirectional derivative of ๐’‡ at ๐’– direction
๐
๐๐œถ
๐’‡ ๐’— + ๐œถ๐’– = ๐’– ๐‘ป ๐›๐’— ๐’‡(๐’—)
โ†’ min
๐’–
cos ๐œฝ , ๐’˜๐’‰๐’†๐’“๐’† ๐œถ = ๐ŸŽ
๏ถMoving toward negative gradient decreases ๐’‡
How the network learns
๐’‡
๐’—โ€ฒ = ๐’— โˆ’ ๐œผ๐›๐’— ๐’‡(๐’—)
(๐œผ โˆถ ๐’๐’†๐’‚๐’“๐’๐’Š๐’๐’ˆ ๐’“๐’‚๐’•๐’†)
๏ถLearning of the network
๏ถBackpropagation
How the network learns
Error backpropagation path
๐’™ ๐’š = ๐’ˆ(๐’™)
๐’…๐’›
๐’…๐’™
=
๐’…๐’›
๐’…๐’š
๐’…๐’š
๐’…๐’™
๐’› = ๐’‡ ๐’ˆ ๐’™
= ๐’‡(๐’š)y
๐’›
by chain-rule
๏ถLearning of the network
๏ถBackpropagation
๏ถFor ๐’™ โˆˆ โ„ ๐’Ž
, ๐’š โˆˆ โ„ ๐’
and ๐’ˆ: โ„ ๐’Ž
โ†’ โ„ ๐’
, ๐’‡: โ„ ๐’
โ†’ โ„
๏ถFrom gradient descent,
How the network learns
๐’…๐’›
๐’…๐’™
=
๐’…๐’›
๐’…๐’š
๐’…๐’š
๐’…๐’™
๐๐’›
๐๐’™๐’Š
=
๐’‹
๐๐’›
๐๐’š๐’‹
๐๐’š๐’‹
๐๐’™๐’Š
๐›๐’™ ๐’› = (
๐๐’š
๐๐’™
) ๐‘ป
๐›๐’š ๐’›
๐๐’š
๐๐’™
: ๐’ ร— ๐’Ž Jacobian
matrix of ๐’ˆ
๐’™โ€ฒ = ๐’™ โˆ’ ๐œผ(
๐๐’š
๐๐’™
) ๐‘ป ๐›๐’š ๐’› ๐œฝโ€ฒ = ๐œฝ โˆ’ ๐œผ(
๐๐’š
๐๐œฝ
) ๐‘ป ๐›๐’š ๐’›
๏ถLearning of the network
๏ถUniversal approximation theorem
๏ถGradient descent & Backpropagation
๏ถPractical reason of fail
๏ถOptimization
๏ถOptimizer (SGD, AdaGrad, RMSprop, Adam, etc.)
๏ถWeight initialization
๏ถRegularization
๏ถParameter norm penalty (๐‘ณ ๐Ÿ
, ๐‘ณ ๐Ÿ
)
๏ถAugmentation / Noise input (weight noise, label smoothing)
๏ถMultitask learning
๏ถParameter sharing (CNN)
๏ถEnsemble / Dropout
๏ถAdversarial training
How the network learns
Domain specific prior
๏ถConvolutional neural network
๏ถConvolution vs cross-correlation
๏ถConvolution
๏ถCross-correlation
Modern deep learning
๐‘บ ๐’Š, ๐’‹ = ๐‘ฐ โˆ— ๐‘ฒ ๐’Š, ๐’‹ =
๐’Ž ๐’
๐‘ฐ ๐’Ž, ๐’ ๐‘ฒ(๐’Š โˆ’ ๐’Ž, ๐’‹ โˆ’ ๐’)
= ๐‘ฒ โˆ— ๐‘ฐ ๐’Š, ๐’‹ =
๐’Ž ๐’
๐‘ฐ ๐’Š โˆ’ ๐’Ž, ๐’‹ โˆ’ ๐’ ๐‘ฒ(๐’Ž, ๐’)
๐‘บ ๐’Š, ๐’‹ = ๐‘ฐ โˆ— ๐‘ฒ ๐’Š, ๐’‹ =
๐’Ž ๐’
๐‘ฐ ๐’Š + ๐’Ž, ๐’‹ + ๐’ ๐‘ฒ(๐’Ž, ๐’)
Most of CNN actually uses cross-correlation not convolution
๏ถConvolutional neural network
๏ถSignificant characteristics of CNN
๏ถ Sparse interaction
๏ถ Parameter sharing
๏ถ Equivariant representation
๏ถSparse interaction
๏ถ Kernel size โ‰ช input size (e.g., 128-by-128 image and 3-by-3 kernel)
๏ถ For ๐’Ž โˆ’ ๐’Š๐’๐’‘๐’–๐’• and ๐’ โˆ’ ๐’๐’–๐’•๐’‘๐’–๐’•,
fully connected network: ๐‘ถ ๐’Ž ร— ๐’
CNN: ๐‘ถ ๐’Œ ร— ๐’ , ๐ฐ๐ก๐ž๐ซ๐ž ๐ค ๐ข๐ฌ ๐ง๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐œ๐จ๐ง๐ง๐ž๐œ๐ญ๐ข๐จ๐ง๐ฌ
๏ถ Practically, k has several orders of magnitude smaller than m
Modern deep learning
CNN fully connected network Receptive field of CNN
๏ถConvolutional neural network
๏ถParameter sharing
๏ถ Learning only a set of parameters (kernel) for every location
๏ถ Reduce the required amount of memory
Modern deep learning
fully connected networkCNN
Calculation : 4 billion times efficient
Memory storage: 178,640 for matrix multiplication
Vertical
edge
๏ถConvolutional neural network
๏ถEquivariant representation
(translation equivariant)
๏ถ Translation in input โ†’ translation in output
Modern deep learning
Location of output (feature)
related to cat
๏ถConvolutional neural network
๏ถPooling (translation invariance)
๏ถTasks that care more about whether some features
exist than exactly where they are
Modern deep learning
๏ถConvolutional neural network
๏ถPrior belief of convolution and pooling
๏ถFtn. the layer should learn contains only local
interactions and is equivariant to translation
๏ถFtn. the layers learns must be invariant to small
translations
๏ถC.f.) Inception module(Szegedy. 2015)
Capsule network(Hinton, 2017)
Modern deep learning
๏ถConvolutional neural network
๏ถHistorical meaning of CNN
๏ถSince the imageNet challenge of AlexNet(2012)
Modern deep learning
๏ถConvolutional neural network
๏ถHistorical meaning of CNN
๏ถFirst deep network that is trained and operated
well with backpropagation
๏ถReason of success is not entirely clear
๏ถEfficiency of the computation time might give
chances to perform more experiments for the
tuning of the implementation and hyperparameters
๏ถCNN achieved states of the arts with the data that
has a clear grid-structured topology(such as image)
Modern deep learning
End
Q & A

Deep learning lecture - part 1 (basics, CNN)

  • 1.
    Deep Learning Lecture (1) 19.10.22You Sung Min Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. Deep learning. Vol. 1. MIT press, 2017.
  • 2.
    0. Introduction 1. Whyneural networks? 1. What is the neural network? 2. Universal approximation theorem 3. Why deep neural network? 2. How the network learns 1. Gradient descent 2. Backpropagation 3. Modern deep learning 1. Convolutional neural network 2. Recurrent neural network Contents
  • 3.
    ๏ถExample of deeplearning model Introduction Image source : Zeiler & Fergus, 2014
  • 4.
  • 5.
    ๏ถHistory of deeplearning Introduction Backpropagation Distributed representation (1986) Deep learning (2006) LSTM (1997)Biological learning (1943) Neocognitron (1980) Perceptron (1958) Stochastic gradient descent (1960)
  • 6.
    ๏ถHistory of deeplearning ๏ถ Size of dataset Introduction
  • 7.
    ๏ถHistory of deeplearning ๏ถ Connections per neuron Introduction 10: GoogleNet (2014)
  • 8.
    ๏ถHistory of deeplearning ๏ถ Number of neurons Introduction 1. Perceptron 20. GoogleNet
  • 9.
    ๏ถStructure of perceptron(Developed in 1950s) Why neural networks? = ๐ŸŽ ๐’Š๐’‡ ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ โ‰ค ๐‘ป ๐Ÿ ๐’Š๐’‡ ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ > ๐‘ป ๐Ž ๐Ÿ ๐Ž ๐Ÿ ๐Ž ๐Ÿ‘ ๐’‹ ๐Ž๐’‹ ๐’™๐’‹Binary Inputs Threshold T ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ โˆ’ ๐‘ป โ‰ค ๐ŸŽ or ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ โˆ’ ๐‘ป > ๐ŸŽ ๐’› = ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ + ๐’ƒ ๐’๐’–๐’•๐’‘๐’–๐’• ๐’š = ๐“(๐’›), where ๐“ is called activation ftn. output of a single neuron ๐’š = ๐“( ๐’‹ ๐Ž๐’‹ ๐’™๐’‹ + ๐’ƒ)
  • 10.
    ๏ถMultilayer perceptron (MLP) Whyneural networks? ๐Ž ๐Ÿ ๐Ÿ ๐Ž๐’Š ๐’‹ ๐’š ๐Ÿ ๐Ÿ ๐’™ ๐Ÿ ๐’™ ๐Ÿ ๐’™๐’Š ๐’š๐’‹ ๐Ÿ ๐’š ๐Ÿ ๐Ÿ ๐’š ๐Ÿ ๐Ÿ ๐’š๐’‹ ๐Ÿ = ๐“( ๐’Š ๐Ž๐’Š ๐Ÿ ๐’™๐’Š + ๐’ƒ๐’‹ ๐Ÿ ) ๐’š ๐Ÿ ๐Ÿ = ๐“( ๐’Š ๐Ž๐’Š ๐Ÿ ๐’š๐’Š ๐Ÿ + ๐’ƒ๐’‹ ๐Ÿ ) ๐’š ๐Ÿ‘ ๐Ž ๐Ÿ ๐Ÿ ๐Ž ๐Ÿ ๐Ÿ‘ ๐’š ๐Ÿ‘ = ๐“( ๐’Š ๐Ž๐’Š ๐Ÿ‘ ๐’š๐’Š ๐Ÿ + ๐’ƒ๐’‹ ๐Ÿ‘ ) ๐‘ญ ๐’™ = ๐“ ๐’Š ๐Ž๐’Š ๐Ÿ‘ ๐“( ๐’Š ๐Ž๐’Š ๐Ÿ ๐“( ๐’Š ๐Ž๐’Š ๐Ÿ ๐’™๐’Š + ๐’ƒ๐’‹ ๐Ÿ ) + ๐’ƒ๐’‹ ๐Ÿ ) + ๐’ƒ๐’‹ ๐Ÿ‘ Output of a network
  • 11.
    ๏ถUniversal approximation theorem(๋ณดํŽธ ๊ทผ์‚ฌ์ •๋ฆฌ) โ‡’ For any subset of โ„ ๐’, any continuous function f can be approximated with a feedforward neural network that has at least a single hidden layer โ‡’ ํ•˜๋‚˜์˜ ์€๋‹‰์ธต์„ ๊ฐ–๋Š” ์‹ ๊ฒฝ๋ง์€ ์ž„์˜์˜ ์—ฐ์†์ธ ๋‹ค๋ณ€์ˆ˜ ํ•จ ์ˆ˜๋ฅผ ์›ํ•˜๋Š” ์ •๋„๋กœ ๊ทผ์‚ฌ ํ•  ์ˆ˜ ์žˆ๋‹ค Why neural networks? ๐‘ญ ๐’™ = ๐’Š=๐Ÿ ๐‘ต ๐’—๐’Š ๐‹ ๐‘พ๐’Š ๐‘ป ๐’™ + ๐’ƒ๐’Š , where ฯ† is โ„ โ†’ โ„, nonconstant, bounded , continuous function ๐‘ญ ๐’™ โˆ’ ๐’‡ ๐’™ < ๐ for all ๐’™ โˆˆ ๐’”๐’–๐’ƒ๐’†๐’• ๐’๐’‡ โ„ ๐‘ด
  • 12.
    ๏ถUniversal approximation theorem(๋ณดํŽธ ๊ทผ์‚ฌ์ •๋ฆฌ) โ‡’ Regardless of what function we are trying to learn, a large MLP will be able to represent that function But not guaranteed that the training algorithm is able to learn that function 1. Optimization algorithm may fail to find parameters (weight) 2. Training algorithm might choose wrong function due to overfitting (fail generalization) : There is no universal procedure to train and generalize a function (no free lunch theorem; Wolpert, 1996) Why neural networks?
  • 13.
    ๏ถUniversal approximation theorem(๋ณดํŽธ ๊ทผ์‚ฌ์ •๋ฆฌ) โ‡’ A feed forward with a single hidden layer is sufficient to represent any function. But the layer may be large and may fail to learn and generalize correctly ๏ถ Why deep neural network? In many case, deeper model can reduce the required number of units (neuron) and the amount of generalization error Why neural networks?
  • 14.
    ๏ถWhy deep neuralnetwork? ๏ถEffect of depth (Goodfellow et al., 2014) ๏ถ Street View House Numbers (SVHN) database Why neural networks? Number of depth Goodfellow, Ian J., et al. "Multi-digit number recognition from street view imagery using deep convolutional neural networks." arXiv preprint arXiv:1312.6082 (2013)
  • 15.
    ๏ถWhy deep neuralnetwork? ๏ถCurse of dimensionality (โ†’ statistical challenge) ๏ถLet dimension of data space as d ๏ถRequired number of sample to inference : n ๏ถGenerally in practical task: ๐ โ‰ซ ๐’ ๐Ÿ‘ Why neural networks? Image source : Nicolas Chapados d = 10 ๐’ ๐Ÿ d = ๐Ÿ๐ŸŽ ๐Ÿ ๐’ ๐Ÿ d = ๐Ÿ๐ŸŽ ๐Ÿ‘ ๐’ ๐Ÿ‘ ๐’ ๐Ÿ < ๐’ ๐Ÿ โ‰ช ๐’ ๐Ÿ‘
  • 16.
    ๏ถWhy deep neuralnetwork? ๏ถLocal constancy prior (smoothness prior) ๏ถ For ๐’™ as an input sample and small change of ฮต, the well-trained function ๐’‡ should satisfy Why neural networks? ๐’‡โˆ— ๐’™ โ‰ˆ ๐’‡โˆ— ๐’™ + ๐
  • 17.
    ๏ถWhy deep neuralnetwork? ๏ถLocal constancy prior (smoothness prior) ๏ถModels with local kernel at samples ๏ถ๐‘ถ(๐’Œ) sample is required to distinguish ๐‘ถ(๐’Œ) regions ๏ถDeep learning spans data into subspaces (Distributed representation) ๏ถData was generated by the composition of factors (or features), potentially at multiple levels in a hierarchy Why neural networks? Voronoi diagram (nearest-neighborhood)
  • 18.
    ๏ถWhy deep neuralnetwork? ๏ถManifold hypothesis ๏ถManifold : a connected set of points that can be approximated well by considering only a small number of degree of freedom (or dimensions) in a higher-dimensional space Why neural networks?
  • 19.
    ๏ถWhy deep neuralnetwork? ๏ถManifold hypothesis ๏ถReal world data(sound, image, text etc.) are highly concentrated Why neural networks? Random samples in the image space
  • 20.
    ๏ถWhy deep neuralnetwork? ๏ถManifold hypothesis ๏ถEven though the data space is โ„ ๐’, we donโ€™t have to consider all the space ๏ถWe may consider only neighborhood of the observed samples along with some manifolds ๏ถA transfer may exist along the manifold ๏ถFor example, intensity change in images ๏ถ Manifolds related human face and those related with cat may different Why neural networks?
  • 21.
    ๏ถWhy deep neuralnetwork? ๏ถManifold hypothesis Why neural networks? Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015)
  • 22.
    ๏ถWhy deep neuralnetwork? ๏ถ Non-linear transform by learning ๏ถLinear model: linear combination of input ๐‘ฟ โ‡’ Linear model with non-linear transform ๐“(๐‘ฟ) as input ๏ถFinding an optimal ๐“ ๐‘ฟ ๏ถPrevious: human knowledge-based transform (i.e., handcrafted features) ๏ถDeep learning: learning inside the network ๏ถ๐’š = ๐’‡ ๐’™; ๐œฝ, ๐Ž = ๐“(๐’™; ๐œฝ) ๐‘ป ๐Ž Why neural networks?
  • 23.
    ๏ถWhy deep neuralnetwork? Why neural networks? A hidden layer ๐’š = ๐’‡ ๐’™; ๐œฝ, ๐Ž = ๐“(๐’™; ๐œฝ) ๐‘ป ๐Ž
  • 24.
    ๏ถWhy deep neuralnetwork? ๏ถSummary ๏ถCurse of dimensionality ๏ถLocal constancy prior ๏ถManifold hypothesis ๏ถNonlinear transform by learning ๏ถDimension of the data space can be reduced as subsets of manifold ๏ถThe number of decision regions can be spanned with the subspaces as composition of factors Why neural networks?
  • 25.
    ๏ถLearning of thenetwork ๏ถTo approximate a function ๐’‡โˆ— ๏ถClassifier ๐’š = ๐’‡โˆ—(๐’™), where ๐’š๐’Š โˆˆ ๐’‡๐’Š๐’๐’Š๐’•๐’† ๐’”๐’†๐’• ๏ถRegression ๐’š = ๐’‡โˆ— (๐’™), where ๐’š๐’Š โˆˆ โ„ ๐’… ๏ถ A network defines a mapping ๐’š = ๐’‡(๐’™; ๐œฝ) and learns parameters ๐œฝ which approximate the function ๐’‡โˆ— ๏ถDue to the non-linearity, the global optimization algorithm (such as convex optimization) is not proper to the deep learning โ†’ Update cost function ๐‘ช ๏ถGradient descent ๏ถBackpropagation How the network learns
  • 26.
    ๏ถLearning of thenetwork ๏ถGradient descent How the network learns ๐’‡ ๐Ÿ: โ„ โ†’ โ„ ๐’‡ ๐Ÿ: โ„ ๐’ โ†’ โ„
  • 27.
    ๏ถLearning of thenetwork ๏ถDirectional derivative of ๐’‡ at ๐’– direction ๐ ๐๐œถ ๐’‡ ๐’— + ๐œถ๐’– = ๐’– ๐‘ป ๐›๐’— ๐’‡(๐’—) โ†’ min ๐’– cos ๐œฝ , ๐’˜๐’‰๐’†๐’“๐’† ๐œถ = ๐ŸŽ ๏ถMoving toward negative gradient decreases ๐’‡ How the network learns ๐’‡ ๐’—โ€ฒ = ๐’— โˆ’ ๐œผ๐›๐’— ๐’‡(๐’—) (๐œผ โˆถ ๐’๐’†๐’‚๐’“๐’๐’Š๐’๐’ˆ ๐’“๐’‚๐’•๐’†)
  • 28.
    ๏ถLearning of thenetwork ๏ถBackpropagation How the network learns Error backpropagation path ๐’™ ๐’š = ๐’ˆ(๐’™) ๐’…๐’› ๐’…๐’™ = ๐’…๐’› ๐’…๐’š ๐’…๐’š ๐’…๐’™ ๐’› = ๐’‡ ๐’ˆ ๐’™ = ๐’‡(๐’š)y ๐’› by chain-rule
  • 29.
    ๏ถLearning of thenetwork ๏ถBackpropagation ๏ถFor ๐’™ โˆˆ โ„ ๐’Ž , ๐’š โˆˆ โ„ ๐’ and ๐’ˆ: โ„ ๐’Ž โ†’ โ„ ๐’ , ๐’‡: โ„ ๐’ โ†’ โ„ ๏ถFrom gradient descent, How the network learns ๐’…๐’› ๐’…๐’™ = ๐’…๐’› ๐’…๐’š ๐’…๐’š ๐’…๐’™ ๐๐’› ๐๐’™๐’Š = ๐’‹ ๐๐’› ๐๐’š๐’‹ ๐๐’š๐’‹ ๐๐’™๐’Š ๐›๐’™ ๐’› = ( ๐๐’š ๐๐’™ ) ๐‘ป ๐›๐’š ๐’› ๐๐’š ๐๐’™ : ๐’ ร— ๐’Ž Jacobian matrix of ๐’ˆ ๐’™โ€ฒ = ๐’™ โˆ’ ๐œผ( ๐๐’š ๐๐’™ ) ๐‘ป ๐›๐’š ๐’› ๐œฝโ€ฒ = ๐œฝ โˆ’ ๐œผ( ๐๐’š ๐๐œฝ ) ๐‘ป ๐›๐’š ๐’›
  • 30.
    ๏ถLearning of thenetwork ๏ถUniversal approximation theorem ๏ถGradient descent & Backpropagation ๏ถPractical reason of fail ๏ถOptimization ๏ถOptimizer (SGD, AdaGrad, RMSprop, Adam, etc.) ๏ถWeight initialization ๏ถRegularization ๏ถParameter norm penalty (๐‘ณ ๐Ÿ , ๐‘ณ ๐Ÿ ) ๏ถAugmentation / Noise input (weight noise, label smoothing) ๏ถMultitask learning ๏ถParameter sharing (CNN) ๏ถEnsemble / Dropout ๏ถAdversarial training How the network learns Domain specific prior
  • 31.
    ๏ถConvolutional neural network ๏ถConvolutionvs cross-correlation ๏ถConvolution ๏ถCross-correlation Modern deep learning ๐‘บ ๐’Š, ๐’‹ = ๐‘ฐ โˆ— ๐‘ฒ ๐’Š, ๐’‹ = ๐’Ž ๐’ ๐‘ฐ ๐’Ž, ๐’ ๐‘ฒ(๐’Š โˆ’ ๐’Ž, ๐’‹ โˆ’ ๐’) = ๐‘ฒ โˆ— ๐‘ฐ ๐’Š, ๐’‹ = ๐’Ž ๐’ ๐‘ฐ ๐’Š โˆ’ ๐’Ž, ๐’‹ โˆ’ ๐’ ๐‘ฒ(๐’Ž, ๐’) ๐‘บ ๐’Š, ๐’‹ = ๐‘ฐ โˆ— ๐‘ฒ ๐’Š, ๐’‹ = ๐’Ž ๐’ ๐‘ฐ ๐’Š + ๐’Ž, ๐’‹ + ๐’ ๐‘ฒ(๐’Ž, ๐’) Most of CNN actually uses cross-correlation not convolution
  • 32.
    ๏ถConvolutional neural network ๏ถSignificantcharacteristics of CNN ๏ถ Sparse interaction ๏ถ Parameter sharing ๏ถ Equivariant representation ๏ถSparse interaction ๏ถ Kernel size โ‰ช input size (e.g., 128-by-128 image and 3-by-3 kernel) ๏ถ For ๐’Ž โˆ’ ๐’Š๐’๐’‘๐’–๐’• and ๐’ โˆ’ ๐’๐’–๐’•๐’‘๐’–๐’•, fully connected network: ๐‘ถ ๐’Ž ร— ๐’ CNN: ๐‘ถ ๐’Œ ร— ๐’ , ๐ฐ๐ก๐ž๐ซ๐ž ๐ค ๐ข๐ฌ ๐ง๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐œ๐จ๐ง๐ง๐ž๐œ๐ญ๐ข๐จ๐ง๐ฌ ๏ถ Practically, k has several orders of magnitude smaller than m Modern deep learning CNN fully connected network Receptive field of CNN
  • 33.
    ๏ถConvolutional neural network ๏ถParametersharing ๏ถ Learning only a set of parameters (kernel) for every location ๏ถ Reduce the required amount of memory Modern deep learning fully connected networkCNN Calculation : 4 billion times efficient Memory storage: 178,640 for matrix multiplication Vertical edge
  • 34.
    ๏ถConvolutional neural network ๏ถEquivariantrepresentation (translation equivariant) ๏ถ Translation in input โ†’ translation in output Modern deep learning Location of output (feature) related to cat
  • 35.
    ๏ถConvolutional neural network ๏ถPooling(translation invariance) ๏ถTasks that care more about whether some features exist than exactly where they are Modern deep learning
  • 36.
    ๏ถConvolutional neural network ๏ถPriorbelief of convolution and pooling ๏ถFtn. the layer should learn contains only local interactions and is equivariant to translation ๏ถFtn. the layers learns must be invariant to small translations ๏ถC.f.) Inception module(Szegedy. 2015) Capsule network(Hinton, 2017) Modern deep learning
  • 37.
    ๏ถConvolutional neural network ๏ถHistoricalmeaning of CNN ๏ถSince the imageNet challenge of AlexNet(2012) Modern deep learning
  • 38.
    ๏ถConvolutional neural network ๏ถHistoricalmeaning of CNN ๏ถFirst deep network that is trained and operated well with backpropagation ๏ถReason of success is not entirely clear ๏ถEfficiency of the computation time might give chances to perform more experiments for the tuning of the implementation and hyperparameters ๏ถCNN achieved states of the arts with the data that has a clear grid-structured topology(such as image) Modern deep learning
  • 39.

Editor's Notes

  • #10ย A simple model to emulate a single neuron A perceptron takes binary inputs (๐’™_๐Ÿ,๐’™_๐Ÿ,๐’™_๐Ÿ‘โ€ฆ) and produce a single binary output (0, 1)
  • #32ย By Cmglee - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=20206883
  • #35ย Image source: https://www.cc.gatech.edu/~san37/post/dlhc-cnn/
  • #36ย Image source: https://www.cc.gatech.edu/~san37/post/dlhc-cnn/
  • #37ย Image source: https://www.cc.gatech.edu/~san37/post/dlhc-cnn/
  • #38ย Image source: https://www.topbots.com/14-design-patterns-improve-convolutional-neural-network-cnn-architecture/
  • #40ย 13์ธต์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์˜ ๊ฐ’์„ ์‚ฐ์ถœํ•˜๊ธฐ ์œ„ํ•ด์„  ์•ฝ 300์–ต ๋ฒˆ์˜ ์—ฐ์‚ฐ์ˆ˜ ํ•„์š”