Convolutional 3D Neural Network (C3D)
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
• Global Average Pooling is a pooling operation designed to replace fully
connected layers in classical CNNs. The idea is to generate one feature map
for each corresponding category of the classification task in the last
mlpconv layer. Instead of adding fully connected layers on top of the feature
maps, we take the average of each feature map, and the resulting vector is
fed directly into the softmax layer.
• One advantage of global average pooling over the fully connected layers is
that it is more native to the convolution structure by enforcing
correspondences between feature maps and categories. Thus the feature
maps can be easily interpreted as categories confidence maps. Another
advantage is that there is no parameter to optimize in the global average
pooling thus overfitting is avoided at this layer. Furthermore, global average
pooling sums out the spatial information, thus it is more robust to spatial
translations of the input.
An example Likely to overfit the data
36
Underfitting and Overfitting
Underfitting Overfitting
Complexity of a Decision
Tree := number of nodes
It uses
Complexity of the Used Model
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex and test errors are large although training errors
are small.
How Overfitting affects Prediction
Predictive
Error
Error on Test Data
Error on Training Data
Model Complexity
How Overfitting affects Prediction
Underfitting Overfitting
Predictive
Error
Error on Test Data
Error on Training Data
Model Complexity
Ideal Range
for Model Complexity
Bias and Variance
• In statistics and machine learning, the bias–variance tradeoff (or dilemma)
is the problem of simultaneously minimizing two sources of error that
prevent supervised learning algorithms from generalizing beyond their
training set:
• The bias is error from erroneous assumptions in the learning algorithm.
High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).i.e The model class does not
contain the solution.
• The variance is error from sensitivity to small fluctuations in the training
set. High variance can cause overfitting: modeling the random noise in the
training data, rather than the intended outputs.i.e The model is too general
and also learn the noise this is overfitting.
40
Bias and Variance
• Ensemble methods
• Combine learners to reduce variance
from Elder, John. From Trees to Forests and Rule Sets - A Unified
41
Overview of Ensemble Methods. 2007.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
residual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition (pp. 770-778).
If the identity mapping f(x)=x is the desired underlying mapping, the residual
mapping amounts to g(x)=0 and it is thus easier to learn: we only need to push
the weights and biases of the upper weight layer (e.g., fully connected layer and
convolutional layer) within the dotted-line box to zero.
Residual Blocks
𝑎[𝑙+1]
𝑎[𝑙] 𝑎[𝑙+2]
𝑧 [𝑙+1] = 𝑊 [𝑙+1] 𝑎[𝑙] + 𝑏 [𝑙+1] 𝑎[𝑙+1] = 𝑔(𝑧 [𝑙+1] )
“linear” “relu”
𝑧 [𝑙+2] = 𝑊 [𝑙+2] 𝑎[𝑙+1] + 𝑏 [𝑙+2] 𝑎[𝑙+2] = 𝑔 𝑧 𝑙+2 + 𝑎 𝑙
“output” “relu on (output plus input)”
Softmax
FC 1000
Pool
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128
3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
filters, /2
relu 3x3 conv, 128
spatially with
identity
spatially using stride 2 3x3 conv, 128 stride 2
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
filters
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
relu 3x3 conv, 128
identity
spatially using stride 2 3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64
- Additional conv layer at 3x3 conv, 64
the beginning X
3x3 conv, 64
3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2 Beginning
Input conv layer
Intro Results ResNet 1000 Comparison
Softmax
FC 1000 No FC layers
Pool besides FC
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
1000 to
output
classes
3x3 conv, 512
3x3 conv, 512
Full ResNet architecture: Global
relu 3x3 conv, 512 average
- Stack residual blocks 3x3 conv, 512, /2 pooling layer
F(x) + x after last
- Every residual block has ..
conv layer
.
two 3x3 conv layers 3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
relu 3x3 conv, 128
identity
spatially using stride 2 3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64
- Additional conv layer at 3x3 conv, 64
the beginning X
3x3 conv, 64
3x3 conv, 64
- No FC layers at the end Residual block 3x3 conv, 64
3x3 conv, 64
(only FC 1000 to output
Pool
classes) 7x7 conv, 64, / 2
Input
Softmax
FC 1000
Pool
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, /2
Total depths of 34, 50, 101, or
..
152 layers for ImageNet .
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128, / 2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
ResNet Architecture
28x28x256
output
For deeper networks 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv, 64
(similar to GoogLeNet)
1x1 conv, 64
28x28x256
input
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
ResNet Architecture
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps
1x1 conv, 64 filters 1x1 conv, 64
to project to
28x28x64 28x28x256
input
Residual Blocks (skip connections)
Training ResNet in practice
• Batch Normalization after every CONV layer.
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error
plateaus.
• Mini-batch size 256.
• Weight decay of 1e-5.
• No dropout used.
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
2-3 weeks of training
on 8 GPU machine
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
(slide from Kaiming He’s recent presentation)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study:
224x224x3
ResNet
[He et al., 2015]
spatial dimension
only 56x56!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study: ResNet [He et al., 2015]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing complexity... Inception-v4: Resnet + Inception!
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Technical ResNet
Intro ResNet Results Comparison
details 1000
VGG: Highest
Comparing complexity... memory, most
operations
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Technical ResNet
Intro ResNet Results Comparison
details 1000
GoogLeNet:
Comparing complexity... most efficient
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Technical ResNet
Intro ResNet Results Comparison
details 1000
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Technical ResNet
Intro ResNet Results Comparison
details 1000
We can take some inspiration from the Inception block of Fig. 8.4.1 which has information flowing through the
block in separate groups. Applying the idea of multiple independent groups to the ResNet block of Fig.
8.6.3 led to the design of ResNeXt (Xie et al., 2017). Different from the smorgasbord of transformations in
Inception, ResNeXt adopts the same transformation in all branches, thus minimizing the need for manual
tuning of each branch.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
• Breaking up a convolution from ci to co channels into one of g groups of
size ci/g generating g outputs of size co/g is called, quite fittingly, a grouped
convolution. The computational cost (proportionally) is reduced
from O(ci⋅co) to O(g⋅(ci/g)⋅(co/g))=O(ci⋅co/g), i.e., it is g times faster. Even better,
the number of parameters needed to generate the output is also reduced from
a ci×co matrix to g smaller matrices of size (ci/g)×(co/g), again a g times
reduction. In what follows we assume that both ci and co are divisible by g.
• The only challenge in this design is that no information is exchanged between
the g groups. The ResNeXt block of Fig. amends this in two ways: the grouped
convolution with a 3×3 kernel is sandwiched in between two 1×1 convolutions.
The second one serves double duty in changing the number of channels back. The
benefit is that we only pay the O(c⋅b) cost for 1×1 kernels and can make do with
an O(b2/g) cost for 3×3 kernels. Similar to the residual block implementation in,
the residual connection is replaced (thus generalized) by a 1×1 convolution.
“You need a lot of a data if you want to
train/use CNNs”
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016