Lecture 5 - CNNs For Detection and Segmentation
Lecture 5 - CNNs For Detection and Segmentation
Bernt Schiele
https://cms.sic.saarland/hlcvss24/
Class Scores
Cat: 0.9
Dog: 0.05
Fully-Connected: Car: 0.01
4096 to 1000 ...
This image is CC0 public domain Vector:
4096
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 6 May 10, 2018
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018
es
Sky Sky
e
Tr
Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass
Grass
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 11 May 10, 2018
Full image
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x⇥H/8
H/4 x⇥W/4
W/8
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 16 May 10, 2018
Low-res:
D3 x⇥H/8
H/4 x⇥W/4
W/8
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 17 May 10, 2018
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 18 May 10, 2018
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 19 May 10, 2018
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 31 May 10, 2018
2 3
x y z 0 0 0
60 x y z 0 07
6 7
40 0 x y z 05
0 0 0 x y z
1D Convolution
Fei-FeiKernel
Li & Justin Johnson & Serena Yeung Lecture 11 - 33
32 May 10, 2018
expanded into Matrix
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 16
Convolution as Matrix Multiplication (1D Example)
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Convolution transpose multiplies by the
terms of a matrix multiplication transpose of the same matrix:
x y z 0 0 0
0 0 x y z 0
Low-res:
D3 x⇥H/8
H/4 x⇥W/4
W/8
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 36 May 10, 2018
Learns Segmentation
“Contraction” Phase
- Increases field of view
- Lose Spatial Information
“Expansion” Phase
- Create High Resolution
Mapping
• Contraction Phase
‣ Reduce spatial dimension, but increases the “what.”
• Expansion Phase
‣ Recovers object details and the dimensions, which is the “where.”
• Concatenating feature maps from the Contraction phase helps the Expansion phase
with recovering the “where” information.
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...
+ Loss
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 45 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 29
Object Detection as Regression?
Each image needs a
Object Detection as Regression? different number of outputs!
DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
12
CAT: (x, y, w, h)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 46 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 30
Object Detection as Classification: Sliding Window
Dog? NO
Cat? NO
Background? YES
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 47 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 31
Object Detection as Classification: Sliding Window
Dog? YES
Cat? NO
Background? NO
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 48 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 32
Object Detection as Classification: Sliding Window
Dog? YES
Cat? NO
Background? NO
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 49 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 33
Object Detection as Classification: Sliding Window
Dog? NO
Cat? YES
Background? NO
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 50 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 34
Object Detection as Classification: Sliding Window
Dog? NO
Cat? YES
Background? NO
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 51 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 35
Region Proposals / e.g. Selective Search
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 52 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 36
Region Proposal Step
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 59 May 10, 2018
High Level Computer Vision | Bernt Schiele 49
Fast R-CNN
Fast R-CNN
Fast R-CNN
Problem:
Runtime dominated
by region proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 71 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 52
Faster R-CNN:
Make CNN do proposals!
Insert Region Proposal
Network (RPN) to predict
Region Proposal Network (RPN) proposals from features
• components classes)
4. Final box coordinates
‣ 3x3 sliding window Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 71 May 10, 201
each location (convolutions)
‣ k anchor boxes, e.g. k=9:
- 3 scales x 3 aspect ratios
‣ for each box
- class score (here 2-class softmax)
for (any) object present or not
- 4 coordinates for bounding box
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 72 May 10, 2018
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018
C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 73 May 10, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 74 May 10, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 75 May 10, 2018
Darknet-53
YOLO YOLO v3 YOLO v5
CVPR 2016 arXiv 2016 GitHub 2020
fi
Object Detection: Impact of Deep Learning
Object Detection: Impact of Deep Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 79 May 10, 2018