KEMBAR78
Lecture 5 - CNNs For Detection and Segmentation | PDF | Image Segmentation | Vision
0% found this document useful (0 votes)
40 views62 pages

Lecture 5 - CNNs For Detection and Segmentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views62 pages

Lecture 5 - CNNs For Detection and Segmentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

High Level Computer Vision

Object Detection and Segmentation


@ May 22, 2024

Bernt Schiele

https://cms.sic.saarland/hlcvss24/

Max Planck Institute for Informatics & Saarland University,


Saarland Informatics Campus Saarbrücken
So far: Image Classification
So far: Image Classification

Class Scores
Cat: 0.9
Dog: 0.05
Fully-Connected: Car: 0.01
4096 to 1000 ...
This image is CC0 public domain Vector:
4096

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 6 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 2
Other Computer Vision Tasks
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 3
Other Computer Vision Tasks
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 4
Semantic Segmentation
Semantic Segmentation This image is CC0 public domain

Label each pixel in the


image with a category
label

es
Sky Sky

e
Tr

Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass
Grass

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 11 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 5
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 13


12 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 6
Semantic Segmentation Idea: Fully Convolutional
Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 15


14 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 7
Semantic Segmentation Idea: Fully Convolutional
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!

Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x⇥H/8
H/4 x⇥W/4
W/8
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 16 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 8
Semantic Segmentation Idea: Fully Convolutional
Semantic Segmentation Idea: Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x⇥H/8
H/4 x⇥W/4
W/8
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 17 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 9
In-Network Upsampling: "Unpooling"
In-Network upsampling: “Unpooling”

Nearest Neighbor “Bed of Nails”


1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 18 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 10
In-Network Upsampling: "Max Unpooling"
In-Network upsampling: “Max Unpooling”
Max Pooling
Max Unpooling
Remember which element was max!
Use positions from
pooling layer 0 0 2 0
1 2 6 3
1 2
3 5 2 1 5 6
… 3 4
0 1 0 0

1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 19 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 11
Learnable Upsampling: Transpose Convolution
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Recall:Typical 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 22


20 May 10, 2018
21

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 12
Learnable Upsampling: Transpose Convolution
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in


Dot product the input for every one
between filter pixel in the output
and input
Stride gives ratio between
movement in input and
output
Input: 4 x 4 Output: 2 x 2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 25


23 May 10, 2018
24

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 13
Learnable Upsampling: Transpose Convolution
Learnable Upsampling: Transpose Convolution
Sum where
Other names: 3 x 3 transpose convolution, stride 2 pad 1 output overlaps
-Deconvolution (bad)
-Upconvolution
-Fractionally strided
convolution
-Backward strided Filter moves 2 pixels in
convolution Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 30


26 May 10, 2018
27
28

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 14
Learnable Upsampling: 1D Example
Learnable Upsampling: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
where at overlaps in
a the output
y az + bx
b Need to crop one
z by pixel from output to
make output exactly
2x input
bz

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 31 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 15
Convolution as Matrix Multiplication (1D Example)
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Convolution transpose multiplies by the
terms of a matrix multiplication transpose of the same matrix:

2 3
x y z 0 0 0
60 x y z 0 07
6 7
40 0 x y z 05
0 0 0 x y z

Example: 1D conv, kernel 1D input When stride=1, convolution transpose is


size=3, stride=1, padding=1 (e.g. image) just a regular convolution (with different
padding rules)

1D Convolution
Fei-FeiKernel
Li & Justin Johnson & Serena Yeung Lecture 11 - 33
32 May 10, 2018
expanded into Matrix
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 16
Convolution as Matrix Multiplication (1D Example)
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Convolution transpose multiplies by the
terms of a matrix multiplication transpose of the same matrix:


x y z 0 0 0
0 0 x y z 0

Learnable Upsampling: 1D Example


Example: 1D conv, kernel
When stride>1, convolution transpose is Output
size=3, stride=2, padding=1
no longer a normal convolution!
Input Filter Outpu
ax copie
weigh
x ay input,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 35
34a May 10, y2018 where
the ou
az + bx
b Need
z by pixel
make
2x inp
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
bz
High Level Computer Vision | Bernt Schiele 17
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 31 Ma
Semantic Segmentation Idea: Fully Convolutional
Semantic Segmentation Idea: Fully Convolutional
Upsampling:
Downsampling: Design network as a bunch of convolutional layers, with
Unpooling or strided
Pooling, strided downsampling and upsampling inside the network!
transpose convolution
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x⇥H/8
H/4 x⇥W/4
W/8
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 36 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 18
What does a U-Net do?

Learns Segmentation

Input Image Output Segmentation Map

High Level Computer Vision | Bernt Schiele 19


U-Net Architecture

Ronneberger et al. (2015) U-net Architecture

High Level Computer Vision | Bernt Schiele 20


U-Net Architecture

“Contraction” Phase
- Increases field of view
- Lose Spatial Information

Ronneberger et al. (2015) U-net Architecture

High Level Computer Vision | Bernt Schiele 21


U-Net Architecture

“Expansion” Phase
- Create High Resolution
Mapping

Ronneberger et al. (2015) U-net Architecture

High Level Computer Vision | Bernt Schiele 22


U-Net Architecture

Concatenate with high-resolution feature


maps from the Contraction Phase

Ronneberger et al. (2015) U-net Architecture

High Level Computer Vision | Bernt Schiele 23


U-Net Summary

• Contraction Phase
‣ Reduce spatial dimension, but increases the “what.”

• Expansion Phase
‣ Recovers object details and the dimensions, which is the “where.”

• Concatenating feature maps from the Contraction phase helps the Expansion phase
with recovering the “where” information.

High Level Computer Vision | Bernt Schiele 24


Other Computer Vision Tasks
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 25
Classification + Localization
Classification + Localization Correct label:
Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

Multitask Loss + Loss

This image is CC0 public domain Vector: Fully


Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 43


41 May 10, 2018
42

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 26
Classification + Localization
Classification + Localization Correct label:
Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

+ Loss

This image is CC0 public domain Vector: Fully


Often pretrained on ImageNet Connected:
4096 4096 to 4 Box
(Transfer learning)
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 42


44 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 27
Other Computer Vision Tasks
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 28
Object Detection as Regression?

Object Detection as Regression?


CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 45 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 29
Object Detection as Regression?
Each image needs a
Object Detection as Regression? different number of outputs!

CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
12
CAT: (x, y, w, h)

DUCK: (x, y, w, h) Many


DUCK: (x, y, w, h) numbers!
….

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 46 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 30
Object Detection as Classification: Sliding Window

Object Detection as Classification: Sliding Window


Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? NO
Background? YES

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 47 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 31
Object Detection as Classification: Sliding Window

Object Detection as Classification: Sliding Window


Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 48 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 32
Object Detection as Classification: Sliding Window

Object Detection as Classification: Sliding Window


Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 49 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 33
Object Detection as Classification: Sliding Window

Object Detection as Classification: Sliding Window


Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 50 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 34
Object Detection as Classification: Sliding Window

Object Detection as Classification: Sliding Window


Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Problem: Need to apply CNN to huge


number of locations, scales, and aspect
ratios, very computationally expensive!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 51 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 35
Region Proposals / e.g. Selective Search

Region Proposals / Selective Search


● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 2000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012


Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 52 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 36
Region Proposal Step

High Level Computer Vision | Bernt Schiele 37


Selective Search: Motivation

• Many approaches (at the time) use exhaustive search:


‣ visit every location in an image
‣ problem: computationally expensive:
- number of possible locations should be small
-> number of grid locations & aspect ratio(s) need to be small
- evaluation cost per location should be low
-> simple features / classifiers

• to go beyond this - we should aim for something more “sophisticated”

High Level Computer Vision | Bernt Schiele 38


Selective Search: Main Design Criteria

High Level Computer Vision | Bernt Schiele 39


Selective Search: How to Obtain High Recall?

High Level Computer Vision | Bernt Schiele 40


Selective Search: Method

High Level Computer Vision | Bernt Schiele 41


Selective Search: Method

High Level Computer Vision | Bernt Schiele 42


Selective Search: Method
• compute similarity measure between all adjacent region pairs
a and b (e.g.) as:
size(a) + size(b)
S(a, b) = ↵S
Ssize + 1 Scolor (a, b)
zize(a, b) =
size(image)
‣ with
size(a) + size(b)
Ssize (a, b) = 1
size(image)
encourages small regions to merge early
‣ and n
X
Scolor (a, b) = min(ak , bk )
k=1
ak , bk are color histograms, encouraging “similar (color)” regions to merge
‣ for slightly more elaborated similarities see their IJCV-paper
High Level Computer Vision | Bernt Schiele 43
Selective Search: Method

High Level Computer Vision | Bernt Schiele 44


Selective Search: Method

High Level Computer Vision | Bernt Schiele 45


Selective Search: High Recall Revisited

High Level Computer Vision | Bernt Schiele 46


Selective Search: Evaluation of Object Hypotheses

High Level Computer Vision | Bernt Schiele 47


R-CNN — Region Based CNN

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 58


53 May 10, 2018
54
55
56
57

High Level Computer Vision | Bernt Schiele 48


R-CNN — Region Based CNN: Problems
R-CNN: Problems

• Ad hoc training objectives


• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
• Fixed by SPP-net [He et al. ECCV14]

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 59 May 10, 2018
High Level Computer Vision | Bernt Schiele 49
Fast R-CNN
Fast R-CNN
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 62
63 May 10, 2018
60
61
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 66


65 May 10, 2018

High Level Computer Vision | Bernt Schiele 50


R-CNN vs. SPP vs. Fast-RCNN
R-CNN vs SPP vs Fast R-CNN

Problem:
Runtime dominated
by region proposals!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 70


69 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 51
Faster - R-CNN: Make CNN do Proposals also !
Faster R-CNN:
Make CNN do proposals!
Insert Region Proposal
Network (RPN) to predict
proposals from features

Jointly train with 4 losses:


1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 71 May 10, 2018
slide credit: Fei-Fei, Justin Johnson, Serena Yeung
High Level Computer Vision | Bernt Schiele 52
Faster R-CNN:
Make CNN do proposals!
Insert Region Proposal
Network (RPN) to predict
Region Proposal Network (RPN) proposals from features

Jointly train with 4 losses:


1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object

• components classes)
4. Final box coordinates

‣ 3x3 sliding window Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

‣ 256-dimensional vector for


Figure copyright 2015, Ross Girshick; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 71 May 10, 201
each location (convolutions)
‣ k anchor boxes, e.g. k=9:
- 3 scales x 3 aspect ratios
‣ for each box
- class score (here 2-class softmax)
for (any) object present or not
- 4 coordinates for bounding box

High Level Computer Vision | Bernt Schiele 53


Faster - R-CNN
Faster R-CNN:
Make CNN do proposals!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 72 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 54
Object Detection vs. Instance
Other Computer Segmentation
Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 8 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 55
Mask R-CNN = Faster R-CNN + Segmentation Output for each ROI
Mask R-CNN
Classification Scores: C
Box coordinates (per class): 4 * C

CNN Conv Conv


RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes

C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 73 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 56
Mask R-CNN: Very Good Results
Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, arXiv 2017


Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017.
Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 74 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 57
Mask R-CNN: Also does Pose
Mask R-CNN
Also does pose

He et al, “Mask R-CNN”, arXiv 2017


Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017.
Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 75 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 58
Detection without Proposals: YOLO (alternative: SSD)
Detection without Proposals: YOLO / SSD
Go from input image to tensor of scores with one big convolutional network!

Within each grid cell:


- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)

Input image Divide image into grid Output:


3xHxW 7x7 7 x 7 x (5 * B + C)
Image a set of base boxes
Redmon et al, “You Only Look Once: centered at each grid cell
Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016 Here B = 3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 77


76 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 59
Detection without Proposals: YOLO

slide credit: https://youtu.be/YmMZkCstui0

High Level Computer Vision | Bernt Schiele 60


YOLO Family

Darknet-53
YOLO YOLO v3 YOLO v5
CVPR 2016 arXiv 2016 GitHub 2020

YOLO v2 / YOLO9000 YOLO v4


CVPR 2017 arXiv 2020

Anchor boxes, Batch normalization, …


Darknet-19
Joint training: both detection and classi cation

High Level Computer Vision | Bernt Schiele 61

fi
Object Detection: Impact of Deep Learning
Object Detection: Impact of Deep Learning

Figure copyright Ross Girshick, 2015.


Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 79 May 10, 2018

slide credit: Fei-Fei, Justin Johnson, Serena Yeung


High Level Computer Vision | Bernt Schiele 62

You might also like