Image Segmentation
How many zebras?
From Sandlot Science
Why context is important?
What is this?
slide by Takeo Kanade
Why is this a car?
…because it’s on the road!
Why is this road?
Why is this a road?
Context is very important!
Same problem in real scenes
From images to objects
What defines an object?
• Subjective problem, but has been well-studied
Extracting objects
How could we do this automatically (or at least
semi-automatically)?
Semi-automatic binary segmentation
Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]
Auto segmentation: toy example
white
pixels
3
pixel count
black pixels
gray
1 2 pixels
input image
intensity
• These intensities define the three groups.
• We could label every pixel in the image according to
which of these primary intensities it is.
• i.e., segment the image based on the intensity feature.
• But … image isn’t quite so simple …
Source: K. Grauman
pixel count
input image
intensity
• Now how to determine the three main intensities that
define our groups?
• We need to cluster.
Source: K. Grauman
Deep Learning
Semantic Classification Object Instance
Segmentation + Detection Segmentation
Localization
GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT
TREE, SKY
Pixel-level Single Object Multiple Object
May 10, 2017
Segmentation+Classification
Fei-Fei Li & Justin Johnson &
Lecture 11 - 13
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation
Label each pixel in the
image with a category
label
s
Sky
ee
Sky
Tr
Tr
ee
s
Cat Cow
Grass Grass
Don’t differentiate instances,
only care about pixels
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation Idea:
Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!
Conv Conv Conv Conv argmax
Input:
Scores: Predictions:
3 x H xW
CxHxW HxW
Convolutions
Each channel is a class
C channels->C classes
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation Idea:
Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Semantic Segmentation Idea:
Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3 x H xW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
In-Network upsampling: “Unpooling”
Nearest Neighbor “Bed of Nails”
1 1 2 2 1 0 2 0
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
In-Network upsampling: “Max Unpooling”
Max Pooling Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0
1 2
3 5 2 1 5 6
… 3 4
0 1 0 0
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4
Corresponding pairs of
downsampling and
upsampling layers
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Learnable Upsampling
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Learnable Upsampling
3 x 3 transpose convolution, stride 2 pad 1
Input gives
weight for
filter
Input: 2 x 2 Output: 4 x 4
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Learnable Upsampling
Sum where
3 x 3 transpose convolution, stride 2 pad 1 output overlaps
Filter moves 2 pixels in
Input gives the output for every one
weight for pixel in the input
filter
Input: 2 x 2 Output: 4 x 4
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Transpose Convolution: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
a where at overlaps in
the output
y az + bx
b
z by
bz
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 - 23
Adapted fromYeung
Serena Justin Johnson
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 - 24
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Regression?
CAT: (x, y, w, h) 4 numbers
DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)
DUCK: (x, y, w, h) Many
DUCK: (x, y, w, h) numbers!
….
May 10, 2017
Each image needs a different
number of outputs!
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO Cat?
NO
Background? YES
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 - 26
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? YES
Cat? NO
Background? NO
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? YES
Cat? NO
Background? NO
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO Cat?
YES
Background? NO
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Object Detection as Classification:
Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO Cat?
YES
Background? NO
Problem: Need to
apply CNN to huge
number of locations
and scales, very
computationally
May 10, 2017
expensive!
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Slide by: Justin
Serena Johnson
Yeung
Region Proposals
● Find image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012
Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013 May 10, 2017
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
Fei-Fei Li & Justin Johnson &
Lecture 11 - 31
Slide by: Justin
Serena Johnson
Yeung
Alexe et al., CVPR 2010
R-CNN
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 - 33
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN
Conv
Conv Net
Conv Net
Net
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN
Conv
Conv Net
Conv Net
Net
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
R-CNN
Conv
Conv Net
Conv Net
Net
May 10, 2017
Fei-Fei Li & Justin Johnson &
Lecture 11 -
Girshick et Yeung
Serena al., “R i c h Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
Detection without Proposals:
YOLO
Within each grid cell:
•Regress from each of the B base
boxes to a final box with 5
numbers:(dx, dy, dh, dw,
confidence)
•Predict scores for each of C
classes (including background as
a class)
Input image Divide image into grid 7 Output:
3xHxW x7 7 x 7 x (5 * B + C)
Image a set of base
boxes centered at each
grid cell Here B = 3
May 10, 2017
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Fei-Fei Li & Justin Johnson &
Lecture 11 - 39
Slide by: Justin
Serena Johnson
Yeung
This parameterization fixes the output
size
• Each cell predicts:
- For each bounding box:
- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities
• For Pascal VOC:
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
• 7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Split the image into a grid
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell predicts boxes and confidences:
P(Object)
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell also predicts a probability
P(Class | Object)
Bicycle Car
Dog
Dining
Table
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Combine the box and class predictions
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Finally do non-maximum suppression and
threshold detections
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
It also generalizes well to new domains
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016