Machine Perception
Carl Vondrick
Measurement:
173 redness,
101 greenness,
68 blueness
Measurement:
173 redness,
101 greenness,
68 blueness
Perception:
Tomato or Face
Cambrian Explosion
Time
Cambrian Explosion
"The Cambrian Explosion
is triggered by the sudden
evolution of vision,” which
set off an evolutionary
arms race where animals
either evolved or died.
— Andrew Parker
Slide credit: Fei-Fei Li
Evolution of
Biological Eye
Planarian (worm) eye’s distinguish light direction
Illumination
“Neither Autopilot nor the driver noticed the white
side of the tractor trailer against a brightly lit sky, so
the brake was not applied.” — Tesla Company Blog
Slide credit: S. Ullman
Occlusion
René Magritte, 1957
Class Variation
Slide credit: Antonio Torralba
Clutter and Camouflage
Color
Motion
Slide credit: S. Lazebnik
Ill-posed Problem
Ill-posed Problem
Ill-posed Problem
A quick experiment
Animals or Not?
You will see a mask, then image, then mask.
What do you see?
Slide credit: Jia Deng
Slide credit: Jia Deng
150$ms$!!$
Thorpe, et al. Nature, 1996
Fei-Fei Li! Lecture 1 - !14' 20:Sep:14'
Why not build a brain?
About 1/3rd of the brain is devoted to visual processing
Do we have the hardware?
11
10 parallel neurons
8
10 serial transistors
Adelson Illusion
Illusionary Motion
Scale Ambiguity
The Ames Room
The Ames Room
(Effect used in
Lord of the Rings)
Heider-Simmel Illusion
What objects are here?
Slide credit: Rob Fergus and Antonio Torralba
Context
Slide credit: Rob Fergus and Antonio Torralba
Agenda
A. Deep Learning Crash Course
B. Visual Tracking
C. Action Recognition
D. Sound Analysis
Artificial Neural Networks
orks: Architectures
x1 x2
xi ∈ ℝH×1
D×H
Wi ∈ ℝ
bi ∈ ℝD×1
xi+1 = f (Wi xi + bi)
“3-layer Neural Net”, or
“2-hidden-layer Neural Net”
lly-connected” layers
Bio/Artificial Neurons Inspiration
sigmoid activation
function
19 : COS429 : L19 : 28.11.17 : Andras Ferencz Slide Credit: Li, Slide credit:
Karpathy, Andrej Karpathy
Johnson
Bio/Artificial Networks Inspiration
18 : COS429 : L19 : 28.11.17 : Andras Ferencz LeCun
Slide Credit: Slide credit: LeCun
Artificial Neural Networks
orks: Architectures
x1 x2
xi ∈ ℝH×1
D×H
Wi ∈ ℝ
bi ∈ ℝD×1
xi+1 = f (Wi xi + bi)
“3-layer Neural Net”, or
“2-hidden-layer Neural Net”
lly-connected” layers
Loss Functions
xi Input (image) θ Parameters
yi Target (labels) f(xi; θ) Prediction
ℒ Loss Function
ℒ (f(xi; θ), yi)
∑
The objective
of learning:
min
θ
i
Common Loss Functions
Squared error:
2
ℒ(x, y) = ∥x − y∥2
Hinge loss:
ℒ(x, y) = max(0,1 − x ⋅ y)
Cross entropy:
∑
ℒ(x, y) = − yi log xi
i
Loss Surface
θ
Loss Surface
θ
Loss Surface
δℒ
ℒ δθ
θ
Gradient Descent
δℒ
−
ℒ δθ
θ
Gradient Descent
α learning rate
δℒ
ℒ α
δθ
θ
Gradient Descent
α learning rate
ℒ δℒ
α
δθ
θ
Gradient Descent
α learning rate
ℒ δℒ
α
δθ
θ
Gradient Descent
α learning rate
ℒ
δℒ
α
δθ
θ
Unlabeled Visual Data
Agenda
A. Deep Learning Crash Course
B. Visual Tracking
C. Action Recognition
D. Sound Analysis
Object Tracking
Time
What color is that pixel?
Time
Temporal Coherence of Color
RGB
Color
Channels
Quantized
Color
Obvious exceptions…
Obvious exceptions…
Edward Adelson, 1995
Obvious exceptions…
Self-supervised Tracking
Reference Frame Gray-scale Video
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Ansel Adams, Yosemite Valley Bridge
Result of [Zhang et al., ECCV 2016]
What color
is this?
Where to
copy color?
Robust to
outliers
Where to
copy color?
Robust to
occlusion
Input Frame
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Colorize by Pointing
Reference Frame Input Frame
Reference Colors Target Colors
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Reference Frame Input Frame
fi Aij
fj
Reference Colors Target Colors
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
!
X T
exp fi fj
min L cj , Aij ci where Aij = P T
k exp fk fj
f
i
Reference Frame Input Frame
fi Aij
fj
Reference Colors Target Colors
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
!
X T
exp fi fj
min L ĉj c=
j, Aij ci where Aij = P T
k exp fk fj
f
i
Reference Frame Input Frame
fi Aij
fj
ci
Reference Colors Target Colors
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
!
X T
exp fi fj
min L cj , Aij ci where Aij = P T
k exp fk fj
f
i
Reference Frame Input Frame
fi Aij
fj
ci cj
Reference Colors Target Colors
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Video Colorization
Reference Frame Gray-scale Video Predicted Color Groun
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Video Colorization
Groun
Reference Frame Gray-scale Video Predicted Color
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Tracking Emerges!
Reference Frame Input Frame
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Tracking Emerges!
Reference Frame Input Frame
Reference Mask Predicted Mask
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
!
X T
exp fi fj
min L ĉj c=
j, Aij ci where Aij = P Tf
f
i k exp f k j
Reference Frame Input Frame
Aij
fi
fj
ci
ĉj =
Reference Mask Predicted Mask
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Segment Tracking Results
Only the first frame is given. Colors indicate different instances.
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Segment Tracking Results
Only the first frame is given. Colors indicate different instances.
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Pose Tracking Results
Only the skeleton in the first frame is given.
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Tracking Performance
80
Average Performance (Segment Overlap)
Identity
Optic Flow
60
Colorization
40
20
0
2 9 16 23 30 37 44 51 58 64
Frame Number
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Tracking Performance
Scale-Variation
Shape Complexity Identity
Appearance Change Optic Flow
Heterogeneus Object Colorization
Out-of-view
Interacting Objects
Motion Blur
Occlusion
Fast Motion
Dynamic Background
Low Resolution
Deformation
Edge Ambiguity
Out-of-Plane Rotation
Background Clutter
Camera-Shake
0 12.5 25 37.5 50
Average Performance (Segment Overlap)
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Visualizing Embeddings
Project embedding to 3 dimensions and visualize as RGB
o l
de a
Vi in
rig
O
at ing
n
dd
io
be
liz
Em
s ua
Vi
Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018
Agenda
A. Deep Learning Crash Course
B. Visual Tracking
C. Action Recognition
D. Sound Analysis
What are they doing?
What are they doing?
Barker and Wright, 1954
Action Recognition
Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014
Action Recognition
Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014
Action Recognition
Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014
Felix Warneken, Max Plank Institute
The Oops! dataset
Oops! Predicting Unintentional Action
CVPR 2020
oops.cs.columbia.edu
Oops! Predicting Unintentional Action
CVPR 2020
oops.cs.columbia.edu
Perceptual Clues
1) Predictability
Ranzato 2014, Han 2019, …
2) Temporal Order
Misra 2016, Wei 2018, …
3) Video speed as
self-supervised clue
Speed of Action
Alters Perceptual Judgement
Speed of Action
Alters Perceptual Judgement
3) Video speed as
self-supervised clue
Visualizing Features
Fit linear model to classify
intentionality
+ -
+
+ - -
Agenda
A. Deep Learning Crash Course
B. Visual Tracking
C. Action Recognition
D. Sound Analysis
Learning to Hear
Objects
f (xs ; !)
Sound
Aytar, Vondrick, Torralba. NIPS 2016
Natural Synchronization
X
min
f
DKL (F (xi )||f (xi )) Lion
i
f (xs ; !) F (xv ; ⌦)
Sound Vision
Aytar, Vondrick, Torralba. NIPS 2016
SoundNet
r m e s
e f o ori
a v eg
W a t
C
Convolutional Neural Network
Aytar, Vondrick, Torralba. NIPS 2016
Sound Recognition
Classifying sounds in ESC-50
Method Accuracy
Chance 2%
Human Consistency 81%
Aytar, Vondrick, Torralba. NIPS 2016
Sound Recognition
Classifying sounds in ESC-50
Method Accuracy
Chance 2%
SVM-MFCC 39%
Random Forest 44%
CNN, Piczak 2015 64%
Human Consistency 81%
Aytar, Vondrick, Torralba. NIPS 2016
Sound Recognition
Classifying sounds in ESC-50
Method Accuracy
Chance 2%
SVM-MFCC 39%
Random Forest 44%
CNN, Piczak 2015 64%
10% gain
SoundNet 74%
Human Consistency 81%
Aytar, Vondrick, Torralba. NIPS 2016
Vision vs Sound
Low-dimensional embeddings
via Maaten and Hinton, 2007
Vision Sound
Aytar, Vondrick, Torralba. NIPS 2016
Sensor Power Consumption
Camera Microphone
~1 watt ~1 milliwatt
Aytar, Vondrick, Torralba. NIPS 2016
Which objects make which sounds?
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
The sound of clicked object
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
The sound of clicked object
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
The sound of clicked object
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Audiovisual Grounding
Which regions are
making which sounds?
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba.
Audiovisual Grounding
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba.
Sounds like a good idea
• Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with
Self-Supervised Multisensory Features
• Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In
So Kweon. Learning to Localize Sound Source in Visual Scenes
• Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin
Wilson, Avinatan Hassidim, William T. Freeman, Michael
Rubinstein. Looking to Listen at the Cocktail Party: A Speaker-
Independent Audio-Visual Model for Speech Separation
• Relja Arandjelovic, Andrew Zisserman. Objects that Sound
• Ruohan Gao, Rogerio Feris, Kristen Grauman. Learning to
Separate Object Sounds by Watching Unlabeled Video
• Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman. The
Conversation: Deep Audio-Visual Speech Enhancement
Collect unlabeled videos
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Mix Sound Tracks
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
How to recover originals?
Audio-only:
• Ill-posed
• permutation problem
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Vision can help
Video Analysis
Network
Audio Synthesizer
Network
Sound of target video
Audio Analysis
Network
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Audiovisual Model
Video Analysis Network
Max
CNN
Pool
K vision
channels
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Audiovisual Model
Video Analysis Network
Max
CNN
Pool
K vision
channels
s1
Audio Analysis Network
STFT s2
U-Net K audio
… channels
sK
…
Sound spectrogram
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Audiovisual Model
Video Analysis Network
Max Audio Synthesizer
CNN Network
Pool
K vision
channels
Sound of target video
s1
Audio Analysis Network
STFT s2
U-Net K audio
… channels
sK
…
Sound spectrogram
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Original Audio
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
What does this sound like?
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
What does this sound like?
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
What regions are making sound?
Original Video
Estimated Volume
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
What sounds are they making?
Original Video
Embedding (projected and visualized as color)
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Adjusting Volume
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
What does it learn?
r m e s
e f o ori
a v e g
W a t
C
Aytar, Vondrick, Torralba. NIPS 2016
Layer 1
Aytar, Vondrick, Torralba. NIPS 2016
What does it learn?
r m e s
e f o ori
a v e g
W a t
C
Aytar, Vondrick, Torralba. NIPS 2016
Layer 5
Smacking-like
Aytar, Vondrick, Torralba. NIPS 2016
Layer 5
Chime-like
Aytar, Vondrick, Torralba. NIPS 2016
What does it learn?
r m e s
e f o ori
a v e g
W a t
C
Aytar, Vondrick, Torralba. NIPS 2016
Layer 7
Scuba-like
Aytar, Vondrick, Torralba. NIPS 2016
Layer 7
Parents-like
Aytar, Vondrick, Torralba. NIPS 2016