KEMBAR78
22 ComputerVisionOverview Carl | PDF | Deep Learning | Perception
0% found this document useful (0 votes)
12 views153 pages

22 ComputerVisionOverview Carl

The document discusses machine perception, focusing on the evolution of visual processing in animals, the development of artificial neural networks, and the integration of sound and vision in understanding actions and objects. It highlights various visual and auditory recognition techniques, including deep learning applications for object tracking, action recognition, and sound classification. The content is supported by experimental results and references to significant studies in the field.

Uploaded by

gogimo1970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views153 pages

22 ComputerVisionOverview Carl

The document discusses machine perception, focusing on the evolution of visual processing in animals, the development of artificial neural networks, and the integration of sound and vision in understanding actions and objects. It highlights various visual and auditory recognition techniques, including deep learning applications for object tracking, action recognition, and sound classification. The content is supported by experimental results and references to significant studies in the field.

Uploaded by

gogimo1970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

Machine Perception

Carl Vondrick
Measurement:
173 redness,
101 greenness,
68 blueness
Measurement:
173 redness,
101 greenness,
68 blueness

Perception:
Tomato or Face
Cambrian Explosion

Time
Cambrian Explosion

"The Cambrian Explosion


is triggered by the sudden
evolution of vision,” which
set off an evolutionary
arms race where animals
either evolved or died.
— Andrew Parker

Slide credit: Fei-Fei Li


Evolution of
Biological Eye
Planarian (worm) eye’s distinguish light direction
Illumination

“Neither Autopilot nor the driver noticed the white


side of the tractor trailer against a brightly lit sky, so
the brake was not applied.” — Tesla Company Blog
Slide credit: S. Ullman
Occlusion

René Magritte, 1957


Class Variation

Slide credit: Antonio Torralba


Clutter and Camouflage
Color
Motion

Slide credit: S. Lazebnik


Ill-posed Problem
Ill-posed Problem
Ill-posed Problem
A quick experiment
Animals or Not?

You will see a mask, then image, then mask.

What do you see?

Slide credit: Jia Deng


Slide credit: Jia Deng
150$ms$!!$
Thorpe, et al. Nature, 1996

Fei-Fei Li! Lecture 1 - !14' 20:Sep:14'


Why not build a brain?
About 1/3rd of the brain is devoted to visual processing
Do we have the hardware?

11
10 parallel neurons

8
10 serial transistors
Adelson Illusion
Illusionary Motion
Scale Ambiguity
The Ames Room
The Ames Room
(Effect used in
Lord of the Rings)
Heider-Simmel Illusion
What objects are here?

Slide credit: Rob Fergus and Antonio Torralba


Context

Slide credit: Rob Fergus and Antonio Torralba


Agenda

A. Deep Learning Crash Course

B. Visual Tracking

C. Action Recognition

D. Sound Analysis
Artificial Neural Networks
orks: Architectures
x1 x2
xi ∈ ℝH×1
D×H
Wi ∈ ℝ
bi ∈ ℝD×1

xi+1 = f (Wi xi + bi)


“3-layer Neural Net”, or
“2-hidden-layer Neural Net”

lly-connected” layers
Bio/Artificial Neurons Inspiration

sigmoid activation
function

19 : COS429 : L19 : 28.11.17 : Andras Ferencz Slide Credit: Li, Slide credit:
Karpathy, Andrej Karpathy
Johnson
Bio/Artificial Networks Inspiration

18 : COS429 : L19 : 28.11.17 : Andras Ferencz LeCun


Slide Credit: Slide credit: LeCun
Artificial Neural Networks
orks: Architectures
x1 x2
xi ∈ ℝH×1
D×H
Wi ∈ ℝ
bi ∈ ℝD×1

xi+1 = f (Wi xi + bi)


“3-layer Neural Net”, or
“2-hidden-layer Neural Net”

lly-connected” layers
Loss Functions
xi Input (image) θ Parameters

yi Target (labels) f(xi; θ) Prediction

ℒ Loss Function

ℒ (f(xi; θ), yi)



The objective
of learning:
min
θ
i
Common Loss Functions
Squared error:
2
ℒ(x, y) = ∥x − y∥2
Hinge loss:

ℒ(x, y) = max(0,1 − x ⋅ y)
Cross entropy:


ℒ(x, y) = − yi log xi
i
Loss Surface

θ
Loss Surface

θ
Loss Surface

δℒ
ℒ δθ

θ
Gradient Descent

δℒ

ℒ δθ

θ
Gradient Descent
α learning rate

δℒ
ℒ α
δθ

θ
Gradient Descent
α learning rate

ℒ δℒ
α
δθ

θ
Gradient Descent
α learning rate

ℒ δℒ
α
δθ

θ
Gradient Descent
α learning rate


δℒ
α
δθ

θ
Unlabeled Visual Data
Agenda

A. Deep Learning Crash Course

B. Visual Tracking

C. Action Recognition

D. Sound Analysis
Object Tracking
Time
What color is that pixel?

Time
Temporal Coherence of Color

RGB

Color
Channels

Quantized
Color
Obvious exceptions…
Obvious exceptions…
Edward Adelson, 1995
Obvious exceptions…
Self-supervised Tracking

Reference Frame Gray-scale Video

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Ansel Adams, Yosemite Valley Bridge
Result of [Zhang et al., ECCV 2016]
What color
is this?
Where to
copy color?
Robust to
outliers
Where to
copy color?
Robust to
occlusion
Input Frame

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Colorize by Pointing
Reference Frame Input Frame

Reference Colors Target Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Reference Frame Input Frame

fi Aij
fj

Reference Colors Target Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


!
X T
exp fi fj
min L cj , Aij ci where Aij = P T
k exp fk fj
f
i

Reference Frame Input Frame

fi Aij
fj

Reference Colors Target Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


!
X T
exp fi fj
min L ĉj c=
j, Aij ci where Aij = P T
k exp fk fj
f
i

Reference Frame Input Frame

fi Aij
fj

ci
Reference Colors Target Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


!
X T
exp fi fj
min L cj , Aij ci where Aij = P T
k exp fk fj
f
i

Reference Frame Input Frame

fi Aij
fj

ci cj
Reference Colors Target Colors

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Video Colorization
Reference Frame Gray-scale Video Predicted Color Groun

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Video Colorization
Groun
Reference Frame Gray-scale Video Predicted Color

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Tracking Emerges!
Reference Frame Input Frame

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Tracking Emerges!
Reference Frame Input Frame

Reference Mask Predicted Mask

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


!
X T
exp fi fj
min L ĉj c=
j, Aij ci where Aij = P Tf
f
i k exp f k j

Reference Frame Input Frame


Aij
fi
fj

ci

ĉj =
Reference Mask Predicted Mask

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Segment Tracking Results
Only the first frame is given. Colors indicate different instances.

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Segment Tracking Results
Only the first frame is given. Colors indicate different instances.

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Pose Tracking Results
Only the skeleton in the first frame is given.

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Tracking Performance
80
Average Performance (Segment Overlap)

Identity
Optic Flow
60
Colorization

40

20

0
2 9 16 23 30 37 44 51 58 64
Frame Number

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Tracking Performance
Scale-Variation
Shape Complexity Identity
Appearance Change Optic Flow
Heterogeneus Object Colorization
Out-of-view
Interacting Objects
Motion Blur
Occlusion
Fast Motion
Dynamic Background
Low Resolution
Deformation
Edge Ambiguity
Out-of-Plane Rotation
Background Clutter
Camera-Shake
0 12.5 25 37.5 50
Average Performance (Segment Overlap)

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Visualizing Embeddings
Project embedding to 3 dimensions and visualize as RGB
o l
de a
Vi in
rig
O
at ing
n
dd
io
be
liz
Em
s ua
Vi

Vondrick, Shrivastava, Fathi, Guadarrama, Murphy. ECCV 2018


Agenda

A. Deep Learning Crash Course

B. Visual Tracking

C. Action Recognition

D. Sound Analysis
What are they doing?
What are they doing?

Barker and Wright, 1954


Action Recognition

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014


Action Recognition

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014


Action Recognition

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014


Felix Warneken, Max Plank Institute
The Oops! dataset
Oops! Predicting Unintentional Action
CVPR 2020
oops.cs.columbia.edu
Oops! Predicting Unintentional Action
CVPR 2020
oops.cs.columbia.edu
Perceptual Clues
1) Predictability
Ranzato 2014, Han 2019, …

2) Temporal Order
Misra 2016, Wei 2018, …
3) Video speed as
self-supervised clue
Speed of Action
Alters Perceptual Judgement
Speed of Action
Alters Perceptual Judgement
3) Video speed as
self-supervised clue
Visualizing Features
Fit linear model to classify
intentionality

+ -
+
+ - -
Agenda

A. Deep Learning Crash Course

B. Visual Tracking

C. Action Recognition

D. Sound Analysis
Learning to Hear
Objects

f (xs ; !)

Sound
Aytar, Vondrick, Torralba. NIPS 2016
Natural Synchronization
X
min
f
DKL (F (xi )||f (xi )) Lion
i

f (xs ; !) F (xv ; ⌦)

Sound Vision
Aytar, Vondrick, Torralba. NIPS 2016
SoundNet

r m e s
e f o ori
a v eg
W a t
C

Convolutional Neural Network

Aytar, Vondrick, Torralba. NIPS 2016


Sound Recognition
Classifying sounds in ESC-50

Method Accuracy
Chance 2%

Human Consistency 81%


Aytar, Vondrick, Torralba. NIPS 2016
Sound Recognition
Classifying sounds in ESC-50

Method Accuracy
Chance 2%
SVM-MFCC 39%
Random Forest 44%
CNN, Piczak 2015 64%

Human Consistency 81%


Aytar, Vondrick, Torralba. NIPS 2016
Sound Recognition
Classifying sounds in ESC-50

Method Accuracy
Chance 2%
SVM-MFCC 39%
Random Forest 44%
CNN, Piczak 2015 64%
10% gain
SoundNet 74%
Human Consistency 81%
Aytar, Vondrick, Torralba. NIPS 2016
Vision vs Sound
Low-dimensional embeddings
via Maaten and Hinton, 2007

Vision Sound

Aytar, Vondrick, Torralba. NIPS 2016


Sensor Power Consumption

Camera Microphone
~1 watt ~1 milliwatt

Aytar, Vondrick, Torralba. NIPS 2016


Which objects make which sounds?

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


The sound of clicked object

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


The sound of clicked object

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


The sound of clicked object

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Audiovisual Grounding

Which regions are


making which sounds?

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba.


Audiovisual Grounding

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba.


Sounds like a good idea
• Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with
Self-Supervised Multisensory Features

• Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In


So Kweon. Learning to Localize Sound Source in Visual Scenes

• Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin


Wilson, Avinatan Hassidim, William T. Freeman, Michael
Rubinstein. Looking to Listen at the Cocktail Party: A Speaker-
Independent Audio-Visual Model for Speech Separation

• Relja Arandjelovic, Andrew Zisserman. Objects that Sound

• Ruohan Gao, Rogerio Feris, Kristen Grauman. Learning to


Separate Object Sounds by Watching Unlabeled Video

• Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman. The


Conversation: Deep Audio-Visual Speech Enhancement
Collect unlabeled videos

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Mix Sound Tracks

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


How to recover originals?
Audio-only:
• Ill-posed
• permutation problem

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Vision can help

Video Analysis
Network

Audio Synthesizer
Network
Sound of target video
Audio Analysis
Network

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Audiovisual Model
Video Analysis Network

Max
CNN
Pool
K vision
channels

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Audiovisual Model
Video Analysis Network

Max
CNN
Pool
K vision
channels

s1
Audio Analysis Network

STFT s2
U-Net K audio
… channels
sK


Sound spectrogram

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Audiovisual Model
Video Analysis Network

Max Audio Synthesizer


CNN Network
Pool
K vision
channels

Sound of target video


s1
Audio Analysis Network

STFT s2
U-Net K audio
… channels
sK


Sound spectrogram

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


Original Audio

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


What does this sound like?

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


What does this sound like?

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


What regions are making sound?
Original Video

Estimated Volume
Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
What sounds are they making?
Original Video

Embedding (projected and visualized as color)


Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.
Adjusting Volume

Zhao, Gan, Rouditchenko, Vondrick, McDermott, Torralba. ECCV 2018.


What does it learn?

r m e s
e f o ori
a v e g
W a t
C

Aytar, Vondrick, Torralba. NIPS 2016


Layer 1

Aytar, Vondrick, Torralba. NIPS 2016


What does it learn?

r m e s
e f o ori
a v e g
W a t
C

Aytar, Vondrick, Torralba. NIPS 2016


Layer 5

Smacking-like

Aytar, Vondrick, Torralba. NIPS 2016


Layer 5

Chime-like

Aytar, Vondrick, Torralba. NIPS 2016


What does it learn?

r m e s
e f o ori
a v e g
W a t
C

Aytar, Vondrick, Torralba. NIPS 2016


Layer 7

Scuba-like
Aytar, Vondrick, Torralba. NIPS 2016
Layer 7

Parents-like
Aytar, Vondrick, Torralba. NIPS 2016

You might also like