Computer Vision Sample
Computer Vision Sample
Fundamentals of
C mputer Visi n
George Kudrayvtsev
george.ok@pm.me
This is a sample of the full guide that shows the full table of
contents; it includes the introductory chapters, the chapter
on Tracking, and a bonus appendix on the Human Vision
System.
1
0 Preface 9
0.1 How to Read This Guide . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1 Introduction 12
1.1 Comparing Disciplines . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Computational Photography . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Dual Photography . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Edge Detection 31
3.1 The Importance of Edges . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Gradient Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 The Discrete Gradient . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Dimension Extension Detection . . . . . . . . . . . . . . . . . . . . . 35
3.4 From Gradients to Edges . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Canny Edge Operator . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 2nd Order Gaussian in 2D . . . . . . . . . . . . . . . . . . . . 38
4 Hough Transform 39
4.1 Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Polar Representation of Lines . . . . . . . . . . . . . . . . . . 42
4.1.4 Hough Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.5 Handling Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Finding Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2
4.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Hough Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Frequency Analysis 52
5.1 Basis Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Limitations and Discretization . . . . . . . . . . . . . . . . . . 56
5.2.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 Antialiasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Blending 62
6.1 Crossfading and Feathering . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Image Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Pyramid Blending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Poisson Blending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4.1 Flashback: Recovering Functions . . . . . . . . . . . . . . . . 67
6.4.2 Recovering Complicated Functions . . . . . . . . . . . . . . . 68
6.4.3 Extending to 2D . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4.4 Je suis une poisson. . . . . . . . . . . . . . . . . . . . . . . . . 71
3
7.5.3 Total Rigid Transformation . . . . . . . . . . . . . . . . . . . 100
7.5.4 The Duality of Space . . . . . . . . . . . . . . . . . . . . . . . 100
7.5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.6 Intrinsic Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . 101
7.6.1 Real Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . 101
7.7 Total Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . 102
7.8 Calibrating Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.8.1 Method 1: Singular Value Decomposition . . . . . . . . . . . . 105
7.8.2 Method 2: Inhomogeneous Solution . . . . . . . . . . . . . . . 106
7.8.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . 107
7.8.4 Geometric Error . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.9 Using the Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.9.1 Where’s Waldo the Camera? . . . . . . . . . . . . . . . . . . . 109
7.10 Calibrating Cameras: Redux . . . . . . . . . . . . . . . . . . . . . . . 110
4
9.3.3 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11 Motion 167
11.1 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.1.1 Lucas-Kanade Flow . . . . . . . . . . . . . . . . . . . . . . . . 171
11.1.2 Applying Lucas-Kanade: Frame Interpolation . . . . . . . . . 175
11.2 Motion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.2.1 Known Motion Geometry . . . . . . . . . . . . . . . . . . . . 178
11.2.2 Geometric Motion Constraints . . . . . . . . . . . . . . . . . . 178
11.2.3 Layered Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 179
12 Tracking 181
12.1 Modeling Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
12.1.1 Tracking as Inference . . . . . . . . . . . . . . . . . . . . . . . 183
12.1.2 Tracking as Induction . . . . . . . . . . . . . . . . . . . . . . 184
12.1.3 Making Predictions . . . . . . . . . . . . . . . . . . . . . . . . 184
12.1.4 Making Corrections . . . . . . . . . . . . . . . . . . . . . . . . 185
12.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.2.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
12.2.3 Kalman in Action . . . . . . . . . . . . . . . . . . . . . . . . . 187
12.2.4 N -dimensional Kalman Filter . . . . . . . . . . . . . . . . . . 189
12.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
12.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
12.3.1 Bayes Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.3.2 Practical Considerations . . . . . . . . . . . . . . . . . . . . . 196
12.4 Real Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
12.4.1 Tracking Contours . . . . . . . . . . . . . . . . . . . . . . . . 198
12.4.2 Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
12.5 Mean-Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.5.1 Similarity Functions . . . . . . . . . . . . . . . . . . . . . . . 201
12.5.2 Kernel Choices . . . . . . . . . . . . . . . . . . . . . . . . . . 202
12.5.3 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
12.6 Issues in Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5
13 Recognition 206
13.1 Generative Supervised Classification . . . . . . . . . . . . . . . . . . . 210
13.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 212
13.2.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . 216
13.2.2 Face Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
13.2.3 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
13.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
13.3 Incremental Visual Learning . . . . . . . . . . . . . . . . . . . . . . . 223
13.3.1 Forming Our Model . . . . . . . . . . . . . . . . . . . . . . . . 224
13.3.2 All Together Now . . . . . . . . . . . . . . . . . . . . . . . . . 226
13.4 Discriminative Supervised Classification . . . . . . . . . . . . . . . . 227
13.4.1 Discriminative Classifier Architecture . . . . . . . . . . . . . . 228
13.4.2 Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . 230
13.4.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
13.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 235
13.5.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 235
13.5.2 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 236
13.5.3 Extending SVMs . . . . . . . . . . . . . . . . . . . . . . . . . 238
13.5.4 SVMs for Recognition . . . . . . . . . . . . . . . . . . . . . . 242
13.6 Visual Bags of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
15 Segmentation 270
15.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
15.1.1 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . 273
15.2 Mean-Shift, Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
15.2.1 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . 278
15.3 Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6
15.4 Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.4.1 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7
List of Algorithms
8
Preface
I read that Teddy Roosevelt once said, “Do what you can with
what you have where you are.” Of course, I doubt he was in the
tub when he said that.
— Bill Watterson, Calvin and Hobbes
Computer Vision
If you are here for computer vision, the “reading list” is pretty straightforward: you
can (almost) just read this guide from front to back. You can skip over chapter 6, but
I recommend reading about Image Pyramids because they will come up later in the
chapters on Tracking and Motion. You can also skip the discussion of High Dynamic
Range images and the beginning of the chapter on Video Analysis.
Computational Photography
If you are here for computational photography, the content is a little scattered across
chapters. I recommend the following reading order:
• Start with chapters 1–3 as usual, because they provide the fundamentals;
• skip ahead to chapter 7 to learn a little about cameras, but stop once you hit
9
CHAPTER 0: Preface
Robotics
You may have stumbled upon this guide when studying ideas in robotics. Though I
recommend starting with my brief notes on robotics, the chapter on Tracking covers
both Kalman filters and particle filters which are fundamental algorithms in robotics.
0.2 Notation
Before we begin to dive into all things computer vision, here are a few things I do in
this notebook to elaborate on concepts:
• An item that is highlighted like this is a “term;” this is some vocabulary or
identifying word/phrase that will be used and repeated regularly in subsequent
sections. I try to cross-reference these any time they come up again to link back
to its first defined usage; most mentions are available in the Index.
• An item that is highlighted like this is a “mathematical property;” such
properties are often used in subsequent sections and their understanding is
assumed there.
• An item in a maroon box, like. . .
1
This bit needs to be expanded to specifically discuss panorama creation more.
Kudrayvtsev 10
COMPUTATIONAL PERCEPTION
I also am starting to include margin notes like the one here (which just links to my Linky
homepage) that link to the source of the content so that you can easily explore the
concepts further. A lot of it will be Udacity content; if you don’t have access to that,
unfortunately you’ll have to stick with these notes. The other times, though, will be
links to the original papers being described; keep an eye out for them!
Kudrayvtsev 11
Introduction
his set of notes is an effort to unify the curriculum across both computer vi-
T sion and computational photography, cross-pollinating unique ideas from each
discipline and consolidating concepts that are shared between the two. Before
we begin, it’s important to talk about their differences.
Computer vision is concerned with all of these except display. It is the discipline that
focuses on looking at 2D images from the real world and inferring information from
it, which includes things like geometry as well as a more “high-level” understanding
of things like recognizing objects.
12
COMPUTATIONAL PERCEPTION
On the other hand, computer graphics is the reverse process. It takes information
about the scene in 3D (given things like how objects look, or how they behave under
various lighting conditions) and creates beautiful 2D images from those descriptions.
Image processing is an underlying discipline for both computer vision and computa-
tional photography that is concerned with treating images as functions, doing things
like blurring or enhancing them. An understanding of optics and sensors is also criti-
cal in all of these disciplines because it provides us with a fundamental understanding
of how the 3D world is transformed into 2D (and vice-versa).
The goal of computer vision is to create programs that can interpret and analyse
images, providing the program with the meaning behind the image. This may involve
concepts such as object recognition as well as action recognition for images in motion
(colloquially, “videos”).
Computer vision is a hard problem because it involves much more complex analysis
relative to image processing. For example, observe the following set of images:
Kudrayvtsev 13
CHAPTER 1: Introduction
The two checkered squares A and B have the same color intensity, but our brain
interprets them differently without the connecting band due to the shadow. Shad-
ows are actually quite important to human vision. Our brains rely on shadows to
create depth information and track motion based on shadow movements to resolve
ambiguities.
A computer can easily figure out that the intensities of the squares are equal, but it’s
much harder for it to “see” the illusion like we do. Computer vision involves viewing
the image as a whole and gaining a semantic understanding of its content, rather than
just processing things pixel-by-pixel.
photo graphy
| {z } | {z }
light drawing
Computational photography is what arises when you bring the power of modern
computers to digital cameras. The complex algorithms that run on your phone when
you take a photo (faking depth-of-field, snapping HDR scenes, crafting panoramas
in real time, etc.) are all part of the computational photography sphere. In essence,
we combine modern principles in computing, digital sensors, optics, actuators, lights,
and more to “escape” the limitations of traditional film photography.
The pipeline that takes light from a scene in the real world and outputs something
the user can show their friends is quite complex:
Kudrayvtsev 14
COMPUTATIONAL PERCEPTION
The beauty is that computation can be embedded at every level of this pipeline to
support photography. We’ll be referring to ways to computationally impact each of
these elements throughout this guide.
1
You can read the original paper from 2007 here.
Kudrayvtsev 15
CHAPTER 1: Introduction
in our setup, allowing us to control exactly what the camera captures from the scene.
If we tightly couple the modulation of the lighting and camera, we can see the effect
that various changes will have on our resulting photo:
Helmholtz reciprocity essentially states that for a given ray of light, where it
comes from and who it’s observed by is interchangeable: light behaves the same way
“forward” from a light source and “backwards” from an observer.2
For our case, this means that by opening up certain areas of the projector’s modulator,
we can observe and record their effect on the image. We create an association between
the “open cell” and the “lit image,” and given enough data points we can actually
recreate the scene from the projector’s perspective (see Figure 1.3).
Figure 1.3: On the left is the “primal” image captured by the camera under
full illumination of the scene by the projector. On the right is a recreation of
the scene from a number of controlled illuminations from the perspective of the
projector.
This is an excellent example of how we can use computing to create a novel image
from a scenario that would otherwise be impossible to do with traditional film.
2
Feel free to check out the Wikipedia page for a more thorough explanation, and a later chapter
on BRDFs touches on this idea as well.
Kudrayvtsev 16
Basic Image Manipulation
igital images are just arrays of numbers that represent intensities of color
D arranged in a rectangle. Computationally, we typically just treat them like a
matrix.
This segues perfectly into the notion of treating images simply as (discrete) functions:
mappings of (x, y) coordinates to intensity values.
17
CHAPTER 2: Basic Image Manipulation
Addition Adding two images will result in a blend between the two. As we dis-
cussed, though, intensities have a range [min, max]; thus, adding is often performed
as an average of the two images instead, to not lose intensities when their sum exceeds
max:
Ia Ib
Iadded = +
2 2
Subtraction In contrast, subtracting two images will give the difference between
the two. A smaller intensity indicates more similarity between the two source images
at that pixel. Note that order of operations matters, though the results are inverses
of each other:
Ia − Ib = −(Ib − Ia )
Kudrayvtsev 18
COMPUTATIONAL PERCEPTION
Often, we simply care about the absolute difference between the images. Because we
are often operating in a discrete space that will truncate negative values (for example,
when operating on images represented as uint8), we can use a special formulation to
get this difference:
Idiff = (Ia − Ib ) + (Ib + Ia )
Noise
A common function that is added to a single image is a
noise function. One of these is called the Gaussian
noise function: it adds a variation in intensity drawn
from a Gaussian normal distribution. We basically
add a random intensity value to every pixel on an
image. Here we see Gaussian noise added to a classic
example image1 used in computer vision.
1
of a model named Lena, actually, who is posing for the centerfold of an issue of PlayBoy. . .
Kudrayvtsev 19
CHAPTER 2: Basic Image Manipulation
“overflowing” (or underflowing) noise addition that results in pixel values outside of
that range!
To clarify, consider a single pixel in the range [0, 255] and the intensity 200. If the
noise function added 60 intensity, how would you derive the original value of 200 from
the new value of 255 even if you did know that 60 was added? The original value
could be anywhere in the range [195, 255]!
Averages in 2D
Extending a moving average to 2 dimensions is relatively straightforward. You take
the average of a range of values in both directions. For example, in a 100×100 image,
you may want to take an average over a moving 5×5 square. So, disregarding edge
values, the value of the pixel at (2, 2) would be:
In other words, with our square (typically referred to as a kernel or window) ex-
tending k = 2 pixels in both directions, we can derive the formula for correlation
Kudrayvtsev 20
COMPUTATIONAL PERCEPTION
Handling Borders Notice that we conveniently placed the red averaging window
in Figure 2.2 so that it fell completely within the bounds of the image. What would
we do along the top row, for example? We’ll discuss this in a little bit, in Boundary
Issues.
Results We’ve succeeded in removing all noise from an input image. Unfortunately,
this throws out the baby with the bathwater if our goal was just to remove some extra
Gaussian noise (like the speckles in Lena, above), but we’ve coincidentally discovered
a different, interesting effect: blurring images.
Kudrayvtsev 21
CHAPTER 2: Basic Image Manipulation
Now, what kind of filtering function should we apply to an image of such a single
bright pixel to produce such a blurry spot? Well, a function that looked like that
blurry spot would probably work best: higher values in the middle that fall off (or
attenuate) to the edges. This is a Gaussian filter, which is an application of the
2D Gaussian function:
1 u2 +v 2
h(u, v) = 2
e− σ2 (2.3)
|2πσ
{z }
normalization
coefficient
In such a filter, the nearest neighboring pixels have the most influence. This is much
like the weighted moving average presented in (2.2), but with weights that better
represent “nearness.” Such weights are “circularly symmetric,” which mathematically
are said to be isotropic; thus, this is the isotropic Gaussian filter. Note the nor-
malization coefficient: this value affects the brightness of the blur, not the blurring
itself.
Gaussian Parameters
The Gaussian filter is a mathematical operation that does not care about pixels. Its
only parameter is the variance σ, which represents the “amount of smoothing” that
the filter performs. Of course, when dealing with images, we need to apply the filter
to a particular range of pixels; this is called the kernel.
Now, it’s critical to note that modifying the size of the kernel is not the same thing as
modifying the variance. They are related, though. The kernel has to be “big enough”
to fairly represent the variance and let it perform a smoother blurring.
Kudrayvtsev 22
COMPUTATIONAL PERCEPTION
Shift Invariance The property of shift invariance states that an operator behaves
the same everywhere. In other words, the output depends on the pattern of the image
neighborhood, rather than the position of the neighborhood. An operator must give
the same result on a pixel regardless of where that pixel (and its neighbors) is located
to maintain shift invariance.
2.3.1 Impulses
An impulse function in the discrete world is a very easy function (or signal) to
understand: its value = 1 at a single location. In the continuous world, an impulse
is an idealized function which is very narrow, very tall, and has a unit area (i.e. an
area of 1). In the limit, it has zero width and infinite height; it’s integral is 1.
2.3.2 Convolution
Let’s revisit the cross-correlation equation from Computing Averages:
k
X k
X
G[i, j] = H[u, v]F [i + u, j + v] (2.2)
u=−k v=−k
and see what happens when we treat it as a system H and apply impulses. We begin
with an impulse signal F (an image), and an arbitrary kernel H:
0 0 0 0 0
0 0 0 0 0 a b c
0
F (x, y) = 0 1 0 0 H(u, v) = d e f
0 0 0 0 0 g h i
0 0 0 0 0
What is the result of filtering the impulse signal with the kernel? In other words,
what is G(x, y) = F (x, y) ⊗ H(u, v)? As we can see in Figure 2.3, the resulting image
is a flipped version (in both directions) of the filter H.
We introduce the concept of a convolution operator to account for this “flipping.”
Kudrayvtsev 23
CHAPTER 2: Basic Image Manipulation
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i h 0 0
0 0 0 0 0 0 i 0 0 0
0 0 1 0 0 −→ 0 0 0 0 0
−→ 0
0 0 1 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
(b) The result (right) of subsequently applying the filter
(a) The result (right) of applying the filter H (in red) H on F at (2, 1) (in blue). The kernel covers a new area
on F at (1, 1) (in blue). in red.
0 0 0 0 0
0 i h g 0
0 f e d 0
0 c b a 0
0 0 0 0 0
This filter flips both dimensions. Convolution filters must be shift invariant.
Properties
Because convolution (and correlation) is both linear- and shift-invariant, it maintains
some useful properties:
• Commutative: F ~ G = G ~ F
Kudrayvtsev 24
COMPUTATIONAL PERCEPTION
• Associative: (F ~ G) ~ H = F ~ (G ~ H)
• Identity: Given the “unit impulse” e = [. . . , 0, 1, 0, . . .], then f ~ e = f
• Differentiation: ∂x
∂
(f ~ g) = ∂f
∂x
~ g. This property will be useful later, in
Handling Noise, when we find gradients for edge detection.
Computational Complexity
If an image is N × N and a filter is W × W , how many multiplications are necessary
to compute their convolution (N ~ W )?
Well, a single application of the filter requires W 2 multiplications, and the filter must
be applied for every pixel, so N 2 times. Thus, it requires N 2 W 2 multiplications,
which can grow to be fairly large.
Separability There is room for optimization here for certain filters. If the filter is
separable, meaning you can get the kernel H by convolving a single column vector
by a single row vector, as in the example:
1 2 1 1
H = 2 4 2 = 2 ~ 1 2 1
1 2 1 3
Then we can use the associative property to remove a lot of multiplications. The
result, G, can be simplified:
G=H ~F
= (C ~ R) ~ F
= C ~ (R ~ F )
2
Consider, for example the offerings of OpenCV for extending the borders of images.
Kudrayvtsev 25
CHAPTER 2: Basic Image Manipulation
• Clipping: This method simply treats the non-existant pixels as black. Images
filtered with this method result in a black border bleeding into the their edges.
Such an effect is very noticeable, but may be desireable. It’s similar to the
artistic “vignette” effect.
• Wrapping: This method uses the opposite edge of the image to continue the
edge. It was intended for periodic functions and is useful for seamless images,
but looks noticeably bad for non-seamless images. The colors from the opposite
edge unnaturally bleed into the border.
• Extending: This method copies a chunk of the edge to fit the filter kernel. It
provides good results that don’t have noticeable artifacts like the previous two.
• Reflecting: This method copies a chunk of the edge like the previous method,
but it mirrors the edge like a reflection. It often results in slightly more natural-
looking results, but the differences are largely imperceptible.
2 2 5 6 4 9 6 4
2 5 6 4 9 6 4 2 2 5 6 4 9 6 4
10 7 7 10 6 8 10 10 10 7 7 10 6 8 10
1 3 8 5 1 10 7 1 1 3 8 5 1 10 7
2 10 8 2 10 5 4 2 2 10 8 2 10 5 4
5 4 3 7 4 7 5 5 5 4 3 7 4 7 5
6 7 7 3 9 3 9 6 6 7 7 3 9 3 9
6 5 7 4 7 1 3 6 6 5 7 4 7 1 3
Note that the larger your kernel, the more padding necessary, and as a result, the
more information is “lost” (or, more accurately, “made up”) in the resulting value at
that edge.
Kudrayvtsev 26
COMPUTATIONAL PERCEPTION
where the 90 is clearly an instance of “salt” noise sprinkled into the image.
Finding the median:
10 15 20 23 27 30 31 33 90
results in replacing the center point with intensity = 27, which is much better
than a weighted box filter (as in (2.2)) which could have resulted in an intensity
of 61.3
An interesting benefit of this filter is that any new pixel value was already
present locally, which means new pixels never have any “weird” values.
3 1 4
. . . if choosing 9 for non-center pixels and 9 for the center.
Kudrayvtsev 27
CHAPTER 2: Basic Image Manipulation
In Adobe PhotoShop and other editing software, the “unsharp mask” tool
would actually sharpen the image. Why?
In the days of actual film, when photos had negatives and were developed in
dark rooms, someone came up with a clever technique. If light were shone
on a negative that was covered by wax paper, the result was a negative of
the negative that was blurrier than its original. If you then developed this
negative of a negative layered on top of the original negative, you would get
a sharper version of the resulting image!
This is a chemical replication of the exact same filtering mechanism as the
one we described in the sharpening filter, above! We had our original (the
negative) and were subtracting (because it was the negative of the negative)
a blurrier version of the negative. Hence, again, the result was a sharper
developed image.
This blurrier double-negative was called the “unsharp mask,” hence the his-
torical name for the editing tool.
Filter Normalization Recall how the filters we are working with are linear opera-
tors. Well, if we correlated an image (again, see (2.2)), and multiplied that correlation
filter by some constant, then the resulting image would be scaled by that same con-
stant. This makes it tricky to compare filters: if we were to compare filtered image 1
against filtered image 2 to see how much the source images “respond” to their filters,
we would need to make sure that both filters operate on a similar scale. Otherwise,
outputs may differ greatly but not reflect an accurate comparison.
This topic is called normalized correlation, and we’ll discuss it further later. To
summarize things for now, suffice to say that “normalization” means that the standard
deviation of our filters will be consistent. For example, we may say that all of our
filters will ensure σ = 1. Not only that, but we also need to normalize the image as
we move the kernel across it. Consider two images with the same filter applied, but
one image is just a scaled version of the other. We should get the same result, but
only if we ensure that the standard deviation within the kernel is also consistent (or
σ = 1).
Again, we’ll discuss implementing this later, but for now assume that all of our future
Kudrayvtsev 28
COMPUTATIONAL PERCEPTION
0 5 10 15 20 0 2 4
Figure 2.7: A signal and a filter which is part of the signal (specifically, from
x ∈ [5, 10]).
Where’s Waldo?
Suppose we have an image from a Where’s Waldo? book, and we’ve extracted the
“solution” image of Waldo from it, like so:
If we perform a correlation between this template image as our filter and the image
it came from, we will get a correlation map whose maximum tells us where Waldo is!
Kudrayvtsev 29
CHAPTER 2: Basic Image Manipulation
Figure 2.8: The original image (left) and the correlation map between it and
the template filter from above, with brightness corresponding to similarity.
See that tiny bright spot around the center of the top half of the correlation map in
Figure 2.8? That’s Waldo’s location in the original image!
Applications
What’s template matching useful for? Can we apply it to other problems? What
about using it to match shapes, lines, or faces? We must keep in mind the limitations
of this rudimentary matching technique. Template matching relies on having a near-
perfect representation of the target to use as a filter. Using it to match lines – which
vary in size, scale, direction, etc. – is unreliable. Similarly, faces may be rotated,
scaled, or have varying features. There are much better options for this kind of
matching that we’ll discuss later. Something specific, like icons on a computer or
words in a specific font, are a viable application of template matching.
Kudrayvtsev 30
Tracking
181
CHAPTER 12: Tracking
those objects appear again (called disocclusions), it’s hard to reconcile and
reassociate things. Is this an object we lost previously, or a new object to track?
We somewhat incorporated dynamics when we discussed using Motion Models with
Shi & Lucas-Kanade: we expected points to move along an affine transformation. We could
Tomasi ‘94 improve this further by combining this with Feature Recognition, identifying good
features to track and likewise fitting motion models to them. This is good, but only
gets us so far.
Instead, we will focus on tracking with dynamics, which approaches the problem
differently: given a model of expected motion, we should be able to predict the next
frame without actually seeing it. We can then use that frame to adjust the dynamics
model accordingly. This integration of dynamics is the differentiator between feature
detection and tracking: in detection, we detect objects independently in each frame,
whereas with tracking we predict where objects will be in the next frame using an
estimated dynamic model.
The benefit of this approach is that the trajectory model restricts the necessary search
space for the object, and it also improves estimates due to reduced measurement noise
due to the smoothness of the expected trajectory model.
As usual, we need to make some fundamental assumptions to simplify our model and
construct a mathematical framework for continuous motion. In essence, we’ll be ex-
pecting small, gradual change in pose between the camera and the scene. Specifically:
• Unlike small children, who have no concept of object permeance, we assume
that objects do not spontaneously appear or disappear in different places in the
scene.
• Similarly, we assume that the camera does not move instantaneously to a new
viewpoint, which would cause a massive perceived shift in scene dynamics.
Feature tracking is a multidisciplinary problem that isn’t exclusive to computer vision.
There are elements of engineering, physics, and robotics at play. Thus, we need to
take a detour into state dynamics and estimation in order to model the dynamics of
an image sequence.
Kudrayvtsev 182
COMPUTATIONAL PERCEPTION
Our correction, then, is an updated estimate of the state after introducing a new
observation Zt = zt :
We can say that tracking is the process of propogating the posterior distribution
of state given measurements across time. We will again make some assumptions to
simplify the probability distributions:
• We will assume that we live in a Markovian world in that only the immediate
past matters with regards to the actual hidden state:
This is called the observation model, and much like the small motion con-
straint in Lucas-Kanade, this is the most suspect assumption. Thankfully, we
Kudrayvtsev 183
CHAPTER 12: Tracking
X1 X2 ... Xn
Z1 Z2 Zn
1
Specifically, the law of total probability states that if we have a joint set A ∩ B and we know
P in B, we can P
all of the probabilities get Pr [A] if we sum over all of the probabilities in B.
Formally, Pr [A] = n (Pr [A, Bn ]) = n (Pr [A | Bn ] Pr [Bn ]). For the latter equivalence, recall
that Pr [U, V ] = Pr [U | V ] Pr [V ]; this is the conditioning property.
In our working example, Xt is part of the same probability space as Xt−1 (and all of the Xi s that
came before it), so we can apply the law, using the integral instead of the sum.
Kudrayvtsev 184
COMPUTATIONAL PERCEPTION
To explain this equation in English, what we’re saying is that the likelihood of being
at a particular spot (this is Xt ) depends on the probability of being at that spot
given that we were at some previous spot weighed by the probability of that previous
spot actually happening (our corrected estimate for Xt−1 ). Summing over all of the
possible “previous spots” (that is, the integral over Xt−1 ) gives us the marginalized
distribution of Xt .
As we’ll see, the scary-looking denominator is just a normalization factor that ensures
the probabilities sum to 1, so we’ll never really need to worry about it explicitly.
12.1.5 Summary
We’ve developed the probabilistic model for predicting and subsequently correcting
our state based on some observations. Now we can dive into actual analytical models
that apply these mathematics and do tracking.
Kudrayvtsev 185
CHAPTER 12: Tracking
Dynamics Model
More formally, a linear dynamics model says that the state at some time, xt , depends
on the previous state xt−1 undergoing some linear transformation Dt , plus some level
of Gaussian process noise Σdt which represents uncertainty in how “linear” our model
truly is.4 Mathematically, we have our normal distribution function N ,5 and so
Notice the subscript on Dt and dt : with this, we indicate that the transformation
itself may change over time. Perhaps, for example, the object is moving with some
velocity, and then starts rotating. In our examples, though, these terms will stay
constant.
For example,
suppose we’re tracking position and velocity as part of our state: xt =
px py
and our “linear transformation” is the simple fact that the new position is
vx vy t
1 1
the old position plus the velocity; so, given the transformation matrix: D =
0 1
our dynamics model indicates:
1 1 px p y px + v x p y + v y
Dxt = =
0 1 vx vy t vx vy
Meaning the new state after a prediction based on the dynamics model would just be
a noisy version of that:
xt+1 = Dxt + noise
Essentially, D is crafted to fit the specific Kalman filter for your scenario. Formulating
the matrix is just a matter of reading off the coefficients in the linear equation(s); in
this case: (
p0 = 1 · p + 1 · v
1 1
D= 7→
0 1 v0 = 0 · p + 1 · v
4
This is “capital sigma,” not a summation. It symbolizes the variance (std. dev squared): Σ = σ 2 .
5
N (µ, σ 2 ) is a compact way to specify a multivariate Gaussian (since our state can have n variables).
The notation x ∼ N (µ, σ 2 ) means x is “distributed as” or “sampled from” a Gaussian centered at
µ with variance σ 2 .
Kudrayvtsev 186
COMPUTATIONAL PERCEPTION
Measurement Model
We also have a linear measurement model describing our observations of the world.
Specifically (and unsurprisingly), the model says that the measurement zt is linearly
transformed by Mt , plus some level of Gaussian measurement noise:
zt ∼ N (Mt xt , Σmt )
M is also sometimes called the extraction matrix or also the measurement func-
tion because its purpose is to extract the measurable component from the state
vector.
For example, if we’re storing position and velocity in our state xt , but our measure-
ment can only provide us with position, then our extraction matrix would be 1 0 ,
since:
px py
Mt xt = 1 0 = px py t
vx vy t
Now note that this probably seems a little strange. Our measurement is based on
the existing state? Well, there is going to be a difference between what we “actually
observe” (the real measurement zt from our sensors) versus how our model thinks
things behave. In a perfect world, the sensor measurement matches our predicted
position exactly, but the error between these (zt − Mxt ) will dictate how we adjust
our state.
12.2.2 Notation
Before we continue, we need to introduce some standard notation to track things.
In our predicted state, Pr [Xt | z0 , . . . , zt−1 ], we say that the mean and standard de-
viation of the resulting Gaussian distribution is µ−t and σt− . For our corrected state,
Pr [Xt | z0 , . . . , zt−1 , zt ] (notice the introduction of the latest measurement, zt ), we sim-
ilarly say that the mean and standard deviation are µ+t and σt+ .
Again, at time t,
• µ−t , σt− — state mean and variance after only prediction.
• µ+t , σt+ — state mean and variance after prediction and measurement correction.
Kudrayvtsev 187
CHAPTER 12: Tracking
Prediction
Let’s suppose our dynamics model defines a state X as being just a scaled version of
the previous state plus some uncertainty:
Xt ∼ N dXt−1 , σd2 (12.2)
Correction
Suppose our mapping of states to measurements similarly relies on a constant, m:
2
zt ∼ N mXt , σm
The Kalman filter defines our new Gaussian (the simplified Equation 12.1) as another
adjustment:
+
2
µ−t σm + mzt (σt− )2 + 2
2
σm (σt− )2
µt = (σt ) =
2 + m2 (σ − )2
σm 2 + m2 (σ − )2
σm
t t
Intuition
Let’s unpack that monstrosity of an update to get an intuitive understanding of what
this new mean, µ+t , really is. Let’s first divide the entire thing by m2 to “unsimplify.”
We get this mess:
2
µ−t σm zt
2
+ (σt− )2
µ+t = m 2 m (12.3)
σm
+ (σt− )2
m2
In blue we have our prediction of Xt ; it’s weighted by the variance of Xt computed
from the measurement (in red). Then, in orange, we have our measurement guess of
Xt , weighted by the variance of the prediction (in green).
Kudrayvtsev 188
COMPUTATIONAL PERCEPTION
Notice that all of this is divided by the sum of the weights (in red and green): this
is just a weighted average of our prediction and our measurement guess
based on variances!
This gives us an important insight that applies to the Kalman filter regardless of the
dimensionality we’re working with. Specifically, our corrected distribution for Xt is
a weighted average of the prediction (i.e. based on all prior measurements except zt )
and the measurement guess (i.e. with zt incorporated).
Let’s take the equation from (12.3) and substitute a for the measurement variance
and b for the prediction variance. We get:
zt
aµ−t + b
µ+t = m
a+b
We can do some manipulation (add bµ−t − bµ−t to the top and factor) to get:
z
t
(a + b)µ−t + b − µ−t
= m
a+b
b z
t
= µ−t + − µ−t
a + b m
−
zt
+
µt = µt + k − µ−t
m
Where k is known as the Kalman gain. What does this expression tell us? Well,
the new mean µ+t is the old predicted mean plus a weighted “residual”: the difference
between the measurement and the prediction. In other words, it’s adjusted based on
how wrong the prediction was!
Correct
Predict
− +
Kt = Σt Mt (Mt Σ−t MTt + Σmt )−1
− T
xt = Dt xt−1
x+t = x−t + Kt (zt − Mt x−t )
Σ−t = Dt Σ+t−1 DTt + Σdt
Σ+t = (I − Kt Mt ) Σ−t
Kudrayvtsev 189
CHAPTER 12: Tracking
We now have a Kalman gain matrix, Kt . As our estimate covariance approaches zero
(i.e. confidence in our prediction grows), the residual gets less weight from the gain
matrix. Similarly, if our measurement covariance approaches zero (i.e. confidence in
our measurement grows), the residual gets more weight.
Let’s look at this math a little more thoroughly because it’s a lot to take in. . .
Kalman ‘60 The prediction step should look somewhat familiar: x−t is the result of applying the
Udacity ‘12a dynamics model to the state, and its covariance matrix (the uncertainty) is likewise
Udacity ‘12b adjusted by the process noise.
The correction step is hairier, but is just an adjustment of our simple equations to N
dimensions. First, recall from above (12.3) the Kalman gain:
b (σt− )2
k= = 2
a+b σm
2
+ (σt− )2
m
Similarly here, recall that the closest equivalent to division in linear algebra is multi-
plication by the inverse. Also, note that given a n-dimensional state vector:
• Σ−
t is its n × n covariance matrix, and
Now we can see the state update as an adjustment based on the gain and the residual
error between the prediction and the measurement:
Σ+t = (I − Kt Mt ) Σ−t
All of this feels hand-wavy for a very good reason: I barely understand it. Feel free
to read the seminal paper on Kalman filters for more confusing math.
Kudrayvtsev 190
COMPUTATIONAL PERCEPTION
12.2.5 Summary
The Kalman filter is an effective tracking method due to its simplicity, efficiency,
and compactness. Of course, it does impose some fairly strict requirements and has
significant pitfalls for that same reason. The fact that the tracking state is always
represented by a Gaussian creates some huge limitations: such a unimodal distribution
means we only really have one true hypothesis for where the object is. If the object
does not strictly adhere to our linear model, things fall apart rather quickly.
We know that a fundamental concept in probability is that as we get more information,
certainty increases. This is why the Kalman filter works: with each new measured
observation, we can derive a more confident estimate for the new state. Unfortunately,
though, “always being more certain” doesn’t hold the same way in the real world as it
does in the Kalman filter. We’ve seen that the variance decreases with each correction
step, narrowing the Gaussian. Does that always hold, intuitively? We may be more
sure about the distribution, but not necessarily the variance within that distribution.
Consider the following extreme case that demonstrates the pitfalls of the Kalman
filter.
In Figure 12.2, we have our prior distribution and our measurement. Intuitively,
where should the corrected distribution go? When the measurement and prediction
are far apart, we would think that we can’t trust either of them very much. We can
count on the truth being between them, sure, and that it’s probably closer to the
measurement. Beyond that, though, we can’t be sure. We wouldn’t have a very high
peak and our variance may not change much. In contrast, as we see in Figure 12.2,
Kalman is very confident about its corrected prediction.
This is one of its pitfalls.
x
evidence posterior prior
Figure 12.2: One of the flaws of the Kalman model is that it is always more confident in its
distribution, resulting in a tighter Gaussian. In this figure, the red Gaussian is what the Kalman
filter calculates, whereas the blue-green Gaussian may be a more accurate representation of our
intuitive confidence about the truth. As you can see, Kalman is way more confident than we
are; its certainty can only grow after a measurement.
Another downside of the Kalman filter is this restriction to linear models for dynam-
ics. There are extensions that alleviate this problem called extended Kalman filters
Kudrayvtsev 191
CHAPTER 12: Tracking
We’ll also be introducing the notion of perturbation into our dynamics model.
Previously, we had a linear dynamics model that only consisted of our predictions
based on previous state. Perturbation – also called control – allows us to modify
the dynamics by some known model. By convention, perturbation is an input to our
model using the parameter u.
xt+1 ∼ N (Dxt , Σt ) + u
Kudrayvtsev 192
COMPUTATIONAL PERCEPTION
Note: We (always, unlike lecture) use ut for inputs occuring between the state xt−1 and xt .
• We additionally need the sensor model. This gives us the likelihood of our
measurements given some object location: Pr [z | X]. In other words, how likely
are our measurements given that we’re at a location X. It is not a distribution
of possible object locations based on a sensor reading.
• Finally, we need our stream of observations, z, and our known action data, u:
data = {u1 , z2 , . . . , ut , zt }
Given these quantities, what we want is the estimate of X at time t, just like before;
this is the posterior of the state, or belief :
Bel(xt ) = Pr [xt | u1 , z2 , . . . , ut , zt ]
In English, the probability of the current measurement, given all of the past states,
measurements, and inputs only actually depends on the current state. This is some-
times called sensor independence. Second: the probability of the current state –
again given all of the goodies from the past – actually only depends on the previous
state and the current input. This Markovian assumption is akin to the independence
assumption in the dynamics model from before.
As a reminder, Bayes’ rule (described more in footnote 2) can also be viewed as a
proportionality (η is the normalization factor that ensures the probabilities sum to
one):
Pr [x | z] = ηPr [z | x] Pr [x] (12.4)
6
The notation na:b represents a range; it’s shorthand for na , na+1 , . . . , nb .
Kudrayvtsev 193
CHAPTER 12: Tracking
ut−1 ut ut+1
zt−1 zt zt+1
∝ Pr [z | x] Pr [x] (12.5)
With that, we can apply our given values and manipulation our belief function to get
something more useful. Graphically, what we’re doing is shown in Figure 12.4 (and
again, more visually, in Figure 12.5), but mathematically:
Bel(xt ) = Pr [xt | u1 , z2 , . . . , ut , zt ]
= ηPr [zt | xt , u1 , z2 , . . . , ut ] Pr [xt | u1 , z2 , . . . , ut ] Bayes’ Rule
This results in our final, beautiful recursive relationship between the previous belief
and the next belief based on the sensor likelihood:
Z
Bel(xt ) = ηPr [zt | xt ] Pr [xt | xt−1 , ut ] · Bel(xt−1 ) dxt−1 (12.6)
We can see that there is an inductive relationship between beliefs. The green and
blue sections correspond to the calculations we did with the Kalman filter: we need
to first find the prediction distribution before our latest measurement. Then, we fold
in the actual measurement, which is described by the sensor likelihood model from
before.
With the mathematics out of the way, we can focus on the basic particle filtering algo-
rithm. It’s formalized in algorithm 12.1 and demonstrated graphically in Figure 12.5,
but let’s walk through the process informally.
Kudrayvtsev 194
COMPUTATIONAL PERCEPTION
Kudrayvtsev 195
CHAPTER 12: Tracking
Sampling Method
We need a lot of particles to sample the underlying distribution with relative accuracy.
Every timestep, we need to generate a completely new set of samples after working
all of our new information into our estimated distribution. As such, the efficiency or
algorithmic complexity of our sampling method is very important.
We can view the most straightforward sampling method as a direction on a roulette
wheel, as in Figure 12.6a. Our list of weights covers a particular range, and we choose
a value in that range. To figure out which weight that value refers to, we’d need to
perform a binary search. This gives a total O(n log n) runtime. Ideally, though,
sampling runtime should grow linearly with the number of samples!
As a clever optimization, we can use the systematic resampling algorithm (also called
stochastic universal sampling), described formally in algorithm 12.2. Instead of
Kudrayvtsev 196
COMPUTATIONAL PERCEPTION
viewing the weights as a roulette wheel, we view it as a wagon wheel. We plop down
our “spokes” at a random orientation, as in Figure 12.6b. The spokes are 1/n distance
apart and determining their weights is just a matter of traversing the distance between
each spoke, achieving O(n) linear time for sampling!
Figure 12.6: Two methods of sampling from our set of weighted particles, with
12.6b being the more efficient method.
Sampling Frequency
We can add another optimization to lower the frequency of sampling. Intuitively,
when would we even want to resample? Probably when the estimated distribution
has changed significantly from our initial set of samples. Specifically, we can only
resample when there is a significant variance in the particle weights; otherwise, we
can just reuse the samples.
Kudrayvtsev 197
CHAPTER 12: Tracking
corresponding to the new object. To correct for this, we can apply some randomly
distributed particles every step in order to catch any outliers.
Kudrayvtsev 198
COMPUTATIONAL PERCEPTION
. . . which looks an awful lot like a Gaussian; Figure 12.8: A contour and its
it’s proportional to the distance to the nearest normals. High-contrast features
strong edge. We can then use this Gaussian as (i.e. edges) are sought out along these
our sensor model and track hands reliably. normals.
Figure 12.9: Using edge detection and contours to track hand movement.
Kudrayvtsev 199
CHAPTER 12: Tracking
12.5 Mean-Shift
The mean-shift algorithm tries to find the modes of a probability distribution;
this distribution is often represented discretely by a number of samples as we’ve seen.
Visually, the algorithm looks something like Figure 12.11 below.
Figure 12.11: Performing mean-shift 2 times to find the area in the distribution
with the most density (an approximation of its mode).
This visual example hand-waves away a few things, such as what shape defines the
region of interest (here it’s a circle) and how big it should be, but it gets the point
Kudrayvtsev 200
COMPUTATIONAL PERCEPTION
across. At each step (from blue to red to finally cyan), we calculate the mean, or
center of mass of the region of interest. This results in a mean-shift vector from
the region’s center to the center of mass, which we follow to draw a new region of
interest, repeating this process until the mean-shift vector gets arbitrarily small.
So how does this relate to tracking?
Well, our methodology is pretty similar to before. We start with a pre-defined model
in the first frame. As before, this can be expressed in a variety of ways, but it may
be easiest to imagine it as an image patch and a location. In the following frame, we
search for a region that most closely matches that model within some neighborhood
based on some similarity function. Then, the new maximum becomes the starting
point for the next frame.
What truly makes this mean-shift tracking is the model and similarity functions that
we use. In mean-shift, we use a feature space which is the quantized color space.
This means we create a histogram of the RGB values based on some discretization
of each channel (for example, 4 bits for each channel results in a 64-bin histogram).
Our model is then this histogram interpreted as a probability distribution function;
this is the region we are going to track.
Let’s work through the math. We start with a target model with some histogram
centered at 0. It’s represented by q and contains m bins; since we are interpreting it
as a probability distribution, it also needs to be normalized (sum to 1):
m
X
q = {qu }u∈[1..m] qu = 1
u=1
We also have some target candidate centered at the point y with its own color distri-
bution, and pu is now a function of y
m
X
p(y) = {pu (y)}u∈[1..m] pu = 1
u=1
We need a similarity function f (y) to compute the difference between these two
distributions now; maximizing this function will render the “best” candidate location:
f (y) = f [q, p(y)].
Kudrayvtsev 201
CHAPTER 12: Tracking
p p p
p0 (y) = p1 (y), p2 (y), . . . , pm (y)
Then, the Bhattacharyya relationship is defined as the sum of the products of these
new distributions: m
X
f (y) = p0u (y)qu0 (12.7)
u=1
Well isn’t the sum of element-wise products the definition of the vector dot product?
We can thus also express this as:
But since by design these vectors are magnitude 1 (remember, we are treating them
as probability distributions), the Bhattacharyya coefficient essentially uses the cos θy
between these two vectors as a similarity comparison value.
Ideally, we’d use something with some better mathematical properties. Let’s use
something that’s differentiable, isotropic, and monotonically decreasing. Does that
sound like anyone we’ve gotten to know really well over the last 202 pages?
That’s right, it’s the Gaussian. Here, it’s expressed with a constant falloff, but we
can, as we know, also have a “scale factor” σ to control that:
1 2
KU (x) = c · exp − kxk
2
The most important property of a Gaussian kernel over a uniform kernel is that
it’s differentiable. The spread of the Gaussian means that new points introduced
to the kernel as we slide it along the image have a very small weight that slowly
increases; similarly, points in the center of the kernel have a constant weight that
slowly decreases. We would see the most weight change along the slope of the bell
curve.
We can leverage the Gaussian’s differentiability and use its gradient to see how the
overall similarity function changes as we move. With the gradient, we can actually
Kudrayvtsev 202
COMPUTATIONAL PERCEPTION
optimally “hill climb” the similarity function and find its local maximum rather than
blindly searching the neighborhood.
This is the big idea in mean-shift tracking: the similarity function helps us determine
the new frame’s center of mass, and the search space is reduced by following the
kernel’s gradient along the similarity function.
12.5.3 Disadvantages
Much like the Kalman Filter from before, the biggest downside of using mean-shift
as an exclusive tracking mechanism is that it operates on a single hypothesis of the
“best next point.”
A convenient way to get around this problem while still leveraging the power of
mean-shift tracking is to use it as the sensor model in a particle filter; we treating the
mean-shift tracking algorithm as a measurement likelihood (from before, Pr [z | X]).
Kudrayvtsev 203
CHAPTER 12: Tracking
The sensor model is much more finicky. We do need some sense of absolute truth
to rely on (even if it’s noisy). This could be the reliability of a sonar sensor for
distance, a preconfigured camera distance and depth, or other reliable truths.
Prediction vs. Correction Remember the fundamental trade-off in our Kalman
Filter: we needed to decide on the relative level of noise in the measurement
(correction) vs. the noise in the process (prediction). If one is too strong, we
will ignore the other. Getting this balance right is unfortunately just requires
a bit of magic and guesswork based on any existing data.
Data Association We often aren’t tracking just one thing in a scene, and it’s often
not a simple scene. How do we know, then, which measurements are associated
with which objects? And how do we know which measurements are the result
of visual clutter? The camoflauge techniques we see in nature (and warfare) are
designed to intentionally introduce this kind of clutter so that even our vision
systems have trouble with detection and tracking. Thus, we need to reliably
associate relevant data with the state.
The simple strategy is to only pay attention to measurements that are closest
to the prediction. Recall when tracking hand contours (see Figure A.3a) we
relied on the “nearest high-contrast features,” as if we knew those were truly
the ones we were looking for.
There is a more sophisticated approach, though, which relies on keeping multi-
ple hypotheses. We can even use particle filtering for this: each particle becomes
a hypothesis about the state. Over time, it becomes clear which particle corre-
sponds to clutter, which correspond to our interesting object of choice, and we
can even determine when new objects have emerged and begin to track those
independently.
Drift As errors in each component accumulate and compound over time, we run the
risk of drift in our tracking.
One method to alleviate this problem is to update our models over time. For
example, we could introduce an α factor that incorporates a blending of our
“best match” over time with a simple linear interpolation:
Model(t) = αBest(t) + (1 − α)Model(t − 1) (12.8)
There are still risks with this adaptive tracking method: if we blend in too much
noise into our sensor model, we’ll eventually be tracking something completely
unrelated to the original template.
That ends our discussion of tracking. The notion we introduced of tracking state over
time comes up in computer vision a lot. This isn’t image processing: things change
often!
We introduced probabilistic models to solve this problem. Kalman Filters and Mean-
Shift were methods that rendered a single hypothesis for the next best state, while
Kudrayvtsev 204
COMPUTATIONAL PERCEPTION
Particle Filters maintained multiple hypotheses and converged on a state over time.
Kudrayvtsev 205
Index of Terms
A I
affine transformation . . . . . . . . . . . . . . . 182 impulse function . . . . . . . . . . . . . . . . . . . . 23
attenuate . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 impulse response . . . . . . . . . . . . . . . . . . . . 23
intensity . . . . . . . . . . . . . . . . . . . . . . . . . 14, 19
B interpolation . . . . . . . . . . . . . . . . . . . 181, 204
Bhattacharyya coefficient . . . . . . . . . . 201
box filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 K
BRDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Kalman filter . . . . . . . . . . . . . . . . . 185, 197
Kalman filters . . . . . . . . . . . . . . . . . . . . . . . 10
C Kalman gain . . . . . . . . . . . . . . . . . . . . . . . 189
center of mass . . . . . . . . . . . . . . . . . . . . . 201 kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 22
convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 23
correlation filter, non-uniform weights 21 L
correlation filter, uniform weights . . . . 20 Lucas-Kanade method . . . . . . . . . . . . . . 181
cross-convolution filter . . . . . . . . . . . . . . 24 Lucas-Kanade method, hierarchical . 181
cross-correlation . . . . . . . . . . . . . . . . . . . . . 21 M
Markov model, hidden . . . . . . . . . . . . . . 184
D mean-shift algorithm . . . . . . . . . . . . . . . 200
dense correspondence search . . . . . . . . 200
mean-shift vector . . . . . . . . . . . . . . . . . . 201
depth-of-field . . . . . . . . . . . . . . . . . . . . . . . . 14
measurement function . . . . . . . . . . . . . . 187
disocclusion . . . . . . . . . . . . . . . . . . . . . . . . 182
median filter . . . . . . . . . . . . . . . . . . . . . . . . 27
dual photography . . . . . . . . . . . . . . . . . . . 15
moving average . . . . . . . . . . . . . . . . . . . . . . 20
dynamics model . . . . . . . . . . . . . . . . . . . . 183
N
E noise function . . . . . . . . . . . . . . . . . . . . . . . 19
edge-preserving filter . . . . . . . . . . . . . . . . 27 normalized correlation . . . . . . . . . . . . . . . 28
extraction matrix . . . . . . . . . . . . . . . . . . 187
O
G observation model . . . . . . . . . . . . . . . . . . 183
Gaussian filter . . . . . . . . . . . . . . . . . . 22, 202 occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Gaussian function . . . . . . . . . . . . . . . . . . . 22 optic flow . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Gaussian noise function . . . . 19, 185, 199
P
H panoramas . . . . . . . . . . . . . . . . . . . . . . . . . . 14
HDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 particle filtering . . . . . . . . . . . . . . . . . . . . 192
Helmholtz reciprocity . . . . . . . . . . . . . . . 16 particle filters . . . . . . . . . . . . . . . . . . . . . . . 10
310
Index
311