A seminar on
SCALE INVARIANT FEATURE TRANSFROM
Submitted by:
Hemanthkumar T (1SI09LSP04), MTech (Signal Processing)
Siddaganga Institute of Technology, Tumkur
Email id: hemanthsit.dsp@gmail.com
Abstract The cost of extracting these features is
minimized by taking a cascade filtering approach, in
The method for extracting distinctive which the more expensive operations are applied only
invariant features from images that can be used to at locations that pass an initial test.
perform reliable matching between different views of Following are the major stages of
an object or scene. The features are invariant to computation used to generate the set of image
image scale and rotation, and are shown to provide features:
robust matching across a substantial range of affine 1.1. Scale-space extrema detection:
distortion, change in 3D viewpoint, addition of noise, The first stage of computation searches over
and change in illumination. The features are highly all scales and image locations. It is implemented
distinctive, in the sense that a single feature can be efficiently by using a difference-of-Gaussian function
correctly matched with high probability against a to identify potential interest points that are invariant
large database of features from many images. This to scale and orientation.
paper also describes an approach to using these 1.2. Keypoint localization:
features for object recognition. The recognition At each candidate location, a detailed model
proceeds by matching individual features to a is fit to determine location and scale. Keypoints are
database of features from known objects using a fast selected based on measures of their stability.
nearest-neighbour algorithm, and finally performing 1.3. Orientation assignment:
verification through least squares solution for One or more orientations are assigned to
consistent pose parameters. This approach to each keypoint location based on local image gradient
recognition can robustly identify objects among directions. All future operations are performed on
clutter and occlusion while achieving near real time image data that has been transformed relative to the
performance. assigned orientation, scale, and location for each
feature, thereby providing invariance to these
transformations.
1.4. Keypoint descriptor:
1. Introduction The local image gradients are measured at
the selected scale in the region around each keypoint.
Image matching is a fundamental aspect of These are transformed into a representation that
many problems in computer vision, including object allows for significant levels of local shape distortion
or scene recognition, solving for 3D structure from and change in illumination.
multiple images, stereo correspondence, and motion This approach has been named the Scale
tracking. This paper describes image features that Invariant Feature Transform (SIFT), as it transforms
have many properties that make them suitable for image data into scale-invariant coordinates relative to
matching differing images of an object or scene. The local features.
features are invariant to image scaling and rotation, An important aspect of this approach is that
and partially invariant to change in illumination and it generates large numbers of features that densely
3D camera viewpoint. They are well localized in both cover the image over the full range of scales and
the spatial and frequency domains, reducing the locations. A typical image of size 500x500 pixels will
probability of disruption by occlusion, clutter, or give rise to about 2000 stable features (although this
noise. Large numbers of features can be extracted number depends on both image content and choices
from typical images with efficient algorithms. In for various parameters). The quantity of features is
addition, the features are highly distinctive, which particularly important for object recognition, where
allows a single feature to be correctly matched with the ability to detect small objects in cluttered
high probability against a large database of features, backgrounds requires that at least 3 features be
providing a basis for object and scene recognition. correctly matched from each object for reliable
identification.
1
For image matching and recognition, SIFT It has been shown by Koenderink (1984)
features are first extracted from a set of reference and Lindeberg (1994) that under a variety of
images and stored in a database. A new image is reasonable assumptions the only possible scale-space
matched by individually comparing each feature from kernel is the Gaussian function. Therefore, the scale
the new image to this previous database and finding space of an image is defined as a function, L(x, y,ϭ),
candidate matching that is produced from the convolution of a variable-
features based on Euclidean distance of their feature scale Gaussian, G(x, y, ϭ), with an input image, I(x,
vectors. This paper will discuss fast nearest- y):
neighbour algorithms that can perform this
computation rapidly against large databases.
The keypoint descriptors are highly L ( x , y , σ )=G(x , y , σ )∗I ( x , y )………….
distinctive, which allows a single feature to find its
(1)
correct match with good probability in a large
database of features. However, in a cluttered image,
many features from the background will not have any where ∗ is the convolution operation in x and y, and
correct match in the database, giving rise to many
false matches in addition to the correct ones. The 1 2 2 2
correct matches can be filtered from the full set of G ( x , y , σ )= 2
e−(x + y )/ 2 σ ……………(2)
matches by identifying subsets of keypoints that
2π σ
agree on the object and its location, scale, and
orientation in the new image. The probability that To efficiently detect stable keypoint
several features will agree on these parameters by locations in scale space, we have proposed (Lowe,
chance is much lower than the probability that any 1999) using scale-space extrema in the difference-of-
individual feature match will be in error. The Gaussian function convolved with the image, D(x, y,
determination of these consistent clusters can be _), which can be computed from the difference of
performed rapidly by using an efficient hash table two nearby scales separated by a constant
implementation of the generalized Hough transform. multiplicative factor k:
Each cluster of 3 or more features that agree
on an object and its pose is then subject to further D ( x , y , σ )=( G ( x , y , kσ ) −G ( x , y , σ ) )∗I ( x , y )
detailed verification. First, a least-squared estimate is
made for an affine approximation to the object pose.
¿ L ( x , y ,kσ )−L ( x , y , σ )…….(3)
Any other image features consistent with this pose
are identified, and outliers are discarded. Finally, a There are a number of reasons for choosing
detailed computation is made of the probability that a this function. First, it is a particularly efficient
particular set of features indicates the presence of an function to compute, as the smoothed images, L, need
object, given the accuracy of fit and number of to be computed in any case for scale space feature
probable false matches. Object matches that pass all description, and D can therefore be computed by
these tests can be identified as correct with high simple image subtraction.
confidence.
2. Detection of scale-space extrema
As described in the introduction, we will
detect keypoints using a cascade filtering approach
that uses efficient algorithms to identify candidate
locations that are then examined in further detail. The
first stage of keypoint detection is to identify
locations and scales that can be repeatably assigned
under differing views of the same object. Detecting
locations that are invariant to scale change of the
image can be accomplished by searching for stable
features across all possible scales, using a continuous
function of scale known as scale space (Witkin,
1983).
2
Figure 1: For each octave of scale space, the initial
image is repeatedly convolved with Gaussians to
produce the set of scale space images shown on the
left. Adjacent Gaussian images are subtracted to
produce the difference-of-Gaussian images on the
right. After each octave, the Gaussian image is down-
sampled by a factor of 2, and the process repeated.
In addition, the difference-of-Gaussian
function provides a close approximation to the scale-
normalized Laplacian of Gaussian, ϭ2∇2G, as studied
by Lindeberg (1994). Lindeberg showed that the
normalization of the Laplacian with the factor ϭ 2 is
required for true scale invariance. In detailed
experimental comparisons, Mikolajczyk (2002) found
that the maxima and minima of ϭ 2∇2G produce the Figure 2: Maxima and minima of the difference-of-
most stable image features compared to a range of Gaussian images are detected by comparing a pixel
other possible image functions, such as the gradient, (marked with X) to its 26 neighbours in 3x3 regions
Hessian, or Harris corner function. at the current and adjacent scales (marked with
The relationship between D and ϭ 2∇2G can circles).
be understood from the heat diffusion equation
(parameterized in terms of ϭ rather than the more An efficient approach to construction of
usual t = ϭ2): D(x,y,ϭ) is shown in Figure 1. The initial image is
incrementally convolved with Gaussians to produce
∂G images separated by a constant factor k in scale
=σ ∇2 G ……………… space, shown stacked in the left column. We choose
∂σ to divide each octave of scale space (i.e., doubling of
.(4) ϭ) into an integer number, s, of intervals, so k = 2 1/s.
We must produce s + 3 images in the stack of blurred
From this, we see that ∇2G can be computed from the images for each octave, so that final extrema
finite difference approximation to ∂G/∂ϭ, using the detection covers a complete octave. Adjacent image
difference of nearby scales at kϭ and ϭ: scales are subtracted to produce the difference-of-
Gaussian images shown on the right. Once a
∂G G ( x , y , kσ )−G ( x , y , σ ) complete octave has been processed, we resample the
σ ∇ 2 G= ≈ …… Gaussian image that has twice the initial value of ϭ (it
∂σ kσ−σ
will be 2 images from the top of the stack) by taking
……(5) every second pixel in each row and column. The
accuracy of sampling relative to ϭ is no different than
and therefore, for the start of the previous octave, while
computation is greatly reduced.
G ( x , y , kσ )−G ( x , y , σ ) ≈(k −1)σ 2 ∇ 2 G ..(6)
3. Accurate keypoint localization
This shows that when the difference-of-Gaussian
Once a keypoint candidate has been found
function has scales differing by a constant factor it
by comparing a pixel to its neighbors, the next step is
already incorporates the ϭ2 scale normalization
to perform a detailed fit to the nearby data for
required for the scale-invariant Laplacian. The factor
location, scale, and ratio of principal curvatures. This
(k−1) in the equation is a constant over all scales and
information allows points to be rejected that have low
therefore does not influence extrema location. The
contrast (and are therefore sensitive to noise) or are
approximation error will go to zero as k goes to 1, but
poorly localized along an edge.
in practice we have found that the approximation has
The initial implementation of this approach (Lowe,
almost no impact on the stability of extrema detection
1999) simply located keypoints at the location and
or localization for even significant differences in
scale of the central sample point. However, recently
scale, such as k = √2.
Brown has developed a method (Brown and Lowe,
2002) for fitting a 3D quadratic function to the local
sample points to determine the interpolated location
3
of the maximum, and his experiments showed that resulting 3x3 linear system can be solved with
this provides a substantial improvement to matching minimal cost. If the offset ˆx is larger than 0.5 in any
and stability. His approach uses the Taylor expansion dimension, then it means that the extremum lies
(up to the quadratic terms) of the scale-space closer to a different sample point. In this case, the
function, D(x, y, ϭ), shifted so that the origin is at the sample point is changed and the interpolation
sample point: performed instead about that point. The final offset ˆx
is added to the location of its sample point to get the
^ ∂ DT 1 T ∂2 D interpolated estimate for the location of the
D X =D+
( ) X+ X 2
X ………… extremum.
∂X 2 ∂X The function value at the extremum, D(ˆx),
(7) is useful for rejecting unstable extrema with low
contrast. This can be obtained by substituting
where D and its derivatives are evaluated at the equation (3) into (2), giving
sample point and x = (x, y, ϭ)T is the offset from this
T
point. The location of the extremum, ˆx, is ∂D ^
determined by taking the derivative of this function D(^
X ) =D+ X …………….(9)
∂X
with respect to x and setting it to zero, giving
For the experiments in this paper, all extrema with a
^ −∂2 D −1 ∂ D value of |D(ˆx)| less than 0.03 were discarded (as
X= ………….…(8)
∂ X2 ∂ X before, we assume image pixel values in the range
[0,1]).
Figure 3 shows the effects of keypoint
selection on a natural image. In order to avoid too
much clutter, a low-resolution 233 by 189 pixel
image is used and keypoints are shown as vectors
giving the location, scale, and orientation of each
keypoint (orientation assignment is described below).
Figure 3 (a) shows the original image, which is
shown at reduced contrast behind the subsequent
figures. Figure 3 (b) shows the 832 keypoints at all
detected maxima and minima of the difference-of-
Gaussian function, while (c) shows the 729 keypoints
that remain following removal of those with a value
of |D(x)| less than 0.03. Part (d) will be explained in
the following section.
4. Orientation assignment
By assigning a consistent orientation to each
keypoint based on local image properties, the
Figure 3: This figure shows the stages of keypoint keypoint descriptor can be represented relative to this
selection. (a) The 233x189 pixel original image. (b) orientation and therefore achieve invariance to image
The initial 832 keypoints locations at maxima and rotation. This approach contrasts with the orientation
minima of the difference-of-Gaussian function. invariant descriptors of Schmid andMohr (1997), in
Keypoints are displayed as vectors indicating scale, which each image property is based on a rotationally
orientation, and location. (c) After applying a invariant measure. The disadvantage of that approach
threshold on minimum contrast, 729 keypoints is that it limits the descriptors that can be used and
remain. (d) The final 536 keypoints that remain discards image information by not requiring all
following an additional threshold on ratio of principal measures to be based on a consistent rotation.
curvatures. Following experimentation with a number of
approaches to assigning a local orientation, the
As suggested by Brown, the Hessian and following approach was found to give the most stable
derivative of D are approximated by using results. The scale of the keypoint is used to select the
differences of neighboring sample points. The Gaussian smoothed image, L, with the closest scale,
4
so that all computations are performed in a scale-
invariant manner. For each image sample, L(x, y), at
this scale, the gradient magnitude, m(x, y), and
orientation, Ө(x, y), is precomputed using pixel
differences:
2 2
√
m ( x , y )= ( L ( x +1 , y )−L ( x−1 , y ) ) + ( L ( x , y +1 )−L ( x , y−1 ) )
θ ( x , y )=tan −1 ((L ( x , y+ 1 )−L ( x , y−1 ) )/ ( L ( x +1 , y ) −L ( x−1 , y ) ))
……
(10)
An orientation histogram is formed from the gradient
orientations of sample points within a region around
the keypoint. The orientation histogram has 36 bins Figure 4: The top line in the graph shows the percent
covering the 360 degree range of orientations. Each of keypoint locations and scales that are repeatably
sample added to the histogram is weighted by its detected as a function of pixel noise. The second line
gradient magnitude and by a Gaussian-weighted shows the repeatability after also requiring agreement
circular window with a ϭ that is 1.5 times that of the in orientation. The bottom line shows the final
scale of the keypoint. percent of descriptors correctly matched to a large
Peaks in the orientation histogram database.
correspond to dominant directions of local gradients.
The highest peak in the histogram is detected, and 5. The local image descriptor
then any other local peak that is within 80% of the
highest peak is used to also create a keypoint with The previous operations have assigned an
that orientation. Therefore, for locations with image location, scale, and orientation to each
multiple peaks of similar magnitude, there will be keypoint. These parameters impose a repeatable local
multiple keypoints created at the same location and 2D coordinate system in which to describe the local
scale but different orientations. Only about 15% of image region, and therefore provide invariance to
points are assigned multiple orientations, but these these parameters. The next step is to compute a
contribute significantly to the stability of matching. descriptor for the local image region that is highly
Finally, a parabola is fit to the 3 histogram values distinctive yet is as invariant as possible to remaining
closest to each peak to interpolate the peak position variations, such as change in illumination or 3D
for better accuracy. viewpoint.
Figure 6 shows the experimental stability of One obvious approach would be to sample the local
location, scale, and orientation assignment under image intensities around the keypoint at the
differing amounts of image noise. As before the appropriate scale, and to match these using a
images are rotated and scaled by random amounts. normalized correlation measure. However, simple
The top line shows the stability of keypoint location correlation of image patches is highly sensitive to
and scale assignment. The second line shows the changes that cause misregistration of samples, such
stability of matching when the orientation assignment as affine or 3D viewpoint change or non-rigid
is also required to be within 15 degrees. As shown by deformations. A better approach has been
the gap between the top two lines, the orientation demonstrated by Edelman, Intrator, and Poggio
assignment remains accurate 95% of the time even (1997). Their proposed representation was based
after addition of ±10% pixel noise (equivalent to a upon a model of biological vision, in particular of
camera providing less than 3 bits of precision). The complex neurons in primary visual cortex. These
measured variance of orientation for the correct complex neurons respond to a gradient at a particular
matches is about 2.5 degrees, rising to 3.9 degrees for orientation and spatial frequency, but the location of
10% noise. The bottom line in Figure 4 shows the the gradient on the retina is allowed to shift over a
final accuracy of correctly matching a keypoint small receptive field rather than being precisely
descriptor to a database of 40,000 keypoints (to be localized. Edelman et al. hypothesized that the
discussed below). As this graph shows, the SIFT function of these complex neurons was to allow for
features are resistant to even large amounts of pixel matching and recognition of 3D objects from a range
noise, and the major cause of error is the initial of viewpoints. They have performed detailed
location and scale detection. experiments using 3D computer models of object and
5
animal shapes which show that matching gradients the left side of Figure 5, although, of course, the
while weight falls off smoothly. The purpose of this
allowing for shifts in their position results in much Gaussian window is to avoid sudden changes in the
better classification under 3D rotation. For example, descriptor with small changes in the position of the
recognition accuracy for 3D objects rotated in depth window, and to give less emphasis to gradients that
by 20 degrees increased from 35% for correlation of are far from the centre of the descriptor, as these are
gradients to 94% using the complex cell model. Our most affected by misregistration errors.
implementation described below was inspired by this The keypoint descriptor is shown on the
idea, but allows for positional shift using a different right side of Figure 5. It allows for significant shift in
computational mechanism. gradient positions by creating orientation histograms
over 4x4 sample regions. The figure shows eight
directions for each orientation histogram, with the
length of each arrow
corresponding to the magnitude of that histogram
entry. A gradient sample on the left can shift up to 4
sample positions while still contributing to the same
histogram on the right, thereby achieving the
objective of allowing for larger local positional shifts.
It is important to avoid all boundary affects
in which the descriptor abruptly changes as a sample
shifts smoothly from being within one histogram to
another or from one orientation to another. Therefore,
Figure 5: keypoint descriptor is created by first trilinear interpolation is used to distribute the value of
computing the gradient magnitude and orientation at each gradient sample into adjacent histogram bins. In
each image sample point in a region around the other words, each entry into a bin is multiplied by a
keypoint location, as shown on the left. These are weight of 1−d for each dimension, where d is the
weighted by a Gaussian window, indicated by the distance of the sample from the central value of the
overlaid circle. These samples are then accumulated bin as measured in units of the histogram bin spacing.
into orientation histograms summarizing the contents The descriptor is formed from a vector
over 4x4 subregions, as shown on the right, with the containing the values of all the orientation histogram
length of each arrow corresponding to the sum of the entries, corresponding to the lengths of the arrows on
gradient magnitudes near that direction within the the right side of Figure 5. The figure shows a 2x2
region. This figure shows a 2x2 descriptor array array of orientation histograms, whereas our
computed from an 8x8 set of samples, whereas the experiments below show that the best results are
experiments in this paper use 4x4 descriptors achieved with a 4x4 array of histograms with 8
computed from a 16x16 sample array. orientation bins in each. Therefore, the experiments
in this paper use a 4x4x8 = 128 element feature
5.1 Descriptor representation vector for each keypoint.
Finally, the feature vector is modified to
Figure 5 illustrates the computation of the reduce the effects of illumination change. First, the
keypoint descriptor. First the image gradient vector is normalized to unit length. A change in
magnitudes and orientations are sampled around the image contrast in which each pixel value is
keypoint location, using the scale of the keypoint to multiplied by a constant will multiply gradients by
select the level of Gaussian blur for the image. In the same constant, so this contrast
order to achieve orientation invariance, the change will be cancelled by vector normalization. A
coordinates of the descriptor and the gradient brightness change in which a constant is added to
orientations are rotated relative to the keypoint each image pixel will not affect the gradient values,
orientation. For efficiency, the gradients are as they are computed from pixel differences.
precomputed for all levels of the pyramid as Therefore, the descriptor is invariant to affine
described in Section 5. These are illustrated with changes in illumination.
small arrows at each sample location on the left side However, non-linear illumination changes can also
of Figure 5. occur due to camera saturation or due to illumination
A Gaussian weighting function with ϭ equal changes that affect 3D surfaces with differing
to one half the width of the descriptor window is used orientations by different amounts. These effects can
to assign a weight to the magnitude of each sample cause a large change in relative magnitudes for some
point. This is illustrated with a circular window on gradients, but are less likely to affect the gradient
6
orientations. Therefore, we reduce the influence of
large gradient magnitudes by thresholding the values
in the unit feature vector to each be no larger than
0.2, and then renormalizing to unit length. This
means that matching the magnitudes for large
gradients is no longer as important, and that the
distribution of orientations has greater emphasis. The
value of 0.2 was determined experimentally using
images containing differing illuminations for the
same 3D objects.
6. Matching to large databases
An important remaining issue for measuring
the distinctiveness of features is how the reliability of
matching varies as a function of the number of
features in the database being matched. Most of the
examples in this paper are generated using a database Figure 6: The dashed line shows the percent of
of 32 images keypoints correctly matched to a database as a
with about 40,000 keypoints. Figure 6 shows how function of database size (using a logarithmic scale).
the matching reliability varies as a function of The solid line shows the percent of keypoints
database size. This figure was generated using a assigned the correct location, scale, and orientation.
larger database of 112 images, with a viewpoint Images had random scale and rotation changes, an
depth rotation of 30 degrees and 2% image noise in affine transform of 30 degrees, and image noise of
addition to the usual random image rotation and scale 2% added prior to matching.
change.
The dashed line shows the portion of image 7. Conclusions
features for which the nearest neighbour in the
database was the correct match, as a function of The SIFT keypoints described in this paper
database size shown on a logarithmic scale. The are particularly useful due to their distinctiveness,
leftmost point is matching against features from only which enables the correct match for a keypoint to be
a single image while the rightmost point is selecting selected from a large database of other keypoints.
matches from a database of all features from the 112 This distinctiveness is achieved by assembling a
images. It can be seen that matching reliability does high-dimensional vector representing the image
decrease as a function of the number of distractors, gradients within a local region of the image. The
yet all indications are that many correct matches will keypoints have been shown to be invariant to image
continue to be found out to very large database sizes. rotation and scale and robust across a substantial
The solid line is the percentage of keypoints range of affine distortion, addition of noise, and
that were identified at the correct matching location change in illumination. Large numbers of keypoints
and orientation in the transformed image, so it is only can be extracted from typical images, which leads to
these points that have any chance of having matching robustness in extracting small objects among clutter.
descriptors in the database. The reason this line is flat The fact that keypoints are detected over a complete
is range of scales means that small local features are
that the test was run over the full database for each available for matching small and highly occluded
value, while only varying the portion of the database objects, while large keypoints perform well for
used for distractors. It is of interest that the gap images subject to noise and blur. Their computation
between the two lines is small, indicating that is efficient, so that several thousand keypoints can be
matching failures are due more to issues with initial extracted from a typical image with near real-time
feature localization and orientation assignment than performance on standard PC hardware.
to problems with feature distinctiveness, even out to This paper has also presented methods for
large database sizes. using the keypoints for object recognition. The
approach we have described uses approximate
nearest-neighbour lookup, a Hough transform for
identifying clusters that agree on object pose, least-
squares pose determination, and final verification.
7
Other potential applications include view matching
for 3D reconstruction, motion tracking and
segmentation, robot localization, image panorama
assembly, epipolar calibration, and any others that
require identification of matching locations between
images.
There are many directions for further
research in deriving invariant and distinctive image
features. Systematic testing is needed on data sets
with full 3D viewpoint and illumination changes. The
features described in this paper use only a
monochrome intensity image, so further
distinctiveness could be derived from including
illumination-invariant colour descriptors (Funt and
Finlayson, 1995; Brown and Lowe, 2002). Similarly,
local texture measures appear to play an important
role in human vision and could be incorporated into
feature descriptors in a more general form than the
single spatial frequency used by the current
descriptors. An attractive aspect of the invariant local
feature approach to matching is that there is no need
to select just one feature type, and the best results are
likely to be obtained by using many different
features, all of which can contribute useful matches
and improve overall robustness.
Another direction for future research will be
to individually learn features that are suited to
recognizing particular objects categories. This will be
particularly important for generic object classes that
must cover a broad range of possible appearances.
The research of Weber, Welling, and Perona (2000)
and Fergus, Perona, and Zisserman (2003) has shown
the potential of this approach by learning small sets
of local features that are suited to recognizing generic
classes of objects. In the long term, feature sets are
likely to contain both prior and learned features that
will be used according to the amount of training data
that has been available for various object classes.
8. References
[1] David G. Lowe, "Distinctive image features from
scale-invariant keypoints," International Journal of
Computer Vision, 60, 2, pp. 91-110, 2004.
[2] K. Mikolajezk and C. Schmid. An affine invariant
interest point detector, In European conference on
computer vision. Vol. 1 128-142, 2002.