lOMoARcPSD|32379549
UNIT 5
Motion models
• Before we register and align images, we need to establish the mathematical relationships that map
pixel coordinates from one image to another.
• A variety of such parametric motion models are possible, from simple 2D transforms, to planar perspective
models, 3D camera rotations, lens distortions, and mapping to non-planar (e.g., cylindrical) surfaces.
Planar perspective motion
• The simplest possible motion model to use when aligning images is to simply translate and rotate them in
2D (Figure 9.2a).
• This is the same kind of motion that you would use if you had overlapping photographic prints.
• It is also the kind of technique favored by David Hockney to create the collages that he calls joiners.
• Creating such collages, which show visible seams and inconsistencies that add to the artistic effect, is
popular onWeb sites such as Flickr, where they more commonly go under the name panography.
Two-dimensional motion models and how they can be
used for image stitching.
• Translation and rotation are also usually adequate
motion models to compensate for small camera
motions in applications such as photo and video
stabilization and merging.
• we saw how the mapping between two cameras viewing a common plane can be described using a 3X3
homography.
• Consider the matrix M10 that arises when mapping a pixel in one image to a 3D
point and then back onto a second image,
When the last row of the P0 matrix is replaced with a plane equation and points are assumed to lie
on this plane i.e their disparity is d0=0, we can ignore the last column and also the its last row, since we do
not care about the final z-buffer depth
The resulting homography matrix H~ 10 (the upper left 3x3 sub-matrix of M10) describes the mapping between
pixel in the two images,
• This observation formed the basis of some of the earliest automated image stitching algorithms.
• Because reliable feature matching techniques had not yet been developed, these algorithms used direct
pixel value matching,
• i.e., direct parametric motion estimation.
• More recent stitching algorithms first extract features and then match them up, often using robust
techniques such as RANSAC to compute a good set of inliers.
• The final computation of the homography
(9.2), i.e., the solution of the least squares
fitting problem given pairs of corresponding
features, uses iterative least squares.
• Application: Whiteboard and document scanning
Rotational panoramas
• The most typical case for panoramic image stitching is when the camera undergoes a pure rotation.
• Think of standing at the rim of the Grand Canyon.
• Relative to the distant geometry in the scene, as you snap away, the camera is undergoing a pure rotation,
which is equivalent to assuming that all points are extremely far from the camera, i.e., on the plane at
infinity.
• Setting t0 = t1 = 0, we get the simplified 3X3homography
Two-dimensional motion models and how they can be used for image stitching.
• where Kk = diag(fk; fk; 1) is the simplified camera intrinsic matrix, assuming
that cx = cy = 0, i.e., we are indexing the pixels starting from the optical center
1996).
• This can also be re-written as
lOMoARcPSD|32379549
• which reveals the simplicity of the mapping equations and makes all the
motion parameters explicit.
• instead of the general eight-parameter homography relating a pair of
images, we get the three-, four-, or five-parameter 3D rotation motion
models corresponding to the cases where the focal length f is known,
fixed, or variable.
• Estimating the 3D rotation matrix (and, optionally, focal length)
associated with each image is intrinsically more stable than estimating
a homography with a full eight degrees of freedom, which makes this
the method of choice for large-scale image stitching algorithms
• Figure 9.4 shows the alignment of four images under the 3D rotation
motion model.
Four images taken with a hand-held camera registered using a 3D
rotation motion model (Szeliski and Shum 1997) c 1997 ACM. Notice
how the homographies, rather than being arbitrary, have a well-defined
keystone shape whose width increases away from the origin
Gap closing
• The techniques can be used to estimate a series of rotation matrices and focal lengths, which can be
chained together to create large panoramas.
• Unfortunately, because of accumulated errors, this approach will rarely produce a closed 360 panorama.
Instead, there will invariably be either a gap or an overlap
• We can solve this problem by matching the first image in the sequence with the last one.
• The difference between the two rotation matrix estimates associated with the repeated first indicates the
amount of misregistration.
• This error can be distributed evenly across the whole sequence by taking the quotient of the two
quaternions associated with these rotations and dividing this “error quaternion” by the number of images
in the sequence (assuming relatively constant inter-frame rotations).
We can also update the estimated focal length based on the amount of misregistration.
To do this, we first convert the error quaternion into a gap angle, g and then update the focal length using the
equation
Figure 9.5a shows the end of registered image sequence and the first image
There is a big gap between the last image and the first which are in fact the same image.
• Figure 9.5b shows the registration after closing the gap with the correct focal length (f = 468).
• Notice that both mosaics show very little visual misregistration (except at the gap), yet Figure 9.5a has
been computed using a focal length that has 9% error.
• Related approaches have been developed by Hartley (1994b), McMillan and Bishop (1995), Stein (1995),
and Kang and Weiss (1997) to solve the focal length estimation problem using pure panning motion and
cylindrical images.
Gap closing (Szeliski and Shum 1997) c
1997 ACM:
(a) A gap is visible when the focal length is
wrong (f = 510).
(b) No gap is visible for the correct focal
length (f = 468).
Cylindrical and spherical coordinates
• An alternative to using homographies or 3D motions to align images is to first warp the images into
cylindrical coordinates and then use a pure translational model to align them (Chen 1995; Szeliski 1996).
• Unfortunately, this only works if the images are all taken with a level camera or with a known tilt angle.
• Assume that the camera is in its canonical position, i.e., its rotation matrix is the identity, R = I, so that
the optical axis is aligned with the z axis and the y axis is aligned vertically.
• Cylindrical image stitching algorithms are most used when the camera is known to be level and only
rotating around its vertical axis.
• Under these conditions, images at different rotations are related by a pure horizontal translation.
• This makes it attractive as an initial class project in an introductory computer vision course, since the full
complexity of the perspective alignment algorithm can be avoided.
lOMoARcPSD|32379549
• Figure 9.8 shows how two cylindrically warped images from a leveled rotational panorama are related by a
pure translation
• Professional panoramic photographers often use pan-tilt heads that make it easy to control the tilt and to
stop at specific detents in the rotation angle.
• Motorized rotation heads are also sometimes used for the acquisition of larger panoramas.
A cylindrical panorama (Szeliski and Shum
1997) c 1997 ACM:
(a) two cylindrically warped images related by a
horizontal translation;
(b) part of a cylindrical panorama composited
from a sequence of images.
• Not only do they ensure a uniform coverage of the visual field with a desired amount of image overlap but
they also make it possible to stitch the images using cylindrical or spherical coordinates and pure
translations.
• In this case, pixel coordinates (x; y; f) must first be
rotated using the known tilt and panning angles
before being projected into cylindrical or spherical
coordinates (Chen 1995).
A spherical panorama constructed from 54
photographs (Szeliski and Shum
1997)
• Having a roughly known panning angle also makes it easier to compute the alignment, since the rough
relative positioning of all the input images is known ahead of time, enabling a reduced search range for
alignment.
• Figure 9.9 shows a full 3D rotational panorama unwrapped onto the surface of a sphere.
• One final coordinate mapping worth mentioning is the polar
mapping, where the north pole lies along the optical axis
rather than the vertical axis,
Bundle Adjustment
• One way to register many images is to add new images to the panorama one at a time, aligning the most
recent image with the previous ones already in the collection and discovering, if necessary, which images it
overlaps.
In the case of 360 panoramas, accumulated error may lead to the presence of a gap (or excessive overlap)
between the two ends of the panorama, which can be fixed by stretching the alignment of all the images
using a process called gap closing.
• However, a better alternative is to simultaneously align all the images using a least-squares framework to
correctly distribute any mis-registration errors.
• The process of simultaneously adjusting pose parameters for a large collection of overlapping
images is called bundle adjustment in the photogrammetry community.
• In computer vision, it was first applied to the general structure from motion problem and then later
specialized for panoramic image stitching.
lOMoARcPSD|32379549
• where Kj = diag(fj ; fj ; 1) is the simplified form of the calibration matrix.
• The motion mapping a point xij from frame j into a point xik in
frame k is similarly given by
• where
– The x˜ik function is the predicted location of feature i in frame k given by (9.27),
– x̂ij is the observed location, and the “2D” in the subscript indicates that an image-plane error
is being minimized.
• Note that since x˜ik depends on the x̂ij observed value, we have an errors-in-variable problem, which in
principle requires more sophisticated techniques than least squares to solve.
• However, in practice, if we have enough features, we can directly minimize the above quantity using
regular non-linear least squares and obtain an accurate multi-frame alignment.
• While this approach works well in practice, it suffers from two potential disadvantages.
• First, since a summation is taken over all pairs with corresponding features, features that are observed
many times are overweighted in the final solution.
Parallax removal
• Once we have optimized the global orientations and focal lengths of our cameras, we may find that the
images are still not perfectly aligned,
• i.e., the resulting stitched image looks blurry or ghosted in some places.
• This can be caused by a variety of factors, including unmodeled radial distortion, 3D parallax (failure to
rotate the camera around its optical center), small scene motions such as waving tree branches, and large-
scale scene motions such as people moving in and out of pictures.
• Each of these problems can be treated with a different approach.
• Radial distortion can be estimated (potentially ahead of time) using one of the techniques.
• For example, the plumb-line method adjusts radial distortion parameters until slightly curved lines
become straight, while mosaic-based approaches adjust them until mis-registration is reduced in image
overlap areas
• To get a reasonably dense set of features to interpolate, Shum and Szeliski (2000) place a feature point at
the center of each patch (the patch size controls the smoothness in the local alignment stage), rather than
relying of features extracted using an interest operator.
• An alternative approach to motion-based de-ghosting was proposed by Kang, Uyttendaele, Winder et al.
(2003), who estimate dense optical flow between each input image and a central reference image.
• The accuracy of the flow vector is checked using a photo-consistency measure before a given warped pixel
is considered valid and is used to compute a high dynamic range radiance estimate, which is the goal of
their overall algorithm.
• The requirement for a reference image makes their approach less applicable to general image mosaicing,
although an extension to this case could certainly be envisaged.
Recognizing panoramas
• The final piece needed to perform fully automated image stitching is a technique to recognize which images
go together, which Brown and Lowe (2007) call recognizing panoramas.
• If the user takes images in sequence so that each image overlaps its predecessor and also specifies the first
and last images to be stitched, bundle adjustment combined with the process of topology inference can be
used to automatically assemble a panorama (Sawhney and Kumar 1999).
• However, users often jump around when taking panoramas,
• e.g., they may start a new row on top of a previous one, jump back to take a repeat shot, or create 360
panoramas where end-to-end overlaps need to be discovered.
• Furthermore, the ability to discover multiple panoramas taken by a user over an extended period can be a
big convenience.
• However, users often jump around when taking panoramas,
• e.g., they may start a new row on top of a previous one, jump back to take a repeat shot, or create 360
panoramas where end-to-end overlaps need to be discovered.
• Furthermore, the ability to discover multiple panoramas taken by a user over an extended period can be a
big convenience.
• To recognize panoramas, Brown, and Lowe (2007) first find all pairwise image overlaps using a feature-
based method and then find connected components in the overlap graph to “recognize” individual
panoramas (Figure 9.11).
• The feature-based matching stage first extracts scale invariant feature transform (SIFT) feature locations
and feature descriptors (Lowe 2004) from all the input images and places them in an indexing structure,
lOMoARcPSD|32379549
Recognizing panoramas (Brown, Szeliski, and
Winder 2005), figures courtesy of Matthew
Brown:
(a) input images with pairwise matches;
(b) images grouped into connected components
(panoramas);
(c) individual panoramas registered and blended
into stitched composites
• For each image pair under consideration, the
nearest matching neighbor is found for each
feature in the first image, using the indexing
structure to rapidly find candidates and then
comparing feature descriptors to find the best
match.
• RANSAC is used to find a set of inlier matches; pairs of
matches are used to hypothesize similarity motion models
that are then used to count the number of inliers.
• In practice, the most difficult part of getting a fully
automated stitching algorithm to work is deciding which
pairs of images correspond to the same parts of the scene.
• Repeated structures such as windows (Figure 9.12) can lead
to false matches when using a feature-based approach.
• One way to mitigate this problem is to perform a direct
pixel-based comparison between the registered images to
determine if they are different views of the same scene.
Unfortunately, this heuristic may fail if there are moving
objects in the scene
Matching errors (Brown, Szeliski, and Winder 2004): accidental
matching of several features can lead to matches between
pairs of images that do not actually overlap.
Validation of image matches by direct pixel error
comparison can fail when the scene contains moving
objects (Uyttendaele, Eden, and Szeliski 2001) c 2001
IEEE.
Compositing
• Once we have registered all of the input images
with respect to each other, we need to decide how to
produce the final stitched mosaic image.
• This involves selecting a final compositing surface
(flat, cylindrical, spherical, etc.) and view (reference
image).
• It also involves selecting which pixels contribute to the final composite and how to optimally blend
these pixels to minimize visible seams, blur, and ghosting.
Choosing a compositing surface
• Also known as compositing surface parameterization
• The first choice to be made is how to represent the final image.
• If only a few images are stitched together, a natural approach is to select one of the images as the
reference and to then warp all the other images into its reference coordinate system.
• The resulting composite is sometimes called a flat panorama, since the projection onto the final surface is
still a perspective projection, and hence straight lines remain straight
• Cartographers have also developed several alternative methods for representing the globe.
lOMoARcPSD|32379549
• The choice of parameterization is somewhat application dependent and involves a tradeoff between keeping
the local appearance undistorted (e.g., keeping straight lines straight) and providing a reasonably uniform
sampling of the environment.
• Automatically making this selection and smoothly transitioning between representations based on the
extent of the panorama is an active area of current research.
View selection
• Once we have chosen the output parameterization, we still need to determine which part of the scene will
be centered in the final view.
• As mentioned above, for a flat composite, we can choose one of the images as a reference.
• Often, a reasonable choice is the one that is geometrically most central.
• For example, for rotational panoramas represented as a collection of 3D rotation matrices, we can choose
the image whose z-axis is closest to the average z-axis (assuming a reasonable field of view).
Coordinate transformations
• After selecting the parameterization and reference view, we still need to compute the mappings between
the input and output pixels coordinates.
• If the final compositing surface is flat (e.g., a single plane or the face of a cube map) and the input images
have no radial distortion, the coordinate transformation is the simple homography.
• This kind of warping can be performed in graphics hardware by appropriately setting texture mapping
coordinates and rendering a single quadrilateral.
• If the final composite surface has some other analytic form (e.g., cylindrical, or spherical), we need to
convert every pixel in the final panorama into a viewing ray (3D point) and then map it back into each
image according to the projection (and optionally radial distortion) equations.
• This process can be made more efficient by precomputing some lookup tables, e.g., the partial
trigonometric functions needed to map cylindrical or spherical coordinates to 3D coordinates or the radial
distortion field at each pixel.