THE GEOMETRY OF MULTIPLE VIEWS
Introduction:
In computer vision, analyzing multiple views of the same scene provides crucial information about the
3D structure of the scene and the relative motion between the camera and the scene.
The geometric relationship between multiple images taken from different viewpoints is a key area of
study.
Two-View Geometry:
Epipolar Geometry:
Definition:
Epipolar geometry describes the intrinsic projective geometry between two views of a scene.
It is the basis for many stereo vision algorithms and involves concepts like the epipolar plane, epipolar
lines, and epipoles.
Epipolar Plane:
The plane that contains the camera centers and a 3D point in the scene.
Epipolar Line:
The intersection of the epipolar plane with the image plane.
For a point in one image, its corresponding point in the other image must lie on the corresponding
epipolar line.
Epipole:
The point of intersection of the line connecting the camera centers with the image plane.
Fundamental Matrix (F):
The fundamental matrix relates corresponding points in stereo images.
It encodes the epipolar geometry between two views: x′TFx=0x'^T F x = 0x′TFx=0 Where xxx and
x′x'x′ are corresponding points in the two images.
Essential Matrix (E):
Definition:
The essential matrix is a special case of the fundamental matrix when the cameras are calibrated.
Computation:
The essential matrix can be computed using the camera intrinsic parameters: E=K′TFKE = K'^T F
KE=K′TFK Where KKK and K′K'K′ are the intrinsic parameter matrices of the two cameras.
Decomposition:
The essential matrix can be decomposed to obtain the relative rotation and translation between the two
camera views.
Camera Calibration:
Intrinsic Parameters:
Intrinsic parameters define the internal characteristics of the camera, such as focal length, principal
point, and skew.
Extrinsic Parameters:
Extrinsic parameters define the camera's position and orientation in the world coordinate system.
Calibration Process:
The process of determining the intrinsic and extrinsic parameters of the camera.
Typically involves taking images of a known calibration object (e.g., a checkerboard) and applying
techniques like Zhang's method to estimate the parameters.
Rectification:
Definition:
Rectification is the process of transforming stereo images so that corresponding points are aligned
horizontally.
Importance:
Rectification simplifies the stereo matching problem by reducing the search for corresponding points
to a single dimension (horizontal line).
Techniques:
Several techniques are used to rectify images, often involving homography transformations or epipolar
line correction.
Stereo Vision:
Definition:
Stereo vision involves extracting 3D information from two or more images taken from different
viewpoints.
By analyzing the disparity between corresponding points in the images, depth information can be
recovered.
Disparity Map:
The disparity map is a representation of the difference in pixel positions between corresponding points
in the stereo images.
The disparity is inversely proportional to the depth of the points in the scene.
Depth Estimation:
Using the disparity map, the depth ZZZ can be estimated: Z=f⋅BdZ = \frac{f \cdot B}{d}Z=df⋅B Where
fff is the focal length, BBB is the baseline (distance between cameras), and ddd is the disparity.
Stereopsis
Reconstruction:
3D Reconstruction:
Stereopsis refers to the process of reconstructing the 3D structure of a scene from two or more 2D
images.
The depth information is recovered by triangulating the corresponding points from the different views.
Triangulation:
Triangulation is the process of determining the 3D coordinates of a point by intersecting the lines of
sight from multiple camera viewpoints.
Reconstruction Pipeline:
The general pipeline for 3D reconstruction includes:
Feature detection and matching.
Estimation of camera parameters.
Computation of the 3D coordinates using triangulation.
Human Stereopsis:
Biological Inspiration:
Human stereopsis is the biological equivalent of computer stereo vision, where the brain combines the
images from the two eyes to perceive depth.
The disparity between the two retinal images is used by the brain to infer the relative depth of objects.
Binocular Fusion:
Definition:
Binocular fusion is the process by which the brain combines the two slightly different images from the
eyes into a single 3D perception.
Binocular Disparity:
The small difference between the images seen by the left and right eyes is known as binocular disparity.
The brain uses this disparity to compute the depth of objects.
Using More Cameras:
Multi-view Stereo:
In some cases, using more than two cameras provides additional views that can improve the accuracy
and robustness of 3D reconstruction.
Multi-view stereo techniques extend the principles of stereo vision to multiple cameras.
Applications:
Multi-view stereo is widely used in applications like 3D modeling, virtual reality, and autonomous
navigation.
3. Segmentation by Clustering
Segmentation:
Definition:
Segmentation is the process of partitioning an image into distinct regions, typically corresponding to
different objects or surfaces.
Goal:
The goal of segmentation is to simplify the representation of an image, making it easier to analyze and
understand.
Human Vision: Grouping and Gestalt:
Gestalt Principles:
Gestalt psychology suggests that the human visual system tends to group elements based on certain
principles, such as proximity, similarity, and continuity.
These principles are often used in computer vision to design algorithms that mimic human perception
for image segmentation.
Segmentation Techniques:
Clustering-Based Segmentation:
Clustering is a common approach to segmentation, where pixels are grouped based on their similarity
in terms of color, intensity, texture, or spatial location.
K-means Clustering:
An unsupervised algorithm that partitions the image into kkk clusters, where each pixel is assigned to
the cluster with the nearest centroid.
Mean Shift Clustering:
A non-parametric clustering technique that shifts each pixel to the average of the pixels in its
neighborhood, leading to clusters that represent modes in the data.
Graph-Based Segmentation:
Involves representing the image as a graph, where pixels are nodes, and edges represent the similarity
between pixels.
Segmentation is performed by finding cuts in the graph that minimize the dissimilarity between different
regions.
Spectral Clustering:
Uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data, then applies
clustering in the lower-dimensional space.
Applications:
Shot Boundary Detection:
Segmentation is used to detect transitions between different shots in a video by identifying significant
changes in the scene content.
Background Subtraction:
A common technique in video analysis, where the goal is to separate moving objects (foreground) from
the static background.
Image Segmentation by Clustering Pixels:
Pixels are grouped into clusters based on their attributes (color, texture) to segment the image into
meaningful regions.
Segmentation by Graph-Theoretic Clustering:
This method involves constructing a graph based on pixel similarity and partitioning the graph to
segment the image.