Comprehensive Notes on Panorama
Stitching
Introduction: Fundamentals of Panorama Stitching
Definition: Panorama stitching is the process of merging multiple
overlapping images into a single wide-angle composite image, creating a
seamless view larger than what any single image can capture.
Purpose: Overlap between images provides redundancy where features
appearing in multiple images can be aligned, allowing the software to infer
spatial relationships between the images.
1. Theoretical Foundation: Image Formation and
Capture
1.1 Image Representation
Definition: An image is a 2D array (matrix) of pixel values
Properties:
o Color images contain three channels (RGB)
o Each channel typically uses 8 bits (0-255)
o Each pixel represents a vector [R,G,B]
o Notation: I_i(x,y) denotes intensity at coordinates (x,y) in image
I_i
1.2 Camera Model
Pinhole Camera Model: Maps 3D world points (X,Y,Z) to 2D image
coordinates (x,y)
Mathematical Representation:
s[x;y;1] = K·[R|t]·[X;Y;Z;1]
Where:
o K: intrinsic matrix (focal length, optical center)
o R: rotation matrix
o t: translation vector
o s: scale factor (due to homogeneous coordinates)
Theoretical Significance: This model explains why images from different
viewpoints can be geometrically related through transformations.
2. Acquisition Process: Image Set Capture
2.1 Capture Requirements
Minimum Overlap: Approximately 30% between adjacent images
Configuration Example: Three overlapping photos [I₁] —— [I₂] ——
[I₃]
Best Practices:
o Maintain consistent camera settings (exposure, white balance)
o Ideally rotate camera around optical center
o Minimize parallax effects by avoiding translation
2.2 Equipment Considerations
Tripod recommended for stability
Panoramic head to minimize parallax
Consistent lighting conditions
3. Feature Detection
3.1 Concept and Purpose
Objective: Detect distinct, repeatable visual patterns (corners, edges,
blobs) across images
Mathematical Representation: For each image I_i, detect features
K_i = {k_1, k_2, …, k_m}
Example: In image I₁, features might include k_1 = (x=105, y=212),
k_2 = (x=400, y=118), etc.
3.2 Detection Algorithms
3.2.1 Harris Corner Detector
Principle: Finds corners where intensity gradients change sharply in
both directions
Mathematical Foundation: Based on eigenvalues of the structure
tensor:
M = ∑w(x,y)[[I_x² I_xI_y]
[I_xI_y I_y²]]
Properties: Rotation invariant but not scale invariant
3.2.2 SIFT (Scale-Invariant Feature Transform)
Principle: Detects extrema in scale-space (Difference of Gaussian
pyramids)
Properties:
o Scale and rotation invariant
o Partially invariant to illumination changes
o Resistant to viewpoint changes
Process:
1. Scale-space extrema detection
2. Keypoint localization
3. Orientation assignment
4. Keypoint descriptor generation (128-dimensional vector)
3.2.3 ORB (Oriented FAST and Rotated BRIEF)
Principle: Combines FAST keypoint detector with BRIEF descriptors
Properties:
o Computationally efficient
o Binary descriptors (faster matching)
o Rotation invariant
3.3 Output Format
For each image I_i:
Keypoint set: K_i = {k_1, k_2, …, k_m}
Descriptor set: D_i = {d_1, d_2, …, d_m}, where d_j ∈ ℝ^128 for SIFT
4. Feature Matching
4.1 Objective and Process
Purpose: Pair keypoints from two images that represent the same
physical point in the scene
Input:
o Descriptors D_i = {d_1, …, d_m} from I_i
o Descriptors D_j = {d’_1, …, d’_n} from I_j
4.2 Matching Algorithms
4.2.1 Nearest Neighbor Matching
Principle: For each d ∈ D_i, find d’ ∈ D_j such that ||d - d’||₂ is
minimized
Implementation: k-d trees or approximate nearest neighbor methods
4.2.2 Match Filtering Techniques
Lowe’s Ratio Test:
Accept match if: ||d - d'₁||/||d - d'₂|| < 0.75
Where d’₁ and d’₂ are the closest and second-closest descriptors
4.3 Example Output
Matched keypoints between I₁ and I₂:
(k₁⁽¹⁾, k₄⁽²⁾)
(k₂⁽¹⁾, k₅⁽²⁾)
etc.
5. Homography Estimation
5.1 Homography Definition
Concept: A projective transformation H ∈ ℝ³ˣ³ that maps points from
one plane to another
Mathematical Expression:
[x'] [h₁₁ h₁₂ h₁₃] [x]
[y'] = [h₂₁ h₂₂ h₂₃]·[y]
[1 ] [h₃₁ h₃₂ h₃₃] [1]
Applicability:
o Valid when scene is planar
o Valid when camera rotates around a single center
o Approximation for small depth variations
5.2 Direct Linear Transform (DLT)
5.2.1 Problem Formulation
Requirement: Minimum 4 point correspondences to solve for 8
degrees of freedom
Cross-product Form: For each match (x,y)↔(x’,y’), we get two linear
equations:
h₁₁x + h₁₂y + h₁₃ - x'(h₃₁x + h₃₂y + h₃₃) = 0
h₂₁x + h₂₂y + h₂₃ - y'(h₃₁x + h₃₂y + h₃₃) = 0
5.2.2 Matrix Construction
For each correspondence point, generate two rows of matrix A
Example for point #1: (x,y)=(105,212), (x’,y’)=(110,215):
[-105, -212, -1, 0, 0, 0, 105×110, 212×110, 110]
[0, 0, 0, -105, -212, -1, 105×215, 212×215, 215]
Complete matrix A is 8×9 (4 points × 2 equations)
5.2.3 Solution via SVD
Formulation: Ah = 0
Solve via Singular Value Decomposition (SVD)
Take h as the last column of V
Normalize h so h₃₃ = 1
5.3 RANSAC for Robust Estimation
Purpose: Filter outliers from feature matches
Algorithm:
1. Randomly sample 4 matches
2. Compute H using DLT
3. Count inliers (where ||p’ - Hp|| < ε)
4. Repeat N times, keep H with most inliers
5. Recompute H using all inliers
5.4 Example Homography Result
H₁₂ = [1.056283 -0.014629 4.436752]
[0.019237 1.053948 -6.068625]
[0.000113 0.000040 1.000000]
6. Image Warping
6.1 Warping Concept
Definition: Applying H to remap pixels from one image to another
coordinate system
Process: For every pixel (x’,y’) in destination:
1. Compute (x,y) = H⁻¹(x’,y’)
2. Get pixel value from source image at (x,y)
3. Interpolate if (x,y) is non-integer
6.2 Coordinate Transformation
Mathematical Expression: (x,y,1)ᵀ ~ H₁₂⁻¹·(x’,y’,1)ᵀ
Example:
o Output pixel: (x’,y’) = (200, 150)
o Source coordinate: (x,y) ≈ (192.4494, 148.5133)
6.3 Interpolation Methods
6.3.1 Nearest Neighbor Interpolation
Principle: Use value of nearest pixel
Properties: Fast but produces jagged edges
6.3.2 Bilinear Interpolation
Principle: Weighted average of 4 surrounding pixels
Example Calculation:
I₂(x,y) = (1-Δx)(1-Δy)·I₀₀ + Δx(1-Δy)·I₁₀
+ (1-Δx)Δy·I₀₁ + ΔxΔy·I₁₁
o For point (192.4494, 148.5133):
x₀ = 192, x₁ = 193, y₀ = 148, y₁ = 149
Δx = 0.4494, Δy = 0.5133
I₀₀ = 100, I₁₀ = 102, I₀₁ = 105, I₁₁ = 107
Result ≈ 103.4651
6.3.3 Bicubic Interpolation
Principle: Uses 16 surrounding pixels for smoother results
Properties: Higher quality but more computationally intensive
7. Image Blending
7.1 Blending Purpose
Objective: Eliminate seams and brightness differences in overlapping
regions
Challenges: Exposure differences, vignetting, parallax effects
7.2 Blending Techniques
7.2.1 Feathering (Linear Blending)
Principle: Weighted average based on distance from overlap edge
Mathematical Expression: I_final = α·I₁ + (1-α)·I₂
Example:
o Overlap from x’ = 400 to x’ = 600
o Weight α = (x’-400)/(600-400)
o For pixel at x’ = 450: α = 50/200 = 0.25
o I₁(450,y’) = 120, I₂_warp(450,y’) = 130
o I_blend = 0.75·120 + 0.25·130 = 122.5
7.2.2 Pyramid Blending
Principle: Multi-resolution approach for seamless integration
Process:
1. Construct Laplacian pyramids for each image
2. Blend each level with Gaussian mask
3. Reconstruct final image
Properties: Better handling of high-frequency details
7.2.3 Multiband Blending
Principle: Different frequency bands are blended differently
Properties: Superior results for complex seams
7.3 Advanced Blending Methods
7.3.1 Gradient Domain Fusion (Poisson Blending)
Principle: Match gradient fields rather than direct pixel values
Process: Solve Poisson equation to find seamless transition
7.3.2 Histogram Matching
Principle: Normalize color intensity distributions between images
Process: Transform histogram of one image to match another
7.3.3 Gain Compensation
Principle: Global adjustment of image brightness
Process: Optimize gain factors for each image
8. Multiple Image Stitching
8.1 Sequential Stitching
Process:
1. Detect & match keypoints between blended result and I₃
2. Compute homography H_{(1+2),3} via DLT+RANSAC
3. Warp I₃ into existing canvas
4. Blend using feathering or multiband pyramid
8.2 Global Alignment
Principle: Jointly optimize all homographies to minimize global error
Methods:
o Bundle adjustment
o Global homography estimation
o Graph-based optimization
9. Final Processing and Output
9.1 Cropping
Purpose: Remove empty (black) regions
Process:
1. Compute convex hull of all warped pixel footprints
2. Take axis-aligned bounding box: [x_min, x_max] × [y_min,
y_max]
3. Crop to that box
Example:
o Warped extents: x ∈ [-10, 1010], y ∈ [5, 605]
o Crop to: x ∈ [0, 1000], y ∈ [5, 605]
o Result: 1000×600 final panorama
9.2 Color Adjustment
Techniques:
o White balance correction
o Contrast stretching
o Tone mapping
9.3 Post-Processing Enhancements
Options:
o Sharpening
o Noise reduction
o Vignette correction
10. Advanced Extensions
10.1 Bundle Adjustment
Principle: Joint optimization of camera poses and homographies
Process: Minimize reprojection error across all images simultaneously
10.2 3D Structure from Motion
Principle: Infer scene depth before stitching
Advantage: Better handling of parallax effects
10.3 Real-time Stitching Applications
Purpose: 360° video and VR content creation
Challenges: Computational efficiency, continuous alignment
11. Variable Catalog: Key Components
Symb
ol Description Mathematical Domain
I_i Input image 2D matrix
K_i Keypoints in image i Set of 2D coordinates
D_i Descriptors Set of n-dimensional
vectors
M_{ij} Matches between images i and j Set of coordinate pairs
Symb
ol Description Mathematical Domain
H_{ij} Homography mapping image j to 3×3 matrix
i
W_i Warped image 2D matrix
I_final Final panorama 2D matrix
12. Concrete Implementation Example
12.1 Setup and Data
Correspondences between I₁ and I₂:
Inde
x I₁ (x,y) I₂ (x’,y’)
1 (105, (110,
212) 215)
2 (400, (405,
118) 120)
3 (300, (303,
300) 302)
4 (150, (152,
400) 405)
12.2 Implementation Steps
1. Feature Detection
o Detect keypoints in each image using SIFT
o Example: K₁ = {(105, 212), (400, 118), (300, 300)}
2. Feature Matching
o Match features between images
o Example: M₁₂ = {((105,212), (110,215)), ((400,118), (405,120))}
3. Homography Computation
o Construct matrix A from correspondences
o Solve via SVD to obtain H₁₂
4. Image Warping
o Apply H₁₂⁻¹ to map coordinates
o Use bilinear interpolation for non-integer coordinates
5. Image Blending
o Apply feathering in overlap regions
o Example: For x’ = 450, α = 0.25, resulting value = 122.5
6. Final Processing
o Crop to content bounding box
o Apply color adjustments if needed
1. Setup & Correspondences
We have three overlapping images of a cityscape:
[ I₁ ] —— [ I₂ ] —— [ I₃ ]
We detect keypoints (e.g. via SIFT) and match them. For simplicity, we’ll
work pairwise on I 1↔ I 2. We use four correspondences so we can solve for an
8-DOF homography.
Inde
x I1 ( x , y) I2 ( x' , y ' )
1 (105, 21 (110, 21
2) 5)
2 (400, 11 (405, 12
8) 0)
3 (300, 30 (303, 30
0) 2)
4 (150, 40 (152, 40
0) 5)
2. Homography H 12 via Direct Linear Transform (DLT)
A planar homography H satisfies (in homogeneous coords)
[ ] [] ( )
x' x h11 h12 h13
y ' ∼ H y , H = h21 h22 h23 .
1 1 h31 h32 h33
For each match ( x , y ) ↦ ( x ' , y ' ), we get two linear equations (cross-product
form):
{h11 x+ h12 y +h13−x ' ( h31 x+ h32 y +h33 ) =0 ,
h21 x +h22 y+ h23− y ' ( h31 x + h32 y +h33 )=0 .
Writing these for all four points gives an 8 × 9 matrix equation A h=0, where
T
h=( h11 , h12 , h13 , h21 , h22 , h23 , h31 , h32 , h33 ) .
2.1. First two rows of A
For point #1: ( x , y ) =( 105 ,212 ), ( x ' , y ' )=( 110 , 215 ) :
−¿ 105 h31 ⋅110−212h 32 ⋅110−h33 ⋅ 110+105 h11 + 212h 12+h13=0 ,
−¿ 105 h31 ⋅215−212 h32 ⋅215−h33 ⋅215+105 h21 +212 h22+ h23=0 .
Numeric row entries become:
[−105 ,¿−212 ,−1 , 0 , 0 ,0 , 105 ×110 ,212 ×110 ,110], .¿
¿
and similarly for the other three points.
2.2. Solve by SVD
We stack all eight rows into A , perform SVD, and take h as the last column of
V . Normalizing h so h33=1 yields:
( )
1.056283 −0.014629 4.436752
H 12= 0.019237 1.053948 −6.068625 .
0.000113 0.000040 1.000000
3. Warping I2 into I 1’s Frame
For each pixel ( x ' , y ' ) in the output canvas, we compute its source coordinate
in I 2 as
( x , y , 1 )T ∼ H−1 T
12 ( x ' , y ' ,1 ) .
Let’s pick one sample output pixel:
( x ' , y ' )=( 200 , 150 ) .
Compute
( )( )
200 192.4494
−1
p=H 12 150 ≈ 148.5133 .
1 1
Thus the source coordinate is ( x , y ) ≈ ( 192.4494 , 148.5133 ) .
3.1. Bilinear interpolation
Let
x 0=192 , x1 =193 , y 0 =148 , y1 =149 , Δx=0.4494 , Δy=0.5133.
Suppose the four surrounding pixel intensities in I 2 are
I 00=I 2 ( 192 ,148 )=100 , I 10=102 , I 01=105 , I 11=107 .
Then bilinear interpolation gives
I 2( x , y ) ¿ (1−Δx ) ( 1− Δy ) I 00 + Δx ( 1− Δy ) I 10
¿ ≈ ( 0.5506 )( 0.4867 ) 100+ ( 0.4494 ) ( 0.4867 ) 102
¿ ≈ 103.4651 .
4. Blending I1 & Warped I2
Assume their overlap in the panorama runs from x '=400 to x '=600 . For any
pixel x ' in that interval, define a linear weight
x ' −400
α= .
600−400
50
Take x '=450 ⟹ α= =0.25 . If the (already-warped) pixel values are
200
warp
I 1 ( 450 , y ' )=120 , I 2 ( 450 , y ' )=130 ,
then the feathered blend is
I blend =( 1−α ) I 1+ α I 2 =0.75 ⋅120+0.25 ⋅130=122.5 .
5. Stitching I3 Onto the Result
1. Detect & match keypoints between the blended result and I 3.
2. Compute homography H (1 +2) ,3 via DLT+RANSAC (exactly as above).
3. Warp I 3 into the existing canvas.
4. Blend using the same feathering weights or multiband pyramid (for
smooth seams).
6. Final Cropping & Color Adjustment
After warping, the composite canvas will usually have “empty” (black)
regions:
Compute the convex hull of all warped pixel footprints.
Take its axis-aligned bounding box: e.g. [ x min , x max ] × [ y min , y max ].
Crop to that box: removes black borders.
Optionally apply:
o Histogram matching or gain compensation across seams.
o Poisson blending (gradient-domain) for invisible seams.
o Global color adjustment (white balance, contrast stretch).
In our toy numbers, if the warped extents were
x ∈ [ −10 ,1010 ] , y ∈ [ 5 ,605 ] ,
we’d crop to
x ∈ [ 0 ,1000 ] , y ∈ [ 5 , 605 ] ,
yielding a 1000 ×600 final panorama.
7. Recap of Every Variable & Operation
Step Variable(s) Operation
1. Keypoint K i, Di SIFT / Harris → ( x , y ) , 128-D vectors
detection
2. Matching M ij Euclid. distance + Lowe’s ratio test
3. Build A h=0 A∈R
8 ×9
Fill rows from each match’s
equations
4. Solve via SVD h Last singular vector of A
5. Form H 12 H ∈R
3×3
Reshape h , normalize h33=1
6. Invert H H
−1
Matrix inverse
7. Warp pixels ( x ' , y ' )→ ( x , y ) ( x , y , 1 )=H −1 ( x ' , y ' , 1 )
8. Interpolation Bilinear Based on fractional parts Δx , Δy
weights
9. Blending α (x ') Feather: α =( x '− xmin ) / ( x max −x min )
10. Crop & adjust Bounding box Remove black, color-correct
🔥 INTRODUCTION: What Is Panorama Stitching?
Panorama stitching is the process of merging multiple overlapping
images into a single wide-angle composite image. The goal is to create a
seamless view larger than what any single image can capture.
Why overlapping? Overlap provides redundancy: features that appear in
more than one image can be aligned, allowing the software to infer spatial
relationships between the images.
🪞 PHASE 0: Image Formation and Capture
🧠 Foundational Concepts
What is an image?
An image is a 2D array (matrix) of pixel values. In color images:
Each pixel has three color channels (Red, Green, Blue).
Each channel is usually 8 bits (0–255), so each pixel is a vector like
[ R ,G , B ] .
Let:
I i ( x , y ) denote the intensity at coordinates ( x , y ) in image I i.
What is a camera?
A camera performs a projective transformation: it maps 3D world points
( X , Y , Z ) to 2D image coordinates ( x , y ) via the pinhole camera model:
[] []
X
x
Y
s y =K ⋅ [ R∨t ] ⋅
Z
1
1
Where:
K : intrinsic matrix (focal length, optical center).
R : rotation matrix.
t : translation vector.
s: scale factor (due to homogeneous coordinates).
This model underpins why images from different viewpoints can be
geometrically related.
📷 PHASE 1: Image Set Acquisition
Let’s say we take 3 photos I 1 , I 2 , I 3 of a cityscape by rotating the camera
horizontally.
These photos will have overlapping regions:
[ I_1 ] ---- [ I_2 ] ---- [ I_3 ]
For proper stitching:
Minimum ~30% overlap.
Same camera settings (exposure, white balance) preferred to ease
blending.
🔍 PHASE 2: Feature Detection
❓ Why detect features?
We need distinct, repeatable visual patterns (like corners, edges, blobs)
that can be found across images to align them.
🧠 Mathematical Idea:
Detect features K i={k 1 , k 2 ,... , k m } in each image I i.
Example:
In image I 1, we might detect:
k 1=( x=105 , y=212 )
k 2=( x=400 , y=118 )
etc.
🛠 Algorithms:
1. Harris Corner Detector
o Finds corners where intensity gradients change sharply in both
directions.
o Based on eigenvalues of the structure tensor:
[ ]
2
Ix IxIy
M =∑ w ( x , y ) 2
Ix I y Iy
2. SIFT (Scale-Invariant Feature Transform) 🥇
o Detects extrema in scale-space (DoG pyramids).
o Computes orientation histograms around keypoints.
o Produces 128-D descriptors per keypoint.
3. ORB (Oriented FAST and Rotated BRIEF) — fast alternative.
Output:
For each image I i:
K i={k 1 , k 2 ,... , k m }
Descriptors: Di={d 1 ,d 2 , ..., d m }, where d j ∈ R 128 (for SIFT).
🔗 PHASE 3: Feature Matching
❓ What are we doing?
We’re pairing keypoints from two images that represent the same
physical point in the scene.
🔬 Matching Strategy:
Given:
Descriptors Di={d 1 ,... , d m } from I i
Descriptors D j={d1 ' , ... ,d n ' } from I j
For each d ∈ Di , find d ' ∈ D j such that:
∥ d−d ' ∥ 2 is minimized
🧪 Match Filtering:
Lowe’s Ratio Test:
∥ d−d 1 ' ∥
Accept match if: < 0.75
∥ d−d 2 ' ∥
(Where d 1 ' and d 2 ' are the closest and second-closest descriptors in D j).
🧾 Example Output:
Matched keypoints between I 1 and I 2:
( k (11 ) , k (42) )
( k (21 ) , k (52) )
🧮 PHASE 4: Homography Estimation
❓ What is a homography?
A homography matrix H ∈ R3 × 3 is a projective transformation that maps
points from one plane to another:
[ ] []
x' x
y ' =H ⋅ y
1 1
This holds if the scene is planar or the camera rotates around a single
center.
🔍 Finding H
We need at least 4 point correspondences to solve:
p '=H ⋅ p ⇒ p ' × ( H ⋅ p )=0
This leads to a system of linear equations in the 8 unknowns (9 values in H
, but homogeneous scale makes 1 redundant).
✅ RANSAC for Robustness
To filter outliers: 1. Randomly sample 4 matches. 2. Compute H . 3. Count
inliers (where ∥ p '−Hp ∥< ϵ ). 4. Repeat N times, keep H with most inliers.
🎯 PHASE 5: Image Warping
❓ What is warping?
Applying H to remap pixels from one image to another coordinate system.
For every pixel ( x ' , y ' ) in the destination:
Compute ( x , y ) =H−1 ( x ' , y ' )
Get the pixel value from source image at ( x , y )
Interpolate if ( x , y ) is non-integer
🔢 Interpolation Methods:
Nearest neighbor
Bilinear (most common)
Bicubic
🌈 PHASE 6: Image Blending
❓ Why blend?
To eliminate seams and brightness differences in overlapping regions.
⚙️Methods:
1. Feathering: Weighted average:
I final =α I 1 + ( 1−α ) I 2
Where α depends on distance from overlap edge.
2. Pyramid Blending:
o Construct Laplacian pyramids for each image.
o Blend each level with Gaussian mask.
o Reconstruct final image.
3. Multiband blending (for complex seams).
🧼 PHASE 7: Seam Removal & Color Correction
Techniques:
Gradient domain fusion (Poisson blending): Match gradient fields.
Histogram matching or gain compensation: Normalize color
intensity.
✂️PHASE 8: Cropping and Post-Processing
Identify bounding box of non-empty pixels.
Crop excess black regions.
Optionally apply:
o Sharpening
o Contrast stretching
o Tone mapping
🔑 KEY VARIABLES RECAP
Symbo
l Description
Ii Input image
Ki Keypoints in I i
Di Descriptors of K i
M ij Matches between I i and I j
H ij Homography matrix mapping
I j →Ii
Wi Warped image
I final Final panorama
🏁 CONCRETE EXAMPLE (with values)
Suppose we have 3 images I 1 , I 2 , I 3 of a city:
Step-by-Step:
1. Detect keypoints in I 1: K 1={ ( 105 ,212 ) , ( 400 , 118 ) , ( 300 , 300 ) } Descriptors
are 128-D vectors.
2. Match to I 2: M 12={( ( 105 , 212 ) , ( 110 , 215 ) ) , ( ( 400 , 118 ) , ( 405 , 120 ) ) }
3. Compute H 12 using RANSAC.
4. Warp I 2 with H 12.
5. Blend I 1 and I 2 using feathering.
6. Repeat for I 3.
7. Crop final image.
🧠 BONUS: Advanced Extensions
Bundle adjustment: Joint optimization of camera poses and
homographies.
3D structure from motion: Infer scene depth before stitching.
Real-time stitching: For 360° video and VR.