KEMBAR78
CV Notes | PDF | Library (Computing) | Pointer (Computer Programming)
0% found this document useful (0 votes)
35 views333 pages

CV Notes

The document is a comprehensive guide on Computer Vision, authored by Francesco Gazzola and credited to Prof. Stefano Ghidoni for the academic year 2021/2022. It covers various topics including C++ programming, image processing techniques, spatial filtering, and Fourier Transform applications. The content is structured into sections detailing tools, libraries, data manipulation, and image coding, among others.

Uploaded by

vishalkrm1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views333 pages

CV Notes

The document is a comprehensive guide on Computer Vision, authored by Francesco Gazzola and credited to Prof. Stefano Ghidoni for the academic year 2021/2022. It covers various topics including C++ programming, image processing techniques, spatial filtering, and Fourier Transform applications. The content is structured into sections detailing tools, libraries, data manipulation, and image coding, among others.

Uploaded by

vishalkrm1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 333

Computer Vision

Credits: Prof. Stefano Ghidoni

Author
Francesco Gazzola

2021/2022

Any unauthorized distribution (digitally or on paper) is expressly prohibited


without prior authorization from the author of these notes
Contents

1 C++ introduction 12
1.1 Toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.1 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.3 Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.4 File Extensions . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.5 Example Toolchain with static library . . . . . . . . . . . 14
1.2 Static and dynamic libraries . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Example static library . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Advantage of dynamic libraries over static . . . . . . . . . 15
1.3 Standard C++ Libraries . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 OpenCV library . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Installation on windows . . . . . . . . . . . . . . . . . . . 17
1.4.2 HelloWorld . . . . . . . . . . . . . . . . . . . . . . . . . . 17
What is cv::? . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Compiling HelloWorld . . . . . . . . . . . . . . . . . . . . 18
1.4.4 Standard Directories . . . . . . . . . . . . . . . . . . . . . 19
1.5 CMake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 User perspective . . . . . . . . . . . . . . . . . . . . . . . 20
Example under Linux . . . . . . . . . . . . . . . . . . . . 21
1.5.2 Developer perspective . . . . . . . . . . . . . . . . . . . . 21
CMakeList.txt for OpenCV . . . . . . . . . . . . . . . . . 22
1.6 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.1 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.2 Comparison pointers and references . . . . . . . . . . . . 23
1.7 Cast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.1 Void* pointers . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9 Header guards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
1.10 Lab feedbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Basic data manipulation 28


2.1 The structure of an image . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Color image . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 cv::Mat class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1 Constructors . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2 Inspectors . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Data access . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 The Vec class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Image coding 32
3.1 Analog and Digital converter overview . . . . . . . . . . . . . . . 32
3.1.1 ADC for images . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Image coordinate system . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Digital color image . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Access image pixels . . . . . . . . . . . . . . . . . . . . . 36
3.4 Digital color video . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Spatial & gray level resolution . . . . . . . . . . . . . . . . . . . 37
3.5.1 Gray-level/ intensity resolution . . . . . . . . . . . . . . . 37
3.5.2 Contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 Spatial resolution . . . . . . . . . . . . . . . . . . . . . . 38

4 Image processing 39
4.1 Geometric transformation . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Homogeneous coordinates . . . . . . . . . . . . . . . . . . 40
Conversion to homogeneous coordinates . . . . . . . . . . 40
Conversion from homogeneous coordinates . . . . . . . . . 41
Example of translation in hom coordinated and normal co-
ordinates . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Affine transformation . . . . . . . . . . . . . . . . . . . . 41
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Examples of affine transformations . . . . . . . . . . . . . 42
4.1.3 Forward vs backward mapping . . . . . . . . . . . . . . . 43
4.1.4 cv::warpAffine . . . . . . . . . . . . . . . . . . . . . . . . 44
DoF : Degree of freedom . . . . . . . . . . . . . . . . . . 44
4.2 Single-Pixel operations . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Image negatives . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Log transformations . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Power-law ( or gamma) transformations . . . . . . . . . . 46

2
4.3 Contrast stretching and thresholding . . . . . . . . . . . . . . . . 49
4.3.1 Intensity slicing . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Bit plane slicing . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Histogram equalization . . . . . . . . . . . . . . . . . . . 53
4.4.2 Histogram in OpenCV . . . . . . . . . . . . . . . . . . . . 56
4.4.3 Histogram specification . . . . . . . . . . . . . . . . . . . 56
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Comparison Histogram equalization and histogram specifi-
cation . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.4 Local histogram equalization . . . . . . . . . . . . . . . . 59

5 Spatial filtering 61
5.1 Local operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Spatial Linear filtering . . . . . . . . . . . . . . . . . . . . . . . . 62
Correlation vs Convolution . . . . . . . . . . . . . . . . . 63
5.2.1 Filters and brightness . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Smoothing filters . . . . . . . . . . . . . . . . . . . . . . 64
Average filter . . . . . . . . . . . . . . . . . . . . . . . . 64
Gaussian filter . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.3 Separable filter kernels . . . . . . . . . . . . . . . . . . . 67
5.2.4 Sharpening (aka derivative) filters . . . . . . . . . . . . . 67
First derivative operator . . . . . . . . . . . . . . . . . . . 68
Second derivative operator . . . . . . . . . . . . . . . . . 70
5.2.5 Combination of multiple technique . . . . . . . . . . . . . 72
5.3 Non-linear filtering & Restoration filters . . . . . . . . . . . . . . 73
5.3.1 Image restoration overview . . . . . . . . . . . . . . . . . 73
Noise models . . . . . . . . . . . . . . . . . . . . . . . . 74
Salt and Pepper (or impulse) noise . . . . . . . . . . . . . 76
5.3.2 Averaging filters . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.3 Median filters . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.4 Bilateral filter . . . . . . . . . . . . . . . . . . . . . . . . 79
Bilateral vs Median . . . . . . . . . . . . . . . . . . . . . 81
5.3.5 Max and min filters . . . . . . . . . . . . . . . . . . . . . 81
5.3.6 Alpha-trimmed mean filters . . . . . . . . . . . . . . . . . 82
5.3.7 Adaptive filters . . . . . . . . . . . . . . . . . . . . . . . 84
Adaptive median filters . . . . . . . . . . . . . . . . . . . 84
5.4 Filter for coloured images . . . . . . . . . . . . . . . . . . . . . . 85

3
6 Filtering in frequency 86
6.1 Fourier Transform applied to image . . . . . . . . . . . . . . . . . 86
6.1.1 Example of Rect-sinc transform . . . . . . . . . . . . . . . 87
6.1.2 Spectrum and phase for DFT . . . . . . . . . . . . . . . . 87
6.1.3 Spectrum centering . . . . . . . . . . . . . . . . . . . . . 88
6.1.4 Rotation and translation . . . . . . . . . . . . . . . . . . 90
6.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Aliasing on image . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Moiré effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Filters in frequency . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Lowpass filters - smoothing . . . . . . . . . . . . . . . . . . . . . 94
6.5.1 Ideal Low pass filter - ILPF . . . . . . . . . . . . . . . . . 94
Ringing effect . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5.2 Butterworth lowpass filter - BLPF . . . . . . . . . . . . . 96
6.5.3 Gaussian low pass filter - GLPF . . . . . . . . . . . . . . . 98
6.5.4 Sum up Low pass filtering . . . . . . . . . . . . . . . . . . 100
6.6 Highpass filters - sharpening . . . . . . . . . . . . . . . . . . . . 101
6.6.1 Ideal Highpass . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6.2 Butterworth Highpass . . . . . . . . . . . . . . . . . . . . 103
6.6.3 Gaussian Highpass . . . . . . . . . . . . . . . . . . . . . . 103
6.6.4 Relation frequency and spatial filtering . . . . . . . . . . . 104
6.7 Selective filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7.1 Bandreject and bandpass filters . . . . . . . . . . . . . . . 105
6.7.2 Notch filter . . . . . . . . . . . . . . . . . . . . . . . . . 106

7 Mid level 107


7.1 Edge detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Derivative recall . . . . . . . . . . . . . . . . . . . . . . . 108
7.1.2 Types of edges . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1.3 Derivatives and noise . . . . . . . . . . . . . . . . . . . . 110
7.1.4 Gradient edge detector (1st derivative) . . . . . . . . . . . 111
Gradient magnitude and angles example applying the sober
mask . . . . . . . . . . . . . . . . . . . . . . . 111
Gradient masks . . . . . . . . . . . . . . . . . . . . . . . 113
7.1.5 Laplacian edge detector (2nt derivative) . . . . . . . . . . 114
7.2 Edge detection algorithm . . . . . . . . . . . . . . . . . . . . . . 116
7.2.1 Example of application of the the edge detector algo . . . 117
7.3 The Canny edge detector . . . . . . . . . . . . . . . . . . . . . . 118
7.3.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4
7.3.2 Gradient computation . . . . . . . . . . . . . . . . . . . . 118
7.3.3 Edge quantization . . . . . . . . . . . . . . . . . . . . . . 119
7.3.4 Non-maxima suppression . . . . . . . . . . . . . . . . . . 119
7.3.5 Hysteresis thresholding . . . . . . . . . . . . . . . . . . . 120
7.3.6 Example of canny algo . . . . . . . . . . . . . . . . . . . 122
7.3.7 Gaussian kernel size σ and threshold relation . . . . . . . 122
7.4 Edge detector comparison . . . . . . . . . . . . . . . . . . . . . . 123
7.5 Line detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.1 Naive approach . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.2 Hough transform . . . . . . . . . . . . . . . . . . . . . . 125
Normal representation . . . . . . . . . . . . . . . . . . . . 126
7.5.3 Canny edge detector Vs Hough transform line detector . . 129
7.5.4 Generalized Hough transform - Circle detection . . . . . . 130

8 Morphological operators 132


8.1 Erosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.2 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.3 Opening and closing combination . . . . . . . . . . . . . . . . . . 136
8.3.1 Opening . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.3.2 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.3.3 Comparison opening vs erosion and closing vs dilation . . . 137
8.3.4 More combination . . . . . . . . . . . . . . . . . . . . . . 138

9 Segmentation 139
9.0.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1 Segmentation by thresholding . . . . . . . . . . . . . . . . . . . 141
9.1.1 Factors influencing the selection of the right threshold . . 142
9.2 Otsu’s optimal threshold . . . . . . . . . . . . . . . . . . . . . . 144
9.2.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2.2 Otsu + edge detection . . . . . . . . . . . . . . . . . . . 147
9.2.3 Variable thresholding (Adaptive thresholding) . . . . . . . 149
9.2.4 Multiple thresholding . . . . . . . . . . . . . . . . . . . . 150
9.2.5 What is Percentile? . . . . . . . . . . . . . . . . . . . . . 151
9.3 Region growing methods . . . . . . . . . . . . . . . . . . . . . . 152
9.3.1 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.3.2 Region growing algorithm . . . . . . . . . . . . . . . . . . 152
9.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.3.4 Growing regions vs thresholding-based segmentation . . . 154
9.4 Segmentation using Watersheds . . . . . . . . . . . . . . . . . . 155

5
9.4.1 Watershed on gradients . . . . . . . . . . . . . . . . . . . 157
9.4.2 Over-segmentation and markers . . . . . . . . . . . . . . 158
Marker selection . . . . . . . . . . . . . . . . . . . . . . . 159
Example of markers . . . . . . . . . . . . . . . . . . . . . 160
9.4.3 OpenCV - watershed(img, markers) . . . . . . . . . . . . 160

10 Segmentation by clustering 164


10.1 Feature vector image representation . . . . . . . . . . . . . . . . 164
10.2 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . . 164
10.3 k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.3.1 Lioyd’s algorithm (aka k-means algorithm) . . . . . . . . . 166
10.3.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.3.3 Pro and cons . . . . . . . . . . . . . . . . . . . . . . . . 167
10.3.4 K-means for segmenting images . . . . . . . . . . . . . . 167
10.4 Beyond k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
10.4.1 Density function estimation . . . . . . . . . . . . . . . . . 170
10.5 Mean shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10.5.1 Key idea for estimating the density function (PDF) . . . . 172
10.5.2 Density gradient function estimation . . . . . . . . . . . . 172
10.5.3 Optimization technique . . . . . . . . . . . . . . . . . . . 177
10.5.4 Means shift clustering segmentation . . . . . . . . . . . . 178
10.5.5 Segmentation examples using means shift . . . . . . . . . 181
10.5.6 Means shift pros and cons . . . . . . . . . . . . . . . . . 183
10.6 Segmentation seen as a per-pixel labelling task (MARKOV RAN-
DOM FIELD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.6.1 Label set . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.6.2 Energy function . . . . . . . . . . . . . . . . . . . . . . . 184
10.6.3 Data term definition examples . . . . . . . . . . . . . . . 185
Data term - example 1 . . . . . . . . . . . . . . . . . . . 185
Data term - example 2 . . . . . . . . . . . . . . . . . . . 185
10.6.4 Smoothness term definition examples . . . . . . . . . . . . 186
Smoothness term - example 1 . . . . . . . . . . . . . . . 186
Smoothness term - example 2 . . . . . . . . . . . . . . . 186
10.7 Belief Propagation (BP) algorithm . . . . . . . . . . . . . . . . . 186
10.7.1 Message . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.7.2 BP initialization . . . . . . . . . . . . . . . . . . . . . . . 189
10.7.3 BP time update and termination . . . . . . . . . . . . . . 189
10.7.4 Algo output . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.7.5 BP for segmentation . . . . . . . . . . . . . . . . . . . . 189

6
11 Active contours 191
11.1 Snake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
11.1.1 Initialization process . . . . . . . . . . . . . . . . . . . . . 191
11.1.2 Contour move . . . . . . . . . . . . . . . . . . . . . . . . 192
11.2 Scissors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.3 Level set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

12 Feature detection and matching 197


12.1 Introducing features . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.2 Feature pipeline sum-up . . . . . . . . . . . . . . . . . . . . . . . 198
12.2.1 Set of points . . . . . . . . . . . . . . . . . . . . . . . . . 198
12.2.2 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 199
12.2.3 Feature matching . . . . . . . . . . . . . . . . . . . . . . 199
12.3 Feature applications . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.3.1 Matching application . . . . . . . . . . . . . . . . . . . . 200
12.3.2 Tracking application . . . . . . . . . . . . . . . . . . . . . 202
12.4 Feature (aka keypoint) detection . . . . . . . . . . . . . . . . . . 202
12.5 Harris corners detector . . . . . . . . . . . . . . . . . . . . . . . 203
12.5.1 Autocorrelation function . . . . . . . . . . . . . . . . . . 204
12.5.2 Keypoints selection criteria . . . . . . . . . . . . . . . . . 207
12.5.3 Invariance properties . . . . . . . . . . . . . . . . . . . . 207
12.6 Other detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
12.6.1 SUSAN AND USAN . . . . . . . . . . . . . . . . . . . . . 209
12.7 Blob features detection . . . . . . . . . . . . . . . . . . . . . . . 210
12.7.1 MSER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
12.8 Object detection in different scale space . . . . . . . . . . . . . . 212
12.8.1 Scale space implementation . . . . . . . . . . . . . . . . . 212
12.8.2 Edge detection in scale space . . . . . . . . . . . . . . . . 214
12.9 The SIFT feature . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.9.2 SIFT algorithm . . . . . . . . . . . . . . . . . . . . . . . 218
12.9.3 Scale-space organization . . . . . . . . . . . . . . . . . . 218
What is k? . . . . . . . . . . . . . . . . . . . . . . . . . . 219
DoG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
12.9.4 Keypoint localization . . . . . . . . . . . . . . . . . . . . 223
12.9.5 What is the best value for s? . . . . . . . . . . . . . . . . 225
12.9.6 Keypoint orientation . . . . . . . . . . . . . . . . . . . . 225
12.9.7 Descriptor calculation . . . . . . . . . . . . . . . . . . . . 226
Description refinements . . . . . . . . . . . . . . . . . . . 227

7
12.9.8 SIFT sum up . . . . . . . . . . . . . . . . . . . . . . . . 228
12.9.9 SIFT examples . . . . . . . . . . . . . . . . . . . . . . . . 228
12.9.10 SIFT robustness . . . . . . . . . . . . . . . . . . . . . . . 229
Robustness to noise . . . . . . . . . . . . . . . . . . . . . 230
Robustness to affine transform . . . . . . . . . . . . . . . 230
12.9.11 Questions from slides . . . . . . . . . . . . . . . . . . . . 231
12.10Overview of other SIFT-based features . . . . . . . . . . . . . . . 232
12.10.1 Principal Component Analysis . . . . . . . . . . . . . . . 232
PCA recall . . . . . . . . . . . . . . . . . . . . . . . . . . 232
PCA-SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.10.2 SURF features . . . . . . . . . . . . . . . . . . . . . . . . 234
12.10.3 GLOH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.11Other feature approaches . . . . . . . . . . . . . . . . . . . . . . 237
12.11.1 Shape context . . . . . . . . . . . . . . . . . . . . . . . . 238
12.11.2 LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.11.3 BRIEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.11.4 ORB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
12.11.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 242
12.12Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . 242
12.12.1 Matching strategy . . . . . . . . . . . . . . . . . . . . . . 243
12.12.2 Matching performance . . . . . . . . . . . . . . . . . . . 243

13 Face detection 245


13.1 Viola & Jones face detector overview . . . . . . . . . . . . . . . . 245
13.1.1 Haar features . . . . . . . . . . . . . . . . . . . . . . . . 246
13.1.2 Weak learners . . . . . . . . . . . . . . . . . . . . . . . . 247
13.1.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.1.4 Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.1.5 Viola & Jones examples . . . . . . . . . . . . . . . . . . . 250

14 The camera 252


14.1 The pinhole camera model . . . . . . . . . . . . . . . . . . . . . 252
14.1.1 Camera obscura . . . . . . . . . . . . . . . . . . . . . . . 254
14.1.2 Modelling a camera . . . . . . . . . . . . . . . . . . . . . 254
14.2 Projective geometry . . . . . . . . . . . . . . . . . . . . . . . . . 256
14.2.1 Mapping from 3D to 2D . . . . . . . . . . . . . . . . . . 256
Projection matrix (3D to 2D) . . . . . . . . . . . . . . . . 258
14.2.2 Mapping projected point from 2D to pixel coordinate . . . 259
14.2.3 Mapping from 3D to pixel coordinates . . . . . . . . . . . 261

8
Projection matrix (3D to pixel coordinates) . . . . . . . . 261
14.2.4 Rotatranslation - Mapping from 3D world to 3D camera . 262
Figuring out rototranslation . . . . . . . . . . . . . . . . . 263
14.2.5 Projection recap . . . . . . . . . . . . . . . . . . . . . . . 267
14.3 Inverse projection . . . . . . . . . . . . . . . . . . . . . . . . . . 268
14.4 The thin lens model . . . . . . . . . . . . . . . . . . . . . . . . . 268
14.5 Non ideal (real) lenses . . . . . . . . . . . . . . . . . . . . . . . . 272
14.5.1 Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Radial distortion . . . . . . . . . . . . . . . . . . . . . . . 272
Tangential distortion . . . . . . . . . . . . . . . . . . . . 274
14.5.2 Chromatic aberration . . . . . . . . . . . . . . . . . . . . 275
14.5.3 Distortion and camera lens . . . . . . . . . . . . . . . . . 275
14.6 Field of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
14.7 Real camera overview . . . . . . . . . . . . . . . . . . . . . . . . 277
14.7.1 Perspective vs viewpoint . . . . . . . . . . . . . . . . . . 280
14.7.2 Sensors overview . . . . . . . . . . . . . . . . . . . . . . 281
14.7.3 Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Aperture . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Shutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.7.4 ISO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.8 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 286
14.8.1 Why calibration? . . . . . . . . . . . . . . . . . . . . . . 286
14.8.2 Calibration process . . . . . . . . . . . . . . . . . . . . . 287
14.8.3 Reprojection error . . . . . . . . . . . . . . . . . . . . . . 290
14.8.4 Sum - up . . . . . . . . . . . . . . . . . . . . . . . . . . 291

15 Image resampling 292


15.1 Interpolation techniques . . . . . . . . . . . . . . . . . . . . . . . 292
15.1.1 Nearest neighbor interpolation . . . . . . . . . . . . . . . 292
15.1.2 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . 293
15.1.3 Bicubic interpolation . . . . . . . . . . . . . . . . . . . . 293
15.1.4 Interpolation comparison . . . . . . . . . . . . . . . . . . 294

16 High-level vision 295


16.1 Object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 295
16.1.1 Image classification . . . . . . . . . . . . . . . . . . . . . 295
16.1.2 Object detection (or localization) . . . . . . . . . . . . . 295
16.1.3 Object detection vs image classification . . . . . . . . . . 296
16.1.4 Semantic segmentation . . . . . . . . . . . . . . . . . . . 296

9
16.1.5 Semantic segmentation vs object detection . . . . . . . . 297
16.1.6 Modelling variability . . . . . . . . . . . . . . . . . . . . . 297
16.1.7 History of recognition . . . . . . . . . . . . . . . . . . . . 299
16.2 Template matching . . . . . . . . . . . . . . . . . . . . . . . . . 299
16.2.1 How to apply a template? . . . . . . . . . . . . . . . . . . 301
16.2.2 Correlation-based rigid template matching . . . . . . . . . 302
16.2.3 Generalized Hough Transform rigid template matching . . 303
16.2.4 Template matching weak points and how to cope with . . 303
16.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

17 Introduction to Deep Learning 306


17.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 306
17.1.1 ML framework . . . . . . . . . . . . . . . . . . . . . . . . 306
17.1.2 Linear model and Kernel . . . . . . . . . . . . . . . . . . 307
17.2 Deep Learning basics . . . . . . . . . . . . . . . . . . . . . . . . 307
17.2.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
17.2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . 310
Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
17.2.3 Overview of a fully connected NN layer and output layer . 312
17.2.4 Fully connected DNN . . . . . . . . . . . . . . . . . . . . 312
17.2.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Pre-processing - Data augmentation . . . . . . . . . . . . 313
17.2.6 Output layer: Classification vs Regression . . . . . . . . . 313
17.3 Convolutional neural networks - CNNs . . . . . . . . . . . . . . . 314
17.4 Example1: MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . 318
17.5 Example2: CIFAR . . . . . . . . . . . . . . . . . . . . . . . . . . 321
17.6 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
17.6.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . 323
17.7 CNN in keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
17.7.1 Convolutional layer in Keras - Conv2D . . . . . . . . . . . 325
17.7.2 Pooling layer in Keras - MaxPooling2D . . . . . . . . . . . 325
17.7.3 Dense layer in Keras - Dense . . . . . . . . . . . . . . . . 326
17.7.4 Softmax layer in Keras . . . . . . . . . . . . . . . . . . . 326
17.8 Compiling a model in Keras . . . . . . . . . . . . . . . . . . . . . 327
17.8.1 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . 327
17.8.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . 327

10
17.8.3 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
17.8.4 Example MNIST . . . . . . . . . . . . . . . . . . . . . . . 329
17.9 Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

18 Notes 330
18.1 HSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
18.2 Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

Any unauthorized distribution (digitally or on paper) is expressly prohibited


without prior authorization from the author of these notes

11
Chapter 1

C++ introduction

C++ is an efficient compiled general purpouse programming language. It is con-


sidered a low level programming language since it doesn’t have an automatic
management of the memory ( such as a garbage collector).

1.1 Toolchain

C++ is a compiled programming language. This means that to get a program to


run (hence the executable file) we must first translate it from the human readable
form to something a machine can understand. This is achieved by means of a
toolchain.
A toolchain is a set of programming tools that is used to create a software product.
In general, the tools forming a toolchain are executed consecutively so the output
becomes the input for the next one, but the term is also used when referring to a
set of tools that are not necessarily executed consecutively.
A simple software development toolchain may consist of a preprocessor, compiler
and a linker (which transform the source code into an executable program), li-
braries (which provide interfaces to the operating system), and a debugger (which
is used to test and debug created programs).

1.1.1 Preprocessor

It is an easy element that performs some substitutions based on the directives (


which are statements starting with the symbol # at the beginning of the source
code file). The input is the source code and the output is still the source code.
If there are no directives involved in the program, the preprocessor is not required.

12
Figure 1.1 Toolchain

1.1.2 Compiler

It takes in input the source code and produce as output the object code ( called
also executable or machine code).

1.1.3 Linker

It is a program that links together the object code of the libraries used in the code
with the object code of the code.

Figure 1.2 Linking process

1.1.4 File Extensions

In C++ the extensions are:

• .cpp or .h : source code

13
• .obj object code in Windows (in Linux it is .o)

1.1.5 Example Toolchain with static library


1 // f i l e name : my_func . cpp
2 int f ( int i )
3 {
4 return i + 2;
5 }

1 // f i l e name : my_func . h
2 int f ( int i ) ;

1 // main . cpp
2 i n t main ( v o i d )
3 {
4 int i = 0;
5 i = f(i); // F u n c t i o n c a l l
6 return 0;

Figure 1.3 From source to executable with static library

1.2 Static and dynamic libraries

A software project without any ’main{}’ functions is a LIBRARY. A software


library is a set of functions.
There are two kinds of libraries in programming:

• STATIC. The previous example shows the use of a static library: the object
code of the library is merged with the object code of the source.

• DYNAMIC. A dynamic library is instead linked at load-time (when the pro-


gram is loaded in the memory). In windows they are called DLL (Dynamic
Linking Library) and they are identified by the environment variable ’path’,
while on Linux, SO (Shared Object).

14
We use some different command from cmd line for telling the linker if the libraries
are static or dynamic.
Moreover, note that usually libraries come already compiled (Hence we have only
its object code ready to be merged or linked to the object code of the source).
Usually opensource software offers original code for the libraries involved.

Figure 1.4 Static and dynamic libraries

1.2.1 Example static library

We don’t compile the standard iostream libraries. It comes already compiled .

Figure 1.5 Static and dynamic libraries

1.2.2 Advantage of dynamic libraries over static

The advantage comes from the scenario when multiple programs uses the same
library. If this is static, each program will have a copy of this library merged into
the source.obj. If it is dynamic, we just need a copy of it for the all programs:
we keep it in the memory. So the dynamic libraries allows us to save space: one
installation serves all.

15
Figure 1.6 Static and dynamic libraries

1.3 Standard C++ Libraries

They are libraries that provides support for basis operation such as I/O streams
and so on.

1.3.1 Templates

A special section of the standard libraries is the Standard Template Library. They
make use of the template concept for allowing us to generate function starting
from that template. A template is a way of defining generic functions which can
have a generic data type input.

Figure 1.7 Template example for a simple function returning the input.

The STL provides also the STL containers which are templates defining data
structures: vector, list and so on.

16
Figure 1.8 Example STL container

1.4 OpenCV library

1.4.1 Installation on windows

We use Cmake for building the library. After building, the library shall be installed.

1.4.2 HelloWorld
1 // OpenImg . cpp
2 #i n c l u d e <opencv2 / h i g h g u i . hpp>
3 i n t main ( i n t argc , c h a r ∗∗ a r g v )
4 {
5 cv : : Mat img = cv : : i m re a d ( a r g v [ 1 ] ) ;
6 cv : : namedWindow ( ” Example 1 ” ) ;
7 cv : : imshow ( ” Example 1” , img ) ;
8 cv : : waitKey ( 0 ) ;
9 return 0;
10 }

• #include contains the OpenCV header. It is managed by the preprocessor.

• Argc = argument counter, it indicates the number of command line argu-


ments.

• Argv = vector of command, it contains the input from command line.


Argv[0] is the name of the executable in the command line.

• Imread read from file and allocates memory

• Mat is a Data structure storing the pixels and manage the memory Img is
an object of class cv::Mat

• Window is a graphical element in the graphical user interface. Cv::named-


Windows returns a window called Example1. Then we use Example1 to refer
to that window.

17
• Imshow draws an image on a window

• waitKey synchronizes drawing Without waitKey the window never appears.


Without it we should trigger the window. 0 means wait forever till we press
a key.

What is cv::?

It is the OpenCV namespace.


A namespace is a term using for wrapping concept into a single space. Hence cv::
refers to everything belonging to opencv.
Namespaces are used to separate variables related to a given library. Hence,
namespace becomes part of the name of the class/function. It is useful for over-
coming ambiguity when many libraries use the same class. For instance if many
libraries are using the Mat class, cv::Mat lets us distinguish it from mylib::Mat.
If there is no ambiguity among which library we are referring, we can use ’using’
and remove cv from the variables names.

Figure 1.9 ’Using’ example

IMPORTANT: we cannot use it in a header file, since if we import it in another


file it might influence the behaviour of the other file.

1.4.3 Compiling HelloWorld

For compiling in the T.2020 VM we need to use the command:


g++ OpenImg.cpp -o OpenImg -I/usr/local/include/opencv4 -lopencv_highgui
-lopencv_core -lopencv_imgcodecs

Let’s take a look:

• g++ : calls the preprocessor, compiler and linker

18
• OpenImg.cpp : Name of the input (source file)

• -o OpenImg : Name of the output (executable file)

• -I/usr/local/include/opencv4 : Specifies the file where g++ can find the


header files. (note : -I is an in uppercase ’i’)

• -lopencv_highgui -lopencv_core -lopencv_imgcodecs : are the libraries to


be linked. -l is a lowercase ’L’.

1.4.4 Standard Directories

g++ considers standard directory for looking up for the files. These default
directories are located in :

• For the Header files:

– /usr/include
– /usr/local/include

• For the libraries files:

– /usr/lib
– /usr/local/lib

If we use directories different from the standard ones, we need to specify the full
path associated to the file as done for the header file above. We can do the same
for the libraries, but we need to use the option flag -L instead of -l (lower case L).
Hence, -l is only for those libraries placed inside a default folder.

Generally speaking, for specifying directories that are in a no default location, we


use the command line:

• -I (uppercase ’i’): for specifying a header file location

• -L : for specifying a library file location

• -l (lowercase ’L’): for specifying a library file to be linked

1.5 CMake

A software project is usually composed of several source files (.h, .hcc, .cpp) and
this makes the compilation process not so straightforward. For instance we can
have different groups of files (libraries and source files) that generate different
executable that are part of the same project, and hence we need to compile

19
together only the files related to that executable. Hence, we need some method
for getting to know the structure of the project and know so which files need to
be compiled together. In general, we can describe a project by using one of the
following tools:

• Command line: the commands can be very long and so hard to interpret
since we need to run different commands line for compiling each different
executable.

• IDE: convenient but platform dependent.

• Building system: it is independent from compiler and platform. It is the


solution we use in this course using CMake.

Cmake is a meta-buildying system : a tool for handling the compilation process.


It is independent from compiler and platform and it is used in combination with
the local building system (make, ninjia and so on ).
CMake is an universal language for describing projects: the project description is
contained in a text file called CMakeLists.txt.

Figure 1.10 Cmake process

1.5.1 User perspective

User workflow:

1. Call CMake (GUI/command line)

2. Select the directory containing the source

3. Select the directory where the compiler output shall go (Standard: [project
root]/build)

4. Configure

5. (possible) module enable/disable

20
6. Generate. The generate action creates the build files to be provided to the
local build system.

Figure 1.11 Cmake process

Example under Linux


1 mkdir b u i l d
2 cd b u i l d // P r e p a r e a s e p a r a t e f o l d e r ( o p t i o n a l ) f o r p u t t i n g t h e
b u i l d f i l e s g e n e r a t e d by Cmake
3 cmake . . // g e n e r a t e t h e b u i l d f i l e s s p e c i f i c f o r t h e machine
4 make // Compile e v e r y t h i n g

1.5.2 Developer perspective

Developer workflow:

1. Create CMakeLists.txt manually or automatically through an IDE and placed


it in the root of the project. We can indicate:

• The minimum version of CMake: it is good idea to use the oldest


cmake so everyone can use it. Since we use only some basics concepts
we use 3.2.
• The name of the project
• The list of libraries needed by the project
• The target for creating the executable and the libraries. The creation
of the executable corresponds to
g++ -o tool –c main.cpp another_file.cpp

21
Figure 1.12 CMakeLists.txt

CMakeList.txt for OpenCV


1 cmake_minimum_required (VERSION 2 . 8 )
2 project ( test1 )
3 // Look f o r OpenCVConfig . cmake ( t h e r e t h e r e a r e d e f i n e d some OpenCV
related vars )
4 f i n d _ p a c k a g e (OpenCV REQUIRED)
5 // t e l l s t h e c o m p i l e r t o l o o k f o r t h e h e a d e r i n t h e f o l d e r d e f i n e d by
t h e env v a r
6 i n c l u d e _ d i r e c t o r i e s ( ${OpenCV_INCLUDE_DIRS})
7 // t e l l s t h e c o m p i l e r t o b u i l d an e x e c u t a b l e named t e s t 1 from t h e
s o u r c e f i l e t e s t 1 . cpp
8 a d d _ e x e c u t a b l e ( ${PROJECT_NAME} s r c / t e s t 1 . cpp )
9 // t e l l s t h e l i n k e r t o l i n k t h e t e s t 1 e x e c u t a b l e w i t h t h e l i b r a r i e s
d e f i n e d by t h e v a r i a b l e
10 t a r g e t _ l i n k _ l i b r a r i e s ( ${PROJECT_NAME} ${OpenCV_LIBS })

1.6 Pointers

Pointers are variables that store memory addresses. They point to a given type.
Hence, at a given destination the pointer knows that there there is stored a variable
of a specific type.
We use the symbols & for extracting the address of the variable and the dereference
operator * for extracting the content.

22
Figure 1.13 The pointer p is linked to the address of the variable i.

1 // P o i n t e r s example
2 i n t i = 10;
3 i n t ∗p ;
4 p = &i ;
5 ∗p = 1 5 ;

1.6.1 Reference

References are an alternative method to pointers: the syntax is different but the
concept is similar.
A reference variable is an alias, that is, another name for an already existing vari-
able. A reference, like a pointer, is also implemented by storing the address of an
object. The main differences with pointers is that references cannot be assigned
NULL value.
A reference can be thought of as a constant pointer (not to be confused with a
pointer to a constant value!) with automatic dereferencing, i.e the compiler will
apply the * operator for you.

1 // R e f e r e n c e s example
2 i n t i = 10;
3 i n t &r = i ; // r e f e r e n c e s d e f i n i t i o n . r i s a s s i g n e d t h e v a l u e o f i ,
hence 1 0 .
4 r = 15; // t h i s a s s i g n s t h e v a l u e 10 a l s o t o t h e v a r i a b l e i

1.6.2 Comparison pointers and references

We use pointers and references for passing arguments by reference to a function


(the function changes the value of the variable in the memory).
NOTE: count « i prints the value of the variable i on the command line interface.

23
Figure 1.14 Pointers and references. OUT indicates what count « i prints on the
command line interface

1.7 Cast

Cast is a method for changing the type of a variable.

1 // c a s t example
2 i n t i = 10;
3 float f ;
4 f = s t a t i c _ c a s t <f l o a t > ( i ) ;

1.7.1 Void* pointers

Void* pointers are a generic pointer. They can no be dereferenced since the com-
piler doesn’t know how to interpret their content. Hence, for deferencing them
we need to cast to a specific type first.

1 // v o i d ∗ p o i n t e r example
2 i n t main ( )
3 {
4 i n t a = 10;
5 v o i d ∗ p t r = &a ;
6 c o u t << ∗( i n t ∗) p t r ;
7 return 0;
8 }

24
1.8 Callbacks

A callback is any executable code (or a function) that is passed as an argument


to other code (or a function) that is expected to call back (execute) the argument
at a given time.
This is used for instance for handling the mouse events such as the click.
OpenCV GUI can handle callbacks.

1.9 Header guards

In the C and C++ programming languages, an #include guard, sometimes called


a macro guard, header guard or file guard, is a particular construct used to avoid
the problem of double inclusion when dealing with the include directive. The files
included in this regard are generally header files, which typically contain declara-
tions of functions and classes or structs. If certain C or C++ language constructs
are defined twice, the resulting translation carried out by the preprocessor unit
is invalid. #include guards prevent this erroneous construct from arising by the
double inclusion mechanism.
Another construct to combat double inclusion is #pragma once, which is non-
standard and it is used most on windows. Visual Studio might set it by default.
The working principle is: the preprocessor checks if the constant is already defined
(means the header files are already included): if not it defines it and includes the
files, otherwise skip the inclusion.
IT IS MANDATORY TO USE IT and they are used only in header files (CPP
FILES SHALL NEVER BE INCLUDED). !

25
Figure 1.15 Header guards example

1.10 Lab feedbacks

• Not to implement function doing too many tasks. Split the tasks in different
functions instead. Each function shall do a specific task, so that it can be
used again.

• When many parts are repeated it is a suggestion that we should write a


function for it.

• Input arguments passed by reference to a function that shall not be modified


must be declared as const

• Remember header guards

• Use a good-looking spacing between operators and expressions. Use inden-


tation for enhancing clarity.

26
• Use a structure folder for the project: ”include” folder for the headers,
”image” folder for the images.

• Related functions declarations shall be grouped into one single header file.
We don’t need to create a file for each function, and hence an header for
each file.

• Header file contains only declaration! Not the definition of the function
which should be in a cpp file.

• Cpp files SHALL NEVER BE INCLUDED. If you need it, this means that a
header file is missing.

• We include the opencv library (MAT for example) wherever we use it.

• avoid absolute path

• Global variables shall be avoided for allowing a better understanding of the


flow of the program ( above all when the global variable is edited by different
functions).

27
Chapter 2

Basic data manipulation

2.1 The structure of an image

An image is a set of pixel.


The natural representation for an image is a matrix, where each cell represents a
pixel.

Figure 2.1 matrix representation of an image

In particular, each pixel is represented by a number that can be integer or float:

• Float from [0,1]: 0 is black and 1 is white. Hence, float is used for
white/black image.
NOTE: Blank and white is COMMONLY referring as a grayscale image
actually. We use float for image that have EXACTLY only the two colours,
white or black.

• Unsigned char: [0,255]

• Int 16 bits: [0, 65535]

The number of bits we use for representing each pixel is called DEPTH.

28
2.1.1 Color image

For a coloured image we use the RGB representation: each colour, called CHAN-
NEL, is represented in a grayscale pixel. Hence, we have 3 channels. Depth is
applied to each channel.

Figure 2.2 Coloured image

NOTE: Also 4 channels is possible: the 4th is for transparency usually.

2.2 cv::Mat class

2.2.1 Constructors
1 cv : : Mat ( nrows , n c o l s , t y p e [ , f i l l V a l u e ] )
2 ex : cv : : Mat ( 1 0 , 1 0 ,CV_8U)

In the example above, we can see that the constructor of Mat let us define an
image of nrows, ncols and a type.
Regarding the type, any single channel array should belong to one of following
data types:

• CV_8U - 8 bit unsigned integer

• CV_8S - 8 bit signed integer

• CV_16U - 16 bit unsigned integer

• CV_16S - 16 bit signed integer

• CV_32S - 32 bit signed integer

• CV_32F - 32 bit floating point number

• CV_64F - 64 bit float floating point number

From <https://www.opencv-srf.com/2017/11/opencv-cpp-api.html> TAKE A LOOK


FOR SOME EXAMPLES.

29
Generally the type is defined as CV_<depth>[C<# channels>]. If we omit the
# channels, by default it will be C1.

Figure 2.3 CV_8UC3, 3 channels of 8 bits unsigned integer

2.2.2 Inspectors
1 i n t Mat : : c h a n n e l s ( ) c o n s t ; // # c h a n n e l s
2 i n t Mat : : depth ( ) c o n s t ; // m a t r i x e l e m e n t depth f o r
3 // each c h a n n e l
4 i n t Mat : : t y p e ( ) c o n s t ; // m a t r i x e l e m e n t t y p e

Const means that the functions won’t change its internal state.

2.2.3 Data access

We use the Mat::at template function.


1 t e m p l a t e <typename T>
2 T& Mat : : a t ( i n t i , i n t j ) ;

Figure 2.4 Example ’at’ function

30
2.3 The Vec class

Vec is an OpenCV class for dealing with tuples.


Elements are accessed using the operator [ ].

Figure 2.5 Example Vec class

There are some shortcuts for defining VEC objects:

We use Vec to describe the triplet of colour pixels. Indeed, a colour matrix has 3
channels (RGB). Hence, a pixel is a tuple of three elements. The following code
is an example of a coloured image managed by Vec class.
1 // . . .
2 i n t main ( v o i d )
3 {
4 Mat img ( 2 0 0 , 200 , CV_8UC3) ;
5 f o r ( i n t i = 0 ; i < img . rows ; ++i )
6 {
7 f o r ( i n t j = 0 ; j < img . c o l s ; ++j )
8 {
9 img . at<Vec3b> ( i , j ) [ 0 ] = 0 ;
10 img . at<Vec3b> ( i , j ) [ 1 ] = 0 ;
11 img . at<Vec3b> ( i , j ) [ 2 ] = 2 5 5 ;
12 }
13 }
14 // v i s u a l i z a t i o n
15 return 0;
16 }

31
Chapter 3

Image coding

The digital acquisition of an image is done by an image sensor which performs


some operations such the AD conversion. The main operations involved in the
AD conversion is the sampling and quantization of the continuous image.

Figure 3.1 Image sensing pipeline of a camera sensor ( taken from the book )

3.1 Analog and Digital converter overview

The ADC performs two operations: sampling/hold and quantization/encoding.


Sampling/hold refers to digitize the coordinates: we convert an analog signal to
a discrete signal by considering equally spaced values of the input signal (hence
we take a value at every a specific time which defines the frequency of sampling).
In particular we sample one value at a time, then we hold it and keep sampling
the next value. Since we hold all the value, at the end of this stage we have a
discrete time signal. Sampling and hold can be implemented as a unique device.
Quantization instead means to digitize the amplitude of the signal: we map a
number belonging to an infinity set to another number belonging to a finite set
(in other words we convert a value into a discrete value). This conversion is

32
done by associating the digital input value to the closest value defined by the
quantization. Then, we encode each quantized value to a binary value.

Figure 3.2 Analog to digital converter scheme

Figure 3.3 Example of ADC

The quantization and encoding process is based on the number of bits used for the
encoding (technically called RESOLUTION of the ADC) that defines the quanti-
zation levels. The number of quantization level (hence the number of value in the
finite set) is equal to 2m where m is the number of bits used in the encoding.

Figure 3.4 Different bits encoding

33
3.1.1 ADC for images

From a digital signal processing, an image is represented as a three dimension


function f (x, y), whose value represents the energy of the incident light on the
surface of the sensor.
The sampling for images is done in a different domain. The 2 main steps of the
ADC, sampling and quantization, are carried out as follows:

Figure 3.5 ADC simple scheme of an image

1. SAMPLING: we need to sample along two spatial dimensions (x,y).

f (m, n) = f (m ∗ 4x, n ∗ 4y)

where f (m, n) is the sampled function at the sampling period 4x and 4y

2. QUANTIZATION: we quantize the amplitude.

fquantized = Q[f (x, y)]

where Q is the function dealing with the quantization. In the figure 3.6 the
quantization levels represented a grey scale.

We hence start at the top of the continuous image and carry out the ADC con-
version procedure downward, line by line, producing so a digital image.

Figure 3.6 Sampling of an image along restricted along a single line y

34
Figure 3.7 Sampling of an image. The image on the image plane is the continuous
image appearing on the surface of the sensor, which will be digitized from it.

Figure 3.8 Sampling of an grey scale image. Sampling is done over n1 and n2,
quantization over x(n1, n2). We here have two representation one 3D and one
2D. Moreover, we notice that 0 is dark and higher values are brighter grey.

3.2 Image coordinate system

There are different ways for naming the axes of an image.

Figure 3.9 Different coordinates convention

35
In this course we use the first convention reported in the first row of the table.
Note that it is different from our intuitive way of referring to a matrix: we here
use the convention (column, row) instead of (row, column). It depends on the
library anyway. OpenCV uses this unnatural representation.
In particular, in the course we use the notation (n1,n2) seen previously for the
grey scale image.

3.3 Digital color image

We have a triple for each point. The overlap of them forms the image.

Figure 3.10 Digital color image

3.3.1 Access image pixels

’At’ function can be used for both reading and writing.


For color images we use cv::Vec3b:
1 cv : : Vec3b c u r r e n t _ c o l o r = image_out . at<cv : : Vec3b >(r , c ) ;
2 image . at<cv : : Vec3b >(r , c ) = cv : : Vec3b ( 3 7 , 2 0 1 , 9 2 ) ;

For accessing an element of a triple, we use:


1 current_color [ 0 ] = 22;

3.4 Digital color video

In a video we have a new variable respect to those used for representing an image:
we use k for representing the dependency from time.
Hence, a pixel is represented by the triple x[n1 , n2 , k]. We don’t use ’t’ instead
of ’k’ because ’t’ is usually referred as continuous time. However, our camera can
record a certain number of frames per seconds: K is the number of frames.

36
Figure 3.11 Digital color video

3.5 Spatial & gray level resolution

3.5.1 Gray-level/ intensity resolution

Intensity resolution is defined as the smallest detachable change in the gray level.
This is the resolution of the ADC, and hence it refers to the number of bits used
for the quantization levels for representing the various shadow of gray. Hence, it
is referring to the number of bits per pixel.

Figure 3.12 Gray level example keeping the spatial resolution constant. For each
level, we are changing the quantization Q function.

3.5.2 Contrast

Contrast is defined as the difference between the highest and lowest intensity level
in the image.

37
3.5.3 Spatial resolution

Spatial resolution is defined as the smallest detectable detail in the image. What
means detail? What is detectable? Detail means some geometric elements. In-
deed, when the image becomes smaller and smaller, some points (details) of the
image will disappear, hence they get not detectable anymore.
Spatial resolution can be also expressed as the number of pixels per unit distance.
This last definition is used particularly for printing. For instance, in printing it is
used the dpi (dots per inch) measure unit.
NOTE: Measure of spatial resolution must be stated with respect to a spatial unit
(such as dpi uses inch) to be meaningful. The size of the image itself doesn’t
tell us much. For instance, an image of resolution 1024x1024 pixels is not a
meaningful statement without stating the spatial dimensions encompassed by the
image.

Figure 3.13 Example 1 of spatial resolution. As showed, the smaller the spatial
resolution, the smaller the image. Under 128 we are losing a lot of details.
The pic below show the same zommed-out images showed above. Looking at the
below image we can see that we are reducing either the number of pixels of the
sensor or the number of pixel of the image. This is why it is important to state
the spatial dimensions.

38
Chapter 4

Image processing

There are many different ways of transforming an image: geometric transformation


(scaling, rotation and so on), single pixel operation (histogram and so on), and
local operations such as linear filters.

4.1 Geometric transformation

A geometric transform is a modification of the spatial relationship among pixels.


Considering pixels as points (x,y), a transformation is performed through these
two steps:

1. IMAGE MAPPING. We transform the coordinates: we map each pixel in


the new position.
(x0 , y 0 ) = T {(x, y)}

2. IMAGE RESAMPLING. This is requested because the final position is not


always over posing on the grid of pixels.

The basic geometric transformations in 2D are shown in the figure below.

Figure 4.1 Overview of the planar transformations

Since an image is a matrix, we use matrix for representing a transformation. The


table below shows the hierarchy of the transformations. For hierarchy we mean

39
that each transformation preserves the properties listed in the rows below it. For
instance, similarity preserves not only angles but also parallelism and straight lines.

Figure 4.2 Planar transformations hierarchy

Figure 4.3 Example of application of translation

4.1.1 Homogeneous coordinates

Points in 2D can be expressed in homogeneous coordinates. It is often convenient


to express points not in Euclidean points but homogeneous points.
This represents a mathematical trick to express transformation in a more compact
way consisting only of matrix multiplication. Indeed, we just need to multiply the
original point for a constant different from 0 and add a new dimension.

Conversion to homogeneous coordinates

40
Conversion from homogeneous coordinates

Example of translation in hom coordinated and normal coordinates

Figure 4.4 Example of translation in hom coordinated and normal coordinates.

4.1.2 Affine transformation

This transformation preserves the collinearity of the points that is the alignment
of the points on the same line; and their distance along the line. Formally: given
the points p1 , p2 and p3 lying on the same line, then the quantity |p2 −p1 |
|p3 −p2 | = k is
constant.

Figure 4.5 Collinearity of the point

Example

Let’s consider an affine transformation followed by a translation. We express it


in normal and homogeneous coordinate. We can note that in the homogeneous
coordinates we combine multiple operations into a single matrix multiplication.

41
Examples of affine transformations

42
4.1.3 Forward vs backward mapping

Given a transformation specified by a formula T and a source image (x, y), we


can compute the coordinates of the pixels in the new images (x0 , y 0 ) via two
techniques:

• FORWARD MAPPING: It is the naive implementation. For each (x, y) we


compute the corresponding (x’,y’):

(x0 , y 0 ) = T {(x, y)}

This can lead to multiple points on the same mapping (x’,y’) and to missing
pixels i.e holes in the image: for instance if the first pixel is mapped to 1,
the second to 10 then we miss all the pixels between 1 and 10. Moreover,
(x’,y’) can be decimals and it requires a resample (*).
It is not so much used.

• BACKWARD MAPPING. For each (x’, y’) we compute the corresponding


(x,y):
(x, y) = T −1 {(x0 , y 0 )}

This doesn’t lead to multiple points on the same mapping (x,y) and in
general there are no missing pixels. Moreover, (x,y) can be decimals and it
requires a resample (*).
It is far better than the forward mapping.

* The matrix multiplication seen before for the transformations can end up in a
non integer points which will not overlay the grid of the pixel, hence they don’t
represent correct pixels. We handle this issue by mapping/resampling the image.

Figure 4.6 Forward (on the left) vs Backward (on the right) mapping

43
4.1.4 cv::warpAffine

Cv::warpAffine is a function using the inverse transformation for performing the


affine transformation.
1 v o i d cv : : w a r p A f f i n e (
2 cv : : I n p u t A r r a y s r c , // i n p u t image
3 cv : : OutputArray d st , // o u t p u t image
4 cv : : I n p u t A r r a y M, // 2 x3 t r a n s f o r m m a t r i x
5 cv : : S i z e d s i z e , // d e s t i n a t i o n image s i z e
6 i n t f l a g s = cv : : INTER_LINEAR , // i n t e r p o l a t i o n , i n v e r s e
7 i n t borderMode = cv : : BORDER_CONSTANT, // h a n d l i n g o f m i s s i n g p i x e l s
8 c o n s t cv : : S c a l a r& b o r d e r V a l u e = cv : : S c a l a r ( ) // c o n s t a n t b o r d e r s
9 );

- TAKE A LOOK AT THE ONLINE TUTORIAL ON OPENCV SITE ABOUT


THE AFFINE TRANSFORMATION :
https://docs.opencv.org/3.4/d4/d61/tutorial_warp_affine.html

DoF : Degree of freedom

Degree of freedom is the number of dimensions on which we can move the pixels.
For instance in the image below we have two constraints per each point: moving
along the x and y axes. Hence, there are in total 6 DoF.

Figure 4.7 DoF

44
4.2 Single-Pixel operations

Consider a gray scale image with L gray levels, the single pixel operations or
transform (tit also called intensity transforms) are functions that change the gray
levels of an image.

Figure 4.8 Single pixel operations. I(x, y) is a function representing the gray level
of the image. T (.) is the function performing the grey level change.

The three most used intensity transformation functions are negative, logarithmic
and power-law. (In this course we discuss only the most popular algorithm trans-
formation. In a company environment there are way more to know).

Figure 4.9 Curves of all the basics transformation functions. The inverse diagonal
represents the negative function.

4.2.1 Image negatives

The negative of an image with intensity levels in the range [0, L - 1] is obtained
by using the negative transformation function:

s=L−1−r

45
where s is the new pixel value and r the original pixel value.
This processing switches dark and light ( looking at the curve, we can see that the
lower levels (dark pixels) are mapped to higher levels (light pixels) and viceversa
for higher levels).

Figure 4.10 Negative image example

4.2.2 Log transformations

The log transformation function is:

s = clog(1 + r)

where c is a constant and r >= 0.


This transformation makes all pixels lighter: dark pixels are mapped to light pixels
and light pixels are mapped to lighter pixels. This is useful for highlighting some
details hidden in the dark regions.

Figure 4.11 Example of log transformation

4.2.3 Power-law ( or gamma) transformations

The log transformation function:

s = crγ

where c and γ are positive constants.


This transformations behaves like the log transformations with the difference that

46
it is tunable: we can choose the intensity of the gray levels (contrast manipulation).
In particular, at figure 4.12 we can see that with γ = 1 the curve reduces to
the identity transformation, while γ > 1 produces the opposite effects as those
generated by γ < 1.

Figure 4.12 Gamma transformation curves

In the figure 4.13 we can see the effect produced by γ < 1. The pic at the top
left corner is the original image; the one at the top right corner is transformed
by applying γ = 0.6, the one at the bottom left corner has γ = 0.4, and the
last one γ = 0.3. We can see that as γ decreases some details become more
visible. Anyway with γ = 0.3 the contrast reduces and the image started to have
a ’washed-out’ appearance.

47
Figure 4.13 Gamma transformation example γ < 1

In the figure 4.14 we can see the effect produced by γ > 1. The pic at the top
left corner is the original image; the one at the top right corner is transformed by
applying γ = 3, the one at the bottom left corner has γ = 4, and the last one
γ = 5. We can see that as γ increases some details become more visible (and the
contrast gets higher).

Figure 4.14 Gamma transformation example γ > 1

48
4.3 Contrast stretching and thresholding

Single pixel operations can be used for enhancing contrast (which we recall it is
the difference between highest and lowest intensity level in the image) and apply
a threshold to an image.
For such operations we use a piecewise linear transformation functions. For in-
stance in figure 4.15 we enhance the contrast of a low-contrast image. A common
piecewise function for such operation is defined starting from two points (r1 , s1 )
and (r2 , s2 ).

Figure 4.15 Contrast stretching example

A piecewise function for thresholding is instead defined setting the points (r1 , 0)
and (r2 , L − 1) where r1 = r2 .

Figure 4.16 Thresolding example

4.3.1 Intensity slicing

We can edit the contrast of an image only to a subset of pixels (hence gray levels).
For achieving it, we use piecewise functions that convert specifics gray levels to
another one.

49
Figure 4.17 Intensity slicing

4.3.2 Bit plane slicing

Instead of highlighting a subset of the gray levels of an image, we can highlight


a subset of the bits encoding the gray levels of the image. This is useful for
analysing an image and see how much each gray level are contributing. Another
useful application is the compression: we can remove the bit planes that are not
contributing so much at the final image.
How to highlight a subset of the bits encoding the gray levels of the image? We
can decompose a grayscale image into 8 bit-planes: a 256-level grayscale image
is composed of 8 bits, so we slice the 8 bits over the 8 planes, where the first
plane contain the most significant bit per each pixel and the last one the least
significant per each pixel.

Figure 4.18 bit planes

In particular, the slicing operation consists in putting the corresponding n-th bit
of the binary pixel representation in the n-th plane at the corresponding pixel
position (see figure 4.19). We obtain hence a binary image per each bit-plane.
For showing the binary image graphically, we just need to display black for 0 and

50
white for 1. For achieving this we start from the original image and compute the
threshold: for instance, in the 8th plane in a generic pixel position we get 1 if
the pixel has value equal or greater than 128, 0 otherwise; we can hence set a
threshold that map to 0 all the pixel values between 0 and 127, and maps to 1
the values between 128 and 255.

Figure 4.19 Slicing operations

For constructing the final image, or display the combination of only some planes,
we need to multiply the pixel at the i-th plane for 2i ; we do this for all the plane we
would like to add; and eventually we sum all the value obtained for the different
planes.
For instance, at the figure 4.19 for combining all the plane together, in the final
image the second pixel in the first column would be computes as follows ( similarly
we do for all the other pixels) :

1∗4+0∗2+1∗1=5

while for computing the same pixel by only considering the first 2 planes:

1∗4+0∗2+1∗0=5

51
Figure 4.20 bit planes slicing example

4.4 Histogram

Let rk , k=0,1,2, .. ,L-1 denote the intensities of an L-level digital image f (x, y)
per each level k, the histogram of f is defined as follows:

h(rk ) = nk

Hence nk is the number of pixels in the image f having intensity rk .


This subdivision of the intensity scale is called histogram bin and it indicates the
number of pixels having intensity k.

Figure 4.21 histogram bins

52
Figure 4.22 histogram examples

The normalized histogram is defines as :

h(rk ) nk
p(rk ) = =
MN MN

where M and N are the number of image rows and columns respectively (hence
MN is the total number of pixels in the image). In this course we use the term
histogram for referring to normalized histogram.
The sum of p(rk ) is 1: this lets us consider it as a Probability density function
(PDF).

4.4.1 Histogram equalization

For equalizing an histogram we need an equalization function T () that map each


input gray level r to a new gray level s = T (r). Equalization tries to flat as much
as it can the histogram. Histogram are in practise never perfectly flat.
We need to define T (r). Such function needs to be:

• a monotonic increasing function ( if we use the inverse transformation


T −1 (s), T (r) needs instead to be strictly monotonic increasing function
so that the output is unique)

• 0 ≤ T (r) ≤ L − 1 for the domain 0 ≤ s ≤ L − 1

• continuous and differentiable

T (r) is then defined as the CDF (Cumulative distribution function) of the random
variable r.

53
Figure 4.23 Continuous and discrete T (r)

Figure 4.24 Monotonic and strict monotonic increasing function

Figure 4.25 Equalization result

54
Figure 4.26 Equalization example of an image with only 8 gray levels. Given the
histogram identified by the table at the left top corner, we use the function in the
middle (si ) and we obtain the equalized histogram whose values are reported in
the table at the top right corner.
The starting histogram, the equalization function and the equalized histogram are
displayed at the bottom in order.

Figure 4.27 Example of equalization of an image with 256 gray levels. We can see
the equalization functions T used per each image 1,2,3,4

55
4.4.2 Histogram in OpenCV

In OpenCV the function returning the histogram of an image is the cv::calcHist


function (https://docs.opencv.org/4.5.3/d8/dbc/tutorial_histogram_calculation.html).

4.4.3 Histogram specification

Sometimes we would like to specify a specific histogram that we wish the pro-
cessed image to have. This method is called histogram specification.
Given the input gray levels r and the specified PDF pz (z) that we wish the out-
put image to have, we can compute the new gray levels z which are distributed
according to the PDF pz (z) using the following procedure:

1. We compute the histogram pr (r) of the input image.

2. We equalize pr (r) and round the resulting values sk to the closest integer
in the range [0, L-1]

3. We normalize pz (z). We indicate the normalized histogram with G(zq ) for


q = 0, 1, ..L − 1. We round all values G(zq ) to the closest integer in the
range [0,L-1] and store them in a table.

4. We compute the inverse transformation zq = G−1 (sk ) for sk = 0, 1, ..L − 1


and create so a map from s to z.
In practise we don’t compute it but exploit G(zq ) computed before : for
each equalized pixel sk , k = 0, 1, ..L − 1 we use the stored values of G to
find the corresponding value of zq such that G(zq ) is the closest integer to
sk .

5. Using the map s to z we create the histogram specified image: we replace


every pixel sk in the histogram equalized image with the pixel zq = G−1 (sk ).

56
Figure 4.28 Schema of process for computing the histogram specification

Example

First step we compute pr (r) and we normalize it and round the values.

Figure 4.29 First step

Second step, given a specific PDF pz (z), we normalize it and round the values.

Figure 4.30 Second step

57
Now we need t create the map from s to z using zq = G−1 (sk ). Then we use
this map over the histogram equalized image for getting the histogram specified
image. For instance s0 = 1 is mapped to z3 since G(z3 ) = 1. This means that
every pixel whose value is 1 in the histogram equalized image would map to a
pixel value 3 in the histogram specified image.

Figure 4.31 Map from s to z

Note that the result is not matching the specified histogram exactly, but the
trend of the specified histogram of moving the intensity toward the high end of
the intensity values was achieved.

Figure 4.32 The histogram specified image doesn’t match the histogram specific

58
Comparison Histogram equalization and histogram specification

Equalization is not always a good strategy as showed in the following example.


In the example below we might expect that some details become more visible in
the dark area by applying the equalization. However, this doesn’t happen: the
equalization produces a washed out image. This is always happening when the
number of pixels with low grey level is high. In this scenario specification performs
better.

Figure 4.33 Input image and its histogram

Figure 4.34 Equalization and specification histogram

4.4.4 Local histogram equalization

What seen so far was global: it was computed on the entire image. We can anyway
enhance detail over small areas in an image (hence locally). Such procedure works
as follows:

• We define a neighbourhood ( for instance a submatrix 3x3) and we move


its center pixel to pixel in a horizontal or vertical direction.

59
• at each location we compute the histogram equalization or specification of
the pixel in the neighbourhood ( the submatrix )

• we move the center of the neighbourhood to an adjacent pixel and reiterate


the procedure.
NOTE: by moving the center of the neighbourhood we might overlap the
previous computed pixel and hence update their value.

Figure 4.35 Local histogram example

60
Chapter 5

Spatial filtering

Filters are generally used for enhancement and removal of unnecessary components
from an image. Spatial filtering defines a local operation.

5.1 Local operation

For local operations we mean functions that edit a pixel of an image based on the
value of the pixel itself and the values of its neighbour pixels. The local operations
are defined based on a filter aka kernel, which defines the neighbourhood involved
in the computation. Each neighbourhood pixel is associated to a weight.

Figure 5.1 Schematic of a local operation. The new value of the red pixel (s) is
computed as function of the original value of the red pixel and its neighbours (the
set of green highlighted pixels). In the right most image is showed the filter with
the weights associated to each neighbour pixels.

Local operations are performed in the spatial domain of the image, that is the
space containing the pixels. For such reason we refer to the local operations as
the spatial filtering and the kernel/filter is called spatial filter.
Based on how the filtering process is performed on the image, the filter can be
linear or non linear. For instance, if the filter evaluates the correlation/convolution,
it is linear; if it evaluates the min/max it is non linear.

61
Figure 5.2 Spatial filter

5.2 Spatial Linear filtering

Spatial linear filtering means that each output pixel is a linear combination of
pixels expressed as a weighted summation of some input pixels. This operation is
direct obtained by computing the correlation (we can called it also convolution)
of each pixel of the image with the weights of the filter.
Supposing a filter of dimensions m x n, m = 2a + 1, n = 2b + 1, the correlation
operation is defined by:

Figure 5.3

In practise, convolution means to superimpose a filter on each pixel location and


then the new pixel (i.e the middle one) is computed as sum of the product of the
value of the pixels with the corresponding values of the weight of the filter.

Figure 5.4 Correlation/convolutional filter

62
How to compute the convolution of an even or not square kernel? We
pad it in order to make it odd and set the no zero values in the right position
considering that the middle point is the one that will be replaced.
While for computing the correlation filter at a pixel placed at the border of the
image, we just need to pad the image with 0’s on both the adjacent sides.

Figure 5.5 Example of padding. The two kernels are the derivative kernels and in
particular the 2x2 derivative kernel is called the Robert filter

Figure 5.6 Correlation/convolutional filter

Correlation vs Convolution

In the CV context convolution and correlation are often used as synonyms: filters
are usually symmetric and this makes convolution and correlation equal.
The filters obtained by applying convolution are called convolutional filter.

63
Figure 5.7 Compararison correlation and convulution formula

5.2.1 Filters and brightness

The filter weights can change the image brightness. In particular the brightness
is unchanged if: i wi = 1, where wi is the weight of the convolutional filter.
P

While if the sum is < 1 the brightness is decreased, and if it is greater than 1 the
brightness is increased.

5.2.2 Smoothing filters

Smoothing (also called averaging) spatial filters are used to reduce sharp transi-
tions in intensity i.e to blur which results in denoising. Indeed, the noise often
appears as sharp transition in intensity. It is also used for reducing irrelevant detail
in an image (for irrelevant we mean small pixel region). Smoothing is often re-
ferred to as lowpass filtering, a term borrowed from frequency domain processing.
The size of the filter can be increased: the larger the filter size, the stronger the
smoothing (blurring).
There are two types of linear smoothing filters:

• Average filter

• Gaussian filter

Average filter

An average filter consists of a simple normalized box filter. It simply takes the
average of all the pixels under the kernel area and replaces the central element.
A box kernel assigns the same weights to all the pixels.
In the section 5.3.2 in the figure 5.26 there are reported some common average
filter description.

64
Gaussian filter

It is a special case of average filter. Indeed, a Gaussian kernel is an average kernel


where the weights follow a 3D Gaussian distribution i.e they decrease as a function
of distance from the kernel center; and it is normalized by dividing its coefficients
by the sum of the coefficients. Indeed if we look at it from the top, the maximum
is at the middle pixel and moving toward the edges the weights decrease penalizing
thus the further points. Hence a Gaussian filter takes the neighbourhood around
the pixel and finds its Gaussian weighted average.
The weights of a Gaussian kernel m x n are computed by the formula:

r2
w(r) = G(r) = Ke− 2σ2

where K is a constant and r is the euclidean distance between the middle point p
and a neighbour pixel q, r = ||p − q|| = p2 + q 2 . Hence the new value of the
p

middle point p computed on a window S would be:


X
Inew (p) = G(||p − q||)I(q)
q∈S

Figure 5.8 Left picture: kernel illustration. Right picture: distances from the
middle pixel for different odd sized kernel.

Gaussian filtering effects vary based on the size of the kernel and the standard
deviation: the bigger the kernel size, the stronger the smoothing; the bigger the
deviation standard, the stronger the smoothing. Anyway we don’t need Gaussian
kernel too big since we know that the values of a Gaussian function at a large
distance from the mean are small enough to be ignored, hence nothing is gained
by using very large Gaussian kernels.

65
Figure 5.9 Two averaging filters. The one at the left is a normalized box filter.
The right kernel is a Gaussian kernel.

Figure 5.10 Gaussian kernel.

Figure 5.11 Averaging filters example. Note how the blurring increases as the
window size increases.

66
Figure 5.12 Gaussian smoothing example. Note how the blurring increases as σ
increases.

5.2.3 Separable filter kernels

A spatial filter kernel is a matrix and it can be expressed as the outer product of
two vectors or two matrix.
For instance a square filter of size n x n can be separated in two filers: one of size
n x 1 and one of size 1 x n.
This operation let us speed up the computation.

Figure 5.13 Example of filter separated in a column vector and a row vector.

5.2.4 Sharpening (aka derivative) filters

Sharpening filters produces the opposite result of the averaging filter: they enhance
the transitions in intensity, aka the edges. Sharpening is often referred to as
highpass filtering: high frequencies (which are responsible for fine details) are
passed, while low frequencies are attenuated or rejected.
Derivative filters can implement first or second derivative operator. We recall the
mathematical definition of first and second derivative operator.

67
A first order derivative operator :

• is zero in flat segments

• is non zero on the onset of a step/ramp

• is non zero along ramps

A second order derivative operator :

• is zero in flat segments

• is non zero on the onset and at the end of a step/ramp

• is zero along ramps of constant slope

Figure 5.14 Example of computation of first and second order derivative on pixels.
The scan line represents an image in 1D.

First derivative operator

First derivatives in image processing are implemented as the magnitude of the


gradient (noted with M(x,y). It is called also the gradient image or simply gradient,
and it has the same size of the original image). The computation of the gradient
is done in the following steps:

• we compute the gradient for each pixel: we compute the two partial deriva-
tives. This is achieved by convolving the two kernels (one kernel for com-
puting fx and one for calculating fy ) with the image. Hence, we obtain two
images representing the two partial derivatives.

68
• we compute the magnitude image using the previous two images accord-
ing to one of the two magnitude formulas: usually it is more suitable to
approximate the squares and square root operations of the magnitude for-
mula - M (x, y) = fx2 (x, y) + fy2 (x, y) - by the absolute values M (x, y) =
q

|fx2 (x, y)| + |fy2 (x, y)|. This is done for the sake of the computational cost.

The kernel used for computing the partial derivatives are built starting either from
the direct formulas (reported in the figure 5.15 with the corresponding kernels) or
by means of different formulations that implies larger filters of odd size such as
the 3x3 Robert and Stobel kernel (figure 5.18).

Figure 5.15 Filters of the direct application: 2 elements for both the vector con-
stituting the gradient

Figure 5.16 Sum up of the first order derivative. Isotropic means it is invariant
from rotations.
Note that the square root and the module are not linear operators.

3x3 sized filters are the most commonly used since the direct application filters
(whose size is 2x1 and 1x2) are missing a centered half way element that defines a
center of spatial symmetry. Moreover, larger filters allow a better noise reduction.
Larger filters come from two different formulations which are called Robert and
Stobel. Robert considers two pixels for computing the gradient component vector
per each pixel, while Stobel takes into account 6 pixels and give more importance
to the middle pixel (it is multiplied by 2 indeed). The Robert and Stobel filters
are squared filters and they are represented in figure 5.18 with their formulation.
The sum of the elements in the Stobel filters is 0, hence the brightness of the final
image is darker than the original one.

69
Figure 5.17 Sobel gradient computed from the left-most figure.

Figure 5.18 Robert (matrices 2x2) and Sobel (matrices 3x3) filters. z5 indicates
a general pixel at position (x,y) of the image f(x,y)

Second derivative operator

The second derivative operator is the Laplacian. Given an image f (x, y), the
Laplacian is defined as (we write f instead of f (x, y)):

∂2f ∂2f
∇2 f = + = f (x+1, y)+f (x−1, y)+f (x, y+1)+f (x, y−1)−4∗f (x, y)
∂x2 ∂y 2
(5.1)
where the partial second order derivatives are :

∂2f
= f (x + 1, y) + f (x − 1, y) − 2 ∗ f (x, y) (5.2)
∂x2
∂2f
= f (x, y + 1) + f (x, y − 1) − 2 ∗ f (x, y) (5.3)
∂y 2

The mathematical definition of Laplacian is isotropic. Anyway the kernel repre-


senting it is isotropic only for 90 degree. We can increase the isotropic property
of the kernel to 45 degree by extending the laplacian formulation. The extended
version of the Laplacian is given by considering in the formula 5.1 the diagonals,
which are expressed by adding the term −2 ∗ f (x, y) per each diagonal. Hence,
we end up to have −8f (x, y) in the formula 5.1. See figure 5.20.

70
Figure 5.19 Sum up of the second order derivative. Note that the laplacian is a
linear operator

Laplacian is used for sharpening images. Given an image f (x, y), the sharpened
image g(x, y) is obtained by the following equation involving the second derivative:

g(x, y) = f (x, y) + c[∇2 f (x, y)]

where c = −1 if the center coefficient of the laplacian filter is negative (we subtract
the laplacian image); otherwise c = 1.

Figure 5.20

71
Figure 5.21 Example laplacian computation. Top-left image is the original image.
The top-right image is the Laplacian image obtained with the first filter in the
figure 5.20. The bottom left image is the sharpened image using c = -1. The
bottom right image is the sharpened image using c = 1.

5.2.5 Combination of multiple technique

Figure 5.22 Combination of multiple technique

72
5.3 Non-linear filtering & Restoration filters

Non linear filters are mainly used as restoration filters for image restoration i.e for
denoising. However also smoothing filtering can be employed for noise reduction.
A corrupted image may be restored using:

• A smoothing linear filter:

– Averaging filter
– Gaussian filter

• A non-linear filter:

– Median filter
– Bilateral filter
– Min or max filter (for salt and pepper noise)
– Alpha-Trimmed Mean Filter
– Adaptive filters

5.3.1 Image restoration overview

The image restoration is the operation of reducing the noise affecting an image.
Indeed, each electrical signal is effected of noise. In particular we can think to
noise as the Gaussian noise which is the typical noise affecting electrical signal.

facquired = f (x, y) + η(x, y)

Figure 5.23 Classic model of noise affecting signal acquisition. The noise here is
represented as an image of the same size as the input image.

What is causing noise in an image? A common example of an image with noise


are images taken by night. This is caused because of lack of light and so the
signal is weak and it is amplified by the ISO settings. Moreover, a sensor of small
area will catch a small amount of light compared to a larger sensor, and hence
the noise will be more highlighted.

73
In order to restore the original signal, we exploit a restoration filters which provides
a function representing the best estimation of the original image. In order to
apply a proper restoration filter we need to estimate the noise and then remove
it. Actually, algorithm designer don’t look at image and try to estimate noise, but
they directly use Gaussian filters for managing noise reduction since the Gaussian
noise is the most common as told before.

Noise models

There are different shapes of noise and they are representing as PDF. For simulat-
ing purposes we represent the image of a noise by generating a matrix of random
intensity values with a specified probability density function (this approach is not
used for salt and pepper noise anyway).

Figure 5.24 Most common PDF in image processing application. On axes are
represented intensity values. Rayleigh is pronounced ’reilai’.

We know that for a Gaussian random variable, the probability that values of z are
in the range z ± σ is approximately 0.68; while the probability that the values of
z are in the range z ± 2σ is about 0.95 .
Let’s analyse how the different noise models affect a test pattern showed in the
figure 5.25.

74
Figure 5.25 Test pattern

Figure 5.26 Different noise shapes applied to the test pattern. The gamma noise
causes a darker image since the gamma noise PDF has peaks closer to the dark
intensity.

75
Figure 5.27 Different noise shapes applied to the test pattern

Note that the histograms resulting from adding the noise to the image has the
same shape of the PDF of the noise. Indeed, the histogram is by definition a PDF.

Salt and Pepper (or impulse) noise

Salt and peper noise resembles salt and pepper granules distributed randomly over
the image; hence the name of this type of noise. Pepper means some pixels are
shifting towards darker values, while salt the opposite.
We indicate with P, and called it noise density or just spatial density, the probability
that a pixel is corrupted by salt or pepper noise and it is defined as P = Ps + Pp ,
where Ps and Pp is the probability that a pixel is corrupted by salt noise and pepper
noise respectively. For instance, if an image is corrupted by salt and pepper noise
with Ps = Pp = 0.25%, it means that 14 of the pixels of the image are completely
out of the original values (either white or black). In figure 5.27 (i) is represented
an example of salt and pepper noise.
As a rule of thumb a low (hence good) spatial density Ps and Pp is less than 0.2.

5.3.2 Averaging filters

Given an image f we replace the intensity value of the pixel with the intensity
value g(x, y) computed by means of an averaging kernel mxn reported in the figure

76
5.28.

Figure 5.28 Averaging (or mean) filters overview

Figure 5.29 Arithmetic and geometric mean filters applied

5.3.3 Median filters

The median filter replaces the value of a pixel by the median value in a specific
pixel’s neighbourhood defined by the kernel size. How is the median defined? The
median is the middle number in a sorted, ascending or descending, list of numbers.
A median filter n x n, takes a screenshot of the area composed of n x n pixels of

77
the image overlapping the filters, then it computes the median of the value of the
pixels surrounding the centre pixel and replace it with the computed median.
Median filters discard the highest and lowest peak that might be come from the
salt and pepper noise, while an averaging filter will consider those values for
computing the new intensity value pixel.

Figure 5.30 Median computation. The values in the median filters corresponds to
the one of the original image. The median is 7 and it replaces the value 2 then.
Note that the 92 value is due to the salt and pepper noise with probabilities
0.1. Salt and pepper causes a lot of problem to averaging filters: 92 is
reject by median but not by average.

If we apply more times the same median filter, it is not removing the information
content of the image (In the images below we don’t see many differences among
the results of multiple application of the median filter.). This is an interesting fact
since this is not happening by applying a Gaussian smoothing several times: it is
destroying the image instead.

Figure 5.31 Median filters example about passing the same median filter more
than once

78
5.3.4 Bilateral filter

Bilateral filter is a non linear spatial filter highly effective in noise removal while
preserving edges sharp. The output image results smoothed everywhere but at
the borders. It is more computationally demanding compared to the other filters.
This is due to the fact that bilateral filters exploit two Gaussian kernels:

• a Gaussian kernel function of space (spatial kernel). This is the definition


of Gaussian kernel seen previously: it takes the neighbourhood around the
pixel and finds its Gaussian weighted average.

• a Gaussian kernel function of pixel intensity difference (range kernel).


Why do we need this? The disadvantage of the Gaussian kernel in space
is that it doesn’t consider whether pixels have almost the same intensity and
hence whether a pixel is an edge pixel or not. So it blurs the edges also,
which we would like to avoid. For achieving it, we need this second different
Gaussian kernel.

Thus, the Gaussian function of space makes sure that only nearby pixels are
considered for blurring, while the Gaussian function of intensity difference makes
sure that only those pixels with similar intensities to the central pixel are considered
for blurring. So it preserves the edges since pixels at edges will have large intensity
variation. The bilateral filter is described by the following equation:

Figure 5.32 Bilateral kernel. p is the position of the middle pixel with intensity
I(p), while q is a neighbour pixel with intensity I(q). In (p) would be the new
intensity value of the middle pixel.

Figure 5.33 Gaussian domain and range filters.

79
Figure 5.34 The bilateral filter smooths an input image while preserving its edges.
Each pixel is replaced by a weighted average of its neighbours. Each neighbour
is weighted by a spatial component that penalizes distant pixels and range com-
ponent that penalizes pixels with a different intensity. The combination of both
components ensures that only nearby similar pixels con- tribute to the final result.
The weights are represented for the central pixel (under the arrow).

Figure 5.35 If we increase the σr too much, we end up having just a normal
Gaussian filter effect.
Changing σs preserves the edges but makes smother some regions.

80
Bilateral vs Median

The bilateral reduces more noise than the median. If we increase the median for
removing the noise we start to lose also the edges.

Figure 5.36

5.3.5 Max and min filters

Max and min filters are non linear filters used to manage the salt and pepper
noise. They simply select the min and max pixel in the neighbourhood pixel:

• Max filter: highlights salt noise and removes pepper noise. Hence the max
filter select the max intensity pixel in the neighbourhood.

• Min filter: highlights pepper noise and removes salt noise. Hence the min
filter select the min intensity pixel in the neighbourhood.

81
Figure 5.37 Max and min filters example

Figure 5.38 Max and min filter example

5.3.6 Alpha-trimmed mean filters

Alpha-trimmed mean filters are non linear filters. Basically, it is the mean filter
expressed by a different formulation called alpha-trimmed mean.

Figure 5.39 Alpha-trimmed mean formula

The variable d is used to trim the first and last d


2 number of pixels. If d = 0
the filter reduces to the arithmetic filter, if d = mn − 1 the filter reduces to the
median filter.
It can be used to eliminate spikes.

82
Figure 5.40 Alpha-trimmed mean example. The value 2 of the centered pixel is
replaced with the intensity value 7 ( round up of 6.8)

Figure 5.41 Comparison of filters. The 5x5 Alpha-trimmed mean filter is computed
with d=5

83
5.3.7 Adaptive filters

Adaptive filters are non linear filters. It tunes its effect based on the local image
variance (indicated as σL2 ) of the neighbourhood pixels of the image centered on
the pixel (x,y). In particular the effect of such filter depends on the comparison
between the local image variance σL2 and the variance of the noise corrupting the
image indicated as ση2 , which is known a priori. We distinguish the following case
for the comparison:

• ση2  σL2 , the filter is weak. The new pixel is close to the old noisy image
value: the new image resembles the original one. In particular, in this case
we have a high local variance which is typically associated with edges which
are hence preserved.

• ση2 ≈ σL2 , the filter effect is strong and in this case it returns the arithmetic
mean value of the pixels in the neighbourhood. Such a filter is hence called
ADAPTIVE MEAN FILTER.

Figure 5.42 Adaptive mean filter. In this case ση2 = 1000 ≈ σL2 . We note that the
image d is much sharper than the other ones.

Adaptive median filters

Median filters perform well with a low spatial density: Ps and Pp less than 0.2.
For higher spatial density the adaptive median filter is performing better.
Adaptive median filters is more tricky than the adaptive mean filter since the size

84
of the neighbourhood is changed during running depending on certain conditions.
( more info at p338 - Digital image processing book )

Figure 5.43 Adaptive median filter vs median filter

Usually adaptive filters are mostly better than the corresponding non adaptive
filters. However, there are also cases where adaptive filters are slightly worse:
when it happens, it is also only on some tiny part of the image, but the overall
image look great. Adaptive filters have anyway a bigger cost of computation since
the bigger complexity.

5.4 Filter for coloured images

We don’ consider the single RGB channels, but there are other color encodings.
–> we didn’t go into details in the course.

85
Chapter 6

Filtering in frequency

6.1 Fourier Transform applied to image

The Fourier transform decompose a function into a sum of sinusoidal functions of


different frequencies.
Images have some differences from the common used signals: images have 2
dimensions and only positive values for x and y.

Figure 6.1 General Fourier transforms example

Figure 6.2 Fourier transforms and antitransforms in continuous and discrete do-
main

86
6.1.1 Example of Rect-sinc transform

Figure 6.3 2D transform of a 2D rect

Figure 6.4 3D transform of a 3D rect. Note we note u and v the axes in the
frequency domain.

6.1.2 Spectrum and phase for DFT

The DFT of an image can be expressed in the polar form. This representation
allows to distinguish the magnitude, which is called Spectrum, and the phase.

Figure 6.5 Polar form. R and I indicates the real and imaginary part.

Spectrum and phase encode information. Phase encodes more significant infor-
mation of the spectrum: the phase carries the shape features of the image, while
the spectrum the intensity. In the figure we show the reconstruction of the image
(inverse transform) using only the phase (setting hence the spectrum to 1) and

87
only the spectrum (setting hence the phase to 0); and reconstructing the image
using the phase and spectrum of the rect.

Figure 6.6 Phase and spectrum reconstruction

6.1.3 Spectrum centering

A shift is typically needed to align the Fourier spectrum with the image represen-
tation space. We recall that the center in both the spatial and frequency domain
of an image is defined as the top left corner based our convention. Shifting the
Fourier transforms moves the center of the transform image.

88
Figure 6.7 Spectrum shifting example in 1D and 2D.

Figure 6.8 Example showing why the spectrum shifting is needed

89
6.1.4 Rotation and translation

The translation doesn’t affect the Fourier transform. While the rotation does.

Figure 6.9 Rotation and translation operations

6.2 Sampling

Continuous functions have to be converted into a sequence of discrete values


before they can be processed by a computer.
The image f(x,y) is uniformly sampled over an orthogonal grid with spacing ∆X
and ∆Y .
Sampling is performed based on the Nyquist–Shannon sampling theorem which
defines the criteria for avoiding aliasing. Supposing as for the 1D case that the
signal is band limited, the Nyquist–Shannon theorem for an image is:

1
FX = > 2umax
∆X

1
FY = > 2vmax
∆Y
where umax and vmax are the maximum spatial frequencies of the signal along X
and Y respectively. In particular, u and v are the name of the axes in the frequency
domain (Figure 6.10).

90
Figure 6.10 Over and under sampling. Ideally we should get the first picture. The
second one presents aliasing.

6.2.1 Aliasing on image

We present some examples of aliasing on image. Aliasing occurs at high frequen-


cies: pixels vary a lot in a short space.

Figure 6.11 Here we show a sampling grid partially overlapping three 2-D signals
(regions of an image) of low, mid, and high spatial frequencies. Note that the level
of spatial “detail” in the regions is proportional to frequency (i.e., higher-frequency
signals contain more bars). As expected, all three digitized regions exhibit alias-
ing to some degree, but the effects are dramatically different, worsening as the
discrepancy between detail (frequency) and sampling rate increases.

We can mitigate the aliasing by applying a Low Pass filter before sampling. An
LPF filter is a smoothing filter. In the figure 6.12 the middle image is the sample of
the first one. We can see aliasing by looking at the trousers. The last one instead
doesn’t present aliasing and it has been obtained by applying a 5x5 averaging filter
before sampling.

91
Figure 6.12 Aliasing example. The aliasing is present on the trousers and scarf in
the second image. The last one is the result of applying a LPF filter.

In the figure 6.13 instead we show that aliasing can create results that may be
visually misleading: the image seems good but instead it has aliasing. The figure
represents the results obtained by sampling different checkerboards of different
squares size considering an imaging system of 96x96 pixels of unit width and
length. We notice that sampling chessboard whose squares size is less than 1
pixel result in undersampling which in turn results in aliasing.

Figure 6.13 Example of aliasing that misleads

92
6.3 Moiré effect

Moire effect is visual phenomenon produced by superimposing gratings of similar


spacing. No aliasing is involved even if the result is similar.

Figure 6.14 Moiré effect with lined and dotted grating

Figure 6.15 Moiré effect second example

6.4 Filters in frequency

Filters can be defined in frequency.What are the advantages of filtering in fre-


quency? Working in frequency domain it is easier to remove the noise from the
signal/image.
Filtering techniques in the frequency domain are based on modifying the Fourier
transform to achieve a specific objective, and then computing the inverse DFT to
get us back to the spatial domain.
Several filters described here perform operations that are similar to the filters in
the space domain: Smoothing, Sharpening / edge highlighting. What are the
differences with respect to the filters in the spatial domain? Spatial filtering (i.e.
convolving an image with a kernel) is equivalent to multiplying the Fourier trans-

93
form of the image by the Fourier transform of the kernel. In that sense, indeed
filtering by convolving in the spatial domain is equivalent to filtering by multiplying
in the frequency domain.

Now convolving directly in the spatial domain can be computationally costly:


O(NS), with S the cardinality of the support of the kernel and N the cardinality
of the image filtered. Using the FFT and then multiply the frequencies costs
O(Nlog(N)), and should be less costly unless the support of the kernel is very
sparse.

Figure 6.16 Frequency filtering

6.5 Lowpass filters - smoothing

6.5.1 Ideal Low pass filter - ILPF

Ideally it attenuates all frequencies within a circle of radius D from the center and
cut off all the frequencies outside the circle. The zero-frequency is in the middle
of the figure.

Figure 6.17 Low pass transfer function

where D(u,v) represents the distance between a point (u,v) in the frequency do-
main and the center of the PxQ frequency filter. D0 is called the cutoff frequency.

Figure 6.18 Ideal Low pass

94
Ringing effect

The drawback of the ideal low pass filter function is a ringing effect that occurs
along the edges of the filtered spatial domain image. This happens in two scenario:

• We can observe it in the space domain, when we compute the IDFT of a


low pass filtered image.

Figure 6.19 Ringing in the spatial domain

• We can observe it in the frequency domain when we pad in the frequency


domain. The padding of a frequency domain transfer functions is done by
the steps:

1. starting from a frequency domain transfer function, we compute its


IDFT and get to the spatial domain.
2. we pad the spatial domain transfer function.
3. we compute the DFT of the padded image and go back to the frequency
domain. The discontinuities in the padded function causes ringing in
its frequency domain.

Figure 6.20 Ringing in the frequency domain after padding

95
Figure 6.21 Ringing example. The ringing effect is evident from image c to e. The
ringing becomes finer in texture as the amount of high frequency content removed
decreases.

6.5.2 Butterworth lowpass filter - BLPF

The low pass filter is tuned by means of the parameter n. Large values of n cause
this filter to be similar to the Ideal Low Pass filter.

Figure 6.22 Butterworth LP formula

Figure 6.23 Butterworth LP

96
The ringing phenomenon is not present for low value of n, and it becomes more
evident as we increase it. For value around 20 we get a BLPF equals to ILPF.

Figure 6.24 Ringing effect grows with the order of the filter

Figure 6.25 Butterflow ringing effect. The order of the filter is fixed to 2 while
the cutoff frequency is changing.

97
6.5.3 Gaussian low pass filter - GLPF

It causes no ringing. Its transfer function is :

H(u, v) = exp −D2 (u, v)/2D0 2

Figure 6.26 Gaussian LP filter

It can be used for :

• Filling gaps

Figure 6.27 filling gap example

98
Figure 6.28 filling gap example2

• Beautification

Figure 6.29 Beautification example

• reducing noise

Figure 6.30 Reducing noise example

99
6.5.4 Sum up Low pass filtering

Figure 6.31 Sum up LP filter

100
6.6 Highpass filters - sharpening

A highpass filter can be obtained by the corresponding LP filter as:

HHP (u, v) = 1 − HLP (u, v)

Features of LP and HP filters such as ringing are similar for a given type of filter.

Figure 6.32 Highpass filters formula

Figure 6.33 Highpass filters

Ringing is present in the spatial domain as in lowpass filtering.

101
Figure 6.34 Ringing effect on spatial domain

6.6.1 Ideal Highpass

Figure 6.35 Ideal Highpass

102
6.6.2 Butterworth Highpass

Figure 6.36 Butterworth Highpass

6.6.3 Gaussian Highpass

Figure 6.37 Gaussian Highpass

103
6.6.4 Relation frequency and spatial filtering

The spatial filtering can be obtained by computing the inverse transform of the
frequency filtering.

Figure 6.38 Gaussian spatial and frequency filtering example

6.7 Selective filters

Noise can affect the frequency domain. We use the selective filters for processing

• only a specific band of frequencies. We refer to these filters as band filters:


bandreject filter if the frequencies in the band are filtered out; bandpass if
they are passed.

• or only a small regions of the frequency rectangle. We refer to these filters


as notch filters. If the frequencies in the notch are rejected we call it notch
reject, otherwise notch pass filter.

For all the selective filters it is valid the ideal, butterworth and gaussian definition.

104
6.7.1 Bandreject and bandpass filters

A bandpass filter transfer function is obtained from a bandreject function as:

HBP (u, v) = 1 − HBR (u, v)

Figure 6.39 Formulas of bandreject filters

Figure 6.40 Bandreject filters transfeer function.

Figure 6.41 Bandreject filters for removing noise

105
6.7.2 Notch filter

A notch filter rejects (or passes) frequencies in a predefined neighborhood of the


frequency rectangle.

Figure 6.42 Notch filters

Figure 6.43 Notch filters example. It is reported the butterworth notch filter
transfer function furmula applied in the image

106
Chapter 7

Mid level

Computer vision systems often start with an image enhancement module, that
corresponds to a low level processing, which we discusses in the previous chapters.
We now move to mid level algorithms which are based on more complex models:
line and edges detection.

Figure 7.1 Computer vision system

7.1 Edge detection

We have already seen the derivative filters which highlight the edge pixels. They
can be considered as edge detector even if the filtered images lack the concept
of edge: they just give an output to an input. We want to build something that
remark the concept of edge. Edges are caused by a variety of factors:

107
Figure 7.2 Edge factors

7.1.1 Derivative recall

Line detection is based on the derivatives: gradient (in particular we recal the
Sobel filter) and laplacian.

Figure 7.3 Derivatives recall

The first order derivative highlights thick edge, while second order derivative high-
lights thin edge.
Moreover looking at the derivative graph, we can see that the second derivative
is positive at the beginning of the ramp, negative at the end of the ramp, zero at

108
points on the ramp and at constant points of the ramp. The signs of the derivative
indicates the transition from light to dark. We can conclude that the magnitude
of the first derivative can be used to detect the presence of an edge at a point in
an image: it is 0 where the intensity is constant, constant where there is a ramp.
While the sign of the second derivative can be used to determine whether an edge
pixel lies on the dark or white region, and the zero crossing point can be used for
locating the center of thick edges.

Figure 7.4 Derivatives recall

7.1.2 Types of edges

By looking at its intensity profile, we can distinguish three types of edges:

Figure 7.5 Types of edge

109
7.1.3 Derivatives and noise

Derivatives amplify noise. In particular the 2nd derivative is noisier than the 1st.

Figure 7.6 Derivative and noise relations

110
7.1.4 Gradient edge detector (1st derivative)

We make a recall of the gradient.


The gradient is a vector of the partial derivatives pointing towards the maximum
rate of change of f at (x,y) (it points towards where the function starts to increase.
Hence in our cases it will point from dark (min value 0) to light (max value 255)
region as shown in figure 7.9). The module is called the edge strength while the
gradient direction tells us the direction of the border: the direction of the border
is orthogonal to the gradient direction as shown in the figure. Moreover, the
phase are measured counterclockwise with respect to the x axis. Note that the
magnitude and the phase are images of the same size as the input image size.

Figure 7.7 Discrete gradient and its mask recall

Figure 7.8 Gradient phase and magnitude

Figure 7.9 Gradient and edge direction. Note the edge direction is orthogonal to
the gradient direction.

Gradient magnitude and angles example applying the sober mask

Note that for the computation of the gradient magnitude it is usually exploited
the formulation involving the modules since the one involving the squared root is
more computationally demanding.

111
Figure 7.10 Gradient magnitude computed using the Sober mask

Figure 7.11 Gradient phase. In the sky we get noise because we don’t any variation
and so the value of gx and gy are completely random.

112
Gradient masks

We can use the following mask (smoothing kernels) for detecting/highlighting the
edges along the four directions: all the four sober or prewitt kernels is convoluted
with the input image. This highlights all the edges along those 4 directions.
Anyway, the Sobel kernel is the most used since it has a good noise reduction
result. The prewitt kernel is a form of sobel filter where the center element is not
multiplied by 2.
We can also decide to convolve the image with only some or one filters in order
to highlight edges only along some specific direction. It is important to mention,
that using a specific oriented kernel doesn’t mean that the edges of different
orientation from the one specified by the kernel are not strongly highlighted, but
they are weakly highlighted compared to those oriented in the direction of the
kernel (see figure 7.13).

Figure 7.12 Gradient masks

We recall that the first order derivative is not isotropic (invariant to rotation),
but the Prewitt and Sober filter do. In particular, these filters are isotropic only
for vertical and horizontal direction. This means that it exhibits the strongest re-
sponse predominantly for vertical and horizontal edges, but they can still be used
for detecting diagonal edges.
In the book it is reported the Kirsh’s kernels which are useful for highlighting
edges along 8 directions: it takes a single kernel mask and rotates it in 45 degree
increments through all 8 compass directions: N, NW, W, SW, S, SE, E, and NE.

113
Figure 7.13 Result of diagonal edges by applying the sobel kernels ±45

7.1.5 Laplacian edge detector (2nt derivative)

We remind the Laplacian:

∂2f ∂2f
g(x, y) = ∇2 f (x, y) = + = f (x+1, y)+f (x−1, y)+f (x, y+1)+f (x, y−1)−4f (x, y)
∂x2 ∂y 2

where
∂2f
= f (x + 1, y) + f (x − 1, y) − 2f (x, y)
∂x2
∂2f
= f (x, y + 1) + f (x, y − 1) − 2f (x, y)
∂y 2
The Laplacian edge detector is used together with thresholding: we compute
the thresholding on the Laplacian image g(x, y) ( after we smoothed it with a
smoothing kernel).
Moreover, the laplacian kernels are usually isotropic: its response is invariant to
rotation, that is they are independent of the orientation of the edges. Anyway
sometimes we would like to use it anysotropically in order to detect edge with a
specific direction. The preferred direction of each kernel is weighted with a larger
coefficient such as 2. The Laplacian kernel detectors used for edge detection are
showed in figure 7.15.
We actually use the second order derivative specially for line detection than edge
detection since they have a stronger response to fine details (also to the noise)
such as thin lines. Instead for edge detection we prefer first order derivative since
they produce thicker lines, hence edges.

114
Figure 7.14 Example of isotropic laplacian kernel

Figure 7.15 Anysotropic Laplacian kernels

Figure 7.16 Example of application of a 45 degree anysotropic Laplacian kernel

Moreover the Laplacian filter produces the double-line effect: the detected edges
are represented with two lines.

115
Figure 7.17 Double-line effect example

7.2 Edge detection algorithm

Derivatives filters provide a set of tools for detecting edges along x,y and diagonal
directions, and its direction. Such tools are the building blocks for mid-level edge
detection algorithms: a combination of several filterings and processings in a given
order.
For finding edges hence we combine filters to create an edge detection algorithm.
We follow a specific order of steps:

1. We apply first a smoothing filter for noise removal: this leads to a more
selective edge detection.

2. We compute the gradient using the derivative kernel(s) along the orienta-
tions of the edges we would like to highlight. We prefer first order derivative
kernels since they produce thicker lines than the second order derivatives.

3. We threshold the gradient for a specific T, ∇f > T . Thresholding selects


only the strongest edges leading to a much sharper image.

It follows an example of such practise.

116
7.2.1 Example of application of the the edge detector algo

Figure 7.18 Smoothing and gradient computation (steps 1 and 2)

Figure 7.19 Thresolding. Notice that smoothing the image prior to the gradient
computation allows to maintain more connectivity of the edges (look at the edge
of the roof).

117
7.3 The Canny edge detector

The canny algorithm (1986) is a famous edge detector addressing the following
targets:

• low error rate: all the edges should be found

• edge points well localized: the edge localized should be as close to the true
edges as possible.

• single edge point response: it should return only one point for each true
edge point.

The steps of the algorithm are:

1. smoothing with a Gaussian filter

2. gradient computation (magnitude and phase)

3. quantize the gradient angles

4. non-maxima suppression

5. hysteresis threholding

7.3.1 Smoothing

Smoothing is often employed before evaluating edges for reducing noise.

Figure 7.20 Result of smoothing

7.3.2 Gradient computation

Edges are calculated using vertical, horizontal and diagonal masks. Edge direction
is also computed.

118
Figure 7.21 Result of computing the gradient. We highlight the angle of the
orientation of the mask that provides that contribution in finding the edge.

7.3.3 Edge quantization

Edges are oriented and we group them based on their orientation. In particular
we group them into the 4 groups representing the 4 main orientations: horizontal,
vertical, +45 and -45. We remind that we look at the edge normal to get the
edge orientation, which is orthogonal to the edge normal. For instance if the edge
normal is in the range of direction from -22.5 to 22.5 or from -157.5 to 157.5, we
flag the edge as an horizontal edge.
This step is needed in order to carry out the next step.

Figure 7.22 Edge quantization example

7.3.4 Non-maxima suppression

Non-maxima suppression reduces the edge thickness in order to accurate locate


the border. For instance if the border is 10px thick how can we do where the
border is actually?
The process goes over the edge and select the strongest point, which it is used
for locating the edge. In particular let d1, d2, d3 and d4 denote the four basic

119
edge orientation for 3x3 region: horizontal, vertical, -45 and 45; the suppression
process is :

• find the direction dk closest to the edge normal

• if the module of the value of the centered pixel is less than at least one
of the neighbors along dk , set the gradient at that neighbor position to 0
(suppressed); otherwise do nothing.

This part of the algorithm is run till there is only one pixel survived (not set to 0).

7.3.5 Hysteresis thresholding

We threshold the output of the previous step. This step is meant to reduce false
edge points:

• keep strong edges

• keep weak edges only if connected with strong edges

• reject isolated weak edges

For strong and weak edges we mean its strength, which is defined by the gradient
magnitude.
The threshold is a double threshold: a low threshold TL and a high threshold TH .
We can visualize the thresholding as creating two images: IH , where the input
image has only pixels that are ≥ TH ; and IL , whose pixels are ≥ TL . Hence, all
non zero pixels in IH will be contained also in IL . Hence, we eliminate from IL
all the non zero pixels from IH :

IL (x, y) = IL (x, y) − IH (x, y)

Then, since in IH we might have some pixel gaps (a border with empty pixel)
we use the following procedure for filling up those gaps and so forming the edge
pixels:

• For every pixel in the image IH : we mark as valid edge pixels all neighbor
pixels in the image IL .

120
Figure 7.23 Initial point with TH = 1 (in IH survive only 2 and 3) and TL = 1

Figure 7.24 After applying step 1: we add to IH all the pixels in IL in the
neighborhood of the pixels of IH

• In IL we set to 0 all the pixels not marked as valid edge pixels in IH

Figure 7.25 After applying step 2

• And we add the remaining non zero pixels in IL to IH

121
7.3.6 Example of canny algo

Figure 7.26 Canny algo example

7.3.7 Gaussian kernel size σ and threshold relation

• Large σ means large-scale edges

• Small σ means finer details

122
Figure 7.27 Canny algo output for different value of σ

The value of σ should be calibrated based on the threshold: the σ and the threshold
should grow inversely; if one is high, the other one should be low.

Figure 7.28 σ and threshold relation

7.4 Edge detector comparison

We present a comparison among a gradient image, canny edge detector and Marr-
Hildreth edge detector for highlighting that an edge detector is more than a deriva-
tive filter.
We can see how clean it is the output of the canny edge detector.

123
Figure 7.29 Example comparison edge detecor

Figure 7.30 Example2 comparison edge detecor

124
7.5 Line detection

7.5.1 Naive approach

A possible approach is the following, but it takes large computation time (O(n3 )).

1. Compute the edges image

2. For each couple of edge points, we evaluate the line passing through them

3. For each line, we count how many edge points are laying on the considered
line (or we can also consider how many edge points are close to that line so
that to have some degree of tolerance)

7.5.2 Hough transform

It is the most brilliant approach.


The idea is to consider the generic line equation passing through a point (x,y)
which is a pixel:
y = ax + b

Then we move from the xy-plane to the ab-plane, which is called the parameter
space since ’a’ represents now our unknown parameter:

b = −xa + y

We hence consider per each point in the xy-plane the group of lines passing through
it. Then, the line passing through two points is given by the intersection of two
lines of the group of lines passing through the two corresponding points. In the
ab-plane instead the same line is given by considering the point (a,b) where two
lines intersects: given two point (x1 , y1 ) and (x2 , y2 ), the line passing through
them can be expressed as y1 = a0 x1 + b0 or y2 = a0 x2 + b0 , where (a0 , b0 ) is the
point in the ab-plane where a line of the group of lines of (x1 , y1 ) intersects a line
of the group of lines of (x2 , y2 ) .

125
Figure 7.31 Intersection of two lines in the xy-plane and the corresponding result
in the ab-plane.
We highlighted in green the vertical line in the group of lines of the two points.

Normal representation

However, in the ab-space we cannot represent the corresponding vertical lines in


the xy-space, since vertical lines have infinity angular coefficient ’a’. To overcome
this issue, we move to a different representation of a line called the normal repre-
sentation, which let us map all the lines passing through a point (x,y) in sinusoidal
curves. Then, the points (ρ0 , θ0 ) where the sinusoidal curves intersect corresponds
to the line that passes through two (x,y) points.
The normal representation of a line is:

xcosθ + ysinθ = ρ (7.1)


cosθ ρ
y = (− )x + ( ) (7.2)
sinθ sinθ

Figure 7.32 Meaning of the parameters θ and ρ of the normal representation of a


generic line

126
Figure 7.33 Example of the normal representation of a group of lines passing
through a point (x,y). For each point (x,y) we map the lines passing through it
using the equation 7.1. For the point number 1, which corresponds to (0,0) we
can see we obtained a horizontal line since 0 ∗ cosθ + 0 ∗ sinθ = 0 ∀θ. The point
A and B marks the intersection of the curves. Point B marks the intersection
of the curves corresponding to points 2,3 and 4 in the xy image plane: hence
the points 2, 3 and 4 lie on a straight line oriented at 45° and far ρ = 71 from
the origin. While A indicates that points 2,3 and 4 lie on a straight line passing
through the origin (ρ = 0) and oriented at 45°.

127
Figure 7.34 Example of normal representation of an edged image. Each white
pixel is represented as a sinusoidal curve. Where the sin curves intersect, that
means that close to that region parameter the points are aligned, hence there is
a line. In the example we have 8 white intersection.

The θρ parameters space is not infinite. We quantize it along ρ and θ (hence


ρ and θ assume only integer values ) and divide the ρθ plane in cells, called
ACCUMULATOR CELLS which are initially initialized at 0. Hence, for every
point in the xy plane, we evaluate ρi according to the formula 7.1 and letting θj
assuming all the possible integer values along the θj axis. Then, the resulting ρi
values are rounded off to the nearest allowed cell value (integer value) along the
ρ axis. For every computed pair of points (ρi , θj ), we set A(i, j) = A(i, j) + 1,
where A(i, j) is the value of the cell at the position (i,j). Hence, at the end of
the procedure the value A(i, j) = k in the cells (i,j) indicates the k points in the
xy-plane lie on the line xcosθj + ysinθj = ρi .
The number of cells in the ρθ plane defines the accuracy.

Figure 7.35 Quantized ρθ plane.

128
Figure 7.36 Quantized ρθ plane and illustration of the procedure for finding the
number of points lie on a line.

7.5.3 Canny edge detector Vs Hough transform line detector

In the following example we can see that the canny edge detector outputs a lot
of lines.

Figure 7.37 Canny edge detector vs hough transform line detector. We highlighted
the line of interest by the blue boxes in the ρθ plane: the boxes have been places
by considering the cells with the highest number corresponding to θ = ±90, since
the highroad to highlight has long lines and they are oriented at 90 degree (there
is no distinction in +90 or −90, that’s why we consider both).

129
7.5.4 Generalized Hough transform - Circle detection

The hough transfrom works for more complex shapes, not only for lines. Generally
we can extend it to any function of the form:

g(v, c) = 0

where v is a vector of coordinates and c a vector of coefficients.


For instance for deteching circles, the function will be:

(x − c1 )2 + (y − c2 )2 = c3 (7.3)

where (c1 , c2 ) is the center of the circle and c3 the radius.


As we can see, we have three parameters c1 , c2 , c3 and hence the parameter space
is tridimensional. In this case the accumulator cells are cubes located by the
coordinated (i,j,k).
For the sake of simplicity we report two examples where we set the radius to
r, which represents the radius of the circle passing through all the points in the
image. In this way, we can move the problem from 3D to 2D.
In the first example, for each of the 4 points (x,y), the hough transform procedure
defines a circle centered at (x,y) with the known radius r according to the formula
7.3. The intersection point of all such circles in the c1 c2 -parameter space would
be corresponding to the center point of the circle with the radius r: in particular
the intersection point corresponds to the accumulator cell with the highest value.

Figure 7.38 Circle 2D detection using hough transform. We fixed the radius and
found the circle passing through the 4 points. The color bar indicates the value
of the accumulator cells.

130
Figure 7.39 Multiple circles 2D detection using hough transform. We fixed the
radius.Note that we want to find 3 circles and so we look for the three accumulator
cells with the highest value in the parameter space.

131
Chapter 8

Morphological operators

Morphological operators are low-level operators that operate on the shapes found
in an image. It generally works on binary images but can be extended to also gray
images.
The morphological operations are based on the mathematical set theory (it means
we use the notation of the set theory for defining the morphological operators).
A set is made of tuples:

• In a binary images, the tuples are couples representing the coordinates of


the pixels of a foreground object in the image. Hence, the set of all the
white pixels representing that object.

• in a gray image the tuples are triples (i,j,k) where (i,j) is the coordinates of
the pixels of a foreground object and k is the intensity value.

The two main morphological operators are erosion and dilation and their combi-
nation: mainly opening and closing.

8.1 Erosion

Considering two sets A and B, where A represents an object and B is another


object, called structuring element. The erosion operation is defined as:

A B = {z|(B)z ⊆ A}

where z are foreground pixels (1 for binary images). In words: translate B in


foreground pixels z, and keeps z iff the whole structuring element is fully included
in A.

132
Figure 8.1 Some structuring elements

Erosion is applied for:

• Thinning object (like eroding its edges)

Figure 8.2 Thinning object example. We can see how the final shape is similar to
the initial one but thinner.

133
• separate weakly connected objects

Figure 8.3 Separation of weakly connected components example

The shape of the structuring element determines in which direction the erosion
operates.

Figure 8.4 Example illustrating different results for different structuring elements.

134
8.2 Dilation

Considering two sets A and B, where A represents an object and B is another


object, called structuring element. The dilation operation is defined as:

A ⊕ B = {z|(B)z ∩ A 6= 0}

where z are foreground pixels (1 for binary images). In words: translate B in


foreground pixels z, and keeps z iff there is at least one pixel overlapping with A.
Dilation outputs opposite results to the erosion. It is applied for:

• thickening image border

Figure 8.5 Thickening example for different structuring elements

• merging close, unconnected components

135
Figure 8.6 Example of merging unconnected components

8.3 Opening and closing combination

8.3.1 Opening

Opening is a combination of erosion and dilation defined as:

A ◦ B = (A B) ⊕ B

Its causes:

• contour smoothing

• eliminate thin protrusions without reducing the element size.

8.3.2 Closing

Closing is a combination of erosion and dilation defined as the opposite of opening:

A • B = (A ⊕ B) B

Its causes:

• Fuse narrow breaks without increasing the element size

136
8.3.3 Comparison opening vs erosion and closing vs dilation

Figure 8.7 Comparison opening vs erosion and closing vs dilation

137
8.3.4 More combination

We can create more combination using erosion and dilation. Complex combination
may effectively remove noise.

Figure 8.8 Complex combination

138
Chapter 9

Segmentation

To segment an image means to partition an image into regions corresponding to


different categories ( called also classes ). Hence, given an image R, we partition
R into n subregions R1 , R2 , ..., Rn such that:

• ∪ni=1 Ri = R

• Ri ∩ Rj = 0∀i, j(i 6= j)

For segmenting an image we follow two main criteria:

• Similarity between pixels in the same region

• Discontinuity between pixels in different regions

Looking at the figure, every colour represents a category.

Figure 9.1 Segmentation example

139
Segmentation result depends on how many categories we choose. In the example
below, we can see that starting from the same image, we can recognize only the
person or the person, the face, their hats and their jackets. For the penguin
example, we can distinguish only the penguin or the penguin and the floor.

Figure 9.2 Segmentation example with different categories

There are a lot of segmentation techniques:

• Segmentation by thresholding (histogram-based)

• Region growing methods

• Watershed transformation

• Clustering-based methods

• Model-based segmentation

• Edge-based methods

• Graph partitioning methods

• Multi-scale segmentation

• Many others…

We cover only the first 4/5. The techniques differ from each other for how we
define similarity and the geometry used for defining regions.

140
9.0.1 Application

Figure 9.3 Segmentation applications

9.1 Segmentation by thresholding

One way to extract the objects from the background is to select a threshold T on
the histogram, such that any point of the image at which f (x, y) > T corresponds
to the pixels of the object; and f (x, y) ≤ T corresponds to the background. In
other words the segmented image g(x,y) is obtained as:

 1

f (x, y) > T
g(x, y) =
 0 f (x, y) ≤ T

where we can use any intensity pixel value at place of 1 and 0.


The threshold can be :

• GLOBAL: when T is a constant applicable over the entire image f(x,y).


We use it for distinguishing maximum two classes, hence we need to use
two thresholds (dual thresholding):




 a f (x, y) > T2 (point belongs to an object)

g(x, y) =
 b T1 < f (x, y) ≤ T2 (point belongs to another object)

f (x, y) ≤ T1 (point belongs to the background)


 c

where a, b, c are any intensity value. Segmentation requiring more than two
thresholds are difficult to solve and we use other methods such as region

141
growing.

• LOCAL: when T is not constant but the value of T at any point (x,y) in an
image depends on properties of a neighborhood of (x,y) such as the average
intensity of the pixels in the neighborhood.

Figure 9.4 Single and dual thresholding on the histogram

Figure 9.5 Example of global thresholding

9.1.1 Factors influencing the selection of the right threshold

The success of the segmentation by thresholding is affecting by the following


factors which make difficult to distinguish the modes in the histogram (the modes
are the intensity values that occur the most often and they are individuated by
the peaks ):

• Distance between the peaks: the further apart they are and the easier is two
separates the modes.

• Image noise: the modes are broaden as noise increases

• Relative size of the objects compared to the background: small object have
really low impact on the histogram.

• Not uniform illumination of the object. The same effect can be caused also
by the reflectance properties of the object.

142
Figure 9.6 Noise affecting thresholding

Figure 9.7 Illumination affecting thresholding. We product the image with a


graduated filter for simulating a bad illumination of the object.

Figure 9.8 Small objects have really low impact on the histogram.

143
9.2 Otsu’s optimal threshold

It locates the optimal value of the threshold automatically based on the histogram
of the image.
It is a global thresholding method based on the histogram. It assumes that two
classes are created by thresholding. The Otsu algorithm determines the optimal
threshold by minimizing intra-class intensity variance, or equivalently, by maxi-
mizing inter-class variance. What does it mean intra and inter class variance?
Thinking in terms of international and intranational: the first is ‘across many
nations’, and the second ‘within one nation’. Otsu’s goal is to maximise the inter-
class differences, and minimise the intra-class differences. That is, the pixels in
each class must be as similar as possible, and those in different groups must be
as different as possible.
The steps of the algo are:

1. We compute the normalized histogram. Recall: the normalized histogram


for an image MxN has components pi = ni
MN where ni denotes the number
of pixels with intensity value i (0 ≤ i ≤ L − 1 and L defines the set of
intensity values) and pi is the probability that a pixel has intensity i. It
follows that:
L−1
X
pi = 1
i=0

2. We set a threshold T(k)=k, 0 ≤ k ≤ L − 1 for thresholding the input image


into two classes c1 if f (x, y) ≤ k and c2 otherwise.

3. We compute the probability a pixel belongs to (i.e. is thresholded into ) c1


and c2 for each k=0,1,.. L-1

k
(9.1)
X
P1 (k) = pi
i=0

P2 (k) = 1 − P1 (k) (9.2)

4. We compute the image global mean, i.e the average intensity of the entire
image mg :
L−1
X
mg = i ∗ pi
i=0

5. We compute the cumulative mean up to level k m:

k
X
m(k) = i ∗ pi
i=0

144
6. compute the pixel average value in classes 1 and 2 for k = 0, 1, .. L-1

1
m1 (k) = Pk (9.3)
i=1 i ∗ pi
P1 (k)
1
m2 (k) = PL−1 (9.4)
P2 (k) i=k+1 i ∗ pi

7. Compute the global variance on the entire image

L−1
X
2
σG = (i − mg )2 pi
i=0

8. Define the inter-class variance (we omit dependency on k) for k = 0, 1, ...


, L-1

2
σB (k) = P1 ∗ (m1 − mg )2 + P2 ∗ (m2 − mg )2 (9.5)
= P1 P2 (m1 − m2 )2 (9.6)
(mg (P1 − m))2
= (9.7)
P1 (1 − P1 )

9. we define the ’quality’ of the threshold as:

2 (K)
σB
η= 2
σG

We find the optimal threshold maximizing η that is finding:

k ∗ s.t σB
2
(k ∗ ) = maxk (σB
2
(k))

10. we segment the image using k ∗ as threshold

This procedure maximizes the inter-class variance of the data and at the same time
minimizes the intra-class variance. Indeed, the global variance can be expressed
as sum of the two variances, which lets us express the inter variance in terms of
intra variance. The intra variance is defined as :

2
σin (k) = P1 (k)σ12 + P2 (k)σ22

and we can express the global variance as :

2 2 2
σG (k) = σin + σB

hence:
2 2 2
σB (k) = σG − σin

145
Figure 9.9 Maximizing the inter variance means minimize the intra variance. The
goal here is to separate the text from the background.

Figure 9.10 Otsu example. We report it compared to the k-means segmentation


technique that is covered in other section. We can see that Otsu is more accurate
in separating the two classes.

146
Figure 9.11 Otsu example2

9.2.1 Smoothing

Smoothing the image can help the otsu’s method. Indeed, smoothing prior to
thresholding improves the shape of the histogram separating the valleys.

Figure 9.12 Smoothing effect

9.2.2 Otsu + edge detection

If the object to separate from the background is too small, then it will have not
enough impact on the histogram and we want to be able to separate the valleys.
In this case, we can exploit the edge detection for improving the histogram and
separate the valleys. The idea is to consider only the pixels that lie on or near the
edge.

147
Figure 9.13 Small object problem

The steps to use for combining edge detection and Otsu are:

• Compute the edge image of the input image as either the magnitude of the
gradient or the Laplacian

• Specify a threshold value T. It is usually expressed as a percentile on the


histogram of the edged image and it is usually high, greater than 90 such
as 99.7 percentile. In this way we remove the weak edges.

• Threshold the edge image using T to produce a binary (black 0 and white
1) image g(x,y).

• We mask the input image with the thresholded image g(x,y). In this way
we are going to take the pixels that lie on or near the edge. This means we
multiply g(x,y) to f(x,y) which results :

 f (x, y)

g(x, y) = 1
z(x, y) = g(x, y) ∗ f (x, y) = (9.8)
 0

g(x, y) = 0

• we compute the histogram of z(x,y) and compute the optimal threshold T ∗


using Otsu.

• we threshold the input image f(x,y) using T ∗

148
Figure 9.14 Small object segmented using edge detection + Otsu

9.2.3 Variable thresholding (Adaptive thresholding)

If global thresholding failed we can opt for variable thresholding. It consists in


computing a threshold at every point (x,y) in the image based on one or more
specified properties in a neighborhood of (x,y) such that the mean and the standard
deviation. Common form of variable thresholding are:

Txy = aσxy + bmxy (9.9)


Txy = aσxy + bmG (9.10)

where mG is the global image mean. This method is useful for example when the
illumination is not uniform.
In the example below, in order to face non uniform illumination or reflectance, the
image is partitioned and the thresholding is operated on each partition. In each
partition, the illuminance and reflectance is supposed uniform.
When T depends on the spatial coordinates (x,y) themselves, then variable thresh-
olding is often referred to as dynamic or adaptive thresholding.

149
Figure 9.15 Image partitioning based thresholding. The iterative global algorithm
is the k-means.

9.2.4 Multiple thresholding

We can extend Otsu’s method to an arbitrary number of thresholds. In the case


of K classes c1 , c2 , ..., ck the inter-class variance is:

K
X
2
σB = Pk (mk − mG )2
i=1

where

(9.11)
X
Pk = pi
i∈ck
1 X
mk = i ∗ pi (9.12)
Pk
i∈ck

The K classes are separated by K-1 thresholds whose optimal values k1∗ , k2∗ , ..., kk−1
∗ :

2
σB (k1∗ , k2∗ , ..., kk−1
∗ 2
) = maxk1 ,k2 ,..kk σB (k1 , k2 , ..., kk−1 )

Figure 9.16 Multiple thresholding

150
9.2.5 What is Percentile?

It a statistic score. The i-percentile is the smallest number that is greater than
i% of the numbers in a set. If n is the 99th percentile of an image, it means that
99% of the pixels have intensity below to n.
In the example below, it is represented how many people have the age in the
x-axes. The 14th percentile is 35: 14% of all the people ( total number of the
people is given by summing the height of the bins = 34) is younger than 35.
Indeed, 14% of 34 = about 5 people. Indeed, 5/34 * 100 = 14.7059.
The 95th percentile is computed as follows: 95% of 34=32.3 people. Hence for
finding the corresponding x value, we need to sum all the bins till we don’t exceed
32.3: 5+7+8+5+2+3=30 people which corresponds to age 55. Hence, 30/34
*100=88.2353, hence 88th percentile (88% of people are younger than 55.). Since
the last bin whose cumulative sum is 33, corresponds to the 97th percentile (33/34
*100=97.0588), the 95th percentile corresponds to an age between 50 and 54.

Figure 9.17 Percentile example. The number abova the bins are the percentige
corresponding to that value on the x-axes. The value inside the bins is their height
(y value).

151
9.3 Region growing methods

The objective of segmentation is to partition an image into regions. Region grow-


ing is a segmentation technique that groups pixels or subregions into larger regions
based on predefined criteria.
The idea is to start with a set of ’seed’ pints and from these points, grow regions
by appending to each seed the neighboring pixels that satisfy some properties
expressed as predicate Q (usually regarding connectivity and intensity value.)

9.3.1 Connectivity

One properties we usually want to check is the 8-connectivity.


What is a connected set? Let S represent a subset of pixels in an image. Two
pixels p and q are said to be connected in S if there exists a path between them
consisting entirely of pixels in S. For any pixel p in S, the set of pixels that are
connected to it in S is called a connected component of S.
8-connectivity means we compute dilation using a structuring element of size 3x3
whose elements are all 1’s for defining connectivity. A 8-connected pixel means
that it shares an edge or a vertex with the middle pixel defined by the structuring
element.

9.3.2 Region growing algorithm

Let f(x,y) be the input image and let S(x,y) be the image called SEED ARRAY
containing 1’s at the location of seed points and 0’s elsewhere.
The steps are:

1. Determining the seed points: we thresholds f(x,y) in order to find all the
connected components and then we erode each components to only one
pixel. These single pixel will be the seed points constituting S(x,y).

2. We define a predicate Q and MIGHT create an image fQ (x, y) to evaluate


the criterion s.t.

 1

if Q(x, y)istrue
fQ (x, y) = (9.13)
 0

if Q(x, y)isf alse

3. create an image fG by appending to each seed point in S the 1’s in fQ that


are 8-connected to that seed point.

4. We label each connected component in fG with a different region label


(segmentation with threshold ). This is the final result.

152
9.3.3 Example

We consider an X-ray image of welding and we would like to find the defects on
it. We apply the algo:

1. Step 1: we determine the seed points by noticing that the defects are the
brightest regions in the image. We then extract the seed points by thresh-
olding it using 99.9 percentile of intensity values in the image: this leads to
T = 254 which in turns implies that the seed points value will be 255.
Then we morphologically erodes each connected components to one pixel.

2. Step 2, 3 and 4: We append to each seed all the pixels that

• 8-connected to that seed


• similar to the seed. Given the seed (xs , ys ) and a threshold T we
express the similarity thorough the predicate:

 true if |f (x, y) − f (xs , ys )| ≤ T

Q(x, y) = (9.14)
 f alse

otherwise

Note that we can specify also a dual thresholding technique for the
similarity. In the example we prove both the single and dual threshold-
ing technique using T1 = 68 and T2 = 126 for the dual thresholding,
while we used only T1 for the single thresholding. We can see that we
cannot accomplish the goal using dual thresholding in this example.
The T values correspond closely to the valleys of the histogram.
The final result won’t correspond to the segmented image since the
good region of the weld (the white one) failed the predicate and the
background points too since these ones are not 8-connected to the
seeds.

153
Figure 9.18 Growing regions technique example

9.3.4 Growing regions vs thresholding-based segmentation

The 4 step of the growing regions technique expresses the difference with the
thresholding based segmentation. The growing region method labels every con-
nected components, hence in the example below we can distinguish defect1, de-
fect2 and so on since the growing methods returns a kinda list of pixels for each
labelled connected components.
The thresholding technique instead wouldn’t differentiate the different defects,
but would return just two classes: pixels below T and pixels above T. If we would
like to get a list of pixels for every defects from this method we need to scan the
image with the two classes and check the connectivity.

154
9.4 Segmentation using Watersheds

Watershed segmentation is a region-based technique. It is a segmentation al-


gorithm that provides connected segmentation boundaries (which are connected
path/line) and connected components.
The ground idea is that a grayscale image can be interpreted as a topographic
surface: we map the intensity values into height values.

Figure 9.19 Topographic interpretation of an image

In such topographic representation we distinguish three types of points:

• local minimal

• Points at which a drop of water would fall to a single minima. The points
satisfying this condition are called CATCHMENT BASIN or WATERSHED
of that minima.

• Points at which a drop of water could fall into two (or more) different min-
ima. The points satisfying this condition are called WATERSHED LINES.

The goal of the watershed segmentation is to find the watershed lines. For achiev-
ing it we imagine these lines as if they delimited a dam and for finding them we
hence filling the dam. Suppose that a hole is punched in each regional minimum
and that the entire catchment basin is flooded from below by letting water rise
through the holes at a uniform rate. When two catchment basins merge (water
overflows from one basin), we build a dam between them, which is taller than any
pixel in the image (it has the highest value of pixel then). The points of the dam
defines the watershed lines, which are the desired segmentation boundaries.
Morphollogically this procedure consists in computing sequentially dilation oper-
ations on connected component and the watershed lines are created as soon as
two connected components merge together.

155
Figure 9.20 Dam construction example. Note that two little ’isle’ are not selected
since a water over there will drop in a single minima, and hence the ’isle’ points
are part of the catchment basin.

156
Figure 9.21 Dam construction example with dilation

9.4.1 Watershed on gradients

In practise we apply the watershed segmentation to the gradient of the image


rather than to the image itself. The goal of apply watershed on gradient is to
extract uniform regions. This is particularly useful when the gradient is small since
it would produce thick edges, which can be interpreted as a kind of topological
representation of the image. The watershed lines would be thin lines over these
edges. The catchment basins instead correspond to the homogeneous graylevel
regions.

157
Figure 9.22 Watershed applied on gradient

9.4.2 Over-segmentation and markers

The direct application of such segmentation technique often leads to over-segmentation


due to noise and irregularities of the image gradient. Over-segmentation can make
the results useless.

Figure 9.23 Over-segmentation

For overcoming such issue we use markers, which drive the segmentation. Markers
are hence the set of pixels from where the flooding will start ( the seed points ).
Intuitively, think of markers as the pixels which we are sure of belonging to the
objects present in the image. Generally, the number of markers is equal to the
number of objects (classes) + 1 (for background). Indeed, we define two kind of
markers:

• internal markers: which are associated with the object of interest and hence

158
individuates local minima

• external markers: which are associated to the background

The marker selection procedure is defined through two steps:

1. Preprocessing: usually it consists in smoothing the image.

2. Definition of a set of criteria that markers must satisfy. In practise the


markers can either be explicitly defined by the user or can be automati-
cally determined using morphological operators or by other methods such
as thresholding.

All the markers are represented with different pixel values (colors). Most of the
pixels in the marker image are initially 0 (black) which indicates an unknown
region and the watershed algorithm assigns each of these 0 pixels a class out
of the number of classes (n-1 for objects and 1 background). Intuitively, using
these markers we are telling the watershed algorithm to group points like these
together. So, the concept is simple. Start growing these markers (labeled with
different colors(labels)). When they are about to meet, construct barriers at those
locations. These barriers give us segmentation results.

Marker selection

When creating the marker image, we need to define its pixel as follows:

• 0: unkwown (there might be not unknown objects actually). Then, the


watershed algorithm assigns each of these pixels a class (background or
object).

• 1: background

• >1: our objects.

For finding the objects and hence the pixels to set to value > 1, we usually compute
their euclidean distance from the first zero (=black) pixel. We can accomplish
this by using the distanceTransform function, threshold it in order to extract the
furthest and nearest points, and then fill the area bounded by the contours of the
objects we found (these are basins). All the other pixels of the marker image are
instead set to 1 if they represent the background.
Then the watershed algorithm uses the marker to group similar points together
and it returns an image whose pixels have been set as follows:

• -1: the edges between objects (watershed lines)

• 0: the background (as intended by watershed)

159
• 1: the background as we defined

• >1: our objects

This procedure is applied also in the following example.

Example of markers

In the following example the criteria for defining the internal markers are:

• region that is surrounded by lighter points

• the points in the region from a connected component

• all the points in such region have the same intensity value

Then we smoothed the image and apply the watershed algorithm under the re-
striction the internal markers should meet. This procedure defines the watershed
lines, which are the external markers.
The external markers partition the image into regions containing the internal
marker and part of the background. Hence, then we need to segment each region
into two classes: background and object. For doing it we can use any segmentation
technique. In the example here we use the watershed segmentation.

Figure 9.24 Example of markers

9.4.3 OpenCV - watershed(img, markers)

OpenCV implements a marker-based watershed algorithm, where the markers are


manually selected watershed(img, markers).
The guide is taken from https://docs.opencv.org/4.x/d2/dbd/tutorial_distance_
transform.html.

160
161
162
Any unauthorized distribution (digitally or on paper) is expressly prohibited
without prior authorization from the author of these notes

163
Chapter 10

Segmentation by clustering

Clustering means to group a collection of heterogeneous elements into sets (said


also clusters) of similar elements. In computer vision we use a clustering technique
for segmenting images.

10.1 Feature vector image representation

For applying any clustering technique we need to represent each pixel with a
feature vector, which can be multi-dimensional e.g. for coloured images it can
be the values of the 3 channels R, G and B. The feature vectors forms so a
feature space and a single vector hence contains all the measurements that may
be relevant to describe a pixel: spatial position (coordinates), intensity/brightness
(grayscale images), color information (RGB,/YUV/CieLAB) and so on.
A clustering algorithm for segmenting images uses this representation for grouping
similar pixels.

10.2 Clustering techniques

Clustering techniques often evaluate how similar two pixels are based on their
feature vector. This task is carried out using a distance function to compare the
vectors.
Some typical distance functions are the following where D is the dimension of the
feature vector.

164
Two basic approaches are used to cluster:

• Divisive clustering

– We consider the entire dataset as a cluster


– We split recursively each cluster to yield a good clustering. We need
hence a cluster quality measurement.

• Agglomerative clustering

– We consider each single pixel as a cluster


– We merge recursively each cluster to yield a good clustering. We need
hence a cluster quality measurement.

There are different clustering techniques :

• K-means

• mean shift

• spectral clustering

• hierarchical clustering and so on

10.3 k-means

It is a simple clustering algorithm based on a fixed number of clusters k provided


to the algorithm (the fact we have to set k is good if we have idea about how
we wish the data to be grouped, otherwise it is a bad approach).The k-means
algorithm operates on the FEATURE SPACE (the representation of the image
where each pixel is a vector) and after the process each FEATURE VECTOR is
associated with one of the k clusters.

165
The k clusters are disjoint sets C1 , C2 , .., CK and each cluster Ci has a centroid
µi .
The k-means objective function measures the distance between each data point
and the centroid of its cluster. The goal is to minimize the error made by approx-
imating the points with the center of the cluster it belongs to.
 
k X
(10.1)
X
min  d(x, µi )
i=1 x∈Ci

where d() is an appropriate distance

Figure 10.1 (a) is the original image, (b) is the feature space and (c) is the feacture
space after applying kmeans

10.3.1 Lioyd’s algorithm (aka k-means algorithm)

A heuristic procedure to accomplish kmeans is the following. This can be applied


to vectors containing any set of features.

1. Select k random centroids

166
2. Associate each point to the closest centroid :

Ci = {x : i = argminj d(x, µj )} for i = 1, ... k (10.2)

3. Compute the new centroids ( center of mass of the associated points)

1 X
µi = x
Ci
x∈Ci

4. repeat steps 2 and 3 until the centroids sensibly move or we reach a max
number of steps.

10.3.2 Initialization

Cluster centroids need to be initialized and the final solution depends on the
initialization.
We can opt for two initialization methods:

1. FORGY METHOD. We choose k points randomly among the data points.

2. RANDOM PARTITION. We assigned random points of the dataset to k


clusters. Then we compute the centroids with the previous formula.

10.3.3 Pro and cons

Pro:

• light and simple

• computational complexity can be reduced using euristics

• fast to converge

Cons:

• Optimality is not guaranteed

• solution found depends on the initialization.

• the number of clusters k needs to be known in advance

• forces spherical symmetry of clusters

10.3.4 K-means for segmenting images

K-means image segmentation can be done on the histogram (faster) or the pixel
vectors (better result).
The distance measures can be:

167
• Intensity level difference (grayscale)

• Color channel difference

• combinations of position, color, texture, descriptor and so on.

Figure 10.2 Example of k-means segmentation using grayscale intensity level dif-
ference and color channel difference.

Figure 10.3 Example of k-means segmentation using color channel difference with
different number of clusters.

Figure 10.4 We can segment only some object in the image not necessarily con-
nected. Anyway some clusters are meaningless as shown in the last image.

168
Figure 10.5 Example of k-means segmentation using color channel difference with
different number of clusters.

Figure 10.6 Example of 20-means segmentation using color channel difference +


position (recall we represent a pixel as a vector). As shown this improves the
object separation: the two cucumbers are separated.

169
10.4 Beyond k-means

K-means is a good choice but it suffers of some drawbacks :

• it is a parametric method: k needs to be provided

• it is non-optimal method

• it forces spherical symmetry (see example in figure 9.1 (c))

How could we derive a deeper analysis on the dataset? An idea is:

• create a density function

• look for the modes (that is the major peaks) of the density function

• identify regions of the input space that climb to the same peak: such regions
belong to the same region/cluster.

This leads to an intuition about how the density function could be used for seg-
mentation.

10.4.1 Density function estimation

The problem is that we have pixels and not a distribution. So we need to find a
function that is related to a pixel distribution. In order to estimate the density
function starting from a given set of pixel, we use the non-parametric approach
called Parzen window technique, aka kernel density estimation. It basically con-
volves the image with a given kernel of radius r. Performing this convolution, we
are smoothing the data which results in a PDF.

Figure 10.7 Example of density estimation using a gaussian kernel. The six in-
dividual kernels are the red dashed curves, the kernel density estimate the blue
curves. The data points are the rug plot on the horizontal axis.

After getting the PDF, it is important to find the peaks of the distribution, which
can be achieved by using mean shift.

170
The kernel is defines as :
 2 
x  |x |
K(x) = ck k | |2 = ck k
r r2

where x is a n-dimensional feature vector, K is the kernel and k() is a 1 dimensional


profile generating the kernel. The kernel k() integrates to 1 and has radius ’r’.
We can use different 1 dimensional profiles generating the kernel:

The last one is the most used one. The epanechnikov kernel represents a parabolic
shape.
Hence we compute the convolution with a given kernel of width r, which is ex-
pressed as the sum of a translated kernel for each sample in the dataset:

N N
|x − xi |2
 
1 X ck X
f (x) = K(x − xi ) = K
N rn N rn r2
i=1 i=1

where f(x) is the estimated PDF, xi are the input samples and k(.) is the kernel
function Parzen window and n is the dimension of the x vector. We note that the
kernel is defined everywhere and centered in xi .

Figure 10.8 Density function estimation example. We start from the set of sample
and get the PDF. We used the mean shift for seeking the peaks.

171
We note that this procedure is quite computationally complex. It is possible but
might be inefficient.
CAN WE DO BETTER? Mean shift.

10.5 Mean shift

10.5.1 Key idea for estimating the density function (PDF)

Mean shift is a non parametric technique for finding stationary points such as peaks
in a density function (we are connecting data processing with the computer vision.
That is kinda magic). The key idea is to find the peaks in the high dimensional
data distribution WITHOUT computing the distribution function explicitly.
The no computing of the distribution function is possible because mean shift
assumes that the input data points are sampled from an underlying density function
(PDF): in other words we consider the data points to be a PDF. (Indeed, if we
need to estimates the PDF, we need to estimate some parameters:

(x−µj )2

2σ 2
X
f (x) = cj e j

where cj , σj , µj need to be estimated.)


Hence for estimating the PDF we look at the data point density (here it comes
the non parametric estimation): the higher the density of the data, the higher the
probability.

Figure 10.9 Estimation of the PDF from data

10.5.2 Density gradient function estimation

We can avoid estimating the PDF and estimate its gradient instead. In this
way, mean shift becomes a steepest-ascend method: the modes are given by the
direction indicated by the derivative. Indeed, we are interested in finding the
modes and hence we just need to look where the maximum is without considering

172
the shape of the PDF.

Figure 10.10 Density estimation schema

Let’s compute the gradient of the kernel density estimation:

N
1 X
∆f (x) = ∆K(x − xi )
N rn
i=1

recalling :
|x − xi |2
 
K(x − xi ) = ck k
r2
then:

|x − xi |2 |x − xi |2 |x − xi |2
     
2
∆f (x) = ck K ∗∆ = ck K ∗ |x − xi |
r2 r2 r2 r2
(10.3)

Now let’s define:


1
yi = |x − xi |
r2
and
g(a) = −k(a)

Then the derivative can be written as:

Figure 10.11 Density gradient estimation

173
Let’s analyse the last formulation in detail:

Figure 10.12 Density gradient estimation

We can rewrite the previous equation as:

∆fk (x) = A ∗ fg (x) ∗ mg (x)

hence we can obtain the mean shift vector:

∆fk (x)
mg (x) =
A ∗ fg (x)

Geometrically, the mean shift vector starts at x and points towards the direction
to follow to find the mode.

Figure 10.13 Mean shift vector

The iterative procedure to find the modes using the mean shift vector is:

1. Compute the mean shift vector m(x)

2. move the kernel window by m(x)

3. stop when he gradient is close to 0

4. prune the modes by perturbation

174
The mode pruning is needed to prune the saddle points which have zero gradient.
The saddle points are unstable locations and small perturbation causes the process
to move away.

In the next page we report an example of such procedure. We note that the mean
shift vector magnitude depends on the gradient. As result, the steps are smaller
towards a mode and hence convergence depends on appropriate step size.

Figure 10.14 mean shift vector magnitude about the previous example

175
Figure 10.15 Example of mean shift procedure

176
10.5.3 Optimization technique

For optimizing the procedure we

• divide the space into windows

• run the procedure in parallel

• then, all the points that were traversed by the window belong to the mode
where the window ends. Hence, a mode identifies a cluster.

Figure 10.16 Optimization

We note that the window trajectories follow the steepest ascent directions.

Figure 10.17 windows trajectories

177
10.5.4 Means shift clustering segmentation

We use the procedure described in the previous subsection.


From the example below we can see that unlike k-means, mean shift doesn’t limit
cluster to be spherical.

Figure 10.18 Mean shift clustering image example.


a - is the input image
b - is the pixel plotted in L*U*V space
c - L*u space distribution
d - Clusters after 159 means shift procedures.
e - Corresponding trajectories with peaks marked as red dots

178
Mean shift can effectively separate complex modal structures.

Figure 10.19 Example of complex structures

It allows to preserve discontinuities. For achieving this we exploit the feature


vector representation of a pixel and we write the kernel as composed by a spatial
and color component.
   
xs xr
K(x) = ck ks kr
rs rr

where xs and xr are respectively the spatial and the color part of the feature
vector; and ks = kr actually (we used this notation to make it evident the two
components).

Figure 10.20 Example of discontinuities preserved. We can note that the grass
region presents different clusters although it is always grass. Also the tripod keeps
its discontinuities: we can distinguish the parts constituting it.

179
Figure 10.21 Example of discontinuities preserved as the windows size changes.
hs , hr in our notation are rs , rr as the window sizes o.

Figure 10.22 Example of discontinuities preserved.

180
10.5.5 Segmentation examples using means shift

181
182
10.5.6 Means shift pros and cons

Pros:

• Does not assume clusters to be spherical

• only one single parameter, i.e. the window size r.

• Contrary to KMeans, we don’t need to choose the number of clusters our-


selves.

• finds a variable number of modes

• robust to outliers

Cons:

• Depends on window size. It is not always easy to find the best value and
an inappropriate choice can cause modes to be merged.

• Computationally expensive

• Does not scale well with dimension of feature space

10.6 Segmentation seen as a per-pixel labelling task


(MARKOV RANDOM FIELD)

The idea is:

183
• define a set of label

• define an energy function (similar to a cost function)

• assign a label to each pixel in order to minimize the energy

10.6.1 Label set

The set of labels is a discrete (or continuous) and ordered set.

L = l1 , l2 , ..., lm

with |L|=m.
Labels are associated with sites. A site is a pixel location. We define with Ω the
set of pixels.
Given a pixel location p, its label h is assigned by means of a function:

h = fp = f (p)

10.6.2 Energy function

The energy function is composed of two terms:

1. A DATA TERM depending on specific image features (image feature vector)


at the pixel location p.
Ed ata(p, fp )

It evaluates how good fp fits on pixel location p and defines a match criterion
based on the image feature vector.

2. A SMOOTHING/CONTINUITY/NEIGHBORHOOD TERM.
It depends on the labels in neighboring pixels. The neighborhood of p is
defined as A(p) and is commonly formed by 4 neighbors.
X
Esmooth (fp , fq )
q∈A(p)

We want to find a balance between the two terms for detecting strong boundaries.
A smoothing term too weak leads to fragmentation.
The energy function evaluated over the whole image is defined as:
 
X X
E(f ) = Edata (p, fp ) + Esmooth(fp ,fq ) 
p∈Ω q∈A(p)

This model is called MARKOV RANDOM FIELD (MRF).


MRV considers interaction among pixels in the neighborhood (by means of the

184
smoothness term). Interactions are modelled considering a graph. (this is the
pixel representation used by the MRV. We don’t go into the details). MRV uses
an undirected graph (hence bi-directional arcs) where the arcs represent the links
of mutual influence: how a pixel is influenced by its neighbor.
MRV is used to minimize a problem: in our case we would like to select the
labelling that minimizes the energy function.
The data and smoothing term are defined based on the specific application. Let’s
see different examples about how we can define them.

10.6.3 Data term definition examples

Data term - example 1

This procedure is called RANDOM INITIAL SEEDS.

• We select m random pixels in the image, one per label (remind that m is
the size of the label size.)

• We calculate the mean µ and the standard deviation in a neighborhood σ


for the selected m pixels.

• create a feature vector for each random pixel selected: xi = [Ii , µi , σi ]T for
1 ≤ i ≤ m. In this way we obtain m seeds for the m classes.

• We evaluate Edata as the L2 norm from the actual pixel and the reference
of that class.
n
X
Edata (p, li ) = |xi − xp |22 = (xi,k − xp,k )2
k=1

(each label li is associated to a feature vector xi of n elements)


We can use the L1 norm or other metrics.

Data term - example 2

We can define the data term as the PEAKS in the feature space: we select m
local peaks in the feature space by finding them using the Mean shift algorithm.

185
10.6.4 Smoothness term definition examples

Smoothness term - example 1

We can define it by the POTTS MODEL. Considering two general labels l and h,
if they are equal we set the smoothness term to 0, otherwise to a constant c.

0 if a = 0
Esmooth (l − h) = Esmooth (a) = (10.4)
c otherwise

Smoothness term - example 2

We can use a linear smoothness cost:

Esmooth (l − h) = Esmooth (a) = b ∗ |a| (10.5)

where b is a constant.
A truncated version is also available.

10.7 Belief Propagation (BP) algorithm

The minimization problem defined by the MRF can be solved using the Belief
Propagation (BP) algorithm. Such algorithm is based on message passing between
neighboring pixels and they convey information about labels and related costs. The
algo runs several iterations and messages are delivered at each iteration.
Consider the message sent by pixel p to pixel q. It contains

• local information at p

• info provided by p’s neighbors in previous iterations

At each iteration every pixel p in q’s neighborhood passes its message to q.

Throughout the process each node acts as q in iteration i (it receives info from
the neighborhood pixels) and as p in iteration i+1 (it sends info to another pixel).

186
10.7.1 Message

A message is a function mapping the ordered discrete set of labels L (of size m)
into an array of length m having, for each position, the message value.
Such function is defined as :
mtp−
→q (l)

that indicates the message value for label l ∈ L sent from p to q at time t.

HOW ARE THE MESSAGE VALUES CALCULATED?


The analytical formulation defining the message value is:

187
How to read it? From p to q at time t, the label l should be the one over all the
possible label h that minimize the cost at p of label h + the cost if q takes label
l and p takes h + the cost at p of label h from the perspective of pixels in p’s
neighbourhood excluding q.
More detailed interpretation:

• Node q inquires all its neighbours (such as p) asking their opinion about
selecting label l at q.

• Node p gets the question – to reply, it scans all possible h and tells q about
the lowest possible cost when q takes l.

The reply of node p depends on:

• The data term at p for label h (to be minimized)

• The smoothing term if the labels l and h are different

• The information brought by the neighbors of p (excluding q) in the previous


iteration (time t-1)

• Node q is excluded from the computation (It cannot influence the opinion
of p about q itself).

So far we have discussed the message from p to q. Then, we need to extend this
process also to every neighbor of q. At q we need to take into account also the
local data term and hence the total cost at q is (the smoothness costs are not
appearing because they are already embedded in mtp−
→q (l)):
X
Edata (q, l) + mtp−
→q (l)
p∈A(q)

Hence, for both the costs we consider both the neighborhood of the sender and
receiver node.
These costs are the core of the Belief Propagation algorithm. The steps of this
algorithm are:

1. Initialization

2. Time update: we have just seen this so far.

3. Termination

188
10.7.2 BP initialization

We need to initialize the messages and the related costs at time 0. Note that at
time 0 we miss the history of the previous neighbors.

Messages: m0p−→q (l) = minh∈L (Edata (p, h) + Esmooth (h − l)) (10.6)

Related costs: Edata (q, l) + (10.7)


X
m0p−
→q (l)
p∈A(q)

10.7.3 BP time update and termination

Time updates are controlled by the equations previously detailed.


Termination is often based on a pre-defined number of iterations. Anyway other
options are possible such as observing the mean of all changes. Anyway there is no
guarantee of convergence and hence we need to accept the efficiency of counting
the iterations.

10.7.4 Algo output

Upon termination (at time t0 ) the final label at each pixel location is determined
as
j = argmin1≤i≤m cti0

If more than one minimum is found, we can for instance favour the labels assigned
to adjacent locations.
So we end up having a label assigned to each pixel.

10.7.5 BP for segmentation

BP can be used for image segmentation. Each label found by running the BP
algo defines one or more segments: the pixel with the same label belong to the
same segment.
BP is often applied at multiple resolutions (this technique is called pyramidal BP).

In this example we can separate the people from the background.

189
Figure 10.23 Example 1 - BP segmentation

Figure 10.24 Example 2 - BP segmentation

190
Chapter 11

Active contours

An active contour is a moving contour that automatically tries to delineate an


object outline. A contour is a curve corresponding to the object boundaries and
it has a different conceptual meaning from edges. Ad edge is just a set of pix-
els resulting from some filtering operation such as Canny algorithm. It defines
a transition of intensity, and they can be open or close. Contours instead are
lines not linking to intensity transition that are always closed and are described
by a parametrized function. Anyway contours might highlight edges if they are
attracted by edges, but in general contour can move according other criteria.
Active contour is widely used in applications like object tracking, shape recogni-
tion, segmentation and edge detector.

11.1 Snake

A first approach to locate such boundary curves in images is called Snakes. Snake
approach defines mathematically an active contour has a 2D energy-minimizing
spline curve that evolves (moves) towards image features such strong edges and
lines. We recall that a spline is a mathematical tool for defining a curve piecewise.
Snakes method consists in the following steps:

1. Initialization process

2. The curves move itself to try to fit the contour of the object

3. The contours stop moving when most of the points are edge pixels.

11.1.1 Initialization process

The initialization process consists in placing a spline curve close to the desired
contour. This initialization step can be performed manually by the user through
an user interface, or automatic. A good initialization is critical.

191
11.1.2 Contour move

After initialization, the curve we drew starts deforming itself to try to fit the
contour of the object as best as it can. What makes the contour moving is a
balance energy. This energy is made of two components: an internal energy
that forces the snake to stay in shape and so it controls the deformation of the
snake, and an external energy that forces the snake to get attracted by some
image elements such as the edges. Defined a spline curve, hence a snake, as
v(s) = (x(s), y(s)) we define the energies as follows:

Figure 11.1 Internal energy. The parameters α(s) and β(s) defines the stretch
and the curvature of the snake pointwise. For instance setting β(s) = 0 at some
points allows the snake to develop a corner.

Figure 11.2 External energy

Looking at the external energy we can see that it depends on some terms referring
to some elements of the image: line, edges and terminations. Based on how we
adjust the weights the snake behaves different. The edge term is defined to be
proportional either to the image gradients, or to a smoothed version of the image
Laplacian:

Figure 11.3 Left: edge energy term proportional to the image gradients; Right:
edge energy term proportional to a smoothed version of the image Laplacian. Gσ
is a Gaussian with standard deviation σ

The line term is expressed in terms of the intensity of the image: Eline = I(x, y).
Then depending on the sign of the weights wline the snake will be attracted either
to light lines or dark lines.
The Eterm deals with guiding the snake towards the termination of the line seg-
ments and the corners.

192
Figure 11.4 Eterm definition

However in practise, most systems only consider the edge term in the definition
of the external energy.

Figure 11.5 Snake example

193
Figure 11.6 Snake example using only Eterm and Eedge

Figure 11.7 Snake example used for motion tracking

194
Figure 11.8 Snake example for pedestrian detection. In this application they could
extract the shape of pedestrian instead of drawing rectangle around the pedestrian.
If we change the energy balance we can change the behaviour of the snake and
make it going inside the concave region of the legs.

Figure 11.9 Snake example from the paper ’Active contour model by combining
edge and region information discrete dynamic systems’ by Zhifeng Wang and Yan-
Jun Liu (https://journals.sagepub.com/doi/pdf/10.1177/1687814017692947)

11.2 Scissors

Snake method moves the curve only after we initialize a boundary of interest. A
different approach is Scissors. Scissor optimizes the contour in real time as the
user is drawing.

195
Figure 11.10 Scissor example

11.3 Level set

– NOT COVERED BY THE COURSE –


A limitation of active contours based on parametric curves of the form v(s) such
as snakes is that it is challenging to change the topology of the curve as it evolves.
Furthermore, if the shape changes dramatically, curve reparameterization may also
be required. An alternative representation for such closed contours is to use a level
set.

Figure 11.11 Level set example used for segmentation

196
Chapter 12

Feature detection and matching

12.1 Introducing features

Detecting and matching salient elements of the image is a key task in computer
vision. Such elements are called features. A feature is a ”meaningful” part of the
image.
Features have two main components:

• Feature detection: finding a stable (easily detectable) characteristic point


of the image.

• Feature description: After finding good points, we need to measure ( get a


description) of the surrounding area.

Hence, when dealing with features we give an input image and we would like to
output a set of points + description of the region (aka signature, fingerprint, ...).

Figure 12.1 Feature detection example

197
12.2 Feature pipeline sum-up

Figure 12.2

12.2.1 Set of points

The points we have to find shall be :

• stable and repeatable

• invariant to transformations e.g. rotations

• insensitive to illumination changes

• accurate: it is attached to a specific element of the image.

A way to measure stability is to consider a repeatability score on a pair of images:


whatever the detector algorithm is used to find the good points, we applied it to
two images and the score is defined as the ratio between the number of point to
point correspondences that can be established and the min number of key points
in the two images.

198
Figure 12.3 Stability measure example

12.2.2 Descriptors

The descriptor is a vector representation of the local patch (that is the surround-
ing area of a keypoint). Take in mind that we don’t choose the pixels of the
surrounding area: this is up to the designer, hence we need to consult the manual
for knowing how the descriptor is implemented.
The descriptor is based on:

• color

• texture

• orientation

• etc.

The descriptor shall be:

• robust to occlusion and clutter

• robust to noise, blur, compression, discretization

• discriminative: it should be different for each feature

• stable over changes in viewing angle and illumination

• computationally efficient (indeed there are many features per image).

12.2.3 Feature matching

Once we have the keypoint and the descriptor, once operation we would like to
do is to match the features. Mathing means:

• evaluate features in the two images. Hence, find the keypoints and the
descriptor for each of them.

199
• find similar features looking at their descriptor. We use a distance metric
for stating how similar the two features are based on their descriptor.

Figure 12.4 Matching example. The green lines indicate good feature matches,
while red lines wrong matches.

12.3 Feature applications

Several CV system are based on features such as:

• motion detection

• stitching: create panoramic view from multiple images (landscape image)

• 3D reconstruction

12.3.1 Matching application

• Scene reconstruction & structure from motion (SFM)

Figure 12.5 Scene reconstruction & structure from motion (SFM). The number
of triangles represents the number of photo used for the reconstruction and in
particular the orientation of the triangles indicates the angle at which the image
was shot.

• Stiching

200
Figure 12.6 Stiching

• Query by example, place recognition

Figure 12.7

• Instance matching/object localization

Figure 12.8 Instance matching/object localization

201
• Object detection

Figure 12.9

12.3.2 Tracking application

• Follow patterns in video flows

Figure 12.10

12.4 Feature (aka keypoint) detection

We now focus on the feature detection (called also as keypoints detection), which
is the first step of the feature pipeline.

202
Figure 12.11

What are good points? The easiest keypoint to detect are the corner points
because they are fixed at a position. Indeed if we take edge points, they can move
along the axis, while if we consider points in a uniform regions they can move
everywhere.

Figure 12.12 Good points are the corner one.

The salient points (aka keypoints) can be detected using many different criteria
and different algorithms exist. The most famous algorithm is the Harris-Stephens
corner detector.

12.5 Harris corners detector

The intuition behind the algo is to consider a patch in an image and a shifted
version of the patch. Then:

203
• in an uniform region the two patches will be similar

• for salient points instead the two patches will be different. In particular,
corners produce a large difference if the patch is moved.

So the pseudocode can be written as:

1. Consider a patch (the center) be located at position (xi , yi ).

2. Consider a patch of the same size but displaced of (∆x, ∆y). We do it


without considering any assumptions: a keypoint can be matched in any
portion of the image.

3. We measure the similarity of the two patches by means of the autocorrelation


function

4. We use the autocorrelation for computing a score which we use for classifying
the patch as consisting of flat, edge, or a corner

12.5.1 Autocorrelation function

The autocorrelation function is computed for all the pixels (s,t) in the patch:
XX
E(∆x, ∆y) = w(xs , yt )[I(xs + ∆x, yt + ∆y) − I(xs , yt )]2
s t

where:

• I(xs , yt ) is a patch of the image I

• I(xs + ∆x, yt + ∆y) is a patch of the same size as I(xs , yt ) but shifted by
(∆x, ∆y)

• (s,t) are all the pixels in the patch

• w(xs , yt ) are some weights

We can approximate E using the Taylor series. The accuracy of this approximation
is higher as delta x and delta y is smaller and smaller.

I(xs + ∆x, yt + ∆y) ≈ I(xs , yt ) + Ix (s, t)∆x + Iy (s, t)∆y

where the partial derivative are computed for the points in the patch:
∂I
Ix = ∂x (xs , yt )
∂I
Iy = ∂y (xs , yt )
Hence substituting it into the auto-correlation function and neglecting the weights
yields:
XX
E(∆x, ∆y) = [Ix (s, t)∆x + Iy (s, t)∆y]2
s t

204
We can write it in matrix form:

The matrix A describes how the region changes for a small displacement. This
matrix is REAL and SYMMETRIC. Under these conditions, the eigenvectors are
ORTHOGONAL and point to the directions of max data spread (where the data
are the points of the partial derivative evaluated in each point of the patch
(Ix (s, t), Iy (s, t))). The corresponding eigenvalues are proportional to the amount
of data spread in the direction of the eigenvectors.

Figure 12.13 Eigenvectors

Thus, we see that the eigenvalues of the matrix formed from derivatives in the
image patch can be used to differentiate between the three scenarios of interest:

• two small eigenvalues imply the presence of uniform region

205
• one large and one small eigenvalue imply the presence of a vertical or hori-
zontal boundary, hence an edge.

• two large eigenvalues imply the presence of a corner or (unfortunately) iso-


lated bright spot.

Figure 12.14 λ are the eigenvalues. The green region represents the value of the
eigenvalues identifying a corner, the grey region identifies an uniform region, and
the orange one identifies an edge.

Figure 12.15 3 differentiate scenarios based on the eigenvalues. fx , fy in our


notation are Ix , Iy and λ are the eigenvalues

The weights of the autocorrelation function are convolved with the autocorrelation
matrix :

Figure 12.16

206
and the autocorrelation function can be written as:

E(∆x, ∆y) = [∆x, ∆y] A [∆x, ∆y]T

The weights can be implemented in two different ways:

• w(xs , yt ) = 1 if the point xs , yt is inside the patch, and 0 elsewhere

• Gaussian: it gives more emphasis on changes around the center:

s2 +t2
w(xs , yt ) = e− 2σ 2

12.5.2 Keypoints selection criteria

Given a matrix A and the eigenvalues/vectors, how is a keypoint selected? Basi-


cally we compute a score R whose value is used to classify the patch as consisting
of flat, edge, or a corner. Typically R is used with a threshold T: we have a corner
only if R > T . There are several options presented in the literature for computing
such score R:

• minimum eigenvalue (proposed by Shi and Tomasi) i.e R = min(λ0 , λ1 )

• R = det(A) - αtrace(A)2 = λ0 λ1 − α(λ0 + λ1 )2 [Harris]

• R = λ0 − αλ1 [Triggs]

• R= det(A)
trace(A) = λ0 λ1
λ0 +λ1 [Brown, Szeliski, Winder]

12.5.3 Invariance properties

The Harris corners detector is:

• invariant to brightness offset:

I(x, y) −
→ I(x, y) + c

• invariant to shift and rotations (corners maintain their shape)

Figure 12.17

• NOT invariant to scaling

207
Figure 12.18

Figure 12.19 Harris corner example1

Figure 12.20 Harris corner example2

208
12.6 Other detectors

Several other corner detector exist beyond Harris detector.

12.6.1 SUSAN AND USAN

SUSAN detector can detect both corners and edge. It is interested because :

• it analyzes a circular windows around the point

• no derivatives are involved

• It considers both edge and corners in the detection.

• robust to noise

SUSAN is based on the USAN (Univalue Segment Assimilating Nucleus)) concept.


We compare the intensity of the central point of the mask, with the intensity of
each pixels in the mask, and we return the portion of the mask whose intensity
is similar to the nucleus (within a given threshold). This portion of the mask
is called USAN and they are marked with an equal value (from here the name
Univalue).

Figure 12.21 USAN example. In the right most image, the USAN are the white
region inside the masks.

Based on the area of the portion of the USAN we distinguish corners, edge and
uniform region:

Figure 12.22 USAN

209
SUSAN stands for SMALLEST USAN: hence we take into account only the USANs
that have the smallest region (which indicates an edge).

Figure 12.23 USAN edge detector example

12.7 Blob features detection

Features can focus also on blobs instead of points. A blob is a region where there
are some properties such that:

• properties are different from surrounding regions

• properties are (approximately) constant inside the region

12.7.1 MSER

Reference paper: J.Matas, O.Chuma, M.Urbana, T.Pajdla, ”Robust wide-baseline


stereo from maximally stable extremal regions”, Image and Vision Computing,
2004
MSER (Maximally Stable Extremal Regions) is a feature blob where blob is defined
by connected areas characterized by almost uniform intensity and surrounded by
contrasting background.
The algorithm in a nutshell:

• apply a series of thresholds (e.g one for each grey level)

• compute the connected binary regions hence the thresholded image.

• compute some statistics for each region e.g. area, convexity, circularity, and
so on.

• analyze how persistent each blob is. Under the hood, MSER is monitoring
the changes to each of the thresholded images and is looking for regions of
the thresholded images that maintain unchanged properties over a large set
of the possible threshold values.

210
’Extremal’ refers to the property that all pixels inside the MSER have either higher
(bright extremal regions) or lower (dark extremal regions) intensity than all the
pixels on its outer boundary.

Figure 12.24 Multi thresholding example

211
12.8 Object detection in different scale space

We would like to analyze the image at multiple scales. For achieving it we use
the SCALE SPACE as tool. The scale space transforms the original image f (x, y)
into a family of derived images f (x, y, t). These image are individuated by the
parameter t, which indicates the location of the image in the scale space. Looking
at the pyramid representation in the image 12.25, t represents the level at the
pyramid and the original image is at the base.
For increasing t, the details are gradually suppressed (reducing the scale, we reduce
the resolution, which means the information content of the image). Why should
we aim at suppressing image details? By removing the details we remove the weak
elements and select the strongest element of the image.

Figure 12.25 Pyramid scale space representation

12.8.1 Scale space implementation

The scale space should have the following requirements:

• linearity

• spatial shift invariance

• no new local extrema shall be created

• scale invariance

212
The literature demonstrated that Gaussian smoothing meets such criteria [ T.
Lindeberg, ”Scale-space: A framework for handling image structures at multiple
scales”]. The key element is the N-dimensional gaussian kernel g : RN xR+ −
→ R:

1 −(x2 2
1 +..+xN )
g(x, t) = N e 2t
(2πσ 2 ) 2

where the parameter t is the scale parameter and not the variance.
Thus the scale space is represented as a convolution with a gaussian kernel. In a
continuous domain this is expressed as (the discrete domain is let as exercise):
Z
L(x, t) = f (x − )g(, t)d
∈RN

where g(.) is the gaussian kernel.


Scale space has a relation with the biological vision: photo receptors in the retina
can be modeled as a superposition of DoG (Difference of Gaussian) [Young 1987].

Figure 12.26 Gaussian scale space for t=0 (orginal image), t=4, t=64, t=256.

213
Figure 12.27 No local minima are created in the scale space.

12.8.2 Edge detection in scale space

We can detect edge using scale space. We shall be able to

• select the best scale (hence the value of t) for each edge

• combine the edges found at multiple scales (for their best value of t).

We need to define a measure of edge strength for identifying the best scale value
t. A simple idea is to normalize by t the magnitude of the edge .
Strong edges can be found at all scales. At fine scales edges are accurate but many
spurious edges are found. At coarse scales edges are inaccurate but only strong
edges are found. The combination of the detection leads to the best results.

214
Figure 12.28 Edge detection example

Defining a metric for edge strength lets us find the strongest edges (found at all
scales.)

215
Figure 12.29 Edge detection example with edge metric

Figure 12.30 Edge detection example with edge metric

216
12.9 The SIFT feature

12.9.1 Introduction

The Scale Invariant Feature Transform offers very reliable keypoint detector and
descriptor. It is widely used. When images are similar in nature (same scale,
similar orientation, etc), corner detection and MSERs are suitable as whole image
features. However, in the presence of variables such as scale changes, rotation,
changes in illumination, and changes in viewpoint, we are forced to use methods
like SIFT. SIFT features (called keypoints) are invariant to image scale and rota-
tion, and are robust across a range of affine distortions, changes in 3-D viewpoint,
noise, and changes of illumination. The input to SIFT is an image. Its output is an
n-dimensional feature vector whose elements are the invariant feature descriptors.
The key idea is based on mapping the image content into local feature coordi-
nates. This makes the feature detection invariant to translation, rotation, scale
and other transforms.

Figure 12.31 Image content mapped into local feature coordinates

The strong points of SIFT are:

• The sift features are local and so robust to occlusions.

• It is distinctive, so it is capable to distinguish object in large databases.

• It provides a dense set of features that allow to detect small objects .

• It is efficient.

Some applications are:

• matching and registration

• object recognition

217
• query by example

12.9.2 SIFT algorithm

The RIF algorithm covers the first two blocks of the feature pipeline: feature
detection and description.

Figure 12.32

For describing the algorithm we are going to describe 4 elements that characterise
it:

• scale-space organization

• keypoint localization (extrema detection)

• orientation measurement

• descriptor calculator

12.9.3 Scale-space organization

We discuss here how scale invariance is achieved in SIFT.


Sift makes use of the scale space, which is organized in octaves. An octave
is represented as a progressive gaussian smoothing having standard deviation
σ, kσ, k 2 σ, k 3 σ, ... up to a given number of smoothing level. Then, the image
is downsampled and smoothed again with a doubled sigma of the previous octave
σnext octave = 2σprev octave . Hence, the finer details get lost as the images go up
both in scale as well as in octave.

218
This procedure is applied to each octave. That is, the first image of the new
octave is formed by:

1. downsampling the original image enough times to achieve half the size of
the image in the previous octave

2. gaussian smoothing the downsampled image with a new standard deviation


that is twice the standard deviation of the previous octave. This results in
reducing the scale as explained previously. In particular, this is obtained not
by using different kernel size but simply by applying the same smoothing
stack filtering on the downsampled original image of the previous octave.
Indeed, downsampling will reduce the number of pixels and so we don’t
need to increase the size of the filter since the number of pixel the kernel
is using is lower at each stage. The downsampling allows to save a lot of
computational time.

3. get the rest of the images in the new octave by gaussian smoothing the
downsampled image with standard deviation equals to the ones of the previ-
ous octave: kσ, k 2 σ, k 3 σ, ... . Hence, within one octave, intermediate scales
are used (that is gaussian smoothing).

We represent it in figure 12.33.

What is k?

Within one octave we have s intervals, which are the gaps between one image
and the next one. We can define the number of images in an octave based on
the number of intervals: in an octave there are s + 1 images. For each image
withing an octave we use different standard deviation: starting from the bottom
image, the standard deviations are σ, kσ, k 2 σ, ..., k s σ. In particular the last image
of one octave (with standard deviation k s σ) has the same smoothing as the first
in the next octave, but we recall the second one is downsampled so the result is
different. Hence, the constraint yields:

1
k s σ = 2σ −
→ k = 2s


For instance for s = 2 (3 images) the standard deviation will be σ, 2σ, 2σ.

219
Figure 12.33 Scale space organization

DoG

SIFT makes use of difference of Gaussian of two adjacent scale-space images in an


octave. This procedure anyway requires two additional images in the scale space
to process the first and last image.
In the image 12.34 s = 2 and hence the main images are 3: the ones in the scale
2, 3 and 4. Hence, what said before holds if we neglect the 1st and 5th image.
Indeed, we can see the the standard deviation of the first image of the second
octave is equals to the 4th image of the 1st octave.

220
Figure 12.34 scale space sift for s=2

All the layers are images. Let’s take a look at one of them.

Figure 12.35 outuput images

They look like an edge image. Why? Subtracting consecutive layers means:
evaluate the difference between two gaussian smoothed images which have the
same smoothing (i.e Gaussian smoothing) but different smoothing intensity. Such
operation is actually achieved in one shot by applying the Difference of Gaussian
(DoG) kernel which is represented by the function D(x, y, σ). This kernel brings
up the edges because it approximates the edge detection filter LoG (Laplacian
of Gaussian), which is exploited in the Marr-Hildreth edge detector algorithm (it
simply convolves the image with the LoG filter). Given the gaussian filter G(x, y):

1 − x2 +y2 2
G(x, y) = e 2σ
2πσ 2

221
the LoG filter computes its laplacian:

∇2 G(x, y)

. Hence, filtering with LoG filter would correspond to :

• filter the input using G(x, y)

• compute the Laplacian of the resulting image

As told before the LoG is usually approximated by the DoG, which computes the
following formula:
2 2 2 2
1 − x 2σ+y2 1 − x 2σ+y2
D(x, y) = e 1 − e 2
2πσ12 2πσ22

where σ1 > σ2 .
Hence, in SIFT the scale space + substraction of consecutive layers is DoG filter.

Figure 12.36 DoG and LoG

Figure 12.37 LoG kernel

222
12.9.4 Keypoint localization

The DoG explained before is the very first step to localize the keypoints. Indeed,
after that process, SIFT searches for the maxima and minima of the DoG which
will represent the initial keypoints that will be redefined later.
Figure below shows the procedure used by SIFT to find extrema in a D(x, y, σ)
image. At each location (shown in black) in a D(x, y, σ) image, the value of the
pixel at that location is compared to the values of its 8 neighbors in the current
image and its 9 neighbors in the images above and below. Then, the point is
selected as an extremum (maximum or minimum) point if its value is larger than
the values of all its neighbors, or smaller than all of them. NOTE that no extrema
can be detected in the first and last scale of an octave because it has respectively
no lower and upper scale image of the same size.
This makes SIFT features scale independence.

Once we detected the possible candidate keypoints considering local points, we


refine and validate the found keypoints in two ways:

1. we use interpolation for achieving sub-pixel accuracy. This is achieved by re-


representing each D(x, y, σ) as a Taylor series expansion up to the quadratic
term. (We didn’t go into the details of this but more info can be found at
page 890 in the book Digital image Processing 4th edition). This is done
because when we sample a digital image we can lose the actual position of
the extrema. Hence this procedure allows us to get closer to the extrema
by looking for the extrema in the interpolated function.

2. We recomputed each pixel of the D(x, y, σ) by using Hessian for discarding


all the keypoints that have low strength (hence low gradient) and represent
edge points. Hence, we require that both the engenvalues of H are high,
which means that both curvature are high and hence that the point are
corner.
(We didn’t go into the details of this but more info can be found at page

223
890 in the book Digital image Processing 4th edition)

Figure 12.38 SIFT keypoints after search refinement

Figure 12.39 Output of SIFT: it is a set of keypoints with associated scales. The
size of the circle represents the level of the scale where the keypoint was found.

224
12.9.5 What is the best value for s?

We consider the repeatability of the feature, that is how well the features are
repeated over the same element in different scale images of an octave. From the
experiments we can see that the repeatability has a pick for s = 3. Hence this is
the best value to use for s.

Figure 12.40

12.9.6 Keypoint orientation

At this point we have computed keypoints that SIFT considers stable and we
have achieved scale independence. Now we have to achieve invariance to image
orientation.
Each keypoint previously computed come with its scale. For each keypoint we
consider its Gaussian smoothed image representing the scale and we named it L.
Then we compute the gradient magnitude and orientation using L:

p
M (x, y) = (L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2

(12.1)
L(x, y + 1) − L(x, y − 1)
Θ(x, y) = tan−1 (12.2)
L(x + 1, y) − L(x − 1, y)

To achieve rotation invariance, the dominant direction shall be measured. This is


done by considering the histogram of gradient orientations for a neighborhood of
the keypoint. The histogram consists in 36 bins of 10◦ : every 10◦ we consider the
most dominant orientation.
Then we look for the peak in the histogram, which represents the dominant ori-
entation associated to the keypoint. The we fit a parabola to the 3 bins close to
the peak to refine the peak position.
We also considered the secondary peaks: the ones whose value is > 80% highest
peak. The secondary peaks duplicate the keypoint: a new keypoint with the same
scale but different orientation is created. This is done for taking into account that
one keypoint can have more than one dominant orientation.

225
Figure 12.41 Histogram creation

Figure 12.42 Keypoint detection example. The arrows represent the keypoint
orientation.

12.9.7 Descriptor calculation

The procedures discussed up to this point are used for assigning an image location,
scale, and orientation to each keypoint, thus providing invariance to these three
variables. The next step is to compute a descriptor for a local region around each
keypoint that is highly distinctive and at the same time as invariant as possible
to changes in scale, orientation, illumination, and image viewpoint. The idea is
to be able to use these descriptors to identify matches (similarities) between local
regions in two or more images.
The idea used by SIFT is to evaluate a descriptor for each keypoint in the corre-
sponding image L considering coordinates of pixels that are rotated based on the
measured keypoint orientation.

226
The full descriptor calculation process is the following. We consider a neighbor-
hood 16x16 of the keypoint, and for each pixel inside this region we compute the
gradient magnitude and direction. Each gradient magnitude point is weighted
(multiplied) with a gaussian function centered on the keypoint: recalling the bell-
shaped surface of the gaussian, the weights decrease as we move away from the
center, so the points around the keypoint have high weights.
Then we group all the pixels of the neighborhood region 16x16 into 4x4 pixel
groups, thus obtaining 16 sub-regions. Then for each subregion, we quantize all
the gradient orientation in the 4x4 sub region into eight possible directions differ-
ing by 45 degree: and we evaluate an histogram: the histogram has 8 bins, one
for each possible direction. After applying this procedure for all the subregions, we
get a descriptor which is a 128 dimensional vector ( this is the data structure used
by SIFT, but for simplicity we can imagine it as a 4x4 matrix, each cell containing
the histogram. Indeed 8 directions x 4 rows x 4 columns = 128 array-size).

Figure 12.43 Descriptor procedure

Description refinements

In order to enhance invariance to illumination the feature vector is normalized in


two stages:

1. First we normalize it to unit length by dividing each component by the vector

227
norm. This ensures good invariance to uniform illumination: a brightness
change caused by a constant being added to each pixel will not affect the
gradient values because they are computed from pixel differences. Therefore,
the descriptor is invariant to affine changes in illumination.

2. Non uniform illumination (resulting for instance from camera saturation),


causes large variations in the magnitudes of some of the gradient but it
doesn’t affect gradient orientation. For reducing the influence of large gra-
dient magnitude, SIFT threshold the value of the normalized feature vector
so that all components are below the experimentally determined value of
0.2. After thresholding, we further normalize the feature vector to unit
length.

12.9.8 SIFT sum up

Summing up the steps:

1. create the scale space

2. evaluate the initial keypoints

3. refine the keypoints by interpolation

4. reject unsuitable keypoints by means of the hessian matrix

5. compute the keypoint orientation

6. compute the keypoint descriptor

12.9.9 SIFT examples

NOTE that every library displays the features in a different way.

Figure 12.44 SIFT using VLFeat library

228
Figure 12.45 SIFT using VLFeat library

Figure 12.46 SIFT using OpenCV library

12.9.10 SIFT robustness

The robustness of the SIFT features to noise factor was accurately measured.

229
Robustness to noise

SIFT results to be robust to noisy images to random rotations. From the graph in
figure 12.47 we can see that the capability of the system in matching correctly the
features on the random-rotated noisy images is decreasing as the noise increases
but it is still high.

Figure 12.47 Robustness to noise

Robustness to affine transform

SIFT results to be robust to noisy images to random rotations at an increasing


viewpoint angle (which represents an affine transformation of the image). From
the graph in figure 12.48 we can see that the capability of the system in matching
correctly the features on the random-rotated 2%-noised images is decreasing as
the affine transformation (the angle at which the photo has been taken) increases
but it is still high.

230
Figure 12.48 Robustness to affine transform

The number of features discovered are quite high also in the case of a 2% noise
and 30◦ viewpoint change.

Figure 12.49 Robustness to noise

12.9.11 Questions from slides

Why is SIFT invariant to scale? Why is SIFT invariant to illumination changes?


Why is SIFT invariant to orientation?

231
12.10 Overview of other SIFT-based features

SIFT provides very good results but there exist other techniques.
The main ones that share the same idea as SIFT are:

• PCA-SIFT

• SURF

• GLOH

12.10.1 Principal Component Analysis

PCA recall

PCA is a dimensionality reduction technique: basically we use PCA to reduce the


size of the features. Reducing the dimesions can be useful to :

• work on more compact representations

• reduce computational workload

• highlight the most important information hiding the details

PCA achieve a reduction of the size by performing an orthogonal projection of


data onto a lower dimension linear space that :

• maximizes the variance of projected data (purple line)

• minimizes mean squared distance between the original data point and the
projection (sum of blue lines). Hence this represents the error coming from
projecting the data.

232
So PCA provides us with a lower dimensional space that preserves as much infor-
mation as possible.
PCA is based on the eigenvalues decomposition of the covariance matrix. The
largest eigenvectors, called principal components, are the basis of the new space.
The first principal component is the eigenvector pointing in the direction of largest
variance. Each other principal component :

• is orthogonal to the previous ones

• points in the direction of largest variance of the residual subspace

The number of components (number of eigenvectors) we choose corresponds to


the new dimension of the data.

Figure 12.50 Examples of reduction from n to 2 dimensions

PCA-SIFT

The steps are:

1. Steps 1-3: are the same as SIFT.

2. Step 4: it is different from step 4 of the SIFT procedure. Indeed for com-
puting the descriptor we consider a 41x41 patch at the given scale, centered
at the keypoint and normalized to a canonical directio [more details in Yan
Ke and Rahul Sukthankar, “PCA-SIFT: A More Distinctive Representation
for Local Image Descriptors”, Computer Vision and Pattern Recognition,
2004].

3. then while in the SFIT we computed the weighted histograms here we don’t
do. Instead, we concatenate horizontal and vertical gradients of the region
highlighted by the patch 41x41 (which will be 39x39 because of the border
effects) into a long vector of size: 2 directions x 39x39 = 3042.

4. we normalize the long gradient vector to unit form

233
5. reduce the vector using PCA such as from 3042 to 36 (very strong reduc-
tion).

12.10.2 SURF features

[Reference paper: Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “SURF:
Speeded Up Robust Features”, European Computer Vision Conference (ECCV),
2006 ]
SURF stands fro Speeded Up Robust Features (SURF). The idea is to make the
computation faster using an approximation of the Hessian matrix and descriptors
using integral images. The integral image IP (x, y) has the same size of the
original image and the pixels are not grey scale intensity but each pixel represents
the sum of all the pixels in the rectangle between the origin (0,0) and that pixel:
X X
IP (xi , yi ) = xi − 1 yi − 1I(i, j)
i=0 j=0

Given an integral image we can evaluate the sum of all the pixels in any given
area by just summing up the vertex points:

S =A+D−B−C

Figure 12.51 Integral image

This makes the computation of a box filter fast. Box filters are composed of black
and white rectangles, where white is a positive weight, while black is a negative
weight. A box filter has a similar behaviour in respect to Laplacian. For this
reason we use this filter for approximating the Hessian matrix (instead SIFT uses

234
DoG).

Figure 12.52 Box filters

In SURF the scale space is different from the one in SIFT. In SIFT the images are
smoothed and subsampled to move along the pyramid. In SURF instead we use
box filters of increasing size. Indeed, the computation time of box filters does not
depend on filter size using the integral image.
Regarding orientation, it is evaluated in a different way from as done in SIFT. We
consider a neighborhood of size 6σ (recall that σ is the scale at which the point
was detected) and evaluate the gradients in x and y of the region highlighted using
the Haar wavelets filters.

Figure 12.53 Orientation computation

For the descriptor, we consider the 16 regions as in SIFT but instead of creating
the histograms, for each region we:

• sum the response for dx and dy

• sum the modules for dx and dy

Then we obtained for each region a vector made of 4 elements (hence 4x16=64
features)
X X X X
[ dx , dy , |dx |, |dy |]

We normalize it.

235
Figure 12.54 Descriptor computation

Regarding the performance, SURF:

• Faster than SIFT

• less robust to illumination and viewpoint changes

• both performance good overall

(reference paper: K. Mikolajczyk and C. Schmid,”A Performance Evaluation of


Local Descriptors”, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 27, no. 10, pp. 1615-1630, 2005)

Figure 12.55 Surf features detection example

236
Figure 12.56 Surf features matching example

12.10.3 GLOH

GLOH stands for Gradient Location-Orientation Histogram. It is similar to SIFT


but the main difference is the 4th step: GLOH computes the SIFT descriptor
using the log-polar location grid (so the concept is similar, the difference is how
we group the pixels).
Then we compute the histogram for each sector where each histogram has (1+8+8)x16=272
orientation bins:

• 3 bins in radial direction

• 8 bins in angular direction (the central bin is not subdivided)

• 16 possible orientations

Then we use PCA to reduce the dimension to 128.

Figure 12.57 GLOH

12.11 Other feature approaches

Other approaches not based on SIFT are:

237
• Shape context

• LBP

• BRIEF

• ORB

For all of them we are not diving into the details.

12.11.1 Shape context

It is not so popular. The descriptor are computed based on 3D histogram of


edge point locations and orientation. Basically we extract the edges (often it is
used Canny detector), hence the shape, and we quantize the location of the edge
points using the log-polar coordinate system. Hence each histogram represents
the location, which consists in 60 combination (hence 60 bins):

• 5 possible distance

• 12 possible orientation

Hence 5x12=60 possible locations. So we consider a region around the keypoint


based on the log-polar coordinate system and construct an histogram: we overlap
the log-polar diagram and count how many pixels are inside each sector. Formally
for each pixel pi on the shape, we compute the histogram hi of the relative
coordinates of the remaining pixels:

hi (k) = #q 6= pi : (q − pi ) ∈ bin(k)

From this it comes that it is translation invariance. Also the scale invariance is
achieved by normalizing all radial distances by the mean distance by point pairs.
it is also robust to deformations, noise, outliers. However it is NOT rotation
invariance!

238
Figure 12.58 Shape context. The right-hand image (d), (e), (f) represent the
log-polar histograms of the single keypoint highlighted in the image (a) and (b).
We can see that (d), (e) represent the histogram for the same keypoint and they
are similar even if their position in the image is different. (c) Diagram of log-polar
histogram bins highlighting the center part. In total the diagram identifies 60
sectors.

Figure 12.59 Shape matching using shape contexts. Despite the image are differ-
ent, they are matched as similar because of their shape. The numbers below each
image represent their distance from the query image.

12.11.2 LBP

It was proposed for texture recognition in 1994. LBP method compare the cen-
tral pixel with surrounding 8 neighbors and builds a binary sequence of signs of
differences: following the pixels along the circle: if it is greater than the center
pixel, we assign 1, 0 otherwise. Then we compute the histogram of such values.
We might normalize the histogram as final step but this is optional.

239
Figure 12.60 LBP procedure

The procedure can be applied to subregions of the image instead of to whole image:
we divide the image into subwindows, run the procedure previously described and
then we concatenate the histograms.

Figure 12.61 Image subdivision

12.11.3 BRIEF

[reference paper: M Calonder, V Lepetit, C Strecha, P Fua, ”Brief: Binary robust


independent elementary features”, ECCV 2010]
Binary Robust Independent Elementary Features is based on:

• gaussian smoothing

• using the Hamming distance, compare pairs of pixel inside a window: we


set 1 if the values of the pixel satisfy a given inequality, otherwise 0.

• we concatenate the comparison previous result of each pixel in a vector.

HOMEWORK: read documentation: https://medium.com/data-breach/introduction-to-brief-binary-robust-indep

How do we sample the pixel couples to compare? we can opt for different solutions,
but the best option for most cases is the isotropic gaussian distribution.

240
BRIEF is rotation, scale, angle viewpoint, occlusions invariance. Anyway some
different objects can be matched: even if they appear different to us, if they have
some common features, they will be equal for a computer.

Figure 12.62 BRIEF couples selection options and performance

Figure 12.63 Distribution of Hamming distances for matching pair of points (thin
blue lines) and for non-matching (wrong matching) pairs (thick red lines) related
to the wall dataset. We can use this graph for determining if the images are the
same or not.

12.11.4 ORB

[reference paper Rublee et al., ”ORB: An efficient alternative to SIFT or SURF”,


ICCV 2011]
Oriented FAST and Rotated BRIEF. It uses FAST corner detector (not covered

241
by this course) and add rotation invariance to BRIEF. Orientation assignment is
based on the intensity centroid WRT the central pixel.
HOMEWORK: read documentation: https://medium.com/data-breach/introduction-to-orb-oriented-fast-and-ro

Figure 12.64 ORB

12.11.5 Comparison

How many features are matched correctly? and how many wrongly? The number
of features is different for every features methods.

12.12 Feature matching

We move to feature matching.

Figure 12.65 ORB

Features found on images are often matched. That’s why we find them actually.
There are different matching strategies we can opt for. Matching strategy replies
to the question ’What features to compare?’. Moreover, there are also a match
evaluation to evaluate the good matches: we assign a number to the match which

242
usually corresponds to the Euclidean distance or Hamming distance.

12.12.1 Matching strategy

To understand which matching strategy is good for us we should consider what is


their application. For instance in detecting an object in clutter we cannot do any
assumption on the position of the object, while in a panoramic we can assume
that the features will be very close to each other.
There are different strategies. Some of them:

• Strategy 1: we set a maximum distance and match against all features


within a geometric distance. Basically, given a keypoint in an image, we
move to the same position in the second image, we draw a circular area
centered at that location; and we match the keypoint only with the keypoints
within the circle.
Drawback: it is difficult to set the right radius ( maximum distance) and
we might lose some matches because of it.
An example is the Brute Force (BF) matcher in OpenCV: It takes one
keypoint in the first image and matches it with all other keypoints in the
second image using some distance calculation: the closest one is matched.
This method is on one hand reliable and simple: if a match exists we find
it, but on the other hand it is computational expensive.

• Strategy 2: Nearest Neighbors (NN). Given a keypoint, we consider only


the nearest (aka most similar) neighbour in the feature space. We use also
a threshold for setting the maximum distance within which to look up since
the closest keypoints can be actually very far.

• Strategy 3: Nearest Neighbors Distance Ration (NNDR).


Given a keypoint, we consider the distance d1 of the nearest distance, and
d2 the distance of the second nearest keypoint. Then we consider their ratio
which tell us the quality of the match.

d1
N N DR =
d2

12.12.2 Matching performance

After choosing the matching strategy, we would like to measure the performance
of our choice. We can use the following indicators:

• TRUE POSITIVES (TP): number of correct matches

• TRUE NEGATIVES (TN): number of correct non-matches

243
• FALSE POSITIVES (FP): number of non-matches that were wrongly matched

• FALSE NEGATIVES (FN): number of matches that were wrongly missed

• precision : TP
T P +F P

• recall = TP
T P +F N

Figure 12.66 Matching performance of different strategy matching given some


dataset.
From Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of lo-
cal descriptors. IEEE transactions on pattern analysis and machine intelligence,
27(10), 1615-1630.

244
Chapter 13

Face detection

Faces are a very important element to be detected for vary purposes.


The problem of face detection is quite complex in computer vision. A straightfor-
ward solution is the Slide windows approach, which consists in sliding a window
over the image, and analyse for every region if it contains a face or not. This
approach is computationally demanding, because of different reasons:

• there are a huge number of pixels to analyse: we also need to consider


multiples scales that are obtained using different windows size.

• faces are an unlike event actually.

• faces can appear in different positions: on one side or in front.

A face detector should have the following characteristics:

• fast processing for non-face candidates

• very low false positive rate (< 10−6 ) is required to avoid false positive.

13.1 Viola & Jones face detector overview

It is a widely used approach to fast object detection (also people detection). It is


particularly effective in face detection (it meets the requirements listed previously).
Its keys elements are:

• Haar features. A fast feature evaluation based on the integral image.

• weak learners working on Haar features

• boosting

• Cascading for organizing classifiers and fast rejection of non-matching win-


dows

We are going to discuss about these characteristics now.

245
13.1.1 Haar features

Haar features are combinations of rectangular, blank and white filters. For each
feature we sum up all the pixel overlapping the black area and all the ones over-
lapping the white one, then we subtract the two sums. Hence, an Haar feature is
defined as:
X X
f (x) = pblack (i) − pwhite (i)
i i

We have 2, 3 and 4 rectangle features: we end up having huge number of features.

Figure 13.1 Haar feature. The number of pixels in the black and white area are
the same. (c) is misleading here.

Figure 13.2 Haar feature

They are very coarse features but they are very sensitive to:

• edges

246
• bars

• other simple structures

They are also very computationally efficient: a large number of features can be
computed efficiently, and this compensates for the coarseness.
How many features? Let’s consider 24x24 patch (this is not a constant anyway).
This can evaluate 160k features! Anyway this number of features is very huge
and we cannot evaluate them in a feasible way. For overcoming this issue Viola
and Jones came up with a new method. This is based on the weak learners.

13.1.2 Weak learners

We can define a weak learner as a simple classifier that works on the number
evaluated by a Haar feature fj (x). Basically we set a threshold on each single
feature fj (x) in this way we can separate positive and negative examples better
than casual guess. Formally, the weak learner will output:

Figure 13.3 Weak learner

where pj is a parity term: since it appears in both side we can remove it, but it is
used for deciding the sign of the inequality (if it is -1, then the inequality is <).
The weak learner are trainable: the parameters they can learn are pj (hence the
versus of the inequality) and the threshold.

13.1.3 Boosting

Boosting allows us to build a strong classifier by combining several weak classifiers.


We get it by weighted sum the simple weak learners (we neglect the details). The
weights are selected depending on the classifier accuracies.
 
m−1
(13.1)
X
h(x) = sign  aj hj (x)
j=0

In order to train this model, we use the Adaboost algorithm. Given a dataset
of face and not face, we start with equal weights, and then for each round of
boosting:

247
• Evaluate each rectangle filter on each example

• select the best threshold for each filter

• select the best filter/threshold combination

• reweight examples (we increase the weights where the elements were wrongly
classified)

So at each round, the best weak classifier is found after the effect of the previously
selected classifiers.

Figure 13.4 Boosting. Red are faces and blue no faces.

The computational complexity is O(M N K) where M = rounds, N = examples


and K = features.

13.1.4 Cascading

If we stop to boosting, for it to be reliable we should evaluate a large number of


weak learners. For rejecting in a fast way the non-matching windows we imple-
ment cascading.
As saw before, the classifier is a combination of weak learners. Cascading means
to rearrange all these weak learners in a cascade for making the classifier more
efficient. Thus, basically we organize the weak learners into stage applied in se-
quence and these sequence are called the cascade. Each stage contains one or
more weak learners and acts as a filter: the first stage that discards a sample
prevents the subsequent stages to work. This implies that false negative cause a
failure. While false positives are acceptable at high rate.
The first stages should be as fast as possible. Moreover, the stages are progres-
sively more complex and have false positive rates.

Figure 13.5 Cascading

248
So now we can train several classifier with an increasing number of features till
the target decision rate is reached. The training is performed by re-weighting the
training examples after each stage giving an higher weight to samples wrongly
classified in previous stages.
Cascaded classifier gets a similar accuracy to the monolithic classifier (got with
the boosting step) but cascading results 10x faster.

Figure 13.6 Accuracy of cascaded vs monolithic classifier

The structure used by Viola and Jones uses 32 stages and a total of 4297 features.

Figure 13.7 Viola & Jones cascading

They used the following training data:

249
• 5k faces: all frontal rescaled to 24x24 pixels

• 10k non-faces

The data had the following variation factors: individuals, illumination and pose.
We can use Viola and Jones for tracking face also.

13.1.5 Viola & Jones examples

Figure 13.8 Profile faces were not in the training set so Viola & Jones method
failed to detect it.

Figure 13.9

250
Figure 13.10 False positives and false negatives

251
Chapter 14

The camera

What are the main elements of a camera? Shutter, sensor and camera lens.

14.1 The pinhole camera model

This was the first approach used in photography and describe the fundamental of
how a camera create an image by looking at the real world.

How do cameras create an image?


Different light sources light up the object which reflects them in any directions
and hence some of these light rays hit the sensor. We can hence consider each
little part of the object as a light source.

Figure 14.1

Is there any problems in this configuration?


Yes, the problem is that the single pixel of the sensor gets information from many
part of the objects: there is a superposition of the ray light in a single pixel of the
sensor. We can outcome this issue by placing between source and sensor a screen
with a very tiny hole, hence a PINHOLE (not a rectangular as might appear from
the image below but a hole). In this way we get a one to one relation from pixels

252
on the sensor and part of the object (one ray per point of the object reaches the
sensor).

Figure 14.2 Configuration with pinhole

The screen allows hence a selection of the light rays that depends on the size of
the hole. This model is called PINHOLE CAMERA MODEL.

Figure 14.3 PINHOLE CAMERA MODEL

The smaller the pinhole the sharper the image will be.

Figure 14.4 pinhole size

253
14.1.1 Camera obscura

The pinhole model was used long ago. Initially it was implemented to build the
camera obscura, which was a room with a pinhole, which let artists paint on a
the screen what was projected on the same screen by the light coming from the
pinhole.
A camera obscura can be seen in Padua at the Museo del precinema at Prato
della Valle.

Figure 14.5 Camera obscura

14.1.2 Modelling a camera

Let’s try to model a real camera using a pinhole model. We use hence a pinhole
camera and a coordinate system. Note that in the modelling we neglect some
effects.

Figure 14.6 Modelling a camera by the pinhole model

254
The main elements of such camera are :

• The optical center which identifies the location of the pinhole.

• The image plane which is the screen

• The optical axix which a line perpendicular to the image plane and passing
through the optical center. This is a real element that we consider also in
a real camera.

• The principal points which is the point where the image plane and optical
axis intersect.

• Focal length which is the distance btw the optical center and the image
plane.

• A reference system. There are a lot of reference systems actually. One


describes the camera ( for this system we use x,y,z upper case): the origin
represents the center and the reference system represents where the camera
is pointing (the axis z in particular). The other reference system instead
describes the image plane (identified with x,y lower case).
The orientation of the reference system is given by the right-hand rule.

Figure 14.7 Reference systems and right-hand rule

For instance, consider when we find a keypoint, an edge a line or blob; in which
reference system is given the output of the algo? Did we use any other reference
system? We use a different reference system where the center is different from
the image plane: we set as center pixel ((0,0)), the top-left corner of the image,
which is not the center of the image plane in a pinhole model.

255
14.2 Projective geometry

14.2.1 Mapping from 3D to 2D

We need to describe the geometry of projection quantitatively. This is needed for


being able to convert the location of a 3D point in the world seen from the camera
to its corresponding 2D point location on the image plane. This 2D location is
not expressed in pixels yet.
Let’s consider a 3D point P and its projection p on the 2D image plane.

Figure 14.8

Now let’s add a new plane in front of the optical center that is parallel to the
image plane and at the same distance f from the optical center. From now on we
work on this plane: it is easier consider this plane instead of the image plane since
the new plane avoid the upside-down effect.

Figure 14.9

Thus we remove the original image and start using only the plane added previously.
We use this plane for deriving the geometric description about the projection from
3D to 2D.

256
Hence looking at a pinhole camera, we need to consider only the part in the yellow
box represented in the figure 14.11. For all the discussion we consider the pinhole
camera model as camera.

Figure 14.10

Figure 14.11

We start our geometric analysis from the ’similar triangle rule’ which states that
the two part of the triangles (the pink and green zone) have the same angle:

Yp Zp
=
yp f

where f is the focal distance.

Figure 14.12 Similar triangle rule

257
The equality of the similar triangle rule holds also for the other coordinates leading
to the following PROJECTION EQUATIONS from 3D to 2D:

Xp
xp = f (14.1)
Zp
Yp
yp = f (14.2)
Zp
Yp
tan(α) = (14.3)
Zp

Figure 14.13 Different way to see the ’similar triangle rule’

Anyway, by projecting the 3D points on the 2D surface we lose information about


the distance: this results in 3D points projected onto the same image point.

Figure 14.14

Projection matrix (3D to 2D)

We can rearrange the previous equations into a matrix form using the homogeneous
coordinates.

258
Figure 14.15 Recall: transforming to/from homogeneous coordinates

Figure 14.16 P is the projection matrix. The vector M represents the original
image, while the vector m the virtual one: the virtual image is equal to a scale
factor of the original one.

Thus a camera is a device that converts the 3D world into a 2D image and this
operation can be described by the equation m̃ ≈ P M̃ . Thus the projection matrix
describes how the 3D world is mapped onto the image plane. In particular, for
f=1, we get the essential perspective projection P which represents the core of
the projection process:
 
1 0 0 0 
(14.4)
 
P = 0 1 0 0

 = [I|0]
 
0 0 1 0

Now reflect:
– What measurement unit is used for distances in the 3D world?
– What measurement unit is used for distances on the projection plane?
– What measurement unit do we commonly use for distances in a digital image?
In the 3D world and on the projection plane we use the meter. While in a digital
plane we use pixels. For projection we use meter initially, but then we need to
translate it into pixels.

14.2.2 Mapping projected point from 2D to pixel coordinate

After projecting, we need to map the projected points onto the image plane into
pixels. Thus we need to move in the image pixel reference system: map from (x,y)

259
to (u,v). The transformation is defined considering the top-left corner point of the
plane (x,y) (which we remind it is the plane we added previously for avoiding the
upside-down effect of the image) as the origin of the new reference system (u,v),
and considering also the coordinates of the principal point (which we remind it is
the point where the image plane and optical axis intersect) that we call it (u0 , v0 ).

Figure 14.17 Mapping from (x,y) to (u,v)

Figure 14.18 mapping from (x,y) to (u,v) example

The mapping from (x,y) to (u,v) performs a conversion from metric distances to
pixels using the pixel width w and h height. Width and height are used to defined
the conversion factors, which are defined as :

1
ku = (14.5)
w
1
kv = (14.6)
h

260
Note: that pixels may be not squares. So we need to look at the camera specifi-
cation for getting the pixel size.
Mapping from (x,y) to (u,v) is then obtained by translation and scaling:

u = u0 + xp ku (14.7)
v = v0 + yp kv (14.8)

14.2.3 Mapping from 3D to pixel coordinates

We can combine the two mappings: from 3D to 2D image plane, and then from
image plane to pixels. Hence, we can map from 3D to pixel coordinate directly.
We achieve this by substituting the two equations for mapping from the image
plane (x,y) to the pixel coordinates (u,v) into the projection equation reported in
the figure 12.13.

f Xp Xp
u = u0 + xp ku = u0 + = u0 + fu (14.9)
ku Zp Zp
f Yp Yp
v = v0 + yp kv = v0 + = v0 + fv (14.10)
kv Zp Zp

Projection matrix (3D to pixel coordinates)

The projection matrix is now expressed as


 
fu 0 u0 0
(14.11)
 
P =
 0 fv v0 0 = K [I|0]

 
0 0 1 0

where K is the camera matrix. All the previous equation m̃ ' P M̃ still holds.
What is changed is only the formulation of P:
   
1 0 0 0  f
 u 0 u0 0
We moved from P =  to
   
0 1 0 0
 = [I|0] P =  0 fv v0 0 = K [I|0]
 
   
0 0 1 0 0 0 1 0
(14.12)
The camera matrix K is defined as:
 
fu 0 u0 
(14.13)
 
K=
 0 fv v0 

 
0 0 1

261
The matrix K depends on the INTRINSIC PARAMETERS: ku , kv , u0 , v0 , f (in-
deed fu , fv embed three parameters). These parameters define the projection
characteristics of the camera. Hence, this matrix is unique to a specific camera
and it can be used to remove distortion due to the lenses of a specific camera.

14.2.4 Rotatranslation - Mapping from 3D world to 3D camera

The camera can move and hence it is meaningful to have both a 3D world reference
system and a 3D camera reference system. This means that the camera provides
some observation given its own point of view and we need to convert the 3D points
of the world into the camera viewpoint. This is achieved by using 3D rotation
and translation to align the axis of the two reference systems. A rototranslation
in 3D in homogeneous coordinates is expressed as:
 
R t
T = (14.14)
 

0 1

and the correspondence becomes m̃ ' P T M̃ .


The rototranslation matrix T involve many EXTRINSIC PARAMETERS that de-
fine the relation between camera and world.:

• 3 for translations

• 3 for rotations

Hence, everytime we rotate or move our camera, we change these extrinsic pa-
rameters. Moreover, the plane x,y of the camera reference system is the image
plane and the camera axis z indicates the direction of the camera.
REMIND: Extrinsic is about camera, intrinsic about the projection inside the cam-
era.

Figure 14.19

262
Figure 14.20 Left image represents the camera viewpoint where the plane are com-
puted using m̃ ' P T M̃ . The right image instead represents the world viewpoint.

Figuring out rototranslation

–> EXTRA CONTENT NOT SEEN DURING THE COURSE reported for a better
understanding.
The rototranslation is defined as Pc = R(Pw − C) = RPW − RC = RPW + T
that can be written in matrix form using the homogeneous form:
        
X  r11 r12 r13 0 1 0 0 −cx   V  r11 r12 r13 tx   V 
        
 Y  r21 r22 r23 0 0 1 0 −cy   C  r21 r22 r23 ty   C 
        
 =    =   
        
 Z  r31 r32 r33
   0 0
 0 1 −cz 
 W  r31 r32 r33 tz  W 
    
        
1 0 0 0 1 0 0 0 1 1 0 0 0 1 1
(14.15)

Figure 14.21 Alignment of the world and camera reference systems with respect
formula 14.15. C stands for camera and it is the centre of the pinhole. Pw and
Pc are the same physical point described in two different coordinate systems.

For a better understanding of rototranslation we report an example where only


translation is applied.

263
Figure 14.22 Example of a simple stereo system involving only translation since
the axis are already aligned in the same direction.

264
Let’s now focus on rotation forgetting for a while the translation vector (let’s
suppose the two systems are overlapping and hence no translation is needed ).
Hence our transformation system would be Pc = RPw :
    
X  r11 r12 r13 0  V 
    
 Y  r21 r22 r23 0  C 
    
 =
  
 
  (14.16)
 Z  r31 r32 r33 0 W 
 
  
    
1 0 0 0 1 1

Let’s consider the case where the world x axis (1,0,0) corresponds to camera axis
(a,b,c). In this case we can immediately write down the first column of R.
         
a r11 r12 r13 0 1 a r11 r12 r13 0 1
         
 b  r21 r22 r23 0 0  b  r r r 0 0
         
 =    →   =  21 22 23    (14.17)
         
 c  r31 r32 r33 0 0
 c  r31 r32 r33 0 0
   
     
         
1 0 0 0 1 1 1 0 0 0 1 1

Likewise it is with world Y and Z axis:

Figure 14.23

However sometimes it is easier to specify what camera X,Y or Z axis is in world


coordinates. Then do rearrange the equation as follows:

PC = RPW → R−1 PC = PW → RT PC = PW
    
r11 r21 r31 0 X   V 
    
r12 r22 r32 0  Y   C 
    


  =  
    (14.18)
r13 r23 r33 0  Z  W 
   

    
0 0 0 1 1 1

Let’s consider the scenario where the camera X axis (1,0,0) corresponds to the

265
world axis (a,b,c). In this case we can immediately write down the first column
of RT , which is the first row of R.
         
r11 r21 r31 0 1 a a r21 r31 0 1 a
         
r12 r22 r32 0 0  b   b r22 r32 0 0  b 
         


  =   → 
    
  =  
    (14.19)
r13 r23 r33 0 0  c 
 c r23 r33 0 0  c 
       
 
         
0 0 0 1 1 1 0 0 0 1 1 1

and likewise with camera Y axis and camera Z axis.

Figure 14.24

Figure 14.25 Example of rotation of two coordinate system with respect to the
world reference system

266
14.2.5 Projection recap

The whole projection process involves:

• 4 reference systems

• 3 transformations

Figure 14.26 Recap of the different reference systems involved in the projection
process.

Figure 14.27 Projection recap. The object of the world is projected onto the image
place, which has its own reference system. Then the image plane is quantized into
the pixel frame (the pixel frame has the same size of the image plane actually. In
the image it is shown smaller just for graphical reasons). And then the camera
frame. If we move the camera the reference system of the camera frame change.
A movement of the camera or of the object can be observed at the image plane
or pixel frame.

267
14.3 Inverse projection

The projection process is described as

m̃ ' P T M̃

It evaluates the projected point m̃ given the 3D object M̃ .


The inversion process is useful such as for self-driving cars, but it is not possible
to achieve. Indeed, two elements are not invertible:

• Projection from 3D to 2D.

• Pixel quantization

Anyway we can invert the projection if:

• we accept as a result, the direction of the object, but not the 3D position.
Indeed we cannot reverse the z position.

• We neglect the quantization effect: the pixel location and the projected
point are considered the same. This is acceptable for high-resolution sensors.

14.4 The thin lens model

We discuss what happens if we add a lens.


In the pinhole model the image is always on focus because only few rays hit the
sensor. Anyway this represents also a major issue for this model since few light
causes little light intensity. We shall find a trade-off between sharp images and
light intensity.
To achieve both a light and sharp image we need to consider a more complex
system without needing a pinhole. Basically we consider the pinhole model and
add a lens at the place of the pinhole. This system is called the thin lens model.
The lens has :

• the main axis lying on the optical axis (hence the optical camera axis is
perpendicular to the lens plane).

• the center placed where the pinhole is

• a thin width so that we can neglect the lens width and consider it as a plane
where every deviation occurs.

Lens deviates light and focuses the rays in a single point. In particular:

• the rays passing by the center are not deviated

268
• the rays parallel to the lens axis are deviated through the focal point (we
recall every lens has two focal points)

Figure 14.28 Lens deviates light and focuses the rays

Figure 14.29 the rays passing by the center are not deviated while the rays parallel
to the lens axis are deviated through the focal point

The relationship between the object distance from the lens and the image distance
from the lens is expressed by the THIN LENS EQUATION:

1 1 1
+ =
d0 d1 f

269
Figure 14.30 Thin lens equation

The focal length is a building element of the lens and it represents the distance at
which the virtual and real image appear: hence it indicates the distance at which
the parallel rays intersect.
The distance btw sensor and lens determines the focus of a point. The lenses
can focus only objects at a specific distance. All the rest outside that distance
is not sharp as the items at that distance, but those points generate a circle of
confusion.

Figure 14.31 Circle of confusion

Anyway changing the distance btw lens and sensor changes the focus of the points,
but not cancel out the circle of confusion.

270
Figure 14.32 Changing distance btw lens and sensor

For reducing the circle of confusion we can add a barrier, which works like a
pinhole.

Figure 14.33 we add a barrier

Figure 14.34 Change in focus

Pay attention that the focal length has different meaning in the two different
models we saw.:

• Thin lens model: f is the distance at which parallel rays intersect

271
• pinhole camera model: f is the distance between pinhole and sensor.

Figure 14.35 focal length comparison

14.5 Non ideal (real) lenses

The lens used so far is thin and ideal. However real lenses are affected by addi-
tional effects: distorsion, chromatic aberrations and other minor effects.

14.5.1 Distortion

Lens manufacturer keep under control distortions. Non-distortion lens is often


complex to design and to build.

Figure 14.36 Non distortion lens

Distortion is a deviation from the ideal behaviour described so far. Such deviation
undergoes a pattern that can be: radial and/or tangential.

Radial distortion

In a radial distortion the entity of the distortion depends on the distance of the
distorted point from the image center.

272
Figure 14.37 Radial distortion

Radial distortion causes straight lines to appear curved. Radial distortion becomes
larger the farther points are from the center of the image. In particular, the further
points can be radially distorted in two way: either by pincushion distortion or by
barrel distortion.

Figure 14.38 Pincushion and barrel distortion

Figure 14.39 Pincushion and barrel distortion

The radial distortion can be analysed by means of distortion patterns, which can
be either experimentally measures; or analytically described if the structure of the
lens is known.
Camera models commonly consider a polynomial approximation for the radial

273
distortion. Given the distorted point (x,y), in OpenCV the correction (xcorr , ycorr )
is modelled as:

xcorr = x(1 + k1 r2 + k2 r4 + k3 r6 ) (14.20)


ycorr = y(1 + k1 r2 + k2 r4 + k3 r6 ) (14.21)

Figure 14.40 Radial distortion: the arrows represent the displacement of the point

Tangential distortion

Tangential distortion is not so common and hence it usually negligible. It can


happen if the sensors are attached with cheap glue. Indeed, the tangential distor-
tion is caused by the non ideal alignment between lens and sensor. The distortion
looks similar to a ’perspective’ effect on the sensor.

Figure 14.41 Tangential distortion

274
The tangential distortion is often modelled as:

xcorr = x + 2p1 xy + p2 (r2 + 2x2 ) (14.22)


 

ycorr = y + p1 (r2 + 2y 2 ) + 2p2 xy (14.23)


 

14.5.2 Chromatic aberration

The refractive index depends on the wavelength of the rays. This causes a different
blending of the rays based on their wavelength and a modification of the lens focal
length. Thus each color ray blends differently and intersect in a different position.
This results in color fringes near the image edges.

Figure 14.42 Chromatic aberration

Figure 14.43 Chromatic aberration examples

14.5.3 Distortion and camera lens

A thin lens camera model can be coupled with distortion estimation.


For projection we use a non-distorting thin lens camera model and then we apply
the distortion.

275
For inverse projection, we first compensate the distortion and then we use a non-
distorting thin lens camera model.

14.6 Field of view

One parameter of a camera is the field of view. It is the angle perceived by the
camera.
Previously we consider a 3D point at angle α. α, which represents the angle at
which a photo is taken, can be maximum 1
2 of the Field of View (FoV).

Figure 14.44 α angle at which we considered previously the point P

FoV depends on:

• the sensor size (d)

• the focal length (f)

d
ϕ = arctan( )
2f

Figure 14.45 FoV

276
Anyway, in a camera we cannot change the sensor size, but we can change the
focal length by zooming in and out (so we are changing an intrinsic parameters).

Figure 14.46 FoV: we change f by zooming in and out

Figure 14.47 FoV vs focal length

14.7 Real camera overview

A real camera can be considered as a dimensionality reduction machine from the


3D world to a 2D image. However this mapping leads to a lack of details: we lose
the angles, distances and parallel lines; while straight lines remain straight.

277
Figure 14.48 Dimensionality reduction machine

Figure 14.49 Parallelism is lost

Figure 14.50 Our brain is used to the effect of projection. In this example the two
lines have equal length, but we perceive them differently.

278
Figure 14.51 Length can’t be trusted. Indeed everything is mapped according to
a scale factor. In the figure b and c have the same length in a 3D world but not
in the 2D image.

We remind that we use lenses over pinhole to gather more light and focus.

Figure 14.52 Lens vs pinhole

279
14.7.1 Perspective vs viewpoint

Focal length changes both the subject size and the perspective (e.g background).

Figure 14.53 Perspective. The left image is taken with the camera in the position
at the bottom (middle figure), while the right image with the camera closer. In
particular the left photo was taken with a high focal distance and high aperture.

The telephoto lens makes it easier to select background (a small change in view-
point is a big change in background).

Figure 14.54 telephoto lens

Telephoto lens are also used for getting the Vertigo effect (or Dolly zoom). This

280
is an effect that let us perceive a static scene as a motion. The effect is obtained
by simultaneously zooming out and moving backforward the camera, or zooming
in and moving the camera forward.

14.7.2 Sensors overview

A sensor deals with converting photons to electrons. There are two main types of
sensor: CCD and CMOS.
CCD and CMOS do not differentiate on wavelength/photon energy: there are
essentially greyscale sensors. For getting the colors we need opt for different
configurations:

• 3 chip color: we split the incoming light and separate R, G and B color so
that the hit the corresponding filter+sensor.

• single chip color: a single imager with filter. These sensors are based on the
bayer pattern: over every pixel there is a tiny filters that selects only one of
the single R,G and B colors.

• chip penetration depending on wavelength such as the Foveon sensor. Right


now it is still on research and there is no so many camera using it.

Figure 14.55 3 chip sensor

Figure 14.56 single chip color with bayern pattern

281
Figure 14.57 Direct image Foveon sensor

14.7.3 Exposure

The main aspect about taking photo is the exposure which basically defines the
amount of incoming light. The exposure is regulated by two parameters:

• aperture

• shutter speed

Figure 14.58 Aperture

282
Figure 14.59 Shutter speed

Aperture

The aperture is the diameter of the lens opening and it is expressed as a fraction
of focal length f-number such as:
- f/2.0 on a 50mm: aperture = 25mm
– f/2.0 on a 100mm: aperture = 50mm
The lower the f/stop, the larger the aperture.
Why we express the aperture in terms of focal length? We consider the focal
distance because when we zoom out we reduce the area of our subject and so the
incoming light is less. Thus, we need to tune the aperture according the focal
length to compensate the less amount of light.

Figure 14.60 When we double the focal length, the amount of light gathered is
/4. So wee need to enlarge the aperture to compensate.

The choice of the aperture determines the depth of field and the diffraction effect.

Diffraction effect
Usually we should opt for a small aperture if we neglect the exposure time.
However too small aperture leads to the diffraction effect that causes a
confused image.

283
Figure 14.61 Diffration effect. With the aperture 0.35mm the image is on focus,
but for smaller aperture the image get blurred.

Depth of filed
A smaller aperture increases the range in which the object is approximately
in focus. The depth of field is the depth in which we can neglect the presence
of the circle of confusion (figure 14.62).

Figure 14.62 Depth of field

Portrait lens provides a large aperture for getting a small DOF. Every lens
can do a large DOF but not a small one.

284
Shutter

A picture is acquired when the shutter opens and closes, exposing the sensor
to the incoming light. The exposure time is expressed in second as ratio.
There exists a reciprocity between shutter and aperture: given an amount of
light we can get the same exposure by doubling the exposure time and split
in half the aperture.

Figure 14.63 Several pairs of shutter speed/aperture

The choice of the shutter speed determines the freeze motion vs motion blur:

• Longer exposure time: more light, more motion blur

• shorter exposure time: less light, freeze motion

Figure 14.64 Shutter speed

14.7.4 ISO

ISO defines the sensitivity of the sensor: it is useful to prevent the sensor to
get saturated (saturated pixels appear to be just white).

285
14.8 Camera calibration

Calibration is the process of estimating the camera parameters. Intrinsic pa-


rameters are specific to a camera. They include information like focal length
(fx,fy) and optical centers (cx,cy) which are necessary to create the camera
matrix K. Extrinsic parameters instead corresponds to rotation and transla-
tion vectors which translates a coordinates of a 3D point to a coordinate
system.
We can calibrate the camera considering only the intrinsic parameters or the
intrinsic + extrinsic ones. Indeed calibration is needed to measure the projec-
tion characteristics of a camera and hence we measure at least the intrinsic
parameters.
In the calibration process the reference with the world coordinate system is
not needed, but we cannot neglect the 3D reference system of the object:
the calibration process links 3D object and projection.
Then we can use another dedicated process to evaluate the camera position
and orientation in world coordinates: once we calibrate the camera we link
it with the real world.
The camera orientation is expressed by means of three angles: yaw, pitch
and roll.

Figure 14.65 camera orientation

14.8.1 Why calibration?

We can decide to calibrate the camera for several reasons: removing the
distortion, estimate the 3d world from camera motion, evaluate depth and
so on.

286
Figure 14.66 Some calibration application

There are also other applications: we can detect the position of an object
such as using the ground plane as reference. Indeed camera are not used for
understanding the environment but they can be also used as measurement
tool such as for measuring the size/distance of something.

Figure 14.67 calibration measurement application

14.8.2 Calibration process

Let’s develop a first approach to calibration. We should find a method for


evaluating the matrix K by observing how rays are projected.

287
Several methods exists for calibrating a camera. The general process consists
in the following steps:

• take an object of known shape and appearance. The calibration pattern


can be any object of known shape such as a checkerboard, but the
object shall be found in the image automatically: the shape should be
easily recognizable. For instance we can use a checkerboard such that
the points used for calibration are the square corners.

Figure 14.68 projected checkerboard image with the reference system of the object

• take pictures of the object. We collect N images ot the pattern. For


each image we list the 3D points given in the pattern that we know
where they are in reality (in our example they represent the corner
positions of the checkerboard).
How many images are needed to calibrate a camera? Each view
of a planar object can be represented by a homography and homography
is fully described using only four points per view, this means that only
4 points are free and hence that we have 8 constraints (x and y) per
view. Neglecting distortion, we have 4 or 5 intrinsic parameters and 6
extrinsic parameters. This means that we need at least 2 views since
there are 8 constrains per each view. Anyway in order to get a good
estimate we need way more views such as 10 or more.
Note that if we miss an image, we end up to have crazy results since
we are using a crazy polynomial that has no physical meaning.

288
Figure 14.69 Typical calibration image set

• analyse the projection process. The process consists in the following.

– We observe how the 3D points given in the pattern, which are


called object points, are projected into the 2D image plane, called
image points. Thus we compute the coordinate of the image
points.
– We then initialize the intrinsic calibration parameters to default
values: we initialize the intrinsic parameter for K and for the dis-
tortion to 0.

For K: fu , fv , u0 , v0 → 0K (14.24)
For distortion: k1 , k2 , k3 , p1 , p2 → 0d (14.25)

We initialize also the extrinsic parameters to default values.


– Eventually we solve the non linear least squares problem:

N −1 M −1
|K [I|0] Ti P̃i,j − p̃i,j |2 (14.26)
X X
min0k ,0d
i=0 j=0

where N = #images, M = #points for image, P̃i,j the list of all


the 3D points for each image i, p̃i,j the list of the corresponding
2D image point for each image i. Basically we would like to mini-
mize the projection error computed as the difference between the
theoretical projection of the 3D points (K [I|0] Ti P̃i,j ), and the
actual projection (p̃i,j ).
This minimization problem is tricky to solve given the huge amount
of parameters to tune. We would like to improve the time perfor-
mance by coming up with a good parameter initialization: so that
we can reach the best solution sooner. A good initial guess can
be found by exploiting the homography.
We use homography to transform the object to image plane, but
this is possible only if the object is planar and if we neglect distor-
tion. Homography simplifies the mathematical description. More

289
details about homography can be found at ’Learning OpenCV 3’
at page 664 or at the course ’3D data processing’ held by prof. A.
Pretto.

Figure 14.70 homography

14.8.3 Reprojection error

After calibrating the camera we compute the reprojection error. The projec-
tion error estimates how the projected 3D points recreate the point’s true
projection and thus it provides a qualitative measure of accuracy. We hence
calculate the absolute norm between what we got with our transformation
and the real position. To find the average error, we calculate the arithmetical
mean of the errors calculated for all the calibration images.

Figure 14.71 The yellow arrows indicate the reprojection error

290
14.8.4 Sum - up

To find the intrinsic and extrinsic parameters, we must provide some sample
images of a well defined pattern (e.g. a chess board). We find some specific
points of which we already know the relative positions (e.g. square corners
in the chess board). We know the coordinates of these points in real world
space and we know the coordinates in the image, so we can solve for the
distortion coefficients. For better results, we need at least 10 test patterns.
NOTE from the lab8: R and t are unique for each image taken, so t is a
vector containing the translation vector for each image, and R is a rotation
matrix or rotation vector (by suing Rodriguez formula we can pass from
rotation matrix to rotation vector and viceversa) per each image. While the
3x3 camera matrix and the deformation vector is unique for each camera and
the same for each images.

Any unauthorized distribution (digitally or on paper) is expressly prohibited


without prior authorization from the author of these notes

291
Chapter 15

Image resampling

We discuss about how to change the image resolution. The image resolution
is changed by resampling the image. The resampling can be performed by
different technique based on what final result we aim at. We use:

• interpolation: for enlarging the image

• decimation for shrinking the image

• resampling to a different lattice e.g. after a geometric transform.

15.1 Interpolation techniques

15.1.1 Nearest neighbor interpolation

For enlarging the image we can fill in the added pixel with the color value of
the closest pixel. This techniques has a fast execution but poor results due
to the pixelization effect.

Figure 15.1 Nearest neighbor interpolation

292
Figure 15.2 Nearest neighbor interpolation example

15.1.2 Bilinear interpolation

The new pixels values are taken as the weighted average of the closest 4
samples. The coefficient weights depend on the distance to the samples:

p(x, y) = (1 − a)(1 − b)I1 + a(1 − b)I2 + (1 − a)bI3 + abI4

This techniques assures good balance between speed and accuracy.

Figure 15.3 Bilinear interpolation

15.1.3 Bicubic interpolation

The new pixels values are taken by considering the 16 closest samples. This
is performed by using a 3-degree polynomials.

3 X
X 3
p(x, y) = ai,j xi y j
i=0 j=0

This technique performs slow but results in good accuracy.

293
Figure 15.4 Bicubic interpolation

15.1.4 Interpolation comparison

Figure 15.5 Interpolation comparison

294
Chapter 16

High-level vision

16.1 Object recognition

Object recognition is a general term to describe a collection of related com-


puter vision tasks that involve identifying objects in images or videos. We
are going to describe some possible high-level vision tasks.

16.1.1 Image classification

One of the possible tasks is image classification. Image classification is the


task of associating one or more categories to a given image.

Figure 16.1 image classification example

16.1.2 Object detection (or localization)

Another possible task is object localization. This problem deals with both
classifying an object and localizing it. The position of the object is high-

295
lighted by using a bounding box, which contains with high probability the
required object. Moreover, to each bounding box it is associated a category.
Very often when we use object detection we are looking for only one object.

Figure 16.2 Object detection

16.1.3 Object detection vs image classification

Figure 16.3 Object detection vs image classification

16.1.4 Semantic segmentation

A semantic meaning is given to each pixel based on a classification algorithm.

296
Figure 16.4 Semantic segmentation

16.1.5 Semantic segmentation vs object detection

In the picture 16.5, instance segmentation is actually semantic segmentation


with the difference that the pixels are not only classifying but they are also
blurred. Instance segmentation is what we get when we start from a object
detection and we want to define the contour of the subject.

Figure 16.5

16.1.6 Modelling variability

The two tasks discussed so far shall cope with:

• different camera positions

• perspective deformations

• illumination changes

• intra-class variations

297
Figure 16.6 Example of different camera positions. We would like to recognize it
as a car despite the different point of view.

Figure 16.7 Example of perspective deformations. We would like to recognize it


as a building despite the different perspective.

Figure 16.8 Example of illumination changes. We would like to recognize it as a


face despite the different illumination.

298
Figure 16.9 Example of intra-class variations. An object can be represented with
different shapes. We would like to recognize it as a chair despite their different
shapes.

16.1.7 History of recognition

Figure 16.10

The focus of this course about object recognition will be on :

• boosting for face detection (Viola and Jones)

• template matching

• intro to bag of features model

• overview of deep learning methods

16.2 Template matching

A template matching is a rough approach: very simple method. A template


can be :

299
• something fashioned, shaped, or designed to serve as a model

• something formed after a model

• an example

Then we detect a given object, if we find an instance of the template in the


image. A similarity measure is hence needed and we can use correlation for
comparing the template and an instance.

Figure 16.11 Template matching

Template matching can be used for different applications.

Figure 16.12 Template matching applications

There are some challenges we need to cope with when using this method.
These difficulties come from the fact when it is difficult to find a universal
template. Indeed, every object can be affected by deformation, different
viewpoint, affine transforms (scale, rotations, translations, and so on), noise
and different illumination. Anyway we could overcome with some of these
issues by using different templates.

300
Figure 16.13 Human bodies can have different shapes and so we should use dif-
ferent templates.

16.2.1 How to apply a template?

Given an image and a template, how can we evaluate the match? We could
simple differentiate the image and the template. However, such approach
doesn’t not always provide reliable results: in the figure below we can see
that the template results to be more similar to a completely different face.

Figure 16.14 Differencing approach

301
There are actually far better approaches. In this course we are not covering
all of them but we focus only on the rigid template matching.

Figure 16.15 Approaches

16.2.2 Correlation-based rigid template matching

Basically we have a template T constituted by a small image representing a


rigid object. Correlation-based rigid template matching uses a sliding window
approach: we place T in every possible position across the image without
performing rotation and scaling. We compare the pixel values, the features
and edges or gradient orientation. We then use a similarity metrics. We can
opt to different similarity metrics.

Figure 16.16

302
16.2.3 Generalized Hough Transform rigid template matching

The Hough transform approach works for more complex shapes and doesn’t
require to slide the template over the image. Anyway the parameter space
might have dimensionality.
The choice about which approach to use, whether hough transform or corre-
lation, depends on how complicated the shape is (then the parameter space
has high dimensionality) and how large the image is (then sliding window is
computationally expensive).
Hover, the Hough transform approach is quite robust to illumination.

16.2.4 Template matching weak points and how to cope with

We saw previously the challenges that we need to face when implementing


template matching. For dealing with illumination changes, we can opt for
one of the following:

1. using edge maps instead of images

2. using ZNCC: subtract the uniform illumination component

For dealing with scale changes we can opt for one of the following:

1. matching with several scaled versions of the template

2. generalized Hough transform

3. work in scale space.

For dealing with rotation we can opt for one of the following:

1. matching with several rotated versions of the template

2. generalized Hough transform

16.3 Bag of words

The idea is to try to decouple a complex object (the pattern) into a compo-
sition of multiples elements (they can be also called features). Then we look
for those elements on the image. This technique is designed to be invariant
to several factors mainly to viewpoint and deformations.

303
Figure 16.17 Bag of words

Figure 16.18 Bag of words example. We decompose different objects.

We evaluate the contribution of every words and then the object corresponds
to the word that gave the most contribution. Hence we can use Bag of words
for classifying images.

Figure 16.19 Evaluation of different features

This approach can be implemented in a smart way by means of features.


Thus for instance we can develop the following approach:

304
1. Extract features, hence the keypoints and descriptor. We hence get
the features vector at the end of this step.

Figure 16.20 Feature extraction

2. Then we cluster the feature space: we assign a point to each descriptor.

Figure 16.21 Feature clustering

3. Codebook generation: we select the centroid point for each cluster.


These points represent the words of the bag.

Figure 16.22 Codebook generation

Then, given the codebook and an image we would like to analyse it: we use
a classifier such as SVM to classify the image based on the words evaluated.

305
Chapter 17

Introduction to Deep Learning

17.1 Machine Learning

ML is a field of computer science that investigates how it is possible to


learn from data. A ML algorithm deals with creating a model by looking
at some labelled data. Then the goal is to use the model for doing some
predictions: we pass some labelled data without their labels and then we use
the corresponding labels to measure the accuracy.

Figure 17.1 image classification example

17.1.1 ML framework

A ML algorithm takes as input a dataset of labelled data. The goal then is


to find the model f* that best approximates the unknown function f. Use
ML means to find the optimal parameters for f*. We also need to choose
the specific type of the model f* such as DL network, NN, svm and so on.
In this course we deal with only supervised learning.

306
17.1.2 Linear model and Kernel

A simple case is represented by the linear model

f (xi ) = W xi

where W is the vector of the parameters that the Ml algorithm needs to find
Let’s start from a very simple case. However this model cannot describe non-
linear phenomena. We can overcome this issue by using a kernel function. A
kernel function maps a function to a different function. Then we can use it
for linearized a non linear phenomena. However kernel is often application-
dependent and hence hard to generalize.

Figure 17.2 Kernel lets us map a non linear distribution of data to a linear one.

If we have a computer vision problem, which representation of data


do we provide to a ML algorithm? In computer vision we deal with
images which are represented as a matrix. For inputting an image to a ML
algorithm, we need to use a different representation. Indeed, learning is not
effective when the input space has too many dimensions ( where too many
depends on the ML method actually ). Learning is no magic, but it is similar
to an optimization problem. An idea is to reshape the image to a vector.

17.2 Deep Learning basics

By moving to deep learning we didn’t have to worry about how to represent


images, DL also learns data representation. In DL data is represented by
means of a hierarchy of multiple layers which represents multiple levels of
representation. However a DL network needs huge amount of data to be
trained.

307
Take into account that DL is not always the best approach for image classi-
fication, sometime using a ML technique leads to better results.

Figure 17.3 ML and DL for image classification

The advantages of using DL are:

• learned features are adapted to the input data. Hence compared to ML


we don’t need to extract features manually which might be incomplete
since performed by humans.

• learned representation work for both supervised and non supervised


applications

• data are represented and classified into one single network. This is
defined as end to end learning.

Deep Neural Network is a composition of several simple functions, called


layers. Each layer is made of a set of neurons. DNN includes non linear
elements (the activation functions): if the layers are all linear, they will
collapse in a single layer. Thus the non linear elements forces a hierarchical
topology made of a subsequence stages.

308
Figure 17.4 DNN

17.2.1 Neurons

The input to the neurons of one layer are the output of the neurons in the
preceding layer. This mechanism is defined as feedforward network.
The output of a neuron is the output of an activation function applied to a
weighted average of the inputs plus a bias. There exists different types of
activation functions.

Figure 17.5 Model of an artificial neuron. The l denotes a layer in the network.
The formula highlights the output of a neurons at a given layer l. The weights
and bias are tuned through the training process.

309
17.2.2 Activation functions

Sigmoid

It represents the natural firing of a neuron. It maps the input into a value
within the range [0,1]. This results in smoothing the input. However the
sigmoid neurons saturate and kill gradients.

Figure 17.6 Sigmoid function

Tanh

The tahn function maps: Rn ← [−1, 1]. It presents the same saturation
problem seen for the sigmoid function. Indeed the tanh represents a scaled
sigmoid:
tanh(x) = 2sigm(2x) − 1

310
Figure 17.7 Tanh function

ReLU

ReLU is the most popular activation function. It is implements the mapping


Rn ← R+
n
:
Relu(x) = max(0, x)

It is characterized by a lighter computation and faster training thanks to its


non saturating nature: relu prevents the gradient vanishing problem.

Figure 17.8 ReLU function

311
17.2.3 Overview of a fully connected NN layer and output layer

Figure 17.9 A NN layer and the output layer

17.2.4 Fully connected DNN

In a fully connected network each neuron of a layer is connected to all the


neurons of the next layer.
A deep fully connected network is a network composed of several layers which
results in a high number of connections.
A typical usage of a deep neural network is for classification: the #out-
put neurons represents the #classes and the selection of the right class is
performed by selecting the neuron that outputs the highest value.

17.2.5 Training

The training stages involves:

1. Select a random subset of the training data (mini batch): a small batch
of images speed up the computation of the gradient

2. the definition of a cost function

3. feedforward pass: we map the input vector x to the corresponding


output.

4. we measure the error between the desired and actual values using the
cost function.

5. we backpropagate the errors

312
6. we update the weights

Figure 17.10 Training process

Overfitting

Overfitting means that the model perfectly fit the training data and hence
cannot generalize leading to error on test data.
Techniques for overcoming overfitting are:

• Dropout: randomly drop units during training

• L2 weight decay: penalization of big weights by an adjustable decay


value.

• Early stopping: we use validation error for deciding when training should
be stopped. Training is stopped when we not observed any improve-
ments after n epochs.

Pre-processing - Data augmentation

We can use some preprocessing techniques for having better performance.


The preprocessing aims at adding some variance to the data such as noise,
rotation, flip, hue, color, saturation and so on. This is often called data
augmentation.

17.2.6 Output layer: Classification vs Regression

The output layer depends on the classification problem we are solving. For
instance we have two classes, several representation are possible: either one
value describing the class or two values describing the score of each class.
This can be generalized to N classes.
We can use deep learning both to classification problem and regression prob-
lem. Regression problem means to estimate a continuous value such as the
height of a person.

313
Figure 17.11 Classification vs Regression

Figure 17.12 Classification vs Regression

17.3 Convolutional neural networks - CNNs

A CNN is a specific type of DNN used for dealing with images. It learns the
structure of an image: an image is a 2D matrix and the interesting infor-
mation an image brings is local, i.e the spatial neighbourhood is important.
For the spatial neighbourhood we mean that considering two pixels close to
each other then they might be related to the same object, background, con-
text. CNN makes use of some convolutional layers (not all layers) for taking
advantage of this local properties of the image.
The keypoints of a CNN are the following:

314
• for each input image we consider the receptive field, i.e a group of pixels
close to each others and analyse them together inside the convolutional
layer. Thus this represents a local processing that taking into account
the spatial neighbourhood property of an image explained before.

• the convolutional layers are partially connected.

Figure 17.13 Partial vs fully connected layers. The partial layer is showed in the
left picture.

• in the convolutional layers the weights are shared, i.e every convo-
lutional units share the same parameters even if they operate on a
different input receptive field. This reduces the number of weights to
be trained and hence increases the performance.

Figure 17.14 1D convolution example highlighting the fact that the weight are
shared

• a CNN can perform multiple feature maps: each group of pixels (re-
ceptive fields) can be evaluated in different ways by means of different
kernels. In particular every feature map has its own set of shared
weights.

315
How to compute different feature maps? In a practise scenario, we
choose the number of features maps and the different sizes for the
convolutional kernels.

Figure 17.15 1D example of different feature maps for the same input

• CNN exploits also some pooling layers. This kind of layers reduce the
feature maps performed by the convolutional layers, thus it performs
a dimensionality reduction. The pooling layers usually consists in max
or average kernels: the max pooling is usually employed most. Pooling
enhances the performance and makes the NN more reliable since it
selects the most relevant element in the image despite the resolution
has been decreased. Moreover it adds deformation invariance.
It performs a feature map subsampling. This operation is defined by 2
parameters:

– pooling area: defines the number of pixels to put together


– pooling stride: defines the distance between the start of a pool
and the start of the next one.

Usually stride and area are equal so that the pool area of different pools
is not superimposing. Unlikely they are different.

316
Figure 17.16 Pooling example with pool area and pool stride equals to 2.

• after the subsequence of convolutional and pooling layers, we find a


tiny NN.

Hence the framework of a CNN is: local convolutional layer -> dimensionality
reduction –> convolutional layer -> dimensionality reduction –> tiny fully
connected NN.
We use this framework for increasing the performance. Indeed we can use also
a single feature maps but it is has been proved that the previous configuration
increases the performance.

Figure 17.17 CNN architecture

Figure 17.18 Example of convolution + pooling. The wight picture represents the
visual representation of a feature map.

317
Figure 17.19 Convolutional layer example. Weights are the red numbers and they
are superimposed over the image.

17.4 Example1: MNIST

We solve the MNIST problem: MNIST dataset contains handwriting digits.


The dataset is formed by 60000 training samples + 10000 testing samples.
All the images are 28x28 grayscale images. The MNIST problem is the ’Hello

318
world’ for deep learning.
For solving the MNIST problem of recognizing handwriting digits we:

1. choose the network structure: the topology determines the number of


trainable parameters

2. train the network

3. test the network

Figure 17.20 CNN used for solving the MNIST problem and the number of pa-
rameters involved

Figure 17.21 Illustration of the convolutional and pooling layer in the MNIST
problem

319
Figure 17.22 Illustration of the weights of first and second layer of the MNIST
CNN

Figure 17.23 Visual summary of an input image propagating through the CNN.

320
Figure 17.24 Visual testing and training accuracy and error.

17.5 Example2: CIFAR

Dataset:
• 60k images (50k training, 10k testing), 32×32 pixels
• 10 classes, 6k images per class
• High-quality images, clearly separable classes
• Color images: 3 input channels

Figure 17.25 Visual testing and training accuracy and error + CIFAR dataset

321
Figure 17.26 Visual summary of an input image propagating through the CNN.

322
17.6 Keras

Keras is a python deep learning library supporting GPU and CPU. It repre-
sents a simple interface to TensorFlow, Theano and CNTK.

Figure 17.27 Keras structure

17.6.1 Getting started

1. we create a sequential model. The model is the network representation.


We can create any network different from the sequential model by using
the model class API.

Figure 17.28

2. we create a dense layer. Dense is a fully connected network.

Figure 17.29

3. we compile the model. Compiling means to set the parameter for the
training process.

323
Figure 17.30

4. Then we trigger the training process

Figure 17.31

5. Evaluate the performance

Figure 17.32

6. Generate prediction

Figure 17.33

17.7 CNN in keras

The typical layers of CNN are:

• convolutional + RELU

• pooling

• fully connected

• softmax activation

324
Figure 17.34 Typical CNN

17.7.1 Convolutional layer in Keras - Conv2D

We use usually 2D convolutional layers for spatial convolution. It is impor-


tant to note that we can also create 1D convolutional layers for temporal
convolution. The creation of a convolutional layer involved a huge argument
list. Anyway we just need to specify which we use all the rest are initialized
at their default value.

Figure 17.35 2D convolutional layer definition with all their possible arguments
initialized at their default value

Figure 17.36 2D convolutional layer code implementation

The datastructure used as input/output for Conv2D is called tensor, which


are typed multidimensional arrays. We usually didn’t need to manage it.

17.7.2 Pooling layer in Keras - MaxPooling2D

As conv2D it takes a long list of arguments but we need to declare only those
we need.

325
Figure 17.37 maxpooling2D

17.7.3 Dense layer in Keras - Dense

Dense layer is a fully connected layer.

Figure 17.38 Dense layer

17.7.4 Softmax layer in Keras

Activations are standalone layers embedded as forward layers.


Softmax activation often at the end of the last fully connected layer.

Figure 17.39 Softmax layer

326
17.8 Compiling a model in Keras

Compiling a model means configuring the learning process. A model can be


compiled by providing a:

• an optimizer

• a loss function

• a list of metrics

Figure 17.40 Compiling a model

17.8.1 Optimizer

Several optimizer are provided.

Figure 17.41 List of optimizers

17.8.2 Loss function

Loss function is called also objective function or also optimization score func-
tion. The loss function measures the distance between actual and desired

327
output for driving the training process.
The type of loss function depends on the task. Good starting points for the
different tasks are:

• classification: categorical/binary cross entropy

• regression: mean square error

Figure 17.42 List of loss functions

17.8.3 Metric

A metric function is similar to a loss function, except that the results from
evaluating a metric are not used when training the model but it is used to
judge the performance of the model.

328
Figure 17.43 List of metrics

17.8.4 Example MNIST

Example is available at : https://colab.research.google.com/github/AviatorMoser/


keras-mnist-tutorial/blob/master/MNIST%20in%20Keras.ipynb

17.9 Colab

Colab is similar to Jupiter. It is an online based programming tool.

329
Chapter 18

Notes

18.1 HSV

In the HSV color representation, the hue range is 0-360 but in opencv it is
split by 2, i.e. 0-180. This is because the hsv image has to fit into 3 8-bit
channels. The S and V are instead in the range (0,255).

Figure 18.1 hsv hue

330
18.2 Contours

Contour is a curve delimiting an area of pixels having same color or intensity.


Contours can be found using the function cv::findCountours() which outputs
a vector of points identifying the contours. It can stores all the (x,y) coordi-
nates of the boundary of a shape or just the fundamental ones which let us
save memory.
Whilte the can be drawn by means of the function cv::drawcontours() which
either draws the contours with a specific thick and colour, or fill in the area
bounded by the contours.

Figure 18.2 contours example with all the (x,y) coordinates of the boundary of a
shape and only some

Any unauthorized distribution (digitally or on paper) is expressly prohibited


without prior authorization from the author of these notes

331
Any unauthorized distribution (digitally or on paper) is expressly prohibited
without prior authorization from the author of these notes

332

You might also like