CV Notes
CV Notes
Author
Francesco Gazzola
2021/2022
1 C++ introduction 12
1.1 Toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.1 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.3 Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.4 File Extensions . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.5 Example Toolchain with static library . . . . . . . . . . . 14
1.2 Static and dynamic libraries . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Example static library . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Advantage of dynamic libraries over static . . . . . . . . . 15
1.3 Standard C++ Libraries . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 OpenCV library . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Installation on windows . . . . . . . . . . . . . . . . . . . 17
1.4.2 HelloWorld . . . . . . . . . . . . . . . . . . . . . . . . . . 17
What is cv::? . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Compiling HelloWorld . . . . . . . . . . . . . . . . . . . . 18
1.4.4 Standard Directories . . . . . . . . . . . . . . . . . . . . . 19
1.5 CMake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 User perspective . . . . . . . . . . . . . . . . . . . . . . . 20
Example under Linux . . . . . . . . . . . . . . . . . . . . 21
1.5.2 Developer perspective . . . . . . . . . . . . . . . . . . . . 21
CMakeList.txt for OpenCV . . . . . . . . . . . . . . . . . 22
1.6 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.1 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.2 Comparison pointers and references . . . . . . . . . . . . 23
1.7 Cast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.1 Void* pointers . . . . . . . . . . . . . . . . . . . . . . . . 24
1.8 Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9 Header guards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1
1.10 Lab feedbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Image coding 32
3.1 Analog and Digital converter overview . . . . . . . . . . . . . . . 32
3.1.1 ADC for images . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Image coordinate system . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Digital color image . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Access image pixels . . . . . . . . . . . . . . . . . . . . . 36
3.4 Digital color video . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Spatial & gray level resolution . . . . . . . . . . . . . . . . . . . 37
3.5.1 Gray-level/ intensity resolution . . . . . . . . . . . . . . . 37
3.5.2 Contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 Spatial resolution . . . . . . . . . . . . . . . . . . . . . . 38
4 Image processing 39
4.1 Geometric transformation . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Homogeneous coordinates . . . . . . . . . . . . . . . . . . 40
Conversion to homogeneous coordinates . . . . . . . . . . 40
Conversion from homogeneous coordinates . . . . . . . . . 41
Example of translation in hom coordinated and normal co-
ordinates . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Affine transformation . . . . . . . . . . . . . . . . . . . . 41
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Examples of affine transformations . . . . . . . . . . . . . 42
4.1.3 Forward vs backward mapping . . . . . . . . . . . . . . . 43
4.1.4 cv::warpAffine . . . . . . . . . . . . . . . . . . . . . . . . 44
DoF : Degree of freedom . . . . . . . . . . . . . . . . . . 44
4.2 Single-Pixel operations . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Image negatives . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Log transformations . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Power-law ( or gamma) transformations . . . . . . . . . . 46
2
4.3 Contrast stretching and thresholding . . . . . . . . . . . . . . . . 49
4.3.1 Intensity slicing . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Bit plane slicing . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1 Histogram equalization . . . . . . . . . . . . . . . . . . . 53
4.4.2 Histogram in OpenCV . . . . . . . . . . . . . . . . . . . . 56
4.4.3 Histogram specification . . . . . . . . . . . . . . . . . . . 56
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Comparison Histogram equalization and histogram specifi-
cation . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.4 Local histogram equalization . . . . . . . . . . . . . . . . 59
5 Spatial filtering 61
5.1 Local operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Spatial Linear filtering . . . . . . . . . . . . . . . . . . . . . . . . 62
Correlation vs Convolution . . . . . . . . . . . . . . . . . 63
5.2.1 Filters and brightness . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Smoothing filters . . . . . . . . . . . . . . . . . . . . . . 64
Average filter . . . . . . . . . . . . . . . . . . . . . . . . 64
Gaussian filter . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.3 Separable filter kernels . . . . . . . . . . . . . . . . . . . 67
5.2.4 Sharpening (aka derivative) filters . . . . . . . . . . . . . 67
First derivative operator . . . . . . . . . . . . . . . . . . . 68
Second derivative operator . . . . . . . . . . . . . . . . . 70
5.2.5 Combination of multiple technique . . . . . . . . . . . . . 72
5.3 Non-linear filtering & Restoration filters . . . . . . . . . . . . . . 73
5.3.1 Image restoration overview . . . . . . . . . . . . . . . . . 73
Noise models . . . . . . . . . . . . . . . . . . . . . . . . 74
Salt and Pepper (or impulse) noise . . . . . . . . . . . . . 76
5.3.2 Averaging filters . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.3 Median filters . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.4 Bilateral filter . . . . . . . . . . . . . . . . . . . . . . . . 79
Bilateral vs Median . . . . . . . . . . . . . . . . . . . . . 81
5.3.5 Max and min filters . . . . . . . . . . . . . . . . . . . . . 81
5.3.6 Alpha-trimmed mean filters . . . . . . . . . . . . . . . . . 82
5.3.7 Adaptive filters . . . . . . . . . . . . . . . . . . . . . . . 84
Adaptive median filters . . . . . . . . . . . . . . . . . . . 84
5.4 Filter for coloured images . . . . . . . . . . . . . . . . . . . . . . 85
3
6 Filtering in frequency 86
6.1 Fourier Transform applied to image . . . . . . . . . . . . . . . . . 86
6.1.1 Example of Rect-sinc transform . . . . . . . . . . . . . . . 87
6.1.2 Spectrum and phase for DFT . . . . . . . . . . . . . . . . 87
6.1.3 Spectrum centering . . . . . . . . . . . . . . . . . . . . . 88
6.1.4 Rotation and translation . . . . . . . . . . . . . . . . . . 90
6.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Aliasing on image . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Moiré effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Filters in frequency . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Lowpass filters - smoothing . . . . . . . . . . . . . . . . . . . . . 94
6.5.1 Ideal Low pass filter - ILPF . . . . . . . . . . . . . . . . . 94
Ringing effect . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5.2 Butterworth lowpass filter - BLPF . . . . . . . . . . . . . 96
6.5.3 Gaussian low pass filter - GLPF . . . . . . . . . . . . . . . 98
6.5.4 Sum up Low pass filtering . . . . . . . . . . . . . . . . . . 100
6.6 Highpass filters - sharpening . . . . . . . . . . . . . . . . . . . . 101
6.6.1 Ideal Highpass . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6.2 Butterworth Highpass . . . . . . . . . . . . . . . . . . . . 103
6.6.3 Gaussian Highpass . . . . . . . . . . . . . . . . . . . . . . 103
6.6.4 Relation frequency and spatial filtering . . . . . . . . . . . 104
6.7 Selective filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7.1 Bandreject and bandpass filters . . . . . . . . . . . . . . . 105
6.7.2 Notch filter . . . . . . . . . . . . . . . . . . . . . . . . . 106
4
7.3.2 Gradient computation . . . . . . . . . . . . . . . . . . . . 118
7.3.3 Edge quantization . . . . . . . . . . . . . . . . . . . . . . 119
7.3.4 Non-maxima suppression . . . . . . . . . . . . . . . . . . 119
7.3.5 Hysteresis thresholding . . . . . . . . . . . . . . . . . . . 120
7.3.6 Example of canny algo . . . . . . . . . . . . . . . . . . . 122
7.3.7 Gaussian kernel size σ and threshold relation . . . . . . . 122
7.4 Edge detector comparison . . . . . . . . . . . . . . . . . . . . . . 123
7.5 Line detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.1 Naive approach . . . . . . . . . . . . . . . . . . . . . . . 125
7.5.2 Hough transform . . . . . . . . . . . . . . . . . . . . . . 125
Normal representation . . . . . . . . . . . . . . . . . . . . 126
7.5.3 Canny edge detector Vs Hough transform line detector . . 129
7.5.4 Generalized Hough transform - Circle detection . . . . . . 130
9 Segmentation 139
9.0.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1 Segmentation by thresholding . . . . . . . . . . . . . . . . . . . 141
9.1.1 Factors influencing the selection of the right threshold . . 142
9.2 Otsu’s optimal threshold . . . . . . . . . . . . . . . . . . . . . . 144
9.2.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2.2 Otsu + edge detection . . . . . . . . . . . . . . . . . . . 147
9.2.3 Variable thresholding (Adaptive thresholding) . . . . . . . 149
9.2.4 Multiple thresholding . . . . . . . . . . . . . . . . . . . . 150
9.2.5 What is Percentile? . . . . . . . . . . . . . . . . . . . . . 151
9.3 Region growing methods . . . . . . . . . . . . . . . . . . . . . . 152
9.3.1 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.3.2 Region growing algorithm . . . . . . . . . . . . . . . . . . 152
9.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.3.4 Growing regions vs thresholding-based segmentation . . . 154
9.4 Segmentation using Watersheds . . . . . . . . . . . . . . . . . . 155
5
9.4.1 Watershed on gradients . . . . . . . . . . . . . . . . . . . 157
9.4.2 Over-segmentation and markers . . . . . . . . . . . . . . 158
Marker selection . . . . . . . . . . . . . . . . . . . . . . . 159
Example of markers . . . . . . . . . . . . . . . . . . . . . 160
9.4.3 OpenCV - watershed(img, markers) . . . . . . . . . . . . 160
6
11 Active contours 191
11.1 Snake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
11.1.1 Initialization process . . . . . . . . . . . . . . . . . . . . . 191
11.1.2 Contour move . . . . . . . . . . . . . . . . . . . . . . . . 192
11.2 Scissors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.3 Level set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7
12.9.8 SIFT sum up . . . . . . . . . . . . . . . . . . . . . . . . 228
12.9.9 SIFT examples . . . . . . . . . . . . . . . . . . . . . . . . 228
12.9.10 SIFT robustness . . . . . . . . . . . . . . . . . . . . . . . 229
Robustness to noise . . . . . . . . . . . . . . . . . . . . . 230
Robustness to affine transform . . . . . . . . . . . . . . . 230
12.9.11 Questions from slides . . . . . . . . . . . . . . . . . . . . 231
12.10Overview of other SIFT-based features . . . . . . . . . . . . . . . 232
12.10.1 Principal Component Analysis . . . . . . . . . . . . . . . 232
PCA recall . . . . . . . . . . . . . . . . . . . . . . . . . . 232
PCA-SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.10.2 SURF features . . . . . . . . . . . . . . . . . . . . . . . . 234
12.10.3 GLOH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.11Other feature approaches . . . . . . . . . . . . . . . . . . . . . . 237
12.11.1 Shape context . . . . . . . . . . . . . . . . . . . . . . . . 238
12.11.2 LBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.11.3 BRIEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.11.4 ORB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
12.11.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 242
12.12Feature matching . . . . . . . . . . . . . . . . . . . . . . . . . . 242
12.12.1 Matching strategy . . . . . . . . . . . . . . . . . . . . . . 243
12.12.2 Matching performance . . . . . . . . . . . . . . . . . . . 243
8
Projection matrix (3D to pixel coordinates) . . . . . . . . 261
14.2.4 Rotatranslation - Mapping from 3D world to 3D camera . 262
Figuring out rototranslation . . . . . . . . . . . . . . . . . 263
14.2.5 Projection recap . . . . . . . . . . . . . . . . . . . . . . . 267
14.3 Inverse projection . . . . . . . . . . . . . . . . . . . . . . . . . . 268
14.4 The thin lens model . . . . . . . . . . . . . . . . . . . . . . . . . 268
14.5 Non ideal (real) lenses . . . . . . . . . . . . . . . . . . . . . . . . 272
14.5.1 Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Radial distortion . . . . . . . . . . . . . . . . . . . . . . . 272
Tangential distortion . . . . . . . . . . . . . . . . . . . . 274
14.5.2 Chromatic aberration . . . . . . . . . . . . . . . . . . . . 275
14.5.3 Distortion and camera lens . . . . . . . . . . . . . . . . . 275
14.6 Field of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
14.7 Real camera overview . . . . . . . . . . . . . . . . . . . . . . . . 277
14.7.1 Perspective vs viewpoint . . . . . . . . . . . . . . . . . . 280
14.7.2 Sensors overview . . . . . . . . . . . . . . . . . . . . . . 281
14.7.3 Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Aperture . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Shutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.7.4 ISO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.8 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 286
14.8.1 Why calibration? . . . . . . . . . . . . . . . . . . . . . . 286
14.8.2 Calibration process . . . . . . . . . . . . . . . . . . . . . 287
14.8.3 Reprojection error . . . . . . . . . . . . . . . . . . . . . . 290
14.8.4 Sum - up . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9
16.1.5 Semantic segmentation vs object detection . . . . . . . . 297
16.1.6 Modelling variability . . . . . . . . . . . . . . . . . . . . . 297
16.1.7 History of recognition . . . . . . . . . . . . . . . . . . . . 299
16.2 Template matching . . . . . . . . . . . . . . . . . . . . . . . . . 299
16.2.1 How to apply a template? . . . . . . . . . . . . . . . . . . 301
16.2.2 Correlation-based rigid template matching . . . . . . . . . 302
16.2.3 Generalized Hough Transform rigid template matching . . 303
16.2.4 Template matching weak points and how to cope with . . 303
16.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
10
17.8.3 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
17.8.4 Example MNIST . . . . . . . . . . . . . . . . . . . . . . . 329
17.9 Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
18 Notes 330
18.1 HSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
18.2 Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
11
Chapter 1
C++ introduction
1.1 Toolchain
1.1.1 Preprocessor
12
Figure 1.1 Toolchain
1.1.2 Compiler
It takes in input the source code and produce as output the object code ( called
also executable or machine code).
1.1.3 Linker
It is a program that links together the object code of the libraries used in the code
with the object code of the code.
13
• .obj object code in Windows (in Linux it is .o)
1 // f i l e name : my_func . h
2 int f ( int i ) ;
1 // main . cpp
2 i n t main ( v o i d )
3 {
4 int i = 0;
5 i = f(i); // F u n c t i o n c a l l
6 return 0;
• STATIC. The previous example shows the use of a static library: the object
code of the library is merged with the object code of the source.
14
We use some different command from cmd line for telling the linker if the libraries
are static or dynamic.
Moreover, note that usually libraries come already compiled (Hence we have only
its object code ready to be merged or linked to the object code of the source).
Usually opensource software offers original code for the libraries involved.
The advantage comes from the scenario when multiple programs uses the same
library. If this is static, each program will have a copy of this library merged into
the source.obj. If it is dynamic, we just need a copy of it for the all programs:
we keep it in the memory. So the dynamic libraries allows us to save space: one
installation serves all.
15
Figure 1.6 Static and dynamic libraries
They are libraries that provides support for basis operation such as I/O streams
and so on.
1.3.1 Templates
A special section of the standard libraries is the Standard Template Library. They
make use of the template concept for allowing us to generate function starting
from that template. A template is a way of defining generic functions which can
have a generic data type input.
Figure 1.7 Template example for a simple function returning the input.
The STL provides also the STL containers which are templates defining data
structures: vector, list and so on.
16
Figure 1.8 Example STL container
We use Cmake for building the library. After building, the library shall be installed.
1.4.2 HelloWorld
1 // OpenImg . cpp
2 #i n c l u d e <opencv2 / h i g h g u i . hpp>
3 i n t main ( i n t argc , c h a r ∗∗ a r g v )
4 {
5 cv : : Mat img = cv : : i m re a d ( a r g v [ 1 ] ) ;
6 cv : : namedWindow ( ” Example 1 ” ) ;
7 cv : : imshow ( ” Example 1” , img ) ;
8 cv : : waitKey ( 0 ) ;
9 return 0;
10 }
• Mat is a Data structure storing the pixels and manage the memory Img is
an object of class cv::Mat
17
• Imshow draws an image on a window
What is cv::?
18
• OpenImg.cpp : Name of the input (source file)
g++ considers standard directory for looking up for the files. These default
directories are located in :
– /usr/include
– /usr/local/include
– /usr/lib
– /usr/local/lib
If we use directories different from the standard ones, we need to specify the full
path associated to the file as done for the header file above. We can do the same
for the libraries, but we need to use the option flag -L instead of -l (lower case L).
Hence, -l is only for those libraries placed inside a default folder.
1.5 CMake
A software project is usually composed of several source files (.h, .hcc, .cpp) and
this makes the compilation process not so straightforward. For instance we can
have different groups of files (libraries and source files) that generate different
executable that are part of the same project, and hence we need to compile
19
together only the files related to that executable. Hence, we need some method
for getting to know the structure of the project and know so which files need to
be compiled together. In general, we can describe a project by using one of the
following tools:
• Command line: the commands can be very long and so hard to interpret
since we need to run different commands line for compiling each different
executable.
User workflow:
3. Select the directory where the compiler output shall go (Standard: [project
root]/build)
4. Configure
20
6. Generate. The generate action creates the build files to be provided to the
local build system.
Developer workflow:
21
Figure 1.12 CMakeLists.txt
1.6 Pointers
Pointers are variables that store memory addresses. They point to a given type.
Hence, at a given destination the pointer knows that there there is stored a variable
of a specific type.
We use the symbols & for extracting the address of the variable and the dereference
operator * for extracting the content.
22
Figure 1.13 The pointer p is linked to the address of the variable i.
1 // P o i n t e r s example
2 i n t i = 10;
3 i n t ∗p ;
4 p = &i ;
5 ∗p = 1 5 ;
1.6.1 Reference
References are an alternative method to pointers: the syntax is different but the
concept is similar.
A reference variable is an alias, that is, another name for an already existing vari-
able. A reference, like a pointer, is also implemented by storing the address of an
object. The main differences with pointers is that references cannot be assigned
NULL value.
A reference can be thought of as a constant pointer (not to be confused with a
pointer to a constant value!) with automatic dereferencing, i.e the compiler will
apply the * operator for you.
1 // R e f e r e n c e s example
2 i n t i = 10;
3 i n t &r = i ; // r e f e r e n c e s d e f i n i t i o n . r i s a s s i g n e d t h e v a l u e o f i ,
hence 1 0 .
4 r = 15; // t h i s a s s i g n s t h e v a l u e 10 a l s o t o t h e v a r i a b l e i
23
Figure 1.14 Pointers and references. OUT indicates what count « i prints on the
command line interface
1.7 Cast
1 // c a s t example
2 i n t i = 10;
3 float f ;
4 f = s t a t i c _ c a s t <f l o a t > ( i ) ;
Void* pointers are a generic pointer. They can no be dereferenced since the com-
piler doesn’t know how to interpret their content. Hence, for deferencing them
we need to cast to a specific type first.
1 // v o i d ∗ p o i n t e r example
2 i n t main ( )
3 {
4 i n t a = 10;
5 v o i d ∗ p t r = &a ;
6 c o u t << ∗( i n t ∗) p t r ;
7 return 0;
8 }
24
1.8 Callbacks
25
Figure 1.15 Header guards example
• Not to implement function doing too many tasks. Split the tasks in different
functions instead. Each function shall do a specific task, so that it can be
used again.
26
• Use a structure folder for the project: ”include” folder for the headers,
”image” folder for the images.
• Related functions declarations shall be grouped into one single header file.
We don’t need to create a file for each function, and hence an header for
each file.
• Header file contains only declaration! Not the definition of the function
which should be in a cpp file.
• Cpp files SHALL NEVER BE INCLUDED. If you need it, this means that a
header file is missing.
• We include the opencv library (MAT for example) wherever we use it.
27
Chapter 2
• Float from [0,1]: 0 is black and 1 is white. Hence, float is used for
white/black image.
NOTE: Blank and white is COMMONLY referring as a grayscale image
actually. We use float for image that have EXACTLY only the two colours,
white or black.
The number of bits we use for representing each pixel is called DEPTH.
28
2.1.1 Color image
For a coloured image we use the RGB representation: each colour, called CHAN-
NEL, is represented in a grayscale pixel. Hence, we have 3 channels. Depth is
applied to each channel.
2.2.1 Constructors
1 cv : : Mat ( nrows , n c o l s , t y p e [ , f i l l V a l u e ] )
2 ex : cv : : Mat ( 1 0 , 1 0 ,CV_8U)
In the example above, we can see that the constructor of Mat let us define an
image of nrows, ncols and a type.
Regarding the type, any single channel array should belong to one of following
data types:
29
Generally the type is defined as CV_<depth>[C<# channels>]. If we omit the
# channels, by default it will be C1.
2.2.2 Inspectors
1 i n t Mat : : c h a n n e l s ( ) c o n s t ; // # c h a n n e l s
2 i n t Mat : : depth ( ) c o n s t ; // m a t r i x e l e m e n t depth f o r
3 // each c h a n n e l
4 i n t Mat : : t y p e ( ) c o n s t ; // m a t r i x e l e m e n t t y p e
Const means that the functions won’t change its internal state.
30
2.3 The Vec class
We use Vec to describe the triplet of colour pixels. Indeed, a colour matrix has 3
channels (RGB). Hence, a pixel is a tuple of three elements. The following code
is an example of a coloured image managed by Vec class.
1 // . . .
2 i n t main ( v o i d )
3 {
4 Mat img ( 2 0 0 , 200 , CV_8UC3) ;
5 f o r ( i n t i = 0 ; i < img . rows ; ++i )
6 {
7 f o r ( i n t j = 0 ; j < img . c o l s ; ++j )
8 {
9 img . at<Vec3b> ( i , j ) [ 0 ] = 0 ;
10 img . at<Vec3b> ( i , j ) [ 1 ] = 0 ;
11 img . at<Vec3b> ( i , j ) [ 2 ] = 2 5 5 ;
12 }
13 }
14 // v i s u a l i z a t i o n
15 return 0;
16 }
31
Chapter 3
Image coding
Figure 3.1 Image sensing pipeline of a camera sensor ( taken from the book )
32
done by associating the digital input value to the closest value defined by the
quantization. Then, we encode each quantized value to a binary value.
The quantization and encoding process is based on the number of bits used for the
encoding (technically called RESOLUTION of the ADC) that defines the quanti-
zation levels. The number of quantization level (hence the number of value in the
finite set) is equal to 2m where m is the number of bits used in the encoding.
33
3.1.1 ADC for images
where Q is the function dealing with the quantization. In the figure 3.6 the
quantization levels represented a grey scale.
We hence start at the top of the continuous image and carry out the ADC con-
version procedure downward, line by line, producing so a digital image.
34
Figure 3.7 Sampling of an image. The image on the image plane is the continuous
image appearing on the surface of the sensor, which will be digitized from it.
Figure 3.8 Sampling of an grey scale image. Sampling is done over n1 and n2,
quantization over x(n1, n2). We here have two representation one 3D and one
2D. Moreover, we notice that 0 is dark and higher values are brighter grey.
35
In this course we use the first convention reported in the first row of the table.
Note that it is different from our intuitive way of referring to a matrix: we here
use the convention (column, row) instead of (row, column). It depends on the
library anyway. OpenCV uses this unnatural representation.
In particular, in the course we use the notation (n1,n2) seen previously for the
grey scale image.
We have a triple for each point. The overlap of them forms the image.
In a video we have a new variable respect to those used for representing an image:
we use k for representing the dependency from time.
Hence, a pixel is represented by the triple x[n1 , n2 , k]. We don’t use ’t’ instead
of ’k’ because ’t’ is usually referred as continuous time. However, our camera can
record a certain number of frames per seconds: K is the number of frames.
36
Figure 3.11 Digital color video
Intensity resolution is defined as the smallest detachable change in the gray level.
This is the resolution of the ADC, and hence it refers to the number of bits used
for the quantization levels for representing the various shadow of gray. Hence, it
is referring to the number of bits per pixel.
Figure 3.12 Gray level example keeping the spatial resolution constant. For each
level, we are changing the quantization Q function.
3.5.2 Contrast
Contrast is defined as the difference between the highest and lowest intensity level
in the image.
37
3.5.3 Spatial resolution
Spatial resolution is defined as the smallest detectable detail in the image. What
means detail? What is detectable? Detail means some geometric elements. In-
deed, when the image becomes smaller and smaller, some points (details) of the
image will disappear, hence they get not detectable anymore.
Spatial resolution can be also expressed as the number of pixels per unit distance.
This last definition is used particularly for printing. For instance, in printing it is
used the dpi (dots per inch) measure unit.
NOTE: Measure of spatial resolution must be stated with respect to a spatial unit
(such as dpi uses inch) to be meaningful. The size of the image itself doesn’t
tell us much. For instance, an image of resolution 1024x1024 pixels is not a
meaningful statement without stating the spatial dimensions encompassed by the
image.
Figure 3.13 Example 1 of spatial resolution. As showed, the smaller the spatial
resolution, the smaller the image. Under 128 we are losing a lot of details.
The pic below show the same zommed-out images showed above. Looking at the
below image we can see that we are reducing either the number of pixels of the
sensor or the number of pixel of the image. This is why it is important to state
the spatial dimensions.
38
Chapter 4
Image processing
39
that each transformation preserves the properties listed in the rows below it. For
instance, similarity preserves not only angles but also parallelism and straight lines.
40
Conversion from homogeneous coordinates
This transformation preserves the collinearity of the points that is the alignment
of the points on the same line; and their distance along the line. Formally: given
the points p1 , p2 and p3 lying on the same line, then the quantity |p2 −p1 |
|p3 −p2 | = k is
constant.
Example
41
Examples of affine transformations
42
4.1.3 Forward vs backward mapping
This can lead to multiple points on the same mapping (x’,y’) and to missing
pixels i.e holes in the image: for instance if the first pixel is mapped to 1,
the second to 10 then we miss all the pixels between 1 and 10. Moreover,
(x’,y’) can be decimals and it requires a resample (*).
It is not so much used.
This doesn’t lead to multiple points on the same mapping (x,y) and in
general there are no missing pixels. Moreover, (x,y) can be decimals and it
requires a resample (*).
It is far better than the forward mapping.
* The matrix multiplication seen before for the transformations can end up in a
non integer points which will not overlay the grid of the pixel, hence they don’t
represent correct pixels. We handle this issue by mapping/resampling the image.
Figure 4.6 Forward (on the left) vs Backward (on the right) mapping
43
4.1.4 cv::warpAffine
Degree of freedom is the number of dimensions on which we can move the pixels.
For instance in the image below we have two constraints per each point: moving
along the x and y axes. Hence, there are in total 6 DoF.
44
4.2 Single-Pixel operations
Consider a gray scale image with L gray levels, the single pixel operations or
transform (tit also called intensity transforms) are functions that change the gray
levels of an image.
Figure 4.8 Single pixel operations. I(x, y) is a function representing the gray level
of the image. T (.) is the function performing the grey level change.
The three most used intensity transformation functions are negative, logarithmic
and power-law. (In this course we discuss only the most popular algorithm trans-
formation. In a company environment there are way more to know).
Figure 4.9 Curves of all the basics transformation functions. The inverse diagonal
represents the negative function.
The negative of an image with intensity levels in the range [0, L - 1] is obtained
by using the negative transformation function:
s=L−1−r
45
where s is the new pixel value and r the original pixel value.
This processing switches dark and light ( looking at the curve, we can see that the
lower levels (dark pixels) are mapped to higher levels (light pixels) and viceversa
for higher levels).
s = clog(1 + r)
s = crγ
46
it is tunable: we can choose the intensity of the gray levels (contrast manipulation).
In particular, at figure 4.12 we can see that with γ = 1 the curve reduces to
the identity transformation, while γ > 1 produces the opposite effects as those
generated by γ < 1.
In the figure 4.13 we can see the effect produced by γ < 1. The pic at the top
left corner is the original image; the one at the top right corner is transformed
by applying γ = 0.6, the one at the bottom left corner has γ = 0.4, and the
last one γ = 0.3. We can see that as γ decreases some details become more
visible. Anyway with γ = 0.3 the contrast reduces and the image started to have
a ’washed-out’ appearance.
47
Figure 4.13 Gamma transformation example γ < 1
In the figure 4.14 we can see the effect produced by γ > 1. The pic at the top
left corner is the original image; the one at the top right corner is transformed by
applying γ = 3, the one at the bottom left corner has γ = 4, and the last one
γ = 5. We can see that as γ increases some details become more visible (and the
contrast gets higher).
48
4.3 Contrast stretching and thresholding
Single pixel operations can be used for enhancing contrast (which we recall it is
the difference between highest and lowest intensity level in the image) and apply
a threshold to an image.
For such operations we use a piecewise linear transformation functions. For in-
stance in figure 4.15 we enhance the contrast of a low-contrast image. A common
piecewise function for such operation is defined starting from two points (r1 , s1 )
and (r2 , s2 ).
A piecewise function for thresholding is instead defined setting the points (r1 , 0)
and (r2 , L − 1) where r1 = r2 .
We can edit the contrast of an image only to a subset of pixels (hence gray levels).
For achieving it, we use piecewise functions that convert specifics gray levels to
another one.
49
Figure 4.17 Intensity slicing
In particular, the slicing operation consists in putting the corresponding n-th bit
of the binary pixel representation in the n-th plane at the corresponding pixel
position (see figure 4.19). We obtain hence a binary image per each bit-plane.
For showing the binary image graphically, we just need to display black for 0 and
50
white for 1. For achieving this we start from the original image and compute the
threshold: for instance, in the 8th plane in a generic pixel position we get 1 if
the pixel has value equal or greater than 128, 0 otherwise; we can hence set a
threshold that map to 0 all the pixel values between 0 and 127, and maps to 1
the values between 128 and 255.
For constructing the final image, or display the combination of only some planes,
we need to multiply the pixel at the i-th plane for 2i ; we do this for all the plane we
would like to add; and eventually we sum all the value obtained for the different
planes.
For instance, at the figure 4.19 for combining all the plane together, in the final
image the second pixel in the first column would be computes as follows ( similarly
we do for all the other pixels) :
1∗4+0∗2+1∗1=5
while for computing the same pixel by only considering the first 2 planes:
1∗4+0∗2+1∗0=5
51
Figure 4.20 bit planes slicing example
4.4 Histogram
Let rk , k=0,1,2, .. ,L-1 denote the intensities of an L-level digital image f (x, y)
per each level k, the histogram of f is defined as follows:
h(rk ) = nk
52
Figure 4.22 histogram examples
h(rk ) nk
p(rk ) = =
MN MN
where M and N are the number of image rows and columns respectively (hence
MN is the total number of pixels in the image). In this course we use the term
histogram for referring to normalized histogram.
The sum of p(rk ) is 1: this lets us consider it as a Probability density function
(PDF).
T (r) is then defined as the CDF (Cumulative distribution function) of the random
variable r.
53
Figure 4.23 Continuous and discrete T (r)
54
Figure 4.26 Equalization example of an image with only 8 gray levels. Given the
histogram identified by the table at the left top corner, we use the function in the
middle (si ) and we obtain the equalized histogram whose values are reported in
the table at the top right corner.
The starting histogram, the equalization function and the equalized histogram are
displayed at the bottom in order.
Figure 4.27 Example of equalization of an image with 256 gray levels. We can see
the equalization functions T used per each image 1,2,3,4
55
4.4.2 Histogram in OpenCV
Sometimes we would like to specify a specific histogram that we wish the pro-
cessed image to have. This method is called histogram specification.
Given the input gray levels r and the specified PDF pz (z) that we wish the out-
put image to have, we can compute the new gray levels z which are distributed
according to the PDF pz (z) using the following procedure:
2. We equalize pr (r) and round the resulting values sk to the closest integer
in the range [0, L-1]
56
Figure 4.28 Schema of process for computing the histogram specification
Example
First step we compute pr (r) and we normalize it and round the values.
Second step, given a specific PDF pz (z), we normalize it and round the values.
57
Now we need t create the map from s to z using zq = G−1 (sk ). Then we use
this map over the histogram equalized image for getting the histogram specified
image. For instance s0 = 1 is mapped to z3 since G(z3 ) = 1. This means that
every pixel whose value is 1 in the histogram equalized image would map to a
pixel value 3 in the histogram specified image.
Note that the result is not matching the specified histogram exactly, but the
trend of the specified histogram of moving the intensity toward the high end of
the intensity values was achieved.
Figure 4.32 The histogram specified image doesn’t match the histogram specific
58
Comparison Histogram equalization and histogram specification
What seen so far was global: it was computed on the entire image. We can anyway
enhance detail over small areas in an image (hence locally). Such procedure works
as follows:
59
• at each location we compute the histogram equalization or specification of
the pixel in the neighbourhood ( the submatrix )
60
Chapter 5
Spatial filtering
Filters are generally used for enhancement and removal of unnecessary components
from an image. Spatial filtering defines a local operation.
For local operations we mean functions that edit a pixel of an image based on the
value of the pixel itself and the values of its neighbour pixels. The local operations
are defined based on a filter aka kernel, which defines the neighbourhood involved
in the computation. Each neighbourhood pixel is associated to a weight.
Figure 5.1 Schematic of a local operation. The new value of the red pixel (s) is
computed as function of the original value of the red pixel and its neighbours (the
set of green highlighted pixels). In the right most image is showed the filter with
the weights associated to each neighbour pixels.
Local operations are performed in the spatial domain of the image, that is the
space containing the pixels. For such reason we refer to the local operations as
the spatial filtering and the kernel/filter is called spatial filter.
Based on how the filtering process is performed on the image, the filter can be
linear or non linear. For instance, if the filter evaluates the correlation/convolution,
it is linear; if it evaluates the min/max it is non linear.
61
Figure 5.2 Spatial filter
Spatial linear filtering means that each output pixel is a linear combination of
pixels expressed as a weighted summation of some input pixels. This operation is
direct obtained by computing the correlation (we can called it also convolution)
of each pixel of the image with the weights of the filter.
Supposing a filter of dimensions m x n, m = 2a + 1, n = 2b + 1, the correlation
operation is defined by:
Figure 5.3
62
How to compute the convolution of an even or not square kernel? We
pad it in order to make it odd and set the no zero values in the right position
considering that the middle point is the one that will be replaced.
While for computing the correlation filter at a pixel placed at the border of the
image, we just need to pad the image with 0’s on both the adjacent sides.
Figure 5.5 Example of padding. The two kernels are the derivative kernels and in
particular the 2x2 derivative kernel is called the Robert filter
Correlation vs Convolution
In the CV context convolution and correlation are often used as synonyms: filters
are usually symmetric and this makes convolution and correlation equal.
The filters obtained by applying convolution are called convolutional filter.
63
Figure 5.7 Compararison correlation and convulution formula
The filter weights can change the image brightness. In particular the brightness
is unchanged if: i wi = 1, where wi is the weight of the convolutional filter.
P
While if the sum is < 1 the brightness is decreased, and if it is greater than 1 the
brightness is increased.
Smoothing (also called averaging) spatial filters are used to reduce sharp transi-
tions in intensity i.e to blur which results in denoising. Indeed, the noise often
appears as sharp transition in intensity. It is also used for reducing irrelevant detail
in an image (for irrelevant we mean small pixel region). Smoothing is often re-
ferred to as lowpass filtering, a term borrowed from frequency domain processing.
The size of the filter can be increased: the larger the filter size, the stronger the
smoothing (blurring).
There are two types of linear smoothing filters:
• Average filter
• Gaussian filter
Average filter
An average filter consists of a simple normalized box filter. It simply takes the
average of all the pixels under the kernel area and replaces the central element.
A box kernel assigns the same weights to all the pixels.
In the section 5.3.2 in the figure 5.26 there are reported some common average
filter description.
64
Gaussian filter
r2
w(r) = G(r) = Ke− 2σ2
where K is a constant and r is the euclidean distance between the middle point p
and a neighbour pixel q, r = ||p − q|| = p2 + q 2 . Hence the new value of the
p
Figure 5.8 Left picture: kernel illustration. Right picture: distances from the
middle pixel for different odd sized kernel.
Gaussian filtering effects vary based on the size of the kernel and the standard
deviation: the bigger the kernel size, the stronger the smoothing; the bigger the
deviation standard, the stronger the smoothing. Anyway we don’t need Gaussian
kernel too big since we know that the values of a Gaussian function at a large
distance from the mean are small enough to be ignored, hence nothing is gained
by using very large Gaussian kernels.
65
Figure 5.9 Two averaging filters. The one at the left is a normalized box filter.
The right kernel is a Gaussian kernel.
Figure 5.11 Averaging filters example. Note how the blurring increases as the
window size increases.
66
Figure 5.12 Gaussian smoothing example. Note how the blurring increases as σ
increases.
A spatial filter kernel is a matrix and it can be expressed as the outer product of
two vectors or two matrix.
For instance a square filter of size n x n can be separated in two filers: one of size
n x 1 and one of size 1 x n.
This operation let us speed up the computation.
Figure 5.13 Example of filter separated in a column vector and a row vector.
Sharpening filters produces the opposite result of the averaging filter: they enhance
the transitions in intensity, aka the edges. Sharpening is often referred to as
highpass filtering: high frequencies (which are responsible for fine details) are
passed, while low frequencies are attenuated or rejected.
Derivative filters can implement first or second derivative operator. We recall the
mathematical definition of first and second derivative operator.
67
A first order derivative operator :
Figure 5.14 Example of computation of first and second order derivative on pixels.
The scan line represents an image in 1D.
• we compute the gradient for each pixel: we compute the two partial deriva-
tives. This is achieved by convolving the two kernels (one kernel for com-
puting fx and one for calculating fy ) with the image. Hence, we obtain two
images representing the two partial derivatives.
68
• we compute the magnitude image using the previous two images accord-
ing to one of the two magnitude formulas: usually it is more suitable to
approximate the squares and square root operations of the magnitude for-
mula - M (x, y) = fx2 (x, y) + fy2 (x, y) - by the absolute values M (x, y) =
q
|fx2 (x, y)| + |fy2 (x, y)|. This is done for the sake of the computational cost.
The kernel used for computing the partial derivatives are built starting either from
the direct formulas (reported in the figure 5.15 with the corresponding kernels) or
by means of different formulations that implies larger filters of odd size such as
the 3x3 Robert and Stobel kernel (figure 5.18).
Figure 5.15 Filters of the direct application: 2 elements for both the vector con-
stituting the gradient
Figure 5.16 Sum up of the first order derivative. Isotropic means it is invariant
from rotations.
Note that the square root and the module are not linear operators.
3x3 sized filters are the most commonly used since the direct application filters
(whose size is 2x1 and 1x2) are missing a centered half way element that defines a
center of spatial symmetry. Moreover, larger filters allow a better noise reduction.
Larger filters come from two different formulations which are called Robert and
Stobel. Robert considers two pixels for computing the gradient component vector
per each pixel, while Stobel takes into account 6 pixels and give more importance
to the middle pixel (it is multiplied by 2 indeed). The Robert and Stobel filters
are squared filters and they are represented in figure 5.18 with their formulation.
The sum of the elements in the Stobel filters is 0, hence the brightness of the final
image is darker than the original one.
69
Figure 5.17 Sobel gradient computed from the left-most figure.
Figure 5.18 Robert (matrices 2x2) and Sobel (matrices 3x3) filters. z5 indicates
a general pixel at position (x,y) of the image f(x,y)
The second derivative operator is the Laplacian. Given an image f (x, y), the
Laplacian is defined as (we write f instead of f (x, y)):
∂2f ∂2f
∇2 f = + = f (x+1, y)+f (x−1, y)+f (x, y+1)+f (x, y−1)−4∗f (x, y)
∂x2 ∂y 2
(5.1)
where the partial second order derivatives are :
∂2f
= f (x + 1, y) + f (x − 1, y) − 2 ∗ f (x, y) (5.2)
∂x2
∂2f
= f (x, y + 1) + f (x, y − 1) − 2 ∗ f (x, y) (5.3)
∂y 2
70
Figure 5.19 Sum up of the second order derivative. Note that the laplacian is a
linear operator
Laplacian is used for sharpening images. Given an image f (x, y), the sharpened
image g(x, y) is obtained by the following equation involving the second derivative:
where c = −1 if the center coefficient of the laplacian filter is negative (we subtract
the laplacian image); otherwise c = 1.
Figure 5.20
71
Figure 5.21 Example laplacian computation. Top-left image is the original image.
The top-right image is the Laplacian image obtained with the first filter in the
figure 5.20. The bottom left image is the sharpened image using c = -1. The
bottom right image is the sharpened image using c = 1.
72
5.3 Non-linear filtering & Restoration filters
Non linear filters are mainly used as restoration filters for image restoration i.e for
denoising. However also smoothing filtering can be employed for noise reduction.
A corrupted image may be restored using:
– Averaging filter
– Gaussian filter
• A non-linear filter:
– Median filter
– Bilateral filter
– Min or max filter (for salt and pepper noise)
– Alpha-Trimmed Mean Filter
– Adaptive filters
The image restoration is the operation of reducing the noise affecting an image.
Indeed, each electrical signal is effected of noise. In particular we can think to
noise as the Gaussian noise which is the typical noise affecting electrical signal.
Figure 5.23 Classic model of noise affecting signal acquisition. The noise here is
represented as an image of the same size as the input image.
73
In order to restore the original signal, we exploit a restoration filters which provides
a function representing the best estimation of the original image. In order to
apply a proper restoration filter we need to estimate the noise and then remove
it. Actually, algorithm designer don’t look at image and try to estimate noise, but
they directly use Gaussian filters for managing noise reduction since the Gaussian
noise is the most common as told before.
Noise models
There are different shapes of noise and they are representing as PDF. For simulat-
ing purposes we represent the image of a noise by generating a matrix of random
intensity values with a specified probability density function (this approach is not
used for salt and pepper noise anyway).
Figure 5.24 Most common PDF in image processing application. On axes are
represented intensity values. Rayleigh is pronounced ’reilai’.
We know that for a Gaussian random variable, the probability that values of z are
in the range z ± σ is approximately 0.68; while the probability that the values of
z are in the range z ± 2σ is about 0.95 .
Let’s analyse how the different noise models affect a test pattern showed in the
figure 5.25.
74
Figure 5.25 Test pattern
Figure 5.26 Different noise shapes applied to the test pattern. The gamma noise
causes a darker image since the gamma noise PDF has peaks closer to the dark
intensity.
75
Figure 5.27 Different noise shapes applied to the test pattern
Note that the histograms resulting from adding the noise to the image has the
same shape of the PDF of the noise. Indeed, the histogram is by definition a PDF.
Salt and peper noise resembles salt and pepper granules distributed randomly over
the image; hence the name of this type of noise. Pepper means some pixels are
shifting towards darker values, while salt the opposite.
We indicate with P, and called it noise density or just spatial density, the probability
that a pixel is corrupted by salt or pepper noise and it is defined as P = Ps + Pp ,
where Ps and Pp is the probability that a pixel is corrupted by salt noise and pepper
noise respectively. For instance, if an image is corrupted by salt and pepper noise
with Ps = Pp = 0.25%, it means that 14 of the pixels of the image are completely
out of the original values (either white or black). In figure 5.27 (i) is represented
an example of salt and pepper noise.
As a rule of thumb a low (hence good) spatial density Ps and Pp is less than 0.2.
Given an image f we replace the intensity value of the pixel with the intensity
value g(x, y) computed by means of an averaging kernel mxn reported in the figure
76
5.28.
The median filter replaces the value of a pixel by the median value in a specific
pixel’s neighbourhood defined by the kernel size. How is the median defined? The
median is the middle number in a sorted, ascending or descending, list of numbers.
A median filter n x n, takes a screenshot of the area composed of n x n pixels of
77
the image overlapping the filters, then it computes the median of the value of the
pixels surrounding the centre pixel and replace it with the computed median.
Median filters discard the highest and lowest peak that might be come from the
salt and pepper noise, while an averaging filter will consider those values for
computing the new intensity value pixel.
Figure 5.30 Median computation. The values in the median filters corresponds to
the one of the original image. The median is 7 and it replaces the value 2 then.
Note that the 92 value is due to the salt and pepper noise with probabilities
0.1. Salt and pepper causes a lot of problem to averaging filters: 92 is
reject by median but not by average.
If we apply more times the same median filter, it is not removing the information
content of the image (In the images below we don’t see many differences among
the results of multiple application of the median filter.). This is an interesting fact
since this is not happening by applying a Gaussian smoothing several times: it is
destroying the image instead.
Figure 5.31 Median filters example about passing the same median filter more
than once
78
5.3.4 Bilateral filter
Bilateral filter is a non linear spatial filter highly effective in noise removal while
preserving edges sharp. The output image results smoothed everywhere but at
the borders. It is more computationally demanding compared to the other filters.
This is due to the fact that bilateral filters exploit two Gaussian kernels:
Thus, the Gaussian function of space makes sure that only nearby pixels are
considered for blurring, while the Gaussian function of intensity difference makes
sure that only those pixels with similar intensities to the central pixel are considered
for blurring. So it preserves the edges since pixels at edges will have large intensity
variation. The bilateral filter is described by the following equation:
Figure 5.32 Bilateral kernel. p is the position of the middle pixel with intensity
I(p), while q is a neighbour pixel with intensity I(q). In (p) would be the new
intensity value of the middle pixel.
79
Figure 5.34 The bilateral filter smooths an input image while preserving its edges.
Each pixel is replaced by a weighted average of its neighbours. Each neighbour
is weighted by a spatial component that penalizes distant pixels and range com-
ponent that penalizes pixels with a different intensity. The combination of both
components ensures that only nearby similar pixels con- tribute to the final result.
The weights are represented for the central pixel (under the arrow).
Figure 5.35 If we increase the σr too much, we end up having just a normal
Gaussian filter effect.
Changing σs preserves the edges but makes smother some regions.
80
Bilateral vs Median
The bilateral reduces more noise than the median. If we increase the median for
removing the noise we start to lose also the edges.
Figure 5.36
Max and min filters are non linear filters used to manage the salt and pepper
noise. They simply select the min and max pixel in the neighbourhood pixel:
• Max filter: highlights salt noise and removes pepper noise. Hence the max
filter select the max intensity pixel in the neighbourhood.
• Min filter: highlights pepper noise and removes salt noise. Hence the min
filter select the min intensity pixel in the neighbourhood.
81
Figure 5.37 Max and min filters example
Alpha-trimmed mean filters are non linear filters. Basically, it is the mean filter
expressed by a different formulation called alpha-trimmed mean.
82
Figure 5.40 Alpha-trimmed mean example. The value 2 of the centered pixel is
replaced with the intensity value 7 ( round up of 6.8)
Figure 5.41 Comparison of filters. The 5x5 Alpha-trimmed mean filter is computed
with d=5
83
5.3.7 Adaptive filters
Adaptive filters are non linear filters. It tunes its effect based on the local image
variance (indicated as σL2 ) of the neighbourhood pixels of the image centered on
the pixel (x,y). In particular the effect of such filter depends on the comparison
between the local image variance σL2 and the variance of the noise corrupting the
image indicated as ση2 , which is known a priori. We distinguish the following case
for the comparison:
• ση2 σL2 , the filter is weak. The new pixel is close to the old noisy image
value: the new image resembles the original one. In particular, in this case
we have a high local variance which is typically associated with edges which
are hence preserved.
• ση2 ≈ σL2 , the filter effect is strong and in this case it returns the arithmetic
mean value of the pixels in the neighbourhood. Such a filter is hence called
ADAPTIVE MEAN FILTER.
Figure 5.42 Adaptive mean filter. In this case ση2 = 1000 ≈ σL2 . We note that the
image d is much sharper than the other ones.
Median filters perform well with a low spatial density: Ps and Pp less than 0.2.
For higher spatial density the adaptive median filter is performing better.
Adaptive median filters is more tricky than the adaptive mean filter since the size
84
of the neighbourhood is changed during running depending on certain conditions.
( more info at p338 - Digital image processing book )
Usually adaptive filters are mostly better than the corresponding non adaptive
filters. However, there are also cases where adaptive filters are slightly worse:
when it happens, it is also only on some tiny part of the image, but the overall
image look great. Adaptive filters have anyway a bigger cost of computation since
the bigger complexity.
We don’ consider the single RGB channels, but there are other color encodings.
–> we didn’t go into details in the course.
85
Chapter 6
Filtering in frequency
Figure 6.2 Fourier transforms and antitransforms in continuous and discrete do-
main
86
6.1.1 Example of Rect-sinc transform
Figure 6.4 3D transform of a 3D rect. Note we note u and v the axes in the
frequency domain.
The DFT of an image can be expressed in the polar form. This representation
allows to distinguish the magnitude, which is called Spectrum, and the phase.
Figure 6.5 Polar form. R and I indicates the real and imaginary part.
Spectrum and phase encode information. Phase encodes more significant infor-
mation of the spectrum: the phase carries the shape features of the image, while
the spectrum the intensity. In the figure we show the reconstruction of the image
(inverse transform) using only the phase (setting hence the spectrum to 1) and
87
only the spectrum (setting hence the phase to 0); and reconstructing the image
using the phase and spectrum of the rect.
A shift is typically needed to align the Fourier spectrum with the image represen-
tation space. We recall that the center in both the spatial and frequency domain
of an image is defined as the top left corner based our convention. Shifting the
Fourier transforms moves the center of the transform image.
88
Figure 6.7 Spectrum shifting example in 1D and 2D.
89
6.1.4 Rotation and translation
The translation doesn’t affect the Fourier transform. While the rotation does.
6.2 Sampling
1
FX = > 2umax
∆X
1
FY = > 2vmax
∆Y
where umax and vmax are the maximum spatial frequencies of the signal along X
and Y respectively. In particular, u and v are the name of the axes in the frequency
domain (Figure 6.10).
90
Figure 6.10 Over and under sampling. Ideally we should get the first picture. The
second one presents aliasing.
Figure 6.11 Here we show a sampling grid partially overlapping three 2-D signals
(regions of an image) of low, mid, and high spatial frequencies. Note that the level
of spatial “detail” in the regions is proportional to frequency (i.e., higher-frequency
signals contain more bars). As expected, all three digitized regions exhibit alias-
ing to some degree, but the effects are dramatically different, worsening as the
discrepancy between detail (frequency) and sampling rate increases.
We can mitigate the aliasing by applying a Low Pass filter before sampling. An
LPF filter is a smoothing filter. In the figure 6.12 the middle image is the sample of
the first one. We can see aliasing by looking at the trousers. The last one instead
doesn’t present aliasing and it has been obtained by applying a 5x5 averaging filter
before sampling.
91
Figure 6.12 Aliasing example. The aliasing is present on the trousers and scarf in
the second image. The last one is the result of applying a LPF filter.
In the figure 6.13 instead we show that aliasing can create results that may be
visually misleading: the image seems good but instead it has aliasing. The figure
represents the results obtained by sampling different checkerboards of different
squares size considering an imaging system of 96x96 pixels of unit width and
length. We notice that sampling chessboard whose squares size is less than 1
pixel result in undersampling which in turn results in aliasing.
92
6.3 Moiré effect
93
form of the image by the Fourier transform of the kernel. In that sense, indeed
filtering by convolving in the spatial domain is equivalent to filtering by multiplying
in the frequency domain.
Ideally it attenuates all frequencies within a circle of radius D from the center and
cut off all the frequencies outside the circle. The zero-frequency is in the middle
of the figure.
where D(u,v) represents the distance between a point (u,v) in the frequency do-
main and the center of the PxQ frequency filter. D0 is called the cutoff frequency.
94
Ringing effect
The drawback of the ideal low pass filter function is a ringing effect that occurs
along the edges of the filtered spatial domain image. This happens in two scenario:
95
Figure 6.21 Ringing example. The ringing effect is evident from image c to e. The
ringing becomes finer in texture as the amount of high frequency content removed
decreases.
The low pass filter is tuned by means of the parameter n. Large values of n cause
this filter to be similar to the Ideal Low Pass filter.
96
The ringing phenomenon is not present for low value of n, and it becomes more
evident as we increase it. For value around 20 we get a BLPF equals to ILPF.
Figure 6.24 Ringing effect grows with the order of the filter
Figure 6.25 Butterflow ringing effect. The order of the filter is fixed to 2 while
the cutoff frequency is changing.
97
6.5.3 Gaussian low pass filter - GLPF
• Filling gaps
98
Figure 6.28 filling gap example2
• Beautification
• reducing noise
99
6.5.4 Sum up Low pass filtering
100
6.6 Highpass filters - sharpening
Features of LP and HP filters such as ringing are similar for a given type of filter.
101
Figure 6.34 Ringing effect on spatial domain
102
6.6.2 Butterworth Highpass
103
6.6.4 Relation frequency and spatial filtering
The spatial filtering can be obtained by computing the inverse transform of the
frequency filtering.
Noise can affect the frequency domain. We use the selective filters for processing
For all the selective filters it is valid the ideal, butterworth and gaussian definition.
104
6.7.1 Bandreject and bandpass filters
105
6.7.2 Notch filter
Figure 6.43 Notch filters example. It is reported the butterworth notch filter
transfer function furmula applied in the image
106
Chapter 7
Mid level
Computer vision systems often start with an image enhancement module, that
corresponds to a low level processing, which we discusses in the previous chapters.
We now move to mid level algorithms which are based on more complex models:
line and edges detection.
We have already seen the derivative filters which highlight the edge pixels. They
can be considered as edge detector even if the filtered images lack the concept
of edge: they just give an output to an input. We want to build something that
remark the concept of edge. Edges are caused by a variety of factors:
107
Figure 7.2 Edge factors
Line detection is based on the derivatives: gradient (in particular we recal the
Sobel filter) and laplacian.
The first order derivative highlights thick edge, while second order derivative high-
lights thin edge.
Moreover looking at the derivative graph, we can see that the second derivative
is positive at the beginning of the ramp, negative at the end of the ramp, zero at
108
points on the ramp and at constant points of the ramp. The signs of the derivative
indicates the transition from light to dark. We can conclude that the magnitude
of the first derivative can be used to detect the presence of an edge at a point in
an image: it is 0 where the intensity is constant, constant where there is a ramp.
While the sign of the second derivative can be used to determine whether an edge
pixel lies on the dark or white region, and the zero crossing point can be used for
locating the center of thick edges.
109
7.1.3 Derivatives and noise
Derivatives amplify noise. In particular the 2nd derivative is noisier than the 1st.
110
7.1.4 Gradient edge detector (1st derivative)
Figure 7.9 Gradient and edge direction. Note the edge direction is orthogonal to
the gradient direction.
Note that for the computation of the gradient magnitude it is usually exploited
the formulation involving the modules since the one involving the squared root is
more computationally demanding.
111
Figure 7.10 Gradient magnitude computed using the Sober mask
Figure 7.11 Gradient phase. In the sky we get noise because we don’t any variation
and so the value of gx and gy are completely random.
112
Gradient masks
We can use the following mask (smoothing kernels) for detecting/highlighting the
edges along the four directions: all the four sober or prewitt kernels is convoluted
with the input image. This highlights all the edges along those 4 directions.
Anyway, the Sobel kernel is the most used since it has a good noise reduction
result. The prewitt kernel is a form of sobel filter where the center element is not
multiplied by 2.
We can also decide to convolve the image with only some or one filters in order
to highlight edges only along some specific direction. It is important to mention,
that using a specific oriented kernel doesn’t mean that the edges of different
orientation from the one specified by the kernel are not strongly highlighted, but
they are weakly highlighted compared to those oriented in the direction of the
kernel (see figure 7.13).
We recall that the first order derivative is not isotropic (invariant to rotation),
but the Prewitt and Sober filter do. In particular, these filters are isotropic only
for vertical and horizontal direction. This means that it exhibits the strongest re-
sponse predominantly for vertical and horizontal edges, but they can still be used
for detecting diagonal edges.
In the book it is reported the Kirsh’s kernels which are useful for highlighting
edges along 8 directions: it takes a single kernel mask and rotates it in 45 degree
increments through all 8 compass directions: N, NW, W, SW, S, SE, E, and NE.
113
Figure 7.13 Result of diagonal edges by applying the sobel kernels ±45
∂2f ∂2f
g(x, y) = ∇2 f (x, y) = + = f (x+1, y)+f (x−1, y)+f (x, y+1)+f (x, y−1)−4f (x, y)
∂x2 ∂y 2
where
∂2f
= f (x + 1, y) + f (x − 1, y) − 2f (x, y)
∂x2
∂2f
= f (x, y + 1) + f (x, y − 1) − 2f (x, y)
∂y 2
The Laplacian edge detector is used together with thresholding: we compute
the thresholding on the Laplacian image g(x, y) ( after we smoothed it with a
smoothing kernel).
Moreover, the laplacian kernels are usually isotropic: its response is invariant to
rotation, that is they are independent of the orientation of the edges. Anyway
sometimes we would like to use it anysotropically in order to detect edge with a
specific direction. The preferred direction of each kernel is weighted with a larger
coefficient such as 2. The Laplacian kernel detectors used for edge detection are
showed in figure 7.15.
We actually use the second order derivative specially for line detection than edge
detection since they have a stronger response to fine details (also to the noise)
such as thin lines. Instead for edge detection we prefer first order derivative since
they produce thicker lines, hence edges.
114
Figure 7.14 Example of isotropic laplacian kernel
Moreover the Laplacian filter produces the double-line effect: the detected edges
are represented with two lines.
115
Figure 7.17 Double-line effect example
Derivatives filters provide a set of tools for detecting edges along x,y and diagonal
directions, and its direction. Such tools are the building blocks for mid-level edge
detection algorithms: a combination of several filterings and processings in a given
order.
For finding edges hence we combine filters to create an edge detection algorithm.
We follow a specific order of steps:
1. We apply first a smoothing filter for noise removal: this leads to a more
selective edge detection.
2. We compute the gradient using the derivative kernel(s) along the orienta-
tions of the edges we would like to highlight. We prefer first order derivative
kernels since they produce thicker lines than the second order derivatives.
116
7.2.1 Example of application of the the edge detector algo
Figure 7.19 Thresolding. Notice that smoothing the image prior to the gradient
computation allows to maintain more connectivity of the edges (look at the edge
of the roof).
117
7.3 The Canny edge detector
The canny algorithm (1986) is a famous edge detector addressing the following
targets:
• edge points well localized: the edge localized should be as close to the true
edges as possible.
• single edge point response: it should return only one point for each true
edge point.
4. non-maxima suppression
5. hysteresis threholding
7.3.1 Smoothing
Edges are calculated using vertical, horizontal and diagonal masks. Edge direction
is also computed.
118
Figure 7.21 Result of computing the gradient. We highlight the angle of the
orientation of the mask that provides that contribution in finding the edge.
Edges are oriented and we group them based on their orientation. In particular
we group them into the 4 groups representing the 4 main orientations: horizontal,
vertical, +45 and -45. We remind that we look at the edge normal to get the
edge orientation, which is orthogonal to the edge normal. For instance if the edge
normal is in the range of direction from -22.5 to 22.5 or from -157.5 to 157.5, we
flag the edge as an horizontal edge.
This step is needed in order to carry out the next step.
119
edge orientation for 3x3 region: horizontal, vertical, -45 and 45; the suppression
process is :
• if the module of the value of the centered pixel is less than at least one
of the neighbors along dk , set the gradient at that neighbor position to 0
(suppressed); otherwise do nothing.
This part of the algorithm is run till there is only one pixel survived (not set to 0).
We threshold the output of the previous step. This step is meant to reduce false
edge points:
For strong and weak edges we mean its strength, which is defined by the gradient
magnitude.
The threshold is a double threshold: a low threshold TL and a high threshold TH .
We can visualize the thresholding as creating two images: IH , where the input
image has only pixels that are ≥ TH ; and IL , whose pixels are ≥ TL . Hence, all
non zero pixels in IH will be contained also in IL . Hence, we eliminate from IL
all the non zero pixels from IH :
Then, since in IH we might have some pixel gaps (a border with empty pixel)
we use the following procedure for filling up those gaps and so forming the edge
pixels:
• For every pixel in the image IH : we mark as valid edge pixels all neighbor
pixels in the image IL .
120
Figure 7.23 Initial point with TH = 1 (in IH survive only 2 and 3) and TL = 1
Figure 7.24 After applying step 1: we add to IH all the pixels in IL in the
neighborhood of the pixels of IH
121
7.3.6 Example of canny algo
122
Figure 7.27 Canny algo output for different value of σ
The value of σ should be calibrated based on the threshold: the σ and the threshold
should grow inversely; if one is high, the other one should be low.
We present a comparison among a gradient image, canny edge detector and Marr-
Hildreth edge detector for highlighting that an edge detector is more than a deriva-
tive filter.
We can see how clean it is the output of the canny edge detector.
123
Figure 7.29 Example comparison edge detecor
124
7.5 Line detection
A possible approach is the following, but it takes large computation time (O(n3 )).
2. For each couple of edge points, we evaluate the line passing through them
3. For each line, we count how many edge points are laying on the considered
line (or we can also consider how many edge points are close to that line so
that to have some degree of tolerance)
Then we move from the xy-plane to the ab-plane, which is called the parameter
space since ’a’ represents now our unknown parameter:
b = −xa + y
We hence consider per each point in the xy-plane the group of lines passing through
it. Then, the line passing through two points is given by the intersection of two
lines of the group of lines passing through the two corresponding points. In the
ab-plane instead the same line is given by considering the point (a,b) where two
lines intersects: given two point (x1 , y1 ) and (x2 , y2 ), the line passing through
them can be expressed as y1 = a0 x1 + b0 or y2 = a0 x2 + b0 , where (a0 , b0 ) is the
point in the ab-plane where a line of the group of lines of (x1 , y1 ) intersects a line
of the group of lines of (x2 , y2 ) .
125
Figure 7.31 Intersection of two lines in the xy-plane and the corresponding result
in the ab-plane.
We highlighted in green the vertical line in the group of lines of the two points.
Normal representation
126
Figure 7.33 Example of the normal representation of a group of lines passing
through a point (x,y). For each point (x,y) we map the lines passing through it
using the equation 7.1. For the point number 1, which corresponds to (0,0) we
can see we obtained a horizontal line since 0 ∗ cosθ + 0 ∗ sinθ = 0 ∀θ. The point
A and B marks the intersection of the curves. Point B marks the intersection
of the curves corresponding to points 2,3 and 4 in the xy image plane: hence
the points 2, 3 and 4 lie on a straight line oriented at 45° and far ρ = 71 from
the origin. While A indicates that points 2,3 and 4 lie on a straight line passing
through the origin (ρ = 0) and oriented at 45°.
127
Figure 7.34 Example of normal representation of an edged image. Each white
pixel is represented as a sinusoidal curve. Where the sin curves intersect, that
means that close to that region parameter the points are aligned, hence there is
a line. In the example we have 8 white intersection.
128
Figure 7.36 Quantized ρθ plane and illustration of the procedure for finding the
number of points lie on a line.
In the following example we can see that the canny edge detector outputs a lot
of lines.
Figure 7.37 Canny edge detector vs hough transform line detector. We highlighted
the line of interest by the blue boxes in the ρθ plane: the boxes have been places
by considering the cells with the highest number corresponding to θ = ±90, since
the highroad to highlight has long lines and they are oriented at 90 degree (there
is no distinction in +90 or −90, that’s why we consider both).
129
7.5.4 Generalized Hough transform - Circle detection
The hough transfrom works for more complex shapes, not only for lines. Generally
we can extend it to any function of the form:
g(v, c) = 0
(x − c1 )2 + (y − c2 )2 = c3 (7.3)
Figure 7.38 Circle 2D detection using hough transform. We fixed the radius and
found the circle passing through the 4 points. The color bar indicates the value
of the accumulator cells.
130
Figure 7.39 Multiple circles 2D detection using hough transform. We fixed the
radius.Note that we want to find 3 circles and so we look for the three accumulator
cells with the highest value in the parameter space.
131
Chapter 8
Morphological operators
Morphological operators are low-level operators that operate on the shapes found
in an image. It generally works on binary images but can be extended to also gray
images.
The morphological operations are based on the mathematical set theory (it means
we use the notation of the set theory for defining the morphological operators).
A set is made of tuples:
• in a gray image the tuples are triples (i,j,k) where (i,j) is the coordinates of
the pixels of a foreground object and k is the intensity value.
The two main morphological operators are erosion and dilation and their combi-
nation: mainly opening and closing.
8.1 Erosion
A B = {z|(B)z ⊆ A}
132
Figure 8.1 Some structuring elements
Figure 8.2 Thinning object example. We can see how the final shape is similar to
the initial one but thinner.
133
• separate weakly connected objects
The shape of the structuring element determines in which direction the erosion
operates.
Figure 8.4 Example illustrating different results for different structuring elements.
134
8.2 Dilation
A ⊕ B = {z|(B)z ∩ A 6= 0}
135
Figure 8.6 Example of merging unconnected components
8.3.1 Opening
A ◦ B = (A B) ⊕ B
Its causes:
• contour smoothing
8.3.2 Closing
A • B = (A ⊕ B) B
Its causes:
136
8.3.3 Comparison opening vs erosion and closing vs dilation
137
8.3.4 More combination
We can create more combination using erosion and dilation. Complex combination
may effectively remove noise.
138
Chapter 9
Segmentation
• ∪ni=1 Ri = R
• Ri ∩ Rj = 0∀i, j(i 6= j)
139
Segmentation result depends on how many categories we choose. In the example
below, we can see that starting from the same image, we can recognize only the
person or the person, the face, their hats and their jackets. For the penguin
example, we can distinguish only the penguin or the penguin and the floor.
• Watershed transformation
• Clustering-based methods
• Model-based segmentation
• Edge-based methods
• Multi-scale segmentation
• Many others…
We cover only the first 4/5. The techniques differ from each other for how we
define similarity and the geometry used for defining regions.
140
9.0.1 Application
One way to extract the objects from the background is to select a threshold T on
the histogram, such that any point of the image at which f (x, y) > T corresponds
to the pixels of the object; and f (x, y) ≤ T corresponds to the background. In
other words the segmented image g(x,y) is obtained as:
1
f (x, y) > T
g(x, y) =
0 f (x, y) ≤ T
where a, b, c are any intensity value. Segmentation requiring more than two
thresholds are difficult to solve and we use other methods such as region
141
growing.
• LOCAL: when T is not constant but the value of T at any point (x,y) in an
image depends on properties of a neighborhood of (x,y) such as the average
intensity of the pixels in the neighborhood.
• Distance between the peaks: the further apart they are and the easier is two
separates the modes.
• Relative size of the objects compared to the background: small object have
really low impact on the histogram.
• Not uniform illumination of the object. The same effect can be caused also
by the reflectance properties of the object.
142
Figure 9.6 Noise affecting thresholding
Figure 9.8 Small objects have really low impact on the histogram.
143
9.2 Otsu’s optimal threshold
It locates the optimal value of the threshold automatically based on the histogram
of the image.
It is a global thresholding method based on the histogram. It assumes that two
classes are created by thresholding. The Otsu algorithm determines the optimal
threshold by minimizing intra-class intensity variance, or equivalently, by maxi-
mizing inter-class variance. What does it mean intra and inter class variance?
Thinking in terms of international and intranational: the first is ‘across many
nations’, and the second ‘within one nation’. Otsu’s goal is to maximise the inter-
class differences, and minimise the intra-class differences. That is, the pixels in
each class must be as similar as possible, and those in different groups must be
as different as possible.
The steps of the algo are:
k
(9.1)
X
P1 (k) = pi
i=0
4. We compute the image global mean, i.e the average intensity of the entire
image mg :
L−1
X
mg = i ∗ pi
i=0
k
X
m(k) = i ∗ pi
i=0
144
6. compute the pixel average value in classes 1 and 2 for k = 0, 1, .. L-1
1
m1 (k) = Pk (9.3)
i=1 i ∗ pi
P1 (k)
1
m2 (k) = PL−1 (9.4)
P2 (k) i=k+1 i ∗ pi
L−1
X
2
σG = (i − mg )2 pi
i=0
2
σB (k) = P1 ∗ (m1 − mg )2 + P2 ∗ (m2 − mg )2 (9.5)
= P1 P2 (m1 − m2 )2 (9.6)
(mg (P1 − m))2
= (9.7)
P1 (1 − P1 )
2 (K)
σB
η= 2
σG
k ∗ s.t σB
2
(k ∗ ) = maxk (σB
2
(k))
This procedure maximizes the inter-class variance of the data and at the same time
minimizes the intra-class variance. Indeed, the global variance can be expressed
as sum of the two variances, which lets us express the inter variance in terms of
intra variance. The intra variance is defined as :
2
σin (k) = P1 (k)σ12 + P2 (k)σ22
2 2 2
σG (k) = σin + σB
hence:
2 2 2
σB (k) = σG − σin
145
Figure 9.9 Maximizing the inter variance means minimize the intra variance. The
goal here is to separate the text from the background.
146
Figure 9.11 Otsu example2
9.2.1 Smoothing
Smoothing the image can help the otsu’s method. Indeed, smoothing prior to
thresholding improves the shape of the histogram separating the valleys.
If the object to separate from the background is too small, then it will have not
enough impact on the histogram and we want to be able to separate the valleys.
In this case, we can exploit the edge detection for improving the histogram and
separate the valleys. The idea is to consider only the pixels that lie on or near the
edge.
147
Figure 9.13 Small object problem
The steps to use for combining edge detection and Otsu are:
• Compute the edge image of the input image as either the magnitude of the
gradient or the Laplacian
• Threshold the edge image using T to produce a binary (black 0 and white
1) image g(x,y).
• We mask the input image with the thresholded image g(x,y). In this way
we are going to take the pixels that lie on or near the edge. This means we
multiply g(x,y) to f(x,y) which results :
f (x, y)
g(x, y) = 1
z(x, y) = g(x, y) ∗ f (x, y) = (9.8)
0
g(x, y) = 0
148
Figure 9.14 Small object segmented using edge detection + Otsu
where mG is the global image mean. This method is useful for example when the
illumination is not uniform.
In the example below, in order to face non uniform illumination or reflectance, the
image is partitioned and the thresholding is operated on each partition. In each
partition, the illuminance and reflectance is supposed uniform.
When T depends on the spatial coordinates (x,y) themselves, then variable thresh-
olding is often referred to as dynamic or adaptive thresholding.
149
Figure 9.15 Image partitioning based thresholding. The iterative global algorithm
is the k-means.
K
X
2
σB = Pk (mk − mG )2
i=1
where
(9.11)
X
Pk = pi
i∈ck
1 X
mk = i ∗ pi (9.12)
Pk
i∈ck
The K classes are separated by K-1 thresholds whose optimal values k1∗ , k2∗ , ..., kk−1
∗ :
2
σB (k1∗ , k2∗ , ..., kk−1
∗ 2
) = maxk1 ,k2 ,..kk σB (k1 , k2 , ..., kk−1 )
150
9.2.5 What is Percentile?
It a statistic score. The i-percentile is the smallest number that is greater than
i% of the numbers in a set. If n is the 99th percentile of an image, it means that
99% of the pixels have intensity below to n.
In the example below, it is represented how many people have the age in the
x-axes. The 14th percentile is 35: 14% of all the people ( total number of the
people is given by summing the height of the bins = 34) is younger than 35.
Indeed, 14% of 34 = about 5 people. Indeed, 5/34 * 100 = 14.7059.
The 95th percentile is computed as follows: 95% of 34=32.3 people. Hence for
finding the corresponding x value, we need to sum all the bins till we don’t exceed
32.3: 5+7+8+5+2+3=30 people which corresponds to age 55. Hence, 30/34
*100=88.2353, hence 88th percentile (88% of people are younger than 55.). Since
the last bin whose cumulative sum is 33, corresponds to the 97th percentile (33/34
*100=97.0588), the 95th percentile corresponds to an age between 50 and 54.
Figure 9.17 Percentile example. The number abova the bins are the percentige
corresponding to that value on the x-axes. The value inside the bins is their height
(y value).
151
9.3 Region growing methods
9.3.1 Connectivity
Let f(x,y) be the input image and let S(x,y) be the image called SEED ARRAY
containing 1’s at the location of seed points and 0’s elsewhere.
The steps are:
1. Determining the seed points: we thresholds f(x,y) in order to find all the
connected components and then we erode each components to only one
pixel. These single pixel will be the seed points constituting S(x,y).
152
9.3.3 Example
We consider an X-ray image of welding and we would like to find the defects on
it. We apply the algo:
1. Step 1: we determine the seed points by noticing that the defects are the
brightest regions in the image. We then extract the seed points by thresh-
olding it using 99.9 percentile of intensity values in the image: this leads to
T = 254 which in turns implies that the seed points value will be 255.
Then we morphologically erodes each connected components to one pixel.
Note that we can specify also a dual thresholding technique for the
similarity. In the example we prove both the single and dual threshold-
ing technique using T1 = 68 and T2 = 126 for the dual thresholding,
while we used only T1 for the single thresholding. We can see that we
cannot accomplish the goal using dual thresholding in this example.
The T values correspond closely to the valleys of the histogram.
The final result won’t correspond to the segmented image since the
good region of the weld (the white one) failed the predicate and the
background points too since these ones are not 8-connected to the
seeds.
153
Figure 9.18 Growing regions technique example
The 4 step of the growing regions technique expresses the difference with the
thresholding based segmentation. The growing region method labels every con-
nected components, hence in the example below we can distinguish defect1, de-
fect2 and so on since the growing methods returns a kinda list of pixels for each
labelled connected components.
The thresholding technique instead wouldn’t differentiate the different defects,
but would return just two classes: pixels below T and pixels above T. If we would
like to get a list of pixels for every defects from this method we need to scan the
image with the two classes and check the connectivity.
154
9.4 Segmentation using Watersheds
• local minimal
• Points at which a drop of water would fall to a single minima. The points
satisfying this condition are called CATCHMENT BASIN or WATERSHED
of that minima.
• Points at which a drop of water could fall into two (or more) different min-
ima. The points satisfying this condition are called WATERSHED LINES.
The goal of the watershed segmentation is to find the watershed lines. For achiev-
ing it we imagine these lines as if they delimited a dam and for finding them we
hence filling the dam. Suppose that a hole is punched in each regional minimum
and that the entire catchment basin is flooded from below by letting water rise
through the holes at a uniform rate. When two catchment basins merge (water
overflows from one basin), we build a dam between them, which is taller than any
pixel in the image (it has the highest value of pixel then). The points of the dam
defines the watershed lines, which are the desired segmentation boundaries.
Morphollogically this procedure consists in computing sequentially dilation oper-
ations on connected component and the watershed lines are created as soon as
two connected components merge together.
155
Figure 9.20 Dam construction example. Note that two little ’isle’ are not selected
since a water over there will drop in a single minima, and hence the ’isle’ points
are part of the catchment basin.
156
Figure 9.21 Dam construction example with dilation
157
Figure 9.22 Watershed applied on gradient
For overcoming such issue we use markers, which drive the segmentation. Markers
are hence the set of pixels from where the flooding will start ( the seed points ).
Intuitively, think of markers as the pixels which we are sure of belonging to the
objects present in the image. Generally, the number of markers is equal to the
number of objects (classes) + 1 (for background). Indeed, we define two kind of
markers:
• internal markers: which are associated with the object of interest and hence
158
individuates local minima
All the markers are represented with different pixel values (colors). Most of the
pixels in the marker image are initially 0 (black) which indicates an unknown
region and the watershed algorithm assigns each of these 0 pixels a class out
of the number of classes (n-1 for objects and 1 background). Intuitively, using
these markers we are telling the watershed algorithm to group points like these
together. So, the concept is simple. Start growing these markers (labeled with
different colors(labels)). When they are about to meet, construct barriers at those
locations. These barriers give us segmentation results.
Marker selection
When creating the marker image, we need to define its pixel as follows:
• 1: background
For finding the objects and hence the pixels to set to value > 1, we usually compute
their euclidean distance from the first zero (=black) pixel. We can accomplish
this by using the distanceTransform function, threshold it in order to extract the
furthest and nearest points, and then fill the area bounded by the contours of the
objects we found (these are basins). All the other pixels of the marker image are
instead set to 1 if they represent the background.
Then the watershed algorithm uses the marker to group similar points together
and it returns an image whose pixels have been set as follows:
159
• 1: the background as we defined
Example of markers
In the following example the criteria for defining the internal markers are:
• all the points in such region have the same intensity value
Then we smoothed the image and apply the watershed algorithm under the re-
striction the internal markers should meet. This procedure defines the watershed
lines, which are the external markers.
The external markers partition the image into regions containing the internal
marker and part of the background. Hence, then we need to segment each region
into two classes: background and object. For doing it we can use any segmentation
technique. In the example here we use the watershed segmentation.
160
161
162
Any unauthorized distribution (digitally or on paper) is expressly prohibited
without prior authorization from the author of these notes
163
Chapter 10
Segmentation by clustering
For applying any clustering technique we need to represent each pixel with a
feature vector, which can be multi-dimensional e.g. for coloured images it can
be the values of the 3 channels R, G and B. The feature vectors forms so a
feature space and a single vector hence contains all the measurements that may
be relevant to describe a pixel: spatial position (coordinates), intensity/brightness
(grayscale images), color information (RGB,/YUV/CieLAB) and so on.
A clustering algorithm for segmenting images uses this representation for grouping
similar pixels.
Clustering techniques often evaluate how similar two pixels are based on their
feature vector. This task is carried out using a distance function to compare the
vectors.
Some typical distance functions are the following where D is the dimension of the
feature vector.
164
Two basic approaches are used to cluster:
• Divisive clustering
• Agglomerative clustering
• K-means
• mean shift
• spectral clustering
10.3 k-means
165
The k clusters are disjoint sets C1 , C2 , .., CK and each cluster Ci has a centroid
µi .
The k-means objective function measures the distance between each data point
and the centroid of its cluster. The goal is to minimize the error made by approx-
imating the points with the center of the cluster it belongs to.
k X
(10.1)
X
min d(x, µi )
i=1 x∈Ci
Figure 10.1 (a) is the original image, (b) is the feature space and (c) is the feacture
space after applying kmeans
166
2. Associate each point to the closest centroid :
1 X
µi = x
Ci
x∈Ci
4. repeat steps 2 and 3 until the centroids sensibly move or we reach a max
number of steps.
10.3.2 Initialization
Cluster centroids need to be initialized and the final solution depends on the
initialization.
We can opt for two initialization methods:
Pro:
• fast to converge
Cons:
K-means image segmentation can be done on the histogram (faster) or the pixel
vectors (better result).
The distance measures can be:
167
• Intensity level difference (grayscale)
Figure 10.2 Example of k-means segmentation using grayscale intensity level dif-
ference and color channel difference.
Figure 10.3 Example of k-means segmentation using color channel difference with
different number of clusters.
Figure 10.4 We can segment only some object in the image not necessarily con-
nected. Anyway some clusters are meaningless as shown in the last image.
168
Figure 10.5 Example of k-means segmentation using color channel difference with
different number of clusters.
169
10.4 Beyond k-means
• it is non-optimal method
• look for the modes (that is the major peaks) of the density function
• identify regions of the input space that climb to the same peak: such regions
belong to the same region/cluster.
This leads to an intuition about how the density function could be used for seg-
mentation.
The problem is that we have pixels and not a distribution. So we need to find a
function that is related to a pixel distribution. In order to estimate the density
function starting from a given set of pixel, we use the non-parametric approach
called Parzen window technique, aka kernel density estimation. It basically con-
volves the image with a given kernel of radius r. Performing this convolution, we
are smoothing the data which results in a PDF.
Figure 10.7 Example of density estimation using a gaussian kernel. The six in-
dividual kernels are the red dashed curves, the kernel density estimate the blue
curves. The data points are the rug plot on the horizontal axis.
After getting the PDF, it is important to find the peaks of the distribution, which
can be achieved by using mean shift.
170
The kernel is defines as :
2
x |x |
K(x) = ck k | |2 = ck k
r r2
The last one is the most used one. The epanechnikov kernel represents a parabolic
shape.
Hence we compute the convolution with a given kernel of width r, which is ex-
pressed as the sum of a translated kernel for each sample in the dataset:
N N
|x − xi |2
1 X ck X
f (x) = K(x − xi ) = K
N rn N rn r2
i=1 i=1
where f(x) is the estimated PDF, xi are the input samples and k(.) is the kernel
function Parzen window and n is the dimension of the x vector. We note that the
kernel is defined everywhere and centered in xi .
Figure 10.8 Density function estimation example. We start from the set of sample
and get the PDF. We used the mean shift for seeking the peaks.
171
We note that this procedure is quite computationally complex. It is possible but
might be inefficient.
CAN WE DO BETTER? Mean shift.
Mean shift is a non parametric technique for finding stationary points such as peaks
in a density function (we are connecting data processing with the computer vision.
That is kinda magic). The key idea is to find the peaks in the high dimensional
data distribution WITHOUT computing the distribution function explicitly.
The no computing of the distribution function is possible because mean shift
assumes that the input data points are sampled from an underlying density function
(PDF): in other words we consider the data points to be a PDF. (Indeed, if we
need to estimates the PDF, we need to estimate some parameters:
(x−µj )2
−
2σ 2
X
f (x) = cj e j
We can avoid estimating the PDF and estimate its gradient instead. In this
way, mean shift becomes a steepest-ascend method: the modes are given by the
direction indicated by the derivative. Indeed, we are interested in finding the
modes and hence we just need to look where the maximum is without considering
172
the shape of the PDF.
N
1 X
∆f (x) = ∆K(x − xi )
N rn
i=1
recalling :
|x − xi |2
K(x − xi ) = ck k
r2
then:
|x − xi |2 |x − xi |2 |x − xi |2
2
∆f (x) = ck K ∗∆ = ck K ∗ |x − xi |
r2 r2 r2 r2
(10.3)
173
Let’s analyse the last formulation in detail:
∆fk (x)
mg (x) =
A ∗ fg (x)
Geometrically, the mean shift vector starts at x and points towards the direction
to follow to find the mode.
The iterative procedure to find the modes using the mean shift vector is:
174
The mode pruning is needed to prune the saddle points which have zero gradient.
The saddle points are unstable locations and small perturbation causes the process
to move away.
In the next page we report an example of such procedure. We note that the mean
shift vector magnitude depends on the gradient. As result, the steps are smaller
towards a mode and hence convergence depends on appropriate step size.
Figure 10.14 mean shift vector magnitude about the previous example
175
Figure 10.15 Example of mean shift procedure
176
10.5.3 Optimization technique
• then, all the points that were traversed by the window belong to the mode
where the window ends. Hence, a mode identifies a cluster.
We note that the window trajectories follow the steepest ascent directions.
177
10.5.4 Means shift clustering segmentation
178
Mean shift can effectively separate complex modal structures.
where xs and xr are respectively the spatial and the color part of the feature
vector; and ks = kr actually (we used this notation to make it evident the two
components).
Figure 10.20 Example of discontinuities preserved. We can note that the grass
region presents different clusters although it is always grass. Also the tripod keeps
its discontinuities: we can distinguish the parts constituting it.
179
Figure 10.21 Example of discontinuities preserved as the windows size changes.
hs , hr in our notation are rs , rr as the window sizes o.
180
10.5.5 Segmentation examples using means shift
181
182
10.5.6 Means shift pros and cons
Pros:
• robust to outliers
Cons:
• Depends on window size. It is not always easy to find the best value and
an inappropriate choice can cause modes to be merged.
• Computationally expensive
183
• define a set of label
L = l1 , l2 , ..., lm
with |L|=m.
Labels are associated with sites. A site is a pixel location. We define with Ω the
set of pixels.
Given a pixel location p, its label h is assigned by means of a function:
h = fp = f (p)
It evaluates how good fp fits on pixel location p and defines a match criterion
based on the image feature vector.
2. A SMOOTHING/CONTINUITY/NEIGHBORHOOD TERM.
It depends on the labels in neighboring pixels. The neighborhood of p is
defined as A(p) and is commonly formed by 4 neighbors.
X
Esmooth (fp , fq )
q∈A(p)
We want to find a balance between the two terms for detecting strong boundaries.
A smoothing term too weak leads to fragmentation.
The energy function evaluated over the whole image is defined as:
X X
E(f ) = Edata (p, fp ) + Esmooth(fp ,fq )
p∈Ω q∈A(p)
184
smoothness term). Interactions are modelled considering a graph. (this is the
pixel representation used by the MRV. We don’t go into the details). MRV uses
an undirected graph (hence bi-directional arcs) where the arcs represent the links
of mutual influence: how a pixel is influenced by its neighbor.
MRV is used to minimize a problem: in our case we would like to select the
labelling that minimizes the energy function.
The data and smoothing term are defined based on the specific application. Let’s
see different examples about how we can define them.
• We select m random pixels in the image, one per label (remind that m is
the size of the label size.)
• create a feature vector for each random pixel selected: xi = [Ii , µi , σi ]T for
1 ≤ i ≤ m. In this way we obtain m seeds for the m classes.
• We evaluate Edata as the L2 norm from the actual pixel and the reference
of that class.
n
X
Edata (p, li ) = |xi − xp |22 = (xi,k − xp,k )2
k=1
We can define the data term as the PEAKS in the feature space: we select m
local peaks in the feature space by finding them using the Mean shift algorithm.
185
10.6.4 Smoothness term definition examples
We can define it by the POTTS MODEL. Considering two general labels l and h,
if they are equal we set the smoothness term to 0, otherwise to a constant c.
0 if a = 0
Esmooth (l − h) = Esmooth (a) = (10.4)
c otherwise
where b is a constant.
A truncated version is also available.
The minimization problem defined by the MRF can be solved using the Belief
Propagation (BP) algorithm. Such algorithm is based on message passing between
neighboring pixels and they convey information about labels and related costs. The
algo runs several iterations and messages are delivered at each iteration.
Consider the message sent by pixel p to pixel q. It contains
• local information at p
Throughout the process each node acts as q in iteration i (it receives info from
the neighborhood pixels) and as p in iteration i+1 (it sends info to another pixel).
186
10.7.1 Message
A message is a function mapping the ordered discrete set of labels L (of size m)
into an array of length m having, for each position, the message value.
Such function is defined as :
mtp−
→q (l)
that indicates the message value for label l ∈ L sent from p to q at time t.
187
How to read it? From p to q at time t, the label l should be the one over all the
possible label h that minimize the cost at p of label h + the cost if q takes label
l and p takes h + the cost at p of label h from the perspective of pixels in p’s
neighbourhood excluding q.
More detailed interpretation:
• Node q inquires all its neighbours (such as p) asking their opinion about
selecting label l at q.
• Node p gets the question – to reply, it scans all possible h and tells q about
the lowest possible cost when q takes l.
• Node q is excluded from the computation (It cannot influence the opinion
of p about q itself).
So far we have discussed the message from p to q. Then, we need to extend this
process also to every neighbor of q. At q we need to take into account also the
local data term and hence the total cost at q is (the smoothness costs are not
appearing because they are already embedded in mtp−
→q (l)):
X
Edata (q, l) + mtp−
→q (l)
p∈A(q)
Hence, for both the costs we consider both the neighborhood of the sender and
receiver node.
These costs are the core of the Belief Propagation algorithm. The steps of this
algorithm are:
1. Initialization
3. Termination
188
10.7.2 BP initialization
We need to initialize the messages and the related costs at time 0. Note that at
time 0 we miss the history of the previous neighbors.
Upon termination (at time t0 ) the final label at each pixel location is determined
as
j = argmin1≤i≤m cti0
If more than one minimum is found, we can for instance favour the labels assigned
to adjacent locations.
So we end up having a label assigned to each pixel.
BP can be used for image segmentation. Each label found by running the BP
algo defines one or more segments: the pixel with the same label belong to the
same segment.
BP is often applied at multiple resolutions (this technique is called pyramidal BP).
189
Figure 10.23 Example 1 - BP segmentation
190
Chapter 11
Active contours
11.1 Snake
A first approach to locate such boundary curves in images is called Snakes. Snake
approach defines mathematically an active contour has a 2D energy-minimizing
spline curve that evolves (moves) towards image features such strong edges and
lines. We recall that a spline is a mathematical tool for defining a curve piecewise.
Snakes method consists in the following steps:
1. Initialization process
2. The curves move itself to try to fit the contour of the object
3. The contours stop moving when most of the points are edge pixels.
The initialization process consists in placing a spline curve close to the desired
contour. This initialization step can be performed manually by the user through
an user interface, or automatic. A good initialization is critical.
191
11.1.2 Contour move
After initialization, the curve we drew starts deforming itself to try to fit the
contour of the object as best as it can. What makes the contour moving is a
balance energy. This energy is made of two components: an internal energy
that forces the snake to stay in shape and so it controls the deformation of the
snake, and an external energy that forces the snake to get attracted by some
image elements such as the edges. Defined a spline curve, hence a snake, as
v(s) = (x(s), y(s)) we define the energies as follows:
Figure 11.1 Internal energy. The parameters α(s) and β(s) defines the stretch
and the curvature of the snake pointwise. For instance setting β(s) = 0 at some
points allows the snake to develop a corner.
Looking at the external energy we can see that it depends on some terms referring
to some elements of the image: line, edges and terminations. Based on how we
adjust the weights the snake behaves different. The edge term is defined to be
proportional either to the image gradients, or to a smoothed version of the image
Laplacian:
Figure 11.3 Left: edge energy term proportional to the image gradients; Right:
edge energy term proportional to a smoothed version of the image Laplacian. Gσ
is a Gaussian with standard deviation σ
The line term is expressed in terms of the intensity of the image: Eline = I(x, y).
Then depending on the sign of the weights wline the snake will be attracted either
to light lines or dark lines.
The Eterm deals with guiding the snake towards the termination of the line seg-
ments and the corners.
192
Figure 11.4 Eterm definition
However in practise, most systems only consider the edge term in the definition
of the external energy.
193
Figure 11.6 Snake example using only Eterm and Eedge
194
Figure 11.8 Snake example for pedestrian detection. In this application they could
extract the shape of pedestrian instead of drawing rectangle around the pedestrian.
If we change the energy balance we can change the behaviour of the snake and
make it going inside the concave region of the legs.
Figure 11.9 Snake example from the paper ’Active contour model by combining
edge and region information discrete dynamic systems’ by Zhifeng Wang and Yan-
Jun Liu (https://journals.sagepub.com/doi/pdf/10.1177/1687814017692947)
11.2 Scissors
Snake method moves the curve only after we initialize a boundary of interest. A
different approach is Scissors. Scissor optimizes the contour in real time as the
user is drawing.
195
Figure 11.10 Scissor example
196
Chapter 12
Detecting and matching salient elements of the image is a key task in computer
vision. Such elements are called features. A feature is a ”meaningful” part of the
image.
Features have two main components:
Hence, when dealing with features we give an input image and we would like to
output a set of points + description of the region (aka signature, fingerprint, ...).
197
12.2 Feature pipeline sum-up
Figure 12.2
198
Figure 12.3 Stability measure example
12.2.2 Descriptors
The descriptor is a vector representation of the local patch (that is the surround-
ing area of a keypoint). Take in mind that we don’t choose the pixels of the
surrounding area: this is up to the designer, hence we need to consult the manual
for knowing how the descriptor is implemented.
The descriptor is based on:
• color
• texture
• orientation
• etc.
Once we have the keypoint and the descriptor, once operation we would like to
do is to match the features. Mathing means:
• evaluate features in the two images. Hence, find the keypoints and the
descriptor for each of them.
199
• find similar features looking at their descriptor. We use a distance metric
for stating how similar the two features are based on their descriptor.
Figure 12.4 Matching example. The green lines indicate good feature matches,
while red lines wrong matches.
• motion detection
• 3D reconstruction
Figure 12.5 Scene reconstruction & structure from motion (SFM). The number
of triangles represents the number of photo used for the reconstruction and in
particular the orientation of the triangles indicates the angle at which the image
was shot.
• Stiching
200
Figure 12.6 Stiching
Figure 12.7
201
• Object detection
Figure 12.9
Figure 12.10
We now focus on the feature detection (called also as keypoints detection), which
is the first step of the feature pipeline.
202
Figure 12.11
What are good points? The easiest keypoint to detect are the corner points
because they are fixed at a position. Indeed if we take edge points, they can move
along the axis, while if we consider points in a uniform regions they can move
everywhere.
The salient points (aka keypoints) can be detected using many different criteria
and different algorithms exist. The most famous algorithm is the Harris-Stephens
corner detector.
The intuition behind the algo is to consider a patch in an image and a shifted
version of the patch. Then:
203
• in an uniform region the two patches will be similar
• for salient points instead the two patches will be different. In particular,
corners produce a large difference if the patch is moved.
4. We use the autocorrelation for computing a score which we use for classifying
the patch as consisting of flat, edge, or a corner
The autocorrelation function is computed for all the pixels (s,t) in the patch:
XX
E(∆x, ∆y) = w(xs , yt )[I(xs + ∆x, yt + ∆y) − I(xs , yt )]2
s t
where:
• I(xs + ∆x, yt + ∆y) is a patch of the same size as I(xs , yt ) but shifted by
(∆x, ∆y)
We can approximate E using the Taylor series. The accuracy of this approximation
is higher as delta x and delta y is smaller and smaller.
where the partial derivative are computed for the points in the patch:
∂I
Ix = ∂x (xs , yt )
∂I
Iy = ∂y (xs , yt )
Hence substituting it into the auto-correlation function and neglecting the weights
yields:
XX
E(∆x, ∆y) = [Ix (s, t)∆x + Iy (s, t)∆y]2
s t
204
We can write it in matrix form:
The matrix A describes how the region changes for a small displacement. This
matrix is REAL and SYMMETRIC. Under these conditions, the eigenvectors are
ORTHOGONAL and point to the directions of max data spread (where the data
are the points of the partial derivative evaluated in each point of the patch
(Ix (s, t), Iy (s, t))). The corresponding eigenvalues are proportional to the amount
of data spread in the direction of the eigenvectors.
Thus, we see that the eigenvalues of the matrix formed from derivatives in the
image patch can be used to differentiate between the three scenarios of interest:
205
• one large and one small eigenvalue imply the presence of a vertical or hori-
zontal boundary, hence an edge.
Figure 12.14 λ are the eigenvalues. The green region represents the value of the
eigenvalues identifying a corner, the grey region identifies an uniform region, and
the orange one identifies an edge.
The weights of the autocorrelation function are convolved with the autocorrelation
matrix :
Figure 12.16
206
and the autocorrelation function can be written as:
s2 +t2
w(xs , yt ) = e− 2σ 2
• R = λ0 − αλ1 [Triggs]
• R= det(A)
trace(A) = λ0 λ1
λ0 +λ1 [Brown, Szeliski, Winder]
I(x, y) −
→ I(x, y) + c
Figure 12.17
207
Figure 12.18
208
12.6 Other detectors
SUSAN detector can detect both corners and edge. It is interested because :
• robust to noise
Figure 12.21 USAN example. In the right most image, the USAN are the white
region inside the masks.
Based on the area of the portion of the USAN we distinguish corners, edge and
uniform region:
209
SUSAN stands for SMALLEST USAN: hence we take into account only the USANs
that have the smallest region (which indicates an edge).
Features can focus also on blobs instead of points. A blob is a region where there
are some properties such that:
12.7.1 MSER
• compute some statistics for each region e.g. area, convexity, circularity, and
so on.
• analyze how persistent each blob is. Under the hood, MSER is monitoring
the changes to each of the thresholded images and is looking for regions of
the thresholded images that maintain unchanged properties over a large set
of the possible threshold values.
210
’Extremal’ refers to the property that all pixels inside the MSER have either higher
(bright extremal regions) or lower (dark extremal regions) intensity than all the
pixels on its outer boundary.
211
12.8 Object detection in different scale space
We would like to analyze the image at multiple scales. For achieving it we use
the SCALE SPACE as tool. The scale space transforms the original image f (x, y)
into a family of derived images f (x, y, t). These image are individuated by the
parameter t, which indicates the location of the image in the scale space. Looking
at the pyramid representation in the image 12.25, t represents the level at the
pyramid and the original image is at the base.
For increasing t, the details are gradually suppressed (reducing the scale, we reduce
the resolution, which means the information content of the image). Why should
we aim at suppressing image details? By removing the details we remove the weak
elements and select the strongest element of the image.
• linearity
• scale invariance
212
The literature demonstrated that Gaussian smoothing meets such criteria [ T.
Lindeberg, ”Scale-space: A framework for handling image structures at multiple
scales”]. The key element is the N-dimensional gaussian kernel g : RN xR+ −
→ R:
1 −(x2 2
1 +..+xN )
g(x, t) = N e 2t
(2πσ 2 ) 2
where the parameter t is the scale parameter and not the variance.
Thus the scale space is represented as a convolution with a gaussian kernel. In a
continuous domain this is expressed as (the discrete domain is let as exercise):
Z
L(x, t) = f (x − )g(, t)d
∈RN
Figure 12.26 Gaussian scale space for t=0 (orginal image), t=4, t=64, t=256.
213
Figure 12.27 No local minima are created in the scale space.
• select the best scale (hence the value of t) for each edge
• combine the edges found at multiple scales (for their best value of t).
We need to define a measure of edge strength for identifying the best scale value
t. A simple idea is to normalize by t the magnitude of the edge .
Strong edges can be found at all scales. At fine scales edges are accurate but many
spurious edges are found. At coarse scales edges are inaccurate but only strong
edges are found. The combination of the detection leads to the best results.
214
Figure 12.28 Edge detection example
Defining a metric for edge strength lets us find the strongest edges (found at all
scales.)
215
Figure 12.29 Edge detection example with edge metric
216
12.9 The SIFT feature
12.9.1 Introduction
The Scale Invariant Feature Transform offers very reliable keypoint detector and
descriptor. It is widely used. When images are similar in nature (same scale,
similar orientation, etc), corner detection and MSERs are suitable as whole image
features. However, in the presence of variables such as scale changes, rotation,
changes in illumination, and changes in viewpoint, we are forced to use methods
like SIFT. SIFT features (called keypoints) are invariant to image scale and rota-
tion, and are robust across a range of affine distortions, changes in 3-D viewpoint,
noise, and changes of illumination. The input to SIFT is an image. Its output is an
n-dimensional feature vector whose elements are the invariant feature descriptors.
The key idea is based on mapping the image content into local feature coordi-
nates. This makes the feature detection invariant to translation, rotation, scale
and other transforms.
• It is efficient.
• object recognition
217
• query by example
The RIF algorithm covers the first two blocks of the feature pipeline: feature
detection and description.
Figure 12.32
For describing the algorithm we are going to describe 4 elements that characterise
it:
• scale-space organization
• orientation measurement
• descriptor calculator
218
This procedure is applied to each octave. That is, the first image of the new
octave is formed by:
1. downsampling the original image enough times to achieve half the size of
the image in the previous octave
3. get the rest of the images in the new octave by gaussian smoothing the
downsampled image with standard deviation equals to the ones of the previ-
ous octave: kσ, k 2 σ, k 3 σ, ... . Hence, within one octave, intermediate scales
are used (that is gaussian smoothing).
What is k?
Within one octave we have s intervals, which are the gaps between one image
and the next one. We can define the number of images in an octave based on
the number of intervals: in an octave there are s + 1 images. For each image
withing an octave we use different standard deviation: starting from the bottom
image, the standard deviations are σ, kσ, k 2 σ, ..., k s σ. In particular the last image
of one octave (with standard deviation k s σ) has the same smoothing as the first
in the next octave, but we recall the second one is downsampled so the result is
different. Hence, the constraint yields:
1
k s σ = 2σ −
→ k = 2s
√
For instance for s = 2 (3 images) the standard deviation will be σ, 2σ, 2σ.
219
Figure 12.33 Scale space organization
DoG
220
Figure 12.34 scale space sift for s=2
All the layers are images. Let’s take a look at one of them.
They look like an edge image. Why? Subtracting consecutive layers means:
evaluate the difference between two gaussian smoothed images which have the
same smoothing (i.e Gaussian smoothing) but different smoothing intensity. Such
operation is actually achieved in one shot by applying the Difference of Gaussian
(DoG) kernel which is represented by the function D(x, y, σ). This kernel brings
up the edges because it approximates the edge detection filter LoG (Laplacian
of Gaussian), which is exploited in the Marr-Hildreth edge detector algorithm (it
simply convolves the image with the LoG filter). Given the gaussian filter G(x, y):
1 − x2 +y2 2
G(x, y) = e 2σ
2πσ 2
221
the LoG filter computes its laplacian:
∇2 G(x, y)
As told before the LoG is usually approximated by the DoG, which computes the
following formula:
2 2 2 2
1 − x 2σ+y2 1 − x 2σ+y2
D(x, y) = e 1 − e 2
2πσ12 2πσ22
where σ1 > σ2 .
Hence, in SIFT the scale space + substraction of consecutive layers is DoG filter.
222
12.9.4 Keypoint localization
The DoG explained before is the very first step to localize the keypoints. Indeed,
after that process, SIFT searches for the maxima and minima of the DoG which
will represent the initial keypoints that will be redefined later.
Figure below shows the procedure used by SIFT to find extrema in a D(x, y, σ)
image. At each location (shown in black) in a D(x, y, σ) image, the value of the
pixel at that location is compared to the values of its 8 neighbors in the current
image and its 9 neighbors in the images above and below. Then, the point is
selected as an extremum (maximum or minimum) point if its value is larger than
the values of all its neighbors, or smaller than all of them. NOTE that no extrema
can be detected in the first and last scale of an octave because it has respectively
no lower and upper scale image of the same size.
This makes SIFT features scale independence.
223
890 in the book Digital image Processing 4th edition)
Figure 12.39 Output of SIFT: it is a set of keypoints with associated scales. The
size of the circle represents the level of the scale where the keypoint was found.
224
12.9.5 What is the best value for s?
We consider the repeatability of the feature, that is how well the features are
repeated over the same element in different scale images of an octave. From the
experiments we can see that the repeatability has a pick for s = 3. Hence this is
the best value to use for s.
Figure 12.40
At this point we have computed keypoints that SIFT considers stable and we
have achieved scale independence. Now we have to achieve invariance to image
orientation.
Each keypoint previously computed come with its scale. For each keypoint we
consider its Gaussian smoothed image representing the scale and we named it L.
Then we compute the gradient magnitude and orientation using L:
p
M (x, y) = (L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2
(12.1)
L(x, y + 1) − L(x, y − 1)
Θ(x, y) = tan−1 (12.2)
L(x + 1, y) − L(x − 1, y)
225
Figure 12.41 Histogram creation
Figure 12.42 Keypoint detection example. The arrows represent the keypoint
orientation.
The procedures discussed up to this point are used for assigning an image location,
scale, and orientation to each keypoint, thus providing invariance to these three
variables. The next step is to compute a descriptor for a local region around each
keypoint that is highly distinctive and at the same time as invariant as possible
to changes in scale, orientation, illumination, and image viewpoint. The idea is
to be able to use these descriptors to identify matches (similarities) between local
regions in two or more images.
The idea used by SIFT is to evaluate a descriptor for each keypoint in the corre-
sponding image L considering coordinates of pixels that are rotated based on the
measured keypoint orientation.
226
The full descriptor calculation process is the following. We consider a neighbor-
hood 16x16 of the keypoint, and for each pixel inside this region we compute the
gradient magnitude and direction. Each gradient magnitude point is weighted
(multiplied) with a gaussian function centered on the keypoint: recalling the bell-
shaped surface of the gaussian, the weights decrease as we move away from the
center, so the points around the keypoint have high weights.
Then we group all the pixels of the neighborhood region 16x16 into 4x4 pixel
groups, thus obtaining 16 sub-regions. Then for each subregion, we quantize all
the gradient orientation in the 4x4 sub region into eight possible directions differ-
ing by 45 degree: and we evaluate an histogram: the histogram has 8 bins, one
for each possible direction. After applying this procedure for all the subregions, we
get a descriptor which is a 128 dimensional vector ( this is the data structure used
by SIFT, but for simplicity we can imagine it as a 4x4 matrix, each cell containing
the histogram. Indeed 8 directions x 4 rows x 4 columns = 128 array-size).
Description refinements
227
norm. This ensures good invariance to uniform illumination: a brightness
change caused by a constant being added to each pixel will not affect the
gradient values because they are computed from pixel differences. Therefore,
the descriptor is invariant to affine changes in illumination.
228
Figure 12.45 SIFT using VLFeat library
The robustness of the SIFT features to noise factor was accurately measured.
229
Robustness to noise
SIFT results to be robust to noisy images to random rotations. From the graph in
figure 12.47 we can see that the capability of the system in matching correctly the
features on the random-rotated noisy images is decreasing as the noise increases
but it is still high.
230
Figure 12.48 Robustness to affine transform
The number of features discovered are quite high also in the case of a 2% noise
and 30◦ viewpoint change.
231
12.10 Overview of other SIFT-based features
SIFT provides very good results but there exist other techniques.
The main ones that share the same idea as SIFT are:
• PCA-SIFT
• SURF
• GLOH
PCA recall
• minimizes mean squared distance between the original data point and the
projection (sum of blue lines). Hence this represents the error coming from
projecting the data.
232
So PCA provides us with a lower dimensional space that preserves as much infor-
mation as possible.
PCA is based on the eigenvalues decomposition of the covariance matrix. The
largest eigenvectors, called principal components, are the basis of the new space.
The first principal component is the eigenvector pointing in the direction of largest
variance. Each other principal component :
PCA-SIFT
2. Step 4: it is different from step 4 of the SIFT procedure. Indeed for com-
puting the descriptor we consider a 41x41 patch at the given scale, centered
at the keypoint and normalized to a canonical directio [more details in Yan
Ke and Rahul Sukthankar, “PCA-SIFT: A More Distinctive Representation
for Local Image Descriptors”, Computer Vision and Pattern Recognition,
2004].
3. then while in the SFIT we computed the weighted histograms here we don’t
do. Instead, we concatenate horizontal and vertical gradients of the region
highlighted by the patch 41x41 (which will be 39x39 because of the border
effects) into a long vector of size: 2 directions x 39x39 = 3042.
233
5. reduce the vector using PCA such as from 3042 to 36 (very strong reduc-
tion).
[Reference paper: Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “SURF:
Speeded Up Robust Features”, European Computer Vision Conference (ECCV),
2006 ]
SURF stands fro Speeded Up Robust Features (SURF). The idea is to make the
computation faster using an approximation of the Hessian matrix and descriptors
using integral images. The integral image IP (x, y) has the same size of the
original image and the pixels are not grey scale intensity but each pixel represents
the sum of all the pixels in the rectangle between the origin (0,0) and that pixel:
X X
IP (xi , yi ) = xi − 1 yi − 1I(i, j)
i=0 j=0
Given an integral image we can evaluate the sum of all the pixels in any given
area by just summing up the vertex points:
S =A+D−B−C
This makes the computation of a box filter fast. Box filters are composed of black
and white rectangles, where white is a positive weight, while black is a negative
weight. A box filter has a similar behaviour in respect to Laplacian. For this
reason we use this filter for approximating the Hessian matrix (instead SIFT uses
234
DoG).
In SURF the scale space is different from the one in SIFT. In SIFT the images are
smoothed and subsampled to move along the pyramid. In SURF instead we use
box filters of increasing size. Indeed, the computation time of box filters does not
depend on filter size using the integral image.
Regarding orientation, it is evaluated in a different way from as done in SIFT. We
consider a neighborhood of size 6σ (recall that σ is the scale at which the point
was detected) and evaluate the gradients in x and y of the region highlighted using
the Haar wavelets filters.
For the descriptor, we consider the 16 regions as in SIFT but instead of creating
the histograms, for each region we:
Then we obtained for each region a vector made of 4 elements (hence 4x16=64
features)
X X X X
[ dx , dy , |dx |, |dy |]
We normalize it.
235
Figure 12.54 Descriptor computation
236
Figure 12.56 Surf features matching example
12.10.3 GLOH
• 16 possible orientations
237
• Shape context
• LBP
• BRIEF
• ORB
• 5 possible distance
• 12 possible orientation
hi (k) = #q 6= pi : (q − pi ) ∈ bin(k)
From this it comes that it is translation invariance. Also the scale invariance is
achieved by normalizing all radial distances by the mean distance by point pairs.
it is also robust to deformations, noise, outliers. However it is NOT rotation
invariance!
238
Figure 12.58 Shape context. The right-hand image (d), (e), (f) represent the
log-polar histograms of the single keypoint highlighted in the image (a) and (b).
We can see that (d), (e) represent the histogram for the same keypoint and they
are similar even if their position in the image is different. (c) Diagram of log-polar
histogram bins highlighting the center part. In total the diagram identifies 60
sectors.
Figure 12.59 Shape matching using shape contexts. Despite the image are differ-
ent, they are matched as similar because of their shape. The numbers below each
image represent their distance from the query image.
12.11.2 LBP
It was proposed for texture recognition in 1994. LBP method compare the cen-
tral pixel with surrounding 8 neighbors and builds a binary sequence of signs of
differences: following the pixels along the circle: if it is greater than the center
pixel, we assign 1, 0 otherwise. Then we compute the histogram of such values.
We might normalize the histogram as final step but this is optional.
239
Figure 12.60 LBP procedure
The procedure can be applied to subregions of the image instead of to whole image:
we divide the image into subwindows, run the procedure previously described and
then we concatenate the histograms.
12.11.3 BRIEF
• gaussian smoothing
How do we sample the pixel couples to compare? we can opt for different solutions,
but the best option for most cases is the isotropic gaussian distribution.
240
BRIEF is rotation, scale, angle viewpoint, occlusions invariance. Anyway some
different objects can be matched: even if they appear different to us, if they have
some common features, they will be equal for a computer.
Figure 12.63 Distribution of Hamming distances for matching pair of points (thin
blue lines) and for non-matching (wrong matching) pairs (thick red lines) related
to the wall dataset. We can use this graph for determining if the images are the
same or not.
12.11.4 ORB
241
by this course) and add rotation invariance to BRIEF. Orientation assignment is
based on the intensity centroid WRT the central pixel.
HOMEWORK: read documentation: https://medium.com/data-breach/introduction-to-orb-oriented-fast-and-ro
12.11.5 Comparison
How many features are matched correctly? and how many wrongly? The number
of features is different for every features methods.
Features found on images are often matched. That’s why we find them actually.
There are different matching strategies we can opt for. Matching strategy replies
to the question ’What features to compare?’. Moreover, there are also a match
evaluation to evaluate the good matches: we assign a number to the match which
242
usually corresponds to the Euclidean distance or Hamming distance.
d1
N N DR =
d2
After choosing the matching strategy, we would like to measure the performance
of our choice. We can use the following indicators:
243
• FALSE POSITIVES (FP): number of non-matches that were wrongly matched
• precision : TP
T P +F P
• recall = TP
T P +F N
244
Chapter 13
Face detection
• very low false positive rate (< 10−6 ) is required to avoid false positive.
• boosting
245
13.1.1 Haar features
Haar features are combinations of rectangular, blank and white filters. For each
feature we sum up all the pixel overlapping the black area and all the ones over-
lapping the white one, then we subtract the two sums. Hence, an Haar feature is
defined as:
X X
f (x) = pblack (i) − pwhite (i)
i i
Figure 13.1 Haar feature. The number of pixels in the black and white area are
the same. (c) is misleading here.
They are very coarse features but they are very sensitive to:
• edges
246
• bars
They are also very computationally efficient: a large number of features can be
computed efficiently, and this compensates for the coarseness.
How many features? Let’s consider 24x24 patch (this is not a constant anyway).
This can evaluate 160k features! Anyway this number of features is very huge
and we cannot evaluate them in a feasible way. For overcoming this issue Viola
and Jones came up with a new method. This is based on the weak learners.
We can define a weak learner as a simple classifier that works on the number
evaluated by a Haar feature fj (x). Basically we set a threshold on each single
feature fj (x) in this way we can separate positive and negative examples better
than casual guess. Formally, the weak learner will output:
where pj is a parity term: since it appears in both side we can remove it, but it is
used for deciding the sign of the inequality (if it is -1, then the inequality is <).
The weak learner are trainable: the parameters they can learn are pj (hence the
versus of the inequality) and the threshold.
13.1.3 Boosting
In order to train this model, we use the Adaboost algorithm. Given a dataset
of face and not face, we start with equal weights, and then for each round of
boosting:
247
• Evaluate each rectangle filter on each example
• reweight examples (we increase the weights where the elements were wrongly
classified)
So at each round, the best weak classifier is found after the effect of the previously
selected classifiers.
13.1.4 Cascading
248
So now we can train several classifier with an increasing number of features till
the target decision rate is reached. The training is performed by re-weighting the
training examples after each stage giving an higher weight to samples wrongly
classified in previous stages.
Cascaded classifier gets a similar accuracy to the monolithic classifier (got with
the boosting step) but cascading results 10x faster.
The structure used by Viola and Jones uses 32 stages and a total of 4297 features.
249
• 5k faces: all frontal rescaled to 24x24 pixels
• 10k non-faces
The data had the following variation factors: individuals, illumination and pose.
We can use Viola and Jones for tracking face also.
Figure 13.8 Profile faces were not in the training set so Viola & Jones method
failed to detect it.
Figure 13.9
250
Figure 13.10 False positives and false negatives
251
Chapter 14
The camera
What are the main elements of a camera? Shutter, sensor and camera lens.
This was the first approach used in photography and describe the fundamental of
how a camera create an image by looking at the real world.
Figure 14.1
252
on the sensor and part of the object (one ray per point of the object reaches the
sensor).
The screen allows hence a selection of the light rays that depends on the size of
the hole. This model is called PINHOLE CAMERA MODEL.
The smaller the pinhole the sharper the image will be.
253
14.1.1 Camera obscura
The pinhole model was used long ago. Initially it was implemented to build the
camera obscura, which was a room with a pinhole, which let artists paint on a
the screen what was projected on the same screen by the light coming from the
pinhole.
A camera obscura can be seen in Padua at the Museo del precinema at Prato
della Valle.
Let’s try to model a real camera using a pinhole model. We use hence a pinhole
camera and a coordinate system. Note that in the modelling we neglect some
effects.
254
The main elements of such camera are :
• The optical axix which a line perpendicular to the image plane and passing
through the optical center. This is a real element that we consider also in
a real camera.
• The principal points which is the point where the image plane and optical
axis intersect.
• Focal length which is the distance btw the optical center and the image
plane.
For instance, consider when we find a keypoint, an edge a line or blob; in which
reference system is given the output of the algo? Did we use any other reference
system? We use a different reference system where the center is different from
the image plane: we set as center pixel ((0,0)), the top-left corner of the image,
which is not the center of the image plane in a pinhole model.
255
14.2 Projective geometry
Figure 14.8
Now let’s add a new plane in front of the optical center that is parallel to the
image plane and at the same distance f from the optical center. From now on we
work on this plane: it is easier consider this plane instead of the image plane since
the new plane avoid the upside-down effect.
Figure 14.9
Thus we remove the original image and start using only the plane added previously.
We use this plane for deriving the geometric description about the projection from
3D to 2D.
256
Hence looking at a pinhole camera, we need to consider only the part in the yellow
box represented in the figure 14.11. For all the discussion we consider the pinhole
camera model as camera.
Figure 14.10
Figure 14.11
We start our geometric analysis from the ’similar triangle rule’ which states that
the two part of the triangles (the pink and green zone) have the same angle:
Yp Zp
=
yp f
257
The equality of the similar triangle rule holds also for the other coordinates leading
to the following PROJECTION EQUATIONS from 3D to 2D:
Xp
xp = f (14.1)
Zp
Yp
yp = f (14.2)
Zp
Yp
tan(α) = (14.3)
Zp
Figure 14.14
We can rearrange the previous equations into a matrix form using the homogeneous
coordinates.
258
Figure 14.15 Recall: transforming to/from homogeneous coordinates
Figure 14.16 P is the projection matrix. The vector M represents the original
image, while the vector m the virtual one: the virtual image is equal to a scale
factor of the original one.
Thus a camera is a device that converts the 3D world into a 2D image and this
operation can be described by the equation m̃ ≈ P M̃ . Thus the projection matrix
describes how the 3D world is mapped onto the image plane. In particular, for
f=1, we get the essential perspective projection P which represents the core of
the projection process:
1 0 0 0
(14.4)
P = 0 1 0 0
= [I|0]
0 0 1 0
Now reflect:
– What measurement unit is used for distances in the 3D world?
– What measurement unit is used for distances on the projection plane?
– What measurement unit do we commonly use for distances in a digital image?
In the 3D world and on the projection plane we use the meter. While in a digital
plane we use pixels. For projection we use meter initially, but then we need to
translate it into pixels.
After projecting, we need to map the projected points onto the image plane into
pixels. Thus we need to move in the image pixel reference system: map from (x,y)
259
to (u,v). The transformation is defined considering the top-left corner point of the
plane (x,y) (which we remind it is the plane we added previously for avoiding the
upside-down effect of the image) as the origin of the new reference system (u,v),
and considering also the coordinates of the principal point (which we remind it is
the point where the image plane and optical axis intersect) that we call it (u0 , v0 ).
The mapping from (x,y) to (u,v) performs a conversion from metric distances to
pixels using the pixel width w and h height. Width and height are used to defined
the conversion factors, which are defined as :
1
ku = (14.5)
w
1
kv = (14.6)
h
260
Note: that pixels may be not squares. So we need to look at the camera specifi-
cation for getting the pixel size.
Mapping from (x,y) to (u,v) is then obtained by translation and scaling:
u = u0 + xp ku (14.7)
v = v0 + yp kv (14.8)
We can combine the two mappings: from 3D to 2D image plane, and then from
image plane to pixels. Hence, we can map from 3D to pixel coordinate directly.
We achieve this by substituting the two equations for mapping from the image
plane (x,y) to the pixel coordinates (u,v) into the projection equation reported in
the figure 12.13.
f Xp Xp
u = u0 + xp ku = u0 + = u0 + fu (14.9)
ku Zp Zp
f Yp Yp
v = v0 + yp kv = v0 + = v0 + fv (14.10)
kv Zp Zp
where K is the camera matrix. All the previous equation m̃ ' P M̃ still holds.
What is changed is only the formulation of P:
1 0 0 0 f
u 0 u0 0
We moved from P = to
0 1 0 0
= [I|0] P = 0 fv v0 0 = K [I|0]
0 0 1 0 0 0 1 0
(14.12)
The camera matrix K is defined as:
fu 0 u0
(14.13)
K=
0 fv v0
0 0 1
261
The matrix K depends on the INTRINSIC PARAMETERS: ku , kv , u0 , v0 , f (in-
deed fu , fv embed three parameters). These parameters define the projection
characteristics of the camera. Hence, this matrix is unique to a specific camera
and it can be used to remove distortion due to the lenses of a specific camera.
The camera can move and hence it is meaningful to have both a 3D world reference
system and a 3D camera reference system. This means that the camera provides
some observation given its own point of view and we need to convert the 3D points
of the world into the camera viewpoint. This is achieved by using 3D rotation
and translation to align the axis of the two reference systems. A rototranslation
in 3D in homogeneous coordinates is expressed as:
R t
T = (14.14)
0 1
• 3 for translations
• 3 for rotations
Hence, everytime we rotate or move our camera, we change these extrinsic pa-
rameters. Moreover, the plane x,y of the camera reference system is the image
plane and the camera axis z indicates the direction of the camera.
REMIND: Extrinsic is about camera, intrinsic about the projection inside the cam-
era.
Figure 14.19
262
Figure 14.20 Left image represents the camera viewpoint where the plane are com-
puted using m̃ ' P T M̃ . The right image instead represents the world viewpoint.
–> EXTRA CONTENT NOT SEEN DURING THE COURSE reported for a better
understanding.
The rototranslation is defined as Pc = R(Pw − C) = RPW − RC = RPW + T
that can be written in matrix form using the homogeneous form:
X r11 r12 r13 0 1 0 0 −cx V r11 r12 r13 tx V
Y r21 r22 r23 0 0 1 0 −cy C r21 r22 r23 ty C
= =
Z r31 r32 r33
0 0
0 1 −cz
W r31 r32 r33 tz W
1 0 0 0 1 0 0 0 1 1 0 0 0 1 1
(14.15)
Figure 14.21 Alignment of the world and camera reference systems with respect
formula 14.15. C stands for camera and it is the centre of the pinhole. Pw and
Pc are the same physical point described in two different coordinate systems.
263
Figure 14.22 Example of a simple stereo system involving only translation since
the axis are already aligned in the same direction.
264
Let’s now focus on rotation forgetting for a while the translation vector (let’s
suppose the two systems are overlapping and hence no translation is needed ).
Hence our transformation system would be Pc = RPw :
X r11 r12 r13 0 V
Y r21 r22 r23 0 C
=
(14.16)
Z r31 r32 r33 0 W
1 0 0 0 1 1
Let’s consider the case where the world x axis (1,0,0) corresponds to camera axis
(a,b,c). In this case we can immediately write down the first column of R.
a r11 r12 r13 0 1 a r11 r12 r13 0 1
b r21 r22 r23 0 0 b r r r 0 0
= → = 21 22 23 (14.17)
c r31 r32 r33 0 0
c r31 r32 r33 0 0
1 0 0 0 1 1 1 0 0 0 1 1
Figure 14.23
PC = RPW → R−1 PC = PW → RT PC = PW
r11 r21 r31 0 X V
r12 r22 r32 0 Y C
=
(14.18)
r13 r23 r33 0 Z W
0 0 0 1 1 1
Let’s consider the scenario where the camera X axis (1,0,0) corresponds to the
265
world axis (a,b,c). In this case we can immediately write down the first column
of RT , which is the first row of R.
r11 r21 r31 0 1 a a r21 r31 0 1 a
r12 r22 r32 0 0 b b r22 r32 0 0 b
= →
=
(14.19)
r13 r23 r33 0 0 c
c r23 r33 0 0 c
0 0 0 1 1 1 0 0 0 1 1 1
Figure 14.24
Figure 14.25 Example of rotation of two coordinate system with respect to the
world reference system
266
14.2.5 Projection recap
• 4 reference systems
• 3 transformations
Figure 14.26 Recap of the different reference systems involved in the projection
process.
Figure 14.27 Projection recap. The object of the world is projected onto the image
place, which has its own reference system. Then the image plane is quantized into
the pixel frame (the pixel frame has the same size of the image plane actually. In
the image it is shown smaller just for graphical reasons). And then the camera
frame. If we move the camera the reference system of the camera frame change.
A movement of the camera or of the object can be observed at the image plane
or pixel frame.
267
14.3 Inverse projection
m̃ ' P T M̃
• Pixel quantization
• we accept as a result, the direction of the object, but not the 3D position.
Indeed we cannot reverse the z position.
• We neglect the quantization effect: the pixel location and the projected
point are considered the same. This is acceptable for high-resolution sensors.
• the main axis lying on the optical axis (hence the optical camera axis is
perpendicular to the lens plane).
• a thin width so that we can neglect the lens width and consider it as a plane
where every deviation occurs.
Lens deviates light and focuses the rays in a single point. In particular:
268
• the rays parallel to the lens axis are deviated through the focal point (we
recall every lens has two focal points)
Figure 14.29 the rays passing by the center are not deviated while the rays parallel
to the lens axis are deviated through the focal point
The relationship between the object distance from the lens and the image distance
from the lens is expressed by the THIN LENS EQUATION:
1 1 1
+ =
d0 d1 f
269
Figure 14.30 Thin lens equation
The focal length is a building element of the lens and it represents the distance at
which the virtual and real image appear: hence it indicates the distance at which
the parallel rays intersect.
The distance btw sensor and lens determines the focus of a point. The lenses
can focus only objects at a specific distance. All the rest outside that distance
is not sharp as the items at that distance, but those points generate a circle of
confusion.
Anyway changing the distance btw lens and sensor changes the focus of the points,
but not cancel out the circle of confusion.
270
Figure 14.32 Changing distance btw lens and sensor
For reducing the circle of confusion we can add a barrier, which works like a
pinhole.
Pay attention that the focal length has different meaning in the two different
models we saw.:
271
• pinhole camera model: f is the distance between pinhole and sensor.
The lens used so far is thin and ideal. However real lenses are affected by addi-
tional effects: distorsion, chromatic aberrations and other minor effects.
14.5.1 Distortion
Distortion is a deviation from the ideal behaviour described so far. Such deviation
undergoes a pattern that can be: radial and/or tangential.
Radial distortion
In a radial distortion the entity of the distortion depends on the distance of the
distorted point from the image center.
272
Figure 14.37 Radial distortion
Radial distortion causes straight lines to appear curved. Radial distortion becomes
larger the farther points are from the center of the image. In particular, the further
points can be radially distorted in two way: either by pincushion distortion or by
barrel distortion.
The radial distortion can be analysed by means of distortion patterns, which can
be either experimentally measures; or analytically described if the structure of the
lens is known.
Camera models commonly consider a polynomial approximation for the radial
273
distortion. Given the distorted point (x,y), in OpenCV the correction (xcorr , ycorr )
is modelled as:
Figure 14.40 Radial distortion: the arrows represent the displacement of the point
Tangential distortion
274
The tangential distortion is often modelled as:
The refractive index depends on the wavelength of the rays. This causes a different
blending of the rays based on their wavelength and a modification of the lens focal
length. Thus each color ray blends differently and intersect in a different position.
This results in color fringes near the image edges.
275
For inverse projection, we first compensate the distortion and then we use a non-
distorting thin lens camera model.
One parameter of a camera is the field of view. It is the angle perceived by the
camera.
Previously we consider a 3D point at angle α. α, which represents the angle at
which a photo is taken, can be maximum 1
2 of the Field of View (FoV).
d
ϕ = arctan( )
2f
276
Anyway, in a camera we cannot change the sensor size, but we can change the
focal length by zooming in and out (so we are changing an intrinsic parameters).
277
Figure 14.48 Dimensionality reduction machine
Figure 14.50 Our brain is used to the effect of projection. In this example the two
lines have equal length, but we perceive them differently.
278
Figure 14.51 Length can’t be trusted. Indeed everything is mapped according to
a scale factor. In the figure b and c have the same length in a 3D world but not
in the 2D image.
We remind that we use lenses over pinhole to gather more light and focus.
279
14.7.1 Perspective vs viewpoint
Focal length changes both the subject size and the perspective (e.g background).
Figure 14.53 Perspective. The left image is taken with the camera in the position
at the bottom (middle figure), while the right image with the camera closer. In
particular the left photo was taken with a high focal distance and high aperture.
The telephoto lens makes it easier to select background (a small change in view-
point is a big change in background).
Telephoto lens are also used for getting the Vertigo effect (or Dolly zoom). This
280
is an effect that let us perceive a static scene as a motion. The effect is obtained
by simultaneously zooming out and moving backforward the camera, or zooming
in and moving the camera forward.
A sensor deals with converting photons to electrons. There are two main types of
sensor: CCD and CMOS.
CCD and CMOS do not differentiate on wavelength/photon energy: there are
essentially greyscale sensors. For getting the colors we need opt for different
configurations:
• 3 chip color: we split the incoming light and separate R, G and B color so
that the hit the corresponding filter+sensor.
• single chip color: a single imager with filter. These sensors are based on the
bayer pattern: over every pixel there is a tiny filters that selects only one of
the single R,G and B colors.
281
Figure 14.57 Direct image Foveon sensor
14.7.3 Exposure
The main aspect about taking photo is the exposure which basically defines the
amount of incoming light. The exposure is regulated by two parameters:
• aperture
• shutter speed
282
Figure 14.59 Shutter speed
Aperture
The aperture is the diameter of the lens opening and it is expressed as a fraction
of focal length f-number such as:
- f/2.0 on a 50mm: aperture = 25mm
– f/2.0 on a 100mm: aperture = 50mm
The lower the f/stop, the larger the aperture.
Why we express the aperture in terms of focal length? We consider the focal
distance because when we zoom out we reduce the area of our subject and so the
incoming light is less. Thus, we need to tune the aperture according the focal
length to compensate the less amount of light.
Figure 14.60 When we double the focal length, the amount of light gathered is
/4. So wee need to enlarge the aperture to compensate.
The choice of the aperture determines the depth of field and the diffraction effect.
Diffraction effect
Usually we should opt for a small aperture if we neglect the exposure time.
However too small aperture leads to the diffraction effect that causes a
confused image.
283
Figure 14.61 Diffration effect. With the aperture 0.35mm the image is on focus,
but for smaller aperture the image get blurred.
Depth of filed
A smaller aperture increases the range in which the object is approximately
in focus. The depth of field is the depth in which we can neglect the presence
of the circle of confusion (figure 14.62).
Portrait lens provides a large aperture for getting a small DOF. Every lens
can do a large DOF but not a small one.
284
Shutter
A picture is acquired when the shutter opens and closes, exposing the sensor
to the incoming light. The exposure time is expressed in second as ratio.
There exists a reciprocity between shutter and aperture: given an amount of
light we can get the same exposure by doubling the exposure time and split
in half the aperture.
The choice of the shutter speed determines the freeze motion vs motion blur:
14.7.4 ISO
ISO defines the sensitivity of the sensor: it is useful to prevent the sensor to
get saturated (saturated pixels appear to be just white).
285
14.8 Camera calibration
We can decide to calibrate the camera for several reasons: removing the
distortion, estimate the 3d world from camera motion, evaluate depth and
so on.
286
Figure 14.66 Some calibration application
There are also other applications: we can detect the position of an object
such as using the ground plane as reference. Indeed camera are not used for
understanding the environment but they can be also used as measurement
tool such as for measuring the size/distance of something.
287
Several methods exists for calibrating a camera. The general process consists
in the following steps:
Figure 14.68 projected checkerboard image with the reference system of the object
288
Figure 14.69 Typical calibration image set
For K: fu , fv , u0 , v0 → 0K (14.24)
For distortion: k1 , k2 , k3 , p1 , p2 → 0d (14.25)
N −1 M −1
|K [I|0] Ti P̃i,j − p̃i,j |2 (14.26)
X X
min0k ,0d
i=0 j=0
289
details about homography can be found at ’Learning OpenCV 3’
at page 664 or at the course ’3D data processing’ held by prof. A.
Pretto.
After calibrating the camera we compute the reprojection error. The projec-
tion error estimates how the projected 3D points recreate the point’s true
projection and thus it provides a qualitative measure of accuracy. We hence
calculate the absolute norm between what we got with our transformation
and the real position. To find the average error, we calculate the arithmetical
mean of the errors calculated for all the calibration images.
290
14.8.4 Sum - up
To find the intrinsic and extrinsic parameters, we must provide some sample
images of a well defined pattern (e.g. a chess board). We find some specific
points of which we already know the relative positions (e.g. square corners
in the chess board). We know the coordinates of these points in real world
space and we know the coordinates in the image, so we can solve for the
distortion coefficients. For better results, we need at least 10 test patterns.
NOTE from the lab8: R and t are unique for each image taken, so t is a
vector containing the translation vector for each image, and R is a rotation
matrix or rotation vector (by suing Rodriguez formula we can pass from
rotation matrix to rotation vector and viceversa) per each image. While the
3x3 camera matrix and the deformation vector is unique for each camera and
the same for each images.
291
Chapter 15
Image resampling
We discuss about how to change the image resolution. The image resolution
is changed by resampling the image. The resampling can be performed by
different technique based on what final result we aim at. We use:
For enlarging the image we can fill in the added pixel with the color value of
the closest pixel. This techniques has a fast execution but poor results due
to the pixelization effect.
292
Figure 15.2 Nearest neighbor interpolation example
The new pixels values are taken as the weighted average of the closest 4
samples. The coefficient weights depend on the distance to the samples:
The new pixels values are taken by considering the 16 closest samples. This
is performed by using a 3-degree polynomials.
3 X
X 3
p(x, y) = ai,j xi y j
i=0 j=0
293
Figure 15.4 Bicubic interpolation
294
Chapter 16
High-level vision
Another possible task is object localization. This problem deals with both
classifying an object and localizing it. The position of the object is high-
295
lighted by using a bounding box, which contains with high probability the
required object. Moreover, to each bounding box it is associated a category.
Very often when we use object detection we are looking for only one object.
296
Figure 16.4 Semantic segmentation
Figure 16.5
• perspective deformations
• illumination changes
• intra-class variations
297
Figure 16.6 Example of different camera positions. We would like to recognize it
as a car despite the different point of view.
298
Figure 16.9 Example of intra-class variations. An object can be represented with
different shapes. We would like to recognize it as a chair despite their different
shapes.
Figure 16.10
• template matching
299
• something fashioned, shaped, or designed to serve as a model
• an example
There are some challenges we need to cope with when using this method.
These difficulties come from the fact when it is difficult to find a universal
template. Indeed, every object can be affected by deformation, different
viewpoint, affine transforms (scale, rotations, translations, and so on), noise
and different illumination. Anyway we could overcome with some of these
issues by using different templates.
300
Figure 16.13 Human bodies can have different shapes and so we should use dif-
ferent templates.
Given an image and a template, how can we evaluate the match? We could
simple differentiate the image and the template. However, such approach
doesn’t not always provide reliable results: in the figure below we can see
that the template results to be more similar to a completely different face.
301
There are actually far better approaches. In this course we are not covering
all of them but we focus only on the rigid template matching.
Figure 16.16
302
16.2.3 Generalized Hough Transform rigid template matching
The Hough transform approach works for more complex shapes and doesn’t
require to slide the template over the image. Anyway the parameter space
might have dimensionality.
The choice about which approach to use, whether hough transform or corre-
lation, depends on how complicated the shape is (then the parameter space
has high dimensionality) and how large the image is (then sliding window is
computationally expensive).
Hover, the Hough transform approach is quite robust to illumination.
For dealing with scale changes we can opt for one of the following:
For dealing with rotation we can opt for one of the following:
The idea is to try to decouple a complex object (the pattern) into a compo-
sition of multiples elements (they can be also called features). Then we look
for those elements on the image. This technique is designed to be invariant
to several factors mainly to viewpoint and deformations.
303
Figure 16.17 Bag of words
We evaluate the contribution of every words and then the object corresponds
to the word that gave the most contribution. Hence we can use Bag of words
for classifying images.
304
1. Extract features, hence the keypoints and descriptor. We hence get
the features vector at the end of this step.
Then, given the codebook and an image we would like to analyse it: we use
a classifier such as SVM to classify the image based on the words evaluated.
305
Chapter 17
17.1.1 ML framework
306
17.1.2 Linear model and Kernel
f (xi ) = W xi
where W is the vector of the parameters that the Ml algorithm needs to find
Let’s start from a very simple case. However this model cannot describe non-
linear phenomena. We can overcome this issue by using a kernel function. A
kernel function maps a function to a different function. Then we can use it
for linearized a non linear phenomena. However kernel is often application-
dependent and hence hard to generalize.
Figure 17.2 Kernel lets us map a non linear distribution of data to a linear one.
307
Take into account that DL is not always the best approach for image classi-
fication, sometime using a ML technique leads to better results.
• data are represented and classified into one single network. This is
defined as end to end learning.
308
Figure 17.4 DNN
17.2.1 Neurons
The input to the neurons of one layer are the output of the neurons in the
preceding layer. This mechanism is defined as feedforward network.
The output of a neuron is the output of an activation function applied to a
weighted average of the inputs plus a bias. There exists different types of
activation functions.
Figure 17.5 Model of an artificial neuron. The l denotes a layer in the network.
The formula highlights the output of a neurons at a given layer l. The weights
and bias are tuned through the training process.
309
17.2.2 Activation functions
Sigmoid
It represents the natural firing of a neuron. It maps the input into a value
within the range [0,1]. This results in smoothing the input. However the
sigmoid neurons saturate and kill gradients.
Tanh
The tahn function maps: Rn ← [−1, 1]. It presents the same saturation
problem seen for the sigmoid function. Indeed the tanh represents a scaled
sigmoid:
tanh(x) = 2sigm(2x) − 1
310
Figure 17.7 Tanh function
ReLU
311
17.2.3 Overview of a fully connected NN layer and output layer
17.2.5 Training
1. Select a random subset of the training data (mini batch): a small batch
of images speed up the computation of the gradient
4. we measure the error between the desired and actual values using the
cost function.
312
6. we update the weights
Overfitting
Overfitting means that the model perfectly fit the training data and hence
cannot generalize leading to error on test data.
Techniques for overcoming overfitting are:
• Early stopping: we use validation error for deciding when training should
be stopped. Training is stopped when we not observed any improve-
ments after n epochs.
The output layer depends on the classification problem we are solving. For
instance we have two classes, several representation are possible: either one
value describing the class or two values describing the score of each class.
This can be generalized to N classes.
We can use deep learning both to classification problem and regression prob-
lem. Regression problem means to estimate a continuous value such as the
height of a person.
313
Figure 17.11 Classification vs Regression
A CNN is a specific type of DNN used for dealing with images. It learns the
structure of an image: an image is a 2D matrix and the interesting infor-
mation an image brings is local, i.e the spatial neighbourhood is important.
For the spatial neighbourhood we mean that considering two pixels close to
each other then they might be related to the same object, background, con-
text. CNN makes use of some convolutional layers (not all layers) for taking
advantage of this local properties of the image.
The keypoints of a CNN are the following:
314
• for each input image we consider the receptive field, i.e a group of pixels
close to each others and analyse them together inside the convolutional
layer. Thus this represents a local processing that taking into account
the spatial neighbourhood property of an image explained before.
Figure 17.13 Partial vs fully connected layers. The partial layer is showed in the
left picture.
• in the convolutional layers the weights are shared, i.e every convo-
lutional units share the same parameters even if they operate on a
different input receptive field. This reduces the number of weights to
be trained and hence increases the performance.
Figure 17.14 1D convolution example highlighting the fact that the weight are
shared
• a CNN can perform multiple feature maps: each group of pixels (re-
ceptive fields) can be evaluated in different ways by means of different
kernels. In particular every feature map has its own set of shared
weights.
315
How to compute different feature maps? In a practise scenario, we
choose the number of features maps and the different sizes for the
convolutional kernels.
Figure 17.15 1D example of different feature maps for the same input
• CNN exploits also some pooling layers. This kind of layers reduce the
feature maps performed by the convolutional layers, thus it performs
a dimensionality reduction. The pooling layers usually consists in max
or average kernels: the max pooling is usually employed most. Pooling
enhances the performance and makes the NN more reliable since it
selects the most relevant element in the image despite the resolution
has been decreased. Moreover it adds deformation invariance.
It performs a feature map subsampling. This operation is defined by 2
parameters:
Usually stride and area are equal so that the pool area of different pools
is not superimposing. Unlikely they are different.
316
Figure 17.16 Pooling example with pool area and pool stride equals to 2.
Hence the framework of a CNN is: local convolutional layer -> dimensionality
reduction –> convolutional layer -> dimensionality reduction –> tiny fully
connected NN.
We use this framework for increasing the performance. Indeed we can use also
a single feature maps but it is has been proved that the previous configuration
increases the performance.
Figure 17.18 Example of convolution + pooling. The wight picture represents the
visual representation of a feature map.
317
Figure 17.19 Convolutional layer example. Weights are the red numbers and they
are superimposed over the image.
318
world’ for deep learning.
For solving the MNIST problem of recognizing handwriting digits we:
Figure 17.20 CNN used for solving the MNIST problem and the number of pa-
rameters involved
Figure 17.21 Illustration of the convolutional and pooling layer in the MNIST
problem
319
Figure 17.22 Illustration of the weights of first and second layer of the MNIST
CNN
Figure 17.23 Visual summary of an input image propagating through the CNN.
320
Figure 17.24 Visual testing and training accuracy and error.
Dataset:
• 60k images (50k training, 10k testing), 32×32 pixels
• 10 classes, 6k images per class
• High-quality images, clearly separable classes
• Color images: 3 input channels
Figure 17.25 Visual testing and training accuracy and error + CIFAR dataset
321
Figure 17.26 Visual summary of an input image propagating through the CNN.
322
17.6 Keras
Keras is a python deep learning library supporting GPU and CPU. It repre-
sents a simple interface to TensorFlow, Theano and CNTK.
Figure 17.28
Figure 17.29
3. we compile the model. Compiling means to set the parameter for the
training process.
323
Figure 17.30
Figure 17.31
Figure 17.32
6. Generate prediction
Figure 17.33
• convolutional + RELU
• pooling
• fully connected
• softmax activation
324
Figure 17.34 Typical CNN
Figure 17.35 2D convolutional layer definition with all their possible arguments
initialized at their default value
As conv2D it takes a long list of arguments but we need to declare only those
we need.
325
Figure 17.37 maxpooling2D
326
17.8 Compiling a model in Keras
• an optimizer
• a loss function
• a list of metrics
17.8.1 Optimizer
Loss function is called also objective function or also optimization score func-
tion. The loss function measures the distance between actual and desired
327
output for driving the training process.
The type of loss function depends on the task. Good starting points for the
different tasks are:
17.8.3 Metric
A metric function is similar to a loss function, except that the results from
evaluating a metric are not used when training the model but it is used to
judge the performance of the model.
328
Figure 17.43 List of metrics
17.9 Colab
329
Chapter 18
Notes
18.1 HSV
In the HSV color representation, the hue range is 0-360 but in opencv it is
split by 2, i.e. 0-180. This is because the hsv image has to fit into 3 8-bit
channels. The S and V are instead in the range (0,255).
330
18.2 Contours
Figure 18.2 contours example with all the (x,y) coordinates of the boundary of a
shape and only some
331
Any unauthorized distribution (digitally or on paper) is expressly prohibited
without prior authorization from the author of these notes
332