CSCI218: Foundations
of Artificial Intelligence
Human Vision System
2
Robot Vision System
3
Image Formation
4
Image Formation
5
Simple Image Feature
Image Color Histogram 6
Simple Image Feature
Edge
7
Simple Image Feature
Edge
8
Simple Image Feature
Texture (e.g., Gray-Level Co-Occurrence Matrix (GLCM))
- Characterise how often pairs of pixel with specific values and in a specified spatial relationship occur in an image
9
Simple Image Feature
Optical Flow: Whenever there is relative movement between the camera and one or more
objects in the scene, the resulting apparent motion in the image is called optical flow.
10
Simple Image Feature
Optical Flow: Whenever there is relative movement between the camera and one or more
objects in the scene, the resulting apparent motion in the image is called optical flow.
11
Simple Image Feature
Segmentation of natural images
12
Classifying Images
Important sources of appearance variation
13
Classifying Images
Why convolutional neural networks classify images well
14
Detecting Objects
Faster RCNN for object detection
15
The 3D World
Binocular stereopsis
16
Using Computer Vision
Understanding what people are doing
17
Using Computer Vision
Understanding what people are doing
18
Using Computer Vision
Automated image captioning
19
Using Computer Vision
Visual question-answering
20
Using Computer Vision
Reconstruction from many views
21
Using Computer Vision
Geometry from a single view
22
Using Computer Vision
Making pictures
23
Using Computer Vision
Image Transformation (Paired)
24
Using Computer Vision
Image Transformation (Unpaired)
25
Using Computer Vision
Image Transformation (Style transfer)
26
Using Computer Vision
Image Generation (by GAN)
27
Using Computer Vision
Controlling movement with vision
28
Using Computer Vision
Navigation
29
Image Analysis
§ Overview of Image Analysis
§ Collecting and Representing Image
§ Image Recognition
§ Bag-of-Visual-Words model
§ Deep Convolutional Neural Networks
Overview of Image Analysis
§ Image analysis
§ Refers to the representation, processing, and modelling of visual data to
derive useful insights
§ Suffers from the semantic gap
§ Visual data (image, video, …) is unstructured
§ Semantic gap
§ The gap between high-level concepts used by human and the low-level
features used by computer
Overview of Image Analysis
§ Image recognition (in a narrow sense)
§ Image classification
§ Object detection, localisation, tracking
§ Scene segmentation and reconstruction
§ Image search and retrieval
Overview of Image Analysis
§ Image classification
Face OCR recognition
recognition
Scene recognition Object recognition
Overview of Image Analysis
§ Object detection, localisation, tracking
Object detection and localization
Object tracking (https://www.youtube.com/watch?v=dKpRsdYSCLQ)
Overview of Image Analysis
§ Scene segmentation and reconstruction
[Farabet et al. PAMI 2013]
http://twd20g.blogspot.com.au/2011/12/this-work-presents-novel-system-that.html https://www.3dflow.net/elementsCV/S4.xhtml
Image Analysis Steps
§ Collection and labelling
§ Collect representative images from a given task and label the ground
truth
§ Image representation
§ Select and/or design appropriate image representations (invariant and
discriminative)
§ Image analysis techniques
§ Apply and/or design appropriate analysis techniques for the given tasks
(classification, detection, tracking, segmentation, etc.)
Representing Image
§ Why representing images is difficult?
§ Scale, rotation, illumination, occlusion, background clutter, deformation, …
§ Invariant and Discriminative representation
Cat:
Representing Image
§ Traditional representation (before year 2000)
§ Hand-crafted, global features
§ Intensity, colour, texture, shape, structure, etc.
Colour histogram in a RGB space Face recognition with raw pixel
intensities
Representing Image
§ Days of the BoVW model (2000 ~ 2012)
§ SIFT, HOG, SURF, CENTRIST, filter-based, …
§ Invariant to view angle, scale, illumination, ...
SIFT (Scale Invariant Feature
Transform)
http://www.robots.ox.ac.uk/~vgg/software
/ Image courtesy of David Lowe, IJCV04
Deep Learning Model
Convolutional Neural Networks (CNNs)
§ A special multi-stage architecture inspired by visual system
§ Higher stages compute more global, more invariant features
Deep Learning Model
https://www.datasciencecentral.com/lenet-5-a-classic-cnn-architecture/
Convolution
§ For standard 2D convolution:
Filter
§ The stride is 1.
§ The height and width are changed as:
&'( )&*'+,-.
!"#$ = + 1 = (5 − 3)⁄1 + 1 = 3.
/$0123
Convolution
We need Zero-Padding to keep image size:
The width/height will become:
!&' − !)&*$+, + 2×0122345
!"#$ = +1
678329
Convolution Layers
In convolution layers:
§ Filters are called Kernels and become 3D. The parameters of
kernels (i.e., weights) are to be learned.
Kernel 1
…
Kernel N
'( ×') ×*%&
!×#×$%& !×#×$+,-
Convolution Layers
In convolution layers:
§ Feature maps are the outputs of each layer. The number of
feature maps is the channel.
Feature map 1
…
Feature map N
!×#×$%& !×#×$'()
Convolutional Neural Networks
§ Multi-stage Architecture
Convolution
Non-linearity
Pooling
Convolutional Neural Networks
Convolution
- A set of filters convolve with the input
- Share weights across the input space (translation equivariance)
Input
Filters
Feature Map
Convolutional Neural Networks
Non-linearity
Sigmoid: f(x)=1/(1+e-x) Tanh: f(x)=(ex − e-x)/(ex +e-x) ReLu: f(x)=max(x, 0)
Convolutional Neural Networks
Spatial pooling
§ Non-overlapping / overlapping regions
§ Max or sum
§ Invariance to small transformations
Max pooling
Sum/Average
pooling
Deep Learning Model
CNNs: ImageNet Breakthrough
[Krizhevsky et al. NIPS 2012]
● Krizhevsky et al. win 2012 ImageNet classification with a much bigger ConvNet
○ deeper: 7 stages vs 3 before
○ larger: 60 million parameters vs 1 million before
○ 16.4% error (top-5) vs Next best 26.2% error
● This was made possible by:
○ fast hardware: GPU-optimized code
○ big dataset: 1.2 million images vs thousands before
○ better regularization: dropout et al. Image courtesy of Deng et al.
Deep Learning Model
Learned Features of CNNs
[Matthew D. Zeiler et al. ECCV 2014]
Deep Learning Model
Object detection (Source: Rich feature hierarchies for accurate object detection and semantic
segmentation, CVPR 2014)
Face Recognition (Source: DeepFace: Closing the Gap to Human-Level Performance in Face Verification,
CVPR 2014)
Deep Learning Model
§ Directly use pre-trained CNNs
§ Which layer to use?
§ How to pool the features in a convolutional layer?
Deep Learning Model
§ Directly use pre-trained CNNs
§ Which layer to use?
Convolutional layer
Fully connected
layer
Deep Learning Model
§ Fine-tune pre-trained CNNs
§ To incorporate extra information from the images of a
new recognition task
§ Make the pre-trained CNNs adapt to this new task
Pre-trained CNNs New recognition task
on
Fine-
tune
Image courtesy of Deng et al.
http://people.csail.mit.edu/bzhou/
Summary
§ Computer vision is a key component of AI
§ Image analysis is an important and broad area
§ Feature representation is key for image analysis
§ Deep Learning techniques are now widely used
Acknowledgement
The lecture slides are based on the materials from ai.Berkey.edu
Thank you. Questions?