Lecture 1 Intro
Lecture 1 Intro
Introduction
Chen-Change Loy
吕健勤
https://www.mmlab-ntu.com/person/ccloy/
https://www.mmlab-ntu.com/
Outline
• Get familiarized with PyTorch and OpenMMLab for developing deep learning
applications
• Design and train deep learning models for solving different computer vision
applications
Course hours
• Lecture + tutorial/lab
• Three hours per week
• Every Friday 6:30pm – 9:30pm
• Venue
• Hybrid - Face-to-face at LT4 and Zoom
• Course materials
• Course notes of the particular week will be available on NTULearn before the lecture
• The recording of the lecture will be available after each lecture. You will be able to see
the videos under ‘Course Media'.
Course outline
Week Date Topic Lecturer
1 14 Jan Introduction Loy Chen Change
2 21 Jan Convolutional Neural Networks Loy Chen Change
3 28 Jan Training Convolutional Neural Networks Loy Chen Change
4 4 Feb Object Detection Loy Chen Change
5 11 Feb Image Segmentation Liu Ziwei
6 18 Feb Transformers Loy Chen Change
7 25 Feb Autoencoder Liu Ziwei
Recess Week
8 11 Mar Generative Adversarial Networks Liu Ziwei
9 18 Mar Image Super-Resolution Loy Chen Change
10 25 Mar Image Editing Loy Chen Change
11 1 Apr Open-World Visual Recognition Guest Speaker
12 8 Apr Unsupervised Representation Learning Loy Chen Change
13 14 Apr Creating Small and Efficient Networks Loy Chen Change
Instructors
PhD
2010
Queen Mary University of London
Research areas
Post-doctoral Fellow
2011 • Image restoration and processing
Queen Mary University of London • Image editing and manipulation
Vision Semantics • Detection, segmentation and recognition
• Deep learning
Loy Chen Change Research Assistant Professor • Face anaylsis
2013
The Chinese University of Hong Kong
Visiting Scholar
Email: ccloy@ntu.edu.sg
Chinese Academy of Sciences Office: By email appointment
Instructors
PhD
2017
The Chinese University of Hong Kong
Research areas
Post-doctoral Fellow
2017 • Deep fashion understanding
UC Berkeley • Detection, segmentation and recognition
• Generative adversarial networks
• Computer graphics
Liu Ziwei Post-doctoral Fellow • Domain adaptation and long-tailed recognition
2018
The Chinese University of Hong Kong
Email: Ziwei.liu@ntu.edu.sg
Office: By email appointment
Assessment
• Homework: 20%
• Paper reading and oral presentation (Group): 20%
• Projects (Individual): 40%
• Quiz: 20%
Homework
• Two to three short assignments
• Late submissions will be penalized (each day at 5% up to 3 days)
Paper reading and oral presentation
• Group of up to two members
• Instructions, paper list: TBA
• Choose one paper
• Report your choice by week 9
• Study and find more related work; find connections
• Focus presentation on ideas; not too detailed
• Total 15 min/team
• Record your videos and submit
• Professors are your audience
Projects
• Project 1
• Handout on 18 Feb (Week 6)
• Deadline on 25 Mar (Week 10)
• Project 2
• Handout on 18 Mar (Week 9)
• Deadline on 22 Apr (Week 14)
Projects
• Python is the recommended programming language
• PC with at least 1 GPU is recommended
• Projects are to be done individually
• Each student should submit the final report (in .pdf) and codes (in a .zip file),
individually, using their accounts to NTULearn before the deadline.
• Late submissions will be penalized (each day at 5% up to 3 days)
• Assessment criteria will be indicated in the handout
Quiz
• To be updated
Instructors (TAs)
• To be updated
Where to ask questions
• Post your queries on the 'Discussion Board' so we can learn from each other.
• If you have questions that you would like a more private answer, you can send
an email to ai6126@e.ntu.edu.sg.
http://www.deeplearningbook.org/
References
Computer Vision: Algorithms and Applications
http://szeliski.org/Book/
Outline
Machine Learning
Computer Vision
What is computer vision? Computer vision is a multidisciplinary field that could broadly be called a subfield of artificial
intelligence and machine learning.
What is computer vision?
In particular, computer vision enables computers and systems to derive meaningful information from digital images, videos
and other visual inputs
Computer vision works much the same as human vision, except humans have a head start.
Human sight has the advantage of lifetimes of context to train how to tell objects apart, how far away they are, whether they
are moving and whether there is something wrong in an image.
Computer vision trains machines to perform these functions, but it has to do it with cameras, data and algorithms rather than
retinas, optic nerves and a visual cortex.
Predictions
How does computer vision work? Computer vision typically needs lots of data. It needs to get ‘training’ over and over until it
discerns distinctions and ultimately recognize images.
Chihuahua
or
Muffin?
For example, to train a computer to distinguish chihuahua or muffin, it needs to be fed vast quantities of Chihuahua or muffin
images to learn the differences and recognize them correctly.
You might have heard the term image processing. How is computer vision different from image processing?
Image processing is the process of restoring or enhancing an image, by changing brightness, contrast or anything. It is just a
type of digital signal processing and is not concerned with understanding the content of an image.
But nowadays, to do well in image processing, one needs to understand the content of an image, for instance, to know where
the face is, and then to perform special enhancement. So we still need to apply computer vision in many cases
Let’s talk about some history. Scientists and engineers have been trying to develop ways for machines to see and understand
visual data for about 50 years.
Experimentation began in 1959 when neurophysiologists Hubel and Wiesel recorded electrical activity from individual neurons
in the brains of cats
They showed a cat an array of images, attempting to correlate a response in its brain.
They discovered that it responded first to hard edges or lines, and scientifically, this meant that image processing starts with
simple shapes like straight edges.
At about the same time, the first computer image scanning technology was developed, enabling computers to digitize and
acquire images.
In the 1960s, AI emerged as an academic field of study, and it also marked the beginning of the AI quest to solve the human
vision problem.
In 1958, inspired by the biological cell structure, Frank Resenblatt created the perceptron, an algorithm for pattern recognition
based on a two-layer computer learning network using simple addition and subtraction.
A neural network is nothing but a collection of many interconnected neurons arranged in a hierarchical manner.
In 1982, neuroscientist David Marr established that vision works hierarchically and introduced algorithms for machines to
detect edges, corners, curves and similar basic shapes.
Concurrently, computer scientist Kunihiko Fukushima developed a network of cells that could recognize patterns. The network,
called the Neocognitron, included convolutional layers in a neural network.
In 1998, a type of neural network was pioneered by Yann Lecun, in which the parameters are shared spatially. It was applied
by several banks to recognize hand-written numbers on checks.
By 2001, the first real-time face detection application appeared.
SIFT
There are many more development in the field thereafter, including better methods to extract features and the use of machine
learning techniques for solving tasks like object recognition and detection.
Despite the progress, scientists quickly came to realise that tasks that are easily or even unconsciously done by humans are
very difficult for a computer and the opposite.
But why is it so challenging? The perception of humans comes with a pre-existing knowledge about the object and the
geometry
Seeing a 2D image, one can easily distinguish between foreground and background objects and effortlessly recognise an object
even if it is subject to occlusions, background clutter or deformations.
But computers cannot distinguish easily between object pixels and background pixels. They perceive digital images not as
something continuous with semantic information, but just as a series of discrete numerical values.
In addition, computer vision systems are usually very sensitive to variations such as scale, viewpoint or illumination and intra-
class differences
10 years ago, things have changed thanks to ImageNet dataset, deep convolutional neural networks, and powerful GPUs.
In 2010, the ImageNet dataset became available. It contained millions of tagged images across a thousand object classes and
provides a foundation for training CNNs and deep learning models used today.
In 2012, a team from the University of Toronto used a deep neural network in the ImageNet image recognition contest. The
model, called AlexNet, significantly reduced the error rate for image recognition.
After this breakthrough, error rates have fallen to just a few percent. And computer vision has started to work very well in
many applications
A key driver for the growth of these applications is the flood of visual information flowing from smartphones, security systems,
traffic cameras and other visually instrumented devices.
One important application of computer vision is self-driving vehicles. The development of self-driving vehicles relies on
computer vision to make sense of the visual input from a car’s cameras and other sensors like LiDAR.
It’s essential to identify other cars, traffic signs, lane markers, pedestrians, bicycles and all of the other visual information
encountered on the road. The process involves many computer vision basic tasks.
The first one is image classification, which aims to classify an image into a dog, an apple, a person’s face. More precisely, the
goal is to accurately predict that a given image belongs to a certain class.
Another task is object detection. The goal is to locate the presence of objects with a bounding box and predict types or classes
of the located objects in an image.
Input: Single RGB image
s
Sky
ee
the image with a category
Tr
Tr
ee
label
s
Cat Cow
Grass Grass
Semantic segmentation is needed too. The goal is to classify all the pixels of an image into meaningful classes of objects. These
classes are “semantically interpretable” and correspond to real-world categories.
Input: Single RGB image
s
Sky
ee
the image with a category
Tr
Tr
ee
label
s
Cat Cow
Grass Grass
For instance, you could isolate all the pixels associated with a cat and color them in yellow. This is also known as dense
prediction because it predicts the meaning of each pixel.
Another task is known as keypoint detection, which involves detecting people and localizing their key points simultaneously.
Keypoints are spatial locations or points in the image that define what is interesting or what stands out in the image.
I would like to show you a few more interesting applications in computer vision. The first one is video inpainting. We can select
a region in a video and make it disappear and at the same time filling the missing pixels of the background.
Enhanced
image
Blurry and
low-resolution
The second application is known as image super-resolution. In Hollywood movies, you have seen the amazing capability of
“zoom and enhance” images to obtain finer and finer details until they have the critical piece of evidence necessary to put the
bad guy away. In reality such an infinite zoom is practically hard.
LR input (heavily compressed) SR output (1024x1024)
But now with new technologies, even the input image is heavily degraded, we can use deep learning to recover
the details to a certain extent
LR input (heavily compressed) SR output (1024x1024)
SR output (1024x1024)
LR input (134x134)
SR output (1024x1024)
BasicVSR Series
Prediction - Cat
During the training, it takes an image and makes a prediction about what it is “seeing.” It uses the labels as supervision and
compare with its predictions.
hts
e ig
stw Label - Dog
ju
Ad
Wrong!!
Prediction - Cat
The neural network checks the accuracy of its predictions in a series of iterations and adjust its parameters or weights in the
network when it makes a mistake.
Label - Dog
Correct!!
Prediction - Dog
This process continues until the predictions start to come true. It can then recognize the objects accurately.
Much like a human making out an image at a distance, a deep convolutional network first discerns hard edges and simple
shapes in the shallower layers, then it starts to recognize parts and attributes in the deeper layers of the network
Computer vision is still not fully solved. There are still a lot of exciting research topics.
With further research on and refinement of the technology, the future of computer vision will see it performs a broader range
of functions.
Not only will computer vision models be easier to train but also be able to discern more from images than they do now.
While computer vision brings a lot of convenience to human kind, we also need to beware of malicious use of the technology
such as DeepFake.
We will see one more example here.
We have seen many applications of computer vision. In the future, computer vision will also play a vital role in the
development of artificial general intelligence (AGI) by giving them the ability to process information as well as or even better
than the human visual system.
Image Credits
OpenPose
MIT News,
How the Brain Distinguishes between Objects
Juan Hernandez,
Making AI Interpretable with Generative Adversarial Networks
Game Playing
Deep Learning
Enabling machines to acquire knowledge and skills
inducted from massive data
Autonomous Driving
Robo-advisor, Finance Trading Bot
Identity Authentication
By using face recognition technology of SenseTime, more than 400 million people registered and completed facial
identity authentication in 2016
Safe City
Entertainment
Entertainment
Entertainment
Autonomous driving
Results of StyleGAN
Credit: Tero Karras et al., A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019
Almost Everything can be Faked!
CycleGAN
[Zhu et al., ICCV 2017]
StarGAN
[Choi et al., CVPR 2018]
MUNIT
[Huang et al., ECCV 2018]
High-quality and fine-grained face editing
Add age
Add age
High-quality and fine-grained face editing
Add smile
Add smile
High-quality and fine-grained face editing
Add beard
Add beard
High-quality and fine-grained face editing
Add hair
Add hair
High-quality and high-resolution exemplar-based portrait style transfer
Input Image
Input Image
High-quality and high-resolution exemplar-based portrait style transfer
• Why learning?
• Statistical learning
• Supervised learning
• Empirical risk minimization (using polynomial regression as an example)
• Underfitting and overfitting
• Bias-variance dilemma
What do you see? How do we do that?!
• image recognition
• speech processing
• natural language processing
• robotic control
• ... and many others.
???
Why learning?
Extracting semantic information requires models of high complexity, which
cannot be designed by hand.
However, one can write a program that learns the task of extracting semantic
information.
Supervised learning is usually concerned with the two following inference problems:
We do not know the how well an algorithm will work in practice (the expected
risk) because we do not the the true distribution of the data the algorithm will
work on.
ℓ: 𝒴×𝒴 → ℝ,
such that ℓ(𝑦, 𝑓(𝐱)) ≥ 0 measures how close the prediction 𝑓(𝒙) from 𝑦 is.
We are looking for a function 𝑓 ∈ ℱ with a small expected risk (or generalization error)
This means that for a given data generating distribution 𝑃(𝑋, 𝑌) and for a
given hypothesis space ℱ, the optimal model is
𝑓∗ = argmin 𝑅(𝑓).
+∈ℱ
Empirical risk minimization
Unfortunately, since 𝑃(𝑋, 𝑌) is unknown, the expected risk cannot be evaluated and the
optimal model cannot be determined.
1
@ 𝐝) =
𝑅(𝑓, B ℓ(𝑦( , 𝑓(𝐱 ( )) .
𝑁
𝐱! ,!! ∈𝐝
This estimate can be used for finding an approximation of 𝑓∗ . This results into the empirical risk
minimization principle:
@ 𝐝) .
𝑓∗𝐝 = argmin 𝑅(𝑓,
#∈ℱ
Empirical risk minimization
Most machine learning algorithms, including neural networks, implement
empirical risk minimization.
lim 𝑓∗𝐝 = 𝑓∗
!→#
Polynomial regression
Consider the joint probability distribution 𝑃(𝑋, 𝑌) induced by the data generating process
(𝑥, 𝑦) ∼ 𝑃(𝑋, 𝑌) ⇔ 𝑥 ∼ 𝑈 [−10; 10], 𝜀 ∼ 𝒩 (0, 𝜎 ' ), 𝑦 = 𝑔(𝑥) + 𝜀
where 𝑥 ∈ ℝ, 𝑦 ∈ ℝ and 𝑔 is an unknown polynomial of degree 3.
Polynomial regression
Our goal is to find a function 𝑓 that makes good predictions on average over
𝑃(𝑋, 𝑌).
Consider the hypothesis space 𝑓 ∈ ℱ of polynomials of degree 3 defined through
their parameters 𝐰 ∈ ℝ& such that
𝑦0 ≜ 𝑓 𝑥; 𝐰 = 5 𝑤' 𝑥 '
'()
Polynomial regression
For this regression problem, we use the squared error loss
+
ℓ(𝑦, 𝑓(𝑥, 𝐰)) = 𝑦 − 𝑓 𝑥, 𝐰
/0 .
𝐰∗𝐝 = .
𝐗 𝐗 𝐗 𝐲
We define the Bayes risk as the minimal expected risk over all possible functions,
𝑅; = min𝒳 𝑅(𝑓) ,
#∈𝒴
and call Bayes model the model 𝑓; that achieves this minimum.
In practice, capacity can be controlled through hyper-parameters of the learning algorithm. For
example:
• The degree of the family of polynomials;
• The number of layers in a neural network;
• The number of training iterations;
• Regularization terms.
Under-fitting and over-fitting
• If the capacity of ℱ is too low, then 𝑓; ∉ ℱ and 𝑅(𝑓) − 𝑅; is large for any 𝑓 ∈ ℱ, including 𝑓∗
and 𝑓∗𝐝 . Such models f are said to underfit the data.
• If the capacity of ℱ is too high, then 𝑓; ∈ ℱ or 𝑅(𝑓∗ ) − 𝑅; is small. However, because of the
high capacity of the hypothesis space, the empirical risk minimizer 𝑓∗𝐝 could fit the training
data arbitrarily well such that
𝑅(𝑓∗𝐝 ) ≥ 𝑅; ≥ 𝑅@ (𝑓∗𝐝 , 𝐝) ≥ 0 .
Expected risk Bayes risk Empirical risk
• In this situation, 𝑓∗𝐝 becomes too specialized with respect to the true data generating
process and a large reduction of the empirical risk (often) comes at the price of an increase
of the expected risk of the empirical risk minimizer 𝑅(𝑓∗𝐝 ). In this situation, 𝑓∗𝐝 is said to
overfit the data.
Under-fitting and over-fitting
Can you draw the generalization error and training error curves?
Under-fitting and over-fitting
Therefore, our goal is to adjust the capacity of the hypothesis space such that the expected
risk of the empirical risk minimizer gets as low as possible.
Under-fitting and over-fitting
When over-fitting
𝑅(𝑓∗𝐝 ) ≥ 𝑅; ≥ 𝑅@ (𝑓∗𝐝 , 𝐝) ≥ 0 .
This indicates that the empirical risk 𝑅@ (𝑓∗𝐝 , 𝐝) is a poor estimator of the expected risk 𝑅(𝑓∗𝐝 ).
Nevertheless, an unbiased estimate of the expected risk can be obtained by evaluating 𝑓∗𝐝 on
data 𝐝=>?= independent from the training samples 𝐝:
1
@ ∗𝐝 , 𝐝=>?= ) =
𝑅(𝑓 B ℓ(𝑦( , 𝑓∗𝐝 (𝐱 ( )) .
𝑁
𝐱! ,!! ∈𝐝#$%#
This test error estimate can be used to evaluate the actual performance of the model.
However, it should not be used, at the same time, for model selection.
Evaluation protocol
Under-fitting and over-fitting
Bias-variance dilemma
• Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance).
• Gaining a proper understanding of these errors would help us not only to build accurate models but also to
avoid the mistake of overfitting and underfitting.
Bias-variance dilemma
Bias
• Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict.
• Model with high bias pays very little attention to the training data and
oversimplifies the model.
• It always leads to high error on training and test data.
Variance
• Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data.
• Model with high variance pays a lot of attention to training data and does
not generalize on the data which it hasn’t seen before.
• As a result, such models perform very well on training data but has high
error rates on test data.
Bias-variance dilemma
Bias-variance decomposition: Consider a fixed point unseen sample 𝑥 and the prediction 𝑌5 = 𝑓∗𝐝(𝑥) of the
empirical risk minimizer at 𝑥
#
𝔼𝐝 𝑦 − 𝑓∗𝐝(𝑥) = 𝜎 # + Bias𝐝(𝑓∗𝐝(𝑥))# + Var𝐝(𝑓∗𝐝(𝑥))
where
• The first term is a irreducible noise term with zero mean and variance 𝜎 #
• The second term is a bias term that measures the discrepancy between the average model and the Bayes
model Bias𝐝 𝑓∗𝐝 𝑥 = 𝔼𝐝 𝑓∗𝐝(𝑥) − 𝑓$ (𝑥)
• The third term is a variance term that quantifies the variability of the predictions Var𝐝 𝑓∗𝐝 𝑥 =
𝟐
𝔼𝐝 (𝑓∗𝐝(𝑥))# − 𝔼𝐝 𝑓∗𝐝(𝑥)
Bias-variance dilemma
Bias-variance trade-off - Question
If our model is too simple and has very few parameters, it will results in
high/low bias and high/low variance
If our model has large number of parameters then it’s going to have
high/low bias and high/low variance
Bias-variance dilemma
Bias-variance trade-off
• Reducing the capacity makes 𝑓∗𝐝 fit the data less on average, which increases the bias term.
• Increasing the capacity makes 𝑓∗𝐝 vary a lot with the training data, which increases the variance
term.
Bias-variance dilemma
Bias and variance explained using bulls-eye diagram