KEMBAR78
Lecture 1 Intro | PDF | Computer Vision | Visual Perception
0% found this document useful (0 votes)
23 views164 pages

Lecture 1 Intro

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views164 pages

Lecture 1 Intro

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 164

AI6126 Advanced Computer Vision

Last update: 13 Jan 2022 5:00 pm

Introduction
Chen-Change Loy
吕健勤

https://www.mmlab-ntu.com/person/ccloy/

https://www.mmlab-ntu.com/
Outline

• About this course


• Computer vision background and a little history
• Applications and success
• Fundamentals of machine learning
• Why learning?
• Statistical learning
About This Course
Course objectives
• Understand deep learning models such as convolutional networks and
generative adversarial networks and how they are essential for different
computer vision tasks

• Get familiarized with PyTorch and OpenMMLab for developing deep learning
applications

• Design and train deep learning models for solving different computer vision
applications
Course hours
• Lecture + tutorial/lab
• Three hours per week
• Every Friday 6:30pm – 9:30pm

• Venue
• Hybrid - Face-to-face at LT4 and Zoom

• Course materials
• Course notes of the particular week will be available on NTULearn before the lecture
• The recording of the lecture will be available after each lecture. You will be able to see
the videos under ‘Course Media'.
Course outline
Week Date Topic Lecturer
1 14 Jan Introduction Loy Chen Change
2 21 Jan Convolutional Neural Networks Loy Chen Change
3 28 Jan Training Convolutional Neural Networks Loy Chen Change
4 4 Feb Object Detection Loy Chen Change
5 11 Feb Image Segmentation Liu Ziwei
6 18 Feb Transformers Loy Chen Change
7 25 Feb Autoencoder Liu Ziwei
Recess Week
8 11 Mar Generative Adversarial Networks Liu Ziwei
9 18 Mar Image Super-Resolution Loy Chen Change
10 25 Mar Image Editing Loy Chen Change
11 1 Apr Open-World Visual Recognition Guest Speaker
12 8 Apr Unsupervised Representation Learning Loy Chen Change
13 14 Apr Creating Small and Efficient Networks Loy Chen Change
Instructors
PhD
2010
Queen Mary University of London

Research areas
Post-doctoral Fellow
2011 • Image restoration and processing
Queen Mary University of London • Image editing and manipulation
Vision Semantics • Detection, segmentation and recognition
• Deep learning
Loy Chen Change Research Assistant Professor • Face anaylsis
2013
The Chinese University of Hong Kong

Nanyang Associate Professor


2018
Nanyang Technological University

Adjunct Associate Professor


The Chinese University of Hong Kong

Visiting Scholar
Email: ccloy@ntu.edu.sg
Chinese Academy of Sciences Office: By email appointment
Instructors
PhD
2017
The Chinese University of Hong Kong

Research areas
Post-doctoral Fellow
2017 • Deep fashion understanding
UC Berkeley • Detection, segmentation and recognition
• Generative adversarial networks
• Computer graphics
Liu Ziwei Post-doctoral Fellow • Domain adaptation and long-tailed recognition
2018
The Chinese University of Hong Kong

Nanyang Assistant Professor


2020
Nanyang Technological University

Email: Ziwei.liu@ntu.edu.sg
Office: By email appointment
Assessment
• Homework: 20%
• Paper reading and oral presentation (Group): 20%
• Projects (Individual): 40%
• Quiz: 20%
Homework
• Two to three short assignments
• Late submissions will be penalized (each day at 5% up to 3 days)
Paper reading and oral presentation
• Group of up to two members
• Instructions, paper list: TBA
• Choose one paper
• Report your choice by week 9
• Study and find more related work; find connections
• Focus presentation on ideas; not too detailed
• Total 15 min/team
• Record your videos and submit
• Professors are your audience
Projects
• Project 1
• Handout on 18 Feb (Week 6)
• Deadline on 25 Mar (Week 10)

• Project 2
• Handout on 18 Mar (Week 9)
• Deadline on 22 Apr (Week 14)
Projects
• Python is the recommended programming language
• PC with at least 1 GPU is recommended
• Projects are to be done individually
• Each student should submit the final report (in .pdf) and codes (in a .zip file),
individually, using their accounts to NTULearn before the deadline.
• Late submissions will be penalized (each day at 5% up to 3 days)
• Assessment criteria will be indicated in the handout
Quiz
• To be updated
Instructors (TAs)
• To be updated
Where to ask questions
• Post your queries on the 'Discussion Board' so we can learn from each other.

• If you have questions that you would like a more private answer, you can send
an email to ai6126@e.ntu.edu.sg.

• If possible, please use the discussion forum on NTULearn instead of sending an


email.
References
Deep Learning

http://www.deeplearningbook.org/
References
Computer Vision: Algorithms and Applications

http://szeliski.org/Book/
Outline

• About this course


• Computer vision background and a little history
• Applications and success
• Fundamentals of machine learning
• Why learning?
• Statistical learning
Learn:
What is computer vision?
The history of computer vision
Computer vision tasks
How does computer vision work?
What is computer vision?
Artificial Intelligence

Machine Learning

Computer Vision

What is computer vision? Computer vision is a multidisciplinary field that could broadly be called a subfield of artificial
intelligence and machine learning.
What is computer vision?

Computer vision enables machines to see, observe and understand.


What is computer vision?

In particular, computer vision enables computers and systems to derive meaningful information from digital images, videos
and other visual inputs
Computer vision works much the same as human vision, except humans have a head start.
Human sight has the advantage of lifetimes of context to train how to tell objects apart, how far away they are, whether they
are moving and whether there is something wrong in an image.
Computer vision trains machines to perform these functions, but it has to do it with cameras, data and algorithms rather than
retinas, optic nerves and a visual cortex.
Predictions

A Computer Vision Model

How does computer vision work? Computer vision typically needs lots of data. It needs to get ‘training’ over and over until it
discerns distinctions and ultimately recognize images.
Chihuahua
or
Muffin?

A Computer Vision Model

For example, to train a computer to distinguish chihuahua or muffin, it needs to be fed vast quantities of Chihuahua or muffin
images to learn the differences and recognize them correctly.
You might have heard the term image processing. How is computer vision different from image processing?
Image processing is the process of restoring or enhancing an image, by changing brightness, contrast or anything. It is just a
type of digital signal processing and is not concerned with understanding the content of an image.
But nowadays, to do well in image processing, one needs to understand the content of an image, for instance, to know where
the face is, and then to perform special enhancement. So we still need to apply computer vision in many cases
Let’s talk about some history. Scientists and engineers have been trying to develop ways for machines to see and understand
visual data for about 50 years.
Experimentation began in 1959 when neurophysiologists Hubel and Wiesel recorded electrical activity from individual neurons
in the brains of cats
They showed a cat an array of images, attempting to correlate a response in its brain.
They discovered that it responded first to hard edges or lines, and scientifically, this meant that image processing starts with
simple shapes like straight edges.
At about the same time, the first computer image scanning technology was developed, enabling computers to digitize and
acquire images.
In the 1960s, AI emerged as an academic field of study, and it also marked the beginning of the AI quest to solve the human
vision problem.
In 1958, inspired by the biological cell structure, Frank Resenblatt created the perceptron, an algorithm for pattern recognition
based on a two-layer computer learning network using simple addition and subtraction.
A neural network is nothing but a collection of many interconnected neurons arranged in a hierarchical manner.
In 1982, neuroscientist David Marr established that vision works hierarchically and introduced algorithms for machines to
detect edges, corners, curves and similar basic shapes.
Concurrently, computer scientist Kunihiko Fukushima developed a network of cells that could recognize patterns. The network,
called the Neocognitron, included convolutional layers in a neural network.
In 1998, a type of neural network was pioneered by Yann Lecun, in which the parameters are shared spatially. It was applied
by several banks to recognize hand-written numbers on checks.
By 2001, the first real-time face detection application appeared.
SIFT

There are many more development in the field thereafter, including better methods to extract features and the use of machine
learning techniques for solving tasks like object recognition and detection.
Despite the progress, scientists quickly came to realise that tasks that are easily or even unconsciously done by humans are
very difficult for a computer and the opposite.
But why is it so challenging? The perception of humans comes with a pre-existing knowledge about the object and the
geometry
Seeing a 2D image, one can easily distinguish between foreground and background objects and effortlessly recognise an object
even if it is subject to occlusions, background clutter or deformations.
But computers cannot distinguish easily between object pixels and background pixels. They perceive digital images not as
something continuous with semantic information, but just as a series of discrete numerical values.
In addition, computer vision systems are usually very sensitive to variations such as scale, viewpoint or illumination and intra-
class differences
10 years ago, things have changed thanks to ImageNet dataset, deep convolutional neural networks, and powerful GPUs.
In 2010, the ImageNet dataset became available. It contained millions of tagged images across a thousand object classes and
provides a foundation for training CNNs and deep learning models used today.
In 2012, a team from the University of Toronto used a deep neural network in the ImageNet image recognition contest. The
model, called AlexNet, significantly reduced the error rate for image recognition.
After this breakthrough, error rates have fallen to just a few percent. And computer vision has started to work very well in
many applications
A key driver for the growth of these applications is the flood of visual information flowing from smartphones, security systems,
traffic cameras and other visually instrumented devices.
One important application of computer vision is self-driving vehicles. The development of self-driving vehicles relies on
computer vision to make sense of the visual input from a car’s cameras and other sensors like LiDAR.
It’s essential to identify other cars, traffic signs, lane markers, pedestrians, bicycles and all of the other visual information
encountered on the road. The process involves many computer vision basic tasks.
The first one is image classification, which aims to classify an image into a dog, an apple, a person’s face. More precisely, the
goal is to accurately predict that a given image belongs to a certain class.
Another task is object detection. The goal is to locate the presence of objects with a bounding box and predict types or classes
of the located objects in an image.
Input: Single RGB image

Output: Label each pixel in


Sky

s
Sky

ee
the image with a category

Tr

Tr
ee
label

s
Cat Cow

Grass Grass

Semantic segmentation is needed too. The goal is to classify all the pixels of an image into meaningful classes of objects. These
classes are “semantically interpretable” and correspond to real-world categories.
Input: Single RGB image

Output: Label each pixel in


Sky

s
Sky

ee
the image with a category

Tr

Tr
ee
label

s
Cat Cow

Grass Grass

For instance, you could isolate all the pixels associated with a cat and color them in yellow. This is also known as dense
prediction because it predicts the meaning of each pixel.
Another task is known as keypoint detection, which involves detecting people and localizing their key points simultaneously.
Keypoints are spatial locations or points in the image that define what is interesting or what stands out in the image.
I would like to show you a few more interesting applications in computer vision. The first one is video inpainting. We can select
a region in a video and make it disappear and at the same time filling the missing pixels of the background.
Enhanced
image

Blurry and
low-resolution

The second application is known as image super-resolution. In Hollywood movies, you have seen the amazing capability of
“zoom and enhance” images to obtain finer and finer details until they have the critical piece of evidence necessary to put the
bad guy away. In reality such an infinite zoom is practically hard.
LR input (heavily compressed) SR output (1024x1024)

But now with new technologies, even the input image is heavily degraded, we can use deep learning to recover
the details to a certain extent
LR input (heavily compressed) SR output (1024x1024)

You can see more examples here


LR input (74x74)

SR output (1024x1024)
LR input (134x134)

SR output (1024x1024)
BasicVSR Series

Of course, we can do it for videos as well


Another application is called image-to-image translation. Here you can change the style of an image to some pre-defined
styles. For instance, converting a photo taken in the summer to have a winter style.
The last application I want to show you is image generation.
Pay attention to all these images. Are they real or are they fake images?
They are all fake images generated by a neural network called Generative Adversarial Network.
Image generation has many applications and it is one of the most popular and exciting fields of research in computer vision.
Not just face images, now we can generate synthetic images of different kinds, like cats.
How does computer vision work? To train a deep convolutional network for object classification, we need to prepare training
data with labels. That is, we will have a lot of images and each of them is annotated with a class label.
Depending to the tasks we want to solve, we will need to prepare data with the corresponding type of annotation
Label - Dog

Prediction - Cat

Deep convolutional network

During the training, it takes an image and makes a prediction about what it is “seeing.” It uses the labels as supervision and
compare with its predictions.
hts
e ig
stw Label - Dog
ju
Ad
Wrong!!

Prediction - Cat

Deep convolutional network

The neural network checks the accuracy of its predictions in a series of iterations and adjust its parameters or weights in the
network when it makes a mistake.
Label - Dog

Correct!!

Prediction - Dog

Deep convolutional network

This process continues until the predictions start to come true. It can then recognize the objects accurately.
Much like a human making out an image at a distance, a deep convolutional network first discerns hard edges and simple
shapes in the shallower layers, then it starts to recognize parts and attributes in the deeper layers of the network
Computer vision is still not fully solved. There are still a lot of exciting research topics.
With further research on and refinement of the technology, the future of computer vision will see it performs a broader range
of functions.
Not only will computer vision models be easier to train but also be able to discern more from images than they do now.
While computer vision brings a lot of convenience to human kind, we also need to beware of malicious use of the technology
such as DeepFake.
We will see one more example here.
We have seen many applications of computer vision. In the future, computer vision will also play a vital role in the
development of artificial general intelligence (AGI) by giving them the ability to process information as well as or even better
than the human visual system.
Image Credits

OpenPose

MIT News,
How the Brain Distinguishes between Objects

Juan Hernandez,
Making AI Interpretable with Generative Adversarial Networks

Ethan Yanjia Li,


10 Papers You Should Read to Understand Image Classification in the Deep Learning Era
Outline

• About this course


• Computer vision background and a little history
• Applications and success
• Fundamentals of machine learning
• Why learning?
• Statistical learning
Applications and Success
Deep learning enables AI breakthroughs
Facial Recognition

Voice Recognition Image Recognition

Game Playing

Deep Learning
Enabling machines to acquire knowledge and skills
inducted from massive data

Natural Language Processing

Autonomous Driving
Robo-advisor, Finance Trading Bot
Identity Authentication

By using face recognition technology of SenseTime, more than 400 million people registered and completed facial
identity authentication in 2016
Safe City
Entertainment
Entertainment
Entertainment
Autonomous driving
Results of StyleGAN

Credit: Tero Karras et al., A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019
Almost Everything can be Faked!

CycleGAN
[Zhu et al., ICCV 2017]

StarGAN
[Choi et al., CVPR 2018]
MUNIT
[Huang et al., ECCV 2018]
High-quality and fine-grained face editing

Add age

Add age
High-quality and fine-grained face editing

Add smile

Add smile
High-quality and fine-grained face editing

Add beard

Add beard
High-quality and fine-grained face editing

Add hair

Add hair
High-quality and high-resolution exemplar-based portrait style transfer

Input Image

Different Style Images


High-quality and high-resolution exemplar-based portrait style transfer

Input Image
High-quality and high-resolution exemplar-based portrait style transfer

Cartoon Caricature Anime


Relighting of humans in novel illuminations with free poses and viewpoints requiring only posed human videos under
unknown lighting for training
Reconstruct garment from 3D point clouds and simulates realistic garment dynamics
Driving 3D characters to dance with music. Outperform existing methods in terms of the quality of generated dances, the
diversity of motions and the alignment between the rhythms of music

Breaking: facile toprock

House dance: energetic up-down

Ballet jazz: pirouette

Middle hip-hop: casual steps


3D human structure and texture estimation from a single image
Outline

• About this course


• Computer vision background
• A little history
• ImageNet and breakthrough
• Applications and success
• Fundamentals of machine learning
• Why learning?
• Statistical learning
Fundamentals of
Machine Learning

Credit: Gilles Louppe , "Lecture 1: Fundamentals of machine learning”, 2019.


Outline
Set the fundamentals of machine learning

• Why learning?
• Statistical learning
• Supervised learning
• Empirical risk minimization (using polynomial regression as an example)
• Underfitting and overfitting
• Bias-variance dilemma
What do you see? How do we do that?!

Exceptional generalization ability


Sheepdog or mop?

Exceptional discriminative ability


Chihuahua or muffin?

Exceptional discriminative ability


Why learning?
The automatic extraction of semantic information from raw signal is at the core
of many applications, such as

• image recognition
• speech processing
• natural language processing
• robotic control
• ... and many others.

How can we write a computer program that implements that?


Why learning?
The (human) brain is so good at interpreting visual information that the gap
between raw data and its semantic interpretation is difficult to assess intuitively:
Why learning?
Why learning?
Why learning?

???
Why learning?
Extracting semantic information requires models of high complexity, which
cannot be designed by hand.

However, one can write a program that learns the task of extracting semantic
information.

Techniques used in practice consist of:


• defining a parametric model with high capacity,
• optimizing its parameters, by "making it work" on the training data.
Supervised learning
Consider an unknown joint probability distribution 𝑃(𝑋, 𝑌).

Assume training data


(𝐱 ! , 𝑦! ) ∼ 𝑃 (𝑋, 𝑌),
with 𝒙! ∈ 𝒳, 𝑦! ∈ 𝒴, 𝑖 = 1, … , 𝑁.
Notes: i.i.d. assumption is often
made for training datasets to
• In most cases, imply that all samples stem from
• 𝐱! is a 𝑝-dimensional vector of features or descriptors, the same generative process and
• 𝑦! is a scalar (e.g., a category or a real value). that the generative process is
assumed to have no memory of
• The training data is generated i.i.d. past generated samples.
• The training data can be of any finite size 𝑁.
• In general, we do not have any prior information about 𝑃(𝑋, 𝑌).
Supervised learning
Inference

Supervised learning is usually concerned with the two following inference problems:

Classification: Given (𝐱 ! , 𝑦! ) ∈ 𝒳×𝒴 = ℝ" ×{1, … , 𝐶},for 𝑖 = 1, … , 𝑁, we want to


estimate for any new 𝐱,
argmax 𝑃(𝑌 = 𝑦 ∣ 𝑋 = 𝐱) .
#

Regression: Given(𝐱 ! , 𝑦! ) ∈ 𝒳×𝒴 = ℝ" ×ℝ, for 𝑖 = 1, … , 𝑁, we want to estimate for


any new 𝐱,
𝔼 [𝑌 ∣ 𝑋 = 𝐱].
Supervised learning
Supervised learning
Empirical risk minimization
Our objective is to train a good model.

How to define whether a model is good or bad?

We do not know the how well an algorithm will work in practice (the expected
risk) because we do not the the true distribution of the data the algorithm will
work on.

What we can do is to measure the model’s performance of a known set of


training data (the empirical risk / training error).
Empirical risk minimization
Consider a function 𝑓 ∶ 𝒳 → 𝒴 produced by some learning algorithm. The predictions of this
function can be evaluated through a loss

ℓ: 𝒴×𝒴 → ℝ,
such that ℓ(𝑦, 𝑓(𝐱)) ≥ 0 measures how close the prediction 𝑓(𝒙) from 𝑦 is.

Examples of loss functions

Classification: ℓ(𝑦, 𝑓(𝐱)) = 𝟏!"#(𝐱)


'
Regression: ℓ(𝑦, 𝑓(𝐱)) = 𝑦 − 𝑓 𝐱
Empirical risk minimization
Let ℱ denote the hypothesis space, i.e. the set of all functions 𝑓 than can be produced
by the chosen learning algorithm.

We are looking for a function 𝑓 ∈ ℱ with a small expected risk (or generalization error)

𝑅 𝑓 =𝔼 𝐱,# ∼' (,) ℓ 𝑦, 𝑓 𝐱 .

This means that for a given data generating distribution 𝑃(𝑋, 𝑌) and for a
given hypothesis space ℱ, the optimal model is

𝑓∗ = argmin 𝑅(𝑓).
+∈ℱ
Empirical risk minimization
Unfortunately, since 𝑃(𝑋, 𝑌) is unknown, the expected risk cannot be evaluated and the
optimal model cannot be determined.

However, if we have i.i.d. training data 𝐝 = {(𝐱 ( , 𝑦( ) ∣ 𝑖 = 1, … , 𝑁 }, we can compute an


estimate, the empirical risk (or training error)

1
@ 𝐝) =
𝑅(𝑓, B ℓ(𝑦( , 𝑓(𝐱 ( )) .
𝑁
𝐱! ,!! ∈𝐝

This estimate can be used for finding an approximation of 𝑓∗ . This results into the empirical risk
minimization principle:

@ 𝐝) .
𝑓∗𝐝 = argmin 𝑅(𝑓,
#∈ℱ
Empirical risk minimization
Most machine learning algorithms, including neural networks, implement
empirical risk minimization.

Under regularity assumptions, empirical risk minimizers converge:

lim 𝑓∗𝐝 = 𝑓∗
!→#
Polynomial regression

Consider the joint probability distribution 𝑃(𝑋, 𝑌) induced by the data generating process
(𝑥, 𝑦) ∼ 𝑃(𝑋, 𝑌) ⇔ 𝑥 ∼ 𝑈 [−10; 10], 𝜀 ∼ 𝒩 (0, 𝜎 ' ), 𝑦 = 𝑔(𝑥) + 𝜀
where 𝑥 ∈ ℝ, 𝑦 ∈ ℝ and 𝑔 is an unknown polynomial of degree 3.
Polynomial regression
Our goal is to find a function 𝑓 that makes good predictions on average over
𝑃(𝑋, 𝑌).
Consider the hypothesis space 𝑓 ∈ ℱ of polynomials of degree 3 defined through
their parameters 𝐰 ∈ ℝ& such that

𝑦0 ≜ 𝑓 𝑥; 𝐰 = 5 𝑤' 𝑥 '
'()
Polynomial regression
For this regression problem, we use the squared error loss
+
ℓ(𝑦, 𝑓(𝑥, 𝐰)) = 𝑦 − 𝑓 𝑥, 𝐰

to measure how wrong the predictions are.

Therefore, our goal is to find the best value 𝐰∗ such


𝐰∗ = argmin 𝑅 𝐰
𝐰
+
= argmin 𝔼 𝐱,/ ∼1 2,3 𝑦 − 𝑓 𝑥, 𝐰
𝐰
Polynomial regression
Given a large enough training set 𝐝 = {(𝑥( , 𝑦( ) ∣ 𝑖 = 1, … , 𝑁 }, the empirical risk
minimization principle tells us that a good estimate 𝐰∗𝐝 of 𝐰∗ can be found by minimizing the
empirical risk:
Polynomial regression
This is ordinary least squares regression, for
which the solution is known analytically:

/0 .
𝐰∗𝐝 = .
𝐗 𝐗 𝐗 𝐲

The expected risk minimizer 𝐰∗ within our


hypothesis space is 𝑔 itself.
Therefore, on this toy problem, we can verify
that
𝑓 𝑥; 𝐰∗𝐝 → 𝑓 𝑥; 𝐰∗ = 𝑔(𝑥) as 𝑁 → ∞
Polynomial regression
Let’s see what happen if we have different number of training samples
Polynomial regression
Polynomial regression
Polynomial regression
Polynomial regression
More training samples help us to find a better function!
𝑓 𝑥; 𝐰∗𝐝 → 𝑓 𝑥; 𝐰∗ = 𝑔(𝑥) as 𝑁 → ∞
Under-fitting and over-fitting

What if we consider a hypothesis space ℱ in which candidate functions 𝑓 are


either too "simple" or too "complex" with respect to the true data generating
process?
Under-fitting and over-fitting
The error is large!
Under-fitting and over-fitting
Under-fitting and over-fitting
The error is reducing
Under-fitting and over-fitting
Under-fitting and over-fitting
Under-fitting and over-fitting
The error is small. Is this a good solution?
Under-fitting and over-fitting
Under-fitting and over-fitting
Let 𝒴 𝒳 be the set of all functions 𝑓 ∶ 𝒳 → 𝒴.

We define the Bayes risk as the minimal expected risk over all possible functions,
𝑅; = min𝒳 𝑅(𝑓) ,
#∈𝒴

and call Bayes model the model 𝑓; that achieves this minimum.

No model 𝑓 can perform better than 𝑓; .


Under-fitting and over-fitting
The capacity of an hypothesis space induced by a learning algorithm intuitively represents the
ability to find a good model 𝑓 ∈ ℱ for any function, regardless of its complexity.

In practice, capacity can be controlled through hyper-parameters of the learning algorithm. For
example:
• The degree of the family of polynomials;
• The number of layers in a neural network;
• The number of training iterations;
• Regularization terms.
Under-fitting and over-fitting
• If the capacity of ℱ is too low, then 𝑓; ∉ ℱ and 𝑅(𝑓) − 𝑅; is large for any 𝑓 ∈ ℱ, including 𝑓∗
and 𝑓∗𝐝 . Such models f are said to underfit the data.

• If the capacity of ℱ is too high, then 𝑓; ∈ ℱ or 𝑅(𝑓∗ ) − 𝑅; is small. However, because of the
high capacity of the hypothesis space, the empirical risk minimizer 𝑓∗𝐝 could fit the training
data arbitrarily well such that
𝑅(𝑓∗𝐝 ) ≥ 𝑅; ≥ 𝑅@ (𝑓∗𝐝 , 𝐝) ≥ 0 .
Expected risk Bayes risk Empirical risk

• In this situation, 𝑓∗𝐝 becomes too specialized with respect to the true data generating
process and a large reduction of the empirical risk (often) comes at the price of an increase
of the expected risk of the empirical risk minimizer 𝑅(𝑓∗𝐝 ). In this situation, 𝑓∗𝐝 is said to
overfit the data.
Under-fitting and over-fitting
Can you draw the generalization error and training error curves?
Under-fitting and over-fitting
Therefore, our goal is to adjust the capacity of the hypothesis space such that the expected
risk of the empirical risk minimizer gets as low as possible.
Under-fitting and over-fitting
When over-fitting
𝑅(𝑓∗𝐝 ) ≥ 𝑅; ≥ 𝑅@ (𝑓∗𝐝 , 𝐝) ≥ 0 .

This indicates that the empirical risk 𝑅@ (𝑓∗𝐝 , 𝐝) is a poor estimator of the expected risk 𝑅(𝑓∗𝐝 ).

Nevertheless, an unbiased estimate of the expected risk can be obtained by evaluating 𝑓∗𝐝 on
data 𝐝=>?= independent from the training samples 𝐝:
1
@ ∗𝐝 , 𝐝=>?= ) =
𝑅(𝑓 B ℓ(𝑦( , 𝑓∗𝐝 (𝐱 ( )) .
𝑁
𝐱! ,!! ∈𝐝#$%#

This test error estimate can be used to evaluate the actual performance of the model.
However, it should not be used, at the same time, for model selection.
Evaluation protocol
Under-fitting and over-fitting
Bias-variance dilemma
• Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance).

• There is a tradeoff between a model’s ability to minimize bias and variance.

• Gaining a proper understanding of these errors would help us not only to build accurate models but also to
avoid the mistake of overfitting and underfitting.
Bias-variance dilemma
Bias
• Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict.
• Model with high bias pays very little attention to the training data and
oversimplifies the model.
• It always leads to high error on training and test data.

Variance
• Variance is the variability of model prediction for a given data point or a
value which tells us spread of our data.
• Model with high variance pays a lot of attention to training data and does
not generalize on the data which it hasn’t seen before.
• As a result, such models perform very well on training data but has high
error rates on test data.
Bias-variance dilemma
Bias-variance decomposition: Consider a fixed point unseen sample 𝑥 and the prediction 𝑌5 = 𝑓∗𝐝(𝑥) of the
empirical risk minimizer at 𝑥

It can be shown that the expected error on the unseen sample 𝑥

#
𝔼𝐝 𝑦 − 𝑓∗𝐝(𝑥) = 𝜎 # + Bias𝐝(𝑓∗𝐝(𝑥))# + Var𝐝(𝑓∗𝐝(𝑥))

where
• The first term is a irreducible noise term with zero mean and variance 𝜎 #
• The second term is a bias term that measures the discrepancy between the average model and the Bayes
model Bias𝐝 𝑓∗𝐝 𝑥 = 𝔼𝐝 𝑓∗𝐝(𝑥) − 𝑓$ (𝑥)
• The third term is a variance term that quantifies the variability of the predictions Var𝐝 𝑓∗𝐝 𝑥 =
𝟐
𝔼𝐝 (𝑓∗𝐝(𝑥))# − 𝔼𝐝 𝑓∗𝐝(𝑥)
Bias-variance dilemma
Bias-variance trade-off - Question

If our model is too simple and has very few parameters, it will results in
high/low bias and high/low variance

If our model has large number of parameters then it’s going to have
high/low bias and high/low variance
Bias-variance dilemma

Bias-variance trade-off

• Reducing the capacity makes 𝑓∗𝐝 fit the data less on average, which increases the bias term.
• Increasing the capacity makes 𝑓∗𝐝 vary a lot with the training data, which increases the variance
term.
Bias-variance dilemma
Bias and variance explained using bulls-eye diagram

Which example indicates underfitting?

Which example indicates overfitting?

How to mitigate these problems?


Bias-variance dilemma
Bias and variance explained using bulls-eye diagram

How to mitigate underfitting?


• This happens when a model unable to capture the
underlying pattern of the data.
• These models usually have high bias and low
variance.
• It happens when we have insufficient amount of
data to build an accurate model or when we try to
use a overly simple model to capture the complex
patterns in data
• Increase the model complexity, add more
meaningful features
Bias-variance dilemma
Bias and variance explained using bulls-eye diagram

How to mitigate overfitting?


• This happens when our model is too complex and
too specialized on a small number of training data.
• These models usually have low bias and high
variance.
• Increase the size of the data, remove outliers in
data, reduce the complexity of the model, reduce
the feature dimension
Bias-variance dilemma

An optimal balance of bias and variance would never overfit or


underfit the model.

You might also like