Module-1 Deep Learning (Autosaved)
Module-1 Deep Learning (Autosaved)
I n f o rm a t i o n
Title: Deep learning
Authors:
Ian Goodfellow
Yoshua Bengio
Aaron Courville
ISBN: 9780262337434
2
Chapter O rg a n i z a t i o n (Pa rt
1)
3 InfoLa
Chapter O rg a n i z a t i o n (Pa rt
2)
4 InfoLa
Machine Le a rn i n g and
AI
InfoLa
Chapter 5. Machine Le a rn i n g
Basics
This chapter provides a brief course in the most important
general principles
1. Learning algorithms
2. Capacity, overfitting and underfitting
3. Hyperparameters and validation sets
4. Estimators, bias and variance
5. Maximum likelihood estimation
6. Bayesian statistics
7. Supervised learning algorithms
8. Unsupervised learning algorithms
9. Stochastic gradient descent
10. Building a machine learning algorithm
11. Challenges motivating deep learning
6 InfoLa
C h e c ke r s Game: The Fi r s t ML
Application
The Samuel’s Checkers-playing program appears to be
the world’s first self-learning program (Arthur Samuel,
1959)
Over thousands of games, the program started to learn to
recognize the patterns, which patterns led to win or lose
7 InfoLa
Le a rn i n g
Algorithms
• A field of study that gives computers the ability to learn
without being explicitly programmed (Arthur Samuel,
1959)
1 InfoLa
C l a s s i fi c a t i o
nThe computer program is asked to specify which of K categories
some input belongs to.
Example: object recognition.
1 InfoLa
Reg res s i o
n
1 InfoLa
Tr a n s c r i p t i o n and Machine
Tr a n s l a t i o n
1 InfoLa
𝑃rate) for categorical data
Performance
Measure,
Accuracy (error
To measure the proportion of examples for
which model produces the correct output
�
1 InfoLa �
Linear Regression
• Linear Regression is a machine learning
algorithm based on supervised learning.
• Regression models a target prediction
value based on independent variables. It is
mostly used for finding out the relationship
between variables and forecasting.
• Different regression models differ based on
– the kind of relationship between
dependent and independent variables they
are considering, and the number of
independent variables being used.
Linear Regression
•Linear regression performs the
task to predict a dependent
variable value (y) based on a given
independent variable (x).
•So, this regression technique
finds out a linear relationship
between x (input) and y(output).
•Hence, the name Linear
Regression.
•In the figure above, X (input) is
the work experience and Y
(output) is the salary of a person.
•The regression line is the best fit
line for our model.
Linear Regression
While training the model we are given :
x: input training data (univariate – one input
variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to
predict the value of y for a given value of x.
The model gets the best regression fit line by
finding the best θ and θ values.
1 2
θ : intercept
1
θ : coefficient of x
2
Performance measure, 𝑃:
mean squared error (MSE) on (𝑿 𝑡𝑒 𝑠 𝑡 ,𝑦 𝑡𝑒 𝑠 𝑡 )
𝑚: number of
dataset
2 InfoLa
F in d w b y M in im iz in g 𝑀𝑆𝐸𝑡𝑟𝑎i𝑛
2 InfoLa
L i n e a r Re g re s s i o n
Pro b l e m
2 InfoLa
Generalization
m(test) X w y 2
model by 1 X w y
error
1
w x j
0
2
(train) (train)
2
m(train)
– But we actually care about the test
error 1
X w y (test) (test)
2
(test)
m
2
set error
Generalizati
o nIn ML, generalization is the ability to perform
well on previously unobserved inputs
1. Making the training error small
2. Making the gap between training and test error
small
Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on training set
Overfitting occurs when the gap between the
training error and the test error is too large
Polynomial M
Model
y(x,w) w wx wx
0 1 2
2
.. M M
wj x j
w x j
0
1
6
Capacity and
Learning
• Quantifying Theory
the capacity of a model
enables statistical learning theory to
make quantitative predictions
• Most important results in statistical
learning theory show that:
• The discrepancy between training
error and generalization error
– is bounded from above by a quantity that
grows as the model capacity grows
– But shrinks as the number of training examples
increases
1
7
Usefulness of statistical
•learning theory
Provides intellectual justification that
machine learning algorithms can work
• But rarely used in practice with deep
learning
• This is because:
– The bounds are loose
– Also difficult to determine capacity of deep
learning algorithms
1
8
Typical
generalization error
Relationship between capacity and error
Typically generalization error has a U-shaped
curve
1
9
Arbitrarily high
capacity:
• When doNonparametric
we reach most models
extreme
case of arbitrarily high capacity?
• Parametric models such as linear
regression:
• learn a function described by a parameter
whose size is finite and fixed before data
is observed
• Nonparametric models have no such
limitation
2
• Nearest-neighbor regression is an 0
example
Nearest neighbor regression
Synthetic regression
problem Noise added to
a 5th degree polynomial.
Generated a single test
set
and several different
sizes of training sets
Error bars show
95% Confidence
interval
As training set
size increases
optimal capacity
increases
Plateaus after
reaching
2
sufficient 4
complexity
Probably Approximately Correct
• Learning theory claims that a ML
algorithm generalizes from finite
training set of examples
• It contradicts basic principles of logic
• Inductive reasoning, or inferring general
rules from a limited no of samples is not
logically valid
• To infer a rule describing every member of a
set, one must have information about every
member of the set
• ML avoids this problem with
probabilistic rules 2
5
2
7
Other ways of solution
preference
• Only method so far for modifying
learning algo is to change
representational capacity
• Adding/removing functions from hypothesis
space
• Specific example: degree of polynomial
• Specific identity of those functions
can also affect behaviorT
J (w) M SE w w
of our
algorithm trai
• λ chosen to control preference for smaller
n
• Exweights
of linear regression: include weight
decay 2
8
Tradeoff between fitting data
and being
Controlling asmall
model’s tendency to
overfit or underfit via weight
decay
Weight decay for high degree polynomial
(degree 9) while the data comes from a
quadratic
2
9
9-degree Regression Model with Weight
Decay
Large 𝒘 over-fits
to training data
5 InfoLa
Re g u l a r i z a t i o
n
Regularization is any modification to prevent overfitting
𝒘
Trades-off between fitting the training data and being small
5 InfoLa
Tr a i n i n g Set Size and
Generalization
Polynomial
(degree-9)
Christopher, M. Bishop. PATTERN RECOGNITION AND MACHINE LEARNING. Springer-Verlag New York,
Chapter 1.
5 InfoLa
H y pe r p a r a m e t e r s v s .
pa r a m e t e r s
Hyperparameters are higher-level properties for a
model
Decides model’s capacity (or complexity)
Not be learned during training
e.g., degree of regression model, weight decay
5 InfoLa
Setting Hyperparameters with
Validation Set
Setting hyperparameters in training step is not
appropriate
Hyperparameters will be set to yield
model, 𝜆 → 0)
overfitting (e.g., higher degree of regression
Training Validation
folds fold
1st 𝐸
iteration 1
2nd 𝐸
iteration 2
3rd 𝐸
iteration 3
𝑘-th 𝐸
iteration 𝑘
5 InfoLa
Estimation on
Statistics
Point estimation
…,𝗑𝑚
Function estimation
Types of estimation that predict the relationship between
input and target variables
Point estimator
𝑉𝑎𝑟 𝜽^
5 InfoLa
G r a p h i c a l I l l u s t r a t i o n o f B i a s and
Variance
High-bias
model
High-variance
model
Scott Fortmann-Roe, Understanding the Bias-Variance Tradeoff,
2012
6 InfoLa
B i a s -Va r i a n c e Tr a d e - o ff w i t h
Capacity
6 InfoLa
o n 𝑀𝑆𝐸
B i a s & Variance
6 InfoLa
o n 𝑀𝑆𝐸
B i a s & Variance
6 InfoLa
Ways t o Tr a d e - o ff B i a s &
Variance
Cross-validation
Highly successful on many real-world
tasks
MSE of estimates
MSE incorporates both bias and
variance
6 InfoLa
M a x i m u m L i ke l i h o o d
Estimation
Maximum likelihood estimation (MLE) is a method of
estimating the parameters of a statistical model
6 InfoLa
L i n e a r Re g re s s i o n as M a x i m u m
L iInstead o d prediction 𝑦^, we consider of the model
ke l i hofosingle
as producing a conditional distribution, 𝑝(𝑦|𝗑)
6 InfoLa
Bayesian statistics
67 InfoLa
Chapter 5. Machine Learning Basics
Part 1
- 5.1 Learning algorithms
- 5.2 Capacity, overfitting and underfitting
- 5.3 Hyperparameters and validation sets
- 5.4 Estimators, bias and variance Frequentist
- 5.5 Maximun likelihood estimation statistics Point
estimation
Part 2
- 5.6 Bayesian statistics Bayesian
- 5.7 Supervised learning algorithms statistics
68 InfoLa
Bayesian statistics
Bayesian perspective
- Uses probability to reflect degrees of certainty of states of
knowledge
- The dataset is directly observed and so is not random
- Parameter θ is represented as random variable
The prior
- We represent our knowledge of 𝜽 using the prior probability
distribution, notation with 𝑝(𝜽), before observing data
- Select broad priori distribution (with high degree of
uncertainty), such as finite range of volume, with a uniform
distribution, or Gaussian.
69 InfoLa
Mathematical description
Set of data samples {𝗑(𝟏), 𝗑(𝟐), ⋯ , 𝗑 ( 𝑚 ) }
The dataset is directly observed and so is not random
Parameter 𝜽𝜽 is represented as random variable
Combine the data likelihood with the prior via Bayes’ rule:
70 InfoLa
Mathematical description
Set of data samples {𝗑(𝟏), 𝗑(𝟐), ⋯ , 𝗑 ( 𝑚 ) }
The dataset is directly observed and so is not random
Parameter is represented as random variable
Combine the data likelihood with the prior via Bayes’ rule:
71 InfoLa
Relative to MLE
Make prediction using a full distribution over 𝜽
After observing m samples, predict distribution over the next
data sample, 𝗑(𝑚+𝟏), is given by:
72 InfoLa
Maximum A Posterior (MAP)
Estimation
Chose the point of maximal posterior probability
73 InfoLa
MLE vs MAP
<MLE <MAP
> >
74 InfoLa
Bayesian statistics
=
1st sampled data
×
point
=
20th sampled data
point
76 InfoLa
Support Vector Machine (SVM)
Definition
- Two-class classification problem in direct way
- Find a plane that separates the classes in feature space as far as
possible
Separating Hyperplane
78 InfoLa
Support Vector Machine (SVM)
Support Vector Classifier
Averag
Total e or
Sampling Data Modeling Final Model
Sample Majority
Vote
84 InfoLa
Tree Based Methods
Bagging
85 InfoLa
Tree Based Methods
Boosting
Weak
Modelin weig
Classifi
g ht
er
Weak
Modelin weig
Classifi
g ht
er
Total Weak
Modelin weig Final
Classifi
g ht Model
Sampl er
e
Weak
Modelin weig
Classifi
g ht
er
Weak
Modelin weig
Classifi
g ht
er
86 InfoLa
Tree Based Methods
Boostin
g
- Let the problem which we should classify ‘+’, and ‘-’ with
tree-based classifier
- First, a weak classifier classify the label with left-sided
vertical single line
- Then, weight to the incorrect points(large annotated ‘+’
in second figure), and do weak classify again(right-
sided line)
- Repeat those procedure, and finally merge the weak
classifiers
87 InfoLa
Another Supervised Learning Algorithms
Linear Regression
- Ridge
- Lasso
Logistic Regression
LDA (Linear Discriminant
Analysis) Random Forest
KNN (K-Nearest Neighbor)
Naïve Bayes
Neural Network (MLP)
…
88 InfoLa
Unsupervised Learning
Algorithms
89 InfoLa
K-means clustering
Find the K clusters that best describes the data
94 InfoLa
Stochastic Gradient Descent
(SGD)
95 InfoLa
Gradient Descent
The method for parameter update
Consider the model cost function
𝐽(𝜽)
96 InfoLa
Limitation of Gradient Descent
Issue of local minimum
97 InfoLa
Stochastic Gradient Descent (SGD)
The SGD method
- Extension of gradient descent
- Nearly all of deep learning is powered by this method
(deep learning’s cost space is not convex)
Using batch learning ( = epoch learning)
- Calculate the loss function with batch(sample)
- Update the new parameter 𝜽𝑛e𝒘 with gradient of batch loss function
98 InfoLa
SGD vs GD
GD goes in steepest descent direction, but slower to compute
per iteration for large datasets
SGD can be viewed as noisy descent, but faster per iteration
99 InfoLa
The next Deep Learning Seminar
10 InfoLa