0% found this document useful (0 votes)

6 views100 pages

Module-1 Deep Learning (Autosaved)

The document provides information about the book 'Deep Learning' by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, released on November 10, 2016. It covers foundational concepts in machine learning, including learning algorithms, supervised and unsupervised learning, and the challenges of overfitting and underfitting. The text emphasizes the importance of model capacity and generalization in achieving effective machine learning outcomes.

Uploaded by

yashbnv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views100 pages

Module-1 Deep Learning (Autosaved)

Uploaded by

yashbnv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 100

Book

I n f o rm a t i o n
Title: Deep learning

Authors:
 Ian Goodfellow
 Yoshua Bengio
 Aaron Courville

Released: November 10, 2016

ISBN: 9780262337434

2
Chapter O rg a n i z a t i o n (Pa rt
1)

3 InfoLa
Chapter O rg a n i z a t i o n (Pa rt
2)

4 InfoLa
Machine Le a rn i n g and
AI

InfoLa
Chapter 5. Machine Le a rn i n g
Basics
This chapter provides a brief course in the most important
general principles

1. Learning algorithms
2. Capacity, overfitting and underfitting
3. Hyperparameters and validation sets
4. Estimators, bias and variance
5. Maximum likelihood estimation
6. Bayesian statistics
7. Supervised learning algorithms
8. Unsupervised learning algorithms
9. Stochastic gradient descent
10. Building a machine learning algorithm
11. Challenges motivating deep learning

6 InfoLa
C h e c ke r s Game: The Fi r s t ML
Application
The Samuel’s Checkers-playing program appears to be
the world’s first self-learning program (Arthur Samuel,
1959)
 Over thousands of games, the program started to learn to
recognize the patterns, which patterns led to win or lose

Finally, the program played much better than Samuel

himself could

7 InfoLa
Le a rn i n g
Algorithms
• A field of study that gives computers the ability to learn
without being explicitly programmed (Arthur Samuel,
1959)

• A computer program is said to learn from experience 𝐸

with respect to some class of tasks 𝑇 and performance
measure 𝑃, if its performance at tasks in 𝑇, as
measured by 𝑃, improves with experience 𝐸
(Tom M. Michell, 1997)
 𝐸: the experience of playing thousands of games
 𝑇: playing checkers game
 𝑃: the fraction of games it wins against human
opponents
 By its definition, Samuel’s program
8 has learned
InfoLato play checkers
𝑇
Ta s k ,
Machine learning enables us to tackle
tasks that are too difficult to solve with
fixed programs written and designed by
humans.
Machine learning tasks are described as
How the machine learning system should
process an example.
example is a collection of features that
have been quantitatively measured from
some object or event.
9 InfoLa
𝑇
Ta s k ,
Classification (with missing inputs)
Regression
Transcription
Machine translation
Structured output
Anomaly detection
Synthesis and sampling
Imputation of missing values
Denoising

1 InfoLa
C l a s s i fi c a t i o
nThe computer program is asked to specify which of K categories
some input belongs to.
Example: object recognition.

1 InfoLa
Reg res s i o
n

1 InfoLa
Tr a n s c r i p t i o n and Machine
Tr a n s l a t i o n

1 InfoLa
𝑃rate) for categorical data
Performance
Measure,
Accuracy (error
To measure the proportion of examples for
which model produces the correct output

Average log-probability for density estimation

To measure continuous-valued score for each
example

E.g., test set for validating model with unseen

data
1. Separate the data into training set and test
set
2. Train the model with training set
3. Measure the model’s 1 performance
InfoLawith test
𝐸
Experience,

Unsupervised learning algorithms,

Experiences a dataset containing many features
Learns useful properties of the structure of the
dataset

Supervised learning algorithms,

Experiences a dataset associated with labels
Learns to predict the labels from the data

Reinforcement learning algorithms,

Not just experience with a fixed dataset,
but interact with an environment
Learns actions to maximize
1 cumulative
InfoLa
Exa mple: L i n e a r
Re g re s s i o n
�
�

�
1 InfoLa �
Linear Regression
• Linear Regression is a machine learning
algorithm based on supervised learning.
• Regression models a target prediction
value based on independent variables. It is
mostly used for finding out the relationship
between variables and forecasting.
• Different regression models differ based on
– the kind of relationship between
dependent and independent variables they
are considering, and the number of
independent variables being used.
Linear Regression
•Linear regression performs the
task to predict a dependent
variable value (y) based on a given
independent variable (x).
•So, this regression technique
finds out a linear relationship
between x (input) and y(output).
•Hence, the name Linear
Regression.
•In the figure above, X (input) is
the work experience and Y
(output) is the salary of a person.
•The regression line is the best fit
line for our model.
Linear Regression
While training the model we are given :
x: input training data (univariate – one input
variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to
predict the value of y for a given value of x.
The model gets the best regression fit line by
finding the best θ and θ values.
1 2

θ : intercept
1

θ : coefficient of x
2

Once we find the best θ and θ values, we get

1 2

the best fit line. So when we are finally using

our model for prediction, it will predict the
How to update θ1 and θ2 values to get the best fit
line ?
Cost Function (J):
•By achieving the best-fit
regression line, the model
aims to predict y value such
that the error difference
between predicted value
and true value is minimum.
•So, it is very important to
update the θ1 and θ2 values,
to reach the best value that
minimize the error between
predicted y value (pred) and
true y value (y).
Fo rm a l D e fi n i t i o n o f L i n e a r

Task, 𝑇: to predict 𝑦 from 𝗑 by outputting

Re g re s s i o n
𝗑 ∈ ℝ𝑛: input data
𝑦 ∈ ℝ: output value ( : predicted by the model)
𝒘 ∈ ℝ𝑛: parameters (or weights)

Experience, 𝐸: training set (𝑿 𝑡𝑟 𝑎 i 𝑛 ,𝑦 𝑡𝑟 𝑎 i 𝑛 )

Performance measure, 𝑃:
mean squared error (MSE) on (𝑿 𝑡𝑒 𝑠 𝑡 ,𝑦 𝑡𝑒 𝑠 𝑡 )

𝑚: number of
dataset

2 InfoLa
F in d w b y M in im iz in g 𝑀𝑆𝐸𝑡𝑟𝑎i𝑛

2 InfoLa
L i n e a r Re g re s s i o n
Pro b l e m

2 InfoLa
Generalization

• Central challenge of ML is that the

algorithm must perform well on
new, previously unseen inputs
– Not just those on which our model has
been trained

• Ability to perform well on previously

unobserved inputs is called
Generalization Error
• When training a machine learning
model we have access to a training
set
– We can compute some error measure
on the training set, called training
error
• ML differs from optimization
– Generalization error (also called test
error) to be low
• Generalization error definition
– Expected value of the error on a new
error
inputis: 1 (test) (test)
2
4

m(test) X w y 2

• Expected value is computed as average over

inputs taken from distribution encountered in
Estimating the
generalization
• error
We estimate generalization error of a ML
model by measuring performance on a
test set that were collected separately
from the training set
• minimizing
In linear regression examplePolynomial
the training we train  M

Model y(x,w)  w0  w x  2w2x  .. M M

 wj x j

model by 1 X w  y
error
1
w x j
0
2
(train) (train)
2
m(train)
– But we actually care about the test
error 1
X w y (test) (test)
2
(test)
m
2

• How can we affect performance

when we observe only the training
5
set?
– Statistical learning theory provides some
answers
Statistical
Learning Theory
• Need assumptions about training and
test sets
– Training/test data arise from same process
– We make use of i.i.d. assumptions
1. Examples in each data set are independent
2. Training set and testing set are identically
distributed
– We call the shared distribution, the data
generating distribution pdata
• Probabilistic framework and i.i.d.
assumption allows us to study
relationship between training and
testing error
Why are training/test errors
unequal?
• Expected training error of a randomly
selected model is equal to expected
test error of model
– If we have a joint distribution p(x,y) and we
randomly sample from it to generate the
training and test sets. For some fixed
value w the expected training set error is
the same as the expected test set error
• But we don’t fix the parameters w in
advance!
– We sample the training set and then
use it to choose w to reduce training 7

set error
Generalizati
o nIn ML, generalization is the ability to perform
well on previously unobserved inputs
1. Making the training error small
2. Making the gap between training and test error
small
Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on training set
Overfitting occurs when the gap between the
training error and the test error is too large

We can control whether a model is more

likely to overfit or underfit by altering its
2 InfoLa
Generalizati
on Overfitting Underfitting

where a machine learning that can neither model the

model can’t generalize or fit training dataset nor generalize
well on unseen dataset to new dataset
refers to a modeling error that
occurs when a function
corresponds too closely to a
dataset

model learns the detail and

noise in the training dataset to
the extent that it negatively
impacts the performance of the
model on a new dataset
3 InfoLa
Capacity of a
model
• Model capacity is ability to fit variety of functions
– Model with Low capacity struggles to fit training
set
– A High capacity model can overfit by
memorizing properties of training set not
useful on test set
• When model has higher capacity, it overfits
– One way to control capacity of a learning
algorithm is by choosing the hypothesis
space
• i.e., set of functions that the learning algorithm is
allowed to select as being the solution
– E.g., the linear regression algorithm has the set of all
linear functions of its input as the hypothesis space
– We can generalize to include polynomials is its
Capacity of Polynomial
Curve Fits of degree 1 gives a
• A polynomial
linear regression model with the
prediction
yˆ  b  wx
– By introducing x2 as another features provided to
the regression model, we can learn a model
that is quadratic as a function of x
yˆ  b  w x  w x 2
1 2

• The output is still a linear function of the

parameters so we can use normal equations to
train in closed-form
• We can continue
yˆ  b
9
 wiadd
to xi more powers of x as
additional features,i
1
e.g., a polynomial of
degree 9
Appropriate
Capacity
• Machine Learning algorithms will perform
well when their capacity is appropriate for
the true complexity of the task that they
need to perform and the amount of training
data they are provided with
• Models with insufficient capacity are
unable to solve complex tasks
• Models with high capacity can solve
complex tasks, bit when their capacity is
higher than needed to solve the present
task, they may overfit
Principle of Capacity in action
• We fit three models to a training set
– Data generated synthetically sampling x
values and choosing y deterministically (a
quadratic function)

Polynomial M

Model
y(x,w) w wx wx
0 1 2
2
 .. M M
  wj x j
w x j
0

Polynomial of degree 9 suffers from

Linear function Quadratic overfitting. Used Moore-Penrose inverse
fit cannot function fit to solve underdetermined normal
capture generalizes well equations
curvature to unseen data. (we have N equations corresponding to N
present in No training samples)
1
data underfitting Solution
two pointspasses
not truethrough all points
of underlying butalso2
function,
or overfitting does notis capture
function decreasingcorrect
at firststructure:
point, not deep
increasing
valley between
Ordering Learning Machines by
Capacity

Goal of learning is to choose an optimal

element of a structure (e.g., polynomial
degree) and estimate its coefficients from
a given training sample.

For approximating functions linear in

parameters such as polynomials,
complexity is given by the no. of free
parameters.

For functions nonlinear in

parameters, the complexity is
defined as VC-dimension.
The optimal choice of model complexity
provides the minimum of the expected
risk.
Representational and Effective
Capacity
• Representational capacity:
– Specifies family of functions learning
algorithm can choose from
• Effective capacity:
– Imperfections in optimization algorithm
can limit representational capacity
• Occam’s razor:
– Among competing hypotheses that explain
known observations equally well, choose
the simplest one
• Idea is formalized in VC
dimension
VC Dimension

• Statistical learning theory quantifies

model capacity
• VC dimension quantifies capacity of a
binary classifier
• VC dimension is the largest value of m
for which there exists a training set of
m different points that the classifier
can label arbitrarily
VC dimension
(capacity) of a
Linear classifier
in is
R2
3

1
6
Capacity and
Learning
• Quantifying Theory
the capacity of a model
enables statistical learning theory to
make quantitative predictions
• Most important results in statistical
learning theory show that:
• The discrepancy between training
error and generalization error
– is bounded from above by a quantity that
grows as the model capacity grows
– But shrinks as the number of training examples
increases
1
7
Usefulness of statistical
•learning theory
Provides intellectual justification that
machine learning algorithms can work
• But rarely used in practice with deep
learning
• This is because:
– The bounds are loose
– Also difficult to determine capacity of deep
learning algorithms

1
8
Typical
generalization error
Relationship between capacity and error
Typically generalization error has a U-shaped
curve

1
9
Arbitrarily high
capacity:
• When doNonparametric
we reach most models
extreme
case of arbitrarily high capacity?
• Parametric models such as linear
regression:
• learn a function described by a parameter
whose size is finite and fixed before data
is observed
• Nonparametric models have no such
limitation
2
• Nearest-neighbor regression is an 0

example
Nearest neighbor regression

• Simply store the X and y from the

training set
• When asked to classify a test point x the
model looks up the nearest entry in the
training set and returns the associated
target, i.e.,
yˆ  y where i  arg min || X  x ||2
i i,: 2

• Algorithm can be generalized to

distance metrics other than L 2 norm 2

such as learned distance metrics 1

Bayes Error

• Ideal model is an oracle that knows the

true probability distributions that
generate the data
• Even such a model incurs some error
due to noise/overlap in the
distributions
• The error incurred by an oracle making
predictions from the true distribution
p(x,y) is called the Bayes error
2
2
Effect of size of training set

• Expected generalization error can

never increase as the no of
training examples increases
• For nonparametric models, more data
yields better generalization until best
possible error is achieved
• For any fixed parametric model with
less than optimal capacity will
asymptote to an error value that
2
exceeds the Bayes error 3
Effect of training set size

Synthetic regression
problem Noise added to
a 5th degree polynomial.
Generated a single test
set
and several different
sizes of training sets
Error bars show
95% Confidence
interval

As training set
size increases
optimal capacity
increases
Plateaus after
reaching
2
sufficient 4
complexity
Probably Approximately Correct
• Learning theory claims that a ML
algorithm generalizes from finite
training set of examples
• It contradicts basic principles of logic
• Inductive reasoning, or inferring general
rules from a limited no of samples is not
logically valid
• To infer a rule describing every member of a
set, one must have information about every
member of the set
• ML avoids this problem with
probabilistic rules 2
5

• Rather than rules of logical reasoning

The no-free lunch
•theorem
PAC does not entirely resolve the
problem
• No free lunch theorem states:
• Averaged over all distributions, every
algo has same error classifying
unobserved points
• i.e., no ML algo universally better than any
other
• Most sophisticated algorithm has same error
rate that merely predicts that every point
belongs to same class
• Fortunately results hold only when we
average over all possible data
Regularization
• No free lunch theorem implies that we
must design our ML algorithms to
perform well on a specific task
• We do so by building a set of
preferences into learning algorithm
• When these preferences are aligned
with the learning problems it performs
better

2
7
Other ways of solution
preference
• Only method so far for modifying
learning algo is to change
representational capacity
• Adding/removing functions from hypothesis
space
• Specific example: degree of polynomial
• Specific identity of those functions
can also affect behaviorT
J (w)  M SE  w w
of our
algorithm trai
• λ chosen to control preference for smaller
n

• Exweights
of linear regression: include weight
decay 2
8
Tradeoff between fitting data
and being
Controlling asmall
model’s tendency to
overfit or underfit via weight
decay
Weight decay for high degree polynomial
(degree 9) while the data comes from a
quadratic

2
9
9-degree Regression Model with Weight
Decay

Large 𝒘 over-fits
to training data

5 InfoLa
Re g u l a r i z a t i o
n
Regularization is any modification to prevent overfitting

Regularization is able to control the performance of an

algorithm
 Intended to reduce test error but not training error

Example: weight decay for regression problem

• 𝐽(w) : cost function to be minimized on training

• 𝜆: control factor of the preference for smaller weight (𝜆 ≥ 0)

𝒘
 Trades-off between fitting the training data and being small

5 InfoLa
Tr a i n i n g Set Size and
Generalization

Polynomial
(degree-9)
Christopher, M. Bishop. PATTERN RECOGNITION AND MACHINE LEARNING. Springer-Verlag New York,
Chapter 1.
5 InfoLa
H y pe r p a r a m e t e r s v s .
pa r a m e t e r s
Hyperparameters are higher-level properties for a
model
 Decides model’s capacity (or complexity)
 Not be learned during training
e.g., degree of regression model, weight decay

Parameters are properties of the training data

Learned during training by a ML
model e.g., weights of regression
model

5 InfoLa
Setting Hyperparameters with
Validation Set
Setting hyperparameters in training step is not
appropriate
Hyperparameters will be set to yield

model, 𝜆 → 0)
overfitting (e.g., higher degree of regression

Test set will not be seen for training nor model

choosing (hyperparameter setting)

So, we need validation set that the training algorithm

does not observe
1. Split validation set from training data
2. Train a model with training data (not including validation
set)
5 InfoLa
𝑘 - f o l d C ro s s
Va l i d a t iTraining
o n set

Training Validation
folds fold

1st 𝐸
iteration 1

2nd 𝐸
iteration 2

3rd 𝐸
iteration 3

𝑘-th 𝐸
iteration 𝑘

5 InfoLa
Estimation on
Statistics
Point estimation

unknown property of a model using sample data, 𝗑 𝟏𝟏,

 Attempt to provide the singe best prediction of the true but

…,𝗑𝑚

Function estimation
 Types of estimation that predict the relationship between
input and target variables

Point estimator

𝑚: number of data elements

: point estimator for the property of a model (e.g., expectation)

𝗑𝟏, … , 𝗑 𝑚 : independent and identically distributed (i.i.d.) data

points
Pro p e r t i e s o f an
Estimator
value of 𝜽𝜽
Bias measures the expected deviation from the true

Variance measures the deviation from the expected

estimator value that any particular sampling of the data
is likely to cause

𝑉𝑎𝑟 𝜽^

5 InfoLa
G r a p h i c a l I l l u s t r a t i o n o f B i a s and
Variance

High-bias
model

High-variance
model
Scott Fortmann-Roe, Understanding the Bias-Variance Tradeoff,
2012
6 InfoLa
B i a s -Va r i a n c e Tr a d e - o ff w i t h
Capacity

6 InfoLa
o n 𝑀𝑆𝐸
B i a s & Variance

6 InfoLa
Ways t o Tr a d e - o ff B i a s &
Variance
Cross-validation
 Highly successful on many real-world
tasks

MSE of estimates
 MSE incorporates both bias and
variance

6 InfoLa
M a x i m u m L i ke l i h o o d
Estimation
Maximum likelihood estimation (MLE) is a method of
estimating the parameters of a statistical model

6 InfoLa
L i n e a r Re g re s s i o n as M a x i m u m
L iInstead o d prediction 𝑦^, we consider of the model
ke l i hofosingle
as producing a conditional distribution, 𝑝(𝑦|𝗑)

6 InfoLa
Bayesian statistics

67 InfoLa
Chapter 5. Machine Learning Basics
Part 1
- 5.1 Learning algorithms
- 5.2 Capacity, overfitting and underfitting
- 5.3 Hyperparameters and validation sets
- 5.4 Estimators, bias and variance Frequentist
- 5.5 Maximun likelihood estimation statistics Point
estimation
Part 2
- 5.6 Bayesian statistics Bayesian
- 5.7 Supervised learning algorithms statistics

- 5.8 Unsupervised learning algorithms

- 5.9 Stochastic gradient descent
- 5.10 Building a machine learning algorithm
- 5.11 Challenges motivating deep learning

68 InfoLa
Bayesian statistics
Bayesian perspective
- Uses probability to reflect degrees of certainty of states of
knowledge
- The dataset is directly observed and so is not random
- Parameter θ is represented as random variable
The prior
- We represent our knowledge of 𝜽 using the prior probability
distribution, notation with 𝑝(𝜽), before observing data
- Select broad priori distribution (with high degree of
uncertainty), such as finite range of volume, with a uniform
distribution, or Gaussian.

69 InfoLa
Mathematical description
Set of data samples {𝗑(𝟏), 𝗑(𝟐), ⋯ , 𝗑 ( 𝑚 ) }
The dataset is directly observed and so is not random
Parameter 𝜽𝜽 is represented as random variable
Combine the data likelihood with the prior via Bayes’ rule:

70 InfoLa
Mathematical description
Set of data samples {𝗑(𝟏), 𝗑(𝟐), ⋯ , 𝗑 ( 𝑚 ) }
The dataset is directly observed and so is not random
Parameter is represented as random variable
Combine the data likelihood with the prior via Bayes’ rule:

71 InfoLa
Relative to MLE
Make prediction using a full distribution over 𝜽
After observing m samples, predict distribution over the next
data sample, 𝗑(𝑚+𝟏), is given by:

Prior distribution has influence by shifting probability toward the

parameter space
Bayesian method typically generalize much better
But high computational cost

72 InfoLa
Maximum A Posterior (MAP)
Estimation
Chose the point of maximal posterior probability

Has the advantage of leveraging information that is brought by the

prior
Additional information helps the variance of MAP estimation
But it increase bias
Regularized estimation strategies can be interpreted as making
the MAP approximation

73 InfoLa
MLE vs MAP

74 InfoLa
Bayesian statistics

=
1st sampled data

×
point

2nd sampled data =

point

=
20th sampled data
point

*source : PRML(pattern recognition & Machine Learning)

Textbook
75 InfoLa
Supervised Learning
Algorithms

76 InfoLa
Support Vector Machine (SVM)
Definition
- Two-class classification problem in direct way
- Find a plane that separates the classes in feature space as far as
possible
Separating Hyperplane

* Image from “Introduction to Statistical Learning with R”,

springer
77 InfoLa
Support Vector Machine (SVM)
Maximal Margin Classifier

78 InfoLa
Support Vector Machine (SVM)
Support Vector Classifier

<With large <With small

C> C>

* Image from “Introduction to Statistical Learning with R”,

springer
79 InfoLa
Support Vector Machine (SVM)
Kernel Method (SVM)
- In SVM, we just need to calculate inner product of vectors
- If the data is not linear separable, we send the data to more high
dimensional space and make Support Vector Classifier

* Image from “Introduction to Statistical Learning with R”,

springer
80 InfoLa
Tree Based Methods
Decision Tree Classification

* Slides from Seo Hui(LG Electronics), “Gradient Boosting

Model”
81
Tree Based Methods
Decision Tree Regression

* Slides from Seo Hui(LG Electronics), “Gradient Boosting

Model”
82 InfoLa
Tree Based Methods
Decision Tree Complexity

* Slides from Seo Hui(LG Electronics), “Gradient Boosting

Model”
83 InfoLa
Tree Based Methods
Bagging

Sampling Data Modeling

Averag
Total e or
Sampling Data Modeling Final Model
Sample Majority
Vote

Sampling Data Modeling

84 InfoLa
Tree Based Methods
Bagging

<Model with Full <Model No. <Model No.

dataset> 1> 2>

<Model No. <Model No. <Model No.

3> 4> 5>

85 InfoLa
Tree Based Methods
Boosting
Weak
Modelin weig
Classifi
g ht
er

Weak
Modelin weig
Classifi
g ht
er

Total Weak
Modelin weig Final
Classifi
g ht Model
Sampl er
e
Weak
Modelin weig
Classifi
g ht
er

Weak
Modelin weig
Classifi
g ht
er

86 InfoLa
Tree Based Methods
Boostin
g

- Let the problem which we should classify ‘+’, and ‘-’ with
tree-based classifier
- First, a weak classifier classify the label with left-sided
vertical single line
- Then, weight to the incorrect points(large annotated ‘+’
in second figure), and do weak classify again(right-
sided line)
- Repeat those procedure, and finally merge the weak
classifiers
87 InfoLa
Another Supervised Learning Algorithms
Linear Regression
- Ridge
- Lasso
Logistic Regression
LDA (Linear Discriminant
Analysis) Random Forest
KNN (K-Nearest Neighbor)
Naïve Bayes
Neural Network (MLP)
…

88 InfoLa
Unsupervised Learning
Algorithms

89 InfoLa
K-means clustering
Find the K clusters that best describes the data

* Slides from Andrew Ng(Stanford Univ.), “Machine

Learning”
90 InfoLa
K-means clustering
Number of cluster k = 2,
- Randomly initialize
“centroids”

* Slides from Andrew Ng(Stanford Univ.), “Machine

Learning”
91 InfoLa
K-means clustering
Number of cluster k = 2,
- Assign cluster membership
- Update the cluster centroid (average of the data points in each cluster)

* Slides from Andrew Ng(Stanford Univ.), “Machine

Learning”
92 InfoLa
K-means clustering
Number of cluster k = 2,
- Update cluster membership
- Repeat those procedure until no membership update

* Slides from Andrew Ng(Stanford Univ.), “Machine

Learning”
93 InfoLa
Another Unsupervised Learning Algorithms
PCA (Principal Component Analysis)
ICA (Independent Component
Analysis) ARM (Association Rule
Mining)
- Apriori rule
- FP-growth
- Eclat algorithm
Expectation Maximization
Density Estimation
…

94 InfoLa
Stochastic Gradient Descent
(SGD)

95 InfoLa
Gradient Descent
The method for parameter update
Consider the model cost function
𝐽(𝜽)

Gradient of 𝐽(𝜽) respect to 𝜽

is:

Update the new parameter

96 InfoLa
Limitation of Gradient Descent
Issue of local minimum

If the starting point for gradient descent was chosen inappropriately,

cannot reach global minimum

97 InfoLa
Stochastic Gradient Descent (SGD)
The SGD method
- Extension of gradient descent
- Nearly all of deep learning is powered by this method
(deep learning’s cost space is not convex)
Using batch learning ( = epoch learning)
- Calculate the loss function with batch(sample)

- Update the new parameter 𝜽𝑛e𝒘 with gradient of batch loss function

- At each update, loss function will be changed

98 InfoLa
SGD vs GD
GD goes in steepest descent direction, but slower to compute
per iteration for large datasets
SGD can be viewed as noisy descent, but faster per iteration

99 InfoLa
The next Deep Learning Seminar

[Part 2] Deep Networks: Modern Practice

Chapter 6. Deep Feedforward Networks

1. Example: Learning XOR

2. Gradient-Based Learning
3. Hidden Units
4. Architecture Design
5. Back-Propagation and Other Differentiation Algorithm
6. Historical Notes

10 InfoLa

5.1 ML Basics M1
No ratings yet
5.1 ML Basics M1
37 pages
ML 01
No ratings yet
ML 01
24 pages
AI ML 3 Updated
No ratings yet
AI ML 3 Updated
34 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
ML 2
No ratings yet
ML 2
155 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
First Cours 2
No ratings yet
First Cours 2
42 pages
Supervised Learning & Regression
No ratings yet
Supervised Learning & Regression
41 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Chap 5-1 - Machine Learning Basics - Jinwook Kim
No ratings yet
Chap 5-1 - Machine Learning Basics - Jinwook Kim
39 pages
Machine Learning-2
No ratings yet
Machine Learning-2
87 pages
Presentation 6
No ratings yet
Presentation 6
34 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Machine Learning: Introduction and Linear Regression
No ratings yet
Machine Learning: Introduction and Linear Regression
29 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
Neural Networks Cheat Sheet - 2020 PDF
No ratings yet
Neural Networks Cheat Sheet - 2020 PDF
14 pages
SML Lecture1
No ratings yet
SML Lecture1
37 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
DS-05 Introduction To Machine Learning
No ratings yet
DS-05 Introduction To Machine Learning
103 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Unit I
No ratings yet
Unit I
14 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
DL Unit1
100% (2)
DL Unit1
79 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
Machine Learning Models
No ratings yet
Machine Learning Models
52 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Intro To Machine Learning With PyTorch
No ratings yet
Intro To Machine Learning With PyTorch
48 pages
Unit 2 ML Regression
No ratings yet
Unit 2 ML Regression
46 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Intro DL 01
No ratings yet
Intro DL 01
64 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
ML Indivisual Assignment
No ratings yet
ML Indivisual Assignment
11 pages
Lecture 2-Regression
No ratings yet
Lecture 2-Regression
49 pages
Lec 4
No ratings yet
Lec 4
33 pages
Ai ML 3
No ratings yet
Ai ML 3
27 pages
Advanced ML Slides Intro
No ratings yet
Advanced ML Slides Intro
14 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
ML Hand Written Notes
No ratings yet
ML Hand Written Notes
19 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Linear Regression
No ratings yet
Linear Regression
61 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Unit 3 Machine Learning
No ratings yet
Unit 3 Machine Learning
12 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
Stats Extra Credit
100% (3)
Stats Extra Credit
5 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Between X and Y How Process Tracing Contributes To Opening The Black Box of Causality
100% (1)
Between X and Y How Process Tracing Contributes To Opening The Black Box of Causality
19 pages
Research & Methodology (Unit1,2,3,4)
No ratings yet
Research & Methodology (Unit1,2,3,4)
10 pages
MODULE 5 and 6 STATISTICS AND PROBABILITY - Docx (1) - PDF
100% (2)
MODULE 5 and 6 STATISTICS AND PROBABILITY - Docx (1) - PDF
9 pages
Unit 5: Hypothesis Testing
No ratings yet
Unit 5: Hypothesis Testing
6 pages
Z Test PowerPoint April 25 2023
No ratings yet
Z Test PowerPoint April 25 2023
22 pages
Data Mining in Pharmaceutical Marketing and Sales Analysis: Pavel Brusilovskiy, PHD Merck
No ratings yet
Data Mining in Pharmaceutical Marketing and Sales Analysis: Pavel Brusilovskiy, PHD Merck
23 pages
Regression Statistics: Summary Output
No ratings yet
Regression Statistics: Summary Output
3 pages
Chapter 1 - The Fundamentals of Scientific Research
No ratings yet
Chapter 1 - The Fundamentals of Scientific Research
27 pages
Me29 6
No ratings yet
Me29 6
36 pages
Hypothesis (1) - Dr. Sumit Sir
No ratings yet
Hypothesis (1) - Dr. Sumit Sir
30 pages
Bstat 3322 Test Study Guide
No ratings yet
Bstat 3322 Test Study Guide
8 pages
Data Management
No ratings yet
Data Management
84 pages
Econometrics Midterm Solutions
No ratings yet
Econometrics Midterm Solutions
4 pages
NPTEL Course List Jan 2022
No ratings yet
NPTEL Course List Jan 2022
24 pages
54616-Article Text-118921-1-18-20230727
No ratings yet
54616-Article Text-118921-1-18-20230727
9 pages
What - Are - Confidence Interval and P Value
100% (1)
What - Are - Confidence Interval and P Value
8 pages
Psych Stat Lec 1
No ratings yet
Psych Stat Lec 1
17 pages
Sample Selection Bias and Heckman Models in Strategic Management Research
No ratings yet
Sample Selection Bias and Heckman Models in Strategic Management Research
19 pages
STA 2303 CAT I 2024Ms
No ratings yet
STA 2303 CAT I 2024Ms
3 pages
Review: Lecture 5: Bivariate Analysis 2 (Spearman and Chi)
No ratings yet
Review: Lecture 5: Bivariate Analysis 2 (Spearman and Chi)
8 pages
Sample Mean 7.6 and S 0.4
100% (2)
Sample Mean 7.6 and S 0.4
3 pages
HMM Intraday Momentum
No ratings yet
HMM Intraday Momentum
23 pages
SAS Dummy Variables & ANOVA Guide
No ratings yet
SAS Dummy Variables & ANOVA Guide
11 pages
Chiong, Lucina Mae - Task 4 Learning Assessment On Confidence Interval Estimation
No ratings yet
Chiong, Lucina Mae - Task 4 Learning Assessment On Confidence Interval Estimation
9 pages
Bartlett's Test - Definition and Examples - Statistics How To
No ratings yet
Bartlett's Test - Definition and Examples - Statistics How To
3 pages
Power & Sample Size for Researchers
No ratings yet
Power & Sample Size for Researchers
8 pages
BSA 301 Final Exam Solutions
No ratings yet
BSA 301 Final Exam Solutions
7 pages
Txtart
100% (1)
Txtart
384 pages

Module-1 Deep Learning (Autosaved)

Uploaded by

Module-1 Deep Learning (Autosaved)

Uploaded by

Book

Released: November 10, 2016

Finally, the program played much better than Samuel

• A computer program is said to learn from experience 𝐸

Average log-probability for density estimation

E.g., test set for validating model with unseen

Unsupervised learning algorithms,

Supervised learning algorithms,

Reinforcement learning algorithms,

Once we find the best θ and θ values, we get

the best fit line. So when we are finally using

Task, 𝑇: to predict 𝑦 from 𝗑 by outputting

Experience, 𝐸: training set (𝑿 𝑡𝑟 𝑎 i 𝑛 ,𝑦 𝑡𝑟 𝑎 i 𝑛 )

• Central challenge of ML is that the

• Ability to perform well on previously

• Expected value is computed as average over

Model y(x,w)  w0  w x  2w2x  .. M M

• How can we affect performance

We can control whether a model is more

where a machine learning that can neither model the

model learns the detail and

• The output is still a linear function of the

Polynomial of degree 9 suffers from

Goal of learning is to choose an optimal

For approximating functions linear in

For functions nonlinear in

• Statistical learning theory quantifies

• Simply store the X and y from the

• Algorithm can be generalized to

such as learned distance metrics 1

• Ideal model is an oracle that knows the

• Expected generalization error can

• Rather than rules of logical reasoning

Regularization is able to control the performance of an

Example: weight decay for regression problem

• 𝐽(w) : cost function to be minimized on training

Parameters are properties of the training data

Test set will not be seen for training nor model

So, we need validation set that the training algorithm

unknown property of a model using sample data, 𝗑 𝟏𝟏,

𝑚: number of data elements

𝗑𝟏, … , 𝗑 𝑚 : independent and identically distributed (i.i.d.) data

Variance measures the deviation from the expected

- 5.8 Unsupervised learning algorithms

Prior distribution has influence by shifting probability toward the

Has the advantage of leveraging information that is brought by the

2nd sampled data =

*source : PRML(pattern recognition & Machine Learning)

* Image from “Introduction to Statistical Learning with R”,

<With large <With small

* Image from “Introduction to Statistical Learning with R”,

* Image from “Introduction to Statistical Learning with R”,

* Slides from Seo Hui(LG Electronics), “Gradient Boosting

* Slides from Seo Hui(LG Electronics), “Gradient Boosting

* Slides from Seo Hui(LG Electronics), “Gradient Boosting

Sampling Data Modeling

Sampling Data Modeling

Sampling Data Modeling

Sampling Data Modeling

<Model with Full <Model No. <Model No.

<Model No. <Model No. <Model No.

* Slides from Andrew Ng(Stanford Univ.), “Machine

* Slides from Andrew Ng(Stanford Univ.), “Machine

* Slides from Andrew Ng(Stanford Univ.), “Machine

* Slides from Andrew Ng(Stanford Univ.), “Machine

Gradient of 𝐽(𝜽) respect to 𝜽

Update the new parameter

If the starting point for gradient descent was chosen inappropriately,

- At each update, loss function will be changed

[Part 2] Deep Networks: Modern Practice

1. Example: Learning XOR

You might also like