KEMBAR78
Module-1 Deep Learning (Autosaved) | PDF | Errors And Residuals | Support Vector Machine
0% found this document useful (0 votes)
6 views100 pages

Module-1 Deep Learning (Autosaved)

The document provides information about the book 'Deep Learning' by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, released on November 10, 2016. It covers foundational concepts in machine learning, including learning algorithms, supervised and unsupervised learning, and the challenges of overfitting and underfitting. The text emphasizes the importance of model capacity and generalization in achieving effective machine learning outcomes.

Uploaded by

yashbnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views100 pages

Module-1 Deep Learning (Autosaved)

The document provides information about the book 'Deep Learning' by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, released on November 10, 2016. It covers foundational concepts in machine learning, including learning algorithms, supervised and unsupervised learning, and the challenges of overfitting and underfitting. The text emphasizes the importance of model capacity and generalization in achieving effective machine learning outcomes.

Uploaded by

yashbnv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

Book

I n f o rm a t i o n
Title: Deep learning

Authors:
 Ian Goodfellow
 Yoshua Bengio
 Aaron Courville

Released: November 10, 2016

ISBN: 9780262337434

2
Chapter O rg a n i z a t i o n (Pa rt
1)

3 InfoLa
Chapter O rg a n i z a t i o n (Pa rt
2)

4 InfoLa
Machine Le a rn i n g and
AI

InfoLa
Chapter 5. Machine Le a rn i n g
Basics
This chapter provides a brief course in the most important
general principles

1. Learning algorithms
2. Capacity, overfitting and underfitting
3. Hyperparameters and validation sets
4. Estimators, bias and variance
5. Maximum likelihood estimation
6. Bayesian statistics
7. Supervised learning algorithms
8. Unsupervised learning algorithms
9. Stochastic gradient descent
10. Building a machine learning algorithm
11. Challenges motivating deep learning

6 InfoLa
C h e c ke r s Game: The Fi r s t ML
Application
The Samuel’s Checkers-playing program appears to be
the world’s first self-learning program (Arthur Samuel,
1959)
 Over thousands of games, the program started to learn to
recognize the patterns, which patterns led to win or lose

Finally, the program played much better than Samuel


himself could

7 InfoLa
Le a rn i n g
Algorithms
• A field of study that gives computers the ability to learn
without being explicitly programmed (Arthur Samuel,
1959)

• A computer program is said to learn from experience 𝐸


with respect to some class of tasks 𝑇 and performance
measure 𝑃, if its performance at tasks in 𝑇, as
measured by 𝑃, improves with experience 𝐸
(Tom M. Michell, 1997)
 𝐸: the experience of playing thousands of games
 𝑇: playing checkers game
 𝑃: the fraction of games it wins against human
opponents
 By its definition, Samuel’s program
8 has learned
InfoLato play checkers
𝑇
Ta s k ,
Machine learning enables us to tackle
tasks that are too difficult to solve with
fixed programs written and designed by
humans.
Machine learning tasks are described as
How the machine learning system should
process an example.
example is a collection of features that
have been quantitatively measured from
some object or event.
9 InfoLa
𝑇
Ta s k ,
Classification (with missing inputs)
Regression
Transcription
Machine translation
Structured output
Anomaly detection
Synthesis and sampling
Imputation of missing values
Denoising

1 InfoLa
C l a s s i fi c a t i o
nThe computer program is asked to specify which of K categories
some input belongs to.
Example: object recognition.

1 InfoLa
Reg res s i o
n

1 InfoLa
Tr a n s c r i p t i o n and Machine
Tr a n s l a t i o n

1 InfoLa
𝑃rate) for categorical data
Performance
Measure,
Accuracy (error
To measure the proportion of examples for
which model produces the correct output

Average log-probability for density estimation


To measure continuous-valued score for each
example

E.g., test set for validating model with unseen


data
1. Separate the data into training set and test
set
2. Train the model with training set
3. Measure the model’s 1 performance
InfoLawith test
𝐸
Experience,

Unsupervised learning algorithms,


Experiences a dataset containing many features
Learns useful properties of the structure of the
dataset

Supervised learning algorithms,


Experiences a dataset associated with labels
Learns to predict the labels from the data

Reinforcement learning algorithms,


Not just experience with a fixed dataset,
but interact with an environment
Learns actions to maximize
1 cumulative
InfoLa
Exa mple: L i n e a r
Re g re s s i o n


1 InfoLa �
Linear Regression
• Linear Regression is a machine learning
algorithm based on supervised learning.
• Regression models a target prediction
value based on independent variables. It is
mostly used for finding out the relationship
between variables and forecasting.
• Different regression models differ based on
– the kind of relationship between
dependent and independent variables they
are considering, and the number of
independent variables being used.
Linear Regression
•Linear regression performs the
task to predict a dependent
variable value (y) based on a given
independent variable (x).
•So, this regression technique
finds out a linear relationship
between x (input) and y(output).
•Hence, the name Linear
Regression.
•In the figure above, X (input) is
the work experience and Y
(output) is the salary of a person.
•The regression line is the best fit
line for our model.
Linear Regression
While training the model we are given :
x: input training data (univariate – one input
variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to
predict the value of y for a given value of x.
The model gets the best regression fit line by
finding the best θ and θ values.
1 2

θ : intercept
1

θ : coefficient of x
2

Once we find the best θ and θ values, we get


1 2

the best fit line. So when we are finally using


our model for prediction, it will predict the
How to update θ1 and θ2 values to get the best fit
line ?
Cost Function (J):
•By achieving the best-fit
regression line, the model
aims to predict y value such
that the error difference
between predicted value
and true value is minimum.
•So, it is very important to
update the θ1 and θ2 values,
to reach the best value that
minimize the error between
predicted y value (pred) and
true y value (y).
Fo rm a l D e fi n i t i o n o f L i n e a r

Task, 𝑇: to predict 𝑦 from 𝗑 by outputting


Re g re s s i o n
𝗑 ∈ ℝ𝑛: input data
𝑦 ∈ ℝ: output value ( : predicted by the model)
𝒘 ∈ ℝ𝑛: parameters (or weights)

Experience, 𝐸: training set (𝑿 𝑡𝑟 𝑎 i 𝑛 ,𝑦 𝑡𝑟 𝑎 i 𝑛 )

Performance measure, 𝑃:
mean squared error (MSE) on (𝑿 𝑡𝑒 𝑠 𝑡 ,𝑦 𝑡𝑒 𝑠 𝑡 )

𝑚: number of
dataset

2 InfoLa
F in d w b y M in im iz in g 𝑀𝑆𝐸𝑡𝑟𝑎i𝑛

2 InfoLa
L i n e a r Re g re s s i o n
Pro b l e m

2 InfoLa
Generalization

• Central challenge of ML is that the


algorithm must perform well on
new, previously unseen inputs
– Not just those on which our model has
been trained

• Ability to perform well on previously


unobserved inputs is called
Generalization Error
• When training a machine learning
model we have access to a training
set
– We can compute some error measure
on the training set, called training
error
• ML differs from optimization
– Generalization error (also called test
error) to be low
• Generalization error definition
– Expected value of the error on a new
error
inputis: 1 (test) (test)
2
4

m(test) X w y 2

• Expected value is computed as average over


inputs taken from distribution encountered in
Estimating the
generalization
• error
We estimate generalization error of a ML
model by measuring performance on a
test set that were collected separately
from the training set
• minimizing
In linear regression examplePolynomial
the training we train  M

Model y(x,w)  w0  w x  2w2x  .. M M


 wj x j

model by 1 X w  y
error
1
w x j
0
2
(train) (train)
2
m(train)
– But we actually care about the test
error 1
X w y (test) (test)
2
(test)
m
2

• How can we affect performance


when we observe only the training
5
set?
– Statistical learning theory provides some
answers
Statistical
Learning Theory
• Need assumptions about training and
test sets
– Training/test data arise from same process
– We make use of i.i.d. assumptions
1. Examples in each data set are independent
2. Training set and testing set are identically
distributed
– We call the shared distribution, the data
generating distribution pdata
• Probabilistic framework and i.i.d.
assumption allows us to study
relationship between training and
testing error
Why are training/test errors
unequal?
• Expected training error of a randomly
selected model is equal to expected
test error of model
– If we have a joint distribution p(x,y) and we
randomly sample from it to generate the
training and test sets. For some fixed
value w the expected training set error is
the same as the expected test set error
• But we don’t fix the parameters w in
advance!
– We sample the training set and then
use it to choose w to reduce training 7

set error
Generalizati
o nIn ML, generalization is the ability to perform
well on previously unobserved inputs
1. Making the training error small
2. Making the gap between training and test error
small
Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on training set
Overfitting occurs when the gap between the
training error and the test error is too large

We can control whether a model is more


likely to overfit or underfit by altering its
2 InfoLa
Generalizati
on Overfitting Underfitting

where a machine learning that can neither model the


model can’t generalize or fit training dataset nor generalize
well on unseen dataset to new dataset
refers to a modeling error that
occurs when a function
corresponds too closely to a
dataset

model learns the detail and


noise in the training dataset to
the extent that it negatively
impacts the performance of the
model on a new dataset
3 InfoLa
Capacity of a
model
• Model capacity is ability to fit variety of functions
– Model with Low capacity struggles to fit training
set
– A High capacity model can overfit by
memorizing properties of training set not
useful on test set
• When model has higher capacity, it overfits
– One way to control capacity of a learning
algorithm is by choosing the hypothesis
space
• i.e., set of functions that the learning algorithm is
allowed to select as being the solution
– E.g., the linear regression algorithm has the set of all
linear functions of its input as the hypothesis space
– We can generalize to include polynomials is its
Capacity of Polynomial
Curve Fits of degree 1 gives a
• A polynomial
linear regression model with the
prediction
yˆ  b  wx
– By introducing x2 as another features provided to
the regression model, we can learn a model
that is quadratic as a function of x
yˆ  b  w x  w x 2
1 2

• The output is still a linear function of the


parameters so we can use normal equations to
train in closed-form
• We can continue
yˆ  b
9
 wiadd
to xi more powers of x as
additional features,i
1
e.g., a polynomial of
degree 9
Appropriate
Capacity
• Machine Learning algorithms will perform
well when their capacity is appropriate for
the true complexity of the task that they
need to perform and the amount of training
data they are provided with
• Models with insufficient capacity are
unable to solve complex tasks
• Models with high capacity can solve
complex tasks, bit when their capacity is
higher than needed to solve the present
task, they may overfit
Principle of Capacity in action
• We fit three models to a training set
– Data generated synthetically sampling x
values and choosing y deterministically (a
quadratic function)

Polynomial M

Model
y(x,w) w wx wx
0 1 2
2
 .. M M
  wj x j
w x j
0

Polynomial of degree 9 suffers from


Linear function Quadratic overfitting. Used Moore-Penrose inverse
fit cannot function fit to solve underdetermined normal
capture generalizes well equations
curvature to unseen data. (we have N equations corresponding to N
present in No training samples)
1
data underfitting Solution
two pointspasses
not truethrough all points
of underlying butalso2
function,
or overfitting does notis capture
function decreasingcorrect
at firststructure:
point, not deep
increasing
valley between
Ordering Learning Machines by
Capacity

Goal of learning is to choose an optimal


element of a structure (e.g., polynomial
degree) and estimate its coefficients from
a given training sample.

For approximating functions linear in


parameters such as polynomials,
complexity is given by the no. of free
parameters.

For functions nonlinear in


parameters, the complexity is
defined as VC-dimension.
The optimal choice of model complexity
provides the minimum of the expected
risk.
Representational and Effective
Capacity
• Representational capacity:
– Specifies family of functions learning
algorithm can choose from
• Effective capacity:
– Imperfections in optimization algorithm
can limit representational capacity
• Occam’s razor:
– Among competing hypotheses that explain
known observations equally well, choose
the simplest one
• Idea is formalized in VC
dimension
VC Dimension

• Statistical learning theory quantifies


model capacity
• VC dimension quantifies capacity of a
binary classifier
• VC dimension is the largest value of m
for which there exists a training set of
m different points that the classifier
can label arbitrarily
VC dimension
(capacity) of a
Linear classifier
in is
R2
3

1
6
Capacity and
Learning
• Quantifying Theory
the capacity of a model
enables statistical learning theory to
make quantitative predictions
• Most important results in statistical
learning theory show that:
• The discrepancy between training
error and generalization error
– is bounded from above by a quantity that
grows as the model capacity grows
– But shrinks as the number of training examples
increases
1
7
Usefulness of statistical
•learning theory
Provides intellectual justification that
machine learning algorithms can work
• But rarely used in practice with deep
learning
• This is because:
– The bounds are loose
– Also difficult to determine capacity of deep
learning algorithms

1
8
Typical
generalization error
Relationship between capacity and error
Typically generalization error has a U-shaped
curve

1
9
Arbitrarily high
capacity:
• When doNonparametric
we reach most models
extreme
case of arbitrarily high capacity?
• Parametric models such as linear
regression:
• learn a function described by a parameter
whose size is finite and fixed before data
is observed
• Nonparametric models have no such
limitation
2
• Nearest-neighbor regression is an 0

example
Nearest neighbor regression

• Simply store the X and y from the


training set
• When asked to classify a test point x the
model looks up the nearest entry in the
training set and returns the associated
target, i.e.,
yˆ  y where i  arg min || X  x ||2
i i,: 2

• Algorithm can be generalized to


distance metrics other than L 2 norm 2

such as learned distance metrics 1


Bayes Error

• Ideal model is an oracle that knows the


true probability distributions that
generate the data
• Even such a model incurs some error
due to noise/overlap in the
distributions
• The error incurred by an oracle making
predictions from the true distribution
p(x,y) is called the Bayes error
2
2
Effect of size of training set

• Expected generalization error can


never increase as the no of
training examples increases
• For nonparametric models, more data
yields better generalization until best
possible error is achieved
• For any fixed parametric model with
less than optimal capacity will
asymptote to an error value that
2
exceeds the Bayes error 3
Effect of training set size

Synthetic regression
problem Noise added to
a 5th degree polynomial.
Generated a single test
set
and several different
sizes of training sets
Error bars show
95% Confidence
interval

As training set
size increases
optimal capacity
increases
Plateaus after
reaching
2
sufficient 4
complexity
Probably Approximately Correct
• Learning theory claims that a ML
algorithm generalizes from finite
training set of examples
• It contradicts basic principles of logic
• Inductive reasoning, or inferring general
rules from a limited no of samples is not
logically valid
• To infer a rule describing every member of a
set, one must have information about every
member of the set
• ML avoids this problem with
probabilistic rules 2
5

• Rather than rules of logical reasoning


The no-free lunch
•theorem
PAC does not entirely resolve the
problem
• No free lunch theorem states:
• Averaged over all distributions, every
algo has same error classifying
unobserved points
• i.e., no ML algo universally better than any
other
• Most sophisticated algorithm has same error
rate that merely predicts that every point
belongs to same class
• Fortunately results hold only when we
average over all possible data
Regularization
• No free lunch theorem implies that we
must design our ML algorithms to
perform well on a specific task
• We do so by building a set of
preferences into learning algorithm
• When these preferences are aligned
with the learning problems it performs
better

2
7
Other ways of solution
preference
• Only method so far for modifying
learning algo is to change
representational capacity
• Adding/removing functions from hypothesis
space
• Specific example: degree of polynomial
• Specific identity of those functions
can also affect behaviorT
J (w)  M SE  w w
of our
algorithm trai
• λ chosen to control preference for smaller
n

• Exweights
of linear regression: include weight
decay 2
8
Tradeoff between fitting data
and being
Controlling asmall
model’s tendency to
overfit or underfit via weight
decay
Weight decay for high degree polynomial
(degree 9) while the data comes from a
quadratic

2
9
9-degree Regression Model with Weight
Decay

Large 𝒘 over-fits
to training data

5 InfoLa
Re g u l a r i z a t i o
n
Regularization is any modification to prevent overfitting

Regularization is able to control the performance of an


algorithm
 Intended to reduce test error but not training error

Example: weight decay for regression problem

• 𝐽(w) : cost function to be minimized on training


• 𝜆: control factor of the preference for smaller weight (𝜆 ≥ 0)

𝒘
 Trades-off between fitting the training data and being small

5 InfoLa
Tr a i n i n g Set Size and
Generalization

Polynomial
(degree-9)
Christopher, M. Bishop. PATTERN RECOGNITION AND MACHINE LEARNING. Springer-Verlag New York,
Chapter 1.
5 InfoLa
H y pe r p a r a m e t e r s v s .
pa r a m e t e r s
Hyperparameters are higher-level properties for a
model
 Decides model’s capacity (or complexity)
 Not be learned during training
e.g., degree of regression model, weight decay

Parameters are properties of the training data


Learned during training by a ML
model e.g., weights of regression
model

5 InfoLa
Setting Hyperparameters with
Validation Set
Setting hyperparameters in training step is not
appropriate
Hyperparameters will be set to yield

model, 𝜆 → 0)
overfitting (e.g., higher degree of regression

Test set will not be seen for training nor model


choosing (hyperparameter setting)

So, we need validation set that the training algorithm


does not observe
1. Split validation set from training data
2. Train a model with training data (not including validation
set)
5 InfoLa
𝑘 - f o l d C ro s s
Va l i d a t iTraining
o n set

Training Validation
folds fold

1st 𝐸
iteration 1

2nd 𝐸
iteration 2

3rd 𝐸
iteration 3

𝑘-th 𝐸
iteration 𝑘

5 InfoLa
Estimation on
Statistics
Point estimation

unknown property of a model using sample data, 𝗑 𝟏𝟏,


 Attempt to provide the singe best prediction of the true but

…,𝗑𝑚

Function estimation
 Types of estimation that predict the relationship between
input and target variables

Point estimator

𝑚: number of data elements


: point estimator for the property of a model (e.g., expectation)

𝗑𝟏, … , 𝗑 𝑚 : independent and identically distributed (i.i.d.) data


points
Pro p e r t i e s o f an
Estimator
value of 𝜽𝜽
Bias measures the expected deviation from the true

Variance measures the deviation from the expected


estimator value that any particular sampling of the data
is likely to cause

𝑉𝑎𝑟 𝜽^

5 InfoLa
G r a p h i c a l I l l u s t r a t i o n o f B i a s and
Variance

High-bias
model

High-variance
model
Scott Fortmann-Roe, Understanding the Bias-Variance Tradeoff,
2012
6 InfoLa
B i a s -Va r i a n c e Tr a d e - o ff w i t h
Capacity

6 InfoLa
o n 𝑀𝑆𝐸
B i a s & Variance

6 InfoLa
o n 𝑀𝑆𝐸
B i a s & Variance

6 InfoLa
Ways t o Tr a d e - o ff B i a s &
Variance
Cross-validation
 Highly successful on many real-world
tasks

MSE of estimates
 MSE incorporates both bias and
variance

6 InfoLa
M a x i m u m L i ke l i h o o d
Estimation
Maximum likelihood estimation (MLE) is a method of
estimating the parameters of a statistical model

6 InfoLa
L i n e a r Re g re s s i o n as M a x i m u m
L iInstead o d prediction 𝑦^, we consider of the model
ke l i hofosingle
as producing a conditional distribution, 𝑝(𝑦|𝗑)

6 InfoLa
Bayesian statistics

67 InfoLa
Chapter 5. Machine Learning Basics
Part 1
- 5.1 Learning algorithms
- 5.2 Capacity, overfitting and underfitting
- 5.3 Hyperparameters and validation sets
- 5.4 Estimators, bias and variance Frequentist
- 5.5 Maximun likelihood estimation statistics Point
estimation
Part 2
- 5.6 Bayesian statistics Bayesian
- 5.7 Supervised learning algorithms statistics

- 5.8 Unsupervised learning algorithms


- 5.9 Stochastic gradient descent
- 5.10 Building a machine learning algorithm
- 5.11 Challenges motivating deep learning

68 InfoLa
Bayesian statistics
Bayesian perspective
- Uses probability to reflect degrees of certainty of states of
knowledge
- The dataset is directly observed and so is not random
- Parameter θ is represented as random variable
The prior
- We represent our knowledge of 𝜽 using the prior probability
distribution, notation with 𝑝(𝜽), before observing data
- Select broad priori distribution (with high degree of
uncertainty), such as finite range of volume, with a uniform
distribution, or Gaussian.

69 InfoLa
Mathematical description
Set of data samples {𝗑(𝟏), 𝗑(𝟐), ⋯ , 𝗑 ( 𝑚 ) }
The dataset is directly observed and so is not random
Parameter 𝜽𝜽 is represented as random variable
Combine the data likelihood with the prior via Bayes’ rule:

70 InfoLa
Mathematical description
Set of data samples {𝗑(𝟏), 𝗑(𝟐), ⋯ , 𝗑 ( 𝑚 ) }
The dataset is directly observed and so is not random
Parameter is represented as random variable
Combine the data likelihood with the prior via Bayes’ rule:

71 InfoLa
Relative to MLE
Make prediction using a full distribution over 𝜽
After observing m samples, predict distribution over the next
data sample, 𝗑(𝑚+𝟏), is given by:

Prior distribution has influence by shifting probability toward the


parameter space
Bayesian method typically generalize much better
But high computational cost

72 InfoLa
Maximum A Posterior (MAP)
Estimation
Chose the point of maximal posterior probability

Has the advantage of leveraging information that is brought by the


prior
Additional information helps the variance of MAP estimation
But it increase bias
Regularized estimation strategies can be interpreted as making
the MAP approximation

73 InfoLa
MLE vs MAP

<MLE <MAP
> >

74 InfoLa
Bayesian statistics

=
1st sampled data

×
point

2nd sampled data =


point

=
20th sampled data
point

*source : PRML(pattern recognition & Machine Learning)


Textbook
75 InfoLa
Supervised Learning
Algorithms

76 InfoLa
Support Vector Machine (SVM)
Definition
- Two-class classification problem in direct way
- Find a plane that separates the classes in feature space as far as
possible
Separating Hyperplane

* Image from “Introduction to Statistical Learning with R”,


springer
77 InfoLa
Support Vector Machine (SVM)
Maximal Margin Classifier

78 InfoLa
Support Vector Machine (SVM)
Support Vector Classifier

<With large <With small


C> C>

* Image from “Introduction to Statistical Learning with R”,


springer
79 InfoLa
Support Vector Machine (SVM)
Kernel Method (SVM)
- In SVM, we just need to calculate inner product of vectors
- If the data is not linear separable, we send the data to more high
dimensional space and make Support Vector Classifier

* Image from “Introduction to Statistical Learning with R”,


springer
80 InfoLa
Tree Based Methods
Decision Tree Classification

* Slides from Seo Hui(LG Electronics), “Gradient Boosting


Model”
81
Tree Based Methods
Decision Tree Regression

* Slides from Seo Hui(LG Electronics), “Gradient Boosting


Model”
82 InfoLa
Tree Based Methods
Decision Tree Complexity

* Slides from Seo Hui(LG Electronics), “Gradient Boosting


Model”
83 InfoLa
Tree Based Methods
Bagging

Sampling Data Modeling

Sampling Data Modeling

Averag
Total e or
Sampling Data Modeling Final Model
Sample Majority
Vote

Sampling Data Modeling

Sampling Data Modeling

84 InfoLa
Tree Based Methods
Bagging

<Model with Full <Model No. <Model No.


dataset> 1> 2>

<Model No. <Model No. <Model No.


3> 4> 5>

85 InfoLa
Tree Based Methods
Boosting
Weak
Modelin weig
Classifi
g ht
er

Weak
Modelin weig
Classifi
g ht
er

Total Weak
Modelin weig Final
Classifi
g ht Model
Sampl er
e
Weak
Modelin weig
Classifi
g ht
er

Weak
Modelin weig
Classifi
g ht
er

86 InfoLa
Tree Based Methods
Boostin
g

- Let the problem which we should classify ‘+’, and ‘-’ with
tree-based classifier
- First, a weak classifier classify the label with left-sided
vertical single line
- Then, weight to the incorrect points(large annotated ‘+’
in second figure), and do weak classify again(right-
sided line)
- Repeat those procedure, and finally merge the weak
classifiers
87 InfoLa
Another Supervised Learning Algorithms
Linear Regression
- Ridge
- Lasso
Logistic Regression
LDA (Linear Discriminant
Analysis) Random Forest
KNN (K-Nearest Neighbor)
Naïve Bayes
Neural Network (MLP)

88 InfoLa
Unsupervised Learning
Algorithms

89 InfoLa
K-means clustering
Find the K clusters that best describes the data

* Slides from Andrew Ng(Stanford Univ.), “Machine


Learning”
90 InfoLa
K-means clustering
Number of cluster k = 2,
- Randomly initialize
“centroids”

* Slides from Andrew Ng(Stanford Univ.), “Machine


Learning”
91 InfoLa
K-means clustering
Number of cluster k = 2,
- Assign cluster membership
- Update the cluster centroid (average of the data points in each cluster)

* Slides from Andrew Ng(Stanford Univ.), “Machine


Learning”
92 InfoLa
K-means clustering
Number of cluster k = 2,
- Update cluster membership
- Repeat those procedure until no membership update

* Slides from Andrew Ng(Stanford Univ.), “Machine


Learning”
93 InfoLa
Another Unsupervised Learning Algorithms
PCA (Principal Component Analysis)
ICA (Independent Component
Analysis) ARM (Association Rule
Mining)
- Apriori rule
- FP-growth
- Eclat algorithm
Expectation Maximization
Density Estimation

94 InfoLa
Stochastic Gradient Descent
(SGD)

95 InfoLa
Gradient Descent
The method for parameter update
Consider the model cost function
𝐽(𝜽)

Gradient of 𝐽(𝜽) respect to 𝜽


is:

Update the new parameter

96 InfoLa
Limitation of Gradient Descent
Issue of local minimum

If the starting point for gradient descent was chosen inappropriately,


cannot reach global minimum

97 InfoLa
Stochastic Gradient Descent (SGD)
The SGD method
- Extension of gradient descent
- Nearly all of deep learning is powered by this method
(deep learning’s cost space is not convex)
Using batch learning ( = epoch learning)
- Calculate the loss function with batch(sample)

- Update the new parameter 𝜽𝑛e𝒘 with gradient of batch loss function

- At each update, loss function will be changed

98 InfoLa
SGD vs GD
GD goes in steepest descent direction, but slower to compute
per iteration for large datasets
SGD can be viewed as noisy descent, but faster per iteration

99 InfoLa
The next Deep Learning Seminar

[Part 2] Deep Networks: Modern Practice


Chapter 6. Deep Feedforward Networks

1. Example: Learning XOR


2. Gradient-Based Learning
3. Hidden Units
4. Architecture Design
5. Back-Propagation and Other Differentiation Algorithm
6. Historical Notes

10 InfoLa

You might also like