KEMBAR78
Optimization Algorithms | PDF | Mathematical Optimization | Numerical Analysis
0% found this document useful (0 votes)
5 views26 pages

Optimization Algorithms

Uploaded by

agnivamanna0828
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views26 pages

Optimization Algorithms

Uploaded by

agnivamanna0828
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Optimization algorithms

Dr. Abhijit Debnath


University of Engineering and Management
Newtown, Kolkata
Why optimizers?

 Minimizing the Loss Function:


The optimizer adjusts model parameters to reduce the difference between the
predicted output and the actual target values, minimizing the loss function.

 Learning Rate Management:


Optimizers manage the learning rate, which determines the step size for parameter
updates. Some optimizers adapt the learning rate dynamically.

 Handling Non-Convex Functions:


Deep learning models often have highly non-convex loss functions. Optimizers help
navigate these complex surfaces and avoid poor local minima or saddle points.
Why optimizers?

 Accelerating Convergence:
Techniques like momentum or adaptive learning rates help optimizers converge to
the optimal solution faster.
 Learning Rate Management:
Optimizers manage the learning rate, which determines the step size for parameter
updates. Some optimizers adapt the learning rate dynamically.

 Addressing Gradient Scaling Issues:


Optimizers like RMSProp and Adam normalize gradients to handle problems like
vanishing or exploding gradients.
Common Optimization Algorithms

1. Gradient Descent (GD)


2. Momentum
3. Nesterov Accelerated Gradient (NAG)
4. Adagrad
5. RMSProp
6. Adam
7. AdaMax
8. Nadam
9. AMSGrad
Gradient Descent (GD)
Mathematical Formulation:
θ = θ - η∇J(θ)
Variants:
1. Batch Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini-Batch Gradient Descent
Pros:
- Simple and effective for convex loss functions.

Cons:
- High computational cost for large datasets (Batch GD).
- Noise and oscillations in SGD.
- Requires careful learning rate tuning.
Gradient Descent
SGD and GD
Mini-batch GD

• Mini-batch GD: Splits dataset into smaller subsets


Momentum
Nestorov Accelerated Gradient (NAG)
Adagrad

Loss for each feature

Variance of gradients

Pros: Cons:
 Adapts learning rates,  Learning rate decreases
making it suitable for continuously, leading to
sparse data. convergence issues for
 Reduces the need for long training.
manual learning rate
tuning.
RMSProp

Pros: Cons:
 Works well for non-stationary  Requires learning-rate
objectives. scheduler
 Effective for RNNs and
training deep networks.
Adadelta

Parameter update

Pros:
 Works well for datasets with
Cons:
sparse gradients and features,  The decay rate (ρ) must be
similar to Adagrad. carefully chosen for optimal
 Easy to implement and tune
performance.
due to fewer hyperparameters  May Struggle with Very Large
(ρ and ϵ). Datasets
Adam

The Adam optimizer is a widely used optimization algorithm in deep learning,


combining the strengths of two other popular optimizers: Momentum and
RMSProp.
Adam
Pros
 Efficient and Fast: Combines the benefits of Momentum and RMSProp,
generally converging faster than SGD and its variants.
 Adaptive Learning Rates: Automatically adjusts the learning rate for each
parameter, reducing the need for manual tuning.
 Handles Noisy Gradients: Suitable for non-stationary objectives and noisy
gradient problems.
 Widely Applicable: Performs well across various architectures, including
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and
transformers.
Adam
Cons
 Suboptimal for Certain Problems: May not always generalize as well as SGD
with momentum in tasks requiring very fine-tuned convergence.
 Requires Tuning: Hyperparameters (β1, β2, η) may require fine-tuning for
optimal performance.
 Non-Convergence Issues: In some cases, Adam may fail to converge to an
optimal solution. Variants like AMSGrad address this issue.
Regularization
What is regularization?

 Regularization is a technique used in machine learning and statistics to prevent


overfitting by introducing additional constraints or penalties to a model.

 Overfitting occurs when a model learns not only the underlying pattern in the
training data but also the noise, making it perform poorly on unseen data.
Types of regularization
Types of regularization
Pros and cons
Pros and cons
Pros and cons
Pros and cons
Pros and cons
Comparison

You might also like