Optimization algorithms
Dr. Abhijit Debnath
University of Engineering and Management
Newtown, Kolkata
Why optimizers?
Minimizing the Loss Function:
The optimizer adjusts model parameters to reduce the difference between the
predicted output and the actual target values, minimizing the loss function.
Learning Rate Management:
Optimizers manage the learning rate, which determines the step size for parameter
updates. Some optimizers adapt the learning rate dynamically.
Handling Non-Convex Functions:
Deep learning models often have highly non-convex loss functions. Optimizers help
navigate these complex surfaces and avoid poor local minima or saddle points.
Why optimizers?
Accelerating Convergence:
Techniques like momentum or adaptive learning rates help optimizers converge to
the optimal solution faster.
Learning Rate Management:
Optimizers manage the learning rate, which determines the step size for parameter
updates. Some optimizers adapt the learning rate dynamically.
Addressing Gradient Scaling Issues:
Optimizers like RMSProp and Adam normalize gradients to handle problems like
vanishing or exploding gradients.
Common Optimization Algorithms
1. Gradient Descent (GD)
2. Momentum
3. Nesterov Accelerated Gradient (NAG)
4. Adagrad
5. RMSProp
6. Adam
7. AdaMax
8. Nadam
9. AMSGrad
Gradient Descent (GD)
Mathematical Formulation:
θ = θ - η∇J(θ)
Variants:
1. Batch Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini-Batch Gradient Descent
Pros:
- Simple and effective for convex loss functions.
Cons:
- High computational cost for large datasets (Batch GD).
- Noise and oscillations in SGD.
- Requires careful learning rate tuning.
Gradient Descent
SGD and GD
Mini-batch GD
• Mini-batch GD: Splits dataset into smaller subsets
Momentum
Nestorov Accelerated Gradient (NAG)
Adagrad
Loss for each feature
Variance of gradients
Pros: Cons:
Adapts learning rates, Learning rate decreases
making it suitable for continuously, leading to
sparse data. convergence issues for
Reduces the need for long training.
manual learning rate
tuning.
RMSProp
Pros: Cons:
Works well for non-stationary Requires learning-rate
objectives. scheduler
Effective for RNNs and
training deep networks.
Adadelta
Parameter update
Pros:
Works well for datasets with
Cons:
sparse gradients and features, The decay rate (ρ) must be
similar to Adagrad. carefully chosen for optimal
Easy to implement and tune
performance.
due to fewer hyperparameters May Struggle with Very Large
(ρ and ϵ). Datasets
Adam
The Adam optimizer is a widely used optimization algorithm in deep learning,
combining the strengths of two other popular optimizers: Momentum and
RMSProp.
Adam
Pros
Efficient and Fast: Combines the benefits of Momentum and RMSProp,
generally converging faster than SGD and its variants.
Adaptive Learning Rates: Automatically adjusts the learning rate for each
parameter, reducing the need for manual tuning.
Handles Noisy Gradients: Suitable for non-stationary objectives and noisy
gradient problems.
Widely Applicable: Performs well across various architectures, including
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and
transformers.
Adam
Cons
Suboptimal for Certain Problems: May not always generalize as well as SGD
with momentum in tasks requiring very fine-tuned convergence.
Requires Tuning: Hyperparameters (β1, β2, η) may require fine-tuning for
optimal performance.
Non-Convergence Issues: In some cases, Adam may fail to converge to an
optimal solution. Variants like AMSGrad address this issue.
Regularization
What is regularization?
Regularization is a technique used in machine learning and statistics to prevent
overfitting by introducing additional constraints or penalties to a model.
Overfitting occurs when a model learns not only the underlying pattern in the
training data but also the noise, making it perform poorly on unseen data.
Types of regularization
Types of regularization
Pros and cons
Pros and cons
Pros and cons
Pros and cons
Pros and cons
Comparison