UNIT – I BASICS OF DEEP LEARNING
Linear Algebra:
Linear algebra is a branch of mathematics that focuses on the study of
vectors, matrices, and linear transformations, which are fundamental
concepts in many areas of mathematics, science, and engineering. It deals
with solving systems of linear equations, analysing vector spaces, and
understanding the properties of matrices.
What is Linear Algebra?
Linear Algebra is a branch of Mathematics that deals with
matrices, vectors, finite and infinite spaces. It is the study of vector
spaces, linear equations, linear functions, and matrices.
Linear Algebra Equations
The general linear equation is represented as u1x1 + u2x2+…..unxn= v
Where,
u’s – represents the coefficients
x’s – represents the unknowns
v – represents the constant
There is a collection of equations called a System of linear algebraic
equations. It obeys the linear function such as –
(x1,……..xn) → u1x1+……….+unxn
Linear Algebra Topics
Below is the list of important topics in Linear Algebra.
Matrix inverses and determinants
Linear transformations
Singular value decomposition
Orthogonal matrices
Mathematical operations with matrices (i.e. addition, multiplication)
Projections
Solving systems of equations with matrices
Eigenvalues and eigenvectors
Euclidean vector spaces
Positive-definite matrices
Linear dependence and independence
The foundational concepts essential for understanding linear algebra,
detailed here, include:
Linear Functions
Vector spaces
Matrix
These foundational ideas are interconnected, allowing for the
mathematical representation of a system of linear equations. Generally,
vectors are entities that can be combined, and linear functions refer to
vector operations that encompass vector combination.
Branches of Linear Algebra
Linear Algebra is divided into different branches based on the
difficulty level of topics, which are,
Elementary Linear Algebra
Advanced Linear Algebra
Applied Linear Algebra
Elementary Linear Algebra
Elementary Linear algebra covers the topics of basic linear algebra
such as Scalars and Vectors, Matrix and matrix operation, etc.
Linear Equations
Linear equations form the basis of linear algebra and are equations of
the first order. These equations represent straight lines in geometry
and are characterized by constants and variables without exponents
or products of variables. Solving systems of linear equations involves
finding the values of the variables that satisfy all equations
simultaneously.
A linear equation is the simplest form of equation in algebra,
representing a straight line when plotted on a graph.
Example: 2x + 3x = 6 is a linear equation. If you have two such
equations, like 2x + 3y = 6, and 4x + 6y =12, solving them together
would give you the point where the two lines intersect.
Advanced Linear Algebra
Advanced linear algebra mostly covers all the advanced topics related
to linear algebra such as Linear function, Linear transformation,
Eigenvectors, and Eigenvalues, etc.
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors are fundamental concepts in linear
algebra. It offers deep insights into the properties of linear
transformations. An eigenvector of a square matrix is a non-zero vector
that, when the matrix multiplies it, results in a scalar multiple of itself. This
scalar is known as the eigenvalue associated with the eigenvector. They
are essential in various applications, including stability analysis, quantum
mechanics, and the study of dynamical systems.
Consider a transformation that changes the direction or length of
vectors, except for some special vectors that only get stretched or
shrunk. These special vectors are eigenvectors, and the factor by which
they are stretched or shrunk is the eigenvalue.
Example: For the matrix A = [2, 0, 0, 3], the vector v = 1,0 is an
eigenvector because Av = 2v, and 2 is the eigenvalue.
Singular Value Decomposition
Singular Value Decomposition (SVD) is a powerful mathematical
technique used in signal processing, statistics, and machine
learning. It decomposes a matrix into three other matrices, where
one represents the rotation, another the scaling, and the third the
final rotation. It’s essential for identifying the intrinsic geometric structure
of data.
Vector Space in Linear Algebra
A vector space (or linear space) is a collection of vectors, which may
be added together and multiplied (“scaled”) by numbers, called
scalars. Scalars are often real numbers, but can also be complex
numbers. Vector spaces are central to the study of linear algebra and are
used in various scientific fields.
Basic vectors in Linear Algebra
Linear Map
A linear map (or linear transformation) is a mapping between two
vector spaces that preserves the operations of vector addition and
scalar multiplication. The concept is central to linear algebra and has
significant implications in geometry and abstract algebra.
A linear map is a way of moving vectors around in a space that keeps the
grid lines parallel and evenly spaced.
Example: Scaling objects in a video game world without
changing their basic shape is like applying a linear map.
Positive Definite Matrices
A positive definite matrix is a symmetric matrix where all its
eigenvalues are positive. These matrices are significant in optimisation
problems, as they ensure the existence of a unique minimum in quadratic
forms.
Example: The matrix A = [2, 0, 0, 2] is positive definite because it
always produces positive values for any non-zero vector.
Matrix Exponential
The matrix exponential is a function on square matrices analogous to
the exponential function for real numbers. It is used in solving systems
of linear differential equations, among other applications in physics and
engineering.
Matrix exponentials stretch or compress spaces in ways that depend
smoothly on time, much like how interest grows continuously in a
bank account.
Example: The exponential of the matrix A = [0, −1, 1, 0] represents
rotations, where the amount of rotation depends on the “time”
parameter.
Linear Computations
Linear computations involve numerical methods for solving linear
algebra problems, including systems of linear equations,
eigenvalues, and eigenvectors calculations. These computations are
essential in computer simulations, optimisations, and modelling.
These are techniques for crunching numbers in linear algebra problems,
like finding the best-fit line through a set of points or solving systems of
equations quickly and accurately.
Linear Independence
A set of vectors is linearly independent if no vector in the set is a
linear combination of the others. The concept of linear independence is
central to the study of vector spaces, as it helps define bases and
dimension.
Vectors are linearly independent if none of them can be made by
combining the others. It’s like saying each vector brings something
unique to the table that the others don’t.
Example: 1,0 and 0,1 are linearly independent in 2D space because
you can’t create one of these vectors by scaling or adding the other.
Linear Subspace
A linear subspace (or simply subspace) is a subset of a vector space
that is closed under vector addition and scalar multiplication. A subspace
is a smaller space that lies within a larger vector space, following the
same rules of vector addition and scalar multiplication.
Example: The set of all vectors of the form a, 0 in 2D space is a
subspace, representing all points along the x-axis.
Applied Linear Algebra
In Applied Linear Algebra, the topics covered are generally the practical
implications of Elementary and advanced linear Algebra topics such
as the Complement of a matrix, matrix factorization and norm of vectors,
etc.
Linear Programming
Linear programming is a method to achieve the best outcome in a
mathematical model whose requirements are represented by linear
relationships. It is widely used in business and economics to maximize
profit or minimize cost while considering constraints.
This is a technique for optimizing (maximizing or minimizing) a linear
objective function, subject to linear equality and inequality constraints. It’s
like planning the best outcome under given restrictions.
Example: Maximizing profit in a business while considering constraints
like budget, material costs, and labor.
Linear Equation Systems
Systems of linear equations involve multiple linear equations that share
the same set of variables. The solution to these systems is the set of
values that satisfy all equations simultaneously, which can be found using
various methods, including substitution, elimination, and matrix
operations.
Example: Finding the intersection point of two lines represented by
two equations.
Gaussian Elimination
Gaussian elimination is a systematic method for solving systems of linear
equations. It involves applying a series of operations to transform the
system’s matrix into its row echelon form or reduced row echelon form,
making it easier to solve for the variables. It is a step-by-step procedure to
simplify a system of linear equations into a form that’s easier to solve.
Example: Systematically eliminating variables in a system of
equations until each equation has only one variable left to solve for.
Vectors in Linear Algebra
In linear algebra, vectors are fundamental mathematical objects that
represent quantities that have both magnitude and direction.
Vectors operations like addition and scalar multiplication are mainly
used concepts in linear algebra. They can be used to solve systems
of linear equations and represent linear transformation, and perform
matrix operations such as multiplication and inverse matrices.
The representation of many physical processes’ magnitude and
direction using vectors, a fundamental component of linear algebra, is
essential.
In linear algebra, vectors are elements of a vector space that can be
scaled and added. Essentially, they are arrows with a length and
direction.
Linear Function
A formal definition of a linear function is provided below:
f(ax) = af(x), and f(x + y) = f(x) + f(y)
where a is a scalar, f(x) and f(y) are vectors in the range of f, and x and y
are vectors in the domain of f.
A linear function is a type of function that maintains the properties of
vector addition and scalar multiplication when mapping between two
vector spaces. Specifically a function T: V ->W is considered linear if it
satisfies two key properties:
Property Description Equation
A linear transformation’s ability to T(u+v)=T(u)
Additive Property
preserve vector addition. +T(v)
Homogeneous A linear transformation’s ability to
T(cu)=cT(u)
Property preserve scalar multiplication.
V and W: Vector spaces
u and v: Vectors in vector space V
c: Scalar
T: Linear transformation from V to W
The additional property requires that the function T preserves the
vector addition operation, meaning that the image of the sum of two
vectors is equal to the sum of two images of each individual vector.
For example, we have a linear transformation T that takes a two-
dimensional vector (x, y) as input and outputs a new two-dimensional
vector (u, v) according to the following rule:
T(x, y) = (2x + y, 3x – 4y)
To verify that T is a linear transformation, we need to show that it satisfies
two properties:
Additivity: T(u + v) = T(u) + T(v)
Homogeneity: T(cu) = cT(u)
Let’s take two input vectors (x1, y1) and (x2, y2) and compute their images
under T:
T(x1, y1) = (2x1 + y1, 3x1 – 4y1)
T(x2, y2) = (2x2 + y2, 3x2 – 4y2)
Now let’s compute the image of their sum:
T(x1 + x2, y1 + y2) = (2(x1 + x2) + (y1 + y2), 3(x1 + x2) – 4(y1 + y2)) =
(2x1 + y1 + 2x2 + y2, 3x1 + 3x2 – 4y1 – 4y2) = (2x1 + y1, 3x1 – 4y1) +
(2x2 + y2, 3x2 – 4y2) = T(x1, y1) + T(x2, y2)
So T satisfies the additivity property.
Now let’s check the homogeneity property. Let c be a scalar and (x,
y) be a vector:
T(cx, cy) = (2(cx) + cy, 3(cx) – 4(cy)) = (c(2x) + c(y), c(3x) – c(4y)) =
c(2x + y, 3x – 4y) = cT(x, y)
So T also satisfies the homogeneity property. Therefore, T is a linear
transformation.
Linear Algebra Matrix
A linear matrix in algebra is a rectangular array of integers organized
in rows and columns in linear algebra. The letters a, b, c, and other
similar letters are commonly used to represent the integers that make
up a matrix’s entries.
Matrices are often used to represent linear transformation, such as
scaling, rotation, and reflection.
Its size is determined by the rows and columns that are present.
A matrix has three rows and two columns, for instance. A matrix is
referred to as be 3×2 matrix, for instance, if it contains three rows and
two columns.
Matrix basically works on operations including addition, subtraction,
and multiplication.
The appropriate elements are simply added or removed when matrices
are added or subtracted.
Scalar multiplication involves multiplying every entry in the matrix by a
scalar(a number).
Matrix multiplication is a more complex operation that involves
multiplying and adding certain entries in the matrices.
The number of columns and rows in the matrix determines its size. For
instance, a matrix with 4 rows and 2 columns is known as
a 4×2 matrix. The entries in the matrix are integers, and they are
frequently represented by letters like u, v, and w.
For example: Let’s consider a simple example to understand more,
suppose we have two vectors, v1, and v2 in a two-dimensional space. We
can represent these vectors as a column matrix, such as:
Now we will apply a linear transformation that doubles the value of the first
component and subtracts the value of the second component. Now we
can represent this transformation as a 2×2 linear matrix A
To apply this to vector v1, simply multiply the matrix A with vector v1
The resulting vector, [0,-2] is the transformed version of v1. Similarly, we
can apply the same transformation to v2
The resulting vector, [3,-4] is the transformed version of v2.
Numerical Linear Algebra
Numerical linear algebra, also called applied linear algebra, explores how
matrix operations can solve real-world problems using computers. It
focuses on creating efficient algorithms for continuous mathematics tasks.
These algorithms are vital for solving problems like least-square
optimization, finding Eigenvalues, and solving systems of linear equations.
In numerical linear algebra, various matrix decomposition methods such
as Eigen decomposition, Single value decomposition, and QR
factorization are utilized to tackle these challenges.
Linear Algebra Applications
Linear algebra is ubiquitous in science and engineering, providing
the tools for modelling natural phenomena, optimising processes,
and solving complex calculations in computer science, physics,
economics, and beyond.
Linear algebra, with its concepts of vectors, matrices, and linear
transformations, serves as a foundational tool in numerous fields,
enabling the solving of complex problems across science, engineering,
computer science, economics, and more. Following are some
specific applications of linear algebra in real-world.
1. Computer Graphics and Animation
Linear algebra is indispensable in computer graphics, gaming, and
animation. It helps in transforming the shapes of objects and their
positions in scenes through rotations, translations, scaling, and more. For
instance, when animating a character, linear transformations are used to
rotate limbs, scale objects, or shift positions within the virtual world.
2. Machine Learning and Data Science
In machine learning, linear algebra is at the heart of algorithms used for
classifying information, making predictions, and understanding the
structures within data. It’s crucial for operations in high-dimensional data
spaces, optimizing algorithms, and even in the training of neural networks
where matrix and tensor operations define the efficiency and effectiveness
of learning.
3. Quantum Mechanics
The state of quantum systems is described using vectors in a complex
vector space. Linear algebra enables the manipulation and prediction of
these states through operations such as unitary transformations (evolution
of quantum states) and eigenvalue problems (energy levels of quantum
systems).
4. Cryptography
Linear algebraic concepts are used in cryptography for encoding messages
and ensuring secure communication. Public key cryptosystems, such as
RSA, rely on operations that are easy to perform but extremely difficult to
reverse without the key, many of which involve linear algebraic
computations.
5. Control Systems
In engineering, linear algebra is used to model and design control systems.
The behavior of systems, from simple home heating systems to complex
flight control mechanisms, can be modeled using matrices that describe
the relationships between inputs, outputs, and the system’s state.
6. Network Analysis
Linear algebra is used to analyze and optimize networks, including
internet traffic, social networks, and logistical networks. Google’s
PageRank algorithm, which ranks web pages based on their links to and
from other sites, is a famous example that uses the eigenvectors of a
large matrix representing the web.
7. Image and Signal Processing
Techniques from linear algebra are used to compress, enhance, and
reconstruct images and signals. Singular value decomposition (SVD), for
example, is a method to compress images by identifying and eliminating
redundant information, significantly reducing the size of image files
without substantially reducing quality.
8. Economics and Finance
Linear algebra models economic phenomena, optimizes financial
portfolios, and evaluates risk. Matrices are used to represent and solve
systems of linear equations that model supply and demand, investment
portfolios, and market equilibrium.
9. Structural Engineering
In structural engineering, linear algebra is used to model structures,
analyze their stability, and simulate how forces and loads are distributed
throughout a structure. This helps engineers design buildings, bridges,
and other structures that can withstand various stresses and strains.
10. Robotics
Robots are designed using linear algebra to control their movements
and perform tasks with precision. Kinematics, which involves the
movement of parts in space, relies on linear transformations to calculate
the positions, rotations, and scaling of robot parts.
Solved Examples
Example 1: Find the sum of the two vectors A→ A = 2i + 3j + 5k
and B→ B = -i + 2j + k
Eigen Decomposition
Eigen decomposition is a method used in linear algebra to break down a
square matrix into simpler components called eigenvalues and
eigenvectors. This process helps us understand how a matrix behaves
and how it transforms data.
For Example - Eigen decomposition is particularly useful in fields like
Physics, Machine learning, and Computer graphics, as it simplifies
complex calculations.
Fundamental Theory of Eigen Decomposition
Eigen decomposition separates a matrix into its eigenvalues and
eigenvectors. Mathematically, for a square matrix A, if there exists a
scalar λ (eigenvalue) and a non-zero vector v (eigenvector) such that:
Av = λv
Where:
A is the matrix.
λ is the eigenvalue.
v is the eigenvector.
Then, the matrix A can then be represented as:
A=VΛV-1
Where:
V is the matrix of eigenvectors.
Λ is the diagonal matrix of eigenvalues.
V-1 is the inverse of the matrix.
This decomposition is significant because it transforms matrix operations
into simpler, scalar operations involving eigenvalues, making
computations easier.
How to Perform Eigen decomposition?
To perform Eigen decomposition on a matrix, follow these steps:
Step 1: Find the Eigenvalues:
Solve the characteristic equation:
det (A−λI=0
Here, A is the square matrix, λ is the eigenvalue, and I is the identity
matrix of the same dimension as A.
Step 2: Find the Eigenvectors:
For each eigenvalue λ, substitute it back into the equation:
(A−λI)v=0
This represents a system of linear equations where v is the eigenvector
corresponding to the eigenvalue λ.
Step 3: Construct the Eigenvector Matrix V:
Place all the eigenvectors as columns in the matrix V. If there are n
distinct eigenvalues, V will be an n×n matrix..
Step 4 Form the Diagonal Matrix Λ:
Construct a diagonal matrix Λ by placing the eigenvalues on its diagonal:
Step 5: Calculate the Inverse of V:
Find V-1, the inverse of the eigenvector matrix V, if the matrix is invertible.
Importance of Eigen decomposition
Eigen decomposition is widely used because it makes complex tasks
simpler:
Simplifying Matrix Powers: It helps in easily calculating powers of
matrices, which is useful in solving equations and modeling systems.
Data Simplification: It is used in techniques like PCA to reduce large
datasets into fewer dimensions, making them easier to analyze.
Physics: In quantum mechanics, it helps in understanding how
systems change over time.
Image Processing: It is used in tasks like image compression and
enhancement, making handling images more efficient.
Probability is a fundamental concept in statistics that helps us understand
the likelihood of different events occurring. Within probability theory, there
are three key types of probabilities: joint, marginal, and conditional
probabilities.
Marginal Probability refers to the probability of a single event
occurring, without considering any other events.
Joint Probability is the probability of two or more events happening at
the same time. It is the probability of the intersection of these events.
Conditional Probability deals with the probability of an event
occurring given that another event has already occurred.
Probability of an Event
Probability of an event quantifies how likely it is for that event to occur. It
is a measure that ranges from 0 to 1, where 0 indicates the event cannot
happen and 1 indicates the event is certain to happen.
The probability of an event A, denoted as P(A), is defined as:
P(A)=Number of favorable outcomesTotal number of possible outcomesP(A)=Total nu
mber of possible outcomesNumber of favorable outcomes
Sample Space (S)
The set of all possible outcomes of a random experiment. For example, if
you roll a die, the sample space S is {1, 2, 3, 4, 5, 6}.
Event (A)
A subset of the sample space is called event in probability.
Event is the specific outcome or set of outcomes that we are interested in.
For instance, getting an even number when rolling a die is an event A =
{2, 4, 6}.
Joint Probability
Joint probability is the probability of two (or more) events happening
simultaneously. It is denoted as P(A∩B) for two events A and B, which
reads as the probability of both A and B occurring.
For two events A and B, the joint probability is defined as:
P(A∩B)=P(both A and B occur)
Examples of Joint Probability
Rolling Two Dice
Let A be the event that the first die shows a 3.
Let B be the event that the second die shows a 5.
The joint probability P(A∩B) is the probability that the first die shows a 3
P(A∩B) = P(A) ⋅ P(B).
and the second die shows a 5. Since the outcomes are independent,
⇒ P(A∩B) = 1/6 × 1/6 = 1/36.
Given: P(A) = 1/6 and P(B) = 1/6, so
Marginal Probability
Marginal probability refers to the probability of an event occurring,
irrespective of the outcomes of other variables. It is obtained by summing
or integrating the joint probabilities over all possible values of the other
variables.
For two events A and B, the marginal probability of event A is defined as:
P(A)=∑BP(A,B)
Where P(A, B) is the joint probability of both events A and B occurring
together. If the variables are continuous, the summation is replaced by
integration:
P(A)=∫BP(A,B)dB
Examples of Marginal Probability
Consider a table showing the joint probability distribution of two discrete
random variables X and Y:
X/Y Y=1 Y=2
X=1 0.1 0.2
X=2 0.3 0.4
To find the marginal probability of X = 1:
P(X = 1) = P(X = 1, Y = 1) + P(X = 1, Y = 2) = 0.1 + 0.2 = 0.3
Conditional Probability
Conditional probability is the probability of an event occurring given that
another event has already occurred. It provides a way to update our
predictions or beliefs about the occurrence of an event based on new
information.
The conditional probability of event A given event B is denoted as P(A∣B)
and is defined by the formula:
P(A∣B)=P(B)P(A∩B)
Where:
P(A∩B) is the joint probability of both events A and B occurring.
P(B) is the probability of event B occurring.
Examples of Conditional Probability
Suppose we have a deck of 52 cards, and we want to find the probability
of drawing an Ace given that we have drawn a red card.
Let A be the event of drawing an Ace.
Let B be the event of drawing a red card.
There are 2 red Aces in a deck (Ace of hearts and Ace of diamonds) and
26 red cards in total.
P(A∣B)=P(B)P(A∩B)=2/52/26/52 =2/26=131
Difference between Joint, Marginal, and Conditional
Probability
The key differences between joint, marginal and conditional probability are
listed in the following table:
For Independent Events
o P(A∩B) = P(A) x P(B)
For Dependent Events
o P(A∩B) = P(A) x P(B|A)
Joint Marginal Conditional
Aspect Probability Probability Probability
The probability The probability of
The probability of
of a single event an event given
two or more
Definition irrespective of that another
events occurring
the occurrence event has
together.
of other events. occurred.
Notation P(A∩B) or P(A, B) P(A) or P(B) P(A∣B) or P(B∣A)
P(A∣B) = P(A∩B)
Formula P(A) = ∑BP(A∩B)
/P(B)
Probability of Probability of
rolling a 2 and rolling a 2 given
Probability of
is heads: P(2 ∣
Example flipping that the coin flip
rolling a 2:P(2)
heads: P(2 ∩
Heads) Heads)
Calculated by Calculated using
summing the the joint
Calculated from a
Calculation joint probabilities probability and
joint probability
Context over all the marginal
distribution.
outcomes of the probability of the
other variable. given condition.
Involves multiple Does not Depends on the
Dependencies events happening depend on other occurrence of
simultaneously. events. another event.
Used to find the Used to find the Used to update
likelihood of likelihood of a the probability of
Use Case combined events single event in an event based
in probabilistic the presence of on new
models. multiple events. information.
Bayes’ Theorem is a mathematical formula that helps determine
the conditional probability of an event based on prior knowledge and
new evidence.
Bayes Theorem and Conditional Probability
Bayes’ theorem (also known as the Bayes Rule or Bayes Law) is used to
determine the conditional probability of event A when event B has already
occurred.
The general statement of Bayes’ theorem is “The conditional probability of
an event A, given the occurrence of another event B, is equal to the
product of the event of B, given A and the probability of A divided by the
probability of event B.” i.e.
For example, if we want to find the probability that a white marble drawn
at random came from the first bag, given that a white marble has already
been drawn, and there are three bags each containing some white and
black marbles, then we can use Bayes’ Theorem.
Check: Bayes’s Theorem for Conditional Probability
Bayes Theorem Formula
For any two events A and B, Bayes’s formula for the Bayes theorem is
given by:
Where,
P(A) and P(B) are the probabilities of events A and B also P(B) is
never equal to zero,
P(A|B) is the probability of event A when event B happens,
P(B|A) is the probability of event B when A happens.
Bayes Theorem Statement
Bayes’s Theorem for n set of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S,
in which all the events E 1, E2,…, En have a non-zero probability of
occurrence. All the events E 1, E2,…, E form a partition of S. Let A be an
event from space S for which we have to find probability, then according
to Bayes theorem,
Bayes’ theorem is also known as the formula for the probability of
“causes”. As we know, the E i‘s are a partition of the sample space S, and
at any given time only one of the events E i occurs. Thus, we conclude that
the Bayes theorem formula gives the probability of a particular E i, given
that event A has occurred.
Terms Related to Bayes Theorem
After learning about Bayes theorem in detail, let us understand some
important terms related to the concepts we covered in formula and
derivation.
Hypotheses
Hypotheses refer to possible events or outcomes in the sample space,
they are denoted as E1, E2, …, En.
Each hypothesis represents a distinct scenario that could explain an
observed event.
Priori Probability
Priori Probability P(Ei) is the initial probability of an event occurring before
any new data is taken into account.
It reflects existing knowledge or assumptions about the event.
Example: The probability of a person having a disease before taking a
test.
Posterior Probability
Posterior probability (P(Ei∣A) is the updated probability of an event after
considering new information.
It is derived using the Bayes Theorem.
Example: The probability of having a disease given a positive test
result.
Conditional Probability
The probability of an event A based on the occurrence of another event
B is termed conditional Probability.
It is denoted as P(A|B) and represents the probability of A when event
B has already happened.
Joint Probability
When the probability of two or more events occurring together and at
the same time is measured it is marked as Joint Probability.
For two events A and B, it is denoted by joint probability is denoted
as, P(A∩B).
Random Variables
Real-valued variables whose possible values are determined by
random experiments are called random variables.
The probability of finding such variables is the experimental probability.
Bayes Theorem Applications
Bayesian inference is very important and has found application in various
activities, including medicine, science, philosophy, engineering, sports,
law, etc., and Bayesian inference is directly derived from Bayes theorem.
Some of the Key Applications are:
Medical Testing → Finding the real probability of having a disease
after a positive test.
Spam Filters → Checking if an email is spam based on keywords.
Weather Prediction → Updating the chance of rain based on new
data.
AI & Machine Learning → Used in Naïve Bayes classifiers to predict
outcomes.
Numerical Computation
In Neural Networks , we have the concept of Loss Functions, which tell
us about the performance of our neural networks, i.e., at the current
instant, how good or poor the model is performing. Now, to train our
network to perform better on unseen datasets, we need to use loss. We
aim to minimize the loss, as a lower loss implies that our model will
perform better.
Overview:
Define loss functions in neural networks, explaining their role in measuring
model performance and how optimization aims to minimize these
functions.
Explain the mathematical concepts behind gradient-based optimization
and how these optimizers navigate the “terrain” of the loss function to
reach a global minimum.
Differentiate between Batch Gradient Descent, Stochastic Gradient
Descent (SGD), and Mini-Batch Gradient Descent, covering their
mechanics and how they update model parameters.
Describe how gradients indicate a direction for optimization and the
importance of setting an appropriate learning rate to ensure effective and
efficient convergence.
Explore the strengths and limitations of each method (e.g., stability vs.
speed) and how these impact deep learning model training, especially for
large datasets.
Identify common challenges, such as getting stuck in local minima,
choosing an optimal learning rate, and managing memory constraints that
affect all gradient-based optimizers.
Showcase practical applications of these optimizers in deep learning,
highlighting typical code implementations and best practices for real-world
data science and machine learning tasks.
Role of an Optimizer
As discussed in the introduction, Optimizers update the parameters of neural
networks, such as weights and learning rate, to minimize the loss function . Here,
the loss function guides the terrain, telling the optimizer if it is moving in the right
direction to reach the bottom of the valley, the global minimum.
The Intuition Behind Optimizers with an Example
Let us imagine a climber hiking down the hill with no direction. He doesn’t know
the right way to reach the valley in the hills, but he can understand whether he is
moving closer (going downhill) or further away (uphill) from his final destination.
If he keeps taking steps in the correct direction, he will reach his aim i.,e the
valley
This is the intuition behind optimizers- to reach a global minimum concerning the
loss function.
Instances of Gradient-Based Optimizers
Different instances of Gradient descent Optimizers are as follows:
Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD)
Stochastic Gradient Descent (SGD)
Mini batch Gradient Descent (MB-GD)
Batch Gradient Descent
Gradient descent is an optimization algorithm used when training deep learning
models. It’s based on a convex function and updates its parameters iteratively to
minimize a given function to its local minimum.
The notation used in the above Formula is given below,
In the above formula,
α is the learning rate,
J is the cost function, and
ϴ is the parameter to be updated.
As you can see, the gradient represents the partial derivative of J(cost function)
with respect to ϴ j
Note that as we reach closer to the global minima, the slope or the gradient of
the curve becomes less and less steep, which results in a smaller value of the
derivative, which in turn reduces the step size or learning rate automatically.
It is the most basic but most used optimizer that directly uses the derivative of
the loss function and learning rate to reduce the loss function and tries to reach
the global minimum.
Thus, the Gradient Descent Optimization algorithm has many applications
including:
Linear Regression ,
Classification Algorithms ,
Backpropagation in Neural Networks, etc.
The above-described equation calculates the gradient of the cost function J(θ)
with respect to the network parameters θ for the entire training dataset:
Our aim is to reach the bottom of the graph(Cost vs. weight) or to a point where
we can no longer move downhill–a local minimum.
Role of Gradient
In general, a Gradient represents the slope of the equation, while gradients are
partial derivatives. They describe the change reflected in the loss function with
respect to the small change in the function’s parameters. This slight change in
loss functions can tell us about the next step to reduce the loss function’s output.
Role of Learning Rate
The learning rate represents our optimisation algorithm’s steps to reach the
global minima. To ensure that the gradient descent algorithm reaches the local
minimum, we must set the learning rate to an appropriate value that is neither too
low nor too high.
Taking very large steps, i.e., a large learning rate value, may skip the global
minima, and the model will never reach the optimal value for the loss function.
On the contrary, taking very small steps, i.e., a small learning rate value, will take
forever to converge.
Thus, the step size also depends on the gradient value.
As we discussed, the gradient represents the direction of increase. However, we
aim to find the minimum point in the valley, so we have to go in the opposite
direction of the gradient. Therefore, we update parameters in the negative
gradient direction to minimize the loss.
Advantages of Batch Gradient Descent
Efficient Computation: By processing the entire dataset in one go, batch
gradient descent efficiently computes gradients, especially with matrix
operations, optimizing performance on large datasets.
Simple Implementation: The straightforward approach of calculating gradients
on all data points makes batch gradient descent easy to code, especially with
frameworks like TensorFlow and PyTorch.
Enhanced Convergence Stability: With gradients computed over the full
dataset, batch gradient descent offers a smoother path to convergence, reducing
fluctuations in updates and aiding in reliable model training.
Disadvantages of Batch Gradient Descent
1. Susceptible to Local Minima: Batch gradient descent may get stuck in local
minima, limiting optimization effectiveness in non-convex loss surfaces.
2. Slow Convergence on Large Datasets: Since weights are updated after
processing the entire dataset, convergence can be extremely slow for large
datasets, delaying training.
3. High Memory Demand: Calculating gradients on the entire dataset requires
substantial memory, making it challenging to implement on limited-memory
systems or massive datasets.
Stochastic Gradient Descent
To overcome some of the disadvantages of the GD algorithm, the SGD algorithm
comes into the picture as an extension of the Gradient Descent. One of the
disadvantages of the Gradient Descent algorithm is that it requires a lot of
memory to load the entire dataset at a time to compute the derivative of the loss
function. So, In the SGD algorithm, we compute the derivative by taking one data
point at a time, i.e., try to update the model’s parameters more frequently.
Therefore, the model parameters are updated after the loss computation on each
training example.
So, let’s have a dataset that contains 1000 rows, and when we apply SGD, it will
update the model parameters 1000 times in one complete cycle of a dataset
instead of one time as in Gradient Descent.
Algorithm: θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training
examples Copy Code
We want the training, even more, faster, so we take a Gradient Descent step for
each training example. Let’s see the implications in the image below:
Let’s try to find some insights from the above diagram:
In the left diagram of the above picture, we have SGD (where 1 per step time),
where we take a Gradient Descent step for each example, and on the right
diagram, we have GD (1 step per entire training set).
SGD seems quite noisy, but it is also much faster than others and might not
converge to a minimum.
It is observed that in SGD, the updates take more iterations to reach minima than
in GD. On the contrary, GD takes fewer steps to reach minima. Still, the SGD
algorithm is noisier and takes more iterations as the model parameters are
frequently updated with high variance and fluctuations in loss functions at
different intensities values.
Advantages of Batch Gradient Descent
Efficient Computation: By processing the entire dataset in one go, batch
gradient descent efficiently computes gradients, especially with matrix
operations, optimizing performance on large datasets.
Simple Implementation: The straightforward approach of calculating gradients
on all data points makes batch gradient descent easy to code, especially with
frameworks like TensorFlow and PyTorch.
Enhanced Convergence Stability: With gradients computed over the full
dataset, batch gradient descent offers a smoother path to convergence, reducing
fluctuations in updates and aiding in reliable model training.
Disadvantages of Batch Gradient Descent
1. Susceptible to Local Minima: Batch gradient descent may get stuck in local
minima, limiting optimization effectiveness in non-convex loss surfaces.
2. Slow Convergence on Large Datasets: Since weights are updated after
processing the entire dataset, convergence can be extremely slow for large
datasets, delaying training.
3. High Memory Demand: Calculating gradients on the entire dataset requires
substantial memory, making it challenging to implement on limited-memory
systems or massive datasets.
Stochastic Gradient Descent
To overcome some of the disadvantages of the GD algorithm, the SGD algorithm
comes into the picture as an extension of the Gradient Descent. One of the
disadvantages of the Gradient Descent algorithm is that it requires a lot of
memory to load the entire dataset at a time to compute the derivative of the loss
function. So, In the SGD algorithm, we compute the derivative by taking one data
point at a time, i.e., try to update the model’s parameters more frequently.
Therefore, the model parameters are updated after the loss computation on each
training example.
So, let’s have a dataset that contains 1000 rows, and when we apply SGD, it will
update the model parameters 1000 times in one complete cycle of a dataset
instead of one time as in Gradient Descent.
Algorithm: θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training
examples Copy Code
We want the training, even more, faster, so we take a Gradient Descent step for
each training example. Let’s see the implications in the image below:
Let’s try to find some insights from the above diagram:
In the left diagram of the above picture, we have SGD (where 1 per step time),
where we take a Gradient Descent step for each example, and on the right
diagram, we have GD (1 step per entire training set).
SGD seems quite noisy, but it is also much faster than others and might not
converge to a minimum.
It is observed that in SGD, the updates take more iterations to reach minima than
in GD. On the contrary, GD takes fewer steps to reach minima. Still, the SGD
algorithm is noisier and takes more iterations as the model parameters are
frequently updated with high variance and fluctuations in loss functions at
different intensities values.
Advantages of Stochastic Gradient Descent
1. Faster Convergence: With frequent updates to model parameters, SGD
converges more quickly than other methods, ideal for large datasets.
2. Lower Memory Usage: SGD processes one data point at a time, eliminating the
need to store the entire loss function and saving memory.
3. Better Minima Exploration: The random updates allow SGD to escape local
minima potentially, increasing the chance of reaching a better global minimum.
Disadvantages of Stochastic Gradient Descent
1. High Parameter Variability: The frequent updates introduce high variance in
model parameters, which can cause unstable convergence.
2. Risk of Overshooting: Even near the global minimum, SGD can overshoot due
to its fluctuating updates, complicating precise convergence.
3. Learning Rate Adjustment Needed: To match the stable convergence of
gradient descent, the learning rate must be gradually reduced over time,
requiring careful tuning.
Mini-Batch Gradient Descent
To overcome the problem of large time complexity in the case of the SGD
algorithm. MB-GD algorithm comes into the picture as an extension of the SGD
algorithm. It’s not all, but it also overcomes the Gradient descent problem.
Therefore, It’s considered the best among all the variations of gradient descent
algorithms. MB-GD algorithm takes a batch of points or a subset of points from
the dataset to compute derivate.
It is observed that the derivative of the loss function for MB-GD is almost the
same as a derivate of the loss function for GD after several iterations. However,
the number of iterations to achieve minima is large for MB-GD compared to GD,
and the computation cost is also large.
Therefore, the weight updation depends on the derivate of loss for a batch of
points. The updates in the case of MB-GD are much more noisy because the
derivative does not always go towards minima.
Advantages of Mini Batch Gradient Descent
1. Frequent, Stable Updates: Mini-batch gradient descent offers frequent updates
to model parameters while lowering variance, balancing speed and stability.
2. Moderate Memory Requirement: It balances memory usage, needing only a
medium amount to store mini-batches, making it feasible for large datasets.
Disadvantages of Mini Batch Gradient Descent
1. Higher Noise in Updates: Mini-batch updates introduce noise in parameter
updates, which can lead to less precise optimization compared to full-batch
gradient descent.
2. Slower Convergence: Compared to batch gradient descent, converging may
take longer, requiring more iterations to reach the minimum.
3. Risk of Local Minima: Mini-batch gradient descent can still get stuck in local
minima, especially in non-convex optimization problems.
Challenges with All Types of Gradient-based
Optimizers
Optimum Learning Rate: We must choose an optimum learning rate value. If we
choose a learning rate that is too small, gradient descent may take a long time to
converge. For more about this challenge, refer to the above section on Learning
Rate, which we discussed in the Gradient Descent Algorithm.
Constant Learning Rate: All the parameters have a constant learning rate, but
there may be some parameters that we do not want to change at the same rate.
Local minimum: You may get stuck at local minima, i.e., you may not reach the
local minimum.