KEMBAR78
UNIT I - Basics of Deep Learning | PDF | Eigenvalues And Eigenvectors | Matrix (Mathematics)
0% found this document useful (0 votes)
37 views37 pages

UNIT I - Basics of Deep Learning

The document provides an overview of linear algebra, covering its fundamental concepts such as vectors, matrices, linear transformations, and systems of linear equations. It details various topics within linear algebra, including eigenvalues, eigenvectors, singular value decomposition, and applications in fields like computer graphics, machine learning, and quantum mechanics. Additionally, it discusses the importance of linear independence, linear programming, and numerical methods in solving real-world problems.

Uploaded by

rdevika753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views37 pages

UNIT I - Basics of Deep Learning

The document provides an overview of linear algebra, covering its fundamental concepts such as vectors, matrices, linear transformations, and systems of linear equations. It details various topics within linear algebra, including eigenvalues, eigenvectors, singular value decomposition, and applications in fields like computer graphics, machine learning, and quantum mechanics. Additionally, it discusses the importance of linear independence, linear programming, and numerical methods in solving real-world problems.

Uploaded by

rdevika753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT – I BASICS OF DEEP LEARNING

Linear Algebra:
Linear algebra is a branch of mathematics that focuses on the study of
vectors, matrices, and linear transformations, which are fundamental
concepts in many areas of mathematics, science, and engineering. It deals
with solving systems of linear equations, analysing vector spaces, and
understanding the properties of matrices.

What is Linear Algebra?


Linear Algebra is a branch of Mathematics that deals with
matrices, vectors, finite and infinite spaces. It is the study of vector
spaces, linear equations, linear functions, and matrices.
Linear Algebra Equations
The general linear equation is represented as u1x1 + u2x2+…..unxn= v
Where,
 u’s – represents the coefficients
 x’s – represents the unknowns
 v – represents the constant
There is a collection of equations called a System of linear algebraic
equations. It obeys the linear function such as –
(x1,……..xn) → u1x1+……….+unxn

Linear Algebra Topics


Below is the list of important topics in Linear Algebra.
 Matrix inverses and determinants
 Linear transformations
 Singular value decomposition
 Orthogonal matrices
 Mathematical operations with matrices (i.e. addition, multiplication)
 Projections
 Solving systems of equations with matrices
 Eigenvalues and eigenvectors
 Euclidean vector spaces
 Positive-definite matrices
 Linear dependence and independence
 The foundational concepts essential for understanding linear algebra,
detailed here, include:
 Linear Functions
 Vector spaces
 Matrix
These foundational ideas are interconnected, allowing for the
mathematical representation of a system of linear equations. Generally,
vectors are entities that can be combined, and linear functions refer to
vector operations that encompass vector combination.
Branches of Linear Algebra
Linear Algebra is divided into different branches based on the
difficulty level of topics, which are,
 Elementary Linear Algebra
 Advanced Linear Algebra
 Applied Linear Algebra

Elementary Linear Algebra


Elementary Linear algebra covers the topics of basic linear algebra
such as Scalars and Vectors, Matrix and matrix operation, etc.
Linear Equations
Linear equations form the basis of linear algebra and are equations of
the first order. These equations represent straight lines in geometry
and are characterized by constants and variables without exponents
or products of variables. Solving systems of linear equations involves
finding the values of the variables that satisfy all equations
simultaneously.
A linear equation is the simplest form of equation in algebra,
representing a straight line when plotted on a graph.
Example: 2x + 3x = 6 is a linear equation. If you have two such
equations, like 2x + 3y = 6, and 4x + 6y =12, solving them together
would give you the point where the two lines intersect.

Advanced Linear Algebra


Advanced linear algebra mostly covers all the advanced topics related
to linear algebra such as Linear function, Linear transformation,
Eigenvectors, and Eigenvalues, etc.
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors are fundamental concepts in linear
algebra. It offers deep insights into the properties of linear
transformations. An eigenvector of a square matrix is a non-zero vector
that, when the matrix multiplies it, results in a scalar multiple of itself. This
scalar is known as the eigenvalue associated with the eigenvector. They
are essential in various applications, including stability analysis, quantum
mechanics, and the study of dynamical systems.
Consider a transformation that changes the direction or length of
vectors, except for some special vectors that only get stretched or
shrunk. These special vectors are eigenvectors, and the factor by which
they are stretched or shrunk is the eigenvalue.
Example: For the matrix A = [2, 0, 0, 3], the vector v = 1,0 is an
eigenvector because Av = 2v, and 2 is the eigenvalue.
Singular Value Decomposition
Singular Value Decomposition (SVD) is a powerful mathematical
technique used in signal processing, statistics, and machine
learning. It decomposes a matrix into three other matrices, where
one represents the rotation, another the scaling, and the third the
final rotation. It’s essential for identifying the intrinsic geometric structure
of data.
Vector Space in Linear Algebra
A vector space (or linear space) is a collection of vectors, which may
be added together and multiplied (“scaled”) by numbers, called
scalars. Scalars are often real numbers, but can also be complex
numbers. Vector spaces are central to the study of linear algebra and are
used in various scientific fields.
 Basic vectors in Linear Algebra
Linear Map
A linear map (or linear transformation) is a mapping between two
vector spaces that preserves the operations of vector addition and
scalar multiplication. The concept is central to linear algebra and has
significant implications in geometry and abstract algebra.
A linear map is a way of moving vectors around in a space that keeps the
grid lines parallel and evenly spaced.
Example: Scaling objects in a video game world without
changing their basic shape is like applying a linear map.
Positive Definite Matrices
A positive definite matrix is a symmetric matrix where all its
eigenvalues are positive. These matrices are significant in optimisation
problems, as they ensure the existence of a unique minimum in quadratic
forms.
Example: The matrix A = [2, 0, 0, 2] is positive definite because it
always produces positive values for any non-zero vector.
Matrix Exponential
The matrix exponential is a function on square matrices analogous to
the exponential function for real numbers. It is used in solving systems
of linear differential equations, among other applications in physics and
engineering.
Matrix exponentials stretch or compress spaces in ways that depend
smoothly on time, much like how interest grows continuously in a
bank account.
Example: The exponential of the matrix A = [0, −1, 1, 0] represents
rotations, where the amount of rotation depends on the “time”
parameter.
Linear Computations
Linear computations involve numerical methods for solving linear
algebra problems, including systems of linear equations,
eigenvalues, and eigenvectors calculations. These computations are
essential in computer simulations, optimisations, and modelling.
These are techniques for crunching numbers in linear algebra problems,
like finding the best-fit line through a set of points or solving systems of
equations quickly and accurately.
Linear Independence
A set of vectors is linearly independent if no vector in the set is a
linear combination of the others. The concept of linear independence is
central to the study of vector spaces, as it helps define bases and
dimension.
Vectors are linearly independent if none of them can be made by
combining the others. It’s like saying each vector brings something
unique to the table that the others don’t.
Example: 1,0 and 0,1 are linearly independent in 2D space because
you can’t create one of these vectors by scaling or adding the other.
Linear Subspace
A linear subspace (or simply subspace) is a subset of a vector space
that is closed under vector addition and scalar multiplication. A subspace
is a smaller space that lies within a larger vector space, following the
same rules of vector addition and scalar multiplication.
Example: The set of all vectors of the form a, 0 in 2D space is a
subspace, representing all points along the x-axis.
Applied Linear Algebra
In Applied Linear Algebra, the topics covered are generally the practical
implications of Elementary and advanced linear Algebra topics such
as the Complement of a matrix, matrix factorization and norm of vectors,
etc.
Linear Programming
Linear programming is a method to achieve the best outcome in a
mathematical model whose requirements are represented by linear
relationships. It is widely used in business and economics to maximize
profit or minimize cost while considering constraints.
This is a technique for optimizing (maximizing or minimizing) a linear
objective function, subject to linear equality and inequality constraints. It’s
like planning the best outcome under given restrictions.
Example: Maximizing profit in a business while considering constraints
like budget, material costs, and labor.
Linear Equation Systems
Systems of linear equations involve multiple linear equations that share
the same set of variables. The solution to these systems is the set of
values that satisfy all equations simultaneously, which can be found using
various methods, including substitution, elimination, and matrix
operations.
Example: Finding the intersection point of two lines represented by
two equations.
Gaussian Elimination
Gaussian elimination is a systematic method for solving systems of linear
equations. It involves applying a series of operations to transform the
system’s matrix into its row echelon form or reduced row echelon form,
making it easier to solve for the variables. It is a step-by-step procedure to
simplify a system of linear equations into a form that’s easier to solve.
Example: Systematically eliminating variables in a system of
equations until each equation has only one variable left to solve for.

Vectors in Linear Algebra


In linear algebra, vectors are fundamental mathematical objects that
represent quantities that have both magnitude and direction.
 Vectors operations like addition and scalar multiplication are mainly
used concepts in linear algebra. They can be used to solve systems
of linear equations and represent linear transformation, and perform
matrix operations such as multiplication and inverse matrices.
 The representation of many physical processes’ magnitude and
direction using vectors, a fundamental component of linear algebra, is
essential.
 In linear algebra, vectors are elements of a vector space that can be
scaled and added. Essentially, they are arrows with a length and
direction.
Linear Function
A formal definition of a linear function is provided below:
f(ax) = af(x), and f(x + y) = f(x) + f(y)
where a is a scalar, f(x) and f(y) are vectors in the range of f, and x and y
are vectors in the domain of f.

A linear function is a type of function that maintains the properties of


vector addition and scalar multiplication when mapping between two
vector spaces. Specifically a function T: V ->W is considered linear if it
satisfies two key properties:

Property Description Equation

A linear transformation’s ability to T(u+v)=T(u)


Additive Property
preserve vector addition. +T(v)

Homogeneous A linear transformation’s ability to


T(cu)=cT(u)
Property preserve scalar multiplication.

 V and W: Vector spaces


 u and v: Vectors in vector space V
 c: Scalar
 T: Linear transformation from V to W
 The additional property requires that the function T preserves the
vector addition operation, meaning that the image of the sum of two
vectors is equal to the sum of two images of each individual vector.
For example, we have a linear transformation T that takes a two-
dimensional vector (x, y) as input and outputs a new two-dimensional
vector (u, v) according to the following rule:
T(x, y) = (2x + y, 3x – 4y)
To verify that T is a linear transformation, we need to show that it satisfies
two properties:
 Additivity: T(u + v) = T(u) + T(v)
 Homogeneity: T(cu) = cT(u)

Let’s take two input vectors (x1, y1) and (x2, y2) and compute their images
under T:
 T(x1, y1) = (2x1 + y1, 3x1 – 4y1)
 T(x2, y2) = (2x2 + y2, 3x2 – 4y2)
 Now let’s compute the image of their sum:
 T(x1 + x2, y1 + y2) = (2(x1 + x2) + (y1 + y2), 3(x1 + x2) – 4(y1 + y2)) =
(2x1 + y1 + 2x2 + y2, 3x1 + 3x2 – 4y1 – 4y2) = (2x1 + y1, 3x1 – 4y1) +
(2x2 + y2, 3x2 – 4y2) = T(x1, y1) + T(x2, y2)
 So T satisfies the additivity property.
 Now let’s check the homogeneity property. Let c be a scalar and (x,
y) be a vector:
 T(cx, cy) = (2(cx) + cy, 3(cx) – 4(cy)) = (c(2x) + c(y), c(3x) – c(4y)) =
c(2x + y, 3x – 4y) = cT(x, y)

So T also satisfies the homogeneity property. Therefore, T is a linear


transformation.
Linear Algebra Matrix
 A linear matrix in algebra is a rectangular array of integers organized
in rows and columns in linear algebra. The letters a, b, c, and other
similar letters are commonly used to represent the integers that make
up a matrix’s entries.
 Matrices are often used to represent linear transformation, such as
scaling, rotation, and reflection.
 Its size is determined by the rows and columns that are present.
 A matrix has three rows and two columns, for instance. A matrix is
referred to as be 3×2 matrix, for instance, if it contains three rows and
two columns.
 Matrix basically works on operations including addition, subtraction,
and multiplication.
 The appropriate elements are simply added or removed when matrices
are added or subtracted.
 Scalar multiplication involves multiplying every entry in the matrix by a
scalar(a number).
 Matrix multiplication is a more complex operation that involves
multiplying and adding certain entries in the matrices.
 The number of columns and rows in the matrix determines its size. For
instance, a matrix with 4 rows and 2 columns is known as
a 4×2 matrix. The entries in the matrix are integers, and they are
frequently represented by letters like u, v, and w.

For example: Let’s consider a simple example to understand more,


suppose we have two vectors, v1, and v2 in a two-dimensional space. We
can represent these vectors as a column matrix, such as:

Now we will apply a linear transformation that doubles the value of the first
component and subtracts the value of the second component. Now we
can represent this transformation as a 2×2 linear matrix A
To apply this to vector v1, simply multiply the matrix A with vector v1

The resulting vector, [0,-2] is the transformed version of v1. Similarly, we


can apply the same transformation to v2

The resulting vector, [3,-4] is the transformed version of v2.


Numerical Linear Algebra
Numerical linear algebra, also called applied linear algebra, explores how
matrix operations can solve real-world problems using computers. It
focuses on creating efficient algorithms for continuous mathematics tasks.
These algorithms are vital for solving problems like least-square
optimization, finding Eigenvalues, and solving systems of linear equations.
In numerical linear algebra, various matrix decomposition methods such
as Eigen decomposition, Single value decomposition, and QR
factorization are utilized to tackle these challenges.
Linear Algebra Applications
Linear algebra is ubiquitous in science and engineering, providing
the tools for modelling natural phenomena, optimising processes,
and solving complex calculations in computer science, physics,
economics, and beyond.
Linear algebra, with its concepts of vectors, matrices, and linear
transformations, serves as a foundational tool in numerous fields,
enabling the solving of complex problems across science, engineering,
computer science, economics, and more. Following are some
specific applications of linear algebra in real-world.
1. Computer Graphics and Animation
Linear algebra is indispensable in computer graphics, gaming, and
animation. It helps in transforming the shapes of objects and their
positions in scenes through rotations, translations, scaling, and more. For
instance, when animating a character, linear transformations are used to
rotate limbs, scale objects, or shift positions within the virtual world.
2. Machine Learning and Data Science
In machine learning, linear algebra is at the heart of algorithms used for
classifying information, making predictions, and understanding the
structures within data. It’s crucial for operations in high-dimensional data
spaces, optimizing algorithms, and even in the training of neural networks
where matrix and tensor operations define the efficiency and effectiveness
of learning.
3. Quantum Mechanics
The state of quantum systems is described using vectors in a complex
vector space. Linear algebra enables the manipulation and prediction of
these states through operations such as unitary transformations (evolution
of quantum states) and eigenvalue problems (energy levels of quantum
systems).
4. Cryptography
Linear algebraic concepts are used in cryptography for encoding messages
and ensuring secure communication. Public key cryptosystems, such as
RSA, rely on operations that are easy to perform but extremely difficult to
reverse without the key, many of which involve linear algebraic
computations.
5. Control Systems
In engineering, linear algebra is used to model and design control systems.
The behavior of systems, from simple home heating systems to complex
flight control mechanisms, can be modeled using matrices that describe
the relationships between inputs, outputs, and the system’s state.
6. Network Analysis
Linear algebra is used to analyze and optimize networks, including
internet traffic, social networks, and logistical networks. Google’s
PageRank algorithm, which ranks web pages based on their links to and
from other sites, is a famous example that uses the eigenvectors of a
large matrix representing the web.
7. Image and Signal Processing
Techniques from linear algebra are used to compress, enhance, and
reconstruct images and signals. Singular value decomposition (SVD), for
example, is a method to compress images by identifying and eliminating
redundant information, significantly reducing the size of image files
without substantially reducing quality.
8. Economics and Finance
Linear algebra models economic phenomena, optimizes financial
portfolios, and evaluates risk. Matrices are used to represent and solve
systems of linear equations that model supply and demand, investment
portfolios, and market equilibrium.
9. Structural Engineering
In structural engineering, linear algebra is used to model structures,
analyze their stability, and simulate how forces and loads are distributed
throughout a structure. This helps engineers design buildings, bridges,
and other structures that can withstand various stresses and strains.
10. Robotics
Robots are designed using linear algebra to control their movements
and perform tasks with precision. Kinematics, which involves the
movement of parts in space, relies on linear transformations to calculate
the positions, rotations, and scaling of robot parts.
Solved Examples
Example 1: Find the sum of the two vectors A→ A = 2i + 3j + 5k
and B→ B = -i + 2j + k
Eigen Decomposition

Eigen decomposition is a method used in linear algebra to break down a


square matrix into simpler components called eigenvalues and
eigenvectors. This process helps us understand how a matrix behaves
and how it transforms data.
For Example - Eigen decomposition is particularly useful in fields like
Physics, Machine learning, and Computer graphics, as it simplifies
complex calculations.

Fundamental Theory of Eigen Decomposition


Eigen decomposition separates a matrix into its eigenvalues and
eigenvectors. Mathematically, for a square matrix A, if there exists a
scalar λ (eigenvalue) and a non-zero vector v (eigenvector) such that:
Av = λv
Where:
 A is the matrix.
 λ is the eigenvalue.
 v is the eigenvector.
Then, the matrix A can then be represented as:
A=VΛV-1
Where:
 V is the matrix of eigenvectors.
 Λ is the diagonal matrix of eigenvalues.
 V-1 is the inverse of the matrix.
This decomposition is significant because it transforms matrix operations
into simpler, scalar operations involving eigenvalues, making
computations easier.
How to Perform Eigen decomposition?
To perform Eigen decomposition on a matrix, follow these steps:
 Step 1: Find the Eigenvalues:
Solve the characteristic equation:
det (A−λI=0
Here, A is the square matrix, λ is the eigenvalue, and I is the identity
matrix of the same dimension as A.
 Step 2: Find the Eigenvectors:
For each eigenvalue λ, substitute it back into the equation:
(A−λI)v=0
This represents a system of linear equations where v is the eigenvector
corresponding to the eigenvalue λ.
 Step 3: Construct the Eigenvector Matrix V:
Place all the eigenvectors as columns in the matrix V. If there are n
distinct eigenvalues, V will be an n×n matrix..
 Step 4 Form the Diagonal Matrix Λ:
Construct a diagonal matrix Λ by placing the eigenvalues on its diagonal:
 Step 5: Calculate the Inverse of V:
Find V-1, the inverse of the eigenvector matrix V, if the matrix is invertible.
Importance of Eigen decomposition
Eigen decomposition is widely used because it makes complex tasks
simpler:
 Simplifying Matrix Powers: It helps in easily calculating powers of
matrices, which is useful in solving equations and modeling systems.
 Data Simplification: It is used in techniques like PCA to reduce large
datasets into fewer dimensions, making them easier to analyze.
 Physics: In quantum mechanics, it helps in understanding how
systems change over time.
 Image Processing: It is used in tasks like image compression and
enhancement, making handling images more efficient.

Probability is a fundamental concept in statistics that helps us understand


the likelihood of different events occurring. Within probability theory, there
are three key types of probabilities: joint, marginal, and conditional
probabilities.
 Marginal Probability refers to the probability of a single event
occurring, without considering any other events.
 Joint Probability is the probability of two or more events happening at
the same time. It is the probability of the intersection of these events.
 Conditional Probability deals with the probability of an event
occurring given that another event has already occurred.
Probability of an Event
Probability of an event quantifies how likely it is for that event to occur. It
is a measure that ranges from 0 to 1, where 0 indicates the event cannot
happen and 1 indicates the event is certain to happen.
The probability of an event A, denoted as P(A), is defined as:
P(A)=Number of favorable outcomesTotal number of possible outcomesP(A)=Total nu
mber of possible outcomesNumber of favorable outcomes
Sample Space (S)
The set of all possible outcomes of a random experiment. For example, if
you roll a die, the sample space S is {1, 2, 3, 4, 5, 6}.
Event (A)
A subset of the sample space is called event in probability.
Event is the specific outcome or set of outcomes that we are interested in.
For instance, getting an even number when rolling a die is an event A =
{2, 4, 6}.
Joint Probability
Joint probability is the probability of two (or more) events happening
simultaneously. It is denoted as P(A∩B) for two events A and B, which
reads as the probability of both A and B occurring.
For two events A and B, the joint probability is defined as:
P(A∩B)=P(both A and B occur)
Examples of Joint Probability
Rolling Two Dice
 Let A be the event that the first die shows a 3.
 Let B be the event that the second die shows a 5.
The joint probability P(A∩B) is the probability that the first die shows a 3

P(A∩B) = P(A) ⋅ P(B).


and the second die shows a 5. Since the outcomes are independent,

⇒ P(A∩B) = 1/6 × 1/6 = 1/36.


Given: P(A) = 1/6 and P(B) = 1/6, so

Marginal Probability
Marginal probability refers to the probability of an event occurring,
irrespective of the outcomes of other variables. It is obtained by summing
or integrating the joint probabilities over all possible values of the other
variables.
For two events A and B, the marginal probability of event A is defined as:
P(A)=∑BP(A,B)
Where P(A, B) is the joint probability of both events A and B occurring
together. If the variables are continuous, the summation is replaced by
integration:
P(A)=∫BP(A,B)dB
Examples of Marginal Probability
Consider a table showing the joint probability distribution of two discrete
random variables X and Y:
X/Y Y=1 Y=2

X=1 0.1 0.2

X=2 0.3 0.4

To find the marginal probability of X = 1:


P(X = 1) = P(X = 1, Y = 1) + P(X = 1, Y = 2) = 0.1 + 0.2 = 0.3

Conditional Probability
Conditional probability is the probability of an event occurring given that
another event has already occurred. It provides a way to update our
predictions or beliefs about the occurrence of an event based on new
information.
The conditional probability of event A given event B is denoted as P(A∣B)
and is defined by the formula:

P(A∣B)=P(B)P(A∩B)
Where:
 P(A∩B) is the joint probability of both events A and B occurring.
 P(B) is the probability of event B occurring.
Examples of Conditional Probability
Suppose we have a deck of 52 cards, and we want to find the probability
of drawing an Ace given that we have drawn a red card.
 Let A be the event of drawing an Ace.
 Let B be the event of drawing a red card.
There are 2 red Aces in a deck (Ace of hearts and Ace of diamonds) and
26 red cards in total.
P(A∣B)=P(B)P(A∩B)=2/52/26/52 =2/26=131

Difference between Joint, Marginal, and Conditional


Probability
The key differences between joint, marginal and conditional probability are
listed in the following table:
 For Independent Events
o P(A∩B) = P(A) x P(B)
 For Dependent Events
o P(A∩B) = P(A) x P(B|A)

Joint Marginal Conditional


Aspect Probability Probability Probability

The probability The probability of


The probability of
of a single event an event given
two or more
Definition irrespective of that another
events occurring
the occurrence event has
together.
of other events. occurred.

Notation P(A∩B) or P(A, B) P(A) or P(B) P(A∣B) or P(B∣A)

P(A∣B) = P(A∩B)
Formula P(A) = ∑BP(A∩B)
/P(B)

Probability of Probability of
rolling a 2 and rolling a 2 given
Probability of
is heads: P(2 ∣
Example flipping that the coin flip
rolling a 2:P(2)
heads: P(2 ∩
Heads) Heads)

Calculated by Calculated using


summing the the joint
Calculated from a
Calculation joint probabilities probability and
joint probability
Context over all the marginal
distribution.
outcomes of the probability of the
other variable. given condition.

Involves multiple Does not Depends on the


Dependencies events happening depend on other occurrence of
simultaneously. events. another event.

Used to find the Used to find the Used to update


likelihood of likelihood of a the probability of
Use Case combined events single event in an event based
in probabilistic the presence of on new
models. multiple events. information.
Bayes’ Theorem is a mathematical formula that helps determine
the conditional probability of an event based on prior knowledge and
new evidence.
Bayes Theorem and Conditional Probability
Bayes’ theorem (also known as the Bayes Rule or Bayes Law) is used to
determine the conditional probability of event A when event B has already
occurred.
The general statement of Bayes’ theorem is “The conditional probability of
an event A, given the occurrence of another event B, is equal to the
product of the event of B, given A and the probability of A divided by the
probability of event B.” i.e.
For example, if we want to find the probability that a white marble drawn
at random came from the first bag, given that a white marble has already
been drawn, and there are three bags each containing some white and
black marbles, then we can use Bayes’ Theorem.
Check: Bayes’s Theorem for Conditional Probability
Bayes Theorem Formula
For any two events A and B, Bayes’s formula for the Bayes theorem is
given by:

Where,
 P(A) and P(B) are the probabilities of events A and B also P(B) is
never equal to zero,
 P(A|B) is the probability of event A when event B happens,
 P(B|A) is the probability of event B when A happens.
Bayes Theorem Statement
Bayes’s Theorem for n set of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S,
in which all the events E 1, E2,…, En have a non-zero probability of
occurrence. All the events E 1, E2,…, E form a partition of S. Let A be an
event from space S for which we have to find probability, then according
to Bayes theorem,
Bayes’ theorem is also known as the formula for the probability of
“causes”. As we know, the E i‘s are a partition of the sample space S, and
at any given time only one of the events E i occurs. Thus, we conclude that
the Bayes theorem formula gives the probability of a particular E i, given
that event A has occurred.
Terms Related to Bayes Theorem
After learning about Bayes theorem in detail, let us understand some
important terms related to the concepts we covered in formula and
derivation.
Hypotheses
 Hypotheses refer to possible events or outcomes in the sample space,
they are denoted as E1, E2, …, En.
 Each hypothesis represents a distinct scenario that could explain an
observed event.
Priori Probability
 Priori Probability P(Ei) is the initial probability of an event occurring before
any new data is taken into account.
 It reflects existing knowledge or assumptions about the event.
 Example: The probability of a person having a disease before taking a
test.
Posterior Probability
 Posterior probability (P(Ei∣A) is the updated probability of an event after
considering new information.
 It is derived using the Bayes Theorem.
 Example: The probability of having a disease given a positive test
result.
Conditional Probability
 The probability of an event A based on the occurrence of another event
B is termed conditional Probability.
 It is denoted as P(A|B) and represents the probability of A when event
B has already happened.
Joint Probability
 When the probability of two or more events occurring together and at
the same time is measured it is marked as Joint Probability.
 For two events A and B, it is denoted by joint probability is denoted
as, P(A∩B).
Random Variables
 Real-valued variables whose possible values are determined by
random experiments are called random variables.
 The probability of finding such variables is the experimental probability.
Bayes Theorem Applications
Bayesian inference is very important and has found application in various
activities, including medicine, science, philosophy, engineering, sports,
law, etc., and Bayesian inference is directly derived from Bayes theorem.
Some of the Key Applications are:
 Medical Testing → Finding the real probability of having a disease
after a positive test.
 Spam Filters → Checking if an email is spam based on keywords.
 Weather Prediction → Updating the chance of rain based on new
data.
 AI & Machine Learning → Used in Naïve Bayes classifiers to predict
outcomes.

Numerical Computation
In Neural Networks , we have the concept of Loss Functions, which tell
us about the performance of our neural networks, i.e., at the current
instant, how good or poor the model is performing. Now, to train our
network to perform better on unseen datasets, we need to use loss. We
aim to minimize the loss, as a lower loss implies that our model will
perform better.

Overview:

 Define loss functions in neural networks, explaining their role in measuring

model performance and how optimization aims to minimize these

functions.

 Explain the mathematical concepts behind gradient-based optimization

and how these optimizers navigate the “terrain” of the loss function to

reach a global minimum.

 Differentiate between Batch Gradient Descent, Stochastic Gradient

Descent (SGD), and Mini-Batch Gradient Descent, covering their

mechanics and how they update model parameters.

 Describe how gradients indicate a direction for optimization and the

importance of setting an appropriate learning rate to ensure effective and

efficient convergence.

 Explore the strengths and limitations of each method (e.g., stability vs.

speed) and how these impact deep learning model training, especially for

large datasets.

 Identify common challenges, such as getting stuck in local minima,

choosing an optimal learning rate, and managing memory constraints that

affect all gradient-based optimizers.


 Showcase practical applications of these optimizers in deep learning,

highlighting typical code implementations and best practices for real-world

data science and machine learning tasks.

Role of an Optimizer

As discussed in the introduction, Optimizers update the parameters of neural

networks, such as weights and learning rate, to minimize the loss function . Here,

the loss function guides the terrain, telling the optimizer if it is moving in the right

direction to reach the bottom of the valley, the global minimum.

The Intuition Behind Optimizers with an Example


Let us imagine a climber hiking down the hill with no direction. He doesn’t know

the right way to reach the valley in the hills, but he can understand whether he is

moving closer (going downhill) or further away (uphill) from his final destination.

If he keeps taking steps in the correct direction, he will reach his aim i.,e the

valley

This is the intuition behind optimizers- to reach a global minimum concerning the

loss function.

Instances of Gradient-Based Optimizers

Different instances of Gradient descent Optimizers are as follows:

 Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD)

 Stochastic Gradient Descent (SGD)

 Mini batch Gradient Descent (MB-GD)

Batch Gradient Descent

Gradient descent is an optimization algorithm used when training deep learning

models. It’s based on a convex function and updates its parameters iteratively to

minimize a given function to its local minimum.

The notation used in the above Formula is given below,


In the above formula,

 α is the learning rate,

 J is the cost function, and

 ϴ is the parameter to be updated.

As you can see, the gradient represents the partial derivative of J(cost function)

with respect to ϴ j

Note that as we reach closer to the global minima, the slope or the gradient of

the curve becomes less and less steep, which results in a smaller value of the

derivative, which in turn reduces the step size or learning rate automatically.

It is the most basic but most used optimizer that directly uses the derivative of

the loss function and learning rate to reduce the loss function and tries to reach

the global minimum.

Thus, the Gradient Descent Optimization algorithm has many applications

including:

 Linear Regression ,

 Classification Algorithms ,

 Backpropagation in Neural Networks, etc.

The above-described equation calculates the gradient of the cost function J(θ)

with respect to the network parameters θ for the entire training dataset:
Our aim is to reach the bottom of the graph(Cost vs. weight) or to a point where

we can no longer move downhill–a local minimum.

Role of Gradient

In general, a Gradient represents the slope of the equation, while gradients are

partial derivatives. They describe the change reflected in the loss function with

respect to the small change in the function’s parameters. This slight change in

loss functions can tell us about the next step to reduce the loss function’s output.

Role of Learning Rate

The learning rate represents our optimisation algorithm’s steps to reach the

global minima. To ensure that the gradient descent algorithm reaches the local

minimum, we must set the learning rate to an appropriate value that is neither too

low nor too high.

Taking very large steps, i.e., a large learning rate value, may skip the global

minima, and the model will never reach the optimal value for the loss function.

On the contrary, taking very small steps, i.e., a small learning rate value, will take

forever to converge.
Thus, the step size also depends on the gradient value.

As we discussed, the gradient represents the direction of increase. However, we

aim to find the minimum point in the valley, so we have to go in the opposite

direction of the gradient. Therefore, we update parameters in the negative

gradient direction to minimize the loss.

Advantages of Batch Gradient Descent

 Efficient Computation: By processing the entire dataset in one go, batch

gradient descent efficiently computes gradients, especially with matrix

operations, optimizing performance on large datasets.

 Simple Implementation: The straightforward approach of calculating gradients

on all data points makes batch gradient descent easy to code, especially with

frameworks like TensorFlow and PyTorch.

 Enhanced Convergence Stability: With gradients computed over the full

dataset, batch gradient descent offers a smoother path to convergence, reducing

fluctuations in updates and aiding in reliable model training.

Disadvantages of Batch Gradient Descent


1. Susceptible to Local Minima: Batch gradient descent may get stuck in local

minima, limiting optimization effectiveness in non-convex loss surfaces.

2. Slow Convergence on Large Datasets: Since weights are updated after

processing the entire dataset, convergence can be extremely slow for large

datasets, delaying training.

3. High Memory Demand: Calculating gradients on the entire dataset requires

substantial memory, making it challenging to implement on limited-memory

systems or massive datasets.

Stochastic Gradient Descent

To overcome some of the disadvantages of the GD algorithm, the SGD algorithm

comes into the picture as an extension of the Gradient Descent. One of the

disadvantages of the Gradient Descent algorithm is that it requires a lot of

memory to load the entire dataset at a time to compute the derivative of the loss

function. So, In the SGD algorithm, we compute the derivative by taking one data

point at a time, i.e., try to update the model’s parameters more frequently.

Therefore, the model parameters are updated after the loss computation on each

training example.

So, let’s have a dataset that contains 1000 rows, and when we apply SGD, it will

update the model parameters 1000 times in one complete cycle of a dataset

instead of one time as in Gradient Descent.

Algorithm: θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training


examples Copy Code

We want the training, even more, faster, so we take a Gradient Descent step for

each training example. Let’s see the implications in the image below:
Let’s try to find some insights from the above diagram:

 In the left diagram of the above picture, we have SGD (where 1 per step time),

where we take a Gradient Descent step for each example, and on the right

diagram, we have GD (1 step per entire training set).

 SGD seems quite noisy, but it is also much faster than others and might not

converge to a minimum.

It is observed that in SGD, the updates take more iterations to reach minima than

in GD. On the contrary, GD takes fewer steps to reach minima. Still, the SGD

algorithm is noisier and takes more iterations as the model parameters are

frequently updated with high variance and fluctuations in loss functions at

different intensities values.

Advantages of Batch Gradient Descent

 Efficient Computation: By processing the entire dataset in one go, batch

gradient descent efficiently computes gradients, especially with matrix

operations, optimizing performance on large datasets.


 Simple Implementation: The straightforward approach of calculating gradients

on all data points makes batch gradient descent easy to code, especially with

frameworks like TensorFlow and PyTorch.

 Enhanced Convergence Stability: With gradients computed over the full

dataset, batch gradient descent offers a smoother path to convergence, reducing

fluctuations in updates and aiding in reliable model training.

Disadvantages of Batch Gradient Descent

1. Susceptible to Local Minima: Batch gradient descent may get stuck in local

minima, limiting optimization effectiveness in non-convex loss surfaces.

2. Slow Convergence on Large Datasets: Since weights are updated after

processing the entire dataset, convergence can be extremely slow for large

datasets, delaying training.

3. High Memory Demand: Calculating gradients on the entire dataset requires

substantial memory, making it challenging to implement on limited-memory

systems or massive datasets.

Stochastic Gradient Descent

To overcome some of the disadvantages of the GD algorithm, the SGD algorithm

comes into the picture as an extension of the Gradient Descent. One of the

disadvantages of the Gradient Descent algorithm is that it requires a lot of

memory to load the entire dataset at a time to compute the derivative of the loss

function. So, In the SGD algorithm, we compute the derivative by taking one data

point at a time, i.e., try to update the model’s parameters more frequently.
Therefore, the model parameters are updated after the loss computation on each

training example.

So, let’s have a dataset that contains 1000 rows, and when we apply SGD, it will

update the model parameters 1000 times in one complete cycle of a dataset

instead of one time as in Gradient Descent.

Algorithm: θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training


examples Copy Code

We want the training, even more, faster, so we take a Gradient Descent step for

each training example. Let’s see the implications in the image below:

Let’s try to find some insights from the above diagram:

 In the left diagram of the above picture, we have SGD (where 1 per step time),

where we take a Gradient Descent step for each example, and on the right

diagram, we have GD (1 step per entire training set).

 SGD seems quite noisy, but it is also much faster than others and might not

converge to a minimum.
It is observed that in SGD, the updates take more iterations to reach minima than

in GD. On the contrary, GD takes fewer steps to reach minima. Still, the SGD

algorithm is noisier and takes more iterations as the model parameters are

frequently updated with high variance and fluctuations in loss functions at

different intensities values.

Advantages of Stochastic Gradient Descent

1. Faster Convergence: With frequent updates to model parameters, SGD

converges more quickly than other methods, ideal for large datasets.

2. Lower Memory Usage: SGD processes one data point at a time, eliminating the

need to store the entire loss function and saving memory.

3. Better Minima Exploration: The random updates allow SGD to escape local

minima potentially, increasing the chance of reaching a better global minimum.

Disadvantages of Stochastic Gradient Descent

1. High Parameter Variability: The frequent updates introduce high variance in

model parameters, which can cause unstable convergence.

2. Risk of Overshooting: Even near the global minimum, SGD can overshoot due

to its fluctuating updates, complicating precise convergence.

3. Learning Rate Adjustment Needed: To match the stable convergence of

gradient descent, the learning rate must be gradually reduced over time,

requiring careful tuning.

Mini-Batch Gradient Descent


To overcome the problem of large time complexity in the case of the SGD

algorithm. MB-GD algorithm comes into the picture as an extension of the SGD

algorithm. It’s not all, but it also overcomes the Gradient descent problem.

Therefore, It’s considered the best among all the variations of gradient descent

algorithms. MB-GD algorithm takes a batch of points or a subset of points from

the dataset to compute derivate.

It is observed that the derivative of the loss function for MB-GD is almost the

same as a derivate of the loss function for GD after several iterations. However,

the number of iterations to achieve minima is large for MB-GD compared to GD,

and the computation cost is also large.

Therefore, the weight updation depends on the derivate of loss for a batch of

points. The updates in the case of MB-GD are much more noisy because the

derivative does not always go towards minima.

Advantages of Mini Batch Gradient Descent

1. Frequent, Stable Updates: Mini-batch gradient descent offers frequent updates

to model parameters while lowering variance, balancing speed and stability.

2. Moderate Memory Requirement: It balances memory usage, needing only a

medium amount to store mini-batches, making it feasible for large datasets.


Disadvantages of Mini Batch Gradient Descent

1. Higher Noise in Updates: Mini-batch updates introduce noise in parameter

updates, which can lead to less precise optimization compared to full-batch

gradient descent.

2. Slower Convergence: Compared to batch gradient descent, converging may

take longer, requiring more iterations to reach the minimum.

3. Risk of Local Minima: Mini-batch gradient descent can still get stuck in local

minima, especially in non-convex optimization problems.

Challenges with All Types of Gradient-based


Optimizers

 Optimum Learning Rate: We must choose an optimum learning rate value. If we

choose a learning rate that is too small, gradient descent may take a long time to

converge. For more about this challenge, refer to the above section on Learning

Rate, which we discussed in the Gradient Descent Algorithm.

 Constant Learning Rate: All the parameters have a constant learning rate, but

there may be some parameters that we do not want to change at the same rate.

 Local minimum: You may get stuck at local minima, i.e., you may not reach the

local minimum.

You might also like