Mathematics for Machine Learning: Essential
Equations (V4)
1. Basic Linear Algebra
• Scalar Multiplication:
cv1
cv2
c · v = ..
.
cvn
• Matrix-Vector Multiplication:
a11 a12 ··· a1n v1
a21 a22 ··· a2n v2
A · v = .. .. · ..
.. ...
. . . .
am1 am2 · · · amn vn
• Norm of a Vector: q
||v|| = v12 + v22 + · · · + vn2
• Dot Product: n
X
u·v = ui vi
i=1
• Cross Product (3D Vectors):
u2 v3 − u3 v2
u × v = u3 v1 − u1 v3
u1 v2 − u2 v1
• Outer Product:
u1 v1 u1 v2 ··· u1 vn
u2 v1 u2 v2 ··· u2 vn
u ⊗ v = ..
.. .. ..
. . . .
um v1 um v2 · · · um vn
• Matrix Addition:
a11 + b11 · · · a1n + b1n
A+B=
.. .. ..
. . .
am1 + bm1 · · · amn + bmn
1
• Matrix Multiplication:
n
X
(A · B)ij = aik bkj
k=1
• Transpose of a Matrix:
(AT )ij = aji
• Inverse of a Matrix (for square A):
A−1 · A = I, where I is the identity matrix.
2. Basic Probability and Statistics
• Conditional Probability:
P (A ∩ B)
P (A|B) =
P (B)
• Law of Total Probability:
X X
P (A) = P (A ∩ Bi ) = P (A|Bi )P (Bi )
i i
• Bayes’ Theorem:
P (B|A)P (A)
P (A|B) =
P (B)
• Expectation:
X Z
E[X] = xi P (xi ) (discrete) or xp(x)dx (continuous)
i
• Variance:
Var(X) = E[(X − E[X])2 ]
• Standard Deviation: p
σ= Var(X)
• Covariance:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
• Correlation Coefficient:
Cov(X, Y )
ρX,Y =
σX σY
• Probability Mass Function (PMF):
X
P (X = x) = p(x), p(x) = 1
x
• Probability Density Function (PDF):
Z ∞
p(x)dx = 1 for continuous random variables.
−∞
2
3. Basic Calculus
• Derivative of a Function:
d f (x + h) − f (x)
[f (x)] = lim
dx h→0 h
• Partial Derivatives:
∂f f (x + h, y) − f (x, y)
= lim
∂x h→0 h
• Gradient: ∂f
∂x1
∂f
∂x2
∇f (x) = .
..
∂f
∂xn
• Chain Rule:
dy dy du
= ·
dx du dx
• Second Derivative (Hessian Matrix):
∂2f ∂2f ∂2f
∂x2 ∂x1 ∂x2
··· ∂x1 ∂xn
∂ 2 f1 ∂2f ∂2f
∂x2 ∂x1 ∂x22
··· ∂x2 ∂xn
H(f ) =
.. .. .. ..
.
. . .
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
··· ∂x2n
• Taylor Series Expansion:
f ′′ (a)
f (x) ≈ f (a) + f ′ (a)(x − a) + (x − a)2 + · · ·
2!
• Gradient Descent Update Rule:
w ← w − η∇J(w)
• Optimization Objective:
min f (x)
x
• Logarithmic Derivative:
d 1
[ln x] =
dx x
• Exponential Derivative:
d x
[e ] = ex
dx
3
4. Basic Optimization
• Gradient Descent:
wt+1 = wt − η∇J(wt )
• Learning Rate Decay:
η0
ηt =
1 + λt
• Stochastic Gradient Descent (SGD):
w ← w − η∇J(w; xi , yi )
• Momentum-based Optimization:
vt = βvt−1 + (1 − β)∇J(w), w ← w − ηvt
• Nesterov Accelerated Gradient (NAG):
wt+1 = wt − η∇J(wt + β(wt − wt−1 ))
• RMSProp:
η
w←w− p ∇J(w)
∇2 J(w) + ϵ
• Adam Optimization:
mt = β1 mt−1 + (1 − β1 )∇J(w), vt = β2 vt−1 + (1 − β2 )(∇J(w))2
mt vt
m̂t = , v̂t =
1 − β1t 1 − β2t
η m̂t
wt+1 ← wt − √
v̂t + ϵ
• Regularized Optimization Objective:
J(w) = Loss(w) + λ||w||2
• Projection Gradient Descent:
wt+1 = ΠC (wt − η∇J(wt )) where ΠC projects onto set C
• Newton’s Method:
wt+1 = wt − ηH−1 ∇J(wt ) where H is the Hessian matrix.
4
5. Basic Regression Equations
• Linear Regression Hypothesis:
ŷ = X · w + b
• Mean Absolute Error (MAE):
m
1 X
MAE = |yi − ŷi |
m i=1
• Mean Squared Error (MSE):
m
1 X
MSE = (yi − ŷi )2
m i=1
• Ridge Regression Objective:
m n
1 X 2
X
J(w) = (ŷi − yi ) + λ wj2
m i=1 j=1
• Lasso Regression Objective:
m n
1 X X
J(w) = (ŷi − yi )2 + λ |wj |
m i=1 j=1
• Logistic Regression Hypothesis:
1
ŷ = σ(X · w + b), σ(z) =
1 + e−z
• Binary Cross-Entropy Loss:
m
1 X
J(w) = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
m i=1
• Coefficient of Determination (R-squared):
Pm 2
2 i=1 (yi − ŷi )
R =1− P m 2
i=1 (yi − ȳ)
• Adjusted R-squared:
(1 − R2 )(n − 1)
R̄2 = 1 −
n−p−1
• Gradient of the MSE Loss:
1 T
∇J(w) = X (Xw − y)
m
5
6. Basic Neural Network Concepts
• Perceptron Update Rule:
w ← w + η(y − ŷ)x
• Sigmoid Activation Function:
1
σ(z) =
1 + e−z
• ReLU Activation Function:
f (x) = max(0, x)
• Softmax Function:
ezi
Softmax(zi ) = Pn
j=1 ezj
• Loss Function for Multi-Class Classification:
m K
1 XX
J(w) = − yik log(ŷik )
m i=1 k=1
• Forward Propagation (Single Layer):
a = σ(wT x + b)
• Backward Propagation (Gradient for Weights):
∂J
= x(ŷ − y)
∂w
• Gradient Descent for Neural Networks:
∂J
w ←w−η
∂w
• Dropout Regularization:
(l) (l)
hi = ri hi , ri ∼ Bernoulli(p)
• Batch Normalization:
xi − µ B
x̂i = p 2 , yi = γ x̂i + β
σB + ϵ
6
7. Basic Clustering Concepts
• k-Means Objective Function:
K X
X
J= ||xi − µk ||2
k=1 i∈Ck
• Centroid Update Rule:
1 X
µk = x
|Ck | x∈C
k
• Distance Metric (Euclidean Distance):
v
u n
uX
d(x, y) = t (xi − yi )2
i=1
• Silhouette Score:
b(i) − a(i)
s(i) =
max(a(i), b(i))
• DBSCAN Core Point Condition:
|Nϵ (x)| ≥ MinPts where Nϵ (x) = {y : d(x, y) ≤ ϵ}
• Hierarchical Clustering Dendrogram Objective:
Minimize the linkage criterion L(A, B)
• Gaussian Mixture Model (GMM):
K
X
p(x) = πk N (x|µk , Σk )
k=1
• Expectation-Maximization (E-step):
πk N (xi |µk , Σk )
γik = PK
j=1 πj N (xi |µj , Σj )
• Expectation-Maximization (M-step):
PN PN
γik xi γik (xi − µk )(xi − µk )T
µk = Pi=1
N
and Σk = i=1
PN
i=1 γik i=1 γik
• Elbow Method for Optimal k:
Choose k where J(k) has the largest drop.
7
8. Basic Dimensionality Reduction Concepts
• Principal Component Analysis (PCA) Objective:
Maximize ||Xw||2 subject to ||w|| = 1
• Covariance Matrix for PCA:
1 T
C= X X
m
• Eigen Decomposition for PCA:
Cw = λw
• t-SNE Objective: X pij
C= pij log
i̸=j
qij
• Singular Value Decomposition (SVD):
X = UΣVT
• LDA Objective (Fisher’s Criterion):
wT Sb w
J(w) =
wT Sw w
• Reconstruction Error for PCA:
Error = ||X − X̂||F
• Kernel PCA Transformation:
ϕ(x) → Principal Components in Feature Space
• Autoencoder Reconstruction:
X ≈ g(f (X))
• Explained Variance Ratio:
λi
Ratio = P
j λj
8
9. Basic Probability Distributions
• Bernoulli Distribution:
P (X = x) = px (1 − p)1−x , x ∈ {0, 1}
• Binomial Distribution:
n k
P (X = k) = p (1 − p)n−k , k ∈ {0, 1, . . . , n}
k
• Poisson Distribution:
λk e−λ
P (X = k) = , k≥0
k!
• Uniform Distribution:
(
1
b−a
, a≤x≤b
f (x) =
0, otherwise
• Normal Distribution:
1 (x−µ)2
f (x) = √ e− 2σ 2
2πσ 2
• Exponential Distribution:
(
λe−λx , x ≥ 0
f (x) =
0, x<0
• Beta Distribution:
xα−1 (1 − x)β−1
f (x; α, β) = , x ∈ [0, 1]
B(α, β)
• Gamma Distribution:
β α xα−1 e−βx
f (x; α, β) = , x≥0
Γ(α)
• Multinomial Distribution:
n!
P (X1 = x1 , . . . , Xk = xk ) = px1 1 px2 2 · · · pxkk
x1 !x2 ! · · · xk !
• Chi-Square Distribution:
k x
x 2 −1 e− 2
f (x; k) = k , x≥0
2 2 Γ( k2 )
9
10. Basic Reinforcement Learning Concepts
• Bellman Equation for State-Value Function:
V (s) = E[Rt + γV (St+1 )|St = s]
• Bellman Equation for Action-Value Function:
Q(s, a) = E[Rt + γQ(St+1 , At+1 )|St = s, At = a]
• Policy Improvement:
π ′ (s) = arg max Q(s, a)
a
• Temporal Difference Update Rule:
V (St ) ← V (St ) + α[Rt+1 + γV (St+1 ) − V (St )]
• Q-Learning Update Rule:
Q(St , At ) ← Q(St , At ) + α[Rt+1 + γ max Q(St+1 , a) − Q(St , At )]
a
• SARSA Update Rule:
Q(St , At ) ← Q(St , At ) + α[Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]
• Reward Function:
R(s, a) = E[Rt |St = s, At = a]
• Value Iteration Update Rule:
X
V (s) ← max[R(s, a) + γ P (s′ |s, a)V (s′ )]
a
s′
• Actor-Critic Policy Update:
θ ← θ + α∇θ log πθ (a|s)δ
• Discounted Return: ∞
X
Gt = γ k Rt+k+1
k=0
10