Logistic Regression: Neural Network:
Deep Learning.
Rohit Kumar
IIT Delhi
August 8, 2025
Rohit Kumar IIT Delhi August 8, 2025 1 / 26
Logistic Regression: Neural Network:
Logistic Regression:
▶ Binary logistic regression assumes there are two output labels Yi ∈ {0, 1}.
▶ The binary logistic regression postulates the conditional probability
Pr(Yi = 1|Xi ) of the form
Pr(Yi = 1|Xi ) = σ(Xi⊤ β + α),
where
exp(x)
σ(x) = ,
1 + exp(x)
is called sigmoid function.
Rohit Kumar IIT Delhi August 8, 2025 2 / 26
Logistic Regression: Neural Network:
Logistic Regression:
▶ Based on MLE estimator we get
c i = 1|Xi ) = σ(X ⊤ β̂ + α̂).
Pr(Y i
▶ We can define our classifier based on this as
n o
ϕ(Xi ) = arg max Pr(Y c i = 1|Xi ), Pr(Y
c i = 0|Xi )
(
c i = 1|Xi ) ≥ .5
1 Pr(Y
=
0 otherwise.
▶ It is easy to see that
c i = 1|Xi ) ≥ .5 ⇐⇒ Xi⊤ β̂ + α̂ ≥ 0.
Pr(Y
Rohit Kumar IIT Delhi August 8, 2025 3 / 26
Logistic Regression: Neural Network:
Logistic Regression:
▶ So ϕ(Xi ) is my label and Yi is true label.
▶ We can find Type I and type II error as
F P = Pr(Yi = 0 & ϕ(Xi ) = 1)
F N = Pr(Yi = 1 & ϕ(Xi ) = 0)
▶ Sample splitting to get idea about FP and FN.
Rohit Kumar IIT Delhi August 8, 2025 4 / 26
Logistic Regression: Neural Network:
Logistic Regression:
1 α
x1 β1 k
P
σ α+ βi x i
β2 i=1
x2 σ
β3
x3
.. βk
.
xn
Rohit Kumar IIT Delhi August 8, 2025 5 / 26
Logistic Regression: Neural Network:
Multiclass Logistic Regression:
▶ Multiclass logistic regression assumes there are J + 1 output labels
Yi ∈ {0, 1, . . . , J}.
▶ The logistic regression postulates the conditional probability Pr(Yi = k|Xi )
of the form
exp(Zik )
Pr(Yi = k|Xi ) = PJ j
,
j=0 exp(Zi )
where
Zik = Xi⊤ β k + αk .
is called sigmoid function.
Rohit Kumar IIT Delhi August 8, 2025 6 / 26
Logistic Regression: Neural Network:
Multiclass Logistic Regression:
▶ Based on MLE estimator we get
k
c i = k|Xi ) = P exp(Ẑi ) .
Pr(Y J j
j=0 exp(Ẑi )
▶ We can define our classifier based on this as
n o
ϕ(Xi ) = arg max Pr(Y
c i = k|Xi ) .
k∈{0,1,...,J}
▶ It is easy to see that
c i = j|Xi ) for j ̸= k ⇐⇒ Ẑik ≥ Ẑ j for j ̸= k.
c i = k|Xi ) ≥ Pr(Y
Pr(Y i
Rohit Kumar IIT Delhi August 8, 2025 7 / 26
Logistic Regression: Neural Network:
Multiclass Logistic Regression:
Input Hidden Output
layer layer layer
x1
x2
Output
x3
x4
Rohit Kumar IIT Delhi August 8, 2025 8 / 26
Logistic Regression: Neural Network:
Revisit Logistic Regression:
x2 Not linearly separable
(0, 1) → 1 (1, 1) → 0
x1
(0, 0) → 0 (1, 0) → 1
Rohit Kumar IIT Delhi August 8, 2025 9 / 26
Logistic Regression: Neural Network:
Revisit Logistic Regression:
Input
Hidden
layer
layer Output
layer
Input x1
Output y
Input x2
Rohit Kumar IIT Delhi August 8, 2025 10 / 26
Logistic Regression: Neural Network:
Neural Network:
▶ Logistic regression is an example of the simplest neural network consisting
only of the input and output layers.
▶ Neutral networks are just generalization of multi class logistic regressions.
▶ In general, a neural network has one input layer, hidden layer and one output
layer.
▶ There can be many hidden layer and each layer is assigned an activation
function.
Rohit Kumar IIT Delhi August 8, 2025 11 / 26
Logistic Regression: Neural Network:
Shallow Network:
Input Hidden Output
layer layer layer
x1
x2
Output
x3
x4
Rohit Kumar IIT Delhi August 8, 2025 12 / 26
Logistic Regression: Neural Network:
Deep Network:
Input Hidden Hidden Output
layer layer 1 layer L layer
x1
x2
Output
x3
x4
Rohit Kumar IIT Delhi August 8, 2025 13 / 26
Logistic Regression: Neural Network:
Neural Network:
▶ Our objective is to approximate a function f : X 7→ R, where X ⊆ Rd using
neural network.
▶ We consider a network with L hidden layers, with the width of layer (l)
denoted as Hl for l = 0, 1, ..., L + 1.
▶ H0 = d the number of inputs and HL+1 = 1 output layer.
▶ Let us denote the output vector for l-th layer by x(l) ∈ RHl , which will serve
as the input to the next layer.
▶ Each output of l-th layer is fed into l + 1-th layer as input.
Rohit Kumar IIT Delhi August 8, 2025 14 / 26
Logistic Regression: Neural Network:
Neural Network:
▶ Each input is weighted by β plus bias α as β ⊤ x + α, then applied an
activation function to get the output of layer as
σ(β ⊤ x + α).
▶ For layer l + 1 the output of each neuron is
(l+1) (l+1) ⊤ (l) (l+1)
xk = σ (βk ) x + αk , 1 ≤ k ≤ Hl+1 .
(l+1) (l+1)
where βk and αk are respectively known as the weights and bias, while
the function σ(·) is known as the activation function.
Rohit Kumar IIT Delhi August 8, 2025 15 / 26
Logistic Regression: Neural Network:
Neural Network:
▶ We can write whole thing in matrix form as
x(l+1) = σ(Ll+1 (x(l) )) = σ ◦ Ll+1 (x(l) ),
where Ll+1 (x(l) ) = B (l+1) x(l) + A(l+1) with
(l+1) ⊤
(l+1)
(β1 ) α1
(l+1) ⊤ (l+1)
(β ) α
1
B (l+1) = , A(l) = 2 .
. .
..
..
(l+1) ⊤ (l+1)
(βHl+1 ) αHl+1
▶ Final output can be written as
f (x; θ) = L(L+1) ◦ σ ◦ L(L) ◦ σ ◦ L(L−1) ◦ · · · ◦ σ ◦ L(1) (x).
Rohit Kumar IIT Delhi August 8, 2025 16 / 26
Logistic Regression: Neural Network:
Activation Function:
▶ Activation function plays a pivotal role in helping the network to represent
non-linear complex functions.
▶ Examples
1. Linear σ(x) = x.
2. ReLU σ(x) = max{0, x}.
3. Leaky ReLU σ(x) = max{0, x} + α max{0, −x}.
4. Sigmoid σ(x) = exp(x)/(1 + exp(x)).
Rohit Kumar IIT Delhi August 8, 2025 17 / 26
Logistic Regression: Neural Network:
Activation Functions:
σ(ξ) σ(ξ) σ(ξ)
ξ ξ ξ
α = 0.1
Figure: (1) Linear (2) ReLU (3) Leaky ReLU
Rohit Kumar IIT Delhi August 8, 2025 18 / 26
Logistic Regression: Neural Network:
Activation Functions:
σ(ξ) σ(ξ) σ(ξ)
1 1 1
ξ ξ ξ
−1 −1
Figure: (1) Logistic (2) Tanh (3) Sine
Rohit Kumar IIT Delhi August 8, 2025 19 / 26
Logistic Regression: Neural Network:
Example: 1 Layer.
b = −2
w=2 w=1
(1)
x1
b=0
One kink
w=1 w=1
b=0
0 (0)
x1 4
(0) (1) (2)
4 4
(1) (2)
x2 x1
One kink
Two kinks
0 (0) 0 (0)
x1 4 x1 4
Rohit Kumar IIT Delhi August 8, 2025 20 / 26
Logistic Regression: Neural Network:
Deep vs Shallow:
▶ When and Why Are Deep Networks Better than Shallow Ones?
▶ Both shallow and deep networks can approximate arbitrarily well any
continuous function of d variables on a compact domain.
▶ Suppose we want to approximate of functions with a compositional structure
f (x1 , . . . , xd ) = h1 (h2 . . . (hj (hi1 (x1 , x2 ), hi2 (x3 , x4 )), . . .))
▶ For shallow learning, we need parameter complexity of order ϵ−d/r .
▶ For deep learning, we need parameter complexity around ϵ−2/r .
Rohit Kumar IIT Delhi August 8, 2025 21 / 26
Logistic Regression: Neural Network:
Gradient Descent:
▶ Suppose we wish solve the minimization problem θ∗ = minθ Π(θ).
▶ Consider the Taylor expansion about θ0
⊤
∂Π(θ0 ) ∂ 2 Π(θ̂)
Π(θ0 + ∆θ) = Π(θ0 ) + ∆θ + ∆θ⊤ ∆θ.
∂θ ∂θ∂θ′
for some θ̂ = θ0 + α∆θ, where 0 ≤ α ≤ 1.
▶ When |∆θ| is small we can neglect the second order term as
⊤
∂Π(θ0 )
Π(θ0 + ∆θ) ≈ Π(θ0 ) + ∆θ.
∂θ
Rohit Kumar IIT Delhi August 8, 2025 22 / 26
Logistic Regression: Neural Network:
Gradient Descent:
▶ We should choose ∆θ to reduce objective function.
▶ We need to choose the step ∆θ in the opposite direction of the gradient,
∂Π(θ0 )
∆θ = −η
∂θ
with the step-size η ≥ 0, also known as the learning-rate.
▶ This is the crux of the GD algorithm,
Rohit Kumar IIT Delhi August 8, 2025 23 / 26
Logistic Regression: Neural Network:
Gradient Descent:
1. Initialize k = 0 and θ0
2. While |Π(θk ) − Π(θk−1 )| > ϵ1 , do
∂Π(θk )
(a) Evaluate ∂θ
(b) Update θk+1 = θk − η ∂Π(θ
∂θ
k)
.
(c) Increment k = k + 1
Rohit Kumar IIT Delhi August 8, 2025 24 / 26
Logistic Regression: Neural Network:
Advanced Algorithms:
▶ In general, the update formula for most optimization algorithms is
[θk+1 ]i = [θk ]i − [ηk ]i [gk ]i , 1 ≤ i ≤ Nθ ,
▶ Momentum methods make use of the history of the gradient
∂Π(θk )
[ηk ]i = η, gk = β1 gk−1 + (1 − β1 ) , g−1 = 0.
∂θ
▶ Adam’s algorithm: gk same as momentum algorithm (β1 = 0.9, β2 = 0.999
and ϵ = 10−8 ). Additionally,
2
∂Π(θk )
[Gk ]i = β2 [Gk−1 ]i + (1 − β2 )
∂θi
η
[ηk ]i = p
[Gk ]i + ϵ
Rohit Kumar IIT Delhi August 8, 2025 25 / 26
Logistic Regression: Neural Network:
Back Propagation:
▶ In deep learning we are minimizing the least square
N
X
(Yi − f (Xi ; θ))2 .
i=1
▶ Recall that
f (x; θ) = L(L+1) ◦ σ ◦ L(L) ◦ σ ◦ L(L−1) ◦ · · · ◦ σ ◦ L(1) (x).
▶ We can take derivative backwards for all parameters.
Rohit Kumar IIT Delhi August 8, 2025 26 / 26