Logistic Regression: Exercises and Solutions
Prof. Abdelatif Hafid
ESISA
March 21, 2025
Exercise 1: Sigmoid Function
(a) Write the mathematical expression for the sigmoid function.
1
(b) Calculate g(z) when z = 2.5.
(c) What are the limits of g(z) as z approaches ±∞?
(d) Why is the sigmoid function useful for binary classification?
Solution:
(a) The sigmoid function is:
1
g(z) =
1 + e−z
(b) For z = 2.5:
1
g(2.5) =
1 + e−2.5
1
=
1 + 0.0821
= 0.924
(c) Limits:
• As z → +∞: g(z) → 1
• As z → -∞: g(z) → 0
(d) The sigmoid function is ideal for binary classification because:
• It maps any real input to (0,1)
• The output can be interpreted as a probability
• Has a natural decision boundary at 0.5
Exercise 2: Decision Boundary
For a logistic regression model with parameters represented as vectors:
w1 2
w= = , b = −3
w2 −1
Answer the following:
1 Sometimes denoted as σ(z).
1
x1
(a) Express z in terms of x = using vector notation.
x2
(b) Find the decision boundary equation.
(c) Determine where the model predicts y = 1.
Solution:
(a) Linear combination in vector form:
⊤
x1
z =w x+b= 2 −1 −3
x2
Expanding:
z = 2x1 − x2 − 3
(b) Decision boundary:
g(z) = 0.5 =⇒ z = 0
Substituting z = 0:
2x1 − x2 − 3 = 0 =⇒ x2 = 2x1 − 3
(c) Predicting y = 1 (z > 0):
2x1 − x2 − 3 > 0 =⇒ x2 < 2x1 − 3
Exercise 3: Cost Function
2
For a single training example where y = 1, and h(x) represents the predicted value:
(a) Calculate the loss when ŷ = 0.7
(b) Calculate the loss when ŷ = 0.1
(c) Explain why mean squared error isn’t used
Solution:
(a) For ŷ = 0.7:
loss(ŷ, y) = −y log(ŷ) − (1 − y) log(1 − ŷ)
= −(1) log(0.7) − (0) log(0.3)
= − log(0.7)
= 0.357
(b) For ŷ = 0.1:
J(ŷ, y) = − log(0.1)
= 2.303
(c) Mean squared error isn’t used because:
• It creates a non-convex optimization problem
• Multiple local minima make optimization unreliable
• Gradient descent may not find the global minimum
2 Here, h(x(i) ) represents the predicted value and is also denoted as fw,b (x(i) ) or ŷ (i) .
2
Exercise 4: Gradient Descent
Given the dataset:
x1 x2 y
2 1 1
3 -1 0
1 2 1
4 0 0
(a) Write gradient descent update equations
(b) Calculate the first iteration (α = 0.1, starting from zeros)
(c) Describe the role of α
(d) Discuss potential issues with large α
Solution:
(a) Gradient descent update equations:
m
∂J 1 X (i) (i)
= (ŷ − y (i) )xj ,
∂wj m i=1
m
∂J 1 X (i)
= (ŷ − y (i) )
∂b m i=1
where ŷ = g(z) and z = w1 x1 + w2 x2 + b.
(b) First iteration (α = 0.1, starting from w1 := 0, w2 := 0, b := 0):
ŷ (1) = g(0) = 0.5, ŷ (2) = g(0) = 0.5,
ŷ (3) = g(0) = 0.5, ŷ (4) = g(0) = 0.5.
For w1 :
4
1 X (i) (i)
w1 := w1 − α · (ŷ − y (i) )x1
4 i=1
1
= 0 − 0.1 · (0.5 − 1)2 + (0.5 − 0)3 + (0.5 − 1)1 + (0.5 − 0)4
4
1
= 0 − 0.1 · − 1 + 1.5 − 0.5 + 2
4
= 0 − 0.1 · 0.5
= −0.05.
For w2 :
4
1 X (i) (i)
w2 := w2 − α · (ŷ − y (i) )x2
4 i=1
1
= 0 − 0.1 · (0.5 − 1)1 + (0.5 − 0)(−1) + (0.5 − 1)2 + (0.5 − 0)0
4
1
= 0 − 0.1 · − 0.5 + 0 + (−1) + 0
4
= 0 − 0.1 · (−1.5)
= 0.15.
3
For b:
4
1 X (i)
b := b − α · (ŷ − y (i) )
4 i=1
1
= 0 − 0.1 · (0.5 − 1) + (0.5 − 0) + (0.5 − 1) + (0.5 − 0)
4
1
= 0 − 0.1 · − 0.5 + 0.5 − 0.5 + 0.5
4
= 0 − 0.1 · 0
= 0.
(c) The role of α (learning rate):
• Controls the step size in the gradient descent updates.
• Determines how quickly the model converges to the minimum of the cost function.
• A small α results in slow convergence, while a large α can cause divergence or overshoot.
(d) Potential issues with large α:
• Updates may overshoot the optimal solution.
• The parameters may oscillate around the minimum.
• Gradient descent might diverge, failing to converge to the minimum.
Exercise 5: Model Evaluation
Given the following confusion matrix:
Predicted 0 Predicted 1
Actual 0 45 5
Actual 1 10 40
Calculate:
(a) Accuracy, Precision, Recall, and F1 Score
(b) The most important metric for scenarios with costly false positives
Solution:
(a) Accuracy:
TP + TN 45 + 40 85
Accuracy = = = = 0.85 (85%).
Total 45 + 5 + 10 + 40 100
(b) Precision:
TP 40 40
Precision = = = ≈ 0.89 (89%).
TP + FP 40 + 5 45
(c) Recall:
TP 40 40
Recall = = = = 0.80 (80%).
TP + FN 40 + 10 50
(d) F1 Score:
Precision · Recall 0.89 · 0.80
F1 = 2 · =2· .
Precision + Recall 0.89 + 0.80
Simplifying:
0.712 1.424
F1 = 2 · ≈ ≈ 0.84 (84%).
1.69 1.69
4
(e) Most important metric for costly false positives:
In scenarios where false positives are costly, Precision is the most critical metric. Precision ensures
that when the model predicts a positive outcome, it is highly likely to be correct, minimizing the
impact of false positives.
Exercise 6: Regularization
(a) Write the L2 regularized cost function
(b) Explain λ’s effect on overfitting
(c) Calculate regularization impact with λ = 1.5, w1 = 0.8
Solution:
(a) L2 regularized cost function:
m
1 Xh i Pn
J(w, b) = − y (i) log(h(x(i) )) − (1 − y (i) ) log(1 − h(x(i) )) + λ
2m j=1 wj2
m i=1
(b) Effects of λ:
• Large λ: Stronger regularization, simpler model
• Small λ: Weaker regularization, more complex model
• λ = 0: No regularization (original model)
(c) For λ = 1.5 , w1 = 0.8:
• Regular gradient term: ∂J
∂w1
• Regularization term: λ
m w1 = m
1.5
· 0.8
• Updated gradient: ∂J 1.5
∂w1 + m w1