MACHINE LEARNING HOMEWORK 1 SOLUTIONS
HARISH BALAJI BOOMINATHAN (hb2917)
September 16,2024
Problem 1:
Goal : To find a one dimensional function that takes a scalar input and outputs a scalar
f : ℝ → ℝ.
Form of the function :
f( x,𝜃 ) = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + . . . + 𝜃𝑑 𝑥 𝑑
Where d is the degree of the polynomial.
Empirical Risk is given by
𝑁
1 2
𝑅𝑒𝑚𝑝 (𝜃) = ∑ (𝑦𝑖 − 𝑓( 𝑥; 𝜃))
2𝑁
𝑖=1
On partially differentiating with respect to 𝜃 and equating to 0 ,we get:
𝜃 ∗ = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌
After splitting the dataset into two random halves, compute 𝜃 ∗ and use this Regression model for d ranging
from 1 to 100 we get,
Regression Model for D = 100 :
When we use cross validation and plot the Train Error vs Test Error graph, we can see that the minimum value
of Test Error is present at Degree = 9 with error value = 1.085044e+21
As we can see in the graph that the testing error reaches minimum when the degree = 9 and starts increasing
afterwards while the training error decreases and stays constant once it reaches 0, This shows that the model
starts to overfit, as the degree increases and hence the testing error increases rapidly.
Problem 2:
Goal : To Build a Multi-variate Regression function f : ℝ100 → ℝ , where the basis functions are of the form
𝑘
𝑓(𝑥; 𝜃) = ∑ 𝜃𝑖 𝑥𝑖
𝑖=1
We use a l2 function to penalize the model and minimize the risk of overfitting.
𝑁
1 2 𝜆 2
𝑅(𝑅𝑒𝑔) = ∑ (𝑦𝑖 − 𝑓(𝑥; 𝜃)) + ||𝜃 ||
2𝑁 2𝑁
𝑖=1
When we partially differentiate Risk with respect to 𝜃 and equate to 0 we get the model with minimum risk.
𝜃 ∗ = (𝑋 𝑇 𝑋 + 𝜆𝐼)−1𝑋 𝑇 𝑌
On applying two-fold cross validation and plotting the graph for various values of 𝜆 in the range 0 to 1000.we
obtain :
We can see that the test error is minimum for 𝜆 = 422. And as 𝜆 increases we can also observe that the values in
𝜃 decreases and the model starts to slowly underfit. Hence the training error slowly decreases while the testing
error increases.
Problem 3:
(i) To prove:
1
For 𝑔(𝑧) = , 𝑔(−𝑧) = 1 − 𝑔(𝑧) is true.
1+𝑒 −𝑧
Proof :
1
𝑔(𝑧) = → (1)
1+𝑒 −𝑧
Then,
1
𝑔(−𝑧) =
1 + 𝑒𝑧
On multiplying and dividing by 𝑒 −𝑧 , we get
𝑒 −𝑧
𝑔(−𝑧) = 𝑒 −𝑧 +1
On adding and subtracting 1 on the numerator,
1 + 𝑒 −𝑧 − 1
𝑔(−𝑧) =
1 + 𝑒 −𝑧
1 + 𝑒 −𝑧 1
𝑔(−𝑧) = −
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
1
𝑔(−𝑧) = 1 −
1 + 𝑒 −𝑧
Using equation (1):
𝑔(−𝑧) = 1 − 𝑔(𝑧)
And hence proved.
(ii) To prove :
𝑦
𝑔 −1 (𝑦) = ln ( )
1−𝑦
Proof
We know that 𝑔(𝑔−1 (𝑦)) = 𝑦
Therefore,
1
𝑔(𝑔−1 (𝑦)) = = 𝑦
1 + 𝑒 −𝑔−1(𝑦)
1 −1
= 1 + 𝑒 −𝑔 (𝑦)
𝑦
1 −1
− 1 = 𝑒 −𝑔 (𝑦)
𝑦
1−𝑦 −1
= 𝑒 −𝑔 (𝑦)
𝑦
On taking natural logarithm on both sides we have,
1−𝑦 −1
ln ( ) = ln(𝑒 −𝑔 (𝑦) )
𝑦
Using the properties of logarithm,
1−𝑦
ln ( ) = −𝑔−1 (𝑦)
𝑦
Therefore,
1−𝑦
𝑔 −1 (𝑦) = − ln ( )
𝑦
Again using properties of logarithm,
𝑦
𝑔 −1 (𝑦) = ln ( )
1−𝑦
Hence Proved.
Problem 4:
Goal :
To Implement a linear Logistic regression algorithm for binary classification in MatLab using gradient
descent.
Classification Function :
𝑓(𝑥; 𝜃) = (1 + exp(−𝜃 𝑇 𝑋))−1
That minimizes the Empirical Risk with logistical loss
𝑁
1
𝑅𝐸𝑚𝑝 (𝜃) = ∑(𝑦𝑖 − 1) log(1 − 𝑓(𝑥𝑖 , 𝜃)) − 𝑦𝑖 log(𝑓(𝑥𝑖 , 𝜃))
𝑁
𝑖=1
𝑁
1
𝑅𝐸𝑚𝑝 (𝜃) = − ∑(1 − 𝑦𝑖 ) log(1 − 𝑓(𝑥𝑖 , 𝜃)) + 𝑦𝑖 log(𝑓(𝑥𝑖 , 𝜃))
𝑁
𝑖=1
We know that,
𝜕
𝑓(𝑥𝑖 , 𝜃) = 𝑔(𝜃 𝑇 𝑋) and 𝜕𝜃 𝑔(𝜃 𝑇 𝑋) = (𝑔(𝜃 𝑇 𝑋))(1 − 𝑔(𝜃 𝑇 𝑋)) 𝑋 → (1)
Therefore,
𝑁
1
𝑅𝐸𝑚𝑝 (𝜃) = − ∑(1 − 𝑦𝑖 ) log(1 − 𝑔(𝜃 𝑇 𝑋)) + 𝑦𝑖 log(𝑔(𝜃 𝑇 𝑋))
𝑁
𝑖=1
Hence the gradient is,
𝑁
𝜕 1 1 𝜕 1 𝜕
𝑅𝐸𝑚𝑝 (𝜃) = − ∑(1 − 𝑦𝑖 ) ( 𝑇
) (− 𝑔(𝜃 𝑇 𝑋)) + 𝑦𝑖 𝑇
( 𝑔(𝜃 𝑇 𝑋))
𝜕𝜃 𝑁 1 − 𝑔(𝜃 𝑋) 𝜕𝜃 𝑔(𝜃 𝑋) 𝜕𝜃
𝑖=1
𝑁
𝜕 1 1 1 𝜕
𝑅𝐸𝑚𝑝 (𝜃) = − ∑ ( 𝑦𝑖 ( 𝑇
) − (1 − 𝑦𝑖 ) ( 𝑇
)) ( 𝑔(𝜃 𝑇 𝑋))
𝜕𝜃 𝑁 𝑔(𝜃 𝑋) 1 − 𝑔(𝜃 𝑋) 𝜕𝜃
𝑖=1
Using Equation (1):
𝑁
𝜕 1 1 1
𝑅𝐸𝑚𝑝 (𝜃) = − ∑ ( 𝑦𝑖 ( 𝑇
) − (1 − 𝑦𝑖 ) ( )) (𝑔(𝜃 𝑇 𝑋)) (1 − 𝑔(𝜃 𝑇 𝑋))𝑋
𝜕𝜃 𝑁 𝑔(𝜃 𝑋) 1 − 𝑔(𝜃 𝑇 𝑋)
𝑖=1
𝑁
𝜕 1
𝑅𝐸𝑚𝑝 (𝜃) = − ∑ ( 𝑦𝑖 (1 − 𝑔(𝜃 𝑇 𝑋)) − (1 − 𝑦𝑖 )(𝑔(𝜃 𝑇 𝑋))) 𝑋
𝜕𝜃 𝑁
𝑖=1
𝑁
𝜕 1
𝑅𝐸𝑚𝑝 (𝜃) = − ∑( 𝑦𝑖 − (𝑦𝑖 ) 𝑔(𝜃 𝑇 𝑋) + (𝑦𝑖 ) 𝑔(𝜃 𝑇 𝑋) − 𝑔(𝜃 𝑇 𝑋))𝑋
𝜕𝜃 𝑁
𝑖=1
𝑁
𝜕 1
𝑅𝐸𝑚𝑝 (𝜃) = − ∑( 𝑦 − 𝑔(𝜃 𝑇 𝑋))𝑋
𝜕𝜃 𝑁
𝑖=1
We can Apply batch gradient to get 𝜃 ∗ ,
(𝑡+1) 𝜕
𝜃{ = 𝜃𝑡 − 𝜂 𝑅(𝜃).
𝜕𝜃
Here , 𝜂 is the step size and the iteration can be stopped when the descent is negligible in size or when it is less
than tolerance 𝜀 . Where both 𝜀 and 𝜂 are hyper parameters.
We can Initialize 𝜃 0 to a small random matrix.
For 𝜂 = 3 and 𝜀 = 0.1 , The model produces a 12.5% binary classification error and performs with
87.5 % accuracy and it takes 19 iterations to converge.
For 𝜂 = 2 and 𝜀 = 0.03 , The model produces a 10.5% binary classification error and performs with
89.5 % accuracy and it takes 161 iterations to converge.
For 𝜂 = 2 and 𝜀 = 0.01 , The model produces a 2% binary classification error and performs with
98 % accuracy and it takes 1050 iterations to converge.
For 𝜂 = 2 and 𝜀 = 0.001 , The model produces a 0% binary classification error and performs with
100% accuracy and it takes 27784 iterations to converge.
This model performs excellently with 0% error and with 100% accuracy but takes 27784 iterations and
converges many iterations after it reaches zero error. So, we can increase the step size and tolerance
level so that the model reaches convergence and attains zero error just at the right time.
For 𝜂 = 4 and 𝜀 = 0.0025 , The model produces a 0% binary classification error and performs with
100% accuracy and it takes just 10,292 iterations to converge.
Here the model performs with 100% accuracy and 0% error and also converges right after the binary
classification error reaches zero. And runs in least amount of time.