KEMBAR78
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1 | PDF | Applied Mathematics | Mathematics
0% found this document useful (0 votes)
73 views11 pages

Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1

This document provides instructions for homework 1 in a machine learning course. It consists of 4 problems worth a total of 100 marks and contributes 2.5% to the final grade. Students must submit their own handwritten solutions scanned into a single PDF. Collaboration is allowed but copying is prohibited and subject to disciplinary action. Problem 1 (worth 20 marks) involves polynomial regression on toy data using scikit-learn. As the degree of the polynomial increases, the model overfits the training data and likely decreases accuracy on new data. Problem 2 (worth 10 marks) derives the least squares solution to minimize the mean square error between predicted and actual outputs. The solution is the pseudo-inverse of the feature matrix multiplied by the

Uploaded by

Muhammad Husnain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views11 pages

Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1

This document provides instructions for homework 1 in a machine learning course. It consists of 4 problems worth a total of 100 marks and contributes 2.5% to the final grade. Students must submit their own handwritten solutions scanned into a single PDF. Collaboration is allowed but copying is prohibited and subject to disciplinary action. Problem 1 (worth 20 marks) involves polynomial regression on toy data using scikit-learn. As the degree of the polynomial increases, the model overfits the training data and likely decreases accuracy on new data. Problem 2 (worth 10 marks) derives the least squares solution to minimize the mean square error between predicted and actual outputs. The solution is the pseudo-inverse of the feature matrix multiplied by the

Uploaded by

Muhammad Husnain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Department of Electrical Engineering

School of Science and Engineering

EE514/CS535 Machine Learning

HOMEWORK 1 – SOLUTIONS
Due Date: 23:55, Tuesday, March 02, 2021 (Submit online on LMS)
Format: 7 problems, for a total of 100 marks
Contribution to Final Percentage: 2.5%
Instructions:

• Each student must submit his/her own hand-written assignment, scanned in a single PDF
document.

• You are allowed to collaborate with your peers but copying your colleague’s solution is
strictly prohibited. Anybody found guilty would be subjected to disciplinary action in
accordance with the university rules and regulations.
• Note: Vectors are written in lowercase and bold in the homework, for your written
submission kindly use an underline instead. In addition, use capital letters for matrices and
lowercase for scalars.

Problem 1 (20 marks)


(Note: Compile and submit screenshots of your plots for this question.)

Polynomial Regression - Polynomial regression is a form of regression analysis in which


the relationship between the independent variable and the dependent variable is modeled
as an n-th degree polynomial. The model equation that relates the input to the output is
of the form:
yi (xi ) = θ0 + θ1 xi + θ2 x2i + · · · + θM xM
i .

(a) [4 marks] Assuming we use an M th degree polynomial for our regression analysis
and we have N outputs y1 , y2 , . . . , yN , express the regression model in the matrix form
y = Aθ.
(b) [8 marks] We will now implement the regression model on one feature variable using
programming tools. Use the following code to generate some toy data:
import pandas as pd
xdic={’x’: {11: 300, 12: 170, 13: 288, 14: 360, 15: 319, 16: 330,
17: 520, 18: 345, 19: 399, 20: 479}}
ydic={’y’: {11: 305, 12: 270, 13: 360, 14: 370, 15: 379, 16: 400,
17: 407, 18: 450, 19: 450, 20: 485}}
x=pd.DataFrame.from_dict(xdic)
y=pd.DataFrame.from_dict(ydic)
import numpy as np
x_seq = np.linspace(x.min(),x.max(),300).reshape(-1,1)

Next, use the scikit-learn library to implement polynomial regression of degree 2 on


the data. Finally, use the following code to display the data points and the regression
curve:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x,y)
plt.plot(x_seq,polyreg.predict(x_seq),color="black")
plt.title("Polynomial regression with degree "+str(degree))
plt.show()

(c) [8 marks] Repeat the regression process for degrees 1 (i.e., linear regression), 3, 4,
and 5. What problem do you observe as the degree of the polynomial increases? (Hint:
think about the trained model’s accuracy when implemented on a test dataset.)

Solution:
(a)
x21 . . . xM
    
y1 1 x1 1 θ0
 y2  1 x2 x22 . . . xM   θ1 
2 
 ..  =  ..
   
.. .. .. ..   .. 
 .  . . . . .   . 
yN 1 xN x2N M
. . . xN θM
(b) Code to implement polynomial regression:

from sklearn.preprocessing import PolynomialFeatures


from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
degree = 2
polyreg=make_pipeline(PolynomialFeatures(degree),LinearRegression())
polyreg.fit(x,y)

Plot:

Page 2
(c) Plots:

As the degree of the polynomial increases, the regression curve overfits on our training data,
which is likely to give a bad accuracy on a test dataset from the same distribution.

Problem 2 (10 marks)


Least Squares Formulation - In this question we will derive the least squares solution
from the perspective of an optimization problem.
(a) [2 marks] First let us formulate the problem in terms of an optimization problem.
Consider the following system: The correct output is given by y ∈ Rn . The measured
output ŷ ∈ Rn for a total of m instances is given by A = [ŷ1 , ŷ2 , . . . , ŷm ] ∈ Rn×m
(where n > m). Using these m instances, we want to determine the best approxima-
tion of y by taking a weighted sum of these m measured outputs, where the weights
are given by the vector θ ∈ Rm .

The weight vector θ ∈ Rm minimizes the mean-square error between the correct and
measured output. We can define this as an optimization problem as shown:
θ ∗ = minimize f (θ),
θ

where f (θ) is the cost function which we aim to minimize. For this problem, write
down the expression for f (θ).
(b) [8 marks] Find the expression for θ ∗ that minimises the cost function f (θ) by com-
puting its gradient.
Note: You might find the following relationships useful for this proof:
kak22 = aT a
(AB)T = B T AT

Page 3
You will also need to review derivatives of matrices and vectors.

This final expression is also known as the Left Pseudo-Inverse, verify your answer
by looking up the formula.

Solution:
(a) f (θ) = ky − Aθk22

(b)
f (θ) = ky − Aθk22
= (y − Aθ)T (y − Aθ)
= (y T − θ T AT )(y − Aθ)
= y T y − y T Aθ − θ T AT y + θ T AT Aθ
d d
y T y − y T Aθ − θ T AT y + θ T AT Aθ = 0

f (θ) =
dθ dθ
−AT y − AT y + 2AT Aθ = 0
AT Aθ = AT y
θ ∗ = (AT A)−1 AT y

Problem 3 (15 marks)


Multivariate Gradient Descent - You are presented with the following feature vector
X, the parameter vector θ, and the gold label y. The initial values of the parameter vector
are also given, and assume x0 to be the bias absorbed into the feature vector.

       
x0 1 θ0 0.7
 x1   2  θ1   2.0 
       
 = −0.2 ,
x2  = 12 ,
X= θ=θ 2 y =7
  
   
 x3   7  θ3   0.3 
x4 5 θ4 0.7
Perform 3 iterations of batch gradient descent on the dataset. Assume learning rate α =
0.001 and round your answers to 4 decimal places. Show all your steps and the mathematical
equations that you use.

Solution: We use the following loss function for our calculations:


1
L(θ) = ||(y − Xθ)||22
2
Then we simultaneously update our parameters using the following equation:
∂L
θi = θi − α
∂θi
The parameters vector in the 3 iterations is then as follows:
     
0.6992 0.6987 0.6983
 1.9984   1.9974   1.9967 
     
θ1 = −0.2097 , θ2 = −0.2159 , θ3 = −0.2203
    
 0.2943   0.2907   0.2881 
0.6960 0.6934 0.6916
and the cost L(θ) decreases from 0.8100 to 0.5180 to 0.3655.

Page 4
Problem 4 (10 marks)
Hamming Distance - For categorical data in k-NN classification, we often use the Ham-
ming distance as our distance metric. In the table provided below, you are provided the
binary vectors for a set of colours, and their frequency in the training data:

Colour Binary Vector No. in


1 2 3 4 Train Data
White 0 0 0 1 2
Pink 1 0 0 1 1
Red 1 0 0 0 2
Orange 1 1 0 0 3
Yellow 0 1 0 0 3
Green 0 1 1 0 1
Blue 0 0 1 0 1
Purple 1 0 1 0 2

Table 1: Binary coded vectors for different colours

(a) [4 marks] Suppose you are provided an unknown colour, defined by the vector:
‘1011’. Run the k-NN algorithm with k = 3 using Hamming distance to predict
the colour. Show all of your working.
(b) [6 marks] We can also use Hamming distance to compute the error in the final pre-
diction. For the test data tabulated below: Run the k-NN algorithm for k = 5 and

Test Data Binary Vector No. in


Point 1 2 3 4 Test Data
X 0 0 1 1 2
Y 0 1 0 1 1
Z 0 1 1 0 2

Table 2: Binary coded vectors for test data

calculate the prediction error using Hamming distance.


Hint: Make sure to average out the total error across all the test data points!

Solution:
(a) Finding the three nearest neighbours, k = 3:
Pink: 1, Purple: 2
Predicted colour: Purple

Page 5
Colour Binary Vector No. in Distance
1 2 3 4 Train Data
White 0 0 0 1 2 2
Pink 1 0 0 1 1 1
Red 1 0 0 0 2 2
Orange 1 1 0 0 3 3
Yellow 0 1 0 0 3 4
Green 0 1 1 0 1 3
Blue 0 0 1 0 1 2
Purple 1 0 1 0 2 1

Table 3: Hamming distance between test data point and all colours

(b) (Using the same working as last part)


Finding the five nearest neighbours, k = 5:

X : White : 2, Blue : 1, Pink : 1, Green : 1, => Error = 1 × 2 = 2


Y : Yellow : 3, White : 2, => Error = 1 × 1 = 1
Z : Yellow : 3, Green : 1, Blue : 1 => Error = 1 × 2 = 2

(Also award marks if X predicted as purple, or any valid answer for any other test point)
Note: Error is given by Hamming distance between actual binary vector and predicted
binary vector, multiplied by the number of the same test points.
We can find the prediction error by summing these errors and dividing by the total number
of test points.
Prediction Error = 55 = 1

Problem 5 (15 marks)


(Note: Compile and submit screenshots of your plots for this question.)

Curse of Dimensionality - In this question we will look closely at the problem asso-
ciated with higher dimensions in hypercubes when using k-NN. During your lectures, you
learnt that as the dimensions increase, the neighbours in a constant vicinity decrease. Let
us now visualise this phenomenon through simulations.
The expression for average distance in a unit cube [0, 1]3 is given by:
Z 1Z 1Z 1Z 1Z 1Z 1p
(x1 − x2 )2 + (y1 − y2 )2 + (z1 − z2 )2 dx1 dx2 dy1 dy2 dz1 dz2
0 0 0 0 0 0
Solving this expression is really tedious, and as dimensions increase, it becomes harder
to solve and visualise as well. Instead, we will numerically calculate average distance by
plotting a frequency histogram of distances.

(a) [10 marks] You are required to build a function that will randomly take two points
from a n−dimensional hypercube and calculate the Euclidean distance for i = 10000
iterations. You are given the following starter code:
import matplotlib.pyplot as plt
%matplotlib inline

Page 6
import math
import random

def freq_plot(n_dim, iterations = 10000):


dist = []
# code required here

plt.figure()
plt.hist(dist, range = [0, math.sqrt(n_dim)], bins=100,
density = True)
title = ’n = ’ + str(n_dim)
plt.gca().set(title=title, xlabel=’Distances’, ylabel=’Frequency’)
return

Use the random.random() function for generating a float between 0 and 1.


Attach a screenshot of your final working function code.
(b) [5 marks] Run the function in part (b) for n dim = 2, 3, 5, 10, 100, 1000, and attach
screenshots of your plots.

Solution:
(a) import matplotlib.pyplot as plt
%matplotlib inline
import math
import random
import numpy as np

def freq_plot(n_dim, iterations = 10000):


dist = []
for i in range(0, iterations):
a_vec = []
b_vec = []
for n in range(0, n_dim):
a_vec.append(random.random())
b_vec.append(random.random())
dist.append(np.linalg.norm(np.array(a_vec)-np.array(b_vec)))

plt.figure()
plt.hist(dist, range = [0, math.sqrt(n_dim)], bins=100,
density = True)
title = ’n = ’ + str(n_dim)
plt.gca().set(title=title, xlabel=’Distances’, ylabel=’Frequency’)
return

Page 7
(b) Plots given in Figure 1

Figure 1: Average distance in a unit hypercube for different dimensions

Page 8
Problem 6 (15 marks)
KD Trees - One of the most commonly used k-NN algorithms, KD trees are space-
partitioning data structures that divide a feature set based on some particular attributes,
thereby reducing the time complexity of the k-NN algorithm.
(a) [10 marks] The following table contains data points x = (x1 , x2 ) from a distribution
∈ IR2 . Use these data points to create a KD tree. Begin your solution by splitting
using the x1 feature, and then alternate between the 2 features at each level of the
tree when determining the next split. Each child node should have 2-3 data points.
x1 1 3 12 7 8 13 2 21 5 17 31 4 47 19 1 9 14
x2 6 8 9 21 23 7 12 9 1 3 16 20 10 11 4 31 39
(b) [5 marks] What major issue can you observe when using KD trees to classify a test
data point?

Solution:
(a) .

The following has also been accepted for this question:

Page 9
(b) While using KD trees to classify a data point is faster than the conventional k-NN algorithm,
it is likely that the closest neighbour is in another branch of the KD tree, hence the likelihood
of misclassification increases.

Problem 7 (15 marks)


Principal Component Analysis - In class we studied PCA as a means to reduce dimen-
sionality, by minimizing the squared error of predictions in lower dimension, also known as
projections onto a range space, as compared to the actual values. In this question, we will
prove that this problem is equivalent to maximizing the squared length of these projections,
i.e., maximizing variance of projected data.
Let us define the original problem from the class lecture. Considering n feature vectors of
the form x ∈ Rd . Using only k basis vectors, we want to project x in a new space of lower
dimensionality, from Rd to Rk :
z = W T x,
where W is the orthogonal mapping matrix of size d × k. To obtain a approximation of x
we use the following reconstruction:
x̂ = W z,
which is equivalent to:
x̂ = W W T x
x̂ = P x,
where P = W W T is the projection operator, that is idempotent: has the following proper-
ties: P 2 = P = P T .
Our objective is to minimize the sum of squared error during reconstruction, we can write
it as follows:
minimize kx − P xk22

(a) [10 marks] Show that this expression can be reduced to:
2
minimize [kxk22 − kP xk2 ]
Note: You might find the following relationships useful for this proof:
kak22 = aT a
(AB)T = B T AT
(b) [5 marks] Using the final expression in part (a) and the Pythagorean Theorem on
kxk22 to show that this problem is equivalent to maximizing the variance of projected
data. You may draw a diagram to augment your point.

Solution:
(a)
kx − P xk22 = (x − P x)T (x − P x)
= xT x − 2xT P x + xT P T P x
= xT x − 2xT P T P x + xT P T P x (since P = P T P )
= xT x − xT P T P x
= kxk22 − kP xk22

Page 10
(b) We can use the Pythagorean Theorem on kxk22 as shown:
kxk22 = kP xk22 + k(I − P )xk22 ,
where the first term is the variance of data and is fixed, the second term is the captured
variance, which we want to maximize, the third term is the reconstruction error, which we
want to minimize.
We can also draw a right triangle and label it as follows:

kxk22
kP xk22

k(I − P )xk22

It is easy to see that minimizing the reconstruction error is the same as maximizing the
captured variance of data.

— End of Homework —

Page 11

You might also like