KEMBAR78
Epfl Machine Learning Final Exam 2021 Solutions | PDF | Support Vector Machine | Principal Component Analysis
0% found this document useful (0 votes)
334 views21 pages

Epfl Machine Learning Final Exam 2021 Solutions

This exam document outlines the exam instructions and structure for a machine learning exam being held on January 20th, 2022 from 8:15 to 11:15 in room STCC. The exam is closed book with no electronic devices allowed and lasts 180 minutes. Students are informed that the exam contains multiple choice and true/false questions with point values assigned based on correct and incorrect answers.

Uploaded by

Diogo Valdivieso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
334 views21 pages

Epfl Machine Learning Final Exam 2021 Solutions

This exam document outlines the exam instructions and structure for a machine learning exam being held on January 20th, 2022 from 8:15 to 11:15 in room STCC. The exam is closed book with no electronic devices allowed and lasts 180 minutes. Students are informed that the exam contains multiple choice and true/false questions with point values assigned based on correct and incorrect answers.

Uploaded by

Diogo Valdivieso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

y +1/1/60+ y

Profs. Nicolas Flammarion and Martin Jaggi


Machine Learning – CS-433 - IC
20.01.2022 from 08h15 to 11h15 in STCC
Duration : 180 minutes
1
Student One
SCIPER : 111111

Do not turn the page before the start of the exam. This document is double-sided, has 20
pages, the last ones are possibly blank. Do not unstaple.

T
AF
• This is a closed book exam. No electronic devices of any kind.
• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet (hand-
written or 11pt min font size) if you have one; place all other personal items below your desk.
• You each have a different exam.
• Only answers in this booklet count. No extra loose answer sheets. You can use the last two pages
DR

as scrap paper.
• For the multiple choice questions, we give :
+2 points if your answer is correct,
0 points if you give no answer or more than one,
−0.5 points if your answer is incorrect.
• For the true/false questions, we give :
+1 points if your answer is correct,
0 points if you give no answer or more than one,
−1 points if your answer is incorrect.
• Use a black or dark blue ballpen and clearly erase with correction fluid if necessary.
• If a question turns out to be wrong or ambiguous, we may decide to nullify it.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/2/59+ y

First part: multiple choice questions


For each question, mark the box corresponding to the correct answer. Each question has exactly one
correct answer.

Robustness to outliers
We consider a classification problem on linearly separable data. Our dataset had an outlier—a point that is
very far from the other datapoints in distance (and also far from margins in SVM but still correctly classified
by the SVM classifier).
We trained the SVM, logistic regression and 1-nearest-neighbour models on this dataset. We tested
trained models on a test set that comes from the same distribution as training set, but doesn’t have any
outlier points. After that we removed the outlier and retrained our models.
Question 1 After retraining, which classifier will change its decision boundary around the test points.

Logistic regression
SVM
1-nearest-neighbors classifier
All of them

T
Solution: The solution of the SVM problem only depends on the support vectors, and outlier is not a
AF
support vector as stated in the problem. Thus SVM decision boundary would not change after retraining.
For 1-NN classifier decisions are based on the nearest datapoints that would be not outlier points, as outlier
is very far from other datapoints. Thus decisions of 1-NN classifier would also not change.
Logistic regression solution is based on all of the datapoints thus removing one point will change the decision
boundary.
DR

SVM
For any vector v ∈ RD let kvk2 := v12 + · · · + vD
p
Question 2 2 denote the Euclidean norm. The hard-
D
margin SVM problem for linearly separable points in R is to minimize the Euclidean norm kwk2 under
some constraints. What are the additional constraints for this optimization problem?

w> xn ≥ 1 ∀n ∈ {1, · · · , N }
yn
w > xn
≥ 1 ∀n ∈ {1, · · · , N }
yn w> xn ≥ 1 ∀n ∈ {1, · · · , N }
yn + w> xn ≥ 1 ∀n ∈ {1, · · · , N }

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/3/58+ y

Cross Validation
Question 3 Consider the K-fold cross validation on a linear regression model with a sufficiently large
amount of training data. When K is large, the computational complexity of the K-fold cross validation with
respect to K is of order

O(1/K).
O(K).
O(1).
O(K(K − 1)).

Solution: Because each training process uses (K − 1)/K = O(1) fraction of the data, and there are K such
processes.

Bias-Variance Decomposition
Question 4 Consider a regression model where data (x, y) is generated by input x uniformly randomly
sampled from [0, 1] and y(x) = x2 + ε, where ε is random noise with mean 0 and variance 1. Two models
are carried out for regression: model A is a trained quadratic function g(x; w) = w2 x2 + w1 x + w0 where
w = (w0 , w1 , w2 )> ∈ R3 , and model B is a constant function h(x) = 1/2. Then compared to model B, model
A has

higher bias, higher variance.


T
AF
lower bias, higher variance.
higher bias, lower variance.
lower bias, lower variance.

Solution: Model B has zero variance because it outputs a constant 1/2 which is not related to the training
DR

data.

Optimization
Question 5 Let n be an integer such that n ≥ 2 and let A ∈ Rn×n , and x ∈ Rn , consider the function
f (x) = x Ax defined over Rn . Which of the following is the gradient of the function f ?
>

A> x + Ax
2A> x
2x> A
2Ax

Solution: ∇f (x) = A> x + Ax. See lab 11 exercise 1, here the matrix A is not symmetric.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/4/57+ y

Question 6 Which of the following functions reaches a global maximum on the set I? (Note that [., .]
and (., .) denote closed and open intervals respectively)

• f1 (x) = −x4 , I = [−5, 5]

• f2 (x) = arccos(x), I = (−1, 1)

• f3 (x) = x exp(−x), I = (−∞, 0)

• f4 (x) = sin(cos(x)) sin(x), I = R+

f1 , f4
f1 , f3 , f4
f1 , f2 , f3 , f4
f1 , f2 , f4

Solution:

• f1 : Yes, computing the first and second derivative leads to the maximum being reached at x = 0. Also
note that any continuous function on compact set reaches its maximum and minimum.

• f2 : No, we can compute the first derivative starting from:

T
cos(arccos(x)) = x
AF
− sin(arccos(x)) arccos0 (x) = 1
1
arccos0 (x) = −
sin(arccos(x))
1
arccos0 (x) = − p
1 − cos(arccos(x))2
DR

1
arccos0 (x) = − √
1 − x2
The derivative of f can not be equal to 0 on I.

• f3 : No, the first derivative is −ex (x − 1), hence no maximum can be reach on I.

• f4 : Yes, since f is the product of periodic functions (sin(cos(x)) and sin(x), it is also periodic. Since
f is periodic on I of not finite measure, f reaches its maximum infinitely many times.

• f5 : Yes, Computing the gradient of f , we find ∇f (x1 , x2 ) = (− sgn(x1 ) − 6x2 , −10x2 ), which is equal
to (0, 0) for (x1 , x2 ) = (0, 0). Then computing the Hessian, we find that it is a diagonal matrix filled
with negative values, hence f does reach a maximum at (0, 0).

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/5/56+ y

Linear Models
Question 7 Consider a linear regression model on a dataset which we split into a training set and a test
set. After training, our model gives a mean-squared error of 0.1 on the training set and a mean-squared
error of 5.3 on the test set. Recall that the mean-squared error (MSE) is given by:
N
1 X
M SEw (y, X) = (yn − x>
n w)
2
2N n=1

Which of the following statements is correct ?

Ridge regression can help reduce the gap between the training MSE and the test MSE.
Retraining the model with feature augmentation (e.g. adding polynomial features) will increase the
training MSE.
Retraining while discarding some training samples will likely reduce the gap between the train MSE
and the test MSE.
Using cross-validation can help decrease the training MSE of this very model.

Solution:

T
• feature augmentation: Incorrect, using feature augmentation will increase overfitting, hence decrease
the training MSE even more.

• cross-validation: Incorrect, cross-validation can help to select a model that overfits less, it does not
AF
help to get better performance on a specific model.

• discarding some training samples: Incorrect, reducing the number of training samples is more likely
to increase the overfitting.

• Ridge regression: Correct, regularization from ridge regression can be useful to reduce overfitting.
DR

Regularization
Question 8 Which of the following statements is incorrect ?

Training a model with L1 -regularization ...

can reduce the storage cost of the final model.


is used to help escaping local minima during training.
can reduce overfitting.
can be named Lasso regression when in combination with an MSE loss function and a linear model.

Solution:

• reduce the storage: is true, the parameters tend to be sparse when using L1 -regularization, thus
parameters with 0 value do not need to be stored.

• escaping local minima: is false (so check this box), L1 -regularization has nothing to do with the
optimization of the model.

• reduce overfitting: is true, it is a regularization technique, hence it reduces the complexity of the
model.

• Lasso: is true, see the course.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/6/55+ y

Neural networks
Let f : RD → R be an L-hidden layer multi-layer perceptron (MLP) such that

f (x) = σL+1 w> σL (WL σL−1 (WL−1 . . . σ1 (W1 x))) ,




with w ∈ RM , W1 ∈ RM ×D and W` ∈ RM ×M for ` = 2, . . . , L, and σi for i = 1, . . . , L + 1 is an entry-wise


activation function. For any MLP f and a classification threshold τ let Cf,τ be a binary classifier that
outputs YES for a given input x if f (x) ≤ τ and NO otherwise.
Question 9 Assume σL+1 is the element-wise sigmoid function and Cf, 21 is able to obtain a high accuracy
on a given binary classification task T . Let g be the MLP obtained by multiplying the parameters in the
last layer of f , i.e. w, by 2. Moreover, let h be the MLP obtained by replacing σL+1 with element-wise
ReLU. Finally, let q be the MLP obtained by doing both of these actions. Which of the following is true?

ReLU (x) = max{x, 0}


1
Sigmoid(x) =
1 + e−x

Ch,0 may have an accuracy significantly lower than Cf, 12 on T


Cg, 12 , Ch,0 , and Cq,0 have the same accuracy as Cf, 12 on T

T
Cg, 12 may have an accuracy significantly lower than Cf, 12 on T
Cq,0 may have an accuracy significantly lower than Cf, 12 on T
AF
Solution: Since the threshold 21 for sigmoid corresponds to the input to the last activation function being
positive, Ch,0 is true. Moreover, multiplying the weights by 2 does not change the sign of the output.
Therefore both Cg, 12 and Cq,0 are also true.

Question 10 Assume the weights in f were obtained by initializing all parameters to zero and running
DR

SGD. Assume there exists τ such that Cf,τ is a good classifier. Let g be the MLP obtained by randomly
keeping one neuron per layer and removing the other M − 1 neurons from each layer, and multiplying all
weights except for the first layer by M (the number of neurons in each layer in the original network). What
is the probability that Cg,τ is a good classifier?
1
Less than ML
1 1
Between ML
and M

1
Not possible to determine with the given information.

Solution: When all weights are initialized to zero, the value for all hidden neurons will be the same and
remain the same. As such, keeping one neuron in each layer and multiplying its output by the number of
neurons will keep the output unchanged. Therefore g and f are equal.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/7/54+ y

Question 11 Which of the following techniques do not improve the generalization performance in deep
learning?

None. All techniques here improve generalization.


Dropout
Data augmentation
Tuning the optimizer
L2 regularization

Solution: Once correct hyperparameters are chosen, any of these strategies may be applied to improve
the generalization performance.

Adversarial Examples
Consider a linear model ŷ = x> w with the squared loss under an `∞ -bounded adversarial perturbation. For
a single point (x, y), it corresponds to the following objective:
2
max y − x̃> w , (OP)

T
x̃: kx−x̃k∞ ≤ε

where kx − x̃k∞ ≤ ε denotes the `∞ -norm, i.e. |xi − x̃i | ≤ ε for every i.

Assume that w = (3, −2)> , x = (−1, 2)> , y = 2. What is the maximum value of the
AF
Question 12
optimization problem in Eq. (OP)?

(9 + 5ε)2
(3 + 10ε)2
(10 − ε)2
DR

Other
(5 + 9ε)2

Solution: First, it’s convenient to reparametrize the objective in terms of an additive per-
 2
turbation δ: maxδ:kδk∞ ≤ε y − x> w − δ > w . If we plug the given values of w, x, y, we get:
2
maxδ:kδk∞ ≤ε (9 − 3δ1 + 2δ2 ) . We can maximize this objective independently over δ1 and δ2 by noting
that the optimal value is attained at a boundary of the feasible set, i.e. for |δ1 | = |δ2 | = ε. This leads to the
maximizer δ ? = (−ε, ε)> and the maximum value (9 + 5ε)2 .

Question 13 Assume that w = (3, −2)> , x = (−1, 2)> , y = 2. What is the optimal x̃? that maximizes
the objective in Eq. (OP)?

(−1 + ε, 2)>
(−1 − ε, 2 − ε)>
(−1 − ε, 2)>
(−1 + ε, 2 + ε)>
Other

Solution: The optimal δ ? is equal to (−ε, ε)> . This corresponds to x̃? = x + δ ? = (−1 − ε, 2 + ε)> which
corresponds to the answer “Other”.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/8/53+ y

KNN
Question 14 The KNN algorithm needs a notion of distance to assess which points are “nearest”. Identify
the distance measures that can be used in the KNN algorithm
p
(a) Euclidean Distance : distance associated to the L2 norm kxk2 := x21 + · · · + x2D
(b) Manhattan Distance : distance associated to the L1 norm kxk1 := |x1 | + · · · + |xD |
1/4
(c) Distance associated to the L4 norm kxk4 := |x1 |4 + · · · + |xD |4

only a and c
only b
only c
a, b and c
only b and c
only a and b
only a

Solution: The similarity measure is only an algorithmic choice that should be made in the KNN algorithm.

Linear Regression
Question 15
T
Assume we are doing linear regression with Mean-Squared Loss and L2-regularization on
AF
four one-dimensional data points. Our prediction model can be written as f (x) = ax+b and the optimization
problem can be written as
4
X
a? , b? = argmin [yn − f (xn )]2 + λa2 .
a,b n=1

Assume that our data points are Y = [1, 3, 2, 4] and X = [−2, −1, 0, 3]. For example y1 = 1 and x1 = −2.
DR

What is the optimal value for the bias, b? ?

2
None of the above answers.
Depends on the value of λ
3
2.5
P4
Solution: If we take the derivative of the loss w.r.t b and set it to zero we get i=1 [yi − axi − b] = 0. Since
P4 1+3+2+4
i=1 axi = 0, the optimal value for b is equal to the mean of the target values, b = 4 = 2.5.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/9/52+ y

Subgradients
Question 16 Consider the Parametric ReLU function defined as
(
x if x > 0
f (x) =
ax otherwise

where a ∈ R is an arbitrary number. Which of the following statements is true regarding the subgradients
of f (x) at x = 0?

If a subgradient exists, then it is not unique.


A subgradient does not exist at x = 0.
A subgradient exists even though f (x) is not necessarily differentiable at x = 0.
None of the mentioned answers.

Solution: For values of a > 1 the function f (x) is strictly concave and hence a subgradient does not exist.
For a = 1 we have a unique subgradient. Hence the correct answer is None of the above answers.

Logistic regression

T
Consider a binary classification task as in Figure 1, which consists of 14 two-dimensional linearly separable
samples (circles corresponds to label y = 1 and pluses corresponds to label y = 0). We would like to predict
AF
the label y = 1 of a sample (x1 , x2 ) when the following holds true
1
Pr(y = 1|x1 , x2 , w1 , w2 ) = > 0.5
1 + exp(−w1 x1 − w2 x2 )

where w1 and w2 are parameters of the model.


DR

𝑥2

𝑥1
(0,0)

Figure 1: Two-dimensional dataset.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/10/51+ y

Question 17 If we obtain the (w1 , w2 ) by optimizing the following objective


N
X C 2
− log Pr(yn |xn1 , xn2 , w1 , w2 ) + w
n=1
2 2

where C is very large, then the decision boundary will be close to which of the following lines?

x1 + x2 = 0
x1 = 0
x2 = 0
x1 − x2 = 0

Solution: When C is very large, we are essentially optimizing over C2 w22 and thus w2 is close to 0. Then
1
Pr(y = 1|x1 , x2 , w1 , w2 ) = 1+exp(−w1 x1 )
indicates that the classifier outputs 1 or 0 depending on the sign of
x. Therefore the decision boundary is x1 = 0.

Recommender systems and word vectors


Question 18 What alternates in Alternating Least Squares for Matrix Factorization for a movie recom-

T
mender system?

recommendation steps and optimization steps


AF
updates based on different movie rating examples from the training set
expectation steps and maximization steps
updates to user embeddings and updates to movie embeddings

Question 19 Which NLP model architectures can differentiate between the sentences “I have to read
this book.” and “I have this book to read.”?
DR

(a) a convolutional model based on word2vec vectors


(b) a recurrent neural network based on GloVe word vectors
(c) a bag-of-words model based on GloVe word vectors

only b and c
only b
only a and c
a, b and c
only c
only a
only a and b

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/11/50+ y

Generative Networks
Question 20 Consider a Generative Adversarial Network (GAN) which successfully produces images of
goats. Which of the following statements is false?

After the training, the discriminator loss should ideally reach a constant value.
The generator aims to learn the distribution of goat images.
The generator can produce unseen images of goats.
The discriminator can be used to classify images as goat vs non-goat.

Solution: “classify goat vs non-goat...” is FALSE, because the discriminator classifies images into real or
fake.

Clustering
For any vector v ∈ RD let kvk2 := v12 + · · · + vD
p
Question 21 2 denote the Euclidean norm. Let
D
x1 , . . . , xN ∈ R be a dataset of N ≥ 2 distinct points. For any integer K ≥ 1, consider the following value:
N K
1 XX
L?K = inf znk kxn − µk k22

T
µ1 ,...,µK ∈RD N n=1
PK k=1
znk ∈{0,1} s.t. k=1 znk =1

The statements below are all true except one. Which of the following statement is false?
AF
For 2 ≤ K < N and with the initial means randomly chosen as K data points, the K-means algorithm
with K clusters is not guaranteed to reach the optimal L?K loss value.
For 2 ≤ K < N , L?K is hard to compute.
For K ≥ N , L?K = 0.
The sequence (L?K )1≤K≤N is not necessarily strictly decreasing.
DR

L?1 corresponds to the population variance of the dataset.

Solution: (L?K )1≤K≤N is necessarily strictly decreasing. Since all the datapoints are distinct, we can
always strictly improve the loss by assigning µK+1 = xi where xi 6= µk for all 1 ≤ k ≤ K.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/12/49+ y

Question 22 You are given the data (xn , yn )1≤n≤N ∈ R2 illustrated in Figure 2 which you want to
cluster into an inner ring and an outer ring (hence a number of clusters K = 2). Which of the following
statement(s) is/are correct?

(a) There exists some initialization such that K-means clustering succeeds.

(b) There exists an appropriate feature expansion such that K-means (with standard initialization) suc-
ceeds.

(c) There exists an appropriate feature expansion such that the Expectation Maximization algorithm (with
standard initialization) for a Gaussian Mixture Model succeeds.

Only a and c
Only a
All of them
Only b and c
None of them
Only a and b

T
Only c
Only b
AF
p
Solution: For example, (xi , yi , c x2i + yi2 )1≤i≤n for c big enough will easily be clusterable using K-means
or GMM.
DR

Figure 2: Two-dimensional dataset

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/13/48+ y

Linear Algebra
Question 23 Given a matrix X of shape D ×N with a singular value decomposition (SVD), X = USV> ,
suppose X has rank K and A = XX>
Which one of the following statements is false?

The eigenvalues of A are the singular values of X


A is positive semi-definite, i.e all eigenvalues of A are non-negative
The eigendecomposition of A is also a singular value decomposition (SVD) of A
A vector x that can be expressed as a linear combination of the last D − K columns of U, i.e x =
PD >
i=K+1 wi ui (where ui is the i-th column of U), lies in the null space of X

Solution:

• “A is PSD...” and “The eigendecomposition of...” are true because A = US2 U> .

• “The eigenvalues of... are the singular values...” is false because the eigenvalues of A are the square
of the singular values of X.

• “A vector x that...” is true.

T
PD PD PD
Proof: X > = V S > U > , X > x~i = X > i=K+1 wi u~i = i=K+1 wi (X > u~i ) = i=K+1 wi (V S > e~i ), the
last equality follows from the fact that u~i is orthogonal to every row of U > except the ith row i.e
U > u~i = e~i .As X > = has rank K, then S > has non-zeros along the diagonal only up to the kth row /
AF
column, i.e S > e~i = 0 for i ≥ k + 1. Therefore X > x~i = 0
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/14/47+ y

Second part: true/false questions


For each question, mark the box (without erasing) TRUE if the statement is always true and the box
FALSE if it is not always true (i.e., it is sometimes false).

Question 24 (Neural Networks) Weight sharing allows CNNs to deal with image data without using too
many parameters. However, weight sharing increases the variance of a model.

TRUE FALSE

Solution: False, weight sharing increases bias.

(Kernels) For any vector v ∈ RD let kvk2 := v12 + · · · + vD


p
Question 25 2 denote the Euclidean norm.

Define the function k(x, x0 ) = 1−x1> x0 , on the set x, x0 ∈ RD such that kxk2 < 1 and kx0 k2 < 1. The function
k(x, x0 ) is a valid kernel.

TRUE FALSE

P∞
Solution: True. It is a valid kernel, because k(x, x0 ) = 1
1−xx0 = i=0 (xx
0 i
)

Question 26

T
(Bias/Variance Decomposition) Consider a linear regression model where the data is gen-
erated by input x and output y = w> x + ε, where w is a fixed vector and ε is a Gaussian noise with zero
mean and σ 2 variance. Then there is no machine learning algorithm that can achieve a training error lower
AF
than σ 2 .

TRUE FALSE

Solution: False. It is the true error that cannot be less than σ 2 instead of the training error. Moreover,
overfitting is always an option, and it is often possible to have a small training error.
DR

PN
Question 27 (Logistic regression) Consider a binary classification task L(w) = n=1 −yn x>
n w + log(1 +
>
exn w ) with linearly separable samples and y ∈ {0, 1}. Then there exists an optimal w with exact 0 loss and
100% training accuracy.

TRUE FALSE

Solution: False. While there exists w which has 100% training accuracy, exact 0 loss is not attainable.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/15/46+ y

(Linear Models) For any vector v ∈ RD let kvk2 := v12 + · · · + vD


p
Question 28 2 denote the Euclidean
D D×D
norm. For y ∈ R , X ∈ R , the solution of the least squares problem:

w? = argmin ky − Xwk22
w∈RD

is always unique.

TRUE FALSE

Solution: False. This is true only if X is full-rank. Suppose that


   
0 1 0 0
X= 0 1 0  , and y =  0 
   

1 0 1 1
   
1 0
then w?1 =  0  and w?2 =  0  are both solutions.
   

0 1

Question 29 (PCA) Your friend performed Principal Component Analysis on some data P and claimed that

T
he retained at least 95% of the variance using k principal components. This is equivalent to i≥k+1
P
i λi
λi
≤ 0.05,
where λ1 , ..., λk , ... are the eigenvalues associated to each principal component, sorted in a non-increasing
order.
AF
TRUE FALSE
P
i≤k λi
Solution: True. The variance explained with k principal components is 95%. This means that P is at
i λi
least 0.95. Hence, the claim is true.
DR

Question 30 (Adversarial robustness) Let kvk∞ := maxi |vi | denote the `∞ -norm. Assume a binary
classification problem with y ∈ {−1, 1}. Then the adversarial zero-one loss maxδ:kδk∞ ≤ε 1yf (x+δ)≤0 (1C is
equal to 1 if the condition C is true and to 0 if C is false) is always upper bounded by the adversarial hinge
loss maxδ:kδk∞ ≤ε max{0, 1 − yf (x + δ)}.

TRUE FALSE

Solution: True. Since the exponential loss upper bounds the zero-one loss, taking the maximum over both
of them preserves the ranking between them.

Question 31 (Expectation Maximization) If we run the Expectation Maximization (EM) algorithm on the
log-likelihood for a Gaussian Mixture Model, then the sequence of log-likelihoods {L(θ (t) )}t∈N is guaranteed
to be non-decreasing.

TRUE FALSE

Solution: True. By definition of the algorithm, each parameter update increases the log-likelihood.

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/16/45+ y

Question 32 (FastText Supervised Classifier) The FastText supervised classifier can be modelled as a
two-layer linear (one hidden layer) neural network with a non-linear loss function at the output.

TRUE FALSE

Solution: True. The FastText supervised classifier consists of two layers - embedding layer and linear
classifier. The embedding layer can be thought of as another linear layer if the input is thought of as
encoding word frequency.

Question 33 (Matrix Factorization) Consider the matrix factorization objective function


1
P  >
2
2 (d,n)∈Ω x dn − (WZ )dn . When minimizing this objective with SGD over the embedding matrices
W and Z, you should initialize W and Z with zeros.

TRUE FALSE

Solution: False. In fact, it will not work if you initialize the embeddings to zero. ‘Zero’ is a stationary
point of the optimization problem. If you start all the parameters at zero, they will never change.

Question 34 (MSE and Neural Networks) The mean squared error (MSE) is convex w.r.t the parameters

T
of a multi layer perceptron with more than one hidden layer and linear activation function.

TRUE FALSE
AF
Solution: False. Consider a simple two layer neural network with single neuron in each layer such that the
output of the network is equal to w1 w2 x where w1 is the weight of the first layer and w2 is the weight of the
second layer. You can simply check that MSE loss is not convex w.r.t the parameters.
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/17/44+ y

Third part, open questions


Answer in the space provided! Your answer must be justified with all steps. Leave the check-boxes empty,
they are used for the grading.

PCA
Let x1 , ..., xN be a dataset of N vectors in RD .

Question 35: (1 point.) What does it mean for the data vectors x1 , ..., xN to be centered, as for principle
component analysis (PCA) to be meaningful? Use the notation xnd for individual entries.

0 1

PN PN
Solution: Data is centered, i.e. E[x] = 0 or in other words N1 n=1 xn = 0 or N1 n=1 xnd = 0 ∀d.
Question 36: (1 point.) Write down the covariance matrix of the dataset X = (x1 , ..., xN ) ∈ RD×N , and
state its dimensions. (Note that for PCA we assume data to be already centered, as in the previous
question)

0 1

Solution: cov =1/XX> ∈ RD×D .


N
FT
Now let x be a random vector distributed according to the uniform distribution over the finite normalized
dataset x1 , ..., xN from above. Consider the problem of finding a unit vector, w ∈ RD , such that the random
A
variable w> x has maximal variance.

Question 37: (1 point.) What is the variance of the random variable w> x over the randomness of x?
DR

0 1

PN
Solution: Var[w> x] = N1 n=1 (w> xn )2
Question 38: (2 points.) Show that the solution of the problem of argmaxw:kwk=1 Var[w> x] is to set w to
be the first principle vector of x1 , ..., xN .

0 1 2

PN PN
Solution: argmaxw:kwk=1 Var[w> x] = N1 n=1 (w> xn )2 = N1 n=1 w> xn x> 1 >
n w = N wXX w is (by
>
definition of Eigenvector) maximized if w is the top eigenvector of XX . (One can add some arguing why
this gives the top singular vector of X)
Question 39: (1 point.) Explain in words what the above result says about how PCA relates to the dataset.

0 1

Solution: The data has maximum variance when projected onto the first PC. (Mentioning ’max variance’
alone is not sufficient, need to say what kind of variance, related to the data, direction).

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/18/43+ y

The Perceptron
Setting

Let us consider a binary classification problem with a training set S = {(xn , yn )}N
n=1 such that:

xn ∈ RD , and yn ∈ {−1, 1}, for all n = 1, · · · , N,

where N, D are integers such that N, D ≥ 1.


We consider the Perceptron classifier which classifies x ∈ RD following the rule:

fw,b (x) = sgn(w> x + b),

where w ∈ RD is the weight vector, b ∈ R is the threshold, and the sign function is defined as

+1 if z ≥ 0
sgn(z) =
−1 if z < 0

Question 40: (1 point.) As seen in the course, explain how we can ignore the threshold b and only deal
with classifiers passing through the origin, i.e., of the form fw (x) = sgn(w> x).

T
0 1

Solution: We can view the threshold as an additional weight by adding the constant input 1 to the input
x. It amounts to consider the input x̃> = [x> , 1] since x̃> [w> , b] = x> w + b.
AF
For the remainder of the exercise, we proceed without this additive threshold, as explained in the previ-
ous question, and only consider classifiers of the form fw (x) = sgn(w> x). We make the following two
assumptions:

• Bounded input: There exists a real number R ≥ 0 such that for all n = 1, · · · , N we have
DR

kxn k2 ≤ R,
qP
D
where for any vector z ∈ RD we use kzk2 to refer to the Euclidean norm of z, i.e, kzk2 = 2
d=1 zd .

• Linearly separable data: There exists w? ∈ RD and a real number γ > 0 such that for all n =
1, · · · , N we have
yn w?> xn ≥ γ.

We will use the Perceptron algorithm to train the Perceptron classifier and find a separating hyperplane.
Let t denote the number of parameter updates we have performed and wt the weight vector after t updates.
We initialize with w0 = 0. If this weight vector is already a separating hyperplane, we are done. If not, we
pick an arbitrary point xi that is currently misclassified. This point is used to update the weight vector wt
as
wt+1 := wt + yi xi where yi x> i wt ≤ 0.

The algorithm is formally written below.


We will study the convergence of this learning algorithm.

Convergence

Question 41: (2 points.) For t > 0, show that when making the tth update according to the Perceptron
update, we have:

w?> wt ≥ w?> wt−1 + γ

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/19/42+ y

Algorithm 1: Perceptron algorithm


t ← 0; wt ← 0;
while there exists j ∈ {1, · · · , N } such that yj x>
j wt ≤ 0 do
Pick an arbitrary i ∈ {1, · · · , N } such that yi x>
i wt ≤ 0 ;
wt+1 ← wt + yi xi ;
t←t+1
end
return wt

0 1 2

Solution: Let us assume that the tth update is made on the sample i, we have then

w?> wt = w?> (wt−1 + yi xi )


= w?> wt−1 + yi w?> xi
≥ w?> wt−1 + γ

T
The first equation follows from the definition of the Perceptron update and the last equation follows from
the assumption of linear separability .

Question 42: (1 point.) Show that it implies that


AF
w?> wt ≥ tγ

0 1

Solution: By induction on t we obtain w?> wt ≥ w?> wt−1 + γ ≥ w?> w0 + tγ, and by definition w0 = 0.
DR

Therefore w?> wt ≥ tγ.

Question 43: (1 point.) Using the previous question, derive a lower-bound on kwt k2 depending on t, γ
and kw? k2 .
(We recall that a lower bound is a constant C > 0 such that kwt k2 ≥ C. Note that we will only accept
non-trivial lower-bounds as correct answers.)

0 1


Solution: Using the Cauchy-Schwarz inequality kwt k2 kw? k2 ≥ w?> wt ≥ tγ. Therefore kwt k2 ≥ kw? k2

Question 44: (3 points.) Derive the following upper bound on the squared norm kwt k22 :

kwt k22 ≤ kwt−1 k22 + R2

0 1 2 3

Solution: We still assume that the tth is made on sample i, we have then

kwt k22 = kwt−1 + yi xi k22


>
= kwt−1 k22 + 2yi wt−1 xi + kxi k22
≤ kwt−1 k22 + kxi k22
≤ kwt−1 k22 + R2

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/20/41+ y

The first equation follows from the definition of the Perceptron update, the second equation follows from
expanding the square , the third inequality follows from the fact that updates are made only on mistakes, i.e,
>
yi wt−1 xi ≤ 0 whenever an update is made , and the last inequality follows from the boundedness assumption
on the input .

Question 45: (1 point.) Show that it implies that

kwt k22 ≤ tR2

0 1

Solution: By induction on t we obtain kwt k22 ≤ kwt−1 k22 + R2 ≤ kw0 k22 + tR2 , and by definition w0 = 0.
Therefore kwt k22 ≤ tR2 .

Question 46: (3 points.) Combine the results obtained in the previous questions to obtain an upper bound
on t depending on the quantities R, γ and w? .

0 1 2 3

Solution: Using Question 43 we get

kwt k2 ≥

kw? k2
.
T
AF
From Question 45 we have

kwt k22 ≤ tR2

Therefore combining both we obtain:


DR

t2 γ 2
≤ kwt k22 ≤ tR2
kw? k22

which yields

R2 kw? k22
t≤
γ2

Question 47: (2 points.) Qualitatively analyze the previous result. What have you proven? Interpret the
dependency on R, γ and w? .

0 1 2

Solution: We have shown that the Perceptron algorithm terminates with a correct (separating) solution
in finite time, and that this time is indirectly proportional to the margin γ 2 , and proportional to R2 .

Perceptron and margin

Let us remind that we define the max-margin M? as

M? = max M such that yn x>


n w ≥ M for n = 1, · · · , N
w∈RD ,kwk2 =1

and a max-margin separating hyperplane w̄ as a solution of this problem:

w̄ ∈ arg max M such that yn x>


n w ≥ M for i = 1, · · · , N
w∈RD ,kwk2 =1

y For your examination, preferably print documents compiled from auto- y


multiple-choice.
y +1/21/40+ y

Question 48: (2 points.) Bound the number of perceptron updates t using the quantities R and M? . Prove
your result.

0 1 2

Solution: By definition of γ and M we have that γ/kw? k2 ≤ M (1 point if proven properly). And
therefore we obtain
R2
t≤
M2
Question 49: (1 point.) Does it imply that the output of the Perceptron algorithm is a max-margin
separating hyperplane?

0 1

Solution: No it does not, we have no clue to which separating hyperplane the Perceptron algorithm is
converging to.

T
AF
DR

y For your examination, preferably print documents compiled from auto- y


multiple-choice.

You might also like