0% found this document useful (0 votes)

334 views21 pages

Epfl Machine Learning Final Exam 2021 Solutions

This exam document outlines the exam instructions and structure for a machine learning exam being held on January 20th, 2022 from 8:15 to 11:15 in room STCC. The exam is closed book with no electronic devices allowed and lasts 180 minutes. Students are informed that the exam contains multiple choice and true/false questions with point values assigned based on correct and incorrect answers.

Uploaded by

Diogo Valdivieso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

334 views21 pages

Epfl Machine Learning Final Exam 2021 Solutions

Uploaded by

Diogo Valdivieso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

y +1/1/60+ y

Profs. Nicolas Flammarion and Martin Jaggi

Machine Learning – CS-433 - IC
20.01.2022 from 08h15 to 11h15 in STCC
Duration : 180 minutes
1
Student One
SCIPER : 111111

Do not turn the page before the start of the exam. This document is double-sided, has 20
pages, the last ones are possibly blank. Do not unstaple.

T
AF
• This is a closed book exam. No electronic devices of any kind.
• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet (hand-
written or 11pt min font size) if you have one; place all other personal items below your desk.
• You each have a different exam.
• Only answers in this booklet count. No extra loose answer sheets. You can use the last two pages
DR

as scrap paper.
• For the multiple choice questions, we give :
+2 points if your answer is correct,
0 points if you give no answer or more than one,
−0.5 points if your answer is incorrect.
• For the true/false questions, we give :
+1 points if your answer is correct,
0 points if you give no answer or more than one,
−1 points if your answer is incorrect.
• Use a black or dark blue ballpen and clearly erase with correction fluid if necessary.
• If a question turns out to be wrong or ambiguous, we may decide to nullify it.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/2/59+ y

First part: multiple choice questions

For each question, mark the box corresponding to the correct answer. Each question has exactly one
correct answer.

Robustness to outliers
We consider a classification problem on linearly separable data. Our dataset had an outlier—a point that is
very far from the other datapoints in distance (and also far from margins in SVM but still correctly classified
by the SVM classifier).
We trained the SVM, logistic regression and 1-nearest-neighbour models on this dataset. We tested
trained models on a test set that comes from the same distribution as training set, but doesn’t have any
outlier points. After that we removed the outlier and retrained our models.
Question 1 After retraining, which classifier will change its decision boundary around the test points.

Logistic regression
SVM
1-nearest-neighbors classifier
All of them

T
Solution: The solution of the SVM problem only depends on the support vectors, and outlier is not a
AF
support vector as stated in the problem. Thus SVM decision boundary would not change after retraining.
For 1-NN classifier decisions are based on the nearest datapoints that would be not outlier points, as outlier
is very far from other datapoints. Thus decisions of 1-NN classifier would also not change.
Logistic regression solution is based on all of the datapoints thus removing one point will change the decision
boundary.
DR

SVM
For any vector v ∈ RD let kvk2 := v12 + · · · + vD
p
Question 2 2 denote the Euclidean norm. The hard-
D
margin SVM problem for linearly separable points in R is to minimize the Euclidean norm kwk2 under
some constraints. What are the additional constraints for this optimization problem?

w> xn ≥ 1 ∀n ∈ {1, · · · , N }
yn
w > xn
≥ 1 ∀n ∈ {1, · · · , N }
yn w> xn ≥ 1 ∀n ∈ {1, · · · , N }
yn + w> xn ≥ 1 ∀n ∈ {1, · · · , N }

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/3/58+ y

Cross Validation
Question 3 Consider the K-fold cross validation on a linear regression model with a sufficiently large
amount of training data. When K is large, the computational complexity of the K-fold cross validation with
respect to K is of order

O(1/K).
O(K).
O(1).
O(K(K − 1)).

Solution: Because each training process uses (K − 1)/K = O(1) fraction of the data, and there are K such
processes.

Bias-Variance Decomposition
Question 4 Consider a regression model where data (x, y) is generated by input x uniformly randomly
sampled from [0, 1] and y(x) = x2 + ε, where ε is random noise with mean 0 and variance 1. Two models
are carried out for regression: model A is a trained quadratic function g(x; w) = w2 x2 + w1 x + w0 where
w = (w0 , w1 , w2 )> ∈ R3 , and model B is a constant function h(x) = 1/2. Then compared to model B, model
A has

higher bias, higher variance.

T
AF
lower bias, higher variance.
higher bias, lower variance.
lower bias, lower variance.

Solution: Model B has zero variance because it outputs a constant 1/2 which is not related to the training
DR

data.

Optimization
Question 5 Let n be an integer such that n ≥ 2 and let A ∈ Rn×n , and x ∈ Rn , consider the function
f (x) = x Ax defined over Rn . Which of the following is the gradient of the function f ?
>

A> x + Ax
2A> x
2x> A
2Ax

Solution: ∇f (x) = A> x + Ax. See lab 11 exercise 1, here the matrix A is not symmetric.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/4/57+ y

Question 6 Which of the following functions reaches a global maximum on the set I? (Note that [., .]
and (., .) denote closed and open intervals respectively)

• f1 (x) = −x4 , I = [−5, 5]

• f2 (x) = arccos(x), I = (−1, 1)

• f3 (x) = x exp(−x), I = (−∞, 0)

• f4 (x) = sin(cos(x)) sin(x), I = R+

f1 , f4
f1 , f3 , f4
f1 , f2 , f3 , f4
f1 , f2 , f4

Solution:

• f1 : Yes, computing the first and second derivative leads to the maximum being reached at x = 0. Also
note that any continuous function on compact set reaches its maximum and minimum.

• f2 : No, we can compute the first derivative starting from:

T
cos(arccos(x)) = x
AF
− sin(arccos(x)) arccos0 (x) = 1
1
arccos0 (x) = −
sin(arccos(x))
1
arccos0 (x) = − p
1 − cos(arccos(x))2
DR

1
arccos0 (x) = − √
1 − x2
The derivative of f can not be equal to 0 on I.

• f3 : No, the first derivative is −ex (x − 1), hence no maximum can be reach on I.

• f4 : Yes, since f is the product of periodic functions (sin(cos(x)) and sin(x), it is also periodic. Since
f is periodic on I of not finite measure, f reaches its maximum infinitely many times.

• f5 : Yes, Computing the gradient of f , we find ∇f (x1 , x2 ) = (− sgn(x1 ) − 6x2 , −10x2 ), which is equal
to (0, 0) for (x1 , x2 ) = (0, 0). Then computing the Hessian, we find that it is a diagonal matrix filled
with negative values, hence f does reach a maximum at (0, 0).

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/5/56+ y

Linear Models
Question 7 Consider a linear regression model on a dataset which we split into a training set and a test
set. After training, our model gives a mean-squared error of 0.1 on the training set and a mean-squared
error of 5.3 on the test set. Recall that the mean-squared error (MSE) is given by:
N
1 X
M SEw (y, X) = (yn − x>
n w)
2
2N n=1

Which of the following statements is correct ?

Ridge regression can help reduce the gap between the training MSE and the test MSE.
Retraining the model with feature augmentation (e.g. adding polynomial features) will increase the
training MSE.
Retraining while discarding some training samples will likely reduce the gap between the train MSE
and the test MSE.
Using cross-validation can help decrease the training MSE of this very model.

Solution:

T
• feature augmentation: Incorrect, using feature augmentation will increase overfitting, hence decrease
the training MSE even more.

• cross-validation: Incorrect, cross-validation can help to select a model that overfits less, it does not
AF
help to get better performance on a specific model.

• discarding some training samples: Incorrect, reducing the number of training samples is more likely
to increase the overfitting.

• Ridge regression: Correct, regularization from ridge regression can be useful to reduce overfitting.
DR

Regularization
Question 8 Which of the following statements is incorrect ?

Training a model with L1 -regularization ...

can reduce the storage cost of the final model.

is used to help escaping local minima during training.
can reduce overfitting.
can be named Lasso regression when in combination with an MSE loss function and a linear model.

Solution:

• reduce the storage: is true, the parameters tend to be sparse when using L1 -regularization, thus
parameters with 0 value do not need to be stored.

• escaping local minima: is false (so check this box), L1 -regularization has nothing to do with the
optimization of the model.

• reduce overfitting: is true, it is a regularization technique, hence it reduces the complexity of the
model.

• Lasso: is true, see the course.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/6/55+ y

Neural networks
Let f : RD → R be an L-hidden layer multi-layer perceptron (MLP) such that

f (x) = σL+1 w> σL (WL σL−1 (WL−1 . . . σ1 (W1 x))) ,

with w ∈ RM , W1 ∈ RM ×D and W` ∈ RM ×M for ` = 2, . . . , L, and σi for i = 1, . . . , L + 1 is an entry-wise

activation function. For any MLP f and a classification threshold τ let Cf,τ be a binary classifier that
outputs YES for a given input x if f (x) ≤ τ and NO otherwise.
Question 9 Assume σL+1 is the element-wise sigmoid function and Cf, 21 is able to obtain a high accuracy
on a given binary classification task T . Let g be the MLP obtained by multiplying the parameters in the
last layer of f , i.e. w, by 2. Moreover, let h be the MLP obtained by replacing σL+1 with element-wise
ReLU. Finally, let q be the MLP obtained by doing both of these actions. Which of the following is true?

ReLU (x) = max{x, 0}

1
Sigmoid(x) =
1 + e−x

Ch,0 may have an accuracy significantly lower than Cf, 12 on T

Cg, 12 , Ch,0 , and Cq,0 have the same accuracy as Cf, 12 on T

T
Cg, 12 may have an accuracy significantly lower than Cf, 12 on T
Cq,0 may have an accuracy significantly lower than Cf, 12 on T
AF
Solution: Since the threshold 21 for sigmoid corresponds to the input to the last activation function being
positive, Ch,0 is true. Moreover, multiplying the weights by 2 does not change the sign of the output.
Therefore both Cg, 12 and Cq,0 are also true.

Question 10 Assume the weights in f were obtained by initializing all parameters to zero and running
DR

SGD. Assume there exists τ such that Cf,τ is a good classifier. Let g be the MLP obtained by randomly
keeping one neuron per layer and removing the other M − 1 neurons from each layer, and multiplying all
weights except for the first layer by M (the number of neurons in each layer in the original network). What
is the probability that Cg,τ is a good classifier?
1
Less than ML
1 1
Between ML
and M

1
Not possible to determine with the given information.

Solution: When all weights are initialized to zero, the value for all hidden neurons will be the same and
remain the same. As such, keeping one neuron in each layer and multiplying its output by the number of
neurons will keep the output unchanged. Therefore g and f are equal.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/7/54+ y

Question 11 Which of the following techniques do not improve the generalization performance in deep
learning?

None. All techniques here improve generalization.

Dropout
Data augmentation
Tuning the optimizer
L2 regularization

Solution: Once correct hyperparameters are chosen, any of these strategies may be applied to improve
the generalization performance.

Adversarial Examples
Consider a linear model ŷ = x> w with the squared loss under an `∞ -bounded adversarial perturbation. For
a single point (x, y), it corresponds to the following objective:
2
max y − x̃> w , (OP)

T
x̃: kx−x̃k∞ ≤ε

where kx − x̃k∞ ≤ ε denotes the `∞ -norm, i.e. |xi − x̃i | ≤ ε for every i.

Assume that w = (3, −2)> , x = (−1, 2)> , y = 2. What is the maximum value of the
AF
Question 12
optimization problem in Eq. (OP)?

(9 + 5ε)2
(3 + 10ε)2
(10 − ε)2
DR

Other
(5 + 9ε)2

Solution: First, it’s convenient to reparametrize the objective in terms of an additive per-
2
turbation δ: maxδ:kδk∞ ≤ε y − x> w − δ > w . If we plug the given values of w, x, y, we get:
2
maxδ:kδk∞ ≤ε (9 − 3δ1 + 2δ2 ) . We can maximize this objective independently over δ1 and δ2 by noting
that the optimal value is attained at a boundary of the feasible set, i.e. for |δ1 | = |δ2 | = ε. This leads to the
maximizer δ ? = (−ε, ε)> and the maximum value (9 + 5ε)2 .

Question 13 Assume that w = (3, −2)> , x = (−1, 2)> , y = 2. What is the optimal x̃? that maximizes
the objective in Eq. (OP)?

(−1 + ε, 2)>
(−1 − ε, 2 − ε)>
(−1 − ε, 2)>
(−1 + ε, 2 + ε)>
Other

Solution: The optimal δ ? is equal to (−ε, ε)> . This corresponds to x̃? = x + δ ? = (−1 − ε, 2 + ε)> which
corresponds to the answer “Other”.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/8/53+ y

KNN
Question 14 The KNN algorithm needs a notion of distance to assess which points are “nearest”. Identify
the distance measures that can be used in the KNN algorithm
p
(a) Euclidean Distance : distance associated to the L2 norm kxk2 := x21 + · · · + x2D
(b) Manhattan Distance : distance associated to the L1 norm kxk1 := |x1 | + · · · + |xD |
1/4
(c) Distance associated to the L4 norm kxk4 := |x1 |4 + · · · + |xD |4

only a and c
only b
only c
a, b and c
only b and c
only a and b
only a

Solution: The similarity measure is only an algorithmic choice that should be made in the KNN algorithm.

Linear Regression
Question 15
T
Assume we are doing linear regression with Mean-Squared Loss and L2-regularization on
AF
four one-dimensional data points. Our prediction model can be written as f (x) = ax+b and the optimization
problem can be written as
4
X
a? , b? = argmin [yn − f (xn )]2 + λa2 .
a,b n=1

Assume that our data points are Y = [1, 3, 2, 4] and X = [−2, −1, 0, 3]. For example y1 = 1 and x1 = −2.
DR

What is the optimal value for the bias, b? ?

2
None of the above answers.
Depends on the value of λ
3
2.5
P4
Solution: If we take the derivative of the loss w.r.t b and set it to zero we get i=1 [yi − axi − b] = 0. Since
P4 1+3+2+4
i=1 axi = 0, the optimal value for b is equal to the mean of the target values, b = 4 = 2.5.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/9/52+ y

Subgradients
Question 16 Consider the Parametric ReLU function defined as
(
x if x > 0
f (x) =
ax otherwise

where a ∈ R is an arbitrary number. Which of the following statements is true regarding the subgradients
of f (x) at x = 0?

If a subgradient exists, then it is not unique.

A subgradient does not exist at x = 0.
A subgradient exists even though f (x) is not necessarily differentiable at x = 0.
None of the mentioned answers.

Solution: For values of a > 1 the function f (x) is strictly concave and hence a subgradient does not exist.
For a = 1 we have a unique subgradient. Hence the correct answer is None of the above answers.

Logistic regression

T
Consider a binary classification task as in Figure 1, which consists of 14 two-dimensional linearly separable
samples (circles corresponds to label y = 1 and pluses corresponds to label y = 0). We would like to predict
AF
the label y = 1 of a sample (x1 , x2 ) when the following holds true
1
Pr(y = 1|x1 , x2 , w1 , w2 ) = > 0.5
1 + exp(−w1 x1 − w2 x2 )

where w1 and w2 are parameters of the model.

𝑥2

𝑥1
(0,0)

Figure 1: Two-dimensional dataset.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/10/51+ y

Question 17 If we obtain the (w1 , w2 ) by optimizing the following objective

N
X C 2
− log Pr(yn |xn1 , xn2 , w1 , w2 ) + w
n=1
2 2

where C is very large, then the decision boundary will be close to which of the following lines?

x1 + x2 = 0
x1 = 0
x2 = 0
x1 − x2 = 0

Solution: When C is very large, we are essentially optimizing over C2 w22 and thus w2 is close to 0. Then
1
Pr(y = 1|x1 , x2 , w1 , w2 ) = 1+exp(−w1 x1 )
indicates that the classifier outputs 1 or 0 depending on the sign of
x. Therefore the decision boundary is x1 = 0.

Recommender systems and word vectors

Question 18 What alternates in Alternating Least Squares for Matrix Factorization for a movie recom-

T
mender system?

recommendation steps and optimization steps

AF
updates based on different movie rating examples from the training set
expectation steps and maximization steps
updates to user embeddings and updates to movie embeddings

Question 19 Which NLP model architectures can differentiate between the sentences “I have to read
this book.” and “I have this book to read.”?
DR

(a) a convolutional model based on word2vec vectors

(b) a recurrent neural network based on GloVe word vectors
(c) a bag-of-words model based on GloVe word vectors

only b and c
only b
only a and c
a, b and c
only c
only a
only a and b

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/11/50+ y

Generative Networks
Question 20 Consider a Generative Adversarial Network (GAN) which successfully produces images of
goats. Which of the following statements is false?

After the training, the discriminator loss should ideally reach a constant value.
The generator aims to learn the distribution of goat images.
The generator can produce unseen images of goats.
The discriminator can be used to classify images as goat vs non-goat.

Solution: “classify goat vs non-goat...” is FALSE, because the discriminator classifies images into real or
fake.

Clustering
For any vector v ∈ RD let kvk2 := v12 + · · · + vD
p
Question 21 2 denote the Euclidean norm. Let
D
x1 , . . . , xN ∈ R be a dataset of N ≥ 2 distinct points. For any integer K ≥ 1, consider the following value:
N K
1 XX
L?K = inf znk kxn − µk k22

T
µ1 ,...,µK ∈RD N n=1
PK k=1
znk ∈{0,1} s.t. k=1 znk =1

The statements below are all true except one. Which of the following statement is false?
AF
For 2 ≤ K < N and with the initial means randomly chosen as K data points, the K-means algorithm
with K clusters is not guaranteed to reach the optimal L?K loss value.
For 2 ≤ K < N , L?K is hard to compute.
For K ≥ N , L?K = 0.
The sequence (L?K )1≤K≤N is not necessarily strictly decreasing.
DR

L?1 corresponds to the population variance of the dataset.

Solution: (L?K )1≤K≤N is necessarily strictly decreasing. Since all the datapoints are distinct, we can
always strictly improve the loss by assigning µK+1 = xi where xi 6= µk for all 1 ≤ k ≤ K.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/12/49+ y

Question 22 You are given the data (xn , yn )1≤n≤N ∈ R2 illustrated in Figure 2 which you want to
cluster into an inner ring and an outer ring (hence a number of clusters K = 2). Which of the following
statement(s) is/are correct?

(a) There exists some initialization such that K-means clustering succeeds.

(b) There exists an appropriate feature expansion such that K-means (with standard initialization) suc-
ceeds.

(c) There exists an appropriate feature expansion such that the Expectation Maximization algorithm (with
standard initialization) for a Gaussian Mixture Model succeeds.

Only a and c
Only a
All of them
Only b and c
None of them
Only a and b

T
Only c
Only b
AF
p
Solution: For example, (xi , yi , c x2i + yi2 )1≤i≤n for c big enough will easily be clusterable using K-means
or GMM.
DR

Figure 2: Two-dimensional dataset

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/13/48+ y

Linear Algebra
Question 23 Given a matrix X of shape D ×N with a singular value decomposition (SVD), X = USV> ,
suppose X has rank K and A = XX>
Which one of the following statements is false?

The eigenvalues of A are the singular values of X

A is positive semi-definite, i.e all eigenvalues of A are non-negative
The eigendecomposition of A is also a singular value decomposition (SVD) of A
A vector x that can be expressed as a linear combination of the last D − K columns of U, i.e x =
PD >
i=K+1 wi ui (where ui is the i-th column of U), lies in the null space of X

Solution:

• “A is PSD...” and “The eigendecomposition of...” are true because A = US2 U> .

• “The eigenvalues of... are the singular values...” is false because the eigenvalues of A are the square
of the singular values of X.

• “A vector x that...” is true.

T
PD PD PD
Proof: X > = V S > U > , X > x~i = X > i=K+1 wi u~i = i=K+1 wi (X > u~i ) = i=K+1 wi (V S > e~i ), the
last equality follows from the fact that u~i is orthogonal to every row of U > except the ith row i.e
U > u~i = e~i .As X > = has rank K, then S > has non-zeros along the diagonal only up to the kth row /
AF
column, i.e S > e~i = 0 for i ≥ k + 1. Therefore X > x~i = 0
DR

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/14/47+ y

Second part: true/false questions

For each question, mark the box (without erasing) TRUE if the statement is always true and the box
FALSE if it is not always true (i.e., it is sometimes false).

Question 24 (Neural Networks) Weight sharing allows CNNs to deal with image data without using too
many parameters. However, weight sharing increases the variance of a model.

TRUE FALSE

Solution: False, weight sharing increases bias.

(Kernels) For any vector v ∈ RD let kvk2 := v12 + · · · + vD

p
Question 25 2 denote the Euclidean norm.

Define the function k(x, x0 ) = 1−x1> x0 , on the set x, x0 ∈ RD such that kxk2 < 1 and kx0 k2 < 1. The function
k(x, x0 ) is a valid kernel.

TRUE FALSE

P∞
Solution: True. It is a valid kernel, because k(x, x0 ) = 1
1−xx0 = i=0 (xx
0 i
)

Question 26

T
(Bias/Variance Decomposition) Consider a linear regression model where the data is gen-
erated by input x and output y = w> x + ε, where w is a fixed vector and ε is a Gaussian noise with zero
mean and σ 2 variance. Then there is no machine learning algorithm that can achieve a training error lower
AF
than σ 2 .

TRUE FALSE

Solution: False. It is the true error that cannot be less than σ 2 instead of the training error. Moreover,
overfitting is always an option, and it is often possible to have a small training error.
DR

PN
Question 27 (Logistic regression) Consider a binary classification task L(w) = n=1 −yn x>
n w + log(1 +
>
exn w ) with linearly separable samples and y ∈ {0, 1}. Then there exists an optimal w with exact 0 loss and
100% training accuracy.

TRUE FALSE

Solution: False. While there exists w which has 100% training accuracy, exact 0 loss is not attainable.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/15/46+ y

(Linear Models) For any vector v ∈ RD let kvk2 := v12 + · · · + vD

p
Question 28 2 denote the Euclidean
D D×D
norm. For y ∈ R , X ∈ R , the solution of the least squares problem:

w? = argmin ky − Xwk22
w∈RD

is always unique.

TRUE FALSE

Solution: False. This is true only if X is full-rank. Suppose that

   
0 1 0 0
X= 0 1 0  , and y =  0 
   

1 0 1 1
   
1 0
then w?1 =  0  and w?2 =  0  are both solutions.
   

0 1

Question 29 (PCA) Your friend performed Principal Component Analysis on some data P and claimed that

T
he retained at least 95% of the variance using k principal components. This is equivalent to i≥k+1
P
i λi
λi
≤ 0.05,
where λ1 , ..., λk , ... are the eigenvalues associated to each principal component, sorted in a non-increasing
order.
AF
TRUE FALSE
P
i≤k λi
Solution: True. The variance explained with k principal components is 95%. This means that P is at
i λi
least 0.95. Hence, the claim is true.
DR

Question 30 (Adversarial robustness) Let kvk∞ := maxi |vi | denote the `∞ -norm. Assume a binary
classification problem with y ∈ {−1, 1}. Then the adversarial zero-one loss maxδ:kδk∞ ≤ε 1yf (x+δ)≤0 (1C is
equal to 1 if the condition C is true and to 0 if C is false) is always upper bounded by the adversarial hinge
loss maxδ:kδk∞ ≤ε max{0, 1 − yf (x + δ)}.

TRUE FALSE

Solution: True. Since the exponential loss upper bounds the zero-one loss, taking the maximum over both
of them preserves the ranking between them.

Question 31 (Expectation Maximization) If we run the Expectation Maximization (EM) algorithm on the
log-likelihood for a Gaussian Mixture Model, then the sequence of log-likelihoods {L(θ (t) )}t∈N is guaranteed
to be non-decreasing.

TRUE FALSE

Solution: True. By definition of the algorithm, each parameter update increases the log-likelihood.

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/16/45+ y

Question 32 (FastText Supervised Classifier) The FastText supervised classifier can be modelled as a
two-layer linear (one hidden layer) neural network with a non-linear loss function at the output.

TRUE FALSE

Solution: True. The FastText supervised classifier consists of two layers - embedding layer and linear
classifier. The embedding layer can be thought of as another linear layer if the input is thought of as
encoding word frequency.

Question 33 (Matrix Factorization) Consider the matrix factorization objective function

1
P >
2
2 (d,n)∈Ω x dn − (WZ )dn . When minimizing this objective with SGD over the embedding matrices
W and Z, you should initialize W and Z with zeros.

TRUE FALSE

Solution: False. In fact, it will not work if you initialize the embeddings to zero. ‘Zero’ is a stationary
point of the optimization problem. If you start all the parameters at zero, they will never change.

Question 34 (MSE and Neural Networks) The mean squared error (MSE) is convex w.r.t the parameters

T
of a multi layer perceptron with more than one hidden layer and linear activation function.

TRUE FALSE
AF
Solution: False. Consider a simple two layer neural network with single neuron in each layer such that the
output of the network is equal to w1 w2 x where w1 is the weight of the first layer and w2 is the weight of the
second layer. You can simply check that MSE loss is not convex w.r.t the parameters.
DR

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/17/44+ y

Third part, open questions

Answer in the space provided! Your answer must be justified with all steps. Leave the check-boxes empty,
they are used for the grading.

PCA
Let x1 , ..., xN be a dataset of N vectors in RD .

Question 35: (1 point.) What does it mean for the data vectors x1 , ..., xN to be centered, as for principle
component analysis (PCA) to be meaningful? Use the notation xnd for individual entries.

0 1

PN PN
Solution: Data is centered, i.e. E[x] = 0 or in other words N1 n=1 xn = 0 or N1 n=1 xnd = 0 ∀d.
Question 36: (1 point.) Write down the covariance matrix of the dataset X = (x1 , ..., xN ) ∈ RD×N , and
state its dimensions. (Note that for PCA we assume data to be already centered, as in the previous
question)

0 1

Solution: cov =1/XX> ∈ RD×D .

N
FT
Now let x be a random vector distributed according to the uniform distribution over the finite normalized
dataset x1 , ..., xN from above. Consider the problem of finding a unit vector, w ∈ RD , such that the random
A
variable w> x has maximal variance.

Question 37: (1 point.) What is the variance of the random variable w> x over the randomness of x?
DR

0 1

PN
Solution: Var[w> x] = N1 n=1 (w> xn )2
Question 38: (2 points.) Show that the solution of the problem of argmaxw:kwk=1 Var[w> x] is to set w to
be the first principle vector of x1 , ..., xN .

0 1 2

PN PN
Solution: argmaxw:kwk=1 Var[w> x] = N1 n=1 (w> xn )2 = N1 n=1 w> xn x> 1 >
n w = N wXX w is (by
>
definition of Eigenvector) maximized if w is the top eigenvector of XX . (One can add some arguing why
this gives the top singular vector of X)
Question 39: (1 point.) Explain in words what the above result says about how PCA relates to the dataset.

0 1

Solution: The data has maximum variance when projected onto the first PC. (Mentioning ’max variance’
alone is not sufficient, need to say what kind of variance, related to the data, direction).

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/18/43+ y

The Perceptron
Setting

Let us consider a binary classification problem with a training set S = {(xn , yn )}N
n=1 such that:

xn ∈ RD , and yn ∈ {−1, 1}, for all n = 1, · · · , N,

where N, D are integers such that N, D ≥ 1.

We consider the Perceptron classifier which classifies x ∈ RD following the rule:

fw,b (x) = sgn(w> x + b),

where w ∈ RD is the weight vector, b ∈ R is the threshold, and the sign function is defined as

+1 if z ≥ 0
sgn(z) =
−1 if z < 0

Question 40: (1 point.) As seen in the course, explain how we can ignore the threshold b and only deal
with classifiers passing through the origin, i.e., of the form fw (x) = sgn(w> x).

T
0 1

Solution: We can view the threshold as an additional weight by adding the constant input 1 to the input
x. It amounts to consider the input x̃> = [x> , 1] since x̃> [w> , b] = x> w + b.
AF
For the remainder of the exercise, we proceed without this additive threshold, as explained in the previ-
ous question, and only consider classifiers of the form fw (x) = sgn(w> x). We make the following two
assumptions:

• Bounded input: There exists a real number R ≥ 0 such that for all n = 1, · · · , N we have
DR

kxn k2 ≤ R,
qP
D
where for any vector z ∈ RD we use kzk2 to refer to the Euclidean norm of z, i.e, kzk2 = 2
d=1 zd .

• Linearly separable data: There exists w? ∈ RD and a real number γ > 0 such that for all n =
1, · · · , N we have
yn w?> xn ≥ γ.

We will use the Perceptron algorithm to train the Perceptron classifier and find a separating hyperplane.
Let t denote the number of parameter updates we have performed and wt the weight vector after t updates.
We initialize with w0 = 0. If this weight vector is already a separating hyperplane, we are done. If not, we
pick an arbitrary point xi that is currently misclassified. This point is used to update the weight vector wt
as
wt+1 := wt + yi xi where yi x> i wt ≤ 0.

The algorithm is formally written below.

We will study the convergence of this learning algorithm.

Convergence

Question 41: (2 points.) For t > 0, show that when making the tth update according to the Perceptron
update, we have:

w?> wt ≥ w?> wt−1 + γ

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/19/42+ y

Algorithm 1: Perceptron algorithm

t ← 0; wt ← 0;
while there exists j ∈ {1, · · · , N } such that yj x>
j wt ≤ 0 do
Pick an arbitrary i ∈ {1, · · · , N } such that yi x>
i wt ≤ 0 ;
wt+1 ← wt + yi xi ;
t←t+1
end
return wt

0 1 2

Solution: Let us assume that the tth update is made on the sample i, we have then

w?> wt = w?> (wt−1 + yi xi )

= w?> wt−1 + yi w?> xi
≥ w?> wt−1 + γ

T
The first equation follows from the definition of the Perceptron update and the last equation follows from
the assumption of linear separability .

Question 42: (1 point.) Show that it implies that

AF
w?> wt ≥ tγ

0 1

Solution: By induction on t we obtain w?> wt ≥ w?> wt−1 + γ ≥ w?> w0 + tγ, and by definition w0 = 0.
DR

Therefore w?> wt ≥ tγ.

Question 43: (1 point.) Using the previous question, derive a lower-bound on kwt k2 depending on t, γ
and kw? k2 .
(We recall that a lower bound is a constant C > 0 such that kwt k2 ≥ C. Note that we will only accept
non-trivial lower-bounds as correct answers.)

0 1

tγ
Solution: Using the Cauchy-Schwarz inequality kwt k2 kw? k2 ≥ w?> wt ≥ tγ. Therefore kwt k2 ≥ kw? k2

Question 44: (3 points.) Derive the following upper bound on the squared norm kwt k22 :

kwt k22 ≤ kwt−1 k22 + R2

0 1 2 3

Solution: We still assume that the tth is made on sample i, we have then

kwt k22 = kwt−1 + yi xi k22

>
= kwt−1 k22 + 2yi wt−1 xi + kxi k22
≤ kwt−1 k22 + kxi k22
≤ kwt−1 k22 + R2

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/20/41+ y

The first equation follows from the definition of the Perceptron update, the second equation follows from
expanding the square , the third inequality follows from the fact that updates are made only on mistakes, i.e,
>
yi wt−1 xi ≤ 0 whenever an update is made , and the last inequality follows from the boundedness assumption
on the input .

Question 45: (1 point.) Show that it implies that

kwt k22 ≤ tR2

0 1

Solution: By induction on t we obtain kwt k22 ≤ kwt−1 k22 + R2 ≤ kw0 k22 + tR2 , and by definition w0 = 0.
Therefore kwt k22 ≤ tR2 .

Question 46: (3 points.) Combine the results obtained in the previous questions to obtain an upper bound
on t depending on the quantities R, γ and w? .

0 1 2 3

Solution: Using Question 43 we get

kwt k2 ≥
tγ
kw? k2
.
T
AF
From Question 45 we have

kwt k22 ≤ tR2

Therefore combining both we obtain:

t2 γ 2
≤ kwt k22 ≤ tR2
kw? k22

which yields

R2 kw? k22
t≤
γ2

Question 47: (2 points.) Qualitatively analyze the previous result. What have you proven? Interpret the
dependency on R, γ and w? .

0 1 2

Solution: We have shown that the Perceptron algorithm terminates with a correct (separating) solution
in finite time, and that this time is indirectly proportional to the margin γ 2 , and proportional to R2 .

Perceptron and margin

Let us remind that we define the max-margin M? as

M? = max M such that yn x>

n w ≥ M for n = 1, · · · , N
w∈RD ,kwk2 =1

and a max-margin separating hyperplane w̄ as a solution of this problem:

w̄ ∈ arg max M such that yn x>

n w ≥ M for i = 1, · · · , N
w∈RD ,kwk2 =1

y For your examination, preferably print documents compiled from auto- y

multiple-choice.
y +1/21/40+ y

Question 48: (2 points.) Bound the number of perceptron updates t using the quantities R and M? . Prove
your result.

0 1 2

Solution: By definition of γ and M we have that γ/kw? k2 ≤ M (1 point if proven properly). And
therefore we obtain
R2
t≤
M2
Question 49: (1 point.) Does it imply that the output of the Perceptron algorithm is a max-margin
separating hyperplane?

0 1

Solution: No it does not, we have no clue to which separating hyperplane the Perceptron algorithm is
converging to.

T
AF
DR

y For your examination, preferably print documents compiled from auto- y

multiple-choice.

Machine Learning 10-701 Exam Prep
No ratings yet
Machine Learning 10-701 Exam Prep
14 pages
Machine Learning Midterm Exam
No ratings yet
Machine Learning Midterm Exam
106 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Machine Learning Quiz for Students
No ratings yet
Machine Learning Quiz for Students
45 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Midterm2008f Sol
No ratings yet
Midterm2008f Sol
12 pages
Machine Learning Exam Guide
No ratings yet
Machine Learning Exam Guide
9 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
MLvsMAP Merged
No ratings yet
MLvsMAP Merged
208 pages
IML-IITKGP - Assignment 5 Solution
No ratings yet
IML-IITKGP - Assignment 5 Solution
7 pages
ML Assignments 2025
No ratings yet
ML Assignments 2025
91 pages
Machine Learning Assignment Solutions
No ratings yet
Machine Learning Assignment Solutions
46 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
ML 2024a QP Solution Full
No ratings yet
ML 2024a QP Solution Full
13 pages
ML Midsem 2018 Solutions
No ratings yet
ML Midsem 2018 Solutions
7 pages
Endsem ML Makeup AK - 1
No ratings yet
Endsem ML Makeup AK - 1
7 pages
Machine Learning Exam Instructions
No ratings yet
Machine Learning Exam Instructions
16 pages
Machine Learning Midterm 2009
No ratings yet
Machine Learning Midterm 2009
23 pages
HW 3
No ratings yet
HW 3
7 pages
Machine 2020 Jul-Dec
No ratings yet
Machine 2020 Jul-Dec
45 pages
Int 354 ML-1
No ratings yet
Int 354 ML-1
4 pages
ML Questions
No ratings yet
ML Questions
6 pages
6390 Fall 2022 Midterm
No ratings yet
6390 Fall 2022 Midterm
20 pages
Statistical ML Exam Guide
No ratings yet
Statistical ML Exam Guide
13 pages
2nd Exam Question Paper 2
No ratings yet
2nd Exam Question Paper 2
16 pages
Exam 2011
No ratings yet
Exam 2011
22 pages
COMPSCI5014 1 Machine Learning (M) 201904
No ratings yet
COMPSCI5014 1 Machine Learning (M) 201904
7 pages
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
No ratings yet
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
6 pages
S&UL Subjective Question Bank
No ratings yet
S&UL Subjective Question Bank
7 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Quiz1 Solutions Quiz 1 Soln
No ratings yet
Quiz1 Solutions Quiz 1 Soln
7 pages
Wa0006.
No ratings yet
Wa0006.
4 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
NPTEL ML Assignment Week1
100% (5)
NPTEL ML Assignment Week1
5 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
ML 2023a Midsem Solution
No ratings yet
ML 2023a Midsem Solution
9 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
ML June 2024
No ratings yet
ML June 2024
12 pages
HW 1
No ratings yet
HW 1
11 pages
ASU Assignment2 Sol
No ratings yet
ASU Assignment2 Sol
8 pages
Quiz 3
No ratings yet
Quiz 3
12 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Finals 19
No ratings yet
Finals 19
16 pages
Practice Questions Lec 18 45
No ratings yet
Practice Questions Lec 18 45
4 pages
ECS7020P Sample Paper Solutions
No ratings yet
ECS7020P Sample Paper Solutions
6 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
Practice Midterm 2 Sol
No ratings yet
Practice Midterm 2 Sol
26 pages
Machine Learning Exam Sample Questions
No ratings yet
Machine Learning Exam Sample Questions
8 pages
Linear Regression & Data Analysis
No ratings yet
Linear Regression & Data Analysis
4 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Exam 21
No ratings yet
Exam 21
17 pages
Supervised Machine Learning Regression
No ratings yet
Supervised Machine Learning Regression
6 pages
Regression Discontinuity Designs
No ratings yet
Regression Discontinuity Designs
13 pages
Machine Learning Based Power Estimation For CMOS V
No ratings yet
Machine Learning Based Power Estimation For CMOS V
12 pages
The Impact of Time Management On The Students' Academic Achievements
No ratings yet
The Impact of Time Management On The Students' Academic Achievements
8 pages
PGP - 2010 - 12 - DADM - Pilgrim Bank
100% (2)
PGP - 2010 - 12 - DADM - Pilgrim Bank
5 pages
Data Analysis and Probability Tasks
No ratings yet
Data Analysis and Probability Tasks
2 pages
Jurnal DESY ERLINDA
No ratings yet
Jurnal DESY ERLINDA
13 pages
Multivariate Poisson Regression Models
No ratings yet
Multivariate Poisson Regression Models
14 pages
Patient Satisfaction With Post Operative Pain.147
No ratings yet
Patient Satisfaction With Post Operative Pain.147
6 pages
Demand Estimation and Forecasting: Managerial Economics, 2e
No ratings yet
Demand Estimation and Forecasting: Managerial Economics, 2e
13 pages
H2 Math (Topical Set1)
No ratings yet
H2 Math (Topical Set1)
25 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
59 - Fraz Inam - Forecasting Bankruptcy For Organizational Sustainability in Pakistan
No ratings yet
59 - Fraz Inam - Forecasting Bankruptcy For Organizational Sustainability in Pakistan
20 pages
Revolut Customer Lifetime Value
No ratings yet
Revolut Customer Lifetime Value
5 pages
M.A. Population Studies Syllabus
No ratings yet
M.A. Population Studies Syllabus
22 pages
Manual MSA.3.Ingles PDF
No ratings yet
Manual MSA.3.Ingles PDF
240 pages
Model R R Square Adjusted R Square Std. Error of The Estimate 1, 892, 723, 47657
No ratings yet
Model R R Square Adjusted R Square Std. Error of The Estimate 1, 892, 723, 47657
2 pages
Plant Tissue Culture Statistical Analysis
100% (1)
Plant Tissue Culture Statistical Analysis
68 pages
Business Stats Exam Solutions
No ratings yet
Business Stats Exam Solutions
6 pages
Stata Commands for Fixed/Random Effects
No ratings yet
Stata Commands for Fixed/Random Effects
10 pages
Hate Speech Detection PPT FINAL
100% (1)
Hate Speech Detection PPT FINAL
29 pages
Econometrics for Researchers
No ratings yet
Econometrics for Researchers
31 pages
Coco Martin's Brand Impact Study
No ratings yet
Coco Martin's Brand Impact Study
20 pages
Annual Reporting, Agency Costs and Firm Valuations
No ratings yet
Annual Reporting, Agency Costs and Firm Valuations
25 pages
Chapter 5 Testbank Topic Grid: Garrison/Noreen/Brewer, Managerial Accounting, Twelfth Edition 5-1
No ratings yet
Chapter 5 Testbank Topic Grid: Garrison/Noreen/Brewer, Managerial Accounting, Twelfth Edition 5-1
4 pages
Nepalese Bank Governance Study
No ratings yet
Nepalese Bank Governance Study
18 pages
Dell Case Analysis
57% (7)
Dell Case Analysis
77 pages
E-Banking Service Quality Study
No ratings yet
E-Banking Service Quality Study
9 pages
Arosio Harries ISRM EUROCK 2009 084 Book
No ratings yet
Arosio Harries ISRM EUROCK 2009 084 Book
7 pages
MGS3100 Chapter 13 Forecasting: Slides 13b: Time-Series Models Measuring Forecast Error
No ratings yet
MGS3100 Chapter 13 Forecasting: Slides 13b: Time-Series Models Measuring Forecast Error
36 pages
Neanderthal Clothing Modeled with Ethnographic Data
No ratings yet
Neanderthal Clothing Modeled with Ethnographic Data
15 pages