Epfl Machine Learning Final Exam 2021 Solutions
Epfl Machine Learning Final Exam 2021 Solutions
Do not turn the page before the start of the exam. This document is double-sided, has 20
pages, the last ones are possibly blank. Do not unstaple.
T
AF
• This is a closed book exam. No electronic devices of any kind.
• Place on your desk: your student ID, writing utensils, one double-sided A4 page cheat sheet (hand-
written or 11pt min font size) if you have one; place all other personal items below your desk.
• You each have a different exam.
• Only answers in this booklet count. No extra loose answer sheets. You can use the last two pages
DR
as scrap paper.
• For the multiple choice questions, we give :
+2 points if your answer is correct,
0 points if you give no answer or more than one,
−0.5 points if your answer is incorrect.
• For the true/false questions, we give :
+1 points if your answer is correct,
0 points if you give no answer or more than one,
−1 points if your answer is incorrect.
• Use a black or dark blue ballpen and clearly erase with correction fluid if necessary.
• If a question turns out to be wrong or ambiguous, we may decide to nullify it.
Robustness to outliers
We consider a classification problem on linearly separable data. Our dataset had an outlier—a point that is
very far from the other datapoints in distance (and also far from margins in SVM but still correctly classified
by the SVM classifier).
We trained the SVM, logistic regression and 1-nearest-neighbour models on this dataset. We tested
trained models on a test set that comes from the same distribution as training set, but doesn’t have any
outlier points. After that we removed the outlier and retrained our models.
Question 1 After retraining, which classifier will change its decision boundary around the test points.
Logistic regression
SVM
1-nearest-neighbors classifier
All of them
T
Solution: The solution of the SVM problem only depends on the support vectors, and outlier is not a
AF
support vector as stated in the problem. Thus SVM decision boundary would not change after retraining.
For 1-NN classifier decisions are based on the nearest datapoints that would be not outlier points, as outlier
is very far from other datapoints. Thus decisions of 1-NN classifier would also not change.
Logistic regression solution is based on all of the datapoints thus removing one point will change the decision
boundary.
DR
SVM
For any vector v ∈ RD let kvk2 := v12 + · · · + vD
p
Question 2 2 denote the Euclidean norm. The hard-
D
margin SVM problem for linearly separable points in R is to minimize the Euclidean norm kwk2 under
some constraints. What are the additional constraints for this optimization problem?
w> xn ≥ 1 ∀n ∈ {1, · · · , N }
yn
w > xn
≥ 1 ∀n ∈ {1, · · · , N }
yn w> xn ≥ 1 ∀n ∈ {1, · · · , N }
yn + w> xn ≥ 1 ∀n ∈ {1, · · · , N }
Cross Validation
Question 3 Consider the K-fold cross validation on a linear regression model with a sufficiently large
amount of training data. When K is large, the computational complexity of the K-fold cross validation with
respect to K is of order
O(1/K).
O(K).
O(1).
O(K(K − 1)).
Solution: Because each training process uses (K − 1)/K = O(1) fraction of the data, and there are K such
processes.
Bias-Variance Decomposition
Question 4 Consider a regression model where data (x, y) is generated by input x uniformly randomly
sampled from [0, 1] and y(x) = x2 + ε, where ε is random noise with mean 0 and variance 1. Two models
are carried out for regression: model A is a trained quadratic function g(x; w) = w2 x2 + w1 x + w0 where
w = (w0 , w1 , w2 )> ∈ R3 , and model B is a constant function h(x) = 1/2. Then compared to model B, model
A has
Solution: Model B has zero variance because it outputs a constant 1/2 which is not related to the training
DR
data.
Optimization
Question 5 Let n be an integer such that n ≥ 2 and let A ∈ Rn×n , and x ∈ Rn , consider the function
f (x) = x Ax defined over Rn . Which of the following is the gradient of the function f ?
>
A> x + Ax
2A> x
2x> A
2Ax
Solution: ∇f (x) = A> x + Ax. See lab 11 exercise 1, here the matrix A is not symmetric.
Question 6 Which of the following functions reaches a global maximum on the set I? (Note that [., .]
and (., .) denote closed and open intervals respectively)
f1 , f4
f1 , f3 , f4
f1 , f2 , f3 , f4
f1 , f2 , f4
Solution:
• f1 : Yes, computing the first and second derivative leads to the maximum being reached at x = 0. Also
note that any continuous function on compact set reaches its maximum and minimum.
T
cos(arccos(x)) = x
AF
− sin(arccos(x)) arccos0 (x) = 1
1
arccos0 (x) = −
sin(arccos(x))
1
arccos0 (x) = − p
1 − cos(arccos(x))2
DR
1
arccos0 (x) = − √
1 − x2
The derivative of f can not be equal to 0 on I.
• f3 : No, the first derivative is −ex (x − 1), hence no maximum can be reach on I.
• f4 : Yes, since f is the product of periodic functions (sin(cos(x)) and sin(x), it is also periodic. Since
f is periodic on I of not finite measure, f reaches its maximum infinitely many times.
• f5 : Yes, Computing the gradient of f , we find ∇f (x1 , x2 ) = (− sgn(x1 ) − 6x2 , −10x2 ), which is equal
to (0, 0) for (x1 , x2 ) = (0, 0). Then computing the Hessian, we find that it is a diagonal matrix filled
with negative values, hence f does reach a maximum at (0, 0).
Linear Models
Question 7 Consider a linear regression model on a dataset which we split into a training set and a test
set. After training, our model gives a mean-squared error of 0.1 on the training set and a mean-squared
error of 5.3 on the test set. Recall that the mean-squared error (MSE) is given by:
N
1 X
M SEw (y, X) = (yn − x>
n w)
2
2N n=1
Ridge regression can help reduce the gap between the training MSE and the test MSE.
Retraining the model with feature augmentation (e.g. adding polynomial features) will increase the
training MSE.
Retraining while discarding some training samples will likely reduce the gap between the train MSE
and the test MSE.
Using cross-validation can help decrease the training MSE of this very model.
Solution:
T
• feature augmentation: Incorrect, using feature augmentation will increase overfitting, hence decrease
the training MSE even more.
• cross-validation: Incorrect, cross-validation can help to select a model that overfits less, it does not
AF
help to get better performance on a specific model.
• discarding some training samples: Incorrect, reducing the number of training samples is more likely
to increase the overfitting.
• Ridge regression: Correct, regularization from ridge regression can be useful to reduce overfitting.
DR
Regularization
Question 8 Which of the following statements is incorrect ?
Solution:
• reduce the storage: is true, the parameters tend to be sparse when using L1 -regularization, thus
parameters with 0 value do not need to be stored.
• escaping local minima: is false (so check this box), L1 -regularization has nothing to do with the
optimization of the model.
• reduce overfitting: is true, it is a regularization technique, hence it reduces the complexity of the
model.
Neural networks
Let f : RD → R be an L-hidden layer multi-layer perceptron (MLP) such that
T
Cg, 12 may have an accuracy significantly lower than Cf, 12 on T
Cq,0 may have an accuracy significantly lower than Cf, 12 on T
AF
Solution: Since the threshold 21 for sigmoid corresponds to the input to the last activation function being
positive, Ch,0 is true. Moreover, multiplying the weights by 2 does not change the sign of the output.
Therefore both Cg, 12 and Cq,0 are also true.
Question 10 Assume the weights in f were obtained by initializing all parameters to zero and running
DR
SGD. Assume there exists τ such that Cf,τ is a good classifier. Let g be the MLP obtained by randomly
keeping one neuron per layer and removing the other M − 1 neurons from each layer, and multiplying all
weights except for the first layer by M (the number of neurons in each layer in the original network). What
is the probability that Cg,τ is a good classifier?
1
Less than ML
1 1
Between ML
and M
1
Not possible to determine with the given information.
Solution: When all weights are initialized to zero, the value for all hidden neurons will be the same and
remain the same. As such, keeping one neuron in each layer and multiplying its output by the number of
neurons will keep the output unchanged. Therefore g and f are equal.
Question 11 Which of the following techniques do not improve the generalization performance in deep
learning?
Solution: Once correct hyperparameters are chosen, any of these strategies may be applied to improve
the generalization performance.
Adversarial Examples
Consider a linear model ŷ = x> w with the squared loss under an `∞ -bounded adversarial perturbation. For
a single point (x, y), it corresponds to the following objective:
2
max y − x̃> w , (OP)
T
x̃: kx−x̃k∞ ≤ε
where kx − x̃k∞ ≤ ε denotes the `∞ -norm, i.e. |xi − x̃i | ≤ ε for every i.
Assume that w = (3, −2)> , x = (−1, 2)> , y = 2. What is the maximum value of the
AF
Question 12
optimization problem in Eq. (OP)?
(9 + 5ε)2
(3 + 10ε)2
(10 − ε)2
DR
Other
(5 + 9ε)2
Solution: First, it’s convenient to reparametrize the objective in terms of an additive per-
2
turbation δ: maxδ:kδk∞ ≤ε y − x> w − δ > w . If we plug the given values of w, x, y, we get:
2
maxδ:kδk∞ ≤ε (9 − 3δ1 + 2δ2 ) . We can maximize this objective independently over δ1 and δ2 by noting
that the optimal value is attained at a boundary of the feasible set, i.e. for |δ1 | = |δ2 | = ε. This leads to the
maximizer δ ? = (−ε, ε)> and the maximum value (9 + 5ε)2 .
Question 13 Assume that w = (3, −2)> , x = (−1, 2)> , y = 2. What is the optimal x̃? that maximizes
the objective in Eq. (OP)?
(−1 + ε, 2)>
(−1 − ε, 2 − ε)>
(−1 − ε, 2)>
(−1 + ε, 2 + ε)>
Other
Solution: The optimal δ ? is equal to (−ε, ε)> . This corresponds to x̃? = x + δ ? = (−1 − ε, 2 + ε)> which
corresponds to the answer “Other”.
KNN
Question 14 The KNN algorithm needs a notion of distance to assess which points are “nearest”. Identify
the distance measures that can be used in the KNN algorithm
p
(a) Euclidean Distance : distance associated to the L2 norm kxk2 := x21 + · · · + x2D
(b) Manhattan Distance : distance associated to the L1 norm kxk1 := |x1 | + · · · + |xD |
1/4
(c) Distance associated to the L4 norm kxk4 := |x1 |4 + · · · + |xD |4
only a and c
only b
only c
a, b and c
only b and c
only a and b
only a
Solution: The similarity measure is only an algorithmic choice that should be made in the KNN algorithm.
Linear Regression
Question 15
T
Assume we are doing linear regression with Mean-Squared Loss and L2-regularization on
AF
four one-dimensional data points. Our prediction model can be written as f (x) = ax+b and the optimization
problem can be written as
4
X
a? , b? = argmin [yn − f (xn )]2 + λa2 .
a,b n=1
Assume that our data points are Y = [1, 3, 2, 4] and X = [−2, −1, 0, 3]. For example y1 = 1 and x1 = −2.
DR
2
None of the above answers.
Depends on the value of λ
3
2.5
P4
Solution: If we take the derivative of the loss w.r.t b and set it to zero we get i=1 [yi − axi − b] = 0. Since
P4 1+3+2+4
i=1 axi = 0, the optimal value for b is equal to the mean of the target values, b = 4 = 2.5.
Subgradients
Question 16 Consider the Parametric ReLU function defined as
(
x if x > 0
f (x) =
ax otherwise
where a ∈ R is an arbitrary number. Which of the following statements is true regarding the subgradients
of f (x) at x = 0?
Solution: For values of a > 1 the function f (x) is strictly concave and hence a subgradient does not exist.
For a = 1 we have a unique subgradient. Hence the correct answer is None of the above answers.
Logistic regression
T
Consider a binary classification task as in Figure 1, which consists of 14 two-dimensional linearly separable
samples (circles corresponds to label y = 1 and pluses corresponds to label y = 0). We would like to predict
AF
the label y = 1 of a sample (x1 , x2 ) when the following holds true
1
Pr(y = 1|x1 , x2 , w1 , w2 ) = > 0.5
1 + exp(−w1 x1 − w2 x2 )
𝑥2
𝑥1
(0,0)
where C is very large, then the decision boundary will be close to which of the following lines?
x1 + x2 = 0
x1 = 0
x2 = 0
x1 − x2 = 0
Solution: When C is very large, we are essentially optimizing over C2 w22 and thus w2 is close to 0. Then
1
Pr(y = 1|x1 , x2 , w1 , w2 ) = 1+exp(−w1 x1 )
indicates that the classifier outputs 1 or 0 depending on the sign of
x. Therefore the decision boundary is x1 = 0.
T
mender system?
Question 19 Which NLP model architectures can differentiate between the sentences “I have to read
this book.” and “I have this book to read.”?
DR
only b and c
only b
only a and c
a, b and c
only c
only a
only a and b
Generative Networks
Question 20 Consider a Generative Adversarial Network (GAN) which successfully produces images of
goats. Which of the following statements is false?
After the training, the discriminator loss should ideally reach a constant value.
The generator aims to learn the distribution of goat images.
The generator can produce unseen images of goats.
The discriminator can be used to classify images as goat vs non-goat.
Solution: “classify goat vs non-goat...” is FALSE, because the discriminator classifies images into real or
fake.
Clustering
For any vector v ∈ RD let kvk2 := v12 + · · · + vD
p
Question 21 2 denote the Euclidean norm. Let
D
x1 , . . . , xN ∈ R be a dataset of N ≥ 2 distinct points. For any integer K ≥ 1, consider the following value:
N K
1 XX
L?K = inf znk kxn − µk k22
T
µ1 ,...,µK ∈RD N n=1
PK k=1
znk ∈{0,1} s.t. k=1 znk =1
The statements below are all true except one. Which of the following statement is false?
AF
For 2 ≤ K < N and with the initial means randomly chosen as K data points, the K-means algorithm
with K clusters is not guaranteed to reach the optimal L?K loss value.
For 2 ≤ K < N , L?K is hard to compute.
For K ≥ N , L?K = 0.
The sequence (L?K )1≤K≤N is not necessarily strictly decreasing.
DR
Solution: (L?K )1≤K≤N is necessarily strictly decreasing. Since all the datapoints are distinct, we can
always strictly improve the loss by assigning µK+1 = xi where xi 6= µk for all 1 ≤ k ≤ K.
Question 22 You are given the data (xn , yn )1≤n≤N ∈ R2 illustrated in Figure 2 which you want to
cluster into an inner ring and an outer ring (hence a number of clusters K = 2). Which of the following
statement(s) is/are correct?
(a) There exists some initialization such that K-means clustering succeeds.
(b) There exists an appropriate feature expansion such that K-means (with standard initialization) suc-
ceeds.
(c) There exists an appropriate feature expansion such that the Expectation Maximization algorithm (with
standard initialization) for a Gaussian Mixture Model succeeds.
Only a and c
Only a
All of them
Only b and c
None of them
Only a and b
T
Only c
Only b
AF
p
Solution: For example, (xi , yi , c x2i + yi2 )1≤i≤n for c big enough will easily be clusterable using K-means
or GMM.
DR
Linear Algebra
Question 23 Given a matrix X of shape D ×N with a singular value decomposition (SVD), X = USV> ,
suppose X has rank K and A = XX>
Which one of the following statements is false?
Solution:
• “A is PSD...” and “The eigendecomposition of...” are true because A = US2 U> .
• “The eigenvalues of... are the singular values...” is false because the eigenvalues of A are the square
of the singular values of X.
T
PD PD PD
Proof: X > = V S > U > , X > x~i = X > i=K+1 wi u~i = i=K+1 wi (X > u~i ) = i=K+1 wi (V S > e~i ), the
last equality follows from the fact that u~i is orthogonal to every row of U > except the ith row i.e
U > u~i = e~i .As X > = has rank K, then S > has non-zeros along the diagonal only up to the kth row /
AF
column, i.e S > e~i = 0 for i ≥ k + 1. Therefore X > x~i = 0
DR
Question 24 (Neural Networks) Weight sharing allows CNNs to deal with image data without using too
many parameters. However, weight sharing increases the variance of a model.
TRUE FALSE
Define the function k(x, x0 ) = 1−x1> x0 , on the set x, x0 ∈ RD such that kxk2 < 1 and kx0 k2 < 1. The function
k(x, x0 ) is a valid kernel.
TRUE FALSE
P∞
Solution: True. It is a valid kernel, because k(x, x0 ) = 1
1−xx0 = i=0 (xx
0 i
)
Question 26
T
(Bias/Variance Decomposition) Consider a linear regression model where the data is gen-
erated by input x and output y = w> x + ε, where w is a fixed vector and ε is a Gaussian noise with zero
mean and σ 2 variance. Then there is no machine learning algorithm that can achieve a training error lower
AF
than σ 2 .
TRUE FALSE
Solution: False. It is the true error that cannot be less than σ 2 instead of the training error. Moreover,
overfitting is always an option, and it is often possible to have a small training error.
DR
PN
Question 27 (Logistic regression) Consider a binary classification task L(w) = n=1 −yn x>
n w + log(1 +
>
exn w ) with linearly separable samples and y ∈ {0, 1}. Then there exists an optimal w with exact 0 loss and
100% training accuracy.
TRUE FALSE
Solution: False. While there exists w which has 100% training accuracy, exact 0 loss is not attainable.
w? = argmin ky − Xwk22
w∈RD
is always unique.
TRUE FALSE
1 0 1 1
1 0
then w?1 = 0 and w?2 = 0 are both solutions.
0 1
Question 29 (PCA) Your friend performed Principal Component Analysis on some data P and claimed that
T
he retained at least 95% of the variance using k principal components. This is equivalent to i≥k+1
P
i λi
λi
≤ 0.05,
where λ1 , ..., λk , ... are the eigenvalues associated to each principal component, sorted in a non-increasing
order.
AF
TRUE FALSE
P
i≤k λi
Solution: True. The variance explained with k principal components is 95%. This means that P is at
i λi
least 0.95. Hence, the claim is true.
DR
Question 30 (Adversarial robustness) Let kvk∞ := maxi |vi | denote the `∞ -norm. Assume a binary
classification problem with y ∈ {−1, 1}. Then the adversarial zero-one loss maxδ:kδk∞ ≤ε 1yf (x+δ)≤0 (1C is
equal to 1 if the condition C is true and to 0 if C is false) is always upper bounded by the adversarial hinge
loss maxδ:kδk∞ ≤ε max{0, 1 − yf (x + δ)}.
TRUE FALSE
Solution: True. Since the exponential loss upper bounds the zero-one loss, taking the maximum over both
of them preserves the ranking between them.
Question 31 (Expectation Maximization) If we run the Expectation Maximization (EM) algorithm on the
log-likelihood for a Gaussian Mixture Model, then the sequence of log-likelihoods {L(θ (t) )}t∈N is guaranteed
to be non-decreasing.
TRUE FALSE
Solution: True. By definition of the algorithm, each parameter update increases the log-likelihood.
Question 32 (FastText Supervised Classifier) The FastText supervised classifier can be modelled as a
two-layer linear (one hidden layer) neural network with a non-linear loss function at the output.
TRUE FALSE
Solution: True. The FastText supervised classifier consists of two layers - embedding layer and linear
classifier. The embedding layer can be thought of as another linear layer if the input is thought of as
encoding word frequency.
TRUE FALSE
Solution: False. In fact, it will not work if you initialize the embeddings to zero. ‘Zero’ is a stationary
point of the optimization problem. If you start all the parameters at zero, they will never change.
Question 34 (MSE and Neural Networks) The mean squared error (MSE) is convex w.r.t the parameters
T
of a multi layer perceptron with more than one hidden layer and linear activation function.
TRUE FALSE
AF
Solution: False. Consider a simple two layer neural network with single neuron in each layer such that the
output of the network is equal to w1 w2 x where w1 is the weight of the first layer and w2 is the weight of the
second layer. You can simply check that MSE loss is not convex w.r.t the parameters.
DR
PCA
Let x1 , ..., xN be a dataset of N vectors in RD .
Question 35: (1 point.) What does it mean for the data vectors x1 , ..., xN to be centered, as for principle
component analysis (PCA) to be meaningful? Use the notation xnd for individual entries.
0 1
PN PN
Solution: Data is centered, i.e. E[x] = 0 or in other words N1 n=1 xn = 0 or N1 n=1 xnd = 0 ∀d.
Question 36: (1 point.) Write down the covariance matrix of the dataset X = (x1 , ..., xN ) ∈ RD×N , and
state its dimensions. (Note that for PCA we assume data to be already centered, as in the previous
question)
0 1
Question 37: (1 point.) What is the variance of the random variable w> x over the randomness of x?
DR
0 1
PN
Solution: Var[w> x] = N1 n=1 (w> xn )2
Question 38: (2 points.) Show that the solution of the problem of argmaxw:kwk=1 Var[w> x] is to set w to
be the first principle vector of x1 , ..., xN .
0 1 2
PN PN
Solution: argmaxw:kwk=1 Var[w> x] = N1 n=1 (w> xn )2 = N1 n=1 w> xn x> 1 >
n w = N wXX w is (by
>
definition of Eigenvector) maximized if w is the top eigenvector of XX . (One can add some arguing why
this gives the top singular vector of X)
Question 39: (1 point.) Explain in words what the above result says about how PCA relates to the dataset.
0 1
Solution: The data has maximum variance when projected onto the first PC. (Mentioning ’max variance’
alone is not sufficient, need to say what kind of variance, related to the data, direction).
The Perceptron
Setting
Let us consider a binary classification problem with a training set S = {(xn , yn )}N
n=1 such that:
where w ∈ RD is the weight vector, b ∈ R is the threshold, and the sign function is defined as
+1 if z ≥ 0
sgn(z) =
−1 if z < 0
Question 40: (1 point.) As seen in the course, explain how we can ignore the threshold b and only deal
with classifiers passing through the origin, i.e., of the form fw (x) = sgn(w> x).
T
0 1
Solution: We can view the threshold as an additional weight by adding the constant input 1 to the input
x. It amounts to consider the input x̃> = [x> , 1] since x̃> [w> , b] = x> w + b.
AF
For the remainder of the exercise, we proceed without this additive threshold, as explained in the previ-
ous question, and only consider classifiers of the form fw (x) = sgn(w> x). We make the following two
assumptions:
• Bounded input: There exists a real number R ≥ 0 such that for all n = 1, · · · , N we have
DR
kxn k2 ≤ R,
qP
D
where for any vector z ∈ RD we use kzk2 to refer to the Euclidean norm of z, i.e, kzk2 = 2
d=1 zd .
• Linearly separable data: There exists w? ∈ RD and a real number γ > 0 such that for all n =
1, · · · , N we have
yn w?> xn ≥ γ.
We will use the Perceptron algorithm to train the Perceptron classifier and find a separating hyperplane.
Let t denote the number of parameter updates we have performed and wt the weight vector after t updates.
We initialize with w0 = 0. If this weight vector is already a separating hyperplane, we are done. If not, we
pick an arbitrary point xi that is currently misclassified. This point is used to update the weight vector wt
as
wt+1 := wt + yi xi where yi x> i wt ≤ 0.
Convergence
Question 41: (2 points.) For t > 0, show that when making the tth update according to the Perceptron
update, we have:
0 1 2
Solution: Let us assume that the tth update is made on the sample i, we have then
T
The first equation follows from the definition of the Perceptron update and the last equation follows from
the assumption of linear separability .
0 1
Solution: By induction on t we obtain w?> wt ≥ w?> wt−1 + γ ≥ w?> w0 + tγ, and by definition w0 = 0.
DR
Question 43: (1 point.) Using the previous question, derive a lower-bound on kwt k2 depending on t, γ
and kw? k2 .
(We recall that a lower bound is a constant C > 0 such that kwt k2 ≥ C. Note that we will only accept
non-trivial lower-bounds as correct answers.)
0 1
tγ
Solution: Using the Cauchy-Schwarz inequality kwt k2 kw? k2 ≥ w?> wt ≥ tγ. Therefore kwt k2 ≥ kw? k2
Question 44: (3 points.) Derive the following upper bound on the squared norm kwt k22 :
0 1 2 3
Solution: We still assume that the tth is made on sample i, we have then
The first equation follows from the definition of the Perceptron update, the second equation follows from
expanding the square , the third inequality follows from the fact that updates are made only on mistakes, i.e,
>
yi wt−1 xi ≤ 0 whenever an update is made , and the last inequality follows from the boundedness assumption
on the input .
0 1
Solution: By induction on t we obtain kwt k22 ≤ kwt−1 k22 + R2 ≤ kw0 k22 + tR2 , and by definition w0 = 0.
Therefore kwt k22 ≤ tR2 .
Question 46: (3 points.) Combine the results obtained in the previous questions to obtain an upper bound
on t depending on the quantities R, γ and w? .
0 1 2 3
kwt k2 ≥
tγ
kw? k2
.
T
AF
From Question 45 we have
t2 γ 2
≤ kwt k22 ≤ tR2
kw? k22
which yields
R2 kw? k22
t≤
γ2
Question 47: (2 points.) Qualitatively analyze the previous result. What have you proven? Interpret the
dependency on R, γ and w? .
0 1 2
Solution: We have shown that the Perceptron algorithm terminates with a correct (separating) solution
in finite time, and that this time is indirectly proportional to the margin γ 2 , and proportional to R2 .
Question 48: (2 points.) Bound the number of perceptron updates t using the quantities R and M? . Prove
your result.
0 1 2
Solution: By definition of γ and M we have that γ/kw? k2 ≤ M (1 point if proven properly). And
therefore we obtain
R2
t≤
M2
Question 49: (1 point.) Does it imply that the output of the Perceptron algorithm is a max-margin
separating hyperplane?
0 1
Solution: No it does not, we have no clue to which separating hyperplane the Perceptron algorithm is
converging to.
T
AF
DR