EE 769 Introduction to Machine Learning
Sheet 4 — 2020-21-2
Linear classification
1. Assume that we have an augmented input vector x̂ = [xT 1]T , where the last dimension
is always 1, and there is no bias b, and yi = sign(w.x̂i ). When x is 1-dimensional
pictorially and mathematically show that all x̂i lie on a straight line. Further, show
that the decision threshold is at the point where the normal to w intersects this line
formed by x̂i .
2. Assume that for a binary classification problem with 2-dimensional input, the training
label is given by:
ti = sign(xi,1 ) × sign(xi,2 ), (1)
where xi,1 6= 0, xi,2 6= 0, ∀i. Suggest a new variable xi,3 that is a function of the two
original variables to make this data linearly separable.
3. Gaussian class-conditionals:
(a) Show that the quadratic term cancels out when the covariance matrices of the two
Gaussian class conditional densities of a binary classification problem are equal,
thus yielding a linear decision boundary, even if the prior probabilities of the two
classes are unequal.
(b) Show that the Bayesian classification decision takes the form of a logistic regres-
sion classifier when the class conditionals are Gaussians with the same covariance
matrix.
(c) Show that the decision boundaries of a Bayesian classifier are still (piece-wise) linear
for a three-class classifier as long as the class conditional densities are Gaussian
distributed with the same covariance matrix.
Department of Electrical Engineering, Indian Institute of Technology Bombay Page 1 of 4
Amit Sethi asethi@iitb.ac.in
EE 769 Introduction to Machine Learning: Sheet 4— 2020-21-2
4. Bayesian classification:
(a) Assume that in a population, the prevalence of a disease is 10 per 100,000. A
company makes a test to detect the disease, but it fails to detect even a single case,
no matter on how large a population it is used. What is the expected accuracy
of such a test (ratio of correct decisions and total decisions) when tested on 100
million people?
(b) A better test is developed for the same disease, which has a sensitivity of 0.99,
and a specificity of 0.95. If a mass screening drive is conducted for 100 million
people, what is the expected number of people with the disease who will be missed
(false negative), and what is the expected number of people who will be sent to a
doctor but they will not have the disease (false positive)? What is the expected
accuracy of this classifier in a mass screening scenario? You can read about sensi-
tivity and specificity here https://en.wikipedia.org/wiki/Sensitivity_and_
specificity.
(c) For the previous part, assume that the test technology uses a logistic regression
classifier that computes probability of the disease using the logistic function. If
that probability is above 0.5, then the human subject is assumed to be diseased
and sent to a doctor for further examination. However, we need not use this default
threshold, and we can vary it between 0 and 1. If we want to miss approximately
0.1% of diseased subjects in a mass screening, then should we move this threshold
closer to 0 or closer to 1, and what will the sensitivity be? Is there any disadvantage
of doing so?
(d) Assume that the classifier is applied to a different population where the disease
burden is 100 patients per 100,000 population. In such a scenario, with the same
decision threshold of p = 0.5, what will be accuracy of the classifier?
(e) Now, assume that the societal cost (or risk) of missing a patient is $1,000,000, while
that of falsely calling a healthy person a patient is $1000. Which of the two will
incur a lower expected societal cost (or risk) when used for mass screening?
(i) A decision threshold of 0.5 with a sensitivity of 0.99 and a specificity of 0.95?
(ii) A decision threshold of 0.1 with a sensitivity of 0.999 and a specificity of 0.8?
(f) As obvious from the previous parts, accuracy itself is an incomplete measure of
a classifier’s performance. We need to know both sensitivity and specificity (or
FNR and FPR, or PPV and NPV etc., or a confusion matrix). On top of that,
we can vary the decision threshold in certain classifiers, such as logistic regression.
For such classifiers, there is a measure of performance that takes into account the
entire range of thresholds, which is called AUC or AU-ROC (area under receiver
Department of Electrical Engineering, Indian Institute of Technology Bombay Page 2 of 4
Amit Sethi asethi@iitb.ac.in
EE 769 Introduction to Machine Learning: Sheet 4— 2020-21-2
operating characteristic curve). Assume that the classifier yields p of 0.4, 0.55,
0.8, 0.9, 0.91, and 0.99 for positive samples, and 0.01, 0.02, 0.04, 0.11, 0.14, 0.24,
0.3, 0.43, and 0.54 for the negative samples tested. For this scenario, compute the
following:
(i) Sensitivity and specificity for a threshold of 0.5.
(ii) Sensitivity and specificity for a threshold of 0.4.
(iii) AUC.
5. Soft-SVM:
(a) Which of the following points are support vectors in a soft-SVM?
(i) Points outside the margin on the correct side of the decision boundary.
(ii) Points on the margin on the correct side of the decision boundary.
(iii) Points inside the margin on the correct side of the decision boundary.
(iv) Points on the decision boundary.
(v) Points inside the margin on the wrong side of the decision boundary.
(vi) Points outside the margin on the wrong side of the decision boundary.
(b) For each of the sub-parts in the previous part, what is the value or range of values
for the slack variable ξi ?
PN
(c) As λ is increased in its loss function Lw,b = λ||w||22 + i=1 [1 − ti (wT xi + b)]+ , do
you expect the number of support vectors to increase or decrease? Give reason.
(d) Analyze the behavior of a soft-SVM as λ is increased, whose cost function is given
by Lw,b = λ||w||1 + N T
P
i=1 [1 − ti (w xi + b)]+ , where the L2-norm of w has been
replaced by L1-norm. Such an SVM is called L1-SVM [Ref: ”Feature Selection via
Concave Minimization and Support Vector Machines” by by P.S. Bradley , O. L.
Mangasarian in ICML 1998, and doi:10.1093/bioinformatics/btp286].
(e) Suggest a way to convert the output of an SVM into a continuous measure that
may be interpreted as a probability measure, instead of simply a discrete binary
decision.
Department of Electrical Engineering, Indian Institute of Technology Bombay Page 3 of 4
Amit Sethi asethi@iitb.ac.in
EE 769 Introduction to Machine Learning: Sheet 4— 2020-21-2
6. Advanced regularization:
(a) Instead of using either L2 or L1 regularization in isolation, elastic net uses both L2
and L1 regularization in an optimization objective that looks like Lerror +λ2 ||w||22 +
λ1 ||w||1 ; λ2 > 0, λ1 > 0, where the weights of the model (such as linear or logistic
regression) are w, and the unregularized loss (e.g. MSE or cross-entropy) is Lerror .
The reason for using both regularizations is that L1 eliminates some variables and
L2 groups correlated variables such that two correlated variables either remain
together in the model, or get eliminated together from the model, depending on
the value of λ1 and λ2 and the strength of the correlation.
Assume two perfectly correlated variables x1 and x2 (you can assume that these
are exactly the same variables) along with other variables, and show that by just
using L1 penalty, it is possible to have no increase in the L1 penalty if we add a
constant to the coefficient of one of those variables, and subtract another constant
from the coefficient of its correlated variable. Then show that using the L2 penalty
stabilizes the coefficients such that they remain equal to each other, whether they
are zero or non-zero. [Read more at: https://rss.onlinelibrary.wiley.com/
doi/pdfdirect/10.1111/j.1467-9868.2005.00503.x]
(b) (Optional) Read about SCAD penalty, and state its advantage and disadvantage.
[This is a quick read: https://andrewcharlesjones.github.io/posts/2020/
03/scad/]
Department of Electrical Engineering, Indian Institute of Technology Bombay Page 4 of 4
Amit Sethi asethi@iitb.ac.in