Machine Learning Lab Guide
Machine Learning Lab Guide
Introduction 5
4 Complexity 37
4.1 The Support Vector Machine, svc . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 The Nonlinear Support Vector Machine, svc . . . . . . . . . . . . . . . . . . . 38
4.3 Bias and Variance in Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3
4.4 Bias and Variance in Classification . . . . . . . . . . . . . . . . . . . . . . . . 40
4
Introduction
• classification;
• evaluation;
• complexity;
• regression;
• clustering.
• understand pattern recognition theory to such an extent that you are able to read recent
literature on the topic in engineering-oriented journals (e.g. IEEE Tr. on PAMI);
• know which statistical methods to apply to which problems, on which assumptions they
are based and how these methods interrelate;
• be able to construct a learning system to solve a given simple problem, using existing
software.
Prior knowledge
Basic working knowledge of multivariate statistics and linear algebra is required to follow the
course. Next to that, it is expected that you have had the course CSE2510 Machine Learning,
or something comparable.
Software Originally, this course uses Matlab to do the programming exercises. After
many requests, the change to Python is made. Although both programming languages are
5
very similar, the devil is in the details. It may be, that in some locations, still Matlab
notation is used. Please let us know when you find something!
You can get PRTools for Python from https://github.com/DMJTax/prtools.
Notation
Most weeks contain a few optional exercises, indicated like this:
optional
An exercise between these lines is optional. This means that you are not required to do it.
It can give you some extra background and tips. Only work on it if you think the subject is
interesting and if you have sufficient time left.
end optional
Some other notational conventions are:
• Variables and code will be indicated using the teletype font (for example: x,
mean(x)). For larger pieces of code, we will use the notation:
>>> % A piece of test code
>>> import numpy as np
>>> x = np.random.rand(10,2)
>>> np.mean(x)
>>> np.std(x)
Here, >>> is the prompt. If pieces of code are not preceded by >>> it will mean it’s
wiser to write a script.
• An alert sign like the one in the margin here indicates it is essential you read the text
next to it carefully.
• A book sign indicates where you can read more on the theory behind the subject dis-
X, Slides cussed. Unless indicated otherwise, numbers refer to chapters and sections in the book
“Pattern Recognition” by Theodoridis & Koutroumbas (4th edition). “Slides” means
the theory is discussed in the slides, which can be downloaded in hand-out format from
the Blackboard site.
6
Week 1
Objectives When you have done the exercises for this week, you
Exercise 1.1 (a) Assume that we managed to represent objects from a two-class classifi-
cation problem by a single feature. We know that the objects from class ω1 have a 2.2
Gaussian distribution with µ1 = 0 and σ12 = 1/2, and the objects from class ω2 have
a Gaussian distribution with µ2 = 1 and σ22 = 1/2. Derive the position of the de-
cision boundary when both class priors are equal. (Note that we are now assuming
that we know the distributions exactly, so effectively we’re making the Bayes classi-
fier.)
(b) Again, assume we have a two-class classification problem in a 1D feature space,
but now assume that objects from class ω1 have a uniform distribution between 0 and
1, and objects from class ω2 have a uniform distribution between 2 and 3. Where is the
decision boundary now?
(c) And where is the decision boundary when the objects from class ω2 have a uniform
distribution between 0.5 and 1.5? (The distribution of ω1 did not change, classes have
equal prior.)
(d) And where is the decision boundary when the objects from class ω2 have a uniform
distribution between 0.5 and 2.5? (The distribution of ω1 did not change, classes have
equal prior.)
7
Exercise 1.2 (a) Assume we represent the objects in a two-class classification problem by
a single feature. We know that the objects from class ω1 have a Gaussian distribution
with µ1 = 0 and σ12 = 1/2, and the objects from class ω2 have a Gaussian distribution
with µ2 = 1 and σ22 = 1/2. Derive the position of the decision boundary when both
class priors are equal, but we have a loss matrix of:
0 0.5
L= . (1.1)
1.0 0
(b) Assume again we have a two-class classification problem in a 1D feature space, but
now assume that objects from class ω1 have a uniform distribution between 0 and 1,
and objects from class ω2 have a uniform distribution between 0.5 and 2.5. Given the
loss matrix (1.1), where is the decision boundary now?
p(x|ωi )
0.5
x
-2 -1 0 1 2 3 4
Figure 1.1: The class-conditional probabilities of two classes p(x|ω1 ) (dashed blue line) and
p(x|ω2 ) (solid black line) in a 1-dimensional feature space.
Exercise 1.3 In figure 1.1 two triangular-shaped class conditional probability density func-
tions are given. The first one p(x|ω1 ) is indicated by a dashed blue line and the second
p(x|ω2 ) with a solid black line. The class priors are assumed equal here.
(a) Again, use the Bayes’ rule to derive the class posterior probabilities of the fol-
lowing objects: x = 3, x = −0.5, x = +0.5. To which class are the objects therefore
assigned?
(b) Which object is on the decision boundary of the Bayes classifier?
Exercise 1.4 Now assume that class ω1 in figure 1.1 is twice as small as class ω2 . That
means p(ω1 ) = 1/3 and p(ω2 ) = 2/3.
(a) Compute the posterior probabilities for x = 3, x = −0.5, x = 0.5.
(b) Where is the decision boundary of the Bayes classifier now?
Exercise 1.5 Compute the Bayes error for the class distributions given in figure 1.1, where
the classes have equal prior.
8
1.2 Prtools for Python
In order to learn something about many machine learning methods, the toolbox PRTools
was developed. Originally, it is a Matlab toolbox, but it has been, provisionally, translated
into Python. It should mimic the most important parts of PRTools for Matlab.
Again, you can get PRTools for Python from https://github.com/DMJTax/prtools.
The advantages of PRTools (for Matlab or Python) are:
Of course, you are not required to use PRTools. If you feel more comfortable with (for
instance) scikit-learn, please use that. But then we will not be able to help with bugs or
implementation problems.
When you are starting with Python, we assume that you already imported numpy and
PRTools, like1 :
>>> import numpy as np
>>> import prtools as pr
For plotting purposes we also use matplotlib, so also include:
>>> import matplotlib.pyplot as plt
1.3 Datasets
One key entity in Machine Learning is the idea of an object. We always assume we can
represent any object by a set of values, often just measurements. Whatever object we are
considering, in order to perform operations on an object by a computer, we have to encode
this object by some numbers. An object in real life and in the computer are therefore two
different things. Furthermore, in this course we will use the convention that each object is
represented by a row vector of features (thus an 1 × d array).
When you want to conclude something from data, it is often not from one object, but from
a set of objects. We assume we have a set of objects from which we want to obtain some
knowledge. This set is called a dataset. A dataset is a set of n objects and is stored in a n × d
array.
After you have specified the problem you want to solve, you have to collect examples and do
measurements on these examples. For that, you have to define what features are likely to be
informative for the problem you specified.
1
To get the Python code, please have a look at the link given on page 6.
9
Exercise 1.6 (a) Make a fake dataset containing 10 objects with 3 features each. Invent
the feature values yourself. Below is an example piece of code which fills the matrix x
with 2 objects, each containing 3 features:
>>> x = np.array([[ 0.7,0.3,0.2],[2.1,4.5,0]])
Make your own matrix x.
(b) Compute the means (using np.mean) and standard deviations (np.std) of your 3
features of 10 objects. What is the difference between mean(x), mean(x,axis=0) and
mean(x,axis=1)?
Exercise 1.7 (a) When a dataset contains just two features per object, it can be visualized
in a scatterplot (we come back to this later). Make a scatterplot by:
plt.scatter(x[:,0],x[:,1])
Looking at the data matrix you created in Exercise 1.6, find out which object is plotted
where in the plot.
(b) When you look at the scatterplot, can you identify outlier objects, or structure in
the data?
When the dataset is used to train classifiers, it is also required that for each object a class
label is present. This indicates from which class the object originates, according to the expert
(i.e. you).
Exercise 1.8 (a) Invent labels for the objects that you defined in the previous question.
The labels can be numbers, like 1 or 2. Store them in a column vector lab, and create
a PRTools dataset by
>>> lab = np.array([1,1,1,1,2,2,2,2,2,2]).reshape(10,1)
>>> lab = np.array([[1,1,1,1,2,2,2,2,2,2]]).T # alternative
>>> a = pr.prdataset(x,lab)
Check if the resulting dataset has the correct number of objects, the correct number of
features and correct number of classes (correct here means: what you expect). You can
do that by just typing the variable on the Matlab command line:
>>> print(a)
A scatterplot is the most simple plot you can make: it simply plots the first feature against
the second feature. If you have three features, you can use 3D plots; if you have even more,
you will have to select at most three of them by hand (although later we will discuss ways of
visualising more features at once).
Exercise 1.9 Load the dataset “boomerangs” (use the function boomerangs and choose the
number of objects to generate).
(a) Use pr.scatterd(a) to make a scatterplot of the first two features, and
pr.scatterd(a[:,[1,2]]) for the features 2 and 3 (note that Python starts count-
ing with 0).
10
There are many other (artificial) datasets defined in PRTools. The table below lists a few
of them:
gendatb Generation of banana shaped classes
gendatc Generation of circular classes
gendatd Generation of two difficult classes
gendats Generation of two Gaussian distributed classes
gendath Generation of the Higleyman dataset
When you want to extract the original data matrix from a prdataset, you can use the
’+’-operator:
>> a = pr.gendatb()
>> b = +a
>> print(b)
This is sometimes useful when you want to remove the labels from a dataset:
>> a = pr.gendatb() # data with labels
>> b = pr.prdataset(+a) # data without labels
>> print(b)
The labels are stored in a.targets so you can retrieve them with lab = a.targets.
On the prdataset you can perform operations. If the dataset is labeled, you can train a
classifier. If the dataset has arbitrary real-valued targets, you can train a regressor. But you
can also do feature normalisation, feature reduction, or clustering. All these operations are
stored in a prmapping.
Assume we want to train a nearest mean classifier on the Banana dataset. In PRTools you
do:
>>> a = pr.gendatb()
>>> w = pr.nmc(a)
>>> print(w)
Nearest mean, 2 to 2 trained mapping
The result of the operation a*w is again a dataset. It is the classified, rescaled or mapped
result of applying the mapping definition stored in w to a.
For mappings which change the labels of the objects (so the mapping is actually a classifier)
the routines labeld and testc are useful. labeld and testc are the general classification
11
and testing routines respectively. They can handle any classifier from any routine.
>>> lab = b*pr.labeld()
>>> print(lab)
[[-1]
[-1]
[-1]
[-1]
...
[ 1]
[ 1]]
>>> e = b*pr.testc()
0.14
Note that in the above examples we use the typical Matlab conventions: everything is a
matrix, and operations are often defined in terms of ’multiplication’ *. If you prefer the
Python way of working, you can also call the methods explicitly:
>>> a = pr.gendatb()
>>> w = pr.nmc()
>>> w.train(a)
>>> b = w.eval(a)
>>> e = pr.testc(b)
A few of the available classifiers is listed below:
ldc Linear discriminant analysis
qdc Quadratic discriminant analysis
nmc Nearest mean classifier
fisherc Fishers linear discriminant
knnc k-nearest neighbor classifier
parzenc Parzen classifier
naivebc Naive-Bayes classifier
mogc Mixture-of-Gaussians classifier
stumpc decision stump classifier
dectreec Decision tree classifier
adaboostc AdaBoost
svc Support vector classifier
The list is not complete. Feel free to add your favorite classifier!
Some classifiers require additional hyperparameters to be specified. For instance, in the
support vector classifier, you can specify the kernel, a kernel parameter, and a regularisation
parameter. Or in the k-nearest neighbor classifier you can specify the number of neighbors k.
You can supply that as additional input during training:
>>> w = pr.svc(a,(’rbf’,4.5,1))
You can also specify an untrained mapping beforehand with the required hyperparameters,
and train it afterwards on some training set:
>>> u = pr.svc((’rbf’,4.5,1))
>>> w = a*u
12
or
>>> w = pr.svc((’rbf’,4.5,1))
>>> w.train(a)
Finally, you can also visualise the decision boundary of classifiers. This is done using the
function plotc. In order to see the relevant region in the feature space, first a scatterplot of
the (training) dataset has to be made. For example:
>>> a = pr.gendath()
>>> w = pr.parzenc(a)
>>> pr.scatterd(a)
>>> pr.plotc(w)
Exercise 1.10 Practice the use of PRTools. Create some of the artificial datasets, make
scatterplots of the data, train some classifiers and plot their decision boundaries.
(a) Train a classifier and plot its decision boundary (together with the training set).
Can you verify that the function testc gives the correct classification rate on the train-
ing set as you see in the plot?
Exercise 1.12 (a) Make a rough sketch of the shape of the ellipsoids of a normal density
for which the covariance matrix has the form 10 09 . Take this ellipsoid, what is the
ratio of the length of its long axis in comparison with its short axis?
(b) Imagine we rotate the above sketch of ellipsoids clockwise such that their orienta-
tion becomes diagonal. Qualitatively, what happens to the variances and the covariances
for this new normal distributions? How do they compare to the original variances and
covariances as given in the covariance matrix above?
(c) Draw a figure visualizing a 2D normal density in which the two variables/features
are completely correlated, i.e., the correlation coefficient between the two variables is 1.
Give the covariance matrix that matches your sketch.
Exercise 1.13 Generate 1000 random point from a 2D standard normal distribution using
randn. (Note the discrepancy between the math, in which typically column vectors are
employed, and computational software, in which feature vectors often are row vectors.
This means, in this case, that the matrix with random numbers should be size 1000×2.)
(a) Turn these random points into a prdataset a and scatterd the data.
The command w = gaussm(a) determines estimates for a normal distribution based on
the data set a, and stores the estimated parameters in the mapping w.
(b) Use plotm(w) to visualize the 2D density estimates on top of the scattered data.
13
Exercise 1.14 Consider two 2D normal distributions with different means but equal covari-
ance matrices. Assume
the latter to be equal to a multiple of the identity matrix, i.e.,
c 0
take Σ to be 0 c for some c > 0. The class priors are not necessarily equal. (Note
that now we’re basically having the LDA classifier.)
(a) What shapes can the decision boundary take on? Demonstrate this mathematically
starting from the expressions in Section 2.4.2 in the book.
2.4.2
Exercise 1.15 Consider two 2D normal distributions with different means but equal covari-
ance matrices.
(a) What shape does the optimal decision boundary have?
Generate 10 data points per class from a data set that fulfills the above assumptions.
Create a data set a with these 20 points and these two classes. We can estimate linear
discriminants by means of ldc(a) and quadratic discriminant boundaries by means of
qdc(a).
(b) Given the shape of the optimal decision boundary, how does the boundary look
for the trained normal density based quadratic discriminant?
The PRTools command plotc plots the decision boundary. If you plotted decision
boundaries of multple classifiers, you can use plt.legend() to get a legend.
(c) Scatter the data set a and plot the decision boundary estimated by means of ldc(a)
and qdc(a). Can you explain their difference? You might want to revisit the previous
question (b).
(d) What happens when the number of points per class increases? What happens in
the limit of an infinite number of points?
Exercise 1.16 Generate a 2D data set with 2 classes with 10 samples from a uniform distri-
bution (using rand) and 10 samples from a normal distribution (using randn), respec-
tively.
(a) Train a ldc and a qdc on this data, scatter the data, and plot the two decision
boundaries (with plotc). Determine the how many points are on the wrong side of the
boundary.
(b) With w a trained classifier and a a data set, the labels that the classifier assigns
to the data points in a can be retrieved using labeld: a*w*labeld. Check your count
from the previous question using this command.
Next to classifiers, PRTools also has the possibility to estimate densities. In this section
we are going to estimate the density using a Parzen density estimator, called parzenm in
PRTools.
14
(Type help(pr.gendats) to understand what type of data we have now.)
(b) We define the width parameter h for the Parzen kernel:
>>> h = 0.5;
(c) The function parzenm estimates a density for a given dataset. In most cases a
PRTools prdataset is labeled, and these labels are used in the function parzenm to
estimate a density for each class. To define a Parzen density estimator with a certain
width parameter h on the entire dataset, ignoring labels, type:
>>> a = pr.prdataset(+a)
>>> w = pr.parzenm(a,h)
This mapping can now be plotted along with the data:
>>> pr.scatterd(a); pr.plotm(w)
If your graphs look a little “bumpy”, you can increase the grid size PRTools uses for
plotting:
>>> pr.plotm(w,gridsize=100)
and try the above again.
(d) Plot the Parzen density estimate for different values of h. What is the best value
of h?
When you want to evaluate a fit of a density model to some data, you have to define an error.
One possibility is to use the log-likelihood, defined as:
!
Y X
LL(X ) = log p̂(x i ) = log (p̂(x i )) (1.2)
i i
Exercise 1.18 Use the data from the same distribution as in the previous exercise to train
a Parzen density estimator for different values of h. Compute the log-likelihood of this
15
training set given the estimated densities (for different h):
a = pr.gendats([20,20],1,8) # Generate data
a = pr.prdataset(+a)
hs = [0.01,0.05,0.1,0.25,0.5,1,1.5,2,3,4,5] # Array of h’s to try
LL = np.zeros(len(hs))
for i in range(len(hs)): # For each h...
w = pr.parzenm(a,hs[i]) # estimate Parzen density
LL[i] = np.sum(np.log(+(a*w))); # calculate log-likelihood
Exercise 1.19 (a) Use the same data as in the previous exercise, but now split the data
into a training and test set of equal size. Estimate a Parzen density on the training set
and compute the Parzen density for the test set. Compute the log-likelihood on both
the training and test sets for h = [0.1, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5]. Plot these log-likelihood
vs. h curves:
[trn,tst] = pr.gendat(a,0.5) # Split into trn and tst, both 50%
hs = [0.01,0.05,0.1,0.25,0.5,1,1.5,2,3,4,5] % Array of h’s to try
Ltrn = np.zeros(len(hs))
Ltst = np.zeros(len(hs))
for i in range(len(hs)): # For each h...
w = pr.parzenm(trn,hs[i]) # estimate Parzen density
Ltrn[i] = np.sum(np.log(+(trn*w))) # calculate trn log-likelihood
Ltst[i] = np.sum(np.log(+(tst*w))) # calculate tst log-likelihood
This week you saw the basis of machine learning: object definition, data collection, and Bayes’
rule. Starting from good measurement data, the rest of the analysis (visualisation, clustering,
classification, regression) becomes much easier. Starting from poorly defined objects or poorly
sampled datasets with insufficient example objects, your analysis becomes very hard and no
clear conclusions will be possible (except that more data is needed).
In this last section we have a quick look at the scaling problem. It appears that some classifiers
are sensitive to the scaling of features. That means, that when one of the features is rescaled
to very small or very large values, the classifier will change dramatically. It can even mean
16
that the classifier is not capable of finding a good solution. Here we will try to find out, which
classifiers are sensitive to scaling, and which are not.
Exercise 1.20 (a) Generate a simple 2D dataset (for instance, using gendatb) and plot the
decision boundaries of the six classifiers listed above. Use k = 1 for the knnc.
(b) Make a new dataset in which the second feature is 10 times larger than the dataset
above. Do this in Matlab by
newtrain = a;
newtrain(:,2) = 10*newtrain(:,2)
and in Python:
newtrain = a;
newtrain[:,[1]] *= 10.
Train six classifiers nmc,ldc,qdc,fisherc,parzenc,knnc and plot the decision bound-
aries.
(c) Which classifiers are affected by this rescaling of the feature space? Why are they
affected, and others not?
(d) Is it an advantage or a disadvantage?
17
18
Week 2
These exercises are meant to give you both some practice with the material covered in the
lectures and the literature and an impression of what you are expected to know. There are
a lot of exercises. So, judge for yourself which exercises you need to, want to, or should
practice. There are anyway too many to go through in the 2 hours of exercise lab that we
have. On another note, not all exercises are thoroughly checked. If you think something is
wrong, unclear, etc., let us (i.e., any of the lecturers or TAs) know.
Objectives When you have done the exercises for this week and went through the related
reading material, you should
• be able to formulate the basic least squares regression models and derive its optimal
estimators,
• comprehend the idea and use of polynomial and transformed regression,
• understand how Fisher’s linear discriminant is formulated in terms of standard
linear regression,
• know how a perceptron classifier is optimized.
• be able to derive the expression for the posteriors in the case of logistic discrimi-
nation,
• have an idea of the basic shape of the logistic posterior.
Exercise 2.1 Consider standard linear regression with the squared loss as the performance
measure:
XN
(xTi w − yi )2 = kXw − Y k2 . (2.1)
i=1
19
Note that in the expression above there is some confusing notation going on. The
(feature) vector xi ∈ Rd is a column vector, while all features per object in X ∈ RN ×d
are in rows. Y is an N -vector with all corresponding outputs.
The aim is to minimizing this sum of squared residuals between the linearly predicted
and actual output over w ∈ Rd .
(a) Assume that (X T X)−1 exists. Show that (X T X)−1 X T Y gives a least squares
solution to the above problem, i.e., it minimizes kXw − Y k2 .
(b) Given that (X T X)−1 exists, what does that tell us about the data? More specifi-
cally, what limitation on the number of observations does this imply, what does invert-
ibility say about the dimensionality of the (affine) subspace our data is in, and what
difference does the presence or absence of the origin in this subspace make? To what
extent are these limitations enough to guarantee invertibility?
Let us now allow for an intercept (or bias term), i.e., we also model a constant offset in the
regression function. We do this by the trick of adding a column of ones to the matrix X. Let
Z refer to this new matrix.
Consider standard linear regression (with intercept) with the squared loss as the performance
measure, kZw − Y k2 , which we want to minimize for w.
Exercise 2.2 (a) Assume that (Z T Z)−1 exists. Show that (Z T Z)−1 Z T Y gives a least
squares solution to this least squares problem.
(b) Given that (Z T Z)−1 exists, what does that tell us about how the data is scattered?
Can you formulate necessary and sufficient requirements to for the existence of this
inverse?
(c) Construct an example data set, for which the inverse of X T X exists, while the
inverse for Z T Z does not.
Exercise 2.3 Let us consider a couple of settings in which we have very few observations.
These problems may be referred to as underdetermined (do you understand the choice
of adjective?).
(a) Given a data set consisting of one training point only. The input is x = π in 1D,
while the output y equals e. Sketch/describe all linear solutions with intercept that
minimize the squared loss on this data set.
(b) Give a mathematical expression for the (set of) solutions of this 1D problem.
1 1 and Y = 1
(c) Describe how the linear least squares solutions look if X = −2 1 −1
(d) For this last problem with 2D inputs, what is the value that the minimizer takes
on? In other words, what is the sum of the squared residuals in this case?
(e) Forget about the intercept for a moment and describe how the linear least squares
solutions look if X = (1, 1) and Y = π.
(f ) Look again at the three solution sets that you have determined for the three re-
gression problems above. Given that all solutions in these sets are valid, which would
be more preferable? Could you come up with a good rule to single out an element from
every one of these three sets? What is your reason for singling out this solution?
20
end optional
Exercise 2.4 Consider a regression training data set with four 1D inputs X = (−2, −1, 0, 3)T
and corresponding outputs Y = (1, 1, 2, 3)T .
(a) Let us assume that we fit a linear function without intercept to this data under
squared loss. Calculate the optimal function fit for the given data set.
(b) Let us now also include an intercept. Still, we stick to fitting linear functions that
we fit using the squared loss. Calculate the optimal value that we find for the intercept.
(c) Think of polynomial regression (see also Section 2.2), what is the minimum poly-
nomial degree that we need in order to fit the regression curve exactly to the training
data? Is there a difference between the situation with and without intercept?
The function linearr allows one to perform polynomial regression. Polynomial regression fits
a polynomial of some maximal degree to the data in a least squares sense. Even though this
results in a nonlinear function in x, the problem may still be referred to as linear regression
as the regression function is, in fact, linear in the unknown parameters w estimated from the
data. Enough with the confusion, let’s do some experiments!
Exercise 2.5 Using gendatr or prdataset, generate data where the inputs x are drawn
uniformly from the interval [0, 1] and the corresponding outputs y are obtained by
squaring this value and adding Gaussian noise to the inputs: y = x2 + ε with ε a
random sample from the standard normal distribution.
(a) Study the behavior of polynomials of degree 0 to 3 for different training set sizes
(e.g. 4, 40, and 400 samples?). You may want to have a look at the data and the fitted
polynomial models. You can also estimate the squared error using a somewhat large
test set.
The idea of not only using the original features, but also powers of those feature values (as
in polynomial regression) can of course be applied more liberally. There is, in principle, no
reason to limit oneself to polynomials. Especially if you understand what data you are dealing
with, if you understand the problem you are going to crack, or if there is any type of a priori
knowledge available, you could even design dedicated features.
Exercise 2.6 Consider the following regression problem from 2D to 1D. The input vectors
x are from a standard normal distribution in 2D. The corresponding outputs, y, are
obtained through the following equation: y = 50 sin(x1 ) sin(x2 ) + ε, where ε has a
standard normal distribution as well (but in 1D of course).
(a) Visualize 10,000 samples from this regression problem and have a look at the data
from different points of view.
21
(b) Fit a linear linear regression to these 10,000 points and measure the error on a
separate test set.
(c) Fit a second degree linear regression to these 10,000 points. Again measure the
error on a separate test set. Try the same for some higher degrees.
(d) Why can these higher-degree polynomials not fit this data better than the standard
linear regressor? Can you figure out what seems to be happening? (If not, maybe the
next question helps.)
(e) Let the input x be as in the above, but now take y = x1 x2 . Fit linear regressions
of degree 1 and 2 and report the error they make and/or visualize the solutions in
comparison with the actual training data.
optional
Exercise 2.7 (a) Given that we have a data set with d features, how many monomials of
degree m do we have? (Roughly, a monomial of degree m is the product of exactly
m variables, where every variable can occur multiple times. E.g. x5 and x2 z 3 are
monomials of degree 5. 1 is the only zero degree monomial.)
(b) Compared with the number of features that linearr uses with increasing degree
m, does the number of features when all cross-terms are included grow essentially faster?
(c) Can you come up with real-world classification problems where a polynomial ex-
pansion of even a moderate degree becomes infeasible?
end optional
Fisher’s linear discriminant (FLD, in PRTools referred to as Fisher’s linear classifier , fisherc,
and probably known under various other names as well, e.g. linear regression classifier,
Fisher classifier, least squares classifier) is the classifier that can be constructed with the use
of standard linear regression. We consider the two-class case, in which the input variables
are simply taken to be the feature vectors in our data set, while the corresponding classes
are typically encoded numerically as +1 and −1, i.e., Y ∈ {−1, 1}N in case of N training
samples. A test sample x is then assigned to the class sign(wT x). One typically assumes that
an intercept is included in the regression model.
Exercise 2.8 (a) Assume N is the total number of training samples, we have the same
number of samples in both classes, and the data mean is the zero vector.1 Show that
the optimal w is given by 21 T + (m+ − m− ), where T is the standard (biased) sample
estimate for the covariance matrix of the data and m+ and m− are the estimated class
means for the positive and negative classes, respectively. Note: A+ indicates that we
take the so-called pseudo-inverse of the matrix A. More on this you can find in any of
the suggested books. Also have a look at Exercise 3.5.
1
Alternatively, we could also have stated that it is given that the mean is the zero vector.
22
optional
At times, it can be convenient to have a probabilistic regression model. For instance, because
its predictions can be more easily combined with other probabilistic approaches or because it
may allow one to say something about the possible spread/uncertainty of an estimate.
Exercise 2.9 Let us consider a regression problem setting with a simple 1D input and 1D
output. Assume that x has a normal distribution with variance τ 2 and mean ν. In
addition, given x, let y be normally distributed around xw + w0 with variance σ 2 .
(a) Write out explicitly the joint probability distribution for observations (x, y)T .
(b) Assume ν, τ , and σ known. Consider the likelihood of this model for a data set
{(xi , yi )T }N
i=1 and derive the maximum (log-)likelihood estimate for w and w0 .
(c) Instead of the normal distribution for x, we now take a generic distribution, say,
p(x|θ), which depends on some parameter θ. When estimating the parameters, θ and
w by means of maximum likelihood, why can they be optimized separately? That is,
why does one not need to know the one to determine the other.
This last exercise shows why we may as well forget about the marginal distribution over
x if one is merely interested in the fitting of w.
(d) Determine the ML estimates for w, w0 , and σ given that these are the free param-
eters of the model.
optional
23
(e) Consider the likelihood of this model for a data set {(xi , yi )T }N
i=1 and derive the
estimates for w and w0 that maximize the likelihood times the above prior.
end optional
optional
Exercise 2.11 We have a 2-class classification problem in 2D for which we are going to
consider some specific linear classifiers. The class labels are from Y = {−1, +1} and we
have the feature vectors (0, −1), (1, 1), and (−1, 1). The first belongs to class −1 the
last two are from the positive class.
Let us consider two hypothesis classes. One is H0 = {h(x; a) = sign(aT x)|a ∈ R2 } and
the other is H1 = {h(x; a) = sign(x1 + a1 x2 + a2 )|a ∈ R2 }.
(a) Visualize, for both hypothesis classes, how the error rate depends on the choice of
parameters. (For this visualization purpose, tt should be OK to limit a to something
like [−2, 2]2 ⊂ R2 ) Certainly for H0 , it is doable (and potentially instructive) to sketch
the situation with pen and paper, but I can imagine you want to use some sort of 3D
function plotting for this.
(b) H0 , determine the (precise) subset of R2 for which the error rate on this training
set is minimal. What is the error rate achieved in this case?
Characterizing the set of optimal solutions in H1 seems much more troublesome. Do
have a go at it and see how far you get! Let us at least have a look at the following.
(c) What are the three corner points that you can identify in your plot or sketch from
2.11(a)? Draw the data set and indicate which three classifiers these correspond to.
Give their precise parameter values.
(d) Swap the labels for the training points and redo the plot from 2.11(a) (and compare
the latter to the previous visualization). What does the best training set error become
and what are the corner points of the set of all optimal a now?
24
(e) Construct a data set for which you can find an a from H0 that gives you an error
of 12 , while in H1 there is a perfect classifier achieving 0 error. How about the other
way around
(f ) Is there a data sets for which the training error rate cannot be better than 1 when
using either H0 or H−1 ?
Exercise 2.13 (a) Let the two samples in one class consist of the two corners of the left
side of the unit square [0, 1]2 and let the other class take the opposite two corners as
samples.
Determine the perceptron solution when you take η = 1, initialize with w = (0, 0), and
let φ(x) = (x, 1) for x ∈ R2 .
end optional
Exercise 2.15 Test and compare the Fisher classifier and perlc on the same dataset that
you have used in Exercise 2.14 by plotting the decision boundary in a scatterplot. Of
course, you could also measure and compare their performances on a large, new, and
unseen test dataset.
25
26
Week 3
These exercises are meant to give you both some practice with the material covered in the
lectures and the literature and an impression of what you are expected to know. There are
a lot of exercises. So, judge for yourself which exercises you need to, want to, or should
practice. There are anyway too many to go through in the 2 hours of exercise lab that we
have. On another note, not all exercises are thoroughly checked. If you think something is
wrong, unclear, etc., let us (i.e., any of the lecturers or TAs) know.
Objectives When you have done the exercises for this week and went through the related
reading material, you should
3.1 Regularization
Consider standard linear regression with the squared Euclidean norm as the so-called regu-
larizer (also referred to as L2 regularization):
N
X
(xTi w − yi )2 + λkwk2 = kXw − Y k2 + λkwk2 . (3.1)
i=1
27
The aim is to minimize this objective function over w ∈ Rd . The regularizer tries to keep the
weights small. This form of regression is also referred to as ridge regression (ridger).
We investigate the effect of this regularization term a bit. First, however, some (optional)
math.
optional
Exercise 3.1 (a) Given a data matrix X. Show that X T X is positive semidefinite (psd)
matrix.
(b) Show that X T X + λI, with λ > 0 and I the d × d-identify matrix, is always
invertible.
(c) Show that (X T X+λI)−1 X T Y gives the least squares solution to the above problem.
(d) Assume λ is fixed to a positive value, what solution do you find if the data set
grows larger and larger? Or in other words, how does the regularizer’s influence change
with a growing data set size?
(e) What solution is obtained in the limit of λ going to infinity?
Exercise 3.2 Set up an experiment to find if ridger does, in fact, regularize the intercept,
or not.
end optional
Exercise 3.3 We consider a regression data set where the inputs x are drawn uniformly
from the interval [0, 1]. The corresponding outputs y are obtained by adding Gaussian
noise to the inputs: y = x + ε with ε a random sample from the standard normal
distribution. You can generate regression data sets by means of gendatr or prdataset.
Ridge regression is carried out using ridger.
(a) Create a training data set of size 2 and a large test data set (1,000 or 10,000
samples will do). Study the regression fit to the training data for different amounts of
regularization (using command like scatterr and plotr). Also check how the squared
error varies for different λs (using commends like testr). Check a couple of new draws
for the training set and also try a different set size of, for instance, of 100. Try regular-
ization parameters varying from 0 (or something smallish like 10−3 ) to something larger
like 103 .
(b) Determine (roughly) which value for the regularization parameter gives, on average,
the best performance. Determine this optimum for a training set size of 2, of 10, and
for a training set size of 100. You can limit your search to λs in the range [10−3 , 103 ].
Exercise 3.4 Given a regression data set of fixed size. How would you tune the regularization
parameter?
optional
28
3.2 A Pseudoinverse
Exercise 3.5 The solution for the limit of lambda approaching from above to zero, i.e.,
limλ↓0 (X T X + λI)−1 , is equal to the pseudoinverse of X T X and denoted (X T X)+ .
Reason why, among all solutions of the unregularized regression problem, (X T X)+ X T Y
provides the minimum norm solution irrespective of the number of training samples. A
precise mathematical proof is not expected.
Exercise 3.6 Let’s say that we have two polynomial expansions P1 (x) and P2 (x) for a single
input variable x. Assume we have a linear transformation T (i.e., just a matrix) for
which T P1 (x) = P2 (x).
(a) Show that, if T is invertible, for unregularized linear regression, these two repre-
sentations are equivalent. That is, if w1 and w2 are the respective solutions, we have
w1T P1 (x) = w2T P2 (x) for all x.
(b) Why is the training loss using representation P1 never smaller than that obtained
with P2 , even if T may not be invertible? Can you give an example where it is strictly
better?
(c) Construct an example, possibly numerical, that shows that if we use L2 regularized
regression, P1 and P2 can lead to different solutions, even if T is invertible. Why is this
the case?
end optional
Another way of regularizing that is popular is through an L1 norm instead of the L2 norm.
This leads to the so-called least absolute shrinkage and selection operator or LASSO for short.
What is nice about the LASSO is that it often also results in a selection of features (feature
selection is also implemented through some other approaches as we will see later). That is, it
automatically leads to a reduction of the number of features that the final regressor depends
on. Here, we investigate that behavior a bit.
29
3.4 Hypothesis Classes and (Surrogate) Losses
Exercise 3.8 (a) In the linear regression setting above, what is the hypothesis class, what
is the loss, what is the regularization term?
(b) Let’s say we fit a Gaussian distribution to a data set D by means of maximum
likelihood. Give a hypothesis class and a loss. Do we have a regularization term?
In the typical settings that we encounter, the measure of fit can often be expressed as the
sum over individual elements in the data set. That is, we define a loss per element (x, y) ∈ D,
`(x, y|h) and calculate the overall or, more often, the expected loss as the average loss over
the data set: N1 N
P
i=1 `(x i , yi |h). Note that this is often what loss refers to: the loss per data
point (and this is also how we will typically use it). The expected loss is also referred to as
the (empirical)
R risk. Just mentioning risk most often refers to the expected true loss for a
given h: p(x, y)`(x, y|h)dxdy—the integral becomes a sum whenever appropriate, e.g. in
the classification setting y is, of course, a discrete label.
optional
Exercise 3.9 (a) Write down the loss (per data point) for linear regression in case we fit
linear functions.
(b) What loss1 is used when fitting a model to i.i.d. data under log-likelihood?
(c) Can you give a loss that cannot be expressed in terms of, or that is not equivalent
to, a sum over individual element in a data set?
(d) Can you give a learning methods for which seems to be difficult to formulate in
terms of hypothesis class plus loss?
30
(b) Consider Fisher’s linear discriminant. Rewrite its objective function in the form
that uses a margin-based loss function. How does v look now?
(c) Can you determine how the margin-based loss that corresponds to the SVM looks
like?
(d) What is the (obvious? logical? canonical?) hypothesis class for LDA (or ldc) and
what is the loss typically used? Do you think that LDA can be formulated in terms of
a margin-based loss?
end optional
Exercise 3.11 (a) Reconsider the solution found in 2.9(e). With which form of L2 regu-
larized linear regression do we get to exactly the same solution?
(b) Again, consider the solution found in 2.9(e). Consider standard multivariate linear
regression with squared Euclidean norm regularization as in 3.1. What prior should one
assume over w to let the MAP solution coincide with the minimizer of Equation (3.1)?
(c) Again, in the light of 2.9(e), with what prior do we get to the lasso as discussed in
Section 3.3?
optional
Exercise 3.12 Assume a linear relation y = wT x between input and output and assume
that, given the input, the output is normally distributed with mean 0 and unknown
variance σ.
(a) Given a data matrix X = (x1 , . . . , xN )T and corresponding output vector Y =
(y1 , . . . , yN )T . Determine the likelihood for w and show that its maximizer is equal to
the standard least squares estimator that we already saw in Section 2.1.
(b) Determine the maximum likelihood estimator for σ.
(c) Assume, additionally, a Gaussian prior over w with mean 0. The posterior for w in
this setting is Gaussian as well. Determine its mean and covariance using (2.116) from
Bishop’s book or, for instance, the technique of completing the squares.
(d) Determine the MAP estimator for w.
end optional
31
3.7 Sources of Variation
Exercise 3.13 In this exercise we investigate the difference in behavior of the error on the
training and the test set. Generate a large test set and study the variations in the
classification error based on repeatedly generated training sets:
>> t = gendath([500 500]);
>> a = gendath([20 20]); t*ldc(a)*testc
Repeat the last line e.g. 30 times.
(a) What causes the variation in the error?
Now do the same for different test sets:
>> a = gendath([20 20]);
>> w = ldc(a);
>> t = gendath([500 500]); t*w*testc
Repeat the last line e.g. 30 times.
(b) Again explain what causes the variance observed in the results.
The function cleval allows you to calculate so-called learning curves. These are curves that
plot classification errors against the number of points in the training data set. (Check help
cleval for the details.)
In this section we are going to study some learning curves for different data sets and different
classifiers. Some of the plots and number that will be generated demonstrate certain salient
and/or noteworthy behavior.
After reading a question, and before just implementing the things to be implemented, try to
think about the outcome to be expected. Eventually, say at the end of this course, you should
not be surprised by many of these things anymore!
Exercise 3.14 Generate Highleyman classes (gendath) with a 1000 samples per class. En-
large the feature dimensionality of this set by adding 60 dimensions of class independent
randomness, i.e., plain noise. Use cleval to generate learning curves for nmc, ldc, and
qdc using 64, 128, 256, and 512 objects in the training set (make sure that you repeat
often enough. . . ). (In Matlab, use plote to plot these curves in a single figure (check
the help). In Python, the curve is automatically shown.)
(a) Can you explain the overall behavior of these curves?
(b) Explain why the curves intersect. Which classifier performs best?
(c) What do you expect the limiting behavior of learning curves is? That is, if we were
able to train on more and more data?
32
Exercise 3.15 (a) Take your favorite data set and study a learning curve for the first
nearest neighbor classier (1NN). What can be expected of the learning curve of the
apparent error?
optional
Exercise 3.16 Redo the experiments from Exercise 3.14 but substitute gendath with
gendats and gendatd. Don’t forget to add the 60 noise features.
(a) What kind of typical differences do you see when comparing behaviors for different
data sets? Explain these differences?
3.8.2 Dipping
Exercise 3.17 Study the learning curve for nmc in combination with gendatc for very, very,
very small training set sizes.
(a) What seems to be happening? And is this expected?
Exercise 3.18 This more an challenge than an exercise. You might consider skipping it
and move on to the next exercise. Then again, you also might take on this challenge!
(Anyway, do keep track of time somewhat.)
(a) Create a two-class problem in which ldc, on average, outperforms qdc for large
sample sizes.
end optional
Like learning curves that typically plot the classification error against the number of training
set samples, making feature curves can also be informative. The latter studies how the
classification error varies with varying number of feature dimensionality. We can use the
clevalf command to perform such study.
Exercise 3.19 Load the data set mfeat kar and make a feature curve for 4, 8, 16, 32, and
64 features using 50% of the data for training based on qdc.
(a) When redoing this experiment, would you expect the same curve? Why? Or why
not?
(b) What, generally, happens to the learning curve when 40% or 80% of the data is
used for training? Can you explain this?
33
3.9 Cross-Validation
The prcrossval function allows you to perform a cross-validation on a given data set using
a particular classifier (or even a whole bunch of them in one go).
Exercise 3.20 Generate a small data set using gendatb, say, with 10 objects per class.
(a) Using n-fold cross-validation, make plots for the error rates for kNN and 1NN over
different values of n. Also calculate the standard deviation of the error estimate, e.g.,
by performing the cross-validation 10 time (check the help).
(b) What do you notice about the estimated error rates? What is the general
trend (maybe you should redo the data generation and the cross-validation a couple
of times).
(c) What happen to the variance of the estimates for varying n? Again, we are inter-
ested in the general trend.
(d) How would the observations change if one would repeat the experiments with much
larger dataset? Would they change?
optional
A confusion matrix is of the size C × C, where C is the number of classes. An element C(i, j)
encodes the confusion between the classes i and j. More precisely, C(i, j) gives the number
of objects that are from i, but are labeled as j. Nothing confusing about it.
Exercise 3.21 To create a confusion matrix you may use something like the following code.
>> lab = getlab(test_set);
>> w = fisherc(train_set);
>> lab2 = test_set*w*labeld;
>> confmat(lab,lab2)
An alternative is:
>> confmat(test_set*w)
(a) What is stored in lab and what does lab2 contain?
Exercise 3.22 Load the digits data sets mfeat zer and mfeat kar. Split the data up in a
training and a test set (using gendat).
(a) Use your favorite classifier. What are the error rates on the test sets?
(b) Use confmat to study the error distributions more closely. Where do the larger
parts of the errors stem from?
(c) Can you explain which of the two feature sets is insensitive to image rotations?
34
a b
Exercise 3.23 Given a confusion matrix c d for a two-class problem.
(a) Give an estimate of the expected cost (per object) when misclassifying class 1
as class 2 is twice as expensive as the other way around and assuming that a correct
classification incurs no cost.
end optional
Exercise 3.24 Make a learning curve for fisherc for sample sizes ranging from 1 per class
to about 50 per class for gendats in 40 dimensions. Is this the learning curve that you
had expected? Can you explain it?
35
36
Week 4
Complexity
Objectives When you have done the exercises for this week, you
• should know the fundament of the support vector classifier (i.e. maximum margin),
• should be able to kernelize a nearest mean classifier,
• optimise a hyperparameter using crossvalidation.
In order to keep the code text simple, we may assume that we imported Prtools as
>> from prtools import *
Exercise 4.2 (a) Demonstrate, possibly graphically/geometrically, that the support vector
classifier is sensitive to feature scaling. Hint: this can be done in 2D based on a training
set of size 3 (like in 4.1(a) and 4.1(b)) and a single test point.
Exercise 4.3 (a) Study, again, small data sets in 2D, for instance those from 4.1(a) and
4.2, or generate one yourself and compare the solution of the fisherc classifier to those
obtained by means of an SVM. In what cases do they differ? Do you see the pattern?
optional
37
Exercise 4.4 Make exercise 3.17 in the book.
Section 3.8
end optional
Exercise 4.5 (a) Assume we have two objects, represented by 1-dimensional feature vectors
x and χ. Find a feature mapping φ that leads to the inner product exp(−(x − χ)2 ).
Hints: expand the term −(x − χ)2 and write exp(2xχ) as a series based on the Taylor
series of the exponential.
(b) What is the dimensionality of the space that φ maps a 1-dimensional feature vector
to?
Exercise 4.7 The function svc can be used to both construct linear and non-
linear support vector machines. The following kernels K are defined:
’linear’ linear kernel (default)
’poly’ polynomial kernel with degree par
’rbf’ RBF or Gaussian kernel with width par
To define the kernel in svc, supply a second input argument with a list of kernel type,
kernel parameter, and tradeoff parameter C: svc(a,(kernel type,par,C)).
(a) On a = gendatb([20,20]), train an svc with a rbf’ kernel, i.e., the Gaussian
kernel, for kernel widths that vary from fairly small (0.1?) to fairly large (10?). Check
with a large (enough) independent banana test set how the performance varies for the
different choices of kernel widths.
(b) How does the kernel width of parzenc relate to the width of the radial basis
function?
(c) Why can the svc, potentially, perform much faster at test time than the Parzen
classifier?
Exercise 4.8 The prcrossval function allows you to perform a cross-validation on a given
data set using a particular classifier. You should supply a dataset, an untrained mapping
and, optionally, the number of folds. The function then returns an error estimate for
each of the folds:
>> e = prcrossval(a,svc([],(’rbf’,2.0,1.)),k=10)
38
Given the banana dataset with 200 objects per class, a=gendatb([200,200]), we want
to optimise the hyperparameter of an RBF support vector classifier.
(a) Choose a range of possible hyperparameter-values for the radial basis kernel (some-
thing like s = [0.2,0.5,1.0,2.0,5.0,7.0,10.0,25.0]?). Estimate for each s the
crossvalidation error. Which s is optimal?
optional
Exercise 4.9 (a) Use the fact that (A + BB T )−1 B = A−1 B(I + B T A−1 B)−1 to show that
X T (XX T + λIN )−1 = (X T X + λId )−1 X T , with Ia the a × a identity matrix.
(b) Using the identity from 4.9(a), show that estimating the y to an unobserved x
through ridge regression can be expressed completely in terms of inner products between
input vectors and values from Y .
Define k(x, z) = (xT z + c)2 . We will see that k is a kernel by explicitly finding the
mapping φ that takes the feature vectors x and z to a new space where the inner
product equal k(x, y), i.e., you will find φ such that φ(x)T φ(z) = k(x, z).
(c) First take x and z to be 1-dimensional and write out/expand k(x, y).
(d) Do the same for 2-dimensional x and z.
(e) See the pattern.
Unfortunately, there is no kernelized ridge regression in PRTools. So, for the next
experiment, you have to implement it yourself. The function proxm could come in
handy.
(f ) You may want to check that kernel ridge with a very small λ and the above kernel
k (with c fixed) basically solves the problem in 2.6(e). (But maybe you should not do
the experiment with too large a training set. . . )
(g) Why does the choice of c does basically not matter in the foregoing experiment as
long as it is not set to 0 (and one has enough training samples)?
39
both across degree and across training set sizes. How do the bias and variance change?
Did you expect this in the light of the the bias-variance tradeoff?
Exercise 4.12 Generate a small Banana data set, using gendatb, say, with 10 samples per
class. Let us use a linear classifier on this data set, for instance, LDA, logistic regression,
or the NMC.
(a) Determine which parts of this data set are as good as always misclassified. Is this
a manifestation of the bias or the variance?
(b) Determine for which parts of the feature space the classifiers often disagree. Is this
a manifestation of the bias or the variance?
end optional
40
Week 5
In what follows, p is used for pmfs as well as pdfs. Likewise, the term probability may also
refer to a value that is actually a density. This should not lead to any (real) confusion. Inform
us, however, if you think it does. Admittedly, we generally may use somewhat sloppy notation.
E.g. we may not be talking in terms of random variables there where it might actually be
more appropriate to do so. In a similar vein, we often just talk about the distributions p(a)
and p(µ, b) over a and (µ, b), respectively, without making very explicit, for instance, that
these are of course different p.
Exercise 5.1 (a) So, what is the probability that you do indeed know all of ML?
41
5.2 Maximum Likelihood and A Posteriori Estimation
Exercise 5.2 Given N i.i.d. observations xi , with i ∈ {1, . . . , N }, from a true distribution p
(e.g. a pdf or a pmf). Assume we want to fit a distributional model p(·|θ) parameterized
through θ for such observations. So, p(·|θ) is a distribution for every choice of θ, which
takes xi ’s as input variable.
(a) Consider a fixed parameter value θ and assume that this is the correct model.
Write down the probability (or, in the continuous setting, the probability density) of
observing the sample of N xi ’s given this parameter value.
(b) How is the solution to the previous question also referred to if we consider it as a
function of the parameter?
Exercise 5.3 The maximizer of the likelihood, gives us the maximum likelihood estimator for
the parameter. You could say that this parameter choice maximizes the probability1 of
observing our i.i.d. training data set {x1 , . . . , xN } among all possible parameter choices.
Another way to come to an estimate for our parameter θ is to aim for the most probable
choice of parameter given the data.
(a) Use Bayes’ theorem to write the probability of a specific parameter θ, given the
observed training data, in terms of the likelihood of θ.
The probability distribution over the parameter space that comes in through Bayes’
theorem, gives us the possibility to encode our a priori expectations about the possible
solution. Once we have decided on this so-called prior, we find another estimate of our
parameter by maximizing the probability of θ, given the observed training data. This
estimator is called the maximum a posteriori estimator or MAP estimator for short.
(b) When maximizing the a posteriori
Q probability, why do we not need to bother
ourselves with the data marginal p(xi )?
(c) What solution does the MAP estimator give us if we assume p(θ) to be constant,
i.e., we assume a so-called flat prior?
(d) Say we are not only interested in the optimal estimator, but we also care about
the actual probability p(θ|x1 , . . . , xN ). Assuming Q
p(x|θ) to be the true model for an
observation x, can you come to an expression for p(xi ) in terms of θ, among other
terms? Hint: from the joint distribution over x and θ, we can of course derive any
marginal. . .
optional
42
Exercise 5.4 (a) We want to fit a model p(x, y|w). Give the log-likelihood of a single
labeled observation (xi , yi ) and a single unlabeled observation xj .
Here’s a taste of what the clustering lectures will have in store for us (see Section 5.13).
(b) Say our model is QDA, show that if we only have unlabeled data available, the
log of the model likelihood on the training data is equal to the objective function used
when fitting a mixture of Gaussians.
(c) What is (probably) the most-used technique to estimate the parameters of such a
model with so-called unobserved latent variables? How would you get to the likelihood
estimates?
end optional
Exercise 5.5 In a two-class problem, labels y ∈ {−1, +1} can be modeled through a
Bernoulli distribution. Let the parameter q be the probability of observing y = −1.
Assume the a priori distribution for q equals p(q) = q −1 (1 − q)−1 .
(a) Determine the maximum a posteriori probability for q given that we observed one
+1 and one −1.
(b) The prior used is referred to as the Haldane prior. It is a so-called improper prior.
Which properties of a pdf does this improper prior fulfill and which doesn’t it?
optional
Exercise 5.6 Let us a priori assume that the mean m of a 1D distribution that we want to
fit is within an interval [−a, +a] around the origin. Within this interval, the probability
for every mean is equal.
(a) What kind of distribution do we have on [−a, a]?
(b) Let us assume in addition that the 1D model distribution that we are fitting is
normal and that there is no prior assumption on the variance (or, equivalently, that the
prior on the variance is flat). Given N observations x1 , . . . , xN from R, determine the
MAP estimates for the mean and the variance. (No need to derive this very formally
and precise, but your solution should be correct of course and you should be able to
properly defend it.)
(c) Let us now assume that the 1D model distribution that we are fitting is actually
uniform between l and u. There is no prior assumption on any of the other parameters
except for the mean. Given N observations x1 , . . . , xN from R, determine the MAP
estimates for l and u.
end optional
43
Exercise 5.7 Consider a two-class classification problem in a feature space of one dimension.
Let Ni be the number of observations from class i. Let (xi , yi ) (i ∈ {1, . . . , N }) be
N = N1 + N2 pairs of observations, i.e., we have N different training samples with
feature value xi and corresponding class label yi . Assume now we want to train a linear
discriminant classifier (LDA) that assumes Gaussian class-conditional distributions for
which the two variances (and standard deviations) are equal. That is, we model the
classes by normal distributions with equal (co)variance structure.
(a) Write down the full density p(x, y) for this two-class Gaussian model.
(b) Use the logarithmic loss to write down the (maximum) likelihood risk functional
for this model. Indicate clearly which parameters have to be estimated, i.e., how are
the two priors, the two means, and the single standard deviation denoted?
(c) Write down the expression for the empirical likelihood risk functional for the two-
class Gaussian model and express it in the N different training samples (xi , yi ) (i ∈
{1, . . . , N }).
(d) Derive the maximum likelihood estimates for the priors, means, and variances for
this model. The derivation doesn’t have to be given in every single detail, but should
contain sufficient formulas and/or explanatory text to be able to follow the proof.
Assume now, we have some prior knowledge about the mean of class 1 that we can
encode by means of a prior over this variable. Assume this prior to be a Gaussian
distribution with zero mean and variance λ2 , i.e., we expect the mean of class one to be
near the origin.
(e) Given again the N observations, write down the expression for the posterior prob-
ability of the five parameters considered earlier. You may want to use Bayes theorem
and the earlier obtained expression for the (empirical) likelihood risk functional in order
to come to an expression for this posterior.
(f ) Determine the MAP solution for this posterior probability.
(g) What happens in the extreme cases that λ goes to either 0 or ∞?
(h) Consider the following additional extreme cases: limN1 →∞ , limN1 ↓0 , lims↓0 (where
in the last expression s denotes the standard deviation of the Gaussian classes).
(i) For which of the above five extreme cases, do we find the ML estimate back? In
which cases does the prior decide on the optimal MAP solution?
Exercise 5.8 In case we are dealing with a 1D scale parameter s, like the standard deviation
in a normal distribution, a standard prior to use for this parameter is 1s .
(a) Why is this not a proper prior?
(b) Will the MAP estimator for such scale parameter typically be larger or smaller
than the ML estimator, or will they be exactly the same?
44
(b) Can we see these priors as regularizers?
Exercise 5.10 (a) Reconsider the solution found in 2.9(e). With which form of L2 regu-
larized linear regression do we get to exactly the same solution?
(b) Consider standard multivariate linear regression with squared Euclidean norm
regularization as in 3.1. What prior should one assume over w to let the MAP solution
coincide with the minimizer of Equation (3.1)?
(c) With what prior do we get to the lasso as discussed in Section 3.3?
When doing actual predictions, especially in the Bayesian setting, one typically is not inter-
ested in the estimates of the parameters of a distribution as such. In the regression setting,
for instance, one rather cares about the output value for a particular input given all of the
observed data.
Exercise 5.11 A random variable Y takes on the value +1 with probability θ and −1 with
probability 1 − θ. A random draw lead us to make the observation Y = +1.
(a) Determine the maximum likelihood estimate for θ.
(b) Take the a prior probability for θ to be uniform on [0, 1], determine its MAP
estimate.
(c) Assume Y ∗ is an i.i.d. draw from the same distribution as Y . Determine the
predictive distribution for p(Y ∗ |Y = +1).
We are going to take a new (i.i.d.) draw Y ∗ from the same distribution as Y . The idea
is, however, that we should guess at the outcome of this draw beforehand.
(d) Assume the loss to be minimized with our guess is the error rate (or the 0-1 loss),
what label should we guess for this new draw according to the maximum likelihood
estimate, i.e., which guess minimizes the expected error? What would our guess be if
we rely on the MAP estimate? And what if we use the predictive distribution?
(e) Assume guessing +1, while the actual draw will give −1, costs us 1, while the
cost of make the other mistake, guessing −1, while the actual draw will give +1, is a
nonnegative variable c. Relying on the predictive distribution, for which cost c does it
become beneficial to guess −1? At this same cost, what guess would we prefer according
to the ML and MAP?
(f ) What guess should we do for Y ∗ in case, the loss considered is the squared loss?
Again, go through the above three scenarios: ML, MAP, and predictive distribution.
Note that, a priori, we are of course not limited to choosing −1 or +1.
45
5.8 Some Bayesian Network Basics
Exercise 5.12 Is there an essential difference between the dependence of the two variables
a and b as defined by the two Bayesian networks below?
a b
a b
Exercise 5.13 (a) With three nodes, how many different DAGs can be con-
structed?
(b) With three nodes, how many essentially different DAG “types“ are there (graphs
that are not equivalent by permuting the nodes)?
(c) With three different random variables, how many essentially different (in the
sense of describing different conditional independences) Bayesian networks can be con-
structed?
(d) How many essentially different “types“ of Bayesian networks are there (where,
similar to b), we consider a type different when the conditional independencies described
can’t be found by simply permuting indices of the variables)?
Exercise 5.14 Given 2 variables for which we cannot make any independence assumptions.
In this case, we can make 2 different decompositions of the joint distribution in terms
of individual distributions (i.e., distributions for single variables that can, however, be
conditioned on any number of other variables). In particular, we have p(x1 )p(x2 |x1 )
and p(x2 )p(x1 |x2 ).
(a) Given 3 variables for which we cannot make any independence assumptions, how
many different decompositions does the joint distribution allow in terms of individual
distributions?
(b) Can you come up with the general solution for N variables?
optional
Exercise 5.15 Given a Bayesian network for the variables xi , i ∈ {1, . . . , N }. Denote by
P a(xk ) the subset of variables that are parents of xk . Define the product of conditional
probability density functions
N
Y
p(xi |P a(xi )).
i=1
end optional
46
Exercise 5.16 Assuming that p(a|b, c) = p(a|b), show that p(c|a, b) = p(c|b).
Exercise 5.17 Consider a problem with three variables a, b, and c and discrete observations.
(a) Show by constructing a counter example that p(a, b) = p(a)p(b) does not generally
imply that p(a, b|c) = p(a|c)p(b|c).
(b) Also show that the implication in the other direction generally does not hold.
Again, try to do so by means of a simple counterexample.
a b c
(a) Assume X, Y, Z ∈ {a, b, c, d} and X, Y , and Z different from each other. Suppose
we want to consider paths between X and Y . How many possible, essentially different,
graphs do we have to consider?
(b) For which of the graphs of X, Y , and Z are all paths from X to Y blocked by Z?
(c) For which of the graphs do we have that X ⊥ Y |Z?
(d) Assume now that there is also an arrow from c to d. How do the answers to the
above questions change?
optional
Exercise 5.19 = Exercise 2.39 from Pattern Recognition (fourth edition) by Theodoridis
and Koutroumbas.
end optional
47
s e l
(a) Determine p(e = 1), i.e., the a priori probability of passing the exam.
(b) Calculate p(e = 1|l = 1) and p(l = 1|e = 1).
(c) What value does p(l = 1|e = 1, s = 1) take on?
(d) Given that somebody passed the exam, is it more likely that the person studied
(properly) or that s/he attended the lectures? What is the difference in probability?
optional
Exercise 5.21 = Exercise 2.41 from Pattern Recognition (fourth edition) by Theodoridis
and Koutroumbas.
end optional
Exercise 5.23 (a) Given that the independence assumption made in a particular Bayesian
network are indeed correct, why would this typically lead to improved performance over
a statistical model that does not make these assumptions?
(b) Even if the assumptions are only correct to some degree, why can they help in
getting to improved performance?
Let us now compare ldc and a naive Bayes version of ldc that treats all features as
independent given the class label. (The latter you may have to implement yourself.)
(c) Generate a data set in 10 dimensions of size 5000 using gendats and compare the
learning curves for 10, 20, 30, . . . , 100 training samples of ldc and its naive version.
Qualitatively explain their behavior and their performance differences.
(d) How do those learning curves look if gendatd is used rather than gendats? Explain
the differences we now see.
optional
Exercise 5.24 Consider the polynomial regression setting in Subsection 8.1.1 of Bishop.
Specifically, look on page 264 in Figure 8.6.
(a) What does it mean that xn is modeled with a small, solid dot?
(b) What would it mean if these xn are modeled by means of a shaded node?
48
(c) Can you come up with situations where one would consider these xn to be deter-
ministic and situations where these are stochastic?
(d) In most classification settings one considers, would the input feature vector typi-
cally be taken stochastic or deterministic?
Exercise 5.25 Draw a Bayesian network for LDA in which both its parameters, the feature
and label data (say, N samples to train on), and their relations are shown. Are there
any (conditional) independency assumptions that can be made?
end optional
49
5.11 Clustering
In the rest of this week’s exercises, we will discuss the problem of clustering; this practical
11 session is intended to familiarise you with the different techniques that were discussed, more
specifically hierarchical clustering, the mixtures-of-Gaussians and clustering evaluation.
This week we will focus on which objects belong together in groups or clusters. This clustering
process enables us to extract structure from the data, and to exploit this structure for various
purposes such as building a classifier or creating a taxonomy of objects.
The most difficult part of clustering is to determine whether there is truly any structure
present in the data and if so, what this structure is. To this end, we will also employ cluster
validity measures to estimate the quality of the clustering we have obtained.
In the lectures we discussed hierarchical clustering at length. There are basically two choices
13 that need to be made in hierarchical clustering in order to construct a dendrogram:
In this course, we will only employ the Euclidean distance between two samples as a measure
of dissimilarity. As you will recall from the lecture, there are three types of linkage: complete,
single and average linkage. Once the dendrogram has been constructed, we need to cut the
dendrogram at an appropriate level to obtain a clustering.
Exercise 5.26 Start with the hall dataset, an artificial dataset with clear structure. This
dataset can be loaded into a prdataset by using read mat("hall").
(a) Load the dataset and use scatterd to visualise it. How many clusters are visible
in the plot?
(b) What is the most suitable clustering?
Exercise 5.27 Load the rnd dataset and make a scatterplot to visualise it. This is a uni-
formly distributed dataset, with no apparent cluster structure. We will hierarchically
cluster this dataset to get an idea of what a dendrogram looks like when there is no
structure in the data.
(a) Plot the dendrogram with complete linkage using dendro(+a, "complete").
What is apparent?
(b) Perform hierarchical clustering with hclust on the rnd dataset with complete
linkage. The function hclust is a mapping that can be trained on a distance matrix D
using complete linkage, to get 3 clusters, by doing:
lab = hclust(D,’complete’,3)
50
This distance matrix can be obtained from the original dataset a by computing some
dissimilarity between the objects:
D = a*proxm(a,(’eucl’)) # Euclidean distance
Of course, you’re free to choose other dissimilarity definitions as well:
D = a*proxm(a,(’city’)) # or, city-block distance
You can now relabel the original data with the new labels, and plot:
b = prdataset(+a,lab)
scatterd(b)
(c) Repeat this for single and average linkage. Do you observe the same behavior as
with complete linkage?
Exercise 5.28 (a) Perform hierarchical clustering on the hall dataset with complete link-
age: what do the lengths of the vertical stems in the dendrogram tell us about the
clustering?
(b) Cut the dendrogram at different levels, i.e. experiment with different numbers of
clusters when calling the hclust. Can you think of ways in which a good clustering can
be defined?
(c) Can you devise a simple rule-of-thumb (in terms of vertical stem lengths) for finding
a good clustering in the dendrogram?
(d) Now perform single linkage hierarchical clustering. Do you notice any significant
differences with the complete linkage dendrogram?
(e) Do you notice any differences with the average linkage dendrogram?
During the lectures, the concept of clustering based on the quality of a mixture-of-Gaussian
density fit to the data was discussed. The operation of the Expectation-Maximisation (EM) 14.2
algorithm, which is employed to estimate the parameters of the mixture model, was also
discussed in detail.
In the following exercise, mixture model clustering will be explored with the aid of the mog
function. This function performs the Expectation-Maximization procedure to find the pa-
rameters of a Mixture of Gaussians, trained on some dataset a:
w = mog(a,(3,’full’,0.001))
scatterd(a)
plotm(w)
The function mog has three input parameters: (1) k for the number of clusters, (2) the shape
of the covariance matrix of each of the clusters, (3) a regularisation parameter that regularises
the inverse covariance matrix.
When you fitted the mixture, the trained mapping outputs a probability density for each of
the mixture components. So if you ask for k=3 clusters, each input object x, will return a
vector of 3 values. You can now compute the (log-)likelihood of some dataset by first summing
51
the probabilities of all clusters, then take the logarithm, and then add all log-probabilities of
the full dataset:
pred = a*w
logL = numpy.sum(numpy.log(numpy.sum(+pred,axis=1)))
print(logL)
Exercise 5.29 (a) Load the triclust dataset and play around with the function mog.
Vary the number of Gaussians employed in the mixture model, and also vary the type
of Gaussian employed. Relate the type (’full’,’diag’,’sphr’) of the Gaussian to its
covariance matrix.
(b) On the cigars dataset, fit an unconstrained Gaussian (type = ’full’) mixture
model using the function mog. For the number of clusters k, assume the following values:
k = 1, 2, 3, 4, 5. Which k do you expect to be the best? .
(c) Now try clustering the messy dataset. What is the best shape to employ for the
Gaussians? What is the optimal number of clusters?
This part of the session is intended to familiarise you with different cluster validity measures.
16 We will achieve this by:
2. coding and applying the Davies-Bouldin index to various algorithms and datasets.
The fusion graph plots the fusion level as a function of the number of clusters (g). For example,
the fusion level at g = 2 represents the (single, complete, average) link distance between the
clusters that are merged to create two clusters from three clusters. A simple heuristic to
determine the number of clusters in hierarchical clustering is to cut the dendrogram at the
point where we observe a large jump in the fusion graph.
The following three exercises focus on the estimation of the number of clusters based on the
fusion graph.
Exercise 5.31 (a) Load the triclust dataset. Perform single linkage hierarchical cluster-
ing and display its fusion graph by using the function: fusion graph. Where do you
observe the largest jump?
(b) Now perform complete linkage hierarchical clustering. Does the fusion graph give
a clear indication of the number of clusters in the data? If not, why not?
52
Exercise 5.32 (a) Load the hall dataset. Perform single linkage hierarchical clustering
and display the fusion graph. What do you observe in the fusion graph?
Exercise 5.33 (a) Finally, load the messy dataset. Perform single linkage hierarchical clus-
tering. According to the fusion graph, where should the dendrogram be cut?
(b) Does a satisfactory clustering result from cutting the dendrogram at this point?
Motivate.
(c) Now perform complete linkage clustering. Is the clustering suggested by the fu-
sion graph better or worse than the clustering obtained with single linkage cluster-
ing?
D.L. Davies and D.W. Bouldin2 introduced a cluster separation measure which is based on
both the within-scatter of the clusters in a given clustering and the separation between the
clusters. Formally, this measure is known as the Davies-Bouldin index (DBI). It assumes that
clusters are spherical, and that a desirable clustering consists of compact clusters that are
well-separated.
Suppose we wish to compute the DBI for a clustering consisting of n objects assigned to g
clusters. We can compute a score for every possible pair of clusters in this clustering, which
is inversely proportional to the distance between the cluster means and directly proportional
to the sum of the within-scatters in the pair of clusters. This score is given by
σj + σk
Rjk = , j, k = 1, 2, . . . , g; k 6= j. (5.1)
kµj − µk k
Here µj is the mean of all the objects in cluster j and σj is the within scatter of cluster j,
given by: v
u1 X
u
σj = t kx i − µj k2 , (5.2)
nj
x i ∈Cj
where Cj is the set of objects associated with cluster j and nj is the number of objects in
cluster j. The score, Rjk , is small when the means of clusters j and k are far apart and the
sum of the within-scatter for these clusters is small. Since cluster j can be paired with g − 1
other clusters, resulting in g − 1 pair-scores, Rjk , j = 1, 2, . . . , g; k 6= j, a conservative estimate
of the cluster score for cluster j, when paired with all other clusters, is obtained by assigning
the maximal pair-score with cluster j:
The Davies-Bouldin index of the complete clustering is then determined by averaging these
maximal pair-scores for all clusters:
g
1X
IDB = Rj . (5.4)
g
j=1
2
IEEE Transactions on Pattern Analysis and Machine Intelligence 1, pp. 224–227, 1979.
53
Exercise 5.34 We will employ the Davies-Bouldin index to evaluate the clustering produced
by hierarchical clustering. We will do so for a range of clusters in order to determine
the optimal number of clusters, i.e. the best level to cut the dendrogram at. To achieve
this we employ the function dbi.
(a) The function dbi takes the dataset features and the labels predicted by the clus-
tering method as inputs. It computes, for each clustering, the DBI. Familiarize yourself
with the operation of this function.
(b) Load the triclust dataset and make a scatterplot. What do you expect the DBI
values as a function of the number of clusters to look like?
(c) Now apply dbi to this dataset with complete linkage clustering, starting at 2
clusters and stopping at 10. What is evident from the DBI values?
(d) Apply dbi to the hall dataset with complete linkage clustering, starting at 2
clusters and stopping at 20. What do you observe? Can you explain your observa-
tion?
54
Week 6
These exercises are meant to give you both some practice with the material covered in the
lectures and the literature and an impression of what you are expected to know. Judge for
yourself which exercises you need to, want to, or should practice. On another note, at this
point, the lecturer (ML) cannot claim that all exercises are thoroughly checked. If you think
something is wrong, unclear, etc., let us (i.e., me (ML) , any of the other lecturers, or any of
the TAs) know.
Objectives When you have done the exercises for this week and went through the related
reading material, you should
Exercise 6.1 Given a data set with 100 features. Say you want to find the 5 best features.
How many feature sets do you need to check for this in case we have to rely on an
exhaustive search?
55
6.2 Stuff on Scatter Matrices
Exercise 6.2 Given a set of feature vectors {xi } for which the sample covariance matrix
equals C and given a linear transformation A. Show that the covariance matrix for the
transformed feature vectors {Axi } equals ACAT .
optional
Exercise 6.3 Note: this might be a tiresome exercise but if you have too much time on your
hands. . . One way or the other, you should know that Sm = Sb + Sw for a fact!
(a) Show that Σi = E[(x − µi )(x − µi )T ] = E[(x − x0 )(x − x0 )T ] in which x and x0 are
equally distributed, i.e., according to the distribution of class i.
(b) Exercise 5.12 Ex. 5.12 from the book. Show that the mixture scatter matrix is the
sum of the within-class and between-class scatter matrices. The fact proven in (a) can
be very helpful here.
Exercise 6.4 (a) Variation to Exercise 5.18 from Pattern Recognition by Theodoridis and
Koutroumbas (where the book is slightly inaccurate). Convince yourself that if the
number of classes equals M that the rank of the between scatter matrix Sb is at most
M −1. You can either try to proof this, which might be a bit though, or you can develop
some insight be staring at the formula or doing some experiments in, say, Matlab.
(b) In which (two) different cases can the rank of Sb be smaller than M − 1? Note
that this applies similarly to any estimate of a covariance matrix based on M samples.
Exercise 6.5 Make Exercise 5.20 from Pattern Recognition by Theodoridis and Koutroum-
bas. (Note that Sxw actually equals Sw ?)
end optional
56
6.4 PCA and Fisher
Exercise 6.7 Assume we have a d-dimensional data set for which we have a covariance
matrix C.
(a) Say that we map all the data point linearly to 1 dimension by mean of the d-vector
v. Show that the data variance in this 1D space equals v T Cv.
(b) Assume that v has norm 1, i.e., ||v|| = 1. Show that the v maximizing v T Cv is the
eigenvector corresponding to the largest eigenvalue.
optional
Exercise 6.8 Assume that the overall mean of a data set is zero and that all feature vectors
are stored in the rows of the matrix X.
(a) Show that the eigenvector of X T X with the largest eigenvalue is equal to the
largest principal component.
(b) Given that v is a (column) eigenvector with eigenvalue λ of XX T , show that X T v
is an eigenvector of X T X. What is the eigenvalue associated to the eigenvector X T v?
(c) Show that a (row) feature vector x from the original space can be mapped onto
the first PC by using xX T v.
(d) Describe now how one can kernelize PCA. That is, describe the procedure to do
the actual calculation. All ingredients are there. (Hint: realize what information XX T
and xX T contain.)
Exercise 6.9 Consider 1000 samples from a 3D Gaussian distribution. Assume this sample
has zero mean and a 3 × 3-covariance matrix C given by
99 0 0
C = 0 99 0 .
0 0 124
(a) Which direction gives the first principle component for this data set?
It turns out, that these 1000 samples are in fact the class means of 1000 different classes
of bird species described by three different features. Assume that all class priors are
1
1000 and that all these classes are normally distributed with covariance matrix equal to
the identity matrix.
(b) Give the total covariance matrix (also called the mixture scatter matrix) for this
data set.
(c) The Fisher mapping determines the Eigenvectors with the largest Eigenvalues of
the matrix Σ−1 −1
W ΣT (or, if you prefer, ΣW ΣB .) ΣW , ΣB , and ΣT are the within-class
covariance, the between-class, and the total covariance matrices, respectively. Which
direction gives the optimal Fisher mapping if we want to reduce the feature space to
one dimension?
57
In a next step, a linear feature transformation T is applied to the original 3D space.
The 3 × 3-matrix describing this transformation is given by
1
1 2 0
T = 12 1 0 .
0 0 1
(d) Recall your answer under b. and show that the total covariance matrix of the data
after the transformation equals
125 100 0
100 125 0 .
0 0 125
(e) Are the first two features correlated and, if so, are they positively or negatively
correlated?
(f ) Determine the first principal component for this new feature space.
(g) Let v the Fisher vector you found under exercise c. How can you calculate the
optimal Fisher mapping in Matlab for the transformed data set, given you can use
standard functions from Matlab, but only v and T as arguments?
Exercise 6.10 Consider the following two-class problem in three dimensions. Class one has
a normal distribution with unit covariance matrix centered in the origin. Class two lies
symmetrically around the first class and consists of two normal distributions (a mixture
of Gaussians if you like) of which the two means are located in (−3, −3, 2) and (3, 3, −2).
The classes have equal priors. We are going to study dimensionality reductions for this
classification problem.
(a) Consider the Fisher mapping (fisherm). Give a formula for the Fisher criterion
that this procedure optimizes. Make sure that all variables etc. are properly clarified.
Explain what this criterion aims to do and how it works.
(b) Explain why you would expect to find better subspaces with the Fisher mapping
in case you are dealing with small training set sizes rather then large training set sizes.
(c) What can you say about the subspaces that you will find in case the training set
is very, very large? Is it a good subspace or a bad subspace and why is this so?
(d) How would PCA work under different sizes of training sets on this data set? Would
it work well in comparison to the Fisher mapping? Explain your answer.
(e) For a particular training set, the covariance matrix has the form
3 0 2
0 0 0
2 0 3
58
Exercise 6.11 Let us consider a two-dimensional, two-class problem in which the two classes
are normally (Gaussian) distributed with both covariance matrices equal to 1 0 and
0 1
with equal class priors. The two class means are given by 00 and 20 , respectively. We
are going to qualitatively study the learning curve of the true error of a classification
procedure that first extracts a single feature from a sample of this data set using PCA
and subsequently builds a nearest mean classifier (NMC) in this 1-dimensional space.
(a) Show that the total covariance matrix, which is used in PCA, for this data set
equals 20 01 .
1 is the first principle component of the total covariance matrix.
(b) Show that 0
Show that 01 is the second.
(c) If we reduce the original data by means of the first principle component as given
in b., does the Bayes error increase, decrease, or stay exactly the same?
(d) What happens if we estimate the covariance matrices and calculate the first prin-
ciple component based on a small training set? How does the expected true error,
using PCA for feature extraction and NMC for classification, for the small training set
compare to the true error rate for a very large training data set?
(e) How does the learning curve for this problem with the above classification procedure
look qualitatively?
(f ) How does the learning curve look qualitatively if the class covariance matrices equal
10 ?
04
(g) How does the learning curve look qualitatively if the class covariance matrices equal
10 ?
02
end optional
Exercise 6.12 We have a 4-class problem in 3 dimensions. All classes are normally dis-
tributed with the identity matrix as the covariance matrix (lucky us). The four class
means are at (−5, 0, 0), (5, 0, 0), (0, 0, 3), and (0, 0, −3). The classes have equal priors.
We are going to study dimensionality reductions for this classification problem.
(a) Assume we have an infinite amount of training data; so we basically know the class
distributions perfectly. What will be the first principal component? Briefly explain
your answer. You can use explicit calculations in this, but there is no need for this.
(b) Now consider the Fisher mapping and let us reduce the dimensionality to 1 (one)
to start with. In which direction of the following three will the optimal 1D subspace
be: (1, 0, 0), (0, 1, 0), or (0, 0, 1)? Explain your answer. Again: explicit calculations are
not needed, but if you prefer to, you can of course provide them.
(c) We are still in the infinite training set size setting. Explain that if we reduce
the dimensionality to 2 (two), that PCA and Fisher mapping both provide the same
subspace.
(d) We now rescale the first feature by dividing all its values by 4. Does the 2D PCA
subspace change? Does the 2D Fisher subspace change? Explain if, why, and how they
change.
59
(e) Assume now we are back in the situation of question c. but we are now in the more
realistic setting in which we carry out PCA and Fisher on the basis of only a handful
of samples. E.g. three or four per class. We again are going to reduce to 2 dimensions.
Which method will turn out to work best, given that we will use the 2D representation
for classification?
Exercise 6.13 We are dealing with a 20-dimensional classification problem with three classes
and 10 samples per class. We would like to reduce the dimensionality of this initial 20-
dimensional space to only 5. As feature subset evaluation criterion you have taken the
classification performance of your favorite classifier.
(a) How many criterion evaluations are necessary to come to a choice of 5 features
when you rely on sequential feature forward selection?
(b) How many criterion evaluations are necessary when you use sequential feature
backward selection instead?
(c) How many criterion evaluations does one have to make when one would search
for the optimal subset of 5 features? Optimality is of course measured in terms of the
evaluation criterion chosen.
(d) Which one of the three feature subsets obtained above would you prefer to use for
your actual classification problem? Explain your answer.
optional
Exercise 6.14 Say we want to solve a high-dimensional classification problem (e.g. 1000
features or more maybe) with relatively few training objects (say, 100) by selecting a
few features (say, about 30) from the large, original collection.
(a) Pick three feature selection strategies (e.g. feature forward selection etc.). Write
these down and provide a brief description of how they work for every one of them.
(b) From the three selection strategies you picked, give the best performing and the
worst performing strategies (w.r.t. the expected true error rate). Explain your answer
properly and provide convincing reasons for your choices.
(c) Which two selection strategies would you pick if the speed of the selection process
matters the most? Which one will be the fastest and which one the slowest?
Exercise 6.15 We consider feature selection based on the Bayes error: the absolutely mini-
mal error that can be achieved on the classification problem at hand.
(a) Construct a 3-dimensional 2-class classification problem for which feature forward
selection (using the Bayes error as criterion) will outperform individual feature selection
if we want to have a subspace of dimension 2. Make clear that the subspaces that will
be obtained are indeed different and that the better subspace is the one that feature
forward selection obtains. Also be clear in the description of the distributions that
60
you choose for the two classes (drawings could be fine, but may be hard to correctly
interpret).
(b) Show that, when using the Bayes error as the selection criterion, individual feature
selection can never outperform feature forward selection when reducing a 3-dimensional
problem to 2 dimensions.
(c) Construct a 2-class classification problem (that should at least be 4D according to
b!) for which individual feature forward (using again the Bayes error as criterion) will
outperform feature forward selection if we want a subspace of dimension 2. Again, make
clear that the subspaces that will be obtained are indeed different and that the better
subspace is the one that feature forward selection obtains. Be clear in the description
of the distributions that you choose for the two classes.
Exercise 6.16 You are dealing with a 10-dimensional classification problem with two classes
and 100 samples per class. You decide to study the nearest mean classifier (NMC)
and the Fisher classifier (fisherc) in order to solve this classification problem. You also
decide to perform a feature selection to 3 (three) features before you train your final
classifier. As feature subset evaluation criterion you take the classification performance
the respective classifier has on the training set.
(a) How many criterion evaluations are necessary to come to your choice of 3 features
(out of the 10) when you use sequential feature forward selection?
(b) How many criterion evaluations are necessary to come to your choice of 3 features
(out of the 10) when you use sequential feature backward selection?
(c) How many criterion evaluations do you need to make sure you find the overall
optimal combination of 3 features?
(d) How many criterion evaluations do you need to make sure you find the overall
optimal combination of features?
(e) Say it is allowed for features to be chosen more than once. How many criterion
evaluations do you need to make sure you find the overall optimal combination of 3
features (so they don’t have to be unique) for the Fisher classifier? And how many for
the NMC? [Yes, this is a tricky one.]
end optional
Exercise 6.17 Given a data set with 100 features. Say you want to find the 5 best features.
How many feature sets do you need to check for this in case we have to rely on an
exhaustive search?
61
(b) Which features are the best two and do the algorithms pick them out? And if not,
do they agree among each other on the optimal choice?
Exercise 6.19 Load the diabetes data set using a = diabetes. This data set is a real-
world data set (though an old one). The aim is, or rather, was to predict the presence
of diabetes mellitus in people based on the eight measurements made. Some of the
features included are diastolic blood pressure, triceps skin fold thickness, and age.
(a) Split up the data set in 50% for training and the rest for testing. Based on the
training set, what is the optimal 5-dimensional subspace for the fisher classifier? (Hint:
you can loop you way out of this but it might be more handy to use some feature from
nchoosek; read the help carefully.)
(b) What is the optimal 5-dimensional space for a 1NN classifier?
(c) What is the error attained on the test set for these classifiers with their respective
optimal feature spaces.
(d) Check some search strategies and selection criteria (help feateval) and see
whether they are able to pick out the same 5-dimensional subspace. If not, do at
least the error rates on the test set come close in case one uses the same classifier?
(e) Do another 50-50 split of the original data set. When redoing some of the feature
selections, do you find back the same feature sets?
optional
a = mfeat_zer\\
[trn,tst] = gendat(a,.04)\\
clsf = ldc\\
featnum = [1:5:47]\\
\\
prwaitbar off\\
prwarning off\\
\\
mf = max(featnum)\\
e0 = clevalf(trn,clsf,featnum,size(trn,1),1,tst);\\
rp = randperm(size(a,2));\\
er = clevalf(trn(:,rp),clsf,featnum,size(trn,1),1,tst(:,rp));\\
[w,r] = featseli(trn,’eucl-m’,mf)\\
e1 = clevalf(trn*w,clsf,featnum,size(trn,1),1,tst*w);\\
[w,r] = featself(trn,’eucl-m’,mf)\\
e2 = clevalf(trn*w,clsf,featnum,size(trn,1),1,tst*w);\\
[w,r] = featselb(trn,’eucl-m’)\\
e3 = clevalf(trn*w,clsf,featnum,size(trn,1),1,tst*w);\\
figure\\
plote({e0,er,e1,e2,e3})\\
62
legend({’o’,’r’,’i’,’f’,’b’})}
}
(a) Run it a couple of times and have a look at the output, both in the figure and in
the command window, and try to understand it. Also have a look at the code of course.
E.g. what does randperm do in the code and what is rp used for? What does featseli
do? And ’eucl-m’?
(b) Can you find a selection strategy, which optimal performance is, by and large,
better than the best performing one in (a)? Recall that there are not only so-called
filter approaches.
3 Z
2
Feature A
0 X Y
−1
−2
−2 −1 0 1 2 3 4 5 6 7
Feature 1
Figure 6.1: Three equally probable classes X, Y, and Z in a 2-dimensional feature space.
Exercise 6.21 Consider the three equally probable classes X, Y, and Z in a 2-dimensional
feature space. See Figure 6.7. All have a uniform and circular distributions of diameter
2 and the three class means are located in (0, 0), (2, 0), and (5, 3), respectively. In
addition, the distributions are such that all class covariance matrices, as well as the
within-class covariance matrix SW , are equal to the identity matrix ( 10 01 ).
(a) In a forward selection scheme, what feature is selected when the selection criterion
used is the sum of the (squared) Mahalanobis distances1 between the classes? (You may
1
Mahalanobis distances are the same as the Euclidean distance, but they have a correction based on some
covariance. Specifically for this exercise, the Mahalanobis distance between two classes based on the two class
−1
mean m1 and m2 is given by (m1 − m2 )T SW (m1 − m2 ).
63
also use the multi-class Fisher criterion—often denoted JF —as this leads to the same
result.)
(b) Using the same criterion, what feature is selected when a backward approach is
used?
(c) Use the given class means to determine the between-class scatter, or between-class
covariance matrix, and check that both (3, 2) and (−2, 3) are eigenvectors.
(d) Use the previous between-class matrix, together with the within-class covariance
matrix, to determine the 1-dimensional optimal solution based on the Fisher criterion.
(e) Which of the three previous solutions performs worst? How much class overlap is
attained when relying on the feature forward selection procedure?
(f ) Change one or more of the current three class means and modify the problem such
that the feature extraction performs better (i.e., gives lower class overlap) than the two
feature selection methods. Stated differently: provide three class means and show that
the feature extraction approach above now outperforms the feature selection methods.
(g) Finally, assume that the class distributions are squares aligned with the feature
axes instead of the circular distributions and that the three class means are still the
same. In this setting, do the solutions to questions 1, 2, and 3 change? And if so, how?
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
Exercise 6.22 Let us consider a two-dimensional, three-class problem in which all classes
are uniformly distributed discs with radius 1. All class means are variable but differ
so that there is no class overlap in the 2D feature space. The prior probabilities of all
classes are equal. Figure 6.2 above gives an example configuration in which the three
class means are given by (2, 3), (5, 7), and (10, 6), respectively.
(a) Let us reduce the feature dimensionality from 2 to 1 by means of the Fisher
mapping (fisherm in terms of good ol’ Matlab’s PRTools). Configure the three class
means such that in the 2D space there is no overlap but in the 1D space found by the
Fisher mapping two of the three classes completely overlap. You are only allowed to put
64
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
Figure 6.3: Example configuration of two classes (both of which in turn consist of two clusters).
the class means in the grid points given in the figure. Provide the three coordinates of
your solution.
(b) Is it possible to give three means for which there is no overlap in 2D but for which
all three classes overlap completely in 1D when using the Fisher mapping?
We now consider a similar setting as exemplified in Figure 6.2, but now we have two
classes (one black one white) both consisting of two clusters. All four clusters are
uniformly distributed discs with radius 1. (The four clusters have equal priors.) An
example configuration is given in Figure 6.3.
(c) Configure the four class means such that if one would look at either of the two
features, both classes would perfectly overlap, while in the original 2D space the two
classes are perfectly non-overlapping. In other words, construct an example for which
selecting a single feature would give a Bayes error of 0.5, while the Bayes error in the
original space is 0. Again, you are only allowed to put the means in the grid points
given in the figure. Provide the four coordinates of the clusters of your solution. Make
sure you make clear where the black clusters go and where the white go.
(d) Would a linear feature extraction technique be able to provide a better one-
dimensional subspace for the problem created in c. than feature selection is able to
do? That is, can feature extraction find a 1D feature for which the two classes do not
fully overlap?
(e) Is it possible to create a classification problem, again positioning the four clusters,
such that feature selection will outperform any kind of feature extraction? Explain your
answer.
Exercise 6.23 Figure 6.4 displays a three-class problem in which all classes are uniformly
distributed on a 2 × 2 square. The three class centers are located in (−1, 6), (0, 0), and
(2, 0), respectively. The prior probabilities of all classes are equal.
(a) What is the Bayes error for this two-dimensional classification problem?
65
8
4
Feature B
−1
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5 6
Feature A
Figure 6.4:
We now are going to reduce the dimensionality of this classification problem from the
original two features to a single one.
(b) Using the means of the classes, what is the sum of squared Euclidean distances for
this problem if one would only consider Feature A? What if one would only consider
Feature B? If we would perform a feature selection based on this error, which one of
the two features would we choose and why?
(c) What is the Bayes error for this problem if one would only consider Feature A?
What if one would only consider Feature B? If we would perform a feature selection
based on this error, which one of the two features would we choose and why?
Assume now that the sizes of the squares of all three classes increase from 2 × 2 to 4 × 4
(the class centers indeed remain the same).
(d) Similar to b., again using the means of the classes, what is the sum of squared
Euclidean distances for this problem if one would only consider Feature A? What if one
would only consider Feature B?
(e) Same as in c., what is the Bayes error for this problem if one would only consider
Feature A? What if one would only consider Feature B?
Instead of feature selection, we could also have resorted to feature extraction. In order
to perform PCA, we need the covariance matrix of the data.
(f ) Determine the between-class covariance matrix (or the between class scatter) for
the three class problem we consider.
(g) Assume that the covariance matrix of the squares is given by 20 02 . Use this in
combination with your answer in f. to determine the total covariance.
66
end optional
6.8 Combining
optional
(a) Train three nearest mean classifiers on the different data sets:
w1 = nmc(b1)*classc; w2 = nmc(b2)*classc; w3 = nmc(b3)*classc;
(The classc is a trick to make sure that classifiers output posteriors.) Estimate the
three corresponding error rates on the corresponding c-data sets, e.g., for the first data
set: compute c1*w1*testc.
(b) Train and test the same classifier on the data sets b = [b1 b2 b3] and c
= [c1 c2 c3] (or, in Python: b = pr.concatenate([b1,b2,b3],axis=1) and c =
pr.concatenate([c1,c2,c3],axis=1)), which contain all features from the three data
sets above. Compare its error rate to those from (a).
Given the trained individual classifiers w1, w2, and w3,
a fixed combiner can be constructed as follows: v =
pr.concatenate([c1*w1,c2*w2,c3*w3],axis=1)*pr.meanc() in Python, or v =
[w1; w2; w3]*meanc in Matlab. This is the mean combiner. Other combiners that
PRTools allows are minc, maxc, prodc, and medianc of which it should be clear what
kind of combining rule these implement.
67
(c) Test some combiners, by means of c*v*testc in Matlab or v*pr.testc() in
Python, and compare the results to the four earlier error rates you estimated.
(d) Looking back at the exercise, do you understand why the special form of the
gendat command, in which the I and J occur, had to be used? What would we have
been doing if we would just have generated the b and c data sets by applying gendat
to the three a data sets individually? Obviously, you may want to check help gendat
for an initial clue and to see what I and J contain.
Exercise 6.26 (a) Generate a data set by means of gendath and train two dif-
ferently behaving linear classifiers on it. Take, for instance, w1 equal to nmc
and w2 toldc. Generate posterior probabilities from these by computing p1 =
a*w1*classc and p2 = a*w2*classc and combine them in one data set by p =
pr.concatenate([p1,p2],axis=1) in Python, or p = [p1 p2] in Matlab.
How many features does this new data set have and why is this so?
(b) Scatter plot the first and third dimension of p. What do these dimensions corre-
spond to? Try to understand that the original classifiers correspond to horizontal and
vertical lines at 0.5 in this same scatter plot.
(c) Straight lines not parallel to any of the two axes actually combine the posteriors
from the two classifiers. Are there combiners possible that indeed seem improve upon
the two individual classifiers?
Exercise 6.27 As simply as fixed combiners can be constructed, one can construct trained
combiners. An example is v = [nmc*classc ldc*classc]*fisherc, which takes the
two classifier outputs of nmc and ldc and combines them through training a fisherc
on top of it. All is simultaneously trained by executing w3 = a*v on a data set a.
(a) Take the data set from Exercise 6.26, gendath, construct a training and a test set,
and compare nmc, ldc, and the trained combiner.
end optional
68
(c) Repeat the experiment but replace the base classifier with nmc and with fisherc.
Are there noteworthy differences between the various solutions obtained? Also compare
results between different base classifiers.
(d) What can we expect AdaBoost to do when we employ it in combination with, say,
a 1NN or an SVM with a radial basis function? Could one expect any improvements
from such scheme?
69