Lecture Notes - Machine Learning For The Sciences
Lecture Notes - Machine Learning For The Sciences
Preface to v2
This is an introductory machine-learning course specifically developed with STEM
students in mind. Our goal is to provide the interested reader with the basics to
employ machine learning in their own projects and to familiarize themself with the
terminology as a foundation for further reading of the relevant literature.
In these lecture notes, we discuss supervised, unsupervised, and reinforcement
learning. The notes start with an exposition of machine learning methods with-
out neural networks, such as principle component analysis, t-SNE, clustering, as
well as linear regression and linear classifiers. We continue with an introduction
to both basic and advanced neural-network structures such as dense feed-forward
and conventional neural networks, recurrent neural networks, restricted Boltzmann
machines, (variational) autoencoders, generative adversarial networks. Questions of
interpretability are discussed for latent-space representations and using the examples
of dreaming and adversarial attacks. The final section is dedicated to reinforcement
learning, where we introduce basic notions of value functions and policy learning.
These lecture notes are based on a course taught at ETH Zurich and the Uni-
versity of Zurich for the first time in the fall of 2021 by Titus Neupert and Mark
H Fischer. The lecture notes are further based on a shorter German version of
the lecture notes published in the Springer essential series, ISBN 978-3-658-32268-7,
doi:https://doi.org/10.1007/978-3-658-32268-7. The content of these lecture notes
together with exercises is available under ml-lectures.org.
This second version of the lecture notes is an updated version for the course
taught at the University of Zurich in spring 2022 by Mark H Fischer. In particular,
the notation should be more consistent, a short discussion of the maximal-likelihood
principle was introduced in Sec. 5.1, and the interpretation of latent-space variables
was extended in Sec. 6.1.
University of Zurich i
Machine Learning for the Sciences CONTENTS
Contents
1 Introduction 1
1.1 Why machine learning for the sciences? . . . . . . . . . . . . . . . . . 1
1.2 Overview and learning goals . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Unsupervised Learning 49
5.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 49
5.2 Restricted Boltzmann machine . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Training an RBM . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Example: signal or image reconstruction/denoising . . . . . . 53
5.3 Training an RNN without supervision . . . . . . . . . . . . . . . . . . 54
5.3.1 Example: generating molecules with an RNN . . . . . . . . . 55
5.4 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . 57
University of Zurich ii
Machine Learning for the Sciences CONTENTS
7 Reinforcement Learning 71
7.1 Exploration versus exploitation . . . . . . . . . . . . . . . . . . . . . 72
7.2 Finite Markov decision process . . . . . . . . . . . . . . . . . . . . . . 72
7.3 Policies and value functions . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Temporal-difference learning . . . . . . . . . . . . . . . . . . . . . . . 76
7.5 Function approximation . . . . . . . . . . . . . . . . . . . . . . . . . 78
8 Concluding remarks 79
1 Introduction
1.1 Why machine learning for the sciences?
Machine learning (ML) and artificial neural networks are everywhere and change our
daily life more profoundly than we might be aware of. However, these concepts are
not a particularly recent invention. Their foundational principles emerged already
in the 1940s. The perceptron, the predecessor of the artificial neuron, the basic unit
of many neural networks to date, was invented by Frank Rosenblatt in 1958, and
even cast into a hardware realization by IBM.
It then took half a century for these ideas to become technologically relevant.
Now, artificial intelligence based on neural-network algorithms has become an in-
tegral part of data processing with widespread applications. Famous milestones
include image based tasks such as recognizing a cat and distinguishing it from a dog
(2012), the generation of realistic human faces (2018), but also sequence-to-sequence
(seq2seq) tasks, such as language translation, or developing strategies to play and
beat the best human players in games from chess to Go.
The reason for ML’s tremendous success is twofold. First, the availability of big
and structured data caters to machine-learning applications. Second, while deep
(feed-forward) networks (made from many “layers” of artificial neurons) with many
variational parameters are tremendously more powerful than few-layer ones, it only
recently, in the last decade or so, became feasible to train such networks. This big
leap is known as the “deep learning revolution”.
Machine learning refers to algorithms that infer information from data in an im-
plicit way. If the algorithms are inspired by the functionality of neural activity in the
brain, the term cognitive or neural computing is used. Artificial neural networks re-
fer to a specific, albeit most broadly used, ansatz for machine learning. Another field
that concerns iteself with inferring information from data is statistics. In that sense,
both machine learning and statistics have the same goal. However, the way this goal
is achieved is markedly different: while statistics uses insights from mathematics to
extract information in a well defined and deterministic manner, machine learning
aims at optimizing a variational function using available data through learning.
The mathematical foundations of machine learning with neural networks are
poorly understood: we do not know why deep learning works. Nevertheless, there
are some exact results for special cases. For instance, certain classes of neural
networks are a complete basis of smooth functions, that is, when equipped with
enough variational parameters, they can approximate any smooth high-dimensional
function with arbitrarily precision. Other variational functions with this property
we commonly use are Taylor or Fourier series (with the coefficients as “variational”
parameters). We can think of neural networks as a class or variational functions,
for which the parameters can be efficiently optimized—learned—with respect to a
desired objective.
As an example, this objective can be the classification of handwritten digits from
‘0’ to ‘9’. The input to the neural network would be an image of the number, encoded
in a vector of grayscale values. The output is a probability distribution saying how
likely it is that the image shows a ‘0’, ‘1’, ‘2’, and so on. The variational parameters
of the network are adjusted until it accomplishes that task well. This is a classical
example of supervised learning. To perform the network optimization, we need data
University of Zurich 1
Machine Learning for the Sciences 1 INTRODUCTION
Figure 1: Datasets for image classifications: Examples of the digits from the
handwritten MNIST dataset (top row) and images from the Galaxy Zoo project
(bottom row).
consisting of input data (the pixel images) and labels (the integer number shown on
the respective image).
Our hope is that the optimized network also recognizes handwritten digits it has
not seen during the learning. This property of a network is called generalization. It
stands in opposition to a tendency called overfitting, which means that the network
has learned specificities of the data set it was presented with, rather than the ab-
stract features necessary to identify the respective digit. An illustrative example of
overfitting is fitting a polynomial of degree 9 to 10 data points, which will always be
a perfect fit. Does this mean that this polynomial best characterizes the behavior of
the measured system? Of course not. Fighting overfitting and creating algorithms
that generalize well are key challenges in machine learning. We will study several
approaches to achieve this goal.
Handwritten digit recognition has become one of the standard benchmark prob-
lems in the field. Why so? The reason is simple: there exists a very good and freely
available data set for it, the MNIST database 1 , see top row of Fig. 1. This curious
fact highlights an important aspect of machine learning: it is all about data. The
most efficient way to improve machine learning results is to provide more and better
data. Thus, one should keep in mind that despite the widespread applications, ma-
chine learning is not the hammer for every nail. Machine learning is most beneficial
if large and balanced data sets in a machine-readable way are available, such that
the algorithm can learn all aspects of the problem equally.
This lecture is an introduction specifically targeting the use of machine learning
in different domains of science. In scientific research, we see a vastly increasing
number of applications of machine learning, mirroring the developments in industrial
technology. With that, machine learning presents itself as a universal new tool for
the exact sciences, standing side-by-side with methods such as calculus, traditional
statistics, and numerical simulations. This poses the question, where in the scientific
workflow, summerized in Fig. 2, these novel methods are best employed.
While machine learning has been used in various fields that have been dealing
with ’big data’ problems—famous examples include the search of the Higgs boson in
high-energy collision data or finding and classifying galaxies in astronomy—the last
years have witnessed a host of new applications in the sciences: from high-accuracy
1
http://yann.lecun.com/exdb/mnist
University of Zurich 2
Machine Learning for the Sciences 1 INTRODUCTION
Observation
design experiment
aquire data
analyse data
University of Zurich 3
Machine Learning for the Sciences 1 INTRODUCTION
University of Zurich 4
Machine Learning for the Sciences 1 INTRODUCTION
probability distribution P
discriminative/supervised generative/unsupervised
additional topics
dimensional reduction (Sec. 2) reinforcement learning (Sec. 7)
• PCA
• t-SNE
Figure 3: Overview over the plan of the lecture from the perspective of learning
probability distributions.
University of Zurich 5
Machine Learning for the Sciences 1 INTRODUCTION
help with individual segments of the scientific workflow (Fig. 2). While the type of
abstraction needed to formulate Newton’s laws of classical mechanics seems incred-
ibly complex, neural networks are very good at implicit knowledge representation.
To understand precisely how they achieve certain tasks, however, is not an easy
undertaking. We will discuss this question of interpretability in Sec. 6.
A third class of algorithms, which does not neatly fit the framework of approxi-
mating a statistical model and thus the distinction into discriminative and generative
algorithms is known as reinforcement learning. Instead of approximating a statisti-
cal model, reinforcement learning tries to optimize strategies (actions) for achieving
a given task. Reinforcement learning has gained a lot of attention with Google’s
AlphaGo Zero, a computer program that beat the best Go players in the world. As
an example for an application in the sciences, reinforcement learning can be used to
decide on what experimental configuration to perform next or to control a complex
experiment. While the whole topic is beyond the scope of this lecture, we will give
an introduction to the basic concepts of reinforcement learning in Sec. 7.
A final note on the practice of learning. While the machine learning machinery is
extremely powerful, using an appropriate architecture and the right training details,
captured in what are called hyperparameters, is crucial for its successful application.
Though there are attempts to learn a suitable model and all hyperparameters as part
of the overall learning process, this is not a simple task and requires immense com-
putational resources. A large part of the machine learning success is thus connected
to the experience of the scientist using the appropriate algorithms. We strongly
encourage solving the accompanying exercises carefully and taking advantage of the
exercise classes.
1.3 Resources
While it may seem that implementing ML tasks is computationally challenging,
actually almost any ML task one might be interested in can be done with relatively
few lines of code simply by relying on external libraries or mathematical computing
systems such as Mathematica or Matlab. At the moment, most of the external
libraries are written for the Python programming language. Here are some useful
Python libraries:
3. Scikit-Learn. Whereas TensorFlow and PyTorch are catered for deep learn-
ing practitioners, Scikit-Learn provides much of the traditional machine learn-
ing tools, including linear regression and PCA.
University of Zurich 6
Machine Learning for the Sciences 1 INTRODUCTION
1.4 Prerequisites
This course is aimed at students of the (natural) sciences with a basic mathemat-
ics education and some experience in programming. In particular, we assume the
following prerequisites:
1.5 References
For further reading, we recommend the following books:
University of Zurich 7
Machine Learning for the Sciences 2 STRUCTURING DATA
PCA algorithm
Given a dataset of m samples with n data features, we can arrange our data in the
form of a m by n matrix X where the element xij corresponds to the value of the
jth data feature of the ith sample. We will also use the feature vector xi for all the
n features of one sample i = 1, . . . , m. The vector xi can take values in the feature
space, for example xi ∈ Rn . Going back to our diabetes example, we might have 10
data features. Furthermore if we are given information regarding 100 patients, our
University of Zurich 8
Machine Learning for the Sciences 2 STRUCTURING DATA
1. Center the data by subtracting from each column the mean of that col-
umn,
1 Xm
xi 7→ xi − xi . (2.1)
m i=1
This ensures that the mean of each data feature is zero.
i=1
We have thus reduced the dimensionality of the data from n to l. Notice that
there are actually two things happening: First, of course, we now only have l data
features. But second, the l data features are new features and not simply a selec-
tion of the original data. Rather, they are a linear combination of them. Using our
diabetes example again, one of the “new” data features could be the sum of the aver-
age blood pressure and the body mass index. These new features are automatically
extracted by the algorithm.
But why did we have to go through such an elaborate procedure to do this instead
of simply removing a couple of features? The reason is that we want to maximize
the variance in our data. We will give a precise definition of the variance later in
the chapter, but briefly the variance just means the spread of the data. Using PCA,
we have essentially obtained l “new” features which maximise the spread of the data
when plotted as a function of this feature. We illustrate this with an example.
University of Zurich 9
Machine Learning for the Sciences 2 STRUCTURING DATA
∆wp [cm]
setosa
versicolor
2
w2
virginica 0 0
0
−2 −2
0 2 4 6 −4 −2 0 2 4 −4 −2 0 2 4
petal length [cm] ∆lp [cm] w1
Figure 4: PCA on Iris Dataset. The original data (left) is first centered (center),
and then, PCA ’fits an ellipsoid around the data’ to extract the new feature axes
(right).
it is easy to visualise the data. In Fig. 4, we show how the data is transformed under
the PCA algorithm.
Notice that there is no dimensional reduction here since l = n. In this case, the
PCA algorithm amounts simply to a rotation of the original data. However, it still
produces 2 new features which are orthogonal linear combinations of the original
features: petal length and petal width, i.e.
w1 = 0.922 × petal length + 0.388 × petal width,
(2.4)
w2 = −0.388 × petal length + 0.922 × petal width.
We see very clearly that the first new feature w1 has a much larger variance than the
second feature w2 . In fact, if we are interested in distinguishing the three different
species of flowers, as in a classification task, its almost sufficient to use only the
data feature with the largest variance, w1 . This is the essence of (PCA) dimensional
reduction.
Finally, it is important to note that it is not always true that the feature with
the largest variance is the most relevant for the task and it is possible to construct
counter examples where the feature with the least variance contains all the useful
information. However, PCA is often a good guiding principle and can yield inter-
esting insights into the data. Most importantly, it is also interpretable, i.e., not only
does it separate the data, but we also learn which linear combination of features
can achieve this separation. We will see that for many neural network algorithms,
in contrast, a lack of interpretability is a central issue.
Φ : Rn → RN , (2.5)
University of Zurich 10
Machine Learning for the Sciences 2 STRUCTURING DATA
PCA
1.0
0.5
w2
0.0
original
−0.5
1.0
−1.0
0.5
−1.0 −0.5 0.0 0.5 1.0
w1
f2
0.0
kernel PCA
−0.5
1.0
−1.0
0.5
−1.0 −0.5 0.0 0.5 1.0
f1
w2
0.0
−0.5
−1.0
Figure 5: Kernel PCA versus PCA. For kernel PCA, an RBF kernel with γ = 10
was used.
which is a map from the original n-dimensional space (corresponding to the n original
data features) to a N -dimensional feature space. By embedding the original data in
this new space, we hope to be able to separate the data with a linear transformation,
in other words, kernel PCA simply involves performing the standard PCA on the
transformed data Φ(x). Here, we will assume that the transformed data is centered,
i.e.,
Φ(xi ) = 0 (2.6)
X
i=1
i=1
University of Zurich 11
Machine Learning for the Sciences 2 STRUCTURING DATA
i=1
we see that finding the eigenvectors is equivalent to finding the coefficients aji . On
substituting this form back into Eq. (2.8), we find
m
"m # "m #
Φ(xi )Φ(xi ) T
aji Φ(xj ) = λj aji Φ(xi ) . (2.10)
X X X
where K(x, y) = Φ(x)T Φ(y) is known as the kernel. Thus, we see that if we directly
specify the kernels we can avoid explicit performing the transformation Φ. In matrix
form, we find the eigenvalue equation K 2 aj = λj Kaj , which simplifies to
Kaj = λj aj . (2.12)
Note that this simplification requires λj 6= 0, which will be the case for relevant
principle components. (If λj = 0, then the corresponding eigenvectors would be
irrelevant components to be discarded.) After solving the above equation and ob-
taining the coefficients ajl , the kernel PCA transformation is then simply given by
the overlap with the eigenvectors v j , namely
m m
x → Φ(x) → yj = v Tj Φ(x) = aji Φ(xi ) Φ(x) =
T
(2.13)
X X
aji K(xi , x),
i=1 i=1
or the Gaussian kernel, also known as the radial basis function (RBF) kernel, defined
by
KRBF (x, y) = exp −γkx − yk2 , (2.15)
qP
where c, d, and γ are tunable parameters and kx − yk = j (xj − yj ) denotes
2
the Euclidian distance. Using the RBF kernel, we compare the result of kernel PCA
with that of standard PCA, as shown on the right of Fig. (5). It is clear that kernel
PCA leads to a meaningful separation of the data while standard PCA completely
fails.
University of Zurich 12
Machine Learning for the Sciences 2 STRUCTURING DATA
1/(2m), so that each data point makes a significant contribution and outliers are
not simply discarded in the minimization.
The probability distribution in the target space is chosen to be a so-called Student
t-distribution
(1 + ky i − y j k2 )−1
qij = P . (2.18)
k6=l (1 + ky k − y l k )
2 −1
This choice has several advantages over the Gaussian choice in the space of the orig-
inal data points: (i) it is symmetric upon interchanging i and j, (ii) it is numerically
more efficiently evaluated because there are no exponentials, (iii) it has ’fatter’ tails
which helps to produce more meaningful maps in the lower dimensional space.
In order to minimize the distance between pij and qij , we have to introduce a
(real-valued) measure for the similarity between the two probability distributions.
We will then use this measure as a so-called loss function or cost function, which we
aim to minimize by adjusting the y’s 6 . Here, we choose the Kullback-Leibler (KL)
6
Note that we can also frame PCA as a optimization problem, where we are interested in finding
the linear combination of features that maximize the variance. Unlike in t-SNE, this optimization
can be done analytically, see exercises
University of Zurich 13
Machine Learning for the Sciences 2 STRUCTURING DATA
Figure 6: PCA vs. t-SNE Application of both methods on 5000 samples from
the MNIST handwritten digit dataset. We see that perfect clustering cannot be
achieved with either method, but t-SNE delivers the much better result.
divergence
pij 7
L({y i }) = pij log (2.19)
XX
,
i j qij
which we will frequently encounter during this lecture. An important property of
the KL divergence is that it is non-negative 8 .
The minimization of L({y i }) with respect to the positions y i can be achieved
with a variety of methods. In the simplest case it can be gradient descent, which we
will discuss in more detail in a later chapter. As the name suggests, gradient descent
follows the direction of largest gradient of the cost function to find the minimum.
To this end it is useful that these gradients can be calculated in a simple form
∂L
= 4 (pij − qij )(y i − y j )(1 + ||y i − y j ||2 )−1 . (2.20)
X
∂y i j
Finally, let us discuss the choice of the σi in Eq. (2.16). Intuitively, we want
to choose σi such that distances are equally well resolved in both dense and sparse
regions of the dataset. As a consequence, we choose smaller values of σi in dense
regions as compared to sparser regions. More formally, a given value of σi induces
a probability distribution Pi over all the other data points. This distribution has
an entropy (here we use the Shannon entropy, in general it is a measure for the
“uncertainty” represented by the distribution)
The value of H(Pi ) increases as σi increases, in other words the more uncertainty is
added to the distances. The algorithm searches for the σi that result in a Pi with
fixed perplexity
Perp(Pi ) = 2H(Pi ) . (2.22)
7
In general, we use log for the natural logarithm. In case of a different base, this is denoted as
loga .
8
Importantly, while the KL divergence is a measure for the similarity of two probability dis-
tributions, it does not define a metric on the space of probability distributions, but, as the name
suggests, a divergence. Amongst other things, the KL divergence is not symmetric.
University of Zurich 14
Machine Learning for the Sciences 2 STRUCTURING DATA
The target value of the perplexity is chosen a priory and is the main parameter that
controls the outcome of the t-SNE algorithm. It can be interpreted as a smooth
measure for the effective number of neighbors. Typical values for the perplexity are
between 5 and 50.
By now, t-SNE is implemented as standard in many packages. They involve
some extra tweaks that force points y i to stay close together at the initial steps
of the optimization and create a lot of empty space. This facilitates the moving
of larger clusters in early stages of the optimization until a globally good arrange-
ment is found. If the dataset is very high-dimensional it is advisable to perform an
initial dimensionality reduction (to somewhere between 10 and 100 dimensions, for
instance) with PCA before running t-SNE.
While t-SNE is a very powerful clustering technique, it has its limitations. (i)
The target dimension should be 2 or 3, for much larger dimensions the ansatz for qij
is not suitable. (ii) If the dataset is intrinsically high-dimensional (so that also the
PCA pre-processing fails), t-SNE may not be a suitable technique. (iii) Due to the
stochastic nature of the optimization, results are not reproducible. The result may
end up looking very different when the algorithm is initialized with some slightly
different initial values for y i .
i=1 j=1
University of Zurich 15
Machine Learning for the Sciences 2 STRUCTURING DATA
90
70
60
50
Naturally, we want to minimize the loss function with respect to the assignment
wij . However, a change in this assignment also changes µj . For this reason, it is
natural to use an iterative algorithm and divide each update step into two parts.
The first part updates the wij according to
1, if j = argminl ||xi − µl ||,
wij = (2.26)
0, else ,
in other words, each data point is assigned to the nearest cluster centroid. The
second part is a recalculation of the centroids according to Eq. (2.25).
The algorithm is initialized by choosing at random k distinct data points as
initial positions of the centroids. Then one repeats the above two-part steps until
convergence, meaning until the wij do not change anymore. Figure 7 shows an
example of clustering on a dataset for the Old Faithful geyser 9 .
A few things to keep in mind when using k-means: First, in this algorithm we
use the Euclidean distance measure k · k. It is advisable to standardize the data such
that each feature has mean zero and a standard deviation of one when averaging
over all data points. Otherwise—if some features are overall numerically smaller
than others—, the differences in various features may be weighted very differently
by the algorithm. Second, the results of k-means may depend on the initialization.
One should re-run the algorithm with a few different initializations to avoid running
into bad local minima. Finally, k-means assumes spherical clusters by construction.
If the actual cluster form is very far from such a spherical form, k-means fails to
assign the points correctly.
Applications of k-means are manifold: in economy they include marked seg-
mentation, in science any classification problem such as that of phases of matter,
document clustering, image compression (color reduction), etc.. In general, it helps
to build intuition about the data at hand.
9
https://www.stat.cmu.edu/ larry/all-of-statistics/=data/faithful.dat
University of Zurich 16
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
or in matrix notation
f (x|β) = x̃T β (3.2)
where x̃T = (1, x1 , x2 , . . . , xn ) and β = (β0 , . . . , βn )T are (n + 1) dimensional row
vectors.
The aim then is to find parameters β̂ such that f (x|β̂) is a good estimator for
the output value y. In order to quantify what it means to be a “good” estimator,
we again need to specify a loss function L(β). The good set of parameters β̂ is then
the minimizer of this loss function
There are many, inequivalent, choices for this loss function. For our purpose, we
University of Zurich 17
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
choose the loss function to be the residual sum of squares (RSS) defined as
m
RSS(β) = [yi − f (xi |β)]2
X
i=1
m
n
2 (3.4)
=
X X
yi − β0 − βj xij ,
i=1 j=1
where the sum runs over the m samples of the dataset. This loss function is some-
times also called the L2-loss and can be seen as a measure of the distance between
the output values from the dataset yi and the corresponding predictions f (xi |β).
It is convenient to define the m by (n + 1) data matrix X, f each row of which
corresponds to an input sample x̃Ti , as well as the output vector Y T = (y1 , . . . , ym ).
With this notation, Eq. (3.4) can be expressed succinctly as a matrix equation
RSS(β) = (Y − Xβ)
f T (Y − Xβ).
f (3.5)
The minimum of RSS(β) can be easily solved by considering the partial derivatives
with respect to β, i.e.,
∂RSS
= −2X
fT (Y − Xβ),
f
∂β
(3.6)
∂ 2 RSS
T = 2X X.
fT f
∂β∂β
∂ 2 RSS
At the minimum, ∂RSS
∂β
= 0 and ∂β∂β T
is positive-definite. Assuming X f is full-
fT X
∂RSS
= 0,
∂β β=β̂
(3.7)
=⇒ X fβ̂ = X
fT X fT Y ,
=⇒ β̂ = (X f −1 X
fT X) fT Y .
If X
fT Xf is not full-rank, which can happen if certain data features are perfectly
correlated (e.g., x1 = 2x3 ), the solution to X f =X
fT Xβ fT Y can still be found, but it
would not be unique. Note that the RSS is not the only possible choice for the loss
function and a different choice would lead to a different solution.
What we have done so far is uni-variate linear regression, that is linear regres-
sion where the output y is a single, real-valued number. The generalization to the
multi-variate case, where the output is a p-component vector y T = (y1 , . . . yp ), is
straightforward. The model takes the form
n
fk (x|β) = β0k + (3.8)
X
βjk xj ,
j=1
University of Zurich 18
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
yi = yitrue + i , i = 1, · · · , m. (3.9)
We assume that this error is a Gaussian random variable with mean µ = 0 and
variance σ 2 , which we denote by ∼ N (0, σ 2 ) 10 . Assuming that a linear model
in Eq. (3.1) is a suitable model for our dataset, we are interested in the following
question: How does our solution β̂ as given in Eq. (3.7) compare with the true
solution β true which obeys
n
yi = β0true + βjtrue xij + i , i = 1, . . . , m?
X
(3.10)
j=1
where the second line follows because E() = 0 and (X fT Y true = β true .
f −1 X
fT X)
Equation (3.12) implies that linear regression is unbiased. Note that other machine
learning algorithms will in general be biased.
10
Given a probability distribution P (x), we in general denote by x ∼ P (x) that x follows
the probability distribution P (x). Concretely, for discrete x this means that we choose x with
probability P (x), while for continuous x, the probability to lie in the interval dx is giben by
P (x)dx.
University of Zurich 19
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
What about the standard error or uncertainty of our solution? This information
is contained in the covariance matrix
Var(β̂) = E [β̂ − E(β̂)][β̂ − E(β̂)]T . (3.13)
The covariance matrix can be computed for the case of linear regression using the
solution in Eq. (3.7), the expectation value in Eq. (3.12) and the assumption in
Eq. (3.10) that Y = Y true + yielding
Var(β̂) = E [β̂ − E(β̂)][β̂ − E(β̂)]T
h ih iT
=E (X fT (X
f −1 X
fT X) f −1 X
fT X) fT (3.14)
= E (X f −1 X
fT X) fT T X(
fX f −1 .
fT X)
This expression can be simplified by using the fact that our input matrices X
f are
independent of the draw such that
Var(β̂) = (X
fT X) fT E(T )X(
f −1 X fX f −1
fT X)
= (X f −1 X
fT X) fT σ 2 I X(
fX f −1
fT X) (3.15)
= σ 2 (X f −1 .
fT X)
Here, the second line follows from the fact that different samples are uncorrelated,
which implies that E(T ) = σ 2 I with I the identity matrix. The diagonal elements
of σ 2 (X f −1 then correspond to the variance
fT X)
Var(β̂) = E [β̂ − E(β̂)][β̂ − E(β̂)]T
(3.16)
= σ 2 (X f −1
fT X)
q
of the individual parameters βi . The standard error or uncertainty is then Var(β̂i ).
There is one more missing element: we have not explained how to obtain the
variances σ 2 of the outputs y. In an actual machine learning task, we would not
know anything about the true relation, as given in Eq. (3.10), governing our dataset.
The only information we have access to is a single dataset. Therefore, we have to
estimate the variance using the samples in our dataset, which is given by
1 m
σ̂ 2 = (yi − f (xi |β̂))2 , (3.17)
X
m − n − 1 i=1
where yi are the output values from our dataset and f (xi |β̂) is the corresponding
prediction. Note that we normalized the above expression by (m − n − 1) instead
of m to ensure that E(σ̂ 2 ) = σ 2 , meaning that σ̂ 2 is an unbiased estimator of σ 2 .
Our ultimate goal is not simply to fit a model to the dataset. We want our
model to generalize to inputs not within the dataset. To assess how well this is
achieved, let us consider the prediction ãT β̂ on a new random input-output pair
(a, y0 ). The output is again subject to an error y0 = ãT β true + . In order to
compute the expected error of the prediction, we compute the expectation value of
the loss function over these previously unseen data. This is also known as the test or
University of Zurich 20
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
generalization error. For the square-distance loss function, this is the mean square
error (MSE)
MSE(β̂) =E (y0 − ãT β̂)2
=E ( + ãT β true − ãT β̂)2
(3.18)
=E(2 ) + [ãT (β true − E(β̂))]2 + E [ãT (β̂ − E(β̂))]2
=σ 2 + [ãT Bias(β̂)]2 + ãT Var(β̂)ã.
There are three terms in the expression. The first term is the irreducible or intrinsic
uncertainty of the dataset. The second term represents the bias and the third term
is the variance of the model. For RSS linear regression, the estimate is unbiased so
that
MSE(β̂) = σ 2 + ãT Var(β̂)ã. (3.19)
Based on the assumption that the dataset indeed derives from a linear model as
given by Eq. (3.10) with a Gaussian error, it can be shown that the RSS solution,
Eq. (3.7), gives the minimum error among all unbiased linear estimators, Eq. (3.1).
This is known as the Gauss-Markov theorem.
This completes our error analysis of the method.
i=1 j=0
where λ > 0 is a positive parameter. This is almost the same as the RSS apart
from the term proportional to λ [c.f. Eq. (3.4)]. The effect of this new term is to
penalize large parameters βj and bias the model towards smaller absolute values.
The parameter λ is an example of a hyper-parameter, which is kept fixed during the
training. On fixing λ and minimizing the loss function, we obtain the solution
β̂ ridge = (X f + λI)−1 X
fT X fT Y , (3.22)
University of Zurich 21
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
Error
Generalization Error
Variance
Bias
Model Complexity
from which we can see that as λ → ∞, β̂ ridge → 0. By computing the bias and
variance,
it is also obvious that increasing λ increases the bias, while reducing the variance.
This is the tradeoff between bias and variance. By appropriately choosing λ it is
possible that the generalization error can be reduced. We will introduce in the next
section a common strategy how to find the optimal value for λ.
The techniques presented here to reduce the generalization error, namely drop-
ping of features and biasing the model to small parameters, are part of a large class
of methods known as regularization. Comparing the two methods, we can see a sim-
ilarity. Both methods actually reduce the complexity of our model. In the former,
some parameters are set to zero, while in the latter, there is a constraint which effec-
tively reduces the magnitude of all parameters. A less complex model has a smaller
variance but larger bias. By balancing these competing effects, generalization can
be improved, as illustrated schematically in Fig. 8.
In the next chapter, we will see that these techniques are useful beyond ap-
plications to linear methods. We illustrate the different concepts in the following
example.
3.1.3 Example
We illustrate the concepts of linear regression using a medical dataset. In the process,
we will also familiarize ourselves with the standard machine learning workflow [see
Fig. 9]. For this example, we are given 10 data features, namely age, sex, body
mass index, average blood pressure, and six blood serum measurements from 442
diabetes patients, and our task is to train a model f (x|β) [Eq. (3.1)] to predict a
quantitative measure of the disease progression after one year.
Recall that the final aim of a machine-learning task is not to obtain the smallest
possible value for the loss function such as the RSS, but to minimize the general-
ization error on unseen data [c.f. Eq. (3.18)]. The standard approach relies on a
division of the dataset into three subsets: training set, validation set and test set.
The standard workflow is summarized in Box 1.
University of Zurich 22
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
Dataset
Tune
Hyperparameters Final
Validate Train Model Test Final
Model Model Model
Box 1: ML Workflow
1. Divide the dataset into training set T , validation set V and test set S. A
common ratio for the split is 70 : 15 : 15.
3. Train the model with only the training set, in other words minimize the
loss function on the training set. [This corresponds to Eq. (3.7) or (3.22)
for the linear regression, where X
f only contains the training set.]
4. Evaluate the MSE (or any other chosen metric) on the validation set, [c.f.
Eq. (3.18)]
1 X
MSEvalidation (β̂) = (yj − f (xj |β̂))2 . (3.24)
|V| j∈V
5. Pick a different value for the hyperparameters and repeat steps 3 and 4,
until validation error is minimized.
It is important to note that the test set S was not involved in optimizing either
parameters β or the hyperparameters such as λ.
Applying this procedure to the diabetes dataset11 , we obtain the results in
Fig. 10. We compare RSS linear regression with the ridge regression, and indeed
we see that by appropriately choosing the regularization hyperparameter λ, the
generalization error can be minimized.
As side remark regarding the ridge regression, we can see on the left of Fig. 11,
that as λ increases, the magnitude of the parameters, Eq. (3.22), β̂ ridge decreases.
11
Source: https://www4.stat.ncsu.edu/∼boos/var.select/diabetes.html
University of Zurich 23
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
Consider on the other hand, a different form of regularization, which goes by the
name lasso regression, where the loss function is given by
m n
Llasso (β) = [yi − f (xi |β)]2 + α (3.26)
X X
|βj |.
i=1 j=0
Despite the similarities, lasso regression has a very different behavior as depicted on
the right side of Fig. 11. As α increases some parameters actually vanish and can
be ignored completely. This actually corresponds to dropping certain data features
completely and can be useful if we are interested in selecting the most important
features in a dataset.
University of Zurich 24
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
j=1
hyperplane with a sign depending on which side of the plane the point xi lies. To
use this model as a classifier, we thus define
which yields {−1, +1}. If the two classes are (completely) linearly separable, then
the goal of the classification is to find a hyperplane that separates the two classes
in feature space. Specifically, we look for parameters β, such that
where M is called the margin. The optimal solution β̂ then maximizes this margin.
Note that instead of fixing the norm of βj≥1 and maximizing M , it is customary to
minimize nj=1 βj2 setting M = 1 in Eq. (3.29).
P
In most cases, the two classes are not completely separable. In order to still find
a good classifier, we allow some of the points xi to lie within the margin or even
on the wrong side of the hyperplane. For this purpose, we rewrite the optimization
constraint Eq. (3.29) to
1X n
min βj2 + C (3.31)
X
ξi
β,{ξi } 2
j=1 i
subject to the constraint Eq. (3.30). Note that the second term with hyperparameter
C acts like a regularizer, in particular a lasso regularizer. As we have seen in the
example of the previous section, such a regularizer tries to set as many ξi to zero as
possible.
University of Zurich 25
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
2.0
1.8
Width [cm]
1.6
1.4
1.2
1.0
4.00 4.50 5.00 5.50 6.00
Legth [cm]
Figure 12: Binary classification. Hyperplane separating the two classes and
margin M of the linear binary classifier. The support vectors are denoted by a
circle around them.
βj = (3.33)
X
αi yi xij ,
i
0 = (3.34)
X
αi yi
i
αi = C − µi , ∀i. (3.35)
subject to αi yi = 0 and 0 ≤ αi ≤ C 12
. Using Eq. (3.33), we can reexpress βj to
P
i
find
f (x|{αi }) = 0
αi yi xT xi + β0 , (3.40)
X
where the sum only runs over the points xi , which lie within the margin, as all other
points have αi ≡ 0 [see Eq. (3.37)]. These points are thus called the support vectors
12
Note that the constraints for the minimization are not equalities, but actually inequalities. A
solution thus has to fulfil the additional Karush-Kuhn-Tucker constraints
University of Zurich 26
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
and are denoted in Fig. 12 with a circle around them. Finally, note that we can use
Eq. (3.37) again to find β0 .
0
(y)
e
1 .
.. ..
.
y −→ e(y) = = 1 , (3.41)
e(y)
y
.
. ..
. .
e(y)
p
0
(y)
where el = 1 if l = y and zero for all other l = 1, . . . , p. A main advantage of this
encoding is that we are not forced to choose a potentially biasing ordering of the
classes as we would when arranging them along the ray of integers.
A linear approach to this problem then again mirrors the case for linear re-
gression. We fit a multi-variate linear model, Eq. (3.8), to the one-hot encoded
dataset {(x1 , e(y1 ) ), . . . , (xm , e(ym ) )}. By minimising the RSS, Eq. (3.4), we obtain
the solution
β̂ = (X f −1 X
fT X) fT Y, (3.42)
where Y is the m by p output matrix. The prediction given an input x is then a
p-dimensional vector f (x|β̂) = x̃T β̂. On a generic input x, it is obvious that the
components of this prediction vector would be real valued, rather than being one of
the one-hot basis vectors. To obtain a class prediction F (x|β̂) = 1, . . . , p, we simply
take the index of the largest component of that vector, i.e.,
The argmax function is a non-linear function and is a first example of what is referred
to as activation function.
For numerical minimization, it is better to use a smooth generalization of the
argmax function as activation function. Such an activation function is given by the
University of Zurich 27
Machine Learning for the Sciences 3 SUPERVISED LEARNING I
softmax function
efk (x|β̂)
Fk (x|β̂) = Pp . (3.44)
fk0 (x|β̂)
k0 =1 e
sion 13 .
The current linear approach based on classification of one-hot encoded data
generally works poorly when there are more than two classes. We will see in the
next chapter that relatively straightforward non-linear extensions of this approach
can lead to much better results.
13
Note that the softmax function for two classes is the logistic function.
University of Zurich 28
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
x1
Input x2
Output
x3
"carrot"
...
xn
information flow
University of Zurich 29
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
ReLu
sigmoid
tanh
Figure 14: The artificial neuron. Left: schematic of a single neuron and its
functional form. Right: examples of the commonly used activation functions: ReLU,
sigmoid function and hyperbolic tangent.
where z1 , z2 , . . . , zk are the outputs of the neurons from the preceding layer to which
the neuron is connected.
The linear function is parametrized as
k
q(z1 , . . . , zk ) = wj zj + b. (4.2)
X
j=1
Here, the real numbers w1 , w2 , . . . , wk are called weights and can be thought of as
the “strength” of each respective connection between neurons in the preceding layer
and this neuron. The real parameter b is known as the bias and is simply a constant
offset 14 . The weights and biases are the variational parameters we will need to
optimize when we train the network.
The activation function g is crucial for the neural network to be able to approxi-
mate any smooth function, since so far we merely performed a linear transformation.
For this reason, g has to be nonlinear. In analogy to biological neurons, g represents
the property of the neuron that it “spikes”, in other words it produces a noticeable
output only when the input potential grows beyond a certain threshold value. The
most common choices for activation functions, shown in Fig. 14, include:
• ReLU : ReLU stands for rectified linear unit and is zero for all numbers smaller
than zero, while a linear function for all positive numbers.
14
Note that this bias is unrelated to the bias we learned about in regression.
University of Zurich 30
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
• Softmax: The softmax function is a common activation function for the last
layer in a classification problem (see below).
The choice of activation function is part of the neural network architecture and
is therefore not changed during training (in contrast to the variational parameters
weights and bias, which are adjusted during training). Typically, the same activation
function is used for all neurons in a layer, while the activation function may vary
from layer to layer. Determining what a good activation function is for a given layer
of a neural network is typically a heuristic rather than systematic task.
Note that the softmax provides a special case of an activation function as it
explicitly depends on the output of the q functions in the other neurons of the same
layer. Let us label by l = 1, . . . , n the n neurons in a given layer and by ql the
output of their respective linear transformation. Then, the softmax is defined as
eql
gl (q1 , . . . , qn ) = Pn (4.3)
l0 =1 eql0
gl (q1 , . . . , qn ) = 1, (4.4)
X
Here, W [n] and b[n] are the weight matrix and bias vectors of the n-th layer. Specif-
ically, W [1] is the k × l weight matrix of the hidden layer with k and l the number
[1]
of neurons in the input and hidden layer, respectively. Wij is the j-the entry of
University of Zurich 31
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
[1]
the weight vector of the i-th neuron in the hidden layer, while bi is the bias of this
[2] [2]
neuron. The Wij and bi are the respective quantities for the output layer. This
network is called fully connected or dense, because each neuron in a given layer takes
as input the output from all the neurons in the previous layer, in other words all
weights are allowed to be non-zero.
Note that for the evaluation of such a network, we first calculate all the neurons’
values of the first hidden layer, which feed into the neurons of the second hidden layer
and so on until we reach the output layer. This procedure, which is possible only
for feed-forward neural networks, is obviously much more efficient than evaluating
the nested function of each output neuron independently.
4.3 Training
Adjusting all the weights and biases to achieve the task given using data samples D =
{(x1 , y 1 ), . . . , (xm , y m )} constitutes the training of the network. In other words, the
training is the process that makes the network an approximation F θ (x) ≈ F(x) to
the mathematical function F(x) = y that we want it to represent. Each neuron has
its own bias and weights, a potentially huge number of variational parameters that
we collectively denote by θ = {W, B}, and we will need to adjust all of them.
We have already seen in the previous chapter how one in principle trains a
variational function: We introduce a loss function L(θ), which characterizes how
well the network is doing at predicting the correct output for each input. The loss
function now depends, through the neural network, on all the weights and biases.
Once we have defined a loss function, we also already understand how to train the
network: we need to minimize L(θ) with respect to W and B. However, L is typically
a high-dimensional function and may have many nearly degenerate minima. Unlike
in the previous chapter, finding the loss function’s absolute minimum exactly is
typically intractable analytically and may come at prohibitive costs computationally.
The practical goal is therefore rather to find a “good” instead than the absolute
minimum through training. Having found such “good” values for W, B, the network
can then be applied on previously unseen data.
In order to minimize the loss function numerically, we employ an iterative method
called gradient descent 15 . Intuitively, the method corresponds to “walking down the
hill” in our many parameter landscape until we reach a (local) minimum. For this
15
We will discuss more elaborate variations of this method in the exercises.
University of Zurich 32
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
purpose, we use the derivatives of the cost function to update all the weights and
biases incrementally and search for the minimum of the function via tiny steps on
the many-dimensional surface. More specifically, we can update all weights and
biases in each step as
∂L(θ)
θα → θα − η . (4.6)
∂θα
The variable η, also referred to as learning rate, specifies the size of the step we
use to walk the landscape. If the learning rate is too small in the beginning, we
might get stuck in a local minimum early on, while for too large η we might never
find a minimum. The learning rate is a hyperparameter of the training algorithm.
Note that gradient descent is just a discrete many-variable version of the analytical
search for extrema which we know from calculus: An extremum is characterized by
vanishing derivatives in all directions, which results in convergence in the gradient
descent algorithm outlined above.
The choice of loss function may strongly impact the efficiency of the training
and is based on heuristics (as was the case with the choice of activation functions).
In the previous chapter, we already encountered one loss function, the mean square
error
1 X m
L(θ) = ||F θ (xi ) − y i ||22 . (4.7)
2m i=1
qP
Here, ||a||2 = i ai is the L2 norm and thus, this loss function is also referred
2
L1 loss. Note that the L2 norm, given the squares, puts more weight on outliers
than the L1 loss.
The two loss functions introduced so far are the most common loss functions for
regression tasks, where the network provides a continuous output. For tasks, where
each neuron outputs a probability, for example in classification problems, a great
choice is the cross-entropy between true label, y i and the network output, F θ (xi )
defined as
m
Lent (θ) = − [y i · log (F θ (xi )) + (1 − y i ) · log (1 − F θ (xi ))] , (4.9)
X
i=1
where the logarithm is taken element-wise. This loss function is also called negative
log-likelihood. It is here written for outputs that lie between 0 and 1, as is the
case when the activation function of the last layer of the network is sigmoid σ(z) =
1/(1 + e−z ).
Before continuing with the intricacies of implementing such a gradient descent
algorithm for (deep) neural networks, we can try to better understand the loss
functions. While the L1 and L2 loss are obvious measures, the choice of cross
entropy is less intuitive. The different cost functions actually differ by the speed
of the learning process. The learning rate is largely determined by the partial
derivatives of the cost function ∂L/∂θ, see Eq. (4.6) and slow learning appears
University of Zurich 33
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
when these derivatives become small. Let us consider the toy example of a single
neuron with sigmoid activation F (x) = σ(wx + b). Then, the L2 cost function has
derivatives ∂w L and ∂b L that are proportional to σ 0 (w + b). We observe that these
derivatives become very small for σ(w + b) → 1, because σ 0 is very small in that
limit, leading to a slowdown of learning. This slowdown is also observed in more
complex neural networks with L2 loss; we considered the simple case here only for
analytical simplicity.
Given this observation, we want to see whether the cross entropy can improve
the situation. We again compute the derivative of the cost function with respect to
the weights for a single term in the sum and a network that is composed of a single
sigmoid and a general input-output pair {x, y}
1−y
!
∂Lent y
=− − σ 0 (wx + b)x
∂w σ(wx + b) 1 − σ(wx + b)
σ 0 (wx + b)x (4.10)
= [σ(wx + b) − y]
σ(wx + b)[1 − σ(wx + b)]
= x[σ(wx + b) − y],
where in the last step we used that σ 0 (z) = σ(z)[1−σ(z)]. This is a much better result
than what we got for the L2 loss. The learning rate is here directly proportional
to the error between data point and prediction [σ(wx + b) − y]. The mathematical
reason for this change is that σ 0 (z) cancels out due to the specific form of the cross
entropy. A similar expression holds true for the derivative with respect to b,
∂Lent
= [σ(wx + b) − y]. (4.11)
∂b
In fact, if we insisted that we want the very intuitive form of Eqs. (4.10) and (4.11)
for the gradients, we can derive the cost function for the sigmoid activation function
to be the cross-entropy. To see this, we start from
∂L ∂L 0
= F (4.12)
∂b ∂F
and F 0 = F (1 − F ) for the sigmoid activation, which, comparing to the desired form
of Eq. (4.11), yields
∂L F −y
= . (4.13)
∂F F (1 − F )
When integrated with respect to F , Eq. (4.13) gives exactly the cross-entropy (up
to a constant). Starting from Eqs. (4.10) and (4.11), we can thus think of choosing
the cost function using backward engineering. Following this logic, we can think of
other pairs of final-layer activations and cost functions that may work well together.
What happens if we change the activation function in the last layer from sigmoid
to softmax, which is usually used for the case of a classification task? For the loss
function, we consider just the first term in the cross entropy—for softmax, this form
is appropriate—
m
L(θ) = − y i · log (F θ (xi )) , (4.14)
X
i=1
where, again, the logarithm is taken element-wise. For concreteness, let us look at
one-hot encoded classification problem. Then, all y i labels are vectors with exactly
University of Zurich 34
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
one entry “1”. Let that entry have index ni in the vector, such that (y i )j = δni ,j .
The loss function then reads
m
L(θ) = − log (Fni (xi )) . (4.15)
X
i=1
Due to the properties of the softmax, Fni (xi ) is always ≤ 1, so that the loss function
is minimized if it approaches 1, the value of the label. For the gradients, we obtain
∂L m
1 ∂Fni (xi )
=−
X
∂bj i=1 Fni (xi ) ∂bj
m
1
=− [Fni (xi )δni ,j − Fni (xi )Fj (xi )] (4.16)
X
i=1
and a similar result for the derivatives with respect to the weights. We observe that,
again, the gradient has a similar favorable structure to the previous case, in that it
is linearly dependent on the error that the network makes.
Backpropagation
While the process of optimizing the many variables of the loss function is mathemat-
ically straightforward to understand, it presents a significant numerical challenge:
[k]
For each variational parameter, for instance a weight in the k-th layer Wij , the par-
[k]
tial derivative ∂L/∂Wij has to be computed. And this has to be done each time the
network is evaluated for a new dataset during training. Naively, one could assume
that the whole network has to be evaluated for each derivative. Luckily there is an
algorithm that allows for an efficient and parallel computation of all derivatives—
this algorithm is known as backpropagation. The algorithm derives directly from the
chain rule of differentiation for nested functions and is based on two observations:
(1) The loss function is a function of the neural network F θ (x), that is L ≡ L(F ).
(2) To determine the derivatives in layer k only the derivatives of the subsequent
layers, given as Jacobi matrix
Df [l] (z [l−1] ) = ∂f [l] /∂z [l−1] , (4.17)
with l > k and z [l−1] the output of the previous layer, as well as
[k]
[k−1] [k]
∂z [k]
∂g [k] [k]
∂qi
∂g
[k] zj θα = Wij
= = ∂qi
∂g [k] [k] (4.18)
∂θα
[k]
∂qi
[k] ∂θα
[k] θα = bi
∂qi
are required.
The calculation of the Jacobi matrix thus has to be performed only once for every
update. In contrast to the evaluation of the network itself, which is propagating
forward, (output of layer k is input to layer k + 1), we find that a change in the
Output propagates backwards through the network, hence the name16 .
16
Backpropagation is actually a special case of a set of techniques known as automatic differenti-
ation (AD). AD makes use of the fact that any computer program can be composed of elementary
operations (addition, subtraction, multiplication, division) and elementary functions (sin, exp, . . . ).
By repeated application of the chain rule, derivatives of arbitrary order can be computed auto-
matically.
University of Zurich 35
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
Algorithm 2: Backpropagation
University of Zurich 36
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
a k = 282 long vector, the hidden layer contains l neurons and the output layer
has p = 10 neurons, each corresponding to one digit in the one-hot encoding. The
output is then a probability distribution over these 10 neurons that will determine
which digit the network identifies.
As an exercise, we build a neural network according to these guidelines and train
it. How exactly one writes the code depends on the library of choice, but the generic
structure will be the following:
Example 1: MNIST
1. Import the data: The MNIST database is available for download at http:
//yann.lecun.com/exdb/mnist/
• Input layer: 282 = 784 neurons (the greyscale value of each pixel of
the image, normalized to a value in [0, 1), is one component of the
input vector).
• Fully connected hidden layer: Here one can experiment, starting
from as few as 10 neurons. The use of a sigmoid activation function
is recommended, but others can in principle be used.
• Output layer: Use 10 neurons, one for each digit. The proper acti-
vation function for this classification task is, as discussed, a softmax
function.
3. Choose the loss function: Since we are dealing with a classification task
with a softmax output activation layer, we use the cross-entropy, Eq. (4.14).
With the training completed, we want to understand how well the final model
performs in recognizing handwritten digits. For that, we introduce the accuracy
defined by
correct predictions
accuracy = . (4.19)
total predictions
Running 20 trainings with 30 hidden neurons with a learning rate η = 0.5 (a
mini-batch size of 10 and train for 30 epochs), we obtain an average accuracy of
96.11 %. With a L2 loss, we obtain only slightly worse results of 95.95%. For
100 hidden neurons, we obtain 97.99%. That is a considerable improvement over a
quadratic cost, where we obtain 96.73%—the more complex the network, the more
important it is to choose a good loss function. Still, these numbers are not even close
to state of the art neural network performances. The reason is that we have used
the simplest possible all-to-all connected architecture with only one hidden layer.
University of Zurich 37
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
4.5 Regularization
In the previous sections, we have illustrated an artificial neural network that is
constructed analogous to neuronal networks in the brain. A priori, a model is only
given a rough structure, within which it has a huge number of parameters to adjust
by learning from the training set. While we already understand that this is an
extraordinarily powerful approach, this method of learning comes with its own set
of challenges. The most prominent of them is the generalization of the rules learned
from training data to unseen data.
We have already encountered in the previous chapter how the naive optimization
of a linear model reduces its generalization. However, we have also seen how the
generalization error can be improved by reducing the variance at the cost of a biased
model using regularization. Training neural networks comes with the same issue:
we are always showing the model we built a training set that is limited in one way
or another and we need to make sure that the neural network does not learn partic-
ularities of that given training set, but actually extracts a general knowledge. As we
will discuss in the following, the mitigation strategies of regularization encountered
for linear models can be similarly used for neural networks.
Step zero to avoid over-fitting is to create a sufficiently representative and diverse
training set. Once this is taken care of, we can take several steps for the regular-
ization of the network. The simplest, but at the same time most powerful option is
introducing dropout layers. This regularization is very similar to dropping features
that we discussed for linear regression. However, the dropout layer ignores a ran-
domly selected subset of neuron outputs in the network only during training. Which
neurons are dropped is chosen at random for each training step. This regulariza-
tion procedure is illustrated in Fig. 16. By randomly discarding a certain fraction
University of Zurich 38
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
of neurons we add noise to the hidden layers that ensures that the network does
not get fixed at small particular features of the training set and is better equipped
to recognize the more general features. Another way of looking at it is that this
procedure corresponds to training a large number of neural networks with different
neuron connections in parallel. The fraction of neurons that are ignored in a dropout
layer is a hyperparameter that is fixed a priori. Maybe it is counter-intuitive but
the best performance is often achieved when this number is sizable, between 20%
and 50%. It shows the remarkable resilience of the network against fluctuations.
Note that, as is usually done, we penalize only the weights of the neural network,
such that the sums above run over all weights Wj of the network. Finally, while we
can in principle set different λ for each layer, it is common to use one for the whole
network.
As for the linear models, L2 regularization shrinks all parameters symmetrically,
whereas L1 regularization usually causes a subset of parameters to vanish, a property
often used for feature selection. For this reason, both methods are also referred to as
weight decay. Either way, both L1 and L2 regularizations restrict the expressiveness
of the neural network, thus encouraging it to learn generalizable features rather than
overfitting specific features of the data set.
The weights are commonly initialized with small values and increase during train-
ing. This naturally limits the capacity of the network, because for very small weights
it effectively acts as a linear model (when one approximates the activation function
by a linear function). Only once the weights become bigger, the network explores
its nonlinearity.
Another regularization technique consists in artificially enlarging the data set.
Data is often costly, but we have extra knowledge about what further data might
look like and feed this information in the machine learning workflow. For instance,
going back to the MNIST example, we may shift or tilt the existing images or apply
small transformations to them. By doing that, researchers were able to improve
MNIST performance by almost 1 percent 18 . In particular if we know symmetries
of the problem from which the data originates (such as time translation invariance,
invariance under spatial translations or rotations), effective generation of augmented
datasets is possible. For classification problems in particular, data may not be
17
L2 regularization is also known as Tikhonov regularization
18
See Simard et al., http://dx.doi.org/10.1109/ICDAR.2003.1227801
University of Zurich 39
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
Kernel
Figure 17: Convolutional layer in 2D. Here with filter size k = 3 and stride
s = 2. The filter is first applied to the 3 × 3 sub-image in the top left of the input,
which yields the first pixel in the feature map. The filter then moves s neurons to
the right, which yields the next pixel and so on. After moving all the way to the
right, the filter moves s pixels down and starts from the left again until reaching the
bottom right.
University of Zurich 40
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
Figure 18: Pooling layer (a) an average pooling and (b) a max pooling layer (both
n = 3).
For two-dimensional data, such as shown in the example in Fig. 17, we write the
discrete convolution explicitly as
k X
k
qi,j = fn,m xsi−m,sj−n + b0 , (4.25)
X
m=1 n=1
where fn,m are the weights of the kernel, which has linear size k, and b0 is a bias.
Finally, s is called stride and refers to the number of pixels the filter moves per
application. The output, q, is called feature map. Note that the dimension of the
feature map is nq × nq with nq = b(nin − k)/s + 1c when the input image is of
dimensions nin × nin : application of a convolutional layer thus reduces the image
size, an effect not always intended. To avoid this reduction, the original data can
be padded, for example by adding zeros around the border of the data to ensure the
feature map has the same dimension as the input data.
For typical convolutional networks, one applies a number of filters for each layer
in parallel, where each filter is trained to recognize different features. For instance,
one filter could start to be sensitive to contours in an image, while another filter
recognizes the brightness of a region. Further, while filters in the first layers may be
sensitive to local patterns, the ones in the later layers recognize larger structures.
This distribution of functionalities between filters happens automatically, it is not
preconceived when building the neural network.
4.6.2 Pooling
Another very useful layer, in particular in combination with convolutional layers, is
the pooling layer. Each neuron in the pooling layer takes input from n (neighboring)
neurons in the previous layer—in the case of a convolutional network for each feature
map individually—and only retains the most significant information. Thus, the
pooling layer helps to reduce the spatial dimension of the data. What is considered
significant depends on the particular circumstances: Picking the neuron with the
maximum input among the n, called max pooling, detects whether a given feature is
present in the window. Furthermore, max pooling is useful to avoid dead neurons, in
other words neurons that are stuck with a value near 0 irrespective of the input and
such a small gradient for its weights and biases that this is unlikely to change with
further training. This is a scenario that can often happen especially when using the
ReLU activation function. Average pooling, in other words taking the average value
of the n inputs is a straight forward compression. Note that unlike other layers, the
pooling layer has just a small set of n connections with no adjustable weights. The
functionality of the pooling layer is shown in Fig. 18 (a) and (b).
University of Zurich 41
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
An extreme case of pooling is global pooling, where the full input is converted to
a single output. Using a max pooling, this would then immediately tell us, whether
a given feature is present in the data.
University of Zurich 42
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
DNA random
G G conv pool dense
C C
T T
A A
T T DNA
C T
A T
G G
G G
C T
T T random
A A
G A
G A
T A
Figure 20: Neural network classification of DNA sequences. The two lower
panels show loss function and accuracy on the training (evaluation) data in green
(orange) as a function of the training step respectively.
example for supervised learning with neural networks that will be useful in many
other scenarios. In particular, we implement the following architecture:
• Input layer: The input layer has dimension 36 × 4 (36 entries per
DNA sequence, 4 to encode each of 4 different bases A, G, C, T)
Example: [[1,0,0,0], [0,0,1,0], [0,0,1,0], [0,0,0,1]] = ACCT
• Convolutional layer: Kernel size k = 4, stride s = 1 and number of
filters N = 64.
• Pooling layer: max pooling over n = 2 neurons, which reduces the
output of the previous layer by a factor of 2.
• Dense layer: 256 neurons with a ReLU activation function.
• Output layer: 2 neurons (DNA and non-DNA output) with softmax
activation function.
A schematic of the network structure as well as the evolution of the loss and the
accuracy measured over the training and validation sets with the number of training
steps are shown in Fig. 20. Comparing the accuracies of the training and validation
sets is a standard way to avoid overfitting: On the examples from the training set
we can simply check the accuracy during training. When training and validation
University of Zurich 43
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
accuracy reach the same number, this indicates that we are not overfitting on the
training set since the validation set is never used to adjust the weights and biases. A
decreasing validation accuracy despite an increasing training accuracy, on the other
hand, is a clear sign of overfitting.
We see that this simple convolutional network is able to achieve around 80%
accuracy. By downloading a larger training set, ensuring that only truly random
sequences are labeled as such, and by optimizing the hyper-parameters of the net-
work, it is likely that an even higher accuracy can be achieved. We also encourage
you to test other architectures: one can try to add more layers (both convolutional
and dense), adjust the size of the convolution kernel or stride, add dropout layers,
and finally, test whether it is possible to reach higher accuracies without over-fitting
on the training set.
For the loss function, we again use cross-entropy between the output and the labels.
Notice here the repeated structure of convolutional layers and pooling layers. This
is a very common structure for deep convolutional networks. With this model, we
achieve an accuracy on the MNIST test set of 99.1%, a massive improvement over
the simple dense network.
University of Zurich 44
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
Figure 21: Recurrent neural network architecture. The input xt is fed into
the recurrent cell together with the (hidden) memory ht−1 of the previous step to
produce the new memory ht and the output yt . One can understand the recurrent
structure via the “unwrapped” depiction of the structure on the right hand side of
the figure. The red arrows indicate how gradients are propagated back in time for
updating the network parameters.
University of Zurich 45
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
where we used for the nonlinearity the hyperbolic tangent, a common choice, which
is applied element-wise. Further, if the input data xt has dimension n and the
hidden state ht dimension m, the weight matrices Whh and Wxh have dimensions
m × m and m × n, respectively. Finally, the output at step t can be calculated using
the hidden state ht ,
y t = Who ht . (4.27)
A schematic of this implementation is depicted in Fig. 22(a). Note that in this
simplest implementation, the output is only a linear operation on the hidden state.
A straight forward extension—necessary in the case of a classification problem—is
to add a non-linear element to the output as well. Concretely,
y t = g(Who ht + by ) (4.28)
with g(q) some activation function, such as a softmax. Note that while in principle
an output can be calculated at every step, this is only done after the last input
element in a classification task.
An interesting property of RNNs is that the weight matrices and biases, the
parameters we learn during training, are the same for each input element. This
property is called parameter sharing and is in stark contrast to dense networks.
In the latter architecture, each input element is connected through its own weight
matrix. While it might seem that this property could be detrimental in terms of
representability of the network, it can greatly help extracting sequential information:
Similar to a filter in a CNN, the network does not need to learn the exact location
of some specific sequence that carries the important information, it only learns to
recognize this sequence somewhere in the data. Note that the way each input element
is processed differently is instead implemented through the hidden memory.
Parameter sharing is, however, also the root of a major problem when training
a simple RNN. To see this, remember that during training, we update the network
parameters using gradient descent. As in the previous sections, we can use backprop-
agation to achieve this optimization. Even though the unwrapped representation of
the RNN in Fig. 21 suggests a single hidden layer, the gradients for the backpropaga-
tion have to also propagate back through time. This is depicted in Fig. 21 with the
red arrows 20 . Looking at the backpropagation algorithm in Sec. 4.3 , we see that to
0
use data points from N steps back, we need to multiply N − 1 Jacobi matrices Df [t ]
with t − N < t0 ≤ t. Using Eq. (4.27), we can write each Jacobi matrix as a product
of the derivative of the activation function, ∂q tanh(q), with the weight matrix. If ei-
ther of these factors 21 is much smaller than 1, the gradients decrease exponentially.
This is known as the problem of vanishing gradients. Alternatively, if the factors
are much larger than 1, the gradients grow exponentially, known as the problem
of exploding gradients. Both problems make learning long-term dependencies with
simple RNNs practically impossible.
20
In the context of RNNs, backpropagation is thus referred to as backpropagation through time
(BPTT).
21
For the weight matrix this means the singular values.
University of Zurich 46
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
(a) (b)
Figure 22: Comparison of (a) a simple RNN and (b) a LSTM. The boxes
denote neural networks with the respective activation function, while the circles de-
note element-wise operations. The dark green box indicates that the four individual
neural networks can be implemented as one larger one.
Note that the problem of exploding gradients can be mitigated by clipping the
gradients, in other words scaling them to a fixed size. Furthermore, we can use the
ReLU activation function instead of a hyperbolic tangent, as the derivative of the
ReLU for q > 0 is always 1. However, the problem of the shared weight matrices
can not so easily be resolved. In order to learn long-time dependencies, we have to
introduce a different architecture. In the following, we will discuss the long short-
term memory (LSTM) network. This architecture and its variants are used in most
applications of RNNs nowadays.
University of Zurich 47
Machine Learning for the Sciences 4 SUPERVISED LEARNING II
which due to the hyperbolic tangent, −1 < gtα < 1 for each element. However, we
do not necessarily update each element of the cell state, but rather we introduce
another gate, which determines whether to actually write to the cell,
again with 0 < iαt < 1. Finally, we update the cell state
Output step
In the final step, we decide how much of the information stored in the cell state
should be written to the new hidden state,
The full structure of the LSTM with the four gates and the element-wise opera-
tions is schematically shown in Fig. 22(b). Note that we can concatenate the input
xt and hidden memory ht−1 into a vector of size n + m and write one large weight
matrix W of size 4m × (m + n).
So far, we have only used the RNN in a supervised setting for classification
purposes, where the input is a sequence and the output a single class at the end of
the full sequence. A network that performs such a task is thus called a many-to-one
RNN. We will see in the next section, that unlike the other network architectures
encountered in this section, RNNs can straight-forwardly be used for unsupervised
learning, usually as one-to-many RNNs.
University of Zurich 48
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
5 Unsupervised Learning
In Sec. 4, we discussed supervised learning tasks, for which datasets consist of input-
output pairs, or data-label pairs. More often than not, however, we have data
without labels and would like to extract information from such a dataset. Clustering
problems fall in this category, for instance: We suspect that the data can be divided
into different types, but we do not know which features distinguish these types.
Mathematically, we can think of the data x as samples that were drawn from
a probability distribution Pdata (x). The unsupervised learning task is to represent
this distribution with a model Pθ (x), for example represented by a neural network
and parametrized by θ. The model can then be used to study properties of the
distribution or to generate new ‘artificial’ data. The models we encounter in this
chapter are thus also referred to as generative models. In general, unsupervised
learning is conceptually more challenging than supervised learning. At the same
time, unsupervised algorithms are highly desirable, since unlabelled data is much
more abundant than labelled data. Moreover, we can in principle use a generative
model for a classification task by learning the joint probability distribution of the
data-label pair.
In this chapter, we will introduce three types of neural networks that are spe-
cific to unsupervised learning tasks: Restricted Boltzmann machines, autoencoders,
and generative adversarial networks. Furthermore, we will discuss how the RNN
introduced in the previous chapter can also be used for an unsupervised task.
i=1
where m is the number of samples in the data {xi }. The goal is to find the param-
eters θ that maximize the likelihood. Thanks to the monotonicity of the logarithm
function, we can work with the log-likelihood,
m
θ̂ = argmax Pθ (xi )
Y
θ i=1
m (5.2)
= argmax log Pθ (xi ),
X
θ i=1
University of Zurich 49
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Figure 23: Restricted Boltzmann machine. Each of the three visible units
and five hidden units represents a variable that can take the values ±1 and the
connections between them represent the entries Wij of the weight matrix that enters
the energy function (5.3).
cross-entropy between two probability distributions: the ‘true’ distribution Pdata (x),
from which the data has been drawn, and Pθ (x). While we do not have access to
Pdata (x) in principle, we estimate it empirically as a distribution peaked at the m
data points we have in our data set.
E(v, h) = − (5.3)
X X X
ai vi − bj hj − vi Wij hj ,
i j ij
where the vectors a, b, and the matrix W constitute the variational parameters θ
of the model. Given the energy, the probability distribution over the configurations
(v, h) is defined as
1
Pθrbm (v, h) = e−E(v,h) , (5.4)
Z
University of Zurich 50
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
where
Z= e−E(v,h) (5.5)
X
v,h
is a normalization factor called the partition function. The sum in Eq. (5.5) runs
over all binary vectors v and h, in other words vectors with entries 0 or 1. The
probability that the model assigns to a visible vector v is then the marginal over
the joint probability distribution Eq. (5.4),
1 X −E(v,h)
Pθrbm (v) = Pθrbm (v, h) = (5.6)
X
e ,
h Z h
where the sum runs over all hidden-layer configurations h ∈ {0, 1}nh .
As a result of the restriction, the visible units, with the hidden units fixed, are
mutually independent: given a choice of the hidden units h, we have an indepen-
dent probability distribution for each visible unit given by
where σ(x) = 1/(1 + e−x ) is the sigmoid function. Similarly, with the visible units
fixed, the individual hidden units are also mutually independent with the probability
distribution
The visible (hidden) units can thus be interpreted as artificial neurons connected to
the hidden (visible) units with sigmoid activation function and bias a (b). A direct
consequence of this mutual independence is that sampling a vector v or h reduces to
sampling every component individually. Notice that this simplification comes about
due to the restriction that visible (hidden) units do not directly interact amongst
themselves, meaning there are no terms proportional to vi vj or hi hj in Eq. (5.3). In
the following, we explain how one can train an RBM and discuss possible applications
of RBMs.
k=1
For the gradient descent, we need derivatives of the loss function of the form
∂L(θ) m
∂ log Pθrbm (xk )
=− (5.10)
X
.
∂Wij k=1 ∂W ij
University of Zurich 51
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
with the second one containing a sum over all possible visible-layer configurations
v ∈ {0, 1}nv . Note that similarly simple forms are found for the derivatives with
respect to the components of a and b. We can then iteratively update the parameters
just as we have done in Sec. 4,
∂L(θ)
Wij → Wij − η (5.12)
∂Wij
with a sufficiently small learning rate η. As we have seen in the previous chapter in
the context of backpropagation, we can reduce the computational cost by replacing
the summation over the whole data set in Eq. (5.10) with a summation over a small
randomly chosen batch of samples. This reduction in the computational cost comes
at the expense of noise, but at the same time it can help to improve generalization.
However, there is one more problem: The second summation in Eq. (5.11), which
contains 2nv terms, cannot be efficiently evaluated exactly. Instead, we have to
approximate the sum by sampling the visible layer v from the marginal probability
distribution Pθrbm (v). This sampling can be done using Gibbs sampling as follows:
With sufficiently many steps r, the vector v(r) is an unbiased sample drawn from
Pθrbm (v). By repeating the procedure, we can obtain multiple samples to estimate
the summation. Note that this is still rather computationally expensive, requiring
multiple evaluations on the model.
The key innovation which allows the training of an RBM to be computation-
ally feasible was proposed by Geoffrey Hinton (2002). Instead of obtaining multiple
samples, we simply perform the Gibbs sampling with r steps and estimate the sum-
mation with a single sample, in other words we replace the second summation in
Eq. (5.11) with
where v 0 = v(r) is simply the sample obtained from r-step Gibbs sampling. With
this modification, the gradient, Eq. (5.11), can be approximated as
University of Zurich 52
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Having trained the RBM to represent the underlying data distribution Pdata (x),
there are a few ways one can use the trained model:
1. Pretraining We can use W and b as the initial weights and biases for a deep
network (c.f. Chapter 4), which is then fine-tuned with gradient descent and
backpropagation.
University of Zurich 53
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
original
Gibbs
sampling
a reconstruction of the signal, where the missing part has been repaired, as can been
seen at the bottom of Fig. 24.
Note that the same procedure can be used to reconstruct or denoise images. Due
to the limitation to binary data, however, the picture has to either be binarized, or
the input size to the RBM becomes fairly large for high-resolution pictures. It is
thus not surprising that while RBMs have been popular in the mid-2000s, they have
largely been superseded by more modern architectures such as generative adversarial
networks which we shall explore later in the chapter. However, they still serve a
pedagogical purpose and could also provide inspiration for future innovations, in
particular in science. A recent example is the idea of using an RBM to represent a
quantum mechanical state 22 .
t=1
where xt+1 is now the ‘label’ for the input xt and y t is the output of the network and
t runs over the input sequence with length N , where one usually uses an ‘end’ label
22
Carleo and Troyer, Science 355, 602 (2017)
University of Zurich 54
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
'label'
Training Generation
=
? ? ? ? ?
output
RNN
RNN
RNN
RNN
RNN
RNN
RNN
RNN
RNN
RNN
(...) (...)
input
Figure 25: RNN used as a generator. For training, left, the input data shifted
by one, xt+1 , are used as the label. For the generation of new sequences, right, we
input a single data point x1 and the RNN uses the recurrent steps to generate a
new sequence.
for xN +1 to signal the end of the sequence. This training is schematically shown in
Fig. 25.
For generating a new sequence, it is enough to have one single input point x1 to
start the sequence. Note that since we now can start with a single data point x1
and generate a whole sequence of data points {y t }, this mode of using an RNN is
referred to as one-to-many. This sequence generation is shown on the right side of
Fig. 25.
University of Zurich 55
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
HO CH 3
O N N H3C O O
CH 3
O
guess of the network, here we sample in each step from the probability distribution
y t to again have a one-hot-encoded vector for the input of the next step.
5.4 Autoencoders
Autoencoders are neuron-based generative models, initially introduced for dimen-
sionality reduction. The original purpose, thus, is similar to that of PCA or t-SNE
that we already encountered in Sec. 2, namely the reduction of the number of fea-
tures that describe our input data. Unlike for PCA, where we have a clear recipe
how to reduce the number of features, an autoencoder learns the best way of achiev-
ing the dimensionality reduction. An obvious question, however, is how to measure
the quality of the compression, which is essential for the definition of a loss function
and thus, training. In the case of t-SNE, we introduced two probability distributions
based on the distance of samples in the original and feature space, respectively, and
minimized their difference using the Kullback-Leibler divergence.
The solution the autoencoder uses is to have a neural network do first, the
dimensionality reduction, or encoding to the latent space, x 7→ eθ (x) = z, and
then, the decoding back to the original dimension, z 7→ dθ (z), see Fig. 27. For
the purpose of our discussion here, we assume that both the encoder and decoder
are deterministic. This architecture allows us to directly compare the original input
x with the reconstructed output dθ (eθ (x)), such that the autoencoder trains itself
unsupervised by minimizing the difference. A good example of a loss function that
achieves successful training and that we have encountered already is the L2 loss,
1 Xm
Lae (θ) = kxi − dθ (eθ (xi ))k22 . (5.16)
2m i=1
In other words, we compare point-wise the difference between the input to the
encoder with the decoder’s output.
Intuitively, the latent space with its lower dimension presents a bottleneck for the
information propagation from input to output. The goal of training is to find and
keep the most relevant information for the reconstruction to be optimal. The latent
space then corresponds to the reduced space in PCA and t-SNE. Note that much
like in t-SNE but unlike in PCA, the new features are in general not independent.
University of Zurich 56
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Encoder Decoder
1. If two input data points are close (according to some measure), their images
in the latent space should also be close. We call this property continuity.
2. Any point in the latent space should be mapped through the decoder onto a
meaningful data point, a property we call completeness.
While there are principle ways to achieve regularization along similar paths as dis-
cussed in the previous section on supervised learning, we will discuss here a solution
that is particularly useful as a generative model: the variational autoencoder (VAE).
The idea behind VAEs is for the encoder to output not just an exact point z
in the latent space, but a (factorized) Normal distribution of points, N (µ, σ). In
particular, the output of the encoder comprises two vectors, the first representing
the means µ, and the second the standard deviations σ. The input for the decoder
is then sampled from this distribution, z ∼ N (µ, σ), and the original input is recon-
structed and compared to the original input for training. In addition to the standard
loss function comparing input and output of the VAE, we further add a regulariza-
tion term to the loss function such that the distributions from the encoder are close
to a standard normal distribution N (0, 1). Using the Kullback-Leibler divergence,
Eq. (2.19), to measure the deviation from the standard normal distribution, the full
University of Zurich 57
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
encoder decoder
In this expression, the first term quantifies the reconstruction loss with xin i the
input to and xout
i the reconstructed data from the VAE. The second term is the
regularization on the latent space for each input data point, which for two (diagonal)
Normal distributions can be expressed in a closed form, see second line of Eq. (5.17).
This procedure regularizes the training through the introduction of noise, similar to
the dropout layer in Sec. 4. However, the regularization here not only generically
increases generalization, but also enforces the desired structure in the latent space.
The structure of a VAE is shown in Fig. 28. By enforcing the mean and variance
structure of the encoder output, the latent space fulfills the requirements outlined
above. This type of structure can then serve as a generative model for many different
data types: anything from human faces to complicated molecular structures. Hence,
the variational autoencoder goes beyond extracting information from a dataset, but
can be used for scientific discovery. Note, finally, that the general structure of the
variational autoencoder can be applied beyond the simple case introduced above.
As an example, a different distribution function can be enforced in the latent space
other than the standard Normal distribution, or a different neural network can be
used as encoder and decoder, such as a RNN.
University of Zurich 58
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
i=1
where m is the number of samples in the data set {xi }. The goal is to choose the
parameters θ such as to maximize the likelihood. While we have used the maximum
likelihood principle explicitly only in the case of the RBM in Sec. 5.2, both the RNN
and the VAE of Secs 5.3 and 5.4 can be understood in this framework .
The methods and their respective models can be distinguished by the way Pθmodel (x)
is defined, trained, and evaluated (see Fig. 29). We differentiate between models
that define Pθ (x) explicitly through some functional form and those, where the
probability distribution is defined implicitly. The explicit models have the general
advantage that maximization of the likelihood is rather straight-forward, since we
have direct access to this function. The disadvantage is that the functional forms are
generically limiting the ability of the model to fit the data distribution or become
computationally intractable.
Among the explicit density models, we thus further distinguish between those
that represent a computationally tractable density and those that do not. An ex-
ample for tractable explicit density models are so-called fully visible belief networks
(FVBNs) that decompose the probability distribution over an n-dimensional vector
x into a product of conditional probabilities
n
PθFVBN (x) = pFVBN (xj |x1 , · · · , xj−1 ). (5.19)
Y
θ
j=1
We can already see that once we use the model to draw new samples, this sampling
is done one entry of the vector x at a time (first x1 is drawn, then, knowing x1 , x2
is drawn etc.). This is computationally costly and not parallelizable but is useful
for data with a sequential structure. An example of a fully visible belief network is
the RNN we discussed in Section 5.3.
Models that encode an explicit density, but require approximations to maximize
the likelihood can either be variational in nature or use stochastic methods. We
have seen examples of both. Variational methods define a lower bound to the log-
likelihood which can be maximized,
University of Zurich 59
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Maximum Likelihood
Generative Models
explicit implicit
probability density probability density
GAN
often rely on a Markov-chain process for their evaluation and training: The model
is defined by a probability q(x0 |x) from which the next sample x0 is drawn, which
depends on the current sample x but not any others. RBMs are an example for this
class of models, where Gibbs sampling realizes the Markov chain. RBMs are guar-
anteed to converge to a random sample drawn from Pθrbm (x), but this convergence
may be slow and as such drawing samples inefficient.
All the above classes of models allow for an explicit representation of a model
probability density function. In contrast, GANs and related models, only allow
indirect access to the probability density: The model only allows us to sample from
it. While this makes optimization potentially harder, the architecture circumvents
many of the other previously mentioned problems. In particular, GANs can generate
samples in parallel with no need for a Markov chain; There are few restrictions on the
form of the generator function (as compared to Boltzmann machines, for instance,
which have a restricted form to make Markov chain sampling work); no variational
bound is needed; and some GAN model families are known to be asymptotically
consistent (meaning that for a large enough model they are approximations to any
probability distribution).
GANs have been immensely successful in several application scenarios, most of
which in the field of image generation, and largely on the ImageNet database. Some
of the standard tasks in this context are generating an image from a sentence or
phrase that describes its content (“a blue flower”), generating realistic images from
sketches, generating abstract maps from satellite photos generating a high-resolution
(“super-resolution”) image from a lower resolution one, or predicting a next frame in
a video. Note that for the actual assessment of the output, the likelihood might not
be relevant, as especially for images, the quality of an output is subjective. As far
as more science-related applications are concerned, GANs have been used to predict
the impact of climate change on individual houses or to generate new molecules that
have been later synthesized.
University of Zurich 60
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Noise Generator
Figure 30: Architecture of a GAN. A generator G(z) creates new samples (usu-
ally with z drawn randomly) and a discriminator D tries to distinguish generated
from real data.
where we have also indicated the two sets of parameters on which the two functions
depend: θD and θG , respectively. The game is then defined by two cost functions:
The discriminator wants to minimize LD (θD ; θG ) by only changing θD , while the
generator LG (θG ; θD ) by only changing θG . So, each player’s cost function depends
on both their and the other player’s parameters, the latter of which cannot be
University of Zurich 61
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
controlled by the player. The solution to this game optimization problem is a (local)
minimum, in other words a point in (θD , θG )-space where LD (θD ; θG ) has a local
minimum with respect to θD and LG (θG ; θD ) has a local minimum with respect to
θG . In game theory such a solution is called a Nash equilibrium. Let us now specify
possible choices for the cost functions as well as for D and G.
The most important requirement of G is that it is differentiable. In contrast
to VAEs, it can thus not have discrete variables on the output layer. A typical
representation is a deep (possibly convolutional) neural network 24 . Then θG are
the networks weights and biases. The input z is drawn from some simple prior
distribution, such as the uniform distribution or a normal distribution. The space,
where z is living in, is called the latent space as in the case of the autoencoder. It is
important that the latent space (z) has at least as high a dimension as the feature
space (x) if the model Pθ (x) should have full support on the feature space.
The training proceeds in steps. At each step, a minibatch of x is drawn from the
data set and a minibatch of z is sampled from the prior distribution. Using this,
gradient descent-type updates are performed: One update of θD using the gradient
of LD (θD ; θG ) and one of θG using the gradient of LG (θG ; θD ).
where the sums over i and j run over the respective minibatches, which in general
contain N1 and N2 points, respectively.
For the generator G, more variations of the cost functions have been explored.
Maybe the most intuitive one is
LG (θG ; θD ) = −LD (θD ; θG ), (5.23)
which corresponds to the so-called zero-sum or minimax game. Its solution is for-
mally given by
?
θG = arg min max [−LD (θD ; θG )] . (5.24)
θG θD
This form of the cost is convenient for theoretical analysis, because there is only a
single target function, which helps drawing parallels to conventional optimization.
However, other cost functions have been proven superior in practice. The reason
is that minimization can get trapped very far from an equilibrium: When the dis-
criminator manages to learn rejecting generator samples with high confidence, the
gradient of the generator will be very small, making its optimization very hard.
Instead, we can use the cross-entropy also for the generator cost function (but
this time from the generator’s perspective)
1 X
LG (θG ; θD ) = − log DθD (GθG (z j )). (5.25)
2N2 j
24
A popular Deep Conventional architecture is called DCGAN
University of Zurich 62
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Now, the generator maximizes the probability of the discriminator being mistaken.
This way, each player still has a strong gradient when the player is loosing the game.
We observe that this version of LG (θD ; θG ) has no direct dependence of the training
data. Of course, such a dependence is implicit via D, which has learned from
the training data. This indirect dependence also acts like a regularizer, preventing
overfitting: G has no possibility to directly ‘fit’ its output to training data.
instead of the order in Eq. (5.24), where V(D,G) is the overall ‘loss function’ 25 .
While the interchange of min and max might seem innocent, the reversed order
corresponds to a different solution: It is now sufficient for G to always produce one
(and the same) output that the discriminator D classified as data with very high
probability. Due to the mode collapse problem, GANs are not good at exploring
ergodically the full space of possible outputs. They rather produce few very good
possible outputs.
One strategy to fight mode collapse is called minibatch features. Instead of letting
D rate one sample at a time, a minibatch of real and generated samples is considered
at once. It then detects whether the generated samples are unusually close to each
other.
One-sided label smoothing
Often, the discriminator D gives proper results, but shows a too confident prob-
ability 26 . This overshooting confidence can be counteracted by one-sided label
smoothing and is a kind of regularization. The idea is to simply replace the target
value for the real examples in the loss function with a value 1 − α < 1 . This results
in a smoother distribution of the discriminator. Why do we only perform this off-
set one-sided and not also give a small nonzero value β to the fake samples target
25
The function is rather a value function, as we have not really defined a single loss function. In
Eq. (5.24), we would have V (D, G) = LD (θD ; θG ).
26
Note that this is not only a problem of the discriminator in GANs, but more general for
neural-network classifiers. The technique is thus useful also other binary classification problems
with neural networks.
University of Zurich 63
Machine Learning for the Sciences 5 UNSUPERVISED LEARNING
Consider now a range of x for which Pdata (x) is small but Pθ (x) is large (a “spurious
mode”). D? (x) will have a peak near this spurious mode. This means D reinforces
incorrect behavior of G and encourages G to reproduce samples that it already
makes (irrespective of whether they are anything like real data).
Using GANs with labelled data
If (at least partially) labeled data is available, using the labels when training the
discriminator D may improve the performance of G. In this constellation, G still
has the same task as before and does not interact with the labels. However, for data
with n classes D is constructed as a classifier for (n + 1) classes, where the extra
class corresponds to ‘fake’ data that D attributes to coming from G. If a data point
has a label, then this label is used as a reference in the cost function. If a datapoint
has no label, then the first n outputs of D are summed up.
Arithmetics with GANs
GANs can do linear arithmetics with inputs to add or remove abstract features from
the output. In other words, the latent space, in which z is a point, has structure that
can be exploited. This has been demonstrated using a DCGAN trained on images of
faces: The gender and the feature ‘wearing glasses’ can be added or subtracted and
thus changed at will. Of course, such a result is only empirical, there is no formal
mathematical theory why it works.
University of Zurich 64
Machine Learning for the Sciences 6 INTERPRETABILITY
University of Zurich 65
Machine Learning for the Sciences 6 INTERPRETABILITY
Figure 31: The Copernicus problem. Relation between angles to describe the
position of Mars in the heliocentric and geocentric coordinate system within the
reference frame of the fixed stars.
In the previous sections, We minimized the distance between the output of the
network F θ (x) and a chosen target output y, which we did by minimizing a loss
function, such as
L = |F θ (x) − y|2 . (6.2)
However, unlike in supervised learning, where we minimized the loss minimization
with respect to the parameters θ of the network, we are here interested in minimizing
with respect to the input x while keeping the parameters θ fixed. For this purpose,
we again use gradient descent,
∂L
x 7→ x − η , (6.3)
∂x
where η is the learning rate. With a sufficient number of iterations, an initial input
x0 will be transformed into a final input x∗ , such that
F θ (x∗ ) ≈ y. (6.4)
University of Zurich 66
Machine Learning for the Sciences 6 INTERPRETABILITY
Figure 32: Plant leaves. Top: Some samples from the plants dataset. Bottom:
Samples generated by using the "dreaming" procedure starting from random noise.
University of Zurich 67
Machine Learning for the Sciences 6 INTERPRETABILITY
tually be in for a surprise. Even though the images may ‘resemble’ each other to
our naked eye, the different cameras can have a different noise profile which might
not be perceptible to the human eye. We will see in the next section that even such
minute image distortions can already be sufficient to completely confuse a model.
University of Zurich 68
Machine Learning for the Sciences 6 INTERPRETABILITY
+ 0.01 ⇥
<latexit sha1_base64="Uq9hQrleaI6OL5WW9wlQj7F7mts=">AAACFXicbVDLSgMxFM3UVx1fY126CRZBEMpMFXRZdOOygn1AZyiZNNOGZjJDckcspd/hzq3+hDtx69p/8CNM21nY1gMhJ+fey7k5YSq4Btf9tgpr6xubW8Vte2d3b//AOSw1dZIpyho0EYlqh0QzwSVrAAfB2qliJA4Fa4XD22m99ciU5ol8gFHKgpj0JY84JWCkrlM6xz7GbsX1zO0Dj5nuOmXzngGvEi8nZZSj3nV+/F5Cs5hJoIJo3fHcFIIxUcCpYBPbzzRLCR2SPusYKokxCcaz3Sf41Cg9HCXKHAl4pv6dGJNY61Ecms6YwEAv16bif7VOBtF1MOYyzYBJOjeKMoEhwdMgcI8rRkGMDCFUcbMrpgOiCAUT14ILKCJ1SpT538S2TTzechirpFmteBeV6v1luXaTB1VEx+gEnSEPXaEaukN11EAUPaEX9IrerGfr3fqwPuetBSufOUILsL5+AUktnPI=</latexit>
= <latexit sha1_base64="1LPVExGShI5/e/wN46yy9j0Q6Ro=">AAACAnicbVBNS8NAEN34WeNX1aOXYBE8laQKehGKXjy2YD+gDWWznbRLN5uwOxFK6M2bV/0T3sSrf8T/4I9w2+ZgWx8MPN6bYWZekAiu0XW/rbX1jc2t7cKOvbu3f3BYPDpu6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdT/3WEyjNY/mI4wT8iA4kDzmjaKT6ba9YcsvuDM4q8XJSIjlqveJPtx+zNAKJTFCtO56boJ9RhZwJmNjdVENC2YgOoGOopBFoP5sdOnHOjdJ3wliZkujM1L8TGY20HkeB6YwoDvWyNxX/8zophjd+xmWSIkg2XxSmwsHYmX7t9LkChmJsCGWKm1sdNqSKMjTZLGxBRaVOqDL/TWzbxOMth7FKmpWyd1mu1K9K1bs8qAI5JWfkgnjkmlTJA6mRBmEEyAt5JW/Ws/VufVif89Y1K585IQuwvn4B67yXlA==</latexit>
“Healthy” “Unhealthy”
87.4% confidence 87.8% confidence
+ 0.01 ⇥
<latexit sha1_base64="Uq9hQrleaI6OL5WW9wlQj7F7mts=">AAACFXicbVDLSgMxFM3UVx1fY126CRZBEMpMFXRZdOOygn1AZyiZNNOGZjJDckcspd/hzq3+hDtx69p/8CNM21nY1gMhJ+fey7k5YSq4Btf9tgpr6xubW8Vte2d3b//AOSw1dZIpyho0EYlqh0QzwSVrAAfB2qliJA4Fa4XD22m99ciU5ol8gFHKgpj0JY84JWCkrlM6xz7GbsX1zO0Dj5nuOmXzngGvEi8nZZSj3nV+/F5Cs5hJoIJo3fHcFIIxUcCpYBPbzzRLCR2SPusYKokxCcaz3Sf41Cg9HCXKHAl4pv6dGJNY61Ecms6YwEAv16bif7VOBtF1MOYyzYBJOjeKMoEhwdMgcI8rRkGMDCFUcbMrpgOiCAUT14ILKCJ1SpT538S2TTzechirpFmteBeV6v1luXaTB1VEx+gEnSEPXaEaukN11EAUPaEX9IrerGfr3fqwPuetBSufOUILsL5+AUktnPI=</latexit>
=
<latexit sha1_base64="1LPVExGShI5/e/wN46yy9j0Q6Ro=">AAACAnicbVBNS8NAEN34WeNX1aOXYBE8laQKehGKXjy2YD+gDWWznbRLN5uwOxFK6M2bV/0T3sSrf8T/4I9w2+ZgWx8MPN6bYWZekAiu0XW/rbX1jc2t7cKOvbu3f3BYPDpu6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdT/3WEyjNY/mI4wT8iA4kDzmjaKT6ba9YcsvuDM4q8XJSIjlqveJPtx+zNAKJTFCtO56boJ9RhZwJmNjdVENC2YgOoGOopBFoP5sdOnHOjdJ3wliZkujM1L8TGY20HkeB6YwoDvWyNxX/8zophjd+xmWSIkg2XxSmwsHYmX7t9LkChmJsCGWKm1sdNqSKMjTZLGxBRaVOqDL/TWzbxOMth7FKmpWyd1mu1K9K1bs8qAI5JWfkgnjkmlTJA6mRBmEEyAt5JW/Ws/VufVif89Y1K585IQuwvn4B67yXlA==</latexit>
“Unhealthy” “Healthy”
94.4% confidence 99.6% confidence
Figure 33: Adversarial examples. Generated using the fast gradient sign method
with T = 1 iteration and = 0.01. The target model is Google’s InceptionV3 deep
convolutional network with a test accuracy of ∼ 95% on the binary ("Healthy" vs
"Unhealthy") plants dataset.
target F θ without actually having access to it. Although it might seem surprising,
this strategy has been found to work albeit with a lower success rate as compared
to white box methods. We illustrate these concepts in the example below.
6.3.1 Example
We use the same plant-leaves-classification example as above. The target model F θ
which we want to ’attack’ is a pretrained model using Google’s well known Incep-
tionV3 deep convolutional neural network containing over 20 million parameters29 .
The model achieved a test accuracy of ∼ 95%. Assuming we have access to the
gradients of the model F θ , we can consider a white box attack. Starting from an
image in the dataset which the target model correctly classifies and applying the
fast gradient sign method (Alg. 5) with = 0.01 and T = 1, we obtain an adversar-
ial image which differs from the original image by almost imperceptible amount of
noise as depicted on the left of Fig. 33. Any human would still correctly identify the
image but yet the network, which has around 95% accuracy has completely failed.
If, however, the gradients and outputs of the target model F θ are hidden, the
above white box attack strategy becomes unfeasible. In this case, we can adopt the
following ‘black box attack’ strategy. We train a secondary model Gθ0 , and then
29
This is an example of transfer learning. The base model, InceptionV3, has been trained on
a different classification dataset, ImageNet, with over 1000 classes. To apply this network to our
binary classification problem, we simply replace the top layer with a simple duo-output dense
softmax layer. We keep the weights of the base model fixed and only train the top layer.
University of Zurich 69
Machine Learning for the Sciences 6 INTERPRETABILITY
applying the FGSM algorithm to Gθ0 to generate adversarial examples for Gθ0 . Note
that it is not necessary for Gθ0 to have the same network architecture as the target
model F θ . In fact, it is possible that we do not even know the architecture of our
target model.
Let us consider another pretrained network based on MobileNet containing about
2 million parameters. After retraining the top classification layer of this model to a
test accuracy of ∼ 95%, we apply the FGSM algorithm to generate some adversarial
examples. If we now test these examples on our target model F θ , we notice a
significant drop in the accuracy as shown on the graph on the right of Fig. 34. The
fact that the drop in accuracy is greater for the black box generated adversarial
images as compared to images with random noise (of the same scale) added to it,
shows that adversarial images have some degree of transferability between models.
As a side note, on the left of Fig. 34 we observe that black box attacks are more
effective when only T = 1 iteration of the FGSM algorithm is used, contrary to the
situation for the white box attack. This is because, with more iterations, the method
has a tendency towards overfitting the secondary model, resulting in adversarial
images which are less transferable.
These forms of attacks highlight not only a problem for the scientific application
of neural networks, but a serious vulnerability of data-driven machine-learning tech-
niques in general. Defending against such attack is thus an active area of research,
buts is largely a cat and mouse game between the attacker and defender.
University of Zurich 70
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
7 Reinforcement Learning
In the previous sections, we have introduced data-based learning, where we are given
a dataset {xi } for training. Depending on whether we are given labels yi with each
data point, we have further divided our learning task as either being supervised or
unsupervised, respectively. The aim of machine learning is then to classify unseen
data (supervised), or extract useful information from the data and generate new
data resembling the data in the given dataset (unsupervised). However, the concept
of learning as commonly understood certainly encompasses other forms of learning
that are not falling into these data-driven categories.
An example for a form of learning not obviously covered by supervised or unsu-
pervised learning is learning how to walk: in particular, a child that learns how to
walk does not first collect data on all possible ways of successfully walking to extract
rules on how to walk best. Rather, the child performs an action, sees what happens,
and then adjusts their actions accordingly. This kind of learning thus happens best
‘on-the-fly’, in other words while performing the attempted task. Reinforcement
learning formalizes this different kind of learning and introduces suitable (compu-
tational) methods.
As we will explain in the following, the framework of reinforcement learning
considers an agent that interacts with an environment through actions, which, on
the one hand, changes the state of the agent and on the other hand, leads to a
reward. Whereas we tried to minimize a loss function in the previous sections,
the main goal of reinforcement learning is to maximize this reward by learning an
appropriate policy. One way of reformulating this task is to find a value function,
which associates to each state (or state-action pair) a value, or expected total reward.
Note that, importantly, to perform our learning task we do not require knowledge, a
model, of the environment. All that is needed is feedback to our actions in the form
of a reward signal and a new state. We stress again that we study in the following
methods that learn at each time step 30 .
The framework of reinforcement learning is very powerful and versatile. Exam-
ples include:
• We can train a robot to perform a task, such as using an arm to collect samples.
The state of the agent is the position of the robot arm, the actions move the
arm, and the agent receives a reward for each sample collected.
University of Zurich 71
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
• We can train an agent to play a game, with the state being the current state of
the game and a reward is received once for winning. The most famous example
for such an agent is Google’s AlphaGo, which outperforms humans in the
game of Go. A possible way of applying reinforcement learning in the sciences
is to phrase a problem as a game. An example, where such rephrasing was
successfully applied, is error correction for (topological) quantum computers.
• In the following, we will use a toy example to illustrate the concepts introduced:
We want to train an agent to help us with the plants in our lab: in particular,
the state of the agent is the water level. The agent can turn on and off a
growth lamp and it can send us a message if we need to show up to water the
plants. Obviously, we would like to optimize the growth of the plants and not
have them die.
As a full discussion of reinforcement learning goes well beyond the scope of this
lecture, we will focus in the following on the main ideas and terminology with no
claim of completeness.
University of Zurich 72
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
state action
agent
reward
environment
Obviously, the value of an action now depends on the state the agent is in, such that
we write for the optimal total reward q∗ (s, a). Alternatively, we can also assign to
each state a value v∗ (s), which quantizes the optimal reward from this state.
Finally, we can define what we want to accomplish by learning: knowing our
current state s, we want to know what action to choose such that our future total
University of Zurich 73
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
on on
text
high low
off off
Figure 36: Transition graph of the MDP for the plant-watering agent. The
states ‘high’ and ‘low’ are denoted with large circles, the actions with small black
circles, and the arrows correspond to the probabilities and rewards.
As such a sum is not guaranteed to converge for a continuous task, the total reward
is the discounted return
∞
Gt = Rt+1 + γRt+2 + γ Rt+3 + · · · =
2
γ k Rt+k+1 , (7.3)
X
k=0
with 0 ≤ γ < 1 the discount rate. Equation (7.3) is more general and can be used
for an episodic task by setting γ = 1 and Rt = 0 for t > T . Note that for rewards
which are bound, this sum is guaranteed to converge to a finite value.
This is the expectation value for the return starting in state s and choosing action
a, but using the policy π for all future actions. Note that one of the key ideas
University of Zurich 74
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
This equation, known as the Bellman equation, relates the value of state s to the
expected reward and the (discounted) value of the next state after having chosen an
action under the policy π(a|s).
As an example, we can write the Bellman equation for the strategy of always
leaving the lamps on in our toy model. Then, we find the system of linear equations
von (h) = p(h, ron |h, on)[ron + γvon (h)] + p(l, ron |h, on)[ron + γvon (l)]
= ron + γ[αvon (h) + (1 − α)von (l)], (7.8)
von (l) = β[ron + γvon (l)] + (1 − β)[rfail + γvon (h)], (7.9)
from which we can solve easily for von (h) and von (l).
Instead of calculating the value function for all possible policies, we can directly
try and find the optimal policy π∗ , for which vπ∗ (s) > vπ0 (s) for all policies π 0 and
s ∈ S. For this policy, we find the Bellman optimality equations
v∗ (s) = max
a
q∗ (s, a)
= max
a
E(Rt+1 + γv∗ (St+1 )|St = s, At = a) (7.10)
= max p(s0 , r|s, a)[r + γv∗ (s0 )]. (7.11)
X
a
s0 ,r
Importantly, the Bellman optimality equations do not depend on the actual policy
anymore. As such, Eq. (7.11) defines a non-linear system of equations, which for
a sufficiently simple MDP can be solved explicitly. For our toy example, the two
equations for the value functions are
and
β[ron + γv∗ (l)] + (1 − β)[rfail + γv∗ (h)]
v∗ (l) = max β 0 [roff + γv∗ (l)] + (1 − β 0 )[rfail + γv∗ (h)] . (7.13)
rtext + γv∗ (h)
Note that equivalent equations to Eqs. (7.10) and (7.11) hold for the state-action
value function
Once we know v∗ , the optimal policy π∗ (a|s) is the greedy policy that chooses
the action a that maximizes the right-hand side of Eq. (7.11). If, instead, we know
University of Zurich 75
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
q∗ (s, a), then we can directly choose the action which maximizes q∗ (s, a), namely
π∗ (a|s) = argmaxa0 q∗ (s, a0 ), without looking one step ahead.
While Eqs (7.11) or (7.15) can be solved explicitly for a sufficiently simple system,
such an approach, which corresponds to an exhaustive search, is often not feasible.
In the following, we distinguish two levels of complexity: First, if the explicit solution
is too hard, but we can still keep track of all possible value functions—we can choose
either the state or the state-action value function—we can use a tabular approach.
A main difficulty in this case is the evaluation of a policy, or prediction, which
is needed to improve on the policy. While various methods for policy evaluation
and policy improvement exist, we will discuss in the following an approach called
temporal-difference learning. Second, in many cases the space of possible states is
much too large to allow for a complete knowledge of all value functions. In this case,
we additionally need to approximate the value functions. For this approximation,
we can use the methods encountered in the previous chapters, such as (deep) neural
networks.
s0 ,r
This second step is called policy improvement. The full policy iteration then proceeds
iteratively
π0 → vπ0 → π1 → vπ1 → π2 → · · · (7.17)
until convergence to v∗ and hence π∗ . Note that, indeed, the Bellman optimality
equation (7.11) is the fixed-point equation for this procedure.
Policy iteration requires a full evaluation of the value function of πk for every
iteration k, which is usually a costly calculation. Instead of fully evaluating the value
function under a fixed policy, we can also directly try and calculate the optimal value
function by iteratively solving the Bellman optimality equation,
v [k+1] (s) = max p(s0 , r|s, a)[r + γv [k] (s0 )]. (7.18)
X
a
s0 ,r
Note that once we have converged to the optimal value function, the optimal pol-
icy is given by the greedy policy corresponding to the right-hand side of Eq. (7.16)
An alternative way of interpreting this iterative procedure is to perform policy im-
provement every time we update the value function, instead of finishing the policy
evaluation each time before policy improvement. This procedure is called value iter-
ation and is an example of a generalized policy iteration, the idea of allowing policy
evaluation and policy improvement to interact while learning.
University of Zurich 76
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
In the following, we want to use such a generalized policy iteration scheme for the
(common) case, where we do not have a model for our environment. In this model-
free case, we have to perform the (generalized) policy improvement using only our
interactions with the environment. It is instructive to first think about how to
evaluate a policy. We have seen in Eqs. (7.4) and (7.6) that the value function can
also be written as an expectation value,
We can thus either try to directly sample the expectation value of the first line—this
can be done using Monte Carlo sampling over possible state-action sequences—or
we try to use the second line to iteratively solve for the value function. In both cases,
we start from state St and choose an action At according to the policy we want to
evaluate. The agent’s interaction with the environment results in the reward Rt+1
and the new state St+1 . Using the second line, Eq. (7.20), goes under the name
temporal-difference learning and is in many cases the most efficient method. In
particular, we make the following updates
vπ[k+1] (St ) = vπ[k] (St ) + α[Rt+1 + γvπ[k] (St+1 ) − vπ[k] (St )]. (7.21)
The expression in the brackets is the difference between our new estimate and the
old estimate of the value function and α < 1 is a learning rate. As we look one
step ahead for our new estimate, the method is called one-step temporal difference
method.
We now want to use generalized policy iteration to find the optimal value. We
already encountered a major difficulty when improving a policy using a value func-
tion based on experience in Sec. 7.1: it is difficult to maintain enough exploration
over possible action-state pairs and not end up exploiting the current knowledge.
However, this sampling is crucial for both Monte Carlo methods and the temporal-
difference learning we discuss here. In the following, we will discuss two different
methods of performing the updates, both working on the state-action value function,
instead of the value function. Both have in common that we look one step ahead
to update the state-action value function. A general update should then be of the
form
q [k+1] (St , a) = q [k] (St , a) + α[Rt+1 + γq [k] (St+1 , a0 ) − q [k] (St , a)] (7.22)
and the question is then what action a we should take for the state-action pair and
what action a0 should be taken in the new state St+1 .
Starting from a state S0 , we first choose an action A0 according to a policy
derived from the current estimate of the state-action value function 31 , such as an
-greedy policy. For the first approach, we perform updates as
q [k+1] (St , At ) = q [k] (St , At ) + α[Rt+1 + γq [k] (St+1 , At+1 ) − q [k] (St , At )]. (7.23)
As above, we are provided a reward Rt+1 and a new state St+1 through our interac-
tion with the environment. To choose the action At+1 , we again use a policy derived
31
We assume here an episodic task. At the very beginning of training, we may initialize the
state-action value function randomly.
University of Zurich 77
Machine Learning for the Sciences 7 REINFORCEMENT LEARNING
from q [k] (s = St+1 , a). Since we are using the policy for choosing the action in the
next state St+1 , this approach is called on-policy. Further, since in this particular
case we use the quintuple St , At , Rt+1 , St+1 , At+1 , this algorithm is referred to as
Sarsa. Finally, note that for the next step, we use St+1 , At+1 as the state-action pair
for which q [k] (s, a) is updated.
Alternatively, we only keep the state St from the last step and first choose the
action At for the update using the current policy. Then, we choose our action from
state St+1 in greedy fashion, which effectively uses q [k] (s = St , a) as an approximation
for q∗ (s = St , a). This leads to
q [k+1] (St , At ) = q [k] (St , At ) + α[Rt+1 + γ max q [k] (St+1 , a) − q [k] (St , At )]. (7.24)
a
θ[k+1] = θ[k] + α[R + γv̂π (S 0 ; θ[k] ) − v̂π (S; θ[k] )]∇v̂π (S; θ[k] ) (7.25)
with 0 < α < 1 again the learning rate. Note that, even though the new estimate
also depends on θ[k] , we only take the derivative with respect to the old estimate.
This method is thus referred to as semi-gradient method. In a similar fashion, we
can reformulate the Sarsa algorithm introduced for generalized gradient iteration.
University of Zurich 78
Machine Learning for the Sciences 8 CONCLUDING REMARKS
8 Concluding remarks
In this lecture, ‘Machine Learning for the Sciences’, we have discussed common
structures and algorithms of machine learning to analyze data or learn policies to
achieve a given goal. Even though machine learning is often associated with neu-
ral networks, we have first introduced methods commonly known from statistical
analysis, such as linear regression. Neural networks, which we used for most of this
lecture, are much less controlled than these conventional methods. As an example,
we do not try to find an absolute minimum in the optimization procedure, but one
of many almost degenerate minima. This uncertainty might feel like a loss of control
to a scientist, but it is crucial for the successful generalization of the trained network
The goal of our discussions was not to provide the details needed for an actual
implementation—as all standard algorithms are provided by standard libraries such
as TensorFlow or PyTorch, this is indeed not necessary—but to give an overview
over the most important terminology and the common algorithms. We hope that
such an overview is helpful for reading the literature and deciding, whether a given
method is suitable for your own problems.
To help with the use of machine learning in your own research, here a few lessons
for a successful machine learner:
4. Don’t be afraid of lingo. Not everything that sounds fancy actually is.
Regarding the use of machine learning in a scientific setting, several points should
be kept in mind. First, unlike in many other applications, scientific data often
exhibits specific structure, correlations, or biases, which are known beforehand. It is
thus important to use our prior knowledge in the construction of the neural network
and the loss function. There are also many situations, where the output of the
network has to satisfy conditions, such as symmetries, to be meaningful in a given
context. This should ideally be included in the definition of the network. Finally,
scientific analysis needs to be well defined and reproducible. Machine learning,
with its intrinsic stochastic nature, does not easily satisfy these conditions. It is
thus crucial to document carefully the architecture and all hyperparameters of a
network and training. The results should be compared to conventional statistical
methods, their robustness to variations in the structure of the network and the
hyperparameters should be checked.
University of Zurich 79
Machine Learning for the Sciences 8 CONCLUDING REMARKS
Machine Learning
a task-oriented overview
g
learnin
d
vise
er RNN
up (sequential)
CNN
logistic
regression
s
(spatial)
orks SVM
dense n etw
(brute force) ra l
eu
n
classifica
ridge tio Bellman Eq.
reinf
regression n (brute force)
n
sio
orcement learnin
es
regr
optim
linear Sarsa
regression (on-policy)
MDP
al polic
Q-learning
stru
PCA (off-policy)
y
ctu
re
function
on
g e n e r a ti
g
t-SNE approximation
neural networks
un (V)AE GAN
su RBM RNN
pe
rv i
sed
le a rn
ing
©MHF
Figure 37: Machine Learning overview. Methods covered in this lecture from
the point of view of tasks.
University of Zurich 80