Hennig 2021 Probabilistic Machine Learning
Hennig 2021 Probabilistic Machine Learning
PHILIPP HENNIG
SCRIBES:
FREDERIK KÜNSTNER (2018/19)
A N N - K AT H R I N S C H A L K A M P ( 2 0 1 9 )
TIM REBIG (2020)
Z A F I R S TO JA N OV S K I ( 2 0 2 1 )
PROBABILISTIC
MACHINE
LEARNING
LECTURE NOTES
Questions about this document (typos, structure, etc.) should be directed to
Zafir Stojanovski at zafir.stojanovski@student.uni-tuebingen.de
Probabilistic Reasoning v 15
Gaussian Processes v 61
Understanding Kernels v 73
Gauss-Markov Models v 83
Bibliography 189
Reasoning under Uncertainty v
An inference problem requires statements about the value of Probability theory is nothing but common
sense reduced to calculation.
an unobserved (latent) variable x based on observations y which are
related to x, but may not be sufficient to fully determine x. This — Pierre-Simon de Laplace
requires a notion of uncertainty.
We hope in this chapter to give an intuition on reasoning under
incomplete information as well as present a rigorous mathematical
construction of the foundation of modern probability theory.
Examples
A Card Trick v
Three cards with colored faces are placed into a bag. One card is
red on both faces, one is white on both faces and the last is red on
one face and white on the other - see Fig. 1. After mixing the bag,
we pick a card and see that one face is white. What is the color of
its other face?
While we do not have direct information about the back of the
card, we can use what we know about the setup to make an edu-
cated guess about the probability that it is also white. Make a guess; Figure 1: A card trick
is it 1/2? 2/3? Something else? We will revisit the problem at a later
stage.
1 2 3
A helpful mental image to think about probabilities is to imag-
MANQUE
4 5 6
ine truth as a finite amount of “mass” which can be spread over
PASSE
a space of mutually exclusive “elementary” events. More events 7 8 9
can then be constructed from unions and intersections of sets of 10 11 12
elementary events. An example of this construction is roulette –
13 14 15
see Fig. 2 – where numbers 0 − 36 constitute the set of elementary
17
IMPAIR
events and Red/Black, Odd/Even, Low/High and more elaborate 16 18
PAIR
combinations are constructed from these elementary events. 19 20 21
22 23 24
Formalization v 25 26 27
28 29 30
The goal of this section is to establish a formal framework for prob-
31 32 33
able reasoning. For this purpose, we will introduce Kolmogorov’s
probability theory. Published in 1930, the approach by the Soviet 34 35 36
mathematician Andrey Kolmogorov1 still lays the foundation of 12P 12 M 12D 12D 12 M 12P
modern probability theory and is based on an axiomatic system.
Figure 2: Events for a Roulette
Kolmogorov’s Axioms are a pure mathematical construction. We 1
Kolmogorov. Grundbegriffe der
first present a simplified form of the axioms; Wahrscheinlichkeitsrechnung. 1933
2. Normalization: p(Ω) = 1.
p ( A ∪ B ) = p ( A ) + p ( B ).
reasoning under uncertainty v 9
1. E ∈ F
2. ( A, B ∈ F ) ⇒ ( A − B ∈ F )
S T∞
N
3. ( A1 , A2 , · · · ∈ F ) ⇒ i =1 A i ∈ F ∧ i =1 Ai ∈ F
1. P(∅) = 0
P( A) = P( A, B) + P( A, ¬ B).
10 probabilistic machine learning
P( A, B) = P( B | A) P( A) = P( A | B) P( B).
Rule:
Bayes’ Theorem
Finally, we can state the mechanism that is at the heart of all proba-
bilistic reasoning.
P ( Ai ) P ( X | Ai )
P ( Ai | X ) = .
∑nj=1 P( A j ) P( X | A j )
where B is the space of possible values for B. Note that despite the
name, the prior is not necessarily what you know before seeing the
data, but the marginal distribution P( X ) = ∑d∈D P( X, d) under all
possible data.
p(W |C = WW ) p(C = WW )
p(C = WW |W ) = .
p (W )
Filling in numbers,
A⇒B is equivalent to p( B| A) = 1
¬B ⇒ ¬ A is equivalent to p(¬ A|¬ B) = 1
p( B|¬ A) ≤ p( B) A is false implies B becomes less plausible
p( A| B) ≥ p( A) B is true implies A becomes more plausible.
p( A = 0, B = 0) = p( A = 0) p( B = 0),
p( A = 0, B = 1) = p( A = 0) p( B = 1),
p( A = 1, B = 0) = p( A = 1) p( B = 0),
p( A = 1, B = 1) = p( A = 1) p( B = 1).
P ( t | v1 ) · P ( v1 ) P ( t | v1 ) · P ( v1 ) P ( v1 )
P ( v1 | t ) = = =
P(t) P ( t | v1 ) ∑ i P ( v i ) ∑i P ( vi )
You might also convince yourself before opening the door that one
of the possible hypotheses is more likely than the others, a specific
visitor l you very much like to see. Doing so, you can unknowingly
reasoning under uncertainty v 13
adjust the prior in such a way that this event dominates the poste-
rior. This is a concern of critics of probabilistic reasoning who often
argue that it is possible to obtain any desired explanation for the
data as long as the hypothesis in question has non-zero probability.
A similar human flaw would be to not accept that one of the possi-
ble hypotheses is actually impossible (the person l might be dead)
and needs to be assigned a prior probability of 0.
In the poem the person is greeted by darkness there and nothing more
after opening the door. Realizing that there is no visitor creates
an inconsistency in our theory as all former hypothesis v1 , .., vn
now have probability 0, which contradicts the observed tapping.
The hypothesis space has to contain some explanantion for the
observation t. The appropriate construction of the hypothesis space
can be one of the main challenges in practice.
Reasoning that the wind w might have caused the tapping at this
point and adding this new hypothesis reveals a frequent problem of
probabilistic inference. We have to know the correct variables before
we start reasoning and include them. Otherwise, no matter what
prior distribution we choose, the results will be flawed.
Condensed content
P( A) = P( A, B) + P( A, ¬ B)
P( A, B) = P( A | B) · P( B) = P( B | A) · P( A)
• Bayes’ Theorem:
P( B | A) P( A) P( B | A) P( A)
P( A | B) = =
P( B) P( B, A) + P( B, ¬ A)
p( A, B, . . . , Z ) = p1
p(¬ A, B, . . . , Z ) = p2
..
.
p(¬ A, ¬ B, . . . , Z ) = p67 108 863
n 1
p(¬ A, ¬ B, . . . , ¬ Z ) = 1 − ∑2i=− 1 pi
fully represented,
p( A, R, E, B) = p( A| R, E, B) p( R| E, B) p( E| B) p( B). E B
• Similarly, we can assume that the radio broadcast does not de-
pend on your house being robbed, such that p( R| E, B) = p( R| E).
Note that this last point does not imply that the alarm system is in-
dependent of the radio broadcast, p( A| R) 6= p( A). If an earthquake
increases the probability of false alarms and the probability of radio
broadcast, knowing that there was a radio broadcast increases the
probability that the alarm will go off. Those simplifications lead to
a system with 8 = 4 + 2 + 1 + 1 parameters,
p( A, R, E, B) = p( A| E, B) p( R| E) p( E) p( B).
p( E) = 10−3 , p( B) = 10−3 .
p( R = 1| E = 1) = 1, p( R = 1| E = 0) = 0.
For the alarm, we will assume that it can send false alarms, with a
rate f = 1/1,000, that a burglar has a α B = 99/100 chance of triggering
it while an earthquake only has a α E = 1/100 chance of triggering it.
This yields the following table of probabilities,
p( A = 0| B = 0, E = 0) = (1 − f ) = 0.999,
p( A = 0| B = 0, E = 1) = (1 − f )(1 − α E ) = 0.98901,
p( A = 0| B = 1, E = 0) = (1 − f )(1 − α B ) = 0.00999,
p( A = 0| B = 1, E = 1) = (1 − f )(1 − α B )(1 − α E ) = 0.0098901,
p( A = 1| B = 0, E = 0) = f = 0.001,
p( A = 1| B = 0, E = 1) = 1 − (1 − f )(1 − α E ) = 0.01099,
p( A = 1| B = 1, E = 0) = 1 − (1 − f )(1 − α B ) = 0.99001,
p( A = 1| B = 1, E = 1) = 1 − (1 − f )(1 − α B )(1 − α E ) = 0.9901099.
probabilistic reasoning v 17
p( A = 1, B = 1)
p ( B = 1| A = 1) = ,
p ( A = 1)
∑ p( A = 1, B = 1, R, E)
= R,E = 0.495.
∑ B,R,E p( A = 1, B, R, E)
The somewhat lengthy calculation can be seen here v . If we also
know that the radio broadcasts an announcement about an earth-
quake v ,
p( A = 1, B = 1, R = 1)
p( B = 1| A = 1, R = 1) = ,
p ( A = 1)
∑ p( A = 1, B = 1, R = 1, E)
= E = 0.08.
∑ B,R,E p( A = 1, B, R, E)
The phenomenon of reducing the probability of an event by adding
more observation is often referred to as explaining away. In this
example the information about the radio announcement explains
away the break-in as a reason for the alarm.
p( A, E, B, R) = p( A| E, B, R) p( R| E, B) p( E| B) p( B),
leads to one such DAG, but this other factorization leads to an-
other graphical representation where the direction of each edge is
reversed,
p( A, E, B, R) = p( B| A, E, R) p( E| A, R) p( R| A) p( A).
p( A, E, B, R) = p( A| E, B) p( R| E) p( E) p( B).
p( A, B) = p( A) p( B).
p( A, B | C ) = p( A|C ) p( B|C ).
The DAG for the two coins example is not unique. For exam-
ple, computing the probabilities
1 1
p ( A = 1) = , p ( B = 1) = ,
2 2
p(C = 1| A = 1, B = 1) = 1, p(C = 1| A = 0, B = 1) = 0,
p(C = 1| A = 1, B = 0) = 0, p(C = 1| A = 0, B = 0) = 1,
p ( A | B ) = p ( A ), p ( B | C ) = p ( B ), p ( C | A ) = p ( C ), p ( C | B ) = p ( C ),
p( A, B, C ) = p(C | A, B) p( A) p( B),
p( A, B, C ) = p( A| B, C ) p( B) p(C ),
p( A, B, C ) = p( B| A, C ) p( A) p(C ),
C C C
20 probabilistic machine learning
Condensed content
p ( B ) = p ( A ) + p (W ) , p (W ) = p ( B ) − p ( A ) .
The Product and the Sum rules apply to the probability den-
sity function, and taken together imply Bayes’ rule.
p ( x1 , x2 )
p ( x1 | x2 ) = Product rule,
p ( x2 ) 0.4
Z
p
0.2
p X1 ( x 1 ) = p X ( x1 , x2 ) dx Sum rule, 0
p(x | y) p(x, y) −2
R −2
−1
0
0
p ( x1 ) · p ( x2 | x1 )
x
1 2
y 2
p ( x1 | x2 ) = R Bayes’ Theorem.
p( x1 ) · p( x2 | x1 ) dx1
Figure 6: Joint probability density
Those rules, however, do not apply to the cumulative distribution function for two variables, highlight-
ing the marginal p(y) (rear panel)
function PX ( x ). Fig. 6 illustrates the joint, marginal and conditional and conditional probability density
densities on a two-dimensional example. p( x |y = 0) (cutting through the joint
density).
22 probabilistic machine learning
Formal definitions
This is mostly useful to understand
We now give the promised formal definitions which lead the way to other reference material on the subject;
a rigorous formulation of densities and probabilities on continuous do not worry if the definitions sound
too convoluted at first.
spaces v . The first challenge arrives when deriving a σ-algebra F
for continuous spaces. We will not use the canonical way by taking
the power set of the elements of our continuous space Ω as our
σ-algebra. This is because the power sets can contain sets which
are not measurable with respect to the Lebesgue measure5 . This 5
https://en.wikipedia.org/wiki/Lebesgue_measure
• Ω ∈ τ, and ∅ ∈ τ
The elements of the topology τ are called open sets. In the Euclidean
vector space Rd , the canonical topology is that of all sets U that
satisfy x ∈ U :⇒ ∃ε > 0 : ((ky − x k < ε) ⇒ (y ∈ U )).
Consider (Ω, F ) and (Γ, G). If both F and G are Borel σ-algebras,
then any continuous function X is measurable (and can thus be
used to define a random variable). This is because, for continu-
ous functions, pre-images of open sets are open sets and Borel
σ-algebras are the smallest σ-algebras to contain all those sets.
PX ( G ) = P( X −1 ( G )) = P({ω | X (ω ) ∈ G }).
Note, not all measures have densities (ex. measures with point
masses).
and, for d = 1,
Z b
P( a ≤ X < b) = F (b) − F ( a) = f ( x ) dx.
a
p ( X = 1 | π ) = π, p ( X = 0 | π ) = 1 − π.
A nice choice for the prior p(π ), to make the computation easy,
is the Beta distribution with parameters a, b > 0,
π a −1 (1 − π ) b −1
p(π ) = ,
Z
probabilities over continuous variables v 25
R1
where Z is a normalization constant to ensure that 0 p(π ) dπ = 1
and is given by the Beta function,
Z 1
Z = B( a, b) = π a−1 (1 − π )b−1 dπ.
0
π n + a −1 (1 − π ) m + b −1
p(π |n, m) = .
B( a + n, b + m)
Condensed content
– Not every measure has a density, but all pdfs define measures
– Densities transform under continuously differentiable, injec-
tive functions g : x 7→ y with non-vanishing Jacobian as
p ( g−1 (y)) · | J −1 (y)| if y is in the range of g,
X g
pY ( y ) =
0 otherwise.
As the next tool to add to our toolbox, we will look at Monte Carlo
methods.
In many probabilistic inference problems, the main computa-
tional issue is the computation of expectations and marginal proba-
bilities,
Z Z
E p( x) [ x ] = xp( x ) dx, p(y) = E p( x) p(y| x ) = p(y| x ) p( x ) dx,
1
= var( f ) = O(n−1 ).
n
28 probabilistic machine learning
and count the number of samples that fall within the unit circle 0
0 0.2 0.4 0.6 0.8 1
(x > x < 1).
Figure 8: Estimating π by sampling
While this procedure only needs ≈ 9 samples to get the first
digit right, it is not great when high precision is required; to get to
single-float precision (≈ 10−7 ), it needs about 1014 samples. Fig. 9
shows the error w.r.t. the number of samples
4
10−1
φ̂
2
10−2
0
10−3
100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples
Sampling
1
B ( x; α, 1) = x α −1
B(α, 1)
Z 1
Γ ( α ) Γ (1) 1
with B(α, 1) = = x α−1 dx =
Γ ( α + 1) 0 α
which gives us
B ( x; α, 1) = αx α−1
Now notice that by setting x = u1/α we get exactly a Beta-distributed
random variable:
∂u( x )
p x ( x ) = pu (u( x )) · = α · x α−1 = B( x; α, 1).
∂x
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Rejection Sampling v
One issue with Inverse Transform sampling is that the normaliza-
tion constant needs to be known; we cannot use an unnormalized
30 probabilistic machine learning
0.2
0.1
0
−4 −2 0 2 4 6 8 10
p̃( x )
p( x ) = .
Z
cq( x ) ≥ p̃( x ),
Importance Sampling v
Importance sampling is a slightly less simple method; if it is not
possible to compute the inverse transform to sample from p( x ), but
the PDF can still be evaluated, we can use samples from a proxy
monte carlo methods v 31
1 p( x ) 1
φ̃ =
n ∑ f (xi ) q(xii) =
n ∑ f ( x i ) wi ,
i i
0
−2 0 2 4 6 8 0 50 100
x f ( x ), g ( x )
Condensed content
p̃( x )
p( x ) =
Z
assuming that it is possible to evaluate the unnormalized density p̃
(but not p) at arbitrary points.
Typical example: Compute moments of a posterior
p( D | x ) p( x ) 1
p( x | D ) = R as E p( x| D) ( x n ) ≈ ∑ xin with xi ∼ p( x | D )
p( D, x ) dx S s
p ( x i | x 1 , x 2 , . . . , x i −1 ) = p ( x i | x i −1 ).
• Compute a = p( x 0 )/p( xt ).
• If a > 1, accept xt+1 = x 0 , else reject x’ and set xt+1 = xt . add node in this chain
p( x 0 ) q( xt | x 0 )
• Compute a = p( xt ) q( x 0 | xt )
.
Using this method, the samples will spend more “time” in re-
gions where p( x ) is high (lower probability of sampling a better
proposition) and less “time” in regions where p( x ) is low (any
proposition would be good), but the algorithm can still visit regions
of low probability (see Fig. 13 for an example). See chi-feng.github.io/mcmc-demo/
app.html#RandomWalkMH for a visu-
Metropolis-Hasting draws samples from p( x ) in the limit sample
alization of the Metropolis-Hastings
of infinite sampling steps. The proof sketch involves the existence algorithm in 2D created by Chi Feng.
of a stationary distribution, which is a distribution that does not
change over time (anymore). For Markov Chains, its existence can
be shown through the detailed balance equation:
p ( x ) T ( x → x 0 ) = p ( x 0 ) T ( x 0 → x ),
p( x )
0.4
q( x )
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p( x )
0.4
q( x )
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
how the dots distributed in the plot depends on how the Markov chain moves
Figure 13: v Example of a MCMC
execution with a Gaussian proposal
distribution. Steps 1–5 and step 300.
The sample distribution still appears
not to be uniform and the Markov
chain has not yet mixed ’perfectly’.
36 probabilistic machine learning
Gibbs Sampling v
chi-feng.github.io/mcmc-demo/
This is a special case of Metropolis-Hastings. It employs the idea app.html#GibbsSampling,banana
that sampling from a high-dimensional joint distribution is often provides a visualization of the Gibbs
Sampling created by Chi Feng.
difficult, while sampling from a one-dimensional conditional dis-
tribution is easier. So instead of directly sampling from the joint
distribution p( x ), the Gibbs sampler alternates between drawing
from the respective conditional distributions p( xi | x j j6=i ).
1 procedure Gibbs(p( x ))
2 xi ^ rand() ∀i initialize randomly
3 for t = 1, . . . , T do
(t) (t) (t)
4 x1t+1 ∝ p( x1 | x2 , x3 , . . . , xm )
( t + 1 ) ( t ) (t)
5 x2t+1 ∝ p( x2 | x1 , x3 , . . . , x m )
..
6 .
t + 1 ( t +1) ( t +1) ( t +1)
7 x m ∝ p ( x m | x1 , x2 , . . . , x m −1 )
8 end for
9 end procedure
Condensed content
( x − µ )2
p( x )
0.2 p( x ) = √1 e− 2σ2 =: N ( x; µ, σ2 )
σ 2π
0.1
x
0
0 1 2 3 4 5 6
µ−σ µ µ+σ
of a quadratic polynomial:
1 |
N x; µ, Σ = exp a + η | x − x Λx
2
1
= exp a + η | x − |
tr( xx Λ)
2
8
with the natural parameters Λ = Σ−1 (the precision matrix), η =
Λµ, and the sufficient statistics x, xx| . The scaling of the normal 6
x2
soids (See Fig. 16).
0
Those properties make it convenient to perform inference on
Gaussian random variables, using the simple tools of linear algebra. −2
−4
−4 −2 0 µ1 4 6 8
The Gaussian is its own conjugate prior, meaning that x1
given a Gaussian prior p( x ) and a Gaussian likelihood p(y| x ), the Figure 16: Two-dimensional Gaussian
distribution.
posterior p( x |y) is also a Gaussian (see Fig. 17). For
p( x ) = N x; µ, σ2 and p(y| x ) = N y; x, ν2 ,
p( x )
the posterior is given by 0.8
p(y | x )
p( x | y)
p(y| x ) p( x )
p( x |y) = R = N x; m, s2 , 0.6
p(y| x ) p( x ) dx
p( x )
0.4
σ −2 µ + ν −2 y 1
where m = and s2 = −2 .
σ −2 + ν −2 σ + ν −2
0.2
The derivation of these expressions can be seen here v .
0
Gaussians are closed under multiplication.
0 2 4 6
x
N ( x; a, A) N ( x; b, B) = N ( x; c, C ) Z
Figure 17: Gaussian Prior, Likelihood
where C = ( A−1 + B−1 )−1 , c = C ( A −1 a + B −1 b ), and Posterior.
and Z = N ( a; b, A + B) .
0 5
Gaussians are closed under marginalization. Marginal- x1
ization is a special case of a linear projection: it is a projection with Figure 18: Linear projection of a
Gaussian distributed random vari-
a matrix which has the entry 1 on its diagonal at the indices of the able.
variables for which we construct the marginal, and 0s elsewhere.
Assuming that x, y, z are distributed according to
x µ x Σ xx Σ xy Σ xz
N y ; µy , Σyx Σyy Σyz
,
z µz Σzx Σzy Σzz
gaussian probability distributions v 41
Numerical Stability
Inverting matrices is very often subject to numerical instability,
which happens when the matrices are close to singular10 . If you 10
wikipedia.org/wiki/Invertible_matrix
x = numpy.linalg.inv11 (A) @ b. 11
Docs for numpy.linalg.inv
with 14
wikipedia.org/wiki/Cholesky_decomposition
L = numpy.linalg.cholesky15 (A) 15
Docs for numpy.linalg.cholesky
Infering (in)dependence
A neat property that Gaussian distributions have is that they also
allow us to infer the (in)dependence between the variables.
A zero off-diagonal element in the covariance matrix implies
marginal independence.
[Σ]ij = 0 ⇒ p( xi , x j ) = N xi ; [µ]i , [Σ]ii · N x j ; [µ] j , [Σ] jj
Condensed content
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
f ( x ) = w1 + w2 x.
2
10
f (x)
0
w2
−2 −10
−2 0 2 −8 −6 −4 −2 0 2 4 6 8
w1 x
Figure 20: Gaussian prior on the
weights v , along with the matching
Gaussian prior on the function space
To recap the notation, we have a dataset X = [ x1 , .., xn ], X ∈
X N , and an output dataset y ∈ R N , a function f ( x ) ∈ R, and a
feature vector for each data point (which we choose as φx = [1, x ]> ).
To make use of vectorization, we build the following feature matrix
containing the feature vectors
!
1 ... 1
φX = φx1 . . . φx N = .
x1 . . . x n
2
10
f (x)
0
w2
−2 −10
−2 0 2 −8 −6 −4 −2 0 2 4 6 8
w1 x
Figure 21: Posterior on the weights,
and the matching posterior on the
Those equations can be difficult to digest at first glance, but do function, after seeing several data-
points. The result of applying more
not be intimidated. We will first see some results those equations datapoints can be seen here v .
produce, but will return to them at the end of the chapter for more
details.
Feature functions
Furthermore, one can use sines and cosines to get a Fourier regres-
sion
h i>
φ( x ) = φx = cos( x ) cos(2x ) cos(3x ) sin( x ) sin(2x ) sin(3x ) ,
10
y
−10
−8 −6 −4 −2 0 2 4 6 8
x
10 φx = [1 x x 2 x 3 ] >
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
50 probabilistic machine learning
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10 φα ( x ) = | x − α| − α.
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
parametric gaussian regression v 51
10 φα ( x ) = e−| x−α|
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Bell curve regression, using Gaussian
distributions with different locations,
2
10 φα ( x ) = e−( x−α)
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
Condensed content
f ( x ) = φ( x )| w = φx| w
Dimensionality
It is useful to look at the dimensionality of each variable to get a
sense of the computation complexity. Assume we have a dataset 1 F
with N data points and choose a set of F features to fit.
F µ Σ
• The prior mean µ and prior covariance matrix Σ of the weights
w yield the prior on the weights p(w) = N w; µ, Σ , where µ is a
F-dimensional vector and Σ a [ F × F ] matrix. 1 N
• The function
vector fX and noise matrix σ2 I
give the likelihood
2
p(y) = N y; f X , σ I , where f X is a N-dimensional vector and N fX σ2
σ2 I a [ N × N ] (diagonal) matrix for N data points.
• The feature function generates a [ F × N ] matrix that links the
[ N × N ] data space and the [ F × F ] feature space. N
−1
p(w|y, φX ) = N w ;
µ + Σ φX
φ>
X Σ φX + σ2
y - φ>
X
µ ,
−1
Σ - Σ φX φ>
X Σ φX + σ 2
φ>
X Σ
−1
>
Σ − ΣφX φX ΣφX + σ2 I >
φX Σ ,
−1
= N w; Σ −1 + σ −2 φX φX
>
Σ −1 µ + σ −2 φX y ,
−1
Σ −1 + σ −2 φX φX
> .
−1 Figure 26: Matrix Inversion Lemma
>
p( f X | y, φX ) = N φX w; >
φX >
µ + φX >
ΣφX φX ΣφX + σ2 I >
( y − φX µ), for the posterior on the function
values.
−1
>
φX >
ΣφX − φX >
ΣφX φX ΣφX + σ2 I >
φX ΣφX ,
−1
>
= N φX w; φX Σ−1 + σ−2 φX φX
>
Σ −1 µ + σ −2 φX y ,
−1
φX Σ −1 + σ −2 φX φX
> >
φX .
parameters, and search this subspace instead. Consider the family 0.8
of feature functions parametrized by θ,
0.6
1
φ( x )
φ( x; θ ) = ,
1 + exp − x−
θ2
θ1 0.4
0.2
illustrated in Fig. 27. The number of feature functions is still infi-
nite, as there is an infinite array of choice for (θ1 , θ2 ), but the di- 0
−5 0 5
mensionality of the parametrization is fixed. x
The parameters θ1 and θ2 can be treated as unknown parameters,
just as the weights w. However, it is more difficult to infer them, Figure 27: Parametrized Family of
Functions,
since the likelihood function
! −1
x − θ1
p( x |w, θ ) = N y; φ( x; θ )> w, σ2 I , φ( x; θ ) = 1 + exp −
θ2
.
contains a non-linear mapping of θ. Due to this non-linearity, we The parameter θ1 controls the intercept
—where on the x-axis the function
cannot use the full Bayesian treatment through linear algebra opera- value crosses the 1/2 point— and the
tions for the Normal distribution. It is still technically possible to per- parameter θ2 controls the slope.
form inference over θ, but the computational cost is very prohibitive.
However, if θ is known, the distributions related to the weights are
still linear combinations of Gaussians, and we can still use linear al-
gebra to infer the weights. It would be nice to have an approximative
solution for θ, but still do full inference on the weights w. This is
where Maximum Likelihood (ML) and Maximum A-Posteriori (MAP)
estimation come into play. Instead of integrating over θ, we can fit it
by selecting the most likely values using Maximum A-Posteriori weighs the
likelihood term by a prior on θ,
θ ? = arg max p( D |θ ), θ ? = arg max p(θ | D ) . p( D |θ ) p(θ )
θ θ p(θ | D ) = ∝ p ( D | θ ) p ( θ ),
| {z } | {z } p( D )
Maximum Likelihood Maximum A-Posteriori
and maximizes the posterior w.r.t. θ
(hence, “A-Posteriori”).
56 probabilistic machine learning
The parameters of the linear model w will still get a full proba-
bilistic treatment and will be integrated out in the inference of the
posterior, but the parameters that select the features (also known
as hyper-parameters) θ, are too costly to properly infer and will get
fitted.
To get a better understanding where these expressions come
from, notice that the evidence in our posterior for f
p(y | f , x, θ) p( f |, θ) p(y | f , x, θ) p( f |, θ)
p( f | y, x, θ) = R =
p(y | f , x, θ) p( f |, θ) d f p(y | x, θ)
| {z }
the evidence
This gives
We can drop the term N/2 log(2π ) because it does not affect the
minimization.
To better see that the first term is a square error, it might be
−1/2
2
θ> θ
useful to rewrite it as
φX ΣφX + Λ ( y − φX µ )
>
, which
θ
hierarchical inference: learning the features v 57
is the squared error of the distance between φXθ > µ and y scaled
by (the square-root
of) the precision
matrix. The Model Complexity
term, log det φXθ > Σφθ + Λ , measures the “volume” of hypotheses
X
covered by the joint Gaussian distribution.
The Model Complexity term, also called Occam’s factor, adds a
Numquam ponenda est pluralitas sine
penalty for features that lead to a large hypothesis space. This is necessitate.
based on the principle that, everything kept equal, simpler explana- Plurality must never be posited with-
out necessity.
tions should be favored over more complex ones.
The aforementioned minimization procedure tries to both: — William of Occam
θ >µ
• explain the observed data well – by making the resulting φX
close to y
It is important to note that by using Maximum Likelihood (or The usual way to train such networks,
however, does not include the Occam
Maximum A-Posteriori) solutions, we do not the capture uncer- factor. The method used here is of-
tainty on the hyper-parameters. However, they make it possible to ten referred to as Type-II Maximum
get some solution about which features to use in a reasonable time, Likelihood, whereas neural networks
typically use Type-I. The following
which would be intractable otherwise. reference contains more details on the
If you are worried about fitting or hand-picking features for application of those ideas to neural
networks;
Bayesian regression, remember that this also applies for deep learn-
MacKay. The Evidence Framework
ing, where we have to reason about the choice of activation func- Applied to Classification Networks.
tions. By highlighting assumptions and priors, the probabilistic Neural Computation, 1992
view forces us to address this problem directly, rather than obscur-
ing them with notation and intuitions. L
m9 m8
e c
Connection to deep learning v
m5
m6 G m7
Up until this point, we haven’t really talked about how to solve
m4
the minimization problem stemming from the Maximum Likeli-
∆ K
hood over the model’s hyperparameters. Since the optimization
problem doesn’t have an analytical solution, we can turn to a very m2 m3
φ
prominent tool commonly used in deep learning – Automatic Dif-
ferentiation.
m1
In general, Automatic Differentation (AD) is a set of techniques
to evaluate the derivative of a function specified by a computer θ
program. AD exploits the fact that every computer program (i.e.
mathematical expression), no matter how complicated, executes a Figure 28: The computation graph for
L ( θ ).
sequence of elementary arithmetic operations (addition, subtrac-
tion, multiplication, division, etc.) and elementary functions (exp,
log, sin, cos, etc.). By applying the chain rule repeatedly to these
operations, derivatives of arbitrary order can be computed auto-
matically, accurately, and using at most a small constant factor more
arithmetic operations than the original program.
Looking back at our derived loss L(θ ), we split the computations
58 probabilistic machine learning
as follows:
−1 =:∆
z }| { !
1
θ| | θ| θ| θ| θ
L(θ ) = ( y − φX µ) φX ΣφX
θ
+Λ (y − φX µ) + log φX ΣφX + Λ
2 | {z } | {z }
=:K =:c
| {z }
=:G
| {z }
=:e
y output
Figure 29: Graphical representation
of Hierarchical Bayesian Linear Re-
gression. The parameters controlling
weights w1 w2 w3 w4 w5 w6 w7 w8 w9 the features, θ, are learned from the
data using Maximum Likelihood, in a
similar fashion as a Neural Network,
features [ φx ]1 [ φx ]2 [ φx ]3 [ φx ]4 [ φx ]5 [ φx ]6 [ φx ]7 [ φx ]8 [ φx ]9 and full inference is carried over the
weights of those features, w.
parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
x input
Further assuming that our prior for w and θ is also Gaussian and
centered with unit covariance matrix, we get
n
1 |
arg max p(w, θ | y) = arg min ∑ wi2 + ∑ θ 2j + ∑ kyi − φiθ w k2
w,θ w,θ i j
2σ2 i =1
1 n b
|
≈ arg min ∑ wi2 + ∑ θ 2j + ∑ ky β − φβθ wk2 ∼ N r + L(θ, w), O(b−1 ) .
w,θ i j
2σ2 b β =1
| {z } | {z }
r (θ,w) L(θ,w)
Condensed content
We can use the following abstraction To see how Gaussian Process infer-
ence can be implemented and how the
mean function: m( x ) = φx> µ, m : X → R, introduced abstraction allows to hide
the feature functions in the computa-
covariance function (Kernel): K ( a, b) = φa> Σφb , K : X × X → R, tion, take a look at the jupyter note-
book Gaussian_Process_Regression.ipynb.
to rewrite the posterior as
p( f x0 |y, φX ) = N f x0 ; m x0 + K x0 X (KXX + σ2 I )−1 (y − m X ),
Kx0 x0 − Kx0 X (KXX + σ2 I )−1 KXx0 ,
where m a = φa> µ and Kab = φa> Σφb . The feature vectors φX , φx0 are
hidden in the computation of the mean function and the kernel. We
will see that for some models, it is not necessary to construct the
(infinite) feature vectors to compute the posterior – the mean and
kernel can be computed in closed form.
62 probabilistic machine learning
with parameters c1 < . . . < c F in [cmin , cmax ]. The kernel can then be
written as
! !
F 2 2
c max − c min ( a − c ) ( b − c )
φa> Σφb = σ2 ∑ exp − 2λ2 `
exp − `
,
F `=1
2λ2
! !
2 cmax − cmin ( a − b )2 F (c` − 12 ( a + b))2
=σ
F
exp −
4λ2 ∑ exp − λ2
.
`=1
Visualizing kernels
The following figure shows the prior for different feature functions
and how increasing the number of features (by taking the limit
towards infinity) leads to a kernel. The final posteriors are inferred
from the dataset introduced in Fig. 22, Page 48
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
64 probabilistic machine learning
0 and converges to
K ( a, b) = σ2 (min( a, b) − c0 ).
−10 The derivation can be found here v.
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
gaussian processes v 65
Cubic Splines
The Cubic Splines start similarly
to the Wiener process’ as threshold
functions, but they keep increasing
linearly after being activated, as RELU
activations functions,
(
x − c` if x ≥ c` ,
10 φ` ( x ) =
0 otherwise,
f (x)
and converge to
0
1 3
K ( a, b) = σ2 min( a − c0 , b − c0 )
3
−10 1 2
+ | a − b| min( a − c0 , b − c0 ) .
2
−8 −6 −4 −2 0 2 4 6 8
x Its derivation can be seen here v
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
66 probabilistic machine learning
The first two properties should be easy to prove from the proper-
ties of positive semi-definite matrices. The third property is the re-
sult of Mercer’s Theorem19 , which we will cover in the next chapter, 19
wikipedia.org/wiki/Mercer’s_theorem
and the last property is the result of the Shur Product Theorem20 .
Its proof is involved and is the result of the fact that the Hadamard 20
wikipedia.org/wiki/Schur_product_theorem
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
68 probabilistic machine learning
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
gaussian processes v 69
3
x+8
0 φ( x ) = .
5
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
0 φ ( x ) = [1 x x 2 ] > .
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f (x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
70 probabilistic machine learning
Condensed content
Gaussian Processes
k ( a, b) = k(τ ) with τ := a − b
We can obtain the expectation (i.e. the mean in the Gaussian set-
ting) of the functions at the explicit locations X by maximizing the
posterior:
= arg min − p( f X | y)
fX
1. ∀ x ∈ X : k(·, x ) ∈ H
n
µ( x ) = k xX (k XX + σ2 I )−1 y = ∑ wi k( x, xi ) for n ∈ N.
| {z } i =1
:=w
This means that we can think of the RKHS as the space that is
spanned by the posterior mean functions of GP regression.
−10
−8 −6 −4 −2 0 2 4 6 8
x
m( x ) = k xX (k XX + σ2 I )−1 y
GP’s expected square error is the RKHS’s worst case square error
At this point, one might say that the Bayesian and the frequentist
viewpoint are roughly the same. However, notice that when we talk
about the posterior distribution of a Gaussian process regression,
we are not just talking about the posterior mean (the mode), but we
also consider the width – thus encapsulating the entire probability
distribution. This ability to quantify uncertainty is often seen as the
main selling point of the probabilistic framework – by keeping track
of the remaining volume of hypotheses, we can be certain about
our estimate. A natural question that arises is if there is a statistical
interpretation of the posterior variance in the Gaussian process
framework.
For a moment, suppose that we have noise-free observations.
Given this assumption, let’s observe how far the posterior mean
could be from the truth in a given RKHS:
2
2 −1
sup m( x ) − f ( x ) = sup ∑ f ( xi ) [KXX k( X, x )]i − f ( x )
f ∈H,k f k≤1 f ∈H,k f k≤1 i | {z }
wi
* +2
reproducing property: = sup ∑ wi k(·, xi ) − k(·, x), f (·)
i H
2
Cauchy-Schwartz: (|h a, bi| ≤ k ak · kbk) =
∑ wi k(·, xi ) − k(·, x )
i
H
reproducing property: = ∑ wi w j k( xi , x j ) − 2 ∑ wi k( x, xi ) + k( x, x )
ij i
−1
= k xx − k xX KXX k Xx = E|y [( f x − µ x )2 ]
Then (simplified!):
f (x) = ∑ zi λ1/2
i φi ( x ) ∼ GP (0, k ).
i∈ I
Then,
Kernels for which the RKHS lies dense in the space of all continu-
ous functions are known as universal kernels. One such example is
the square-exponential (also known as Gaussian, or RBF kernel):
( a − b )2
k ( a, b) = exp( )
2
When using such kernels for GP/kernel-ridge regression, for any
continuous functions f and e > 0, there is an RKHS element fˆ ∈ Hk
such that k f − fˆk < e (where k · k is the maximum norm on a
compact subset of X).
−5
−8 −6 −4 −2 0 2 4 6 8
x
−5
−8 −6 −4 −2 0 2 4 6 8
x
−5
−8 −6 −4 −2 0 2 4 6 8
x
−5
−8 −6 −4 −2 0 2 4 6 8
x
−5
−8 −6 −4 −2 0 2 4 6 8
x
understanding kernels v 81
In fact, there are two main aspects that are wrong with the ob-
served properties of our estimator. The first one is that, as we in-
crease the number of evaluations, we start seeing the mean of the
posterior largely deviating from the true function, thus increasing 101
the overall error. The second one is that the uncertainty contracts,
meaning that the algorithm becomes more and more certain in the 100
predictions it is making, even though we observe that the posterior
k f − m k2
is nothing alike the true function.
10−1
Note that the statements about universality that we have made at
the beginning of the section still do apply. It is just that the conver-
R
gence rate of the algorithm is terribly low. 10−2
If f is “not well covered” by the RKHS, the number of data
points required to achieve e error can be exponential in e. Out-
10−3 0
side of the observation range, there are no guarantees at all. The 10 101 102 103
# function evaluations
following technical theorem defines the notions of convergence
precisely: Figure 36: Convergence rate of the GP
regressor. The√golden
lines indicate
Theorem 46 (v.d. Vaart & v. Zanten, 2011). Let f 0 be an element rate of O 1/ n
β .
of the Sobolev space W2 [0, 1]d with β > d/2. Let k s be a kernel
on [0, 1]d whose RKHS is norm-equivalent to the Sobolev space
W2s ([0, 1]d ) of order s := α + d/2 with α > 0. If f 0 ∈ C β ([0, 1]d ) ∩
β
W2 ([0, 1]d ) and min(α, β) > d/2, then we have The Sobolev space W2s (X) is the
vector space of real-valued functions
Z
over X whose derivatives up to s-
ED n | f 0 k f − f 0 k L2 ( P ) dΠn ( f |Dn ) = O(n−2 min(α,β)/(2α+d) )
2
X
( n → ∞ ), th order have bounded L2 norm.
L2 ( PX ) is the Hilbert space of square-
(1) integrable functions with respect to
where EX,Y | f0 denotes expectation with respect to Dn = ( xi , yi )in=1 PX .
wikipedia.org/wiki/Sobolev_space
with the model xi ∼ PX and p(y | f 0 ) = N (y; f 0 ( X ), σ2 I ), and
Πn ( f |Dn ) the posterior given by GP-regression with kernel k s .
Condensed content
• GPs are quite powerful: They can learn any function in the
RKHS (a large, generally infinite-dimensional space!)
• GPs are quite limited: If f 6∈ Hk , they may converge very (e.g. ex-
ponentially) slowly to the truth.
Time Series v
p ( x i | x 0 , x 1 , . . . , x i −1 ) = p ( x i | x i −1 ).
2. The observations y are local – each yt only depends on the latent ··· ·
xt : p(yt | X ) = p(yt | xt )
y0 y1 yt
In the typical predictive setting of these problems, given observed
data Y0:t−1 = (y0 , y1 , . . . yt−1 ), we first want to infer the current Figure 37: The graphical model under
latent state xt : the two assumptions
R .
j6=t p ( X ) p (Y0:t−1 | X ) dx j
p( xt | Y0:t−1 ) = R
p( X ) p(Y0:t−1 | X ) dX
R
j6=t p ( Y0:t − 1 | X 0:t − 1 ) p ( x 0 ) ∏ 0 < j < t p ( x j | x j − 1 ) dx j p ( x t | x t − 1 ) ∏ j > t p ( x j | x j − 1 ) dx j
= R
p(Y0:t−1 | X0:t−1 ) p( x0 ) ∏0< j<t p( x j | x j−1 ) p( xt | xt−1 ) ∏ j>t p( x j | x j−1 ) dX
R
j<t p ( xt | xt−1 ) p (Y0:t−1 | X0:t−1 ) p ( x0 ) ∏0< j<t p ( x j | x j−1 ) dx j
= R
j≤t p ( Y 0:t − 1 | X 0:t − 1 ) p ( x 0 ) ∏ 0 < j < t p ( x j | x j − 1 ) dx j
Z
= p( xt | xt−1 ) p( xt−1 | Y0:t−1 ) dxt−1
Filtering: O( T )
Z
predict: p( xt | Y0:t−1 ) = p( xt | xt−1 ) p( xt−1 | Y0:t−1 ) dxt−1 (Chapman-Kolmogorov Eq.)
p(yt | xt ) p( xt | Y0:t−1 )
update: p( xt | Y0:t ) =
p(yt )
Smoothing: O( T )
Z
p ( x t +1 | Y )
smooth: p( xt | Y ) = p( xt | Y0:t ) p ( x t +1 | x t ) dx
p( xt+1 | Y0:t ) t+1
Gauss-Markov Models v
= N ( xt , Amt−1 , APt−1 A| + Q)
= N ( xt , m− −
t , Pt )
p(yt | xt ) p( xt | Y1:t−1 )
update: p( xt | Y1:t ) =
p(yt )
N (yt ; Hxt ; R)N ( xt ; m− −
t , Pt )
= − |
N (yt ; Hm− t , HPt H )
= N ( xt , m− −
t + Kz, ( I − KH ) Pt )
= N ( xt , mt , Pt ) where
K := Pt− H | ( HPt H | + R)−1 , (gain)
z := yt − Hm−
t (residual)
Z
p ( x t +1 | Y )
smooth: p( xt | Y ) = p( xt | Y0:t ) p ( x t +1 | x t ) dx
p( xt+1 | Y1:t ) t+1
Z
N ( xt+1 ; mst+1 , Pts+1 )
= N ( xt ; mt , Pt ) N ( xt+1 , Axt , Q) dx
N ( xt+1 ; mt+1 , Pt+1 ) t+1
|
= N ( xt , mt + Gt (mst+1 − m− s −
t+1 ), Pt + Gt ( Pt+1 − Pt+1 ) Gt )
= N ( xt , mst , Pts ) where
Gt := Pt A| ( Pt− )−1 (smoother gain)
(Kalman) Filter:
p( xt ) = N ( xt ; m− −
t , Pt ) predict step
m−t = Amt−1 predictive mean
|
Pt− = APt−1 A + Q predictive covariance
p( xt | yt ) = N ( xt ; mt , Pt ) update step
zt = yt − Hm− t innovation residual
St = HPt− H | + R innovation covariance
Kt = Pt− H | S−1 Kalman gain
mt = m−t + Kzt estimation mean
Pt = ( I − KH ) Pt− estimation covariance
0
xt
−5
0 1 2 3 4 5 6 7 8 9 10
t
0
xt
−5
0 1 2 3 4 5 6 7 8 9 10
t
0
xt
−5
0 1 2 3 4 5 6 7 8 9 10
t
In another example, by setting F = − λ1 , L = √2θ , the SDE yields the det e X = etrX .
λ
Ornstein-Uhlenbeck process:
t − t0 |t a −tb | 2t0 −t a −tb
m ( t ) = x0 e − λ k (t a , tb ) = θ 2 e− λ − e λ
Condensed content
For more on Gaussian and approxi-
• Markov Chains capture finite memory of a time series through mately Gaussian filters see, e.g.
conditional independence Simo Särkkä. Bayesian Filtering and
Smoothing Cambridge University Press,
2013
• Gauss-Markov models map this state to linear algebra https://users.aalto.fi/~ssarkka/
pub/cup_book_online_20131111.pdf
• Kalman filter is the name for the corresponding algorithm
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
to be fuzzy.
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
be found.
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
exist.
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
We can use the logistic function on top of a Gaussian Process re- as is its derivative
gression model to adapt it for classification, thus creating logistic ∂π f
= π f (1 − π f ).
regression. In particular, take a Gaussian Process prior over f , ∂f
p( f ) = GP f ; m, k and use the likelihood
(
σ( f ) if y = 1,
p(y| f x ) = σ(y f x ) =
1 − σ( f ) if y = −1.
0.4
0.2
0
−4 −3 −2 −1 0 1 2 3 4
f
n
1 −1
log p( f X |Y ) = − f X> KXX f X + ∑ log σ (yi f xi ) + const.
2 i =1
f1
f1
f1
Z Z
The evidence: E p(y, f ) [1] = 1 · p(y, f ) d f = p(y, f ) d f = Z,
Z Z
1
The mean: E p( f |y) f = f · p( f | y) d f = f · p( f , y) d f = f¯,
Z
h i Z 1
Z
The variance: E p( f |y) f 2 − f¯2 = f 2 · p( f | y) d f − f¯2 = f 2 · p( f , y) d f − f¯2 = var( f ).
Z
Recall that Z can be useful for parameter tuning, f¯ provides a
useful point estimate, and var( f ) is a good estimate of the error
around f¯.
f1
Z
q( f x |y) =p( f x | f X )q( f X ) d f X ,
Z
−1 −1
= N f x ; m x + KxX KXX ( f X − m X ), Kxx − KxX KXX KXx q( f X ) d f X ,
−1 ˆ −1 −1 −1
= N f x ; m x + KxX KXX ( f − m X ), Kxx − KxX KXX KXx + KxX KXX ΣKXX KXx .
gaussian process classification v 95
Finally, all we need is an optimizer which will find the local max-
imum of the log-posterior. Practically, any optimizer that follows
the gradient will do the job. For the sake of illustration, we will
consider the second order Newton Optimization method, since it is
very efficient for convex optimization problems such as ours.
1 procedure GP-Logistic-Train(K XX , m X , y)
2 f ^ mX initialize
3 while not converged do
y +1
4 r ^ 2 − σ( f ) = ∇ log p(y | f X ), gradient of log likelihood
0
f
−2
−4
0.8
0.6
π
0.4
0.2
0
−8 −6 −4 −2 0 2 4 6 8
x
gaussian process classification v 97
Condensed content
Let’s quickly recap what we did in the previous lecture. For the
purpose of classification, we made extensive use of the sigmoid link
function:
1 dσ ( x )
σ( x) = with = σ( x )(1 − σ( x ))
1 + e− x dx
We were interested in extending Gaussian Process regression to the
classification setting, by constructing a Gaussian approximation
q( f X | y) for the (usually non-Gaussian) posterior distribution at
the training points. In order to find the mode of the posterior, we
had to compute the gradient of the log-posterior and set it to zero:
4
n
−1
∇ log p( f X | y) = ∑ ∇ log σ(yi f xi ) − KXX ( f X − mX ) = 0
i =1
n
−1
⇒ KXX ( f X − mX ) = ∑ ∇ log σ(yi f xi ) = ∇ log p(y | f X ) =: r 2
i =1
Then at test time, recall that the mode of the approximation is given
by:
0
fx
−1 ˆ
Eq ( f x ) = m x + k xX KXX ( f X − m X ) = m x + k xX r
y
Notice that in the last expression, the expected function value for fˆX
−2
a test point explicitly depends on the gradients of the log-likelihood fˆx
of the training points. w
σ( f )
In particular, observe the dashed-blue line in Fig. 51 – the two dσ/d f
training points in the middle have the largest value for the gradient. −4
−4 −2 0 2 4
In turn, these points will have the highest contribution in determin- x
ing the expected function value for a new test point.
Figure 51: Towards Support Vector
Machines. Notice how the two points
around zero would already provide a
strong support for a good classifier.
100 probabilistic machine learning
2
So, xi with | f i | 1, where ∇ log p(yi | f i ) ≈ 0, contribute almost log σ (y f )
nothing to Eq ( f x ). On the other hand, the xi with | f i | < 1 can be [1 − y f ] +
considered as “support points”. 1.5
This realization leads us to the idea to try and make the connec-
− log p(y | f )
tion between GP classification and Support Vector Machines.
1
0
− log p( f X | y) = ∑ − log σ(yi f i ) + k f X k2KXX −1 0 1 2
i f
By explicitly setting the loss term with the Hinge Loss `(yi ; f i ) =
[1 − yi f i ]+ , we obtain the Support Vector Machine learning algo-
rithm.
At this point, one could postulate that the Hinge Loss is the
limiting object for the log likelihood (see Fig. 52), for which the
gradient for f > 1 is zero.
This would describe the phenomena of support points in the
previous example of GP classification. In turn, this would mean
that we would have constructed a probabilistic interpretation of
Support Vector Machines.
Unfortunately, that is not the case, as the Hinge loss is not a
log-likelihood:
exp(`(yi ; f i )) + exp(`(−yi ; f i ))
= exp(`( f i )) + exp(`(− f i )) 6= const
exp([1 − yi f i ]+ ) + [1 + yi f i ]+ 6= const
(dotted black line). This is a necessary
requirement, since the likelihood
0.5
is a probability distribution in the
observed data (but not necessarily the
σ( f ) latent parameters).
σ (− f )
exp([1 − f ]+ )
exp(−[1 + f ]+ )
0
−1 −0.5 0 0.5 1 1.5 2
f
σ( f )
σ( f )
0.5
0.4
10
0.25 0.2
0 0 0
3
f (x)
−3
−6
x
2,000
0
0 20 40 60 80 100 120 140
days since outbreak
the form:
p(y | f T ) = N (y; f T , σ2 I ) p( f ) = GP f ; 0, k
−2
0 20 40 60 80 100 120 140
days since outbreak
One crucial aspect that we notice from the data is that the count
values in the first 40 days since the outbreak differ significantly
from the rest of the days (observe the values of the solid blue line,
ignore the noise bands for now).
In the standard GP regression, our likelihood assumed that
the noise had the same scale σ across all observations. Ideally, we
would like to represent the values in the first 40 days with higher
uncertainty.
In order to account for this, we perform Laplace approximation
on the likelihood:
Now, with this new approximation for the likelihood, we can ac-
count for different uncertainties at different points in time.
Hyperparameter optimization
First, we show that computing evidences (marginal likelihoods)
is possible, which could then be used to perform hyperparameter
optimization. Let’s start by writing down the evidence term:
Z Z
p(y | X ) = p(y, f | X ) d f = exp log p(y | f ) p( f | X ) d f
In the standard GP setting, both the likelihood and the prior were
Gaussian, so computing this integral only involved linear algebra.
However, in the case of Generalized Linear Models, this is not the
case, as the likelihood is typically non-Gaussian. For this reason, we
construct Laplace approximation:
1
log p(y | f ) p( f | X ) ≈ log p(y | fˆ) p( fˆ | X ) − ( f − fˆ)| (K −1 + W )( f − fˆ) = log q(y, f | X )
2
From there, we have:
Z
1
p(y | X ) ≈ q(y | X ) = exp log p(y | fˆ) p( fˆ | X ) exp − ( f − fˆ)| (K −1 + W )( f − fˆ) d f
2
= exp log p(y | fˆ) N ( fˆ; m X , k XX )(2π )n/2 |(K −1 + W )−1 |1/2
1 1
log q(y | X ) = log p(y | f ) − ( fˆ − m X )| KXX
−1 ˆ
( f − m X ) − log(|K | · |K −1 + W |)
2 2
n
1 1
= ∑ σ(yi f xi ) − ( fˆ − m X )| KXX
−1 ˆ
( f − m X ) − log | B|
i =1
2 2
From there, we only need to compute the gradient and the Hessian:
n
| ∂ log σ (yi φx|i v) yi + 1
∇ log p(v | y) = ∑ ∇ log σ(yi φxi v) − Σ−1 (v − µ) with = [ φxi ] j − σ(φx|i v)
i =1
∂v j 2
n
| ∂2 log σ (yi φx|i v)
∇∇| log p(v | y) = ∑ ∇∇| log σ(yi φxi v) − Σ−1 with = −[φxi ] a [φxi ]b σ(φx|i v)(1 − σ(φx|i v))
i =1
∂v a ∂vb | {z }
=:wi
|
=: −(W + Σ−1 ) = − φX diag(w)φX + Σ −1 ∈ R F × F
Deep Learning
Consider a deep feedforward neural network:
n
p ( yi | W ) = ∏ σ ( f W ( xi ))
i =1
fW (x) = w|L φ(wL−1 φ(. . . (w1 x) . . . ))
W i −1 W
≈ N ( f ( x ); f W ∗ ( x ), G ( x )ΨG ( x )| ) =: N ( f ( x ); m( x ), v( x ))
statement:
Condensed content
• arise if the empirical risk has zero gradient for large values of f
(→ hinge-loss)
• Laplace approximations are not for free, but feasible for many
deep models, and easy to implement
Exponential Family v
Conjugate Priors v
We will extend this idea with the concept of conjugate priors. A prior
is said to be conjugate to a likelihood if the posterior arising from
the combination of the likelihood and the conjugate prior has the
same form as the prior. The Gaussian prior is the conjugate prior to
the Gaussian likelihood.
p ( x | π ) = π a (1 − π ) b .
the prior,
1 α −1
p ( π ) = D ( α1 , . . . , α K ) =
B ( α1 , . . . , α K ) ∏ πk k ,
k
Exponential Family v
n
1
∇w log p( x1 , . . . , xn |w) = 0 ⇒ ∇w log Z (w) =
n ∑ φ ( x i ).
i =1
Z Z
∇w pw ( x | w) dx = ∇w pw ( x | w) dx
Z Z
= φ( x ) dpw ( x | w) − ∇w log Z (w) dpw ( x | w) = 0
|
p( x ) ≈ p̂( x | w) = exp φ( x ) w − log Z (w)
exponential family v 111
Condensed Content
• They also allow analytic MAP inference from only a finite set of
sufficient statistics.
w w µ, Σ
Repeated observations and hyperparameters can be ex-
pressed using some syntactic sugar to make it easier to draw com- =
plex graphical models. A box with sharp edges drawn around a set y1 y2 ... yn yi σ
of nodes and labeled with a number n is called a plate and denotes
n copies of the content of the box. A small filled circle denotes a
(hyper-)parameter that is set or optimized, and which is not part of xi
n
the generative model.
Figure 60: Plates and Hyperparameters
n
p(y, w) = ∏ N (yi ; φ(xi )T w, σ2 )N (w; µ, Σ)
i =1
116 probabilistic machine learning
• the arrows meet head-to-head at the node, and neither the node,
nor any of its descendants is in C.
the graph.
Essentially, MRFs allow for a more compact definition of condi- Figure 64: Markov Blanket for a
Markov Random Field
tional independence compared to directed graphs. Nevertheless, the
associated joint probability distribution cannot be easily read from
the graph.
1
p ( x1 , . . . , x n ) =
Z ∏ ψc ({ xi ∈ c}).
c∈C
118 probabilistic machine learning
p( x ) = exp(− E( x ))
Condensed content
• When you want to model a process for which you have a “sci-
entific” theory or some generative knowledge, writing down the
directed model is a good start.
• However, reading off the joint from the graph is tricky as it re-
quires calculating the normalization constant, which is usually
intractable.
• When your model has millions of parameters, and you are more
worried about computational complexity than interpretability,
the conditional independence structure of MRFs can help keep-
ing things tractable.
Factor graphs v
p ( x ) = p ( x 1 ) p ( x 2 | x 1 ) . . . p ( x n | x n −1 ),
1
= ψ ( x , x2 ) . . . ψn−1,n ( xn−1 , xn ).
Z 1,2 1
x1 x2 ... x n −1 xn
x2 x2
x4 x4
C D
Factor Graphs
x1 x2 x3
Factor Graphs are an explicit representation of functional relation-
ships;
p( x ) = ∏ p(xch |xpa(ch) ),
ch
draw a circle for each variable xi , a box for each conditional in the
factorization and connect each xi to the factorizations it appears in.
yi N
σ w µ, Σ
σ xi
N
xi yi
n n
1
p( x ) =
Z ∏ ψc ({ xi ∈ c}),
c∈C
draw a circle for each variable xi , a box for each factor (clique) ψc
and connect each ψc to the variables used in the factor.
x1 x2 x1 x2 x1 x2
f ? ? fa
fb
x3 x3 x3
124 probabilistic machine learning
p ( x1 , x2 , x3 ) = p ( x3 | x1 , x2 ) p ( x1 ) p ( x2 ) p ( x1 , x2 , x3 ) = p ( x1 , x2 | x3 ) p ( x3 )
x1 x2 x1 x2 x1 x2
? p ?
x3 x3 x3
p( x) = p23 ( x2 , x3 | x1 ) p( x1 ) p ( x ) = p2 ( x2 | x1 ) p3 ( x3 | x1 ) p ( x1 )
x1 x1 x1
p2 p3
p23 ? ?
x2 x3 x2 x3 x2 x3
The graphical view itself does not always capture the entire
structure. Nevertheless, when factor graphs are encoded with an
explicit functional form, part of the structure can be automatically
deduced and used for inference. For this purpose, we introduce the
Sum-Product algorithm.
The Sum-Product Algorithm v
p ( xi ) = ∑ p ( x1 , . . . , x n ),
x 6 =i
! !
1
Z x∑ ∑ ψ0,1 (x0 , x1 ) ∑ ψi,i+1 (xi , xi+1 ) . . . ∑ ψn−1,n (xn−1 , xn ) ,
= ψi−1,i ( xi−1 , xi ) . . .
i −1 x0 x i +1 xn
| {z }| {z }
: = µ → ( xi ) : = µ ← ( xi )
1
= µ → ( x i ) µ ← ( x i ),
Z
with Z = ∑ xi µ→ ( xi )µ← ( xi ). The terms µ→ ( xi ) and µ← ( xi ) are
called messages, which can be computed recursively
By storing local messages, all marginals can be computed in O nk2 ,
as in filtering and smoothing. Computing a message from the pre-
ceding one can be done by taking the sum of the product of the
local factors and incoming messages. The local marginal can be
computed by taking the sum of the product of incoming messages,
hence the name of the algorithm.
1
max p( x1 , . . . , xn ) = max · · · max ψ0,1 ( x0 , x1 ) · · · ψn−1,n ( xn−1 , xn )
x1 ,...,xn Z x0 xN
!
1
= max ψ0,1 ( x0 , x1 ) · · · max ψn−1,n ( xn−1 , xn )
Z x0 ,x1 xn
µ xi → f i,i+1 ( xi ) = µ f i−1,i → xi ( xi )
ximax max )
−1 = φ ( x i
Sum-Product on Trees
x
x\ x fs
expanding the joint, we obtain:
p( x ) = ∑ ∏ Fs ( x, xs )
x\ x s∈ne( x ) Figure 74: Messages from factors to
! variables
= ∏ ∑ F(x, xs )
s∈ne( x ) xs
| {z }
=:µ f s → x ( x )
= ∏ µ fs →x (x)
s∈ne( x )
128 probabilistic machine learning
µ xm
xm
→
structured subgraphs:
fs
(x
m
)
x
Fs ( x, xs ) = f s ( x, x1 , . . . , xm ) G1 ( x1 , xs1 ) · · · Gm ( xm , xsm ) fs
µ fs →x (x)
Gi ( xi , xsi )
xi
where { x1 , . . . , xm } are the nodes in xs and xsi are the neighbors of
xi . Then, we obtain the factor-to-variable messages:
= ∏ µ f ` → xi ( x i )
`∈ne( xi )\ f s
x
µ x→ f ( x ) = ∏∑ := 1 f
∅ ∅ Figure 77: Messages from leaf nodes
µ f →x (x) = ∑ f (x, ∅) ∏ := f ( x ). in the sum-product algorithm
∅ ∅
the sum-product algorithm v 129
µ f` →xj = ∑ f ` ( x j , x` j ) ∏ µ xi → f ` ( x i )
x` j i ∈{` j}=ne( f ` )\ x j
µxj → f` (x j ) = ∏ µ fi →xj (x j )
i ∈ne( x j )\ f `
To get the marginal of each node, once the root has received all the
messages, pass messages from the root back to the leaves. Once
every node has received the messages from all their neighbors,
take the product of all incoming messages at each variable (and
normalize).
This implies that inference on the marginal of all variables in a
tree-structure factor-graph is linear in graph size.
Incorporating observations
If one or more nodes xo in the graph are observed (xo = x̂o ), we
introduce factors f ( xio ) = δ( xio − x̂io ) into the graph. This amounts
to “clamping” the variables to their observed value.
Say x := [ xo , xh ]. Because p( xo , xh ) ∝ p( xh | xo ), the sum-
product algorithm can thus be used to compute posterior marginal
distributions over the hidden variables xh .
0.6 0.4
x2 = 0 x2 = 1
0.7 x1 = 0 0.3 0.4
0.3 x1 = 1 0.3 0.0
• max( a + b, a + c) = a + max(b, c)
• log maxx p( x) = maxx log p( x)
Thus, we can compute the most probable state xmax by taking the
sum-product algorithm and replacing all summations with maxi-
mizations (the max-product algorithm). For numerical stability, we
can further replace all products of p with sums of log p (the max-
sum algorithm). The only complication is that, if we also want to
know the arg max, we have to track it separately using an addi-
tional data structure.
µ f ` → x j ( x j ) = max log f ` ( x j , x` j ) +
x` j
∑ µ xi → f ` ( x i )
i ∈{` j}=ne( f ` )\ x j
µxj → f` (x j ) = ∑ µ fi →xj (x j )
i ∈ne( x j )\ f `
4. once the root has messages from all its neighbors, pass messages
from the root towards the leaves. At each factor node, set xmax
`j =
φ( x j ) (this is known as backtracking).
the sum-product algorithm v 131
Condensed content
200
V words
D documents
D documents
K topics
X ∼ Q × U|
Dimensionality reduction v
Z := φ( X ) ∈ RD ×K
L( X, ψ( Z )) = L( X, ψ ◦ φ( X ))
• save memory
• “find structure”
extended example: topic modeling v 135
Linear PCA
Let us derive the famous Principle Component Analysis (PCA)
algorithm. Again, consider a dataset X ∈ RD×V . Furthermore,
consider an orthonormal basis {ui }i=1,...,V , u|i u j = δij . Then, we can
represent any point xd as a linear combination of the projections
onto the orthonormal basis:
V V
|
xd = ∑ (xd ui )ui =: ∑ αdi ui or simply vectorized X = ( XU )U |
i =1 i =1
K V
x̃d := ∑ adk uk + ∑ b` u `
k =1 `=K +1
First, let’s find adk and b j . Since the vectors u are orthonormal, recall
that ∑ j uij ukj = δik . Then, we simply differentiate with respect to the
parameters that we wish to optimize, and set the derivatives to zero
in order to obtain their optimal values:
V K V
∂J 2
∂ad`
=
D ∑ xd − ∑ adk uk − ∑ b j u j (−u`v )
v =1 k =1 j = K +1
v
2 2 !
= (− x|d u` ) + ad` = 0
D D
=⇒ adk = x|d uk
D V K V
∂J 2
D d∑ ∑ xd − ∑ adk uk − ∑ bj u j (−u`v )
=
∂b` =1 v =1 k =1 j = K +1
v
D
2 | !
=
D ∑ (−xd u` ) + 2b` = 0
d =1
1
=⇒ b j = x̄| u j where x̄ := ∑ xd
D d
K V V K V
| |
xd − x̃d = xd − ∑ adk uk − ∑ bj u j = ∑ ( xd u` )u` − ∑ ( xd uk )uk − ∑ ( x̄| u j )u j
k =1 j = K +1 `=1 k =1 j = K +1
K K V V
| |
= ∑ ( xd u` )u` − ∑ ( xd uk )uk + ∑ ( x|d u` )u` − ∑ ( x̄| u j )u j
`=1 k =1 `=K +1 j = K +1
V
= ∑ (( xd − x̄)| u j )u j
j = K +1
136 probabilistic machine learning
Using this result, along with the following notation for the sample
covariance matrix S := D1 ∑dD=1 ( xd − x̄)( xd − x̄)| , we obtain:
D D V
1 1
J= ∑ k xd − x̃d k2 = ∑ ∑ (( xd − x̄)| u j )2
D d =1
D d =1 j = K +1
V D
1 |
=
D ∑ ∑ u j (xd − x̄)(xd − x̄)| u j
j = K +1 d =1
V
= ∑ u|j Su j
j = K +1
K V M D
|
x̃d := ∑ adk uk + ∑ bj u j = ∑ ( x d ui ) ui + ∑ ( x̄| ui )ui
k =1 j = K +1 i =1 i = M +1
V
J= ∑ λj
j = K +1
X̂ = X − 1x̄| (resulting in b = 0)
Probabilistic PCA v
We have seen several times that various statistical algorithms have
probabilistic interpretation. In this section, we explore the prob-
abilistic aspects of PCA, in order to better understand its implicit
assumptions.
D
1
J = −c · log p( X | X̃ ) + log Z =
D ∑ k xd − x̃d k2
d =1
D
p( X | X̃ ) = ∏ N (xd ; x̃d , σ2 I )
d =1
ing graphical model can be seen on Fig. 80. That being said, the Figure 80: A graphical model of
probabilistic PCA.
marginal likelihood can be formulated as:
Z D
p( X ) = ∏ p(xd | ad ) p(ad ) dad = ∏ N (xd ; µ, C)
d =1 d
DV D 1 D
log p( X ) = − log(2π ) − log |C | − ∑ ( xd − µ)| C −1 ( xd − µ)
2 2 2 d =1
Thus, by plugging the maximizer back in, the maximum (log) likeli-
hood can be written as:
D
log p( X ) = − V log(2π ) + log |C | + tr(C −1 S)
2
where S is again the sample covariance matrix. Furthermore, it can
be shown that the maximum likelihood estimates for V and σ2 are:
Primary results v
Now that we have obtained the first tool to analyze our dataset
with, let us see the primary results. If we ignore the preprocessing
steps, the implementation is a one-line solution in Python:
0 0
2
4
K
6
8
02468
K
50 0
2
4
K
6
8
150
200
0 2 4 6 8
K
For each of the samples, one could postulate about the potential
topics from which the words were generated. However, there are
several problems with our approach:
In the next chapter we look into how one could resolve these issues.
Latent Dirichlet Allocation v
As we discussed the issues of PCA in the last chapter, our goal now
is to create a model with the following properties:
V words K topics
V words
D documents
D documents
K topics
W ∼ Π × Θ
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K ]
d = [1, . . . , D ]
Designing an algorithm v
For now, let us focus on the terms colored in blue. By further ex-
panding the Dirichlet and utilizing the notation for ndkv mentioned
above, we obtain:
!
D D Id
c
p(Π | α) · p(C | Π) = ∏ D(πd ; αd ) · ∏ ∏ ∏K π dik k =1 dk
d =1 d =1 i =1
!
D K D Id
Γ(∑k αdk ) α −1 c
= ∏
∏ Γ ( α dk ) ∏ πdkdk · ∏ ∏ ∏kK=1 πdkdik
d =1 k k =1 d =1 i =1
!
D
Γ(∑k αdk ) K αdk −1+ndk·
= ∏
Γ(αdk ) k∏
πdk
d =1 ∏ k =1
One could perform the similar steps for the terms colored in green.
Then, we can formulate the joint as follows:
p(C, Π, Θ, W ) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)
! !
D
Γ(∑k αdk ) K αdk −1+ndk· K
Γ(∑v β kv ) V
β −1+n·kv
= ∏ ∏ πdk · ∏ ∏ θkvkv
d =1 ∏ k
Γ ( α )
dk k =1 k =1 ∏ v
Γ( β kv ) v =1
D Id ∏K ( π θ cdik
p(W, C, Θ, Π) dk kwdi )
p(C | Θ, Π, W ) = = ∏ ∏ k =1
∑C p(W, C, Θ, Π) d =1 i =1 ∑ k
0 ( π dk 0 θ k 0 w )
di
Note that this conditional independence can easily be read off from
the graph (see Fig. 83). Recall the definition for a Markov blanket in
directed graphical models: when for a given variable (in our case
C) we condition on its parents (Π), children (W) and co-parents (Θ),
the terms in the variable become independent i.e. factorize.
p(C, W, Π, Θ)
p(Θ, Π | C, W ) = R
p(Θ, Π, C, W ) dΘ dΠ
n n
∏d D(πd ; αd ) ∏k πdkdk· ∏k D(θk ; β k ) ∏v θkv·kv
=
p(C, W )
! !
= ∏ D(πd ; αd: + nd:· ) ∏ D(θk ; β k: + n·k: )
d k
Note that this conditional independence can not be easily read off
from the above graph!
K-Means v
80
60
40
1.5 2 2.5 3 3.5 4 4.5 5 5.5
duration [mins]
k i = arg minkmk − xi k2
k
1
mk ←
Rk ∑ rki xi , where Rk = ∑ rki .
i i
n K
J (r, m) = ∑ ∑ rik k xi − mk k2 .
i =1 k =1
150 probabilistic machine learning
Soft K-Means
being maximized:
n K
(r, m) = arg min ∑ ∑ rik k xi − mk k2
r,m i k
n K
= arg max ∑ ∑ rik (−1/2σ−2 k xi − mk k2 ) + const
i k
n K
= arg max ∏ ∑ rik exp −1/2σ−2 k xi − mk k2 /Z
i k
n K
= arg max ∏ ∑ rik N ( xi ; mi , σ2 I )
i k
= arg max p( x | m, r )
For K clusters with means and variances (µk , Σk )k=1,...,K , the genera-
tive process first chooses which cluster to draw from with probabil-
ity πk and then samples from N x; µk , Σk .
The likelihood model can be written as
K K
p( x |π, µ, Σ) = ∑ πk N x; µk , Σk with πk ∈ [0, 1], ∑ πk = 1.
k =1 k =1
0.2
Given a dataset x1 , . . . , xn , we want to learn the generative model 0.3
0.5
(π, µ, Σ) (see Fig. 90 for a graphical representation), using the likeli-
hood
n K Figure 89: Generative model matching
p( x |π, µ, Σ) = ∏ ∑ πk N xi ; µk , Σk . Fig. 92.
i =1 k =1
p( x |π, µ, Σ) p(π, µ, Σ)
p(π, µ, Σ| x ) = ,
p( x )
1 n π j N ( x i ; µ j , Σ j ) −1
∇Σ j log p( x | π, µ, Σ) = − ∑ Σ ( xi − µ j )( xi − µ j )| Σ−1 − Σ−
j
1
2 i ∑ j0 π j N ( xi ; µ j , Σ j )
| {z }
=:r ji
1
n ∂|Σ|−1/2 /∂Σ = − |Σ|−3/2 |Σ|Σ−1
1 | 2
∇Σ j log p = 0 ⇒ Σj =
Rj ∑ r ji (xi − µ j )(xi − µ j ) R j := ∑ r ji ∂(v| Σ−1 v)/∂Σ = −Σ−1 vv| Σ−1
i i
n N ( xi ; µ j , Σ j ) n
0= ∑ πj ∑ j0 π j N ( xi ; µ j , Σ j )
+ λπ j = ∑ rij + λπ j
i i
Now, if we sum the above expression over all j, and use the con-
straint ∑ j π j = 1, we obtain the optimal value for λ:
λ = −n
Rj
πj =
n
If we know the responsibilities rij , we can optimize µ, Σ, π ana-
lytically. And if we know µ, π, we can set rij ! This leads us to the
following algorithm:
2. Set
π j N ( xi ; µ j , Σ j )
rij =
∑kj0 π j0 N ( xi ; µ j0 , Σ j0 )
154 probabilistic machine learning
3. Update
n
1
Rj = ∑ r ji µj =
Rj ∑ rij xi
i i
n Rj
1
Σj =
Rj ∑ rij (xi − µ j )(xi − µ j )| πj =
n
i
4. Go back to 2.
This algorithm might seem arbitrary at first, but it is a case of the
Expectation-Maximization algorithm, which fits a probabilistic model
by alternating between (1) computing the expectation of some latent
variables – the responsibilities; and (2) maximizing the likelihood of
the parameters – the cluster parameters.
π j N ( xi ; µ j , Σ j ) R j exp(− βk xi − m j k2 )
rij = =
∑kj0 π j0 N ( xi ; µ j0 , Σ j0 ) ∑ j0 R j0 exp(− βk xi − m j0 k2 )
Expectation Maximization v
zi:
Let us note that, even though we have spent a lot of time and en-
ergy deriving the probabilistic interpretation of K-Means, this was
µk Σk
in fact not our ultimate goal.
k
We wanted to find a particular algorithmic structure that can be
used for probabilistic generative models where it is not straightfor- xi
ward to find a maximum likelihood expression in closed form. This
is exactly what the EM algorithm achieves. n
Figure 91: Graphical model for the
Let us first revisit the Gaussian Mixture Model, and introduce the Gaussian mixture with latent variables
z
latent variable z so that things simplify. Consider the binary ran-
dom variable zij ∈ {0; 1} s.t. ∑ j zij = 1. We define:
p(zij = 1) = π j p ( x i | z j = 1) = N ( x i ; µ j , Σ j )
p(zij = 1) p( xi | zij = 1, µ j , Σ j )
p(zij = 1 | xi , µ, Σ) =
∑kj0 p(zij0 = 1) p( xi | zij0 = 1, µ j , Σ j )
π j N ( xi ; µ j , Σ j )
=
∑ j0 π j0 N ( xi ; µ j0 , Σ j0 )
= rij
So it turns out that the responsibilities rij are the marginal posterior
probability ([E]xpectation) for zij = 1! In the previous chapter, we
have seen that if we knew the cluster responsibilities rij , we could
optimize µ, Σ and π analytically, and vice-versa. We did not know
z, so we replaced it with its expectation, leading to the Expectation-
Maximization algorithm, which repeats the two following steps:
Generic EM
The EM algorithm attempts to find maximum likelihood estimates
for models with latent variables. In this section, we describe a more
abstract view of EM which can be extended to other latent variable
models. Let x be the entire set of observed variables and z the entire
set of latent variables. We are interested in finding the maximum
(log) likelihood estimate for the model:
!
θ? = arg max log( P( x | θ )) = arg max log ∑ p(x, z | θ )
θ θ z
1. Compute p(z | x, θ ):
p(zij = 1) p( xi | zij = 1) π j N ( xi ; µ j , Σ j )
p(zij = 1 | xi , µ, Σ) = = =: rij
∑kj0 p(zij0 = 1) p( xi | zij0 = 1) ∑ j0 π j0 N ( xi ; µ j0 , Σ j0 )
2. Maximize
E p(z| x,θ ) log p( x, z | θ ) = ∑ ∑ rij log π j + log N ( xi ; µ j , Σ j )
i j
Convergence of EM v
Variational Approximation v
Z
!
n
L(q) = ∏ qi ( zi ) log p( x, z) − ∑ log qi (zi ) dz
i i
Z Z Z
= q j (z j ) log p( x, z) ∏ qi (zi ) dzi dz j − q j (z j ) log q j (z j ) dz j + const
i6= j
Z Z
= q j (z j ) log p̃( x, z j ) dz j − q j (z j ) log q j (z j ) dz j + const
where log p̃( x, z j ) = Eq,i6= j log p( x, z) + const.
free energy v 161
which we maximize w.r.t. q j . In turn, this minimizes DKL (q(z j )k p̃( x, z j )),
thus obtaining the minimum q∗j with
(left) shows
the optimal solution to
DKL qk p (the “zero-enforcing” or
“mode-seeking” direction) and (right)
the optimal solution to DKL pkq
(the “nonzero-enforcing” or “support-
covering”).
Condensed content
EM
Variational Inference
α
π
log q ? ( z ) = E q ( π,µ,Σ ) log p ( x, z, π, µ, Σ ) + const ,
= E q ( π ) log p ( z | π ) + E q ( µ,Σ ) log p ( x | z, µ, Σ ) + const,
h i
1 −1 > −1
= ∑ ∑ z nk E q ( π ) log π k + E q ( µ,Σ ) log det ( Σ ) − ( x n − µ k ) Σ k ( x − µ k ) + const,
n k 2
| {z }
: = log ρ nk
z ρnk z
q? (z) ∝ ∏ ∏ ρnknk , or, writing rnk =
∑ j ρnj
, q? (z) = ∏ ∏ rnknk , with rnk = Eq(z) [z] .
n k n k
Note that q? (z) factorizes over n, even though we did not impose
this restriction; we only imposed a factorization between z and
π, µ, Σ, which leads to conditional independence.
Computing those expectation for log ρnk can be a bit difficult to
do manually, but can be done given a table of values for
∂
ψ( x ) = log Γ( x ).
∂x
where ψ( x ) is Digamma function36 . We need to compute 36
wikipedia.org/wiki/Digamma_function
ED(π;αk ) log πk = ψ(αk ) − ψ(∑ αk )
k
D
1 νk + 1 − d
E
W Σ− 1
k ;Wk ,νk
log det Σ−
k = ∑ ψ
2
+ D log 2 + log det(Wk ),
d =1
h i
E
1
( xn − µk )> Σ−1 ( xn − µk ) = D/β k + νk ( xn − mk )> Wk ( xn − mk ).
N (µk ;mk ,Σk /β k )W Σ−
k ;Wk ,νk
variational inference v 167
log q ? ( π, µ, Σ ) = E q ( z ) log p ( x, z, π, µ, Σ ) + const,
" #
= E q ( z ) log p ( π ) + ∑ log p ( µ k , Σ k ) + log p ( z | π ) + ∑ log p ( x n | z, µ, Σ )
k n
= log p ( π ) + ∑ log p ( µ k , Σ k ) + E q ( z ) log p ( z | π ) + ∑ ∑ E q ( z ) [ z nk ] log N x n ; µ k , Σ k + const.
k n k
with log π̃ k = E D( π;α k ) log π k , log det Σ̃ − 1 =E
1
log det Σ −
k
1
.
k W Σ−
k ;Wk ,νk
Let us return to our topic modeling example. Recall that the pos-
terior p(Π, Θ, C | W ) is intractable. Luckily, Variational Inference
provides a method to construct efficient approximations for in-
tractable distributions. So, we desire to find an approximation q
that factorizes:
q(Π, Θ, C ) = q(C ) · q(Π, Θ)
Thus, we obtain:
q(C ) = ∏ q(cdi )
d,i
c
with q(cdi ) = ∏ γ̃dikdik where γ̃dik = γdik / ∑ γdik
k k
log q∗ (Π, Θ) = E∏d,i q(cdi: )) ∑(αdk − 1 + ndk· ) log πdk + ∑( β kv − 1 + n·kv ) log θkv + const
d,k k,v
D K K V
= ∑ ∑ (αdk − 1 + Eq(C) (ndk· )) log πdk + ∑ ∑ ( βkv − 1 + Eq(C) (n·kv )) log θkv + const
d =1 k =1 k =1 v =1
D K D Id
q∗ (Π, Θ) = ∏D πd ; α̃d: := [αd: + γ̃d·: ] · ∏ D θk ; β̃ kv := [ β kv + ∑ ∑ γ̃di: I(wdi = v)]v=1,...,V .
d =1 k =1 d i =1
Last but not least, we could explicitly compute the ELBO. No-
tice from above, that in practice calculating the ELBO isn’t strictly
necessary. However, it could be a useful tool for monitoring the
progress and debugging the algorithm. To compute the ELBO we
need:
The entropies can be computed from the tabulated values. For the
expectation, we use Eq(C) (ndkv ) = ∑i γdik I(wdi = v) and use
ED(πd ;α̃) (log πd ) = z(α˜d ) − z(α̃ˆ ) from above.
1 procedure LDA(W, α, β)
2 γ̃dik ^ Dirichlet_rand (α) initialize
3 L ^ −∞
4 while L not converged do
5 for d = 1, . . . , D; k = 1, . . . , K do
6 α̃dk ^ αdk + ∑i γ̃dik update document-topics distributions
7 end for
8 for k = 1, . . . , K; v = 1, . . . , V do
9 β̃ kv ^ β kv + ∑d,i γ̃dik I(wdi = v) update topic-word distributions
10 end for
11 for d = 1, . . . , D; k = 1, . . . , K; i = 1, . . . , Id do
12 γ̃dik ^ exp(z(α̃dk ) + z( β̃ kwdi ) − z(∑v β̃ kv )) update word-topic assignments
13 γ̃dik ^ γ̃dik /γ̃di·
14 end for
15 L ^ Bound (γ̃, w, α̃, β̃) update bound
16 end while
17 end procedure
Customizing models and algorithms v
N
p( x, z | η ) = ∏ exp η | φ( xn , zn ) − log Z (η )
n =1
p(η | ν, v) = exp η | v − ν log Z (η ) − log F (ν, v)
N
log q∗ (z) = Eq(η ) (log p( x, z, η )) + const = Eq(η ) (log p( x, z | η )) + const = ∑ Eq ( η ) ( η )| φ ( x n , z n )
n =1
|
∗
q (z) = ∏ exp E(η ) φ( xn , zn ) − log Z (E(η ))
n =1
174 probabilistic machine learning
and get away with less? First, note that p(C, Θ, Π | W ) = p(Θ, Π |
C, W ) p(C | W ). Now, we minimize
Z
!
q(Π, Θ|C )q(C )
DKL (q(Π, Θ, C )k p(Π, Θ, C | W )) = q(Π, Θ | C )q(C ) log dC dΠ dΘ
p(Π, Θ | C, W ) p(C | W )
! !
Z
q ( Π, Θ | C ) q ( C )
= q(Π, Θ | C )q(C ) log + log dC dΠ dΘ
p(Π, Θ | C, W ) p(C | W )
where:
! !
Γ(∑k αdk ) Γ(αdk +ndk· ) Γ(∑v β kv ) Γ( β kv +n·kv )
p(C, W ) = ∏ ∏
Γ(∑k αdk + ndk· ) k Γ(αdk ) ∏ ∏
Γ(∑v β kv + n·kv ) v Γ( β kv )
d k
Therefore:
!
\di \di \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log( β kwdi + n·kw ) − log ∑ β kv + n·kv
di
v
Under our assumption for the factorization q(C ) = ∏di cdi , the
counts ndk· are sums of independent Bernoulli variables (i.e. they
have a multinomial distribution). Computing their expected log-
arithm is tricky and of complexity O(n2d·· ) i.e. quadratic in the
counts that we are dealing with. That is likely why the original
paper didn’t do this. In fact, it took three years for a solution to be
provided by Yee Whye Teh and Max Welling.
176 probabilistic machine learning
N!
P( R = r | f , N ) = · f r · (1 − f ) N −r
( N − r )! · r!
!
N
= · f r · (1 − f ) N −r
r
≈ N (r; Nr, Nr (1 − r ))
1 1 1
log(α + n) ≈ log(α + E(n)) + (n − E(n)) · − (n − E(n))2 ·
α + E( n ) 2 (α + E(n))2
0.6
πd
0.4
0.2
0.0
1,800
1,850
1,900
1,950
2,000
year
h φd
kernel
metadata
d = [1, . . . , D ]
Now, let us return to our problem at hand and see how exactly we
can incorporate the meta-data. One can pose the problem of con-
structing a smooth latent structure as a regression problem, which
motivates the use of Gaussian Processes. We will use GP regression
for the latent function f , which together with the metadata Φ in-
forms the prior α for the document topic distribution. The updated
model to generate the words W of documents d = 1, . . . , D with
features φd ∈ F is as follows:
The first part is the well known rational-quadratic kernel which en-
codes the smoothness properties. The second part is an indication
of change in presidency – notice that we allow for slight shift in
customizing models and algorithms v 179
0.8
0.6
p(topic)
0.4
0.2
0
1,800 1,820 1,840 1,860 1,880 1,900 1,920 1,940 1,960 1,980 2,000
Year
law, war,
0.6
hπk | φi
America,
good,
American, people
0.4 work
made, business
0
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
year
Making decisions v
But for this setting, we have assumed that we know p. Let’s illus-
trate an example for the (more common) case when we don’t know
p.
0.6
payout
0.4
0.2
0
100 101 102 103
N
mk mk (nk − mk )
π̄k := E p(πk |nk ,mk ) [π ] = σk2 := var p(πk |nk ,mk ) [π ] = = O(n− 1
k )
nk n2k (nk + 1)
In fact, the smooth curved lines in the plot above are exactly the
standard deviations at each time point i, which we know is the
expected distance of the estimated quantity to the true value.
One
q idea is, at time i, to choose the option k that maximizes π̄k +
c σk2 . Now, one could pose the question: which is the best value
for c? A large c ensures uncertain options are preferred, thus lead-
ing to exploration. On the other hand, a small c ignores uncertainty,
thus leading to exploitation.
One possibility is to let c grow at rate less than O(n1/2
k ). Then,
the variance of the chosen options will drop faster than c grows, so
their exploration will stop unless their mean is good. However, as c
grows, unexplored choices will eventually become dominant, thus
always explored eventually.
making decisions v 183
regret bound
2,000 103 expected regret
sampled regret
p = 50%
regret
∑t nit
p = 55% 101
1,000 p = 45%
10−1
0
500 1,000 1,500 2,000 2,500 3,000 100 101 102 103 104
N N
0
f
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
Continuous-Armed Bandits
p(y | x ) = N (y; f x , σ2 )
making decisions v 185
T
R( T ) := ∑ f ( xt ) − f ( x∗ )
t =1
p( f ) = GP ( f ; µ, k )
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
2
For this reason, we build GP regression on the observations (see
Fig. 106). One pedestrian way to find where it is most likely that 0
the minimum lies is to iteratively draw samples from the posterior
GP and record where the minimum of each sample is. Based on this −2
f
−6
GP Upper Confidence Bound −4 −2 0 2 4
x
A more structured and theoretically motivated algorithm is GP
Upper Confidence Bound (GP-UCB)39 . Under the posterior p( f | Figure 107: GP UCB. Top: GP poste-
rior. Bottom: Utility u( x )
y) = GP ( f ; µt−1 , σt2−1 ), we define the pointwise utility as: 39
Srinivas, Krause, Kakade, Seeger,
p ICML 2009
u i ( x ) = µ i −1 ( x ) − β t σt−1 ( x )
f
thus limT →∞ R T /T = 0 (“no regret”).
−4
−6
Entropy Search xn
x
The limit of the GP UCB algorithms is that they solely focus on
minimizing regret. It might not be true that you always want to Figure 108: Entropy Search. Top: GP
posterior hypotheses. Bottom: Utility
collect the minimum function values. Ideally we would like, in a u( x )
guided fashion, to efficiently learn where the minimum is. This is
the driving idea behind Entropy Search40 . In particular, instead 40
Villemonteix et al., 2009; Hennig &
of evaluating where you think the minimum lies, evaluate where Schuler, 2012
you expect to learn most about the minimum. For this, we need to
make use of the entropy:
Z
p( x )
H( p ) : = − p( x ) log dx
b( x )
Condensed content
F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the
sum-product algorithm. IEEE Transactions on Information Theory,
2001.