KEMBAR78
Lda-The Gritty Details | PDF | Statistical Theory | Teaching Mathematics
100% found this document useful (1 vote)
249 views12 pages

Lda-The Gritty Details

This document provides details on deriving the Gibbs sampling algorithm for latent Dirichlet allocation (LDA). It begins with an introduction explaining the need for a comprehensive resource on the derivation. It then outlines LDA and its learning problem, introducing key notation. The bulk of the document derives the joint distribution of LDA and the Gibbs updating rule. It shows the distributions and integrals required to arrive at an expression for the full conditional posterior distribution proportional to the Gibbs sampling distribution. The goal is to provide a self-contained derivation of the Gibbs sampling algorithm for LDA.

Uploaded by

Jun Wang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
249 views12 pages

Lda-The Gritty Details

This document provides details on deriving the Gibbs sampling algorithm for latent Dirichlet allocation (LDA). It begins with an introduction explaining the need for a comprehensive resource on the derivation. It then outlines LDA and its learning problem, introducing key notation. The bulk of the document derives the joint distribution of LDA and the Gibbs updating rule. It shows the distributions and integrals required to arrive at an expression for the full conditional posterior distribution proportional to the Gibbs sampling distribution. The goal is to provide a self-contained derivation of the Gibbs sampling algorithm for LDA.

Uploaded by

Jun Wang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Distributed Gibbs Sampling of Latent Dirichlet

Allocation: The Gritty Details


Yi Wang
yiwang@tencent.com
THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY AP-
PRECIATED.
1 Introduction
I started to write this chapter since November 2007, right after my rst MapRe-
duce implementation of the AD-LDA algorithm[7]. I had worked so hard to
understand LDA, but cannot nd any document that were comprehensive, com-
plete, and contain all necessary details. It is true that I can nd some open
source implementations of LDA, for example, LDA Gibbs sampler in Java
1
and
GibbsLDA++
2
, but code does not reveal math derivations. I read Griths
paper [4] and technical report [3], but realized most derivations are skipped.
It was lucky to me that in the paper about Topic-over-time[9], an extension
of LDA, the authors show part of the derivation. The derivation is helpful to
understand LDA but is too short and not self-contained. A primer [5] by Gregor
Heinrich contains more details than documents mentioned above. Indeed, most
of my initial understanding on LDA comes from [5]. As [5] focuses more on the
Bayesian background of latent variable modeling, I wrote this article focusing
on more on the derivation of the Gibbs sampling algorithm for LDA. You may
notice that the Gibbs sampling updating rule we derived in (38) diers slightly
from what was presented in [3] and [5], but matches that in [9].
2 LDA and Its Learning Problem
In the literature, the training documents of LDA are often denoted by a long
and segmented vector W:
W =
_

_
{w
1
, . . . , w
N
1
}, words in the 1st document
{w
N
1
+1
, . . . , w
N
1
+N
2
}, words in the 2nd document
. . . . . .
{w
1+

D1
j=1
N
j
, . . . , w

D
j=1
N
j
} words in the D-th document
_

_
, (1)
1
http://arbylon.net/projects/LdaGibbsSampler.java
2
http://gibbslda++.sourceforge.net
1
where N
d
denotes the number of words in the d-th document.
3
In rest of this article, we consider hidden variables
Z =
_

_
{z
1
, . . . , z
N
1
}, topics of words in the 1st document
{z
N
1
+1
, . . . , z
N
1
+N
2
}, topics of words in the 2nd document
. . . . . .
{z
1+

D1
j=1
N
j
, . . . , z

D
j=1
N
j
} topics of words in the D-th document
_

_
,
(3)
where each z
i
Z corresponds to a word w
i
W in (1).
3 Dirichlet and Multinomial
To derive the Gibbs sampling algorithm with LDA, it is important to be familiar
with the conjugacy between Dirichlet distribution and multinomial distribution.
The Dirichlet distribution is dened as:
Dir(p; ) =
1
B()
||

v=1
p

t
1
t
, (4)
where the normalizing constant is the multinomial beta function, which can be
expressed in terms of gamma function:
B() =

||
i=1
(
i
)
(

||
i=1

i
)
. (5)
When Dirichlet distribution is used in LDA to model the prior,
i
s are positive
integers. In this case, the gamma function degenerates to the factorial function:
(n) = (n 1)! (6)
The multinomial distribution is dened as:
Mult(x; p) =
n!

K
i=1
x
i
!
K

i=1
p
x
i
i
, (7)
where x
i
denotes the number of times that value i appears in the samples drawn
from the discrete distribution p, K = |x| = |p| = || and n =

K
i=1
x
i
. The
normalization constant comes from the product of combinatorial numbers.
We say the Dirichlet distribution is the conjugate distribution of the multi-
nomial distribution, because if the prior of p is Dir(p; ) and x is generated
3
Another representation of the training documents are two vectors: d and W:

d
W

d
1
, . . . , d
N
w
1
, . . . , w
N

, (2)
where N =

D
j=1
N
j
, d
i
indices in D, w
i
indices in W, and the tuple [d
i
, w
i
]
T
denotes an
edge between the two vertices indexed by d
i
and w
i
respectively. Such representation is
comprehensive and is used in many implementations of LDA, e.g, [8].
However, many papers (including this article) do not use this representation, because it
misleads readers to consider two sets of random variables, W and d. The fact is that d is
the structure of W and is highly dependent with W. d is also the structure of the hidden
variables Z, as shown by (3).
2
Figure 1: The procedure of learning LDA by Gibbs sampling.
by Mult(x; p), then the posterior distribution of p, p(p|x, ), is a Dirichlet
distribution:
p(p|x, ) = Dir(p; x +)
=
1
B(x +)
||

v=1
p
x
t
+
t
1
t
.
(8)
Because (8) is a probability distribution function, integrating it over p should
result in 1:
1 =
_
1
B(x +)
||

v=1
p
x
t
+
t
1
t
dp
=
1
B(x +)
_ ||

v=1
p
x
t
+
t
1
t
dp .
(9)
This implies
_ ||

v=1
p
x
t
+
t
1
t
dp = B(x +) . (10)
This property will be used in the following derivations.
4 Learning LDA by Gibbs Sampling
There have been three strategies to learn LDA: EM with variational inference
[2], EM with expectation propagation [6], and Gibbs sampling [4]. In this article,
we focus on the Gibbs sampling method, whose performance is comparable with
the other two but is tolerant better to local optima.
Gibbs sampling is one of the class of samplings methods known as Markov
Chain Monte Carlo. We use it to sample from the posterior distribution,
p(Z|W), given the training data W represented in the form of (1). As will
be shown by the following text, given the sample of Z we can infer model pa-
rameters and . This forms a learning algorithm with its general framework
shown in Fig. 1.
In order to sample from p(Z|W) using the Gibbs sampling method, we need
the full conditional posterior distribution p(z
i
|Z
i
, W), where Z
i
denotes all
3
z
j
s with j = i. In particular, Gibbs sampling does not require knowing the
exact form of p(z
i
|Z
i
, W); instead, it is enough to have a function f(), where
f(z
i
|Z
i
, W) p(z
i
|Z
i
, W) (11)
The following mathematical derivation is for such a function f().
The Joint Distribution of LDA We start from deriving the joint distribu-
tion
4
,
p(Z, W|, ) = p(W|Z, )p(Z|) , (12)
which is the basis of the derivation of the Gibbs updating rule and the parameter
estimation rule. As p(W|Z, ) and p(Z|) depend on and respectively, we
derive them separately.
According to the denition of LDA, we have
p(W|Z, ) =
_
p(W|Z, )p(|)d , (13)
where p(|) has Dirichlet distribution:
p(|) =
K

k=1
p(
k
|) =
K

k=1
1
B()
V

v=1

v
1
k,v
, (14)
and p(|) has multinomial distribution:
p(W|Z, ) =
N

i=1

z
i
,w
i
=
K

k=1
V

v=1

(k,v)
k,v
, (15)
where is a KV count matrix and (k, v) is the number of times that topic
k is assigned to word v. With W and Z dened by (1) and (3) respectively, we
can compute (k, v) by
(k, v) =
N

i=1
I{w
i
= v z
i
= k} , (16)
where N is the corpus size by words.
Given (15) and (14), (13) becomes
p(W|Z, ) =
_
K

k=1
1
B()
V

v=1

(k,v)+
v
1
k,v
d
k
. (17)
Using the property of integration of a product, which we learned in our colleague
time,
_
K

k=1
f
k
(
k
) d
1
. . . d
K
=
K

k=1
_
f
k
(
k
) d
k
, (18)
4
Notations used in this article are identical with those used in [5].
4
we have
p(W|Z, ) =
K

k=1
_
_
1
B()
V

v=1

(k,v)+
v
1
k,v
d
k
_
=
K

k=1
_
1
B()
_
V

v=1

(k,v)+
k
1
k,v
d
k
_
.
(19)
Using property (10), we can compute the integratation over
z
in a close form,
so
p(W|Z, ) =
K

k=1
B(
k
+)
B()
, (20)
where
k
denotes the k-th row of matrix .
Now we derive p(Z|) analogous to p(W|Z, ). Similar with (14), we have
p(|) =
D

d=1
p(
m
|) =
D

d=1
1
B()
K

k=1

k
1
d,k
; (21)
similar with (15), we have
p(Z|) =
N

i=1

d
i
,z
i
=
D

d=1
K

k=1

(d,k)
d,k
; (22)
and similar with (20) we have
p(Z|) =
_
p(Z|)p(|)d
=
D

d=1
_
_
1
B()
K

k=1

(d,k)+
k
1
d,k
d
k
_
=
D

d=1
B(
d
+)
B()
,
(23)
where is a count matrix and (d, k) is the number of times that topic k is
assigned to words in document d;
d
denotes the d-th row of . With input
data dened by (2), (d, k) can be computed as
(d, k) =
N

i=1
I{d
i
= m z
i
= z} , (24)
where N is the corpus size by words.
Given (20) and (23), the joint distribution (12) becomes:
p(Z, W|, ) = p(W|Z, )p(Z|)
=
K

k=1
B(
k
+)
B()

D

d=1
B(
d
+)
B()
(25)
5
The Gibbs Updating Rule With (25), we can derive the Gibbs updating
rule for LDA:
p(z
i
= k|Z
i
, W, , ) =
p(z
i
= k, Z
i
, W|, )
p(Z
i
, W|, )
. (26)
Because z
i
depends only on w
i
, (26) becomes
p(z
i
|Z
i
, W, , ) =
p(Z, W|, )
p(Z
i
, W
i
|, )
where z
i
= k . (27)
The numerator in (27) is (25). The denominator has a similar form:
p(Z, W|, ) = p(W|Z, )p(Z|)
=
K

k=1
B(
i
k
+)
B()

D

d=1
B(
i
d
+)
B()
,
(28)
where
i
k
denotes the k-th row of the count K V count matrix
i
, and

i
(k, v) is the number of times that topic k is assigned to word w, but with
the i-th word and its topic assignment excluded. Similarly,
i
d
is the d-th row
of count matrix
i
, and
i
(d, k) is the number of words in document d that
are assigned topic k, but with the i-th word and its topic assignment excluded.
In more details,

i
(k, v) =

1jN
j=i
I{w
j
= v z
j
= k} ,

i
(d, k) =

1jN
j=i
I{d
j
= d z
j
= k} ,
(29)
where d
j
denotes the document to which the i-th word in the corpus belongs.
Some properties can be derived from the denitions directly:
(k, v) =
_

i
(k, v) + 1 if v = w
i
and k = z
i
;

i
(k, v) all other cases.
(d, k) =
_

i
(d, k) + 1 if d = d
i
and k = z
i
;

i
(d, k) all other cases.
(30)
Correct notations from here on: From (30) we have
V

v=1
n(v; z
i
) = 1 +
V

v=1
n
i
(v|z
i
)
K

k=1
n(z; d
i
) = 1 +
K

k=1
n
i
(z|d
i
)
(31)
(30) and (31) are important to compute (27). Expanding the products in
(25) and take the fraction in (27), due to (30), those factors related with z = z
i
and m = m
i
disappear, and we get
p(z
i
|Z
i
, W, , ) =
B(n(; z
i
) +)
B(n
i
(|z
i
) +)

B(n(; m
i
) +)
B(n
i
(|m
i
) +)
. (32)
6
This equation can be further simplied by expanding B(). Considering
B(x) =

dimx
k=1
(x
k
)
(

dimx
k=1
x
k
)
, (33)
(32) becomes

V
v=1
(n(t;z
i
)+
t
)
(

V
v=1
n(t;z
i
)+
t
)

V
v=1
(n
i
(t|z
i
)+
t
)
(

V
v=1
n
i
(t|z
i
)+
t
)

K
k=1
(n(z;d
i
)+
z
)
(

K
k=1
n(z;d
i
)+
z
)

K
k=1
(n
i
(z|d
i
)+
z
)
(

K
k=1
n
i
(z|d
i
)+
z
)
. (34)
Using (30), we can remove most factors in the products of (34) and get
(n(w
i
;z
i
)+
w
i
)
(

V
v=1
n(t;z
i
)+
t
)
(n
i
(w
i
|z
i
)+
w
i
)
(

V
v=1
n
i
(t|z
i
)+
t
)

(n(z
i
;d
i
)+
z
i
)
(

K
k=1
n(z;d
i
)+
z
)
(n
i
(z
i
|d
i
)+
z
i
)
(

K
k=1
n
i
(z|d
i
)+
z
)
. (35)
Here it is important to know that
(y) = (y 1)! if y is a positive integer , (36)
so we expand (35) and using (30) to remove most factors in the factorials. This
leads us to the Gibbs updating rule for LDA:
p(z
i
|Z
i
,W, , ) =
n(w
i
; z
i
) +
w
i
1
_

V
v=1
n(t; z
i
) +
t
_
1

n(z
i
; d
i
) +
z
i
1
_

K
k=1
n(z; d
i
) +
z
_
1
.
(37)
Note that the denominator of the second factor at the right hand side of
(38) does not depend on z
i
, the parameter of the function p(z
i
|Z
i
, W, , ).
Because the Gibbs sampling method requires only a function f(z
i
) p(z
i
),
c.f. (11), we can use the following updating rule in practice:
p(z
i
|Z
i
,W, , ) =
n(w
i
; z
i
) +
w
i
1
_

V
v=1
n(t; z
i
) +
t
_
1

_
n(z
i
; d
i
) +
z
i
1

.
(38)
Parameter Estimation Given the sampled topics Z, as well as the input:
, , and W, we can estimate model parameters and .
Because (k, v) is a matrix that consists of the number of times of co-
occurrences of word w and topic z, which is proportional to the joint probability
p(w, z). Similarly, n(z; d) is proportional to p(z, d). Therefore, we can estimate
the model parameters by simply normalizing (k, v) and n(z; d); this leads to
the updating equation (43).
Another more precise understanding comes from [5].
According to the denitions
k,t
= p(w = t|z = k) and
m,k
= p(z = k|d =
m), we can interpret
k,t
and
m,k
by the posterior probabilities:

k,t
= p(w = t|z = k, W, Z, )

m,k
= p(z = k|Z, )
(39)
7
The following is a trick that derives (39) using the joint distribution (25).

k,t

m,k
= p(w = t | z = k, W, Z, ) p(z = k | Z, )
= p(w = t, z = k | W, Z, , )
=
p(

W,

Z | , )
p(W, Z | , )
,
(40)
where

W = {W, t},

Z = {Z, k} and

d = {m}. Both the numerator and denomi-
nator can be computed using (25). Analogous to what we did to compute (27),
(40) becomes:

k,t

m,k
=
p(

W,

Z | , )
p(W, Z | , )
=
(n(t;k)+1+
t
)
(

V
v=1
n(t;k)+1+
t
)
(n
i
(t|k)+
t
)
(

V
v=1
n
i
(t|k)+
t
)

(n(k;m)+1+
k
)
(

K
k=1
(d,k)+1+
z
)
(n
i
(k|m)+
k
)
(

K
k=1

i
(d,k)+
z
)
(41)
Note that the +1 terms in (41) is due to |

W| = |W| + 1 and |

Z| = |Z| + 1.
Analogous to the simplication we did for (34), simplifying (41) results in

k,t

m,k
=
n(t; k) +
t
_

V
v=1
n(t; k) +
t
_
n(k; m) +
k
_

K
k=1
(d, k) +
z
_ .
(42)
Comparing with (38), (42) does not have those 1 terms because of those
+1 terms in (41).
Because the two factors at the right hand side of (42) correspond to p(w =
t | z = k, W, Z, ) and p(z = k | Z, ) respectively, according to the denition
of
k,t
and
m,k
, we have

k,t
=
n(t; k) +
t
_

V
t

=1
n(t

; k) +
t

_ ,

m,k
=
n(k; m) +
k
_

K
k=1
(d, k) +
z
_ .
(43)
The Algorithm With (38) known, we can realize the learning procedure
shown in Fig. 1 by the algorithm shown in Fig. 1.
It is notable that this algorithm does not require input of and , because it
takes and as zero vectors. This simplication is reasonable because we can
interpret the conjugate property of Dirichlet and multinomial distribution [1] as
that
z
and
t
denote the number of times that topic z and word t respectively
are observed out of the training data set. Taking and as zero vectors means
we do not have prior knowledge on these numbers of times.
This simplication makes it easy to maintain the quantities used in (38).
We use a 2D array NWZ[w,z] to maintain n(w, z) and NZM[z,m] for (d, k).
To accelerate the computation, we use a 1D array NZ[z] to maintain n(z) =

|W,
v=1
n(t; z). Note that we do not need an array NM[d] for n(z) =

|Z,
k=1
n(z, d);
which appears in (37) but is unnecessary in (38).
8
zero all count variables NWZ, NZM, NZ ;
foreach document m [1, D] do
foreach word n [1, N
m
] in document m do
sample topic index z
m,n
Mult(1/K) for word w
m,n
;
increment document-topic count: NZM[z
m,n
, m]++ ;
increment topic-term count: NWZ[w
m,n
, z
m,n
]++ ;
increment topic-term sum: NZ[z
m,n
]++ ;
end
end
while not nished do
foreach document m [1, D] do
foreach word n [1, N
m
] in document m do
NWZ[w
m,n
, z
m,n
]--, NZ[z
m,n
]--, NZM[z
m,n
, m]-- ;
sample topic index z
m,n
according to (44) ;
NWZ[w
m,n
, z
m,n
]++, NZ[ z
m,n
]++, NZM[ z
m,n
, m]++ ;
end
end
if converged and L sampling iterations since last read out then
read out parameter set and according to (43) ;
end
end
Algorithm 1: The Gibbs sampling algorithm that learns LDA.
If and are known as non-zero vectors, we can easily modify the algorithm
to make use of them. In particular, NWZ[w,z]=n(w, z)+
w
, NZM[z,m]=(d, k)+

z
and NZ[z]=

V
v=1
n(t; z) +
t
. Only the initialization part of this algorithm
need to be modied.
The input of this algorithm is not in the form of (1); instead, it is in a
more convenient form: with each document represented by a set of words it
includes, the input is given as a set of documents. Denote the n-th word in
the m-th document by w
m,n
, the corresponding topic is denoted by z
m,n
and
corresponding document is m. With these correspondence known, the above
derivations can be easily implemented in the algorithm.
The sampling operation in this algorithm is not identical with the full condi-
tional posterior distribution (38); instead, it does not contain those 1 terms:
p(z
i
) =
n(w
i
; z
i
) +
w
i
_

V
v=1
n(t; z
i
) +
t
_
_
n(z
i
; d
i
) +
z
i

.
(44)
This makes it elegant to maintain the consistency of sampling new topics and
updating the counts (NWZ, NZ and NZM) using three lines of code: decreasing the
counts, sampling and increasing the counts.
5
9
Figure 2: The ground-truth used to synthesize our testing data.
5 Experiments
The synthetic image data mentioned in [4] is useful to test the correctness of
any implementation of LDA learning algorithm. To synthesize this data set, we
x = {
k
}
K=10
k=1
as visualized in Fig. 2 (also, Fig. 1 in [4]), set = [1], the
number of words/pixels in each document/image d = 100. Because every image
has 5 5 pixels, the vocabulary size is 25.
6
Fig. 3 shows the result of running Gibbs sampling to learn an LDA from this
testing data set, where the left pane shows the convergence of the likelihood,
and the right pane shows the updating process of along the iterations of the
algorithm. The 20 rows of images in the right pane are visualizations of
estimated at the iterations of 1, 2, 3, 5, 7, , 10, 15, 20, 25, 50, 75, 100, 125, 150,
175, , 200, 250, 300, 400, 499. From this gure we can see that since iteration
100, the estimated is almost identical to Fig. 1(a) of [4].
6 Acknowledgement
Thanks go to Gregor Heinrich and Xuerui Wang for helping me grasp some
details in the derivation of Gibbs sampling of LDA.
References
[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
[2] D. Blei, A. Ng, M. Jordan, and J. Laerty. Latent dirichlet allocation.
JMLR, 3:9931022, 20003.
[3] T. Grith. Gibbs sampling in the generative model of latent dirichlet allo-
cation. Technical report, Stanford University, 2004.
[4] T. Griths and M. Steyvers. Finding scientic topics. In PNAS, volume
101, pages 52285235, 2004.
[5] G. Heinrich. Parameter estimation for text analysis. Technical report, vsonix
GmbH and University of Leipzig, Germany, 2009.
[6] T. Minka and J. Laerty. Expectation propaga-
tion for the generative aspect model. In UAI, 2002.
https://research.microsoft.com/ minka/papers/aspect/minka-aspect.pdf.
5
This trick illustrates the physical meaning of those 1 terms in (38). I learn this trick
from Heinrichs code at http://arbylon.net/projects/LdaGibbsSampler.java.
6
All these settings are identical with those described in [4].
10
Figure 3: The result of running the Gibbs sampling algorithm to learn an LDA
from synthetic data. Each row of 10 images in the right pane visualizes the
= {
k
}
K=10
k=1
estimated at those iterations indicated by red circles in the left
pane.
[7] D. Newman, A. Asuncion, P. Smyth, and MaxWelling. Distributed inference
for latent dirichlet allocation. In NIPS, 2007.
11
[8] M. Steyvers and T. Griths. Matlab topic modeling toolbox. Technical
report, University of California, Berkeley, 2007.
[9] X. Wang and A. McCallum. Topics over time: A non-markov continuous-
time model of topical trends. In KDD, 2006.
12

You might also like