KEMBAR78
A Note on Correlated Topic Models | PDF
Deriving formulas used in a variational Bayesian inference for
Correlated Topic Models
Tomonari MASADA @ Nagasaki University
December 21, 2012
1 Model
This manuscript includes a derivation of update formulas for correlated topic models (CTM)[1]. We give
a generative description of CTM below.
1. For each topic k, draw a multinomial Mul(φk) from a Dirichlet prior Dir(β).
2. For each document d,
(a) Draw md from a Gaussian N(µ, Σ).
(b) Let θdk ≡ exp(mdk)
k exp(mdk) .
(c) For the ith word token, draw a topic zdi from a multinomial Mul(θd).
(d) For the ith word token, draw a word xdi from a multinomial Mul(φzdi
).
A full joint distribution can be written as follows:
p(x, z, φ, m|β, µ, Σ) = p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z)
=
k
p(φk|β) ·
d
p(md|µ, Σ) ·
d i
p(zdi|md)p(xdi|φzdi
)
=
k
Γ( w βw)
w Γ(βw)
φβw−1
kw ·
d
1
(2π)K/2|Σ|1/2
exp −
1
2
(md − µ)T
Σ−1
(md − µ)
·
d i k
exp(mdk)
k exp(mdk)
φkxdi
δ(zdi=k)
, (1)
where δ(·) is equal to one when the condition inside the parentheses holds and is equal to zero otherwise.
2 Variational Bayesian inference
A log evidence of an observed document set x can be lower-bounded by using Jensen’s inequality as follows:
ln p(x|β, µ, Σ) = ln
z
p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z)dφdm
= ln
z
q(z)q(φ)q(m)
p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z)
q(z)q(φ)q(m)
dφdm
≥
z
q(z)q(φ)q(m) ln
p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z)
q(z)q(φ)q(m)
dφdm
=
z
q(z)q(m) ln p(z|m)dm + q(φ) ln p(φ|β)dφ
+
z
q(z)q(φ) ln p(x|φ, z)dφ + q(m) ln p(m|µ, Σ)dm
−
z
q(z) ln q(z) − q(φ) ln q(φ)dφ − q(m) ln q(m)dm . (2)
1
With respect to variational posteriors, we assume:
• q(z) is factorized as d i k q(zdi|γdi) = d i k γ
δ(zdi=k)
dik ;
• q(φ) is factorized as k q(φk|ζk), where each q(φk|ζk) is a Dirichlet; and
• q(m) is factorized as d k q(mdk|rdk, sdk), where each q(mdk|rdk, sdk) is a univariate Gaussian.
2.1
z
q(z)q(m) ln p(z|m)dm
=
z d i k
γ
δ(zdi=k)
dik
d k
q(mdk|rdk, sdk) ln
d i k
exp(mdk)
k exp(mdk)
δ(zdi=k)
dm
=
d i k
γdik q(mdk|rdk, sdk) ln exp(mdk)dmdk
−
d i k
γdik q(md|rd, sd) ln
k
exp(mdk) dmd
=
d i k
γdikrdk −
d i k
γdik q(md|rd, sd) ln
k
exp(mdk) dmd (3)
We obtain a lower bound by a variational method proposed in [1]. Since f(x) = ln x ≤ x
ν − 1 + ln ν for
any ν > 0, we introduce a new variable νd for each document and obtain the following inequality:
q(md|rd, sd) ln
k
exp(mdk) dmd ≤ q(md|rd, sd) ν−1
d
k
exp(mdk) − 1 + ln νd dmd
= ln νd − 1 + ν−1
d
k
q(mdk|rdk, sdk) exp(mdk)dmdk
= ln νd − 1 + ν−1
d
k
exp(rdk + s2
dk/2) . (4)
Therefore, Eq. (3) can be lower-bounded as follows:
z
q(z)q(m) ln p(z|m)dm ≥
d i k
γdik rdk − ln νd + 1 − ν−1
d
k
exp(rdk + s2
dk/2) . (5)
2.2
q(φ) ln p(φ|β)dφ =
k
Γ( w ζkw)
w Γ(ζkw) w
φζkw−1
kw ln
Γ( w βw)
w Γ(βw)
φβw−1
kw dφk
= K ln Γ(
w
βw) − K
w
ln Γ(βw) +
k w
(βw − 1) Ψ(ζkw) − Ψ(
w
ζkw)
(6)
z
q(z)q(φ) ln p(x|φ, z)dφ =
d i k
γdik
Γ( w ζkw)
w Γ(ζkw) w
φζkw−1
kw ln φkxdi
dφk
=
d i k
γdik Ψ(ζkxdi
) − Ψ(
w
ζkw) (7)
These derivations are completely the same with latent Dirichlet allocation (LDA).
2
2.3
q(m) ln p(m|µ, Σ)dm
=
d k
q(mdk|rdk, sdk) ln
1
(2π)K/2|Σ|1/2
exp −
1
2
(md − µ)T
Σ−1
(md − µ) dmd
= −
DK
2
ln 2π −
D
2
ln |Σ| −
1
2
d k
s2
dk(Σ−1
)kk −
1
2
d
(rd − µ)T
Σ−1
(rd − µ) , (8)
where (Σ−1
)kk means the (k, k )th entry of Σ−1
. The last two terms are derived as follows:
k
q(mdk|rdk, sdk)(md − µ)T
Σ−1
(md − µ)dmd
=
k
q(mdk|rdk, sdk)
k
(mdk − µk)2
(Σ−1
)kk +
k k =k
(mdk − µk)(mdk − µk )(Σ−1
)kk dmd
=
k
(r2
dk + s2
dk − 2rdkµk + µ2
k)(Σ−1
)kk +
k k =k
(rdk − µk)(rdk − µk )(Σ−1
)kk
=
k
s2
dk(Σ−1
)kk +
k k
(rdk − µk)(rdk − µk )(Σ−1
)kk
=
k
s2
dk(Σ−1
)kk + (rd − µ)T
Σ−1
(rd − µ) (9)
2.4
z
q(z) ln q(z) =
d i k
γdik ln γdik (10)
q(φ) ln q(φ)dφ =
k
ln Γ(
w
ζkw) −
k w
ln Γ(ζkw) +
k w
(ζkw − 1) Ψ(ζkw) − Ψ(
w
ζkw)
(11)
q(m) ln q(m)dm = −
DK
2
− DK ln
√
2π −
d k
ln sdk (12)
3 Updating posteriors
Consequently, the lower bound in Eq. (2) is obtained as follows:
ln p(x|β, µ, Σ) ≥
d i k
γdik rdk − ln νd + 1 − ν−1
d
k
exp(rdk + s2
dk/2)
+ K ln Γ(
w
βw) − K
w
ln Γ(βw) +
k w
(βw − 1) Ψ(ζkw) − Ψ(
w
ζkw)
−
k
ln Γ(
w
ζkw) +
k w
ln Γ(ζkw) −
k w
(ζkw − 1) Ψ(ζkw) − Ψ(
w
ζkw)
+
d i k
γdik Ψ(ζkxdi
) − Ψ(
w
ζkw) −
d i k
γdik ln γdik
−
DK
2
ln 2π −
D
2
ln |Σ| −
1
2
d k
s2
dk(Σ−1
)kk −
1
2
d
(rd − µ)T
Σ−1
(rd − µ)
+
DK
2
+ DK ln
√
2π +
d k
ln sdk . (13)
Let L denote the right hand side. With respect to νd, we obtain a derivative:
∂L
∂νd
=
i k
γdik − ν−1
d + ν−2
d
k
exp(rdk + s2
dk/2) . (14)
3
Note that i k γdik is equal to nd, the length of document d. From ∂L/∂νd = 0, we obtain νd =
k exp(rdk + s2
dk/2). With respect to γdik, we obtain a derivative:
∂L
∂γdik
= rdk − ln νd + 1 − ν−1
d
k
exp(rdk + s2
dk/2) + Ψ(ζkxdi
) − Ψ(
w
ζkw) − ln γdik + 1 . (15)
Therefore, by using νd = k exp(rdk +s2
dk/2), we can update γdik as γdik ∝ exp(rdk)·
exp Ψ(ζkxdi
)
exp Ψ( w ζkw) . With
respect to rdk,
∂L
∂rdk
= ndk −
nd
νd
exp(rdk + s2
dk/2) −
k
(rdk − µk )(Σ−1
)kk , (16)
where ndk ≡ i γdik. This cannot be solved analytically. Therefore, we maximize
L(rdk) = ndkrdk −
nd
νd
exp(rdk + s2
dk/2) +
1
2
r2
dk(Σ−1
)kk − rdk
k
(rdk − µk )(Σ−1
)kk (17)
by some gradient-based method (e.g. L-BFGS). With respect to sdk, we maximize
L(sdk) = −
nd
νd
exp(rdk + s2
dk/2) −
1
2
s2
dk(Σ−1
)kk + ln sdk (18)
by using a gradient
∂L(sdk)
∂sdk
= −
nd
νd
exp(rdk + s2
dk/2) − sdk(Σ−1
)kk +
1
sdk
. (19)
With respect to ζkw, we obtain the following update: ζkw = βw + d i k γdik.
With respect to Σ, we have the following function to be maximized:
L(Σ) = −
D
2
ln |Σ| −
1
2
d k
s2
dk(Σ−1
)kk −
1
2
d
(rd − µ)T
Σ−1
(rd − µ) . (20)
From the first term in Eq. (20), we obtain a derivative ∂ ln |Σ|
∂Σkk
= tr Σ−1 ∂Σ
∂Σkk
. The matrix Σ−1 ∂Σ
∂Σkk
has non-zero entries only in the k th column, and the column has an entry (Σ−1
)lk at the lth row.
Therefore, ∂ ln |Σ|
∂Σkk
= (Σ−1
)k k. By a symmetry, ∂ ln |Σ|
∂Σ = Σ−1
.
For the second term in Eq. (20), it holds that k s2
dk(Σ−1
)kk = tr(Σ−1
Sd), where Sd is a diagonal
matrix whose kth diagonal entry is s2
dk. By using an equation1 ∂tr(AΣ−1
B)
∂Σ = −Σ−1
BAΣ−1
, we obtain
∂ d k s2
dk(Σ−1
)kk
∂Σ = −Σ−1
d Sd Σ−1
.
For the last term in Eq. (20), it holds that (rd − µ)T
Σ−1
(rd − µ) = tr (rd − µ)T
Σ−1
(rd − µ) .
Therefore, by using an equation ∂tr(AΣ−1
B)
∂Σ = −Σ−1
BAΣ−1
again, we obtain ∂(rd−µ)T
Σ−1
(rd−µ)
∂Σ =
−Σ−1
(rd − µ)(rd − µ)T
Σ−1
.
Consequently,
∂L(Σ)
∂Σ
= −
D
2
Σ−1
+
1
2
Σ−1
d
Sd Σ−1
+
1
2
Σ−1
d
(rd − µ)(rd − µ)T
Σ−1
. (21)
Therefore, ∂L(Σ)
∂Σ = 0 holds when Σ−1
= 1
D Σ−1
d Sd + (rd − µ)(rd − µ)T
Σ−1
. By multiplying Σ
from the left and the right, we obtain Σ = 1
D d Sd + (rd − µ)(rd − µ)T
.
References
[1] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005.
1cf. Eq. (16) in http://research.microsoft.com/en-us/um/people/minka/papers/matrix/minka-matrix.pdf
4

A Note on Correlated Topic Models

  • 1.
    Deriving formulas usedin a variational Bayesian inference for Correlated Topic Models Tomonari MASADA @ Nagasaki University December 21, 2012 1 Model This manuscript includes a derivation of update formulas for correlated topic models (CTM)[1]. We give a generative description of CTM below. 1. For each topic k, draw a multinomial Mul(φk) from a Dirichlet prior Dir(β). 2. For each document d, (a) Draw md from a Gaussian N(µ, Σ). (b) Let θdk ≡ exp(mdk) k exp(mdk) . (c) For the ith word token, draw a topic zdi from a multinomial Mul(θd). (d) For the ith word token, draw a word xdi from a multinomial Mul(φzdi ). A full joint distribution can be written as follows: p(x, z, φ, m|β, µ, Σ) = p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z) = k p(φk|β) · d p(md|µ, Σ) · d i p(zdi|md)p(xdi|φzdi ) = k Γ( w βw) w Γ(βw) φβw−1 kw · d 1 (2π)K/2|Σ|1/2 exp − 1 2 (md − µ)T Σ−1 (md − µ) · d i k exp(mdk) k exp(mdk) φkxdi δ(zdi=k) , (1) where δ(·) is equal to one when the condition inside the parentheses holds and is equal to zero otherwise. 2 Variational Bayesian inference A log evidence of an observed document set x can be lower-bounded by using Jensen’s inequality as follows: ln p(x|β, µ, Σ) = ln z p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z)dφdm = ln z q(z)q(φ)q(m) p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z) q(z)q(φ)q(m) dφdm ≥ z q(z)q(φ)q(m) ln p(φ|β)p(m|µ, Σ)p(z|m)p(x|φ, z) q(z)q(φ)q(m) dφdm = z q(z)q(m) ln p(z|m)dm + q(φ) ln p(φ|β)dφ + z q(z)q(φ) ln p(x|φ, z)dφ + q(m) ln p(m|µ, Σ)dm − z q(z) ln q(z) − q(φ) ln q(φ)dφ − q(m) ln q(m)dm . (2) 1
  • 2.
    With respect tovariational posteriors, we assume: • q(z) is factorized as d i k q(zdi|γdi) = d i k γ δ(zdi=k) dik ; • q(φ) is factorized as k q(φk|ζk), where each q(φk|ζk) is a Dirichlet; and • q(m) is factorized as d k q(mdk|rdk, sdk), where each q(mdk|rdk, sdk) is a univariate Gaussian. 2.1 z q(z)q(m) ln p(z|m)dm = z d i k γ δ(zdi=k) dik d k q(mdk|rdk, sdk) ln d i k exp(mdk) k exp(mdk) δ(zdi=k) dm = d i k γdik q(mdk|rdk, sdk) ln exp(mdk)dmdk − d i k γdik q(md|rd, sd) ln k exp(mdk) dmd = d i k γdikrdk − d i k γdik q(md|rd, sd) ln k exp(mdk) dmd (3) We obtain a lower bound by a variational method proposed in [1]. Since f(x) = ln x ≤ x ν − 1 + ln ν for any ν > 0, we introduce a new variable νd for each document and obtain the following inequality: q(md|rd, sd) ln k exp(mdk) dmd ≤ q(md|rd, sd) ν−1 d k exp(mdk) − 1 + ln νd dmd = ln νd − 1 + ν−1 d k q(mdk|rdk, sdk) exp(mdk)dmdk = ln νd − 1 + ν−1 d k exp(rdk + s2 dk/2) . (4) Therefore, Eq. (3) can be lower-bounded as follows: z q(z)q(m) ln p(z|m)dm ≥ d i k γdik rdk − ln νd + 1 − ν−1 d k exp(rdk + s2 dk/2) . (5) 2.2 q(φ) ln p(φ|β)dφ = k Γ( w ζkw) w Γ(ζkw) w φζkw−1 kw ln Γ( w βw) w Γ(βw) φβw−1 kw dφk = K ln Γ( w βw) − K w ln Γ(βw) + k w (βw − 1) Ψ(ζkw) − Ψ( w ζkw) (6) z q(z)q(φ) ln p(x|φ, z)dφ = d i k γdik Γ( w ζkw) w Γ(ζkw) w φζkw−1 kw ln φkxdi dφk = d i k γdik Ψ(ζkxdi ) − Ψ( w ζkw) (7) These derivations are completely the same with latent Dirichlet allocation (LDA). 2
  • 3.
    2.3 q(m) ln p(m|µ,Σ)dm = d k q(mdk|rdk, sdk) ln 1 (2π)K/2|Σ|1/2 exp − 1 2 (md − µ)T Σ−1 (md − µ) dmd = − DK 2 ln 2π − D 2 ln |Σ| − 1 2 d k s2 dk(Σ−1 )kk − 1 2 d (rd − µ)T Σ−1 (rd − µ) , (8) where (Σ−1 )kk means the (k, k )th entry of Σ−1 . The last two terms are derived as follows: k q(mdk|rdk, sdk)(md − µ)T Σ−1 (md − µ)dmd = k q(mdk|rdk, sdk) k (mdk − µk)2 (Σ−1 )kk + k k =k (mdk − µk)(mdk − µk )(Σ−1 )kk dmd = k (r2 dk + s2 dk − 2rdkµk + µ2 k)(Σ−1 )kk + k k =k (rdk − µk)(rdk − µk )(Σ−1 )kk = k s2 dk(Σ−1 )kk + k k (rdk − µk)(rdk − µk )(Σ−1 )kk = k s2 dk(Σ−1 )kk + (rd − µ)T Σ−1 (rd − µ) (9) 2.4 z q(z) ln q(z) = d i k γdik ln γdik (10) q(φ) ln q(φ)dφ = k ln Γ( w ζkw) − k w ln Γ(ζkw) + k w (ζkw − 1) Ψ(ζkw) − Ψ( w ζkw) (11) q(m) ln q(m)dm = − DK 2 − DK ln √ 2π − d k ln sdk (12) 3 Updating posteriors Consequently, the lower bound in Eq. (2) is obtained as follows: ln p(x|β, µ, Σ) ≥ d i k γdik rdk − ln νd + 1 − ν−1 d k exp(rdk + s2 dk/2) + K ln Γ( w βw) − K w ln Γ(βw) + k w (βw − 1) Ψ(ζkw) − Ψ( w ζkw) − k ln Γ( w ζkw) + k w ln Γ(ζkw) − k w (ζkw − 1) Ψ(ζkw) − Ψ( w ζkw) + d i k γdik Ψ(ζkxdi ) − Ψ( w ζkw) − d i k γdik ln γdik − DK 2 ln 2π − D 2 ln |Σ| − 1 2 d k s2 dk(Σ−1 )kk − 1 2 d (rd − µ)T Σ−1 (rd − µ) + DK 2 + DK ln √ 2π + d k ln sdk . (13) Let L denote the right hand side. With respect to νd, we obtain a derivative: ∂L ∂νd = i k γdik − ν−1 d + ν−2 d k exp(rdk + s2 dk/2) . (14) 3
  • 4.
    Note that ik γdik is equal to nd, the length of document d. From ∂L/∂νd = 0, we obtain νd = k exp(rdk + s2 dk/2). With respect to γdik, we obtain a derivative: ∂L ∂γdik = rdk − ln νd + 1 − ν−1 d k exp(rdk + s2 dk/2) + Ψ(ζkxdi ) − Ψ( w ζkw) − ln γdik + 1 . (15) Therefore, by using νd = k exp(rdk +s2 dk/2), we can update γdik as γdik ∝ exp(rdk)· exp Ψ(ζkxdi ) exp Ψ( w ζkw) . With respect to rdk, ∂L ∂rdk = ndk − nd νd exp(rdk + s2 dk/2) − k (rdk − µk )(Σ−1 )kk , (16) where ndk ≡ i γdik. This cannot be solved analytically. Therefore, we maximize L(rdk) = ndkrdk − nd νd exp(rdk + s2 dk/2) + 1 2 r2 dk(Σ−1 )kk − rdk k (rdk − µk )(Σ−1 )kk (17) by some gradient-based method (e.g. L-BFGS). With respect to sdk, we maximize L(sdk) = − nd νd exp(rdk + s2 dk/2) − 1 2 s2 dk(Σ−1 )kk + ln sdk (18) by using a gradient ∂L(sdk) ∂sdk = − nd νd exp(rdk + s2 dk/2) − sdk(Σ−1 )kk + 1 sdk . (19) With respect to ζkw, we obtain the following update: ζkw = βw + d i k γdik. With respect to Σ, we have the following function to be maximized: L(Σ) = − D 2 ln |Σ| − 1 2 d k s2 dk(Σ−1 )kk − 1 2 d (rd − µ)T Σ−1 (rd − µ) . (20) From the first term in Eq. (20), we obtain a derivative ∂ ln |Σ| ∂Σkk = tr Σ−1 ∂Σ ∂Σkk . The matrix Σ−1 ∂Σ ∂Σkk has non-zero entries only in the k th column, and the column has an entry (Σ−1 )lk at the lth row. Therefore, ∂ ln |Σ| ∂Σkk = (Σ−1 )k k. By a symmetry, ∂ ln |Σ| ∂Σ = Σ−1 . For the second term in Eq. (20), it holds that k s2 dk(Σ−1 )kk = tr(Σ−1 Sd), where Sd is a diagonal matrix whose kth diagonal entry is s2 dk. By using an equation1 ∂tr(AΣ−1 B) ∂Σ = −Σ−1 BAΣ−1 , we obtain ∂ d k s2 dk(Σ−1 )kk ∂Σ = −Σ−1 d Sd Σ−1 . For the last term in Eq. (20), it holds that (rd − µ)T Σ−1 (rd − µ) = tr (rd − µ)T Σ−1 (rd − µ) . Therefore, by using an equation ∂tr(AΣ−1 B) ∂Σ = −Σ−1 BAΣ−1 again, we obtain ∂(rd−µ)T Σ−1 (rd−µ) ∂Σ = −Σ−1 (rd − µ)(rd − µ)T Σ−1 . Consequently, ∂L(Σ) ∂Σ = − D 2 Σ−1 + 1 2 Σ−1 d Sd Σ−1 + 1 2 Σ−1 d (rd − µ)(rd − µ)T Σ−1 . (21) Therefore, ∂L(Σ) ∂Σ = 0 holds when Σ−1 = 1 D Σ−1 d Sd + (rd − µ)(rd − µ)T Σ−1 . By multiplying Σ from the left and the right, we obtain Σ = 1 D d Sd + (rd − µ)(rd − µ)T . References [1] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, 2005. 1cf. Eq. (16) in http://research.microsoft.com/en-us/um/people/minka/papers/matrix/minka-matrix.pdf 4