Notes 5 Multivariate Distributions
Notes 5 Multivariate Distributions
Distribution Modeling
Multivariate Distributions
Recall
Let X = (X1 , . . . , Xn ) be an n-dimensional vector of random variables. We have the following defini-
tions and statements.
Definition 0.1 (Joint CDF). For all x = (x1 , . . . , xn )⊤ ∈ Rn , the joint cumulative distribution
function (CDF) of X satisfies
Definition 0.2 (Marginal CDF). For a fixed i, the marginal CDF of Xi satisfies
It is straightforward to generalize the previous definition to joint marginal distributions. For example,
the joint marginal distribution of Xi and Xj satisfies
If the joint CDF is absolutely continuous, then it has an associated probability density function (PDF)
so that Z x1 Z xn
FX (x1 , . . . , xn ) = ··· f (u1 , . . . , un ) du1 · · · dun .
−∞ −∞
Similar statements also apply to the marginal CDFs. A collection of random variables is independent
if the joint CDF (or PDF if it exists) can be factored into the product of the marginal CDFs (or
PDFs).
1
Whereas, again assuming it exists, the covariance matrix of X satisfies
h i
Cov(X) := Σ := E (X − E[X])(X − E[X])⊤ .
so that the (i, j)th element of Σ is simply the covariance of Xi and Xj . Note that the covariance matrix
is symmetric so that Σ⊤ = Σ, its diagonal elements satisfy Σi,i ≥ 0, and it is positive semi-definite so
that x⊤ Σx ≥ 0 for all x ∈ Rn .
ρij := Corr(Xi , Xj ).
It is also symmetric, positive semi-definite and has 1’s along the diagonal.
The term (x − µ)′ Σ−1 (x − µ) appearing inside the exponent of the multivariate normal distribution
is a quadratic form. This particular quadratic form is also called the squared Mahalanobis distance
between the random vector x and the mean vector µ. Thus, density function only depends on x through
the squared Mahalanobis distance: (x−µ)′ Σ−1 (x−µ). This is the equation for a hyper-ellipse centered
at µ. The multivariate normal density is constant on ellipsoids of the form
If the variables are uncorrelated then the variance-covariance matrix will be a diagonal matrix with
variances of the individual variables appearing on the main diagonal of the matrix and zeros everywhere
else:
σ12 0 ... 0
0 σ22 . . . 0
Σ= .. .. . . ..
. . . .
0 0 . . . σp2
2
Let X ∼ Np (µ, Σ) be p-variate multivariate normal with mean µ and variance-covariance matrix Σ,
where
X1 µ1 σ11 σ12 · · · σ1p
X2 µ2 σ21 σ22 · · · σ2p
X=
.. ,
µ=
.. ,
Σ= .
.. .. .
..
. . .. . . .
Xp µp σp1 σp2 · · · σpp
(iii) Conditional Distributions. Subdivide the vector X into two subsets X1 and X2 ,
X11 X21
" #
X12 X22
X1
X= , X1 =
.. ,
X2 =
.. ,
X2 . .
X1p X2q
(iv) If two variates, say X1 and X2 , of the multivariate normal are uncorrelated, ρ12 = 0 that implies
σ12 = 0, then X1 and X2 are independent. This property is not in general true for other
distributions. However, it is always true that if two variates are independent, then they are
uncorrelated, no matter what their joint distribution is.
(v) The characteristic function of a multivariate normal distribution with mean µ and covariance
matrix Σ ≥ 0 is, for t ∈ Rp ,
′ 1 ′
ϕ(t) = exp it µ − t Σt .
2
(vi) Linear Combinations. If X ∼ Np (µ, Σ) and Y = AX + b for A(q × p) and b(q × 1), then
Y ∼ Nq (Aµ + b, AΣA′ ).
3
Generating Multivariate Normally Distributed Random Vectors
Suppose we wish to generate a random vector X = (X1 , . . . , Xn ) with distribution X ∼ M Nn (0, Σ).
The case E[X] ̸= 0 can easily be accommodated afterwards. Let Z = (Z1 , . . . , Zn )⊤ , where the Zi are
i.i.d. N (0, 1) random variables for i = 1, . . . , n. If C is an (n × m) matrix, then
C⊤ Z ∼ M N (0, C⊤ C).
Thus, our task reduces to finding a matrix C such that C⊤ C = Σ. This can be achieved using
the Cholesky decomposition of Σ. A standard result from linear algebra states that any symmetric
positive-definite matrix M can be written as
M = U⊤ DU,
where U is an upper triangular matrix and D is a diagonal matrix with strictly positive diagonal
entries. Since Σ is symmetric positive-definite, we can express it as
√ √ √ √
Σ = U⊤ DU = (U⊤ D)( D U) = ( D U)⊤ ( D U).
√
Therefore, the matrix C = D U satisfies C⊤ C = Σ, and is called the Cholesky decomposition of Σ.
n
np n 1X
ℓ(µ, Σ) = − ln(2π) − ln |Σ| − (Yi − µ)⊤ Σ−1 (Yi − µ)
2 2 2
i=1
n
X
∂µ ℓ(µ, Σ) = Σ−1 (Yi − µ) = 0
i=1
Pn −1 (Y
Pn
Since i=1 Σ i − µ) = Σ−1 i=1 (Yi − µ), we deduce that µ̂ is the empirical mean:
Pn
n−1i=1 Y i,1
..
µ̂ = Ȳ =
.
−1
P n
n i=1 Yi,p
By using the properties of the trace function, the concentrated log-likelihood function becomes:
n
np n 1X
ℓ(µ̂, Σ) = − ln(2π) − ln |Σ| − (Yi − µ̂)⊤ Σ−1 (Yi − µ̂)
2 2 2
i=1
n
np n 1X
=− ln(2π) − ln |Σ| − tr (Yi − Ȳ )⊤ Σ−1 (Yi − Ȳ )
2 2 2
i=1
4
n
np n 1 X −1
= − ln(2π) − ln |Σ| − tr Σ (Yi − Ȳ )(Yi − Ȳ )⊤
2 2 2
i=1
np n 1
ln(2π) − ln |Σ| − tr Σ−1 S
=−
2 2 2
n
X
S= (Yi − Ȳ )(Yi − Ȳ )⊤
i=1
∂ℓ(µ̂, Σ) n 1
−1
= Σ− S =0
∂Σ 2 2
n
1 1X
Σ̂ = S= (Yi − Ȳ )(Yi − Ȳ )⊤
n n
i=1
The simplest and most common method of estimating a multivariate normal distribution is to take
the sample mean vector and sample covariance matrix as our estimators of µ and Σ, respectively. It
is easy to justify this choice since they are the maximum likelihood estimators. It is also common to
take n/(n − 1) times the sample covariance matrix as an estimator of Σ as this estimator is known to
be unbiased.
(i) The tails of its univariate marginal distributions are too thin; they do not assign enough weight
to extreme events.
(ii) The joint tails of the distribution do not assign enough weight to joint extreme outcomes.
(iii) The distribution has a strong form of symmetry, known as elliptical symmetry.
5
Multivariate Gaussian Mixture Model
Multivariate Gaussian mixture model The probability density function of the random vector
X of dimension d is defined as a weighted sum of Gaussian distributions:
K
X
f (x) = πj ϕd (x; µj , Σj )
j=1
where K is the number of mixture components, µj and Σj are the mean vector and the covariance
matrix of the Gaussian distribution associated with the j th component, and πj is the mixture weight
such that K
P
j=1 πj = 1. The log-likelihood function of the sample X = {X1 , . . . , XN } is:
N
X K
X
ℓ(θ) = ln πj ϕd (Xi ; µj , Σj )
i=1 j=1
One-dimensional Case
• Means µk .
• Variances σk2 .
In the EM algorithm, we maximize the expected complete-data log-likelihood, which is the log-
likelihood considering both the observed data X and the latent variables Z = {z1 , z2 , . . . , zN } that
indicate which Gaussian component generated each data point.
N X
X K
2
I(zi = k) log πk + log N (xi |µk , σk2 )
log L(π, µ, σ ) =
i=1 k=1
where I(zi = k) is an indicator function that equals 1 if zi = k (i.e., if the i-th data point was generated
by the k-th Gaussian component) and 0 otherwise.
Since we don’t know the values of zi (they are latent variables), we take the expectation with respect
to their posterior distribution, which leads to the expected complete-data log-likelihood:
N X
X K
Q(π, µ, σ 2 ) = γik log πk + log N (xi |µk , σk2 )
i=1 k=1
6
E-Step
In the E-step of the EM algorithm, we calculate the responsibilities, which represent the probability
that each data point xi was generated by each Gaussian component k. These responsibilities are
denoted as γik and are defined as:
γik = P (zi = k|xi )
where zi is the latent variable indicating which Gaussian component generated the data point xi . The
responsibilities γik can be computed using Bayes’ theorem:
In the context of a Gaussian Mixture Model, P (zi = k) is the mixing coefficient πk , and p(xi |zi = k)
is the probability density function of the k-th Gaussian, given by:
(xi − µk )2
1
p(xi |zi = k) = N (xi |µk , σk2 ) =q exp −
2πσk2 2σk2
(xi −µk )2
πk √ 1 exp − 2σ 2
2πσk2
γik = k
PK 1 (xi −µj )2
j=1 πj
q
2
exp − 2σ2
2πσj j
This formula gives the responsibility γik that the k-th Gaussian component has for explaining the i-th
data point xi .
The responsibilities γik are then used in the M-step to update the parameters πk , µk , and σk2 .
M-Step
In the M-step, we maximize Q(π, µ, σ 2 ) with respect to the parameters πk , µk , and σk2 .
To find the optimal πk , we maximize Q(π, µ, σ 2 ) with respect to πk , subject to the constraint that
PK
k=1 πk = 1. The relevant part of Q is:
N X
X K
Q(π) = γik log πk
i=1 k=1
7
We introduce a Lagrange multiplier λ for the constraint and set up the Lagrangian:
N X
K K
!
X X
L= γik log πk + λ πk − 1
i=1 k=1 k=1
N
1 X
πknew = γik
N
i=1
This is the average responsibility that component k takes over all data points.
Update of Means µk :
To find the optimal µk , we maximize Q(π, µ, σ 2 ) with respect to µk . The relevant part of Q is:
N X
X K
Q(µ) = γik log N (xi |µk , σk2 )
i=1 k=1
1 (xi − µk )2
log N (xi |µk , σk2 ) = − log(2πσk2 ) −
2 2σk2
N
1X (xi − µk )2
Q(µk ) = − γik
2
i=1
σk2
N
∂Q(µk ) X xi − µ k
= γik =0
∂µk
i=1
σk2
This is the weighted average of the data points, where the weights are the responsibilities.
To find the optimal σk2 , we maximize Q(π, µ, σ 2 ) with respect to σk2 . The relevant part of Q is:
N
(xi − µk )2
1X
Q(σk2 ) =− 2
γik log σk +
2
i=1
σk2
8
Taking the derivative with respect to σk2 and setting it to zero gives:
N
∂Q(σk2 ) (xi − µk )2
1X 1
=− γik 2 − =0
∂σk2 2
i=1
σ k σ 4
k
This is the weighted variance, where the weights are the responsibilities.
Multi-dimensional Case
N X
X K
log L(π, µ, Σ) = I(zi = k) [log πk + log N (xi |µk , Σk )]
i=1 k=1
where I(zi = k) is an indicator function that equals 1 if zi = k (i.e., if the i-th data point was generated
by the k-th Gaussian component) and 0 otherwise. Since we don’t know the values of zi (they are
latent variables), we take the expectation with respect to their posterior distribution, which leads to
the expected complete-data log-likelihood:
N X
X K
Q(π, µ, Σ) = γik [log πk + log N (xi |µk , Σk )]
i=1 k=1
E-step
Similar to one-dimensional case, given the current parameters in t-th iteration, we compute the re-
sponsibilities (posterior probabilities of the latent variables):
(t) (t) (t)
πk N xi | µk , Σk
(t+1)
γik =P ,
K (t) (t) (t)
j=1 πj N xi | µj , Σj
M-Step
To find the optimal πk , we maximize Q(π, µ, Σ) with respect to πk , subject to the constraint that
PK
k=1 πk = 1. The relevant part of Q is:
N X
X K
Q(π) = γik log πk
i=1 k=1
9
We introduce a Lagrange multiplier λ for the constraint and set up the Lagrangian:
N X
K K
!
X X
L= γik log πk + λ πk − 1
i=1 k=1 k=1
PN
∂L i=1 γik
= +λ=0
∂πk πk
PK
Solving for πk and using the constraint k=1 πk = 1, we get:
N
(t+1) 1 X (t+1)
πk = γik
N
i=1
This is the average responsibility that component k takes over all data points.
Update of Means µk :
To find the optimal µk , we maximize Q(π, µ, Σ) with respect to µk . The relevant part of Q is:
N X
X K
Q(µ) = γik log N (xi |µk , Σk )
i=1 k=1
D 1 1
log N (xi |µk , Σk ) = − log(2π) − log |Σk | − (xi − µk )⊤ Σ−1
k (xi − µk )
2 2 2
N
1X
Q(µk ) = − γik (xi − µk )⊤ Σ−1
k (xi − µk )
2
i=1
N
∂Q(µk ) X
= γik Σ−1
k (xi − µk ) = 0
∂µk
i=1
PN (t+1)
(t+1) γik xi
µk = Pi=1 (t+1)
N
i=1 γik
This is the weighted average of the data points, where the weights are the responsibilities.
10
Update of Covariances Σk :
To find the optimal Σk , we maximize Q(π, µ, Σ) with respect to Σk . The relevant part of Q is again
the quadratic form:
N
1X h i
Q(Σk ) = − γik log |Σk | + (xi − µk )⊤ Σ−1
k (xi − µk )
2
i=1
N
∂Q(Σk ) 1X h i
=− γik Σ−1
k − Σ−1
k (x i − µk )(x i − µ k ) ⊤ −1
Σk =0
∂Σk 2
i=1
This is the weighted covariance matrix, where the weights are the responsibilities.
n
(k+1) 1 X (k+1)
πj = γj,i
n
i=1
Pn (k+1)
(k+1) i=1 γj,i xi
µj = Pn (k+1)
i=1 γj,i
Pn (k+1) (k+1) (k+1) ⊤
(k+1) i=1 γj,i (xi − µj )(xi − µj )
Σj = Pn (k+1)
i=1 γj,i
11
Mean-Variance Mixture Distributions
We now generalize the multivariate normal to obtain multivariate normal mixture distributions. The
random vector X is said to have a (multivariate) normal mean-variance mixture distribution if
d √
X = m(W ) + W AZ, (4)
where
(i) Z ∼ Nk (0, Ik );
where Σ = AA′ . For instance, a possible concrete specification for the function m(W ) is
m(W ) = µ + W γ, (5)
12
Appendix: EM-algorithm
The expectation–maximization (EM) algorithm is an iterative method to find the maximum likelihood
estimate when the statistical model depends on unobserved latent variables.
We note Y the sample of observed data and Z the sample of unobservable data. We have:
n
X
ℓ(Y, Z; θ) = ln f (yi , zi ; θ)
i=1
Xn
= ln f (zi ; θ) f (yi | zi ; θ)
i=1
n
X n
X
= ln f (zi ; θ) + ln f (yi | zi ; θ).
i=1 i=1
To overcome the latency of the Zi data the EM algorithm is used. This is an iterative procedure
consisting of an E-step, or expectation step (where essentially Zi is replaced by an estimate given the
observed data and current parameter estimates), and an M-step, or maximization step (where the
parameter estimates are updated). The EM algorithm consists in iteratively applying the two steps:
(E–Step) We calculate the conditional expectation of the so called augmented likelihood (in the above
equation) given the data Y1 , . . . , Yn using the parameter values θ(k) . This results in the objective
function h i
Q(θ; θ(k) ) = E ℓ(Y, Z; θ) | Y, θ(k)
with respect to the parameter vector θ(k) . In practice, performing the E-step amounts to replacing
any functions g(Wi ) of the latent mixing variables by the quantities E g(Wi ) | Yi ; θ(k) .
Alternating between these steps, the EM algorithm produces improved parameter estimates at each
step (in the sense that the value of the original likelihood is continually increased) and we converge
to the maximum likelihood (ML) estimates.
13