KEMBAR78
Notes 5 Multivariate Distributions | PDF | Statistical Theory | Mathematical Analysis
0% found this document useful (0 votes)
6 views13 pages

Notes 5 Multivariate Distributions

yes

Uploaded by

varshzz16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Notes 5 Multivariate Distributions

yes

Uploaded by

varshzz16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MA634 Financial Risk Management

Distribution Modeling

Multivariate Distributions
Recall
Let X = (X1 , . . . , Xn ) be an n-dimensional vector of random variables. We have the following defini-
tions and statements.
Definition 0.1 (Joint CDF). For all x = (x1 , . . . , xn )⊤ ∈ Rn , the joint cumulative distribution
function (CDF) of X satisfies

FX (x) = FX (x1 , . . . , xn ) = P (X1 ≤ x1 , . . . , Xn ≤ xn ).

Definition 0.2 (Marginal CDF). For a fixed i, the marginal CDF of Xi satisfies

FXi (xi ) = FX (∞, . . . , ∞, xi , ∞, . . . , ∞).

It is straightforward to generalize the previous definition to joint marginal distributions. For example,
the joint marginal distribution of Xi and Xj satisfies

Fij (xi , xj ) = FX (∞, . . . , ∞, xi , ∞, . . . , ∞, xj , ∞, . . . , ∞).

If the joint CDF is absolutely continuous, then it has an associated probability density function (PDF)
so that Z x1 Z xn
FX (x1 , . . . , xn ) = ··· f (u1 , . . . , un ) du1 · · · dun .
−∞ −∞

Similar statements also apply to the marginal CDFs. A collection of random variables is independent
if the joint CDF (or PDF if it exists) can be factored into the product of the marginal CDFs (or
PDFs).

If X1 = (X1 , . . . , Xk )⊤ and X2 = (Xk+1 , . . . , Xn )⊤ is a partition of X, then the conditional CDF


satisfies
FX2 |X1 (x2 |x1 ) = P (X2 ≤ x2 | X1 = x1 ).

If X has a PDF f (·), then it satisfies


R xk+1 R xn
−∞ ··· −∞ f (x1 , . . . , xk , uk+1 , . . . , un ) duk+1 · · · dun
FX2 |X1 (x2 |x1 ) = ,
fX1 (x1 )

where fX1 (·) is the joint marginal PDF of X1 .

Assuming it exists, the mean vector of X is given by


⊤
E[X] := E[X1 ], . . . , E[Xn ] .

1
Whereas, again assuming it exists, the covariance matrix of X satisfies
h i
Cov(X) := Σ := E (X − E[X])(X − E[X])⊤ .

so that the (i, j)th element of Σ is simply the covariance of Xi and Xj . Note that the covariance matrix
is symmetric so that Σ⊤ = Σ, its diagonal elements satisfy Σi,i ≥ 0, and it is positive semi-definite so
that x⊤ Σx ≥ 0 for all x ∈ Rn .

The correlation matrix, ρ(X), has as its (i, j)th element

ρij := Corr(Xi , Xj ).

It is also symmetric, positive semi-definite and has 1’s along the diagonal.

For any matrix A ∈ Rk×n and vector a ∈ Rk we have

E[AX + a] = AE[X] + a, (1)


Cov(AX + a) = A Cov(X) A⊤ . (2)

Multivariate Normal Distribution


A random vector X = (X1 , . . . , Xp ) with nonsingular covariance matrix Σ has a multivariate normal
distribution if and only if its density is
 p  
1 1 1
f (x) = √ √ exp − (x − µ)′ Σ−1 (x − µ) . (3)
2π det Σ 2

The term (x − µ)′ Σ−1 (x − µ) appearing inside the exponent of the multivariate normal distribution
is a quadratic form. This particular quadratic form is also called the squared Mahalanobis distance
between the random vector x and the mean vector µ. Thus, density function only depends on x through
the squared Mahalanobis distance: (x−µ)′ Σ−1 (x−µ). This is the equation for a hyper-ellipse centered
at µ. The multivariate normal density is constant on ellipsoids of the form

(x − µ)′ Σ−1 (x − µ) = constant.

If the variables are uncorrelated then the variance-covariance matrix will be a diagonal matrix with
variances of the individual variables appearing on the main diagonal of the matrix and zeros everywhere
else:

 
σ12 0 ... 0
0 σ22 . . . 0
 
 
Σ= .. .. . . .. 
. . . .
 
 
0 0 . . . σp2

2
Let X ∼ Np (µ, Σ) be p-variate multivariate normal with mean µ and variance-covariance matrix Σ,
where

     
X1 µ1 σ11 σ12 · · · σ1p
X2   µ2  σ21 σ22 · · · σ2p 
     
X=
 ..  ,
 µ=
 ..  ,
 Σ= .
 .. .. .
.. 
 .  .  .. . . . 
Xp µp σp1 σp2 · · · σpp

(i) Marginal distributions. All subsets of X are themselves multivariate normal.

(ii) Any linear combination of the Xi , say c′ X = c1 X1 + c2 X2 + · · · + cp Xp , is normally distributed


as
c′ X ∼ N (c′ µ, c′ Σc).

(iii) Conditional Distributions. Subdivide the vector X into two subsets X1 and X2 ,
   
X11 X21
" #
X12  X22 
   
X1
X= , X1 = 
 ..  ,
 X2 = 
 ..  ,

X2  .   . 
X1p X2q

so that " # " # " #!


X1 µ1 Σ11 Σ12
X= ∼ Np+q , .
X2 µ2 Σ21 Σ22

The conditional distribution of X1 given X2 = x2 , f (X1 |X2 = x2 ), is p-variate multivariate


normal,
Np µ1 + Σ12 Σ−1 −1

22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 .

(iv) If two variates, say X1 and X2 , of the multivariate normal are uncorrelated, ρ12 = 0 that implies
σ12 = 0, then X1 and X2 are independent. This property is not in general true for other
distributions. However, it is always true that if two variates are independent, then they are
uncorrelated, no matter what their joint distribution is.

(v) The characteristic function of a multivariate normal distribution with mean µ and covariance
matrix Σ ≥ 0 is, for t ∈ Rp ,  
′ 1 ′
ϕ(t) = exp it µ − t Σt .
2

(vi) Linear Combinations. If X ∼ Np (µ, Σ) and Y = AX + b for A(q × p) and b(q × 1), then
Y ∼ Nq (Aµ + b, AΣA′ ).

(vii) Convolutions. If X and Y are independent d-dimensional random vectors satisfying X ∼


Nd (µ, Σ) and Y ∼ Nd (µ̃, Σ̃), then we may take the product of characteristic functions to show
that
X + Y ∼ Nd (µ + µ̃, Σ + Σ̃).

3
Generating Multivariate Normally Distributed Random Vectors

Suppose we wish to generate a random vector X = (X1 , . . . , Xn ) with distribution X ∼ M Nn (0, Σ).
The case E[X] ̸= 0 can easily be accommodated afterwards. Let Z = (Z1 , . . . , Zn )⊤ , where the Zi are
i.i.d. N (0, 1) random variables for i = 1, . . . , n. If C is an (n × m) matrix, then

C⊤ Z ∼ M N (0, C⊤ C).

Thus, our task reduces to finding a matrix C such that C⊤ C = Σ. This can be achieved using
the Cholesky decomposition of Σ. A standard result from linear algebra states that any symmetric
positive-definite matrix M can be written as

M = U⊤ DU,

where U is an upper triangular matrix and D is a diagonal matrix with strictly positive diagonal
entries. Since Σ is symmetric positive-definite, we can express it as
√ √ √ √
Σ = U⊤ DU = (U⊤ D)( D U) = ( D U)⊤ ( D U).


Therefore, the matrix C = D U satisfies C⊤ C = Σ, and is called the Cholesky decomposition of Σ.

Estimation of Multivariate Normal Distributions


We assume that Yi ∼ Np (µ, Σ). We have the following log-likelihood function,

n
np n 1X
ℓ(µ, Σ) = − ln(2π) − ln |Σ| − (Yi − µ)⊤ Σ−1 (Yi − µ)
2 2 2
i=1

The first-order condition with respect to µ is:

n
X
∂µ ℓ(µ, Σ) = Σ−1 (Yi − µ) = 0
i=1

Pn −1 (Y
Pn
Since i=1 Σ i − µ) = Σ−1 i=1 (Yi − µ), we deduce that µ̂ is the empirical mean:

 Pn 
n−1i=1 Y i,1
 .. 
µ̂ = Ȳ = 
 . 

−1
P n
n i=1 Yi,p

By using the properties of the trace function, the concentrated log-likelihood function becomes:

n
np n 1X
ℓ(µ̂, Σ) = − ln(2π) − ln |Σ| − (Yi − µ̂)⊤ Σ−1 (Yi − µ̂)
2 2 2
i=1

n
np n 1X  
=− ln(2π) − ln |Σ| − tr (Yi − Ȳ )⊤ Σ−1 (Yi − Ȳ )
2 2 2
i=1

4
n
np n 1 X  −1 
= − ln(2π) − ln |Σ| − tr Σ (Yi − Ȳ )(Yi − Ȳ )⊤
2 2 2
i=1

np n 1
ln(2π) − ln |Σ| − tr Σ−1 S

=−
2 2 2

where S is the p × p matrix defined in the following way:

n
X
S= (Yi − Ȳ )(Yi − Ȳ )⊤
i=1

We deduce the first-order condition:

∂ℓ(µ̂, Σ) n 1
−1
= Σ− S =0
∂Σ 2 2

It follows that the ML estimator of Σ is the empirical covariance matrix:

n
1 1X
Σ̂ = S= (Yi − Ȳ )(Yi − Ȳ )⊤
n n
i=1

The simplest and most common method of estimating a multivariate normal distribution is to take
the sample mean vector and sample covariance matrix as our estimators of µ and Σ, respectively. It
is easy to justify this choice since they are the maximum likelihood estimators. It is also common to
take n/(n − 1) times the sample covariance matrix as an estimator of Σ as this estimator is known to
be unbiased.

Limitations In many risk-management applications the multivariate normal distribution is not a


good description of reality. It has three main defects,

(i) The tails of its univariate marginal distributions are too thin; they do not assign enough weight
to extreme events.

(ii) The joint tails of the distribution do not assign enough weight to joint extreme outcomes.

(iii) The distribution has a strong form of symmetry, known as elliptical symmetry.

We now discuss some distributions that address some of these defects.

5
Multivariate Gaussian Mixture Model
Multivariate Gaussian mixture model The probability density function of the random vector
X of dimension d is defined as a weighted sum of Gaussian distributions:

K
X
f (x) = πj ϕd (x; µj , Σj )
j=1

where K is the number of mixture components, µj and Σj are the mean vector and the covariance
matrix of the Gaussian distribution associated with the j th component, and πj is the mixture weight
such that K
P
j=1 πj = 1. The log-likelihood function of the sample X = {X1 , . . . , XN } is:

 
N
X K
X
ℓ(θ) = ln  πj ϕd (Xi ; µj , Σj )
i=1 j=1

To maximize the log-likelihood function, we use EM algorithm1 .

One-dimensional Case

In case of one-dimensional mixture, the parameters of the GMM are:


PK
• Mixing coefficients πk where k=1 πk = 1.

• Means µk .

• Variances σk2 .

In the EM algorithm, we maximize the expected complete-data log-likelihood, which is the log-
likelihood considering both the observed data X and the latent variables Z = {z1 , z2 , . . . , zN } that
indicate which Gaussian component generated each data point.

The complete-data log-likelihood is:

N X
X K
2
I(zi = k) log πk + log N (xi |µk , σk2 )
 
log L(π, µ, σ ) =
i=1 k=1

where I(zi = k) is an indicator function that equals 1 if zi = k (i.e., if the i-th data point was generated
by the k-th Gaussian component) and 0 otherwise.

Since we don’t know the values of zi (they are latent variables), we take the expectation with respect
to their posterior distribution, which leads to the expected complete-data log-likelihood:

N X
X K
Q(π, µ, σ 2 ) = γik log πk + log N (xi |µk , σk2 )
 

i=1 k=1

where γik = P (zi = k|xi ) is the responsibility calculated in the E-step.


1
Refer Appendix for EM algorithm

6
E-Step

In the E-step of the EM algorithm, we calculate the responsibilities, which represent the probability
that each data point xi was generated by each Gaussian component k. These responsibilities are
denoted as γik and are defined as:
γik = P (zi = k|xi )

where zi is the latent variable indicating which Gaussian component generated the data point xi . The
responsibilities γik can be computed using Bayes’ theorem:

P (zi = k) p(xi |zi = k)


γik = PK
j=1 P (zi = j) p(xi |zi = j)

In the context of a Gaussian Mixture Model, P (zi = k) is the mixing coefficient πk , and p(xi |zi = k)
is the probability density function of the k-th Gaussian, given by:

(xi − µk )2
 
1
p(xi |zi = k) = N (xi |µk , σk2 ) =q exp −
2πσk2 2σk2

Substituting these into the expression for γik :

πk N (xi |µk , σk2 )


γik = PK
2
j=1 πj N (xi |µj , σj )

Expanding the Gaussian pdf:

 
(xi −µk )2
πk √ 1 exp − 2σ 2
2πσk2
γik =  k 
PK 1 (xi −µj )2
j=1 πj
q
2
exp − 2σ2
2πσj j

This formula gives the responsibility γik that the k-th Gaussian component has for explaining the i-th
data point xi .

The responsibilities γik are then used in the M-step to update the parameters πk , µk , and σk2 .

M-Step

In the M-step, we maximize Q(π, µ, σ 2 ) with respect to the parameters πk , µk , and σk2 .

Update of Mixing Coefficients πk :

To find the optimal πk , we maximize Q(π, µ, σ 2 ) with respect to πk , subject to the constraint that
PK
k=1 πk = 1. The relevant part of Q is:

N X
X K
Q(π) = γik log πk
i=1 k=1

7
We introduce a Lagrange multiplier λ for the constraint and set up the Lagrangian:

N X
K K
!
X X
L= γik log πk + λ πk − 1
i=1 k=1 k=1

Taking the derivative with respect to πk and setting it to zero gives:


PN
∂L i=1 γik
= +λ=0
∂πk πk
PK
Solving for πk and using the constraint k=1 πk = 1, we get:

N
1 X
πknew = γik
N
i=1

This is the average responsibility that component k takes over all data points.

Update of Means µk :

To find the optimal µk , we maximize Q(π, µ, σ 2 ) with respect to µk . The relevant part of Q is:

N X
X K
Q(µ) = γik log N (xi |µk , σk2 )
i=1 k=1

Substituting the Gaussian pdf:

1 (xi − µk )2
log N (xi |µk , σk2 ) = − log(2πσk2 ) −
2 2σk2

The relevant term for µk is the quadratic term:

N
1X (xi − µk )2
Q(µk ) = − γik
2
i=1
σk2

Taking the derivative with respect to µk and setting it to zero:

N
∂Q(µk ) X xi − µ k
= γik =0
∂µk
i=1
σk2

This leads to: PN


γik xi
µnew
k = Pi=1
N
i=1 γik

This is the weighted average of the data points, where the weights are the responsibilities.

Update of Variances σk2 :

To find the optimal σk2 , we maximize Q(π, µ, σ 2 ) with respect to σk2 . The relevant part of Q is:

N
(xi − µk )2
 
1X
Q(σk2 ) =− 2
γik log σk +
2
i=1
σk2

8
Taking the derivative with respect to σk2 and setting it to zero gives:

N
∂Q(σk2 ) (xi − µk )2
 
1X 1
=− γik 2 − =0
∂σk2 2
i=1
σ k σ 4
k

Multiplying through by σk2 and solving for σk2 , we get:


PN new 2
i=1 γik (xi − µk )
σk2new = PN
i=1 γik

This is the weighted variance, where the weights are the responsibilities.

Multi-dimensional Case

The complete-data log-likelihood is:

N X
X K
log L(π, µ, Σ) = I(zi = k) [log πk + log N (xi |µk , Σk )]
i=1 k=1

where I(zi = k) is an indicator function that equals 1 if zi = k (i.e., if the i-th data point was generated
by the k-th Gaussian component) and 0 otherwise. Since we don’t know the values of zi (they are
latent variables), we take the expectation with respect to their posterior distribution, which leads to
the expected complete-data log-likelihood:

N X
X K
Q(π, µ, Σ) = γik [log πk + log N (xi |µk , Σk )]
i=1 k=1

where γik = P (zi = k|xi ) is the responsibility calculated in the E-step.

E-step

Similar to one-dimensional case, given the current parameters in t-th iteration, we compute the re-
sponsibilities (posterior probabilities of the latent variables):
 
(t) (t) (t)
πk N xi | µk , Σk
(t+1)
γik =P  ,
K (t) (t) (t)
j=1 πj N xi | µj , Σj

M-Step

In the M-step, we maximize Q(π, µ, Σ) with respect to the parameters πk , µk , and Σk .

Update of Mixing Coefficients πk :

To find the optimal πk , we maximize Q(π, µ, Σ) with respect to πk , subject to the constraint that
PK
k=1 πk = 1. The relevant part of Q is:

N X
X K
Q(π) = γik log πk
i=1 k=1

9
We introduce a Lagrange multiplier λ for the constraint and set up the Lagrangian:

N X
K K
!
X X
L= γik log πk + λ πk − 1
i=1 k=1 k=1

Taking the derivative with respect to πk and setting it to zero gives:

PN
∂L i=1 γik
= +λ=0
∂πk πk
PK
Solving for πk and using the constraint k=1 πk = 1, we get:

N
(t+1) 1 X (t+1)
πk = γik
N
i=1

This is the average responsibility that component k takes over all data points.

Update of Means µk :

To find the optimal µk , we maximize Q(π, µ, Σ) with respect to µk . The relevant part of Q is:

N X
X K
Q(µ) = γik log N (xi |µk , Σk )
i=1 k=1

Substituting the Gaussian pdf:

D 1 1
log N (xi |µk , Σk ) = − log(2π) − log |Σk | − (xi − µk )⊤ Σ−1
k (xi − µk )
2 2 2

The relevant term for µk is the quadratic form:

N
1X
Q(µk ) = − γik (xi − µk )⊤ Σ−1
k (xi − µk )
2
i=1

Taking the derivative with respect to µk and setting it to zero:

N
∂Q(µk ) X
= γik Σ−1
k (xi − µk ) = 0
∂µk
i=1

This leads to:

PN (t+1)
(t+1) γik xi
µk = Pi=1 (t+1)
N
i=1 γik

This is the weighted average of the data points, where the weights are the responsibilities.

10
Update of Covariances Σk :

To find the optimal Σk , we maximize Q(π, µ, Σ) with respect to Σk . The relevant part of Q is again
the quadratic form:

N
1X h i
Q(Σk ) = − γik log |Σk | + (xi − µk )⊤ Σ−1
k (xi − µk )
2
i=1

Taking the derivative with respect to Σk and setting it to zero gives:

N
∂Q(Σk ) 1X h i
=− γik Σ−1
k − Σ−1
k (x i − µk )(x i − µ k ) ⊤ −1
Σk =0
∂Σk 2
i=1

Multiplying through by Σk and solving for Σk , we get:

PN (t+1) (t+1) (t+1) ⊤


(t+1) i=1 γik (xi − µk )(xi − µk )
Σk = PN (t+1)
i=1 γik

This is the weighted covariance matrix, where the weights are the responsibilities.

Summary: The EM algorithm consists in the following iterations:


(0) (0) (0)
(i) we set k = 0 and initialize the algorithm with starting values πj , µj and Σj ;
(k+1)
(ii) E-Step: we calculate the posterior probabilities γj,i :

(k) (k) (k)


(k+1) πj ϕd (xi ; µj , Σj )
γj,i =P (k) (k) (k)
m
s=1 πs ϕd (xi ; µs , Σs )

(iii) M-step: we update the estimators π̂j , µ̂j and Σ̂j :

n
(k+1) 1 X (k+1)
πj = γj,i
n
i=1

Pn (k+1)
(k+1) i=1 γj,i xi
µj = Pn (k+1)
i=1 γj,i
Pn (k+1) (k+1) (k+1) ⊤
(k+1) i=1 γj,i (xi − µj )(xi − µj )
Σj = Pn (k+1)
i=1 γj,i

(iv) we iterate steps 2 and 3 until convergence;

(v) finally, we have


(∞) (∞) (∞)
π̂j = πj , µ̂j = µj , Σ̂j = Σj .

11
Mean-Variance Mixture Distributions
We now generalize the multivariate normal to obtain multivariate normal mixture distributions. The
random vector X is said to have a (multivariate) normal mean-variance mixture distribution if

d √
X = m(W ) + W AZ, (4)

where

(i) Z ∼ Nk (0, Ik );

(ii) W ≥ 0 is a non-negative, scalar-valued rv which is independent of Z;

(iii) A ∈ Rd×k is a matrix; and

(iv) m : [0, ∞) → Rd is a measurable function.

In this case we have that


X | W = w ∼ Nd (m(w), wΣ),

where Σ = AA′ . For instance, a possible concrete specification for the function m(W ) is

m(W ) = µ + W γ, (5)

where µ and γ are parameter vectors in Rd . Since E(X | W ) = µ + W γ and cov(X | W ) = W Σ, it


follows in this case by simple calculations that

E(X) = E(E(X | W )) = µ + E(W )γ, (6)


cov(X) = E(cov(X | W )) + cov(E(X | W ))
= E(W )Σ + var(W )γγ ′ , (7)

If we take γ = 0, we get (multivariate) normal variance mixture distribution.

12
Appendix: EM-algorithm
The expectation–maximization (EM) algorithm is an iterative method to find the maximum likelihood
estimate when the statistical model depends on unobserved latent variables.

We note Y the sample of observed data and Z the sample of unobservable data. We have:
n
X
ℓ(Y, Z; θ) = ln f (yi , zi ; θ)
i=1
Xn

= ln f (zi ; θ) f (yi | zi ; θ)
i=1
n
X n
X
= ln f (zi ; θ) + ln f (yi | zi ; θ).
i=1 i=1

To overcome the latency of the Zi data the EM algorithm is used. This is an iterative procedure
consisting of an E-step, or expectation step (where essentially Zi is replaced by an estimate given the
observed data and current parameter estimates), and an M-step, or maximization step (where the
parameter estimates are updated). The EM algorithm consists in iteratively applying the two steps:

(E–Step) We calculate the conditional expectation of the so called augmented likelihood (in the above
equation) given the data Y1 , . . . , Yn using the parameter values θ(k) . This results in the objective
function h i
Q(θ; θ(k) ) = E ℓ(Y, Z; θ) | Y, θ(k)

with respect to the parameter vector θ(k) . In practice, performing the E-step amounts to replacing
any functions g(Wi ) of the latent mixing variables by the quantities E g(Wi ) | Yi ; θ(k) .
 

(M–Step) we estimate θ(k+1) by maximizing Q(θ; θ(k) ):

θ(k+1) = arg max Q(θ; θ(k) ).


θ

Alternating between these steps, the EM algorithm produces improved parameter estimates at each
step (in the sense that the value of the original likelihood is continually increased) and we converge
to the maximum likelihood (ML) estimates.

13

You might also like