A Different(ial) Way
Matrix Derivatives Again
Steven W. Nydick
University of Minnesota
May 17, 2012
Outline
1
Introduction
Notation
Matrix Calculus: Idea Two
Useful Matrix Algebra Rules
Vectorized Operators
Patterned Matrices
Vector Differential Calculus
Continuity and Differentiability
Cauchys Rules and the Hessian
Matrix Differential Calculus
Basic Differentials and Derivatives
Preliminary Results
Scalar Functions
Vector Functions
Matrix Functions
Differentials of Special Matrices
The Inverse
The Exponential and Logarithm
References
Steven W. Nydick
2/119
Introduction
Notation
Notation
X: A matrix
x: A vector
x: A scalar
(x), (x), or (X): A scalar function
f(x), f(x), or f(X): A vector function
F(x), F(x), or F(X): A matrix function
xT or XT : The transpose of x or X
xij : The element in the ith row and jth column of X
(xT )ij : The element in the ith row and jth column of XT
D f(x): The derivative of the function f(x)
d f(x): The differential of the function f(x)
Steven W. Nydick
3/119
Introduction
Matrix Calculus: Idea Two
Basic Idea
Vector calculus is well established, but matrix calculus is difficult.
The paper written by Schneman took one version of the Calculus of
Vectors and applied it to matrices:
1
The trace operator was a scalar function (of a matrix), that
essentially turned matrices into vectors and computed a dot
product between them.
tr(AT X) = vec(A)T vec(X)
vec is the vectorizing operator, stacking the columns of a matrix to
create a very long vector.
After applying the trace operator, an important subset of
maximization problems could be solved by an application of
standard vector calculus rules.
Steven W. Nydick
4/119
Introduction
Matrix Calculus: Idea Two
Basic Idea
The current paper is based off of the following idea.
1
First -- the entire treatment used differentials.
This would allow a vector function to remain a vector instead of
turning into a matrix.
Second -- the derivative was taken with respect to vec(X).
This would keep the problem as a vector derivative problem
instead of a matrix derivative problem.
Moreover, by undoing the vec operator, we would retain the correct
derivative matrix.
Steven W. Nydick
5/119
Useful Matrix Algebra Rules
Matrix Algebra in Magnus
There are several matrix algebra properties and matrices that Magnus
references through his paper and book.
1
The Kronecker Product
The Vec/Vech Operator
The Duplication Matrix
The Commutation Matrix
I will go through these operators in some depth.
Steven W. Nydick
6/119
Useful Matrix Algebra Rules
Vectorized Operators
The Kronecker Product
The Kronecker Product: Transforms matrices A = m n and
B = s t into a matrix C = ms nt.
a11 B
a21 B
AB= .
..
a12 B
a11 B
..
.
..
.
am1 B am2 B
ain B
a2n B
..
.
(1)
amn B
The most important Kronecker properties are discussed on pp. 2728 of
Magnus & Neudecker (1999).
Steven W. Nydick
7/119
Useful Matrix Algebra Rules
Vectorized Operators
The Vec Operator
The Vec Operator: Creates a vector from a matrix by stacking the
columns of the matrix.
Assume A is an m n matrix such that:
A = a1 a2
an
where a1 , a2 , . . . , an are the columns of A. Then:
a1
a2
vec(A) = .
..
(2)
an
Note that vec(A) is an mn 1 column vector.
Steven W. Nydick
8/119
Useful Matrix Algebra Rules
Vectorized Operators
Vec and Kronecker
The vec operator is related to the Kronecker product as follows.
vec(abT ) = vec ab1 ab2 abn
b1 a
ab1
ab2 b2 a
= . = . =ba
.
.
. .
bn a
abn
Thus, as a basic rule
vec(abT ) = b a
(3)
where a and b can be any size vectors.
Steven W. Nydick
9/119
Useful Matrix Algebra Rules
Vectorized Operators
Vec and Kronecker 2
Now, assume that AXC is a conformable matrix product.
Furthermore, let e1 , e2 , . . . , en be the columns (or rows) of a q q
identity matrix, where q is the number of columns in X.
Then:
q
X
(xj eTj ) = x1 1
0 + x2 0
0 + + xq 0
xq
j=1
= x1
= x1
0 + 0 x2
xq
0
x2
0 + + 0
=X
A matrix can be written as a sum of a bunch of vectors.
Steven W. Nydick
10/119
Useful Matrix Algebra Rules
Vectorized Operators
Vec and Kronecker 2
Now, based on the last slide
q
X
vec(AXC) = vec A (xj eTj ) C
j=1
= vec
q
X
(Axj eTj C)
j=1
= vec
q
X
[(Axj )(eTj C)]
j=1
because A and C are constants, (Axj ) is a column vector, and (eTj C) is
a row vector.
Steven W. Nydick
11/119
Useful Matrix Algebra Rules
Vectorized Operators
Vec and Kronecker 2
We can continue deriving:
q
q
X
X
T
vec
[(Axj )(ej C)] =
vec[(Axj )(eTj C)]
j=1
j=1
q
X
[(eTj C)T (Axj )]
by (3)
j=1
q
X
[(CT ej ) (Axj )]
j=1
q
X
[(CT A)(ej xj )]
j=1
because we can pull a sum outside of the vec operator.
Steven W. Nydick
12/119
Useful Matrix Algebra Rules
Vectorized Operators
Vec and Kronecker 2
Finally:
q
q
X
X
[(CT A)(ej xj )] = (CT A)
(ej xj )
j=1
j=1
= (CT A)
q
X
vec(xj eTj )
by (3)
j=1
X
= (CT A) vec (xj eTj )
j=1
T
= (C A) vec(X)
(4)
Therefore, a matrix product can be vectorized such that we only need
to perform the vec operator on one matrix.
Steven W. Nydick
13/119
Useful Matrix Algebra Rules
Vectorized Operators
The Vech Operator
The Vech Operator: Creates a vector from a symmetric matrix by
stacking the non-duplicate elements column-wise.
Assume A is a symmetric, square, n n matrix.
a11
a21
..
.
an1
vech(A) = a22
..
.
an2
..
.
ann
a11
a21
A= .
..
a21
a22
..
.
..
.
an1 an2
Steven W. Nydick
an1
an2
..
.
ann
(5)
14/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Magnus describes several useful patterned matrices.
One useful matrix: The Commutation Matrix.
Kmn
such that Kmn vec(Amn ) = vec(ATnm )
(6)
The number of rows and columns of K correspond to the length of
vec(A), because both vec(A) and vec(AT ) have the same number of
elements. Moreover, the unique matrix Kmn (with both mn rows and
columns) takes m n, or flips the columns to be the rows.
Note: The commutation matrix will always be square.
Steven W. Nydick
15/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
The commutation matrix changes a mn size vector into a nm size
vector, so it is square and of size mn mn.
Moreover, the commutation matrix is just rearranging the elements of
the original vector, so it must be a rearranged identity matrix designed
to pick off the appropriate elements and put each in the correct place.
For instance:
A32
Steven W. Nydick
1 2
= 3 4
5 6
K32
1
0
0
=
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
1
16/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Thus:
1 0 0 0 0
0 0 0 1 0
1 2
0 1 0 0 0
K32 vec(A) = K32 vec 3 4 =
0 0 0 0 1
5 6
0 0 1 0 0
0 0 0 0 0
1
2
3
1
= = vec
4
2
5
1
0
3
0
0
5
0
2
0 4
6
3 5
4 6
= vec(AT )
6
Steven W. Nydick
17/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
What is the commutation matrix for arbitrary vec(Xmn )?
Given an m n matrix X, vec(X) will be mn 1:
1
2
3
Elements 1 m of vec(X) will correspond to column 1 of X.
Elements (m + 1) 2m of vec(X) will correspond to column 2 of X.
There will be n of these repeating sequences, one for each column
of X.
What are contained in the columns of Kmn :
1
The first m columns of Kmn will affect only the first m elements
of vec(X).
The second m columns of Kmn will affect only the second m
elements of vec(X).
There will be n of these blocks.
Steven W. Nydick
18/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
So the commutation matrix contains n column blocks each affecting a
particular column of X and corresponding to a particular set of m
elements in vec(X).
k1 k2
km km+1
km2
km(n1)+1
kmn
The vertical lines separate the elements in different columns of X, and
each of the ki are elementary vectors. Why?
Steven W. Nydick
19/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Now where does each element of a particular block go in the new
matrix?
We are turning vec(X) into vec(XT )
1
2
There are n rows (and m columns) in XT .
The first column block of Kmn takes the first column and puts it
in the first row.
The second column block of Kmn takes the second column and
puts it in the second row.
Because there are n rows in XT , elements in the first column of X
(directly next to each other in vec(X)) are now separated by n
elements in vec(XT ).
Because there are n rows in XT , elements in the second column of
X (directly next to each other in vec(X)) are now separated by n
elements in vec(XT ).
Steven W. Nydick
20/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Therefore:
1
2
The columns of Kmn affect the elements of vec(X), in order.
The rows of Kmn represent the particular place of vec(XT ), in
order.
For the first column block of Kmn (affecting the first column of
X), there are n rows separating each element in XT .
So to create a commutation matrix...
1
Create an mn mn size matrix
Divide the matrix into blocks of m columns
Write a line separating column m from column m + 1 and column
2m from column 2m + 1, etc.
There will be n such column blocks.
Steven W. Nydick
21/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Divide the matrix into blocks of n rows.
Write a line separating row n from row n + 1 and row 2n from row
2n + 1, etc.
There will be m such row blocks.
The first n entries of vec(XT ) (corresponding to the first n rows of
Kmn ) will be the elements directly to the right of the column
separators.
The second n entries of vec(XT ) (corresponding to the second n
rows of Kmn ) will be the elements one column to the right of the
column separators, etc.
Steven W. Nydick
22/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Or, as an example:
1 0
0 0
.. ..
. .
0 0
0 1
0 0
.. ..
Kmn =
. .
0 0
.. ..
. .
0 0
0 0
. .
.. ..
..
.
..
.
..
.
..
.
0 0
Steven W. Nydick
0 0 0
0 1 0
.. .. ..
. . .
0 0 0
0 0 0
0 0 1
.. .. ..
. . .
..
.
..
.
0 0 0
.. .. ..
..
. . .
.
1 0 0
0 0 0
.. .. ..
..
. . .
.
0 0 0
0
0
..
..
.
.
0
0
0
..
..
.
.
0 0
0 0
.. ..
. .
1 0
0 0
0 0
.. ..
..
. .
.
0
0
..
.
0
..
..
.
.
0
1
..
..
.
.
0 1
.. ..
..
. .
.
0 0
0 0
.. ..
..
.
. .
0 0
0
..
.
0
0
0
..
.
0
0
..
.
1
23/119
Useful Matrix Algebra Rules
Patterned Matrices
The Commutation Matrix
Now let B be a p q matrix, X be a q n matrix, and A be a m n
matrix. Then
Kpm vec([BXAT ]pm ) = vec[(BXAT )T ]
= vec(AXT BT )
= (B A) vec(XT )
by (4)
= (B A)Kqn vec(Xqn )
by (6)
But because
Kpm vec(BXAT ) = Kpm (A B) vec(X)
by (4)
it follows that
(B A)Kqn = Kpm (A B)
Steven W. Nydick
(7)
24/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
Another useful matrix: The Duplication Matrix.
Dn
such that Dn vech(Ann ) = vec(Ann )
(8)
The number of rows of D correspond to the length of vec(A), and the
number of columns of D correspond to the length of vech(A).
Because vech(A) will always be shorter than vec(A), D will have at
least as many rows as columns.
Furthermore, the columns of D are linearly independent. Why?
Steven W. Nydick
25/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
The length of vec(A) is equal to the number of elements in A, and the
length of vec(A) is equal to the number of elements on the lower
triangle of A.
The number of rows of D is equal to n2 .
n corresponds to the number of rows/columns of A.
The number of columns of D is equal to [n(n + 1)/2].
Therefore:
Rows of Dn = n2
Steven W. Nydick
Columns of Dn =
n(n + 1)
2
26/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
How does the duplication matrix appear?
Each column corresponding to an off-diagonal element of A will
have two 1s.
Each column corresponding to a diagonal element
have one 1.
1 0 0 0
0 1 0 0
0 0 1 0
0 1 0 0
1 2 3
A33 = 2 4 5
D3 =
0 0 0 1
0 0 0 0
3 5 6
0 0 1 0
0 0 0 0
0 0 0 0
Steven W. Nydick
of A will only
0
0
0
0
0
1
0
1
0
0
0
0
1
27/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
Why? Well, multiplying vech(A) by D3
1
0
0
0
1 2 3
2 4 5
D3 vech(A) = D3 vech
=
0
0
3 5 6
0
0
0
1
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0 0
0 0
0 0
0 0
1 0
0 1
0 0
0 1
0 0
1
T
= 1 2 3 2 4 5 3 5 6 = vec 2
3
0
0
1
0
2
0
3
0
4
0
5
0
6
0
1
2 3
4 5
5 6
= vec(A)
Steven W. Nydick
28/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
What is the duplication matrix for arbitrary vech(Xmn )?
Given an n n matrix X, vec(X) will be [n(n + 1)/2] 1:
1
The first n elements of vech(X) will correspond to the first column
of X.
The next n 1 elements of vech(X) will correspond to the second
column of X.
The next n 2 elements of vech(X) will correspond to the third
column of X. ]
The last 1 element of vech(X) will correspond to the nth column of
X.
Note that vech(X) will affect ever decreasing elements in the columns.
Steven W. Nydick
29/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
So the duplication matrix contains n blocks each affecting a particular
column of X and corresponding to a particular set of elements in
vech(X).
d1
dn dn+1
dn+(n1)
d[n(n+1)/2]
Rather than dividing blocks of the same length, the separators divide
blocks of increasingly shortening lengths because the number of
elements in vech(X) corresponding to a particular column of X
decreases by 1 in each column.
How many elements are in each column of Dn ?
Steven W. Nydick
30/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
Now where does each element of a particular block go in the new
matrix?
We are turning vech(X) into vec(X)
1
There are n rows and n columns of X.
The first column block of Dn takes the first column and puts it in
the first column and first row.
The second column block of Dn takes the second column and puts
it in the second column and second row.
Because there are n rows in X, elements in the first column of X
are now both directly next to each other at one point and
separated by n elements at another point.
Steven W. Nydick
31/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
So to create a duplication matrix...
1
2
Create an n2 [n(n + 1)/2] size matrix.
Divide the matrix into column blocks of decreasing size, starting
with size n.
Write a line separating column n from column n + 1 and column
n + (n 1) from column n + (n 1) + 1, etc.
There will be n such column blocks.
Steven W. Nydick
32/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
3
Divide the matrix into row blocks of size n.
Write a line separating row n from row n + 1 and row 2n from row
2n + 1, etc.
There will be n such row blocks.
The first n entries of vec(X) (corresponding to the first n rows of
Dn ) will be the first n columns of Dn .
The second n entries of vec(X) will consist of the second column in
the first block of Dn followed by all of the entries in the second
block of Dn .
The third n entries of vec(X) will consist of the third column in
the first block of Dn followed by the second column in the second
block of Dn followed by all of the entries in the third block of Dn .
Steven W. Nydick
33/119
Useful Matrix Algebra Rules
Patterned Matrices
The Duplication Matrix
Or:
Dn =
Steven W. Nydick
1 0
0 1
.. .. . .
.
. .
0 0
0 1
0 0
.. ..
..
. .
.
0 0
0 0
..
.. ..
.
. .
1 0
0 0
0 1
.. .. . .
.
. .
0
0
..
..
.
.
0
0
0
..
..
.
.
0 0
.. ..
..
. .
.
0 0
0 0
.. ..
..
. .
.
0 0
0 0
0 0
.. ..
..
. .
.
1 0
0 0
.. ..
..
. .
.
0 0
0 0
1
..
..
.
.
0
1
..
..
.
.
0
0
0 0 0
0 0 0
.. .. ..
. . .
0 0 0
0 0 0
0 0 0
.. .. ..
. . .
0 0 0
.. .. ..
. . .
0 0 0
0 0 0
.. .. ..
. . .
0 1 0
0 0 1
34/119
Useful Matrix Algebra Rules
Patterned Matrices
Patterned Matrix Code
commutator <- function(m, n){
mn <- m*n
K <- matrix(0, mn, mn)
index <- 0
col <- 0
for(i in 1:n){
index <- index + 1
row <- index
for(j in 1:m){
col <- col + 1
K[row, col] <- 1
row <- row + n
}
}
return(K)
}
Steven W. Nydick
duplicator <- function(n){
D <- matrix(0, n^2, n*(n + 1)/2)
index <- n + 1; row <- 0
for(i in 1:n){
index <- index - 1; n2 <- n
col.blocksep <- n - index + 1
if(index != n){
for(k in (index + 1):n){
row <- row + 1
D[row, col.blocksep] <- 1
n2 <- n2 - 1
col.blocksep <- col.blocksep + n2
}}
for(j in 1:index){
row <- row + 1
col.ident <- col.blocksep + j - 1
D[row, col.ident] <- 1
}}
return(D)
}
35/119
Vector Differential Calculus
Continuity and Differentiability
Continuity and Accumulation
To understand the treatment of matrix calculus, we should review a few
definitions.
Continuity:
(c) is continuous at c if one of two things hold:
1
For any > 0, there exists a > 0 so ||u|| < forces
||(c + u) (c)|| <
Given any small distance past c in the y direction we can find points
close to c in the x direction.
Only applies to accumulation points.
c is not an accumulation point.
An accumulation point (or a cluster point) is just a limiting point,
meaning that f (c) is the limit of the function f (c + u) as u 0.
Non accumulation points are also called isolated points or dots in
space, and are automatically continuous, but trivial.
Steven W. Nydick
36/119
Vector Differential Calculus
Continuity and Differentiability
Differentiability and Taylor Series
Taylor Series:
Using the Taylor Series, we can approximate any function with a
polynomial of any size.
(n) (c)
0 (c)
(x c) + +
(x c)n + . . .
1!
n!
X
(k) (c)
=
(x c)k
k!
(x) = (c) +
k=0
p
X
k=0
(k) (c)
(x c)k + rc (x c)
k!
where rc (the remainder) usually converges at some rate.
Steven W. Nydick
37/119
Vector Differential Calculus
Continuity and Differentiability
Differentiability and Taylor Series
Taylor Series:
Replacing x with c + u, so that u = x a and
(c + u) =
X
(k) (c)
k=0
p
X
k=0
k!
(u)
(k) (c) k
(u) + rc (u)
k!
00 (c)
+ r2c (u)
2
= (c) + u0 (c) + r1c (u)
= (c) + u0 (c) + u2
The third line: second-order Taylor formula
The fourth line: first-order Taylor formula
Steven W. Nydick
38/119
Vector Differential Calculus
Continuity and Differentiability
Differentiability and Talor Series
Rewriting the equation on the previous page:
(c + u) (c)
r1c (u)
= 0 (c) +
u
u
We know, based on calculus that
lim
u0
(c + u) (c)
= 0 (c)
u
which is the definition of the derivative and implies
r1c (u)
=0
u0
u
lim
Steven W. Nydick
39/119
Vector Differential Calculus
Continuity and Differentiability
Differentiability
Based on two slides ago, we have
(c + u) = (c) + u0 (c) + r1c (u)
so that (c) + u0 (c) is the best linear approximation to the original
function. But the strength of the linear approximation depends on the
size of r1c (u).
The first differential:
d (c; u) = u0 (c)
(9)
Equation (9) is the linear part of (c + u) (c).
Steven W. Nydick
40/119
Vector Differential Calculus
Continuity and Differentiability
Multidimensional Taylor Series
We can expand to linearize a vector function:
f(c + u) = f(c) + A(c)u + rc (u)
= f(c) + D f(c)u + rc (u)
Now ||u|| 0, df(c; u) = D f(c)u is called the differential, D f(c) is the
first derivative (Jacobian matrix), and f(c) = D f(c)T is the Gradient
of f at c.
Letting ||u|| 0 would be equivalent to setting w as a unit-length
vector, t as a scalar (such that tw = u) and making t 0 ... the
directional derivative approach.
Steven W. Nydick
41/119
Vector Differential Calculus
Continuity and Differentiability
Properties of the Differential
Note 1: For the differential to make sense, the original function must be
defined on a circle B(c; r) surrounding c with radius r, and
c + u B(c; r).
Note 2: If f : S R, where f(S) is defined for a set S, and c is an
interior point of that set, the function is continuous at c, and each of
the partial derivatives exist in some small space surrounding c, then the
derivative exists at c.
Note 3: There is only one first derivative, and the rows of the Jacobian
are Gradients of a particular partial functions of the vector function f,
whereas the columns are the partial derivatives of f with respect to a
particular element of c.
Steven W. Nydick
42/119
Vector Differential Calculus
Continuity and Differentiability
Multivariate Chain Rule
If h(x) = g(f(x)), f(c) = b, and the function h is differentiable at c
D h(c) = (D g(b))(D f(c))
(10)
Expanding the multivariate chain rule:
h(c)1
c1
h(c)k
c1
..
.
Steven W. Nydick
h(c)1
cn
..
.
h(c)k
cn
g(b)
b1
g(b)k
b1
..
.
g(b)1
bp
..
.
g(b)k
bp
f (c)1
c1
f (c)p
c1
..
.
f (c)1
cn
..
.
f (c)p
cn
43/119
Vector Differential Calculus
Continuity and Differentiability
Multivariate Chain Rule
Keeping track of the multivariate chain rule is straightforward if
remembering that the partial functions go down the rows and the
partial values go across the columns.
If h = , a univariate function, and f = f(t), a multivariate function of
a scalar, the multivariate chain rule simplifies.
For example:
g(x) =
Steven W. Nydick
x21
+ 2x2
f(t) =
t + 2 cos(t)
ln(t)
44/119
Vector Differential Calculus
Continuity and Differentiability
Multivariate Chain Rule
Functions:
g(x) = x21 + 2x2
f(t) =
t + 2 cos(t)
ln(t)
Method 1:
(t) = g(f(t))
= (t + 2 cos(t))2 + 2(ln(t))
= t2 + 4t cos(t) + 4 cos2 (t) + 2 ln(t)
So
d(t)
= 2t + 4t[ sin(t)] + 4 cos(t) + 8 cos(t)[ sin(t)] + 2(1/t)
dt
= 2t 4t sin(t) + 4 cos(t) 8 cos(t) sin(t) + 2/t
Steven W. Nydick
45/119
Vector Differential Calculus
Continuity and Differentiability
Multivariate Chain Rule
Functions:
g(x) =
x21
+ 2x2
f(t) =
t + 2 cos(t)
ln(t)
Method 2:
(t) = g(f(t))
So
d(t) g(f(t))
= f1 (t)
dt
g(f(t))
f2 (t)
f1 (t)
t
f2 (t)
t
1 2 sin(t)
= 2 t + 2 cos(t) 2
1/t
= [2 t + 2 cos(t) ][1 2 sin(t)] + [2][1/t]
= 2t 4t sin(t) + 4 cos(t) 8 cos(t) sin(t) + 2/t
Steven W. Nydick
46/119
Vector Differential Calculus
Continuity and Differentiability
Multivariate Chain Rule
By virtue of the multivariate chain rule process
f : R Rm
g : Rm R
: R R
Therefore, if is a scalar function of a scalar but has a vector as an
intermediate step, then we have the chain rule from vector calculus.
m
d X
=
dt
i=1
Steven W. Nydick
g xi
xi t
47/119
Vector Differential Calculus
Cauchys Rules and the Hessian
Cauchys Rule of Invariance
Cauchys Rule of Invariance:
When we apply the chain rule to a composite differential (instead of
only a derivative), the distances also sequentially apply.
d h(c; u) = D h(c)u
= (D g(b))(D f (c))u
by (9)
by (10)
= D g(b) d f (c; u)
by (9)
= d g[b; d f (c; u)]
(11)
Moving a little bit in the u direction moves f (c) up a
particular amount, and moving f (c) up a particular amount
moves g(b) up a particular amount (because b and hence g(b)
depends on f (c)).
Steven W. Nydick
48/119
Vector Differential Calculus
Cauchys Rules and the Hessian
The Hessian
A real-valued function can be approximated with a 2nd degree
polynomial:
1
(c + u) = (c) + D((c)) + uT Bu + r2c (u)
2
(12)
as long as the remainder converges at a particular rate.
lim =
||u||0
Steven W. Nydick
r(u)
=0
||u||2
49/119
Vector Differential Calculus
Cauchys Rules and the Hessian
Properties of the (Second) Differential
Properties of the second differential:
1
The second differential is just the differential of the first
differential.
The conditions for the second differential (and second derivative)
to exist are identical to the conditions for the first differential to
exist. We are just pretending that the first differential is our
original function.
Steven W. Nydick
50/119
Vector Differential Calculus
Cauchys Rules and the Hessian
The Hessian Matrix
Even though only one vector satisfies
d(c; u) = a0 u
an infinite number of matrices satisfy
d2 (c; u) = uT B u
And the unique Hessian is defined as
1
H(c) = (B(c) + B(c)T )
2
Steven W. Nydick
(13)
51/119
Vector Differential Calculus
Cauchys Rules and the Hessian
Cauchys Rule of Invariance: Part II
Unfortunately, the second differential is not Cauchy Invariant.
d2 h(c; u) 6= d2 g(b; d f (c; u))
Why? Well, by the original chain rule, we have
h0 (c) = g 0 (f (c)) f 0 (c)
which implies that d h(c; u) = d g(b; d f (c; u)). But when taking the
second derivative, the product rule gets in the way:
h00 (c) = g 00 (f (c)) [f 0 (c)]2 + g 0 (f (c)) f 00 (c)
6= g 00 (f (c)) [f 0 (c)]2
Steven W. Nydick
52/119
Vector Differential Calculus
Cauchys Rules and the Hessian
Cauchys Rule of Invariance: Part II
In other words:
1
2
In the original function, u is a constant with respect to c.
In the derivative function, d f (c; u) is no longer a constant with
respect to c.
Same reason why we must apply the product rule in the middle of
two chain rules.
Therefore
d2 h(c; u) = d2 g(b; d f (c; u)) + d g(b; d2 f (c; u))
(14)
by applying the product and chain rules to the first differential.
Steven W. Nydick
53/119
Vector Differential Calculus
Matrix Differential Calculus
The Transition: Part I
The transition from vector calculus to matrix calculus is
straightforward (according to Magnus).
Step 1: First, he addends his notation to consider matrix derivatives:
If for vector derivatives
D f(x) :=
f(x)
xT
then for matrix derivatives
D F(X) :=
F(X)
[vec(X)]T
He thus turns a matrix into a vector to apply the vector theory.
Steven W. Nydick
54/119
Vector Differential Calculus
Matrix Differential Calculus
The Transition: Part II
Step 2: Second, he addends his vector differentials to apply to matrices.
vec(d F(C; U)) = d vec(F(C; U)) = A(C) vec(U)
(15)
Thus, every partial derivative is specified, the order is prespecified, and
the theory proceeds from the previous section.
Steven W. Nydick
55/119
Vector Differential Calculus
Matrix Differential Calculus
The Transition: Part III
Therefore, to find differentials w.r.t. matrices:
1
Apply the vec operator to both sides.
Take the differential of both sides.
Simplify until A(X) d vec(X) is on the right side.
A is a Matrix, and d vec(X) is a vector.
A must not depend on d vec(X).
A is the derivative.
Take the differential again.
Simplify until [d vec(X)]T B(X)[d vec(X)]
1
2 (B(X)
Steven W. Nydick
+ B(X)T ) is the Hessian.
56/119
Basic Differentials and Derivatives
Preliminary Results
Basic Properties of Differentials
The first six differential/derivative rules:
dA = O
(16)
d(F) = d F
(17)
d(F + G) = d F + d G
(18)
d tr F = tr(d F)
(19)
d(FG) = (d F)G + F(d G)
d(F G) = (d F) G + F (d G)
(20)
(21)
which are a consequence of the differential being a linear operator on
the derivative, and a derivative matrix being a matrix of derivatives.
Steven W. Nydick
57/119
Basic Differentials and Derivatives
Preliminary Results
Basic Properties of Differentials
For instance, take Equation (18):
d(F + G) = d F + d G
For an arbitrary element i in the differential vector
di (F + G) = Di. (F + G)T u
where Di. (F + G)T is the ith row of D(F + G). Finally,
X
Di. (F + G)T u =
(Dij (F + G)uj )
j
(Dij (F)uj ) +
(Dij (G)uj )
= di F + di G
Because linearity applies for an arbitrary element in the differential
vector, it holds for the entire vector of differentials.
Steven W. Nydick
58/119
Basic Differentials and Derivatives
Preliminary Results
Basic Properties of Differentials
Now, take Equation (20):
d(FG) = (d F)G + F(d G)
For an arbitrary element i, j
(d(FG))ij = d(FG)ij = d
fik gkj
d(fik gkj )
X
=
[(d fik )gkj + fik (d gkj )]
k
X
X
=
[(d fik )gkj ] +
[fik (d gkj )]
k
= [(d F)G]ij + [F(d G)]ij
Therefore, formulas that work on a linear operator of the derivative also
work on the differential.
Steven W. Nydick
59/119
Basic Differentials and Derivatives
Scalar Functions
Basic Scalar Functions: (x) = aT x
Our first function: (x) = aT x.
Then
d (x) = d(aT x)
= aT d x
by (17)
Thus
Steven W. Nydick
d(aT x) = aT d x
(22)
D(aT x) = aT
(23)
60/119
Basic Differentials and Derivatives
Scalar Functions
Basic Scalar Functions: (x) = xT Ax
Our next function: (x) = xT Ax.
Then
d (x) = d(xT Ax)
= d(xT )Ax + xT d(Ax)
T
= d(x) Ax + x A d(x)
T
by (20)
by (17)
= x A d(x) + x A d(x)
= [xT (AT + A)] d(x)
Thus
d(xT Ax) = [xT (AT + A)] d(x)
T
D(a x) = x (A + A)
Steven W. Nydick
(24)
(25)
61/119
Basic Differentials and Derivatives
Scalar Functions
Scalar Functions of Mat 1: (X) = aT Xb
Our third function: (X) = aT Xb.
Now the differential is with respect to a matrix.
d vec[(X)] = vec[d (X)]
by (15)
= vec[d(aT Xb)]
= vec[aT (d X)b]
= bT aT vec(d X)
T
= b a d vec(X)
Steven W. Nydick
by (17)
by (4)
by (15)
62/119
Basic Differentials and Derivatives
Scalar Functions
Scalar Functions of Mat 1: (X) = aT Xb
According to the previous slide, the matrix differential of aT Xb:
d vec[aT Xb] = bT aT d vec(X)
(26)
D vec[aT Xb] = bT aT = (b a)T
(27)
Notice the (important) use of the vec to Kronecker rule.
Steven W. Nydick
63/119
Basic Differentials and Derivatives
Scalar Functions
Scalar Functions of Mat 2: (X) = aT XXT a
A slightly more complicated function: (X) = aT XXT a
By Equation (15), the matrix differential is as follows.
d vec[(X)] = vec[d (X)]
T
by (15)
T
= vec[d(a XX a)]
= vec[aT (d X)XT a + aT X d(XT )a]
T
by (20) and (17)
= vec[a (d X)X a + a X(d X) a]
= vec[aT (d X)XT a + (aT X(d X)T a)T ]
T
Scalar Transpose
= vec[a (d X)X a + a (d X)X a]
= vec[2aT (d X)XT a]
Steven W. Nydick
64/119
Basic Differentials and Derivatives
Scalar Functions
Scalar Functions of Mat 2: (X) = aT XXT a
Finishing:
vec[2aT (d X)XT a] = 2(XT a)T aT vec(d X)
T
by (4)
= [2(X a) a ] d vec(X)
by (15)
Thus, the matrix differential of aT XXT a is
d vec[aT XXT a] = 2(XT a)T aT d vec(X)
T
(28)
T
D vec[a Xb] = 2(X a) a = 2(X a a)
Steven W. Nydick
(29)
65/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions
Finding the differential of trace functions use
tr(AT B) = vec(A)T vec(B)
(30)
Why can we use Equation (30)? Well:
tr(AT B) =
n
m X
X
(aij bij )
j=1 i=1
= vec(A)T vec(B)
Vectorizing a matrix and taking the dot product is equivalently
summing the squares of every entry in the matrix.
Steven W. Nydick
66/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr(AT X)
The first trace function: tr(AT X).
d[tr(AT X)] = d[vec(A)T vec(X)]
T
= vec(A) d vec(X)
by (30)
by (17)
And the matrix differential of tr(AT X):
Steven W. Nydick
d[tr(AT X)] = vec(A)T d vec(X)
(31)
D[tr(AT X)] = vec(A)T
(32)
67/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr(Xp )
The next trace function: tr(Xp ).
d[tr(Xp )] = tr[d(Xp )]
= tr[(d X)Xp1 + X(d X)Xp2 + + Xp1 (d X)] by (20)
= tr[(d X)Xp1 ] + tr[X(d X)Xp2 ] + + tr[Xp1 (d X)]
= tr[Xp1 (d X)] + tr[Xp1 (d X)] + + tr[Xp1 (d X)]
= p tr[Xp1 (d X)]
= p vec([XT ]p1 )T d vec(X)
by (30)
Both the second to third line and the third to fourth line use typical
trace rules (e.g., linearity of traces and cyclic permutation).
Steven W. Nydick
68/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr(Xp )
And the matrix differential of tr(Xp ):
d[tr(Xp )] = p vec([XT ]p1 )T d vec(X)
(33)
D[tr(Xp )] = p vec([XT ]p1 )T
(34)
Note that Equation (34) is similar to differentiating a polynomial scalar.
Steven W. Nydick
69/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr(XT X)
The first power trace to differentiate: tr(XT X).
d tr(XT X) = tr[d(XT X)]
= tr[d(XT )X + XT d(X)]
by (20)
= tr[d(X)T X] + tr[XT d(X)]
= tr[(d(X)T X)T ] + tr[XT d(X)]
= tr[XT d(X)] + tr[XT d(X)]
= 2 tr[XT d(X)]
Steven W. Nydick
70/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr(XT X)
And we have
d tr(XT X) = 2 tr[XT d(X)]
= 2 vec(X)T d vec(X)
by (30)
d tr(XT X) = 2 vec(X)T d vec(X)
(35)
D tr(XT X) = 2 vec(X)T
(36)
which implies
Steven W. Nydick
71/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr(XAXB)
The next power trace function: tr(XAXB).
d tr(XAXB) = tr[d(XAXB)]
= tr[d(X)AXB + XA d(X)B]
by (20)
= tr[AXB d(X) + BXA d(X)]
= tr[(AXB + BXA) d(X)]
= vec[(AXB + BXA)T ]T d vec(X)
by (30)
And thus
d tr(XAXB) = vec[(AXB + BXA)T ]T d vec(X)
T T
D tr(XAXB) = vec[(AXB + BXA) ]
Steven W. Nydick
(37)
(38)
72/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr[(XT AXB)p ]
The final scalar function: tr[(XT AXB)p ]
This final function (in its more general state) captures the remaining
differentials on pages 358359 of Magnus.
tr(XT X) =
XX
i
x2ij = A = I, B = I, & p = 1
tr[(X X) ] = tr[(XXT )p ] = A = I & B = I
Steven W. Nydick
73/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr[(XT AXB)p ]
First, apply the product rule sequentially:
d[tr{(XT AXB)p }] = tr{d[(XT AXB)p ]}
= tr{d[XT AXB](XT AXB)p1
+ + (XT AXB)p1 d[XT AXB]}
= p tr{(XT AXB)p1 d(XT AXB)}
Next, note that
d(XT AXB) = (d X)T AXB + XT A(d X)B
which is due to the product rule and multiplication by constants.
Steven W. Nydick
74/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr[(XT AXB)p ]
And replacing:
d[tr{(XT AXB)p }] = p tr{(XT AXB)p1 d(XT AXB)}
= p tr{(XT AXB)p1 [(d X)T AXB + XT A(d X)B]}
= p tr{(XT AXB)p1 (d X)T AXB
+ (XT AXB)p1 XT A(d X)B]}
= p tr{[(XT AXB)p1 (d X)T AXB]T
+ (XT AXB)p1 XT A(d X)B]}
= p tr{BT XT AT (d X)[(XT AXB)p1 ]T
+ (XT AXB)p1 XT A(d X)B]}
= p tr{[(XT AXB)p1 ]T BT XT AT (d X)
+ (XT AXB)p1 XT A(d X)B]}
Steven W. Nydick
75/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr[(XT AXB)p ]
We ultimately have
d[tr{(XT AXB)p }] = p tr{[(XT AXB)p1 ]T BT XT AT (d X)
+ (XT AXB)p1 XT A(d X)B]}
= p tr{[(XT AXB)p1 ]T BT XT AT (d X)
+ B(XT AXB)p1 XT A(d X)]}
= p tr{[(XT AXB)p1 ]T BT XT AT
+ B(XT AXB)p1 XT A](d X)}
= p vec{ [(XT AXB)p1 ]T BT XT AT
T
+ B(XT AXB)p1 XT A }T d vec(X)
= p vec{AX[B(XT AXB)p1 ]
+ AT X[B(XT AXB)p1 ]T }T d vec(X)
Steven W. Nydick
76/119
Basic Differentials and Derivatives
Scalar Functions
Trace Functions: tr[(XT AXB)p ]
Which implies
d[tr{(XT AXB)p }] = p vec{AX[B(XT AXB)p1 ]
+ AT X[B(XT AXB)p1 ]T }T d vec(X) (39)
D[tr{(XT AXB)p }] = p vec{AX[B(XT AXB)p1 ]
+ AT X[B(XT AXB)p1 ]T }T
(40)
Even though Equations (39) and (40) do not appear interesting, they
generalize to all matrix differentials on pages 358359 of Magnus.
For instance:
D[tr(XT X)] = 1 vec{IX[I(XT IXI)0 ] + IT X[I(XT IXI)0 ]T }T
= vec[XI + XIT ]T
= vec[X + X]T = 2 vec(X)T
Steven W. Nydick
77/119
Basic Differentials and Derivatives
Scalar Functions
Trace Differentials
The standard process of computing trace differentials:
1
Put differential inside trace operator.
Usually perform standard product rule or chain rule.
Take transposes and rotate to get d(X) on the outside.
Combine terms.
Use the trace to vec identity.
Steven W. Nydick
78/119
Basic Differentials and Derivatives
Vector Functions
Vector Functions of Vec 1: f(x) = A(x)x
The first vector function: f(x) = A(x)x.
This is the most general function mentioned on pp. 360 in Magnus.
f(x) = A(x)x
If A depends on x, then
d[f(x)] = d[A(x)x]
= d[A(x)]x + A(x) d x
by (20)
= vec{d[A(x)]x} + A(x) d x
= vec{I d[A(x)]x} + A(x) d x
= (xT I) vec{d[A(x)]} + A(x) d x
= (xT I) D vec[A(x)] d x + A(x) d x
= (xT I) D vec[A(x)] + A(x) d x
Steven W. Nydick
by (9)
(41)
79/119
Basic Differentials and Derivatives
Vector Functions
Vector Functions of Vec 2: f(x) = [xT x]a(x)
The second vector function: f(x) = [xT x]a(x).
If a depends on x, then
d f(x) = d{[xT x]a(x)}
= d[xT x]a(x) + xT x d[a(x)]
by (20)
= d[xT Ix]a(x) + xT x D a(x) d x
= [xT (IT + I)] d(x)a(x) + xT x D a(x) d x
by (24)
= [2xT ] d(x)a(x) + xT x D a(x) d x
= 2a(x)xT d(x) + xT x D a(x) d x
= [2a(x)xT + xT x D a(x)] d x
Steven W. Nydick
(42)
80/119
Basic Differentials and Derivatives
Vector Functions
Vector Functions of Mat: f(X) = Xa
A vector function of a matrix: f(X) = Xa.
The second example of Magnus (f(X) = XT ) is redundant.
d[f(X)] = d[Xa]
= d[X]a
by (17)
= vec(d[X]a)
= vec(I d[X]a)
= (aT In ) vec(d X)
T
= (a In ) d vec(X)
Steven W. Nydick
by (4)
(43)
81/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Vec: F(x) = xxT
The first (and easiest) matrix function: F(x) = xxT .
To differentiate a matrix, first vectorize it:
d vec(F) = d vec(xxT )
= vec[d(xxT )]
= vec[d(x)xT + x d(xT )]
by (20)
= vec[d(x)xT ] + vec[x d(xT )]
= vec[In d(x)xT ] + vec[x d(xT )In ]
= (x In ) d vec(x) + (In x) d vec(xT )
by (4)
= (x In ) d vec(x) + (In x) d vec(x)
= [(x In ) + (In x)] d vec(x)
Steven W. Nydick
(44)
82/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 1: F(X) = X2
A matrix power to differentiate: F(X) = X2 (X is square).
d vec(F) = d vec(X2 ) = vec[d(XX)]
= vec[d(X)X + X d(X)]
by (20)
= vec[I d(X)X] + vec[X d(X)I]
= (XT In ) d vec(X) + (In X) d vec(X)
by (4)
= [XT In + In X] d vec(X)
(45)
Make sure to vectorize first, and the differentials are easy.
Steven W. Nydick
83/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 2: F(X) = XT
Another matrix to differentiate: F(X) = XT (X is of size m n).
Just remember that there is a commutation matrix to help.
d vec(F) = d vec(XT )
= d[Kmn vec(X)]
by (6)
= [Kmn ] d vec(X)
(46)
Apply the commutation matrix prior to differentiating and then realize
that the commutation matrix is a constant with respect to X.
Steven W. Nydick
84/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 3: F(X) = XT X
Another matrix to differentiate: F(X) = XT X (X is of size m n).
Also taking advantage of the commutation matrix.
d vec(F) = d vec(XT X)
= vec d(XT X)
= vec[d(XT )X + XT d(X)]
T
by (20)
= vec[In d(X )X] + vec[X d(X)In ]
= (XT In ) d vec(XT ) + (In XT ) d vec(X)
T
= (X In ) d[Kmn vec(X)] + (In X ) d vec(X)
T
by (4)
by (6)
= (X In )Kmn d vec(X) + (In X ) d vec(X)
= [(XT In )Kmn + (In XT )] d vec(X)
Steven W. Nydick
85/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 3: F(X) = XT X
By Equation (7)
Kpm (A B) = (B A)Kqn
where A is of size m n and B is of size p q.
Therefore, because X is of size m n, XT is of size n m, and I is of
size n n, we have
d vec(F) = [(XT In )Kmn + (In XT )] d vec(X)
= [Knn (In XT ) + In2 (In XT )] d vec(X)
= [(Knn + In2 )(In XT )] d vec(X)
Steven W. Nydick
by (7)
(47)
86/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 4: F(X) = XAXT
Another matrix to differentiate: F(X) = XAXT .
If A is symmetric and X is of size m n, then
d vec(F) = d vec(XAXT )
= vec[d(XAXT )]
= vec[d(X)AXT + XA d(XT )]
T
by (20)
T
= vec[d(X)AX ] + vec[XA d(X )]
= vec[Im d(X)AXT ] + vec[XA d(XT )Im ]
= ([AXT ]T Im ) d vec(X) + (Im XA) d vec(XT )
by (4)
= (XA Im ) d vec(X) + (Im XA) d[Kmn vec(X)]
by (6)
= (XA Im ) d vec(X) + (Im XA)Kmn d vec(X)
Steven W. Nydick
87/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 4: F(X) = XAXT
Continuing:
d vec(F) = (XA Im ) d vec(X) + (Im XA)Kmn d vec(X)
= Im2 (XA Im ) d vec(X) + Kmm (XA Im ) d vec(X) by (7)
= [(Im2 + Kmm )(XA Im )] d vec(X)
(48)
Make sure to remember the order of the commutation matrices.
Kmn vec(Amn ) = vec(ATnm )
Kpm (Amn Bpq ) = (Bpq Amn )Kqn
Steven W. Nydick
88/119
Basic Differentials and Derivatives
Matrix Functions
Matrix Functions of Mat 5: F(X) = XT AXT
A final matrix function F(X) = XT AXT .
d vec(F) = d vec(XT AXT )
= vec[d(XT AXT )]
= vec[d(XT )AXT + XT A d(XT )]
T
by (21)
T
= vec[d(X )AX ] + vec[X A d(X )]
= vec[In d(XT )AXT ] + vec[XT A d(XT )Im ]
= (XAT In ) d vec(XT ) + (Im XT A) d vec(XT )
T
by (4)
= (XA In ) d[Kmn vec(X)] + (Im X A) d[Kmn vec(XT )]
by (6)
= (XAT In )Kmn d vec(X) + (Im XT A)Kmn d vec(X)
= {[(XAT In ) + (Im XT A)]Kmn } d vec(X)
Steven W. Nydick
(49)
89/119
Differentials of Special Matrices
The Inverse
Differential of The Inverse: F(X) = X1
The differential of the inverse is a once seen, never forgotten problem.
Note that if X is invertible, then
XX1 = X1 X = I
A matrix times its inverse is the identiy matrix.
And the differential of the left size is equivalent to that of the right side:
The right side is a constant (I), and what is the differential of
a constant?
Steven W. Nydick
90/119
Differentials of Special Matrices
The Inverse
Differential of The Inverse: F(X) = X1
Because
d(I) = 0
by (16)
We have
d(X1 X) = d(I)
d(X1 X) = 0
d(X
)X + X
d(X) = 0
by (16)
by (21)
d(X1 )X = X1 d(X)
d(X1 ) = X1 d(X)X1
And
d(X1 ) = X1 d(X)X1
Steven W. Nydick
(50)
91/119
Differentials of Special Matrices
The Inverse
Differential of the Inverse
To differentiate a function that involves the inverse of X:
1
Vectorize everything.
Perform standard rules:
Multiplication rules
Chain rules
Pulling out constants
Isolate d(X1 ).
Perform the diff-of-inverse rule.
Separate particular linear combinations and use trace rules
Use the vec Kronecker rule or trace vec rule.
Fiddle with transposes and Kmn , and recombine terms.
Steven W. Nydick
92/119
Differentials of Special Matrices
The Inverse
Inverse Example 1: (X) = tr(AX1 )
An example of inverse differentials: (X) = tr(AX1 ).
d (X) = d tr(AX1 )
= tr[d(AX1 )]
Perform standard rules (e.g., multiplication by a constant).
d (X) = tr[d(AX1 )]
= tr[A d(X1 )]
Perform the diff-of-inverse rule.
d (X) = tr[A d(X1 )]
= tr[A(1X1 d(X)X1 )] = 1 tr(AX1 d(X)X1 )
Steven W. Nydick
by (50)
93/119
Differentials of Special Matrices
The Inverse
Inverse Example 1: (X) = tr(AX1 )
Use trace identities (e.g., transposing and rotating).
d (X) = 1 tr(AX1 d(X)X1 )
= tr[X1 AX1 d(X)]
And finally, perform the trace vec rule.
d (X) = tr[X1 AX1 d(X)]
= vec[(X1 AX1 )T ]T d vec(X)
Steven W. Nydick
(51)
94/119
Differentials of Special Matrices
The Inverse
Inverse Example 2: M = In X(XT X)1 XT
The next example is a strange matrix function.
M = In X(XT X)1 XT
M is idempotent (so the square of itself is itself).
M2 = (In X(XT X)1 XT )(In X(XT X)1 XT )
= I2n In [X(XT X)1 XT ] [X(XT X)1 XT ]In
+ [X(XT X)1 XT ]2
= In 2[X(XT X)1 XT ] + [X(XT X)1 XT ][X(XT X)1 XT ]
= In 2X(XT X)1 XT + X(XT X)1 (XT X)(XT X)1 XT
= In 2X(XT X)1 XT + X(XT X)1 XT
= In X(XT X)1 XT = M
Steven W. Nydick
95/119
Differentials of Special Matrices
The Inverse
Inverse Example 2: M = In X(XT X)1 XT
M = In X(XT X)1 XT might look pretty familiar.
H = X(XT X)1 XT
(52)
maps y into the column space defined by the predictors, but
M = In X(XT X)1 XT
(53)
maps y into the space orthogonal to the predictors (the error space).
Hy = y
Steven W. Nydick
My =
96/119
Differentials of Special Matrices
The Inverse
Inverse Example 2: M = In X(XT X)1 XT
To find the differential, perform standard (e.g., product) rules.
d(M) = d(In X(XT X)1 XT )
= d(X(XT X)1 XT )
= [d(X)(XT X)1 XT + X d[(XT X)1 ]XT
+ X(XT X)1 d(XT )]
by (21)
Concentrate on the inverse differential.
d[(XT X)1 ] = (XT X)1 d(XT X)(XT X)1
= (XT X)1 [d(XT )X + XT d(X)](XT X)1
Steven W. Nydick
by (50)
by (21)
97/119
Differentials of Special Matrices
The Inverse
Inverse Example 2: M = In X(XT X)1 XT
Plug the differential back in the equation.
d(M) = [d(X)(XT X)1 XT + X d[(XT X)1 ]XT
+ X(XT X)1 d(XT )]
= [d(X)(XT X)1 XT
+ X[(XT X)1 [d(XT )X + XT d(X)](XT X)1 ]XT
+ X(XT X)1 d(XT )]
= [d(X)(XT X)1 XT
X(XT X)1 d(XT )X(XT X)1 XT
X(XT X)1 XT d(X)(XT X)1 XT + X(XT X)1 d(XT )]
= [In d(X)(XT X)1 XT X(XT X)1 XT d(X)(XT X)1 XT
+ X(XT X)1 d(XT )In X(XT X)1 d(XT )X(XT X)1 XT ]
= [M d(X)(XT X)1 XT + X(XT X)1 d(XT )M]
Steven W. Nydick
98/119
Differentials of Special Matrices
The Inverse
Inverse Example 2: M = In X(XT X)1 XT
And finally vectorizing everything.
d vec(M) = vec[M d(X)(XT X)1 XT + X(XT X)1 d(XT )M]
= vec[M d(X)(XT X)1 XT ] vec[X(XT X)1 d(XT )M]
which leads to the vec Kronecker rule.
d vec(M) = vec[M d(X)(XT X)1 XT ] vec[X(XT X)1 d(XT )M]
= [(XT X)1 XT )T M] d vec(X)
[MT X(XT X)1 ] d vec(XT )
Steven W. Nydick
99/119
Differentials of Special Matrices
The Inverse
Inverse Example 2: M = In X(XT X)1 XT
And then applying the commutation matrix,
d vec(XT ) = Kmn d vec(X)
on the above function.
d vec(M) = [(XT X)1 XT )T M] d vec(X)
[MT X(XT X)1 ] d vec(XT )
= [X(XT X)1 M] d vec(X)
[M X(XT X)1 ]Kmn d vec(X)
T
= Im2 [X(X X)
T
M] d vec(X)
Kmm [X(X X)
M] d vec(X)
T
= (Im2 + Kmm )[X(X X)
Steven W. Nydick
by (6)
M] d vec(X)
by (7)
(54)
100/119
Differentials of Special Matrices
The Inverse
Inverse Example 3: F(X) = AX1 AT
A third example: F(X) = AX1 AT (X is symmetric).
If X is symmetric, then X1 is symmetric and d(X) is symmetric.
Find the differential w.r.t. d vec(X), but use the duplication matrix to
limit the number of freely varying terms to those on the lower diagonal.
Steven W. Nydick
101/119
Differentials of Special Matrices
The Inverse
Inverse Example 3: F(X) = AX1 AT
To differentiate a symmetric matrix:
1
Take the full d vec[F(X)] differential.
Simplify as in every other differential.
After d vec(X) is isolated, use the duplication matrix inside the
differential operator to restrict vec(X) = Dn vech(X).
Pull Dn outside of the differential operator because it is constant
with respect to X.
Steven W. Nydick
102/119
Differentials of Special Matrices
The Inverse
Inverse Example 3: F(X) = AX1 AT
First, taking differentials.
d[F(X)] = d[AX1 AT ]
= A d(X1 )AT
by (16)
= A[X1 d(X)X1 ]AT
= AX
d(X)X
by (50)
Then vectorizing the differential.
vec[d(F)] = vec[AX1 d(X)X1 AT ]
= vec[AX1 d(X)X1 AT ]
= [(X1 AT )T (AX1 )] d vec(X)
Steven W. Nydick
by (4)
103/119
Differentials of Special Matrices
The Inverse
Inverse Example 3: F(X) = AX1 AT
Continuing:
vec[d(F)] = [(X1 AT )T (AX1 )] d vec(X)
= [AX1 AX1 ] d vec(X)
by the Symmetry of X1
We can finally impose the duplication identity.
vec[d(F)] = [AX1 AX1 ] d[vec(X)]
= [AX
AX
] d[Dn vech(X)]
= [AX1 AX1 ]Dn d vech(X)
Steven W. Nydick
(55)
(56)
(57)
104/119
Differentials of Special Matrices
The Inverse
Inverse Example 4: = T X1
Assume is a vector of 1s. Then
= T X1
is the sum of all of the elements in X1 .
Now, if X is symmetric, then
d vec() = vec[d(T X1 )]
= vec[T d(X1 )]
T
= vec[ X
by (16)
d(X)X
= [(X1 )T T X1 ] d vec(X)
T
= [ X
Steven W. Nydick
]Dn d vech(X)
by (50)
by (4)
(58)
105/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The Special Case
The Maclaurin Series is just the Taylor Series with the constant set to 0.
(x) =
X
(k) (0)
k!
k=0
xk
The definition of the Exponential Function:
(I) The derivative of ex equals ex .
(II) (k) (0) = 1 for all k.
... which leads to the Maclaurin representation:
ex =
X
xk
k=0
k!
=1+x+
x2 x3 x4
+
+
+ ...
2
6
24
We could also replace x with any function of x.
ef (x) =
X
f (x)k
k=0
Steven W. Nydick
k!
106/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The Special Case
To find the differential of an exponential function, expand the function
as a Maclaurin series and differentiate.
An example function for the exponential: xA.
d(e
xA
)=d
X
(xA)k
k=0
k!
!
=
X
k=0
x k Ak
k!
!
=
=
=
=
X
d(xk Ak )
k=0
X
k=0
X
k=0
X
k=0
Steven W. Nydick
k!
d(xk )Ak
k!
by (16)
[kxk1 d x]Ak
k!
xk1 Ak
dx
(k 1)!
107/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The Special Case
Factorials do not exist for negative numbers, so change the boundaries.
d(exA ) =
X
xk1 Ak
k=0
(k 1)!
dx =
X
xk1 [AAk1 ]
k=1
=A
k=1
And set m = k 1:
d(exA ) = A
X
xk1 Ak1
k=1
(k 1)!
dx = A
(k 1)!
dx
xk1 Ak1
dx
(k 1)!
X
xm A m
dx
m!
m=0
Noticing that the summation is equal to the original exponenent:
d(exA ) = A
X
X
xm A m
(xA)m
dx = A
d x = AexA d x
m!
m!
m=0
Steven W. Nydick
(59)
m=0
108/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The General Case
By analogy, define a matrix exponential.
exp(X) =
X
1
k=0
k!
(60)
To take the derivative of a matrix exponential, follow similar steps.
d F(X) = d[exp(X)]
!
X
1 k
=d
X
k!
k=0
X
1
d(Xk )
by (16)
=
k!
k=0
X
1
k1
k2
k1
=
(d X)X
+ X(d X)X
+ + X (d X)
k!
k=0
by (21)
Steven W. Nydick
109/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The General Case
Continuing:
X
1
k1
k2
(d X)X
+ X(d X)X
+ + X
k!
k1
X
X
1
Xj (d X)Xkj1
=
k!
j=0
k=0
k1
X
X
1
=
Xj (d X)Xkj1
k!
d F(X) =
k1
(d X)
k=0
k=1
j=0
Note that the bounds change because 0 j k 1, but if k = 0, then
0 j 0 1 = 1, which is a contradition.
Steven W. Nydick
110/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The General Case
Setting m = k 1, so k = m + 1.
k1
X
X
1
d F(X) =
Xj (d X)Xkj1
k!
j=0
k=1
m
X
X
1
=
Xj (d X)Xmj
(m + 1)!
m=0
j=0
Because everything until the Xj (d X)Xmj is a scalar, we have
m
X
X
1
Xj (d X)Xmj
tr d[exp(X)] = tr
(m + 1)!
m=0
j=0
m
X
X
1
=
tr Xj (d X)Xmj
(m + 1)!
m=0
Steven W. Nydick
j=0
111/119
Differentials of Special Matrices
The Exponential and Logarithm
The Exponential: The General Case
Finishing:
X
tr d[exp(X)] =
m=0
m
X
1
tr Xj (d X)Xmj
(m + 1)!
j=0
X
1
mj j
(m + 1) tr X
X (d X)
=
(m + 1)!
m=0
!
X
1 m
= tr
X (d X)
m!
m=0
h
i
= tr exp(X) d(X) = vec[exp(X)T ]T d vec(X)
(61)
Note that we must take traces to obtain a sensible result.
Steven W. Nydick
112/119
Differentials of Special Matrices
The Exponential and Logarithm
The Logarithm: The Special Case
The Mercator Series is the Maclaurin Series Expansion for ln(1 + x):
ln(1 + x) =
X
(k) (0)
k=0
k!
xk
00 (0)x2
(3) (0)x3
(4) x4 (0)
+
+
+ ...
2!
3!
4!
2
x
x3
x
+ (1)
+
(2)(1)
+ ...
= ln(1 + 0) +
1+0
2!(1 + 0)2
3!(1 + 0)3
X
x2
x3
x4
(1)k+1 xk
=0+x
+
+ =
2
3
4
k
= (0) + 0 (0)x +
k=1
Replacing x with x, all terms in the sum become negative.
X xk
x2
x3
x4
ln(1 x) = x
=
2
3
4
k
k=1
Steven W. Nydick
113/119
Differentials of Special Matrices
The Exponential and Logarithm
The Logarithm: The Special Case
To find the differential of a logarithmic function, expand as a Mercator
series and differentiate.
An example function for the logarithm: xA.
#
"
h
i
X
X (xA)k
d(xk )Ak
d ln(In xA) = d
=
k
k
k=1
k=1
by (16)
kxk1 d xAk
k
k=1
= A
= A
[xk1 Ak1 ] d x
k=1
[xA]m d x
m=0
The last line involved the changing-indices trick.
Steven W. Nydick
114/119
Differentials of Special Matrices
The Exponential and Logarithm
The Logarithm: The Special Case
And if |x| < 1 then, due to the geometric series,
X
k=0
xk =
1
1x
By analogy, if x and A satisfy a similar constraint, then
h
i
X
d ln(In xA) = A
[xA]m d x
m=0
= A(In xA)1 d x
= A(In xA)1 d x
Steven W. Nydick
(62)
115/119
Differentials of Special Matrices
The Exponential and Logarithm
The Logarithm: The General Case
For the multivariate case, define:
ln(In X) =
X
1
k=1
To take the differential of ln(In X), notice that we ultimately use the
same expansion as for the exponential differential.
!
h
i
X
1 k
X
d F(X) = d ln(In X) = d
k
k=1
X
1
k
d(X )
by (16)
=
k
k=1
k1
X
X
1
=
Xj (d X)Xkj1
k
k=1
Steven W. Nydick
j=0
116/119
Differentials of Special Matrices
The Exponential and Logarithm
The Logarithm: The General Case
Setting m = k 1, so k = m + 1:
k1
X
X
1
d F(X) =
Xj (d X)Xkj1
k
j=0
k=1
m
X
X
1
=
Xj (d X)Xmj
m+1
m=0
j=0
Because everything until the Xj (d X)Xmj is a scalar, we have
m
X
X
1
Xj (d X)Xmj
tr d ln(In X) = tr
m+1
m=0
j=0
m
X
X
1
=
tr Xj (d X)Xmj
m+1
m=0
Steven W. Nydick
j=0
117/119
Differentials of Special Matrices
The Exponential and Logarithm
The Logarithm: The General Case
Finishing:
X
tr d ln(In X) =
m=0
1
m+1
m
X
tr Xj (d X)Xmj
j=0
1
mj j
=
(m + 1) tr X
X (d X)
m+1
m=0
!
X
m
= tr
X (d X)
m=0
= tr (In X) d X
h
T i T
= vec (In X)1
d vec(X)
(63)
Differentials of both exponentials and logarithms for multivariate functions
behave in similar way to the univariate case but inside of trace operators.
Steven W. Nydick
118/119
References
References
Abadir, K. M., & Magnus, J. R. (2005). Matrix algebra. Cambridge,
UK: Cambridge University Press.
Magnus, J. R. & Neudecker, H. (1986). Symmetry, 0-1 matrices and
Jacobians: A review. Econometric Theory, 2, 157190.
Magnus, J. R. & Neudecker, H. (1999). Matrix differential calculus
with applications in statistics and economics. New York, NY: John
Wiley & Sons
Steven W. Nydick
119/119