November 27, 2009
The Kalman Filter Explained
Tristan Fletcher
www.cs.ucl.ac.uk/staff/T.Fletcher/
1 Introduction
The aim of this document is to derive the filtering equations for the simplest
Linear Dynamical System case, the Kalman Filter, outline the filter’s im-
plementation, do a similar thing for the smoothing equations and conclude
with parameter learning in an LDS (calibrating the Kalman Filter).
The document is based closely on Bishop [1] and Gharamani’s [2] texts,
but is more suitable for those who wish to understand every aspect of the
mathematics required and see how it all comes together in a procedural
sense.
2 Model Specification
The simplest form of Linear Dynamical System (LDS) models a discrete time
process where a latent variable h is updated every time step by a constant
linear state transition A with the addition of zero-mean Gaussian noise η h :
ht = Aht−1 + ηth where ηth ∼ N (0, ΣH )
⇒ p(ht |ht−1 ) ∼ N (Aht−1 , ΣH ) (2.1)
This latent variable is observed through a constant linear function of the
latent state B also subject to zero-mean Gaussian noise η v :
vt = Bht + ηtv where ηtv ∼ N (0, ΣV )
⇒ p(vt |ht ) ∼ N (Bht , ΣV ) (2.2)
We wish to infer the probability distribution of ht given the observations up
to that point in time v1:t , i.e. p(ht |v1:t ), which can be expressed recursively.
Starting with an initial distribution for our latent variable given the first
observation:
p(h1 |v1 ) ∝ p(v1 |h1 )p(h1 )
and the assumption that h1 has a Gaussian distribution:
p(h1 ) ∼ N (µ0 , σ02 )
1
values for p(ht |v1:t ) for subsequent values of t can be found by iteration:
p(ht |v1:t ) = p(ht |vt , v1:t−1 )
p(ht , vt |v1:t−1 )
=
p(vt |v1:t−1 )
∝ p(ht , vt |v1:t−1 )
Z
= p(ht , vt |v1:t−1 , ht−1 )p(ht−1 |v1:t−1 )
ht−1
Z
= p(vt |v1:t−1 , ht−1 , ht )p(ht |ht−1 , v1:t−1 )p(ht−1 |v1:t−1 )
ht−1
Z
ht ⊥v1:t−1 |ht−1 ⇒ p(vt |v1:t−1 , ht−1 , ht )p(ht |ht−1 )p(ht−1 |v1:t−1 )
ht−1
Z
vt ⊥ht−1 |v1:t−1 , ht ⇒ p(vt |v1:t−1 , ht )p(ht |ht−1 )p(ht−1 |v1:t−1 )
ht−1
Z
= p(vt |ht )p(ht |ht−1 )p(ht−1 |v1:t−1 ) (2.3)
ht−1
The fact that the distributions described in (2.1) and (2.2) are both Gaussian
and the operations in (2.3) of multiplication and then integration will yield
Gaussians when performed on Gaussians, means that we know p(ht |v1:t ) will
itself be a Gaussian:
p(ht |v1:t ) ∼ N (µt , σt2 ) (2.4)
and the task is to derive the expressions for µt and σt2 .
3 Deriving the state estimate variance
h i
If we define ∆h ≡ h − E ĥ , where h denotes the actual value of the latent
h i
variable, ĥ is its estimated value, E ĥ is the expected value of this estimate
and F as the covariance of the error estimate then:
Ft|t−1 = E ∆ht ∆hTt
h i
= E (A∆ht−1 + ∆ηth )(A∆ht−1 + ∆ηth )T
h i
= E (A∆ht−1 ∆ht−1 T AT + ∆ηth ∆ηthT + ∆ηth AT ∆hTt−1 + A∆ht−1 ∆ηthT )
h i
= AE ∆ht−1 ∆hTt−1 AT + E ∆ηth ∆ηthT
= AFt−1:t−1 AT + ΣH (3.1)
The subscript in Ft|t−1 denotes the fact that this is F ’s value before an
observation is made at time t (i.e. its a priori value) while Ft|t would denote
2
a value for Ft|t after an observation is made (its posterior value). This more
informative notation allows the update equation in (2.1) to be expressed as
follows:
ĥt|t−1 = Aĥt−1|t−1 (3.2)
Once we have an observation (and are therefore dealing with posterior val-
ues), we can define t as the difference between the observation we’d expect
to see given our estimate of the latent state (its a priori value) and the one
actually observed, i.e.:
t = vt − B ĥt|t−1 (3.3)
Now that we have an observation, if we wish to add a correction to our a
priori estimate that is proportional to the error t we can use a coefficient
κ:
ĥt|t = ĥt|t−1 + κt (3.4)
This allows us to express Ft|t recursively:
Ft|t = Cov(ht − ĥt|t )
= Cov(ht − (ĥt|t−1 + κt )
= Cov(ht − (ĥt|t−1 + κ(vt − B ĥt|t−1 )))
= Cov(ht − (ĥt|t−1 + κ(Bht + ηtv − B ĥt|t−1 )))
= Cov(ht − ĥt|t−1 − κBht − κηtv + κB ĥt|t−1 )
= Cov((I − κB)(ht − ĥt|t−1 ) − κηtv )
= Cov((I − κB)(ht − ĥt|t−1 )) + Cov(κηtv )
= (I − κB)Cov(ht − ĥt|t−1 )(I − κB)T + κCov(ηtv )κT
= (I − κB)Ft|t−1 (I − κB)T + κΣV κT
= (Ft|t−1 − κBFt|t−1 )(I − κB)T + κΣV κT
= Ft|t−1 − κBFt|t−1 − Ft|t−1 (κB)T + κBFt|t−1 (κB)T + κΣV κT
= Ft|t−1 − κBFt|t−1 − Ft|t−1 B T κT + κ(BFt|t−1 B T + ΣV )κT (3.5)
If we define the innovation variance as St = BFt|t−1 B T + ΣV then (3.5)
becomes:
Ft|t = Ft|t−1 − κBFt|t−1 − Ft|t−1 B T κT + κSt κT (3.6)
3
4 Minimizing the state estimate variance
If we wish to minimize the variance of Ft|t , we can use the mean square error
measure (MSE):
2
E ht − ĥt|t = T r(Cov(ht − ĥt|t )) = T r(Ft|t ) (4.1)
The only coefficient we have control over is κ, so we wish to find the κ that
gives us the minimum MSE, i.e. we need to find κ such that:
δT r(Ft|t )
=0
δκ
δT r(Ft|t−1 − κBFt|t−1 − Ft|t−1 B T κT + κSt κT )
(2.6) ⇒ =0
δκ
⇒κ = Ft|t−1 B T St−1 (4.2)
This optimum value for κ in terms of minimizing MSE is known as the
Kalman Gain and will be denoted K.
If we multiply by both sides of (4.2) by SK T :
KSK T = Ft|t−1 B T K T (4.3)
Substituting this into (3.6):
Ft|t = Ft|t−1 − KBFt|t−1 − Ft|t−1 B T K T + Ft|t−1 B T K T
= (I − KB)Ft|t−1 (4.4)
4
5 Filtered Latent State Estimation Procedure (The
Kalman Filter)
The procedure for estimating the state of ht , which when using the MSE
optimal value for κ is called Kalman Filtering, proceeds as follows:
1. Choose initial values for ĥ and F (i.e. ĥ0|0 and F0|0 ).
2. Advance latent state estimate:
ĥt|t−1 = Aĥt−1|t−1
3. Advance estimate covariance:
Ft|t−1 = AFt−1|t−1 AT + ΣH
4. Make an observation vt
5. Calculate innovation:
t = vt − B ĥt|t−1
6. Calculate St :
St = BFt|t−1 B T + ΣV
7. Calculate K:
K = Ft|t−1 B T St−1
8. Update latent state estimate:
ĥt|t = ĥt|t−1 + Kt
9. Update estimate covariance (from (4.2)):
Ft|t = (I − KB)Ft|t−1
10. Cycle through stages 2 to 9 for each time step.
Note that ĥt|t and Ft|t correspond to µt and σt2 from (2.4).
5
6 Smoothed Latent State Estimation
The smoothed probability of the latent variable is the probability it had a
value at time t after a sequence of T observations, i.e. p(ht |v1:T ). Unlike the
Kalman Filter which you can update with each observation, one has to wait
until T observations have been made and then retrospectively calculate the
probability the latent variable had a value at time t where t < T .
Commencing at the final time step in the sequence (t = T ) and working
backwards to the start (t = 1), p(ht |v1:T ) can be evaluated as follows:
Z
p(ht |v1:T ) = p(ht |ht+1 , v1:T )p(ht+1 |v1:T )
ht+1
ht ⊥vt+1:T |ht+1 ⇒ p(ht |ht+1 , v1:T ) = p(ht |ht+1 , v1:t )
Z
∴ p(ht |v1:T ) = p(ht |ht+1 , v1:t )p(ht+1 |v1:T )
ht+1
Z
p(ht+1 , ht |v1:t )p(ht+1 |v1:T )
ht+1
=
p(ht+1 |v1:t )
Z
∝ p(ht+1 , ht |v1:t )p(ht+1 |v1:T )
ht+1
Z
= p(ht+1 |ht , v1:t )p(ht |v1:t )p(ht+1 |v1:T )
ht+1
Z
ht+1 ⊥v1:t |ht ⇒ p(ht+1 |ht )p(ht |v1:t )p(ht+1 |v1:T ) (6.1)
ht+1
6
As before, we know that p(ht |v1:T ) will be a Gaussian and we will need to
establish it’s mean and variance at each t, i.e. in a similar manned to (2.4):
p(ht |v1:T ) ∼ N (hst , Fts ) (6.2)
Using the filtered values calculated in the previous section for ĥt|t and Ft|t
for each time step, the procedure for estimating the smoothed parameters
hst and Fts works backwards from the last time step in the sequence, i.e. at
t = T as follows:
1. Set hsT and FTs to ĥT |T and FT |T from steps 8 and 9 in section 5.
2. Calculate Ast :
Ast = (AFt|t )T (AFt|t AT + ΣH )−1
3. Calculate Sts :
Sts = Ft|t − Ast AFt|t
4. Calculate the smoothed latent variable estimate hst :
hst = Ast hst+1 + ĥt|t − Ast Aĥt|t
5. Calculate the smoothed estimate covariance Fts:
1
Fts = 2 (Ast Ft+1
s AT + S s ) + (As F s AT + S s )T
t t t+1 t
6. Calculate the smoothed cross-variance Xts :
Xts = Ast Fts + hst ĥTt|t
7. Cycle through stages 2 to 6 for each time step backwards through the
sequence from t = T to t = 1.
7
7 Expectation Maximization (Calibrating the Kalman
Filter)
The procedures outlined in the previous sections
are fine if we assume
that
we know the value in the parameter set θ = µ0 , σ02 , A, ΣH , B, ΣV but in
order to learn these values, we will need to perform the Expectation Maxi-
mization algorithm.
The joint probability of T time steps of the latent and observable variables
is:
T
Y T
Y
p(h1:T , v1:T ) = p(h1 ) p(ht |ht−1 ) p(vt |ht ) (7.1)
t=2 t=1
Making the dependence on the parameters explicit, the likelihood of the
model given the parameter set θ is:
T
Y T
Y
p(h1:T , v1:T |θ) = p(h1 |µ0 , σ02 ) p(ht |ht−1 , A, ΣH ) p(vt |ht , B, ΣV )
t=2 t=1
(7.2)
Taking logs gives us the model’s log likelihood:
T
X T
X
ln p(h1:T , v1:T |θ) = ln p(h1 |µ0 , σ02 ) + ln p(ht |ht−1 , A, ΣH ) + ln p(vt |ht , B, ΣV )
t=2 t=1
(7.3)
We will deal with each of the three components of (7.3) in turn. Using V
to represent the set of observations up to and including time t (i.e. v1:t ),
H for h1:T , θold to represent our parameter values before an iteration of the
EM loop, the superscript n to represent the value of a parameter after an
iteration of the loop, c to represent terms that are not dependent on µ0 or
σ02 , λ to represent (σ02 )−1 and Q = EH|θold [ln p(H, V |θ)] we will first find the
expected value for p(h1 |µ0 , σ02 ):
1 2 1 T
Q = − ln σ0 − EH|θold (h1 − µ0 ) λ(h1 − µ0 ) + c
2 2
1 1
= − ln σ02 − EH|θold hT1 λh1 − hT1 λµ0 − µT0 λh1 + µT0 λµ0 + c
2 2
1 h i
ln |λ| − T r λEH|θold h1 hT1 − h1 µT0 − µ0 hT1 + µ0 µT0
= +c
2
(7.4)
8
In order to find the µ0 which maximizes the expected log likelihood described
in (7.4), we will differentiate it wrt µ0 and set the differential to zero:
δQ
= 2λµ0 − 2λE [h1 ] = 0
δµ0
⇒ µn0 = E [h1 ] (7.5)
Proceeding in a similar manner to establish the maximal λ:
δQ 1 2
σ0 − E h1 hT1 − E [h1 ] µT0 − µ0 E hT1 + µ0 µT0 = 0
=
δλ 2
n
σ02 = E h1 hT1 − E [h1 ] E hT1
(7.6)
In order to optimize for A and ΣH we will substitute for p(ht |ht−1 , A, ΣH )
in (7.3) giving:
" T #
T −1 1X
Q=− ln |ΣH | − EH|θold (ht − Aht−1 )T Σ−1
H (ht − Aht−1 ) + c
2 2
t=2
(7.7)
Maximizing with respect to these parameters then gives:
T
! T !−1
X X
n
T T
A = E ht ht−1 E ht ht−1 (7.8)
t=2 t=2
T
1 Xn T o
ΣnH = E ht ht−1 − An E ht−1 hTt − E ht hTt−1 An + An E ht−1 hTt−1 (An )T
T −1
t=2
(7.9)
In order to determine values for B and ΣV we substitute for p(vt |ht , B, ΣV )
in (7.3) to give:
" T #
T 1X
Q = − ln |ΣV | − EH|θold (vt − Bht )T Σ−1
V (vt − Bht ) + c
2 2
t=2
(7.10)
Maximizing this with respect to B and ΣV gives:
T
! T
!−1
X X
n
vt E hTt
T
B = E ht ht (7.11)
t=1 t=1
T
1 X
ΣnV =
T
vt vt − B n E [ht ] vtT − vt E hTt B n + B n E ht hTt B n
T
t=1
(7.12)
9
Using the values calculated from the smoothing procedure in section 5:
E [ht ] = hst
E ht hTt = Fts
E ht hTt−1 = Xts
We can now set out the procedure for parameter learning using Expectation
Maximization:
1. Choose starting values for the parameters θ = µ0 , σ02 , A, ΣH , B, ΣV .
2. Using the parameter set θ, calculate the filtered statistics ĥt|t and Ft|t
for each time step as described in section 4.
3. Using the parameter set θ, calculate the smoothed statistics hst , Fts
and Xts for each time step as described in section 5.
4. Update A:
" T T
#" T T
#−1
X X X X
n s s T s s sT s
A = ht ht−1 + Xt ht ht + Ft
t=1 t=1 t=1 t=1
5. Update ΣH :
T T
!T
ΣnH = [T − 1]−1 γ − hs1 hs1 T − F1s − An
X X
hst hst−1 T + Xts
t=1 t=1
6. Update B:
" T #" T T
#−1
X X X
s T s s T s
Bn = vt ht ht ht + Ft
t=1 t=1 t=1
7. Update ΣV :
T T
!T
X X
n
ΣV = vt vtT − B n vt hst T T −1
t=1 t=1
8. Update µ0 :
µn0 = hs1
9. Update σ02 :
n −1
σ02 = F1s + hs1 hs1 T 1 − µn0 µn0 T
10. Iterate steps 2 to 10 a given number of times or until the difference
between parameter values from succeeding iterations is below a pre-
defined threshold.
10
References
[1] C. M. Bishop, Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer, 2006.
[2] Z. Ghahramani and G. Hinton, “Parameter estimation for linear dynam-
ical systems,” Tech. Rep., 1996.
11