KEMBAR78
Stat Modelling Notes | PDF | Ordinary Least Squares | Linear Regression
0% found this document useful (0 votes)
26 views49 pages

Stat Modelling Notes

The document outlines a course on statistical modeling, focusing on analyzing data through regression functions relating dependent and independent variables. It introduces key concepts such as linear models, ordinary least squares (OLS), and the treatment of random variables in statistical analysis. The course aims to equip students with the necessary tools to understand and apply various statistical models effectively.

Uploaded by

x.liu114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views49 pages

Stat Modelling Notes

The document outlines a course on statistical modeling, focusing on analyzing data through regression functions relating dependent and independent variables. It introduces key concepts such as linear models, ordinary least squares (OLS), and the treatment of random variables in statistical analysis. The course aims to equip students with the necessary tools to understand and apply various statistical models effectively.

Uploaded by

x.liu114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Statistical modelling

Rajen D. Shah
r.shah@statslab.cam.ac.uk

Course webpage: http://www.statslab.cam.ac.uk/~rds37/statistical_modelling.html

Introduction
This course is largely about analysing data composed of observations that come in the form of
pairs
(y1 , x1 ), . . . , (yn , xn ). (0.0.1)
Our aim will be to infer an unknown regression function relating the values yi , to the xi , which
may be p-dimensional vectors xi = (xi1 , . . . , xip )T . The yi are often called the response, target
or dependent variable; the xi are known as predictors, covariates, independent variables or
explanatory variables. Below are some examples of possible responses and covariates.

Response Covariates
House price Numbers of bedrooms, bathrooms; Plot area; Year built; Location
Weight loss Type of diet plan; type of exercise regime
Short-sightedness Parents’ short-sightedness; Hours spent watching TV or reading books

First note that in each of the examples above, it would be hopeless to attempt to find a
deterministic function that gives the response for every possible set of values of the covariates.
Instead, it makes sense to think of the data-generating mechanism as being inherently random,
with perhaps a deterministic function relating average values of the responses to values of the
covariates.
We model the responses yi as realisations of random variables Yi . Depending on how the
data were collected, it may seem appropriate to also treat the xi as random. However, in such
cases we usually condition on the observed values of the explanatory variables. To aid intuition,
it may help to imagine a hypothetical sequence of repetitions of the ‘experiment’ that was
conducted to produce the data with the xi , i = 1, . . . , n held fixed, and think of the dataset at
hand as being one of the many elements of such a sequence.
In the course Principles of Statistics, theory was developed for data that were i.i.d. In our
setting here, this assumption is not appropriate: the distributions of Yi and Yj may well be
different is xi 6= xj . In fact what we are interested in is how the distributions of the Yi differ.
However, we will still usually assume that the data are at least independent. It turns out that
with this assumption of independence, much of the theory from Principles of Statistics can be
applied, with little modification.
In this course we will study some of the most popular and important statistical models for
data of the form (0.0.1). We begin with the linear model, which you will have met in Statistics
IB.

i
Contents

1 Linear models 2
1.1 Ordinary least squares (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Orthogonal projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Analysis of OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Normal errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 The multivariate normal distribution and related distributions . . . . . . 6
1.2.3 Inference for the normal linear model . . . . . . . . . . . . . . . . . . . . 8
1.2.4 ANOVA and ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Exponential families and generalised linear models 22


2.1 Non-normal responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Exponential dispersion families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Generalised linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Choice of link function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Likelihood equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 The score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Fisher information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 Two key asymptotic results . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.4 Inference in generalised linear models . . . . . . . . . . . . . . . . . . . . 33
2.6 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Specific regression problems 37


3.1 Binomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Link functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 *A classification view of logistic regression* . . . . . . . . . . . . . . . . . 39
3.1.3 *Logistic regression and linear discriminant analysis* . . . . . . . . . . . . 39
3.1.4 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Likelihood equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Modelling rates with an offset . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Test for independence of columns and rows . . . . . . . . . . . . . . . . . 46
3.2.5 Test for homogeneity of rows . . . . . . . . . . . . . . . . . . . . . . . . . 47

ii
Chapter 1

Linear models

1.1 Ordinary least squares (OLS)


The linear regression model assumes that

Y = Xβ + ε,

where    T    
Y1 x1 β1 ε1
 ..   ..   ..   .. 
Y =  . , X =  . , β =  . , ε =  . ,
Yn xTn βp εn
and the εi are to be considered as random errors that satisfy
(A1) E(εi ) = 0,
(A2) Var(i ) = σ 2 ,
(A3) Cov(εi , εj ) = 0 for i 6= j.

A word on models. It is important to recognise that this, or any statistical model is a math-
ematical object and cannot really be thought of as a ‘true’ representation of reality. Nevertheless
statistical models can nevertheless be a useful representation of reality. Though the model may
be wrong, it can still be used to answer questions of interest, and help inform decisions.

The design matrix X


If we want to include an intercept term in the linear model, we can simply take our design
matrix X as
1 xT1
 

X =  ... ...  .
 

1 xTn
To include quadratic terms, we may take
xT1 x211 · · · x21p
 
1
 .. .. .. ..  .
X = . . . . 
1 xTn x2n1 · · · x2np
The resulting model will not be linear in the xi , but it is still a linear model because it is linear
in β.

2
Least squares
Under assumptions A1–A3, a sensible way to estimate β is using OLS. This gives an estimate
β̂ that satisfies

β̂ :=arg min kY − Xbk2


b∈Rp
=(X X)−1 X T Y,
T

provided the n by p matrix X has full column rank (i.e. r(X) = p) so X T X is invertible
(see example sheet). The fitted values, Ŷ := X β̂ are then given by X(X T X)−1 X T Y . Let
P := X(X T X)−1 X T . Then P known as the ‘hat’ matrix because it puts the hat on Y . In fact
it is an orthogonal projection on to the column space of X. To discuss this further, we recall
some facts about projections from linear algebra.

1.1.1 Orthogonal projections


Let V be a subspace of Rn and define

V ⊥ := {w ∈ Rn : wT v = 0 for all v ∈ V }.

V ⊥ is known as the orthogonal complement of V . Fact: Then Rn = V ⊕ V ⊥ , so each x ∈ Rn


may be written uniquely as x = v + w with v ∈ V and w ∈ V ⊥ . This follows because we
can pick an orthonormal basis for V , v1 , . . . , vm , and then extend it to an orthonormal basis
v1 , . . . , vm , vm+1 , . . . , vn for Rn . V ⊥ is then the span of vm+1 , . . . , vn .

Definition 1. A matrix Π ∈ Rn×n is called an orthogonal projection on to V ≤ Rn if Πx = v


when x = v + w with v ∈ V , w ∈ V ⊥ . Thus Π acts as the identity on V and sends everything
orthogonal to V to 0. We will say that Π is an orthogonal projection if it is an orthogonal
projection on to its column space.

Let Π be an orthogonal projection on to V . Here are some important properties.

(i) The column space (a.k.a. range, image) of Π is V .

(ii) I − Π is an orthogonal projection on to V ⊥ , so I − Π fixes everything in V ⊥ and sends


everything in V to 0. Indeed, (I − Π)(v + w) = v + w − Π(v + w) = v + w − v = w.

(iii) Π2 = Π = ΠT , so Π is idempotent and symmetric. The former is clear from the definition.
To see that Π is symmetric observe that for all u1 , u2 ∈ Rn ,

0 = (Πu1 )T (I − Π)u2 = uT1 (ΠT − ΠT Π)u2 = 0 ⇔ ΠT = ΠT Π.

In fact, we can see that Π2 = Π = ΠT is an alternative definition for Π being an orthogonal


projection. Indeed, if v is in the column space of Π, then v = Πu, for some u ∈ Rn . But
then Πv = Π2 u = Πu = v, so Π fixes everything in its column space. Now if v is orthogonal
to the column space of Π, then Πv = ΠT v = 0.

(iv) Orthonormal bases of V and V ⊥ are eigenvectors of Π with eigenvalues 1 and 0 respectively.
Therefore we can from the eigendecomposition Π = U DU T where U is an orthogonal
matrix with columns as eigenvectors of Π and D is a diagonal matrix of corresponding
eigenvalues.

3
(v) r(Π) = dim(V ). Also, by the eigendecomposition above,

r(Π) = tr(D) = tr(U T U D) = tr(U DU T ) = tr(Π),

where we have used the cyclic property of the trace.

Note that the matrix P = X(X T X)−1 X T defined earlier is the orthogonal projection on to
the column space of X. Indeed, P Xb = Xb and if w is orthogonal to the column space of X, so
X T w = 0, then P w = 0. Also, our derivation of P Y as the linear combination of columns of X
that is closest in Euclidean distance to Y reveals another property of orthogonal projections: if
Π is an orthogonal projection on to V , then for any v ∈ Rn , Πv is the closest point on V to the
vector v—in other words
Πv = arg min kv − uk2 .
u∈V

1.1.2 Analysis of OLS


Back to Statistics: the fitted values of OLS are given by the projection of the vector of responses,
Y on to the column space of the matrix of predictors X.
Recall that the covariance between two random vectors Z1 ∈ Rn1 and Z2 ∈ Rn2 is defined
by
Cov(Z1 , Z2 ) := E[{Z1 − E(Z1 )}{Z1 − E(Z1 )}T ].
The correlation matrix between Z1 and Z2 is the n1 by n2 matrix with entries given by

Cov(Z1 , Z2 )ij
Corr(Z1 , Z2 ) := p .
Var(Z1,i )Var(Z2,j )

For any constants a1 ∈ Rn1 and a2 ∈ Rn2 , Cov(Z1 + a1 , Z2 + a2 ) = Cov(Z1 , Z2 ). Also recall that
for any d by n1 matrix A and any constant vector m ∈ Rn1 , as expectation is a linear operator,
E(m + AZ1 ) = m + AE(Z1 ).
We can show that the vector of residuals, ε̂ := Y − Ŷ = (I − P )Y is uncorrelated with the
fitted values Ŷ :

Cov(P Y, (I − P )Y ) = Cov(P ε, (I − P )ε)


= E(P εεT (I − P )T )
= P E(εεT )(I − P )
| {z }
σ2 I
2
= σ P (I − P ) = 0.

Here is another way to think of the OLS coefficients that can offer further insight. Let us
write Xj for the j th column of X, and X−j for the n × (p − 1) matrix formed by removing the
j th column from X. Define P−j as the orthogonal projection on to the column space of X−j .

Proposition 1. Let Xj⊥ := (I − P−j )Xj , so Xj⊥ is the orthogonal projection of Xj on to the
orthogonal complement of the column space of X−j . Then

(Xj⊥ )T Y
β̂j = .
kXj⊥ k2

4
Proof. Note that Y = P Y + (I − P )Y and

XjT (I − P−j )(I − P )Y = XjT (I − P )Y = 0,

so
(Xj⊥ )T Y (Xj⊥ )T X(X T X)−1 X T Y
= .
kXj⊥ k2 kXj⊥ k2

Since Xj⊥ is orthogonal to the column space of X−j , we have

(Xj⊥ )T X = (0 · · · 0 (Xj⊥ )T Xj 0 · · · 0)

th
j position

and (Xj⊥ )T Xj = XjT (I − P−j )Xj = k(I − P−j )Xj k2 .

We see that Var(β̂j ) = σ 2 kXj⊥ k−2 . Thus if Xj is closely aligned to the column space of X−j ,
the variance of β̂j will be large. In particular, if a pair of variables are highly correlated with
each other, the variances of the estimates of the corresponding coefficients will be large.
We can measure the quality of a regression procedure by its mean-squared prediction error
(MSPE). This is defined here as
1
E(kXβ − X β̂k2 ).
n
Note that X β̂ = P Y = Xβ + P ε, so

E(kXβ − X β̂k2 ) = E(εT P T P ε) = E{tr(εT P ε)} = tr{E(εεT )P } = σ 2 tr(P ) = σ 2 p.

Thus
1 p
E(kXβ − X β̂k2 ) = σ 2 .
n n
More is true. Note that β̂ is unbiased, as

Eβ (β̂) = Eβ {(X T X)−1 X T Xβ} = β. (1.1.1)

Further,
Var(β̂) = (X T X)−1 X T Var(ε){(X T X)−1 X T }T = σ 2 (X T X)−1 . (1.1.2)
In fact it is the best linear unbiased estimator (BLUE), that is for any other estimator β̃ that
is linear in Y , we have Var(β̃) − Var(β̂) is positive semi-definite. In particular this means that
given a new observation x∗ ∈ Rp , we can estimate the regression function at x∗ optimally in the
sense that E{(x∗ T β − x∗ T β̂)2 } ≤ E{(x∗ T β − x∗ T β̃)2 }.
Theorem 2 (Gauss–Markov). Under (A1)–(A3) OLS is BLUE.

1.2 Normal errors


1.2.1 Maximum likelihood estimation
The method of least squares is just one way to construct as estimator. A more general technique
is that of maximum likelihood estimation. Here given data y ∈ Rn that we take as a realisation
of a random variable Y , we specify its density f (y; θ) up to some unknown vector of parameters

5
θ ∈ Θ ⊆ Rd , where Θ is the parameter space. The likelihood function is a function of θ for each
fixed y given by
L(θ) := L(θ; y) = c(y)f (y; θ),
where c(y) is an arbitrary constant of proportionality. We form an estimate θ̂ by choosing that
θ which maximises the likelihood. Often it is easier to work with the log-likelihood defined by

`(θ) := `(θ; y) = log f (y; θ) + log(c(y)).

If we assume that the errors εi in our linear model have N (0, σ 2 ) distributions, we see that
the log-likelihood for (β, σ 2 ) is
n
n 1 X
`(β, σ 2 ) = − log(σ 2 ) − 2 (yi − xTi β)2 .
2 2σ
i=1

The maximiser of this over β is precisely the least squares estimator (X T X)−1 X T Y . Maximum
likelihood does much more than simply give us another interpretation of OLS here. It allows us
to perform inference: that is construct confidence interval for parameters and perform hypoth-
esis tests. Before moving on to this topic, we review some facts about the multivariate normal
distribution.

1.2.2 The multivariate normal distribution and related distributions


Multivariate normal distribution
We say a random variable Z ∈ Rd has a d-variate normal distribution if for every t ∈ Rd , tT Z has
a univariate normal distribution. Thus linear combinations of Z are also normal: for any m ∈ Rk
and A ∈ Rk×d , m + AZ is multivariate normal. Fact: the multivariate normal distribution is
uniquely characterised by its mean and variance. Thus we write write Z ∼ Nd (µ, Σ) when
E(Z) = µ and Var(Z) = Σ. Note that m + AZ ∼ Nk (m + µ, AΣAT ).
When Σ is invertible, the density of Z is given by
 
1 1 T −1
f (z; µ, Σ) = exp − (z − µ) Σ (z − µ) , z ∈ Rd .
(2π)d/2 |Σ|1/2 2

Note that for example, the vector of residuals from the normal linear model (I − P )ε ∼
Nn (0, σ 2 (I − P )) but it does not have a density of the form given above as I − P is not
invertible.

Proposition 3. If Z1 and Z2 are jointly normal (i.e. (Z1 , Z2 ) has a multivariate normal dis-
tribution), then if Cov(Z1 , Z2 ) := E[{Z1 − E(Z1 )}{Z2 − E(Z2 }T ] = 0, we have that Z1 and Z2
are independent.

Proof. Let Z̃1 and Z̃2 be independent and have the same distributions as Z1 and Z2 respectively.
Then the mean and variance of the random variables (Z̃1 , Z̃2 ) and (Z1 , Z2 ) are identical and they
are both multivariate normal (the former is multivariate normal because sums of independent
normal random variables are normal). Since a multivariate normal distribution is uniquely
d
determined by its mean and variance, we must have (Z̃1 , Z̃2 ) = (Z1 , Z2 ).

6
χ2 distribution
d
We say Z has a χ2 distribution on k degrees of freedom, and write Z ∼ χ2k if Z = Z12 + · · · + Zk2
i.i.d.
where Z1 , . . . , Zk ∼ N (0, 1).
Proposition 4. Let Π be an n by n orthogonal projection with rank k, and let ε ∼ Nn (0, σ 2 I).
Then kΠεk2 ∼ σ 2 χ2k .
Proof. As Π is an orthogonal projection, we may form its eigendecomposition U DU T where U
is an orthogonal matrix and D is diagonal with entries in {0, 1}. Then

kΠεk2 = kDU T εk2 and kDεk2

have the same distribution. But


1 2 1 X 2
kDεk = εi ∼ χ2k .
σ2 σ2
i:Dii 6=0

Student’s t distribution
We say Z has a t distribution on k degrees of freedom, and write Z ∼ tk if
d Z1
Z=p
Z2 /k

where Z1 and Z2 are independent N (0, 1) and χ2k random variables respectively.

Multivariate t distribution
This is a generalisation of the Student’s t distribution above. We say Z has a p-dimensional
multivariate t distribution on k degrees of freedom, and write Z ∼ tk (µ, Σ) if
d Z1
Z =µ+ p
Z2 /k

where Z1 and Z2 are independent Np (0, Σ) and χ2k random variables respectively. It can be
shown that when Σ is invertible, Z has density
 −(k+p)/2
Γ((k + p)/2) −1/2 1 T −1
f (z) := |Σ| 1 + (z − µ) Σ (z − µ) ,
Γ(k/2)(kπ)p/2 k
where the gamma function Γ satisfies Γ(m) = (m − 1)! for m ≥ 1.

F distribution
We say Z has an F distribution on k and l degrees of freedom, and write Z ∼ Fk,l if

d Z1 /k
Z=
Z2 /l

where Z1 and Z2 are independent and follow χ2k and χ2l distributions respectively.

Notation. We will denote the upper α-points of the χ2k , tk and Fk,l distributions by χ2k (α),
tk (α) and Fk,l (α) respectively. (So, for example, if Z ∼ χ2k then P{Z ≥ χ2k (α)} = α. As the tk
distribution is symmetric, if Z ∼ tk , then P{−tk (α/2) ≤ Z ≤ tk (α/2)} = 1 − α.)

7
Informal summary

χ2k = N (0, 1)2 + · · · + N (0, 1)2


| {z }
k times
N (0, 1)
tk = q
χ2k /k
χ2k /k
Fk,l = ,
χ2l /l
so t2l = F1,l

with appropriate independence between relevant random variables.

1.2.3 Inference for the normal linear model


Distribution of β̂
We already know the mean and variance of β̂ (equations (1.1.1) and (1.1.2)). As it is a linear
combination of ε, we know it must be normally distributed: β̂ ∼ Np (β, σ 2 (X T X)−1 ).

Distribution of σ̂ 2
The maximum likelihood estimate for the σ 2 is
1 1 1
kY − X β̂k2 = k(I − P )Y k2 = k(I − P )εk2 .
n n n
We already know that the fitted values P Y and residuals (I − P )Y are uncorrelated. But
(P Y, (I − P )Y ) is a linear transformation of the multivariate normal Y , so P Y and (I − P )Y
must be independent. Therefore β̂ = (X T X)−1 X T P Y and σ̂ 2 are independent. Proposition 4
shows that σ̂ 2 ∼ σ 2 χ2n−p /n. Note that E(σ̂ 2 ) = (n − p)σ 2 /n, so σ̂ 2 is a biased estimator of σ 2 .
Let
n 1 σ2 2
σ̃ 2 := σ̂ 2 = kY − X β̂k2 ∼ χ ,
n−p n−p n − p n−p
so σ̃ 2 is now an unbiased estimator of σ 2 .
Now that we know the joint distribution of (β̂, σ̃ 2 ), it is rather easy to construct confidence
sets for β.

Confidence statements for β


We can obtain confidence sets for β by using the fact that the quantity

Np (0, (X T X)−1 )
  
β̂ − β
= q independent
σ̃ 1 2
n−p χn−p

(p)
is a pivot, that is its distribution does not depend on β or σ 2 . In fact it has a tn−p (0, (X T X)−1 )
distribution. For example, observe that

β̂j − βj
q ∼ tn−p ,
σ̃ 2 (X T X)−1
jj

8
so a (1 − α)-confidence interval for βj is given by
 q q 
β̂j − σ̃ 2 (X T X)−1 t
jj n−p (α/2), β̂j + σ̃ 2 (X T X)−1 t
jj n−p (α/2) =: Cj (α).

Note that pj=1


Q
Q Cj (α) does not constitute a 1 − α confidence cuboid for the entire parameter
vector β, though pj=1 Cj (α/p) does have coverage at least 1 − α (see example sheet). However,
the latter can have very large volume.
A confidence ellipsoid with much lower volume can be constructed by considering

kX(β − β̂)k2 = kP εk2 ,

which has a σ 2 χ2p distribution (by Proposition 4) and is independent of σ̃ 2 . Thus


1 2
!
p kX(β − β̂)k
1 − α = Pβ,σ2 ≤ Fp,n−p (α) ,
σ̃ 2
so
1
( )
p p kX(b − β̂)k2
b∈R : ≤ Fp,n−p (α)
σ̃ 2
is a (1 − α)-level confidence set for β. One disadvantage of this confidence set is that it might
be harder to interpret.
Of course the arguments used to arrive at the confidence intervals above can also be used
to perform hypothesis tests of the form

H0 : βj = β0,j
H1 : βj 6= β0,j .

and

H0 : β = β0
H1 : β 6= β0 .

Prediction intervals
Given a new observation x∗ , we can easily form a confidence interval for x∗ T β, the regression
function at x∗ , by noting that

x∗ T (β̂ − β) ∼ N (0, σ 2 x∗ T (X T X)−1 x∗ ),

so
x∗ T (β̂ − β)
p ∼ tn−p .
σ̃ 2 x∗ T (X T X)−1 x∗
A (1 − α)-level prediction interval for x∗ is a random interval I depending only on Y such
that Pβ,σ2 (Y ∗ ∈ I) = 1 − α where Y ∗ := x∗ T β + ε∗ and ε∗ ∼ N (0, σ 2 ) independently of
ε1 , . . . , εn . This will be wider than the confidence interval for x∗ T β as it must take into account
the additional variability of ε∗ . Indeed

Y ∗ − x∗ T β̂ = ε∗ + x∗ T (β − β̂) ∼ N (0, σ 2 {1 + x∗ T (X T X)−1 x∗ }),

so
Y ∗ − x∗ T β̂
p ∼ tn−p .
σ̃ 2 {1 + x∗ T (X T X)−1 x∗ }

9
The Bayesian normal linear model
So far we have treated β and σ 2 as unknown but fixed quantities. We have constructed estima-
tors of these quantities and tried to understand how we expect them to vary under hypothetical
repetitions of the experiment used to generate the data (with the design matrix fixed). This is
a frequentist approach to inference.
A Bayesian approach instead treats unknown parameters as random variables, and examines
their distribution conditional on the data observed. To fix ideas, suppose we have posited that
the density of the r.v. representing our data Y ∈ Rn conditional on a parameter vector θ is
p(y|θ). In addition to this statistical model, the Bayesian method requires that we agree on a
marginal distribution for θ, p(θ). This can represent prior information about the parameters that
is known before any of the data has been analysed, and hence it is called the prior distribution.
Inference about θ is based on the posterior distribution, p(θ|y), which satisfies

p(y|θ) p(θ)
p(θ|y) = .
p(y)

Taking the mean or mode of the posterior distribution gives point estimates for θ. Note that
in order to determine the posterior p(θ|y), we only need knowledge of the right-hand side up to
multiplication by an arbitrary function of y, in particular it suffices to consider p(y|θ)p(θ). To
recover p(θ|y) we simply multiply by
Z −1
0 0 0
p(y|θ )p(θ )dθ ,

or alternatively we may be able to spot the form of the density for θ and find the normalising
constant that way.
In contrast to frequentist confidence sets, using p(θ|y), we can construct sets S such that
the posterior probability of {θ ∈ S} is at least 1 − α. These are known as credible sets.
In the context of the Bayesian linear model, it is convenient to work with the precision ω :=
σ −2 rather than the variance. A commonly used prior for the parameters (β, ω) is p(β, ω) = ω −1 .
This is not a density since it does not have a finite integral. Nevertheless, the posterior resulting
from this prior is a genuine density, and inference based on this posterior has many similarities
with inference in the frequentist context. To see this we first recall the gamma distribution.

The gamma distribution. If a random variable Z has density

ba z a−1 e−bz
f (z; a, b) = for z ≥ 0 and a, b > 0,
Γ(a)

we write Z ∼ Γ(a, b) and say Z has a gamma distribution with shape a and rate b. We note,
for future use, that since the gamma density integrates to 1, we must have that
Z ∞
Γ(a)
z a−1 e−bz dz = a . (1.2.1)
z=0 b

Let us write the likelihood as

p(y|β, ω) ∝ ω (n−p)/2 exp{−ωk(I − P )yk2 /2} ω p/2 exp{−ω(β − β̂)T X T X(β − β̂)/2} .
| {z } | {z }
∝ Γ((n − p)/2 + 1, k(I − P )yk2 /2) density For fixed ω ∝ Np (β̂, ω −1 (X T X)−1 ) density

10
Then, multiplying by the prior, we see that the posterior is a product of gamma and normal
densities: informally

p(β, ω|y) = Γ((n − p)/2, k(I − P )yk2 ) × Np (β̂, ω −1 (X T X)−1 ) .


| {z } | {z }
p(ω|y) p(β|ω,y)

Thus β|ω, Y ∼ Np (β̂, ω −1 (X T X)−1 ). Compare this to the distribution of β̂ in the frequentist
setting. The marginal posterior for β can be obtained by integrating out ω in the joint posterior
above. Rather than performing the integration directly, we note that as a function of ω alone,
the joint posterior is of the form
ω A−1 × exp(−ωB),
where
n
A=
2
1
B = {k(I − P )yk2 + kX(β − β̂)k2 }.
2
Thus by (1.2.1), we have that the marginal posterior for β satistfies
Z ∞
p(β|y) ∝ ω A−1 × exp(−ωB)
ω=0
∝ B −A
!−{(n−p)/2+p/2}
kX(β − β̂)k2
∝ 1+
k(I − P )yk2
 −{(n−p)/2+p/2}
1 T −2 T
∝ 1+ (β − β̂) (σ̃ X X)(β − β̂) ,
n−p
(p)
which we recognise as proportional to the density of a tn−p (β̂, σ̃ 2 (X T X)−1 ) distribution. Thus

β − β̂ (p)
Y ∼ tn−p (0, (X T X)−1 ),
σ̃

similarly to the frequentist case, though here it is β rather than β̂ that is random. From this
we see that
βj − β̂j
q Y ∼ tn−p ,
σ̃ 2 (X T X)−1
jj

kX(β − β̂)k2
Y ∼ Fp,n−p ,
σ̃ 2
so the frequentist confidence regions described in earlier sections can also be thought of as
Bayesian credible regions, when the prior p(β, ω) ∝ ω −1 is used.

Testing significance of groups of variables


Often we want to test whether a given group of variables is significant. Consider partitioning
 
β0
X = (X0 X1 ) and β= ,
β1

11
where X0 is n by p0 and X1 is n by p − p0 , and correspondingly β0 ∈ Rp0 and β1 ∈ Rp−p0 . We
are interesting in testing

H0 : β1 = 0 against
H1 : β1 6= 0.

One sensible way of proceeding is to construct a generalised likelihood ratio test. Recall that
given an n-vector Y , assumed to have density f (y; θ) for some unknown θ ∈ Θ, the likelihood
ratio test for testing

H0 : θ ∈ Θ0 against
H1 : θ ∈
/ Θ0 ,

where Θ0 ⊂ Θ, rejects the null hypothesis for large values of wLR defined by
supθ0 ∈Θ L(θ0 )
 
wLR (H0 ) = 2 log = 2{ sup `(θ0 ) − sup `(θ0 )}.
supθ0 ∈Θ0 L(θ0 ) θ0 ∈Θ θ0 ∈Θ0

Let us apply the generalised likelihood ratio test to the problem of assigning significance
to groups of variables in the linear model. Write β̌0 and σ̌ 2 for the MLEs of the vector of
regression coefficients and the variance respectively under the null hypothesis (i.e. when the
model is Y = X0 β0 + ε with ε ∼ Nn (0, σ 2 I)).
We have
1 1
wLR (H0 ) = −n log(σ̂ 2 ) − 2 kY − X β̂k2 + n log(σ̌ 2 ) + 2 kY − X0 β̌0 k2
σ̂ σ̌
k(I − P )Y k2
 
= −n log .
k(I − P0 )Y k2

To determine the right cutoff for an α-level test, we need to obtain the distribution of (a
monotone function of the) argument of the logarithm under the null hypothesis, that is, the
distribution of
k(I − P0 )εk2
.
k(I − P )εk2
By dividing top and bottom by σ 2 , we see that the distribution of the quantity above doesn’t
depend on any unknown parameters. To find its distribution we argue as follows. Write

I − P0 = (I − P ) + (P − P0 ).

Now since the columns of P and P0 are in the column space of X, (I − P )(P − P0 ) = 0, so

k(I − P0 )εk2 = k(I − P )εk2 + k(P − P0 )εk2 ,

whence
k(I − P0 )εk2 k(P − P0 )εk2
= 1 + .
k(I − P )εk2 k(I − P )εk2
Also

Cov((I − P )ε, (P − P0 )ε) = E{(I − P )εεT (P − P0 )T } = (I − P )(P − P0 ) = 0.

As the random vector  


(I − P )ε
(P − P0 )ε

12
is multivariate normal (being the image of a multivariate normal vector under a linear map),
we know that (I − P )ε and (P − P0 )ε are independent. Hence k(I − P )εk2 and k(P − P0 )εk2 are
independent. We know that k(I −P )εk2 /σ 2 ∼ χ2n−p . It turns out that k(P −P0 )εk2 /σ 2 ∼ χ2p−p0 .
This follows from Proposition 4 and the fact that P − P0 is an orthogonal projection with
rank p − p0 . Indeed, it is certainly symmetric, and
(P − P0 )2 = P − P P0 − P0 P + P0 = P − P0 ,
the final equality following from P0 P = P0T P T = (P P0 )T = P0T = P0 . Thus P − P0 is an
orthogonal projection, so we know
r(P − P0 ) = tr(P − P0 ) = tr(P ) − tr(P0 ) = r(P ) − r(P0 ) = p − p0 .
Finally, we may conclude that
1
p−p0 k(P − P0 )εk2
1 ∼ Fp−p0 ,n−p .
n−p k(I − P )εk2
In summary, we can perform a generalised likelihood ratio test for
H0 : β1 = 0 against
H1 : β1 6= 0
at level α by comparing the test statistic
1
p−p0 k(P − P0 )Y k2
1
n−p k(I − P )Y k2

to Fp−p0 ,n−p (α) and rejecting for large values of the test statistic.

1.2.4 ANOVA and ANCOVA


Although so far we have thought of our covariates as being real-valued (i.e. things like age,
time, height, volume etc.), categorical predictors (also known as factors) can also be dealt with.
These can arise in situations such as the following. Consider measuring the weight loss of people
each participating in one of J different exercise regimes, the first regime being no exercise (the
control). Let the weight loss of the k th participant of regime j be Yjk . The model that the
responses are independent with
Yjk ∼ N (µj , σ 2 ), j = 1, . . . , J; k = 1, . . . nj
can be cast within the framework of the normal linear model by writing
   
Y11 1 0 ··· ··· 0
 ..   .. .. . . . . .. 
 .  . . . . .
   
 Y1n1 
 
1 0
 · · · · · · 0
 Y21  0 1 0 · · · 0
 
    µ1
 ..   .. .. .. . . .. 
 .  . . . . .  µ2 
  
Y =   ; X =  ; β = . .
 Y2n2 
0 1 0 · · · 0  .. 
 
 
 ..  .. µJ
 
 .   . 
   
 YJ1  0 · · · · · · 0 1 
  
 ..   .. . . .. . .
 .  . . . .. .. 
YJnJ 0 ··· ··· 0 1

13
This type of model is known as a one-way analysis of variance (ANOVA). If all the nj were
equal, it would be called a balanced one-way ANOVA.
An alternative parametrisation is

Yjk = µ + αj + εjk , εjk ∼ N (0, σ 2 ); j = 1, . . . , J; k = 1, . . . , nj ,

where µ is the baseline or mean effect and αj is the effect of the j th regime in relation to the
baseline.
Notice that the parameter vector (µ, β) is not identifiable since, for example, replacing µ
with µ + c and each αj with αj − c gives the same model for every c ∈ R. To make the model
identifiable, one option is to constrain α1 = 0. This is known as a corner point constraint and
is the default in R. This makes it easier
PJ to test for differences from the control. Another option
is to use a sum-to-zero constraint: j=1 nj αj = 0. Note that the particular constraints used
do not affect the fitted values in any way.
If each of the subjects in our hypothetical experiment also went on one of I different diets,
then writing Yijk now to mean the weight loss of the k th participant of exercise regime j and
diet i, we might model the Yijk as independent with

Yijk = µ + αi + βj + εijk , εijk ∼ N (0, σ 2 ); i = 1, . . . , I; j = 1, . . . , J; k = 1, . . . , nij .

This model is called an additive two-way ANOVA because it assumes that the effects of the
different factors are additive. The model is over-parametrised and as before, constraints must
be imposed on the parameters to ensure identifiability. By default, R uses the corner point
constraints α1 = β1 = 0.
If the contribution of one of the exercise regimes to the response was not the same for all
the different types of diets, it may be more appropriate to use the model

Yijk = µ + αi + βj + γij + εijk .

The γij are known as interaction terms.


One might also have information about the subjects in the form of continuous variables,
e.g. blood pressure, BMI. Since all the models above are normal linear models, these variables
can simply be appended to the design matrix to include them in the model. A linear model
that contains both factors and continuous variables is known as an analysis of covariance
(ANCOVA).

1.2.5 Model selection


Recall that the MSPE of β̂ is σ 2 p/n. If only p0 of the components of β were non-zero, say
the first p0 , then we could perform regression on just X0 , the matrix formed from the first p0
columns of X, and the resulting estimator of the non-zero coefficients, β̂0 , would have reduced
MSPE σ 2 p0 /n, rather than σ 2 p/n. Moreover

σ2 σ2
Var(β̂0,j ) = ≤ = Var(β̂j ),
k(I − P0,−j )Xj k2 k(I − P−j )Xj k2

for j = 1, . . . , p0 (see example sheet). Here P0,−j is the orthogonal projection on to the column
space of X0,−j , the matrix formed by removing the j th column from X0 .
It is thus useful to check whether a model formed from a smaller set of variables can ad-
equately explain the data observed. Another advantage of selecting the right model is that it
allows one to focus on variables of interest.

14
Coefficient of determination
One popular measure of the goodness of fit of a linear model is the coefficient of determination
or R2 . It compares the residual sum of squares (RSS) under the model in question to a minimal
model containing just an intercept, and is defined by

kY − Ȳ 1n k2 − k(I − P )Y k2
R2 := ,
kY − Ȳ 1n k2

where 1n is an n-vector of 1’s. The interpretation of R2 is as the proportion of the total variation
in the data explained by the model. It takes values between 0 and 1 with higher values indicating
a better fit. The R2 will always increase if variables are added to the model. The adjusted R2 ,
R̃2 defined by
n−1
R̃2 := 1 − (1 − R2 )
n−p
can be motivated by analogy with the F statistic, and takes account of the number of parameters.

AIC
Another approach to measuring the fit of a model is Akaike’s Information Criterion (AIC). We
will describe AIC in a more general setting than the normal linear model, since it will be used
when assessing the fit of generalised linear models which will be introduced in the next chapter.
Suppose that our data (Y1 , xT1 ), . . . , (Yn , xTn ) are generated with Yi independent conditional
on the design matrix X whose rows are the xTi . Suppose that given xi , the true pdf of Yi is
gxi and from a model F := {(fxi (·; θ))ni=1 , θ ∈ Θ ⊆ Rp } the corresponding maximum likelihood
fitted pdf is fxi (·; θ̂). One measure of the quality of fˆxi (·) := fxi (·; θ̂) as an estimate of the true
density gxi is the Kullback–Leibler divergence, K(gxi , fˆxi ) defined as
Z ∞
ˆ
K(gxi , fxi ) := [log{gxi (y)} − log{fˆxi (y)}]gxi (y)dy.
−∞

For an overall measure of fit, we can consider


n
1X
K̄ := K(gxi , fˆxi ).
n
i=1

One can show via Jensen’s inequality that K̄ ≥ 0 with equality if and only if each gxi =
fˆxi (almost surely). Thus if K̄ is low, we have a good fit. Given a collection of different
fitted densities for the data, it is therefore desirable to select that which minimises K̄. This is
equivalent to minimising
n Z ∞ n
1X 1 X
K̃ := − log{fˆxi (y)}gxi (y)dy = − EYi∗ ∼gxi [log{fˆxi (Yi∗ )}|Yi ].
n −∞ n
i=1 i=1

Of course, we cannot compute K̄ or K̃ from the data since this requires knowledge of gxi
for i = 1, . . . , n. However, it can be shown that it is possible to estimate E(K̃) (where the
expectation is over the randomness in the fˆxi ). Akaike’s information criterion (AIC) defined as

AIC := −2`(θ̂) + 2p,


= 2 × (-maximised loglikelihood + number of parameters in the model)

15
satisfies E(AIC)/n ≈ 2E(K̃) for large n, provided the true densities gxi , i = 1, . . . , n are
contained in the model F.
In the normal linear model where X is n by p with full column rank, AIC amounts to

n{1 + log(2πσ̂ 2 )} + 2(p + 1),

thus the best set of variables to use according to the AIC method is determined by minimising
n log(σ̂) + p across all candidate models.

*Corrected information criterion*


In fact we may form an unbiased estimate of 2nE(K̃) in the normal linear model. Suppose
we have computed β̂ from data Y generated by Y = Xβ + ε with ε ∼ Nn (0, σ 2 I). Now let
Y ∗ = Xβ + ε∗ where ε∗ ∼ Nn (0, σ 2 I) and ε∗ and ε are independent. Then
( ! )
2 kY ∗ − X β̂k2
2nE(K̃) = E E n log(2πσ̂ ) + Y
σ̂ 2
!
2 nσ 2 + kXβ − X β̂k2
= E{n log(2πσ̂ )} + E .
σ̂ 2

Fact: If Z ∼ χ2k with k > 2 then E(Z −1 ) = (k − 2)−1 . Since σ̂ 2 and kXβ − X β̂k2 = kP εk2 are
independent, the second expectation in the display above equals

n(n + p)
,
n−p−2

provided n > p + 2. Thus an unbiased estimator of 2nE(K̃) is

n(n + p)
n log(2πσ̂ 2 ) + .
n−p−2
The corrected information criterion, AICc , is given by

1 + p/n
AICc = n log(2πσ̂ 2 ) + n .
1 − (p + 2)/n

Note that
!
p+1
1 + p/n n
n =n 1+2
1 − (p + 2)/n 1 − p+2
n
1
= n + 2(p + 1) .
1 − p+2
n

Thus when p/n is small, AICc ≈ AIC in the case of the normal linear model.

Orthogonality
One way to use the above model selection criteria is to fit each of the 2p−1 submodels that can
be created using our design matrix (assuming we include an intercept every time and the first
column of X is a column of 1’s) and pick the one that seems best based on our criterion of
choice. However, if p is reasonably large, this becomes a very computationally intensive task.

16
One situation where such an approach is feasible is when the columns of X are orthogonal.
Indeed, more generally, if X can be partitioned as X = (X0 X1 ) with the vector of coefficients
correspondingly partitioned as β = (β0T , β1T )T , we say that β0 and β1 are orthogonal sets of
parameters if X0T X1 = 0. Then
−1  T 
X0T
 
X0
β̂ = (X0 X1 ) Y
X1T X1T
(X0 X0 )−1
 T  T
0 X0
= Y
0 (X1T X1 )−1 X1T
(X0 X0 )−1 X0T Y
 T   
β̂0
= = .
(X1T X1 )−1 X1T Y β̂1

If all the columns of X are orthogonal, we can easily find the best fitting model (in terms of the
RSS) with p0 variables. We simply order the kβ̂j Xj k2 = XjT Y /kXj k (excluding the intercept
term) in decreasing order, and pick variables corresponding to the first p0 terms. This works
because letting XS for S ⊆ {1, . . . , p} be the matrix formed from the columns of X indexed by
S, and writing PS for the projection on to the column space of XS ,
2
X X
2
k(I − PS )Y k = Y − β̂j Xj = kY k2 − kβ̂j Xj k.
j∈S j∈S

Exact orthogonality is of course unlikely to occur unless we have designed the design matrix X
ourselves, either through choosing the values of the original covariates, or through transforming
them in particular ways. A very common example of the latter is mean-centring each variable
before adding an intercept term, so the intercept coefficient is then orthogonal to the rest of the
coefficients.

*Forward and backward selection*


When the design matrix does not have orthogonal columns another strategy to avoid a search
through all submodels is a forward selection approach.

Forward selection.

1. Start by fitting an intercept only model: call this S0 .

2. Add to the current model the predictor variable reduces the residual sum of squares the
most.

3. Continue step 2 until all predictor variables have been chosen or until a large number of
predictor variables has been selected. This produces a sequence of sub-models S0 ⊂ S1 ⊂
S2 ⊂ · · · .

4. Pick a model from the sequence of models created using either AIC or R2 based criteria
(or something better!).

An alternative is:

17
Backward selection.

1. Fit the largest model available (i.e. include all predictors) and call this S0 .

2. Exclude the predictor variable whose removal from the current model decreases the resid-
ual sum of squares the least.

3. Continue step 2 until all predictor variables have been removed (or a large number of
predictor variables have been removed). This produces a sequence of submodels S0 ⊃
S1 ⊃ S2 ⊃ · · · .

4. Finally pick a model from the sequence as with forward selection.

*Inference after model selection*


Once a model has been selected, it is tempting to simply pretend that the variables in the sub-
model were the only ones that were ever collected and then proceed with constructing confidence
intervals and using other inferential tools. But this ignores the fact that the data has already
been used to select the submodel. Recall how we can imagine a 1 − α level confidence interval as
being a particular construction of an interval that when applied to data generated through hy-
pothetical repetitions of the “experiment” (keeping X fixed), gives intervals a proportion 1 − α
of which we expect to contain the true parameter. However when the confidence intervals to be
constructed are determined based on the response, we cannot interpret confidence intervals this
way, because different responses would have led to different models being selected. The same
issue arises for other inferential methods. This is a big problem in Statistics and currently the
subject of a great deal of research in Statistics.
What can we do to combat this problem? One option is to divide the observations into
two halves. One half can be used to pick the best model and then the other half to construct
confidence intervals, p-values etc. However, because we are only using part of the data to
perform inference, our procedures will lose power. Moreover different splits of the data will give
different results (this is less of a problem since we can try to aggregate results in some way).
An alternative is to try to perform model selection in a way such that for almost all datasets
(i.e. realisations of the response Y ), we expect the same submodel to be selected. In any case,
inferences drawn after model selection must be reported with care: this is a tricky issue with
no easy universally accepted solutions.

1.2.6 Model checking


The validity of the inferences drawn from the normal linear model rest on four assumptions.

(A1) E(εi ) = 0. If this is false, the coefficients in the linear model need to be interpreted with
care. Furthermore, our estimate of σ 2 will tend to be inflated and F -tests may lose power
though they will have the correct size (see example sheet).

(A2) Var(εi ) = σ 2 . This assumption of constant variance is called homoscedasticity, and its
violation (nonconstant variance) is called heteroscedasticity. A violation of this assumption
means the least squares estimates are not as efficient as they could be, and furthermore
hypothesis tests and confidence intervals need not have their nominal levels and coverages
respectively. If the variances of the errors are known up to an unknown multiplicative
constant, weighted least squares can be used (see example sheet).

18
(A3) Cov(εi , εj ) = 0 for i 6= j: the errors are uncorrelated. When data are ordered in time
or space, this assumption is often violated. As with heteroscedasticity, the standard
inferential techniques can give misleading results.
(A4) The errors εi are normally distributed. Though the confidence intervals and hypothesis
tests we have studied rest on the assumption of normality, arguments based on the central
limit theorem can be used to show that even when the errors are not normally distributed,
provided (A1–A3) are satisfied, inferences are still asymptotically valid under reasonable
conditions.
A useful way of assessing whether the assumptions above are satisfied is to analyse the
residuals ε̂ := (I − P )Y arising from the model fit. This is usually done graphically rather than
through formal tests. An advantage of the graphical approach is that we can look for many
different signs for departures from the assumptions simultaneously. One potential issue is that
it may not always be clear what indicates a genuine violation of assumptions compared to the
natural variation that one should expect even if the assumptions held.
Note that under (A1), E(ε̂) = 0. It is common to plot the residuals against the fitted values
Ŷi , and also against each of the variables in the design matrix (including those not in the current
model). If (A1) holds, there should not be an obvious trend in the mean of the residuals.
Under (A2) and (A3), Var(ε̂) = σ 2 (I − P ). Define the studentised residuals to be
ε̂i
η̂i := √ , where pi := Pii i = 1, . . . , n.
σ̃ 1 − pi
Provided σ̃ is a good estimate of σ, the variance p of η̂i should be approximately 1. A standard
check of the validity of (A2) involves plotting |η̂i | against the fitted values.
If (A1–A4) hold, then we’d expect the η̂i to look roughly like an i.i.d. sample from a N (0, 1)
distribution since
ε̂i
η̂i ≈ √
σ 1 − pi
and so
−Pij
Cov(η̂i , η̂j ) ≈ p ,
(1 − pi )(1 − pj )
for i 6= j. When n  p We expect this covariance to be close to 0 because
1 X 2 1 1 1 p
Pij = 2 tr(P T P ) = 2 tr(P ) = 2 r(P ) = 2 ,
n2 n n n n
i,j

(so the average of the squared entries of P is small).


A good way of checking that the η̂i look roughly standard normal is to look at a Quantile–
Quantile (Q–Q) plot. This involves plotting the order statistics of the sample of η̂i against the
expected order statistics of the normal distribution. Since the latter are rather complicated to
compute, we often approximate the expected value of the ith order statistic Z(i) from a sample
of i.i.d. standard normal random variables Z1 , . . . , Zn , by
 
i
E(Z(i) ) ≈ Φ−1 .
n+1
In summary, we
1. sort the studentised residuals, η̂1 , . . . , η̂n into into increasing order, and
2. plot them against {Φ−1 ( n+1
i
) : i = 1, . . . , n}.
We expect an approximately straight line through the origin with gradient 1 if our normality
assumption is correct.

19
Variable transformations
We have already discussed how predictors may be transformed so that models that are nonlinear
in the original data (but linear in the parameter β) still fall within the linear model framework.
Sometimes it can also be helpful to transform the response so that it fits the linear model.
Consider the following model

Yi = exp(xTi β + εi ), i = 1, . . . , n, ε ∼ Nn (0, σ 2 I).

If we make the transformation Yi 7→ log(Yi ) we will have a linear model in the logged response.
The Box–Cox family of transformations is given by
 λ
y −1

 if λ 6= 0,
λ

(λ)
y 7→ y :=



log(y) if λ = 0.

(λ) (λ)
Typically one plots the log-likelihood of the transformed data (y1 , . . . , yn ) as a function of λ
and then selects a value of λ which lies close to the λ that maximises the log-likelihood, and
still gives a model with interpretable parameters.

Unusual observations
Often we may find that though the bulk of our data satisfy the assumptions (A1–A4) and fit the
model well, there are a few observations that do not. These are called outliers. It is important
to detect these so that they can be excluded when fitting the model, if necessary. A more subtle
way in which an observation can be unusual is if it is unusual in the predictor space i.e. it has
an unusual x value; it is this we discuss first.

Leverage. Recall that the fitted values Ŷ satisfy

Ŷi = (P Y )i = Pi1 Y1 + · · · + Pii Yi + · · · + Pin Yn .

The value pi := Pii is called the leverage of the ith observation. It measures the contribution
that Yi makes to the fitted value Ŷi . It can be shown that 0 ≤ pi ≤ 1. Since Var(εˆi ) = σ 2 (1−pi ),
values of pi close to 1 force the regression line (or plane) to pass very close to Yi .
The idea of leverage is about the potential for an observation to have a large effect on the
fit; if the observation does not have an unusual response value, it is possible that removing the
observation will change the estimated regression coefficients very little. However in this case,
the R2 and the results of an F -test with the null hypothesis as the intercept only model may
still change a lot.
The relationship ni=1 pi = tr(P ) = p motivates a rule of thumb that says the influence of
P
the ith observation may be of concern if pi > 3p/n. When the design matrix consists of just a
single variable and a column of 1’s representing an intercept term (as the first column), it can
be shown that
1 (Xi2 − X̄2 )2
p i = + Pn 2
,
n k=1 (Xk2 − X̄2 )

where X̄2 := n1 ni=1 Xi2 .


P

20
Cook’s distance. The Cook’s distance Di of the observation (Yi , xi ) is defined as
1
p kX(β̂(−i) − β̂)k2
Di := ,
σ̃ 2
where β(−i) is the OLS estimate of β when omitting observation (Yi , xi ).
The interpretation of Cook’s distance is that if Di = Fp,n−p (α) then omitting the ith data
point moves the m.l.e. of β to the edge of the (1 − α)-level confidence set for β.
Note that we do not need to fit n + 1 linear models to compute all of the Cook’s distances,
since in fact
1 pi
Di = η̂ 2 (see example sheet).
p 1 − pi i
Thus Cook’s distance combines the studentised fitted residuals with the leverage as a measure
of influence. A rule of thumb is that we should be concerned about the influence of (Yi , xi ) if
Di > Fp,n−p (0.5).

21
Chapter 2

Exponential families and generalised


linear models

2.1 Non-normal responses


Suppose we are interested in predicting the probability that an internet advert gets clicked by
web surfers visiting the page where it is displayed, based on it’s colour, size, position, font used
and other information. Given a vector of responses Y ∈ {0, 1}n (1 = ‘clicked’ and 0 = ‘didn’t
click’) and a design matrix X collecting together the relevant information, a linear model would
attempt to find β̂ such that Y and X β̂ are close. However the fitted values do not relate well
to probabilities that Yi = 1: indeed there is no guarantee that we even have X β̂ ∈ [0, 1]n .
Really, we would like to model
Yi ∼ Bin(µi , 1),
with µi related to some function of the predictors xi whose range is contained in [0, 1].
Generalised linear models (GLMs) extend linear models to deal with situations such as that
discussed above. We can think of a normal linear model as consisting of three components.

(i) The random component: Y1 , . . . , Yn are independent normal random variables, with Yi
having mean µi and variance σ 2 .

(ii) The systematic component: a linear predictor η = (η1 , . . . , ηn )T , where ηi = xTi β.

(iii) The link between the random and systematic components: µi = ηi .

Of course this is an unnecessarily wasteful way to write out the linear model, but it is suggestive
of generalisations.
GLMs extend linear models in (i) and (iii) above, allowing different classes of distributions
for the response variables and allowing a more general link:

ηi = g(µi )

where g is a strictly increasing, twice differentiable function.

2.2 Exponential families


We want to consider a class of distributions large enough to include the normal, binomial
and other familiar distributions, but which is still relatively simple, both conceptually and
computationally.

22
Why is this a useful endeavour? We could just work with a particular family of distributions
for the response that is useful for our own purposes, and develop algorithms for estimating
parameters and theory for the distributions of our estimates (just as we did for the normal
linear model). However, if we work in a more general framework, there we may be able to
formulate inference procedures and develop computational techniques that are applicable for a
number of families of distributions.
We begin our quest for such a general framework with the concept of an exponential family.
We motivate the idea by starting with a single density or probability mass function f0 (y),
y ∈ Y ⊆ R. Rather than always writing “density or probability mass function”, we will use
the term “model function” to mean either a density function or p.m.f. (Of course, those of you
who attended Probability and Measure will know that p.m.f.’s are just densities with respect
to counting measure, so we could equally well use “density” throughout).
We will require that f0 be a non-degenerate model function, that is if Y has model function
f0 then Var(Y ) > 0. For example, f0 (y) might be the uniform density on the unit interval
Y = [0, 1], or might have the probability mass function y(1 − y) on Y = {0, 1}.
We can generate a whole family of model functions based on f0 via exponential tilting:

eyθ f0 (y)
f (y; θ) = R , y ∈ Y.
ey0 θ f0 (y 0 )dy 0

We can only consider values of θ for which the integral in the denominator is finite. Note that
the denominator is precisely the moment generating function of f0 evaluated at θ. Let us briefly
recall some facts about moment generating functions before proceeding.

The moment and cumulant generating functions. The moment generating function
(m.g.f.) of a random variable, or equivalently its model function, is M (t) := E(etY ). The
cumulant generating function (c.g.f.) is the logarithm of the m.g.f.: K(t) := log(M (t)). The set
of values where these functions are finite is an interval containing 0. If this contains an open
interval about 0, then we have the series expansions

X tr
M (t) = E(Y r ) ,
r!
r=0

X tr
K(t) = κr ,
r!
r=0

where κr is known as the rth cumulant. Standard theory about power series tells us that

E(Y r ) = M (r) (0),


κr = K (r) (0).

Check that κ1 = E(Y ) and κ2 = Var(Y ).


Let K now be the c.g.f. of f0 and suppose Θ := {θ : K(θ) < ∞} is an open interval
containing 0. Then the class of model functions

{f (y; θ) : θ ∈ Θ}

is called the natural exponential family (of order 1) generated by f0 , and is an example of
an exponential family. With a different generating model function f0 , we can get a different
exponential family.

23
The parameter θ is called the natural parameter Rand Θ is called the natural parameter space.
Note that we may write f (y; θ) = eθy−K(θ) f0 (y) so Y eθy−K(θ) f0 (y) = 1 for all θ ∈ Θ.
The mean of f (y; θ) is of course related to the parameter θ, and it is often useful to
reparametrise the family of model functions in terms of their means. To discuss this, let us
first find the mean and variance of f (y; θ), i.e. the first and second cumulants.
The m.g.f. of f (·; θ), M (t; θ) is
Z
M (t; θ) = ety eθy−K(θ) f0 (y)dy
Y
Z
K(θ+t)−K(θ)
=e e(θ+t)y−K(θ+t) f0 (y)dy
Y
K(θ+t)−K(θ)
=e , for θ, θ + t ∈ Θ.
Thus if Y has a f (y; θ) model function, then
d d2
Eθ (Y ) = K(t; θ) = K 0 (θ), Varθ (Y ) = 2 K(t; θ) = K 00 (θ).
dt t=0 dt t=o
It can be shown that since f0 was assumed to be non-degenerate, so must be every f (y; θ).
Then
µ(θ) := Eθ (Y ) = K 0 (θ) satisfies
0 00
µ (θ) = K (θ) > 0
so µ is a smooth, strictly increasing function from Θ to M := {µ(θ) : θ ∈ Θ} (M for ‘mean
space’), with inverse function θ := θ(µ). This leads to the mean value parametrisation:
f (y; µ) = eθ(µ)y−K(θ(µ)) f0 (y), y ∈ Y, µ ∈ M.
The function V : M → (0, ∞) defined by V (µ) = Varθ(µ) (Y ) = K 00 (θ(µ)) is called the variance
function.

Examples.
2
1. Let f0 = φ, the standard normal density. Then M (θ) = eθ /2 , θ ∈ R so K(θ) = 21 θ2 . Thus
the natural exponential family generated by the standard normal density is
2 1 2 1 2
f (y; θ) = eθy−θ /2 √ e−y /2 = √ e−(y−θ) /2 , y ∈ R, θ ∈ R.
2π 2π
This is the N (θ, 1) family. Clearly µ(θ) = θ, θ(µ) = µ, M = R and V (µ) = 1, as can be
verified by taking derivatives of K(θ).
2. Let f0 denote the Pois(1) p.m.f.:
1
f0 (y) = e−1 , y ∈ {0, 1, . . .}.
y!
Then
∞ θr
X e
M (θ) = e−1 = exp(eθ − 1).
r!
r=0
Thus with exponential tilting, we get
1 (eθ )y exp(−eθ )
f (y; θ) = eθy−exp(θ) = , y ∈ {0, 1, . . .}, θ ∈ R.
y! y!
This is the Pois(eθ ) family of distributions. The mean function is µ = eθ with inverse
θ = log(µ), and the variance function, V (µ) = µ; the mean space is M = (0, ∞).

24
Technical conditions. Why did we impose the technical conditions that the set of values
where the c.g.f. of f0 is finite, Θ, is an open interval containing 0? Note that then given any
θ ∈ Θ, {t : θ + t ∈ Θ} = Θ − θ is an open interval containing 0. Thus the result we have shown
that
K(θ + t) − K(θ)
is the c.g.f. of f (·; θ) is valid for all t ∈ Θ − θ, and as this has a power series expansion, we can
recover the cumulants by taking derivatives an evaluating at 0.

2.3 Exponential dispersion families


The natural exponential families are not broad enough for our purposes. We should like more
control over the variance. A family of model functions of the form
 
2 2 1
f (y; θ, σ ) = a(σ , y) exp 2 {θy − K(θ)} , y ∈ Y, θ ∈ Θ, σ 2 ∈ Φ ⊆ (0, ∞), (2.3.1)
σ

where

• a(σ 2 , y) is a known positive function (c.f. f0 (y) that generated the exponential family),

• Θ is an open interval,

and in addition the model functions are non-degenerate, is called an exponential dispersion
family (of order 1). The parameter σ 2 is called the dispersion parameter. (Note many authors
simply call the family of model functions in (2.3.1) an example of an exponential family.)
Let K(·; θ, σ 2 ) be the c.g.f. of the model function f (y; θ, σ 2 ) in (2.3.1). It can be shown (see
example sheet) that the c.g.f. of the density in (2.3.1) is
1
K(t; θ, σ 2 ) = {K(σ 2 t + θ) − K(θ)},
σ2
for θ + σ 2 t ∈ Θ. Since the set of values where K(·; θ, σ 2 ) is finite contains an open interval
about 0, if Y has model function (2.3.1) then

Eθ,σ2 (Y ) = K 0 (θ), Varθ,σ2 (Y ) = σ 2 K 00 (θ).

As before, we may define µ(θ) := K 0 (θ). Since Varθ,σ2 (Y ) > 0 (by non-degeneracy of the
model functions), K 00 (θ) > 0, so we can define an inverse function to µ, θ(µ). Further define
M := {µ(θ) : θ ∈ Θ} and variance function V : M → (0, ∞) given by V (µ) := K 00 (θ(µ))
(though now the variance of the model function is actually σ 2 V (µ)).

Examples.

1. Consider the family N (ν, τ 2 ) where ν ∈ R and τ 2 ∈ (0, ∞). We may write the densities
as
y2
    
1 1 1 2
f (y; ν, τ 2 ) = √ exp − 2 exp 2
νy − ν ,
2πτ 2 2τ τ 2
showing that this family is an exponential dispersion family with θ = ν, µ(θ) = θ, σ 2
= τ 2 . Of course we know that µ(θ) = θ and V (µ) = 1, but we can also check this by
differentiating K(θ) = θ2 /2.

25
2. Let Z ∼ Bin(n, p). Then Y := Z/n ∼ n1 Bin(n, p) has p.m.f.
 
n
f (y; p) = pny (1 − p)n(1−y) , y ∈ {0, 1/n, 2/n, . . . , 1}
ny
Consider the family of p.m.f.’s of the form above with p ∈ (0, 1) and n ∈ N. To show this
is an exponential dispersion family, we write
    
p n
f (y; p) = exp ny log + n log(1 − p)
1−p ny
θ 2
  
yθ − log(1 + e ) 1/σ
= exp 2
,
σ y/σ 2
with σ 2 = 1/n, θ = log{p/(1 − p)} and K(θ) = log(1 + eθ ). To find the mean function
µ(θ), we differentiate K
d eθ
µ(θ) = log(1 + eθ ) = (= p),
dθ 1 + eθ
with inverse θ(µ) = log{µ/(1 − µ)}. Differentiating once more we see that
(1 + eθ(µ) )eθ(µ) − (eθ(µ) )2
V (µ) =
(1 + eθ(µ) )2
!
eθ(µ) eθ(µ)
= 1−
1 + eθ(µ) 1 + eθ(µ)
= µ(1 − µ).
Here M = (0, 1) and Φ = N.
3. Consider the gamma family of densities,
λα y α−1 e−λy
f (y; α, λ) = for y > 0 and α, λ > 0.
Γ(α)
It is not immediately clear how to write this in exponential dispersion family form, so let us
take advantage of the fact that we know the mean and variance of a gamma distribution. If
Y has the gamma density then Eα,λ (Y ) = α/λ and Varα,λ (Y ) = α/λ2 . If this family were
an exponential dispersion family then µ = α/λ and σ 2 V (µ) = α/λ2 . It is not clear what
we should take as σ 2 . However, the y α−1 term would need to be absorbed by the a(y, σ 2 )
in the definition of the EDF. Thus we can try taking σ 2 as a function of α alone. What
function must this be? Imagine that α = λ, so σ 2 V (µ) = σ 2 × constant ∝ 1/λ = 1/α.
Thus we must have σ 2 = α−1 (or some constant multiple of it). In the new parametrisation
where α = σ −2 and λ = (µσ 2 )−1
−2 −1

2
yσ exp(− σy2 µ )
f (y; µ, σ ) =
(σ 2 µ)σ−2 Γ(σ −2 )
−2
y σ −1
 
y 1
= 2 σ−2 exp − − log µ
(σ ) Γ(σ −2 ) µ σ2
σ −2 −1  
y 1
= 2 σ−2 exp 2 {yθ − K(θ)} ,
(σ ) Γ(σ −2 ) σ
where θ(µ) = −µ−1 and K(θ) = log(−θ−1 ). We found the variance function to be
V (µ) = µ2 and both M and Φ and (0, ∞).

26
2.4 Generalised linear models
Having finally defined the concept of an exponential dispersion family, we can now define what
a generalised linear model is. A generalised linear model for observations (Y1 , x1 ), . . . , (Yn , xn )
is defined by the following properties.

1. Y1 , . . . , Yn are independent, each Yi having model function in the same exponential dis-
persion family of the form
 
2 2 1
f (y; θi , σi ) = a(σi , y) exp 2 {θi y − K(θi )} , y ∈ Y, θi ∈ Θ, σi2 ∈ Φ ⊆ (0, ∞),
σi

with σi2 = σ 2 ai where a1 , . . . , an are known and ai > 0, though σ 2 may be unknown. Note
that the functions a and K must be fixed for all i.

2. The mean µi of the ith observation and the ith component of the linear predictor ηi := xTi β
are linked by the equation

g(µi ) = ηi , i = 1, . . . , n,

where g is a strictly increasing, twice differentiable function called the link function.

2.4.1 Choice of link function


Note that the only allowable values of β are those such that g −1 (xTi β) is in the mean space M of
the exponential dispersion family. Allowing the non-identity link function is particularly useful
when M does not coincide with R, as for the Poisson, gamma and binomial model functions.
This is because if we choose g to map M to the whole real line, then no restriction needs to be
placed on β.
For example, if we had
1
Yi ∼ Bin(ni , µi ),
ni
then we know that M = (0, 1). A popular choice for g in this situation is the logit function:
g(µ) = log{µ/(1 − µ)}.
Recall the function θ(µ) from the definition of the exponential dispersion family (the inverse
of the mean function). The choice
g(µ) = θ(µ)
is called the canonical link function. In view of this, we may also refer to the function θ(µ) as
the canonical link function. The logit function is the canonical link when the Yi have scaled
binomial distributions as above.

2.4.2 Likelihood equations


A sensible way to estimate β in a generalised linear model is using maximum likelihood. The
likelihood for a generalised linear model is
( n ) n
X 1 Y
L(β, σ 2 ; y1 , . . . , yn ) = exp [θ(µ i )yi − K(θ(µ i ))] a(σ 2 ai , yi ),
σ 2 ai
i=1 i=1

where θ(µi ) = θ(g −1 (xTi β)).

27
Using the canonical link function can simplify some calculations. With g the canonical link
function, θ(µi ) = xTi β, so we have log-likelihood
n n
X 1 X
`(β, σ 2 ; y1 , . . . , yn ) = {yi x T
i β − K(x T
i β)} + log{a(σ 2 ai , yi )}.
σ 2 ai
i=1 i=1

One feature of the log-likelihood above that makes it particularly easy to maximise over β is
that the Hessian is negative semi-definite so the log-likelihood is is a concave function of β (for
any fixed σ 2 ). We have
n
∂`(β, σ 2 ) X xi
= {yi − K 0 (xTi β)}
∂β σ 2 ai
i=1
n
∂ 2 `(β, σ 2 ) X xi xT
=− i
K 00 (xTi β),
∂β∂β T σ 2 ai
i=1

and K 00 > 0. This in particular means that as a function of β, the log-likelihood cannot have
multiple local maxima. Indeed, we know that if a maximiser of the log-likelihood, β̂ exists, it
must satisfy

`(β, σ 2 ) = 0. (2.4.1)
∂β β=β̂

However due to concavity of the log-likelihood, the converse is also true: if β̂ satisfies (2.4.1)
then it must maximise the log-likelihood. Indeed, for any β0 , consider the function

f (t) := `(β̂ + t(β0 − β̂), σ 2 ).

Note that f (0) = `(β̂, σ 2 ) and f (1) = `(β0 , σ 2 ). A Taylor expansion of f about 0 gives us
1
f (1) = f (0) + f 0 (0) + f 00 (t)
2
for some t ∈ [0, 1] (note this is a Taylor expansion with a “mean-value” form of the remainder).
Noting that f 0 (0) = 0 by assumption,

1 ∂ 2 `(β, σ 2 )
f (1) − f (0) = (β0 − β̂)T (β0 − β̂) ≤ 0,
2 ∂β∂β T β=β̃

where β̃ := β̂ + t(β0 − β̂).

2.5 Inference
Having generalised the normal linear model, how do we compute maximum likelihood estimators
and how can we perform inference (i.e. construct confidence sets, perform hypothesis test)?
These tasks were fairly simple in the normal linear model setting since the maximum likelihood
estimator had an explicit form. In our more general setting, this will not (necessarily) be the
case. Despite this, we can still perform inference and compute m.l.e.’s, but approximations
must be involved in both of these tasks. We first turn to the problem of inference.

28
2.5.1 The score function
Consider data (Y1 , xT1 ), . . . , (Yn , xTn ) with the Yi independent given the xi , and suppose Y =
(Y1 , . . . , Yn )T has density in
( n )
Y
n d d
{f (y, θ), y ∈ Y : θ ∈ Θ ⊆ R } = fxi (yi ; θ), yi ∈ Y : θ ∈ Θ ⊆ R .
i=1

We will review some theory associated with maximum likelihood estimators in this setting.
Here we simply aim to sketch out the main results; for a rigorous treatment see your Principles
of Statistics notes (or borrow someone’s). In particular, we do not state all the conditions
required for the results to be true (broadly known as “regularity conditions”), but they will all
be satisfied for the generalised linear model setting to which we wish to apply the results.
Let θ̂ be the maximum likelihood estimator of θ (assuming it exists and is unique). If
we cannot write down the explicit form of θ̂ as a function of the data, in order to study its
properties, we must argue from what we do know about the m.l.e.—the fact that it maximises
the likelihood, or equivalently the log-likelihood. This means θ̂ satisfies


`(θ; Y ) = 0,
∂θ θ=θ̂

where
n
X
`(θ; Y ) = log f (Y ; θ) = log fxi (Yi ; θ).
i=1

We call the vector of partial derivatives of the likelihood the score function, U (θ; Y ):


Ur (θ; Y ) := `(θ; Y ).
∂θr
Two key features of the score function are that provided the order of differentiation w.r.t. a
component of θ and integration over the sample space Y n may be interchanged,

1. Eθ {U (θ; Y )} = 0,
∂2
 
2. Varθ {U (θ; Y )} = −Eθ `(θ; Y ) .
∂θ∂θT
To see the first property, note that for r = 1, . . . , d,
Z

Eθ {Ur (θ; Y )} = log{f (y; θ)}f (y; θ)dy
Y n ∂θr
Z

= f (y; θ)dy
Y n ∂θ r
Z
∂ ∂
= f (y; θ)dy = (1) = 0.
∂θr Y n ∂θr

We leave property 2 as an exercise.

2.5.2 Fisher information


The quantity
i(θ) := Varθ {U (θ; Y )}

29
is known as the Fisher information. It can be thought of as a measure of how hard it is to
estimate θ when it is the true parameter value. A related quantity is the observed information
matrix, j(θ) defined by
∂2
j(θ) = − `(θ; Y ).
∂θ∂θT
Note that i(θ) = Eθ (j(θ)).

Example. Consider our friend the normal linear model: Y = Xβ + ε, ε ∼ Nn (0, σ 2 ). Then
 −2 T 
2 σ X X 0
i(β, σ ) = .
0 nσ −4 /2

Note that writing i−1 (β) for the top left p × p sub-matrix of i−1 (β, σ 2 ) (the matrix inverse of
i(β, σ 2 )), we have that Var(β̂) = i−1 (β).
In fact we have the following result.
Theorem 5 (Cramér–Rao lower bound). Let θ̃ be an unbiased estimator of θ. Then under
regularity conditions,
Varθ (θ̃) − i−1 (θ)
is positive semi-definite.
Proof. We only sketch the proof when d = 1. By the Cauchy–Schwarz inequality,

i(θ)Var(θ̃) = Var(U (θ))Var(θ̃) ≥ {Cov(θ̃, U (θ))}2 .

As E{U (θ)} = 0,

Cov(θ̃, U (θ)) = E(θ̃U (θ))


Z  

= θ̃(y) log f (y; θ) f (y; θ)dy
Yn ∂θ
Z

= f (y; θ)θ̃(y)dy
Y n ∂θ
Z
∂ ∂
= θ̃(y)dy = Eθ θ̃.
∂θ Y n ∂θ

But as θ̃ is unbiased we finally get



Cov(θ̃, U (θ)) = θ = 1.
∂θ
Since the m.l.e. of β in the normal linear model, β̂ := (X T X)−1 X T Y is unbiased, we
conclude that β̂ has the minimum variance among all unbiased estimators of β (not just the
linear unbiased estimators as the Gauss–Markov theorem yields).
It turns out that this is, to a certain extent, a general feature of maximum likelihood
estimators (in finite dimensional models), as we now discuss.

2.5.3 Two key asymptotic results


A feature of maximum likelihood estimators that asymptotically they are normally distributed
with mean the true parameter value θ and variance the inverse of the Fisher information matrix
evaluated at θ. Thus asymptotically they achieve the Cramér–Rao lower bound. To make this
a little more precise, let us recall some definitions to do with convergence of random variables.

30
*Convergence of random variables*. We say a sequence of random variables Z1 , Z2 , . . .
with corresponding distribution functions F1 , F2 , . . . converges in distribution to a random vari-
d
able Z with distribution function F , and write Zn → Z if Fn (x) → F (x) at all x where F is
continuous.
A sequence of random vectors Zn ∈ Rk converges in distribution to a continuous random
vector Z ∈ Rk when
P(Zn ∈ B) → P(Z ∈ B)
for all (Borel) sets B for which δB := cl(B) \ int(B) has P(Z ∈ δB) = 0.
For example, the multidimensional central limit theorem (CLT) states that if Z1 , Z2 , . . . are
i.i.d. random vectors in Rk with positive definite variance Σ and mean µ ∈ Rk , then writing
¯ 1 n
Z (n) for n i=1 Zi , we have
P

n(Z¯(n) − µ) → N (0, Σ).
d
k

A stronger mode of convergence is convergence in probability. We say Zn ∈ Rk tends to Z ∈ Rk


in probability if for every  > 0
P(kZn − Zk > ) → 0,
p d
as n → ∞. If Zn → Z then Zn → Z.
For example, the weak law of large numbers (WLLN) states that if Z1 , Z2 , . . . are i.i.d. with
p
mean µ ∈ Rk then Z¯(n) → µ. Some useful results concerning convergence of random variables
are:
d
Proposition 6 (Continuous mapping theorem). If Zn → Z and h : R → R is continuous, then
d
h(Zn ) → h(Z).
d p
Proposition 7 (Slutsky’s lemma). If Yn → Y and Zn → c, where c is a constant, then
d
1. Yn + Zn → Y + c,
d
2. Yn Zn → cY ,
Yn d Y
3. → .
Zn c

Asymptotic distribution of maximum likelihood estimators.

Theorem 8. Assume that the Fisher information matrix when there are n observations, i(n) (θ)
(where we have made the dependence on n explicit) satisfies i(n) (θ)/n → I(θ) for some positive
definite matrix I. Then denoting the maximum likelihood estimator of θ when there are n
observations by θ̂(n) , under regularity conditions we have
√ d
n(θ̂(n) − θ) → Nd (0, I −1 (θ)).

A short-hand and informal version of writing this (which is fine for this course) is that

θ̂ ∼ ANd (θ, i−1 (θ)),

to be read “θ̂ is asymptotically normal with mean θ and variance i−1 (θ)”.

31
*Sketch of proof*. Here is a sketch of the proof when d = 1 and our data are i.i.d. rather
than simply independent.
A Taylor expansion of the score function about the true parameter value θ gives

0 = U (θ̂) = U (θ) − (θ̂ − θ)j(θ) + Remn (θ).

Since E(U (θ)) = 0 and Var(U (θ)) = i(θ) = ni1 (θ), where i1 (θ) is the Fisher information of the
first observation, by the CLT we have
U (θ) d
√ → N (0, i1 (θ)).
n

By Slutsky’s lemma (no. 1 additive), provided that

Remn (θ) p
√ → 0,
n
we have
√ j(θ) U (θ) Remn (θ)
n(θ̂ − θ) = √ + √
n n n
d
→ N (0, i1 (θ)).
p
But by WLLN, j(θ)/n → i1 (θ) as n → ∞, so by Slutsky’s lemma (no. 3),
√ d
n(θ̂ − θ) → N (0, i−1
1 (θ)).

Relevance of the result. How are we to use this result? The first issue is that as the true
parameter θ is unknown, so is i−1 (θ). However, provided that i−1 (θ) is a continuous function
of θ, we may estimate this well with i−1 (θ̂), and we can show that, for example

θ̂j − θj d
q → N (0, 1)
(i−1 (θ̂))jj

Thus we can create an approximate 1 − α level confidence interval for θj with


 q q 
−1 −1
θ̂j − zα/2 (i (θ̂))jj , θ̂j + zα/2 (i (θ̂))jj ,

where zα is the upper α-point of N (0, 1). The coverage of this confidence interval tends to 1 − α
as n → ∞. Similarly, an asymptotic 1 − α level confidence set for θ is given by

{θ0 : (θ̂ − θ0 )T i(θ̂)(θ̂ − θ0 ) ≤ χ2d (α)}.

Another issue is that we never have an infinite amount of data. What does the asymptotic
result have to say when we have maybe 100 observations? From a purely logical point of view,
it says absolutely nothing. You will have had it drilled into you long ago in Analysis I that
even the first trillion terms of a sequence have nothing to do with its limiting behaviour. On
the other hand, we can be more optimistic and hope that n = 100 is large enough for the finite
sample distribution of θ̂ to be close to the limiting distribution. Performing simulations can help
justify this optimism and give us values of n for which we can expect the limiting arguments to
apply.

32
Wilks’ theorem. The result on asymptotic normality of maximum likelihood estimators
allows us to construct confidence intervals for individual components of θ and hence perform
hypothesis tests of the form H0 : θj = 0, H1 : θj 6= 0. Now suppose we wish to test

H0 : θ ∈ Θ0 against
H1 : θ ∈
/ Θ0

where Θ0 ⊂ Θ, the full parameter space, and Θ0 is of lower dimension than Θ. The precise
meaning of dimension when Θ0 and Θ are not affine spaces (i.e. a translation of a subspace) but
rather general manifolds would require us to go into the realm of differential geometry, which we
won’t do here. Perhaps the most important case of interest is when θ = (θ0T , θ1T )T and θ0 ∈ Rd0
with Θ = Rd , and we are testing

H0 : θ0 = 0 against
H1 : θ0 6= 0.

Wilks’ theorem gives the asymptotic distribution of of the likelihood ratio statistic
supθ0 ∈Θ L(θ0 )
 
wLR (H0 ) = 2 log = 2{ sup `(θ0 ) − sup `(θ0 )}.
supθ0 ∈Θ0 L(θ0 ) θ0 ∈Θ θ0 ∈Θ0

Theorem 9 (Wilks’ theorem). Suppose that H0 is true. Then, under regularity conditions
d
wLR (H0 ) → χ2k

where k = dim(Θ) − dim(Θ0 ).

*Sketch of proof* We only sketch the proof when the null hypothesis is simple so Θ0 = {θ0 },
and when the data Y1 , Y2 , . . . are i.i.d. rather than just independent. A Taylor expansion of `(θ0 )
centred at the (unrestricted) maximum likelihood estimate θ̂ gives
1
`(θ0 ) = `(θ̂) + (θ̂ − θ0 )T U (θ̂) − (θ̂ − θ0 )T j(θ̂)(θ̂ − θ0 ) + Remn (θ̂).
2
p
Using U (θ̂) = 0 and provided that Remn (θ̂) → 0,
d
2{`(θ̂) − `(θ0 )} = (θ̂ − θ0 )T j(θ̂)(θ̂ − θ0 ) − 2Remn (θ̂) → χ2k
d
under H0 by Slutsky’s theorem, provided (θ̂ − θ0 )T j(θ̂)(θ̂ − θ0 ) → χ2k .
Note that the likelihood ratio test in conjunction with Wilks’ theorem can also be used to
test whether individual components of θ are 0. Unlike the analogous situation in the normal
linear model where the F -test for an individual variable is equivalent to the t-test, here tests
based on asymptotic normality of θ̂ and the likelihood ratio test will in general be different—
usually the likelihood ratio test is to be preferred, though it may be require more computation
to calculate the test statistic.

2.5.4 Inference in generalised linear models


Let i(β, σ 2 ) be the Fisher information in a generalised linear model. It can be shown that this
matrix is block diagonal, so writing i(β) for the p × p top left submatrix of i(β, σ 2 ) and i(σ 2 )
for the bottom right entry, we have
   −1 
2 i(β) 0 −1 2 i (β) 0
i(β, σ ) = and i (β, σ ) = .
0 i(σ 2 ) 0 i−1 (σ 2 )

33
The asymptotic results we have studied then show that
β̂ ∼ ANp (β, i−1 (β)).
This (along with continuity of i−1 (β)) justifies the following asymptotic (1 − α)-level confidence
set for βj  
q q
−1 −1
β̂j − zα/2 {i (β̂)}jj , β̂j + zα/2 {i (β̂)}jj ,

where zα is the upper α-point of N (0, 1). To test H0 : βj = 0 against H1 : βj 6= 0, we can reject
H0 if the confidence interval above excludes 0 i.e. if
|β̂j |
q > zα/2 .
{i−1 (β̂)}jj

Now suppose β is partitioned as β = (β0T , β1T )T where β0 ∈ Rp0 and we wish to test H0 :
β1 = 0 against β1 6= 0. Write β̌0 for the m.l.e. of β0 under the null model, and assume for the
moment that the dispersion parameter σ 2 is known (as is the case for the Poisson and binomial
models—the two most important generalised linear models). Write `(µ, ˜ σ 2 ) for `(β, σ 2 ) so that
n n
˜ 2 1 X 1 X
`(µ, σ ) = 2 [yi θ(µi ) − K{θ(µi )}] + log{a(σ 2 , yi )}
σ ai
i=1 i=1

with µi = g −1 (xTi β). Further write


˜ σ 2 ) for `(µ, σ 2 ) with each µi replaced by its m.l.e. under the model where µ1 , . . . , µn
• `(y,
are unrestricted (i.e. µi is replaced by yi ),
˜ σ 2 ) for `(β̂, σ 2 ), and
• `(µ̂,
˜ σ 2 ) for `(β̌0 , σ 2 ) (so µ̌i = g −1 (xT β̌0 )).
• `(µ̌, i

The deviance of the model in H1 is


˜ σ 2 ) − `(µ̂,
D(y; µ̂) := 2σ 2 {`(y, ˜ σ 2 )};

the deviance of the null model is D(y; µ̌) := 2σ 2 {`(y,˜ σ 2 ) − `(µ̌,


˜ σ 2 )}. The deviance may be
thought of as the appropriate generalisation to GLMs of the residual sum of squares from the
linear model. Note that the deviance in reduced in the larger model.
Notice that
D(y; µ̌) − D(y; µ̂)
wLR (H0 ) = ,
σ2
so by Wilks’ theorem we may test if β1 = 0 at level α by rejecting the null hypothesis when the
value of this test statistic is larger than χ2p−p0 (α).
If σ 2 is unknown it must be replaced with an estimate such as
n
1 X (Yi − µ̂i )2
σ̃ 2 := .
n−p ai V (µ̂i )
i=1
p
Provided that σ̃ 2 → σ 2 Slutsky’s theorem ensures that the asymptotic distribution of wLR (H0 )
remains unchanged, though it is often better to use Fp−p0 ,n−p (α) as the critical value for the
test statistic
1
p−p0 {D(y; µ̌) − D(y; µ̂)}
σ̃ 2
in this case.

34
2.6 Computation
We have seen how despite the maximum likelihood estimator β̂ of β in a generalised linear model
not having an explicit form (except in special cases such as the normal linear model), we can
show that asymptotically the m.l.e. has rather attractive properties and we can still perform
inference that is asymptotically valid. How are we to compute β̂ when all we know about it is
the fact that it satisfies
∂`(β, σ 2 )
0= =: U (β̂)? (2.6.1)
∂β β=β̂

Here, with a slight abuse of notation, we have written U (β) for the first p components of
U (β, σ 2 ); similarly let us write j(β) and i(β) for the top left p × p submatrix of j(β, σ 2 ) and
i(β, σ 2 ) respectively.
If U were linear in β, we should be able to solve the system of linear equations in (2.6.1) to
find β̂. Though in general U won’t be a linear function, given that it is differentiable (recall that
the link function g is required to be twice differentiable), an application of Taylor’s theorem
shows that it is at least locally linear, so

U (β) ≈ U (β0 ) − j(β0 )(β − β0 )

for β close to β0 . If we managed to find a β0 close to β̂, the fact that U (β̂) = 0 suggests
approximating β̂ by the solution of

U (β0 ) − j(β0 )(β − β0 ) = 0

in β, i.e.
β0 + j −1 (β0 )U (β0 ),
where we have assumed that j(β0 ) is invertible. This motivates the following iterative algorithm
(the Newton–Raphson algorithm): starting with an initial guess at β̂, β̂0 , at the mth iteration
we update
β̂m = β̂m−1 + j −1 (β̂m−1 )U (β̂m−1 ). (2.6.2)
A potential issue with this algorithm is that j(β̂m−1 ) may be singular or close to singular and
thus make the algorithm unstable. The method of Fisher scoring replaces j(β̂m−1 ) with i(β̂m−1 )
which is always positive definite (subject to regularity conditions) and generally better behaved.
Fisher scoring may not necessarily converge to β̂ but almost always does. We terminate the
algorithm when successive iterations produce negligible difference.
Let us examine this procedure in more detail. It can be shown (see example sheet) that the
score function and Fisher information matrix have entries
n
X (yi − µi )Xij
Uj (β) = j = 1, . . . , p,
σ 2 V (µi )g 0 (µi )
i=1 i
n
X Xij Xik
ijk (β) = k = 1, . . . , p.
σ2V
i=1 i
(µi ){g 0 (µi )}2

Choosing the canonical link g(µ) = θ(µ) simplifies Uj (β) and ijk (β) since g 0 (µ) = 1/V (µ). Let
W (µ) be the n × n diagonal matrix with ith diagonal entry
1
Wii (µ) := .
ai V (µi ){g 0 (µi )}2

35
Further let Ỹ (µ) ∈ Rn be the vector with ith component

Ỹi (µ) = g 0 (µi )(yi − µi ).

Then we may write

U (β) = σ −2 X T W Ỹ
i(β) = σ −2 X T W X.

Let us set

Wm := W (µ̂m )
Ỹm := Ỹ (µ̂m ).

(Note here the subscript m is not indexing different components of a single vector Ỹ but different
vectors Ỹm . Then we see that

β̂m = β̂m−1 + (X T Wm−1 X)−1 X T Wm−1 Ỹm−1 .

If we define the adjusted dependent variable Zm by

Zm := Ỹm + η̂m ,

where ηm = X β̂m , then


( n )
X
T −1 T
β̂m = (X Wm−1 X) X Wm−1 Zm−1 = arg min Wm−1,ii (Zm−1,i − xTi b)2 .
b∈Rp i=1

See example sheet 1 for the final equality. Thus the sequence of approximations to β̂ are given
by iterative weighted least squares (IWLS) of the adjusted dependent variable Zm−1,i on X with
weights given by the diagonal entries of Wm−1 .
With this formulation, we can start with an initial guess of µ̂ rather than one of β̂. An
obvious choice for this initial guess µ̂0 is the response y, although a small adjustment such as
µ̂0,i = max{yi , } for  > 0 may be necessary if g(µ) = log(µ) for example, to avoid problems
when yi = 0.

36
Chapter 3

Specific regression problems

3.1 Binomial regression


Suppose we have data (y1 , xT1 ), . . . , (yn , xTn ) ∈ R × Rp where it seems reasonable to assume the
yi are realisations of random variables Yi that are independent for i = 1, . . . , n and
1
Yi ∼ Bin(ni , µi ), µi ∈ (0, 1)
ni
with the ni known positive integers. An example of such data could be the proportion Yi of
ni organisms to have been killed by concentrations of various drugs / temperature level etc.
collected together in a vector xi . Often the ni = 1 so Yi ∈ {0, 1}—we could have 1 representing
spam and 0 representing ham for example. If we assume that µi = E(Yi ) is related to the
covariates xi through g(µi ) = xTi β for some link function g and unknown vector of coefficients
β ∈ Rp , then this model falls within the framework of the generalised linear model. Indeed,
 
ni
f (yi ; µi ) = µni yi (1 − µi )ni −ni yi
n i yi i
      
ni 1 µi
= exp −1 yi log + log(1 − µi ) .
n i yi ni 1 − µi ) | {z }
| {z } | {z } −K(θi )
a(ai ,yi ) θi =θ(µi )

We can take the dispersion parameter as 1 and let ai = n−1i .


Once we have chosen a link function, we can obtain the m.l.e. of β using the IWLS algorithm
and then perform hypothesis tests or construct confidence intervals that are asymptotically valid
using the general theory of maximum likelihood estimators.

3.1.1 Link functions


In order to avoid having to place restrictions on the values β can take, we can choose a link
function g such that the image g((0, 1)) = g(M) = R. Three commonly used link functions are
given below in increasing order of their popularity (coincidentally this is also the order in which
they were introduced). Their graphs are plotted in Figure 3.1.

1. g(µ) = log(− log(1 − µ)) gives the complementary log–log link.

2. g(µ) = Φ−1 (µ) where Φ is the c.d.f. of the standard normal distribution (so Φ−1 is the
quantile function of the standard normal) gives the probit link.

37
 
µ
3. g(µ) = log is the logit link. This is the canonical link function for the GLM.
1−µ
The probit link gives an interesting latent variable interpretation of the model. Consider the
case where ni = 1. Imagine that there exists a Y ∗ ∈ Rn such that

Y ∗ = Xβ ∗ + ε

where ε ∼ Nn (0, σ 2 I). Suppose we do not observe Y ∗ but instead, only see Y ∈ {0, 1}n with
ith component given by
Yi = 1{Yi∗ >0} .
Then we see that

P(Yi = 1) = P(Yi∗ > 0) = P(xTi β ∗ > −εi )


= P(xTi β ∗ /σ > Zi ) where Zi ∼ N (0, 1)
= Φ(xTi β) where β := β ∗ /σ.

The models generated by the other two link functions also have latent variable interpretations.
Of the three link functions, by far the most popular is the logit link. This is partly because
it is the canonical link, and so simplifies some calculations, but perhaps more importantly, the
coefficents from a model with logit link (a logistic regression model) are easy to interpret. The
value eβj gives the multiplicative change in the odds µi /(1 − µi ) for a unit increase in the value
of the j th variable, keeping the values of all other variables fixed. To see this note that
 
p p
µi X Y
= exp  Xij βj  = (eβj )Xij .
1 − µi
j=1 j=1
4
2
g(µ)

0
−2
−4

logistic
probit
c log−log

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.1: The graphs of three commonly used link functions for binomial regression.

38
3.1.2 *A classification view of logistic regression*
In the case where ni = 1 for all i, logistic regression can be thought of as a classification
procedure. The response value of each observation is then either 0 or 1, and so divides the
observations into two classes. Having fit a logistic regression to some data which we shall call
the training data, we can then predict responses (class labels) for new data for which we only
have the covariate values. We can do this by applying the function Ĉτ below to each new
observation:
Ĉτ (x) := 1{π̂(x)≥τ } ,
where
exp(xT β̂)
π̂(x) :=
1 + exp(xT β̂)
and β̂ is the m.l.e. of β based on the training data. The value τ is a threshold and should be set
according to how bad predicting a class label of 1 when it is in fact 0 is, compared to predicting
a class label of 0 when it is in fact 1.
If in addition to our training data, we have another set of labelled data, we can plot the
proportion of class 1 observations correctly classified against the proportion of class 0 observa-
tions incorrectly classified using Ĉτ , for different values of τ . This set of data is known as a
test set. As τ varies between 0 and 1, the points plotted trace out what is known as a Receiver
Operating Characteristic (ROC) curve. This gives a visual representation of how good a clas-
sifier our model is, and can serve as a way of comparing different classifiers. A classifier with
ROC curve always above that of another classifier is certainly to be preferred. However, when
ROC curves of classifiers cross, no classifier uniformly dominates the other. In these cases, a
common measure of performance is the area under the ROC curve (AUC). If in a particular
application, there is a certain probability of incorrectly classifying a class 0 observation that can
be tolerated (say 5%), and the chance of incorrectly classifying a class 1 observation is to be
minimised subject to this error tolerance, then ROC curves should be compared at the relevant
point.
Of course all these comparisons are contingent on the particular test set used. Given a
collection of data, it is advisable to randomly split it into training and test sets several times
and average the ROC curves produced by each of the splits. Suppose the training sets are all
of size ntr , say. The average ROC curve is then a measure of the average performance of the
classification procedure when it is fed ntr observations where, thinking of the covariates now
as (realisations of) random variables, this average is over the joint distribution of response and
covariates.

3.1.3 *Logistic regression and linear discriminant analysis*


Additional support for the logistic link function can be gained by considering a classification
problem where we have i.i.d. data (Y1 , X1 ), . . . , (Yn , Xn ) ∈ {0, 1} × Rp . Let (Y, X) = (Y1 , X1 )
(note X is not the design matrix here, it is a p-dimensional random vector). Assume that

X|Y = 0 ∼ Np (µ0 , Σ), X|Y = 1 ∼ Np (µ1 , Σ). (3.1.1)

Suppose further that P(Yi = 1) = π1 = 1 − π0 . Now


   
P(Y = 1|X) π1 1
log = log − (µ1 + µ0 )T Σ−1 (µ1 − µ0 ) + X T Σ−1 (µ1 − µ0 ) (3.1.2)
P(Y = 0|X) π0 2
= α + X T β, (3.1.3)

39
with
 
π1 1
α := log − (µ1 + µ0 )T Σ−1 (µ1 − µ0 )
π0 2
β := Σ−1 (µ1 − µ0 ).

Thus the log odds of the posterior class probabilities is precisely of the form needed for the
logistic regression model to be correct.
Typically if it is known that the data generating process is (3.1.1), then a classifier is
formed by replacing the population parameters π1 , µ0 , µ1 and Σ in (3.1.2) with estimates,
and then classifying to the class with the largest posterior probability. This gives Fisher’s
linear discriminant analysis (LDA), which you will have already met if you took Principles of
Statistics.
The logistic regression model is more general in that it makes fewer assumptions. It does
not specify the distribution of the covariates and instead treats them as fixed (i.e. it conditions
on them). When the mixture of Gaussians model in (3.1.1) is correct, one can expect LDA to
perform better. However, when (3.1.1) is not satisfied, logistic regression may be preferred.

3.1.4 Model checking


We have not discussed model checking for GLMs but it proceeds in much the same way as for
the normal linear model, and residuals are the chief means for assessing the validity of model
assumptions. With GLMs there are several different types of residuals one can consider. One
form of residual builds on the analogy that the deviance is like the residual sum of squares from
the normal linear model. The deviance residuals in a GLM are defined as
p
di := sign(yi − µ̂i ) D(yi ; µ̂i ),

where D(yi ; µ̂i ) is the ith summand in the definition of D(y; µ̂), so
2
D(yi ; µ̂i ) = [yi {θ(yi ) − θ(µ̂i )} − {K(θ(yi )) − K(θ(µ̂i ))}].
ai
In binomial regression (and also Poisson regression), one can sometimes test a particular
model against a saturated model:

H0 :µi = g −1 (xTi β) i = 1, . . . , n against


H1 :µ1 , . . . , µn unrestricted.

In this case,
D(y; µ̂) − D(y; y) D(y; µ̂)
wLR (H0 ) = = ,
σ2 σ2
but standard asmpytotic theory no longer ensures that this converges in distribution to a χ2n−p .
Nevertheless, other asymptotic arguments can sometimes be used to justify referring the likeli-
hood ratio statistic to χ2n−p , for instance when

• Yi ∼ 1
ni Bin(ni , µi ), with ni large, and

• Yi ∼ Pois(µi ) with µi large.

This is because in these cases, the individual Yi get close to normally distributed random
variables. Such asymptotics are known as small dispersion asymptotics.

40
3.2 Poisson regression
We have seen how binomial regression can be appropriate when the responses are proportions
(including the important case when the proportions are in {0, 1} i.e. the classification scenario).
Now we consider count data e.g. the number of texts you receive each day, or the number of
terrorists attacks that occur in a country each week. Another example where count data arises
is the following: imagine conducting an (online) survey where perhaps you ask people to enter
their college and their voting intentions. The survey may be live for a fixed amount of time and
then you can collect to together the data into a 2-way contingency table:
College Labour Conservative Liberal Democrats Other
Trinity
..
.
When the responses are counts, it may be sensible to model them as realisations of Poisson
random variables. A word of caution though. A Poisson regression model entails a particular
relationship between the mean and variance of the responses: if Yi ∼ Pois(µi ), then Var(Yi ) = µi .
In many situations we may find this assumption is violated. Nevertheless, the Poisson regression
model can often be a reasonable approximation.
If the probability of occurence of an event in a given time interval is proportional to the
length of that time interval and independent of the occurence of other events, then the number
of events in any specified time interval will be Poisson distributed. Wikipedia lists a number of
situations where Poisson data arise naturally:

• Telephone calls arriving in a system,

• Photons arriving at a telescope

• The number of mutations on a strand of DNA per unit length

• Cars arriving at a traffic light

• ...

The Poisson regression model assumes that our data (Y1 , x1 ), . . . , (Yn , xn ) ∈ {0, 1, . . .} × Rp
have Y1 , . . . , Yn independent with Yi ∼ Pois(µi ), µi > 0. An example sheet question asks you
to verify that the {Pois(µ) : µ ∈ (0, ∞)} is an exponential dispersion family with dispersion
parameter σ 2 = 1. In line with the GLM framework, we assume the µi are related to the
covariates through g(µi ) = xTi β for a link function g.
By far the most commonly used link function is the log link—this also happens to be the
canonical link. In fact the Poisson regression model is often called the log-linear model. We
only consider the log link here. Two reasons for the popularity of the log link are:

• {log(µ) : µ ∈ (0, ∞)} = R. The parameter space for β is then simply Rp and no restrictions
are needed.

• Interpretability: if  
Xp p
Y
µi = exp  Xij βj  = (eβj )Xij ,
j=1 j=1

then we see that eβj is the multiplicative change in the expected value of the response for
a unit increase in the j th variable.

41
In the next practical class we’ll look at data from the English Premier League and attempt
to model the home and away scores Yijh and Yija when team i is home to team j as independent
Poisson random variables with respective means

µhij = exp(∆ + αi − βj ), µaij = exp(αj − βi ).

Here ∆ represents the home advantage (we expect it to be greater than 0) and αi and βi the
offensive and defensive strengths of team i.

3.2.1 Likelihood equations


With log(µi ) = xTi β, we have
n
Y µi (β)yi
L(β) = e−µi (β)
yi !
i=1
n
T
Y
∝ exp(−exi β ) exp(yi xTi β),
i=1

so
n
X n
X
`(β) = − exp(xTi β) + yi xTi β.
i=1 i=1
Let us consider the case where we have an intercept term. We can either say that the first
column of the design matrix X is a column of 1’s, or we can include it explicitly in the model.
In the latter case we take
log(µi ) = α + xTi β,
so the log-likelihood is
n
X n
X
`(α, β) = − exp(α + xTi β) + yi (α + xTi β).
i=1 i=1

Now differentiating w.r.t. α, we have


n
∂`(α̂, β̂) X
0= = {yi − exp(α̂ + xTi β̂)}.
∂α
i=1

Thus writing µ̂i := exp(α + xTi β), we have


n
X n
X
µ̂i = yi .
i=1 i=1

The deviance and Pearson’s χ2 -statistic


The fact above simplifies the deviance in a Poisson GLM. We have
n
X n
X
˜ σ2) = −
`(µ, µi + yi log(µi ),
i=1 i=1

so
n   n n  
X yi X X yi
D(y; µ̂) = 2 yi log −2 (yi − µ̂i ) = 2 yi log ,
µ̂i µ̂i
i=1 i=1 i=1

42
when an intercept term is included. P
Write yi = µ̂i + δi , so we have that δi = 0. Then, by a Taylor expansion, assuming that
δi /µ̂i is small for each i,
n
X  δi 
D(y; µ̂) = 2 (µ̂i + δi ) log 1 +
µ̂i
i=1
n 
X δi2 δ2 
≈2 δi + − i
µ̂i 2µ̂i
i=1
n
X (yi − µ̂i )2
= .
µ̂i
i=1

The quantity in the final line is known as Pearson’s χ2 statistic.


Though we have described Pearson’s χ2 statistic as an approximation to the deviance, this
does not mean that the deviance is superior in the sense that, for example, its distribution is
closer to that of a χ2n−p (when the µi are large so small dispersion asymptotics are relevant).
On the contrary, Pearson’s χ2 statistic, as its name suggest, is often to be preferred for this
purpose.

3.2.2 Modelling rates with an offset


Often the expected value of a response count Yi is proportional to a known value ti . For instance,
ti might be an amount of time or a population size, such as in modelling crime counts for various
cities. Or, it might be a spatial area, such as in modelling counts of a particular animal species.
Then the sample rate is Yi /ti , with expected value µi /ti .
In most such situations, it seems more natural to assume that it is the expected rate µi /ti
that is related to the covariates, rather that E(Yi ) itself. A log-linear model for the expected
rate would then model Yi ∼ Pois(µi ) with

log(µi /ti ) = log(µi ) − log(ti ) = xTi β

so
µi = ti exp(xTi β).
This is the usual Poisson regression but with an offset of log(ti ). Since these are known con-
stants, they can be readily incorporated into the estimation procedure.

3.2.3 Contingency tables


An r-way contingency table is a way of presenting responses which represent frequencies when
the responses are classified according to r different factors.
We are primarily interested in r = 2 and r = 3. In these cases, we may write the data as

{Yij : i = 1, . . . , I, j = 1, . . . , J}, or
{Yijk : i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , K}

respectively.
Consider the example of the online survey that aimed to cross-classify individuals according
to their college and voting intentions. This data could be presented as a two-way contingency
table. If we also recorded people’s gender, for example, we would have a three-way contingency

43
table. A sensible model for this data is that the number of individuals falling into the ij th cell,
Yij , are independent Pois(µij ).
Suppose we happened to end up with n = 400 forms filled. We could also imagine a situation
where rather than accepting all the survey responses that happened to arrive in a given time,
we fix the number of submissions to consider in advance, so we keep the survey live until we
have 400 forms filled. In this case a multinomial model may be more appropriate.
Recall that a random vector Z = (Z1 , . . . , Zm ) is said to have a multinomial distribution
with parameters n and p1 , . . . , pm , written Z ∼ Multi(n; p1 , . . . , pm ) if m
P
i=1 p i = and

n!
P(Z1 = z1 , . . . , Zm = zm ) = pz1 · · · pzmm ,
z1 ! · · · z m ! 1
for zi ∈ {0, . . . , n} with z1 + · · · + zm = n.
In the second data collection scenario described above, only the overall total n = 400 was
fixed, so we might model

(Yij )i=1,...,I, j=1,...,J ∼ Multi(n; (pij )i=1,...,I, j=1,...,J ),

where
µij
pij = PI PJ .
i=1 j=1 µij

At first sight, this second model might seem to fall outside the GLM framework as the
responses Yij are not independent (adding up to n).
However, the following result suggests an alternative approach. Recall the fact that if Z1 , Z2
are independent with Zi ∼ Pois(µi ), then Z1 + Z2 ∼ Pois(µ1 + µ2 ). Obviously induction gives
a similar result for any finite collection of independent Poisson random variables.

Proposition 10. Let Z = (Z1 , . . . , Zm ) be a random P vector having independent components,


with Zi ∼ Pois(µ
P i ) for i = 1, . . . , m. Conditional on Zi = n, we have that Z ∼ Multi(n; p1 , . . . , pm ),
where pi = µi / µj for i = 1, . . . , m.

Proof. The joint distribution of Z1 , . . . , Zm is


m m
µzi i
 X Y
Pµ1 ,...,µm (Z1 = z1 , . . . , Zm = zm ) = exp − µi , zi ∈ {0, 1, . . .},
zi !
i=1 i=1

P P  P
and S := Zi ∼ Pois µj . It follows that provided i zi = n,
 P Q
exp − µj (µzi i /zi !)
Pµ1 ,...,µm (Z1 = z1 , . . . , Zm = zm |S = n) =  P P n
exp − µj µj /n!
n!
= pz1 . . . pzmm ,
z 1 ! . . . zm ! 1
P
where pi = µi / µj for i = 1, . . . , m.

44
Multinomial likelihood
First consider the multinomial likelihood obtained if we suppose that

(Yij )i=1,...,I, j=1,...,J ∼ Multi(n; (pij )i=1,...,I, j=1,...,J ),

where
µij
pij = PI PJ ,
i=1 j=1 µij
and
log(µij ) = α + xTij β.
Thus
exp(xTij β)
pij = PI PJ .
T
i=1 j=1 exp(xij β)
Here the explanatory variables xij will depend on the particular model being fit.
[Consider the “colleges and voting intentions” example. Each of the 400 submitted survey
forms can be thought of as realisations of i.i.d. random variables Zl , l = 1, . . . , 400, taking values
in the collection of categories {Trinity, . . .} × {Labour, Conservative, . . .}. If we assume that the
two components of the Zl are independent, then we may write

pij = P(Zl1 = collegei , Zl2 = partyj ) = P(Zl1 = collegei )P(Zl2 = partyj ) = qi rj , (3.2.1)
PI PJ
for some qi , rj ≥ 0, i = 1, . . . , I, j = 1, . . . , J, with i=1 qi = j=1 rj = 1. To parametrise this
in terms of β, we can take

xTij = (0, . . . , 0, 1, 0, . . . , 0, 0, . . . , 0, 1, 0, . . . , 0),


| {z } | {z }
I components J components

so xTij β = βi + βI+j , and for identifiability we may take β1 = βI+1 = 0.]


The log-likelihood for the multinomial model is
X
`m (β|n) = yij log{pij (β)}
i,j
 
X X
= yij xTij β − n log  exp(xTij β) ,
i,j i,j

where we have emphasised the fact that the likelihood is based on the conditional distribution
of the counts yij given the total n.

Poisson likelihood
Now consider the Poisson model, but where i,j yij = n. With log(µij ) = α + xTij β, we have
P
log-likelihood
X X
`P (α, β) = − µij (α, β) + yij log{µij (α, β)}
i,j i,j
X X
=− exp(α + xTij β) + yij (α + xTij β)
i,j i,j
X X
= − exp(α) exp(xTij β) + yij xTij β + nα.
i,j i,j

45
Now let us reparametrise (α, β) 7→ (τ, β) where
X X
τ= µij = exp(α) exp(xTij β).
i,j i,j

We have
 
X X
`P (τ, β) = yij xTij β − n log  exp(xTij β) + {n log(τ ) − τ }
i,j i,j

= `m (β|n) + `P (τ ).

To maximise the log-likelihood above, we can maximise over β and τ separately. Thus if β ∗ is
the m.l.e. from the multinomial model, and β̂ is the m.l.e. from the Poisson model, we see that
(assuming the m.l.e.’s are unique) β ∗ = β̂. Several equivalences of the multinomial and Poisson
models emerge from this fact.

• The deviances from the Poisson model and the multinomial model are the same.

• The fitted values from both models are the same. Indeed, in the multinomial model, the
fitted values are
exp(xTij β̂)
np̂ij := n PI PJ .
T
i=1 j=1 exp(xij β̂)

For the Poisson model, the fitted values are

exp(xTij β̂)
µ̂ij := τ̂ PI PJ .
T
i=1 j=1 exp(xij β̂)

But recall that since we have included an intercept term in the Poisson model,
X X
n= yij = µ̂ij = τ̂ .
i,j i,j

Summary. Multinomial models can be fit using Poisson log-linear model provided that an
intercept is included in the Poisson model. The Poisson models used to mimic multinomial
models are known as surrogate Poisson models.

3.2.4 Test for independence of columns and rows


To test whether the rows and columns are independent (i.e. if (3.2.1) holds), we can consider a
surrogate Poisson model that takes

log(µij ) = µ + ai + bj ,

where to ensure identifiability, we enforce the corner point constraints a1 = b1 = 0. Thus there
are 1 + (I − 1) + (J − 1) = I + J − 1 parameters. Provided the cell counts yij are large enough,
small dispersion asymptotics can be used to justify comparing the deviance or Pearson’s χ2
statistic to χ2IJ−I+J+1 = χ2(I−1)(J−1) .

46
3.2.5 Test for homogeneity of rows
Consider the following example. In a flu vaccine trial, patients were randomly allocated to one
of two groups. The first received a placebo, the other the vaccine. The levels of antibody after
six weeks were:

Small Moderate Large Total


Placebo 25 8 5 38
Vaccine 6 18 11 35

We are interested in the homogeneity of the different rows: is there a different response from
the vaccine group? Here the row totals were fixed before the responses were observed. We can
thus model the responses in each row as having a multinomial distribution.
If ni , i = 1, . . . , I denotes the sum of the ith row, we model the response in the ith row, Yi as

Yi ∼ Multi(ni ; pi1 , . . . , piJ ),

with Y1 , . . . , YI independent. Note that Jj=1 pij = 1 for all i.


P
The hypothesis of homogeneity of rows can be represented by requiring that pij = qj for all
i, for some vector of probabilities (q1 , . . . , qJ )T . Thus the mean in the ij h cell is µij := ni qj .
You will discover for yourself on the example sheet that this form of multinomial model can
be fitted using a surrogate Poisson model with

log(µij ) = µ + ai + bj ,

which is the same as for the independence example. For identifiability we may take a1 = b1 = 0.
Here, the ai are playing the role of intercepts for each row.

Three-way contingency tables


Now suppose we have a three-way contingency table with

Y ∼ Multi(n; (pijk ), i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , K).

Consider again that the table is constructed from i.i.d. random variables Z1 , . . . , Zn taking
values in the categories
{1, . . . , I} × {1, . . . , J} × {1, . . . , K}.
Let us write Z1 = (A, B, C). Note that pijk = P(A = i, B = j, C = k). There are now eight
hypotheses concerning independence which may be of interest. Broken into four classes, they
are:

1. H1 : pijk = αi βj γk , for all i, j, k. Summing over j and k we see that αi = P(A = i). Thus
this model corresponds to

P(A = i, B = j, C = k) = P(A = i)P(B = j)P(C = k),

i.e. A, B and C are independent.

2. H2 : pijk = αi βjk for all i, j, k. As before we see that αi = P(A = i), and summing over
i we get βjk = P(B = j, C = k). This corresponds to saying A is independent of (B, C).
Two other hypotheses are obtained by permutation of A, B, C.

47
3. H3 : pijk = βij γik for all i, j, k. If we denote summing over an index with a ‘+’, so for
example X X
pi++ := pijk = βij γik = βi+ γj+ ,
j,k j,k

we see that
pijk βij γik
P(B = j, C = k|A = i) = =
pi++ βi+ γi+

Summing over k and j we see that


βij γik
P(B = j|A = i) = , P(C = k|A = i) = .
βi+ γi+

This means that

P(B = j, C = k|A = i) = P(B = j|A = i)P(C = k|A = i),

so B and C are conditionally independent given A. Two other hypotheses are obtained
by permuting A, B, C.

4. H4 : pijk = αjk βik γij for all i, j, k. This hypothesis cannot be expressed as a conditional
independence statement, but means there are no three-way interactions.

48

You might also like