Nice Econometrics Notes!!!!!!!
Nice Econometrics Notes!!!!!!!
Gregory Wawro
Associate Professor
Department of Political Science
Columbia University
420 W. 118th St.
New York, NY 10027
This course draws liberally on lecture notes prepared by Professors Neal Beck, Lucy Goodhart,
George Jakubson, Nolan McCarty, and Chris Zorn. The course also draws from the following
works:
Aldrich John H. and Forrest D. Nelson. 1984. Linear Probability, Logit and Probit Models.
Beverly Hills, CA: Sage.
Alvarez, R. Michael and Jonathan Nagler. 1995. “Economics, Issues and the Perot Candidacy:
Voter Choice in the 1992 Presidential Election.” American Journal of Political Science
39:714-744
Baltagi, Badi H. 1995. Econometric Analysis of Panel Data. New York: John Wiley & Sons.
Beck, Nathaniel, and Jonathan N. Katz. 1995. “What To Do (and Not To Do) with Time-
SeriesCross-Section Data in Comparative Politics.” American Political Science Review 89:634-
647.
Davidson, Russell and James G. MacKinnon. 1993. Estimation and Inference in Econometrics.
New York: Oxford University Press.
Eliason, Scott R. 1993. Maximum Likelihood Estimation: Logic and Practice. Newbury Park,
CA: Sage.
Fox, John. 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage
Publications.
Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, N.J.: Prentice Hall
Gujarati, Damodar N., Basic Econometrics, 2003, Fourth Edition, New York: McGraw Hill.
ii
Hsiao, Cheng. 2003. Analysis of Panel Data. 2nd ed. Cambridge: Cambridge University Press.
Keifer, Nicholas M. 1988. Economic duration data and hazard functions. Journal of Economic
Literature 24: 646–679.
Kennedy, Peter. 2003. A Guide to Econometrics, Fifth Edition. Cambridge, MA: MIT Press.
King, Gary. 1989. Unifying Political Methodology. New York: Cambridge University Press
Lancaster, Tony. 1990. The Econometric Analysis of Transition Data. Cambridge: Cambridge
University Press.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thou-
sand Oaks: Sage Publications.
Maddala, G. S. 2001. Introduction to Econometrics. Third Edition, New York: John Wiley and
Sons.
Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data, Cambridge,
MA: MIT Press.
Yamaguchi, Kazuo. 1991. Event History Analysis. Newbury Park, CA: Sage.
Zorn, Christopher J. W. 2001. “Generalized Estimating Equations Models for Correlated Data:
A Review With Applications.” American Journal of Political Science 45: 470–90.
iii
TABLE OF CONTENTS
LIST OF FIGURES x
iv
2.11 What is rank and why would a matrix not be of full rank? . . . . . . . . . . . . . . 26
2.12 The Rank of a non-square matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.13 Application of the inverse to regression analysis . . . . . . . . . . . . . . . . . . . . 27
2.14 Partial differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.15 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Goodness of Fit 57
6.1 The R-squared measure of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.1 The uses of R2 and three cautions . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Testing Multiple Hypotheses: the F -statistic and R2 . . . . . . . . . . . . . . . . . 61
6.3 The relationship between F and t for a single restriction . . . . . . . . . . . . . . . 63
6.4 Calculation of the F -statistic using the estimated residuals . . . . . . . . . . . . . . 64
6.5 Tests of structural change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
v
6.5.1 The creation and interpretation of dummy variables . . . . . . . . . . . . . . 67
6.5.2 The “dummy variable trap” . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.3 Using dummy variables to estimate separate intercepts and slopes for each
group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.6 The use of dummy variables to perform Chow tests of structural change . . . . . . . 69
6.6.1 A note on the estimate of s2 used in these tests . . . . . . . . . . . . . . . . 71
8 Regression Diagnostics 86
8.1 Before you estimate the regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2 Outliers, leverage points, and influence points . . . . . . . . . . . . . . . . . . . . . 86
8.2.1 How to look at outliers in a regression model . . . . . . . . . . . . . . . . . . 86
8.3 How to look at leverage points and influence points . . . . . . . . . . . . . . . . . . 89
8.4 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.1 Standardized and studentized residuals . . . . . . . . . . . . . . . . . . . . . 95
8.4.2 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi
II Violations of Gauss-Markov Assumptions in the Classical Linear
Regression Model 116
11 Large Sample Results and Asymptotics 117
11.1 What are large sample results and why do we care about them? . . . . . . . . . . . 117
11.2 What are desirable large sample properties? . . . . . . . . . . . . . . . . . . . . . . 119
11.3 How do we figure out the large sample properties of an estimator? . . . . . . . . . . 121
11.3.1 The consistency of β̂ OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.3.2 The asymptotic normality of OLS . . . . . . . . . . . . . . . . . . . . . . . . 124
11.4 The large sample properties of test statistics . . . . . . . . . . . . . . . . . . . . . . 126
11.5 The desirable large sample properties of ML estimators . . . . . . . . . . . . . . . . 127
11.6 How large does n have to be? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
12 Heteroskedasticity 129
12.1 Heteroskedasticity as a violation of Gauss-Markov . . . . . . . . . . . . . . . . . . . 129
12.1.1 Consequences of non-spherical errors . . . . . . . . . . . . . . . . . . . . . . 130
12.2 Consequences for efficiency and standard errors . . . . . . . . . . . . . . . . . . . . 131
12.3 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
12.3.1 Some intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.4 Feasible Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.5 White-consistent standard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12.6 Tests for heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.6.1 Visual inspection of the residuals . . . . . . . . . . . . . . . . . . . . . . . . 136
12.6.2 The Goldfeld-Quandt test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.6.3 The Breusch-Pagan test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
13 Autocorrelation 138
13.1 The meaning of autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
13.2 Causes of autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.3 Consequences of autocorrelation for regression coefficients and standard errors . . . 139
13.4 Tests for autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
13.4.1 The Durbin-Watson test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.4.2 The Breusch-Godfrey test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13.5 The consequences of autocorrelation for the variance-covariance matrix . . . . . . . 142
13.6 GLS and FGLS under autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 145
13.7 Non-AR(1) processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.8 OLS estimation with lagged dependent variables and autocorrelation . . . . . . . . 148
13.9 Bias and “contemporaneous correlation” . . . . . . . . . . . . . . . . . . . . . . . . 150
13.10Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
13.11Instrumental variable estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
13.12In the general case, why is IV estimation unbiased and consistent? . . . . . . . . . . 153
vii
14 Simultaneous Equations Models and 2SLS 155
14.1 Simultaneous equations models and bias . . . . . . . . . . . . . . . . . . . . . . . . 155
14.1.1 Motivating example: political violence and economic growth . . . . . . . . . 155
14.1.2 Simultaneity bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
14.2 Reduced form equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
14.3.1 The order condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
14.4 IV estimation and two-stage least squares . . . . . . . . . . . . . . . . . . . . . . . 160
14.4.1 Some important observations . . . . . . . . . . . . . . . . . . . . . . . . . . 161
14.5 Recapitulation of 2SLS and computation of goodness-of-fit . . . . . . . . . . . . . . 162
14.6 Computation of standard errors in 2SLS . . . . . . . . . . . . . . . . . . . . . . . . 163
14.7 Three-stage least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
14.8 Different methods to detect and test for endogeneity . . . . . . . . . . . . . . . . . . 166
14.8.1 Granger causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
14.8.2 The Hausman specification test . . . . . . . . . . . . . . . . . . . . . . . . . 167
14.8.3 Regression version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
14.8.4 How to do this in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
14.9 Testing for the validity of instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 169
viii
16.2 Testing for unit or time effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
16.2.1 How to do this test in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
16.3 LSDV as fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
16.4 What types of variation do different estimators use? . . . . . . . . . . . . . . . . . . 192
16.5 Random effects estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
16.6 FGLS estimation of random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
16.7 Testing between fixed and random effects . . . . . . . . . . . . . . . . . . . . . . . . 197
16.8 How to do this in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
16.9 Panel regression and the Gauss-Markov assumptions . . . . . . . . . . . . . . . . . . 198
19 Count Data, Models for Limited Dependent Variables, and Duration Models 218
19.1 Event count models and poisson estimation . . . . . . . . . . . . . . . . . . . . . . . 218
19.2 Limited dependent variables: the truncation example . . . . . . . . . . . . . . . . . 221
19.3 Censored data and tobit regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
19.4 Sample selection: the Heckman model . . . . . . . . . . . . . . . . . . . . . . . . . . 224
19.5 Duration models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
ix
LIST OF FIGURES
1.1 An Example of PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 An example of a CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Conditional PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Conditional PDF, X and Y independent . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Plots of special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
x
1
Part I
2
3
fX (x) = P [X = xi ] for i = 1, 2, 3, . . . , n.
and Z ∞
f (x)dx = 1.
−∞
Z b
P [a < X < b] = f (x)dx
a
• For a continuous rv
F (x) = P [X ≤ x]
= P [−∞ ≤ X ≤ x]
Z x
= f (u)du.
−∞
5
0.10
0.05
0.0
-2 0 2 4
x
6
-5 0 5
x
7
f (x, y) = P [X = x and Y = y]
• See Fig. 1.4 for an example of a condit’l PDF and Fig. 1.5 for an example
of a condit’l PDF for two independent variables.
8
if X is continuous.
1.3.2 Variance
• The distribution of values of X around its expected value can be measured
by the variance, which is defined as:
var[X] = E[(X − µX )2 ]
= E[X 2 ] − (E[X])2
var[Y ] = b2 var[X].
1.3.4 Covariance
• The covariance of two rvs, X and Y , with means µX and µY is defined as
1.3.6 Correlation
• The size of the covariance will depend on the units in which X and Y are
measured. This has led to the development of the correlation coefficient,
which gives a measure of statistical association that ranges between −1 and
+1.
• The population correlation coefficient, ρ, is defined as:
cov(X, Y ) cov(X, Y )
ρ=p =
var(X)var(Y ) σx σy
0.4
0.20
df=1
df=1
0.3
df=5
df=5
df=10 df=10
Probability
Probability
0.2
0.10
0.1
0.00
0.0
5 10 15 20 −4 −2 0 2 4
x t
F distribution
0.00 0.05 0.10 0.15 0.20
df=1,1
df=1,5
Probability
df=1,10
5 10 15 20
x
17
• The sample avg. is a rv: take a different sample, likely to get a different
average. The hypothetical distribution of the sample avg. is the sampling
distribution.
• We can deduce the variance of this sample avg.:
σy2
var(Ȳ ) = .
N
• If the underlying population is distributed normal, we can also say that the
avg. is distributed normal (why?). If the underlying population does not
have a normal distribution, however, & income distribution is not normal,
we cannot infer that the sampling distribution is normal.
• Can use two laws of statistics that can tell us about the precision and
distribution of our estimator (in this case the avg.):
• We will estimate statistics that are analogous to a sample mean and are also
distributed normal. We can again use the normal as the basis of hypothesis
tests. We will also estimate parameters that are distributed as χ2 , as t, and
as F .
Section 2
2x + 4y = 8
x + 5y = 7
• Easy to substitute to get expressions for x (or y) and solve for y (or x).
Harder to do for more equations—motivation for using matrix algebra.
• In regression analysis, estimation of coefficients requires a solution to a set
of N equations with k unknowns.
19
20
• Scalar: a 1 × 1 matrix.
• A matrix is equal to another matrix if every element in them is identical.
This implies that the matrices have the same dimensions.
• Properties of Addition:
A+0=A
(A + B) = (B + A)
(A + B)0 = A0 + B0
(A + B) + C = A + (B + C)
• Again, number of columns in the first matrix must be equal to the number
of the rows in the second matrix, otherwise there will be an element that
has nothing to be multiplied with.
• Thus, for Anm · Bjk , m = j. Note also that
A · B = C
n×m m×k n×k
23
AI = A
(AB)0 = B0 A0
(AB)C = A(BC)
A(B + C) = AB + AC or (A + B)C = AC + BC
y1 = β0 + β1 X1 + ε1
y2 = β0 + β1 X2 + ε2
y = Xβ + ε
where y is n × 1, X is n × 2, β is 2 × 1, and ε is n × 1.
24
33
where i is a column vector of ones of dimensions n × 1 and x is the data
vector.
• Suppose we want to get a vector containing the deviation of each
observation of x from its mean (why would that be useful?):
x1 − x̄
x − x̄
1 0
2
.. = [x − ix̄] = x − ii x
. n
xn − x̄
• Since x = Ix,
1 0 1 0 1 0
x − ii x = Ix − ii x = I − ii x = M0 x
n n n
• M0 has (1 − 1/n) for its diagonal elements and −1/n for all its off-diagonal
elements (∴ symmetric). Also, M0 is equal to its square, M0 M0 = M0 so it
is idempotent.
• Matrix representation:
2 3 x 5
=
3 −6 y −3
• In general form, this would be expressed as: Ax = b
• Solve by using the inverse: A−1 where A−1 A = I.
• Pre-multiply both sides of the equation by A−1 :
Ix = A−1 b
x = A−1 b
• Only square matrices have inverses, but not all square matrices possess an
inverse.
• To have an inverse (i.e., be non-singular), matrix must be “full rank” ⇒ its
determinant (e.g., |A|) is not zero.
a c d −b
• For a 2 × 2 matrix, A = , |A| = (ad − bc) and A−1 = |A|
1
b d −c a
26
3x + y = 7
6x + 2y = 14
3 1 x 7
=
6 2 y 14
• Can’t solve this: the second equation is not truly a different equation; it is a
“linear combination” of the first ⇒ an infinity of solutions.
3 1
• The matrix does not have full “row rank.”
6 2
• One of its rows can be expressed as a linear combination of the others (⇒
its determinant = 0).
• Does not have full column rank either; if a square matrix does not have full
row rank it will not have full column rank.
• Rows and columns must be “linearly independent” for matrix to have full
rank.
27
y = Xβ + ε
29
30
−1
−2
−2 −1 0 1
X
31
• It turns out that the optimal predictor in this sense is the conditional expec-
tation of y given x, also known as the regression function, E(yi |xi ) = g(xi , β).
• g(.) presumably involves a number of fixed parameters (β) needed to correctly
specify the theoretical relationship between yi and xi .
• Whatever part of the actually observed value of yi is not captured in g(xi , β)
in the notation of this equation must be a random component which has a
mean of zero conditional on xi .
• ∴ another way to write the equation above in a way that explicitly captures
the randomness inherent in any political process is as follows:
yi = g(xi, β) + εi
• The first order conditions for a minimum (applying the chain rule) are:
N
P 2
∂ ei X N X N
i=1
= 2(yi − β̂0 − β̂1 xi )(−1) = 0 ⇒ ei = 0
∂ β̂0 i=1 i=1
N
P 2
∂ ei XN XN
i=1
= 2(yi − β̂0 − β̂1 xi )(−xi ) = 0 ⇒ xi ei = 0
∂ β̂1 i=1 i=1
N
X N
X N
X
xi yi = β̂0 xi + β̂1 x2i
i=1 i=1 i=1
34
ȳ = β̂0 + β̂1 x̄
⇒ OLS regression line passes through the mean of the data (but may not be
true if no constant in the model) and
β̂0 = ȳ − β̂1 x̄
N
P
• Use the last expression in place of β̂0 in the solution for β̂1 and use xi = N x̄:
i=1
N N
!
X X
xi yi − N x̄ȳ = β̂1 x2i − N x̄2
i=1 i=1
or N
P N
P
x1 y1 − N x̄ȳ (xi − x̄)(yi − ȳ)
β̂1 = i=1N = i=1
N
x2i − N x̄2
P P
(xi − x̄)2
i=1 i=1
!2
X X
|D| = 4N x2i − 4 xi
i i
X
= 4N x2i − 4N 2 x̄2
i
!
X
= 4N x2i − N x̄2
Xi
= 4N (xi − x̄)2 > 0 if variation in x
i
• If there are more regressors, one can add normal equations of the form
∂ e2i
P
= 0.
∂ β̂k
−1
−2
−2 −1 0 1
X
37
(Notice that there is no x0i . Actually, we assume that x0i = 1 for all i.)
• May also be written as
yi = β0 + β1 x1i + . . . + βk xki + εi
or
or
0 0
e0 e = y0 y − 2β̂ X0 y + β̂ X0 Xβ̂
X0 Xβ̂ = X0 y
• If the inverse of X0 X exists (i.e., assuming full rank) then the solution is:
−1
β̂ = (X0 X) X0 y
Section 4
• OLS estimation relies on five basic assumptions about the way in which the
data are generated.
• If these assumptions hold, then OLS is BLUE (Best Linear Unbiased
Estimator).
Assumptions:
In English:
40
41
• Note that
E(ε21 ) E(ε1 ε2 ) . . . E(ε1 εN )
E(εε0 ) = ..
.
E(εN ε1 ) E(εN ε2 ) . . . E(ε2N )
var(εi ) = E[(εi − E(εi ))2 ] = E[ε2i ]
and
cov(εi , εj ) = E[(εi − E(εi ))(εj − E(εj ))] = E(εi εj )
b = (X0 X)−1 X0 y
β
= (X0 X)−1 X0 (Xβ + ε)
= β + (X0 X)−1 X0 ε
• Taking expectations:
E[β|X]
b = β + (X0 X)−1 X0 E[ε|X] = β
42
E[β]
b = EX [E[β|X]]
b
= β + EX [(X0 X)−1 X0 E[ε|X]] = β
The last step holds because the expectation over X of something that is
always equal to zero is still zero.
var(β)
b = E[(β b − β)0 ]
b − β)(β
• Thus:
b = E[((X0 X)−1 X0 ε)((X0 X−1 )X0 ε)0 ]
var(β)
which gives:
((X0 X)−1 X0 ε)0 = (ε0 X[(X0 X)−1 ]0 ) = (ε0 X(X0 X)−1 )
• Then,
b = E[((X0 X)−1 X0 ε)(ε0 X(X0 X)−1 )]
var(β)
= E[(X0 X)−1 X0 εε0 X(X0 X)−1 ]
• Next, let’s pass through the conditional expectations operator rather than
the unconditional to obtain the conditional variance.
E[[(X0 X)−1 X0 εε0 X(X0 X)−1 ]|X] = [(X0 X)−1 X0 E(εε0 |X)X(X0 X)−1 ]
= [(X0 X)−1 X0 σ 2 IX(X0 X)−1 ]
= σ 2 [(X0 X)−1 X0 X(X0 X)−1 ] = σ 2 (X0 X)−1
= σ 2 (X0 X)−1
44
• But ei = yi − x0i β
b = εi − x0 (β
i
b − β).
• Note that:
• Let
M = (IN −X(X0 X)−1 X0 )
e = My ⇒ e0 e = y0 M0 My = y0 My = y0 e = e0 y
e = My = M[Xβ + ε] = MXβ + Mε = Mε
• Now we need some matrix algebra: the trace operator gives the sum of the
diagonal elements of a matrix. So, for example:
1 7
tr =4
2 3
• Three key results for trace operations:
1. tr(ABC) = tr(BCA) = tr(CBA)
2. tr(cA) = c · tr(A)
3. tr(A − B) = tr(A) − tr(B)
• For a 1 × 1 matrix, the trace is equal to the matrix (Why?). The matrix
ε0 Mε is a 1 × 1 matrix. As a result:
• Later, in the section on asymptotic results, we will be able to show that the
distribution of β
b is approximately normal as the sample size gets large,
without any assumptions on the distribution of the true error terms and
without having to condition on X.
• In the meantime, it is also useful to remember that, for a multivariate
normal distribution, each element of the vector β
b is also distributed normal,
so that we can say that:
• Let β
e be a linear estimator of β calculated as Cy, where C is a K × n
matrix.
• If β
e is unbiased for β, then:
E[β]
e = β = E[Cy] = β
which implies that
E[C(Xβ + ε)] = β
⇒CXβ + CE[ε] = β
⇒CXβ = β and CX = I
var[β]
e = E[(βe − E[β])(
e β e 0]
e − E[β])
e − β)0 ]
e − β)(β
= E[(β
= E[(Cy − β)(Cy − β)0 ]
= E[(CXβ + Cε) − β)((CXβ + Cε) − β)0 ]
= E[(β + Cε − β)(β + Cε − β)0 ]
= E[(Cε)(Cε)0 ]
= CE[εε0 ]C0
= σ 2 CC0
(D + (X0 X)−1 X0 )X = I
so that,
DX + (X0 X)−1 X0 X = I
and
DX+I=I
and
DX = 0
• Given this last step, we can finally re-express the variance of β
e in terms of
D and X.
Therefore:
• The diagonal elements of var[β] e and var(β) b are the variance of the
estimates. So it is sufficient to prove efficiency to show that the diagonal
elements of σ 2 DD0 are all positive.
49
50
51
• If (X−µ)
√ > +Zα/2 or
σ/ n
(X−µ)
√ < −Zα/2 , we reject the null hypothesis. Intuitively,
σ/ n
if X is very different from what we said it is under the null, we reject the
null.
• The confidence interval approach is to construct the 95% confidence interval
for µ.
σ σ
Pr X − Zα/2 √ ≤ µ ≤ X + Zα/2 √ = 0.95
n n
• If the hypothesized value of µ = 4 lies in the confidence interval, then we
accept the null. If the hypothesized value does not lie in the confidence
interval, then we say reject H0 with 95% confidence. Intuitively, we reject
the null hypothesis if the hypothesized value is far from the likely range of
values of µ suggested by our estimate.
b ∼ N β, σ 2 (X0 X)−1
β|X
b k |X ∼ N βk , σ 2 (X0 X)−1
β kk
• Let us assume, at this point, that we can use the relationship between the
marginal and conditional multivariate normal distributions to say simply that
the OLS regression coefficients are normally distributed.
βbk − βk βbk − βk
5.2.1 Proof that =q ∼ tk
se(βbk ) 2 0
s (X X)kk−1
√ β2k −β0 k
b
σ (X X)−1
kk βbk − βk σ
tk = p = q ·
[(N − k)s2 /σ 2 ]/(N − k) σ (X0 X)−1 s
kk
βbk − βk βbk − βk
= q =
s (X0 X)−1 se(βbk )
kk
(N − k)s2
5.2.2 Proof that 2
∼ χ2N −k
σ
• This proof will be done by figuring out whether the component parts of
this fraction are summed squares of variables that are distributed standard
normal. The proof will proceed using something we used earlier (e = Mε).
(N − k)s2 e0 e ε0 MMε ε 0 ε
= 2 = = M
σ2 σ σ2 σ σ
ε
• We know that ∼ N (0, 1)
σ
ε 0 ε
• Thus, if M = I we have ∼ χ2N . This holds because the matrix
σ σ
multiplication would give us N sums of squared standard normal random
variables. However, it can also be shown that if M is any idempotent matrix,
then:
ε 0 ε
M ∼ χ2 [tr(M)]
σ σ
• We showed before that tr(M) =N − k
ε 0 ε
M ∼ χ2N −k
σ σ
βbk − βk
• This completes the proof. In conclusion, the test statistic, ∼ tN −k
se(βbk )
54
• What is the standard error? βb2 and βb3 are both random variables. From
section 1.3.7, var(X − Y ) = var(X) + var(Y ) − 2cov(XY ). Thus:
q
se(βb2 − βb3 ) = var(βb2 ) + var(βb3 ) − 2cov(βb2 , βb3 )
• How do we get those variances and covariances? They are all elements of
var(β),
b the variance-covariance matrix of β
b (Which elements?).
b−q
q R0 βb −q
t= =p
se(b
q) R0 [s2 (X0 X)−1 ]R
• How does that get us exactly what we had above?
R0 β
b = (βb2 − βb3 ) and q = 0.
56
• More complicatedly:
σβ2b σβb0 ,βb1 σβb0 ,βb2 σβb0 ,βb3
0 2
σ σ σ σ
β ,β β ,β β ,β
R0 [s2 (X0 X)−1 ] = 0 0 1 −1 ·
b1 0
b β1
b b1 2
b b1 3
b
σ b b σ b b σ 2 σ b b
β2 ,β0 β2 ,β1 βb2 β2 ,β3
σβb3 ,βb0 σβb3 ,βb1 σβb3 ,βb2 σβ2b
3
h i
= σβb2 ,βb0 − σβb3 ,βb0 σβb2 ,βb1 − σβb3 ,βb1 σβ2b2 − σβb3 ,βb2 σβb2 ,βb3 − σβ2b3
And
p q
Thus, se(b
q) = R0 [s2 (X0 X)−1 ]R = var(βb2 ) + var(βb3 ) − 2cov(βb2 , βb3 )
• We will come back to this when we do multiple hypotheses via F -tests.
Section 6
Goodness of Fit
6.1 The R-squared measure of goodness of fit
• The original fitting criteria used to produce the OLS regression coefficients
was to minimize the sum of squared errors. Thus, the sum of squared errors
itself could serve as a measure of the fit of the model. In other words, how
well does the model fit the data?
• Unfortunately, the sum of squared errors will always rise if we add another
observation or if we multiply the values of y by a constant. Thus, if we want
a measure of how well the model fits the data we might ask instead whether
variation in X is a good predictor of variation in y.
• Recall the mean deviations matrix from Section 2.9, which is used to subtract
the column means from every column of a matrix. E.g.,
x11 − x̄1 . . . x1K − x̄K
M0 X = .. ... .. .
. .
xn1 − x̄1 . . . xnK − x̄K
• Using M0 , we can take the mean deviations of both sides of our sample
regression equation:
M0 y = M0 Xβb + M0 e
• To square the sum of deviations, we just use
(M0 y)0 (M0 y) = y0 M0 M0 y = y0 M0 y
and
y0 M0 y = (M0 Xβb + M0 e)0 (M0 Xβ b + M0 e)
b 0 + (M0 e)0 )(M0 Xβ
= ((M0 Xβ) b + M0 e)
b 0 X0 M0 + e0 M0 )(M0 Xβ
= (β b + M0 e)
=βb 0 X0 M0 Xβ
b + e0 M0 Xβb +βb 0 X0 M0 e + e0 M0 e
57
58
• Now consider, what is the term M0 e? The M0 matrix takes the deviation of
a variable from its mean, but the mean of e is equal to zero, so M0 e = e and
X0 M0 e = X0 e = 0.
• Why does the last hold true? Intuitively, the way that we have minimized
the OLS residuals is to set the estimated residuals orthogonal to the data
matrix – there is no information in the data matrix that helps to predict the
residuals.
• You can also prove that X0 e = 0 using the fact that e = Mε and that M = (IN −
X(X0 X)−1 X0 ):
X0 e = X0 (IN − X(X0 X)−1 X0 )ε = (X0 − X0 X(X0 X)−1 X0 )ε = X0 ε − X0 ε = 0.
• The first term in this decomposition is the regression sum of squares (or the
variation in y that is explained) and the second term is the error sum of
squares. What we have shown is that the total variation in y can be fully
decomposed into the explained and unexplained portion. That is,
SST = SSR + SSE
3. R2 does not necessarily lie between zero and one if we do not include a
constant in the model.
• Given our estimate β̂ of β, our interest centers on how much Rβ̂ deviates
from q and we base our hypothesis test on this deviation.
• Let m = Rβ̂ − q. Intuition: will reject H0 if m is too large. Since m is
a vector, we must have some way to measure distance in multi-dimensional
space. One possibility—suggested by Wald—is to use the normalized sum of
squares, or m0 {var(m)}−1 m.
0
• This is equivalent to estimating σmm m
σm .
• If m can be shown to be normal w/ mean zero, then the statistic above would
be made up of a sum of squared standard normal variables (with the number
of squared standard normal variables given by the number of restrictions)
⇒ χ2 (Wald statistic).
• We know that β̂ is normally distributed ⇒ m is normally distributed (Why?).
62
• Further, var(m) = var(Rβ̂ − q) = R{σ 2 (X0 X)−1 }R0 and so the statistic
• Since, we do not know σ 2 and must estimate it by s2 , we cannot use the Wald
statistic. Instead, we derive the sample statistic:
−1
(Rβ̂ − q)0 σ 2 R(X0 X)−1 R0
(Rβ̂ − q)/J
F =
[(n − k)s2 /σ 2 ] /(n − k)
[Notice that we are dividing by the same correction factor that we used to
prove the appropriateness of the t-test in the single hypothesis case]
• Recall that
2 e0 e ε0 Mε
s = = .
n−k n−k
• Since under the null hypothesis Rβ = q, then Rβ̂ − q = Rβ̂ − Rβ =
R(β̂ − β) = R(X0 X)−1 X0 ε.
• Using these expressions, which are true under the null, we can transform the
sample statistic above so that it is a function of ε/σ.
• In other words, the loss of fit can also be seen as (most of) the usual numerator
∗
from the F -statistic. Under the null, β̂ = β̂, thus (Rβ̂ − q) is equal to
∗
(β̂ − β̂) and R is equal to the identity matrix. Thus, we simply need to
introduce our s2 estimate and divide by J, the number of restrictions, to
obtain an F -statistic.
(e∗0 e∗ − e0 e)/J
FJ,n−k =
e0 e/(n − k)
65
• If we include all of them in the regression with a constant, then the data
matrix X does not have full rank, it has rank at most of K − 1 (perfect
collinearity). In this case, the matrix (X0 X) is also not of full rank and has
rank of at most K − 1.
• Another way of putting this, is that we do not have enough information
to estimate the constant separately if we include all the regional dummies
because, once we have them in, the overall model constant is completely
determined by the individual regional intercepts and the weight of each region.
It cannot be separately estimated, because it is already given by what is in
the model.
• If you include a constant in the model, you can only include M − 1 of M
mutually exclusive and collectively exhaustive dummy variables. Or could
drop the constant.
• Suppose, however, we think that men will contribute more to political cam-
paigns as their income rises than will women. To test this hypothesis, we
would estimate the model:
Yi = β0 + β1 dm,i + β2 Xi + β3 dm,i · Xi + εi
To include the last term in the model, simply multiply the income variable
by the gender dummy. If income is measured in thousands of dollars, what
is the marginal effect of an increase of $1,000 dollars for men on expected
contributions to political campaigns?
• How do we test whether gender has any impact on the marginal effect of
income on political contributions? How do we test the hypothesis that gender
has no impact on the behavior underlying political contributions?
• It is generally a bad idea to estimate interaction terms without including the
main effects.
• If, however, I impose the restriction that gender does not matter, then I
estimate the restricted model:
Yi = β 0 + β 2 Xi + εi
In this case, I am restricting β 1 to be equal to zero and β 3 to be equal to
zero.
• In general terms, we wish to test the unrestricted model where the OLS
coefficients differ by gender.
• The Chow test to determine whether the restriction is valid is performed
using an F -test and can be accomplished using the approach outlined above.
• This version of the Chow Test is exactly equivalent to running the unrestricted
model and then performing an F -test on the joint hypothesis that β 1 = 0 and
β 3 =0. Why is this?
• We saw previously that an F -test could also be computed using:
(e∗0 e∗ − e0 e)/J
FJ,n−k =
e0 e/(n − k)
This is just the same as we did above. We are simply getting the unrestricted
error sum of squares from the regressions run on men and women separately.
• This form of the F -test could be derived as an extension of testing a restriction
on the OLS coefficients.
• Under the restriction, the β vector of regression coefficients was equal to β ∗ .
In the case outlined above:
β0 β0
β1
and β ∗ = 0
β= β2 β2
β3 0
• But the null hypothesis is given by β = β ∗ . In terms of our notation for
expressing a hypothesis as a restriction on the matrix of coefficients, Rβ = q,
we have β = β ∗ , so R is equal to the identity matrix. Moreover, the distance
∗
Rβ̂ − q is now given by β̂ − β̂.
71
∗ ∗
• As a result, the quantity (β̂ − β̂)0 X0 X(β̂ − β̂) = e∗0 e∗ − e0 e can be written
as:
(Rβ̂ − q)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ − q)
• This is most of the F -statistic, except that we now have to divide by s2 , equal
to e0 e/(n − k), our estimate of the variance of the true errors, and divide by
J to get the same F -test that we would get above.
• Testing this restriction on the vector of OLS coefficients, however, is exactly
the same thing as testing that β 1 and β 3 are jointly equal to zero, and the
latter computing procedure is far easier in statistical software.
has a chi-square distribution with K degrees of freedom and a test that the
difference between the two parameters is equal to zero can be based on this
statistic.
• What remains for us to resolve is whether it is valid to base such a test on
estimates of V1 and V2 . We shall see, when we examine asymptotic results
from infinite samples that we can do so as our sample becomes sufficiently
large.
72
• Finally, there are some more sophisticated techniques which allow you to be
agnostic about the locations of the breaks (i.e., let the data determine where
they are).
Section 7
• What is the algebraic solution for β̂, the estimate of β in this partitioned
version of the regression model?
0 0
• To find the solution, we use the “normal
equations,” (X X)β̂ = (X y) em-
ploying the partitioned matrix, X1 X2 for X.
0 0
X1 X1 X01 X2
0 X1
• In this case, (X X) = · X1 X2 =
X02 X02 X1 X02 X2
• The derivation of (X0 y) proceeds analogously.
• The normal equations for this partitioned matrix give us two expressions with
two unknowns: 0 " #
0
X1 0 y
X1 X1 X1 X2 β̂ 1
· =
X2 0 X1 X2 0 X2 β̂ 2 X2 0 y
73
74
• We can conclude from the above that if X01 X2 = 0—i.e., if X1 and X2 are
orthogonal—then the estimated coefficient β̂ 1 from a regression of y on X1
will be the same as the estimated coefficients from a regression of y on X1
and X2 .
• If not, however, and if we mistakenly omit X2 from the regression, then report-
ing the coefficient of β̂ 1 as (X1 0 X1 )−1 X1 0 y will over-state the true coefficient
by (X1 0 X1 )−1 X1 0 X2 β̂ 2 .
• The solution for β̂ 2 is found by multiplying through the second row of the ma-
trix and substituting the above expression for β̂ 1 . This method for estimating
the OLS coefficients is called “partialling out.”
• Here, the matrix M is once again the “residual-maker,” with the sub-script
indicating the relevant set of explanatory variables. Thus:
M1 X2 = a vector of residuals from a regression of X2 on X1 .
M1 y = a vector of residuals from a regression of y on X1 .
• Using the fact that M1 is idempotent, so that M01 M1 = M1 , and setting
X∗2 = M1 X2 and y∗ = M1 y, we can write the solution for β̂ 2 as:
0 0
β̂ 2 = (X∗2 X∗2 )−1 X2∗ y∗
• From the formula for partitioned regression above, we can say that:
−1 −1
d = (X0 X) X0 (y − zc) = b − (X0 X) X0 zc
• Thus,
u0 u = e0 e + c2 (z0∗ z∗ ) − 2cz0∗ e
e = My = y∗ and z0∗ e = z0∗ y∗ . Since c = (z0∗ z∗ )−1 (z0∗ y∗ ), we can say that
z0∗ e = z0∗ y∗ = c(z0∗ z∗ ).
• Therefore:
y = X1 β 1 + X2 β 2 + ε
• Thus, unless (X01 X2 ) = 0 or β 2 = 0, β̂ 1 is biased. Also, note that (X01 X1 )−1 X01 X2
gives the coefficient(s) in a regression of X2 on X1 .
78
• Thus, if we had a good idea of the sign of the covariance and the sign of β 2 ,
we would be able to predict the direction of the bias.
• If there are more than two explanatory variables in the regression, then the
direction of the bias on β̂ 1 when we exclude X2 will depend on the partial
correlation between X1 and X2 (since this always has the same sign as the
regression coefficient in the OLS regression model) and the sign of the true
β2.
This last step involves multiplying out M1 X2 and using E(ε0 X2 ) = 0 and
0
E(X1 ε) = 0, both of which we know from the Gauss-Markov assumptions.
• From here we proceed as we did when working out the expectation of the
sum of squared errors originally:
E(e01 e1 ) = β 02 X02 M1 X2 β 2 + σ 2 tr(M1 ) = β 02 X02 M1 X2 β 2 + (n − K1 )σ 2
79
The first term in the expression is the increase in the error sum of squares
(SSE) that results when X2 is dropped from the regression.
0 0
• We can show that β 2 X2 M1 X2 β 2 is positive, so the s2 derived from the re-
gression of y on X1 will be biased upward (Intuition: the shorter version of
the model leaves more variance unaccounted for).
• Omitted variable bias is a result of a violation of the Gauss-Markov assump-
tions: we assumed we had the correct model.
7.6 Multicollinearity
• Partitioned regression can be used to show that if you add another variable
to a regression, the standard errors of the previously included coefficients will
rise and their t-statistics will fall.
• Magnitude of change depends on covariance b/t variables.
• The estimated variance of β̂k is given by
• We can use the formula for the inverse of a partitioned matrix to say some-
thing about what this element will be.
82
• To find the variance of α̂2 = β̂2 , we would need to know (X0 X)−1
22 , the bottom
right element of the inverse of the matrix above.
• To find this, we will use a result on the inverse of a partitioned matrix (See
Greene, 6th Ed., pp. 966). The formula for the bottom right element of the
inverse of a partitioned matrix is given by: F2 = (A22 −A21 A−1 −1
11 A12 ) , where
the A’s just refer to the elements of the original partitioned matrix.
• Using this for the above matrix we have:
X
0 −1 2
X 1
(X X)22 = (x2i − x̄2 ) − (x2i − x̄2 )(x1i − x̄1 ) · P
(x1i − x̄1 )2
X i−1
· (x1i − x̄1 )(x2i − x̄2 )
−1
cov(X1 , X2 ) · (n − 1) · cov(X1 , X2 ) · (n − 1)
= var(X2 ) · (n − 1) −
var(X1 ) · (n − 1)
−1
cov(X1 , X2 )2 · (n − 1)
= var(X2 ) · (n − 1) −
var(X1 )
83
var(X2 )
• Multiply through by 1 = var(X2 ) on both sides of the equation to give:
−1
(cov(X1 , X2 ))2
(X0 X)−1
22 = 1− · var(X2 )(n − 1)
var(X1 )var(X2 )
2
−1
= (1 − r12 ) · var(X2 )(n − 1)
1
= 2 )var(X )(n − 1)
(1 − r12 2
2
• In this expression, r12 is the squared partial correlation coefficient between
X1 and X2 .
2
c β̂2 ) = s2 (X0 X)−1
• It follows then that: var( kk = (1−r12
s
2 )var(X )(n−1) Thus, the vari-
2
ance of the OLS coefficient rises when the two included variables have a higher
partial correlation.
• Intuition: the partialling out formula for the OLS regression coefficients
demonstrated that each regression coefficient β̂k is estimated from the co-
variance between y and Xk once you have netted out the effect of the other
variables. If Xk is highly correlated with another variable, then there is less
“information” remaining once you net out the effect of the remaining variables
in your data set.
This result can be generalized for a multivariate model with K regressors
(including the constant) and in which data are entered in standard form, and
not in deviations:
s2
var(β̂k ) = 2
(1 − Rk.X−k )var(Xk )(n − 1)
2
where Rk.X−k is the R2 from a regression of Xk on all the other regressors,
X−k .
84
• Implications for the standard errors when the variables are collinear:
2
Rk.X−k will be high, so that the variance and standard errors will be high and
thus the t-statistics will be low (but check F -test). If perfectly collinear,
2
Rk.X−k = 1.
85
Regression Diagnostics
• First commandment of multivariate analysis: “First, know your data.”
86
87
80
var 1
70
60
50
40
7.5
var 2
7.0
6.5
6.0
5.5
5.0
1.6
var 3 1.4
1.2
1.0
0.8
0.6
0
−2
Fitted values
89
attach(Duncan)
av.plots(mod.duncan,labels=row.names(Duncan),ask=F)
40
30
30
20
prestige | others
prestige | others
20
10
10
0
0
−10
−10
−20
−30
−30
Added−Variable Plot
60
40
prestige | others
20
0
−20
−40
education | others
91
−2
−4
−6
−3 −2 −1 0 1
X
4
2
0
Y
−2
−4
−6
−3 −2 −1 0 1
X
92
and then sorting on errors to get the data points organized from highest to
lowest residual. This would also allow us to tell the X values for which the
residuals were largest.
8.4 Leverage
• We should draw a distinction between outliers and high leverage observations.
• The regressor values that are most informative (i.e., have the most influence
on the coefficient estimates) and lead to relatively large reductions in the vari-
ance of coefficient estimates are those that are far removed from the majority
of values of the explanatory variables.
• Intuition: regressor values that differ substantially from average values con-
tribute a great deal to isolating the effects of changes in the regressors.
• Most common measure of influence comes from the “hat matrix”:
H = X(X0 X)−1 X0
• For any vector y, Hy is the set of fitted values (or ŷ in the least squares
regression of y on X. In matrix language, it projects any n × 1 vector into
the column space of X.
• For our purposes, Hy = Xβ̂ = ŷ. Recall also that M = (I − X(X0 X)−1 X0 ) =
(I − H)
• The least squares residuals are:
e = My = Mε = (I − H)ε
• Thus, the variance-covariance matrix for the least squares residual vector is:
• The diagonal elements of this matrix are the variance of each individual esti-
mated residual. Since these diagonal elements are not generally the same, the
residuals are heteroskedastic. Since the off-diagonal elements are not gener-
ally equal to zero, we can also say that the individual residuals are correlated
(they have non-zero covariance). This is the case even when the true errors
are homoskedastic and independent, as we assume under the Gauss-Markov
assumptions.
• For an individual residual, ei , the variance is equal to var(ei ) = σ 2 (1 − hi ).
Here, hi is one of the diagonal elements in the hat matrix, H. This is consid-
ered a measure of leverage, since hi increases in the value of x.
• Average hat values are given by h̄ = (k + 1)/n where k is the number of
coefficients in the model (including the constant). 2h̄ and 3h̄ are typically
treated as thresholds that hat values must exceed to be noteworthy.
• Thus, observations with the greatest leverage have corresponding residuals
with the smallest variance. Observations with high leverage tend to pull the
regression line toward them, decreasing their size and variance.
• Note that high leverage observations are not necessarily bad: they contribute
a great deal of info about the coefficient estimates (assuming they are not
coding mistakes). They might indicate misspecification (different models for
different parts of the data).
• Fig. 8.5 contains a plot of hat values. This was produced using the R com-
mands:
plot(hatvalues(xyreg))
abline(h=c(2,3)*2/N,lty=2)
94
0.10
0.05
0 10 20 30 40 50
Index
95
8.4.2 DFBETAS
• A measure that is often used to tell how much each data point affects the
coefficient is the DFBETAS, due to Belsley, Kuh, and Welch (1980).
• DFBETAS focuses on the difference between coefficient values when the ith
observation is included and excluded, scaling the difference by the estimated
standard error of the coefficient:
β̂k − β̂k (i)
DFBETASki = −1/2
σ̂(i)akk
−1/2
where akk is the kth diagonal element of (X0 X)−1 .
• Belsley et al. suggest that observations with
2
|DFBETAki | > √
N
as deserving special attention (accounts for the fact that single observations
have less of an effect as sample size grows). It is also common practice simply
to use an absolute value of one, meaning that the observation shifted the
estimated coefficient by at least one standard error.
97
where xyreg contains the output from the lm command. To plot them for
the slope coefficient for our fake data, do
0.0
−0.5
−1.0
0 10 20 30 40 50
Index
Section 9
➢ For example, what is the expected change in our dependent variable for
a given change in the explanatory variable, where the given change is
substantively interesting or intuitive. This could include, for instance,
indicating how much the dependent variable changes for a one standard
deviation movement up or down in the explanatory variable.
➢ Relate this back to descriptive statistics.
99
100
9.1.1 Prediction
• We know what is the expected effect on y of a given change in xk from β̂.
But what is the predicted value of y, or ŷ, for a given vector of values of x?
• In-Sample Prediction tells us how the level of y would change for a change
in the Xs for a particular observation in our sample.
• For example, at a given value of each X variable, x0 , ŷ, is equal to:
ŷ = x0 β̂
y0 = x0 β + ε0
101
y0 − ŷ0 = e0 = x0 β + ε0 − x0 β̂ = x0 (β − β̂) + ε0
This is the variance of the forecast error, and thus the variance of the actual
value of y around the predicted value.
• If the regression contains a constant term, then an equivalent expression
(derived using the expression for partitioned matrices is:
" K−1
#
1 X K−1
X
var[e0 ] = σ 2 1 + + (x0j − x̄j )(x0k − x̄k )(Z0 M0 Z)jk
n j=1
k=1
➢ will fall with the number of data points because the inverse will be smaller
and smaller (intuition—as you get more data points, you will estimate β̂
more precisely).
• The variance above can be estimated using:
105
106
• What is unknown are the parameters of the DGP. For a single, normally
distributed random variable, those unknown parameters would be the mean
and the variance. For the dependent variable in a regression model, y, those
parameters would include the regression coefficients, β, and the variance of
the errors, σ 2 .
• What we want is a measure of “inverse probability,” a means to estimate
P (parameters|data) where the parameters are what we don’t know about the
model. In this inverse probability statement, the data are taken as given
and the probability is a measure of absolute uncertainty over various sets of
coefficients.
• We cannot derive this measure of absolute uncertainty, but we can get close
to it, using the notion of likelihood, which employs Bayes Theorem.
P (B|A)P (A) P (A∩B)
• Bayes Theorem: P (A|B) = P (B) = P (B)
• Let’s paraphrase this for our case of having data and wanting to find out
about coefficients:
P (data|parameters) P (parameters)
P (parameters|data) =
P (data)
• For the purposes of estimation, we treat the marginal probability of seeing
the data, P (data), as additional information that we use only to scale our
beliefs about the parameters. Thus, we can say more simply:
P (parameters|data) ∝ P (data|parameters) P (parameters) (10.1)
where ∝ means “is proportional to”.
• The conditional probability of the parameters given the data, or
P (parameters | data), is also known as the likelihood function (e.g., L(µ, σ 2 |x),
where µ refers to the mean and σ 2 refers to the variance).
• The likelihood function may be read as the likelihood of any value of the
parameters, µ and σ 2 in the univariate case, given the sample of data that
we observe. As our estimates of the parameters of interest, we might want
to use the values of the parameters that were most likely given the data at
hand.
107
Note that there is an implicit iid assumption, which enables us to write the
joint likelihood as the product of the marginals.
• The method of finding the most likely values of µ and σ 2 , given the sample
data, is to maximize the likelihood function. The expression above is the
likelihood function since it is the area under the curve generated by the
statistical function that gives you a probability and since this is proportional
to P (parameters|data).
• An easier way to perform the maximization, computationally, is to maximize
the log likelihood function, which is just the natural log of the likelihood
function. Since the log is a monotonic function, the values that maximize L
are the same as the values that maximize ln L.
n 2
n n 1 X (x i − µ)
ln L(µ, σ 2 ) = − ln(2π) − ln σ 2 −
2 2 2 i=1 σ2
109
• To find the values of µ and σ 2 , we take the derivatives of the log likelihood
with respect to the parameters and set them equal to zero.
n
∂ ln L 1 X
= 2 (xi − µ) = 0
∂µ σ i=1
n
∂ ln L n 1 X
= − + (xi − µ)2 = 0
∂σ 2 2σ 2 2σ 4 i=1
• To solve the likelihood equations, multiply both sides by σ 2 in the first equa-
tion and solve for µ̂, the estimate of µ. Next insert this in the second equation
and solve for σ 2 . The solutions are:
n
1X
µ̂M L = xi = x̄n
n i=1
n
2 1X
σ̂M L = (xi − x̄n )2
n i=1
• You should recognize these as the sample mean and variance, so that we
have some evidence that ML estimation is reliable. Note, however, that the
denominator of the calculated variance is n rather than (n−1) ⇒ ML estimate
of the variance is biased (although it is consistent).
• This implies that the yi are independently and normally distributed with
respective means β0 + β1 xi and a common variance σ 2 . The joint density of
the observations is therefore:
1/2
n 1 2 2
f (y1 , . . . , yi , . . . yn |β, σ 2 ) = Π 2
e−1/2[(yi −β0 −β1 xi ) /σ ]
i=1 2πσ
• We will maximize the log likelihood function with respect to β0 and β1 and
then with respect to σ 2 .
• Note that only the last term in the log likelihood function involves β0 and
β1 and that maximizing this is the same as minimizing the sum of squared
errors, since there is a negative sign in front of the whole term.
• Thus, the ML estimators of β0 and β1 , equal to β̂ M L are the same as the least
squares estimates that we have been dealing with throughout.
n
(yi − β̂0 − β̂1 xi )2
P
• Substituting β̂ M L into the log likelihood and setting Q̂ =
i=1
which is the SSE, we get the log likelihood function in terms of σ 2 only:
n n Q̂ n Q̂
ln L(σ) = − ln(2π) − ln σ 2 − 2 = a constant − · ln σ 2 − 2
2 2 2σ 2 2σ
• Differentiating this with respect to σ and setting the derivative equal to zero
we get:
∂ ln L n Q̂
=− + 3 =0
∂σ σ̂ σ̂
Q̂ SSE
This gives the ML estimator for σ 2 = σ̂ 2 = n = n
• This estimator is different from the unbiased estimator that we have been
Q̂
2
using, = σ̂OLS = n−k = SSE
n−k .
111
• For large N , however, the two estimates will be very close. Thus, we can say
that the ML estimate is consistent for σ 2 , and we will define the meaning of
consistency explicitly in the future.
• We can use these results to show the value of the log likelihood function at
its maximum, a quantity that can sometimes be useful for computing test
statistics. We will continue to use the short-hand that Q̂ = SSE and will
2 Q̂
also substitute in the equation for the log likelihood, ln L, that σ̂M L = n.
Substituting this into the equation for the log likelihood above, we get:
!
n n Q̂ Q̂
ln L(β, σ 2 ) = − ln(2π) − ln −
2 2 n 2(Q̂/n)
n n n n
= − ln(2π) − ln Q̂ + ln(n) −
2 2 2 2
• Putting all the terms that do not rely on Q̂ together, since they are constant
and do not rely on the values of the estimated regression coefficients, β̂ M L ,
we can say that the maximum value of the log likelihood is:
n n
max ln L = a constant − ln Q̂ = a constant − ln(SSE)
2 2
• Taking the anti-log, we can say that the maximum value of the likelihood
function is:
max L = a constant · (SSE)−n/2
• This shows that, for a given sample size, n, the maximum value of the like-
lihood and log likelihood functions will rise as SSE falls. This can be handy
to know, as it implies that measures of “goodness of fit” can be based on the
value of the likelihood or log likelihood function at its maximum.
• In conclusion, we can say that ML is equivalent to OLS for the classical
linear regression model. The real power of ML, however, is that we can
use it in many cases where we do not assume that the errors are normally
distributed or that the model is equal to yi = β0 +β1 xi +ui , so that the model
is far more general.
112
• For instance, models with binary dependent variables are estimated via ML,
using the logistic distribution for the data if a logit model is chosen and a
normal if the probit is used. Event count models are estimated via ML,
assuming that the original data is distributed Poisson. Thus, ML is the oper-
ative method for estimating a number of models that are frequently employed
in political science.
• The likelihood ratio (LR) test is analogous to an F -test calculated using the
residuals from the restricted and un-restricted models. Let θ be the set of
parameters in the model and let L(θ) be the likelihood function.
• Hypotheses such as θ1 = 0 impose restrictions on the set of parameters θ.
What the LR test says is that we first obtain the maximum of L(θ) without
any restrictions and we then calculate the likelihood with the restrictions
imposed by the hypothesis to be tested.
• We then consider the ratio
max L(θ) under the restrictions
λ=
max L(θ) without the restrictions
• λ will in general be less than one since the restricted maximum will be less
than the unrestricted maximum.
• If the restrictions are not valid, then λ will be significantly less than one. If
they are exactly correct, then λ will equal one. The LR test consists of using
−2 ln λ as a test statistic ∼ χ2k , where k is the number of restrictions. If this
amount is larger than the critical value of the chi-square distribution, then
we reject the null.
113
• Another frequently used test statistic that is used with ML estimates is the
Wald test, which is analogous to the t-statistic:
β̂ M L − β̃
W =
se(β̂ M L )
• The only difference from a regular t-test is that this is a large sample test.
Since in large samples, the t-distribution becomes equivalent to the standard
normal, the Wald statistic is distributed standard normal.
where θ̂ are the ML estimates (e.g., this would include β̂ M L and σ̂ 2 for the
linear regression model). Thus, the information matrix is the negative of the
expectation of the second derivatives of the likelihood function estimated at
the point where θ̃ = θ̂.
• If θ is a single parameter, then the larger is I(θ̂|y) then the more curved the
likelihood (or log likelihood) and the more precise is θ̂. If θ is a vector of
parameters (the normal case) then I(θ̂|y) is a K × K matrix with diagonal
elements containing the information on each corresponding element of θ̂.
• Because of the expectations operator, the information matrix must be esti-
mated. The most intuitive is:
" #
2
∂ ln L(θ̃|y)
I(θ̂|y) = − 0
∂ θ̃∂ θ̃ θ̃=θ̂
In other words, one just uses the data sample at hand to estimate the second
derivatives.
• The information matrix is closely related to the asymptotic variance of the
ML estimates, θ̂. It can be shown that the asymptotic variance of the ML
estimates, across an infinite number of hypothetically repeated samples is:
" #−1
2
1 −∂ ln L(θ̃|y)
V (θ̂) ≈ lim [I(θ̂|y)/n]−1 = lim E 0
n→∞ n→∞ n ∂ θ̃∂ θ̃ θ̃=θ̂
• Intuition: the variance is inversely related to the curvature. The greater the
curvature, the more that the log likelihood resembles a spike around the ML
estimates θ̂ and the lower the variance in those estimates will be.
115
• In a given sample of data, one can estimate V(θ̂) on the basis of the estimated
information: " #−1
1 −∂ 2 ln L(θ̃|y)
V (θ̂) ≈ − 0
n ∂ θ̃∂ θ̃ θ̃=θ̂
• This expression is also known as the the Cramer-Rao lower bound on the
variance of the estimates. The Cramer-Rao lower bound is the lowest value
that the asymptotic variance on any estimator can take.
• Thus, the ML estimates have the very attractive property that they are
asymptotically efficient. In other words, in large samples, they achieve the
lowest possible variance (thus highest efficiency) of any potential estimate of
the true parameters.
2 0 −1
• For the linear regression case, the estimate of V (θ̂) is σ̂M L (X X) . This is
the same as the variance-covariance matrix of the β̂ OLS except that we use
the ML estimate of σ 2 to estimate the true variance of the error term.
116
Part II
Violations of Gauss-Markov
Assumptions in the Classical Linear
Regression Model
Section 11
117
118
➢ Used normality to show that β̂ OLS ∼ N (β, σ 2 (X0 X)−1 ) and that (n −
k)s2 /σ 2 ∼ χ2n−k where the latter involved showing that (n − k)s2 /σ 2
can also be expressed as a quadratic form of ε/σ which is distributed
standard normal.
➢ Both of these results were used to show that we could calculate test
statistics that were distributed as t and F . It is these results on
test statistics, and our ability to perform hypothesis tests, that are
invalidated if we cannot assume that the true errors are normally
distributed.
2. Non-linear functional forms
➢ We may be interested in estimating non-linear functions of the original
model.
➢ E.g., suppose that you have an unbiased estimator, π ∗ of
π = 1/(1 − β)
1. “Convergence in probability”:
P
Xn → X iff limn→∞ Pr(|X(ω) − Xn (ω)| ≥ ε) = 0
where ε is some small positive number. Or
• For the next step we need the Slutsky Theorem: For a continuous function
g(xn ) that is not a function of n, plim g(xn ) = g(plim xn ).
• An implication of this thm is that if xn and yn are random variables with
plim xn = c and plim yn = d, then plim (xn · yn ) = c · d.
• If Xn and Yn are random matrices with plim Xn = A and plim Yn = B then
plim Xn Yn = AB.
• Thus, we can say that:
plim β̂ OLS = β + plim (X0 X/n)−1 plim (X0 ε/n)
• Since the inverse is a continuous function, the Slutsky thm enables us to bring
the first plim into the parenthesis:
plim β̂ OLS = β + (plim X0 X/n)−1 plim (X0 ε/n)
• More generally, each element of (X0 X) is composed of the sum of squares and
the sum of cross-products of the explanatory variables. As such, the elements
of (X0 X) grow larger with each additional data point, n.
• But if we assume the elements of this matrix do not grow at a rate faster
than n and the columns of X are not linear dependent, then dividing by n,
gives convergence to a finite number.
• We can now say that plim β̂ OLS = β + Q−1 plim (X0 ε/n) and the next step
in the proof is to show that plim (X0 ε/n) is equal to zero. To demonstrate
this, we will prove that its expectation is equal to zero and that its variance
converges to zero.
• Think about the individual elements in (X0 ε/n). This is a k × 1 matrix in
which each element is the sum of all n observations of a given explanatory
variable multiplied by each realization of the error term. In other words:
n n
1 0 1X 1X
Xε= xi εi = wi = w̄ (11.1)
n n i=1 n i=1
lim var[w̄] = 0 · Q
n→∞
124
• Therefore, we can say that w̄ converges in mean square to 0 ⇒ plim (X0 ε/n)
is equal to zero, so that:
plim β̂ OLS = β + Q−1 · 0 = β
Thus, the OLS estimator is consistent as well as unbiased.
• Thus,
√
1
var[ nw̄] = σ 2 Q̄n = σ 2 [Q1 + Q2 + · · · + Qn ]
n
X n
1
=σ 2
xi x0i
n
0 i=1
X X
= σ2
n
• Assuming the sum is not dominated by any particular term and that
limn→∞ (X0 X/n) = Q, then
lim σ 2 Q̄n = σ 2 Q
n→∞
• We can now invoke the Lindberg-Feller CLT to formally state that if the ε
are iid w/ mean 0 and finite variance, and if each element of X is finite and
limn→∞ (X0 X/n) = Q, then
1 d
√ X0 ε −→ N [0, σ 2 Q]
n
It follows that:
1 d
Q−1 √ X0 ε −→ N [Q−1 · 0, Q−1 (σ 2 Q)Q−1 ]
n
[Recalling that if a random variable X has a variance equal to σ 2 , then kX,
where k is a constant, has a variance equal to k 2 σ 2 ].
• Combining terms, and recalling what it was we were originally interested in:
√ d
n(β̂ OLS − β) −→ N [0, σ 2 Q−1 ]
0
ee
• To complete the steps, we can also show that s2 = n−k is consistent for σ 2 .
Thus, s2 (X0 X)/n)−1 is consistent for σ 2 Q−1 . As a result, a consistent estimate
2 2 0 −1
for the asymptotic variance of β̂ OLS (= σn Q−1 = σn XnX ) is s2 (X0 X)−1 .
• Thus, we can say that β̂ OLS is normally distributed and a consistent estimate
of its asymptotic variance is given by s2 (X0 X)−1 , even when the error terms
are not distributed normally. We have gone through the rigors of large sample
proofs in order to show that in large samples OLS retains desirable properties
that are similar to what it has in small sample when all of the G–M conditions
hold.
• To conclude, the desirable properties of OLS do not rely on the assumption
that the true error term is normally distributed. We can appeal to large
sample results to show that the sampling distribution will still be normal
as the sample size becomes large and that it will have variance that can be
consistently estimated by s2 (X0 X)−1 .
• As the sample size becomes large, we can replace this estimate by its prob-
ability limit, which is the true standard error, so that the denominator just
becomes a constant. When we do so, we have a normally distributed random
variable in the numerator divided by a constant, so the test-statistic is dis-
tributed as z, or standard normal. Another way to think of this intuitively is
that the t-distribution converges to the z distribution as n becomes large.
• Testing joint hypothesis also proceeds via constructing an F -test. In this case,
we recall the formula for an F -test, put in terms of the restriction matrix:
(Rb − q)0 [R(X0 X)−1 R0 ]−1 (Rb − q)/J
F [J, n − K] =
s2
• In small samples, we had to take account of the distribution of s2 itself. This
gave us the ratio of two chi-squared random variables, which is distributed
F.
• In large samples, we can replace s2 by its probability limit, σ 2 which is just
a constant value. Multiplying both sides by J, we now have that the test
statistic JF is composed of a chi-squared random variable in the numerator
over a constant, so the JF statistic is the large sample analog of the F -test.
If the JF statistic is larger than the critical value, then we can say that the
restrictions are unlikely to be true.
Heteroskedasticity
12.1 Heteroskedasticity as a violation of Gauss-Markov
• The third of the Gauss-Markov assumptions is that E[εε0 ] = σ 2 In . The
variance-covariance matrix of the true error terms is structured as:
2
σ 0 ... 0
E(ε21 ) E(ε1 ε2 ) . . . E(ε1 εN )
0 σ 2 ... 0
0
E(εε ) = .
. . . . .
. = ..
. .
.
E(εN ε1 ) E(εN ε2 ) . . . E(ε2N )
0 0 ... σ 2
• If all of the diagonal terms are equal to one another, then each realization
of the error term has the same variance, and the errors are said to be ho-
moskedastic. If the diagonal terms are not the same, then the true error term
is heteroskedastic.
• Also, if the off-diagonal elements are zero, then the covariance between dif-
ferent error terms is zero and the errors are uncorrelated. If the off-diagonal
terms are non-zero, then the error terms are said to be auto-correlated and the
error term for one observation is correlated with the error term for another
observation.
• When the third assumption is violated, the variance-covariance matrix of the
error term does not take the special form, σ 2 In , and is generally written,
instead as σ 2 Ω. The disturbances in this case are said to be non-spherical
and the model should then be estimated by Generalized Least Squares, which
employs σ 2 Ω rather than σ 2 In .
129
130
• We can also use the formula we derived in the lecture on asymptotics to show
that β̂ OLS is consistent.
• It can be shown that the characteristic vectors are all orthogonal and for each
characteristic vector, c0i ci = 1 (Greene, 6th ed. p. 968–969). It follows that
CC0 = I, and C0 C = I, a fact that we will use below.
• GLS consists of estimating the following equation, using the standard OLS
solutions for the regression coefficients:
Py = PXβ + Pε
or
y∗ = X∗ β + ε∗ .
0 0
Thus, β̂ GLS = (X∗ X∗ )−1 (X∗ y∗ ) = (X0 Ω−1 X)−1 (X0 Ω−1 y)
0
• It follows that the variance of ε∗ , is equal to E[ε∗ ε∗ ] = Pσ 2 ΩP0 , and:
Pσ 2 ΩP0 = σ 2 PΩP0 = σ 2 Λ−1/2 C0 ΩCΛ−1/2
= σ 2 Λ−1/2 C0 CΛ1/2 Λ1/2 C0 CΛ−1/2
= σ 2 Λ−1/2 Λ1/2 Λ1/2 Λ−1/2
= σ 2 In
• In this general case, the maximum likelihood estimator will be the GLS esti-
mator, β̂ GLS , and that the Cramer-Rao lower bound for the variance of β̂ GLS
is given by σ 2 (X0 Ω−1 X)−1 .
εi = γ0 + γ1 xi + ui
• Since β̂ OLS is consistent and unbiased for β, we can use the OLS residuals
to estimate the model above. The procedure is that we first estimate the
original model using OLS, and then use the residuals from this regression to
√
estimate Ω̂ and ωi .
• We then transform the data using these estimates, and use OLS again on
the transformed data to estimate β̂ GLS and the correct standard errors for
hypothesis testing.
• One can also use the estimated errors terms from the last stage to conduct
FGLS again and keep on doing this until the error model begins to converge,
so that the estimated residuals barely change as one moves through the iter-
ations.
• Some examples:
➢ If σi2 = σ 2 x2i , we would divide all observations through by xi .
√
➢ If σi2 = σ 2 xi , we would divide all observations through by xi .
➢ If σi2 = σ 2 (γ0 + γ1 Xi + ui ), we would divide through all observations by
√
γ0 + γ1 xi .
134
• It has also been suggested that the test-statistic should be calculated using
only the first and last thirds of the data points, excluding the middle section
of the data, to sharpen the test results.
137
σi2 = σ 2 f (α0 + α0 zi )
Autocorrelation
13.1 The meaning of autocorrelation
• Auto-correlation is often paired with heteroskedasticity because it is another
way in which the variance-covariance matrix of the true error terms (if we
could observe it) is different from the Gauss-Markov assumption, E[εε0 ] =
σ 2 In .
• In our earlier discussion of heteroskedasticity, we saw what happened when
we relax the assumption that the variance of the error term is constant. An
equivalent way of saying this is that we relax the assumption that the errors
are “identically distributed.” In this section, we see what happens when we
relax the assumption that the error terms are independent.
• In this case, we can have errors that covary (e.g., if one error is positive and
large the next error is likely to be positive and large) and are correlated. In
either case, one error can give us information about another.
• Two types of error correlation:
1. Spatial correlation: e.g., of contiguous households, states, or counties.
2. Temporal or Autocorrelation: errors from adjacent time periods are cor-
related with one another. Thus, εt is correlated with εt+1 , εt+2 , . . . and
εt−1 , εt−2 , . . . , ε1 .
• The correlation between εt and εt−k is called autocorrelation of order k.
➢ The correlation between εt and εt−1 is the first-order autocorrelation and
is usually denoted by ρ1 .
➢ The correlation between εt and εt−2 is the second-order autocorrelation
and is usually denoted by ρ2 .
138
139
E(εε0 ) = .. ... ..
. .
E(εT ε1 ) E(εT ε2 ) . . . E(ε2T )
Here, the off-diagonal elements are the covariances between the different error
terms. Why is this?
• Autocorrelation implies that the off-diagonal elements are not equal to zero.
• SinceP e2t is approximately equal to e2t−1 as the sample size becomes large,
P P
• First, we must write the E[εε0 ] matrix in a way that can represent autocor-
relation:
E[εε0 ] = σ 2 Ω
• Recall that
E[εt εt−s ] γs
corr(εt , εt−s ) = 2 = ρs =
E[εt ] γ0
where γs = cov(εt , εt−s ) and γ0 = var(εt )
• Let R be the “autocorrelation matrix” showing the correlation between all
the disturbance terms. Then E[εε0 ] = γ0 R
• Second, we need to calculate the autocorrelation matrix. It helps to make
an assumption about the process generating the disturbances or true errors.
The most common assumption is that the errors follow an “autoregressive
process” of order one, written as AR(1).
An AR(1) process is represented as:
εt = ρεt−1 + ut
εt = ut + ρut−1 + ρ2 ut−2 + · · ·
• Each disturbance, εt , embodies the entire past history of the us, with the
most recent shocks receiving greater weight than those in the more distant
past.
• The successive values of ut are uncorrelated, so we can estimate the variance
of εt , which is equal to E[ε2t ], as:
y = a0 · x + a1 · x + a2 · x + a3 · x + . . .
then
x
y= .
1−a
• Using this, we see that
σu2
var[εt ] = 2
= σε2 = γ0 (13.1)
1−ρ
• In other words, the auto-correlations fade over time. They are always less
than one and become less and less the farther two disturbances are apart in
time.
145
• But (given our assumptions) the ut s are serially independent, have constant
variance (σu2 ), and the covariance between different us is zero. Thus, the new
error term, ut , conforms to all the desirable features of Gauss-Markov errors.
• If we conduct OLS using this transformed data, and use our normal esti-
mate for the standard errors, or s2 (X0 X)−1 , we will get correct and efficient
standard errors.
• The transformation is:
yt∗ = yt − ρyt−1
x∗t = xt − ρxt−1
• If we drop the first observation (because we don’t have a lag for it), then we
are following the Cochrane-Orcutt procedure. If we keep the first observation
and use it with the following transformation:
p
y1∗ = 1 − ρ2 y1
p
∗
x 1 = 1 − ρ2 x 1
then we are following something called the Prais-Winsten procedure. Both
of these are examples of FGLS.
• But how do we do this if we don’t “know” ρ? Since β̂OLS is unbiased, our
estimated residuals are unbiased for the true disturbances, εt .
• In this case, we can estimate ρ using our residuals from an initial OLS regres-
sion, estimate ρ, and then perform FGLS using this estimate. The standard
errors that we calculate using this procedure will be asymptotically efficient.
• To estimate ρ from the residuals, et compute:
P
et et−1
ρ̂ = P 2
et
We could also estimate ρ from a regression of et on et−1 .
147
• Thus,
cov(yt−1 , εt )
E[β̂OLS ] = β +
var(yt )
149
• To show that β̂OLS is biased, we need to show only that cov(yt−1 , εt ) 6= 0, and
to show that β̂OLS is inconsistent, we need to show only that the limit of this
covariance as T → ∞ is not equal to zero.
cov[yt−1 , εt ] = cov[yt−1 , ρεt−1 + ut ] = ρcov[yt−1 , εt−1 ] = ρcov[yt , εt ]
• The last step is true assuming the DGP is “stationary” and the ut s are un-
correlated.
• Continuing:
ρcov[yt , εt ] = ρcov[βyt−1 + εt , εt ] = ρ{βcov[yt−1 , εt ] + cov[εt , εt ]} (13.3)
= ρ{βcov[yt−1 , εt ] + var[εt ]} (13.4)
y = βx + ε
• This implies that x can be written as z − u and the true model can then be
written as:
y = β(z − u) + ε = βz + (ε − βu) = βz + η
The new disturbance, η, is a function of u (the error in measuring x). z is also
a function of u. This sets up a non-zero covariance (and correlation) between
z, our imperfect measure of the true regressor x, and the new disturbance
term, (ε − βu).
• The covariance between them, as we showed then, is equal to:
cov[z, (ε − βu)] = cov[x + u, ε − βu] = −βσu2
• How does this lead to bias? We go back to the original equation for the
expectation of the estimated β̂ coefficient, E[β̂] = β + (X0 X)−1 X0 ε. It is
easy to show:
E[β̂] = E[(z 0 z)−1 (z 0 y)] = E[(z 0 z)−1 (z 0 (βz + η))] (13.5)
= E[[(x + u)0 (x + u)]−1 (x + u)0 (βx + ε)] (13.6)
P
(x + u)(βx + ε)
i βx2 x2
=E P = 2 =β 2 (13.7)
(x + u)2 x + σu2 x + σu2
i
• Note: establishing a non-zero covariance between the regressor and the error
term is sufficient to prove bias but is not the same as indicating the direction
of the bias. In this special case, however, β̂ is biased downward.
• To show consistency we must show that:
0 −1 0 −1
X0 X
XX Xε
plim β̂ = β + =β (recall Q∗ = )
n n n
X0 ε
• This involves having to show that plim n = 0.
• In the case above,
P
plim (1/n) (x + u)(βx + ε)
i βQ∗
plim β̂ = P = ∗
plim (1/n) (x + u)2 Q + σu2
i
152
• Describing the bias and inconsistency in the case of a lagged dependent vari-
able with autocorrelation would follow the same procedure. We would look
at the expectation term to show the bias and the probability limit to show
inconsistency. See particularly Greene, 6th ed., pp. 325–327.
y = β0 + β1 x1 + β2 x2 + ε
x1 = Zα + u
y = β0 + β1 x̃1 + β2 x2 + ε
yt = Xβ + γyt−1 + εt
where εt = ρεt−1 + ut
• Use the predicted values from a regression of yt on Xt and Xt−1 as an estimate
of yt−1 . The coefficient on ỹt−1 is a consistent estimate of γ, so it can be used
to estimate ρ and perform FGLS.
X̃ = Zα = Z(Z0 Z)−1 Z0 X
• Then
155
156
1. Suppose some random event sparks political violence (ηi is big). This
could be because you did not control for the actions of demagogues in
stirring up crowds.
2. This causes growth to fall through the effect captured in β1 .
3. Thus, Gi and ηi are negatively correlated.
(E[ε0 η] = E[ε]E[η] because the two error terms are assumed independent).
157
14.3 Identification
• There are two conditions to check for identification: the order condition and
the rank condition. In theory, since the rank condition is more binding, one
checks first the order condition and then the rank condition. In practice, very
few people bother with the rank condition.
• In general, this means that there must be at least one exogenous variable in
the system, excluded from that equation, in order to estimate the coefficient
on the endogenous variable that is included as an explanatory variable in that
equation.
• These conditions are necessary for a given degree of identification. The Rank
Condition is sufficient for each type of identification. The Rank Condition
assumes the order condition and adds that the reduced form equations must
each have full rank.
1. Estimate the reduced form equations using OLS. To do this, regress each
endogenous variable on all the exogenous variables in the system. In our
running example we would have:
Vi = φ0 + φ1 Zi + φ2 Xi + ηi∗
and
Gi = γ0 + γ1 Zi + γ2 Xi + ε∗i
161
2. From these first-stage regressions, estimate G̃i and Ṽi for each observa-
tion. The predicted values of the endogenous variables can then be used
to estimate the structural models:
Vi = α0 + α1 G̃i + α2 Zi + ηi
Gi = β0 + β1 Ṽi + β2 Xi + εi
• This is just what we would do in standard IV estimation, which is to regress
the problem variable on its instruments and then use the predicted value in
the main regression.
• This gives us an estimate of the fit of the structural model. There is one oddity
in the calculated R2 that may result. When we derived R2 , we used the fact
that the OLS normal equations estimated residuals such that E[X0 e] = 0.
This gave us the result that:
X X X
2 2 2
SST = SSR + SSE or (yi − ȳ) = (xi − x̄) β̂ + e2i
The variances in y can be completely partitioned between the variance from
the model and the variance from the residuals. This is no longer the case
when you re-estimate the errors using the real values of Gi and Vi rather
than their instruments and you can get cases where SSE > SST. In this
case, you can get negative values of R2 in the second stage. This may be
perfectly okay if the coefficients are of the right sign and the standard errors
are small.
SSR SSE
• It also matters in this case whether you estimate R2 as SST or 1 − SST .
β̂ IV = (X̃0 X̃)−1 X̃0 y = (X0 Z(Z0 Z)−1 Z0 X)−1 (X0 Z(Z0 Z)−1 Z0 (Xβ + ε)
= β + (X0 Z(Z0 Z)−1 Z0 X)−1 (X0 Z(Z0 Z)−1 Z0 ε)
= β + (X̃0 X̃)−1 (X̃0 ε)
var[β̂ IV ] = E[(β̂ IV − β)(β̂ IV − β)0 ] = E[(X̃0 X̃)−1 X̃0 εε0 X̃(X̃0 X̃)−1 ]
• As with R2 , we should use the estimated residuals from the structural model
with the true variables in it rather than the predicted values. These are
consistent estimates of the true disturbances.
ε̂i = Gi − (β̂ 0 + β̂ 1 Vi + β̂ 2 Xi )
• The IV estimator can have very large standard errors, because the instru-
ments by which X is proxied are not perfectly correlated with it and your
residuals will be larger.
165
xt = β0 + β1 xt−1 + β2 yt−1 + εt
• If a t-test indicates that β2 = 0, then we say that y does not Granger cause x.
If y does not Granger cause x, then x is often said to be exogenous in a system
of equations with y. Here, exogeneity implies only that prior movements in
y do not lead to later movements in x.
• Kennedy has a critique of Granger Causality in which he points out that under
this definition weather reports “cause” the weather and that an increase in
Christmas card sales “cause” Christmas. The problem here is that variables
that are based on expectations (that the weather will be rainy, that Christmas
will arrive) cause earlier changes in behavior (warnings to carry an umbrella
and a desire to buy cards).
167
• The trouble, for a long time, was that no-one knew how to estimate this vari-
ance since it should involve the covariances between β̂ IV and β̂ OLS . Hausman
solved this by proving that the covariance between an efficient estimator β̂ OLS
of a parameter vector β and its difference from an inefficient estimator β̂ IV
of the same parameter vector is zero (under the null). (For more details see
Greene, 6th ed., Section 12.4.)
• Based on this proof, we can say that:
• Under the null hypothesis, we are using two different but consistent estimators
of σ 2 . If we use s2 as a common estimator of this, the Hausman statistic will
be:
(β̂ IV − β̂ OLS )0 [(X̃0 X̃)−1 − (X0 X)−1 ]−1 (β̂ IV − β̂ OLS )
H=
s2
• This test statistic is distributed χ2 but the appropriate degrees of freedom for
the test statistic will depend on the context (i.e., how many of the variables
in the regression are thought to be endogenous).
y = Xβ + X̃α + Wγ + ε
1. Estimate the 2SLS residuals using η̂i = Vi − α̂0 + α̂1 Gi + α̂2 Zi . Use η̂i as
your consistent estimate of the true errors.
2. Regress η̂i on all the instruments used in estimating the 2SLS coefficients.
3. You can approximately test the over-identifying restrictions via inspec-
tion of the F -statistic for the null hypothesis that all instruments are
jointly insignificant. However, since the residuals are only consistent for
the true errors, this test is only valid asymptotically and you should
technically use a large-sample test.
a
➢ An alternative is to use n · R2 ∼ χ2 w/ df = # of instruments − # of
endogenous regressors. If reject the null ⇒ at least some of the IVs
are not exogenous.
Section 15
171
172
• In the case of an AR(1) process we can substitute infinitely for the y terms
on the right-hand side (as we did previously) to show that:
∞
X ∞
X
2 2 ∞ i
yt = µ+γµ+γ µ+...+εt +γεt−1 +γ εt−2 +...+γ εt−∞ = γ µ+ γ i εt−i
i=0 i=0
• So, one way to remember an auto-regressive process is that your current state
is a function of all your previous errors.
• We can present this information far more simply using the lag operator or
L:
Lxt = xt−1 and L2 xt = xt−2 and (1 − L)xt = xt − xt−1
• Using the lag operator, we can write the original AR(1) series as:
yt = µ + γLyt + εt
so that:
(1 − γL)yt = µ + εt
and
∞ ∞
µ εt µ εt X
i
X
yt = + = + = γ µ+ γ i εt−i
(1 − γL) (1 − γL) (1 − γ) (1 − γL) i=0 i=0
173
• The last step comes from something we encountered before: how to represent
an infinite series:
A = x(1 + a + a2 + ... + an )
x
If |a| < 1, then the solution to this series is approximately A = (1−a) . In
other words, the sequence is convergent (has a finite solution).
∞
γ i εt−i = εt +γεt−1 +γ 2 εt−2 +... = εt +γLεt +γ 2 L2 εt +...
P
• Thus, for the series
i=0
∞
εt
γ i εt−i =
P
we have that, a = γL and (1−γL) .
i=0
∞
µ µ
γ iµ =
P
• For similar reasons, (1−γL) = (1−γ) because Lµ = µ In other words,
i=0
µ, the per-period innovation is not subscripted by time and is assumed to be
the same in each period.
15.3 Stationarity
• So, an AR(1) process can be written quite simply as:
µ εt
yt = +
(1 − γ) (1 − γL)
• Recall, though, that this requires that |γ| < 1. If |γ| ≥ 1 then we cannot
even define yt . yt keeps on growing as the error terms collect. Its expectation
will be undefined and its variance will be infinite.
• That means that we cannot use standard statistical procedures if the auto-
regressive process is characterized by |γ| > 1. This is known as the station-
arity condition and the data series is said to be stationary if |γ| < 1.
• The problem is that our standard results for consistency and for hypothesis
testing requires that (X0 X)−1 is a finite, positive matrix. This is no longer
true. The matrix (X0 X) will be infinite when the data series X is non-
stationary.
174
• In the more general case of an AR(p) model, the only difference is that the lag
function by which we divide the right-hand side, (1 − γL), is more complex
and is often written as C(L). In this case, the stationarity condition requires
that the roots of this more complex expression “lie outside the unit circle.”
yt = µ + εt − θεt−1
In this case, your current state depends on only the current and previous
errors.
Using the lag operator:
yt = µ + (1 − θL)εt
Thus,
yt µ
= + εt
(1 − θL) (1 − θL)
Once again, if |θ| < 1, then we can invert the series and express yt as an
infinite series of its own lagged values:
µ
yt = − θyt−1 − θ2 yt−2 − ... + εt
(1 − θ)
• If we had a more general, MA(q) process, with more lags, we could go through
the same steps, but we would have a more complex function of the lags than
(1 − θL). Greene’s textbook refers to this function as D(L). In this case, the
invertibility condition is satisfied when the roots of D(L) lie outside the unit
circle (see Greene, 6th ed., pp. 718–721).
• The question is, what kind of non-stationary sequence do we have and how
can we tell it’s non-stationary. Consider the following types of non-stationary
series:
1. The Pure Random Walk
yt = yt−1 + εt
yt = µ + yt−1 + εt
For the random walk with drift process, we can show that E[yt ] = y0 + tµ
and var[yt ] = tσ 2 . Both the mean and the variance are non-constant.
➢ In this case, first differencing of the series also will give you a variable
that has a constant mean and variance.
178
yt = µ + βt + εt
(1 − L)yt = α + ν
• A unit-root test is based on a model that nests the different processes above
into one regression that you run to test the properties of the underlying data
series:
yt = µ + βt + γyt−1 + εt
• Next subtract yt−1 from both sides of the equation to produce the equation
below. This produces a regression with a (difference) stationary dependent
variable (even under the null of non-stationarity) and this regression forms
the basis for Dickey-Fuller tests of a unit root:
yt − yt−1 = µ + βt + (γ − 1)yt−1 + εt
• There is one complication. Two statisticians, Dickey and Fuller (1979, 1981)
showed that if the unit root is exactly equal to one, the standard errors will
be under-estimated, so that revised critical values are required for the test
statistic above.
• For this reason, the test for stationarity is referred to as the Dickey-Fuller
test. The augmented Dickey-Fuller test applies to the same equation above
but adds lags of the first difference in y, (yt − yt−1 ).
• One problem with the Dickey-Fuller unit-root test is that it has low power
and seems to privilege the null hypothesis of a random walk process over the
alternatives.
180
• To sum up: If your data looks like a random walk, you will have to difference it
until you get something that looks stationary. If your data looks like it’s trend
stationary, you will have to de-trend it until you get something stationary.
An ARMA model carried out on differenced data is called an ARIMA model,
standing for “Auto-Regressive, Integrated, Moving-Average.”
• If yt and yt−s are both expressed in terms of deviations from their means, and
if var(yt ) = var(yt−s ) then:
cov(yt , yt−s ) E[yt yt−s ] γs
corr(yt , yt−s ) = p = =
E[yt2 ]
p
var(yt ) var(yt−s ) γ0
γ0 = var[yt ] = E{yt −E[yt ]}2 = E[(εt −θεt−1 )2 ] = E(ε2t )+θ2 E(ε2t−1 ) = (1+θ2 )σε2
and
• The covariances between yt and yt−s when s > 1 are zero, because the ex-
pression for yt only involves two error terms. Thus, the ACF for an MA(1)
process has one or two spikes and then shows no autocorrelation.
yt = ρyt−1 + εt
• There will be an initial spike at the first lag (where the autocorrelation equals
ρ) and then nothing, because no other lagged value of y is significant.
• For the MA(1) process, the partial autocorrelation function will look like a
gradually declining wave, because any MA(1) process can be written as an
infinite AR process with declining weights on the lagged values of y.
182
This allows for memory but may have a large number of coefficients,
reducing your degrees of freedom.
3. The Exponential Distributed Lag (EDL) Model
M
X
At = (Xt−i λi )β + εt
i=0
Part III
Special Topics
Section 16
188
189
• The degrees of freedom for the model are now N T − k − N − T . The signifi-
cance, or not, of the country-specific and time-specific effects can be tested by
using and F -test to see if the country (time) dummies are jointly significant.
• The general approach of including unit-specific dummies is known as Least
Squares Dummy Variables model, or LSDV.
• Can also include (T − 1) year dummies for time effects. These give the
difference between the predicted causal effect from xit β and what you would
expect for that year. There has to be one year that provides the baseline
prediction.
190
Then:
ȳi. = αi + x̄i. β + ε̄i.
If we run OLS on this regression it will produce what is known as the “Be-
tween Effects” estimator, or β BE , which shows how the mean level of the
dependent variable for each country varies with the mean level of the inde-
pendent variables.
Substracting this from eq. 16.1 gives
or
(yit − ȳi. ) = (xit − x̄i. )β + (εit − ε̄i. )
• If we run OLS on this regression it will produce what is known as the “Fixed
Effects” estimator, or β F E .
• It is indentical to LSDV and is sometimes called the within-group estimator,
because it uses only the variation in yit and xit within each group (or country)
to estimate the β coefficients. Any variation between countries is assumed to
spring from the unobserved fixed effects.
• Note that if time-invariant regressors are included in the model, the standard
FE estimator will not produce estimates for the effects of these variables.
Similar issue w/ LSDV.
N X
X T
T
Sxy = (xit − x̄)(yit − ȳ)
i=1 t=1
and:
STxy = SW B
xy + Sxy
and
−1 W
β̂ F E = [SW
xx ] [Sxy ]
while,
−1 B
β̂ BE = [SB
xx ] [Sxy ]
• The standard β̂ OLS uses all the variation in yit and xit to calculate the slope
coefficients while β̂ F E just uses the variation across time and β̂ BE just uses
the variation across countries.
• We can show that β̂ OLS is a weighted average of β̂ F E and β̂ BE . In fact:
β̂ OLS = FW β̂ F E + FB β̂ BE
where FW = [SW B −1 W B W
xx + Sxx ] Sxx and F = [I − F ]
194
• The way in which the RE model differs from the original OLS estimation (with
no fixed effects) is only in the specification of the disturbance term. When the
model differs from the standard G-M assumptions only in the specification
of the errors, the regression coefficients can be consistently and efficiently
estimated by Generalized Least Squares (GLS) or (when we don’t exactly
know Ω) by FGLS).
• Thus, we can do a transformation of the original data that will create a new
var-cov matrix for the disturbances that conforms to G-M.
196
To estimate this we will need to know Ω−1 = [IN ⊗ Σ]−1 , which means that
we need to estimate Σ−1/2 :
1 θ 0
Σ−1/2 = I − iT iT
σε T
where
σε
θ =1− p
T σu2 + σε2
• Then the transformation of yi and Xi for FGLS is
yi1 − θȳi.
1 yi2 − θȳi.
−1/2
Σ yit = ..
σε
.
yiT − θȳi.
β̂ RE = F̂ W β̂ F E + (I − F̂ W )β̂ BE
where:
B −1 W
F̂ W = [Sxx
W
+ λSxx ] Sxx
and
σε2
λ= 2 2
= (1 − θ)2
σε + T σu
• If λ = 1, then the RE model reduces to OLS. There is essentially no country-
specific disturbance term, so the regression coefficients are most efficiently
estimated using the OLS method.
197
• Greene, 6th ed., p. 205–206, also shows how to perform a Breusch-Pagan test
for RE based on the residuals from the original OLS regression. This tests
for the appropriateness of OLS versus the alternative of RE. It does not test
RE against FE.
• Consider the following example to see how this will show up in the variance-
covariance matrix of the errors. Assume that N = 2 and T = 2 and that the
data are stacked by country and then by time period:
• The covariance matrix for spherical errors is:
2
σ 0 0 0
0 σ2 0 0
Ω= 0 0 σ2 0
0 0 0 σ2
where
E(ε11 ε11 ) E(ε11 ε12 ) E(ε11 ε21 ) E(ε11 ε22 )
E(ε12 ε11 ) E(ε12 ε12 ) E(ε12 ε21 ) E(ε12 ε22 )
Ω=
E(ε21 ε11 )
E(ε21 ε12 ) E(ε21 ε21 ) E(ε21 ε22 )
E(ε22 ε11 ) E(ε22 ε12 ) E(ε22 ε21 ) E(ε22 ε22 )
• This is the covariance matrix with contemporaneous correlation:
2
σ 0 σ12 0
0 σ 2 0 σ12
Ω= σ12 0 σ 2 0
0 σ12 0 σ 2
0 σ12 0 σ22
• This is the covariance matrix with all of the above and first-order serial cor-
relation: 2
σ1 ρ σ12 0
ρ σ12 0 σ12
Ω=
σ12 0 σ22 ρ
0 σ12 ρ σ22
200
• The most popular method for addressing these issues is known as panel cor-
rected standard errors (PCSEs) which is due to Beck and Katz (APSR ’95).
• In other cases of autocorrelation and/or heteroskedasticity, we have suggested
GLS or FGLS. We would transform the data so that the errors become spheri-
−1/2
cal by multiplying x and y by Ω̂ . The FGLS estimates of β are now equal
−1 −1
to (X0 Ω̂ X)−1 (X0 Ω̂ y).
• FGLS is most often performed by first using the OLS residuals to estimate ρ
(to implement the Prais-Winsten method) and then using the residuals from
an OLS regression on this data to estimate the contemporaneous correlation.
This is known as the Parks method and is done in Stata via xtgls.
• Beck and Katz argue, however, that unless T >> N , then very few periods
of data are being used to compute the contemporaneous correlation between
each pair of countries.
• In addition, estimates of the panel-specific autocorrelation coefficient, ρi , are
likely to be biased downwards when they are based on few T observations.
• In other words, FGLS via the Parks method has undesirable small sample
properties when T is not many magnitudes greater than N .
• To show that FGLS leads to biased estimates of the standard errors in TSCS,
Beck and Katz use Monte Carlo Simulation
• Let
σ12 σ12 σ13 · · · σ1N
2
12 σ2 σ23
σ · · · σ2N
Σ = ..
. . . . ...
2
σ1N σ2N σ3N · · · σN
202
• Use OLS residuals, denoted eit for unit i at time t (in Beck and Katz’s nota-
tion), to estimate the elements of Σ:
PT
eit ejt
Σ̂ij = t=1 , (16.2)
T
which means the estimate of the full matrix Σ̂ is
E0 E
Σ̂ =
T
where E is a T × N matrix of the re-shaped N T × 1 vector of OLS residuals,
such that the columns contains the T × 1 vectors of residuals for each cross-
sectional unit (or conversely, each row contains the N × 1 vector of residuals
for each cross-sectional in a given time period) :
e11 e21 . . . eN 1
e
12 e22 . . . eN 2
E = .. .. . . . ..
. . .
e1T e2T . . . eN T
Then
E0 E
Ω̂ = ⊗ IT ,
T
• Compute SEs using the square roots of the diagonal elements of
204
205
• There are two problems with this way of estimating the probability of yi = 1.
1. The model may predict values of yi outside the zero-one range.
2. The linear probability model is heteroskedastic. To see this note that, if
yi = xi β + εi , then εi = 1 − xi β when yi =1 and εi = −xi β when yi =0.
Thus:
• Given these two drawbacks, it seems wise to select a non-linear function for
the predicted probability that will not exceed the values of one or zero and
that will offer a better fit to the data, where the data is composed of zeroes
and ones.
• For this reason, we normally assume a cumulative density function for the
error term where:
lim Pr(Y = 1) = 1
xβ→+∞
and
lim Pr(Y = 1) = 0
xβ→−∞
• The two most popular choices for a continuous probability distribution that
satisfy these requirements are the normal and the logistic.
• The normal distribution function for the errors gives rise to the probit model:
F (xi β) = Φ(xi β) where Φ is the cumulative normal distribution.
• The logistic function gives rise to the logit model:
exβ
F (xi β) = 1+exβ
= Λ(xi β).
• The vector of coefficients, β, is estimated via maximum likelihood.
207
• Maximizing the log likelihood with respect to β yields the values of the pa-
rameters that maximize the likelihood of observing the sample. From what
we learned of maximum likelihood earlier in the course, we can also say that
these are the most probable values of the coefficients for the sample.
• The first order conditions for the maximization require that:
n
∂ ln L X yi fi −fi
= + (1 − yi ) xi = 0
∂β i=1
Fi (1 − F i )
• As a probability, it must lie between zero and one. When we take the natural
log of this, to get the log likelihood,
n
X
ln L = {yi ln F (xi β) + (1 − yi ) ln[1 − F (xi β)]},
i=1
we will get a number between negative infinity and zero. The natural log of
one is the value to which e must be raised to get one, and e0 = 1. The natural
log of zero is negative infinity, because e−∞ = 0. Thus, increases in the log
likelihood toward zero imply that the probability of seeing the entire sample
given the estimated coefficients is higher.
• The maximum likelihood method selects that value of β that maximizes the
log likelihood and reports the value of the log likelihood at β̂ M L . A larger
negative number implies that the model does not fit the data well.
• As you add variables to the model, you expect the log likelihood to get closer
to zero (i.e., increase). ln L̂0 is the log likelihood for the model with just a
constant; ln L̂ is the log likelihood for the full model,so ln L̂0 should always
be larger and more negative than ln L̂.
210
• Where ln L̂0 = ln L̂, the additional variables have added no predictive power,
and the pseudo- R2 is equal to zero. When ln L̂= 0, the model now perfectly
predicts the data, and the pseudo- R2 is equal to one.
• A similar measure is the likelihood ratio test estimating the probability that
the coefficients on all the explanatory variables in the model (except the
constant) are zero (similar to the F -test under OLS).
• Although these measures offer the benefit of comparison to OLS techniques,
they do not easily get at what we want to do, which is to predict accurately
when yi is going to fall into a particular category. An alternative measure of
the goodness of fit of the model might be the percentage of observations that
were correctly predicted. This is calculated as:
The method of model prediction is to say that we predict a value of one for
yi if F (xi β) ≥ 0.5.
Let ŷi be the predicted value of yi .
• The percent correctly predicted is:
N
1 X
(yi ŷi + (1 − yi )(1 − ŷi ))
N i=1
• This measure has some shortcomings itself, which are overcome by the ex-
pected percent correctly predicted (ePCP):
N N
!
1 X X
ePCP = p̂i + (1 − p̂i )
N y =1 y =0
i i
Section 18
yi∗ = xi β + εi .
1 if yi∗ < γ1
if γ1 < yi∗ ≤ γ2
2
yi = 3 if γ2 < yi∗ ≤ γ3
..
.
if γm−1 < yi∗
m
Pr(yi = 1) = Φ(γ1 − xi β)
Pr(yi = 2) = Φ(γ2 − xi β) − Φ(γ1 − xi β)
Pr(yi = 3) = Φ(γ3 − xi β) − Φ(γ2 − xi β)
..
.
Pr(yi = m) = 1 − Φ(γm−1 − xi β)
= Φ(xi β − γm−1 )
211
212
• Then
Pr(zij = 1) = Φ(γj − xi β) − Φ(γj−1 − xi β)
• We compute marginal effects just like for the dichotomous probit model. For
the three categoy case these would be:
∂ Pr(yi = 1)
= −φ(γ1 − xi β)β
∂xi
∂ Pr(yi = 2)
= (φ(γ1 − xi β) − φ(γ2 − xi β))β
∂xi
∂ Pr(yi = 3)
= φ(γ2 − xi β)β
∂xi
213
Uij = xi β j + εij
In this case xi are the attributes of the voters selecting a party and β j are
choice-specific coefficients determining how utility from a party varies with
voter attributes (e.g., union members are presumed to derive more utility
from the Democratic party). Uij is the utility to voter i of party j.
• The voter will chose party one over party two if:
• The likelihood of this being true can be modeled using the logistic function if
we assume that the original errors are distributed log-Weibull. This produces
a logistic distribution for (εi1 − εi2 ) because if two random variables, εi1 and
εi2 , are distributed log-Weibull, then the difference between the two random
variables is distributed according to the logistic distribution.
214
• When there are only two choices (party one or party two) the probability of
a voter choosing party 1 is given by:
exi (β1 −β2 )
Pr(yi = 1|xi ) =
1 + exi (β1 −β2 )
Thus, the random utility model in the binary case gives you something that
looks very like the regular logit.
• We notice from this formulation that the two coefficients, β 1 and β 2 cannot
be estimated separately. One of the coefficients, for instance β 1 , serves as
a base and the estimated coefficients (β j − β 1 ) tell us the relative utility of
choice j compared to choice 1. We normalize β 1 to zero, so that Ui1 = 0 and
we measure the probability that parties 2 and 3 are selected over party 1 as
voter characteristics vary.
• Thus, the probability of selecting party j compared to party 1 depends on the
relative utility of that choice compared to one. Recalling that exi β1 = e0 = 1,
and substituting into the equation above, we find that the probability of
selecting party 2 over party 1 equals:
Pr(yi = 2)
= exi β2
Pr(yi = 1)
• In just the same way, if there is a third party option, we can say that the
relative probability of voting for that party over party 1 is equal to:
Pr(yi = 3)
= exi β3
Pr(yi = 1)
Finally, we could use a ratio of the two expressions above to give us the
relative probability of picking Party 2 over Party 3.
215
• What we may want, however, is the probability that you pick Party 2 (or any
other party). In other words, we want Pr(yi = 2) and Pr(yi = 3). To get
this, we use a little simple algebra. First:
• Next, we use the fact that the three probabilities must sum to one (the three
choices are mutually exclusive and collectively exhaustive):
So that:
Pr(yi = 1) + exi β2 Pr(yi = 1) + exi β3 Pr(yi = 1) = 1
And:
Pr(yi = 1) 1 + exi β2 + exi β3 = 1
so that
1
Pr(yi = 1) =
(1 + exi β2 + exi β3 )
• Using this expression for the probability of the first choice, we can say that:
exi β2
Pr(yi = 2) =
1 + exi β2 + exi β3
exi β3
Pr(yi = 3) =
1 + exi β2 + exi β3
1
Pr(yi = 1) =
1 + exi β2 + exi β3
• The likelihood function is then estimated and computed using these proba-
bilities for each of the three choices. The model is estimated in Stata using
mlogit.
• When the choice depends on attributes of the alternatives instead of the
attributes of the individuals, the model is estimated as a Conditional Logit.
This model is estimated in Stata using clogit.
216
• Thus, eβ2 will tell you the change in the relative probabilities of a one-unit
change in x. This is also known as the relative risk or the relative risk ratios.
• The IIA problem afflicts both the multinomial logit and the conditional logit
models. An option is to use the multinomial probit model, although it comes
with its own set of problems.
Section 19
218
219
• For each observation, the probability of seeing a particular count is given by:
e−λ λyi
Pr(Yi = yi ) = , yi = 0, 1, 2, ...
yi !
λi = exi β or equivalently ln λi = xi β
• We can now say that the probability of observing a given count for period i
is:
e−λi λyi i
Pr(Yi = yi ) = , yi = 0, 1, 2, ...
yi !
and E[yi |xi ] = λi = exi β
• Given this probability for a particular value at each observation, the likelihood
of observing the entire sample is the following:
n
Y e−λi λyi i
Pr[Y |λi ] =
i=1
yi !
220
Since the last term does not involve β, it can be dropped and the log likelihood
can be maximized with respect to β to give the maximum likelihood estimates
of the coefficients.
• This model can be done in Stata using the poisson command.
• For inference about the effects of explanatory variables, we can
➢ examine the predicted number of events based on a given set of values
for the variables, which is given by λi = exi β .
➢ examine the factor change: for a unit change in xk , the expected count
changes by a factor of eβk , holding other variables constant.
➢ examine the percentage change in the expected count for a δ unit change
in xk : 100 × [e(βk ·δ) − 1], holding other variables constant.
• Marginal effects can be computed in Stata after poisson using mfx com-
pute.
• The Poisson model offers no natural counterpart to the R2 in a linear regres-
sion because the conditional mean (the expected number of counts in each
period) is non-linear. Many alternatives have been suggested. Greene p. 908–
909 offers a variety and Cameron and Trivedi (1998) give more. A popular
statistic is the standard P seudo − R2 , which compares the log-likelihood for
the full model to the log-likelihood for the model containing only a constant.
• One feature of the Poisson distribution is that E[yi ] = var[yi ]. If E[yi ] <
var[yi ], this is called overdispersion and the Poisson is inappropriate. A neg-
ative binomial or generalized event count model can be used in this case.
221
• If the censoring cut point is not equal to zero, and if we also have an upper-
censoring cut point, then the conditional expectation of yi will be more com-
plicated, but the point remains that in estimating the conditional mean we
must take account of the fact that yi could be equal to xi β or to the censored
value(s). If we don’t take this into account, we are likely to get incorrect
estimates of the β̂ coefficients just as we did in the case of the truncated
regression.
• The censored model (with an upper or lower censoring point or both) is
estimated in Stata using the tobit command.
• For an example of tobit in practice, see a paper presented at the American
Politics seminar last year, “Accountability and Coercion: Is Justice Blind
when It Runs for Office” by Greg Huber and Sanford Gordon, which recently
appeared in the American Journal of Political Science.
224
• We assume that the two error terms, ui and εi are distributed bivariate nor-
mal, with standard errors, σu and σε , and that the correlation between them
is equal to ρ.
225
• Thus, duration models, and the kinds of questions that go with them, should
be investigated using techniques that permit you to include the effects of
time as a chronological sequence, and not just as a numerical marker. These
models will also allow you to test for the effect of other “covariates” on the
probability of war ending.
• Censoring is also a common (but easily dealt with) problem with duration
analysis.
• Let us begin with a very simple example in which we are examining the prob-
ability of a “spell” of war or strike lasting t periods. Our dependent variable
is the random variable, T , which has a continuous probability distribution
f (t), where t is a realization of T .
• The cumulative probability is:
Zt
F (t) = f (s)ds = Pr(T ≤ t)
0
• Note that the normal distribution might not be a good choice of functional
form for this distribution because the normal usually takes on negative values
and time does not.
The probability that a spell is of length at least t is given by the survival
function:
S(t) = 1 − F (t) = Pr(T ≤ t)
• We are sometimes interested in a related issue. Given that a spell has lasted
until time t, what is the probability that it ends in the next short interval of
time, say ∆t?:
l(t, ∆t) = Pr[t ≤ T ≤ t + ∆t|T ≥ t)
• A useful function for characterizing this aspect of the question is the hazard
rate:
Pr[t ≤ T ≤ t + ∆t|T ≥ t] F (t + ∆t) − F (t) f (t)
λ(t) = lim = lim =
∆t→0 ∆t ∆t→0 ∆tS(t) S(t)
227
Roughly, the hazard rate is the rate at which spells are completed after du-
ration t, given that they last at least until t.
• Now, we build in different functional forms for F (t) to get different models of
duration. In the exponential model, the hazard rate, λ, is just a parameter to
be estimated and the survival function, S(t) = e−λt . As such, the exponential
model has a hazard rate that does not vary over time.
• The Weibull model allows the hazard rate to vary over the duration of the
spell. In the Weibull model, λ(t) = λp(λt)p−1 , where p is also a parameter
to be estimated and the survival function is S(t) = e−(λt)p . This model can
“accelerate” or “decelerate” the effect of time and is thus called an accelerated
failure time model.
• How do we relate duration to explanatory factors like how well the combatants
are funded? As in the Poisson model, we derive a λi for each observation (war
or strike), where λi = e−xi β .
• Either of these models can be estimated in Stata via maximum likelihood
using the streg command. This command permits you to adopt the Weibull
or the exponential or one of several other distributions to characterize your
survival function.
• If you do not specify otherwise, your output will present the effect of your
covariates on your “relative hazard rate” (i.e., how a one unit increase in
the covariate moves the hazard rate up or down). If you use the command
streg depvar expvars, nohr, your output will be in terms of the actual
coefficients. In all cases, you can use the predict command after streg to
get the predicted time until end of spell for each observation. In cases where
the hazard is constant over time, you can also predict the hazard rate using
predict, hazard.
228