KEMBAR78
Nice Econometrics Notes!!!!!!! | PDF | Ordinary Least Squares | Statistics
0% found this document useful (0 votes)
37 views238 pages

Nice Econometrics Notes!!!!!!!

The document outlines the course POLS W4912, Multivariate Political Analysis, taught by Gregory Wawro at Columbia University. It includes acknowledgements of sources and references, a detailed table of contents covering topics such as classical linear regression models, matrix algebra, and properties of ordinary least squares. The course is designed to provide students with a comprehensive understanding of multivariate analysis in political science.

Uploaded by

kfir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views238 pages

Nice Econometrics Notes!!!!!!!

The document outlines the course POLS W4912, Multivariate Political Analysis, taught by Gregory Wawro at Columbia University. It includes acknowledgements of sources and references, a detailed table of contents covering topics such as classical linear regression models, matrix algebra, and properties of ordinary least squares. The course is designed to provide students with a comprehensive understanding of multivariate analysis in political science.

Uploaded by

kfir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 238

POLS W4912

Multivariate Political Analysis

Gregory Wawro
Associate Professor
Department of Political Science
Columbia University
420 W. 118th St.
New York, NY 10027

phone: (212) 854-8540


fax: (212) 222-0598
email: gjw10@columbia.edu
ACKNOWLEDGEMENTS

This course draws liberally on lecture notes prepared by Professors Neal Beck, Lucy Goodhart,
George Jakubson, Nolan McCarty, and Chris Zorn. The course also draws from the following
works:

Aldrich John H. and Forrest D. Nelson. 1984. Linear Probability, Logit and Probit Models.
Beverly Hills, CA: Sage.

Alvarez, R. Michael and Jonathan Nagler. 1995. “Economics, Issues and the Perot Candidacy:
Voter Choice in the 1992 Presidential Election.” American Journal of Political Science
39:714-744

Amemiya, Takeshi. 1985 Advanced econometrics. Cambridge: Harvard University Press.

Baltagi, Badi H. 1995. Econometric Analysis of Panel Data. New York: John Wiley & Sons.

Beck, Nathaniel, and Jonathan N. Katz. 1995. “What To Do (and Not To Do) with Time-
SeriesCross-Section Data in Comparative Politics.” American Political Science Review 89:634-
647.

Davidson, Russell and James G. MacKinnon. 1993. Estimation and Inference in Econometrics.
New York: Oxford University Press.

Eliason, Scott R. 1993. Maximum Likelihood Estimation: Logic and Practice. Newbury Park,
CA: Sage.

Fox, John. 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage
Publications.

Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, N.J.: Prentice Hall

Gujarati, Damodar N., Basic Econometrics, 2003, Fourth Edition, New York: McGraw Hill.

Herron, Michael. 2000. “Post-Estimation Uncertainty in Limited Dependent Variable Models”


Political Analysis 8: 8398.

ii
Hsiao, Cheng. 2003. Analysis of Panel Data. 2nd ed. Cambridge: Cambridge University Press.

Keifer, Nicholas M. 1988. Economic duration data and hazard functions. Journal of Economic
Literature 24: 646–679.

Kennedy, Peter. 2003. A Guide to Econometrics, Fifth Edition. Cambridge, MA: MIT Press.

King, Gary. 1989. Unifying Political Methodology. New York: Cambridge University Press

Lancaster, Tony. 1990. The Econometric Analysis of Transition Data. Cambridge: Cambridge
University Press.

Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thou-
sand Oaks: Sage Publications.

Maddala, G. S. 2001. Introduction to Econometrics. Third Edition, New York: John Wiley and
Sons.

Wooldridge, Jeffrey M. 2002. Introductory Econometrics: A Modern Approach. Cincinnati, OH:


Southwestern College Publishing.

Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data, Cambridge,
MA: MIT Press.

Yamaguchi, Kazuo. 1991. Event History Analysis. Newbury Park, CA: Sage.

Zorn, Christopher J. W. 2001. “Generalized Estimating Equations Models for Correlated Data:
A Review With Applications.” American Journal of Political Science 45: 470–90.

iii
TABLE OF CONTENTS
LIST OF FIGURES x

I The Classical Linear Regression Model 1


1 Probability Theory, Estimation, and Statistical Inference as a Prelude to the
Classical Linear Regression Model 2
1.1 Why is this review pertinent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Joint probability density functions . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Conditional probability density functions . . . . . . . . . . . . . . . . . . . . 7
1.3 Expectations, variance, and covariance . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Properties of expected values . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.3 Properties of variance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.5 Properties of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.6 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.7 Variance of correlated variables . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.8 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Important distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 χ2 (Chi-squared) distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.3 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Matrix Algebra Review 19


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Terminology and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Types of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Addition and subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Multiplying a vector or matrix by a constant . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Multiplying two vectors/matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Representing a regression model via matrix multiplication . . . . . . . . . . . . . . 23
2.9 Using matrix multiplication to compute useful quantities . . . . . . . . . . . . . . . 24
2.10 Representing a system of linear equations via matrix multiplication . . . . . . . . . 24

iv
2.11 What is rank and why would a matrix not be of full rank? . . . . . . . . . . . . . . 26
2.12 The Rank of a non-square matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.13 Application of the inverse to regression analysis . . . . . . . . . . . . . . . . . . . . 27
2.14 Partial differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.15 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 The Classical Regression Model 29


3.1 Overview of ordinary least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Optimal prediction and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Criteria for optimality for estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Linear predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Ordinary least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.1 Bivariate example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.2 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 The multiple regression model in matrix form . . . . . . . . . . . . . . . . . 38
3.5.4 OLS using matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Properties of OLS in finite samples 40


4.1 Gauss-Markov assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Using the assumptions to show that β̂ is unbiased . . . . . . . . . . . . . . . . . . . 41
4.3 Using the assumptions to find the variance of β̂ . . . . . . . . . . . . . . . . . . . . 42
4.4 Finding an estimate for σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Distribution of the OLS coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Efficiency of OLS regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Inference using OLS regression coefficients 50


5.1 A univariate example of a hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Hypothesis testing of multivariate regression coefficients . . . . . . . . . . . . . . . . 51
βbk − βk βbk − βk
5.2.1 Proof that =q ∼ tk . . . . . . . . . . . . . . . . . . . . 52
se(βbk ) s2 (X0 X)−1
kk
2
(N − k)s
5.2.2 Proof that ∼ χ2N −k . . . . . . . . . . . . . . . . . . . . . . . . . . 53
σ2
5.3 Testing the equality of two regression coefficients . . . . . . . . . . . . . . . . . . . 54
5.4 Expressing the above as a “restriction” on the matrix of coefficients . . . . . . . . . 54

6 Goodness of Fit 57
6.1 The R-squared measure of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.1 The uses of R2 and three cautions . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Testing Multiple Hypotheses: the F -statistic and R2 . . . . . . . . . . . . . . . . . 61
6.3 The relationship between F and t for a single restriction . . . . . . . . . . . . . . . 63
6.4 Calculation of the F -statistic using the estimated residuals . . . . . . . . . . . . . . 64
6.5 Tests of structural change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

v
6.5.1 The creation and interpretation of dummy variables . . . . . . . . . . . . . . 67
6.5.2 The “dummy variable trap” . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.3 Using dummy variables to estimate separate intercepts and slopes for each
group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.6 The use of dummy variables to perform Chow tests of structural change . . . . . . . 69
6.6.1 A note on the estimate of s2 used in these tests . . . . . . . . . . . . . . . . 71

7 Partitioned Regression and Bias 73


7.1 Partitioned regression, partialling-out and applications . . . . . . . . . . . . . . . . 73
7.2 R2 and the addition of new variables . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Omitted variable bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3.1 Direction of the Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.2 Bias in our estimate of σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.3 Testing for omitted variables: The RESET test . . . . . . . . . . . . . . . . 79
7.4 Including an irrelevant variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.5 Model specification guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.6 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.6.1 How to tell if multicollinearity is likely to be a problem . . . . . . . . . . . . 85
7.6.2 What to do if multicollinearity is a problem . . . . . . . . . . . . . . . . . . 85

8 Regression Diagnostics 86
8.1 Before you estimate the regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2 Outliers, leverage points, and influence points . . . . . . . . . . . . . . . . . . . . . 86
8.2.1 How to look at outliers in a regression model . . . . . . . . . . . . . . . . . . 86
8.3 How to look at leverage points and influence points . . . . . . . . . . . . . . . . . . 89
8.4 Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.1 Standardized and studentized residuals . . . . . . . . . . . . . . . . . . . . . 95
8.4.2 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9 Presentation of Results, Prediction, and Forecasting 99


9.1 Presentation and interpretation of regression coefficients . . . . . . . . . . . . . . . 99
9.1.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.2 Encompassing and non-encompassing tests . . . . . . . . . . . . . . . . . . . . . . . 102

10 Maximum Likelihood Estimation 105


10.1 What is a likelihood (and why might I need it)? . . . . . . . . . . . . . . . . . . . . 105
10.2 An example of estimating the mean and the variance . . . . . . . . . . . . . . . . . 108
10.3 Are ML and OLS equivalent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.4 Inference and hypothesis testing with ML . . . . . . . . . . . . . . . . . . . . . . . . 112
10.4.1 The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.5 The precision of the ML estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

vi
II Violations of Gauss-Markov Assumptions in the Classical Linear
Regression Model 116
11 Large Sample Results and Asymptotics 117
11.1 What are large sample results and why do we care about them? . . . . . . . . . . . 117
11.2 What are desirable large sample properties? . . . . . . . . . . . . . . . . . . . . . . 119
11.3 How do we figure out the large sample properties of an estimator? . . . . . . . . . . 121
11.3.1 The consistency of β̂ OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.3.2 The asymptotic normality of OLS . . . . . . . . . . . . . . . . . . . . . . . . 124
11.4 The large sample properties of test statistics . . . . . . . . . . . . . . . . . . . . . . 126
11.5 The desirable large sample properties of ML estimators . . . . . . . . . . . . . . . . 127
11.6 How large does n have to be? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

12 Heteroskedasticity 129
12.1 Heteroskedasticity as a violation of Gauss-Markov . . . . . . . . . . . . . . . . . . . 129
12.1.1 Consequences of non-spherical errors . . . . . . . . . . . . . . . . . . . . . . 130
12.2 Consequences for efficiency and standard errors . . . . . . . . . . . . . . . . . . . . 131
12.3 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
12.3.1 Some intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.4 Feasible Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.5 White-consistent standard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12.6 Tests for heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.6.1 Visual inspection of the residuals . . . . . . . . . . . . . . . . . . . . . . . . 136
12.6.2 The Goldfeld-Quandt test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.6.3 The Breusch-Pagan test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

13 Autocorrelation 138
13.1 The meaning of autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
13.2 Causes of autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.3 Consequences of autocorrelation for regression coefficients and standard errors . . . 139
13.4 Tests for autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
13.4.1 The Durbin-Watson test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.4.2 The Breusch-Godfrey test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13.5 The consequences of autocorrelation for the variance-covariance matrix . . . . . . . 142
13.6 GLS and FGLS under autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . 145
13.7 Non-AR(1) processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.8 OLS estimation with lagged dependent variables and autocorrelation . . . . . . . . 148
13.9 Bias and “contemporaneous correlation” . . . . . . . . . . . . . . . . . . . . . . . . 150
13.10Measurement error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
13.11Instrumental variable estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
13.12In the general case, why is IV estimation unbiased and consistent? . . . . . . . . . . 153

vii
14 Simultaneous Equations Models and 2SLS 155
14.1 Simultaneous equations models and bias . . . . . . . . . . . . . . . . . . . . . . . . 155
14.1.1 Motivating example: political violence and economic growth . . . . . . . . . 155
14.1.2 Simultaneity bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
14.2 Reduced form equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
14.3.1 The order condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
14.4 IV estimation and two-stage least squares . . . . . . . . . . . . . . . . . . . . . . . 160
14.4.1 Some important observations . . . . . . . . . . . . . . . . . . . . . . . . . . 161
14.5 Recapitulation of 2SLS and computation of goodness-of-fit . . . . . . . . . . . . . . 162
14.6 Computation of standard errors in 2SLS . . . . . . . . . . . . . . . . . . . . . . . . 163
14.7 Three-stage least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
14.8 Different methods to detect and test for endogeneity . . . . . . . . . . . . . . . . . . 166
14.8.1 Granger causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
14.8.2 The Hausman specification test . . . . . . . . . . . . . . . . . . . . . . . . . 167
14.8.3 Regression version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
14.8.4 How to do this in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
14.9 Testing for the validity of instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 169

15 Time Series Modeling 171


15.1 Historical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
15.2 The auto-regressive and moving average specifications . . . . . . . . . . . . . . . . . 171
15.2.1 An autoregressive process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
15.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
15.4 A moving average process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
15.5 ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
15.6 More on stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
15.7 Integrated processes, spurious correlations, and testing for unit roots . . . . . . . . 176
15.7.1 Determining the specification . . . . . . . . . . . . . . . . . . . . . . . . . . 180
15.8 The autocorrelation function for AR(1) and MA(1) processes . . . . . . . . . . . . . 180
15.9 The partial autocorrelation function for AR(1) and MA(1) processes . . . . . . . . . 181
15.10Different specifications for time series analysis . . . . . . . . . . . . . . . . . . . . . 182
15.11Determining the number of lags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
15.12Determining the correct specification for your errors . . . . . . . . . . . . . . . . . . 185
15.13Stata Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

III Special Topics 187


16 Time-Series Cross-Section and Panel Data 188
16.1 Unobserved country effects and LSDV . . . . . . . . . . . . . . . . . . . . . . . . . 188
16.1.1 Time effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

viii
16.2 Testing for unit or time effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
16.2.1 How to do this test in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
16.3 LSDV as fixed effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
16.4 What types of variation do different estimators use? . . . . . . . . . . . . . . . . . . 192
16.5 Random effects estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
16.6 FGLS estimation of random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
16.7 Testing between fixed and random effects . . . . . . . . . . . . . . . . . . . . . . . . 197
16.8 How to do this in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
16.9 Panel regression and the Gauss-Markov assumptions . . . . . . . . . . . . . . . . . . 198

17 Models with Discrete Dependent Variables 204


17.1 Discrete dependent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
17.2 The latent choice model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
17.3 Interpreting coefficients and computing marginal effects . . . . . . . . . . . . . . . . 208
17.4 Measures of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

18 Discrete Choice Models for Multiple Categories 211


18.1 Ordered probit and logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
18.2 Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
18.2.1 Interpreting coefficients, assessing goodness of fit . . . . . . . . . . . . . . . 216
18.2.2 The IIA assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

19 Count Data, Models for Limited Dependent Variables, and Duration Models 218
19.1 Event count models and poisson estimation . . . . . . . . . . . . . . . . . . . . . . . 218
19.2 Limited dependent variables: the truncation example . . . . . . . . . . . . . . . . . 221
19.3 Censored data and tobit regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
19.4 Sample selection: the Heckman model . . . . . . . . . . . . . . . . . . . . . . . . . . 224
19.5 Duration models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

ix
LIST OF FIGURES
1.1 An Example of PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 An example of a CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Conditional PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Conditional PDF, X and Y independent . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Plots of special distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Data Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


3.2 Data Plot w/ OLS Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.1 Data on Presidential Approval, Unemployment, and Inflation . . . . . . . . . . . . . 87


8.2 A plots of fitted values versus residuals . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.3 Added variable plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.4 An influential outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.5 Hat values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.6 Plot of DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

x
1

Part I

The Classical Linear Regression Model


Section 1

Probability Theory, Estimation, and Statistical Inference


as a Prelude to the Classical Linear Regression Model
1.1 Why is this review pertinent?
1. Most of the phenomena that we care about in the world can be characterized
as random variables, meaning that we can conceive of their values as being
determined by the outcome of a chance experiment. To understand the
behavior of a random variable, we need to understand probability, which
assigns a likelihood to each realization of that random variable.
2. In univariate statistics, much of the focus is on estimating the expectation
(or mean) of a random variable. In multivariate analysis, we will focus on
relating movements in something we care about to variables that can
explain it—but we will still care about expectations.
3. We will focus on establishing the conditional expectation of a dependent
variable Y . In other words, given the value of X (where X may be a single
variable or a set of variables) what value do we expect Y to take on? That
conditional expectation is often written as βX.
4. Since we do not directly observe all the data that make up the world
(and/or cannot run experiments), we must estimate βX using a sample. To
understand what we can say about that estimate, we will have to employ
statistical inference.
5. Finally, many of the statistics we estimate (where a statistic is simply any
quantity calculated from sample data) will conform to particular known
probability distributions. This will simplify the business of conducting
hypothesis tests. To assist in understanding the rationale behind those
hypothesis tests, it helps to review the distributions in question.

2
3

1.2 Probability theory


• The set of all possible outcomes of a random experiment is called the
population or sample space and an event is one cell (or subset) of the
sample space.
• If events cannot occur jointly, they are mutually exclusive. If they exhaust
all of the things that could happen they are collectively exhaustive.

1.2.1 Properties of probability


1. 0 ≤ P (A) ≤ 1 for every A.
2. If A, B, C constitute an exhaustive set of events, then P (A + B + C) = 1.
3. If A, B, C are mutually exclusive events, then
P (A + B + C) = P (A) + P (B) + P (C).

1.2.2 Random variables


• Definition: a variable whose value is determined by a chance experiment.
• Convention: a rv is denoted by an upper case letter (e.g., X) and
realizations of a rv are denoted by lower case letters (e.g., x1 , x2 , . . ., xn ).
• A random variable is discrete if it takes on a finite set (or countably
infinite set) of values. Otherwise the rv is continuous.
• A probability density function (PDF) assigns a probability to each
value (or event or occurrence) of a random variable X.
• Thus, a discrete PDF for a variable X taking on the values x1 , x2 , x3 , . . . , xn
is a function such that:

fX (x) = P [X = xi ] for i = 1, 2, 3, . . . , n.

and is zero otherwise (there is no probability of X taking on any other


value).
4

• A continuous probability density function for a variable X that covers a


continuous range is a function such that:

f (x) ≥ 0 for all x

and Z ∞
f (x)dx = 1.
−∞
Z b
P [a < X < b] = f (x)dx
a

• The cumulative probability density function (CDF) gives the


probability of a random variable being less than or equal to some value:
F (x) = P [X ≤ x].
• For a discrete rv X
F (x) = f (xj ).
xj ≤x

• For a continuous rv

F (x) = P [X ≤ x]
= P [−∞ ≤ X ≤ x]
Z x
= f (u)du.
−∞
5

Figure 1.1: An Example of PDF


0.25
0.20
0.15
f(x)

0.10
0.05
0.0

-2 0 2 4

x
6

Figure 1.2: An example of a CDF


1.0
0.8
0.6
P
0.4
0.2
0.0

-5 0 5
x
7

1.2.3 Joint probability density functions


• For all of the political science examples we will talk about, we will be
interested in how one variable is related to another, so we will be interested
in joint probability density functions:

f (x, y) = P [X = x and Y = y]

• See Fig. 1.3 for an example of a joint PDF.


• The marginal probability density function, fX (x), can be derived from the
joint probability density function, fX,Y (x, y), by summing/integrating over
all the different values that Y could take on.
P
fX (x) = y fX,Y (x, y) if fX,Y is discrete (1.1)
R∞
fX (x) = −∞ fX,Y (x, y)dy if fX,Y is continuous (1.2)

1.2.4 Conditional probability density functions


• Since we are interested in understanding how one variable is related to
another, we will also consider the conditional PDF, which gives, for
example, the probability that X has the realization x given that Y has the
realization y. The conditional PDF is written as:

fX,Y (x|y) = P (X = x|Y = y) (1.3)


fX,Y (x, y)
= (1.4)
fY (y)

• Statistical Independence: We say that X and Y are statistically


independent if and only if

fX,Y (x, y) = fX (x) · fY (y)

• See Fig. 1.4 for an example of a condit’l PDF and Fig. 1.5 for an example
of a condit’l PDF for two independent variables.
8

Figure 1.3: Joint PDF


9

Figure 1.4: Conditional PDF


10

Figure 1.5: Conditional PDF, X and Y independent


11

1.3 Expectations, variance, and covariance


• We will often, want to know about the central tendency of a random
variable. We are particularly interested in the expected value of the random
variable, which is the average value that it takes on over many repeated
trials.
• The expected value (µ), or population mean, is the first moment of X.
n
X
E[X] = xj f (xj ) (1.5)
j=1

if X is discrete and has the possible values x1 , x2 , . . . , xn , and


Z ∞
E[X] = xf (x)dx (1.6)
−∞

if X is continuous.

1.3.1 Properties of expected values


• If b is a constant, then E(b) = b.
• If a and b are constants, then E(aX + b) = aE(X) + b.
• If X and Y are independent, then E(XY ) = E(X)E(Y )
• If X is a random variable with a PDF f (x) and if g(x) is any function of X,
then P
g(x)f (x) if X is discrete
E[g(x)] = R ∞x
−∞ g(x)f (x)dx if X is continuous
12

1.3.2 Variance
• The distribution of values of X around its expected value can be measured
by the variance, which is defined as:

var(X) = σx2 = E[(X − µ)2 ]

• The variance is the


p second moment of X. The standard deviation of X,
denoted σX , is + var[X].
• The variance is computed as follows:
P  2

(X − µ) f (x) if X is discrete
E[g(x)] = R ∞x  2
−∞ (X − µ) f (x)dx if X is continuous

• We can also express the variance as

var[X] = E[(X − µX )2 ]
= E[X 2 ] − (E[X])2

1.3.3 Properties of variance:


• If b is a constant, then var[b] = 0.
• If a and b are constants, and Y = a + bX then

var[Y ] = b2 var[X].

• If X and Y are independent random variables, then


var(X + Y ) = var(X) + var(Y ) and var(X − Y ) = var(X) + var(Y ).
13

1.3.4 Covariance
• The covariance of two rvs, X and Y , with means µX and µY is defined as

cov(X, Y ) = E[(X − µX )(Y − µY )] (1.7)


= E[XY ] − µX µY . (1.8)

1.3.5 Properties of covariance


• If X and Y are independent, then E[XY ] = E[X]E[Y ] implying
cov(X, Y ) = 0.
• cov(a + bX, c + dY ) = bd · cov(X, Y ), where a, b, c, and d are constants.

1.3.6 Correlation
• The size of the covariance will depend on the units in which X and Y are
measured. This has led to the development of the correlation coefficient,
which gives a measure of statistical association that ranges between −1 and
+1.
• The population correlation coefficient, ρ, is defined as:

cov(X, Y ) cov(X, Y )
ρ=p =
var(X)var(Y ) σx σy

1.3.7 Variance of correlated variables

var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )


var(X − Y ) = var(X) + var(Y ) − 2cov(X, Y )
14

1.3.8 Conditional expectation


• The conditional expectation of X, given Y = y, is defined as
P
xf (x|Y = y) if X is discrete
E(X|Y = y) = R ∞x
−∞ xf (x|Y = y)dx if X is continuous

• The Law of Iterated Expectations indicates how the expected value of


X relates to the conditional expectation of X: E(X) = Ey [E(X|Y )].

1.4 Important distributions


1.4.1 The normal distribution
• A random variable X is normally distributed (denoted X ∼ N (µ, σ 2 )) if it
has the PDF !
1 1 (x − µ)2
f (x) = √ exp − (1.9)
2πσ 2 2 σ2
where −∞ < x < ∞, µ is the mean, and σ 2 is the variance of the
distribution.
• Properties of the Normal Distribution:
➢ It is symmetric around its mean value (so mean=median=mode).
➢ Approximately 68% of the area under a normal curve lies between the
values of µ ± σ, about 95% of the area lies between µ ± 2σ, and about
99.7% in the range µ ± 3σ.
➢ A normal distribution is completely characterized by its two parameters.
15

• Any normally distributed variable can be transformed into a standard


normal variable by subtracting its mean and dividing by its standard
deviation.
➢ E.g., if Y ∼ N (µY , σY2 ), and Z = (Y − µY )/σY , then Z ∼ N (0, 1). That
is,  
1 1 2
fZ (z) = √ exp − z (1.10)
2π 2

1.4.2 χ2 (Chi-squared) distribution


• Let Z1 , Z2 , . . . , Zk be independent standard normal variables. Then
k
X
U= Zi2
i=1

has a chi-squared distribution with k degrees of freedom (denoted U ∼ χ2k ).

1.4.3 Student’s t distribution


• Suppose Z ∼ N (0, 1), U ∼ χ2k , and U and Z are distributed independently
of each other. Then the variable
Z
t= p
U/k
has a t distribution with k degrees of freedom.

1.4.4 The F distribution


• The F distribution is the distribution of the ratio of two independent
chi-squared random variables divided by their respective degrees of freedom.
• That is, if U ∼ χ2m and V ∼ χ2n then the variable
U/m
F =
V /n
has an F distribution with m and n degrees of freedom (denoted Fm,n ).
16

Figure 1.6: Plots of special distributions

Chi−squared distribution t distribution

0.4
0.20

df=1
df=1

0.3
df=5
df=5
df=10 df=10
Probability

Probability

0.2
0.10

0.1
0.00

0.0
5 10 15 20 −4 −2 0 2 4

x t

F distribution
0.00 0.05 0.10 0.15 0.20

df=1,1
df=1,5
Probability

df=1,10

5 10 15 20

x
17

1.5 Statistical inference


• In the real world, we typically do not know the true probability distribution
of our population of interest; estimate it using a sample. Use laws of
statistics.
• If we don’t know, for instance, the true distribution of income in the
population, we could take a random sample of individuals.
• Each individual that we poll, or observation that we record, can be viewed
as a random variable, X1 , X2 , . . . , Xn , because they have an income whose
level is in part the result of an experiment. Each of those random draws (or
variables) will, however, have a known value, x1 , x2 , . . . , xn .
• If each comes from the same population (i.e., w/ same likelihood of being
rich or poor) then the observations are said to be identically distributed.
If the selection of one person does not affect the chances of selection of
another person, they are said to be independent. If the observations are
both, they are described as iid—standard assumption.
• Once we have our sample, we can estimate a sample mean,
N
1 X
Ȳ = Yi
N i=1
and a sample variance,
N
1 X
s2y = (Yi − Ȳ )2 .
N − 1 i=1

• What we want to know is how good these estimates are as descriptions of


the real world → use statistical inference.
• We can show that the sample avg. is an unbiased estimator of the true
population mean, µY :
N
1 X
E(Ȳ ) = E(Yi ) = µy
N i=1
18

• The sample avg. is a rv: take a different sample, likely to get a different
average. The hypothetical distribution of the sample avg. is the sampling
distribution.
• We can deduce the variance of this sample avg.:
σy2
var(Ȳ ) = .
N
• If the underlying population is distributed normal, we can also say that the
avg. is distributed normal (why?). If the underlying population does not
have a normal distribution, however, & income distribution is not normal,
we cannot infer that the sampling distribution is normal.
• Can use two laws of statistics that can tell us about the precision and
distribution of our estimator (in this case the avg.):

➢ The Law of Large Numbers states that, under general conditions, Ȳ


will be near µy with very high probability when N is large. The
property that Ȳ is near µy with increasing prob. as N ↑ is called
convergence in probability or, more concisely, consistency. We will
return to consistency in the section on asymptotics.
➢ The Central Limit Theorem states that, under general conditions,
the distribution of Ȳ is well approximated by a normal distribution
when N is large.
➢ ∴ (under general conditions) can treat Ȳ as normally distributed and
use the variance of the average to construct hypothesis tests (e.g., “how
likely it is that avg. income < $20,000 given our estimate?”).
➢ Can also construct confidence intervals, outlining the range of income in
which we expect the true mean income to fall with a given probability.

• We will estimate statistics that are analogous to a sample mean and are also
distributed normal. We can again use the normal as the basis of hypothesis
tests. We will also estimate parameters that are distributed as χ2 , as t, and
as F .
Section 2

Matrix Algebra Review


2.1 Introduction
• A matrix is simply a rectangular array of numbers, like a table. Often,
when we collect data, it takes the form of a table (or excel spreadsheet)
with variable names in the columns across the top and observations in the
rows down the side. That’s a typical matrix.
• A vector (a single-columned matrix) can also be thought of as a coordinate
in space.
• Matrix algebra is the topic that covers the mathematical operations we can
do with numbers. These include addition, subtraction, multiplication,
division (via the inverse), differentiation and factoring. In doing these
matrix operations, we just apply the particular operation to the matrix as a
whole rather than to a single number.
• Matrix algebra is sometimes also called linear algebra because one of the
major uses of matrix algebra is to solve systems of linear equations such as
the following:

2x + 4y = 8
x + 5y = 7

• Easy to substitute to get expressions for x (or y) and solve for y (or x).
Harder to do for more equations—motivation for using matrix algebra.
• In regression analysis, estimation of coefficients requires a solution to a set
of N equations with k unknowns.

19
20

2.2 Terminology and notation


• Matrices are represented w/ bold capital letters (e.g., A)
• Vectors are represented w/ bold lower case letters (e.g., a).
• Each element in a matrix, A, is represented as aij , where i gives the row
and j gives the column.
• Transpose: switch the row and column positions of a matrix (so that the
transposed matrix has aji for its elements). Example:
 0  
4 2 4 1
=
1 5 2 5

• Scalar: a 1 × 1 matrix.
• A matrix is equal to another matrix if every element in them is identical.
This implies that the matrices have the same dimensions.

2.3 Types of matrices


• Square: the number of rows equals the numbers of columns.
• Symmetric: A matrix whose elements off the main left-right diagonal are a
mirror-image of one another (⇒ A0 = A). Symmetric matrices must be
square.
• Diagonal: All off-diagonal elements are zero.
• Identity: A square matrix with all elements on the main diagonal = 1, = 0
everywhere else. This functions as the matrix equivalent to the number one.
• Idempotent: a matrix A is idempotent if A2 = A (⇒ An = A). For a
matrix to be idempotent it must be square.
21

2.4 Addition and subtraction


• As though we were dropping one matrix over another. We are adding the
two exactly together. For that reason, we can add matrices only of exactly
the same dimensions and we find the addition by adding together each
equivalently located element. Example:
     
2 4 1 1 3 5
+ =
1 3 1 1 2 4

In subtraction, it is simply as though we are removing one matrix from the


other, so we subtract each element from each equivalent element. Example:
     
2 4 1 1 1 3
− =
1 3 1 1 0 2

• Properties of Addition:

A+0=A
(A + B) = (B + A)
(A + B)0 = A0 + B0
(A + B) + C = A + (B + C)

2.5 Multiplying a vector or matrix by a constant


• For any constant, k, kA = kaij ∀ i, j.
22

2.6 Multiplying two vectors/matrices


• To multiply two vectors or two matrices together, they must be
conformable—i.e., row dimension of one matrix must match the column
dimension of the other.
• Inner-product multiplication. For two vectors, a and b, the
inner-product is written as a0 b.
   
1 2
Let a = 2 and b = 4, then
4 3
 
0
  2
a b = 1 2 4 · 4 = (1)(2) + (2)(4) + (4)(3) = 22
3
• Note: a0 is 1 × 3, b is 3 × 1, and the result is 1 × 1.
• Note: if a0 b = 0, they are said to be orthogonal. (⇒ the two vectors in
space are at right angles to each other).

2.7 Matrix multiplication


• Same as vector multiplication except that we treat the first row of the
matrix as the transpose vector and continue with the other rows and
columns in the same way. An example helps:
       
2 4 1 2 (2)(1) + (4)(1) (2)(2) + (4)(1) 6 8
· = =
1 3 1 1 (1)(1) + (3)(1) (1)(2) + (3)(1) 4 5

• Again, number of columns in the first matrix must be equal to the number
of the rows in the second matrix, otherwise there will be an element that
has nothing to be multiplied with.
• Thus, for Anm · Bjk , m = j. Note also that
A · B = C
n×m m×k n×k
23

• Makes a difference which matrix is pre-multiplying and which is


post-multiplying (e.g., A · B = B · A only under special conditions).
• Properties of Multiplication:

AI = A
(AB)0 = B0 A0
(AB)C = A(BC)
A(B + C) = AB + AC or (A + B)C = AC + BC

2.8 Representing a regression model via matrix


multiplication
• We have a model: y = β0 + β1 X + ε, implying:

y1 = β0 + β1 X1 + ε1
y2 = β0 + β1 X2 + ε2

• This can be represented in matrix form by:

y = Xβ + ε

where y is n × 1, X is n × 2, β is 2 × 1, and ε is n × 1.
24

2.9 Using matrix multiplication to compute useful


quantities
• To get the average of four observations of a random variable: 30, 32, 31, 33.
 
30
  32 0
1/4 1/4 1/4 1/4 ·  31 = x̄ and this can also be written as 1/ni x = x̄,

33
where i is a column vector of ones of dimensions n × 1 and x is the data
vector.
• Suppose we want to get a vector containing the deviation of each
observation of x from its mean (why would that be useful?):
 
x1 − x̄
 x − x̄  
1 0

 2
 ..  = [x − ix̄] = x − ii x

 .  n
xn − x̄
• Since x = Ix,
     
1 0 1 0 1 0
x − ii x = Ix − ii x = I − ii x = M0 x
n n n

• M0 has (1 − 1/n) for its diagonal elements and −1/n for all its off-diagonal
elements (∴ symmetric). Also, M0 is equal to its square, M0 M0 = M0 so it
is idempotent.

2.10 Representing a system of linear equations via


matrix multiplication
• Say that we have a fairly elementary set of linear equations:
2x + 3y = 5
3x − 6y = −3
25

• Matrix representation:     
2 3 x 5
=
3 −6 y −3
• In general form, this would be expressed as: Ax = b
• Solve by using the inverse: A−1 where A−1 A = I.
• Pre-multiply both sides of the equation by A−1 :

Ix = A−1 b
x = A−1 b

• Only square matrices have inverses, but not all square matrices possess an
inverse.
• To have an inverse (i.e., be non-singular), matrix must be “full rank” ⇒ its
determinant (e.g., |A|) is not zero.
   
a c d −b
• For a 2 × 2 matrix, A = , |A| = (ad − bc) and A−1 = |A|
1
b d −c a
26

2.11 What is rank and why would a matrix not be of full


rank?
• Let us consider a set of two linear equations with two unknowns:

3x + y = 7
6x + 2y = 14

    
3 1 x 7
=
6 2 y 14
• Can’t solve this: the second equation is not truly a different equation; it is a
“linear combination” of the first ⇒ an infinity of solutions.
 
3 1
• The matrix does not have full “row rank.”
6 2
• One of its rows can be expressed as a linear combination of the others (⇒
its determinant = 0).
• Does not have full column rank either; if a square matrix does not have full
row rank it will not have full column rank.
• Rows and columns must be “linearly independent” for matrix to have full
rank.
27

2.12 The Rank of a non-square matrix


• For non-square matrices, rank(A) = rank (A0 ) ≤ min(number of rows,
number of columns).
 
5 1
• For example, for the matrix 2 −3, the rank of the matrix is at most two
7 4
(the number of columns).
• Why is this? Let us look at the three rows separately and think of each of
them as a vector in space. If we have two rows with two elements, we can
produce the third row via a linear combination of the first two and we can
do this for any third row that we could imagine. In technical language, the
matrix only spans a vector space of two.

2.13 Application of the inverse to regression analysis


• Can’t use this to directly solve

y = Xβ + ε

• We have the error vector and X is not square.


• But we will use this to get β by seeking a solution that minimized the sum
of squared residuals; will get us a system of K equations from which we may
derive K unknown β coefficients.
28

2.14 Partial differentiation


• To find the partial derivative of any vector, x, or any function of that
vector, f(x) = ∂f (x)
∂(xi ) :

➢ take the derivative of each element of the matrix with respect to xn .


∂(a0 x)
• ∂x =a
∂(x0 Ax)
• ∂x = 2Ax.

2.15 Multivariate distributions


• Multivariate distribution describes the joint distribution of any group of
random variables X1 to Xn .
• This set can be collected as a vector or matrix and we can write their
moments (i.e., mean and variance) in matrix notation.
• E.g., if X1 to Xn are written as x, then E(x) = [E(X1 ), E(X2 ), . . . , E(Xn )].
• For any n × 1 random vector, x, its variance-covariance matrix, denoted
var(x), is defined as:
 
σ12 σ12 . . . σ1n
σ 2
 12 σ2 . . . σ2n 
var(x) =  .. = E[(x − µ)(x − µ)0 ] = Σ (2.1)

 . . . . ... 
2
σn1 σn2 . . . σn

• This matrix is symmetric. Why? In addition, if the individual elements of x


are independent, then the off-diagonal elements are equal to zero. Why?
• x ∼ N(µ, Σ).
• The properties of the multivariate normal ⇒ each element of this vector is
normally distributed and that any two elements of x are independent iff
they are uncorrelated.
Section 3

The Classical Regression Model


3.1 Overview of ordinary least squares
• OLS is appropriate whenever the Gauss-Markov assumptions are satisfied.
• Technically simple and tractable.
• Allows us to develop the theory of hypothesis testing.

• Regression analysis tells us what is the relationship between a dependent


variable, y, an independent variable, x.

3.2 Optimal prediction and estimation


• Suppose that we observe some data (xi , yi ) for i = 1, . . ., N observational
units of individuals.
• Our theory tells us that xi and yi are related in some systematic fashion for
all individuals.
• Use the tools of statistics to describe this relationship more precisely and to
test its empirical validity.
• For now, no assumptions concerning the functional form linking the variables
in the data set.
• First problem: how to predict yi optimally given the values of xi ?
• xi is known as an independent variable and yi is the dependent or endogenous
variable.
• The function that delivers the best prediction of yi for given values of xi is
known as an optimal predictor.
➢ The “best predictor” is a real-valued function g(·), not necessarily linear.

29
30

Figure 3.1: Data Plot


2
1
0
Y

−1
−2

−2 −1 0 1

X
31

• How to define “best predictor”? One criterion we could apply is to minimize


the mean-square error, i.e. to solve:

min E (yi − g(xi , β))2


 
g(.)

• It turns out that the optimal predictor in this sense is the conditional expec-
tation of y given x, also known as the regression function, E(yi |xi ) = g(xi , β).
• g(.) presumably involves a number of fixed parameters (β) needed to correctly
specify the theoretical relationship between yi and xi .
• Whatever part of the actually observed value of yi is not captured in g(xi , β)
in the notation of this equation must be a random component which has a
mean of zero conditional on xi .
• ∴ another way to write the equation above in a way that explicitly captures
the randomness inherent in any political process is as follows:

yi = g(xi, β) + εi

where εi is an error term or disturbance, with a mean of zero conditional on


xi . For the moment, we shall not say anything else about the distribution of
εi .

3.3 Criteria for optimality for estimators


• Need to define what we mean by “good” or optimal indicators.
• Normally, there are two dimensions along which we compare estimators.

1. Unbiasedness, i.e.: E(β̂) = β

• Large sample analog of unbiasedness is “consistency”.


∗ ∗
2. Efficiency: var(β̂) ≤ var(β̂ ), where β̂ is any unbiased estimator for β.
• Large sample analog is asymptotic efficiency.
32

3.4 Linear predictors


• We often restrict g(.) to being a linear function of xi , or a linear function of
non-linear transformations of xi .
• More specifically, the best linear predictor (BLP) of yi given values of xi is
denoted by:
E ∗ (yi |xi ) = β0 + β1 xi .
• Note that the BLP is not the same as the conditional expectation, unless the
conditional expectation happens to be linear. As before, we can write the
equation in an alternative but equivalent fashion:
yi = β0 + β1 xi + εi

• The assumption of linearity corresponds to a restriction that the true rela-


tionship between yi and xi be a line in (yi , xi ) space. Such a line is known as
a regression line.
• Moreover, the fact that we are trying to predict yi given xi requires that the
error term be equal to the vertical distance between the true regression line
and the data points.
• Thus, in order to estimate the true regression line—that is, to approximate
it using the information contained in our finite sample of observations—one
possible criterion will consist of minimizing some function of the vertical
distance between the estimated line and the scattered points.
➢ In doing so, we hope to exhaust all the information in xi that is useful
in order to predict yi linearly.
• The problem that we are now confronted with consists of estimating the true
parameters β0 and β1 and drawing statistical inferences from these estimates.
• One way: ordinary least squares estimation (OLS).
• Under special assumptions (the Gauss-Markov assumptions or assump-
tions of the classic linear regression model), OLS estimates are “optimal” in
terms of unbiasedness and efficiency.
33

3.5 Ordinary least squares


• The estimates of β0 and β1 resulting from the minimization of the sum of
squared vertical deviations are known as the OLS estimates.

3.5.1 Bivariate example


• The vertical deviations from the estimated line, or residuals ei are as follows:
ei = yi − β̂0 − β̂1 xi ,
where β̂0 and β̂1 are the estimates of the parameters that fully describe the
regression line.
• Thus, the objective is to solve the following problem:
N
X N
X
min e2i = (yi − β̂0 − β̂1 xi )2
β0 ,β1
i=1 i=1

• The first order conditions for a minimum (applying the chain rule) are:
N 
P 2
∂ ei X N X N
i=1
= 2(yi − β̂0 − β̂1 xi )(−1) = 0 ⇒ ei = 0
∂ β̂0 i=1 i=1
N 
P 2
∂ ei XN XN
i=1
= 2(yi − β̂0 − β̂1 xi )(−xi ) = 0 ⇒ xi ei = 0
∂ β̂1 i=1 i=1

• These are the “normal equations”. They can be re-written by substituting


for ei and collecting terms. This yields the following two expressions.
N
X N
X
yi = nβ̂0 + β̂1 xi
i=1 i=1

N
X N
X N
X
xi yi = β̂0 xi + β̂1 x2i
i=1 i=1 i=1
34

• To obtain a solution, divide the first equation by n:

ȳ = β̂0 + β̂1 x̄

⇒ OLS regression line passes through the mean of the data (but may not be
true if no constant in the model) and

β̂0 = ȳ − β̂1 x̄

N
P
• Use the last expression in place of β̂0 in the solution for β̂1 and use xi = N x̄:
i=1

N N
!
X X
xi yi − N x̄ȳ = β̂1 x2i − N x̄2
i=1 i=1

or N 
P N
P
x1 y1 − N x̄ȳ (xi − x̄)(yi − ȳ)
β̂1 = i=1N  = i=1
N
x2i − N x̄2
P P
(xi − x̄)2
i=1 i=1

• Check to see if this is a minimum:


 P N
 N 
2
e2i 2
e2i
P
∂ ∂
 i=12 i=1   2N 2N x̄

 ∂ β̂0  ∂β̂0 ∂ β̂1  
=
2N x̄ 2 N
N N P 2
i=1 xi
 P
 ∂2 e2i ∂2
P
e2i 
i=1 i=1
∂ β̂0 ∂ β̂1 ∂ β̂12

• Sufficient condition for a minimum: matrix must be positive definite.


• The two diagonal elements are positive, so we only need to verify that the
determinant is positive:
35

!2
X X
|D| = 4N x2i − 4 xi
i i
X
= 4N x2i − 4N 2 x̄2
i
!
X
= 4N x2i − N x̄2
Xi
= 4N (xi − x̄)2 > 0 if variation in x
i

• If there are more regressors, one can add normal equations of the form
∂ e2i
P
= 0.
∂ β̂k

• Math is tedious; more convenient to use matrix notation.


36

Figure 3.2: Data Plot w/ OLS Regression Line


2
1
0
Y

−1
−2

−2 −1 0 1

X
37

3.5.2 Multiple regression


• When we want to know the direct effects of a number of independent variables
on a dependent variable, we use multiple regressions by assuming that

E(yi |x1i , . . . , xki ) = β0 + β1 x1i + . . . + βk xki

(Notice that there is no x0i . Actually, we assume that x0i = 1 for all i.)
• May also be written as

yi = β0 + β1 x1i + . . . + βk xki + εi

• The sample regression function can be estimated by Least Squares as well


and can be written as:

ŷi = β̂0 + β̂1 x1i + . . . + β̂k xki

or

yi = β̂0 + β̂1 x1i + . . . + β̂k xki + ei

• Interpretation of Coefficients βk is the effect on y of a one unit increase


of xk holding all of the others xs constant.

➢ Example: Infant deaths per 1000 live births

Infant mortality = 125−.972×(% Females who Read)−.002×(Per capita GNP)

➢ Holding per capita GDP constant, a one percentage point increase in


female literacy reduces the expected infant mortality rate by .972 deaths
per 1000 live births.
➢ Holding female literacy rates constant, a one thousand dollar increase
in the per capita gross domestic product reduces the expected infant
mortality rate by 2.234 deaths per 1000 live births.
38

3.5.3 The multiple regression model in matrix form


• To expedite presentation and for computational reasons, it is important to
be able to express the linear regression model is matrix form. To this end,
note that the observations of a multiple regression can be written as:
y1 = β0 + β1 x11 + . . . +βk xk1 + . . . + βK xK1 + ε1
..
.
yi = β0 + β1 x1i + . . . +βk xki + . . . + βK xKi + εi
..
.
yn = β0 + β1 xn1 + . . . +βk xnk + . . . + βK xnK + εn
       
y1 β0 ε1 1 x11 . . . x1K
• Now let y =  ... , β =  ... , ε =  ... , and X =  ... . . . . . . ... .
yn βK εn 1 xn1 . . . xnK
• Now the entire model can be written as
y = Xβ + ε

• We will often want to refer to specific vectors within X. Let xk refer to


a column in X (the kth independent variable) and xi be a row in X (the
independent variables for the ith observation).
• Thus, we can write the model for a specific observation as
yi = x0i β + εi

3.5.4 OLS using matrix notation

• The sum of squared residuals is


N
X N
X
e2i = (yi − x0i β̂)2
i=1 i=1

where β̂ is now a vector of unknown coefficients.


39

• The minimization problem can now be expressed as

min e0 e = (y − Xβ̂)0 (y − Xβ̂)


β̂

• Expanding this gives:


0 0
e0 e = y0 y − β̂ X0 y − y0 Xβ̂ + β̂ X0 Xβ̂

or
0 0
e0 e = y0 y − 2β̂ X0 y + β̂ X0 Xβ̂

• The first order conditions now become:


∂e0 e
= −2X0 y + 2X0 Xβ̂ = 0
∂ β̂

• The solution then satisfies the least squares normal equations:

X0 Xβ̂ = X0 y

• If the inverse of X0 X exists (i.e., assuming full rank) then the solution is:
−1
β̂ = (X0 X) X0 y
Section 4

Properties of OLS in finite samples


4.1 Gauss-Markov assumptions

• OLS estimation relies on five basic assumptions about the way in which the
data are generated.
• If these assumptions hold, then OLS is BLUE (Best Linear Unbiased
Estimator).

Assumptions:

1. The true model is a linear functional form of the data: y = Xβ + ε.


2. E[ε|X] = 0
3. E[εε0 |X] = σ 2 I
4. X is n × k with rank k (i.e., full column rank)
5. ε|X ∼ N [0, σ 2 I]

In English:

• Assumptions 1 and 2 ⇒ E[y|X] = Xβ. (Why?)


• Assumption 2: “strict exogeneity” assumption. ⇒ E[εi ] = 0, since
E[εi ] = EX [E[εi |Xi ]] = 0 (via the Law of Iterated Expectations).
• Assumption 3: errors are spherical/iid.
• Assumption 4: data matrix has full rank; is invertible (or “non-singular”);
not characterized by perfect multicollinearity.
• Sometimes add to Assumption 4 “X is a non-stochastic matrix” or “X is
fixed in repeated samples.” This amounts to saying that X is something
that we fix in an experiment. Generally not true for political science.

40
41

➢ We will assume X can be fixed or random, but it is generated by a


mechanism unrelated to ε.

• Note that
E(ε21 ) E(ε1 ε2 ) . . . E(ε1 εN )
 

E(εε0 ) =  ..
. 
E(εN ε1 ) E(εN ε2 ) . . . E(ε2N )
var(εi ) = E[(εi − E(εi ))2 ] = E[ε2i ]
and
cov(εi , εj ) = E[(εi − E(εi ))(εj − E(εj ))] = E(εi εj )

• Thus, we can understand the assumption that E(εε0 ) = σ 2 In , as a statement


about the variance-covariance matrix of the error terms, which is equal to:
 
σ 2 0 ... 0
 0 σ 2 ... 0 
 
 .. . .. 
. 
2
0 0 ... σ

• Diagonal elements ⇒ homoskedasticity; Off-diagonal elements ⇒ no


auto-correlation.

4.2 Using the assumptions to show that β̂ is unbiased


• β
b is unbiased if E[β]
b =β

b = (X0 X)−1 X0 y
β
= (X0 X)−1 X0 (Xβ + ε)
= β + (X0 X)−1 X0 ε

• Taking expectations:

E[β|X]
b = β + (X0 X)−1 X0 E[ε|X] = β
42

• So the coefficients are unbiased “conditioned” on the data set at hand.


Using the law of iterated expectations, however, we can also say something
about the unconditional expectation of the coefficients.
• By the LIE, the unconditional expectation of βb can be derived by
“averaging” the conditional expectation over all the samples of X that we
could observe.

E[β]
b = EX [E[β|X]]
b
= β + EX [(X0 X)−1 X0 E[ε|X]] = β

The last step holds because the expectation over X of something that is
always equal to zero is still zero.

4.3 Using the assumptions to find the variance of β̂


• Now that we have the expected value of β,b we should be able to find its
variance using the equation for the variance-covariance matrix of a vector of
random variables (see Eq. 2.1).

var(β)
b = E[(β b − β)0 ]
b − β)(β

b − β = (X0 X)−1 X0 (Xβ + ε) − β


β
= β + (X0 X)−1 X0 ε − β
= (X0 X)−1 X0 ε

• Thus:
b = E[((X0 X)−1 X0 ε)((X0 X−1 )X0 ε)0 ]
var(β)

• Use the transpose rule (AB)0 = (B0 A0 ):

((X0 X)−1 X0 ε)0 = (ε0 X[(X0 X)−1 ]0 )


43

• The transpose of the inverse is equal to the inverse of the transpose:

[(X0 X)−1 ]0 = [(X0 X)0 ]−1


• Since (X0 X) is a symmetric matrix:
[(X0 X)−1 ]0 = [(X0 X)0 ]−1 = (X0 X)−1

which gives:
((X0 X)−1 X0 ε)0 = (ε0 X[(X0 X)−1 ]0 ) = (ε0 X(X0 X)−1 )

• Then,
b = E[((X0 X)−1 X0 ε)(ε0 X(X0 X)−1 )]
var(β)
= E[(X0 X)−1 X0 εε0 X(X0 X)−1 ]

• Next, let’s pass through the conditional expectations operator rather than
the unconditional to obtain the conditional variance.
E[[(X0 X)−1 X0 εε0 X(X0 X)−1 ]|X] = [(X0 X)−1 X0 E(εε0 |X)X(X0 X)−1 ]
= [(X0 X)−1 X0 σ 2 IX(X0 X)−1 ]
= σ 2 [(X0 X)−1 X0 X(X0 X)−1 ] = σ 2 (X0 X)−1

• But the variance of β


b depends on the unconditional expectation.

• Use a theorem on the decomposition of variance from Greene, 6th ed., p.


1008:
var(β)
b = EX [var[β|X]]
b + varX [E[β|X]]
b

• The second term in this decomposition is zero since E[β|X]


b = β for all X,
thus:

= σ 2 (X0 X)−1
44

4.4 Finding an estimate for σ 2


• Since σ 2 = E[(ε − E(ε))2 ] = E[ε2 ]
P 2
e
and e is an estimate of ε, N i would seem a natural choice.

• But ei = yi − x0i β
b = εi − x0 (β
i
b − β).

• Note that:

b y − X(X0 X)−1 X0 y = (I −X(X0 X)−1 X0 )y


e = y − Xβ= N

• Let
M = (IN −X(X0 X)−1 X0 )

M is a symmetric, idempotent matrix, like M0 (⇒ M = M0 and


MM = M).
• Note MX = 0.
• We can also show (by substituting for y) that:

e = My ⇒ e0 e = y0 M0 My = y0 My = y0 e = e0 y

and (even more useful)

e = My = M[Xβ + ε] = MXβ + Mε = Mε

• So E(e0 e) = E(ε0 Mε).


45

• Now we need some matrix algebra: the trace operator gives the sum of the
diagonal elements of a matrix. So, for example:

 
1 7
tr =4
2 3
• Three key results for trace operations:
1. tr(ABC) = tr(BCA) = tr(CBA)
2. tr(cA) = c · tr(A)
3. tr(A − B) = tr(A) − tr(B)
• For a 1 × 1 matrix, the trace is equal to the matrix (Why?). The matrix
ε0 Mε is a 1 × 1 matrix. As a result:

E(ε0 Mε) = E[tr(ε0 Mε)] = E[tr(Mε0 ε)]


using the trace operations above, and

E[tr(Mε0 ε)] = tr(ME[ε0 ε]) = tr(Mσ 2 I) = σ 2 tr(M)


• Thus, E(e0 e) = σ 2 tr(M)

tr(M) = tr(IN −X(X0 X)−1 X0 )


= tr(IN ) − tr(X(X0 X)−1 X0 )
= N − tr(X(X0 X)−1 X0 )
= N − tr(X0 X(X0 X)−1 )
= N − tr(IK ) = N − k
e0 e
• Putting this altogether: E(e0 e) = σ 2 (N − k) ⇒ s2 = N −k is an unbiased
estimator for σ 2 .
• Thus, var[
c β] b = s2 (X0 X)−1 .
46

4.5 Distribution of the OLS coefficients


• Use Assumption Five to figure this out.
• In solving for β,
b we arrived at an expression that was a linear function of
the error terms. Earlier, we stated that:

b = β + (X0 X)−1 X0 ε, this is a linear function of ε, analogous to Az + b ,


β
where A = (X0 X)−1 X0 and b = β and z = ε, a random vector.
• Next, we can use a property of the multi-variate normal distribution:

If z is a multivariate normal vector (i.e., z ∼ N [µ, Σ]), then


Az + b ∼ N [Aµ + b, AΣA0 ]
b ∼ N [β, σ 2 (X0 X)−1 ]
This implies that β|X
• If we could realistically treat X as non-stochastic, then we could just say
that β
b was distributed multivariate normal, without any conditioning on X.

• Later, in the section on asymptotic results, we will be able to show that the
distribution of β
b is approximately normal as the sample size gets large,
without any assumptions on the distribution of the true error terms and
without having to condition on X.
• In the meantime, it is also useful to remember that, for a multivariate
normal distribution, each element of the vector β
b is also distributed normal,
so that we can say that:

βbk |X ∼ N [βk , σ 2 (X0 X)−1


kk ]

• Will use this result for hypothesis testing.


47

4.6 Efficiency of OLS regression coefficients


• To show this, we must prove that there is no other unbiased estimator of
the true parameters, β,
e such that var(β)
e < var(β).
b

• Let β
e be a linear estimator of β calculated as Cy, where C is a K × n
matrix.
• If β
e is unbiased for β, then:

E[β]
e = β = E[Cy] = β
which implies that
E[C(Xβ + ε)] = β

⇒CXβ + CE[ε] = β
⇒CXβ = β and CX = I

We will use the last result when we calculate the variance of β.


e

var[β]
e = E[(βe − E[β])(
e β e 0]
e − E[β])
e − β)0 ]
e − β)(β
= E[(β
= E[(Cy − β)(Cy − β)0 ]
= E[(CXβ + Cε) − β)((CXβ + Cε) − β)0 ]
= E[(β + Cε − β)(β + Cε − β)0 ]
= E[(Cε)(Cε)0 ]
= CE[εε0 ]C0
= σ 2 CC0

• Now we want to say something about this variance compared to the


b = σ 2 (X0 X)−1
variance of β
48

• Let C = D + (X0 X)−1 X0


• D could be positive or negative. We are not saying anything yet about the
“size” of this matrix versus (X0 X)−1 .
e = σ 2 CC0 = σ 2 (D + (X0 X)−1 X0 )(D + (X0 X)−1 X0 )0
• Now, var[β]
• Recalling that CX = I, we can now say that:

(D + (X0 X)−1 X0 )X = I
so that,
DX + (X0 X)−1 X0 X = I
and
DX+I=I
and
DX = 0
• Given this last step, we can finally re-express the variance of β
e in terms of
D and X.

e = σ 2 (D + (X0 X)−1 X0 )(D + (X0 X)−1 X0 )0


var[β]
= σ 2 (D + (X0 X)−1 X0 )(D0 +[(X0 X)−1 X0 ]0 ) (using (A + B)0 = A0 + B0 )
= σ 2 (D + (X0 X)−1 X0 )(D0 +X(X0 X)−1 )
= σ 2 (DD0 + (X0 X)−1 X0 X(X0 X)−1 + DX(X0 X)−1 + (X0 X)−1 X0 D0 )
= σ 2 (DD0 + (X0 X)−1 +(DX(X0 X)−1 )0 )

Therefore:

e = σ 2 (DD0 + (X0 X)−1 ) = var(β)


var[β] b + σ 2 DD0

• The diagonal elements of var[β] e and var(β) b are the variance of the
estimates. So it is sufficient to prove efficiency to show that the diagonal
elements of σ 2 DD0 are all positive.
49

• Since each element of DD0 is made up of a sum of squares (i.e., d2ij ), no


P
j
diagonal element could be negative.
• They could all be zero, but if so then D = 0 and C = (X0 X)−1 X0 , meaning
that β
e = Cy = β b and thus completing the proof.

• In conclusion, any other unbiased linear estimator of β will have a sampling


variance that is larger than the sampling variance of the OLS regression
coefficients.
• OLS is “good” in the sense that it produces regression coefficients that are
“BLUE”, the best, linear unbiased estimator of the true, underlying
population parameters.
Section 5

Inference using OLS regression coefficients


• While we have explored the properties of the OLS regression coefficients, we
have not explained what they tell us about the true parameters, β, in the
population. This is the task of statistical inference.
• Inference in multivariate regression analysis is conducted via hypothesis test-
ing. Two approaches:
1. Test of significance approach
2. Confidence interval approach
• Some key concepts: the size and the power of a test, Type I and Type II
Errors, and the null hypothesis and the alternative hypothesis.

5.1 A univariate example of a hypothesis test


• In the univariate case, we were often concerned with finding the average of
a random distribution. Via the Central Limit theorem, we could say that
2
X ∼ N (µ, σn ). Thus, it followed that:

(X − µ) (X − µ) n(X − µ)
Z= = √ = ∼ N (0, 1)
σX σ/ n σ
• We could then use the Z distribution to perform hypothesis tests.
• E.g.: H0 : µ = 4 and H1 : µ 6= 4.
• Next, let us decide the size of the test. Suppose, we want to have 95%
“confidence” with regard to our inferences, implying a confidence coefficient
of (1-α)=0.95, so α, which is the size of the test, is equal to 0.05. The size
of the test gives us a critical value of Z = ±Zα/2
(X−µ)
• Then the test of significance approach is to calculate √ directly.
σ/ n

50
51

• If (X−µ)
√ > +Zα/2 or
σ/ n
(X−µ)
√ < −Zα/2 , we reject the null hypothesis. Intuitively,
σ/ n
if X is very different from what we said it is under the null, we reject the
null.
• The confidence interval approach is to construct the 95% confidence interval
for µ.  
σ σ
Pr X − Zα/2 √ ≤ µ ≤ X + Zα/2 √ = 0.95
n n
• If the hypothesized value of µ = 4 lies in the confidence interval, then we
accept the null. If the hypothesized value does not lie in the confidence
interval, then we say reject H0 with 95% confidence. Intuitively, we reject
the null hypothesis if the hypothesized value is far from the likely range of
values of µ suggested by our estimate.

5.2 Hypothesis testing of multivariate regression coeffi-


cients
• As demonstrated earlier, the regression coefficients are distributed multivari-
ate normal conditional on X:

b ∼ N β, σ 2 (X0 X)−1
 
β|X

and each individual regression coefficient is distributed normal conditional on


X:

b k |X ∼ N βk , σ 2 (X0 X)−1
 
β kk

• Let us assume, at this point, that we can use the relationship between the
marginal and conditional multivariate normal distributions to say simply that
the OLS regression coefficients are normally distributed.

βbk ∼ N βk , σ 2 (X0 X)−1


 
kk
52

• Thus, we should be able to use

βbk − βk βbk − βk βbk − βk


Z= = q = q ∼ N (0, 1)
σβbk 2 0
σ (X X)kk−1 0
σ (X X)kk −1

to conduct hypothesis tests, right?


• So, why are regression results always reported with the t-statistic? The prob-
0
lem is that we don’t know σ 2 and have to estimate it using s2 = Ne−k e
.
• Aside: as the sample size becomes large, we can ignore the sampling distri-
bution in s2 and proceed as though s2 = σ 2 . Given our assumptions that the
error is normally distributed, however, we can use the t-distribution.

βbk − βk βbk − βk
5.2.1 Proof that =q ∼ tk
se(βbk ) 2 0
s (X X)kk−1

• Recall: if Z1 ∼ N (0, 1) and another variable Z2 ∼ χk and is independent of


Z1 , then the variable defined as:

Z1 Z1 k
t= p = √
(Z2 /k) Z2
is said to follow the student’s t distribution with k degrees of freedom.
βbk − βk
• We have something that is distributed like Z1 = q .
0 −1
σ (X X)kk

• What we need is to divide it by something that is chi-squared. Let us assume


(N − k)s2
for now that 2
∼ χ2N −k .
σ
53

• If that were true, we could divide one by another to get:

√ β2k −β0 k
b
σ (X X)−1
kk βbk − βk σ
tk = p = q ·
[(N − k)s2 /σ 2 ]/(N − k) σ (X0 X)−1 s
kk

βbk − βk βbk − βk
= q =
s (X0 X)−1 se(βbk )
kk

(N − k)s2
5.2.2 Proof that 2
∼ χ2N −k
σ
• This proof will be done by figuring out whether the component parts of
this fraction are summed squares of variables that are distributed standard
normal. The proof will proceed using something we used earlier (e = Mε).

(N − k)s2 e0 e ε0 MMε  ε 0  ε 
= 2 = = M
σ2 σ σ2 σ σ
ε
• We know that ∼ N (0, 1)
σ
 ε 0  ε 
• Thus, if M = I we have ∼ χ2N . This holds because the matrix
σ σ
multiplication would give us N sums of squared standard normal random
variables. However, it can also be shown that if M is any idempotent matrix,
then:
 ε 0 ε
M ∼ χ2 [tr(M)]
σ σ
• We showed before that tr(M) =N − k
 ε 0  ε 
M ∼ χ2N −k
σ σ

βbk − βk
• This completes the proof. In conclusion, the test statistic, ∼ tN −k
se(βbk )
54

5.3 Testing the equality of two regression coefficients


• This section will introduce us to the notion of testing a “restriction” on the
vector of OLS coefficients.
• Suppose we assume that the true model is:

Yi = β0 + β1 X1i + β2 X2i + β3 X3i + εi


• Suppose also that we want to test the hypothesis H0 : β2 = β3 .
• We can also express this hypothesis as: H0 : β2 − β3 = 0.
• We test this hypothesis using the estimator (βb2 − βb3 ). If the estimated value
is very different from the hypothesized value of zero, then we will be able to
reject the null. We can construct the t-statistic for the test using:
(βb2 − βb3 ) − (β2 − β3 )
t=
se(βb2 − βb3 )

• What is the standard error? βb2 and βb3 are both random variables. From
section 1.3.7, var(X − Y ) = var(X) + var(Y ) − 2cov(XY ). Thus:
q
se(βb2 − βb3 ) = var(βb2 ) + var(βb3 ) − 2cov(βb2 , βb3 )
• How do we get those variances and covariances? They are all elements of
var(β),
b the variance-covariance matrix of β
b (Which elements?).

5.4 Expressing the above as a “restriction” on the matrix


of coefficients
• We could re-express the hypothesis as a linear restriction on the β matrix:
   
β0 0
 β1 
 = 0 or R0 β = q where R =  0  and q = 0
  
0 0 1 −1 ·  β2  1
β3 −1
55

• The sample estimate of R0 β is R0 β


b and the sample estimate of q =b
q and is
0b
equal to R β.
• Consistent with the procedure of hypothesis tests, we could calculate t =
b−q
q
.
se(b
q)

• To test this hypothesis, we need se(b


q). Since q
b is a linear function of β,
b and
since we have estimated the variance of βb = s2 (X0 X)−1 , we can estimate q
b’s
0 2 0 −1
variance as var(b
q) = R [s (X X) ]R.
• This is just the matrix version of the rule that var(aX) = a2 var(X) (see
section 1.3.3).
p
• Thus, se(bq) = R0 [s2 (X0 X)−1 ]R
• Given all of this, the t-statistic for the test of the equality of β2 and β3 can
be expressed as:

b−q
q R0 βb −q
t= =p
se(b
q) R0 [s2 (X0 X)−1 ]R
• How does that get us exactly what we had above?

R0 β
b = (βb2 − βb3 ) and q = 0.
56

• More complicatedly:
 
σβ2b σβb0 ,βb1 σβb0 ,βb2 σβb0 ,βb3
 0 2
σ σ σ σ

β ,β β ,β β ,β
R0 [s2 (X0 X)−1 ] = 0 0 1 −1 · 
   b1 0
b β1
b b1 2
b b1 3
b 
σ b b σ b b σ 2 σ b b 
 β2 ,β0 β2 ,β1 βb2 β2 ,β3 
σβb3 ,βb0 σβb3 ,βb1 σβb3 ,βb2 σβ2b
3

h i
= σβb2 ,βb0 − σβb3 ,βb0 σβb2 ,βb1 − σβb3 ,βb1 σβ2b2 − σβb3 ,βb2 σβb2 ,βb3 − σβ2b3

And

R0 [s2 (X0 X)−1 ]R


 
0
h i 0
= σβb2 ,βb0 − σβb3 ,βb0 σβb2 ,βb1 − σβb3 ,βb1 σβ2b2 − σβb3 ,βb2 2
σβb2 ,βb3 − σβb ·  
3 1
−1
= σβ2b − σβb3 ,βb2 − σβb2 ,βb3 + σβ2b
2 3

= var(βb2 ) + var(βb3 ) − 2cov(βb2 , βb3 )

p q
Thus, se(b
q) = R0 [s2 (X0 X)−1 ]R = var(βb2 ) + var(βb3 ) − 2cov(βb2 , βb3 )
• We will come back to this when we do multiple hypotheses via F -tests.
Section 6

Goodness of Fit
6.1 The R-squared measure of goodness of fit
• The original fitting criteria used to produce the OLS regression coefficients
was to minimize the sum of squared errors. Thus, the sum of squared errors
itself could serve as a measure of the fit of the model. In other words, how
well does the model fit the data?
• Unfortunately, the sum of squared errors will always rise if we add another
observation or if we multiply the values of y by a constant. Thus, if we want
a measure of how well the model fits the data we might ask instead whether
variation in X is a good predictor of variation in y.
• Recall the mean deviations matrix from Section 2.9, which is used to subtract
the column means from every column of a matrix. E.g.,
x11 − x̄1 . . . x1K − x̄K
 

M0 X =  .. ... .. .
. .
xn1 − x̄1 . . . xnK − x̄K

• Using M0 , we can take the mean deviations of both sides of our sample
regression equation:
M0 y = M0 Xβb + M0 e
• To square the sum of deviations, we just use
(M0 y)0 (M0 y) = y0 M0 M0 y = y0 M0 y

and
y0 M0 y = (M0 Xβb + M0 e)0 (M0 Xβ b + M0 e)
b 0 + (M0 e)0 )(M0 Xβ
= ((M0 Xβ) b + M0 e)
b 0 X0 M0 + e0 M0 )(M0 Xβ
= (β b + M0 e)

=βb 0 X0 M0 Xβ
b + e0 M0 Xβb +βb 0 X0 M0 e + e0 M0 e

57
58

• Now consider, what is the term M0 e? The M0 matrix takes the deviation of
a variable from its mean, but the mean of e is equal to zero, so M0 e = e and
X0 M0 e = X0 e = 0.
• Why does the last hold true? Intuitively, the way that we have minimized
the OLS residuals is to set the estimated residuals orthogonal to the data
matrix – there is no information in the data matrix that helps to predict the
residuals.
• You can also prove that X0 e = 0 using the fact that e = Mε and that M = (IN −
X(X0 X)−1 X0 ):
X0 e = X0 (IN − X(X0 X)−1 X0 )ε = (X0 − X0 X(X0 X)−1 X0 )ε = X0 ε − X0 ε = 0.

• Given this last result, we can say that:


0
b 0 X0 M0 Xβ
y M0 y =β b + e 0 M0 e

• The first term in this decomposition is the regression sum of squares (or the
variation in y that is explained) and the second term is the error sum of
squares. What we have shown is that the total variation in y can be fully
decomposed into the explained and unexplained portion. That is,
SST = SSR + SSE

• Thus, a good measure of fit is the coefficient of determination or R2 :


SSR
R2 =
SST
• It follows that:
b 0 X0 M0 Xβ
β b e 0 M0 e e 0 M0 e
2
1= + 0 0 =R + 0 0
y 0 M0 y yM y yM y
and
2 e 0 M0 e SSE
R =1− 0 0 =1−
yM y SST
• By construction, R2 can vary between 1 (if the model fully explains y and all
points lie along the regression plane) and zero (if all the β
b are equal to zero).
59

6.1.1 The uses of R2 and three cautions


• R2 is frequently used to compare different models and decide between them.
Other things equal, we would prefer a model that explains more to a model
that explains less. There are three qualifications that need to be borne in
mind, however, when using R2 , one of which motivates the use of the adjusted
R2 .

1. R2 (almost) always rises when you add another regressor


• Consider a model y = Xb + e.
• Add to this model an explanatory variable, z, which I know a priori to be
completely independent of y. Unless the estimated regression coefficient
on z is exactly equal to zero, I will be adding some information to the
model. Thus, I will automatically lower the estimated residuals and
increase R2 , leading to a temptation to throw ever more variables into
the model.
• Alternative: Adjusted R2 :
2 e 0 M0 e n − 1
R =1− ·
y 0 M0 y n − K
• As the number of regressors increases, the error term decreases, but the
adjustment term increases. Thus, the adjusted R2 can go up or down
when you add a new regressor. The adjusted R2 can also be written as:
2 n−1
R =1− (1 − R2 ).
n−K
60

2. R2 is not comparable across different dependent variables

• If you are comparing two models on the basis of R2 or the adjusted R2 ,


the sample size, n, and the dependent variable must be the same.
• Say you used ln y as the dependent variable in your next model. R2
measures the proportion of the variation in the dependent variable that
is explained by the model, but in the first case it will be measuring the
variation in y and in the second case it will be measuring the variation in
ln y. The two are not expected to be the same and cannot be taken as the
basis for model comparison. Also, clearly, as the number of observations
changes, adjusted R2 will also change.
• The standard error of the regression is recommended instead of R2 in
these cases.

3. R2 does not necessarily lie between zero and one if we do not include a
constant in the model.

• Inclusion of a constant gives the result in the normal equation that y =


x0 β
b and this also implied that e = 0 . If you do not include a constant
term in the regression model, this does not necessarily hold, which means
that we cannot simplify the variation in y as we did above.
• As a result, if you do not include a constant, it is possible to get an R2
of less than zero. Generally, you should exclude the constant term only
if you have strong theoretical reasons to believe the dependent variable
is zero when all of the explanatory variables are zero.

• R2 is often abused. We will cover alternative measures that can be used to


judge the adequacy of a regression model. These include Akaike’s Informa-
tion Criterion, the Bayes Information Criterion and Amemiya’s Prediction
Criterion.
61

6.2 Testing Multiple Hypotheses: the F -statistic and R2

• Can use the restriction approach to test multiple hypotheses. E.g., H0 : β2 =


0 and β3 = 0.
• The method we will use can accommodate any case in which we have J < K
hypotheses to test jointly, where K is the number of regressors including the
constant.
• J restrictions ⇒ J rows in the restriction matrix, R. We can express the null
hypothesis associated w/ the example above as a set of linear restrictions on
the β vector, Rβ = q.
 
  β0  
0010  β1 
= 0
Rβ = q ⇒ ·
0001  β2  0
β3

• Given our estimate β̂ of β, our interest centers on how much Rβ̂ deviates
from q and we base our hypothesis test on this deviation.
• Let m = Rβ̂ − q. Intuition: will reject H0 if m is too large. Since m is
a vector, we must have some way to measure distance in multi-dimensional
space. One possibility—suggested by Wald—is to use the normalized sum of
squares, or m0 {var(m)}−1 m.
 0  
• This is equivalent to estimating σmm m
σm .

• If m can be shown to be normal w/ mean zero, then the statistic above would
be made up of a sum of squared standard normal variables (with the number
of squared standard normal variables given by the number of restrictions)
⇒ χ2 (Wald statistic).
• We know that β̂ is normally distributed ⇒ m is normally distributed (Why?).
62

• Further, var(m) = var(Rβ̂ − q) = R{σ 2 (X0 X)−1 }R0 and so the statistic

W = m0 {var(m)}−1 m = (Rβ̂ − q)0 [σ 2 R(X0 X)−1 R0 ]−1 (Rβ̂ − q) ∼ χ2

• Since, we do not know σ 2 and must estimate it by s2 , we cannot use the Wald
statistic. Instead, we derive the sample statistic:
−1
(Rβ̂ − q)0 σ 2 R(X0 X)−1 R0

(Rβ̂ − q)/J
F =
[(n − k)s2 /σ 2 ] /(n − k)
[Notice that we are dividing by the same correction factor that we used to
prove the appropriateness of the t-test in the single hypothesis case]
• Recall that
2 e0 e ε0 Mε
s = = .
n−k n−k
• Since under the null hypothesis Rβ = q, then Rβ̂ − q = Rβ̂ − Rβ =
R(β̂ − β) = R(X0 X)−1 X0 ε.
• Using these expressions, which are true under the null, we can transform the
sample statistic above so that it is a function of ε/σ.

{R(X0 X)−1 X0 ε/σ}0 [R(X0 X)−1 R0 ]−1 {R(X0 X)−1 X0 ε/σ}/J


F =
[M(ε/σ)]0 [M(ε/σ)]/(n − k)
• The F -statistic is the ratio of two quadratic forms in ε/σ.
• Thus, both the numerator and the denominator are, in effect, the sum of
“squared” standard normal random variables (where the ε/σ are the standard
normal variables) and are therefore distributed χ2 .
• Since it can be shown that the two quadratic forms are independent (see
Greene, 6th Ed., p. 85) ⇒ the sample statistic above, is distributed as F
with degrees of freedom [J, n − k], where J is the number of restrictions and
n − k is the degrees of freedom in the estimation of s2 .
63

• Cancel the σs to get


(Rβ̂ − q)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ − q)/J
F =
e0 e/(n − k)
or
(Rβ̂ − q)0 [s2 R(X0 X)−1 R0 ]−1 (Rβ̂ − q)
F =
J
• Question posed by this test: how far is the estimated value, Rβ̂, from its
value under the null, where this distance is “scaled” in the F test by its
estimated variance, equal to s2 R(X0 X)−1 R0 .
• The F statistic from our running example is:
" #−1 " #
h i var(β̂2 ) cov(β̂2 β̂3 ) β̂
β̂2 β̂3 · · 2
cov(β̂3 β̂2 ) var(β̂3 ) β̂3
F[2,n−k] =
2
The sub-scripts indicate that its degrees of freedom are two (for the number
of restrictions) and (n−k) for the number of free data points used to estimate
σ2.
• F -statistic can be handy when confronted w/ high collinearity.

6.3 The relationship between F and t for a single restric-


tion
• Assume that in the example above we now have only one restriction reflected
in the R matrix: β2 = 0. The relevant F -statistic is given by:
(β̂2 )0 [var(β̂2 )]−1 (β̂2 )
F = .
1
Since [var(β̂2 )] is a matrix containing one element, its inverse is equal to 1/σβ22 .
β̂22
• Thus, F = [var(β̂2 )]
, which is just equal to the square of the estimated t-
statistic,
64

6.4 Calculation of the F -statistic using the estimated resid-


uals
• Another approach to using the F -test is to regard it as a measure of “loss of
fit.” We’ll show that this is equivalent although it may be easier to implement
computationally.
• Imposing any restrictions on the model is likely to increase the sum of the
squared errors, simply because the minimization of the error terms is being
done subject to a constraint. On the other hand, if the null hypothesis is
true, then the restrictions should not substantially increase the error term.
Therefore, we can build a test of the null hypothesis from the estimated
residuals, which will result in an F -statistic equivalent to the one above.
• Let e∗ equal the estimated residuals from the restricted model ⇒ e∗ = y −

Xβ̂ .
• Adding and subtracting X β̂ from the restricted model we get:
∗ ∗
e∗ = y − Xβ̂ − X(β̂ − β̂) = e − X(β̂ − β̂)

• The sum of squared residuals in the restricted model is:


∗ ∗ ∗ ∗
e∗0 e∗ = (e0 − (β̂ − β̂)0 X0 )(e − X(β̂ − β̂)) = e0 e + (β̂ − β̂)0 X0 X(β̂ − β̂) ≥ e0 e

• The loss of fit is equal to:


∗ ∗
e∗0 e∗ − e0 e = (β̂ − β̂)0 X0 X(β̂ − β̂) = (Rβ̂ − q)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ − q)

• In other words, the loss of fit can also be seen as (most of) the usual numerator

from the F -statistic. Under the null, β̂ = β̂, thus (Rβ̂ − q) is equal to

(β̂ − β̂) and R is equal to the identity matrix. Thus, we simply need to
introduce our s2 estimate and divide by J, the number of restrictions, to
obtain an F -statistic.
(e∗0 e∗ − e0 e)/J
FJ,n−k =
e0 e/(n − k)
65

• Thus, an equivalent procedure for calculating the F -statistic is to run the


restricted model, calculate the residuals, run the unrestricted model, calculate
this set of residuals, and construct the F -statistic based on the expression
above.
0
ee
• Since SST = 1 − R2 , we can also divide through the above expression by the
SST, to yield:
(R2 − R∗2 )/J
FJ,n−k =
(1 − R2 )/(n − k)
where R∗2 is the coefficient of determination in the restricted model.
• If we want to test the significance of the model as a whole, we use:
(R2 )/(k − 1)
FJ,n−k =
(1 − R2 )/(n − k)

6.5 Tests of structural change


• We often pool different sorts of data in order to test hypotheses of interest.
For instance, we are liable to cumulate different U.S. Presidents in a time-
series used to test the relationship between the economy and approval and to
pool different countries when testing for the relationship between democracy
and income levels.
• In some cases, this pooling is inappropriate because we would not expect
the causal mechanisms that underlie our model to function identically across
different countries or time-periods.
• Tests for structural change of model stability (generally known as Chow Tests)
allow us to test whether the model is significantly different for two different
time periods or different groups of observations (e.g., countries).
66

• Suppose that we are interested in how a set of explanatory variables affects a


dependent variable before and after some critical juncture. Denote the data
before the juncture as y1 and X1 and after as y2 and X2 . We also allow the
coefficients across the two periods to vary.
• We wish to test an unrestricted model
      
y1 X1 0 β1 ε1
= +
y2 0 X2 β2 ε2
against the restricted model
     
y1 X1 ∗ ε1
= β +
y2 X2 ε2
 
This is just a special case of linear restrictions β 1 = β 2 where R = I −I
and q = 0.
• The restricted sum of squares is e∗0 e∗ , which we estimate using all n1 + n2
observations (so that this is just the normal sum of squared errors). The
unrestricted sum of squares is e01 e1 + e02 e2 which is estimated from each sub-
sample separately. Thus, we can construct an F -test from:
(e∗0 e∗ − e01 e1 − e02 e2 )/K
F[J,n−k] =
(e01 e1 + e02 e2 )/(n1 + n2 − 2K)

• In this case, the number of restrictions, J, is equal to the number of coeffi-


cients in the restricted model, K, because we are assuming that each one of
these coefficients is the same across sub-samples. If the F -statistic is suffi-
ciently large, we will reject the hypothesis that the coefficients are equal.
• This test can also be performed using dummy variables.
67

6.5.1 The creation and interpretation of dummy variables


• So far, we have mainly discussed continuous explanatory variables. Now we
turn to the creation and interpretation of dummy variables that take on the
value zero or one to indicate membership of a binary category and discuss
their uses as explanatory variables.
• Suppose that we are trying to determine whether money contributed to po-
litical campaigns varies across regions of the country (e.g., northeast, south,
midwest, and west).
• We can estimate: Yi = β0 + β1 D1i + β2 D2i + β3 D3i + εi
Where
➢ D1 = 1 if the contributor is from the south.
➢ D2 = 1 if the contributor is from the midwest.
➢ D3 = 1 if the contributor is from the west.
• This type of statistical analysis is called ANOVA (or analysis of variance)
and informs you about how contribution levels vary by region.
• What is the interpretation of the intercept in this regression and what is the
interpretation of the dummy variables and their t-statistics?

6.5.2 The “dummy variable trap”


• We do not include all the dummy variables in the regression above. Why?
• There is a constant in the regression above. The constant appears in the X
data matrix as a column of ones. Each of the dummy variables appears in
the data matrix as a column of zeroes and ones. Since they are mutually
exclusive and collectively exhaustive, if we were to introduce them all into
the data matrix, the sum:

D1i + D2i + D3i + D4i = 1

for all observations.


68

• If we include all of them in the regression with a constant, then the data
matrix X does not have full rank, it has rank at most of K − 1 (perfect
collinearity). In this case, the matrix (X0 X) is also not of full rank and has
rank of at most K − 1.
• Another way of putting this, is that we do not have enough information
to estimate the constant separately if we include all the regional dummies
because, once we have them in, the overall model constant is completely
determined by the individual regional intercepts and the weight of each region.
It cannot be separately estimated, because it is already given by what is in
the model.
• If you include a constant in the model, you can only include M − 1 of M
mutually exclusive and collectively exhaustive dummy variables. Or could
drop the constant.

6.5.3 Using dummy variables to estimate separate intercepts and slopes


for each group
• The model above allowed us to estimate different intercepts for each region
in a model of political contributions. Suppose I am now running a model
that looks at the effect of gender on campaign contributions. We might also
expect personal income to affect the level of donations and include it in our
model as well.
• Once we include both continuous and categorical variables into the same
model, we are no longer performing ANOVA but ANCOVA (analysis of co-
variance). If we were interested only in estimating a model looking at the
effects of gender and income on political contributions, our model would take
the following form:
yi = β0 + β1 dm,i + β2 Xi + εi
This model simply estimates a different intercept for men.
69

• Suppose, however, we think that men will contribute more to political cam-
paigns as their income rises than will women. To test this hypothesis, we
would estimate the model:
Yi = β0 + β1 dm,i + β2 Xi + β3 dm,i · Xi + εi
To include the last term in the model, simply multiply the income variable
by the gender dummy. If income is measured in thousands of dollars, what
is the marginal effect of an increase of $1,000 dollars for men on expected
contributions to political campaigns?
• How do we test whether gender has any impact on the marginal effect of
income on political contributions? How do we test the hypothesis that gender
has no impact on the behavior underlying political contributions?
• It is generally a bad idea to estimate interaction terms without including the
main effects.

6.6 The use of dummy variables to perform Chow tests of


structural change
• One of the more common applications of the F -test is in tests of structural
change (or parameter stability) due to Chow. The test basically asks where
the behavior that one is modeling varies across different sub-sets of observa-
tions.
• We often estimate a causal model using samples that combine men and
women, rich and poor countries, and different time periods. Yet we can
hypothesize that the behavior that we are modeling varies by gender, income
level and time period. If that is correct, then restricting the different sub-sets
of observations is inappropriate.
• For example, in the model of political contributions, if I assume that giving
to political campaigns does depend on gender, I would estimate the “unre-
stricted” model:
Yi = β 0 + β 1 dm,i + β 2 Xi + β 3 dm,i · Xi + εi
70

• If, however, I impose the restriction that gender does not matter, then I
estimate the restricted model:
Yi = β 0 + β 2 Xi + εi
In this case, I am restricting β 1 to be equal to zero and β 3 to be equal to
zero.
• In general terms, we wish to test the unrestricted model where the OLS
coefficients differ by gender.
• The Chow test to determine whether the restriction is valid is performed
using an F -test and can be accomplished using the approach outlined above.
• This version of the Chow Test is exactly equivalent to running the unrestricted
model and then performing an F -test on the joint hypothesis that β 1 = 0 and
β 3 =0. Why is this?
• We saw previously that an F -test could also be computed using:
(e∗0 e∗ − e0 e)/J
FJ,n−k =
e0 e/(n − k)
This is just the same as we did above. We are simply getting the unrestricted
error sum of squares from the regressions run on men and women separately.
• This form of the F -test could be derived as an extension of testing a restriction
on the OLS coefficients.
• Under the restriction, the β vector of regression coefficients was equal to β ∗ .
In the case outlined above:
   
β0 β0
 β1 
 and β ∗ =  0 
 
β=  β2   β2 
β3 0
• But the null hypothesis is given by β = β ∗ . In terms of our notation for
expressing a hypothesis as a restriction on the matrix of coefficients, Rβ = q,
we have β = β ∗ , so R is equal to the identity matrix. Moreover, the distance

Rβ̂ − q is now given by β̂ − β̂.
71

∗ ∗
• As a result, the quantity (β̂ − β̂)0 X0 X(β̂ − β̂) = e∗0 e∗ − e0 e can be written
as:
(Rβ̂ − q)0 [R(X0 X)−1 R0 ]−1 (Rβ̂ − q)

• This is most of the F -statistic, except that we now have to divide by s2 , equal
to e0 e/(n − k), our estimate of the variance of the true errors, and divide by
J to get the same F -test that we would get above.
• Testing this restriction on the vector of OLS coefficients, however, is exactly
the same thing as testing that β 1 and β 3 are jointly equal to zero, and the
latter computing procedure is far easier in statistical software.

6.6.1 A note on the estimate of s2 used in these tests


• In using these Chow tests, we have assumed that σ 2 is the same across sub-
samples, so that it is okay if we use the s2 estimate from the restricted or
unrestricted model (although in practice we use s2 from the unrestricted
model).
• If this is not persuasive, there is an alternative. Suppose that θ̂ 1 and θ̂ 2 are
two normally distributed estimators of a parameter based on independent
sub-samples with variance-covariance matrices V1 and V2 . Then, under the
null hypothesis that the two estimators have the same expected value, θ̂ 1 − θ̂ 2
has mean 0 and variance V1 + V2 . Thus, the Wald statistic:

W = (θ̂ 1 − θ̂ 2 )0 [V1 + V2 ]−1 (θ̂ 1 − θ̂ 2 )

has a chi-square distribution with K degrees of freedom and a test that the
difference between the two parameters is equal to zero can be based on this
statistic.
• What remains for us to resolve is whether it is valid to base such a test on
estimates of V1 and V2 . We shall see, when we examine asymptotic results
from infinite samples that we can do so as our sample becomes sufficiently
large.
72

• Finally, there are some more sophisticated techniques which allow you to be
agnostic about the locations of the breaks (i.e., let the data determine where
they are).
Section 7

Partitioned Regression and Bias


7.1 Partitioned regression, partialling-out and applications
• We will first examine the OLS formula for the regression coefficients using
“partitioned regression” as a means of better understanding the effects of
excluding a variable that should have been included in the model.
• Suppose our regression model contains two sets of variables, X1 and X2 .
Thus:
 
  β1
y = Xβ + ε = X1 β 1 + X2 β 2 + ε = X1 X2 · +ε
β2

• What is the algebraic solution for β̂, the estimate of β in this partitioned
version of the regression model?
0 0
• To find the solution, we use the “normal
 equations,” (X X)β̂ = (X y) em-
ploying the partitioned matrix, X1 X2 for X.
 0   0
X1 X1 X01 X2

0 X1  
• In this case, (X X) = · X1 X2 =
X02 X02 X1 X02 X2
• The derivation of (X0 y) proceeds analogously.
• The normal equations for this partitioned matrix give us two expressions with
two unknowns:  0  " # 
0
X1 0 y

X1 X1 X1 X2 β̂ 1
· =
X2 0 X1 X2 0 X2 β̂ 2 X2 0 y

73
74

• We can solve directly for β̂ 1 , the first set of coefficients, by multiplying


through the first row of the partitioned matrix and then using the trick of
pre-multiplying by (X01 X1 )−1 :
(X01 X1 )β̂ 1 + (X01 X2 )β̂ 2 = X01 y
and
β̂ 1 = (X1 0 X1 )−1 X1 0 y − (X1 0 X1 )−1 X1 0 X2 β̂ 2 = (X1 0 X1 )−1 X1 0 (y − X2 β̂ 2 )

• We can conclude from the above that if X01 X2 = 0—i.e., if X1 and X2 are
orthogonal—then the estimated coefficient β̂ 1 from a regression of y on X1
will be the same as the estimated coefficients from a regression of y on X1
and X2 .
• If not, however, and if we mistakenly omit X2 from the regression, then report-
ing the coefficient of β̂ 1 as (X1 0 X1 )−1 X1 0 y will over-state the true coefficient
by (X1 0 X1 )−1 X1 0 X2 β̂ 2 .
• The solution for β̂ 2 is found by multiplying through the second row of the ma-
trix and substituting the above expression for β̂ 1 . This method for estimating
the OLS coefficients is called “partialling out.”

(X02 X1 )β̂ 1 + (X02 X2 )β̂ 2 = X02 y


• Now substitute for β̂ 1 .
(X02 X1 )[(X1 0 X1 )−1 X1 0 y − (X1 0 X1 )−1 X1 0 X2 β̂ 2 ] + (X02 X2 )β̂ 2 = X02 y
Multiplying through and rearranging gives:
[X02 X2 − X02 X1 (X01 X1 )−1 X01 X2 ]β̂ 2 = X02 y − X02 X1 (X01 X1 )−1 X01 y
X02 [I − X1 (X01 X1 )−1 X01 ]X2 β̂ 2 = X02 [I − X1 (X01 X1 )−1 X01 ]y
Solving for βˆ2 gives
βˆ2 = [X2 0 (I − X1 (X1 0 X1 )−1 X1 0 )X2 ]−1 [X2 0 (I − X1 (X1 0 X1 )−1 X1 0 )y]
= (X2 0 M1 X2 )−1 (X2 0 M1 y)
75

• Here, the matrix M is once again the “residual-maker,” with the sub-script
indicating the relevant set of explanatory variables. Thus:
M1 X2 = a vector of residuals from a regression of X2 on X1 .
M1 y = a vector of residuals from a regression of y on X1 .
• Using the fact that M1 is idempotent, so that M01 M1 = M1 , and setting
X∗2 = M1 X2 and y∗ = M1 y, we can write the solution for β̂ 2 as:
0 0
β̂ 2 = (X∗2 X∗2 )−1 X2∗ y∗

• Thus, we have an entirely separate approach by which we can estimate the


OLS coefficient β̂ 2 when X2 is a single variable:
1. Regress X2 on X1 and calculate the residuals.
2. Regress y on X1 and calculate the residuals.
3. Regress the second set of residuals on the first. The first round of regres-
sions “partial out” the effect of X1 on X2 and y, allowing us to calculate
the marginal effect of X2 on y, accounting for X1 .
• The process can also be carried out in reverse, yielding:
0 0
β̂ 1 = (X∗1 X∗1 )−1 X∗1 y∗

7.2 R2 and the addition of new variables


• Using the partitioning approach, we can prove that R2 will always rise when
we add a variable to the regression.
• Start with the model y = Xβ + ε and y = Xb + e where b = β̂
• Add a single variable z with the coefficient c so that y = Xd + zc + u (d 6= b
if X01 X2 6= 0).
• R2 will increase if u0 u < e0 e. To find out if this inequality holds, we have to
find an expression relating u and e.
76

• From the formula for partitioned regression above, we can say that:
−1 −1
d = (X0 X) X0 (y − zc) = b − (X0 X) X0 zc

and we know that u = y − Xd − zc.


• Inserting the expression for d into this equation for the residuals, we get:

u = y − Xb + X(X0 X)−1 X0 zc − zc (7.1)

• Since I − X(X0 X)−1 X0 = M and X(X0 X)−1 X0 = I − M, we can rewrite Eq.


7.1 as:
u = e − Mzc = e − z∗ c

• Thus,
u0 u = e0 e + c2 (z0∗ z∗ ) − 2cz0∗ e

e = My = y∗ and z0∗ e = z0∗ y∗ . Since c = (z0∗ z∗ )−1 (z0∗ y∗ ), we can say that
z0∗ e = z0∗ y∗ = c(z0∗ z∗ ).
• Therefore:

u0 u = e0 e + c2 (z0∗ z∗ ) − 2c2 (z0∗ z∗ ) = e0 e − c2 (z0∗ z∗ )

Since the last matrix is composed of a sum of squares, it is positive definite.


• Thus, u0 u < e0 e unless c = 0. In other words, adding another variable to the
regression will always result in a higher R2 unless the OLS coefficient on the
additional variable is exactly equal to zero.
77

7.3 Omitted variable bias


• The question of which variables should be included in a regression model is
part of the larger question of selecting the right specification. Other issues in
specification testing include whether you have the right functional form and
whether the parameters are the same across sub-samples.
• We will now use a different approach to prove that the OLS coefficients will
normally be biased when you exclude a variable from the regression that is
part of the true, population model.
• Suppose that the correctly specified regression model would be:

y = X1 β 1 + X2 β 2 + ε

• If we don’t know this and assume that the correct model is y = X1 β 1 + ε,


then we will regress y on X1 only. Then our estimate of β 1 will be:

β̂ 1 = (X01 X1 )−1 X01 y

• Now substitute y = X1 β 1 + X2 β 2 + ε from the true equation for y in the


expression for β̂ 1 :

β̂ 1 = β 1 + (X01 X1 )−1 X01 X2 β 2 + (X01 X1 )−1 X01 ε

• To check for bias (i.e. whether E(β̂ 1 ) 6= β 1 ), take the expectation:

E(β̂ 1 ) = β 1 + (X01 X1 )−1 X01 X2 β 2

• Thus, unless (X01 X2 ) = 0 or β 2 = 0, β̂ 1 is biased. Also, note that (X01 X1 )−1 X01 X2
gives the coefficient(s) in a regression of X2 on X1 .
78

7.3.1 Direction of the Bias


• If we have just two explanatory variables in our data set, then the coefficient
from a regression of X2 on X1 will be equal to cov(X 1 X2 )
var(X1 ) .

• Thus, if we had a good idea of the sign of the covariance and the sign of β 2 ,
we would be able to predict the direction of the bias.
• If there are more than two explanatory variables in the regression, then the
direction of the bias on β̂ 1 when we exclude X2 will depend on the partial
correlation between X1 and X2 (since this always has the same sign as the
regression coefficient in the OLS regression model) and the sign of the true
β2.

7.3.2 Bias in our estimate of σ 2


• In order to calculate the standard error of the coefficients, we need to estimate
σ 2 . If we did this just from the regression including X1 we would get:
2 e01 e1
s =
n − K1
but e1 = M1 y = M1 (X1 β 1 + X2 β 2 + ε) = M1 X2 β 2 + M1 ε .
(Note that M1 X1 = 0.)
• Does E(e01 e1 ) = σ 2 (n − K1 )?
E(e01 e1 ) = E[(M1 X2 β 2 + M1 ε)0 (M1 X2 β 2 + M1 ε)]
= E[(β 02 X02 M01 + ε0 M01 )(M1 X2 β 2 + M1 ε)]
= β 02 X02 M1 X2 β 2 + E(β 02 X02 M1 ε) + E(ε0 M01 X2 β 2 ) + E(ε0 M1 ε)
= β 02 X02 M1 X2 β 2 + E(ε0 M1 ε)

This last step involves multiplying out M1 X2 and using E(ε0 X2 ) = 0 and
0
E(X1 ε) = 0, both of which we know from the Gauss-Markov assumptions.
• From here we proceed as we did when working out the expectation of the
sum of squared errors originally:
E(e01 e1 ) = β 02 X02 M1 X2 β 2 + σ 2 tr(M1 ) = β 02 X02 M1 X2 β 2 + (n − K1 )σ 2
79

The first term in the expression is the increase in the error sum of squares
(SSE) that results when X2 is dropped from the regression.
0 0
• We can show that β 2 X2 M1 X2 β 2 is positive, so the s2 derived from the re-
gression of y on X1 will be biased upward (Intuition: the shorter version of
the model leaves more variance unaccounted for).
• Omitted variable bias is a result of a violation of the Gauss-Markov assump-
tions: we assumed we had the correct model.

7.3.3 Testing for omitted variables: The RESET test


• The RESET test (due to Ramsey) checks for patterns in the estimated resid-
uals against the predicted values of y, ŷ.
• Specifically, the test is conducted by estimating the original regression, com-
puting the predicted values, and then using powers of these predicted values,
ŷi2 , ŷi3 and of the original regressors, X, as explanatory variables in an ex-
panded regression. An F -test on these additional regressors is then employed
to detect whether or not they are jointly significant.

7.4 Including an irrelevant variable


• What if the true model is y = X1 β 1 +ε and we estimate y = X1 β 1 +X2 β 2 +ε?
We are likely to find that the estimated coefficient β̂ 2 cannot be distinguished
from zero.
• β̂ 1 will not be biased, nor will s2 . [To verify, use β 2 = 0 in the above
equations].
• This does not mean we should estimate “kitchen sink” regressions.
• The cost of including an irrelevant variable is a loss in efficiency. It can
be shown that the variance-covariance matrix of the coefficients always rises
when we add a variable, which means we will estimate the sampling distribu-
tions of the coefficients less precisely. Our standard errors will rise and our
t-statistics will fall.
80

7.5 Model specification guidelines


• Think about theoretically likely confounding variables; alternative hypothe-
ses.
• Avoid “stepwise regression”.
➢ A common technique of model building is to build it iteratively by adding
regressors sequentially and keeping those that are statistically significant.
This is a very problematic technique.
➢ Consider a simple case of adding columns from X sequentially and dis-
carding those whose estimated coefficient is not statistically different from
zero. First, we estimate y = β0 + x1 β1 + ε, a simple model with an in-
tercept and one explanatory variable.
|β̂1 |
➢ We will keep x1 in if se(β̂ ) > tα/2 (the t-statistic is bigger than the critical
1
value).
➢ The problem is that unless y = β0 + x1 β1 + ε is the true model, β̂1 and
se(βˆ1 ) will be biased. So our decision of what to include will be based on
biased estimates.
• Avoid “Data Mining”

➢ Data mining refers to the practice of estimating lots of different models


to see which has the best fit and gives statistically significant coefficients.
➢ The problem is that if you are using a 95% confidence interval (size of the
test, α, is .05) then the probability that you will reject the null hypothesis
that β = 0, when it is true, is 5%.
➢ If you try out twenty different variables, all of which actually have true
β = 0, your probability of finding one that looks significant in this sample
is approximately one. Thus, when you are doing repeated tests, your
probability of incorrectly rejecting the null is actually much higher than
the given size of the test.
81

• Try “Cross Validation”


➢ Randomly divide your data set in half. Select your model using the first
data set and guided by theory.
➢ Next, after deciding on the right model using the first data set, apply
it to the second data set. If any of the variables are not significant in
your second data set, there’s a good chance that they were incorrectly
included to begin with.
• Let theory be your guide
➢ Include variables that are theoretically plausible or necessary for account-
ing for relevant (but theoretically uninteresting) phenomenon.
➢ If theory is not a good guide, for instance when it comes to determining
the number of lags in a time series model, you could try using one of the
criterion used in the profession for model selection, including the Akaike
Information Criterion.

7.6 Multicollinearity
• Partitioned regression can be used to show that if you add another variable
to a regression, the standard errors of the previously included coefficients will
rise and their t-statistics will fall.
• Magnitude of change depends on covariance b/t variables.
• The estimated variance of β̂k is given by

c β̂k ) = s2 (X0 X)−1


var( kk

⇒ esimated variance depends on the presence of other variables in the re-


gression via their influence on the kth diagonal element in the inverse matrix
(X0 X)−1
kk .

• We can use the formula for the inverse of a partitioned matrix to say some-
thing about what this element will be.
82

• Let the true model be equal to Yi = βo + β1 X1i + β2 X2i + εi .


• Measure all the data as deviations from means, so that the transformed model
is Yi∗ = α1 X1i
∗ ∗
+ α2 X2i ∗
+ εi , where α1 = β1 ,α2 = β2 and X1i = (x1i − x̄1 ), etc.
• You can think of this as the case in which we had first “partialled out”
the intercept, by regressing all the variables on an intercept, and using the
residuals from this regression which amount to the deviations from the mean.
 ∗0 ∗ ∗0 ∗

X X X X
• The data matrix, X, is now X = X1 X∗2 and
 ∗  1 1 1 2
X∗0
2 X ∗
1 X ∗0 ∗
2 X 2

• Multiplying out this last matrix,


(x1i − x̄1 )2
 P P 
(x1i − x̄1 )(x2i − x̄2 )
(X0 X) =  P i i 
(x2i − x̄2 )2
P
(x2i − x̄2 )(x1i − x̄1 )
i i

• To find the variance of α̂2 = β̂2 , we would need to know (X0 X)−1
22 , the bottom
right element of the inverse of the matrix above.
• To find this, we will use a result on the inverse of a partitioned matrix (See
Greene, 6th Ed., pp. 966). The formula for the bottom right element of the
inverse of a partitioned matrix is given by: F2 = (A22 −A21 A−1 −1
11 A12 ) , where
the A’s just refer to the elements of the original partitioned matrix.
• Using this for the above matrix we have:
X
0 −1 2
X 1
(X X)22 = (x2i − x̄2 ) − (x2i − x̄2 )(x1i − x̄1 ) · P
(x1i − x̄1 )2
X i−1
· (x1i − x̄1 )(x2i − x̄2 )
 −1
cov(X1 , X2 ) · (n − 1) · cov(X1 , X2 ) · (n − 1)
= var(X2 ) · (n − 1) −
var(X1 ) · (n − 1)
−1
cov(X1 , X2 )2 · (n − 1)

= var(X2 ) · (n − 1) −
var(X1 )
83

var(X2 )
• Multiply through by 1 = var(X2 ) on both sides of the equation to give:
−1
(cov(X1 , X2 ))2
 
(X0 X)−1
22 = 1− · var(X2 )(n − 1)
var(X1 )var(X2 )
 2
−1
= (1 − r12 ) · var(X2 )(n − 1)
1
= 2 )var(X )(n − 1)
(1 − r12 2

2
• In this expression, r12 is the squared partial correlation coefficient between
X1 and X2 .
2
c β̂2 ) = s2 (X0 X)−1
• It follows then that: var( kk = (1−r12
s
2 )var(X )(n−1) Thus, the vari-
2
ance of the OLS coefficient rises when the two included variables have a higher
partial correlation.
• Intuition: the partialling out formula for the OLS regression coefficients
demonstrated that each regression coefficient β̂k is estimated from the co-
variance between y and Xk once you have netted out the effect of the other
variables. If Xk is highly correlated with another variable, then there is less
“information” remaining once you net out the effect of the remaining variables
in your data set.
This result can be generalized for a multivariate model with K regressors
(including the constant) and in which data are entered in standard form, and
not in deviations:
s2
var(β̂k ) = 2
(1 − Rk.X−k )var(Xk )(n − 1)
2
where Rk.X−k is the R2 from a regression of Xk on all the other regressors,
X−k .
84

• It follows that the variance of β̂k :


➢ rises with σ 2 the variance of ε and its estimate, s2 .
➢ falls as the variance of Xk rises and falls with the number of data points.
➢ rises when Xk and the other regressors are highly correlated.
• Implications for adding another variable: As you add another vari-
2
able to the data set, Rk.X−k will almost always rise ⇒ estimating a kitchen
sink regression raises the standard errors and reduces the efficiency of the
estimates.
• Implications for the bias in the standard errors when you have an
omitted variable: omitted relevant variable ⇒ s2 is biased upwards.

➢ Direction of bias for standard errors is unclear, however.


➢ While s2 in the numerator increases, the Rk.X−k
2
will be smaller than it
would have been if we had estimated the full model (just know std. errs.
are likely to be wrong).

• Implications for the standard errors when the variables are collinear:
2
Rk.X−k will be high, so that the variance and standard errors will be high and
thus the t-statistics will be low (but check F -test). If perfectly collinear,
2
Rk.X−k = 1.
85

7.6.1 How to tell if multicollinearity is likely to be a problem


1. Check the bivariate correlations between variables using the R command cor.
2. If overall R2 is high, but few if any of the t-statistics surpass the critical value
⇒ high collinearity.
3. Look at the Variance Inflation Factors (VIFs). The VIF for each variable is
equal to:
1
VIFk = 2
1 − Rk.X−k
Rules of thumb: evidence of multicollinearity if the largest VIF is greater
than 10 or the mean of all the VIFs is considerably larger than one. VIFs
can be computed in R using Fox’s car library.

7.6.2 What to do if multicollinearity is a problem


1. Nothing. Report the F -tests as an indication that your variables are jointly
significant and the bivariate correlations or VIFs as evidence that multi-
collinearity is likely to explain the lack of individual significance.
2. Get more data.
3. Amalgamate variables that are very highly collinear into an index of the
under-lying phenomenon. The only problem with this approach is that you
lose the clarity of interpretation that comes with entering variables individu-
ally.
Section 8

Regression Diagnostics
• First commandment of multivariate analysis: “First, know your data.”

8.1 Before you estimate the regression


• It is often useful to plot variables in model against each other to get a sense of
bivariate correlations (other useful plots: histograms, nonparametric kernel
density plots, lowess curves in plots).
• R’s graphical capabilities are tremendous, providing numerous canned rou-
tines for displaying data. Fox’s car library augments what is in the standard
R package. E.g., see Figure 8.1, which was produced by the pairs command.

8.2 Outliers, leverage points, and influence points


• An outlier is any point that is far from the fitted values. In other words, it
has a large residual. Why might we care about checking particular outliers?
Two main reasons, although analysts also don’t enjoy large outliers because
they reduce the fit of the model:
➢ Outliers may signal an error in transcribing data.
➢ A pattern in the outliers may imply an incorrect specification—a missing
variable or a likely quadratic relationship in the true model.

8.2.1 How to look at outliers in a regression model


• Plot the residuals against the fitted values ŷ. If the model is properly speci-
fied, no patterns should be visible in the residuals. E.g., see Fig. 8.2.
• You might also want to graph the residuals against just one of the original
explanatory variables.

86
87

Figure 8.1: Data on Presidential Approval, Unemployment, and Inflation

5.0 5.5 6.0 6.5 7.0 7.5

80
var 1

70
60
50
40
7.5

var 2
7.0
6.5
6.0
5.5
5.0

1.6
var 3 1.4
1.2
1.0
0.8
0.6

40 50 60 70 80 0.6 0.8 1.0 1.2 1.4 1.6


88

Figure 8.2: A plots of fitted values versus residuals


4
2
Residuals

0
−2

1.2 1.4 1.6 1.8 2.0 2.2

Fitted values
89

• It would be nice if we had a method that enabled us to identify influential


pairs or subsets of observations. This can be done with an “added variable
plot,” which reduces a higher-dimension regression problem to a series of
two-dimensional plots.
• An added variable plot is produced for a given regressor xk by regressing
both the dependent variable and xk on all of the other K − 1 regressors. The
residuals from these regressions are then plotted against each other.
• Fig. 8.3 shows an example of some added variable plots from Fox, using the
car library. This was produced by the commands:
library(car)

Duncan <- read.table(’/usr/lib/R/library/car/data/Duncan.txt’,header=T)

attach(Duncan)

mod.duncan <- lm(prestige~income+education)

av.plots(mod.duncan,labels=row.names(Duncan),ask=F)

8.3 How to look at leverage points and influence points


• A leverage point is any point that is far from the mean value of X = X̄. A
point with high leverage is capable of exerting influence over the estimated
regression coefficients (the slope of the regression line).
• An influence point is a data point that actually does change the estimated
regression coefficient, such that if that point were excluded, the estimated
coefficient would change substantially.
• Why do points far from the mean of X = X̄ have a bigger impact of the fit of
the line? Recall that if a constant is included, ȳ = X̄β, so that X̄, becomes
the fulcrum around which the line can tilt. See Fig. 8.4.
• We could also have worked out which were the large residuals in the estimated
regression using the lm command:
90

Figure 8.3: Added variable plots

Added−Variable Plot Added−Variable Plot

40
30

30
20
prestige | others

prestige | others

20
10

10
0

0
−10

−10
−20

−30
−30

−0.4 −0.2 0.0 0.2 0.4 0.6 0.8 −40 −20 0 20 40

(Intercept) | others income | others

Added−Variable Plot
60
40
prestige | others

20
0
−20
−40

−60 −40 −20 0 20 40

education | others
91

Figure 8.4: An influential outlier


4
2
0
Y

−2
−4
−6

−3 −2 −1 0 1

X
4
2
0
Y

−2
−4
−6

−3 −2 −1 0 1

X
92

xyreg <- lm(Y~X)

xyresid <- resid(xyreg)

and then sorting on errors to get the data points organized from highest to
lowest residual. This would also allow us to tell the X values for which the
residuals were largest.

8.4 Leverage
• We should draw a distinction between outliers and high leverage observations.
• The regressor values that are most informative (i.e., have the most influence
on the coefficient estimates) and lead to relatively large reductions in the vari-
ance of coefficient estimates are those that are far removed from the majority
of values of the explanatory variables.
• Intuition: regressor values that differ substantially from average values con-
tribute a great deal to isolating the effects of changes in the regressors.
• Most common measure of influence comes from the “hat matrix”:

H = X(X0 X)−1 X0

• For any vector y, Hy is the set of fitted values (or ŷ in the least squares
regression of y on X. In matrix language, it projects any n × 1 vector into
the column space of X.
• For our purposes, Hy = Xβ̂ = ŷ. Recall also that M = (I − X(X0 X)−1 X0 ) =
(I − H)
• The least squares residuals are:

e = My = Mε = (I − H)ε

• Thus, the variance-covariance matrix for the least squares residual vector is:

E[ee0 ] = Mεε0 M0 = Mσ 2 M0 = σ 2 MM0 = σ 2 M = σ 2 (I − H)


93

• The diagonal elements of this matrix are the variance of each individual esti-
mated residual. Since these diagonal elements are not generally the same, the
residuals are heteroskedastic. Since the off-diagonal elements are not gener-
ally equal to zero, we can also say that the individual residuals are correlated
(they have non-zero covariance). This is the case even when the true errors
are homoskedastic and independent, as we assume under the Gauss-Markov
assumptions.
• For an individual residual, ei , the variance is equal to var(ei ) = σ 2 (1 − hi ).
Here, hi is one of the diagonal elements in the hat matrix, H. This is consid-
ered a measure of leverage, since hi increases in the value of x.
• Average hat values are given by h̄ = (k + 1)/n where k is the number of
coefficients in the model (including the constant). 2h̄ and 3h̄ are typically
treated as thresholds that hat values must exceed to be noteworthy.
• Thus, observations with the greatest leverage have corresponding residuals
with the smallest variance. Observations with high leverage tend to pull the
regression line toward them, decreasing their size and variance.
• Note that high leverage observations are not necessarily bad: they contribute
a great deal of info about the coefficient estimates (assuming they are not
coding mistakes). They might indicate misspecification (different models for
different parts of the data).
• Fig. 8.5 contains a plot of hat values. This was produced using the R com-
mands:

plot(hatvalues(xyreg))

abline(h=c(2,3)*2/N,lty=2)
94

Figure 8.5: Hat values


0.15
hatvalues(xyreg)

0.10
0.05

0 10 20 30 40 50

Index
95

8.4.1 Standardized and studentized residuals


• To identify which residuals are significantly large, we first standardize them
by dividing by the appropriate standard error (given by the square root of
each diagonal element of the variance-covariance matrix):
ei
êi = 2
[s (1 − hi )]1/2
This is gives the standardized residuals.
• Since each residual has been standardized by its own standard error, we can
now compare the standardized residuals against a critical value (e.g., ±1.96)
to see if it is truly “large”.
• A related measure is the “studentized residuals.” Standardized residuals use
s2 as the estimate for σ 2 . Studentized residuals are calculated in exactly the
same way, except that for the estimate of σ 2 we use s2i calculated using all
the data points except i.
• Let β̂(i) = [X(i)0 X(i)]−1 X(i)0 y be the OLS estimator obtained by omitting
the ith observation. The variance of the residual yi − x0i β̂(i) is given by
σ(i)2 {1 + x0i [X(i)0 X(i)]−1 x0i }.
• Then
yi − x0i β̂(i)
e∗i = .
σ̂(i){1 + x0i [X(i)0 X(i)]−1 x0i }1/2
follows a t-distribution with N − K − 1 df (assuming normal errors). Note
that
[y − Xβ̂(i)]0 [y − Xβ̂(i)]
σ̂(i)2 =
N −K −1
• Studentized residuals can be interpreted as the t-statistic on a dummy vari-
able equal to one for the observation in question and zero everywhere else.
Such a dummy variable will effectively absorb the observation and so remove
its influence in determining the other coefficients in the model.
• Studentized residuals that are greater than 2 in absolute value are regarded
as outliers.
96

• Studentized residuals are generally preferred to standardized residuals for the


detection of outliers since they give us an idea of which residuals are large
while purging the estimation of the residuals of the effect of that one data
point.
• Related measures that combine the index of leverage taken from the diagonal
elements of the hat matrix and the studentized residuals include DFITS,
Cook’s Distance, and Welsch Distance.
• This studentized residual still does not, however, tell us how much the inclu-
sion of a single data point has affected the line. For that, we need a measure
of influence.

8.4.2 DFBETAS
• A measure that is often used to tell how much each data point affects the
coefficient is the DFBETAS, due to Belsley, Kuh, and Welch (1980).
• DFBETAS focuses on the difference between coefficient values when the ith
observation is included and excluded, scaling the difference by the estimated
standard error of the coefficient:
β̂k − β̂k (i)
DFBETASki = −1/2
σ̂(i)akk
−1/2
where akk is the kth diagonal element of (X0 X)−1 .
• Belsley et al. suggest that observations with
2
|DFBETAki | > √
N
as deserving special attention (accounts for the fact that single observations
have less of an effect as sample size grows). It is also common practice simply
to use an absolute value of one, meaning that the observation shifted the
estimated coefficient by at least one standard error.
97

• To access the DFBETAS in R (e.g., for plotting) do something like

dfbs.xyreg <- dfbetas(xyreg)

where xyreg contains the output from the lm command. To plot them for
the slope coefficient for our fake data, do

plot(dfbs.xyreg[,2],ylab="DFBETAS for X")

See Fig. 8.6 for the resulting plot.


98

Figure 8.6: Plot of DFBETAS


1.0
0.5
DFBETAS for X

0.0
−0.5
−1.0

0 10 20 30 40 50

Index
Section 9

Presentation of Results, Prediction, and Forecasting


9.1 Presentation and interpretation of regression coeffi-
cients
• Some rules of thumb for presenting results of regression analysis:
➢ Provide a table of descriptive statistics (space permitting).
➢ A table of OLS regression results should include coefficient estimates and
standard errors (or t statistics), measures of model fit (adjusted R2 , F -
tests, standard error of the regression), information about the sample
(size), and estimation approach.
➢ Make your tables readable. Do not simply cut and paste regression out-
put. No one needs to see coefficients/std. errs reported to more than
the second or third decimal place. Refrain from using scientific notation
(take logs, divide by some factor if necessary). “Big” models make for
unattractive tables.
➢ Discuss substantive significance, providing ancillary tables or plots to
help with inferences about the magnitude of effects.
• Substantive significance is easier to assess in linear models than in nonlinear
models, although walking readers through some simulations is often helpful.

➢ For example, what is the expected change in our dependent variable for
a given change in the explanatory variable, where the given change is
substantively interesting or intuitive. This could include, for instance,
indicating how much the dependent variable changes for a one standard
deviation movement up or down in the explanatory variable.
➢ Relate this back to descriptive statistics.

99
100

9.1.1 Prediction
• We know what is the expected effect on y of a given change in xk from β̂.
But what is the predicted value of y, or ŷ, for a given vector of values of x?
• In-Sample Prediction tells us how the level of y would change for a change
in the Xs for a particular observation in our sample.
• For example, at a given value of each X variable, x0 , ŷ, is equal to:

ŷ = x0 β̂

This is the conditional mean of y, conditional on the explanatory variables


and the estimated parameters of the model.
• Let us say that we have a growth model estimated on 100 countries including
Namibia and Singapore. An example of in-sample prediction would be to
predict how Namibia’s expected (and predicted) growth rate would change if
Namibia had Singapore’s level of education.
• When we do in-sample prediction we don’t have to be concerned about “fun-
damental uncertainty” (arising from the error term which alters the value of
y) because we are assuming that the true random error attached to Namibia
would stay the same and we only change the average level of education.
• We also do not have to be concerned with what is called “estimation un-
certainty” arising from the fact that we estimated β̂ because we are not
generalizing out of the sample.
• Out-of-sample prediction or forecasting gives the prediction for the de-
pendent variable for observations that are not included in the sample. When
the prediction refers to the future, it is known as a forecast.
• For given values of the explanatory variables, x0 , the predicted value is still,
ŷ = x0 β̂. To get an idea of how precise this forecast is, we need a measure of
the variance of y0 , the actual value of y at x0 , around ŷ.

y0 = x0 β + ε0
101

• Therefore the forecast error is equal to:

y0 − ŷ0 = e0 = x0 β + ε0 − x0 β̂ = x0 (β − β̂) + ε0

• The expected value of this is equal to zero. Why?


• The variance of the forecast error is equal to:

var[e0 ] = E[e0 e00 ] = E[x0 (β − β̂) + ε0 )(x0 (β − β̂) + ε0 )0 ]


= E[(x0 (β − β̂) + ε0 )((β − β̂)0 x00 + ε00 )]
= E[x0 (β − β̂)(β − β̂)0 x0 + ε0 ε00 + ε0 (β − β̂)0 x00 + ε00 x0 (β − β̂)]
= x0 E[(β − β̂)(β − β̂)0 ]x00 + E[ε0 ε00 ]
= x0 [var(β̂)]x00 + σ 2
= x0 [σ 2 (X0 X)−1 ]x00 + σ 2

This is the variance of the forecast error, and thus the variance of the actual
value of y around the predicted value.
• If the regression contains a constant term, then an equivalent expression
(derived using the expression for partitioned matrices is:
" K−1
#
1 X K−1
X
var[e0 ] = σ 2 1 + + (x0j − x̄j )(x0k − x̄k )(Z0 M0 Z)jk
n j=1
k=1

where Z is the K − 1 columns of X that do not include the constant and


(Z0 M0 Z)jk is the jkth element of the inverse of the (Z0 M0 Z) matrix.
• Implications for the variance of the forecast:
➢ will go up with the true variance of the errors, σ 2 , and this effect is still
present no matter how many data points we have.
➢ will go up as the distance of x0 from the mean of the data rises. Why is
this? As x0 rises, any difference between β and β̂ is multiplied by a larger
number. Meanwhile, the true regression line and the predicted regression
line both go through the mean of the data.
102

➢ will fall with the number of data points because the inverse will be smaller
and smaller (intuition—as you get more data points, you will estimate β̂
more precisely).
• The variance above can be estimated using:

var[e0 ] = x0 [s2 (X0 X)−1 ]x00 + s2

The square root of this is the standard error of the forecast.


• A confidence interval can then be formed using: ŷ ± tα/2 se(e0 ).
• Clarify software by King et. al (available for Stata) can be used to help
with presenting predictions as well as other kinds of inferences (see “Making
the Most of Statistical Analysis: Improving Interpretation and Presentation”
by Gary King, Michael Tomz, and Jason Wittenberg, American Journal of
Political Science, 44:4, April 2000, pp. 341–355; this article and the software
described in the article can be found at
http://gking.harvard.edu/clarify/docs/clarify.html.

9.2 Encompassing and non-encompassing tests


• All of the hypothesis tests we have examined so far have been what are known
as “encompassing tests.” In other words, we have always stated the test as
a linear restriction on a given model and have tested H0 : Rβ = q versus
H1 : Rβ 6= q.
• For example, testing Model A below against Model B is a question of running
an F -test on the hypothesis that β3 = β4 = 0:
Model A: Yi = β0 + β1 X1i + β2 X2i + β3 X3i + β4 X4i + εi
Model B: Yi = β0 + β1 X1i + β2 X2i + εi
• This approach can be restrictive. In particular, it does not allow us to test
which of two possible sets of regressors is more appropriate. In many cases, we
are interested in testing a non-nested hypothesis of the form H0 : y = Xβ+ε0
versus H1 : y = Zδ + ε1 .
103

• For example, how do we choose between Model C and Model D?


Model C: Yi = α0 + α1 X1i + α2 X2i + ε0i
Model D: Yi = β0 + β1 Z1i + β2 Z2i + ε1i
• The problem in conducting tests of this sort is to transform the two hy-
potheses above so that we get a “maintained” model (i.e., reflecting the null
hypothesis, H0 ) that encompasses the alternative. Once we have an encom-
passing model, we can base our test on restrictions to this model.
• Note that we could combine the two models above as Model F:
Model F: Yi = λ0 + λ1 X1i + λ2 X2i + λ3 Z1i + λ4 Z2i + ui
• We could then proceed to run an F -test on λ3 = λ4 = 0, as a test of C against
D, or an F -test of λ1 = λ2 = 0, as a test of D against C.
• Problem: the results of the model comparisons may depend on which model
we treat as the reference model and the standard errors of all the coefficients
(and thus the precision of the F -tests) will be affected by the inclusion of
these additional variables. This approach also assumes the same error term
for both models.
• A way out of this problem, known as the J-test, is proposed by Davidson
and MacKinnon (1981) and uses a non-nested F -test procedure. The test
proceeds as follows:
1. Estimate Model D and obtain the predicted y values, ŶiD .
2. Add these predicted values as an additional regressor to Model C and
run the following model. This model is an example of the encompassing
principle.
Yi = α0 + α1 X1i + α2 X2i + α3 ŶiD + ui
3. Use a t-test to test the hypothesis that α3 = 0. If this null is not rejected,
it means that the predicted values from Model D add no significant in-
formation to the model that can be used to predict Yi , so that Model C
is the true model and encompasses Model D. Here, “encompasses” means
that it encompasses the information that is in D. If the null hypothesis
is rejected, then Model C cannot be the true model.
104

4. Reverse the order of estimations. Estimate Model C first, obtain the


predicted values, ŶiC , and use these as an additional explanatory variable
in Model D. Run a t-test on the null hypothesis that β3 , the coefficient
on ŶiC , is equal to zero. If you do not reject the null, then Model D is
the true model.
• One potential problem with this test is that you can find yourself rejecting
both C and D or neither. This is a small sample problem, however.
• Another popular test is the Cox test (see Greene, 5th ed., pp. 155–159, for
details).
Section 10

Maximum Likelihood Estimation


10.1 What is a likelihood (and why might I need it)?
• OLS is simply a method for fitting a regression line that, under the Gauss-
Markov assumptions, has attractive statistical properties (unbiasedness, effi-
ciency). However, it has no particular statistical justification.
• Maximum likelihood (ML) estimation has a theoretical justification and, if we
are prepared to make assumptions about the population model that generates
our data, it has “attractive statistical properties” (consistency, efficiency)
whether or not the Gauss-Markov assumptions hold.
• In addition, ML estimation is one method for estimating regression models
that are not linear.
• Also, if an unbiased minimum variance estimator (something BLUE) exists,
then ML estimation chooses it and ML estimation will achieve the Cramer-
Rao lower bound (more explanation on this jargon later).
• Where does the notion of a likelihood come from? It’s a partial answer to
the statistician’s search for “inverse probability.”
• If we know (or can assume) the under-lying model in the population, we can
estimate the probability of seeing any particular sample of data. Thus, we
could estimate P (data|model) where the model consists of parameters and a
data generating function.
• This is the notion of conditional probability that underlies hypothesis testing,
“How likely is it that we observe the data y, if the null is true?” In most
cases, however, statisticians know the data, what they want to find out about
the parameters that generate the data.
• Often, statisticians will feel comfortable assigning a particular type of distri-
bution to the DGP (e.g., the errors are distributed normally). This is the
same as saying that we are confident about the “functional form.”

105
106

• What is unknown are the parameters of the DGP. For a single, normally
distributed random variable, those unknown parameters would be the mean
and the variance. For the dependent variable in a regression model, y, those
parameters would include the regression coefficients, β, and the variance of
the errors, σ 2 .
• What we want is a measure of “inverse probability,” a means to estimate
P (parameters|data) where the parameters are what we don’t know about the
model. In this inverse probability statement, the data are taken as given
and the probability is a measure of absolute uncertainty over various sets of
coefficients.
• We cannot derive this measure of absolute uncertainty, but we can get close
to it, using the notion of likelihood, which employs Bayes Theorem.
P (B|A)P (A) P (A∩B)
• Bayes Theorem: P (A|B) = P (B) = P (B)

• Let’s paraphrase this for our case of having data and wanting to find out
about coefficients:
P (data|parameters) P (parameters)
P (parameters|data) =
P (data)
• For the purposes of estimation, we treat the marginal probability of seeing
the data, P (data), as additional information that we use only to scale our
beliefs about the parameters. Thus, we can say more simply:
P (parameters|data) ∝ P (data|parameters) P (parameters) (10.1)
where ∝ means “is proportional to”.
• The conditional probability of the parameters given the data, or
P (parameters | data), is also known as the likelihood function (e.g., L(µ, σ 2 |x),
where µ refers to the mean and σ 2 refers to the variance).
• The likelihood function may be read as the likelihood of any value of the
parameters, µ and σ 2 in the univariate case, given the sample of data that
we observe. As our estimates of the parameters of interest, we might want
to use the values of the parameters that were most likely given the data at
hand.
107

• We can maximize the right-hand side of Eq. 10.1, P (data|parameters), with


respect to µ and σ 2 to find the values of the parameters that maximize the
likelihood of getting the particular data sample.
• We can’t calculate the absolute probability of any particular value of the
parameters, but we can tell something about the values of the parameters
that make the observed sample most likely.
• Given the proportional relationship, the values of the parameters that maxi-
mize the likelihood of getting the particular data sample are also the values
of the parameters that are most likely to obtain given the sample.
• Thus, we can maximize P (data|parameters) with respect to the parameters
to find the maximum likelihood estimates of the parameters.
• We treat P (parameters) as a prior assumption that we make independent of
the sample and often adopt what is called a “flat prior” so that we place no
particular weight on any value of the parameters.
• To estimate the parameters of the causal model, we start with the statistical
function that is thought to have generated the data (the “data generating
function”). We then use this model to derive the expression for the probability
of observing the sample, and we maximize this expression with respect to the
parameters to obtain the maximum likelihood estimates.
108

10.2 An example of estimating the mean and the variance


• The general form of a normal distribution with mean µ and variance σ 2 is:
 1/2
2 1 −1/2[(xi −µ)2 /σ 2 ]
f (x|µ, σ ) = e
2πσ 2
This function yields a particular value of x from a normal distribution with
mean µ and variance σ 2 .
• If we had many, independent random variables, all generated by the same un-
derlying probability density function, then the function generating the sample
as a whole would be given by:
 1/2
n 1 −1/2[(x−µ)2 /σ 2 ]
f (x1 , . . . , xi , . . . xn |µ, σ 2 ) = Π e
i=1 2πσ 2

Note that there is an implicit iid assumption, which enables us to write the
joint likelihood as the product of the marginals.
• The method of finding the most likely values of µ and σ 2 , given the sample
data, is to maximize the likelihood function. The expression above is the
likelihood function since it is the area under the curve generated by the
statistical function that gives you a probability and since this is proportional
to P (parameters|data).
• An easier way to perform the maximization, computationally, is to maximize
the log likelihood function, which is just the natural log of the likelihood
function. Since the log is a monotonic function, the values that maximize L
are the same as the values that maximize ln L.
n  2

n n 1 X (x i − µ)
ln L(µ, σ 2 ) = − ln(2π) − ln σ 2 −
2 2 2 i=1 σ2
109

• To find the values of µ and σ 2 , we take the derivatives of the log likelihood
with respect to the parameters and set them equal to zero.
n
∂ ln L 1 X
= 2 (xi − µ) = 0
∂µ σ i=1
n
∂ ln L n 1 X
= − + (xi − µ)2 = 0
∂σ 2 2σ 2 2σ 4 i=1

• To solve the likelihood equations, multiply both sides by σ 2 in the first equa-
tion and solve for µ̂, the estimate of µ. Next insert this in the second equation
and solve for σ 2 . The solutions are:
n
1X
µ̂M L = xi = x̄n
n i=1
n
2 1X
σ̂M L = (xi − x̄n )2
n i=1

• You should recognize these as the sample mean and variance, so that we
have some evidence that ML estimation is reliable. Note, however, that the
denominator of the calculated variance is n rather than (n−1) ⇒ ML estimate
of the variance is biased (although it is consistent).

10.3 Are ML and OLS equivalent?


• We would be particularly perturbed if OLS and ML gave us different estimates
of the regression coefficients, β̂.
• Since the error terms under the Gauss-Markov assumptions are assumed to
be distributed normal, we can set up the regression model as an ML problem,
just as we did in the previous example, and check the estimated parameters,
β̂ and σ̂ 2 .
• The regression model is:
yi = β0 + β1 xi + ui , where ui ∼ iid N (0, σ 2 ).
110

• This implies that the yi are independently and normally distributed with
respective means β0 + β1 xi and a common variance σ 2 . The joint density of
the observations is therefore:
 1/2
n 1 2 2
f (y1 , . . . , yi , . . . yn |β, σ 2 ) = Π 2
e−1/2[(yi −β0 −β1 xi ) /σ ]
i=1 2πσ

and the log likelihood is equal to:


n  2

n n 1 X (y i − β0 − β1 x i )
ln L(β, σ 2 ) = − ln(2π) − ln σ 2 −
2 2 2 i=1 σ2

• We will maximize the log likelihood function with respect to β0 and β1 and
then with respect to σ 2 .
• Note that only the last term in the log likelihood function involves β0 and
β1 and that maximizing this is the same as minimizing the sum of squared
errors, since there is a negative sign in front of the whole term.
• Thus, the ML estimators of β0 and β1 , equal to β̂ M L are the same as the least
squares estimates that we have been dealing with throughout.
n
(yi − β̂0 − β̂1 xi )2
P
• Substituting β̂ M L into the log likelihood and setting Q̂ =
i=1
which is the SSE, we get the log likelihood function in terms of σ 2 only:

n n Q̂ n Q̂
ln L(σ) = − ln(2π) − ln σ 2 − 2 = a constant − · ln σ 2 − 2
2 2 2σ 2 2σ
• Differentiating this with respect to σ and setting the derivative equal to zero
we get:
∂ ln L n Q̂
=− + 3 =0
∂σ σ̂ σ̂
Q̂ SSE
This gives the ML estimator for σ 2 = σ̂ 2 = n = n

• This estimator is different from the unbiased estimator that we have been

2
using, = σ̂OLS = n−k = SSE
n−k .
111

• For large N , however, the two estimates will be very close. Thus, we can say
that the ML estimate is consistent for σ 2 , and we will define the meaning of
consistency explicitly in the future.
• We can use these results to show the value of the log likelihood function at
its maximum, a quantity that can sometimes be useful for computing test
statistics. We will continue to use the short-hand that Q̂ = SSE and will
2 Q̂
also substitute in the equation for the log likelihood, ln L, that σ̂M L = n.
Substituting this into the equation for the log likelihood above, we get:
!
n n Q̂ Q̂
ln L(β, σ 2 ) = − ln(2π) − ln −
2 2 n 2(Q̂/n)
n n n n
= − ln(2π) − ln Q̂ + ln(n) −
2 2 2 2

• Putting all the terms that do not rely on Q̂ together, since they are constant
and do not rely on the values of the estimated regression coefficients, β̂ M L ,
we can say that the maximum value of the log likelihood is:
n n
max ln L = a constant − ln Q̂ = a constant − ln(SSE)
2 2
• Taking the anti-log, we can say that the maximum value of the likelihood
function is:
max L = a constant · (SSE)−n/2

• This shows that, for a given sample size, n, the maximum value of the like-
lihood and log likelihood functions will rise as SSE falls. This can be handy
to know, as it implies that measures of “goodness of fit” can be based on the
value of the likelihood or log likelihood function at its maximum.
• In conclusion, we can say that ML is equivalent to OLS for the classical
linear regression model. The real power of ML, however, is that we can
use it in many cases where we do not assume that the errors are normally
distributed or that the model is equal to yi = β0 +β1 xi +ui , so that the model
is far more general.
112

• For instance, models with binary dependent variables are estimated via ML,
using the logistic distribution for the data if a logit model is chosen and a
normal if the probit is used. Event count models are estimated via ML,
assuming that the original data is distributed Poisson. Thus, ML is the oper-
ative method for estimating a number of models that are frequently employed
in political science.

10.4 Inference and hypothesis testing with ML


• The tests that one can perform, and the statistics we can compute, on the
ML regression coefficients are analogous to the tests and statistics we can
form under OLS. The difference is that they are “large sample” tests.

10.4.1 The likelihood ratio test

• The likelihood ratio (LR) test is analogous to an F -test calculated using the
residuals from the restricted and un-restricted models. Let θ be the set of
parameters in the model and let L(θ) be the likelihood function.
• Hypotheses such as θ1 = 0 impose restrictions on the set of parameters θ.
What the LR test says is that we first obtain the maximum of L(θ) without
any restrictions and we then calculate the likelihood with the restrictions
imposed by the hypothesis to be tested.
• We then consider the ratio
max L(θ) under the restrictions
λ=
max L(θ) without the restrictions
• λ will in general be less than one since the restricted maximum will be less
than the unrestricted maximum.
• If the restrictions are not valid, then λ will be significantly less than one. If
they are exactly correct, then λ will equal one. The LR test consists of using
−2 ln λ as a test statistic ∼ χ2k , where k is the number of restrictions. If this
amount is larger than the critical value of the chi-square distribution, then
we reject the null.
113

• Another frequently used test statistic that is used with ML estimates is the
Wald test, which is analogous to the t-statistic:

β̂ M L − β̃
W =
se(β̂ M L )

where β̃ is the hypothesized value of β̂ M L and se(β̂ M L ) are the standard


2 0 −1
errors of the β̂ M L , calculated using σ̂M L (X X) .

• The only difference from a regular t-test is that this is a large sample test.
Since in large samples, the t-distribution becomes equivalent to the standard
normal, the Wald statistic is distributed standard normal.

10.5 The precision of the ML estimates


• If the likelihood is very curved at the maximum, then the ML estimate of
any parameter contains more information about what the true parameters
are likely to be.
• If the likelihood is quite flat at the maximum, then the maximum is not
telling you much about the likelihood as a whole and the maximum is not a
great summary of the likelihood. Hence, a measure of the likelihood function’s
curvature is also a measure of the precision of the ML estimate.
• What is a reasonable measure of the likelihood function’s curvature? The
second derivative of any function tells us how much the slope (or gradient)
of the function changes as we move along the x axis.
• At the maximum, the slope is changing from positive, through zero, to neg-
ative. The larger the second derivative, however, the more curved the likeli-
hood function is at the maximum and the more information is given by the
ML estimates.
114

• Information about curvature and precision can therefore be captured by the


second derivatives of the likelihood function. Indeed, the information ma-
trix of the likelihood function is used for this purpose and is given by:
" #
∂ 2 ln L(θ̃|y)
I(θ̂|y) = −E 0
∂ θ̃∂ θ̃ θ̃=θ̂

where θ̂ are the ML estimates (e.g., this would include β̂ M L and σ̂ 2 for the
linear regression model). Thus, the information matrix is the negative of the
expectation of the second derivatives of the likelihood function estimated at
the point where θ̃ = θ̂.
• If θ is a single parameter, then the larger is I(θ̂|y) then the more curved the
likelihood (or log likelihood) and the more precise is θ̂. If θ is a vector of
parameters (the normal case) then I(θ̂|y) is a K × K matrix with diagonal
elements containing the information on each corresponding element of θ̂.
• Because of the expectations operator, the information matrix must be esti-
mated. The most intuitive is:
" #
2
∂ ln L(θ̃|y)
I(θ̂|y) = − 0
∂ θ̃∂ θ̃ θ̃=θ̂

In other words, one just uses the data sample at hand to estimate the second
derivatives.
• The information matrix is closely related to the asymptotic variance of the
ML estimates, θ̂. It can be shown that the asymptotic variance of the ML
estimates, across an infinite number of hypothetically repeated samples is:
" #−1
2
1 −∂ ln L(θ̃|y)
V (θ̂) ≈ lim [I(θ̂|y)/n]−1 = lim E 0
n→∞ n→∞ n ∂ θ̃∂ θ̃ θ̃=θ̂

• Intuition: the variance is inversely related to the curvature. The greater the
curvature, the more that the log likelihood resembles a spike around the ML
estimates θ̂ and the lower the variance in those estimates will be.
115

• In a given sample of data, one can estimate V(θ̂) on the basis of the estimated
information: " #−1
1 −∂ 2 ln L(θ̃|y)
V (θ̂) ≈ − 0
n ∂ θ̃∂ θ̃ θ̃=θ̂

• This expression is also known as the the Cramer-Rao lower bound on the
variance of the estimates. The Cramer-Rao lower bound is the lowest value
that the asymptotic variance on any estimator can take.
• Thus, the ML estimates have the very attractive property that they are
asymptotically efficient. In other words, in large samples, they achieve the
lowest possible variance (thus highest efficiency) of any potential estimate of
the true parameters.
2 0 −1
• For the linear regression case, the estimate of V (θ̂) is σ̂M L (X X) . This is
the same as the variance-covariance matrix of the β̂ OLS except that we use
the ML estimate of σ 2 to estimate the true variance of the error term.
116

Part II

Violations of Gauss-Markov
Assumptions in the Classical Linear
Regression Model
Section 11

Large Sample Results and Asymptotics


11.1 What are large sample results and why do we care
about them?
• Large sample results for any estimator, θ̂, are the properties that we can say
hold true as the number of data points, n, used to estimate θ̂ becomes “large.”
• Why do we care about these large sample results? We have an OLS model
that, when the Gauss-Markov assumptions hold, has desirable properties.
Why would we ever want to rely on the more difficult mathematical proofs
that involve the limits of estimators as n becomes large?
• Recall the Gauss-Markov assumptions:

1. The true model is a linear functional form of the data: y = Xβ + ε.


2. E[ε|X] = 0
3. E[εε0 |X] = σ 2 I
4. X is n × k with rank k (i.e., full column rank)
5. ε|X ∼ N [0, σ 2 I]

• Recall that if we are prepared to make a further, simplifying assumption, that


X is fixed in repeated samples, then the expectations conditional on X can
be written in unconditional form.
• There are two main reasons for the use of large sample results, both of which
have to do with violations of these of these assumptions.
1. Errors are not distributed normally
➢ If assumption 5, above, does not hold, then we cannot use the small
sample results.
➢ We established that β̂ OLS is unbiased and “best linear unbiased es-
timator” without any recourse to the normality assumption and we
also established that the variance of β̂ OLS = σ 2 (X0 X)−1 .

117
118

➢ Used normality to show that β̂ OLS ∼ N (β, σ 2 (X0 X)−1 ) and that (n −
k)s2 /σ 2 ∼ χ2n−k where the latter involved showing that (n − k)s2 /σ 2
can also be expressed as a quadratic form of ε/σ which is distributed
standard normal.
➢ Both of these results were used to show that we could calculate test
statistics that were distributed as t and F . It is these results on
test statistics, and our ability to perform hypothesis tests, that are
invalidated if we cannot assume that the true errors are normally
distributed.
2. Non-linear functional forms
➢ We may be interested in estimating non-linear functions of the original
model.
➢ E.g., suppose that you have an unbiased estimator, π ∗ of

π = 1/(1 − β)

but you want to estimate β. You cannot simply use (1 − 1/π ∗ ) as an


unbiased estimate β ∗ of β.
➢ To do so, you would have to be able to prove that E(1−1/π ∗ ) = β. It
is not true, however, that the expected value of a non-linear function
of π is equal to the non-linear function of the expected value of π
(this fact is also known as “Jensen’s Inequality”).
➢ Thus, we can’t make the last step that we would require (in small
samples) to show that our estimator of β is unbiased. As Kennedy
says, “the algebra associated with finding (small sample) expected
values can become formidable whenever non-linearities are involved.”
This problem disappears in large samples.
➢ The models for discrete and limited dependent variables that we will
discuss later in the course involve non-linear functions of the param-
eters, β. Thus, we cannot use the G-M assumptions to prove that
these estimates are unbiased. However, ML estimates have attrac-
tive large sample properties. Thus, our discussion of the properties of
those models will always be expressed in terms of large sample results.
119

11.2 What are desirable large sample properties?


• Under finite sample conditions, we look for estimators that are unbiased and
efficient (i.e., have minimum variance for any unbiased estimator). We have
also found it useful, when calculating test statistics, to have estimators that
are normally distributed. In the large sample setting, we look for analogous
properties.
• The large sample analog to unbiasedness is consistency. An estimator β̂ is
consistent if it converges in probability to β, that is if its probability limit as
n becomes large is β (plim β̂ = β).
• The best way to think about this is that the sampling distribution of β̂
collapses to a spike around the true β.
• Two Notions of Convergence

1. “Convergence in probability”:

P
Xn → X iff limn→∞ Pr(|X(ω) − Xn (ω)| ≥ ε) = 0
where ε is some small positive number. Or

plim Xn (ω) = X(ω)

2. “Convergence in quadratic mean or mean square”: If Xn has mean µn


and variance σn2 such that limn→∞ µn = c and limn→∞ σn2 = 0, then Xn
converges in mean square to c.
Note: convergence in quadratic mean implies convergence in probability
qm P
(Xn → c ⇒ Xn → c).

• An estimator may be biased in small samples but consistent. For example,


take an estimate of β = β + 1/n. In small samples this is biased, but as the
sample size becomes infinite, 1/n goes to zero (limn→∞ 1/n = 0).
120

• Although an estimator may be biased yet consistent, it is very hard (read


impossible) for it to be unbiased but inconsistent. If β̂ is unbiased, there is
nowhere for it to collapse to but β. Thus, the only way an estimator could
be unbiased but inconsistent is if its sampling distribution never collapses to
a spike around the true β.
• We are also interested in asymptotic normality. Even though an estimator
may not be distributed normally in small samples, we can usually appeal to
some version of the central limit theorem (see Greene, 6th Ed., Appendix
D.2.6 or Kennedy Appendix C) to show that it will be distributed normally
in large samples.
• More precisely, the different versions of the central limit theorem state that
the mean of any random variable, whatever the distribution of the underlying
variable, will in the limit be distributed such that:
√ d
n(x̄n − µ) −→ N [0, σ 2 ]

Thus, even if xi is not distributed normal, the sampling distribution of the


average of an independent sample of the xi ’s will be distributed normal.
• To see how we arrived at this result, begin with the variance of the mean:
2
var(x̄n ) = σn .
2
• Next, by the CLT, this will be distributed normal: x̄n ∼ N (µ, σn )
• As a result,
x̄n − µ √ (x̄n − µ)
= n ∼ N (0, 1)
√σ σ
n
Multiplying the expression above by σ will get you the result on the asymp-
totic distribution of the sample average.
• When it comes to establishing the asymptotic normality of estimators, we
can usually express that estimator in the form of an average, as a sum of
values divided by the number of observations n, so that we can then apply
the central limit theorem.
121

• We are also interested in asymptotic efficiency. An estimator is asymptoti-


cally efficient if it is consistent, asymptotically normally distributed, and has
an asymptotic covariance matrix that is not larger than the asymptotic co-
variance matrix of any other consistent, asymptotically normally distributed
estimator.
• We will rely on the result that (under most conditions) the maximum like-
lihood estimator is asymptotically efficient. In fact, it attains the smallest
possible variance, the Cramer-Rao lower bound, if that bound exists. Thus,
to show that any estimator is asymptotically efficient, it is sufficient to show
that the estimator in question either is the maximum likelihood estimate or
has identical asymptotic properties.

11.3 How do we figure out the large sample properties of


an estimator?
• To show that any estimator of any quantity, θ̂, is consistent, we have to show
that plim θ̂ = θ. The means of doing so is to show that any bias approaches
zero as n becomes large and the variance in the sampling distribution also
collapses to zero.
• To show that θ̂ is asymptotically normal, we have to show that its sampling
distribution can be expressed as the sampling distribution of a sample average

pre-multiplied by n.
• Let’s explore the large sample properties of β̂OLS w/o assuming that ε ∼
N (0, σ 2 I).

11.3.1 The consistency of β̂ OLS


• Begin with the expression for β̂ OLS :
β̂ OLS = β + (X0 X)−1 X0 ε

• Instead of taking the expectation, we now take the probability limit:


plim β̂ OLS = β + plim (X0 X)−1 X0 ε
122

We can multiply both sides of the equation by n/n = 1 to produce:


plim β̂ OLS = β + plim (X0 X/n)−1 (X0 ε/n)

• For the next step we need the Slutsky Theorem: For a continuous function
g(xn ) that is not a function of n, plim g(xn ) = g(plim xn ).
• An implication of this thm is that if xn and yn are random variables with
plim xn = c and plim yn = d, then plim (xn · yn ) = c · d.
• If Xn and Yn are random matrices with plim Xn = A and plim Yn = B then
plim Xn Yn = AB.
• Thus, we can say that:
plim β̂ OLS = β + plim (X0 X/n)−1 plim (X0 ε/n)

• Since the inverse is a continuous function, the Slutsky thm enables us to bring
the first plim into the parenthesis:
plim β̂ OLS = β + (plim X0 X/n)−1 plim (X0 ε/n)

• Let’s assume that


lim (X0 X/n) = Q
n→∞
where Q is a finite, positive definite matrix. In words, as n increases, the
elements of X0 X do not increase at a rate greater than n and the explanatory
variables are not linearly depedent.
• To fix ideas, let’s consider a case where this assumption would not be valid:
yt = β0 + β1 t + εt
which would give
 PT   
T t T T (T + 1)/2
X0 X = PT PTt=1 2 =
t=1 t t=1 t
T (T + 1)/2 T (T + 1)(2T + 1)/6
Taking limits gives  
1 ∞
lim (X0 X/t) = .
t→∞ ∞ ∞
123

• More generally, each element of (X0 X) is composed of the sum of squares and
the sum of cross-products of the explanatory variables. As such, the elements
of (X0 X) grow larger with each additional data point, n.
• But if we assume the elements of this matrix do not grow at a rate faster
than n and the columns of X are not linear dependent, then dividing by n,
gives convergence to a finite number.
• We can now say that plim β̂ OLS = β + Q−1 plim (X0 ε/n) and the next step
in the proof is to show that plim (X0 ε/n) is equal to zero. To demonstrate
this, we will prove that its expectation is equal to zero and that its variance
converges to zero.
• Think about the individual elements in (X0 ε/n). This is a k × 1 matrix in
which each element is the sum of all n observations of a given explanatory
variable multiplied by each realization of the error term. In other words:
n n
1 0 1X 1X
Xε= xi εi = wi = w̄ (11.1)
n n i=1 n i=1

where w̄ is a k × 1 vector of the sample averages of xki εi . (Verify this if it is


not clear to you.)
• Since we are still assuming that X is non-stochastic, we can work through
the expectations operator to say that:
n n
1X 1X 1
E[w̄] = E[wi ] = xi E[εi ] = X0 E[ε] = 0 (11.2)
n i=1 n i=1 n

In addition, using the fact that E[εε0 ] = σ 2 I we can say that:


1 0 1 σ 2 X0 X
var[w̄] = E[w̄w̄0 ] = X E[εε0 ]X =
n n n n
σ2 X0 X
• In the limit, as n → ∞, n → 0, n → Q, and thus

lim var[w̄] = 0 · Q
n→∞
124

• Therefore, we can say that w̄ converges in mean square to 0 ⇒ plim (X0 ε/n)
is equal to zero, so that:
plim β̂ OLS = β + Q−1 · 0 = β
Thus, the OLS estimator is consistent as well as unbiased.

11.3.2 The asymptotic normality of OLS


• Let’s show that the OLS estimator, β̂OLS , is also asymptotically normal. Start
with
β̂ OLS = β + (X0 X)−1 X0 ε

and then subtract β from each side and multiply through by n to yield (use
√ √
n = n/ n):
 0 −1 


XX 1
n(β̂ OLS − β) = √ X0 ε
n n
We’ve already established that the first term on the right-hand side
 converges

−1
to Q . We need to derive the limiting distribution of the term √1
n
X0 ε.

• From equations 11.1 and 11.2, we can write



 
1
√ X0 ε = n(w̄ − E[w̄])
n

and then find the limiting distribution of nw̄.
• To do this, we will use a variant of the CLT (called Lindberg-Feller), which
allows for variables to come from different distributions.
From above, we know that
n
1X
w̄ = xi εi
n i=1

which means w̄ is the average of n indepedent vectors xi εi with means 0 and


variances
var[xi εi ] = σ 2 xi x0i = σ 2 Qi
125

• Thus,

 
1
var[ nw̄] = σ 2 Q̄n = σ 2 [Q1 + Q2 + · · · + Qn ]
n
 X n
1
=σ 2
xi x0i
n
 0 i=1

X X
= σ2
n
• Assuming the sum is not dominated by any particular term and that
limn→∞ (X0 X/n) = Q, then
lim σ 2 Q̄n = σ 2 Q
n→∞

• We can now invoke the Lindberg-Feller CLT to formally state that if the ε
are iid w/ mean 0 and finite variance, and if each element of X is finite and
limn→∞ (X0 X/n) = Q, then
 
1 d
√ X0 ε −→ N [0, σ 2 Q]
n
It follows that:
 
1 d
Q−1 √ X0 ε −→ N [Q−1 · 0, Q−1 (σ 2 Q)Q−1 ]
n
[Recalling that if a random variable X has a variance equal to σ 2 , then kX,
where k is a constant, has a variance equal to k 2 σ 2 ].
• Combining terms, and recalling what it was we were originally interested in:
√ d
n(β̂ OLS − β) −→ N [0, σ 2 Q−1 ]

• How do we get from here to a statement about the normal distribution of



β̂ OLS ? Divide through by n on both sides and add β to show that the OLS
estimator is asymptotically distributed normal:
σ 2 −1
 
a
β̂ OLS ∼ N β, Q
n
126

0
ee
• To complete the steps, we can also show that s2 = n−k is consistent for σ 2 .
Thus, s2 (X0 X)/n)−1 is consistent for σ 2 Q−1 . As a result, a consistent estimate
2 2 0 −1
for the asymptotic variance of β̂ OLS (= σn Q−1 = σn XnX ) is s2 (X0 X)−1 .

• Thus, we can say that β̂ OLS is normally distributed and a consistent estimate
of its asymptotic variance is given by s2 (X0 X)−1 , even when the error terms
are not distributed normally. We have gone through the rigors of large sample
proofs in order to show that in large samples OLS retains desirable properties
that are similar to what it has in small sample when all of the G–M conditions
hold.
• To conclude, the desirable properties of OLS do not rely on the assumption
that the true error term is normally distributed. We can appeal to large
sample results to show that the sampling distribution will still be normal
as the sample size becomes large and that it will have variance that can be
consistently estimated by s2 (X0 X)−1 .

11.4 The large sample properties of test statistics


• Given the normality results and the consistent estimate of the asymptotic
variance given by s2 (X0 X)−1 , hypothesis testing proceeds almost as normal.
Hypotheses on individual coefficients can still be estimated by constructing
a t-statistic:
β̂k − βk
SE(β̂k )
• When we qdid this in small samples, we made clear that we had to use an
estimate, s2 (X0 X)−1
kk for the true standard error in the denominator, equal
q
to σ 2 (X0 X)−1
kk .
127

• As the sample size becomes large, we can replace this estimate by its prob-
ability limit, which is the true standard error, so that the denominator just
becomes a constant. When we do so, we have a normally distributed random
variable in the numerator divided by a constant, so the test-statistic is dis-
tributed as z, or standard normal. Another way to think of this intuitively is
that the t-distribution converges to the z distribution as n becomes large.
• Testing joint hypothesis also proceeds via constructing an F -test. In this case,
we recall the formula for an F -test, put in terms of the restriction matrix:
(Rb − q)0 [R(X0 X)−1 R0 ]−1 (Rb − q)/J
F [J, n − K] =
s2
• In small samples, we had to take account of the distribution of s2 itself. This
gave us the ratio of two chi-squared random variables, which is distributed
F.
• In large samples, we can replace s2 by its probability limit, σ 2 which is just
a constant value. Multiplying both sides by J, we now have that the test
statistic JF is composed of a chi-squared random variable in the numerator
over a constant, so the JF statistic is the large sample analog of the F -test.
If the JF statistic is larger than the critical value, then we can say that the
restrictions are unlikely to be true.

11.5 The desirable large sample properties of ML estima-


tors
• Recall that MLEs are consistent, asymptotically normally distributed, and
asymptotically efficient, in that they always achieve the Cramer-Rao lower
bound, when this bound exists.
• Thus, MLEs always have desirable large sample properties although their
small sample estimates (as of σ 2 ) may be biased.
• We could also have shown that OLS was going to be consistent, asymptotically
normally distributed, and asymptotically efficient by indicating that OLS is
the MLE for the classical linear regression model.
128

11.6 How large does n have to be?


• Does any of this help us if we have to have an infinity of data points before we
can attain consistency and asymptotic normality? How do we know how many
data points are required before the sampling distribution of β̂ OLS becomes
approximately normal?
• One way to check this is via Monte Carlo studies. Using a non-normal dis-
tribution for the error terms, we simulate draws from a distribution of the
model y = Xβ + ε, using a different set of errors each time.
• We then calculate the β̂ OLS from repeated samples of size n and plot these
different sample estimates β̂ OLS on a histogram.
• We gradually enlarge n, checking how large n has to be in order to give us a
sampling distribution that is approximately normal.
Section 12

Heteroskedasticity
12.1 Heteroskedasticity as a violation of Gauss-Markov
• The third of the Gauss-Markov assumptions is that E[εε0 ] = σ 2 In . The
variance-covariance matrix of the true error terms is structured as:
 
2
σ 0 ... 0
E(ε21 ) E(ε1 ε2 ) . . . E(ε1 εN )
 
 0 σ 2 ... 0 
0
E(εε ) = .
. . . . .
. =  ..
. .
   

.
E(εN ε1 ) E(εN ε2 ) . . . E(ε2N )
 
0 0 ... σ 2

• If all of the diagonal terms are equal to one another, then each realization
of the error term has the same variance, and the errors are said to be ho-
moskedastic. If the diagonal terms are not the same, then the true error term
is heteroskedastic.
• Also, if the off-diagonal elements are zero, then the covariance between dif-
ferent error terms is zero and the errors are uncorrelated. If the off-diagonal
terms are non-zero, then the error terms are said to be auto-correlated and the
error term for one observation is correlated with the error term for another
observation.
• When the third assumption is violated, the variance-covariance matrix of the
error term does not take the special form, σ 2 In , and is generally written,
instead as σ 2 Ω. The disturbances in this case are said to be non-spherical
and the model should then be estimated by Generalized Least Squares, which
employs σ 2 Ω rather than σ 2 In .

129
130

12.1.1 Consequences of non-spherical errors


• The OLS estimator, β̂OLS , is still unbiased and (under most conditions) con-
sistent.
Proof for Unbiasedness:
As before:
β̂ OLS = β + (X0 X)−1 X0 ε
 
and we still have that E[ε|X] = 0, thus we can say that E β̂ OLS = β.

• The estimated variance of β̂ OLS is E[(β̂ − E(β̂))(β̂ − E(β̂))0 ], which we can


estimate as:

var(β̂ OLS ) = E[(X0 X)−1 X0 εε0 X(X0 X)−1 ]


= (X0 X)−1 X0 (σ 2 Ω)X(X0 X)−1
= σ 2 (X0 X)−1 (X0 ΩX)(X0 X)−1

• Therefore, if the errors are normally distributed,

β̂ OLS ∼ N [β, σ 2 (X0 X)−1 (X0 ΩX)(X0 X)−1 ]

• We can also use the formula we derived in the lecture on asymptotics to show
that β̂ OLS is consistent.

plim β̂ OLS = β + Q−1 plim (X0 ε/n)

To show that plim (X0 ε/n)=0, we demonstrated that the expectation of


2 0
(X0 ε/n) was equal to zero and that its variance was equal to σn XnX , which
becomes zero as n grows to infinity.
2 0
• In the case where E[εε0 ] = σ 2 Ω, the variance of (X0 ε/n) is equal to σn (X nΩX) .
So long as this matrix converges to a finite matrix, then β̂ OLS is also consistent
for β.
• Finally, in most cases, β̂ OLS is asymptotically normally distributed with mean
β and its variance-covariance matrix is given by σ 2 (X0 X)−1 (X0 ΩX)(X0 X)−1 .
131

12.2 Consequences for efficiency and standard errors


• So what’s the problem with using OLS when the true error term may be
heteroskedastic? It seems like it still delivers us estimates of the coefficients
that have highly desirable properties.
• The true variance of β̂ OLS is no longer σ 2 (X0 X)−1 , so that any inference based
on s2 (X0 X)−1 is likely to be “misleading”.
• Not only is the wrong matrix used, but s2 may be a biased estimator of σ 2 .
• In general, there is no way to tell the direction of bias, although Goldberger
(1964) shows that in the special case of only one explanatory variable (in
addition to the constant term), s2 is biased downward if high error variances
correspond with high values of the independent variable.
• Whatever the direction of the bias, we should not use the standard equation
for the standard errors in hypothesis tests on β̂ OLS .
• More importantly, OLS is no longer efficient, since another method called
Generalized Least Squares will give estimates of the regression coeffi-
cients, β̂ GLS , that are unbiased and have a smaller variance.

12.3 Generalized Least Squares


• We assume, as we always do for a variance matrix, that Ω is positive definite
and symmetric. Thus, it can be factored into a set of matrices containing its
characteristic roots and vectors.
Ω = CΛC0

• Here the columns of C contain the characteristic vectors of Ω and Λ is a


diagonal matrix containing its characteristic roots. We can “factor” Ω using
the square root of Λ, or Λ1/2 , which is a matrix containing the square roots
of the characteristic roots on the diagonal.
• Then, if T = CΛ1/2 , TT0 = Ω. The last result holds because TT0 =
CΛ1/2 Λ1/2 C0 = CΛC0 = Ω. Also, if we let P0 = CΛ−1/2 , then P0 P = Ω−1 .
132

• It can be shown that the characteristic vectors are all orthogonal and for each
characteristic vector, c0i ci = 1 (Greene, 6th ed. p. 968–969). It follows that
CC0 = I, and C0 C = I, a fact that we will use below.
• GLS consists of estimating the following equation, using the standard OLS
solutions for the regression coefficients:
Py = PXβ + Pε
or
y∗ = X∗ β + ε∗ .
0 0
Thus, β̂ GLS = (X∗ X∗ )−1 (X∗ y∗ ) = (X0 Ω−1 X)−1 (X0 Ω−1 y)
0
• It follows that the variance of ε∗ , is equal to E[ε∗ ε∗ ] = Pσ 2 ΩP0 , and:
Pσ 2 ΩP0 = σ 2 PΩP0 = σ 2 Λ−1/2 C0 ΩCΛ−1/2
= σ 2 Λ−1/2 C0 CΛ1/2 Λ1/2 C0 CΛ−1/2
= σ 2 Λ−1/2 Λ1/2 Λ1/2 Λ−1/2
= σ 2 In

12.3.1 Some intuition


• In the standard case of heteroskedasticity, GLS consists of dividing each ob-

servation by the square root of its own element in the Ω matrix, ωi . The nice
0
thing about this is that the variance of ε∗ , equal to E[ε∗ ε∗ ] = Pσ 2 ΩP0 = σ 2 I.
• In effect, we’ve removed the heteroskedasticity from the residuals, and we can
0
go ahead and estimate the variance of β̂ GLS using the formula σ 2 (X∗ X∗ )−1 =
σ 2 (X0 Ω−1 X)−1 . This is also known as the Aitken estimator after the statisti-
cian who originally proposed the method in 1935.
• We can also show, using the standard methods, that β̂ GLS is unbiased, con-
sistent, and asymptotically normally distributed.
• We can conclude that β̂ GLS is BLUE for the generalized model in which the
variance of the errors is given by σ 2 Ω. The result follows by applying the
Gauss-Markov theorem to model y∗ = X∗ β + ε∗ .
133

• In this general case, the maximum likelihood estimator will be the GLS esti-
mator, β̂ GLS , and that the Cramer-Rao lower bound for the variance of β̂ GLS
is given by σ 2 (X0 Ω−1 X)−1 .

12.4 Feasible Generalized Least Squares


• All of the above assumes that Ω is a known matrix, which is usually not the
case.
• One option is to estimate Ω in some way and to use Ω̂ in place of Ω in the
GLS model above. For instance, one might believe that the true error term
was a function of one (or more) of the independent variables. Thus,

εi = γ0 + γ1 xi + ui

• Since β̂ OLS is consistent and unbiased for β, we can use the OLS residuals
to estimate the model above. The procedure is that we first estimate the
original model using OLS, and then use the residuals from this regression to

estimate Ω̂ and ωi .
• We then transform the data using these estimates, and use OLS again on
the transformed data to estimate β̂ GLS and the correct standard errors for
hypothesis testing.
• One can also use the estimated errors terms from the last stage to conduct
FGLS again and keep on doing this until the error model begins to converge,
so that the estimated residuals barely change as one moves through the iter-
ations.
• Some examples:
➢ If σi2 = σ 2 x2i , we would divide all observations through by xi .

➢ If σi2 = σ 2 xi , we would divide all observations through by xi .
➢ If σi2 = σ 2 (γ0 + γ1 Xi + ui ), we would divide through all observations by

γ0 + γ1 xi .
134

• Essential Result: So long as our estimate of Ω is consistent, the FGLS


estimator will be consistent and will be asymptotically efficient.
• Problem: Except for the simplest cases, the finite-sample properties and ex-
act distributions of FGLS estimators are unknown. Intuitively, var(β̂ F GLS ) 6=
var(β̂ GLS ) because we also have to take into account the uncertainty in es-
timating Ω. Since we cannot work out the small sample distribution, we
cannot even say that β̂ F GLS is unbiased in small samples.

12.5 White-consistent standard errors


• If you can find a consistent estimator for Ω, then go ahead and perform FGLS
if you have a large number of data points available.
• Otherwise, and in most cases, analysts estimate the regular OLS model and
use “White-consistent standard errors”, which are described by Leamer as
“White-washing heteroskedasticity”.
• White’s heteroskedasticity consistent estimator of the variance-covariance
matrix of β̂ OLS is recommended whenever OLS estimates are being used
for inference in a situation in which heteroskedasticity is suspected, but the
researcher is unable to consistently estimate Ω and use FGLS.
• White’s method consists of finding a consistent estimator for the true OLS
variance below:
 −1   −1
1 1 0 1 0 2 1 0
var[β̂ OLS ] = XX X [σ Ω]X XX
n n n n

• The trick to White’s estimation of the asymptotic variance-covariance matrix


is to recognize that what we need is a consistent estimator of:
n
σ 2 X0 ΩX 1X 2 0
Q∗ = = σ xi xi
n n i=1 i
135

• Under very general conditions


n
1X 2 0
S0 = e xi xi
n i=1 i

(which is a k × k matrix with k(k + 1)/2 original terms) is consistent for Q∗ .


(See Greene, 6th ed., p. 162–3)
• Because β̂ OLS is consistent for β, we can show that White’s estimate of
var[β̂ OLS ] is consistent for the true asymptotic variance.
• Thus, without specifying the exact nature of the heteroskedasticity, we can
still calculate a consistent estimate of var[β̂ OLS ] and use this in the normal
way to derive standard errors and conduct hypothesis tests. This makes the
White-consistent estimator extremely attractive in a wide variety of situa-
tions.
• The White heteroskedasticity consistent estimator is
 −1 n
! −1
1 1 0 1 X 1
Est. asy. var[β OLS ] = XX e2i xi x0i X0 X
n n n i=1 n
−1 −1
= n (X0 X) S0 (X0 X)

• Adding the robust qualifier to the regression command in Stata produces


standard errors computed from this estimated asymptotic variance matrix.

12.6 Tests for heteroskedasticity


• Homoskedasticity implies that the error variance will be the same across
different observations and should not vary with X, the independent variables.
• Unsurprisingly, all tests for heteroskedasticity rely on checking how far the
error variances for different groups of observations depart from one another,
or how precisely they are explained by the explanatory variables.
136

12.6.1 Visual inspection of the residuals


• Plot them and look for patterns (residuals against fitted values, residuals
against explanatory vars).

12.6.2 The Goldfeld-Quandt test


• The Goldfeld-Quandt tests consists of comparing the estimated error vari-
ances for two groups of observations.
• First, sort the data points by one of the explanatory variables (e.g., country
size). Then run the model separately for the two groups of countries.
• If the true errors are homoskedastic, then the estimated error variances for
the two groups should be approximately the same.
• We estimate the true error variance by s2g = (e0g eg )/(ng − k), where the sub-
script g indicates that this is the value for each group. Since each estimated
error variance is distributed chi-squared, the test statistic below is distributed
F.
• If the true error is homoskedastic, the statistic will be approximately equal
to one. The larger the F -statistic, the less likely it is that the errors are
homoskedastic. It should be noted that the formulation below is computed
assuming that the errors are higher for the first group.
e01 e1 /(n1 − k)
= F [n1 − k, n2 − k]
e02 e2 /(n2 − k)

• It has also been suggested that the test-statistic should be calculated using
only the first and last thirds of the data points, excluding the middle section
of the data, to sharpen the test results.
137

12.6.3 The Breusch-Pagan test


• The main problem w/ Goldfeld-Quandt: requires knowledge of how to sort
the data points.
• The Breusch-Pagan test uses a different approach to see if the error variances
are systematically related to the independent variables, or to any subset or
transformation of those variables.
• The idea behind the test is that for the regression

σi2 = σ 2 f (α0 + α0 zi )

where zi is a vector of independent variables, if α = 0 then the errors are


homoskedastic.
• To perform the test, regress e2i /(e0 e/n − 1) on some combination of the inde-
pendent variables (zi ).
• We then compute a Lagrange Multiplier statistic as the regression sum of
squares from that regression divided by 2. That is,
SSR a 2
∼ χm−1 ,
2
where m is the number of explanatory variables in the auxiliary regression.
• Intuition: if the null hypothesis of no heteroskedasticity is true, then the SSR
should equal zero. A non-zero SSR is telling us that e2i /(e0 e/n − 1) varies
with the explanatory variables.
• Note that this test depends on the fact that β OLS is consistent for β, and
that the errors from an OLS regression of yi on xi , ei will be consistent for εi .
Section 13

Autocorrelation
13.1 The meaning of autocorrelation
• Auto-correlation is often paired with heteroskedasticity because it is another
way in which the variance-covariance matrix of the true error terms (if we
could observe it) is different from the Gauss-Markov assumption, E[εε0 ] =
σ 2 In .
• In our earlier discussion of heteroskedasticity, we saw what happened when
we relax the assumption that the variance of the error term is constant. An
equivalent way of saying this is that we relax the assumption that the errors
are “identically distributed.” In this section, we see what happens when we
relax the assumption that the error terms are independent.
• In this case, we can have errors that covary (e.g., if one error is positive and
large the next error is likely to be positive and large) and are correlated. In
either case, one error can give us information about another.
• Two types of error correlation:
1. Spatial correlation: e.g., of contiguous households, states, or counties.
2. Temporal or Autocorrelation: errors from adjacent time periods are cor-
related with one another. Thus, εt is correlated with εt+1 , εt+2 , . . . and
εt−1 , εt−2 , . . . , ε1 .
• The correlation between εt and εt−k is called autocorrelation of order k.
➢ The correlation between εt and εt−1 is the first-order autocorrelation and
is usually denoted by ρ1 .
➢ The correlation between εt and εt−2 is the second-order autocorrelation
and is usually denoted by ρ2 .

138
139

• What does this imply for E[εε0 ]?

E(ε21 ) E(ε1 ε2 ) . . . E(ε1 εT )


 

E(εε0 ) =  .. ... ..
. . 
E(εT ε1 ) E(εT ε2 ) . . . E(ε2T )
Here, the off-diagonal elements are the covariances between the different error
terms. Why is this?
• Autocorrelation implies that the off-diagonal elements are not equal to zero.

13.2 Causes of autocorrelation


• Misspecification of the model
• Data manipulation, smoothing, seasonal adjustment
• Prolonged influence of shocks
• Inertia

13.3 Consequences of autocorrelation for regression coef-


ficients and standard errors
• As with heteroskedasticity, in the case of autocorrelation, the βOLS regression
coefficients are still unbiased and consistent. Note, this is true only in the
case where the model does not contain a lagged dependent variable. We will
come to this latter case in a bit.
• As before, OLS is inefficient. Further, inferences based on the standard OLS
estimate of var(βOLS ) = s2 (X0 X)−1 are wrong.
• OLS estimation in the presence of autocorrelation is likely to lead to an under-
estimate of σ 2 , meaning that our t-statistics will be inflated, and we are more
likely to reject the null when we should not do so.
140

13.4 Tests for autocorrelation


• Visual inspection, natch.
• Test statistics
➢ There are two well-known test statistics to compute to detect autocorre-
lation.
1. Durbin-Watson test: most well-known but does not always give un-
ambiguous answers and is not appropriate when there is a lagged
dependent variable (LDV) in the original model.
2. Breusch-Godfrey (or LM) test: does not have a similar indeterminate
range (more later) and can be used with a LDV.
➢ Both tests use the fact, exploited in tests for heteroskedasticity, that
since βOLS is consistent for β, the residuals, ei , will be consistent for εi .
Both tests, therefore, use the residuals from a preceding OLS regression
to estimate the actual correlations between error terms.
➢ At this point, it is useful to remember the expression for the correlation
of two variables:
cov(x, y)
corr(x, y) = p p
var(x) var(y)
➢ For the error terms, and if we assume that var(εt ) = var(εt−1 ) = ε2t , then
we have
E[εt εt−1 ]
corr(εt , εt−1 ) = = ρ1
E[ε2t ]
➢ To get the sample average of this autocorrelation, we would compute this
ratio for each pair of succeeding error terms, and divide by n.
141

13.4.1 The Durbin-Watson test


• The Durbin-Watson test does something similar to estimating the sample
autocorrelation between error terms. The DW statistic is calculated as:
T
(et − et−1 )2
P
t=2
d= T
e2t
P
t=1

e2t + e2t−1 −2 et et−1


P P P
• We can write d as d = P 2
et

• SinceP e2t is approximately equal to e2t−1 as the sample size becomes large,
P P

and Pet eet−1


2 is the sample average of the autocorrelation coefficient, ρ, we have
t
that d ∼
= 2(1 − ρ̂).
• If ρ = +1, then d = 0, if ρ = −1, then d = 4. If ρ = 0, then d = 2. Thus, if
d is close to zero or four, the residuals can be said to be highly correlated.
• The problem is that the exact sampling distribution of d depends on the
values of the explanatory variables. To get around this problem, Durbin and
Watson have derived upper (du ) and lower limits (dl ) for the significance levels
of d.
• If we were testing for positive autocorrelation, versus the null hypothesis of
no autocorrelation, we would use the following procedure:
➢ If d < dl , reject the null of no autocorrelation.
➢ If d > du , we do not reject the null hypothesis.
➢ If dl < d < du , the test is inconclusive.
• Can be calculated within Stata using dwstat.
142

13.4.2 The Breusch-Godfrey test


• To perform this test, simply regress the OLS residuals, et , on all the explana-
tory variables, xt , and on their own lags, et−1 , et−2 , et−3 , . . . , et−p .
• This test works because the coefficient on the lagged error terms will only
be significant if the partial autocorrelation between et and et−p is significant.
The partial autocorrelation is the autocorrelation between the error terms
accounting for the effect of the X explanatory variables.
• Thus, the null hypothesis of no autocorrelation can be tested using an F -
test to see whether the coefficients on the lagged error terms, et−1 , et−2 ,
et−3 , . . . , et−p are jointly equal to zero.
• If you would like to be precise, and consider that since we are using et as
an estimate of εt , you can treat this as a large sample test, and say that
p · F , where p is the number of lagged error terms restricted to be zero, is
asymptotically distributed chi-squared with degrees of freedom equal to p.
Then use the chi-squared distribution to give critical values for the test.
• Using what is called the Lagrange Multiplier (LM) approach to testing, we
can also show that n · R2 is asymptotically distributed as χp and use this
statistic for testing.

13.5 The consequences of autocorrelation for the variance-


covariance matrix
• In the case of heteroskedasticity, we saw that we needed to make some as-
sumptions about the form of the E[εε0 ] matrix in order to be able to transform
the data and use FGLS to calculate correct and efficient standard errors.
• Under auto-correlation, the off-diagonal elements will not be equal to zero,
but we don’t know what those off-diagonal elements are. In order to conduct
FGLS, we usually make some assumption about the form of autocorrelation
and the process by which the disturbances are generated.
143

• First, we must write the E[εε0 ] matrix in a way that can represent autocor-
relation:
E[εε0 ] = σ 2 Ω

• Recall that
E[εt εt−s ] γs
corr(εt , εt−s ) = 2 = ρs =
E[εt ] γ0
where γs = cov(εt , εt−s ) and γ0 = var(εt )
• Let R be the “autocorrelation matrix” showing the correlation between all
the disturbance terms. Then E[εε0 ] = γ0 R
• Second, we need to calculate the autocorrelation matrix. It helps to make
an assumption about the process generating the disturbances or true errors.
The most common assumption is that the errors follow an “autoregressive
process” of order one, written as AR(1).
An AR(1) process is represented as:

εt = ρεt−1 + ut

where E[ut ] = 0, E[u2t ] = σu2 , and cov[ut , us ] = 0 if t 6= s.


• By repeated substitution, we have:

εt = ut + ρut−1 + ρ2 ut−2 + · · ·

• Each disturbance, εt , embodies the entire past history of the us, with the
most recent shocks receiving greater weight than those in the more distant
past.
• The successive values of ut are uncorrelated, so we can estimate the variance
of εt , which is equal to E[ε2t ], as:

var[εt ] = σu2 + ρ2 σu2 + ρ4 σu2 + . . .

• This is a series of positive numbers (σu2 ) multiplied by increasing powers of


ρ. If ρ is greater than one, the series will be infinite and we won’t be able to
get an expression for var[εt ]. To proceed, we assume that |ρ| < 1.
144

• Here is a useful trick for a series. If we have an infinite series of numbers:

y = a0 · x + a1 · x + a2 · x + a3 · x + . . .

then
x
y= .
1−a
• Using this, we see that
σu2
var[εt ] = 2
= σε2 = γ0 (13.1)
1−ρ

• We can also estimate the covariances between the errors:


ρσu2
cov[εt , εt−1 ] = E[εt εt−1 ] = E[εt−1 (ρεt−1 + ut )] = ρvar[εt−1 ] =
1 − ρ2

• And, since εt = ρεt−1 + ut = ρ(ρεt−2 + ut−1 ) + ut = ρ2 εt−2 + ρut−1 + ut we have


ρ2 σu2
cov[εt , εt−2 ] = E[εt εt−2 ] = E[εt−2 (ρ2 εt−2 + ρut−1 + ut )] = ρ2 var[εt−2 ] =
1 − ρ2

• By repeated substitution, we can see that:


ρs σu2
cov[εt , εt−s ] = E[εt εt−s ] = = γs
1 − ρ2
γs
• Now we can get back to the autocorrelations. Since corr[εt , εt−s ] = γ0 , we
have: s 2
ρ σu
1−ρ2
corr[εt , εt−s ] = σu2
= ρs
1−ρ2

• In other words, the auto-correlations fade over time. They are always less
than one and become less and less the farther two disturbances are apart in
time.
145

• The auto-correlation matrix, R, shows all the auto-correlations between the


disturbances. Given the link between the auto-correlation matrix and E[εε0 ] =
σ 2 Ω, we can now say that:
E[εε0 ] = σ 2 Ω = γ0 R
or  
1 ρ ρ2 ρ3 . . . ρT −1
 ρ 1 ρ ρ2 . . . ρT −2 
σu2
 
T −3 
2
σ Ω=
 ρ2 ρ 1 ρ ... ρ 
1 − ρ2 

.. .. 
 . ... . 
ρT −1 ρT −2 ρT −3 ... 1

13.6 GLS and FGLS under autocorrelation


• Given that we now have the expression for σ 2 Ω, we can in theory transform
the data and estimate via GLS (or use an estimate, σ̂ 2 Ω̂, for FGLS).
• We are transforming the data so that we get an error term that conforms to
the Gauss-Markov assumptions. In the heteroskedasticity case, we showed
how we transformed the matrix using the factor for σ 2 Ω.
• In the case of autocorrelation of the AR(1) type, what’s nice is that the
transformation is fairly simple, even though the matrix expression for σ 2 Ω
may not look that simple.
• Let’s start with a simple model with an AR(1) error term:
yt = β0 + β1 xt + εt , t = 1, 2, . . ., T (13.2)
where εt = ρεt−1 + ut
• Now lag the data by one period and multiply it by ρ:
ρyt−1 = β0 ρ + β1 ρxt−1 + ρεt−1

• Subtracting this from Equation 13.2 above, we get:


yt − ρyt−1 = β0 (1 − ρ) + β1 (xt − ρxt−1 ) + ut
146

• But (given our assumptions) the ut s are serially independent, have constant
variance (σu2 ), and the covariance between different us is zero. Thus, the new
error term, ut , conforms to all the desirable features of Gauss-Markov errors.
• If we conduct OLS using this transformed data, and use our normal esti-
mate for the standard errors, or s2 (X0 X)−1 , we will get correct and efficient
standard errors.
• The transformation is:
yt∗ = yt − ρyt−1
x∗t = xt − ρxt−1

• If we drop the first observation (because we don’t have a lag for it), then we
are following the Cochrane-Orcutt procedure. If we keep the first observation
and use it with the following transformation:
p
y1∗ = 1 − ρ2 y1
p

x 1 = 1 − ρ2 x 1
then we are following something called the Prais-Winsten procedure. Both
of these are examples of FGLS.
• But how do we do this if we don’t “know” ρ? Since β̂OLS is unbiased, our
estimated residuals are unbiased for the true disturbances, εt .
• In this case, we can estimate ρ using our residuals from an initial OLS regres-
sion, estimate ρ, and then perform FGLS using this estimate. The standard
errors that we calculate using this procedure will be asymptotically efficient.
• To estimate ρ from the residuals, et compute:
P
et et−1
ρ̂ = P 2
et
We could also estimate ρ from a regression of et on et−1 .
147

• To sum up this procedure:


1. Estimate the model using OLS. Test for autocorrelation. If tests reveal
this to be present, estimate the autocorrelation coefficient, ρ, using the
residuals from the OLS estimation.
2. Transform the data.
3. Estimate OLS using the transformed data.
• For Prais-Winsten in Stata do:
prais depvar expvars
or
prais depvar expvars, corc
for Cochrane-Orcutt.

13.7 Non-AR(1) processes


• A problem with the Prais-Winsten and Cochrane-Orcutt versions of FGLS
is that the disturbances may not be distributed AR(1). In this situation, we
will have used the wrong assumptions to estimate the auto-correlations.
• There is an approach, analogous to the White-consistent standard errors, that
directly estimates the correct OLS standard errors (from
(X0 X)−1 X0 (σ 2 Ω)X(X0 X)−1 ), using an estimate of X0 (σ 2 Ω)X based on very
general assumptions about the autocorrelation of the error terms. These
standard errors are called Newey-West standard errors.
• Stata users can use the following command to compute these:
newey depvar expvars, lag(#) t(varname)
In this case, lag(#) tells Stata how many lags there are between any two
disturbances before the autocorrelations die out to zero. t(varname) tells
Stata the variable that indexes time.
148

13.8 OLS estimation with lagged dependent variables and


autocorrelation
• We said previously that autocorrelation, like heteroskedasticity, does not af-
fect the unbiasedness of the OLS estimates, just the means by which we cal-
culate efficient and correct standard errors. There is one important exception
to this.
• Suppose that we use lagged values of the dependent variable, yt as a regressor,
so that:
yt = βyt−1 + εt where β < 1
and
εt = ρεt−1 + ut
where E[ut ] = 0, E[u2t ] = σu2 , and cov[ut , us ] = 0 if t 6= s.
• The OLS estimate of β = (X0 X)−1 X0 y where X = yt−1 . Let us also as-
sume that yt and yt−1 have been transformed so that they are measured as
deviations from their means.
T
P T
P
yt−1 yt yt−1 (βyt−1 + εt )
t=2 t=2
β̂OLS = T
= T
2 2
P P
yt−1 yt−1
t=2 t=2
T
P
yt−1 εt
t=2
=β+ T
2
P
yt−1
t=2

• Thus,
cov(yt−1 , εt )
E[β̂OLS ] = β +
var(yt )
149

• To show that β̂OLS is biased, we need to show only that cov(yt−1 , εt ) 6= 0, and
to show that β̂OLS is inconsistent, we need to show only that the limit of this
covariance as T → ∞ is not equal to zero.
cov[yt−1 , εt ] = cov[yt−1 , ρεt−1 + ut ] = ρcov[yt−1 , εt−1 ] = ρcov[yt , εt ]

• The last step is true assuming the DGP is “stationary” and the ut s are un-
correlated.
• Continuing:
ρcov[yt , εt ] = ρcov[βyt−1 + εt , εt ] = ρ{βcov[yt−1 , εt ] + cov[εt , εt ]} (13.3)
= ρ{βcov[yt−1 , εt ] + var[εt ]} (13.4)

• Since cov[yt−1 , εt ] = ρcov[yt , εt ] (from above) we have:


cov[yt−1 , εt ] = ρβcov[yt−1 , εt ] + ρvar[εt ]
so that
ρvar[εt ]
cov[yt−1 , εt ] = .
(1 − βρ)
• Given our calculation of var[εt ] when the error is AR(1) (see Eq. 13.1):
ρσu2
cov[yt−1 , εt ] =
(1 − βρ)(1 − ρ2 )

• Thus, if ρ is positive, the estimate of β is biased upward (more of the fit


is imputed to the lagged dependent variable than to the systematic relation
between the error terms). Moreover, since the covariance above will not
diminish to zero in the limit as T → ∞, the estimated regression coefficients
will also be inconsistent.
• It can also be shown that the estimate of ρ in the Durbin-Watson test will
be biased downward, leading us to accept the null hypothesis of no auto-
correlation too often. For that reason, when we include a lagged dependent
variable in the model, we should be careful to use the Breusch-Godfrey test
to determine whether autocorrelation is present.
150

13.9 Bias and “contemporaneous correlation”


• A more immediate question is what general phenomenon produces this bias
and how to obtain unbiased estimates when this problem exists. The answer
turns out to be a general one, so it is worth exploring.
• The bias (and inconsistency) in this case arose because of a violation of the
G-M assumptions E[ε] = 0 and X is a known matrix of constants (“fixed in
repeated samples”). We needed this so that we could say that E[X0 ε] = 0.
This yielded our proof of unbiasedness.

E[β̂ OLS ] = β + E[(X0 X)−1 X0 ε] = β

• Recall we showed that we could relax the assumption that X is fixed in


repeated samples if we substituted the assumption that X is “strictly exoge-
nous”, so that E[ε|X] = 0. This will also yield the result that E[X0 ε] = 0
• The problem in the case of the lagged dependent variable is that E[X0 ε] 6= 0.
When there is a large and positive error in the preceding period, we would
be likely to get a large and positive εt and a large and positive yt−1 .
• Thus, we see a positive relationship between the current error and the regres-
sor. This is all that we need to get bias, a “contemporaneous” correlation
between a regressor and the error term, such that cov[xt , εt ] 6= 0.

13.10 Measurement error


• Another case in which we get biased and inconsistent estimates because of
contemporaneous correlation between a regressor and the error is measure-
ment error.
• Suppose that the true relationship does not contain an intercept and is:

y = βx + ε

but x is measured with error as z, where z = x + u is what we observe and


E[ut ] = 0, E[u2t ] = σu2 .
151

• This implies that x can be written as z − u and the true model can then be
written as:
y = β(z − u) + ε = βz + (ε − βu) = βz + η
The new disturbance, η, is a function of u (the error in measuring x). z is also
a function of u. This sets up a non-zero covariance (and correlation) between
z, our imperfect measure of the true regressor x, and the new disturbance
term, (ε − βu).
• The covariance between them, as we showed then, is equal to:
cov[z, (ε − βu)] = cov[x + u, ε − βu] = −βσu2

• How does this lead to bias? We go back to the original equation for the
expectation of the estimated β̂ coefficient, E[β̂] = β + (X0 X)−1 X0 ε. It is
easy to show:
E[β̂] = E[(z 0 z)−1 (z 0 y)] = E[(z 0 z)−1 (z 0 (βz + η))] (13.5)
= E[[(x + u)0 (x + u)]−1 (x + u)0 (βx + ε)] (13.6)
P
(x + u)(βx + ε)

i βx2 x2
=E  P  = 2 =β 2 (13.7)
(x + u)2 x + σu2 x + σu2
i

• Note: establishing a non-zero covariance between the regressor and the error
term is sufficient to prove bias but is not the same as indicating the direction
of the bias. In this special case, however, β̂ is biased downward.
• To show consistency we must show that:
 0 −1  0  −1
X0 X

XX Xε
plim β̂ = β + =β (recall Q∗ = )
n n n
X0 ε

• This involves having to show that plim n = 0.
• In the case above,
P
plim (1/n) (x + u)(βx + ε)
 
i βQ∗
plim β̂ =  P  = ∗
plim (1/n) (x + u)2 Q + σu2
i
152

• Describing the bias and inconsistency in the case of a lagged dependent vari-
able with autocorrelation would follow the same procedure. We would look
at the expectation term to show the bias and the probability limit to show
inconsistency. See particularly Greene, 6th ed., pp. 325–327.

13.11 Instrumental variable estimation


• In any case of contemporaneous correlation between a regressor and the error
term, we can use an approach known as instrumental variable (or IV) esti-
mation. The intuition to this approach is that we will find an instrument for
X, where X is the variable correlated with the error term. This instrument
should ideally be highly correlated with X, but uncorrelated with the error
term.
• We can use more than one instrument to estimate βIV . Say that we have one
variable, x1 , measured with error in the model:

y = β0 + β1 x1 + β2 x2 + ε

• We believe that there exists a set of variables, Z, that is correlated with x1


but not with ε. We estimate the following model:

x1 = Zα + u

where the α are the regression coefficients on Z where α = (Z0 Z)−1 Z0 x1 .


• We then calculate the fitted or predicted values of x1 , or x̃1 , equal to Zα,
which is Z(Z0 Z)−1 Z0 x1 . These fitted values should not be correlated with
the error term because they are derived from instrumental variables that are
uncorrelated with the error term.
• We then use the fitted values x̃1 in the original model for x1 .

y = β0 + β1 x̃1 + β2 x2 + ε

This gives an unbiased estimate βIV of β1 .


153

In the case of a model with autocorrelation and a lagged dependent variable,


Hatanaka (1974), suggests the following IV estimation for the model:

yt = Xβ + γyt−1 + εt

where εt = ρεt−1 + ut
• Use the predicted values from a regression of yt on Xt and Xt−1 as an estimate
of yt−1 . The coefficient on ỹt−1 is a consistent estimate of γ, so it can be used
to estimate ρ and perform FGLS.

13.12 In the general case, why is IV estimation unbiased


and consistent?
• Suppose that
y = Xβ + ε
where X contains one regressor that is contemporaneously correlated with
the error term and the other variables are uncorrelated. The intercept and
the variables uncorrelated with the error can serve as their own (perfect)
instruments.
• Each instrument is correlated with the variable of interest and uncorrelated
with the error term. We have at least one instrument for the explanatory
variable correlated with the error term. By regressing X on Z we get X̃, the
predicted values of X. For each uncorrelated variable, the predicted value is
just itself since it perfectly predicts itself. For the correlated variables, the
predicted value is the value given by the first-stage model.

X̃ = Zα = Z(Z0 Z)−1 Z0 X

• Then

β̂ IV = (X̃0 X̃)−1 X̃0 y


= [X0 Z(Z0 Z)−1 Z0 Z(Z0 Z)−1 Z0 X]−1 (X0 Z(Z0 Z)−1 Z0 y)
154

• Simplifying and substituting for y we get:

E[β̂ IV ] = E[(X̃0 X̃)−1 X̃0 y]


= E[(X0 Z(Z0 Z)−1 Z0 X)−1 (X0 Z(Z0 Z)−1 Z0 (Xβ + ε)]
= E{[X0 Z(Z0 Z)−1 Z0 X]−1 (X0 Z(Z0 Z)−1 Z0 X)β
+ [X0 Z(Z0 Z)−1 Z0 X]−1 (X0 Z(Z0 Z)−1 Z0 ε}
=β+0

since E[Z0 ε] is zero by assumption.


Section 14

Simultaneous Equations Models and 2SLS


14.1 Simultaneous equations models and bias
• Where we have a system of simultaneous equations we can get biased and
inconsistent regression coefficients for the same reason as in the case of mea-
surement error (i.e., we introduce a “contemporaneous correlation” between
at least one of the regressors and the error term).

14.1.1 Motivating example: political violence and economic growth


• We are interested in the links between economic growth and political violence.
Assume that we have good measures of both.
• We could estimate the following model

Growth = f (V iolence, OtherF actors)

(This is the bread riots are bad for business model).


• We could also estimate a model

V iolence = f (Growth, OtherF actors)

(This is the hunger and privation causes bread riots model).


• We could estimate either by OLS. But what if both are true? Then violence
helps to explain growth and growth helps to explain violence.
• In previous estimations we have treated the variables on the right-hand side
as exogenous. In this case, however, some of them are endogenous because
they are themselves explained by another causal model. The model now has
two equations:
Gi = β0 + β1 Vi + εi
Vi = α0 + α1 Gi + ηi

155
156

14.1.2 Simultaneity bias


• What happens if we just run OLS on the equation we care about? In general,
this is a bad idea (although we will get into the case in which we can). The
reason is simultaneity bias.
• To see the problem, we will go through the following simulation.

1. Suppose some random event sparks political violence (ηi is big). This
could be because you did not control for the actions of demagogues in
stirring up crowds.
2. This causes growth to fall through the effect captured in β1 .
3. Thus, Gi and ηi are negatively correlated.

• What happens in this case if we estimate Vi = α0 + α1 Gi + ηi by OLS without


taking the simultaneity into account? We will mis-estimate α1 because growth
tends to be low when ηi is high.
• In fact, we are likely to estimate α1 with a negative bias, because E(G0 η) is
negative. We may even produce the mistaken result that low growth produces
high violence solely through the simultaneity.
• To investigate the likely direction and existence of bias, let’s look at E[X0 ε]
or the covariance between the explanatory variables and the error term.

E[G0 η] = E[(β0 + β1 V + ε)η] = E[(β0 + β1 (α0 + α1 G + η) + ε)η]

• Multiplying through and taking expectations we get:

E[G0 η] = E[(β0 + β1 α1 )0 η] + E[β1 α1 G0 η] + E[β1 η 0 η] + E[ε0 η]

• Passing through the expectations operator we get:

E[G0 η] = (β0 + β1 α1 )E[η] + β1 α1 E[G0 η] + β1 E[η 2 ] + E[ε]E[η]

(E[ε0 η] = E[ε]E[η] because the two error terms are assumed independent).
157

• Since E[ε] = 0 and E[η] = 0 and E[η 2 ] = ση2 , we get:

E[G0 η] = β1 α1 E[G0 η] + β1 ση2


β1 ση2
So, (1 − β1 α1 )E[G0 η] = β1 ση2 and E[G0 η] = (1−β1 α1 )

• Since this is non-zero, we have a bias in the estimate of α1 , the effect of


growth, G, on violence, V .
• The equation above indicates that the size of bias will depend on the variance
of the disturbance term, η, and the size of the coefficient on V, β1 . These
parameters jointly determine the magnitude of the feedback effect from errors
in the equation for violence, through the impact of violence, back onto growth.
• The actual computation of the bias term is not quite this simple in the case
of the models above because we will have other variables in the matrix of
right-hand side variables, X.
• Nevertheless, the expression above can help us to deduce the likely direction
of bias on the endogenous variable. When we have other variables in the
model, the coefficients on these variables will also be affected by bias, but we
cannot tell the direction of this bias a priori.
• How can we estimate the causal effects without bias in the presence of simul-
taneity? Doing so will involve re-expressing all the endogenous variables as
a function of the exogenous variables and the error terms. This leads to the
question of the “identification” of the system.

14.2 Reduced form equations


• In order to estimate a system of simultaneous equations (or any equation in
that system) the model must be either “just identified” or “over-identified”.
These conditions depend on the exogenous variables in the system of equa-
tions.
158

• Suppose we have the following system of simultaneous equations:


Gi = β0 + β1 Vi + β2 Xi + εi
Vi = α0 + α1 Gi + α2 Zi + ηi
• We can insert the second equation into the first in order to derive an expres-
sion for growth that does not involve the endogenous variable, violence:
Gi = β0 + β1 (α0 + α1 Gi + α2 Zi + ηi ) + β2 Xi + εi
Then you can pull all the expressions involving Gi onto the left-hand side and
divide through to give:
 
β0 + β1 α0 β1 α2 β2 εi + β1 ηi
Gi = + Zi + Xi +
(1 − β1 α1 ) (1 − β1 α1 ) (1 − β1 α1 ) 1 − β1 α1
• We can perform a similar exercise to write the violence equation as:
 
α0 + α1 β0 α1 β2 α2 ηi + α1 εi
Vi = + Xi + Zi +
(1 − α1 β1 ) (1 − α1 β1 ) (1 − α1 β1 ) 1 − α1 β1
• We can then estimate these reduced form equations directly.
Gi = γ0 + γ1 Zi + γ2 Xi + ε∗i
Vi = φ0 + φ1 Zi + φ2 Xi + ηi∗
Where,
β0 + β1 α0 β1 α2 β2
γ0 = γ1 = γ2 =
(1 − β1 α1 ) (1 − β1 α1 ) (1 − β1 α1 )
Also,
α0 + α1 β0 α2 α1 β2
φ0 = φ1 = φ2 =
(1 − α1 β1 ) (1 − α1 β1 ) (1 − α1 β1 )
• In this case, we can we solve backward from this for the original regression
coefficients of interest, α1 and β1 . We can re-express the coefficients from the
reduced form equations above to yield:
φ2 γ1
α1 = and β1 =
γ2 φ1
159

• This is often labeled “Indirect Least Squares,” a method that is mostly of


pedagogic interest.
• Thus, we uncover and estimate the true relationship between growth and
violence by looking at the relationship between those two variables and the
exogenous variables. We say that X and Z identify the model.
• Without the additional variables, there would be no way to estimate α1 and
β1 from the reduced form. The model would be under-identified.
• It is also possible to have situations where there is more than one solution
for α1 and β1 from the reduced form. This can occur if either equation has
more than one additional variable.
• Since one way to estimate the coefficients of interest without bias is to es-
timate the reduced form and compute the original coefficients from this di-
rectly, it is important to determine whether your model is identified. We can
estimate just-identified and over-identified models, but not under-identified
models.

14.3 Identification
• There are two conditions to check for identification: the order condition and
the rank condition. In theory, since the rank condition is more binding, one
checks first the order condition and then the rank condition. In practice, very
few people bother with the rank condition.

14.3.1 The order condition


• Let g be the number of endogenous variables in the system (here 2) and let
k be the total number of variables (endogenous and exogenous) missing from
the equation under consideration. Then:

1. If k = g − 1, the equation is exactly identified.


2. If k > g − 1, the equation is over-identified.
3. If k < g − 1, the equation is under-identified.
160

• In general, this means that there must be at least one exogenous variable in
the system, excluded from that equation, in order to estimate the coefficient
on the endogenous variable that is included as an explanatory variable in that
equation.
• These conditions are necessary for a given degree of identification. The Rank
Condition is sufficient for each type of identification. The Rank Condition
assumes the order condition and adds that the reduced form equations must
each have full rank.

14.4 IV estimation and two-stage least squares


1. If the model is just-identified, we could regress the endogenous variables on
the exogenous variables and work back from the reduced form coefficients to
estimate the structural parameters of interest.
2. If the model is just or over-identified, we can use the exogenous variables as
instruments for Gi and Vi . In this case, we use the instruments to form esti-
mates of the two endogenous variables, G̃i and Ṽi that are now uncorrelated
with the error term and we use these in the original structural equations.

• The second method is generally easier. It is also known as Two-Stage Least


Squares, or 2SLS because it inolves the following stages:

1. Estimate the reduced form equations using OLS. To do this, regress each
endogenous variable on all the exogenous variables in the system. In our
running example we would have:

Vi = φ0 + φ1 Zi + φ2 Xi + ηi∗

and
Gi = γ0 + γ1 Zi + γ2 Xi + ε∗i
161

2. From these first-stage regressions, estimate G̃i and Ṽi for each observa-
tion. The predicted values of the endogenous variables can then be used
to estimate the structural models:
Vi = α0 + α1 G̃i + α2 Zi + ηi
Gi = β0 + β1 Ṽi + β2 Xi + εi
• This is just what we would do in standard IV estimation, which is to regress
the problem variable on its instruments and then use the predicted value in
the main regression.

14.4.1 Some important observations


• This method gives you unbiased and consistent estimates of α1 and β1 . An-
other way of saying this is that G̃i and Ṽi are good instruments for Gi and
Vi .
➢ A good instrument is highly correlated with the variable that we are
instrumenting for and uncorrelated with the error term. In this case,
G̃i and Ṽi are highly correlated with Gi and Vi because we use all the
information available to us in the exogenous variables to come up with
an estimate.
➢ Second, G̃i and Ṽi are uncorrelated with the error terms ηi and εi because
of the properties of OLS and the way we estimate the first stage. A
property of OLS is that the estimated residuals are uncorrelated with
the regressors and thus uncorrelated with the predicted values (G̃i and
Ṽi ).
➢ Thus, since the direct effect of growth on Vi , for example, in the first
reduced form equation is part of the estimated residual, the predicted
value Ṽi can be treated as exogenous to growth. We have broken the
chain of connection that runs from one model to another. We can say
0
that E[G̃0 η] = 0 and E[Ṽ ε] = 0.
• When the system is exactly identified, 2SLS will give you results that are iden-
tical to those you would obtain from estimated the reduced form equations
and using those coefficients directly to estimate α1 and β1 .
162

• There is one case in which estimation of a system of simultaneous equations


by OLS will not give you biased and inconsistent estimates. This is the case
of recursive systems. The following is an example of a recursive system:
yi = α0 + α1 Xi + εi
pi = β0 + β1 yi + β2 Ri + ηi
qi = δ0 + δ1 yi + δ2 pi + δ3 Si + νi
Here, all the errors are independent. In this system of recursive equations,
substituting for the endogenous variables yi and pi will ultimately get you to
the exogenous variable Xi , so we don’t get the feedback loops and correlations
between regressor and the error that we did in the earlier case.

14.5 Recapitulation of 2SLS and computation of goodness-


of-fit
Let us review the two-stage least square procedure:
1. We first estimated the reduced form equations for Violence and Growth using
OLS.
Vi = φ0 + φ1 Zi + φ2 Xi + ηi∗
and
Gi = γ0 + γ1 Zi + γ2 Xi + ε∗i
We used the predicted value of Gi and Vi , or G̃i and Ṽi , as instruments in the
second stage.
2. We estimated the structural equations using the instruments:
Vi = α0 + α1 G̃i + α2 Zi + ηi
Gi = β0 + β1 Ṽi + β2 Xi + εi
And get unbiased coefficients.
• We would normally compute R2 as:
(xi − x̄)2 β̂ 2 e0 e
P P 2
SSR SSE ei
= P or 1 − =1− P =1− 0 0
SST (yi − ȳ)2 SST (yi − ȳ)2 yM y
163

• In the second stage, however, the estimated residuals are:


η̂i = Vi − (α̂0 + α̂1 G̃i + α̂2 Zi )
ε̂i = Gi − (β̂0 + β̂1 Ṽi + β̂2 Xi )

• If we use these residuals in our computation of R2 we will get a statistic that


tells us how well the model with the instruments fits the data. If we want
an estimate of how well the original structural model fits the data, we should
estimate the residuals using the true endogenous variables Gi and Vi . Thus,
we use:
η̂i = Vi − (α̂0 + α̂1 Gi + α̂2 Zi )
ε̂i = Gi − (β̂0 + β̂1 Vi + β̂2 Xi )

• This gives us an estimate of the fit of the structural model. There is one oddity
in the calculated R2 that may result. When we derived R2 , we used the fact
that the OLS normal equations estimated residuals such that E[X0 e] = 0.
This gave us the result that:
X X X
2 2 2
SST = SSR + SSE or (yi − ȳ) = (xi − x̄) β̂ + e2i
The variances in y can be completely partitioned between the variance from
the model and the variance from the residuals. This is no longer the case
when you re-estimate the errors using the real values of Gi and Vi rather
than their instruments and you can get cases where SSE > SST. In this
case, you can get negative values of R2 in the second stage. This may be
perfectly okay if the coefficients are of the right sign and the standard errors
are small.
SSR SSE
• It also matters in this case whether you estimate R2 as SST or 1 − SST .

14.6 Computation of standard errors in 2SLS


• Let us denote all the original right-hand side variables in the two structural
models as X, the instrumented variables as X̃, and the exogenous variables
that we used to estimate X̃ as Z.
164

• In section 13.12, we showed that:

β̂ IV = (X̃0 X̃)−1 X̃0 y = (X0 Z(Z0 Z)−1 Z0 X)−1 (X0 Z(Z0 Z)−1 Z0 (Xβ + ε)
= β + (X0 Z(Z0 Z)−1 Z0 X)−1 (X0 Z(Z0 Z)−1 Z0 ε)
= β + (X̃0 X̃)−1 (X̃0 ε)

var[β̂ IV ] = E[(β̂ IV − β)(β̂ IV − β)0 ] = E[(X̃0 X̃)−1 X̃0 εε0 X̃(X̃0 X̃)−1 ]

• If the errors conform to the Gauss-Markov assumptions then, E[εε0 ] = σ 2 IN


and
var[β̂ IV ] = σ 2 (X̃0 X̃)−1
e0 e
• To estimate σ 2 we would normally use (n−K) .

• As with R2 , we should use the estimated residuals from the structural model
with the true variables in it rather than the predicted values. These are
consistent estimates of the true disturbances.

η̂i = Vi − (α̂0 + α̂1 Gi + α̂2 Zi )

ε̂i = Gi − (β̂ 0 + β̂ 1 Vi + β̂ 2 Xi )

• The IV estimator can have very large standard errors, because the instru-
ments by which X is proxied are not perfectly correlated with it and your
residuals will be larger.
165

14.7 Three-stage least squares


• What if, in the process above, we became concerned that E[εε0 ] 6= σ 2 IN ?
• We would perform three-stage least squares.
1. Estimate the reduced form equations in OLS and calculate predicted
values of the endogenous variables G̃i and Ṽi .
2. Estimate the structural equations with the fitted values.
3. Use the residuals calculated in the manner above (using the actual val-
ues of Gi and Vi ) to test for heteroskedasticity and/or autocorrelation
and compute the appropriate standard errors if either are present. In
the case of heteroskedasticity, this would mean either FGLS and a data
transformation or White-Corrected Standard Errors. In the case of auto-
correlation, this would mean either FGLS and a data transformation or
Newey-West standard errors.
166

14.8 Different methods to detect and test for endogeneity


1. A priori tests for endogeneity – Granger Causality.
2. A test to look at whether the coefficients under OLS are markedly different
from the coefficients under 2SLS – Hausman Specification Test.

14.8.1 Granger causality


• A concept that is often used in time series work to define endogeneity is
“Granger Causality.” What it defines is really “pre-determinedness.” Granger
causality is absent when we can say that:

f (xt |xt−1 , yt−1 ) = f (xt |xt−1 )

The definition states that in the conditional distribution, lagged values of yt


add no information to our prediction of xt beyond that provided by lagged
values of xt itself.
• This is tested by estimating:

xt = β0 + β1 xt−1 + β2 yt−1 + εt

• If a t-test indicates that β2 = 0, then we say that y does not Granger cause x.
If y does not Granger cause x, then x is often said to be exogenous in a system
of equations with y. Here, exogeneity implies only that prior movements in
y do not lead to later movements in x.
• Kennedy has a critique of Granger Causality in which he points out that under
this definition weather reports “cause” the weather and that an increase in
Christmas card sales “cause” Christmas. The problem here is that variables
that are based on expectations (that the weather will be rainy, that Christmas
will arrive) cause earlier changes in behavior (warnings to carry an umbrella
and a desire to buy cards).
167

14.8.2 The Hausman specification test


• If β̂ OLS and β̂ IV are “close” in magnitude, then it would appear that endo-
geneity is not producing bias.
• This intuition has been formalized into a test by the econometrician Jerry
Hausman. The test is called a specification test because it tells you whether
you were right to use 2SLS. In this case, however, it can also be used to test
the original assumption of endogeneity. The logic is as follows:
• H0 : There is no endogeneity
In this case both β̂ OLS and β̂ IV are consistent and β̂ OLS is efficient relative
to β̂ IV (recall that OLS is BLUE).
• H1 : There is endogeneity
In this case β̂ IV remains consistent while β̂ OLS is inconsistent. Thus, their
values will diverge.
• The suggestion, then, is to examine d = (β̂ IV − β̂ OLS ). The question is how
large this difference should be before we assume that something is up. This
will depend on the variance of d.
• Thus, we can form a Wald statistic to test the hypothesis above:

W = d0 [Estimated Asymptotic Variance(d)]−1 d

• The trouble, for a long time, was that no-one knew how to estimate this vari-
ance since it should involve the covariances between β̂ IV and β̂ OLS . Hausman
solved this by proving that the covariance between an efficient estimator β̂ OLS
of a parameter vector β and its difference from an inefficient estimator β̂ IV
of the same parameter vector is zero (under the null). (For more details see
Greene, 6th ed., Section 12.4.)
• Based on this proof, we can say that:

Asy. var[β̂ IV − β̂ OLS ] = Asy. var[β̂ IV ] − Asy. var[β̂ OLS ]


168

• Under the null hypothesis, we are using two different but consistent estimators
of σ 2 . If we use s2 as a common estimator of this, the Hausman statistic will
be:
(β̂ IV − β̂ OLS )0 [(X̃0 X̃)−1 − (X0 X)−1 ]−1 (β̂ IV − β̂ OLS )
H=
s2
• This test statistic is distributed χ2 but the appropriate degrees of freedom for
the test statistic will depend on the context (i.e., how many of the variables
in the regression are thought to be endogenous).

14.8.3 Regression version


• The test statistic above can be automatically computed in most standard
software packages. In the case of IV estimation (of which 2SLS is an example)
there is a completely equivalent way of running the Hausman test using an
“auxiliary regression”.
• Assume that the model has K1 potentially endogenous variables, X, and K2
remaining variables, W. We have predicted values of X, X̃, based on the
reduced form equations. We estimate the model:

y = Xβ + X̃α + Wγ + ε

• The test for endogeneity is performed as an F -test on the K1 regression


coefficients α being different from zero, where the degrees of freedom are K1
and (n − (K1 + K2 ) − K1 ). If α = 0, then X is said to be exogenous.
• The intuition is as follows. If the X variables are truly exogenous, then the
β should be unbiased and there will be no extra information added by the
fitted values. If the X variables are endogenous, then the fitted values will
add extra information and account for some of the variation in y. Thus, they
will have a coefficient on them significantly different from zero.
• Greene does not go through the algebra to prove that the augmented regres-
sion is equivalent to running the Hausman test but refers readers to Davidson
and MacKinnon (1993). Kennedy has a nice exposition of the logic above on
p. 197–198.
169

14.8.4 How to do this in Stata


1. 2SLS or To perform any type of IV estimation, see ivreg.
The command is:
ivreg (depvar) [varlist1] [varlist2=varlist IV]
2. To perform 3SLS, see reg3
3. To perform a Hausman test, see hausman
The hausman test is used in conjunction with other regression commands.
To use it, you would:
Run the less efficient model (here IV or reg3)
hausman, save
Run the fully efficient model (here OLS)
hausman

14.9 Testing for the validity of instruments


• We stated previously that the important features of “good” instruments, Z,
are that they be highly correlated with the endogenous variables, X, and
uncorrelated with the true errors, ε.
• The first requirement can be tested fairly simply by inspecting the reduced
form model in which each endogenous variables is regressed on its instruments
to yield predicted values, X̃. In a model with one instrument, look at the
t-statistic. In an IV model with multiple instruments, look at the F -statistic.
• For the second requirement, the conclusion that Z and ε are uncorrelated in
the case of one instrument must be a leap of faith, since we cannot observe ε
and must appeal to theory or introspection.
170

• For multiple instruments, however, so long as we are prepared to believe that


one of the instruments is uncorrelated with the error, we can test the assump-
tion that remaining instruments are uncorrelated with the error term. This
is done via an auxiliary regression and is know as “testing over-identification
restrictions.”
• It is called this because, if there is more than one instrument, then the en-
dogenous regressor is over-identified. The logic of the test is simple.
➢ If at least one of the instruments is uncorrelated with the true error term,
then 2SLS gives consistent estimates of the true errors.
➢ The residuals from the 2SLS estimation can then be used as the depen-
dent variable in a regression on all the instruments.
➢ If the instruments are, in fact, correlated with the true errors, then this
will be apparent in a significant F -statistic on the instruments being
jointly significant for the residuals.
• Thus, the steps in the test for over-identifying restrictions are as follows:

1. Estimate the 2SLS residuals using η̂i = Vi − α̂0 + α̂1 Gi + α̂2 Zi . Use η̂i as
your consistent estimate of the true errors.
2. Regress η̂i on all the instruments used in estimating the 2SLS coefficients.
3. You can approximately test the over-identifying restrictions via inspec-
tion of the F -statistic for the null hypothesis that all instruments are
jointly insignificant. However, since the residuals are only consistent for
the true errors, this test is only valid asymptotically and you should
technically use a large-sample test.
a
➢ An alternative is to use n · R2 ∼ χ2 w/ df = # of instruments − # of
endogenous regressors. If reject the null ⇒ at least some of the IVs
are not exogenous.
Section 15

Time Series Modeling


15.1 Historical background
• Genesis of modern time series models: large structural models of the macro-
economy, which involved numerous different variables, were poor predictors
of actual economic outcomes.
• Box and Jenkins showed that the present and future values of an economic
variable could often be better predicted by its own past values than by other
variables—dynamic models.
• In political science: used in models of presidential approval and partisan
identity.

15.2 The auto-regressive and moving average specifica-


tions
• A time series is a sequence of numerical data in which each observation is
associated with a particular instant in time.
• Univariate time series analysis: analysis of a single sequence of data
• Multivariate time series analysis: several sets of data for the same sequence
of time periods
• The purpose of time series analysis is to study the dynamics, or temporal
structure of the data.
• Two representations of the temporal structure that will allow us to describe
almost all dynamics (for “stationary” sequences) are the auto-regressive and
moving-average representations.

171
172

15.2.1 An autoregressive process


• The AR(1) process:
yt = µ + γyt−1 + εt
is said to be auto-regressive (or self-regressive) because the current value is
explained by past values, so that:
E[yt ] = µ + γyt−1

• This AR(1) process also contains a per-period innovation of µ, although this


is often set to zero. This is sometimes referred to as a “drift” term. A more
general, pth-order autoregression or AR(p) process would be written:
yt = µ + γyt−1 + γ2 yt−2 + ... + γp yt−p + εt

• In the case of an AR(1) process we can substitute infinitely for the y terms
on the right-hand side (as we did previously) to show that:

X ∞
X
2 2 ∞ i
yt = µ+γµ+γ µ+...+εt +γεt−1 +γ εt−2 +...+γ εt−∞ = γ µ+ γ i εt−i
i=0 i=0

• So, one way to remember an auto-regressive process is that your current state
is a function of all your previous errors.
• We can present this information far more simply using the lag operator or
L:
Lxt = xt−1 and L2 xt = xt−2 and (1 − L)xt = xt − xt−1
• Using the lag operator, we can write the original AR(1) series as:
yt = µ + γLyt + εt
so that:
(1 − γL)yt = µ + εt
and
∞ ∞
µ εt µ εt X
i
X
yt = + = + = γ µ+ γ i εt−i
(1 − γL) (1 − γL) (1 − γ) (1 − γL) i=0 i=0
173

• The last step comes from something we encountered before: how to represent
an infinite series:
A = x(1 + a + a2 + ... + an )
x
If |a| < 1, then the solution to this series is approximately A = (1−a) . In
other words, the sequence is convergent (has a finite solution).

γ i εt−i = εt +γεt−1 +γ 2 εt−2 +... = εt +γLεt +γ 2 L2 εt +...
P
• Thus, for the series
i=0

εt
γ i εt−i =
P
we have that, a = γL and (1−γL) .
i=0

µ µ
γ iµ =
P
• For similar reasons, (1−γL) = (1−γ) because Lµ = µ In other words,
i=0
µ, the per-period innovation is not subscripted by time and is assumed to be
the same in each period.

15.3 Stationarity
• So, an AR(1) process can be written quite simply as:
µ εt
yt = +
(1 − γ) (1 − γL)

• Recall, though, that this requires that |γ| < 1. If |γ| ≥ 1 then we cannot
even define yt . yt keeps on growing as the error terms collect. Its expectation
will be undefined and its variance will be infinite.
• That means that we cannot use standard statistical procedures if the auto-
regressive process is characterized by |γ| > 1. This is known as the station-
arity condition and the data series is said to be stationary if |γ| < 1.
• The problem is that our standard results for consistency and for hypothesis
testing requires that (X0 X)−1 is a finite, positive matrix. This is no longer
true. The matrix (X0 X) will be infinite when the data series X is non-
stationary.
174

• In the more general case of an AR(p) model, the only difference is that the lag
function by which we divide the right-hand side, (1 − γL), is more complex
and is often written as C(L). In this case, the stationarity condition requires
that the roots of this more complex expression “lie outside the unit circle.”

15.4 A moving average process


• A first-order, moving average process MA(1) is written as:

yt = µ + εt − θεt−1

In this case, your current state depends on only the current and previous
errors.
Using the lag operator:
yt = µ + (1 − θL)εt
Thus,
yt µ
= + εt
(1 − θL) (1 − θL)
Once again, if |θ| < 1, then we can invert the series and express yt as an
infinite series of its own lagged values:
µ
yt = − θyt−1 − θ2 yt−2 − ... + εt
(1 − θ)

Now we have written our MA process as an AR process of infinite lag length


p, describing yt in terms of all its own past values and the contemporaneous
error term. Thus, an MA(1) process can be written as an infinite AR(p)
process.
• Similarly, when we expressed the AR(1) function in terms of all past errors
terms, we were writing it as an infinite MA(p) process.
• Notice, that this last step once again relies on the condition that |θ| < 1.
This is referred to in this case as the invertibility condition, implying that
we can divide through by (1 − θL).
175

• If we had a more general, MA(q) process, with more lags, we could go through
the same steps, but we would have a more complex function of the lags than
(1 − θL). Greene’s textbook refers to this function as D(L). In this case, the
invertibility condition is satisfied when the roots of D(L) lie outside the unit
circle (see Greene, 6th ed., pp. 718–721).

15.5 ARMA processes


• Time series can also be posited to contain both AR and MA terms. However,
if we go through the inversion above, getting:
µ
yt = − θyt−1 − θ2 yt−2 − ... + εt
(1 − θ)
and then substitute for the lagged yt s in the AR process, we will arrive at an
expression for yt that is based only on a constant and a complex function of
past errors. See Greene, 6th ed., p. 717 for an example.
• We could also write an ARMA(1,1) process as:
µ (1 − θL)
yt = + εt
(1 − γ) (1 − γL)

• An ARMA process with p autoregressive components and q moving average


components is called an ARMA(p, q) process.
• Where does this get us? We can estimate yt and apply the standard proofs
of consistency if the time series is stationary, so it makes sense to discuss
what stationarity means in the context of time series data. If the time series
is stationary and can be characterized by an AR process, then the model
can be estimated using OLS. If it is stationary and characterized by an MA
process, you will need to use a more complicated estimation procedure (non-
linear least squares).
176

15.6 More on stationarity


• There are two main concepts of stationarity applied in the literature.
1. Strict Stationarity: For the process yt = ρyt−1 + εt , strict stationarity
implies that:
➢ E[yt ] = µ exists and is independent of t.
➢ var[yt ] = γ0 is a finite, positive constant, independent of t.
➢ cov[yt , ys ] = γ(|t − s|) is a finite function of |t − s|, but not of t or s.
➢ AND all other “higher order moments” (such as skewness or kurtosis)
are also independent of t.
2. Weak Stationarity (or covariance stationarity): removes the condition
on the “higher order moments” of yt .
• A stationary time series will tend to revert to its mean (mean reversion) and
fluctuations around this mean will have a broadly consistent amplitude.
• Intuition: if we take two slices from the data series, they should have approxi-
mately the same mean and the covariance between points should depend only
on the number of time periods that divide them. This will not be true of “in-
tegrated” series, so we will have to transform the data to make it stationary
before estimating the model.

15.7 Integrated processes, spurious correlations, and test-


ing for unit roots
• One of the main concerns w/ non-stationary series is spurious correlation.
• Suppose you have a non-stationary, highly trending series, yt , and you regress
it on another highly trending series, xt . You are likely to find a significant
relationship between yt and xt even when there was none, because we see
upward movement in both produced in their own dynamics.
• Thus, when the two time series are non-stationary, standard critical values
of the t and F statistics are likely to be highly misleading about true causal
relationships.
177

• The question is, what kind of non-stationary sequence do we have and how
can we tell it’s non-stationary. Consider the following types of non-stationary
series:
1. The Pure Random Walk

yt = yt−1 + εt

This DGP can also be written as:


X
yt = y0 + εt
X
E[yt ] = E[y0 + εt ] = y 0
➢ In similar fashion it can be shown that the variance of yt = tσ 2 .
Thus, the mean is constant but the variance increases indefinitely as
the number of time points grows.
➢ If you take the first difference of the data process, however, ∆yt =
yt − yt−1 , we get:
yt − yt−1 = εt
The mean of this process is constant (and equal to zero) and its
variance is also a finite constant. Thus, the first difference of a random
walk process is a difference stationary process.
2. The Random Walk with Drift

yt = µ + yt−1 + εt

For the random walk with drift process, we can show that E[yt ] = y0 + tµ
and var[yt ] = tσ 2 . Both the mean and the variance are non-constant.
➢ In this case, first differencing of the series also will give you a variable
that has a constant mean and variance.
178

3. The Trend Stationary Process

yt = µ + βt + εt

➢ yt is non-stationary because the mean of yt is equal to µ + βt, which


is non-constant, although its variance is constant and equal to σ 2 .
➢ Once the values of µ and β are known, however, the mean can be
perfectly predicted. Therefore, if we subtract the mean of yt from
yt , the resulting series will be stationary, and is thus called a trend
stationary process in comparison to the difference stationary processes
described above.
• Each of these series is characterized by a unit root, meaning that the coefficient
on the lagged value of yt = 1 in each process. For a trend stationary process,
this follows because you can re-write the time series as yt = µ + βt + εt =
yt−1 + β + εt − εt−1 .
• In each case, the DGP can be written as:

(1 − L)yt = α + ν

where α = 0, µ, and β respectively in each process and ν is a stationary


process.
• In all cases, the data should be detrended or differenced to produce a station-
ary series. But which? The matter is not of merely academic interest, since
detrending a random walk will induce autocorrelation in the error terms of
an MA(1) type.
179

• A unit-root test is based on a model that nests the different processes above
into one regression that you run to test the properties of the underlying data
series:
yt = µ + βt + γyt−1 + εt

• Next subtract yt−1 from both sides of the equation to produce the equation
below. This produces a regression with a (difference) stationary dependent
variable (even under the null of non-stationarity) and this regression forms
the basis for Dickey-Fuller tests of a unit root:

yt − yt−1 = µ + βt + (γ − 1)yt−1 + εt

➢ A test of the hypothesis that (γ − 1) is zero gives evidence for a random


walk, because this ⇒ γ = 1.
➢ If (γ − 1) = 0 and µ is significantly different from zero we have evidence
for a random walk with drift.
➢ If (γ − 1) is significantly different from zero (and < 0) we have evidence
of a stationary process.
➢ If (γ − 1) is significantly different from and less than zero, and β (the
coefficient on the trend variable) is significant, we have evidence for a
trend stationary process.

• There is one complication. Two statisticians, Dickey and Fuller (1979, 1981)
showed that if the unit root is exactly equal to one, the standard errors will
be under-estimated, so that revised critical values are required for the test
statistic above.
• For this reason, the test for stationarity is referred to as the Dickey-Fuller
test. The augmented Dickey-Fuller test applies to the same equation above
but adds lags of the first difference in y, (yt − yt−1 ).
• One problem with the Dickey-Fuller unit-root test is that it has low power
and seems to privilege the null hypothesis of a random walk process over the
alternatives.
180

• To sum up: If your data looks like a random walk, you will have to difference it
until you get something that looks stationary. If your data looks like it’s trend
stationary, you will have to de-trend it until you get something stationary.
An ARMA model carried out on differenced data is called an ARIMA model,
standing for “Auto-Regressive, Integrated, Moving-Average.”

15.7.1 Determining the specification


• You have data that is now stationary. How do you figure out which ARMA
specification to use? How many, if any, AR terms should there be? How
many, if any, MA terms?
• As a means of deciding the specification, analysts in the past have looked at
the autocorrelations and partial autocorrelations between yt and yt−s . This is
also known as looking at the autocorrelation function (ACF) and the partial
autocorrelation function (or PACF).
• Recall that
cov(εt , εt−s ) E[εt εt−s ] γs
corr(εt , εt−s ) = p = ] =
E[ε2t
p
var(εt ) var(εt−s ) γ0

• If yt and yt−s are both expressed in terms of deviations from their means, and
if var(yt ) = var(yt−s ) then:
cov(yt , yt−s ) E[yt yt−s ] γs
corr(yt , yt−s ) = p = =
E[yt2 ]
p
var(yt ) var(yt−s ) γ0

15.8 The autocorrelation function for AR(1) and MA(1)


processes
• We showed in the section on autocorrelation in the error terms that if yt =
ρyt−1 + εt then corr[yt , yt−s ] = ρs .
• In this context, εt is white noise. E[εt ] = 0, E[ε2t ] = σε2 , E[εt εt−1 ] = 0
• Thus, the autocorrelations for an AR(1) process tend to die away gradually.
181

• By contrast, the autocorrelations for an MA(1) process die away abruptly.


• Let
yt = εt − θεt−1
Then

γ0 = var[yt ] = E{yt −E[yt ]}2 = E[(εt −θεt−1 )2 ] = E(ε2t )+θ2 E(ε2t−1 ) = (1+θ2 )σε2

and

γ1 = cov[yt , yt−1 ] = E[(εt − θεt−1 )(εt−1 − θεt−2 )] = −θE[ε2t−1 ] = −θσε2

• The covariances between yt and yt−s when s > 1 are zero, because the ex-
pression for yt only involves two error terms. Thus, the ACF for an MA(1)
process has one or two spikes and then shows no autocorrelation.

15.9 The partial autocorrelation function for AR(1) and


MA(1) processes
• The partial autocorrelation is the simple correlation between yt and yt−s minus
that part explained by the intervening lags.
• Thus, the partial autocorrelation between yt and yt−s is estimated by the last
coefficient in the regression of yt on [yt−1 , yt−2 , ..., yt−s ]. The appearance of the
partial autocorrelation function is the reverse of that for the autocorrelation
function. For a true AR(1) process,

yt = ρyt−1 + εt

• There will be an initial spike at the first lag (where the autocorrelation equals
ρ) and then nothing, because no other lagged value of y is significant.
• For the MA(1) process, the partial autocorrelation function will look like a
gradually declining wave, because any MA(1) process can be written as an
infinite AR process with declining weights on the lagged values of y.
182

15.10 Different specifications for time series analysis


• We now turn to models in which yt is related to its own past values and a set
of exogenous, explanatory variables, xt .
• The pathbreaking time series work in pol. sci. concerned how presidential
approval levels respond to past levels of approval and measures of presidential
performance. Let’s use At , standing for the current level of approval, as the
dependent variable for our examples here.
➢ More on this topic can be found in an article by Neal Beck, 1991, “Com-
paring Dynamic Specifications,” Political Analysis.

1. The Static Model


At = βXt + εt
Comment from Beck, “This assumes that approval adjusts instantaneously
to new information (“no stickiness”) and that prior information is of no
consequence (“no memory”).
2. The Finite Distributed Lag Model
M
X
At = βi Xt−i + εt
i=0

This allows for memory but may have a large number of coefficients,
reducing your degrees of freedom.
3. The Exponential Distributed Lag (EDL) Model
M
X
At = (Xt−i λi )β + εt
i=0

Reduces the number of parameters that you now have to estimate. In


the case in which T can be taken as infinite, this can also be written as:

X Xt β
At = Xt β(λL)i + εt = + εt
i=0
(1 − λL)
183

4. The Partial Adjustment Model


Multiply both sides of the EDL model above by the “Koyck Transforma-
tion” or (1 − λL). After simplification, this will yield:
At = Xt β + λAt−1 + εt − λεt−1
➢ This says that current approval is a function of current exogenous
variables and the past value of approval, allowing for memory and
stickiness. By including a lagged dependent variable, you are actu-
ally allowing for all past values of X to affect your current level of
approval, with more recent values of X weighted more heavily.
➢ If the errors in the original EDL model were also AR(1), that is
to say if errors in preceding periods also had an effect on current
approval (e.g., ut = λut−1 + εt ), with the size of that effect falling in
an exponential way, then the Koyck transformation will actually give
you a specification in which the errors are iid. In other words:
At = Xt β + λAt−1 + εt
This specification is very often used in applied work with a stationary
variable whose current level is affected by memory and stickiness. If
the error is indeed iid, then the model can be estimated using OLS.
5. Models for “Difference Stationary” Data: The Error Correction
Model
Often, if you run an augmented Dickey-Fuller test and are unable to
reject the hypothesis that your data has a unit root, you will wind up
running the following type of regression:
∆At = ∆Xt β + εt
As Beck notes (p. 67) this is equivalent to saying that only the infor-
mation in the current period counts and that it creates an instantaneous
change in approval. This can be restrictive.
➢ Moreover, it will frequently result in finding no significant results in
first differences, although you have strong theoretical priors that the
dependent variable is related to the independent variables.
184

➢ What you can do in this instance, with some justification, is to run an


Error Correction Model that allows you to estimate both long-term
and short-term dynamics.
➢ Let us assume, for the moment, that both A and X are integrated (of
order one). Then we would not be advised to include them in a model
in terms of levels. However, if they are in a long-term, equilibrium
relationship with one another, then the errors:
et = At − Xt α
should be stationary. Moreover, if we can posit that people respond
to their “errors” and that gradually the relationship comes back into
equilibrium, then we can introduce the error term into a model of the
change in approval. This is the Error Correction Model.
∆At = ∆Xt β + γ(At−1 − Xt−1 α) + εt
The γ coefficient in this model is telling you how fast errors are ad-
justed. An attraction of this model is that it allows you to estimate
long-term dynamics (in levels) and short-term dynamics (in changes)
simultaneously.

15.11 Determining the number of lags


• In order to select the appropriate number of lags of the dependent variable
(in an AR(p)) model, you could use the “general to specific” methodology.
Include a number of lags that you think is more than sufficient and take out
of the model any lags that are not significant.
• Second, analysts are increasingly using adjusted measures of fit (analogous
to R2 ) that compensate for the fact that as you include more lags (to fit the
data better) you are reducing the degrees of freedom.
• The two best known are the Akaike Information Criterion and the Schwartz
Criterion. Both are based on the standard error of the estimate (actually
on s2 ). Thus, you want to minimize both, but you are penalized when you
include additional lags to reduce the standard error of the estimate.
185

• The equations for the two criteria are:


e0 e
 
2K
AIC(K) = ln +
n n
 0 
ee K ln n
SC(K) = ln +
n n
The Schwartz Criterion penalizes degrees of freedom lost more heavily.
• The model with the smallest AIC or SC indicates the appropriate number
of lags.

15.12 Determining the correct specification for your errors


• Beck gives examples (p. 60 and p. 64) where the error term is autocorrelated
even when you have built in lags of the dependent variable.
• This could occur if the effect of past errors dies out at a different rate than
does the effect of past levels in the explanatory variables. If this is the case,
you could still get an error term that is AR or MA, even when you have
introduced a lagged dependent variable, although doing this will generally
reduce the autocorrelation.
• The autocorrelation could be produced by an AR or an MA process in the
error term. To determine which, you might again look at the autocorrelations
and partial autocorrelations.
• We know how to transform the data in the case where the errors are AR(1)
using the Cochrane-Orcutt transformation or using Newey-West standard er-
rors. If the model was otherwise static, I would also suggest these approaches.
186

15.13 Stata Commands


• Dickey-Fuller Tests:
dfuller varname, noconstant lags(#) trend regress
• The Autocorrelation and Partial Autocorrelation Function:
corrgram varname, lags(#)
ac varname, lags(#)
pac varname, lags(#)
• ARIMA Estimation:
arima depvar [varlist], ar(numlist) ma(numlist)
187

Part III

Special Topics
Section 16

Time-Series Cross-Section and Panel Data


• Consider the model
yit = β0 + xit β + εit
where
➢ i = 1,. . . , N cross-sectional units (e.g., countries, states, households)
➢ t = 1, . . . , T time units (e.g., years, quarters, months)
➢ N T = total number of data points.
• Oftentimes, this model will be estimated pooling the data from different units
and time periods.
• Advantages: can give more leverage in estimating parameters; consistent w/
“general” theories

16.1 Unobserved country effects and LSDV


• In the model:

Govt Spendingit = β0 + β1 Opennessit + β2 Zit + εit

It might be argued that the level of government spending as a percentage


of GDP differs for reasons that are specific to each country (e.g., solidaristic
values in Sweden). This is also known as cross-sectional heterogeneity.
• If these unit-specific factors are correlated with other variables in the model,
we will have an instance of omitted variable bias. Even if not, we will get
larger standard errors because we are not incorporating sources of cross-
country variation into the model.
• We could try to explicitly incorporate all the systematic factors that might
lead to different levels of government spending across countries, but places
high demands in terms of data gathering.

188
189

• Another way to do this, which may not be as demanding data-wise, is to


introduce a set of country dummies into the model.

Govt Spendingit = αi + β1 Opennessit + β2 Zit + εit

This is equivalent to introducing a country-specific intercept into the model.


Either include a dummy for all the countries but one, and keep the inter-
cept term, or estimate the model with a full set of country dummies and no
intercept.

16.1.1 Time effects


• There might also be time-specific effects (e.g., government spending went up
everywhere in 1973–74 in OECD economies because the first oil shock led to
unemployment and increased government unemployment payments). Once
again, if the time-specific factors are not accounted for, we could face the
problem of bias.
• To account for this, introduce a set of dummies for each time period.

Govt Spendingit = αi + δt + β1 Opennessit + β2 Zit + εit

• The degrees of freedom for the model are now N T − k − N − T . The signifi-
cance, or not, of the country-specific and time-specific effects can be tested by
using and F -test to see if the country (time) dummies are jointly significant.
• The general approach of including unit-specific dummies is known as Least
Squares Dummy Variables model, or LSDV.
• Can also include (T − 1) year dummies for time effects. These give the
difference between the predicted causal effect from xit β and what you would
expect for that year. There has to be one year that provides the baseline
prediction.
190

16.2 Testing for unit or time effects


• For LSDV (including an intercept), we want to test the hypothesis that
α1 = α2 = ... = αN −1 = 0

• Can use an F -test:


(RU2 R − RR
2
)/(N − 1)
F (N − 1, N T − N − K) = 2
(1 − RU R )/(N T − N − K)
In this case, the unrestricted model is the one with the country dummies (and
hence different intercepts); the restricted model is the one with just a single
intercept. A similar test could also be performed on the year dummies.

16.2.1 How to do this test in Stata


• After the regress command you type:

1. If there are (N-1) country dummies and an intercept


test dummy1=dummy2=dummy3=dummy4=...=dummyN-1=0
2. If there are N country dummies and no intercept
test dummy1=dummy2=dummy3=dummy4=...=dummyN

16.3 LSDV as fixed effects


• Least squares dummy variable estimation is also known as Fixed Effects,
because it assumes that the variation in the dependent variable, yit , for given
countries or years can be estimated as a given, fixed effect.
• Before we go into the justification for this, let us examine which part of the
variation in yit is used to calculate the remaining β coefficients under fixed
effects.
• A fixed effects model can be estimated by transforming the data. To do this,
calculate the country mean of yit for all the different countries. Let the group
mean of a given country, i, be represented as ȳi.
191

• Let the original model be

yit = αi + xit β + εit (16.1)

Then:
ȳi. = αi + x̄i. β + ε̄i.
If we run OLS on this regression it will produce what is known as the “Be-
tween Effects” estimator, or β BE , which shows how the mean level of the
dependent variable for each country varies with the mean level of the inde-
pendent variables.
Substracting this from eq. 16.1 gives

(yit − ȳi. ) = (αi − αi ) + (xit − x̄i. )β + (εit − ε̄i. )

or
(yit − ȳi. ) = (xit − x̄i. )β + (εit − ε̄i. )

• If we run OLS on this regression it will produce what is known as the “Fixed
Effects” estimator, or β F E .
• It is indentical to LSDV and is sometimes called the within-group estimator,
because it uses only the variation in yit and xit within each group (or country)
to estimate the β coefficients. Any variation between countries is assumed to
spring from the unobserved fixed effects.
• Note that if time-invariant regressors are included in the model, the standard
FE estimator will not produce estimates for the effects of these variables.
Similar issue w/ LSDV.

➢ IV approach to produce estimates, but requires some exgoneity assump-


tions that may not be met in practice.

• The effects of slow-moving variables can be estimated very imprecisely due


to collinearity.
192

16.4 What types of variation do different estimators use?


• Let us now determine the sum of squares (X0 X) and cross-products (X0 y)
for the OLS estimator and within-group estimator in order to clarify which
estimator uses what variation to calculate the β coefficients.
• Let Sxx be the sum of squares and let Sxy be the cross-products. Let the
overall means of the data be represented as ȳ and x̄.
• Then the total sum of squares and cross-products (which define the variation
that we use to estimate β̂ OLS ) is:
N X
X T
T
Sxx = (xit − x̄)(xit − x̄)0
i=1 t=1

N X
X T
T
Sxy = (xit − x̄)(yit − ȳ)
i=1 t=1

• The within-group sum of squares and cross-products (used to estimate β̂ F E )


is:
XN X T
W
Sxx = (xit − x̄i. )(xit − x̄i. )0
i=1 t=1
N X
X T
W
Sxy = (xit − x̄i. )(yit − ȳi. )
i=1 t=1

• The between-group sum of squares and cross-products (used to estimate β̂ BE )


is:
XN X T
B
Sxx = (x̄i. − x̄)(x̄i. − x̄)0
i=1 t=1
N X
X T
SB
xy = (x̄i. − x̄)(ȳi. − ȳ)
i=1 t=1
193

• It is easy to verify that:


STxx = SW B
xx + Sxx

and:
STxy = SW B
xy + Sxy

• We also have that:

β̂ OLS = [STxx ]−1 [STxy ] = [SW B −1 W B


xx + Sxx ] [Sxy + Sxy ]

and
−1 W
β̂ F E = [SW
xx ] [Sxy ]

while,
−1 B
β̂ BE = [SB
xx ] [Sxy ]

• The standard β̂ OLS uses all the variation in yit and xit to calculate the slope
coefficients while β̂ F E just uses the variation across time and β̂ BE just uses
the variation across countries.
• We can show that β̂ OLS is a weighted average of β̂ F E and β̂ BE . In fact:

β̂ OLS = FW β̂ F E + FB β̂ BE

where FW = [SW B −1 W B W
xx + Sxx ] Sxx and F = [I − F ]
194

16.5 Random effects estimation


• Fixed effects is completely appropriate if we believe that the country-specific
effects are indeed fixed, estimable amounts that we can calculate for each
country.
• Thus, we believe that Sweden will always have an intercept of 1.2 units (for
instance). If we were able to take another sample, we would once again
estimate the same intercept for Sweden. There are cases, however, where we
may not believe that we can estimate some fixed amount for each country.
• In particular, assume that we have a panel data model run on 20 countries,
but which should be generalizable to 100 different countries. We cannot
estimate the given intercept for each country or each type of country because
we don’t have all of them in the sample for which we estimate the model.
• In this case, we might want to estimate the βs on the explanatory variables
taking into account that there could be country-specific effects that would
enter as a random shock from a known distribution.
• The appropriate model that accounts for cross-national variation is random
effects:
yit = α + xit β + ui + εit
In this model, α is a general intercept and ui is a time-invariant, random
disturbance characterizing the ith country. Thus, country-effects are treated
as country-specific shocks. We also assume in this model that:
E[εit ] = E[ui ] = 0
E[ε2it ] = σε2 , E[u2i ] = σu2
E[εit uj ] = 0 ∀ i, t, j; E[εit εjs ] = 0 ∀ t 6= s, i 6= j; E[ui uj ] = 0 for i 6= j.
• For each country, we have a separate error term, equal to wit , where:
wit = εit + ui
and
E[wit2 ] = σε2 + σu2 , and E[wit wis ] = σu2 for t 6= s.
195

• It is because the RE model decomposes the disturbance term into different


components that it is also known as an error components model.
• For each panel (or country), the variance-covariance matrix of the T distur-
bance terms will take the following form:
 
(σε2 + σu2 ) σu2 σu2 . . . σu2
 σ2 (σε2 + σu2 ) σu2 . . . σu2  0
u 2 2
Σ= = σ I + σ i i
 
.. . .. .
..  ε T u T T
 . 
σu2 σu2 σu2 . . . (σε2 + σu2 )

where iT is a (T × 1) vector of ones.


• The full variance-covariance matrix for all the N T observations is:
 
Σ 0 0 ... 0
0 Σ 0 . . . 0
Ω =  ..  = IN ⊗ Σ
 
. . .
.
. .
.
0 0 0 ... Σ

• The way in which the RE model differs from the original OLS estimation (with
no fixed effects) is only in the specification of the disturbance term. When the
model differs from the standard G-M assumptions only in the specification
of the errors, the regression coefficients can be consistently and efficiently
estimated by Generalized Least Squares (GLS) or (when we don’t exactly
know Ω) by FGLS).
• Thus, we can do a transformation of the original data that will create a new
var-cov matrix for the disturbances that conforms to G-M.
196

16.6 FGLS estimation of random effects


• The FGLS estimator is
−1
β̂ F GLS = (X0 Ω̂ X)−1 (X0 Ω̂y)

To estimate this we will need to know Ω−1 = [IN ⊗ Σ]−1 , which means that
we need to estimate Σ−1/2 :
 
1 θ 0
Σ−1/2 = I − iT iT
σε T
where
σε
θ =1− p
T σu2 + σε2
• Then the transformation of yi and Xi for FGLS is
 
yi1 − θȳi.
1  yi2 − θȳi. 

−1/2
Σ yit = ..
σε 
 
. 
yiT − θȳi.

with a similar looking expression for the rows of Xi .


• It can be shown that the GLS estimator, β̂ RE , like the OLS estimator, is a
weighted average of the within (FE) and between (BE) estimators:

β̂ RE = F̂ W β̂ F E + (I − F̂ W )β̂ BE

where:
B −1 W
F̂ W = [Sxx
W
+ λSxx ] Sxx
and
σε2
λ= 2 2
= (1 − θ)2
σε + T σu
• If λ = 1, then the RE model reduces to OLS. There is essentially no country-
specific disturbance term, so the regression coefficients are most efficiently
estimated using the OLS method.
197

• If λ = 0, the country-specific shocks dominate the other parts of the distur-


bance. Then the RE model reduces to the fixed effects model. We attribute
all cross-country variation to the country-specific shock, with none attributed
to the random disturbance εit and just use the cross-time variation to estimate
the slope coefficients.
• To the extent that λ differs from one, we can see that the OLS estimation
involves an inefficient weighting of the two least squares estimators (within
and between) and GLS will produce more efficient results.
• To estimate the RE model when Ω is unknown, we use the original OLS
results, which are consistent, to get estimates of σε2 and σu2 .

16.7 Testing between fixed and random effects


• If the random, country-specific disturbance term, ui , is correlated with any
of the other explanatory variables, xit , then we will get biased estimates in
the OLS stage because the regressors and the disturbance will be contempo-
raneously correlated.
• The coefficient on xit will be biased and inconsistent, which means the OLS
estimates of the residuals will be biased and the β̂ RE will be biased. This
sets us up for a Hausman test:
H0 : E[ui xit ] = 0; RE appropriate ⇒ β̂ RE is approximately equal to β̂ F E but is
more efficient (has smaller standard errors).

H1 : E[ui xit ] 6= 0; RE is not appropriate ⇒ β̂ RE will be different from β̂ F E (and


inconsistent).

• In this setting, the Hausman test statistic is calculated as:


−1
W = χ2K = [β̂ F E − β̂ RE ]0 Σ̂ [β̂ F E − β̂ RE ]
where
Σ̂ = var[β̂ F E ] − var[β̂ RE ]
If the Hausman test statistic is larger than its appropriate critical value, then
we reject RE as the appropriate specification.
198

• Greene, 6th ed., p. 205–206, also shows how to perform a Breusch-Pagan test
for RE based on the residuals from the original OLS regression. This tests
for the appropriateness of OLS versus the alternative of RE. It does not test
RE against FE.

16.8 How to do this in Stata


• xtreg depvar [varlist], re for RE
xtreg depvar [varlist], fe for FE
xtreg depvar [varlist], be for between effects
To perform the hausman test, type
xthausman
After xtreg depvar [varlist], re
To run the Breusch-Pagan test for RE versus OLS, type
xttest0
After xtreg depvar [varlist], re

16.9 Panel regression and the Gauss-Markov assumptions


• OLS is BLUE if the errors are iid, implying that E[εit ] = 0, E[ε2it ] = σε2 ,
and E[εit εjs ]=0 for t 6= s or i 6= j
• The errors in panel regressions, however, are particularly unlikely to be spher-
ical:
1. We might expect the errors in time-series, cross-sectional settings to be
contemporaneously correlated (e.g., economies are linked).
2. We might also expect the errors in panel models to show “panel het-
eroskedasticity” (e.g., the scale of the dependent variable may vary across
countries).
3. The errors may show first-order serial correlation (or autocorrelation) or
some other type of temporal dependence.
199

• Consider the following example to see how this will show up in the variance-
covariance matrix of the errors. Assume that N = 2 and T = 2 and that the
data are stacked by country and then by time period:
• The covariance matrix for spherical errors is:
 2 
σ 0 0 0
 0 σ2 0 0 
Ω=  0 0 σ2 0 

0 0 0 σ2
where  
E(ε11 ε11 ) E(ε11 ε12 ) E(ε11 ε21 ) E(ε11 ε22 )
E(ε12 ε11 ) E(ε12 ε12 ) E(ε12 ε21 ) E(ε12 ε22 )
Ω=
E(ε21 ε11 )

E(ε21 ε12 ) E(ε21 ε21 ) E(ε21 ε22 )
E(ε22 ε11 ) E(ε22 ε12 ) E(ε22 ε21 ) E(ε22 ε22 )
• This is the covariance matrix with contemporaneous correlation:
 2 
σ 0 σ12 0
 0 σ 2 0 σ12 
Ω= σ12 0 σ 2 0 

0 σ12 0 σ 2

• This is the covariance matrix with contemporaneous correlation and panel-


specific heteroskedasticity:
 2 
σ1 0 σ12 0
 0 σ12 0 σ12 
Ω=σ12 0 σ22 0 

0 σ12 0 σ22

• This is the covariance matrix with all of the above and first-order serial cor-
relation:  2 
σ1 ρ σ12 0
 ρ σ12 0 σ12 
Ω= 
σ12 0 σ22 ρ 
0 σ12 ρ σ22
200

• The most popular method for addressing these issues is known as panel cor-
rected standard errors (PCSEs) which is due to Beck and Katz (APSR ’95).
• In other cases of autocorrelation and/or heteroskedasticity, we have suggested
GLS or FGLS. We would transform the data so that the errors become spheri-
−1/2
cal by multiplying x and y by Ω̂ . The FGLS estimates of β are now equal
−1 −1
to (X0 Ω̂ X)−1 (X0 Ω̂ y).
• FGLS is most often performed by first using the OLS residuals to estimate ρ
(to implement the Prais-Winsten method) and then using the residuals from
an OLS regression on this data to estimate the contemporaneous correlation.
This is known as the Parks method and is done in Stata via xtgls.
• Beck and Katz argue, however, that unless T >> N , then very few periods
of data are being used to compute the contemporaneous correlation between
each pair of countries.
• In addition, estimates of the panel-specific autocorrelation coefficient, ρi , are
likely to be biased downwards when they are based on few T observations.
• In other words, FGLS via the Parks method has undesirable small sample
properties when T is not many magnitudes greater than N .
• To show that FGLS leads to biased estimates of the standard errors in TSCS,
Beck and Katz use Monte Carlo Simulation

1. Simulate 1000 different samples of data with known properties. Each of


the 1000 samples will be “small” in terms of T and N .
2. Compute the β coefficients and the standard errors of those coefficients
using FGLS on the 1000 different runs of data.
3. Compare the variance implied by the calculated standard errors with the
actual variance found in the data to see if the calculated standard errors
are correct.
201

• PCSEs are built on the same approach as White-Consistent Standard Errors


in the case of heteroskedasticity and Newey-West Standard Errors in the case
of autocorrelation. Instead of transforming the data, we use the following
equation to compute standard errors that account for non-sphericity:

var[β] = (X0 X)−1 X0 ΩX(X0 X)−1

where Ω denote the variance-covariance matrix of the errors.


• Beck and Katz assert that the kind of non-sphericity in TSCS produces the
following Ω:
 
σ12 IT σ12 IT · · · σ1N IT
2
σ I
 21 T σ2 IT · · · σ2N IT 
Ω =  ..

... .. 
 . . 
2
σN 1 IT σN 2 IT · · · σN IT

• Let  
σ12 σ12 σ13 · · · σ1N
2
 12 σ2 σ23
σ · · · σ2N 
Σ =  ..

 . . . . ... 

2
σ1N σ2N σ3N · · · σN
202

• Use OLS residuals, denoted eit for unit i at time t (in Beck and Katz’s nota-
tion), to estimate the elements of Σ:
PT
eit ejt
Σ̂ij = t=1 , (16.2)
T
which means the estimate of the full matrix Σ̂ is
E0 E
Σ̂ =
T
where E is a T × N matrix of the re-shaped N T × 1 vector of OLS residuals,
such that the columns contains the T × 1 vectors of residuals for each cross-
sectional unit (or conversely, each row contains the N × 1 vector of residuals
for each cross-sectional in a given time period) :
 
e11 e21 . . . eN 1
e
 12 e22 . . . eN 2 

E =  .. .. . . . .. 
 . . . 
e1T e2T . . . eN T

Then
E0 E
Ω̂ = ⊗ IT ,
T
• Compute SEs using the square roots of the diagonal elements of

(X0 X)−1 X0 Ω̂X(X0 X)−1 , (16.3)

where X denotes the N T × k matrix of stacked vectors of explanatory vari-


ables, xit .
203

• Intuition behind why PCSEs do well: similar to White’s


heteroskedasticity-consistent standard errors for cross-sect’l estimators, but
better b/c take advantage of info provided by the panel structure.
• Good small sample properties confirmed by Monte Carlo studies.
• Note that this solves the problems of panel heteroskedasticity and contem-
poraneous correlation—not serial correlation. Serial correlation must be re-
moved before applying this fix.
➢ This can be done using the Prais-Winsten transformation or by including
a lagged dependent variable (Beck and Katz are partial to the latter).
Section 17

Models with Discrete Dependent Variables


17.1 Discrete dependent variables
• There are many cases where the observed dependent variable is not continu-
ous, but instead takes on discrete values. For example:

➢ A binary or dichotomous dependent variable: yi = 0 or 1 where these


values can stand for any qualitative measure. We will estimate this type
of data using logit or probit models.
➢ Polychotomous but ordered data: yi = 0, 1, 2, 3, . . . where each value is
part or a ranking, for example [did not graduate high school, graduated
HS, some college, BA, MA etc]. We will estimate this type of data using
ordered logit or probit models.
➢ Polychotomous or multinomial and unordered data, where there is no
particular ranking to the data. We will estimate this type of data using
multinomial or conditional logit models.
➢ Count data, yi = 0, 1, 2, 3, . . . but where the different values might rep-
resent the number of events per period rather than different qualitative
categories. We will estimate this type of data using count models such
as Poisson or negative-binomial models.
➢ Duration data where we are interested in the time it takes until an event
occurs. The dependent variable indicates a discrete time period.

• With continuous data, we are generally interested in estimating the expec-


tation of yi based on the data. With discrete dependent variables, we are
generally interested in the probability of seeing a particular value and how
that changes with the explanatory variables.

204
205

17.2 The latent choice model


• One of the most common ways of modeling and motivating discrete choices
is the latent choice model:
yi∗ = xi β + εi ,
where we observe:
1 if yi∗ > 0

yi =
0 if yi∗ ≤ 0
• Then, the probability that we observe a value of one for the dependent vari-
able is:

Pr[yi = 1|xi ] = Pr[yi∗ ≥ 0|xi ] = Pr[xi β + εi ≥ 0] = Pr[εi ≥ −xi β].

• Let F (·) be the cumulative distribution function (cdf) of ε, so that F (ε) =


Pr[ε ≤ εi ].
Pr[yi = 1|xi ] = 1 − F (−xi β)
Pr[yi = 0|xi ] = F (−xi β)
If f (·), the pdf, is symmetric, then:

Pr[yi = 1|xi ] = F (xi β) and Pr[yi = 0|xi ] = 1 − F (xi β)

• One way to represent the probability of yi = 1 is as the Linear Probability


Model:
Pr[yi = 1|xi ] = F (xi β) = xi β

• This is the model that we implicitly estimate if we regress a column of ones


and zeroes on a set of explanatory variables using ordinary least squares. The
probability of yi = 1 simply rises in proportion to the explanatory variables.
206

• There are two problems with this way of estimating the probability of yi = 1.
1. The model may predict values of yi outside the zero-one range.
2. The linear probability model is heteroskedastic. To see this note that, if
yi = xi β + εi , then εi = 1 − xi β when yi =1 and εi = −xi β when yi =0.
Thus:

var[εi ] = E Pr[yi = 1] · (1 − xi β)2 + Pr[yi = 0] · (−xi β)2 |xi


 

= xi β(1 − xi β)2 + (1 − xi β)(−xi β)2


= xi β(1 − xi β)

• Given these two drawbacks, it seems wise to select a non-linear function for
the predicted probability that will not exceed the values of one or zero and
that will offer a better fit to the data, where the data is composed of zeroes
and ones.
• For this reason, we normally assume a cumulative density function for the
error term where:
lim Pr(Y = 1) = 1
xβ→+∞

and
lim Pr(Y = 1) = 0
xβ→−∞

• The two most popular choices for a continuous probability distribution that
satisfy these requirements are the normal and the logistic.
• The normal distribution function for the errors gives rise to the probit model:
F (xi β) = Φ(xi β) where Φ is the cumulative normal distribution.
• The logistic function gives rise to the logit model:
exβ
F (xi β) = 1+exβ
= Λ(xi β).
• The vector of coefficients, β, is estimated via maximum likelihood.
207

• Treating each observation, yi , as an independent random draw from a given


distribution, we can write out the likelihood of seeing the entire sample as:

Pr(Y1 = y1 , Y2 = y2 , ..., Yn = yn ) = Pr(Y1 = y1 ) Pr(Y2 = y2 ) · · · Pr(Yn = yn )


Y Y
= [1 − F (xi β)] F (xi β).
yi =0 yi =1

• This can be more conveniently written as:


n
Y
L= [F (xi β)]yi [1 − F (xi β)](1−yi ) .
i=1

• We take the natural log of this to get the log likelihood:


n
X
ln L = {yi ln F (xi β) + (1 − yi ) ln[1 − F (xi β)]}
i=1

• Maximizing the log likelihood with respect to β yields the values of the pa-
rameters that maximize the likelihood of observing the sample. From what
we learned of maximum likelihood earlier in the course, we can also say that
these are the most probable values of the coefficients for the sample.
• The first order conditions for the maximization require that:
n  
∂ ln L X yi fi −fi
= + (1 − yi ) xi = 0
∂β i=1
Fi (1 − F i )

where fi is the probability distribution function, dFi /d(xi β)


• The standard errors are estimated by directly computing the information
matrix of the log likelihood using the data at hand. The information matrix
is given by the inverse of the negative “Hessian”. The Hessian is the ma-
trix containing all the second derivatives of the function with respect to the
parameters, or
∂ 2 ln L
.
∂β∂β 0
208

• Maximum likelihood estimators have attractive large sample properties (con-


sistency, asymptotic efficiency, asymptotic normality). We therefore treat our
estimate of σ 2 as consistent for the real parameter and converging inexorably
toward it. Thus, we use the z distribution (which assumes that we know σ 2 )
as the source of our critical values.

17.3 Interpreting coefficients and computing marginal ef-


fects
• A positive sign on a slope coefficient obtained from a logit or probit indicates
that the probability of yi = 1 increases as the explanatory variable increases.
• We would like to know how much the probability increases as xi increases.
In other words, we would like to know the marginal effect. In OLS, this was
given by β
• Recall that for OLS:
∂E(y|x)

∂x
This is not true for logit and probit.
∂E(y|x)
• In binary models ∂x = f (xi β)β
• Thus, for probit models:
∂E(y|x)
= φ(xi β)β
∂x
for logit models:
∂E(y|x) exi β
= β = Λ(xi β)[1 − Λ(xi β)]β
∂x (1 + exi β )2

• Two points to note about the marginal effects:


1. They are not constant.
2. They are maximized at xi β = 0, which is where E[y|x] = 0.5.
209

17.4 Measures of goodness of fit


• A frequently used measure of fit for discrete choice models is the Pseudo-R2
which, as its name implies, is an attempt to tell you how well the model fits
the data.
ln L̂
P seudoR2 = 1 −
ln L̂0
where ln L̂ is the log likelihood from the full model and ln L̂0 is the log like-
lihood from the model estimated with a constant only. This will always be
between 0 and 1.
• The log likelihood is always negative. The likelihood function that we start
with is just the probability of seeing the entire sample:
Y Y
Pr[Y1 = y1 , Y2 = y2 , ..., Yn = yn ] = [1 − F (xi β)] F (xi β).
yi =0 yi =1

• As a probability, it must lie between zero and one. When we take the natural
log of this, to get the log likelihood,
n
X
ln L = {yi ln F (xi β) + (1 − yi ) ln[1 − F (xi β)]},
i=1

we will get a number between negative infinity and zero. The natural log of
one is the value to which e must be raised to get one, and e0 = 1. The natural
log of zero is negative infinity, because e−∞ = 0. Thus, increases in the log
likelihood toward zero imply that the probability of seeing the entire sample
given the estimated coefficients is higher.
• The maximum likelihood method selects that value of β that maximizes the
log likelihood and reports the value of the log likelihood at β̂ M L . A larger
negative number implies that the model does not fit the data well.
• As you add variables to the model, you expect the log likelihood to get closer
to zero (i.e., increase). ln L̂0 is the log likelihood for the model with just a
constant; ln L̂ is the log likelihood for the full model,so ln L̂0 should always
be larger and more negative than ln L̂.
210

• Where ln L̂0 = ln L̂, the additional variables have added no predictive power,
and the pseudo- R2 is equal to zero. When ln L̂= 0, the model now perfectly
predicts the data, and the pseudo- R2 is equal to one.
• A similar measure is the likelihood ratio test estimating the probability that
the coefficients on all the explanatory variables in the model (except the
constant) are zero (similar to the F -test under OLS).
• Although these measures offer the benefit of comparison to OLS techniques,
they do not easily get at what we want to do, which is to predict accurately
when yi is going to fall into a particular category. An alternative measure of
the goodness of fit of the model might be the percentage of observations that
were correctly predicted. This is calculated as:

[(Observation = 1 ∩ P rediction = 1) + (Observation = 0 ∩ P rediction = 0)] /n

The method of model prediction is to say that we predict a value of one for
yi if F (xi β) ≥ 0.5.
Let ŷi be the predicted value of yi .
• The percent correctly predicted is:
N
1 X
(yi ŷi + (1 − yi )(1 − ŷi ))
N i=1

• This measure has some shortcomings itself, which are overcome by the ex-
pected percent correctly predicted (ePCP):
N N
!
1 X X
ePCP = p̂i + (1 − p̂i )
N y =1 y =0
i i
Section 18

Discrete Choice Models for Multiple Categories


18.1 Ordered probit and logit
• Suppose we are interested in a model where the dependent variable takes on
more than two discrete values, but the values can be ordered. The latent
model framework can be used for this type of model.

yi∗ = xi β + εi .


 1 if yi∗ < γ1
if γ1 < yi∗ ≤ γ2

2


yi = 3 if γ2 < yi∗ ≤ γ3
 ..


 .
if γm−1 < yi∗

m

• This gives the probabilities

Pr(yi = 1) = Φ(γ1 − xi β)
Pr(yi = 2) = Φ(γ2 − xi β) − Φ(γ1 − xi β)
Pr(yi = 3) = Φ(γ3 − xi β) − Φ(γ2 − xi β)
..
.
Pr(yi = m) = 1 − Φ(γm−1 − xi β)
= Φ(xi β − γm−1 )

• To write down the likelihood let



1 if yi = j
zij =
0 otherwise for j = 1, . . . , m.

211
212

• Then
Pr(zij = 1) = Φ(γj − xi β) − Φ(γj−1 − xi β)

• The likelihood for an individual is

Li = [Φ(γ1 − xi β)]zi1 [Φ(γ2 − xi β) − Φ(γ1 − xi β)]zi2


· · · [1 − Φ(γm−1 − xi β)]zim
zij
= Πmj=1 [Φ(γj − xi β) − Φ(γj−1 − xi β)]

• The likelihood function for the sample then is


zij
L = Πni=1 Πm
j=1 [Φ(γj − xi β) − Φ(γj−1 − xi β)]

• We compute marginal effects just like for the dichotomous probit model. For
the three categoy case these would be:
∂ Pr(yi = 1)
= −φ(γ1 − xi β)β
∂xi
∂ Pr(yi = 2)
= (φ(γ1 − xi β) − φ(γ2 − xi β))β
∂xi
∂ Pr(yi = 3)
= φ(γ2 − xi β)β
∂xi
213

18.2 Multinomial Logit


• Suppose we are interested in modeling vote choices in multi-party systems.
In this setting, the dependent variable (party choice) might be y ∈ [1, 2, 3]
but there is no implicit ordering, they are just different choices. Hence, this
model is different from ordered logit or probit, which we will discuss later.
• The multinomial logit or probit is motivated through a re-working of the
latent variable model. The latent variable that we don’t see now becomes
the “utility” to the voter of choosing a particular party. Because the utility
also depends on an individual shock, the model is called the random utility
model.
• In the current example, we have three different choices for voter i, where
those choices are sub-scripted by j.

Uij = xi β j + εij

In this case xi are the attributes of the voters selecting a party and β j are
choice-specific coefficients determining how utility from a party varies with
voter attributes (e.g., union members are presumed to derive more utility
from the Democratic party). Uij is the utility to voter i of party j.
• The voter will chose party one over party two if:

Ui1 > Ui2 ⇒ xi β 1 + εi1 > xi β 2 + εi2 ⇒ (εi1 − εi2 ) > xi (β 1 − β 2 )

• The likelihood of this being true can be modeled using the logistic function if
we assume that the original errors are distributed log-Weibull. This produces
a logistic distribution for (εi1 − εi2 ) because if two random variables, εi1 and
εi2 , are distributed log-Weibull, then the difference between the two random
variables is distributed according to the logistic distribution.
214

• When there are only two choices (party one or party two) the probability of
a voter choosing party 1 is given by:
exi (β1 −β2 )
Pr(yi = 1|xi ) =
1 + exi (β1 −β2 )
Thus, the random utility model in the binary case gives you something that
looks very like the regular logit.
• We notice from this formulation that the two coefficients, β 1 and β 2 cannot
be estimated separately. One of the coefficients, for instance β 1 , serves as
a base and the estimated coefficients (β j − β 1 ) tell us the relative utility of
choice j compared to choice 1. We normalize β 1 to zero, so that Ui1 = 0 and
we measure the probability that parties 2 and 3 are selected over party 1 as
voter characteristics vary.
• Thus, the probability of selecting party j compared to party 1 depends on the
relative utility of that choice compared to one. Recalling that exi β1 = e0 = 1,
and substituting into the equation above, we find that the probability of
selecting party 2 over party 1 equals:
Pr(yi = 2)
= exi β2
Pr(yi = 1)

• In just the same way, if there is a third party option, we can say that the
relative probability of voting for that party over party 1 is equal to:
Pr(yi = 3)
= exi β3
Pr(yi = 1)
Finally, we could use a ratio of the two expressions above to give us the
relative probability of picking Party 2 over Party 3.
215

• What we may want, however, is the probability that you pick Party 2 (or any
other party). In other words, we want Pr(yi = 2) and Pr(yi = 3). To get
this, we use a little simple algebra. First:

Pr(yi = 2) = exi β2 · Pr(yi = 1)

Pr(yi = 3) = exi β3 · Pr(yi = 1)

• Next, we use the fact that the three probabilities must sum to one (the three
choices are mutually exclusive and collectively exhaustive):

Pr(yi = 1) + Pr(yi = 2) + Pr(yi = 3) = 1

So that:
Pr(yi = 1) + exi β2 Pr(yi = 1) + exi β3 Pr(yi = 1) = 1
And:
Pr(yi = 1) 1 + exi β2 + exi β3 = 1
 

so that
1
Pr(yi = 1) =
(1 + exi β2 + exi β3 )
• Using this expression for the probability of the first choice, we can say that:
exi β2
Pr(yi = 2) =
1 + exi β2 + exi β3
exi β3
Pr(yi = 3) =
1 + exi β2 + exi β3
1
Pr(yi = 1) =
1 + exi β2 + exi β3
• The likelihood function is then estimated and computed using these proba-
bilities for each of the three choices. The model is estimated in Stata using
mlogit.
• When the choice depends on attributes of the alternatives instead of the
attributes of the individuals, the model is estimated as a Conditional Logit.
This model is estimated in Stata using clogit.
216

18.2.1 Interpreting coefficients, assessing goodness of fit


• We can enter the coefficients into the expressions above to get the probability
of a vote for any particular party. Further, we can easily tell something
about the relative probabilities of picking one party over another and how
that changes with a one-unit change in an x variable.
Pr(yi =2)
• Recall: Pr(yi =1) = exi β2

• Thus, eβ2 will tell you the change in the relative probabilities of a one-unit
change in x. This is also known as the relative risk or the relative risk ratios.

18.2.2 The IIA assumption


• The drawback to multinomial logit is the Independence of Irrelevant Alter-
natives (IIA) assumption. In this model, the odds ratios are independent of
the other alternatives.
• That is to say, the ratio of the likelihood of voting Democrat to voting Re-
publican is independent of whether option Nader is an option and does not
change if we add other options to the model. This is due to the fact that we
treat the errors in the multinomial logit model as independent.
• Thus, the error term, εi1 , which makes someone more likely to vote for the
Democrats than the Republicans, is treated as being wholly independent of
the error term that might make them more likely to vote for the Greens.
• The assumption aids estimation, but is not always warranted in behavioral
situations where the types of alternatives available might have an affect of
the relative utility of other choices. In other words, if you could vote for
the Greens, you might be less likely to vote for the Democrats compared to
Republicans.
• One test of whether the assumption is inappropriate in your model setting
is to run the model with and without a particular category and use a Haus-
man test to see whether the coefficients for the explanatory variables in the
remaining categories change.
217

• The IIA problem afflicts both the multinomial logit and the conditional logit
models. An option is to use the multinomial probit model, although it comes
with its own set of problems.
Section 19

Count Data, Models for Limited Dependent Variables, and


Duration Models
19.1 Event count models and poisson estimation
• Suppose that we have data on the number of coups, or the number of instances
of state-to-state conflict per year over a 20 year stretch. This data is clearly
not continuous, and therefore should not be modeled using OLS, because we
never see 2.5 conflicts.
• In addition, OLS could sometimes predict that we should see negative num-
bers of state-to-state conflict, something we want to rule out given that our
data are always zero or positive.
• It should not, however, be modeled as an ordered probit or logit because the
numbers represent more than different categories. They are count data, and
changes in the count have a cardinal as well as an ordinal interpretation. That
is, we can interpret a change of one in the dependent variable as meaning
something about the magnitude of conflict, rather than just identifying a
change in category.
• Enter the Poisson distribution. This is an important discrete probability
distribution for a count for a count starting at zero with no upper bound.
• For data that we believe could be modeled as a Poisson distribution, each
observation is a unit of time (one year in our data example) and the Poisson
gives the probability of observing a particular count for that period.

218
219

• For each observation, the probability of seeing a particular count is given by:
e−λ λyi
Pr(Yi = yi ) = , yi = 0, 1, 2, ...
yi !

• In this model, λ is just a parameter that can be estimated, as µ is for the


normal distribution. It can also be shown that the expected event count in
each period, or E(yi ) is equal to λ and that the variance of the count is also
equal to λ. Thus, the variance of the event counts increases as their number
increases.
• The distribution above implies that the expected event count is just the same
in each period. This is the univariate analysis of state-to-state conflict. We
are saying what the expected number of events is for each year.
• In multivariate analysis, however, we can do better. We can relate the number
of events in each period to explanatory variables. This requires relating λ
somehow to explanatory variables that we think should be associated with
instances of state-to-state conflict. Then we will get a different parameter λi
for each observation, related to the level of the explanatory variables.
• The standard way to do this for the Poisson model is to say that:

λi = exi β or equivalently ln λi = xi β

• We can now say that the probability of observing a given count for period i
is:
e−λi λyi i
Pr(Yi = yi ) = , yi = 0, 1, 2, ...
yi !
and E[yi |xi ] = λi = exi β
• Given this probability for a particular value at each observation, the likelihood
of observing the entire sample is the following:
n
Y e−λi λyi i
Pr[Y |λi ] =
i=1
yi !
220

• The log likelihood of this is:


n
X
ln L = [−λi + yi xi β − ln yi !]
i=1

Since the last term does not involve β, it can be dropped and the log likelihood
can be maximized with respect to β to give the maximum likelihood estimates
of the coefficients.
• This model can be done in Stata using the poisson command.
• For inference about the effects of explanatory variables, we can
➢ examine the predicted number of events based on a given set of values
for the variables, which is given by λi = exi β .
➢ examine the factor change: for a unit change in xk , the expected count
changes by a factor of eβk , holding other variables constant.
➢ examine the percentage change in the expected count for a δ unit change
in xk : 100 × [e(βk ·δ) − 1], holding other variables constant.
• Marginal effects can be computed in Stata after poisson using mfx com-
pute.
• The Poisson model offers no natural counterpart to the R2 in a linear regres-
sion because the conditional mean (the expected number of counts in each
period) is non-linear. Many alternatives have been suggested. Greene p. 908–
909 offers a variety and Cameron and Trivedi (1998) give more. A popular
statistic is the standard P seudo − R2 , which compares the log-likelihood for
the full model to the log-likelihood for the model containing only a constant.
• One feature of the Poisson distribution is that E[yi ] = var[yi ]. If E[yi ] <
var[yi ], this is called overdispersion and the Poisson is inappropriate. A neg-
ative binomial or generalized event count model can be used in this case.
221

19.2 Limited dependent variables: the truncation example


• The technical definition of a limited dependent variable is that it is limited
because some part of the data is simply not observed and missing (truncation)
or not observed and set at some limit (censoring).
• Although it is not often encountered, we start by discussing the truncation
case because it establishes the intuition.
• Imagine that we have a random variable, x, with a normal distribution. How-
ever, we do not observe the distribution of x below some value a, because the
distribution is truncated.
• If we wanted to estimate the expected value of x using the values we do
observe, we would over-estimate the expected value, because we are not taking
into account the observations that are truncated.
• How do we tell what the mean of x would have been had we seen all the
observations?
• To establish the expected value, we need the probability density function
(pdf) of x given the truncation:
f (x)
f (x|x > a) =
Pr[x > a]
The denominator simply re-scales the distribution by the probability that we
observe x, so that we will get a pdf that would integrate to one.
• In addition:  
a−µ
Pr[x > a] = 1 − Φ = 1 − Φ(α)
σ
where Φ is the cumulative standard normal distribution and the term in
brackets simply transforms the normal distribution of x into a standard nor-
mal.
f (x)
• Thus: f (x|x > a) = 1−Φ(α)
222

• To get the expected value of x, we use:


Z∞
E[x|x > a] = x.f (x|x > a)dx = µ + σλ(α)
a
a−µ

where α = σ , φ(α) is the standard normal density and:
➢ λ(α) = φ(α)/[1 − Φ(α)] when truncation is at x > a
➢ λ(α) = −φ(α)/Φ(α) when truncation is at x < a
• The function λ(α) is also known as the “inverse Mills ratio” and is relevant
for succeeding models.
• Now, imagine that we have a dependent variable, yi , which is truncated. For
example, we only see values of yi > a, or where xi β + εi > a, implying that
εi > a − xi β. In this case,
 the probability that we observe yi is equal to
Pr[εi > a − xi β] = 1 − Φ a−xσ

.

• Given our assumptions about the distribution of εi , we can calculate this


probability, meaning that we can correct the conditional mean we estimate.
• In most cases, we assume that the conditional mean of yi , or E[yi |xi ] = xi β.
This is not the case for the truncated dependent variable, for the reasons
expressed above.
• Instead, and following a similar logic:
φ[(a − xi β)/σ]
E[yi |yi > a] = xi β + σ = xi β + σλ(αi )
1 − Φ[(a − xi β)/σ
• What this implies, very basically, is that we need to add an additional term
to a regression model where we have truncated data in order to correctly
estimate our β̂ coefficients. If we don’t include the term that adjusts for the
probability that we see yi to begin with, we will get an inconsistent estimate,
similar to what would happen if we omitted a relevant variable.
• You can do this in Stata using the command truncreg. The truncated
regression is estimated using maximum likelihood.
223

19.3 Censored data and tobit regression


• Suppose that instead of the data being simply missing, we observe the value
of yi = a, whenever yi < a. We can imagine a latent variable, yi ∗, but we
observe a whenever yi ∗ ≤ a, and yi ∗ otherwise. This is known as a case of
“left-censoring.”
• In this case: E[yi ] = Φa + (1 − Φ)(µ + σλ)
• With probability Φ, yi = a and with probability (1 − Φ), yi = µ + σλ.
• In the “nice” case, where a = 0, the expression above simplifies to:

E[yi ] = Φ(µ/σ)(µ + σλ)

• In a multivariate regression model, where normally E[yi |xi ] = xi β, the cen-


soring with a = 0 implies that:
 
xi β
E[yi |xi ] = Φ (xi β + σλi )
σ
φ[(0−xi β)/σ φ(xi β/σ)
where λi = 1−Φ[(0−xi β)/µ] = Φ(xi β/σ)

• If the censoring cut point is not equal to zero, and if we also have an upper-
censoring cut point, then the conditional expectation of yi will be more com-
plicated, but the point remains that in estimating the conditional mean we
must take account of the fact that yi could be equal to xi β or to the censored
value(s). If we don’t take this into account, we are likely to get incorrect
estimates of the β̂ coefficients just as we did in the case of the truncated
regression.
• The censored model (with an upper or lower censoring point or both) is
estimated in Stata using the tobit command.
• For an example of tobit in practice, see a paper presented at the American
Politics seminar last year, “Accountability and Coercion: Is Justice Blind
when It Runs for Office” by Greg Huber and Sanford Gordon, which recently
appeared in the American Journal of Political Science.
224

19.4 Sample selection: the Heckman model


• Imagine that you want to estimate the impact of participating in a community
group on one’s level of satisfaction with democratic institutions. This is a key
concern for scholars of social capital. Perhaps taking part in some community
organization is likely to increase one’s trust in government and to increase
satisfaction with government performance. Thus, taking part in a community
organization is like a treatment effect. We want to see how this treatment
could affect outcomes for the “average” person.”
• The problem is that the people participating in community groups are un-
likely to be the same as the “average person.” There may well be something
about them that leads them to take part in community initiatives and, if this
unobserved characteristic also makes them likely to trust government, we will
bias our estimates of the treatment effect.
• The difficulty is that we only observe those people who joined the community
groups.
• The situation is analogous to one of truncation. Here, the truncation is
“incidental.” It does not occur at given, definite values of yi , but instead the
truncation cuts out those people for whom the underlying attractiveness of
community service did not reach some critical level.
• Let us call the underlying attractiveness of community service a random vari-
able, zi , and trust in government, our dependent variable of interest, yi .
• The expression governing whether someone joins a community group or not
is given by:
zi∗ = wi γ + ui and when zi∗ > 0 they join.
• The equation of primary interest is:
yi = xi β + εi

• We assume that the two error terms, ui and εi are distributed bivariate nor-
mal, with standard errors, σu and σε , and that the correlation between them
is equal to ρ.
225

• Then we can show that:

E[yi |yi observed] = E[yi |zi∗ > 0]


= E[yi |ui > −wi γ]
= xi β + E[εi |ui > −wi γ]
= xi β + ρσε λi (αu )
= xi β + β λ λi (αu )

where αu = −wi γ/σu and λi (αu ) = φ(wi γ/σu )/Φ(wi γ/σu )]


• This sounds like a complicated model, but in fact it is not. The sample
selection model is typically computed via the Heckman two-stage procedure.
1. Estimate the probability that zi∗ > 0 by the probit model to obtain
estimates of γ. For each observation in the sample, compute λ̂i (αu ) =
φ(wi γ̂/σu )/Φ(wi γ̂/σu ).
2. Estimate β and β λ by least squares regression of yi on xi and λ̂i .
• The model is estimated in Stata using the heckman command. The output
that Stata gives you will indicate the coefficients in the model for yi =
xi β +εi , the coefficients in the model for zi∗ = wi γ +ui , and will also compute
for you ρ and σε .

19.5 Duration models


• Suppose the data that you have is on the length of civil wars or on the number
of days of a strike. You are interested in the factors that prolong civil conflict
or that bring strikes to an end speedily.
• You should not estimate this model using the Poisson distribution, because
you are not interested in the number of days of unrest per se. You are
interested in the probability that a war will end and this probability can
depend on the time for which combatants have already been fighting.
226

• Thus, duration models, and the kinds of questions that go with them, should
be investigated using techniques that permit you to include the effects of
time as a chronological sequence, and not just as a numerical marker. These
models will also allow you to test for the effect of other “covariates” on the
probability of war ending.
• Censoring is also a common (but easily dealt with) problem with duration
analysis.
• Let us begin with a very simple example in which we are examining the prob-
ability of a “spell” of war or strike lasting t periods. Our dependent variable
is the random variable, T , which has a continuous probability distribution
f (t), where t is a realization of T .
• The cumulative probability is:
Zt
F (t) = f (s)ds = Pr(T ≤ t)
0

• Note that the normal distribution might not be a good choice of functional
form for this distribution because the normal usually takes on negative values
and time does not.
The probability that a spell is of length at least t is given by the survival
function:
S(t) = 1 − F (t) = Pr(T ≤ t)

• We are sometimes interested in a related issue. Given that a spell has lasted
until time t, what is the probability that it ends in the next short interval of
time, say ∆t?:
l(t, ∆t) = Pr[t ≤ T ≤ t + ∆t|T ≥ t)

• A useful function for characterizing this aspect of the question is the hazard
rate:
Pr[t ≤ T ≤ t + ∆t|T ≥ t] F (t + ∆t) − F (t) f (t)
λ(t) = lim = lim =
∆t→0 ∆t ∆t→0 ∆tS(t) S(t)
227

Roughly, the hazard rate is the rate at which spells are completed after du-
ration t, given that they last at least until t.
• Now, we build in different functional forms for F (t) to get different models of
duration. In the exponential model, the hazard rate, λ, is just a parameter to
be estimated and the survival function, S(t) = e−λt . As such, the exponential
model has a hazard rate that does not vary over time.
• The Weibull model allows the hazard rate to vary over the duration of the
spell. In the Weibull model, λ(t) = λp(λt)p−1 , where p is also a parameter
to be estimated and the survival function is S(t) = e−(λt)p . This model can
“accelerate” or “decelerate” the effect of time and is thus called an accelerated
failure time model.
• How do we relate duration to explanatory factors like how well the combatants
are funded? As in the Poisson model, we derive a λi for each observation (war
or strike), where λi = e−xi β .
• Either of these models can be estimated in Stata via maximum likelihood
using the streg command. This command permits you to adopt the Weibull
or the exponential or one of several other distributions to characterize your
survival function.
• If you do not specify otherwise, your output will present the effect of your
covariates on your “relative hazard rate” (i.e., how a one unit increase in
the covariate moves the hazard rate up or down). If you use the command
streg depvar expvars, nohr, your output will be in terms of the actual
coefficients. In all cases, you can use the predict command after streg to
get the predicted time until end of spell for each observation. In cases where
the hazard is constant over time, you can also predict the hazard rate using
predict, hazard.
228

• One of the problems that is sometimes encountered with duration data is


individual level heterogeneity. That is, even if two conflicts had the same
level of the explanatory factors, x, they might differ in the length of conflict
because of some non-observed, conflict specific factor (note the similarity to
unobserved, country-specific factors in panel data).
• To estimate a duration model in this context, we use a “proportional hazards”
model, such as the Cox proportional hazard model. This model allows us
to say how hazard rates change as we alter the value of x, but does not
permit us to compute the absolute level of hazard (or baseline hazard) for
each observation. This model is estimated in Stata using stcox.

You might also like