KEMBAR78
LectureNotes 480 | PDF | Multivariate Statistics | Applied Mathematics
0% found this document useful (0 votes)
50 views192 pages

LectureNotes 480

The document contains lecture notes for an Introduction to Econometrics course at Northwestern University, authored by Ivan A. Canay. It covers various topics such as linear regression, endogeneity, instrumental variables, generalized method of moments, and panel data analysis. The notes provide detailed explanations and methodologies for econometric models and their applications.

Uploaded by

tesillusion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views192 pages

LectureNotes 480

The document contains lecture notes for an Introduction to Econometrics course at Northwestern University, authored by Ivan A. Canay. It covers various topics such as linear regression, endogeneity, instrumental variables, generalized method of moments, and panel data analysis. The notes provide detailed explanations and methodologies for econometric models and their applications.

Uploaded by

tesillusion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 192

Econ 480-3

Introduction to Econometrics

SPRING 2022
Ver. May 23, 2022

Northwestern University

Lecture Notes by

IVAN A. CANAY
Department of Economics
Northwestern University

© 2022 IVAN A. CANAY


ALL RIGHTS RESERVED
2
Contents

I A Primer on Linear Models 9

1 Linear Regression 11
1.1 Interpretations of the Linear Regression Model . . . . . . . . 11
1.1.1 Interpretation 1: Linear Conditional Expectation . . . 11
1.1.2 Interpretation 2: “Best” Linear Approximation to the
Conditional Expectation or “Best” Linear Predictor . 12
1.1.3 Interpretation 3: Causal Model . . . . . . . . . . . . . 13
1.2 Linear Regression when E[XU ] = 0 . . . . . . . . . . . . . . . 14
1.2.1 Solving for β . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . 15
1.3.2 Projection Interpretation . . . . . . . . . . . . . . . . 16

2 More on Linear Regression 19


2.1 Solving for Sub-vectors of β . . . . . . . . . . . . . . . . . . . 19
2.2 Estimating Sub-Vectors of β . . . . . . . . . . . . . . . . . . . 20
2.3 Properties of LS . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . 22
2.3.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Limiting Distribution . . . . . . . . . . . . . . . . . . 23
2.4 Estimation of V . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Measures of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Basic Inference and Endogeneity 29


3.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Tests of A Single Linear Restriction . . . . . . . . . . 30
3.1.3 Tests of Multiple Linear Restrictions . . . . . . . . . . 32
3.1.4 Tests of Nonlinear Restrictions . . . . . . . . . . . . . 32
3.2 Linear Regression when E[XU ] ̸= 0 . . . . . . . . . . . . . . . 33
3.2.1 Motivating Examples . . . . . . . . . . . . . . . . . . . 33

3
4 CONTENTS

4 Endogeneity 37
4.1 Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Partition of β: solve for endogenous components . . . 39
4.2 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 The Instrumental Variables (IV) Estimator . . . . . . 40
4.2.2 The Two-Stage Least Squares (TSLS) Estimator . . . 41
4.3 Properties of the TSLS Estimator . . . . . . . . . . . . . . . . 43
4.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Limiting Distribution . . . . . . . . . . . . . . . . . . 43
4.3.3 Estimation of V . . . . . . . . . . . . . . . . . . . . . . 44

5 More on Endogeneity 45
5.1 Efficiency of the TSLS Estimator . . . . . . . . . . . . . . . . 45
5.2 “Weak” Instruments . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Interpretation under Heterogeneity . . . . . . . . . . . . . . . 48
5.3.1 Monotonicity in Latent Index Models . . . . . . . . . 51
5.3.2 IV in Randomized Experiments . . . . . . . . . . . . . 52

6 GMM & EL 53
6.1 Generalized Method of Moments . . . . . . . . . . . . . . . . 53
6.1.1 Over-identified Linear Model . . . . . . . . . . . . . . 53
6.1.2 The GMM Estimator . . . . . . . . . . . . . . . . . . 54
6.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . 55
6.1.5 Estimation of the Efficient Weighting Matrix . . . . . 56
6.1.6 Overidentification Test . . . . . . . . . . . . . . . . . . 57
6.2 Empirical Likelihood . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Asymptotic Properties and First Order Conditions . . 59

7 Panel Data 61
7.1 Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1.1 First Differences . . . . . . . . . . . . . . . . . . . . . 62
7.1.2 Deviations from Means . . . . . . . . . . . . . . . . . 63
7.1.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . 64
7.2 Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 Difference in Differences 71
8.1 A Simple Two by Two Case . . . . . . . . . . . . . . . . . . . 71
8.1.1 Pre and post comparison . . . . . . . . . . . . . . . . 73
8.1.2 Treatment and control comparison . . . . . . . . . . . 73
8.1.3 Taking both differences . . . . . . . . . . . . . . . . . 73
8.1.4 A linear regression representation with individual data 75
8.2 A More General Case . . . . . . . . . . . . . . . . . . . . . . 75
CONTENTS 5

8.2.1 Thinking ahead: inference and few treated groups . . 76


8.3 Synthetic Controls . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

II Some Topics 83

9 Non-Parametric Regression 85
9.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2 Nearest Neighbor vs. Binned Estimator . . . . . . . . . . . . 85
9.3 Nadaraya-Watson Kernel Estimator . . . . . . . . . . . . . . 86
9.3.1 Asymptotic Properties . . . . . . . . . . . . . . . . . . 88
9.4 Local Linear Estimator . . . . . . . . . . . . . . . . . . . . . . 92
9.4.1 Nadaraya-Watson vs Local Linear Estimator . . . . . 94
9.5 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . 94

10 Regression Discontinuity and Matching 95


10.1 Regression Discontinuity Design . . . . . . . . . . . . . . . . . 96
10.1.1 Identification . . . . . . . . . . . . . . . . . . . . . . . 96
10.1.2 Estimation via Local Linear Regression . . . . . . . . 97
10.1.3 Bandwidth Choice . . . . . . . . . . . . . . . . . . . . 98
10.1.4 Other RD Designs . . . . . . . . . . . . . . . . . . . . 99
10.1.5 Extension to Fuzzy RD . . . . . . . . . . . . . . . . . 101
10.1.6 Validity of RD . . . . . . . . . . . . . . . . . . . . . . 102
10.1.7 RD Packages . . . . . . . . . . . . . . . . . . . . . . . 103
10.2 Matching Estimators . . . . . . . . . . . . . . . . . . . . . . . 103
10.2.1 Identification through Unconfoundedness . . . . . . . 103
10.2.2 Matching Metrics . . . . . . . . . . . . . . . . . . . . . 104
10.2.3 Matching Estimator . . . . . . . . . . . . . . . . . . . 105
10.2.4 Propensity Score Matching and Weighting . . . . . . . 106

11 Random Forests 109


11.1 Coming soon . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

12 LASSO 111
12.1 High Dimensionality and Sparsity . . . . . . . . . . . . . . . . 111
12.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
12.2.1 Theoretical Properties of the LASSO . . . . . . . . . . 115
12.3 Adaptive LASSO . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.4 Penalties for Model Selection Consistency . . . . . . . . . . . 117
12.5 Choosing lambda . . . . . . . . . . . . . . . . . . . . . . . . . 118
12.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 119
6 CONTENTS

13 Binary Choice 121


13.1 Linear Index Model . . . . . . . . . . . . . . . . . . . . . . . . 121
13.1.1 Identification . . . . . . . . . . . . . . . . . . . . . . . 122
13.1.2 Identification of the parametric binary model . . . . . 123
13.1.3 Identification via median independence . . . . . . . . 124
13.2 Estimation of the Linear Index Model . . . . . . . . . . . . . 127
13.2.1 Estimation of parametric binary model . . . . . . . . 127
13.2.2 Estimation of marginal effects . . . . . . . . . . . . . . 130
13.3 Linear probability model . . . . . . . . . . . . . . . . . . . . . 131

III A Primer on Inference and Standard Errors 135

14 HC Variance Estimation 137


14.1 Setup and notation . . . . . . . . . . . . . . . . . . . . . . . . 137
14.2 Consistency of HC standard errors . . . . . . . . . . . . . . . 138
14.3 Improving finite sample performance: HC2 . . . . . . . . . . 141
14.4 The Behrens-Fisher Problem . . . . . . . . . . . . . . . . . . 142
14.4.1 The homoskedastic case . . . . . . . . . . . . . . . . . 143
14.4.2 The robust EHW variance estimator . . . . . . . . . . 144
14.4.3 An unbiased estimator of the variance . . . . . . . . . 145

15 HAC Covariance Estimation 149


15.1 Setup and notation . . . . . . . . . . . . . . . . . . . . . . . . 149
15.2 Limit theorems for dependent data . . . . . . . . . . . . . . . 149
15.3 Estimating long-run variances . . . . . . . . . . . . . . . . . . 151
15.3.1 A naive approach . . . . . . . . . . . . . . . . . . . . . 152
15.3.2 Simple truncation . . . . . . . . . . . . . . . . . . . . 153
15.3.3 Weighting and truncation: the HAC estimator . . . . 153

16 Cluster Covariance Estimation 159


16.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 160
16.2 Rates of Convergence . . . . . . . . . . . . . . . . . . . . . . . 161
16.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . 163
16.4 Cluster Covariance Estimation . . . . . . . . . . . . . . . . . 165
16.4.1 Application to Linear Regression . . . . . . . . . . . . 165
16.4.2 Small q ad-hoc adjustments . . . . . . . . . . . . . . . 168
16.4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . 168

17 Bootstrap 171
17.1 Confidence Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 171
17.1.1 Pivots and Asymptotic Pivots . . . . . . . . . . . . . . 172
17.1.2 Asymptotic Approximations . . . . . . . . . . . . . . . 172
17.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 173
CONTENTS 7

17.2.1 The Nonparametric Mean . . . . . . . . . . . . . . . . 173


17.2.2 Asymptotic Refinements . . . . . . . . . . . . . . . . . 177
17.2.3 Implementation of the Bootstrap . . . . . . . . . . . . 178

18 Subsampling & Randomization Tests 181


18.1 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
18.2 Randomization Tests . . . . . . . . . . . . . . . . . . . . . . . 184
18.2.1 Motivating example: sign changes . . . . . . . . . . . 184
18.2.2 The main result . . . . . . . . . . . . . . . . . . . . . 185
18.2.3 Special case: Permutation tests . . . . . . . . . . . . . 187
8 CONTENTS
Part I

A Primer on Linear Models

9
Lecture 1

Linear Regression1

1.1 Interpretations of the Linear Regression Model


Let (Y, X, U ) be a random vector where Y and U take values in R and
X takes values in Rk+1 . Assume further that the first component of X
is a constant equal to one, i.e., X = (X0 , X1 , . . . , Xk )′ with X0 = 1. Let
β = (β0 , β1 , . . . , βk )′ ∈ Rk+1 be such that
Y = X ′β + U .
The parameter β0 is sometimes referred to as the intercept parameter and the
remaining βj parameters are sometimes referred to as the slope parameters.
There are several ways to interpret β depending on the assumptions imposed
on (Y, X, U ). We will study three such ways.

1.1.1 Interpretation 1: Linear Conditional Expectation


Suppose E[Y |X] = X ′ β and define U = Y − E[Y |X]. (Note that we’ve
implicitly assumed E[|Y |] < ∞, so E[Y |X] exists.) This implies that
E[U |X] = 0 and therefore that E[U ] = 0. Moreover, E[XU ] = 0, so
Cov[X, U ] = 0. In this case, β is just a convenient way of summarizing
a feature of the joint distribution of Y and X, namely, the conditional ex-
pectation. It is tempting to interpret the coefficient βj for 1 ≤ j ≤ k as
the ceteris paribus (i.e., holding X−j and U constant) effect of a one unit
change in Xj on Y , but this is incorrect. Indeed, more generally, it is not
appropriate to think of differences in (or derivatives of) conditional expec-
tations causally. After all, Y could be an indicator for rain and X could be
an indicator for carrying an umbrella. In this case, it may be the case that
E[Y |X] is increasing in X, but one would not want to think of carrying an
umbrella as causing rain. What is missing is a model of how Y is determined
as a function of X (and possibly other unobserved variables).
1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.

11
12 LECTURE 1. LINEAR REGRESSION

1.1.2 Interpretation 2: “Best” Linear Approximation to the


Conditional Expectation or “Best” Linear Predictor
In general, one would not expect the conditional expectation to be linear.
Suppose E[Y 2 ] < ∞ and E[XX ′ ] < ∞ (equivalently, that E[Xj2 ] < ∞
for 1 ≤ j ≤ k). Under these assumptions, one may consider what is the
“best” linear approximation (i.e., function of the form X ′ b for some choice
of b ∈ Rk+1 ) to the conditional expectation. To this end, consider the
minimization problem

min E[(E[Y |X] − X ′ b)2 ] .


b∈Rk+1

Denote by β a solution to this minimization problem. In this case, β is simply


a convenient way of summarizing another feature of the joint distribution
of Y and X, namely, the “best” linear approximation to the conditional
expectation. For the same reasons as before, it is not correct to interpret
the coefficient βj for 1 ≤ j ≤ k as the ceteris paribus effect of a one unit
change in Xj on Y .
Let V = E[Y |X] − Y , so E[XV ] = 0. Note that

E[(E[Y |X] − X ′ b)2 ] = E[(E[Y |X] − Y + Y − X ′ b)2 ]


= E[(V + Y − X ′ b)2 ]
= E[V 2 + 2V (Y − X ′ b) + (Y − X ′ b)2 ]
= E[V 2 ] + 2E[V Y ] − 2E[V X ′ ]b + E[(Y − X ′ b)2 ]
= constant + E[(Y − X ′ b)2 ] .

Thus, β also solves


min E[(Y − X ′ b)2 ] .
b∈Rk+1

In this sense, β is also a convenient way of summarizing the “best” linear


predictor of Y given X. Again, it is tempting to interpret βj for 1 ≤ j ≤ k
causally, but this is not correct.
Consider the second minimization problem. Note E[(Y −X ′ b)2 ] is convex
(as a function of b) and

Db E[(Y − X ′ b)2 ] = E[−2X(Y − X ′ b)] .

Hence, β must satisfy


E[X(Y − X ′ β)] = 0 .
If we define U = Y − X ′ β, then we may rewrite this equation as

E[XU ] = 0 .
1.1. INTERPRETATIONS OF THE LINEAR REGRESSION MODEL 13

1.1.3 Interpretation 3: Causal Model


Suppose Y = g(X, U ), where X are the observed determinants of Y and U
are the unobserved determinants of Y . Such a relationship is a model of how
Y is determined and may come from physics, economics, etc. The effect of
Xj on Y holding X−j and U constant (i.e., ceteris paribus) is determined
by g. If g is differentiable, then it is given by DXj g(X, U ). If we assume
further that
g(X, U ) = X ′ β + U ,
then the ceteris paribus effect of Xj on Y is simply βj . We may normalize
U so that E[U ] = 0 (by replacing U with U − E[U ] and β0 with β0 + E[U ]
if this is not the case). On the other hand, E[U |X], E[U |Xj ] and E[U Xj ]
for 1 ≤ j ≤ k may or may not equal zero. These are now statements about
the relationship between the observed and unobserved determinants of Y .
Probably the easiest way to think about causal relationships is in terms
of potential outcomes. As a simple illustration, consider a randomized con-
trolled experiment where individuals are randomly assigned to a treatment
(a drug) that is intended to improve their health status. Let Y denote the
observed health status and X ∈ {0, 1} denote whether the individual takes
the drug or not. The causal relationship between X and Y can be described
using the so-called potential outcomes:

Y (0) potential outcome in the absence of treatment


.
Y (1) potential outcome in the presence of treatment

In other words, we imagine two potential health status variables (Y (0), Y (1))
where Y (0) is the value of the outcome that would have been observed if
(possibly counter-to-fact) X were 0; and Y (1) is the value of the outcome
that would have been observed if (possibly counter-to-fact) X were 1.
The difference Y (1)−Y (0) is called the treatment effect, and the quantity
E[Y (1) − Y (0)] is usually referred to as the average treatment effect. Using
this notation, we may rewrite the observed outcome as

Y = XY (1) + (1 − X)Y (0)


= E[Y (0)] + (Y (1) − Y (0))X + (Y (0) − E[Y (0)])
= β0 + β1 X + U ,

where

β0 = E[Y (0)]
β1 = Y (1) − Y (0)
U = Y (0) − E[Y (0)] .

In order for β1 to be a constant parameter, we need to assume that (Y (1) −


Y (0) is constant across individuals. Under all these assumptions, we end
14 LECTURE 1. LINEAR REGRESSION

up with a linear constant effect causal model with U ⊥⊥ X (from the nature
of the randomized experiment), E[U ] = 0, and so E[XU ] = 0. Notice that,
in order to have a linear causal model a randomized controlled experiment
is not enough; we also need a constant treatment effect. Without such an
assumption it can be shown that a regression of Y on X identifies the average
treatment effect (ATE). The ATE is often interpreted as a causal parameter
because it is an average of causal effects.

1.2 Linear Regression when E[XU ] = 0


1.2.1 Solving for β
Let (Y, X, U ) be a random vector where Y and U take values in R and
X takes values in Rk+1 . Assume further that the first component of X
is a constant equal to one, i.e., X = (X0 , X1 , . . . , Xk )′ with X0 = 1. Let
β = (β0 , β1 , . . . , βk )′ ∈ Rk+1 be such that

Y = X ′β + U .

Suppose E[XU ] = 0, E[XX ′ ] < ∞, and that there is no perfect collinearity


in X. The justification of the first assumption varies depending on which
of the three preceding interpretations we invoke. The second assumption
ensures that E[XX ′ ] exists. The third assumption is equivalent to the as-
sumption that the matrix E[XX ′ ] is in fact invertible. Since E[XX ′ ] is
positive semi-definite, invertibility of E[XX ′ ] is equivalent to E[XX ′ ] being
positive definite. We say that there is perfect collinearity or multicollinearity
in X if there exists nonzero c ∈ Rk+1 such that P {c′ X = 0} = 1, i.e., if we
can express one component of X as a linear combination of the others.

Lemma 1.1 Let X be a random vector such that E[XX ′ ] < ∞. Then
E[XX ′ ] is invertible if and only if there is no perfect collinearity in X.
Proof: We first argue that if E[XX ′ ] is invertible, then there is no perfect
collinearity in X. To see this, suppose there is perfect collinearity in X,
i.e., that there exists a nonzero c ∈ Rk+1 such that P {c′ X = 0} = 1. Note
that E[XX ′ ]c = E[X(X ′ c)] = 0. Hence, the columns of E[XX ′ ] are linearly
dependent, i.e., E[XX ′ ] is not invertible.
We now argue that if there is no perfect collinearity in X, then E[XX ′ ] is
invertible. To see this, suppose E[XX ′ ] is not invertible. Then, the columns
of E[XX ′ ] must be linearly dependent, i.e., there exists nonzero c ∈ Rk+1
such that E[XX ′ ]c = 0. This implies further that c′ E[XX ′ ]c = E[(c′ X)2 ] =
0, which in turn implies that P {c′ X = 0} = 1., i.e., that there is perfect
collinearity in X.
The first assumption above together with the fact that U = Y − X ′ β
implies that E[X(Y − X ′ β)] = 0, i.e., E[XY ] = E[XX ′ ]β. Since E[XX ′ ] is
1.3. ESTIMATING β 15

invertible, we have that there is a unique solution to this system of equations,


namely,
β = E[XX ′ ]−1 E[XY ] .
If E[XX ′ ] is not invertible, i.e., there is perfect collinearity in X, then there
will be more than one solution to this system of equations. Importantly, any
two solutions β and β̃ will necessarily satisfy P {X ′ β = X ′ β̃} = 1. Depend-
ing on the interpretation, this may be an important distinction or not. For
instance, in the second interpretation, each such solution corresponds to the
same “best” linear predictor of Y given X, whereas in the third interpreta-
tion different values of β could have wildly different implications for how X
affects Y holding U constant.

1.3 Estimating β
Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that

Y = X ′β + U .

Suppose that E[XU ] = 0, E[XX ′ ] < ∞ and that there is no perfect


collinearity in X. Above we described three different interpretations and
justifications of such a model. We now discuss estimation of β.

1.3.1 Ordinary Least Squares


Let (Y, X, U ) be distributed as described above and denote by P the marginal
distribution of (Y, X). Let (Y1 , X1 ), . . . , (Yn , Xn ) be an i.i.d. sequence of ran-
dom vectors with distribution P . By analogy with the expression we derived
for β under these assumptions, the natural estimator of β is simply
 −1  
1 X 1 X
β̂n =  Xi Xi′   Xi Yi  .
n n
1≤i≤n 1≤i≤n

This estimator is called the ordinary least squares (OLS) estimator of β


because it can also be derived as the solution to the following minimization
problem:
1 X
min (Yi − Xi′ b)2 .
b∈R k+1 n
1≤i≤n
1 ′ 2
P
To see this, note that 1≤i≤n (Yi − Xi b) is convex (as a function of b)
n
and
1 X 1 X
Db (Yi − Xi′ b)2 = −2Xi (Yi − Xi′ b) .
n n
1≤i≤n 1≤i≤n
16 LECTURE 1. LINEAR REGRESSION

Hence β̂n must satisfy


1 X
Xi (Yi − Xi′ β̂n ) = 0 ,
n
1≤i≤n

i.e.,
1 X 1 X
Xi Yi = Xi Xi′ β̂n .
n n
1≤i≤n 1≤i≤n

The matrix
1 X
Xi Xi′
n
1≤i≤n

may not be invertible, but, since E[XX ′ ] is invertible, it will be invertible


with probability approaching one.
The ith fitted value is denoted by Ŷi = Xi′ β̂n . The ith residual is denoted
by Ûi = Yi − Ŷi . By definition, we therefore have that
1 X
Xi Ûi = 0 .
n
1≤i≤n

1.3.2 Projection Interpretation


Define

Y = (Y1 , . . . , Yn )′
X = (X1 , . . . , Xn )′
Ŷ = (Ŷ1 , . . . , Ŷn )′
= Xβ̂n
U = (U1 , . . . , Un )′
Û = (Û1 , . . . , Ûn )′
= Y − Ŷ
= Y − Xβ̂n .

In this notation,
β̂n = (X′ X)−1 X′ Y
and may be equivalently described as the solution to

min |Y − Xb|2 .
b∈Rk+1

Hence, Xβ̂n is the vector in the column space of X that is closest (in terms
of Euclidean distance) to Y. From the above, we see that X′ Û = 0, thus Û
is orthogonal to all of the columns of X (and thus orthogonal to all of the
vectors in the column space of X). In this sense,

Xβ̂n = X(X′ X)−1 X′ Y


BIBLIOGRAPHY 17

is the orthogonal projection of Y onto the ((k +1)-dimensional) column space


of X. The matrix
P = X(X′ X)−1 X′
is known as a projection matrix. It projects a vector in Rn (such as Y)
onto the column space of X. Note that P2 = P, which reflects the fact that
projecting something that already lies in the column space of X onto the
column space of X does nothing. The matrix P is also symmetric. The
matrix
M=I−P
is also a projection matrix. It projects a vector onto the ((n − k − 1)-
dimensional) vector space orthogonal to the column space of X. Hence,
MX = 0. Note that MY = Û. For this reason, M is sometimes called the
“residual maker” matrix.

Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.

Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.


18 LECTURE 1. LINEAR REGRESSION
Lecture 2

More on Linear Regression1

2.1 Solving for Sub-vectors of β


Partition X into X1 and X2 , where X1 takes values in Rk1 and X2 takes
values in Rk2 . Partition β into β1 and β2 analogously. In this notation,

Y = X1′ β1 + X2′ β2 + U .

Our preceding results imply that


−1 
E[X1 X1′ ] E[X1 X2′ ]
   
β1 E[X1 Y ]
= .
β2 E[X2 X1′ ] E[X2 X2′ ] E[X2 Y ]

Using the so called partitioned matrix inverse formula, it would be possible


to derive formulae for β1 and β2 , but such an exercise is not particularly
illuminating. We therefore take a different approach to arrive at the same
formulae. In doing so, we will make use of the following notation: for a
random variable A and a random vector B, denote by BLP(A|B) the best
linear predictor of A given B, i.e. B ′ E[BB ′ ]−1 E[BA]. If A is a random
vector, then define BLP(A|B) component-wise.
Define Ỹ = Y − BLP(Y |X2 ) and X̃1 = X1 − BLP(X1 |X2 ). Consider the
linear regression Ỹ = X̃1′ β̃1 + Ũ , where E[X̃1 Ũ ] = 0 (as, for example, in the
second interpretation of the linear regression model described before). It
follows that β̃1 = β1 . To see this, note that E[X̃1 X̃1′ ] is invertible (because
each component of X̃1 is a linear combination the components of X), so

β̃1 = E[X̃1 X̃1′ ]−1 E[X̃1 Ỹ ]


= E[X̃1 X̃1′ ]−1 (E[X̃1 Y ] − E[X̃1 BLP(Y |X2 )]
= E[X̃1 X̃1′ ]−1 E[X̃1 Y ] ,

1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.

19
20 LECTURE 2. MORE ON LINEAR REGRESSION

where the first equality follows from the formula for β̃1 , the second equality
follows from the expression for Ỹ , and the third equality follows from the
fact that E[X̃1 X2′ ] = 0 (because X̃1 is the error term from a regression of X1
on X2 ). Note that this first part of the derivation shows that β̃1 is also the
population coefficient of a linear regression of Y on X̃1 . If we now replace
Y by its expression and do some additional steps, we get

β̃1 = E[X̃1 X̃1′ ]−1 (E[X̃1 X1′ β1 ] + E[X̃1 X2′ β2 ] + E[X̃1 U ])


= E[X̃1 X̃1′ ]−1 (E[X̃1 X1′ β1 ])
= E[X̃1 X̃1′ ]−1 (E[X̃1 X̃1′ β1 ] + E[X̃1 BLP(X1 |X2 )′ β1 ])
= β1 ,

where the first equality follows from the expression for Y , the second equal-
ity follows from the fact that E[X̃1 X2′ ] = 0 and E[X̃1 U ] = 0 (because
E[XU ] = 0), the third equality follows from the expression for X̃1 , and the
final equality follows from the fact that E[X̃1 X2′ ] = 0.
In other words, β1 in the linear regression of Y on X1 and X2 is equal to
the coefficient in a linear regression of the error term from a linear regression
of Y on X2 on the error terms from a linear regression of the components
of X1 on X2 . This gives meaning to the common description of β1 as the
“effect” of X1 on Y after “controlling for X2 .”
Notice that if we take X2 to be just a constant, then Ỹ = Y − E[Y ] and
X̃1 = X1 − E[X1 ]. Hence,

β1 = E[(X1 − E[X1 ])(X1 − E[X1 ])′ ]−1 E[(X1 − E[X1 ])(Y − E[Y ])]
= Var[X1 ]−1 Cov[X1 , Y ] .

Finally, also note that if we use our formula to interpret the coefficient
βj associated with the jth covariate for 1 ≤ j ≤ k, we obtain

Cov[X̃j , Y ]
βj = , (2.1)
Var[X̃j ]

which shows that each coefficient in a multivariate regression is the bivariate


slope coefficient for the corresponding regressor, after “partialling out” all
the other variables in the model.

2.2 Estimating Sub-Vectors of β


Partition X into X1 and X2 , where X1 takes values in Rk1 and X2 takes
values in Rk2 . Partition β into β1 and β2 analogously. In this notation,

Y = X1′ β1 + X2′ β2 + U .
2.3. PROPERTIES OF LS 21

Using the preceding results, we can derive estimation counterparts to the


results above about solving for sub-vectors of β. Again, this may be done
using the partitioned matrix inverse formula, but we will use a different
approach. Let X1 = (X1,1 , . . . , X1,n )′ and X2 = (X2,1 , . . . , X2,n )′ . Denote by
P1 the projection matrix onto the column space of X1 and P2 the projection
matrix onto the column space of X2 . Define M1 = I − P1 and M2 = I − P2 .
First note that
Y = X1 β̂1,n + X2 β̂2,n + Û .
This implies that
M2 Y = M2 X1 β̂1,n + Û
because M2 Û = Û, as Û is orthogonal to the column space of X (and hence
the column space of X2 as well). This implies further that

(M2 X1 )′ M2 Y = (M2 X1 )′ M2 X1 β̂1,n

because (M2 X1 )′ Û = X′1 Û = 0, as X′ Û = 0. Note that the matrix (M2 X1 )′ (M2 X1 )


is invertible provided that X′ X is invertible. Hence,

β̂1,n = ((M2 X1 )′ (M2 X1 ))−1 (M2 X1 )′ M2 Y .

In other words, β̂1,n can be obtained by estimating via OLS the coefficients
from a linear regression of M2 Y on M2 X1 . Upon recognizing that M2 Y are
the residuals from a regression of Y on X2 and that the columns of M2 X1 are
the residuals from regressions of the columns of X1 on X2 , we see that this
formula exactly parallels the formula we derived earlier for a sub-vector of
β. This result is sometimes referred to as the Frisch-Waugh-Lovell (FWL)
decomposition.

2.3 Properties of LS
Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that

Y = X ′β + U .

Suppose that E[XU ] = 0, E[XX ′ ] < ∞, and that there is no perfect


collinearity in X. Denote by P the marginal distribution of (Y, X). Let
(Y1 , X1 ), . . . , (Yn , Xn ) be an i.i.d. sample of random vectors with distribution
P . Above we described estimation of β via OLS under these assumptions.
We now discuss properties of the resulting estimator, β̂n imposing stronger
assumptions as needed.
22 LECTURE 2. MORE ON LINEAR REGRESSION

2.3.1 Bias
Suppose in addition that E[U |X] = 0. Equivalently, assume that E[Y |X] =
X ′ β. Under this stronger assumption,

E[β̂n ] = β .

In fact,
E[β̂n |X1 , . . . , Xn ] = β .
To see this, note that

β̂n = (X′ X)−1 X′ Y = β + (X′ X)−1 XU .

Hence,
E[β̂n |X1 , . . . , Xn ] = β + (X′ X)−1 XE[U|X1 , . . . , Xn ] .
Note for any 1 ≤ i ≤ n that

E[Ui |X1 , . . . , Xn ] = E[Ui |Xi ] = 0 ,

where the first equality follows from the fact that Xj is independent of Ui
for i ̸= j. The desired conclusion thus follows.

2.3.2 Gauss-Markov Theorem


Suppose E[U |X] = 0 and that Var[U |X] = σ 2 . When Var[U |X] is constant
(and therefore does not depend on X) we say that U is homoskedastic.
Otherwise, we say that U is heteroskedastic. The Guass-Markov Theorem
says that under these assumptions the OLS estimator is “best” in the sense
that it has the “smallest” value of Var[A′ Y|X1 , . . . , Xn ] among all estimators
of the form
A′ Y
for some matrix A = A(X1 , . . . , Xn ) satisfying

E[A′ Y|X1 , . . . , Xn ] = β .

Here, “smallest” is understood to be in terms of the partial order obtained


by defining B ≥ B̃ if and only if B − B̃ is positive semi-definite. This class
of estimators, of course, includes the OLS estimator as a special case (by
setting A′ = (X′ X)−1 X′ ). This property is sometimes expressed as saying
that OLS is the “best linear unbiased estimator (BLUE)” of β under these
assumptions.
To establish this property of OLS, first note that

E[A′ Y|X1 , . . . , Xn ] = A′ Xβ + A′ E[U|X1 , . . . , Xn ] ,


2.3. PROPERTIES OF LS 23

so E[A′ Y|X1 , . . . , Xn ] = β if and only if A′ X = I. Next, note that

Var[A′ Y|X1 , . . . , Xn ] = A′ Var[Y|X1 , . . . , Xn ]A


= A′ Var[U|X1 , . . . , Xn ]A
= A′ Aσ 2 .

When A′ = (X′ X)−1 X′ , this last expression is simply (X′ X)−1 σ 2 . It therefore
suffices to show that
A′ A − (X′ X)−1
is positive semi-definite for all matrices A satisfying A′ X = I. To this end,
define
C = A − X(X′ X)−1 .
Then,

A′ A − (X′ X)−1 = (C + X(X′ X)−1 )′ (C + X(X′ X)−1 ) − (X′ X)−1


= C′ C + C′ X(X′ X)−1 + (X′ X)−1 X′ C
= C′ C ,

where the last equality follows from the fact that

X′ C = X′ A − X′ X(X′ X)−1 = I − I = 0 .

The desired conclusion thus follows from the fact that C′ C is positive semi-
definite by construction.

2.3.3 Consistency
In this case we do not need additional assumptions. Note that E[XY ] < ∞
since XY = XX ′ β + XU , and both E[XX ′ ] and E[XU ] exist. Under this
P
assumption, the OLS estimator, β̂n is consistent for β, i.e., β̂n → β as
n → ∞. To see this, simply note that by the WLLN
1 X P
Xi Xi′ → E[XX ′ ]
n
1≤i≤n
1 X P
Xi Yi → E[XY ]
n
1≤i≤n

as n → ∞. The desired result therefore follows from the CMT.

2.3.4 Limiting Distribution


Suppose E[XX ′ ] < ∞ and that Var[XU ] = E[XX ′ U 2 ] < ∞. Then,
√ d
n(β̂n − β) → N (0, V)
24 LECTURE 2. MORE ON LINEAR REGRESSION

as n → ∞, where
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 .
To see this, note that
 −1  
√ 1 X 1 X
n(β̂n − β) =  Xi Xi′   √ Xi Ui  .
n n
1≤i≤n 1≤i≤n

The WLLN implies that


1 X P
Xi Xi′ → E[XX ′ ] (2.2)
n
1≤i≤n

as n → ∞. The CLT implies that


1 X d
√ Xi Ui → N (0, Var[XU ])
n
1≤i≤n

as n → ∞. Thus, the desired result follows from the CMT.

2.4 Estimation of V
In order to make use of the preceding estimators, we will require a consistent
estimator of
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 .
Note that V has the so-called sandwich form. As with most sandwich esti-
mators, the interesting object is the “meat” and not the “bread”. Indeed,
the bread can be consistently estimated by (2.2).
Focusing our attention to the meat, we first consider the case where
E[U |X] = 0 and Var[U |X] = σ 2 (i.e., under homoskedasticity). Under these
conditions,
Var[XU ] = E[XX ′ U 2 ] = E[XX ′ ]σ 2 .
Hence,
V = E[XX ′ ]−1 σ 2 .
A natural choice of estimator is therefore
 −1
1 X
V̂n =  Xi Xi′  σ̂n2 ,
n
1≤i≤n

where σ̂n2 is a consistent estimator of σ 2 . The main difficulty in showing that


this estimator is a consistent estimator of V lies in choosing a consistent
estimator of σ 2 . A natural choice of such an estimator is
1 X 2
σ̂n2 = Ûi .
n
1≤i≤n
2.4. ESTIMATION OF V 25

Note that
Ûi = Yi − Xi′ β̂n = Ui − Xi′ (β̂n − β) ,
so

Ûi2 = (Ui − Xi′ (β̂n − β))2 = Ui2 − 2Ui Xi′ (β̂n − β) + (Xi′ (β̂n − β))2 .

The WLLN implies that


1 X 2 P 2
Ui → σ
n
1≤i≤n

as n → ∞. Next, note that the WLLN and CMT imply further that
1 X 1 X
Ui Xi′ (β̂n − β) = (β̂n − β)′ Xi Ui = oP (1) .
n n
1≤i≤n 1≤i≤n

Finally, note that

1 X 1 X
(Xi′ (β̂n − β))2 ≤ |(Xi′ (β̂n − β))2 |
n n
1≤i≤n 1≤i≤n
1 X
≤ |Xi |2 |β̂n − β|2 ,
n
1≤i≤n

which tends in probability to zero because of the WLLN, CMT and the fact
that E[|X|2 ] < ∞ (which follows from the fact that E[XX ′ ] < ∞). The
desired conclusion thus follows.
When we do not assume Var[U |X] = σ 2 , a natural choice of estimator is
 −1   −1
1 X 1 X 1 X
V̂n =  Xi Xi′   Xi Xi′ Ûi2   Xi Xi′  . (2.3)
n n n
1≤i≤n 1≤i≤n 1≤i≤n

Later in the class we will prove that this estimator is consistent, i.e.,
P
V̂n → V as n → ∞ ,

regardless of the functional form of Var[U |X]. This estimator is called the
Heteroskedasticity Consistent (HC) estimator of V. The standard errors
used to construct t-statistics are the square roots of the diagonal elements
of V̂n , and this is the topic of the third part of this class. It is important to
note that, by default, Stata reports homoskedastic-only standard errors.
26 LECTURE 2. MORE ON LINEAR REGRESSION

2.5 Measures of Fit


When reporting the results of estimating a linear regression via OLS, it is
common to report a measure of fit known as R2 , defined as follows:
ESS SSR
R2 = =1− ,
T SS T SS
where
X
T SS = (Yi − Ȳn )2
1≤i≤n
X
ESS = (Ŷi − Ȳn )2
1≤i≤n
X
SSR = Ûi2 .
1≤i≤n

Here, T SS is short for total sum of squares, ESS is short for explained sum
of squares, and SSR is short for sum of squared residuals. To show that
the two expressions for R2 are the same, and that 0 ≤ R2 ≤ 1, it suffices to
show that
SSR + ESS = T SS .
Moreover, R2 = 1 if and only if SSR = 0, i.e., Ûi = 0 for all 1 ≤ i ≤ n.
Similarly, R2 = 0 if and only if ESS = 0, i.e., Ŷi = Ȳn for all 1 ≤ i ≤ n. In
this sense, R2 isP
a measure of the “fit” of a regression.
Note that n1 1≤i≤n (Yi − Ȳn )2 may be viewed as an estimator of Var[Yi ]
and n1 1≤i≤n Ûi2 may be viewed as an estimator of Var[Ui ]. Thus, R2 may
P
be viewed as an estimator of the quantity

Var[Ui ]
1− .
Var[Yi ]

Replacing these estimators with their unbiased counterparts yields “ad-


justed” R2 , defined as
n − 1 SSR
R̄2 = 1 − .
n − k − 1 T SS

Note that R2 always increases with the inclusion of an additional regressor,


whereas R̄2 may not. Note also that R̄2 ≤ R2 , so R̄2 ≤ 1, but, unlike R2 ,
R̄2 may be less than zero.
It is important to understand that a high R2 does not justify interpreting
a linear regression as a causal model, just as a low R2 does not invalidate
interpreting a linear regression as a causal model.
BIBLIOGRAPHY 27

Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.

Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.


28 LECTURE 2. MORE ON LINEAR REGRESSION
Lecture 3

Basic Inference and


Endogeneity1

3.1 Inference
Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that
Y = X ′β + U .
Suppose that E[XU ] = 0, that there is no perfect collinearity in X, that
E[XX ′ ] < ∞, and Var[XU ] < ∞. Denote by P the marginal distribution of
(Y, X). Let (Y1 , X1 ), . . . , (Yn , Xn ) be an i.i.d. sample of random vectors with
distribution P . Under these assumptions, we established the asymptotic
normality of the OLS estimator β̂n ,
√ d
n(β̂n − β) → N (0, V) (3.1)
with
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 . (3.2)
We also described a consistent estimator V̂n of the limiting variance V. We
now use these results to develop methods for inference. We will study in
particular Wald tests for certain hypotheses. Some other testing principles
will be covered later in class. Confidence regions will be constructed using
the duality between hypothesis testing and the construction of confidence
regions.
Below we will assume further that Var[XU ] = E[XX ′ U 2 ] is non-singular.
This would be implied, for example, by the assumption that P {E[U 2 |X] >
0} = 1. Since E[XX ′ ] is non-singular under the assumption of no perfect
collinearity in X, this implies that V is non-singular.
1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.

29
30 LECTURE 3. BASIC INFERENCE AND ENDOGENEITY

3.1.1 Background
Consider the following somewhat generic version of a testing problem. One
observes data Wi = (Yi , Xi ), i = 1, . . . , n, i.i.d. with distribution P ∈ P =
{Pβ : β ∈ Rk+1 } and wishes to test
H0 : β ∈ B0 versus H1 : β ∈ B1 (3.3)
where B0 and B1 form a partition of Rk+1 . In our context, β will be
the coefficient in a linear regression but in general it could be any other
parameter.
A test is simply a function ϕn = ϕn (W1 , . . . , Wn ) that returns the prob-
ability of rejecting the null hypothesis after observing W1 , . . . , Wn . For the
time being, we will only consider non-randomized tests which means that
the function ϕn will take only two values: it will be equal to 1 for rejection
and equal to 0 for non rejection. Most often, ϕn is the indicator function of a
certain test statistic Tn = Tn (W1 , . . . , Wn ) being greater than some critical
value cn (1 − α), this is,
ϕn = I {Tn > cn (1 − α)} . (3.4)
The test is said to be (pointwise) asymptotically of level α (or consistent in
levels) if,
lim sup EPβ [ϕn ] = lim sup Pβ {ϕn = 1} ≤ α , ∀β ∈ B0 .
n→∞ n→∞

Such tests include: Wald tests, quasi-likelihood ratio tests, and Lagrange
multiplier tests.

3.1.2 Tests of A Single Linear Restriction


Consider testing
H0 : r′ β = c versus H1 : r′ β ̸= c ,
where r is a nonzero (k + 1)-dimensional vector and c is a scalar, at level α.
Probably the most important case in this class happens when r selects the
sth component of β, in which case we get
H0 : βs = c versus H1 : βs ̸= c .
The CMT implies that
√ d
n(r′ β̂n − r′ β) → N (0, r′ Vr)
as n → ∞. Since V is non-singular, r′ Vr > 0. The CMT implies that
P
r′ V̂n r → r′ Vr as n → ∞. A natural choice of test statistic for this problem
is the absolute value of the t-statistic,
√ ′
n(r β̂n − c)
tstat = p ,
r′ V̂n r
3.1. INFERENCE 31

so that Tn = |tstat |. Note that when r selects the sth component of β, we


get r′ V̂n r = V̂n,[s,s] , i.e., the sth diagonal element of V̂n .
Such test statistic has the property that large values of Tn provide ev-
idence against the null hypothesis H0 , and so using rejection rules of the
form “reject H0 if Tn is greater than a certain threshold” makes sense. This
threshold value is usually called critical value.
A suitable choice of critical value for this test statistic is z1− α2 , which
exploits the fact that, under the null hypothesis,

n(r′ β̂n − c) d
tstat = p → N (0, 1) . (3.5)
r′ V̂n r

To see that this test is consistent in level, note that whenever r′ β = c,

P {ϕn = 1} = P {|Tn | > z1− α2 }


= P {tstat > z1− α2 } + P {tstat < −z1− α2 }
→ 1 − Φ(z1− α2 ) + Φ(−z1− α2 )
= 1 − Φ(z1− α2 ) + Φ(z α2 )
α α
= 1 − (1 − ) +
2 2
=α.

This construction may be modified in a straightforward fashion for testing


“one-sided” hypotheses, i.e.,

H0 : r′ β ≤ c versus H1 : r′ β > c .

In addition, note that by using the duality between hypothesis testing


and the construction of confidence regions, we may construct a confidence
region of level α for each component βs of β as

 
 n(β̂n,s − c) 
Cn = c ∈ R : q ≤ z1− α2
 
V̂n,[s,s]
 s s 
 V̂n,[s,s] V̂n,[s,s] 
= β̂n,s − z1− α2 , β̂n,s + z1− α2 .
 n n 

This confidence region satisfies

P {βs ∈ Cn } → 1 − α

as n → ∞. It is straightforward to modify this construction to construct a


confidence region of level α for r′ β.
32 LECTURE 3. BASIC INFERENCE AND ENDOGENEITY

3.1.3 Tests of Multiple Linear Restrictions


Consider testing
H0 : Rβ = c versus H1 : Rβ ̸= c ,
where R is a p × (k + 1)-dimensional matrix and c is a p-dimensional vector,
at level α. In order to rule out redundant equations, we require that the
rows of R are linearly independent. The CMT implies that
√ d
n(Rβ̂n − Rβ) → N (0, RVR′ )

as n → ∞. Note that because V is assumed to be non-singular, RVR′ is also


non-singular. To see this, consider a′ RVR′ a for a non-zero vector a ∈ Rp .
Next, note that a′ R ̸= 0 because the rows of R are assumed to be linearly
independent. Hence, a′ RVR′ a > 0 because V is assumed to be non-singular.
Hence, from our earlier results, we see that
d
n(Rβ̂n − Rβ)′ (RV̂n R′ )−1 (Rβ̂n − Rβ) → χ2p

as n → ∞. Thus, a natural choice of test statistic in this case is therefore

Tn = n(Rβ̂n − c)′ (RV̂n R′ )−1 (Rβ̂n − c)

and a suitable choice of critical value is cp,1−α , the 1 − α quantile of χ2p . The
resulting test is consistent in level.
Note that by using the duality between hypothesis testing and the con-
struction of confidence regions, we may construct a confidence region of level
α for β as

Cn = {c ∈ Rk+1 : n(β̂n − c)′ V̂−1


n (β̂n − c) ≤ ck+1,1−α } .

This confidence region satisfies

P {β ∈ Cn } → 1 − α

as n → ∞. It is straightforward to modify this construction to construct a


confidence region of level α for Rβ.

3.1.4 Tests of Nonlinear Restrictions


Consider testing

H0 : f (β) = 0 versus H1 : f (β) ̸= 0 ,

where f : Rk+1 → Rp , at level α. Assume that f is continuously differ-


entiable at β and denote by Dβ f (β) the p × (k + 1)-dimensional matrix of
3.2. LINEAR REGRESSION WHEN E[XU ] ̸= 0 33

partial derivatives of f evaluated at β. Assume that the rows of Dβ f (β) are


linearly independent. The Delta Method implies that
√ d
n(f (β̂n ) − f (β)) → N (0, Dβ f (β)VDβ f (β)′ )

as n → ∞. The CMT implies that


P
Dβ f (β̂n )V̂n Dβ f (β̂n )′ → Dβ f (β)VDβ f (β)′

as n → ∞. It is now straightforward to modify the construction of the test in


the preceding section appropriately to develop a test for the present purpose.
It is also straightforward to modify the construction of the confidence region
in the preceding section to construct a confidence region of level α for f (β).

3.2 Linear Regression when E[XU ] ̸= 0


Let (Y, X, U ) be a random vector where Y and U take values in R and
X takes values in Rk+1 . Assume further that the first component of X is
constant and equal to one, i.e., X = (X0 , X1 , . . . , Xk )′ with X0 = 1. Let
β = (β0 , β1 , . . . , βk )′ ∈ Rk+1 be such that

Y = X ′β + U .

In contrast to our earlier discussion, we do not assume that E[XU ] = 0.


Any Xj such that E[Xj U ] = 0 is said to be exogenous; any Xj such that
E[Xj U ] ̸= 0 is said to be endogenous. By normalizing β0 if necessary,
we assume X0 is exogenous. Note that it must be the case that we are
interpreting this regression as a causal model.
Note that since E[XU ] ̸= 0 we have that

E[XY ] = E[XX ′ ]β + E[XU ]

and so
E[XX ′ ]−1 E[XY ] = β + E[XX ′ ]−1 E[XU ] .
The results from the previous class showed that the least squares estimator
βn of β converges to E[XX ′ ]−1 E[XY ]. It follows that
P
β̂n → β + E[XX ′ ]−1 E[XU ] ,

and is therefore inconsistent for β under endogeneity.

3.2.1 Motivating Examples


We now briefly review some common ways in which endogeneity may arise.
34 LECTURE 3. BASIC INFERENCE AND ENDOGENEITY

Omitted Variables
Suppose k = 2, so
Y = β0 + β1 X1 + β2 X2 + U .
We are interpreting this regression as a causal model and are willing to
assume that E[XU ] = 0 (i.e., E[U ] = E[X1 U ] = E[X2 U ] = 0), but X2 is
unobserved. An example of a situation like this is when Y is wages, X1 is
education, and X2 is ability. Given unobserved ability, we may rewrite this
model as
Y = β0∗ + β1∗ X1 + U ∗ ,
with

β0∗ = β0 + β2 E[X2 ]
β1∗ = β1
U ∗ = β2 (X2 − E[X2 ]) + U .

Note that we have normalized β0∗ so that E[U ∗ ] = 0. In this model,

E[X1 U ∗ ] = β2 Cov[X1 , X2 ] ,

so X1 is endogenous whenever β2 ̸= 0 and Cov[X1 , X2 ] ̸= 0. Based on


the results from the previous class, it follows immediately that running a
regression of Y on X1 produces an estimator with the property that

∗ P Cov[X1 , X2 ]
β̂1,n → β1 + β2 , (3.6)
Var[X1 ]

where the term β2 [Var[X1 ]]−1 Cov[X1 , X2 ] is usually referred to as omitted


variable bias.

Measurement Error
Partition X into X0 and X1 , where X0 = 1 and X1 takes values in Rk .
Partition β analogously. In this notation,

Y = β0 + X1′ β1 + U .

We are interpreting this regression as a causal model and are willing to


assume that E[XU ] = 0, but X1 is not observed. Instead, X̂1 is observed,
where
X̂1 = X1 + V .
Assume E[V ] = 0, Cov[X1 , V ] = 0, and Cov[U, V ] = 0. We may therefore
rewrite this model as
Y = β0∗ + X̂1′ β1∗ + U ∗ ,
3.2. LINEAR REGRESSION WHEN E[XU ] ̸= 0 35

with

β0∗ = β0
β1∗ = β1
U ∗ = −V ′ β1 + U .

In this model,

E[X̂1 U ∗ ] = −E[X̂1 V ′ β1 ] = −E[V V ′ ]β1 ,

so X̂1 is typically endogenous. Note that in the case where X1 is a scalar


random variable, and using results from the previous class, it follows that
running a regression of Y on X̂1 produces an estimator with the property
that !
∗ P E[X̂1 U ∗ ] Var[V ]
β̂1,n → β1 + = β1 1 − < β1 , (3.7)
Var[X̂1 ] Var[X̂1 ]
so that the regression coefficient is biased towards zero when the regressor
of interest is measured with the so-called classical random errors. The last
inequality follows from using X̂1 = X1 + V . Indeed, in the extreme case
∗ → P
where X̂1 = V , it follows that β̂1,n 0.

Simultaneity
A classical example of simultaneity is given by supply and demand. De-
note by Qs the quantity supplied and by Qd the quantity demanded. As a
function of (non-market clearing) price P̃ , assume that

Qd = β0d + β1d P̃ + U d
Qs = β0s + β1s P̃ + U s ,

where E[U s ] = E[U d ] = E[U s U d ] = 0. We observe (Q, P ), where Q and P


are such that the market clears, i.e., Qs = Qd . This implies that

β0d + β1d P + U d = β0s + β1s P + U s ,

so
1
P = (β0d − β0s + U d − U s ) .
β1s d
− β1
It follows that P is endogenous in both of the equations

Q = β0d + β1d P + U d
Q = β0s + β1s P + U s
36 LECTURE 3. BASIC INFERENCE AND ENDOGENEITY

because
Var[U d ]
E[P U d ] =
β1s − β1d
Var[U s ]
E[P U s ] = − s .
β1 − β1d

As it was the case in the previous two examples, running a regression of


Q in P will result in estimators that do not converge to either β1d or β1s .

Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.

Wooldridge, J. M. (2010): Econometric analysis of cross section and


panel data, MIT press.
Lecture 4

Endogeneity1

4.1 Instrumental Variables


Let (Y, X, U ) be a random vector where Y and U take values in R and
X takes values in Rk+1 . Assume further that the first component of X is
constant and equal to one, i.e., X = (X0 , X1 , . . . , Xk )′ with X0 = 1. Let
β = (β0 , β1 , . . . , βk )′ ∈ Rk+1 be such that

Y = X ′β + U .

In contrast to our earlier discussion, we do not assume that E[XU ] = 0.


Any Xj such that E[Xj U ] = 0 is said to be exogenous; any Xj such that
E[Xj U ] ̸= 0 is said to be endogenous.
In order to overcome the difficulty associated with E[XU ] ̸= 0, we as-
sume that there is an additional random vector Z taking values in Rℓ+1 with
ℓ + 1 ≥ k + 1 such that E[ZU ] = 0. We assume that any exogenous compo-
nent of X is contained in Z (the so-called included instruments). In partic-
ular, we assume that the first component of Z is constant and equal to one,
i.e., Z = (Z0 , Z1 , . . . , Zℓ )′ with Z0 = 1. We also assume that E[ZX ′ ] < ∞,
E[ZZ ′ ] < ∞ and that there is no perfect collinearity in Z. The components
of Z are sometimes referred to as instruments or instrumental variables. The
requirement that E[ZU ] = 0 is termed instrument exogeneity. We further
assume that the rank of E[ZX ′ ] is k+1. This condition is termed instrument
relevance or the rank condition. Note that a necessary condition for this to
be true is that ℓ ≥ k. This is sometimes referred to as the order condition.
Using the fact that U = Y − X ′ β and E[ZU ] = 0, we see that β solves
the system of equations

E[ZY ] = E[ZX ′ ]β .

1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.

37
38 LECTURE 4. ENDOGENEITY

Since ℓ + 1 ≥ k + 1, this may be an over-determined system of equations. In


order to find an explicit formula for β, the following lemma is useful.

Lemma 4.1 Suppose there is no perfect collinearity in Z and let Π be such


that BLP(X|Z) = Π′ Z. E[ZX ′ ] has rank k + 1 if and only if Π has rank
k + 1. Moreover, the matrix Π′ E[ZX ′ ] is invertible.
Proof: Write X = Π′ Z + V where E[ZV ′ ] = 0. It follows that E[ZX ′ ] =
E[ZZ ′ ]Π. Recall the rank inequality, which states that

rank(AB) ≤ min{rank(A), rank(B)}

for any conformable matrices A and B. Applying this result, we see that

rank(E[ZZ ′ ]Π) ≤ rank(Π) .

We have further that

rank(Π) = rank(E[ZZ ′ ]−1 E[ZZ ′ ]Π) ≤ rank(E[ZZ ′ ]Π) .

Hence,
rank(E[ZX ′ ]) = rank(E[ZZ ′ ]Π) = rank(Π) ,
as desired.
To complete the proof, note that Π′ E[ZX ′ ] = Π′ E[ZZ ′ ]Π and argue that
Π E[ZZ ′ ]Π is invertible using arguments given earlier.

Since β solves Π′ E[ZY ] = Π′ E[ZX ′ ]β, we arrive at two formulae for β


by applying the lemma:

β = (Π′ E[ZX ′ ])−1 Π′ E[ZY ] (4.1)


′ ′ −1 ′
= (Π E[ZZ ]Π) Π E[ZY ] . (4.2)

Note that if k = ℓ, then Π is an invertible matrix and therefore

β = (E[ZX ′ ])−1 E[ZY ] . (4.3)

In this case, we say that β is exactly identified. Otherwise, we say that β is


over-identified.
A third formula for β arises by replacing Π with E[ZZ ′ ]−1 E[ZX ′ ],
−1
β = E[ZX ′ ]′ E[ZZ ′ ]−1 E[ZX ′ ] E[ZX ′ ]′ E[ZZ ′ ]−1 E[ZY ] . (4.4)

Before proceeding, it is useful to use the preceding lemma to further examine


the rank condition in some simpler settings. To this end, consider the case
4.1. INSTRUMENTAL VARIABLES 39

where k = ℓ and only Xk is endogenous. Let Zj = Xj for all 0 ≤ j ≤ k − 1.


In this case,  
1 0 ··· 0 0
 0 1 ··· 0 0 
 
Π′ =  ..
 . .
.. .
.. .. 
 . 

 0 0 ··· 1 0 
π0 π1 · · · πℓ−1 πℓ
The rank condition therefore requires πℓ ̸= 0. In other words, the instrument
Zℓ must be “correlated with Xk after controlling for X0 , X1 , . . . , Xk−1 .”

4.1.1 Partition of β: solve for endogenous components


Partition X into X1 and X2 , where X2 is exogenous. Partition Z into Z1
and Z2 and β into β1 and β2 analogously. Note that Z2 = X2 are included
instruments and Z1 are excluded instruments. In this notation,
Y = X1′ β1 + X2′ β2 + U .
Note that
BLP(Y |Z2 ) = BLP(X1′ β1 |Z2 ) + BLP(X2′ β2 |Z2 ) + BLP(U |Z2 )
= BLP(X1 |Z2 )′ β1 + X2′ β2 ,
where the second equality uses the fact that E[Z2 U ] = 0. It follows that
Y ∗ = X1∗ ′ β1 + U ,
where
Y ∗ = Y − BLP(Y |Z2 )
X1∗ = X1 − BLP(X1 |Z2 ) .
This calculation shows again the sense in which we may interpret β1 as
summarizing the effect of X1 on Y “after controlling for X2 .” In the exactly
identified case, it follows that
E[Z1 Y ∗ ] = E[Z1 X1∗ ′ ]β1 .
Since there must be a unique solution to this system of equations, it must
be the case that E[Z1 X1∗ ′ ] is invertible. It follows that
β1 = E[Z1 X1∗ ′ ]−1 E[Z1 Y ∗ ] .
In the over-identified case, we may repeat this calculation with X̂1∗ = BLP(X1∗ |Z1 )
in place of Z1 . This yields
β1 = E[X̂1∗ X1∗ ′ ]−1 E[X̂1∗ Y ∗ ]
= E[X̂1∗ X̂1∗′ ]−1 E[X̂1∗ Y ∗ ] ,
where the second equality uses the fact that X1∗ = X̂1∗ +V with E[X̂1∗ V ′ ] = 0.
40 LECTURE 4. ENDOGENEITY

4.2 Estimating β
Let (Y, X, Z, U ) be a random vector where Y and U take values in R, X
takes values in Rk+1 , and Z takes values in Rℓ+1 . Let β ∈ Rk+1 be such
that
Y = X ′β + U .
Suppose E[ZX ′ ] < ∞, E[ZZ ′ ] < ∞, E[ZU ] = 0, there is no perfect
collinearity in Z, and that the rank of E[ZX ′ ] is k + 1. We now discuss
estimation of β.

4.2.1 The Instrumental Variables (IV) Estimator


We first consider the case in which k = ℓ. Let (Y, X, Z, U ) be distributed
as described above and denote by P the marginal distribution of (Y, X, Z).
Let (Y1 , X1 , Z1 ), . . . , (Yn , Xn , Zn ) be an i.i.d. sequence of random variables
with distribution P . By analogy with the expression we derived for β in
(4.3) under these assumptions, the natural estimator of β is simply
 −1  
1 X 1 X
β̂n =  Zi Xi′   Zi Yi  .
n n
1≤i≤n 1≤i≤n

This estimator is called the instrumental variables (IV) estimator of β. Note


that β̂n satisfies
1 X
Zi (Yi − Xi′ β̂n ) = 0 .
n
1≤i≤n

In particular, Ûi = Yi − Xi′ β̂n satisfies

1 X
Zi Ûi = 0 .
n
1≤i≤n

To gain further insight on the IV estimator, partition X into X0 and X1 ,


where X0 = 1 and X1 is assumed to take values in R. Do the same with Z
and β. An interesting interpretation
P of the IV estimator of β1 is obtained
by multiplying and dividing by n1 ni=1 (Z1,i − Z̄1,n )2 , i.e.,
1 Pn 1 Pn 2
n i=1 (Z1,i − Z̄1,n )Yi / n i=1 (Z1,i − Z̄1,n )
β̂1,n = 1 P n 1 P n 2
. (4.5)
n i=1 (Z1,i − Z̄1,n )X1,i / n i=1 (Z1,i − Z̄1,n )

The IV estimator of β1 is simply the ratio of the regression slope of Y on


Z1 (the so-called reduced form) to the regression slope of X1 on Z1 (the
so-called first stage). To see this in a different way, write the model as

Y = β0 + β1 X1 + U
4.2. ESTIMATING β 41

and
X1 = π0 + π1 Z1 + V ,
so that replacing the second equation into the first one delivers

Y = β0∗ + β1 π1 Z1 + U ∗

with

β0∗ = β0 + β1 π0
U ∗ = U + β1 V .

Thus, the estimated slope in the reduced form converges in probability to


β1 π1 , while the estimated slope in the first stage converges to π1 . The
IV estimator takes the ratio of these two, therefore delivering a consistent
estimator of β1 . Note that the IV estimand is predicated on the notion that
the first stage slope is not zero (π1 ̸= 0), which is just another way to state
our rank condition in this simple case.
This estimator may be expressed more compactly using matrix notation.
Define

Z = (Z1 , . . . , Zn )′
X = (X1 , . . . , Xn )′
Y = (Y1 , . . . , Yn )′ .

In this notation, we have

β̂n = (Z′ X)−1 (Z′ Y) .

4.2.2 The Two-Stage Least Squares (TSLS) Estimator


Now consider the case in which ℓ > k. The expressions derived for β in this
case involved Π, where BLP(X|Z) = Π′ Z. An estimate of Π can be obtained
by OLS. More precisely, since Π = E[ZZ ′ ]−1 E[ZX ′ ], a natural estimator of
Π is  −1  
1 X 1 X
Π̂n =  Zi Zi′   Zi Xi′  .
n n
1≤i≤n 1≤i≤n

With this estimator of Π, a natural estimator of β is simply


 −1  
1 X 1 X
β̂n =  Π̂′n Zi Xi′   Π̂′n Zi Yi 
n n
1≤i≤n 1≤i≤n
 −1  
1 X 1 X
=  Π̂′n Zi Zi′ Π̂n   Π̂′n Zi Yi  .
n n
1≤i≤n 1≤i≤n
42 LECTURE 4. ENDOGENEITY

The first equation above provides an interpretation of the TSLS estimator as


an IV estimator with Π̂′n Zi playing the role of the instrument. Note further
that if k + 1 = ℓ + 1 and Π̂n is invertible, then the TSLS estimator of β is
exactly to the IV estimator of β. The second equality might be expected
from our calculations in (4.2). To justify it here, write Xi = Π̂′n Zi + V̂i and
note from properties of OLS that
1 X
Zi V̂i′ = 0 .
n
1≤i≤n

This estimator of β is called the two-stage least squares (TSLS) estimator


of β. Note that β̂n satisfies
1 X ′
Π̂n Zi (Yi − Xi′ β̂n ) = 0 .
n
1≤i≤n

In particular, Ûi = Yi − Xi′ β̂n satisfies

1 X ′
Π̂n Zi Ûi = 0 .
n
1≤i≤n

Notice that this implies that Ûi is orthogonal to all of the instruments equal
to an exogenous regressors, but may not be orthogonal to the other re-
gressors. It is termed the TSLS estimator because it may be obtained in
the following way: first, regress (each component of) Xi on Zi to obtain
X̂i = Π̂′n Zi ; second, regress Yi on X̂i to obtain β̂n . However, in order to
obtain proper standard errors, it is recommended to compute the estimator
in one step (see the following section).
The estimator may again be expressed more compactly using matrix
notation. Define

X̂ = (X̂1 , . . . , X̂n )′
= PZ X ,

where
PZ = Z(Z′ Z)−1 Z′
is the projection matrix onto the column space of Z. In this notation, we
have

β̂n = (X̂′ X)−1 (X̂′ Y)


= (X′ PZ X)−1 (X′ PZ Y) ,

which should be expected given our previous derivation in (4.4).


4.3. PROPERTIES OF THE TSLS ESTIMATOR 43

4.3 Properties of the TSLS Estimator


Let (Y, X, Z, U ) be a random vector where Y and U take values in R, X
takes values in Rk+1 , and Z takes values in Rℓ+1 . Let β ∈ Rk+1 be such
that
Y = X ′β + U .
Suppose E[ZX ′ ] < ∞, E[ZZ ′ ] < ∞, E[ZU ] = 0, there there is no per-
fect collinearity in Z, and that the rank of E[ZX ′ ] is k + 1. Denote by P
the marginal distribution of (Y, X, Z). Let (Y1 , X1 , Z1 ), . . . , (Yn , Xn , Zn ) be
an i.i.d. sequence of random variables with distribution P . Above we de-
scribed estimation of β via TSLS under these assumptions. We now discuss
properties of the resulting estimator, β̂n , imposing stronger assumptions as
needed.

4.3.1 Consistency
Under the assumptions stated above, the TSLS estimator, β̂n , is consistent
P
for β, i.e., β̂n → β as n → ∞. To see this, first recall from our results on
OLS that
P
Π̂n → Π
as n → ∞. Next, note that the WLLN implies that
1 X P
Zi Zi′ → E[ZZ ′ ]
n
1≤i≤n
1 X P
Zi Yi → E[ZY ]
n
1≤i≤n

as n → ∞. The desired result therefore follows from the CMT.

4.3.2 Limiting Distribution


In addition to the assumptions above, assume that Var[ZU ] = E[ZZ ′ U 2 ] <
∞. Then,
√ d
n(β̂n − β) → N (0, V)
as n → ∞, where

V = E[Π′ ZZ ′ Π]−1 Π′ Var[ZU ]ΠE[Π′ ZZ ′ Π]−1 .

To see this, note that


   −1   
√ 1 X 1 X
n(β̂n − β) = Π̂′n  Zi Zi′  Π̂n  Π̂′n  √ Z i Ui   .
n n
1≤i≤n 1≤i≤n
44 LECTURE 4. ENDOGENEITY

As in the preceding section, we have that


P
Π̂n → Π
1 X P
Zi Zi′ → E[ZZ ′ ]
n
1≤i≤n

as n → ∞. The CLT implies that


1 X d
√ Zi Ui → N (0, Var[ZU ]) .
n
1≤i≤n

The desired result thus follows from the CMT.

4.3.3 Estimation of V
A natural estimator of V is given by
   −1
1 X
V̂n = Π̂′n  Zi Zi′  Π̂n 
n
1≤i≤n
 
1 X
× Π̂′n  Zi Zi′ Ûi2  Π̂n
n
1≤i≤n
   −1
1 X
× Π̂′n  Zi Zi′  Π̂n  ,
n
1≤i≤n

where Ûi = Yi − Xi′ β̂n . As in our discussion of OLS, the primary difficulty
in establishing the consistency of this estimator lies in showing that
1 X P
Zi Zi′ Ûi2 → Var[ZU ]
n
1≤i≤n

as n → ∞. Note that part of the complication lies in the fact that we do


not observe Ui and therefore have to use Ûi . However, the desired result can
be shown by arguing exactly as in the second part of this class.
Note that Ûi = Yi − Xi′ β̂n ̸= Yi − X̂i′ β̂n , so the standard errors from two
repeated applications of OLS will be incorrect. Assuming Var[ZU ] is invert-
ible, inference may now be carried out exactly the same way as discussed
for the OLS estimator, simply replacing the OLS quantities with their TSLS
counterparts.

Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Wooldridge, J. M. (2010): Econometric analysis of cross section and
panel data, MIT press.
Lecture 5

More on Endogeneity1

5.1 Efficiency of the TSLS Estimator


Let (Y, X, Z, U ) be a random vector where Y and U take values in R, X
takes values in Rk+1 , and Z takes values in Rℓ+1 . Let β ∈ Rk+1 be such
that
Y = X ′β + U .
Suppose E[ZX ′ ] < ∞, E[ZZ ′ ] < ∞, E[ZU ] = 0, there is no perfect
collinearity in Z, and that the rank of E[ZX ′ ] is k + 1. Denote by P the
marginal distribution of (Y, X, Z). Let (Y1 , X1 , Z1 ), . . . , (Yn , Xn , Zn ) be an
i.i.d. sequence of random variables with distribution P .
The TSLS estimator identifies β by means of the projection matrix Π =
E[ZZ ′ ]−1 E[ZX ′ ]. However, note that we could have solved for β using any
(ℓ + 1) × (k + 1) dimensional matrix Γ such that E[Γ′ ZX ′ ] has rank k + 1.
For any such matrix,

β = E[Γ′ ZX ′ ]−1 E[Γ′ ZY ] ,

and we could have estimated β using


 −1  
1 X 1 X
β̃n =  Γ′ Zi Xi′   Γ′ Zi Yi  .
n n
1≤i≤n 1≤i≤n

Note one could use a consistent estimate of Γ, Γ̂n , instead. By arguing as


P
before, it is possible to show under our assumptions that β̃n → β as n → ∞.
If in addition Var[ZU ] = E[ZZ ′ U 2 ] < ∞, then, by arguing as before, it is
also possible to show that
√ d
n(β̃n − β) → N (0, Ṽ)
1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.

45
46 LECTURE 5. MORE ON ENDOGENEITY

as n → ∞, where

Ṽ = E[Γ′ ZX ′ ]−1 Γ′ Var[ZU ]ΓE[Γ′ ZX ′ ]−1′ .

We now argue that under certain assumptions, the “best” choice of Γ is


given by Π, i.e., Ṽ ≥ V.
In order to establish this claim, we assume that E[U |Z] = 0 and Var[U |Z] =
σ 2 . In addition, define W ∗ = Π′ Z and W = Γ′ Z, which will simplify the
notation below. To see that Ṽ ≥ V, first note that under these assumptions

Ṽ = σ 2 E[Γ′ ZX ′ ]−1 E[Γ′ ZZ ′ Γ]E[Γ′ ZX ′ ]−1′


= σ 2 E[Γ′ ZZ ′ Π]−1 E[Γ′ ZZ ′ Γ]E[Γ′ ZZ ′ Π]−1′
= σ 2 E[W W ∗′ ]−1 E[W W ′ ]E[W W ∗′ ]−1′

and

V = σ 2 E[Π′ ZZ ′ Π]−1 E[Π′ ZZ ′ Π]E[ΠZZ ′ Π]−1


= σ 2 E[Π′ ZZ ′ Π]−1
= σ 2 E[W ∗ W ∗′ ]−1 ,

where in both cases the first equality follows from Var[ZU ] = E[ZZ ′ U 2 ] =
σ 2 E[ZZ ′ ], and the second equality used the fact that X = Π′ Z + V with
E[ZV ′ ] = 0. It suffices to show that Ṽ−1 ≤ V−1 , i.e., to show that

E[W ∗ W ∗′ ] − E[W W ∗′ ]′ E[W W ′ ]−1 E[W W ∗′ ] ≥ 0 .

Yet this follows upon realizing that the left-hand side of the preceding display
is simply E[Ŵ ∗ Ŵ ∗′ ] with

Ŵ ∗ = W ∗ − BLP(W ∗ |W ) = W ∗ − E[W W ∗′ ]′ E[W W ′ ]−1 W .

When we do not assume that E[U |Z] = 0 and Var[U |Z] = σ 2 , then
better estimators for β exist. Such estimators are most easily treated as a
special case of the generalized method of moments (GMM), which will be
covered later in class.

5.2 “Weak” Instruments


It turns out that the normal approximation justified by the preceding results
can be poor in finite samples, especially when the rank of E[ZX ′ ] is “close”
to being < k + 1. As a result, hypothesis tests and confidence regions based
off of this approximation can behave poorly in finite-samples as well. To
gain some insight into this phenomenon in a more elementary way, suppose

Yi = Xi β + Ui
Xi = Zi π + Vi ,
5.2. “WEAK” INSTRUMENTS 47

where Z1 , . . . , Zn are non-random, (U1 , V1 ), . . . , (Un , Vn ) is a sequence of i.i.d.


N (0, Σ) random vectors. Suppose π ̸= 0. Consider the estimator given by
1 Pn
Z i Yi
β̂n = 1 Pni=1
n
.
n i=1 Zi Xi

Note that Pn
√1
√ n i=1 Zi Ui
n(β̂n − β) = 1 Pn 2
 1 Pn .
n i=1 Zi π + n i=1 Zi Vi
The finite-sample, joint distribution of the numerator and denominator is
simply
Z¯n2 σU2 √1 Z¯2 σU,V
!!
0 n n
N , ,
Z¯n2 π √1 Z¯2 σU,V
n n
1 ¯2 2
n Zn σV

where
n
1X 2
Z¯n2 = Zi .
n
i=1

This joint distribution completely determines the finite-sample distribution



of n(β̂n − β). In particular, it is the ratio of two (correlated) normal
random variables. If Z¯n2 → Z¯2 as n → ∞, then it is straightforward to show
that
√ σU2
 
d
n(β̂n − β) → N 0, 2 ¯2 .
π Z
This approximation effectively treats the denominator like a constant equal
to its mean, so we would expect it to be “good” when the mean is “large”,
i.e.,
1
q
Z¯n2 π ≫ √ Z¯n2 σV .
n
When π is “small” , i.e., however, the approximation may be quite poor
in finite-samples. Note in particular that π ̸= 0 is not sufficient for the
approximation to be good in finite-samples.
A variety of ways of carrying out inference that does not suffer from this
problem have been proposed in the literature. We now describe one simple
and popular method. Consider the problem of testing the null hypothesis
that H0 : β = c versus the alternative hypothesis H1 : β ̸= c at level α.
Note that under the null hypothesis, one can compute Ui = Yi − Xi′ β and
Zi Ui = Zi (Yi − Xi′ β). Since it must be the case that E[ZU ] = 0, we can
simply test whether this is true using Z1 U1 , . . . , Zn Un . To formalize this
idea, assume Var[ZU ] is invertible and define Wi (c) = Zi (Yi − Xi′ c). Note
that when β = c, we have that
√ 1 X d
nW̄n (c) = √ Wi (c) → N (0, Σ(c)) ,
n
1≤i≤n
48 LECTURE 5. MORE ON ENDOGENEITY

where Σ(c) = Var[W (c)]. Define


1 X
Σ̂n (c) = (Wi (c) − W̄n (c))(Wi (c) − W̄n (c))′ .
n
1≤i≤n

Using arguments given earlier, we see when β = c that


d
Tn = nW̄n′ (c)Σ̂−1 2
n (c)W̄n (c) → χℓ+1 .

One may therefore test the null hypothesis by comparing Tn with cℓ+1,1−α ,
the 1 − α quantile of the χ2ℓ+1 distribution. As we will discuss in the second
part of the class, one may now construct a confidence region using the duality
between hypothesis testing and the construction of confidence regions. A
closely related variant of this idea leads to the Anderson-Rubin test, in which
one tests whether all of the coefficients in a regression of Yi − Xi′ c on Zi are
zero.
Recent research in econometrics suggests that this method has good
power properties when the model is exactly identified, but may be less de-
sirable when the model is over-identified. Other methods for the case in
which the model is over-identified and/or one is only interested in some fea-
ture of β (e.g., one of the slope parameters) have been proposed and are the
subject of current research as well.
Instead of using these “more complicated” methods, researchers may at-
tempt a two-step method as follows. In the first step, they would investigate
whether the rank of E[ZX ′ ] is “close” to being < k + 1 or not by carrying
out a hypothesis test of the null hypothesis that H0 : rank(E[ZX ′ ]) < k + 1
versus the alternative hypothesis that H1 : rank(E[ZX ′ ]) = k + 1. In some
cases, such a test is relatively easy to carry out given what we have al-
ready learned: e.g., when there is a single endogenous regressor, such a test
is equivalent to a test of the null hypothesis that certain coefficients in a
linear regression are all equal to zero versus not all equal to zero. In the
second step, they would only use these “more complicated” methods if they
failed to reject this null hypothesis. This two-step method will also behave
poorly in finite-samples and should not be used. A deeper discussion of
these “uniformity” issues take place in Econ 481.

5.3 Interpretation under Heterogeneity


Despite possible inefficiencies, TSLS remains popular. One reason stems
from the following interpretation in the presence of heterogeneous effects of
X on Y . To motivate this, note that in the model
Y = X ′β + U ,
the effect of a change in X (say, from X = x to X = x′ ) is the same for
everybody. It seems sensible that in many cases the effect of a change in X
5.3. INTERPRETATION UNDER HETEROGENEITY 49

on Y may be different for different people. To allow for such heterogeneity,


we allow for β to be random. When β is random, we may absorb U into the
intercept and simply write
Y = X ′β .
Note that this means that when we work with a random sample where
variables are indexed by i, we would write Yi = Xi′ βi , which makes it explicit
that every individual has a unique effect βi .
For the time being, assume that k = 1 and write D in place of X1 , which
is assumed to take values in {0, 1}. In this notation,

Y = β0 + β1 D .

In this case, we interpret β0 as Y (0) and β1 as Y (1) − Y (0), where Y (1) and
Y (0) are potential or counterfactual outcomes. Using this notation, we may
rewrite the equation as

Y = DY (1) + (1 − D)Y (0) .

The potential outcome Y (0) is the value of the outcome that would have
been observed if (possibly counter-to-fact) D were 0; the potential outcome
Y (1) is the value of the outcome that would have been observed if (possibly
counter-to-fact) D were 1. The variable D is typically called the treatment
and Y (1) − Y (0) is called the treatment effect. The quantity E[Y (1) − Y (0)]
is usually referred to as the average treatment effect.
If D were randomly assigned (e.g., by the flip of a coin, as in a randomized
controlled trial), then
(Y (0), Y (1)) ⊥⊥ D .
In this case, under mild assumptions, the slope coefficient from OLS regres-
sion of Y on a constant and D yields a consistent estimate of the average
treatment effect. To see this, note that the estimand is

Cov[Y, D]
= E[Y |D = 1] − E[Y |D = 0]
Var[D]
= E[Y (1)|D = 1] − E[Y (0)|D = 0]
= E[Y (1) − Y (0)] ,

where the first equality follows from a homework exercise, the second equal-
ity follows from the equation for Y , and the third equality follows from
independence of (Y (0), Y (1)) and D.
Otherwise, we generally expect D to depend on (Y (1), Y (0)). In this
case, OLS will not yield a consistent estimate of the average treatment
effect. To proceed further, we therefore assume, as usual, that there is an
instrument Z that also takes values in {0, 1}. We may thus consider the
50 LECTURE 5. MORE ON ENDOGENEITY

slope coefficient from TSLS regression of Y on D with Z as an instrument.


The estimand in this case is
Cov[Y, Z] E[Y |Z = 1] − E[Y |Z = 0]
= ,
Cov[D, Z] E[D|Z = 1] − E[D|Z = 0]

where the equality follows by multiplying and dividing by Var[Z] and using
earlier results. Our goal is to express this quantity in terms of the treatment
effect Y (1) − Y (0) somehow. To this end, analogously to our equation for
Y above, it is useful to also introduce a similar equation for D:

D = ZD(1) + (1 − Z)D(0)
= D(0) + (D(1) − D(0))Z
= π0 + π1 Z ,

where π0 = D(0), π1 = D(1) − D(0), and D(1) and D(0) are potential or
counterfactual treatments (rather than outcomes). We impose the following
versions of instrument exogeneity and instrument relevance, respectively:

(Y (1), Y (0), D(1), D(0)) ⊥⊥ Z

and
P {D(1) ̸= D(0)} = P {π1 ̸= 0} > 0 .
Note that the first part of the assumption basically states that Z is as good
as randomly assigned. In addition, note that we are implicitly assuming that
Z does not affect Y directly, i.e., potential outcomes take the form Y (d) as
opposed to Y (d, z). This is the exclusion restriction in this setting. In the
linear model with constant effects, the exclusion restriction is expressed by
the omission of the instruments from the causal equation of interest and by
requiring that E[ZU ] = 0.
We further assume the following monotonicity (or perhaps better called
uniform monotonicity) condition:

P {D(1) ≥ D(0)} = P {π1 ≥ 0} = 1 .

Under these assumptions, note that

E[Y |Z = 1] − E[Y |Z = 0] = E[Y (1)D(1) + Y (0)(1 − D(1))|Z = 1]


− E[Y (1)D(0) + Y (0)(1 − D(0))|Z = 0]
= E[Y (1)D(1) + Y (0)(1 − D(1))]
− E[Y (1)D(0) + Y (0)(1 − D(0))]
= E[(Y (1) − Y (0))(D(1) − D(0))]
= E[Y (1) − Y (0)|D(1) > D(0)]P {D(1) > D(0)} ,
5.3. INTERPRETATION UNDER HETEROGENEITY 51

where the first equality follows from the equations for Y and D, the second
equality follows from instrument exogeneity, and the fourth equality follows
from the monotonicity assumption. Furthermore,

E[D|Z = 1] − E[D|Z = 0] = E[D(1) − D(0)] = P {D(1) > D(0)} .

Hence, the TSLS estimand equals

E[Y (1) − Y (0)|D(1) > D(0)] ,

which is termed the local average treatment effect (LATE). It is the average
treatment effect among the subpopulation of people for whom a change in
the value of the instrument switched them from being non-treated to treated.
We often refer to such subpopulation as compliers.
A few remarks are in order: First, it is important to understand that
this result depends crucially on the monotonicity assumption. Second, it is
important to understand that this quantity may or may not be of interest.
Third, it is important to understand that a consequence of this calculation is
that in a world with heterogeneity “different instruments estimate different
parameters.” Finally, this result also depends on the simplicity of the model.
When covariates are present, the entire calculation breaks down. Some
generalizations are available.

5.3.1 Monotonicity in Latent Index Models


The monotonicity assumption states that while the instrument may have no
effect on some people, all those who are affected are affected in the same
way. Without monotonicity, we would have

E[Y |Z = 1] − E[Y |Z = 0] = E[Y (1) − Y (0)|D(1) > D(0)]P {D(1) > D(0)}
− E[Y (1) − Y (0)|D(1) < D(0)]P {D(1) < D(0)} .

We might therefore have a situation where treatment effects are positive for
everyone (i.e., Y (1) − Y (0) > 0) yet the reduced form is zero because effects
on compliers are canceled out by effects on defiers, i.e., those individuals for
which the instrument pushes them out of treatment (D(1) = 0 and D(0) =
1). This doesn’t come up in a constant effect model where β = Y (1) − Y (0)
is constant, as in such case

E[Y |Z = 1] − E[Y |Z = 0] = β{P {D(1) > D(0)} − P {D(1) < D(0)}}


= βE[D(1) − D(0)] ,

and so a zero reduced-form effect means either the first stage is zero or
β = 0.
It is worth noting that monotonicity assumptions are easy to interpret
in latent index models. In such models individual choices are determined by
52 LECTURE 5. MORE ON ENDOGENEITY

a threshold crossing rule involving observed and unobserved components of


the utility. In our context, we could write
(
1 if γ0 + γ1 Z − V > 0
D=
0 otherwise

where γ1 > 0 and V is an unobserved random variable assumed to be in-


dependent of Z. This latent index model characterizes potential treatment
assignments as

D(0) = I{γ0 > V } and D(1) = I{γ0 + γ1 > V } .

Notably, in this model the monotonicity assumption is automatically satis-


fied since γ1 > 0 is a constant.

5.3.2 IV in Randomized Experiments


The preceding derivation is often relevant in the context of randomized trials,
where treatment assignment is independent of potential outcomes. However,
in cases where there is non-compliance, one could interpret the treatment
assignment as an “offer of treatment” Z (the instrument), and the actual
treatment D as the variable that determines whether the subject actually
had the intended treatment. This is the case in experiments where partic-
ipation is voluntary among those randomly assigned to receive treatment.
At the same time, it is often the case that no one in the control group has
access to the experimental intervention. In other words, D(0) = 0 while
D(1) ∈ {0, 1}. Since the group that receives the assigned treatment (the
compliers) is a self-selected subset of those offered treatment, a comparison
between those actually treated (D = 1) and the control (D = 0) group is
misleading. Two alternatives are frequently used. The first one is a com-
parison between those who where offered treatment (Z = 1) and the control
(Z = 0) group. This comparison is indeed based on randomly assigned Z
and identifies a parameter known as intention to treat effect. The second one
is to do IV using randomly assigned treatment intended as an instrumental
variable for treatment received, which solves the sort of compliance problem
previously discuss. In this case, LATE returns the effect of treatment on the
treated, i.e., E[Y (1) − Y (0)|D = 1].

Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.

Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.


Lecture 6

Generalized Method of
Moments and Empirical
Likelihood

6.1 Generalized Method of Moments


6.1.1 Over-identified Linear Model
Let (Y, X, Z, U ) be a random vector where Y and U take values in R, X takes
values in Rk+1 , and Z takes values in Rℓ+1 . We assume that ℓ ≥ k, E[ZU ] =
0, E[ZX ′ ] < ∞, and rank(E[ZX ′ ]) = k + 1. Assume further that the first
component of X is constant and equal to one. Let β = (β0 , β1 , . . . , βk )′ ∈
Rk+1 be such that
Y = X ′β + U .
Using the fact that U = Y − X ′ β and E[ZU ] = 0, we see that β solves the
system of equations
E[Z(Y − X ′ β)] = 0 .
Since ℓ + 1 ≥ k + 1, this may be an over-determined system of equations and
today we will focus on the case where ℓ > k. This situation is called over-
identified. There are ℓ − k = r more moment restrictions than parameters
to estimate. We usually call r the number of over-identifying restrictions.
The above is a special case of a more general class of moment condition
models. Let m(Y, X, Z, β) be an ℓ + 1 dimensional function of a k + 1
dimensional parameter β such that

E[m(Y, X, Z, β)] = 0 . (6.1)

In the linear model, m(Y, X, Z, β) = Z(Y −X ′ β). In econometrics, this class


of models are called moment condition models. In the statistics literature,
these are known as estimating equations.

53
54 LECTURE 6. GMM & EL

6.1.2 The GMM Estimator


Let (Y, X, Z, U ) be distributed as described above and denote by P the
marginal distribution of (Y, X, Z). Let (Y1 , X1 , Z1 ), . . . , (Yn , Xn , Zn ) be an
i.i.d. sequence of random variables with distribution P . Define the sample
analog of E[m(Y, X, Z, β)] by
n n
1X 1X 1
m̄n (β) = mi (β) = Zi (Yi − Xi′ β) = Z′ (Y − Xβ) , (6.2)
n n n
i=1 i=1

where in what follows we will use the notation mi (β) = m(Yi , Xi , Zi , β).
The method of moments estimator for β is defined as the parameter value
which sets m̄n (β) = 0. This is generally not possible when ℓ > k as there are
more equations than free parameters. The idea of the generalized method of
moments (GMM) is to define an estimator that sets m̄n (β) “close” to zero,
given a notion of “distance”.
P
Let Λn be an (ℓ + 1) × (ℓ + 1) matrix such that Λn → Λ for a symmetric
positive definite matrix Λ and define

Qn (β) = n m̄n (β)′ Λn m̄n (β) .

This is a non-negative measure of the “distance” between the vector m̄n (β)
and the origin. For example, if Λn = I, then Qn (β) = n |m̄n (β)|2 , the square
of the Euclidean norm, scaled by the sample size n. The GMM estimator of
β is defined as the value that minimizes Qn (β), this is

β̂n = argmin Qn (b) . (6.3)


b∈Rk+1

Note that if k = ℓ, then m̄n (β̂n ) = 0, and the GMM estimator is the method
of moments estimator. The first order conditions for the GMM estimator
are

0= Qn (β̂)
∂b

= 2 m̄n (β̂n )′ Λn m̄n (β̂)
∂b
 ′  
1 ′ 1 ′
= −2 Z X Λn Z (Y − Xβ̂n ) (6.4)
n n
so ′ ′
2 Z′ X Λn Z′ X β̂n = 2 Z′ X Λn Z′ Y ,
 
(6.5)
which establishes a closed-form solution for the GMM estimator in the linear
model,
 ′ −1 ′ ′
β̂n = Z′ X Λn Z′ X Z X Λn Z′ Y .

(6.6)
6.1. GENERALIZED METHOD OF MOMENTS 55

Without matrix notation, we can write this estimator as


n
!′ n
!!−1 n
!′ n
!
1X ′ 1X ′ 1X ′ 1X
β̂n = Zi Xi Λn Zi Xi Zi Xi Λn Z i Yi .
n n n n
i=1 i=1 i=1 i=1

The matrix (Z′ X)′ Λn (Z′ X) may not be invertible for a given n, but, since
E[ZX ′ ]′ ΛE[ZX ′ ] is invertible, it will be invertible with probability ap-
proaching one. Note that using similar arguments to those in Lemma 4.1
we can claim that E[ZX ′ ]′ ΛE[ZX ′ ] is invertible provided E[ZX ′ ] has rank
k + 1 and Λ has rank ℓ + 1.

6.1.3 Consistency
Let
Σ = E[ZX ′ ] .
Then, by the WLLN and the CMT
 ′  
1 ′ 1 ′ P
Z X Λn Z X → Σ′ ΛΣ
n n
and
1 ′ P
Z Y → E[ZY ] = Σβ
n
as n → ∞. The desired result therefore follows from the CMT.

6.1.4 Asymptotic Normality


In addition to the assumptions above, assume that

Ω = E[ZZ ′ U 2 ]

is finite and invertible. Write


n
!′ n
!!−1
√ 1X 1X
n(β̂n − β) = Zi Xi′ Λn Zi Xi′
n n
i=1 i=1
n
!′ n
!
1 X 1 X
× Zi Xi′ Λn √ Zi Ui ,
n n
i=1 i=1

and note that by the CLT


n
1 X d
√ Zi Ui → N (0, Ω) .
n
i=1

Using this, the results on convergence in probability, and the CMT, we


conclude that √ d
n(β̂n − β) → N (0, V)
56 LECTURE 6. GMM & EL

where
V = (Σ′ ΛΣ)−1 (Σ′ ΛΩΛΣ)(Σ′ ΛΣ)−1 .
In general, GMM estimators are asymptotically normal with “sandwich
form” asymptotic variances. The optimal weigh matrix Λ∗ is the one which
minimizes V. This turn out to be Λ∗ = Ω−1 . The proof is left as an exercise.
This yields the efficient GMM estimator
 ′ −1 ′ ′ −1 ′ 
β̂n = Z′ X Ω−1 Z′ X ZX Ω ZY ,
which satisfies √ d
n(β̂n − β) → N (0, (Σ′ Ω−1 Σ)−1 ) .
In practice, Ω is not known but it can be estimated consistently. For
P
any Ω̂n → Ω, we still call β̂n the efficient GMM estimator, as it has the
same asymptotic distribution. By “efficient”, we mean that this estimator
has the smallest asymptotic variance in the class of GMM estimators with
this set of moment conditions. This is a weak concept of optimality, as we
are only considering alternative weight matrices Λn . However, it turns out
that the GMM estimator is semiparametrically efficient, as shown by Gary
Chamberlain (1987). If it is known that E[mi (β)] = 0 and this is all that
is known, this is a semi-parametric problem, as the distribution of the data
P is unknown. Chamberlain showed that in this context, if an estimator
has this asymptotic variance, it is semiparametrically efficient. This result
shows that no estimator has greater asymptotic efficiency than the efficient
GMM estimator. No estimator can do better (in this first-order asymptotic
sense), without imposing additional assumptions.

6.1.5 Estimation of the Efficient Weighting Matrix


Given any weight matrix Λn with the properties previously discussed, the
GMM estimator β̂n is consistent yet inefficient. For example, we can set
Λn = I. In the linear model, under the additional assumption that Z has no
perfect collinearity and that E[ZZ ′ ] < ∞, a better choice is Λn = (Z′ Z)−1 ,
which leads to TSLS since
 ′ −1 ′ ′
β̂n = Z′ X Λn Z′ X Z X Λn Z′ Y


 ′ −1 ′ ′ ′ −1 ′ 
= Z′ X (Z′ Z)−1 Z′ X Z X (Z Z) ZY
−1
= X′ PZ X X′ PZ Y .
 

As before, PZ = Z(Z′ Z)−1 Z′ is a projection matrix. Given any such first-step


estimator, we can define residuals Ûi = Yi − Xi′ β̂n and moment equations
m̂i = Zi Ûi . Construct,
n
1X
m̂∗i = m̂i − m̂i .
n
i=1
6.2. EMPIRICAL LIKELIHOOD 57

Now define !−1


n
1 X ∗ ∗′
Λ∗n = m̂i m̂i .
n
i=1
By using arguments similar to those in the second part of this class, it can
P
be shown that Λ∗n → Ω−1 and GMM using the weighting matrix above is
asymptotically efficient. This is typically referred to as the efficient two-step
GMM estimator.
A common alternative choice is to use
n
!−1
1X ′
Λn = m̂i m̂i , (6.7)
n
i=1

which uses the uncentered moment conditions. Since E[mi ] = 0 these two
estimators are asymptotically equivalent under the hypothesis of correct
specification. However, the uncentered estimator may be a poor choice when
constructing hypothesis tests, as under the alternative hypothesis the mo-
ment conditions are violated, i.e. E[mi ] ̸= 0.

6.1.6 Overidentification Test


Let
Q∗n (β) = n m̄n (β)′ Λ∗n m̄n (β) .
If the moment condition model is correctly specified in the sense that there
exist β ∈ Rk+1 such that (6.1) holds, then it can be shown that
d
Q∗n (β̂n ) → χ2ℓ−k , (6.8)

as n → ∞, where β̂n is the efficient two-step GMM estimator. In addition,


Q∗n (β̂n ) → ∞ if E[m(Y, X, Z, β)] ̸= 0 for all β ∈ Rk+1 . The proof of
this result is left as an exercise. Note that the degrees of freedom of the
asymptotic distribution are the number of overidentifying restrictions. The
overidentification test then rejects the null hypothesis “there exist β ∈ Rk+1
such that (6.1) holds” when Q∗n (β̂n ) exceeds the 1 − α quantile of χ2ℓ−k .

6.2 Empirical Likelihood


Empirical Likelihood (EL) is a data-driven nonparametric method of esti-
mation and inference for moment restriction models, which does not require
weight matrix estimation like GMM and is invariant to nonsingular linear
transformations of the moment conditions. It was introduced by Art B.
Owen and later studied in depth by Qin and Lawless (1994), Imbens, Spady
and Johnson (1998) and Kitamura (2001), among others. It is basically a
non-parametric analog of Maximum Likelihood.
58 LECTURE 6. GMM & EL

Consider the same setting as before, where

E[m(Y, X, Z, β)] = 0 . (6.9)

Again, in the linear model m(Y, X, Z, β) = Z(Y − X ′ β) and we still use the
notation mi (β) = m(Yi , Xi , Zi , β).
Empirical likelihood may be viewed as parametric inference in moment
condition models, using a data-determined parametric family of distribu-
tions. The parametric family is a multinomial distribution on the observed
values (Y1 , X1 , Z1 ), . . . , (Yn , Xn , Zn ). This parametric family will have n − 1
parameters. Having the number of parameters grow as quickly as the sample
size makes empirical likelihood very different than parametric likelihood.
The multinomial distribution which places probability pi at each obser-
vation of the data will satisfy the above moment condition if and only if
n
X
pi mi (β) = 0 . (6.10)
i=1

The empirical likelihood estimator is the value of β which maximizes the


multinomial log-likelihood subject to the above restriction. This is, the
empirical likelihood function is
( n n n
)
Y X X
Rn (b) ≡ max npi pi > 0; pi = 1; pi mi (b) = 0 . (6.11)
p1 ,...,pn
i=1 i=1 i=1

The Lagrangian for the empirical log-likelihood is


n n n
!
X X X
L(b, p1 , . . . , pn , λ, κ) = log(npi ) − κ pi − 1 − nλ′ pi mi (b)
i=1 i=1 i=1

where κ and λ are Lagrange multipliers. For a given value b ∈ Rk+1 , the
first order conditions with respect to pi , κ and λ are:
∂L 1
= − κ − nλ′ mi (b) = 0
∂pi pi
n
∂L X
= pi − 1 = 0
∂κ
i=1
n
∂L X
=n pi mi (b) = 0 .
∂λ
i=1

Multiplying the first equation by pi , summing over i; and using the


second and third equations, we find κ = n and
1 1
pi (b) = , (6.12)
n 1 + λ(b)′ mi (b)
6.2. EMPIRICAL LIKELIHOOD 59

where λ(b) solves g(λ) = 0 and


n
1X mi (b)
g(λ) ≡ =0. (6.13)
n 1 + λ′ mi (b)
i=1

It follows that we can write log(Rn (b)) as


n
X
log(Rn (b)) = − log(1 + λ(b)′ mi (b)) .
i=1

The EL estimator of β is the value that maximizes log(Rn (b)),


β̃n = argmax log(Rn (b)) .
b∈Rk+1

The EL estimator of the lagranage multiplier λ is λ̃n = λ(β̃n ) and the EL


probabilities are p̃i = pi (β̃n ).

6.2.1 Asymptotic Properties and First Order Conditions


It turns out that the limit distribution of the EL estimator is the same as
that of efficient GMM. This is,
√ d
n(β̃n − β) → N (0, (Σ′ Ω−1 Σ)−1 ) . (6.14)
We are going to skip the proof of this result in this class. What is more
interesting is to compare the first order conditions of the EL estimator to
those of two-step GMM. In order to do so, let’s re-write the first order
condition of the EL estimator, paying specific attention to the linear model
where mi (β) = Zi (Yi − Xi′ β). Define

Mi (β) = − mi (β) = Zi Xi′ ,
∂β ′
and let
Σ(β) = E [Mi (β)] = E[Zi Xi′ ]
Ω(β) = E mi (β)mi (β)′ = E Zi Zi′ Ui2 .
   

Note that Σ(β) does not depend on β in the linear model. However, in
non-linear models it does and so we keep the dependence on β throughout
the remainder of the section. Denote the sample analogs of Σ(β) and Ω(β)
by
n
1X
Σ̂n (β) = Mi (β)
n
i=1
n
1X
Ω̂n (β) = mi (β)mi (β)′ .
n
i=1
60 LECTURE 6. GMM & EL

Using this notation, the EL estimators (β̃n , λ̃n ) jointly solve:


n
1X mi (β̃n )
0= (6.15)
n 1 + λ̃′n mi (β̃n )
i=1
n
1 X Mi (β̃n )′ λ̃n
0= . (6.16)
n
i=1
1 + λ̃′n mi (β̃n )

Note that since 1/(1 + a) = 1 − a/(1 + a), we can re-write (6.15) and solve
for λ̃n ,
" n #−1
1 X mi (β̃n )mi (β̃n )′
λ̃n = m̄n (β̃n ) . (6.17)
n
i=1
1 + λ̃′n mi (β̃n )
where m̄n (β) = n−1 1≤i≤n mi (β). By (6.16) and (6.17),
P

" n
#′ " n
#−1
1X Mi (β̃n ) 1 X mi (β̃n )mi (β̃n )′
m̄n (β̃n ) = 0 .
n
i=1
1 + λ̃′n mi (β̃n ) n 1 + λ̃′n mi (β̃n )
i=1

The equation above can be written as

Σ̃n (β̃n )′ Ω̃−1


n (β̃n )m̄n (β̃n ) = 0 ,

where by (6.12),
n
X
Σ̃n (β) = p̃i Mi (β)
i=1
Xn
Ω̃n (β) = p̃i mi (β)mi (β)′ .
i=1

We can now see that EL and GMM have very similar first order condi-
tions. Recall from (6.4) that the first order condition for a GMM estimator
is given by
Σ̂n (β̂n )′ Ω̂−1
n m̄n (β̂n ) = 0 ,

where Ω̂n is a consistent estimator of Ω based on a preliminary estimator


of β. It is not surprising then that these two estimators are first order
equivalent. However, using a different estimator of the Jacobian matrix Σ
and the absence of a preliminary estimator for Ω gives EL some favorable
second order properties relative to GMM. The cost of this is some additional
computational complexity. This is a topic we cover in Econ 481.

Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Lecture 7

Panel Data1

Let (Y, X, η, U ) be a random vector where Y , η, and U take values in R and


X takes values in Rk . Note that here we are not assuming that the first
component of X is a constant equal to one. Let β = (β1 , . . . , βk )′ ∈ Rk be
such that
Y = X ′β + η + U ,
where we assume both η and U are unobserved. In addition, we want to
allow for the possibility that X and η are correlated, so that E[Xη] ̸= 0.
Given this, combining η + U into a single unobservable would require an
IV to get an estimator of β, even if we assume E[XU ] = 0. Today we will
see that when we observe the same units (individuals, firms, families, etc)
multiple times (across time, regions, etc) we may identify and consistently
estimate β without an IV, at least under certain restrictions on η and U .
Suppose that we observe the same unit at two different points in time,
and that the unobservable η captures unobserved heterogeneity that is unit
specific and constant over time. This is, consider the model

Y1 = X1′ β + η + U1
Y2 = X2′ β + η + U2 .

Note that we are also assuming that β is a constant parameter that does not
change over time. If this is the case, we could simply take first differences,
i.e.,

Y2 − Y1 = (X2 − X1 )′ β + U2 − U1
∆Y = ∆X ′ β + ∆U ,

and remove the unobserved individual effect η in the process. Notice that

E[∆X∆U ] = E[X2 U2 ] + E[X1 U1 ] − E[X2 U1 ] − E[X1 U2 ] . (7.1)


1
This lecture is based on Alex Torgovitsky’s lecture notes. I want to thank him for
kindly sharing them.

61
62 LECTURE 7. PANEL DATA

For the expression above to be equal to zero it is not enough to assume


that E[X2 U2 ] = E[X1 U1 ] = 0, which would be the standard orthogonality
assumption. We also need that E[X2 U1 ] = E[X1 U2 ] = 0, i.e., that the
covariates in a given time period are uncorrelated with the unobservables
in other time periods. This is called strict exogeneity. If this is the case,
running least squares of ∆Y on ∆X would deliver a consistent estimator of

β = E[∆X∆X ′ ]−1 E[∆X∆Y ] , (7.2)

provided that E[∆X∆X ′ ] is invertible.


Before we proceed to formalize and extend some of these ideas, there
are a few aspects that are worth keeping in mind. First, observing the
same units over multiple time periods (the so-called panel data) allow us to
control for unobserved factors that are constant over time (the η). The trick
we just used would not work if η was allowed to change over time. Second,
the requirement that E[∆X∆X ′ ] is invertible means that we need X to
change over time, so the trick we just used does not allow us to estimate
coefficients of variables that are constant over time. Indeed, such variables
are removed by the transformation in the same way η is removed. Finally,
strict exogeneity is arguably stronger than simply assuming E[Xt Ut ] = 0 for
all t. Cases where X2 is a decision variable of an agent in a context where U1
is known at t = 2 may seriously question the validity of E[X2 U1 ] = 0. Note
that this type of dynamic argument is distinct from omitted variables bias
in the sense that it could occur even if we were to argue that E[Xt Ut ] = 0
or even E[Ut |Xt ] = 0.

7.1 Fixed Effects


7.1.1 First Differences
Let (Y, X, η, U ) be distributed as described above and denote by P the
distribution of
(Yi,1 , . . . , Yi,T , Xi,1 , . . . , Xi,T ) . (7.3)
We assume that we have a random sample of size n, so that the observed
data is given by {(Yi,t , Xi,t ) : 1 ≤ i ≤ n, 1 ≤ t ≤ T }. Note that while the
sampling process is i.i.d. across i, we are being completely agnostic about
the dependence across time for a given unit i. We then consider

Yi,t = Xi,t β + ηi + Ui,t , i = 1, . . . , n t = 1, . . . , T , (7.4)

under the assumptions on Xi,t and Ui,t that we formalize below. Now define

∆Xi,t = Xi,t − Xi,t−1


7.1. FIXED EFFECTS 63

for t ≥ 2, and proceed analogously with the other random variables. Note
again that ∆ηi = 0. Applying this transformation to (7.4), we get

∆Yi,t = ∆Xi,t β + ∆Ui,t , i = 1, . . . , n t = 2, . . . , T . (7.5)

It follows that a regression of ∆Yi,t on ∆Xi,t provides a consistent estimator


of β if the following two assumptions hold,

FD1. E[Ui,t |Xi,1 , . . . , Xi,T ] = 0 for all t = 1, . . . , T ,


PT ′
FD2. t=2 E[∆Xi,t ∆Xi,t ] < ∞ is invertible.

FD1 is a sufficient condition for E[∆Ui,t |∆Xi,t ] = 0. FD2 fails if some


component of Xi,t does not vary over time. The first-difference estimator
then takes the form
 −1  
X X X X
′ 
β̂nfd =  ∆Xi,t ∆Xi,t  ∆Xi,t ∆Yi,t  . (7.6)
1≤i≤n 2≤t≤T 1≤i≤n 2≤t≤T

Under the assumption that Var[Ui,t |Xi,1 , . . . , Xi,T ] is constant (homoskedas-


ticity), together with the assumption of no serial correlation in Ui,t , it is
possible to show that β̂nfd is not asymptotically efficient and that a different
transformation of the data delivers an estimator with a lower asymptotic
variance under those assumption. We will discuss this further after describ-
ing this alternative transformation.

7.1.2 Deviations from Means


An alternative transformation to remove the individual effects ηi from (7.4)
is the so-called de-meaning technique. In order to define this formally, let
1 X
Ẋi,t = Xi,t − X̄i where X̄i = Xi,t ,
T
1≤t≤T

and define Ẏi,t and U̇i,t analogously. Note that η̇i = 0 for all i = 1, . . . , n.
Applying this transformation to (7.4), we get

Ẏi,t = Ẋi,t β + U̇i,t , i = 1, . . . , n t = 1, . . . , T . (7.7)

It follows that a regression of Ẏi,t on Ẋi,t provides a consistent estimator of


β if the following two assumptions hold,

FE1. E[Ui,t |Xi,1 , . . . , Xi,T ] = 0 for all t = 1, . . . , T ,


PT ′
FE2. t=1 E[Ẋi,t Ẋi,t ] < ∞ is invertible.
64 LECTURE 7. PANEL DATA

FE1, which is the same strict exogeneity condition in FD1, is a sufficient


condition for E[U̇i,t |Ẋi,t ] = 0. As before, FE2 fails if some component of
Xi,t does not vary over time. The de-meaning estimator (commonly known
as the fixed effect estimator) or dummy variable estimator takes the form
 −1  
X X X X
′ 
β̂nfe =  Ẋi,t Ẋi,t  Ẋi,t Ẏi,t  . (7.8)
1≤i≤n 1≤t≤T 1≤i≤n 1≤t≤T

Under the assumption that Var[Ui,t |Xi,1 , . . . , Xi,T ] is constant (homoskedas-


ticity), together with the assumption of no serial correlation in Ui,t , it is
possible to show that β̂nfe is asymptotically efficient. We discuss this in the
next section.

7.1.3 Asymptotic Properties


Deriving an asymptotic approximation for estimators in panel data models
involves two elements that were not present with cross-sectional data. First,
the data is i.i.d. across i but may be dependent across time. This is, we
may suspect that Xi,t and Xi,s for t ̸= s may not be independent. Second,
the data has two indices now: the number of units (denoted by n) and the
number of time periods (denoted by T ). We will definitely need nT → ∞ to
get a useful asymptotic approximation, but we may achieve this by all sort
of different assumptions about how n and/or T grow. The two standard
approximations are n → ∞ and T fixed (the so-called short panels) and
n → ∞ and T → ∞ (the so-called large panels). Many commonly used
panels in applied research include thousands of units (n large) and few time
periods (T small) so we will focus on short panels first and discuss large
panels later in class.
Under asymptotics where n → ∞ and fixed T , we can show that β̂nfe
and β̂nfd are asymptotically normal using similar arguments to those we use
before, provided we assume

(Yi,1 , . . . , Yi,T , Xi,1 , . . . , Xi,T , Ui,1 , . . . , Ui,T )

are i.i.d. across i = 1, . . . , n. Start by writing


 −1  
√ 1  √1
X X X X
′ 
n(β̂nfe − β) =  Ẋi,t Ẋi,t Ẋi,t U̇i,t  .
n n
1≤i≤n 1≤t≤T 1≤i≤n 1≤t≤T

In order to make this expression more tractable, we use two tricks. First,
note that
X X X X
Ẋi,t U̇i,t = Ẋi,t Ui,t − Ūi Ẋi,t = Ẋi,t Ui,t , (7.9)
1≤t≤T 1≤t≤T 1≤t≤T 1≤t≤T
7.1. FIXED EFFECTS 65

P
where the last step follows from 1≤t≤T Ẋi,t = 0. We can therefore replace
U̇i,t with Ui,t . Second, let Ẋi = (Ẋi,1 , . . . , Ẋi,T )′ be a T × k vector of stacked
observations for unit i, and define Ui in the same way. Using this notation,
we can write
X X
Ẋi′ Ẋi = ′
Ẋi,t Ẋi,t and Ẋi′ Ui = Ẋi,t Ui . (7.10)
1≤t≤T 1≤t≤T

Combining (7.9) and (7.10), we obtain


 −1  
√ 1 X 1 X
n(β̂nfe − β) =  Ẋi′ Ẋi   √ Ẋi′ Ui  .
n n
1≤i≤n 1≤i≤n

By the law of large numbers and FE2,


1 X P
X
Ẋi′ Ẋi → ΣẊ ≡ E[Ẋi′ Ẋi ] = ′
E[Ẋi,t Ẋi,t ].
n
1≤i≤n 1≤t≤T

In addition, by the central limit theorem and FE1,


1 X d
√ Ẋi′ Ui → N (0, Ω), where Ω = Var[Ẋi′ Ui ] = E[Ẋi′ Ui Ui′ Ẋi ] .
n
1≤i≤n

Combining these results with the CMT we get


√ d
n(β̂nfe − β) → N (0, Vfe ) (7.11)
where
Vfe = Σ−1

ΩΣ−1

. (7.12)
Historically, researchers often assumed that Ui,t was serially uncorrelated
with variance independent of Xi,t (i.e. homoskedastic). The default stan-
dard errors in Stata are still based on these assumptions. However, these
assumptions are difficult to justify for most economic data, which is of-
ten strongly autocorrelated and heteroskedastic. One faces basically the
same trade-off as with heteroskedasticity in the cross-sectional case. The
most common strategy is to use the fully robust consistent estimator of the
asymptotic variance,
 −1   −1
1 X 1 X 1 X
V̂fe =  Ẋi′ Ẋi   Ẋi′ Ûi Ûi′ Ẋi   Ẋi′ Ẋi  ,
n n n
1≤i≤n 1≤i≤n 1≤i≤n

where Ûi = Ẏi − Ẋi β̂nfe . This is what Stata computes when one uses the
luster(unit) } option to \verb+xtreg+ where unit is the variable that indexes $i$. This
consistent covariance matrix estimator that allows for arbitrary inter-temporal
correlation patterns and heteroskedasticity across individuals. As we will see
later in class, this estimator is generally known as a cluster covariance esti-
P
mator (CCE) and is consistent as n → ∞, i.e., V̂fe → Vfe .
66 LECTURE 7. PANEL DATA

A comment on efficiency. Traditional arguments in favor of the fixed


effects (or within-group) estimator β̂nfe over the first-difference estimator β̂nfd
rely on the fact that under homoskedasticity and no-serial correlation of
Ui,t , β̂nfe has a lower asymptotic variance than β̂nfd . Intuitively, taking first
differences introduces correlation in ∆Ui,t as

E[∆Ui,t ∆Ui,t−1 ] = E[Ui,t Ui,t−1 − Ui,t−1 Ui,t−1 − Ui,t Ui,t−2 + Ui,t−1 Ui,t−2 ]
= − Var(Ui,t−1 ) .

However, in the other extreme where Ui,t follows a random walk, i.e., Ui,t =
Ui,t−1 + Vi,t for some i.i.d. sequence Vi,t , then ∆Ui,t = Vi,t . These results, at
the end of the day, rely on homoskedasticity and so it is advised to simply use
a robust standard error as above and forget about efficiency considerations.
Note that when T = 2, these two estimators are numerically the same.
In addition, first differences are used in dynamic panels and difference in
differences, as we will discuss later.

Remark 7.1 Panel data traditionally deals with units over time. However,
we can think about other cases where the data has a two-dimensional in-
dex and where we believe that one of the indices may exhibit within group
dependence. For example, it could be that we observe “employees” within
“firms”, or “students” within “schools”, or “families” in metropolitan sta-
tistical areas (MSA), etc. Cases like these are similar but not identical to
panel data. To start, units are not “repeated” in the sense that each unit
is potentially observed only once in the sample. In addition, these are cases
where “T ” is usually large and “n” is small. For example, we typically
observe many students (which may be dependent within a school) and few
schools. We will study these cases further in the second part of this class.

7.2 Random Effects


Fixed effects approaches are attractive to economists because they provide
a way of addressing omitted variables bias and related forms of endogeneity,
as long as the omitted factors are time constant. An alternative way to
exploit the time dimension of the panel is to model the evolution of the
unobservable term over time within a unit, and use this model to increase
efficiency relative to ordinary pooled linear regressions. This is known as a
random effects approach. Random effects are not as widely used as fixed
effects in economics because they focus on efficiency rather than bias and
robustness. Nevertheless, random effects approaches are occasionally used
and also have some interesting connections to fixed effects and other types
of panel data models.
The standard random effects model adds the following assumption to
(7.4),
7.2. RANDOM EFFECTS 67

RE1. E[ηi |Xi,1 , . . . , Xi,T ] = 0 .

Hence all of the unobservable time-invariant factors that were being con-
trolled for in the fixed effects approach are now assumed to be mean in-
dependent (ergo, uncorrelated) with the explanatory variables at all time
periods. The strict exogeneity condition of the fixed effects approach (i.e.
FE1) is still maintained, so that the aggregate error term Vit = ηi + Ui,t now
satisfies E[Vit |Xi1 , . . . , XiT ] = 0 for all t = 1, . . . , T . The idea behind the
random effects approach is to exploit the serial correlation in Vit that is gen-
erated by having a common ηi component in each time period. Specifically,
the baseline approach maintains the following.

RE2. (i) Var[Ui,t |Xi,1 , . . . , Xi,T ] = σU2 , (ii) Var[ηi |Xi,1 , . . . , Xi,T ] = ση2 , (iii)
E[Ui,t Ui,s |Xi,1 , . . . , Xi,T ] = 0 for all t ̸= s, (iv) E[Ui,t ηi |Xi,1 , . . . , Xi,T ] =
0 for all t = 1, . . . , T .

Under these assumptions,

Var[Vi,t |Xi,1 , . . . , Xi,T ] = E[ηi2 + Ui,t


2
+ ηi Ui,t |Xi,1 , . . . , Xi,T ] = ση2 + σU2 ,

and

E[Vi,t Vi,s |Xi,1 , . . . , Xi,T ] = E[ηi2 +Ui,t Ui,s +ηi Ui,t +ηi Ui,s |Xi,1 , . . . , Xi,T ] = ση2 .

Combining these results and stacking the observations for unit i, we get that

E[Vi Vi′ |Xi ] = Ω = σU2 IT + ση2 ιT ι′T , (7.13)

where IT is the T × T identity matrix and ιT is a T -dimensional vector of


ones. Under these assumption, the estimator with the lowest asymptotic
variance is
 −1  
X X
β̂nre =  Xi′ Ω−1 Xi   Xi′ Ω−1 Yi  , (7.14)
1≤i≤n 1≤i≤n

where Xi = (Xi,1 , . . . , Xi,T )′ is the T × k vector of stacked observations for


unit i, and similarly for Yi . Note this is just a generalized least squares
(GLS) estimator of β. This GLS estimator is, nevertheless, unfeasible, since
Ω depends on the unknown parameters σU2 and ση2 . However, these two can
be easily estimated to form Ω̂ and deliver a feasible GLS estimator of β.
A few aspects are worth discussing. First, the efficiency gains hold under
the additional structure imposed by RE1 and RE2. In particular, we are
possibly gaining efficiency in a context where the unobserved heterogeneity
ηi is assumed to be mean independent of Xi . In other words, unobserved
time-invariant factors must be uncorrelated with observed covariates. This
was precisely what made the fixed effects approach attractive in the first
68 LECTURE 7. PANEL DATA

place. Second, the efficiency gains hold under the homoskedasticity and
independence assumptions in RE2 and do not hold more generally. These
are undoubtedly strong assumptions. Third, unlike the fixed effects estima-
tor, the random effects approach allows to estimate regression coefficients
associated with time-invariant covariates (this is, some of the Xi,t may be
constant across time - i.e., gender of the individual). So if the analysis is
primarily concerned with the effect of a time-invariant regressor and panel
data is available, it makes sense to consider some sort of random effects type
of approach. Fourth, under RE1 and RE2 β is identified in a single cross-
section. The parameters that require panel data for identification in this
model are the variances of the components of the error ση2 and σU2 , which
are needed for the GLS approach. Finally, note that the terminology “fixed
effects” and “random effects” is arguably confusing as ηi is random in both
approaches.
A last word of caution should be made about the use of Hausman spec-
ification tests. These are test that compare β̂nfe with β̂nre in order to test
the validity of RE1 (assuming RE2 holds). Under the null hypothesis that
RE1 holds, both estimators are consistent but β̂nre is efficient. Under the
alternative hypothesis, β̂nfe is consistent while β̂nre is not. Now, suppose we
were to define a new estimator β̂n∗ as follows

β̂n∗ = β̂nfe I{Hausman test rejects} + β̂nre I{Hausman test accepts} . (7.15)

The problem with this new estimator is that its finite sample distribution
looks very different from the usual normal approximations. This is gener-
ally the case when there is pre-testing, understood as a situation where we
conduct a test in a first step, and then depending on the outcome of this
test, we do A or B in a second step. A formal analysis of these uniformity
issues are covered in 481 and are beyond the scope of this class.

7.3 Dynamic Models


One benefit of panel data is that it allows us to analyze economic relation-
ships that are inherently dynamic. Specifically, we may be interested in the
effect of lagged outcomes on future outcomes. Let {Yi,t : 1 ≤ i ≤ n, 1 ≤ t ≤
T } be a sequence of random variables and consider the model

Yi,t = ρYi,t−1 + ηi + Ui,t , i = 1, . . . , n t = 2, . . . , T , (7.16)

where ηi and Ui,t are the same as before but now Yi,t−1 is allowed to have a
direct effect on Yi,t , a feature sometimes referred to as state dependence. We
assume that |ρ| < 1. As is common in dynamic panel data (and time series)
contexts, we will assume that the model is dynamically complete in the sense
that all appropriate lags of Yi,t have been removed from the time-varying
7.3. DYNAMIC MODELS 69

error Uit , i.e.,

E[Ui,t |Yi,t−1 , Yi,t−2 , . . . ] = 0 for all t = 1, . . . , T. (7.17)

Consider now taking first differences to (7.16) to obtain,

∆Yi,t = ρ∆Yi,t−1 + ∆Ui,t , i = 1, . . . , n t = 2, . . . , T ,

where, as before, ∆Yi,t = Yi,t − Yi,t−1 and similarly for Ui,t . In general we
will have Cov(∆Yi,t−1 , ∆Ui,t ) ̸= 0 since (7.16) implies

Cov(Yi,t−1 , Ui,t−1 ) ̸= 0 . (7.18)

A similar conclusion would arise if we tried to use the de-meaning trans-


formation. This inherent endogeneity is a generic feature of models that
have both state dependence and time-invariant heterogeneity. In order to
get rid of the fixed effects we have to compare outcomes over time, but if
past outcomes have effects on future outcomes then differenced error terms
will still be correlated with differenced outcomes.
The most commonly proposed solution to this problem is to use other
lagged outcomes as instruments. Given (7.17), we know that Yi,t−2 is un-
correlated with both Ui,t and Ui,t−1 , hence

Cov(Yi,t−2 , ∆Ui,t ) = 0 .

At the same time, we also know that

Cov(Yi,t−2 , ∆Yi,t−1 ) = Cov(Yi,t−2 , Yi,t−1 ) − Cov(Yi,t−2 , Yi,t−2 )


= Cov(Yi,t−2 , ρYi,t−2 + ηi + Ui,t−1 ) − Cov(Yi,t−2 , Yi,t−2 )
= −(1 − ρ) Var[Yi,t−2 ] + Cov(Yi,t−2 , ηi ) ,

which makes Yi,t−2 a valid instrument for ∆Yi,t−1 since we assumed |ρ| < 1
and Cov(Yi,t−2 , ηi ) ̸= 0. An actual expression for this last covariance can be
obtained under additional assumptions. For example, under the assumption
that the initial condition, Yi,0 , is independent of ηi (and ηi ⊥ Ui,t ), then
t−3
X
Cov(Yi,t−2 , ηi ) = ση2 ρj .
j=0

This strategy requires T ≥ 3, since otherwise we would not have data on


Yi,t−2 . For larger T we could include additional lags such as Yi,t−3 , Yi,t−4 ,
etc. Following such an approach delivers (T − 2)(T − 1)/2 linear moment
restrictions of the form

E[Yi,t−k (∆Yi,t − ρ∆Yi,t−1 )] = 0, t = 3, . . . , T, k = 2, . . . , t − 1 . (7.19)


70 LECTURE 7. PANEL DATA

The predictive power of these lags for ∆Yi,t−1 is likely to get progressively
weaker as the lag distance gets larger. Weak instrument problems may arise
as a consequence. If T ≥ 4 then one could consider using the differenced
term ∆Yi,t−2 (instead or in addition to the level Yi,t−2 ) as an instrument
for ∆Yi,t−1 . In the literature, these approaches are frequently referred to as
Arellano-Bond or Anderson-Hsiao estimators; see Arellano (2003).

Bibliography
Arellano, M. (2003): Panel Data Econometrics, Oxford University Press.

Wooldridge, J. M. (2010): Econometric analysis of cross section and


panel data, MIT press.
Lecture 8

Difference in Differences

Today we will focus again on the problem of evaluating the impact of a


program or treatment on a population outcome Y . As before, we will use
potential outcomes to describe the problem,

Y (0) potential outcome in the absence of treatment


. (8.1)
Y (1) potential outcome in the presence of treatment

The treatment effect is the difference Y (1) − Y (0) and the usual quantity
of interest is E[Y (1) − Y (0)], typically referred to as the average treatment
effect.
Suppose that we observe a random sample of n individuals from this
population, and that for each individual i we observe both Yi (1) and Yi (0).
Clearly, for each i we can compute the treatment effect Yi (1) − Yi (0) and
estimate the average treatment effect as
n n
1X 1X
Yi (1) − Yi (0) .
n n
i=1 i=1

This is, as we know, infeasible. Indeed, a large fraction of the work in


econometric theory precisely deals with deriving methods that may recover
the average treatment affect (or similar quantities) from observing Yi (1) for
individuals receiving treatment and Yi (0) for individuals without treatment.
The difference in differences (DD) approach is a popular method in this
class that exploits grouped-level treatment assignments that vary over time.
We start describing this method in the context of a simple two-groups two-
periods example below.

8.1 A Simple Two by Two Case


To simplest setup to describe the DD approach is one where outcomes are
observed for two groups for two time periods. One of the groups is exposed

71
72 LECTURE 8. DIFFERENCE IN DIFFERENCES

to a treatment in the second period but not in the first period. The second
groups is not exposed to the treatment during either period. To be specific,
let
{(Yj,t , Dj,t ) : j ∈ {1, 2} and t ∈ {1, 2}} (8.2)
denote the observed data, where Yj,t and Dj,t ∈ {0, 1} denote the outcome
and treatment status of group j at time t. Note that in our setup, Dj,t = 1
if and only if j = 1 and t = 2 (assuming the first group is the one receiving
treatment in the second period). The parameter we will be able to identify
is
θ = E[Y1,2 (1) − Y1,2 (0)] , (8.3)
which is simply the average treatment effect on the treated : the average effect
of the treatment that occurs in group 1 in period 2. In order to interpret
θ as an average treatment effect, one would need to make the additional
assumption that
θ = E[Yj,t (1) − Yj,t (0)] (8.4)
is constant across j and t. This is a strong assumption and, in principle,
not fundamental for the DD approach. The assumption in (8.4) has partic-
ular bite when we consider multiple treated groups. Consider the following
example as an illustration.

Example 8.1 On April 1, 1992, New Jersey raised the state minimum wage
from $4.25 to $5.05. Card and Krueger (1994) collected data on employment
at fast food restaurants in New Jersey in February 1992 (t = 1) and again
in November 1992 (t = 2) to study the effect of increasing the minimum
wage on employment. They also collected data from the same type of restau-
rants in eastern Pennsylvania, just across the river. The minimum wage in
Pennsylvania stayed at $4.25 throughout this period. In our notation, New
Jersey would be the first group, Yj,t would be the employment rate in group j
at time t, and Dj,t denotes an increase in the minimum wage (the treatment)
in group j at time t.
The identification strategy of DD relies on the following assumption,
E[Y2,2 (0) − Y2,1 (0)] = E[Y1,2 (0) − Y1,1 (0)] , (8.5)
i.e., both groups have “common trends” in the absence of a treatment. One
way to parametrize this assumption is to assume that
Yj,t (0) = ηj + γt + Uj,t , (8.6)
where E[Uj,t ] = 0, and ηj and γt are (non-random) group and time ef-
fects. This additive structure for non-treated potential outcomes implies
that E[Yj,2 (0) − Yj,1 (0)] = γ2 − γ1 ≡ γ, which is constant across groups.
Note that this assumption, together with (8.3) imply that
E[Y1,2 (1)] = θ + η1 + γ2 . (8.7)
8.1. A SIMPLE TWO BY TWO CASE 73

In the context of the previous example, this assumption says that in the
absence of a minimum wage change, employment is determined by the sum
of a time-invariant state effect, a year effect that is common across states,
and a zero mean shock. Before we discuss the identifying power of this
structure, we discuss two natural (but unsuccessful) approaches that may
come to mind.

8.1.1 Pre and post comparison


A natural approach to identify θ in (8.4) would be to compare Y1,2 and Y1,1 ,
this is, using outcomes before and after the policy change for the treated
group alone. This approach delivers,

E[∆Y1,2 ] = E[Y1,2 (1) − Y1,1 (0)] = θ + γ ,

where ∆Y1,2 = Y1,2 − Y1,1 and γ = γ2 − γ1 . Clearly, this approach does


not identify θ in the presence of time trends, i.e., γ ̸= 0. In the context
of Example 8.1, the employment rate in New Jersey may have been going
up (or down) in the absence of a policy change (the treatment), and so
before and after comparisons confound the time trend as being part of the
treatment effect. Unless one is willing to assume γ = 0, this approach does
not identify θ.

8.1.2 Treatment and control comparison


A second natural approach to identify θ in (8.4) would be to compare Y1,2
and Y2,2 , that is, using outcomes from both groups in the second time period.
This approach delivers,

E[Y1,2 − Y2,2 ] = E[Y1,2 (1) − Y2,2 (0)] = θ − η ,

where η = η2 − η1 . Clearly, this approach does not identify θ in the presence


of persistent group differences, i.e., η ̸= 0. In the context of Example 8.1, the
employment rate in New Jersey and Pennsylvania may be idiosyncratically
different in the absence of a policy change and so comparing these two states
confound these permanent differences as being part of the treatment effect.
Unless one is willing to assume η = 0, this approach does not identify θ.

8.1.3 Taking both differences


The DD approach exploits the common trends assumption in (8.5) to identify
θ. The idea is to consider a second “difference” to remove γ (the time trend)
from the difference that arises from comparing pre and post outcomes. In
other words, the idea is to take the “difference” of the “differences”, ∆Y1,2
74 LECTURE 8. DIFFERENCE IN DIFFERENCES

Emp. rate

Emp
. tren
d - con
−γ trol

−η d
te
- trea
trend
Emp.

θ
coun
terfa
c tual
- tre
ated Y1,1 (0) + Y2,2 (0) − Y2,1 (0)

Time
t=1 t=2

Figure 8.1: Causal effects in the DD model

and ∆Y2,2 , to obtain

E[∆Y1,2 − ∆Y2,2 ] = E[Y1,2 (1) − Y1,1 (0)] − E[Y2,2 (0) − Y2,1 (0)]
=θ+γ−γ =θ .

Thus, the approach identifies the treatment effect by taking the differences
between pre-versus-post comparisons in the two groups, and exploiting the
fact that the time trend γ is “common” in the two groups.
Note that an alternative interpretation to the same idea is to compare
(Y1,2 − Y2,2 ) and (Y1,1 − Y2,1 ), this is, the treatment and control comparison
before and after the policy change. This is because

E[(Y1,2 − Y2,2 ) − (Y1,1 − Y2,1 )] = E[Y1,2 (1) − Y2,2 (0)] − E[Y1,1 (0) − Y2,1 (0)]
=θ−η+η =θ .

Using this representation, the difference for the pre-period is used to identify
the persistent group difference η, a strategy that again works under the
common trends assumption in (8.5).
A final interpretation of the same idea is that the DD approach construct
a counterfactual potential outcome Y1,2 (0) (which is unobserved) by combin-
ing Y1,1 (0), Y2,2 (0), and Y2,1 (0), which are all observed. The “constructed”
potential outcome is simply

Ỹ1,2 (0) = Y1,1 (0) + Y2,2 (0) − Y2,1 (0)


= η1 + γ1 + η2 + γ2 − (η2 + γ1 ) + U1,1 + U2,2 − U2,1
= η1 + γ2 + Ũ1,2 ,

where Ũ1,1 = U1,1 + U2,2 − U2,1 . Computing E[Y1,2 − Ỹ1,2 (0)] = θ therefore
delivers a valid identification strategy. Figure 8.1 illustrates this idea.
8.2. A MORE GENERAL CASE 75

8.1.4 A linear regression representation with individual data


Suppose that we observe

{(Yi,j,t , Dj,t ) : i ∈ Ij,t , j ∈ {1, 2} and t ∈ {1, 2}} , (8.8)

where Ij,t is the set of individual in group j at time t. For simplicity, take
the treatment indicator Dj,t = I{j = 1}I{t = 2} to be non-random and
note that the observed outcome is

Yi,j,t = Yi,j,t (1)Dj,t + (1 − Dj,t )Yi,j,t (0) = (Yi,j,t (1) − Yi,j,t (0))Dj,t + Yi,j,t (0) ,

so that if we define Ui,j,t = Yi,j,t − E[Yi,j,t ], we can write

Yi,j,t = θDj,t + ηj + γt + Ui,j,t (8.9)


= η1 + γ1 + θDj,t + ηj − η1 + γt − γ1 + Ui,j,t
= δ + θDj,t + ηI{j = 2} + γI{t = 2} + Ui,j,t ,

where δ = η1 + γ1 . Thus, we can estimate θ by running a regression of


Yi,j,t on (1, Dj,t , I{j = 2}, I{t = 2}) and extracting the coefficient on Dj,t .
The regression formulation of the DD model offers a convenient way to
construct DD estimates and standard errors. It also makes it easy to add
additional groups and time periods to the regression setup. We might, for
example, add additional control groups and pre-treatment periods. The
resulting generalization thus includes a dummy for each state and period
but is otherwise unchanged.

8.2 A More General Case


Now consider the case with many groups and many time periods (and no
individual data for now). The derivation in (8.9) suggests that the natural
regression to consider would be

Yj,t = θDj,t + ηj + γt + Uj,t with E[Uj,t ] = 0 . (8.10)

Here, the observed data is given by {(Yj,t , Dj,t ) : j ∈ J0 ∪ J1 , t ∈ T0 ∪ T1 },


where Yj,t is the outcome of unit j at time t, Dj,t is the (non-random)
treatment status of unit j at time t, T0 is the set of pre-treatment time
periods, T1 is the set of post-treatment time periods, J0 is the set of controls
units, and J1 is the set of treatment units. The scalar random variables ηj ,
γt and Uj,t are unobserved and θ ∈ Θ ⊆ R is the parameter of interest. The
regression in (8.10) is known as the two-way fixed effect regression.
Define
1 X 1 X
∆n,j = Yj,t − Yj,t , (8.11)
|T1 | |T0 |
t∈T1 t∈T0
76 LECTURE 8. DIFFERENCE IN DIFFERENCES

and
1 X 1 X
θ̂n = ∆n,j − ∆n,j . (8.12)
|J1 | |J0 |
j∈J1 j∈J0

It is easy to show that θ̂n is the LS estimator of a regression of Yj,t on Dj,t


with groups fixed effects (ηj ) and time fixed effects (γt ). i.e., the regression
in (8.10). Simple algebra shows that
 
1 X 1 X 1 X
θ̂n − θ = Uj,t − Uj,t 
|J1 | |T1 | |T0 |
j∈J1 t∈T1 t∈T0
 
1 X
 1
X 1 X
− Uj,t − Uj,t  .
|J0 | |T1 | |T0 |
j∈J0 t∈T1 t∈T0

It follows immediately from E[Uj,t ] = 0 that E[θ̂n ] = θ. This estimator


is also consistent and asymptotically normal in an asymptotic framework
with a large number of treated and untreated groups, i.e., |J1 | → ∞ and
|J0 | → ∞. Other asymptotic approximations may lead to substantially
different results and we will discuss some of these in the second part of this
class.
The parameter θ could be interpreted as the ATE under assumption
(8.4), or as the ATT under the assumption that

E[Yj,t (1) − Yj,t (0)] , (8.13)

is constant for all j ∈ J1 and t ∈ T1 . Alternatively, one could estimate a


different θj for each j ∈ J1 . In general, it has recently been show that the
regression in (8.10) may produce misleading estimates (i.e., θ̂n inconsistent
for the ATT), if the policy’s effect is heterogeneous between groups or over
time, as is often the case in empirical settings. A special case where it would
be consistent for the ATT under the parallel trends assumption alone is when
(i) the design is staggered, meaning that groups’ treatment can only increase
over time and can change at most once; (ii) the treatment is binary; (iii)
there is no variation in treatment timing: all treated groups start receiving
the treatment at the same date. However, conditions (i)-(iii) are seldom met
in practice. See De Chaisemartin and D’Haultfoeuille (2022), and references
therein, for details on these issues.

8.2.1 Thinking ahead: inference and few treated groups


Inference in DD could be tricky and requires thinking. Two issues are of
particular importance. First, what exactly is assumed to be “large”? Are
groups going to infinity? Say, |J1 | → ∞ and |J0 | → ∞. What happens if we
have a few treated groups but many controls? Say, |J1 | fixed and |J0 | → ∞.
8.3. SYNTHETIC CONTROLS 77

What happens if we have few treated and control groups but many time
periods? Say, |J1 | and |J0 | fixed, but |T1 | → ∞ and |T0 | → ∞. Second,
what are the assumptions on Uj,t ? It is typically common to assume that
Uj,t ⊥ Uj ′ ,s for all j ′ ̸= j and (t, s). However, one would expect Uj,t and
Uj,s to be correlated, at least for t and s being “close” to each other. On
top of this, in the context of individual data one would expect Ui,j,t to be
correlated with Ui′ ,j,s - i.e., units in the same group may be dependent to
each other even if they are in different time periods. Each of these aspects
have tremendous impact on which inference tools end up being valid or not.
We will discuss some of these in the second part of this class.
As a way to illustrate how important these assumptions may be, let’s
consider the case where J1 = {1} but |J0 | → ∞ - we also assume that |T0 |
and |T1 | are finite. This is, only the first group is treated, while there are
many control groups. This is common in empirical applications with US
state level data, where often a few states exhibit a policy change while all
the other states do not. The DD estimator in this case reduced to

1 X
θ̂n = ∆n,1 − ∆n,j ,
|J0 |
j∈J0
 
1 X 1 X 1 X 1 X 1 X
=θ+ U1,t − U1,t − Uj,t − Uj,t 
|T1 | |T0 | |J0 | |T1 | |T0 |
t∈T1 t∈T0 j∈J0 t∈T1 t∈T0
P 1 X 1 X
→θ+ U1,t − U1,t ,
|T1 | |T0 |
t∈T1 t∈T0

as |J0 | → 0, assuming {Uj,t : t ∈ T0 ∪ T1 } is i.i.d. across j ∈ J0 . We conclude


that the DD estimator is not even consistent for θ. Interestingly enough, it
is still possible to do inference on θ using the approach proposed by Conley
and Taber (2011) or, more recently, the randomization approach in Canay
et al. (2017).

8.3 Synthetic Controls


Empirical applications with one or few treated groups and many control
groups are ubiquitous in economics. The DD approach as described above
in essence treats all control groups as being of equal quality as a control
group. This may not be true and so the researcher may want to somehow
weight the controls in order to give more importance to those controls that
seem “better” for the given treated group. This is basically the idea of
the synthetic control method, originally proposed by Abadie et al. (2010).
The application in their paper is the effect of California’s tobacco control
program on state-wide smoking rates. During the time period in question,
78 LECTURE 8. DIFFERENCE IN DIFFERENCES

there were 38 states in the US that did not implement such programs. Rather
than just using a standard DD analysis - which effectively treats each state
as being of equal quality as a control group - ADH propose choosing a
weighted average of the potential controls. Of course, choosing a suitable
control group or groups is often done informally, including matching on pre-
treatment predictors. ADH formalize the procedure by optimally choosing
weights, and they propose methods for inference.
Consider the simple case in Section 8.1.2, except that we now assume
there are J0 possible controls and that J1 = {1}. Synthetic controls also
allow the model for potential outcomes to be more flexible, specially when
it comes to the parallel trends assumption required for DD. To be concrete,
in what follows assume that

Yj,t = ηj γt + Uj,t , (8.14)

so that now the time effect and the group effect interact with each other
(note that common trends does not hold in this case). Comparing Y1,2 and
Yj,2 for any j ∈ J0 delivers

E[Y1,2 − Yj,2 ] = E[Y1,2 (1) − Yj,2 (0)] = θ + γ2 (η1 − ηj ) ,

and so this approach does not identify θ in the presence of persistent group
differences. The idea behind synthetic controls is to construct the so-called
synthetic control X
Ỹ1,2 (0) = wj Yj,2 ,
j∈J0
P
by appropriately choosing the weights {wj : j ∈ J0 , wj ≥ 0, j∈J0 wj = 1}.
In order for
h this idea to work,
i it must be the case that E[Y1,2 (0)] = E[Ỹ1,2 (0)]
so that E Y1,2 − Ỹ1,2 (0) = θ. Now, for a given set of weights, this approach
delivers
   
h i X X
E Y1,2 − Ỹ1,2 (0) = E Y1,2 − wj Yj,2  = θ + γ2 η1 − wj η j  .
j∈J0 j∈J0

It follows this approach identifies θ if we could choose the weights in a way


such that X
η1 = wj ηj . (8.15)
j∈J0

This is, however, not feasible as we do not observe the group effects ηj . The
main result in Abadie et al. (2010) can be stated for the example in this
∗ ∗
P as ∗follows: suppose that there exists weights {wj : j ∈ J0 , wj ≥
section
0, j∈J0 wj = 1} such that
X
Y1,1 = wj∗ Yj,1 . (8.16)
j∈J0
8.3. SYNTHETIC CONTROLS 79

If we construct the synthetic control using these optimal weights wj∗ ,


X
Ỹ1,2 (0) = wj∗ Yj,2 ,
j∈J0
h i
then it follows that E Y1,2 − Ỹ1,2 (0) = θ.
Proving this result in the context of our example is straightforward.
First, note that by (8.16) we get that
X X
η1 γ1 + U1,1 = wj∗ ηj γ1 + wj∗ Uj,1 ,
j∈J0 j∈J0

so that  
X X
γ1 η1 − wj∗ ηj  = wj∗ (U1,1 − Uj,1 ) . (8.17)
j∈J0 j∈J0

Next note that


X
Y1,2 − Ỹ1,2 (0) = θ + η1 γ2 + U1,2 − wj∗ (ηj γ2 + Uj,2 )
j∈J0
 
X X
= θ + γ2 η1 − wj∗ ηj  + wj∗ (U1,2 − Uj,2 )
j∈J0 j∈J0
γ2 X ∗ X
=θ+ wj (U1,1 − Uj,1 ) + wj∗ (U1,2 − Uj,2 ) ,
γ1
j∈J0 j∈J0

where we used (8.17) in the third equality. The result follows from E[Uj,t ] =
0 for all (j, t).
We then get the weights by “matching” the observed outcomes of the
treated group and the control groups in the period before the policy change.
In practice, Y1,1 may not lie in the convex hull of {Yj,1 : j ∈ JP
0 } and so the
method relies on minimizing the distance between Y1,1 and j∈J0 wj Yj,1 .
Abadie et al. (2010) provide some formal arguments around these issues,
and in particular require that |T0 | → ∞ and that Uj,t is independent across
j and t. However, the model they consider is slightly more general than the
standard DD model, as it does not require the “common trends” assumption.
The basic idea can be extended in the presence of covariates Xj that
are not (or would not be) affected by the policy change. In this case, the
weights would be chosen to minimize the distance between
X
(Y1,1 , X1 ) and wj (Yj,1 , Xj ) .
j∈J0

The optimal weights - which differ depending on how we define distance -


produce the synthetic control whose pre-intervention outcome and predictors
80 LECTURE 8. DIFFERENCE IN DIFFERENCES

of post-intervention outcome are “closest”. Abadie et al. (2010) propose


permutation methods for inference. We will discuss permutation tests in
the second part of this class. This method has become popular in recent
years and you will probably see it used in applied papers.

8.4 Discussion
To keep the exposition simple we have ignored covariates. However, it is
straightforward to incorporate additional covariates under the assumption
that potential outcomes are linear in those covariates, i.e.,

E[Yj,t (0)|Xj,t ] = ηj + γt + Xj,t β.

This would simply entail adding Xj,t′ β to the regression in (8.10).

It is important to keep in mind that all the results on DD follow from


the assumption that
E[Yj,t (0)] = ηj + γt ,
which is a way to model the “common trends” assumption in (8.5). Where
there are multiple time periods, people will often look at the pre (and post)
treatment trends and compare them between treatment and control as a
way to “eye-ball” verify this assumption. An unpleasant feature of this
assumption is that is not robust to nonlinear transformations of the outcome
variables. In other words, the assumption that

E[Y2,2 (0) − Y2,1 (0)] = E[Y1,2 (0) − Y1,1 (0)] ,

does not imply, for example, that

E[log Y2,2 (0) − log Y2,1 (0)] = E[log Y1,2 (0) − log Y1,1 (0)] .

Indeed, the two assumptions are non-nested and one would typically suspect
that both cannot hold at the same time.

Bibliography
Abadie, A., A. Diamond, and J. Hainmueller (2010): “Synthetic con-
trol methods for comparative case studies: Estimating the effect of Cal-
ifornia’s tobacco control program,” Journal of the American Statistical
Association, 105, 493–505.
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
Canay, I. A., J. P. Romano, and A. M. Shaikh (2017): “Randomization
Tests under an Approximate Symmetry Assumption,” Econometrica, 85,
1013–1030.
BIBLIOGRAPHY 81

Card, D. and A. B. Krueger (1994): “Minimum wages and employment:


a case study of the fast-food industry in New Jersey and Pennsylvania,”
The American Economic Review, 84, 772–793.

Conley, T. G. and C. R. Taber (2011): “Inference with “difference


in differences” with a small number of policy changes,” The Review of
Economics and Statistics, 93, 113–125.

De Chaisemartin, C. and X. D’Haultfoeuille (2022): “Two-way fixed


effects and differences-in-differences with heterogeneous treatment effects:
A survey,” Tech. rep., National Bureau of Economic Research.
82 LECTURE 8. DIFFERENCE IN DIFFERENCES
Part II

Some Topics

83
Lecture 9

Non-Parametric Regression

9.1 Setup
Let (Y, X) be a random vector where Y and X take values in R and let
P be the distribution of (Y, X). The case where X ∈ Rk will be discussed
later. We are interested in the conditional mean of Y given X:

m(x) = E[Y |X = x] .

Let {(Y1 , X1 ), . . . , (Yn , Xn )} be an i.i.d. sample from P . We first consider


the discrete case. If X takes ℓ values {x1 , x2 , . . . , xℓ }, then
Pn
I{Xi = x}Yi
m̂(x) = Pi=1n
i=1 I{Xi = x}

is a natural estimator of m(x) for x ∈ {x1 , x2 , . . . , xℓ }. It is straightforward


to show that m̂(x) is consistent and asymptotically normal if E[Y 2 ] < ∞.

9.2 Nearest Neighbor vs. Binned Estimator


Suppose now that X is a continuous random variable. In this case, the
event {Xi = x} has zero probability, so that the previous estimator will be
undefined for almost every value of x. However, if m(x) is continuous, we
can use observations that are “close” to x to estimate m(x). This motivates
the following estimator:

Definition 9.1 (q-Nearest Neighbor Estimator) Let Jq (x) be the set


of indices in {1, . . . , n} associated with q closest-to-x values of {X1 , . . . , Xn }.
The q-nearest neighbor estimator is defined as
1 X
m̂nn (x) = Yi .
q
i∈Jq (x)

85
86 LECTURE 9. NON-PARAMETRIC REGRESSION

Note that if q = n, we are using all of the observations in the estimation


of each point. Then m̂nn (x) just becomes Ȳn , producing a perfectly flat
estimated function. The variance is very low but unless mnn (x) is truly flat,
bias will be high for many values of x. Alternatively, we can use the Xi that
is closest to x. In this case, bias should be relatively small, but since so few
observations are used, variance is high. Generally, picking q is a problem.
One way to do this is via cross validation – to be discussed later.
The q-NN estimator takes an average of the q observations closest to x,
and so the number of “local” observations is always q. However, this means
that the distance between these observations and x is random. In particular,

h = max |Xi − x|
i∈Jq (x)

is random. As an alternative to the q-NN method, we can fix an h and


consider all observations with |Xi − x| ≤ h. Now, it is the number of local
observations that is random. This gives rise to the binned estimator:

Definition 9.2 (Binned Estimator) Let h > 0 be given. The binned es-
timator is defined as
Pn
I{|Xi − x| ≤ h}Yi
m̂b (x) = Pi=1
n . (9.1)
i=1 I{|Xi − x| ≤ h}

The above formula can be interpreted as a weighted average,


n
X I{|Xi − x| ≤ h}
m̂b (x) = wi (x)Yi with wi (x) = Pn
i=1 i=1 I{|Xi − x| ≤ h}
Pn
and i=1 wi (x) = 1. Just as the choice of q mattered for the q-NN estimator,
the choice of h will be important for the binned estimator.

9.3 Nadaraya-Watson Kernel Estimator


One deficiency of the binned estimator is that it is discontinuous at x =
Xi ±h. This occurs because the weights used are based on indicator functions
but in principle one could use some other (continuous) weights. The family
of weights typically used in non-parametric estimation are called “kernels”.
Our goal is to obtain continuous estimates m̂(x) by using continuous kernels.

Definition 9.3 (2nd order, Non-negative, Symmetric Kernel) A second-


order kernel function k(u) : R → R satisfies
R∞
1. −∞ k(u)du = 1

2. 0 ≤ k(u) < ∞
9.3. NADARAYA-WATSON KERNEL ESTIMATOR 87

3. k(u) = k(−u)
R∞ 2
4. κ2 = −∞ u k(u)du ∈ (0, ∞)
Note that the definition of the kernel does not involve continuity. Indeed,
the binned estimator can be written in terms of a kernel function. To see
this, let
1
k0 (u) = I{|u| ≤ 1}
2
be the uniform density on [−1, 1]. Observe that
   
|Xi − x| Xi − x
I{|Xi − x| ≤ h} = I ≤ 1 = 2k0
h h
so that we can write m̂b (x) in (9.1) as
 
Pn Xi −x
k
i=1 0 h Yi
m̂(x) = P   .
n Xi −x
i=1 k0 h

This is a special case of the so-called Nadaraya-Watson estimator.

Definition 9.4 (Nadaraya-Watson Kernel Estimator) Let k(u) be a


second-order kernel and h > 0 be a bandwidth. Then, the Nadaraya-Watson
estimator is defined as
 
Pn Xi −x
i=1 k h Yi
m̂(x) = P   .
n Xi −x
i=1 k h

The Nadaraya-Watson estimator is also known as the kernel regression


estimator or the local constant estimator. The bandwidth h > 0 plays
the same role as before. In particular, the larger the h, the smoother the
estimates (but the higher the bias):

h → ∞ ⇒ m̂(x) → Ȳn .

The smaller the h, the more erratic the estimates (but the lower the bias):

h → 0 ⇒ m̂(Xi ) → Yi .

Some popular continuous kernels include the Gaussian kernel,


 2
1 u
kg (u) = √ exp − ,
2π 2
and the Epanechnikov kernel,
3
ke (u) = (1 − u2 )I{|u| ≤ 1} .
4
88 LECTURE 9. NON-PARAMETRIC REGRESSION

9.3.1 Asymptotic Properties


We will use the asymptotic framework in which n → ∞, h → 0, nh → ∞,
and h = O(n−1/5 ). We wish to show that for each x,
√ √ √
nh(m̂(x) − m(x)) = nh∆1 (x) + nh∆2 (x)

where nh∆2 (x) converges
√ to a limit that is asymptotically normal and
centered at zero, and √nh∆1 (x) converges to an asymptotic bias term. The
rate of convergence is nh since this reflects the “effective” number of ob-
servations that we are using.
Start by writing Yi = m(Xi ) + Ui so that E[Ui |Xi ] = 0, and let
σ 2 (x) = Var[Ui |Xi = x] .
Fix x ∈ R and write
Yi = m(x) + (m(Xi ) − m(x)) + Ui .
Then we can rewrite the numerator of m̂(x) as
n   n  
1 X Xi − x 1 X Xi − x
k Yi = k m(x)
nh h nh h
i=1 i=1
n  
1 X Xi − x
+ k (m(Xi ) − m(x))
nh h
i=1
n  
1 X Xi − x
+ k Ui
nh h
i=1
= fˆ(x)m(x) + ∆
ˆ 1 (x) + ∆
ˆ 2 (x) ,

where fˆ(x) is the non-parametric density estimator of the pdf of X, f (x).


It follows that
1 ˆ 
ˆ 2 (x) .
m̂(x) − m(x) = ∆1 (x) + ∆
fˆ(x)
ˆ 2 (x) and derive its mean and variance. Since E[Ui |Xi ] = 0,
First consider ∆
   
1
ˆ 2 (x)] = E k Xi − x
E[∆ Ui = 0 .
h h
The variance can be expressed as
"   2 #
h i
ˆ 2 (x) = 1 X i − x
Var ∆ E k Ui
nh2 h
"  #
Xi − x 2 2

1
= E k σ (Xi )
nh2 h
z−x 2 2
Z ∞  
1
= k σ (z)f (z)dz
nh2 −∞ h
9.3. NADARAYA-WATSON KERNEL ESTIMATOR 89

We now simplify the expression by the change of variables u = h1 (z − x).


This will require z to be in the interior of the support of X. Using the
proposed substitution yields
Z ∞
σ 2 (x)f (x) ∞
Z  
1 2 2 2 1
k (u) σ (x + hu)f (x + hu)du = k (u) du + o ,
nh −∞ nh −∞ nh

assuming f (x) is continuously differentiable at x and σ 2 (x) is continuous at


x. Let Z ∞
R(k) = k (u)2 du
−∞
denote the so-called roughness of the kernel. By the above derivation,
h i σ 2 (x)f (x)R(k)  
1
ˆ
Var ∆2 (x) = +o ,
nh nh

and so by a triangular array CLT,


√ d
ˆ 2 (x) → N (0, σ 2 (x)f (x)R(k)) .
nh∆
ˆ 1 (x) and derive its mean and variance. Start with the
Second consider ∆
mean,
   
ˆ 1 Xi − x
E[∆1 (x)] = E k (m(Xi ) − m(x))
h h
1 ∞
 
z−x
Z
= k (m(z) − m(x))f (z)dz
h −∞ h
Z ∞
= k (u) (m(x + hu) − m(x))f (x + hu)du .
−∞

Assuming twice continuous differentiability of m(x) (together with f (x) con-


tinuously differentiable), expand m(x) and f (x) up to o(h2 ), i.e.,
1
m(x + hu) − m(x) = m′ (x)hu + m′′ (x)h2 u2 + o(h2 )
2
f (x + hu) = f (x) + f ′ (x)hu + o(h) .

Plug into the previous integral to obtain,


Z ∞
h2 u2 ′′
 

m (x) f (x) + uhf ′ (x) du + o(h2 )

k (u) m (x)hu +
−∞ 2
Z ∞ 
= uk (u) du m′ (x)f (x)h
−∞
Z ∞   
2 2 1 ′′ ′ ′
+ u k (u) du h m (x)f (x) + m (x)f (x) + o(h2 ) .
−∞ 2
90 LECTURE 9. NON-PARAMETRIC REGRESSION

Let κ2 be defined as Z ∞
κ2 = u2 k (u) du
−∞
and  
1 ′′ −1 ′ ′
B(x) = m (x) + f (x)m (x)f (x) .
2
Using this notation and the symmetry of the kernel, we can write
ˆ 1 (x)] = κ2 h2 f (x)B(x) + o(h2 ) .
E[∆
A similar expansion shows that
 2  
ˆ 1 (x) = O h 1
h i
Var ∆ =o .
nh nh
Again, by a triangular array CLT,
√ d
nh(∆ˆ 1 (x) − h2 κ2 f (x)B(x)) → 0.
P
Putting all the pieces together and using the fact that fˆ(x) → f (x), we have
our theorem

Theorem 9.1 (Asymptotic Normality) Suppose that


1. f (x) is continuously differentiable at the interior point x with f (x) > 0.

2. m(x) is twice continuously differentiable at x.

3. σ 2 (x) > 0 is continuous at x.

4. k(x) is a non-negative, symmetric, 2nd order kernel.

5. E[|Y |2+δ ] < ∞ for some δ > 0.

6. n → ∞, h → 0, nh → ∞, and h = O(n−1/5 ).

It follows that
√ σ 2 (x)R(k)
 
2
 d
nh m̂(x) − m(x) − h κ2 B(x) → N 0, .
f (x)
From the theorem, we also have that the asymptotic mean squared error
of the NW estimator is
σ 2 (x)R(k)
M SE(x) = h4 κ22 B 2 (x) + .
nhf (x)
The optimal rate of h which minimises the asymptotic MSE is therefore
Cn−1/5 , where C is a function of (κ2 , B(x), σ 2 (x), R(k), f (x)). In this case,
the MSE converges at rate O(n−4/5 ), which is the same as the rate obtained
in density estimation. It is possible to estimate C, for instance by plug-in
approaches. However, this is cumbersome and other methods, such as cross
validation may be easier.
9.3. NADARAYA-WATSON KERNEL ESTIMATOR 91

Kernel Choice. The asymptotic distribution of our estimator depends on


the kernel through R(k) and κ2 . An optimal kernel would therefore minimize
R(k). It turns out that the Epanechnikov family is optimal for regression,
as with density estimation.

Bandwidth Choice. The constant C for the optimal bandwidth depends


on the first and second derivatives of the mean function m(x). When the
derivative function B(x) is large, the optimal bandwidth is small. When the
derivative is small, the optimal bandwidth is large. There exists reference
bandwidths for nonparametric density estimation (like Silverman’s rule-of-
thumb) but in nonparametric regression these are less natural.

Bias and Undersmoothing. Note that the bias term needs to be es-
timated to obtain valid confidence intervals. However, B(x) depends on
m′ (x), m′′ (x), f ′ (x) and f (x). Estimating these objects is arguably more
complicated than the problem we started out with. A (proper) residual
bootstrap could be used to obtain valid confidence interval.
Alternatively, we can undersmooth. Undersmoothing is about choosing
h such that √
nhh2 → 0 ,
which makes the bias small, i.e.,

nhh2 κ2 B(x) ≈ 0 .
This eliminates the asymptotic bias but requires h to be smaller than opti-
mal, since optimality requires that
nhh4 → C > 0 .
Such an h will also be incompatible with bandwidth choice methods like cross
validation. Further, undersmoothing does not work well in finite samples.
Better methods exist, though they are outside the scope of the course.

Curse of Dimensionality. Now consider the problem of estimating m :


Rdx → R. where dx > 1 is the dimension of X. The multivariate NW esti-
mator is implemented in a way similar to the one we just described, except
that we now require a multivariate kernel and dx bandwidths. However, the
rate of convergence of the NW estimator becomes
p √
nh1 . . . hdX or nhdx
depending on whether or not we use the same bandwidth for each component
of X. This is the curse of dimensionality: the higher dx , the slower the rate
of convergence. Intuitively, with higher dimensions, it becomes harder to
find “effective” observations. In this case, optimal bandwidths and MSE are
−1 −4
h = O(n 4+dx ) and M SE = O(n 4+dx ) .
92 LECTURE 9. NON-PARAMETRIC REGRESSION

Linear conditional mean. The NW estimator may not perform well


when m(x) is linear, that is when m(x) = β0 +β1 x. In particular, it performs
poorly if the marginal distribution of Xi is not roughly uniform. Suppose
Yi = β0 + β1 Xi so that there is no error in the model. The NW estimator,
when applied to this data generated by this purely linear model, yields a
nonlinear output.

Boundaries of the Support. The NW estimator performs poorly on the


boundaries of the support of X. For points on the boundary, bias is of order
O(h). For x s.t. x ≤ min{X1 , . . . , Xn }, the NW estimator is an average only
of Yi values for observations to the right of x. If m(x) is positively sloped,
the NW estimator will be upward biased. Our change of variable argument
no longer applies and the estimator is inconsistent at the boundary.

9.4 Local Linear Estimator


The Nadaraya-Watson estimator is often called a local constant estimator
because it locally (about x) approximates the CEF m(x) as a constant. To
see this, note that m̂(x) solves the minimization problem:
n  
X Xi − x
m̂(x) = argmin k (Yi − c)2 , (9.2)
c h
i=1

which is a weighted regression Yi on an intercept only. Without the weights,


the estimation problem will have the sample mean as the solution. The
NW estimator generalizes this to a “local” mean. This suggests that we can
construct alternative nonparametric estimators of m(x) by using other local
approximations.
A popular choice is the local linear (LL) approximation. Instead of ap-
proximating m(x) locally as a constant, the LL approximation approximates
m(x) locally by a linear function. We will do this by locally weighted least
squares.

Definition 9.5 (Local Linear (LL) Estimator) For each x, solve the fol-
lowing minimization problem,
n  
X Xi − x
{β̂0 (x), β̂1 (x)} = argmin k (Yi − b0 − b1 (Xi − x))2 . (9.3)
(b0 ,b1 ) h
i=1

The local linear estimator of m(x) is the local intercept: β̂0 (x).
The LL estimator of the derivative of m(x) is the estimated slope coefficient:

m̂′ (x) = β̂1 (x) .


9.4. LOCAL LINEAR ESTIMATOR 93

If we write the local model

Yi = β0 + β1 (Xi − x) + Ui with E[U |X = x] = 0 ,

then taking conditional expectations, we see how using the regressor Xi − x


rather than Xi makes the intercept equal to m(x) = E[Y |X = x].
To obtain the least squares formula, set, for each x,

Zi (x) = (1, Xi − x)′

and  
Xi − x
ki (x) = k .
h
Then
  n
!−1 n
β̂0 (x) X

X
= ki (x)Zi (x)Zi (x) ki (x)Zi (x)Yi , (9.4)
β̂1 (x) i=1 i=1

so that for each x, the estimator is just weighted least squares of Y in Z(x).
In fact, as h → ∞, the LL estimator approaches the full-sample linear least-
squares estimator
m̂(x) = β̂0 + β̂1 x .
This is because as h → ∞, all observations receive equal weight regardless
of x. The LL estimator is thus a flexible generalization of least squares.
Deriving the asymptotic distribution of the LL estimator is similar to
that of the NW estimator, but much more involved. We will skip that here.

Theorem 9.2 (Asymptotic Normality) Let m̂(x) be the LL estimator


as previously defined. Under conditions 1-6 in the NW theorem,
√ σ 2 (x)R(k)
   
2 1 ′′ d
nh m̂(x) − m(x) − h κ2 m (x) → N 0, .
2 f (x)

Relative to the bias of the NW estimator,


 
1 ′′ −1 ′ ′
B(x) = m (x) + f (x)m (x)f (x) ,
2

the second term is no longer present. This simplified expression suggests


reduced bias, though in theory, bias could be larger as opposing terms could
cancel out. Because the bias of LL estimator does not depend of f (x), we
say that it is design adaptive. Furthermore, for the LL estimator to be
consistent and asymptotically normal only continuity of f (x) is required,
not differentiability. As such, relative to the NW estimator, we can relax
condition 1.
94 LECTURE 9. NON-PARAMETRIC REGRESSION

9.4.1 Nadaraya-Watson vs Local Linear Estimator


In contrast to the NW estimator, the LL estimator preserves linear data.
In particular, if Yi = β0 + β1 Xi , then for any sub-sample, a local linear
regression fits exactly, so that m̂(x) = m(x).
Furthermore, the distribution of the LL estimator is invariant to the first
derivative of m. As such, it has zero bias when the true regression is linear.
In addition, the LL estimator has better properties at the boundary than
the NW estimator. Intuitively, the local linear estimator fits a (weighted)
least-squares line through data near the boundary. As such, even if x is at
the boundary of the regression support, this estimator will be unbiased as
long as the true relationship is linear. More generally, the LL estimator has
bias of order O(h2 ) at all x.
Extensions that allow for discontinuities in m(x), f (x) and σ(x) exist.

9.5 Related Methods


There are several other non-parametric methods we did not cover. For
example, the Splines Estimator is the unique minimizer of
n
X Z
2
(Yi − m̂(Xi )) + λ (m̂′′ (u))2 du .
i=1

One advantage of spline estimators over kernels is that global inequality and
equality constraints can be imposed more conveniently. Series Estimators
are of the form
Xτn
m̂(x) = β̂j φj (x) .
j=0

They are typically very easy to compute. However, there is relatively little
theory about how to select the basis functions φ(x) and the smoothing
parameters τn .

Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Lecture 10

Regression Discontinuity and


Matching

(These notes will be revised before this class takes place )


Today we study evaluation methods to measure treatment effects of a
policy or intervention on an outcome Y . To do this, we define potential
outcomes as usual where

Y (0) potential outcome in the absence of treatment


Y (1) potential outcome in the presence of treatment.

Let D ∈ {0, 1} denote treatment assignment status. The treatment effect


is Y (1) − Y (0) and a parameter of interest could be the average treatment
effect (ATE) E[Y (1) − Y (0)] or the average treatment on the treated (ATT)
E[Y (1) − Y (0)|D = 1], to list a few.
The fundamental difficulty of evaluation arises from the fact that we can-
not simultaneously observe Y (1) and Y (0), but only observe either of these
per each individual. Identification of the treatment effect may fail if treated
individuals from whom we observe Y (1) are systemically different from non-
treated individuals from whom we observe Y (0). In order to circumvent this,
evaluation methods construct counterfactuals in a convincing way, dealing
with endogenous selection. Popular approaches include (1) randomized con-
trolled experiments (or RCTs) which exploits controlled/randomized assign-
ment rules, (2) natural experiments which takes advantage of some “natural”
randomization as difference-in-differences does and (3) instrumental vari-
ables and control function methods which rely on exclusion restrictions or
models for an assignment rule. Today we introduce two additional methods:
(4) discontinuity design methods which exploit discreteness in the treatment
assignment rule and (5) matching methods which attempt to reproduce the
treatment group among the non-treated using information of observed co-
variates.

95
96 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING

10.1 Regression Discontinuity Design


The regression discontinuity designs (RDD) are characterized by a triplet:
score, cutoff, and treatment. Suppose that units receive a score and a treat-
ment is assigned based on the score and a known cutoff. Specifically, the
treatment is given to units whose score is above the cutoff and it is withheld
from units whose score is below the cutoff. For example, we can think of a
situation where a scholarship is given to students whose grade in the SAT
exceeds 2100. The abrupt change in the probability of treatment assignment
allows us to learn something about the effect of treatment.
To be more precise, let us introduce some notation. An observed random
variable Z ∈ R denotes the score (it is so-called a running variable). A
known constant c denotes the cutoff. We normalize the cutoff c to 0 without
loss of generality. Then we can represent the treatment assignment by
Di = I{Zi ≥ 0}
and the observed outcome Yi by

Yi (0) if Zi < 0
Yi = .
Yi (1) if Zi ≥ 0
Given this notation, the conditional expected outcome is

E[Y (0) | Z = z] if z < 0
E[Y | Z = z] = (10.1)
E[Y (1) | Z = z] if z ⩾ 0
and the idea would be to exploit the discontinuity in E[Y | Z = z] at the
cutoff to identify some type of treatment effect.

10.1.1 Identification
The RDD allows us to estimate a specific type of treatment effect known as
the ATE at the cutoff, E[Y (1) − Y (0) | Z = 0]. Note that this is an average
effect for those individuals with scores exactly at the cutoff of the running
variable. However, Y (0) at the cutoff is not observed by design, so we need
some assumptions.
Note that a special situation occurs at the cutoff Z = 0, as illustrated by
Figure 10.1 where we plot the conditional mean function in (10.1). Consider
two groups of units: one with score equal to 0, and the other with score
barely below 0, say Z = −ε. If the value of E[Y (0)|Z = −ε] are not
abruptly different from E[Y (0)|Z = 0], then units with Z = −ε would
be a valid counterfactural to units with Z = 0. Putting it formally, if the
conditional mean function E[Y (0)|Z = z] is continuous at z = 0, then the
ATE at the cutoff, denoted by θsrd , can be identified as follows
θsrd = E[Y (1) − Y (0) | Z = 0] = E[Y | Z = 0] − lim E[Y | Z = z] .
z↑0
10.1. REGRESSION DISCONTINUITY DESIGN 97

Figure 10.1: Graphs of means of potential outcomes conditional on the


score Z

It is worth highlighting that the parameter θsrd is a “local” ATE in the


sense that it measures an ATE at a specific point of Z, cutoff. It is similar
to the LATE parameter discussed under IV, but not necessarily the same.
Essentially, RDD exploit the discontinuous dependence of D on Z such
that P {D = 1 | Z = z} is discontinuous at z = 0. In the so-called sharp
design, there is a perfect compliance in the sense that every unit with score
above 0 receives treatment and every unit with score below 0 is in the control
group. This creates a discontinuity in P {D = 1 | Z = z} as the cutoff,

0 if z < 0
P {D = 1 | Z = z} = .
1 if z ⩾ 0

In the so-called fuzzy design, there may be imperfect compliance.

10.1.2 Estimation via Local Linear Regression


In order to estimate θsrd , we can construct nonparametric estimators of
E[Y |Z = 0] and limz→0− E[Y |Z = z] using data to the right and left of 0.
Local linear (LL) estimators are predominantly used because the point of
interest E[Y (1) − Y (0) | Z = 0] is always on the boundary. As addressed in
the last class, LL regression estimators performs better at the boundary than
Nadaraya-Watson estimators in that the bias of LL regression estimators at
the boundary is of order h2 whereas that of Nadaraya-Watson estimators is
of order h given a bandwidth h.
98 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING

In addition, LL regression is simple to implement in RDD: we can obtain


the LL regression estimator first by computing kernel weights based on c = 0
and then by running a weighted least squares regression on observations
either above or below zero. Especially with an uniform kernel, LL estimators
are the same as two unweighted linear regressions on observations with Zi ∈
[−h, 0) and Zi ∈ [0, h]. Specifically, the LL regression estimator of E[Y |Z =
0] is given as
n  
n
+ +
o X Zi 2
β̂0 , β̂1 = argmin k I {Zi ⩾ 0} Yi − b+ +
0 − b1 Zi
(b+ + h
0 ,b1 ) i=1

and the LL regression estimator of limz→0− E[Y |Z = z] is given as


n  
n
− −
o X Zi 2
β̂0 , β̂1 = argmin k I {Zi < 0} Yi − b− −
0 − b1 Zi .
(b− − h
0 ,b1 ) i=1

Note that the regressor is (Zi − c) but we are assuming c = 0. Givne this,
we can estimate θsrd by

θ̂srd = β̂0+ − β̂0− .

Figure 10.2 depicts estimating θsrd using LL regression estimators. The


blue and red curves are conditional mean functions in (10.1) on (z, y) ∈ R2 .
The dots around the curves represent a sample {(Zi , Yi ) : i = 1, · · · , n}. LL
regression estimators use only the observations around the cutoff c within
the window [−h, h]. We obtain the black line on the left by running a
weighted least squares regression using observations below the cutoff and in
the window (blue dots). Its intercept β̂0− is our estimate of E[Y (0)|Z = 0].
Similarly, we we obtain β̂0+ , an estimate of E[Y (1)|Z = 0].

10.1.3 Bandwidth Choice


Running LL regression requires to pick a bandwidth h. Choosing the band-
width is not straightforward due to a trade-off: heuristically, the bias in-
creases and the variance decreases as the bandwidth h increases. Figure
10.3 shows that the bias, indicated as the gap between E[Y (0)|Z = z] and
the intercept β̂0− (or between E[Y (1)|Z = z] and the intercept β̂0+ ), increases
as h changes from h1 to h2 .
Taking this trade-off into account, Imbens and Kalyanaraman (2012)
propose an “optimal” plug-in bandwidth

ĥIK = ĈIK · n−1/5 .

Calonico et al. (2014) improve this result and suggest

ĥCCT = ĈCCT · n−1/5 .


10.1. REGRESSION DISCONTINUITY DESIGN 99

Figure 10.2: Estimation of θsrd via LL regression estimators

In addition to ĥCCT , they also propose bias correction methods and new
variance estimators that account for the additional noise introduced by es-
timating bias. While it is common to see papers based on undersmoothing,
i.e., use uh5 → 0 and ignore asymptotic bias, yet using ĥCCT is a better
approach.

10.1.4 Other RD Designs


So far we have focused on RDD with a single running variable and a single
cutoff where P {D = 1|Z = z} is either 0 or 1. There are other designs
generalizing these simple RDD. Their inference methods use similar tools
(LL regresion, etc) but are different.

Sharp RD (SRD) and Fuzzy RD (FRD) While sharp RDD are char-
acterized by perfect compliance, fuzzy RDD allow partial compliance which
arises if some units with running variable above c decide not to receive treat-
ment. For example, people may not cast a vote even if they are older than
18 and eligible for voting. Such partial compliance induces a discontinuity
in P {D = 1|Z = z} at c, but it does not necessarily change from 0 to 1.
100 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING

Figure 10.3: Biases of E[Y (d)|Z = 0] in Local Linear Regression for


Varying Bandwidth

Kink RD (KRD) and Kink Fuzzy RD (KFRD) KRD/KFRD designs


assume that P {D = 1|Z = z} is continuous but have a kink at the cutoff c
which in turn introduces kinks of E[Y |Z = z] at c. Conceptually, KRD and
KFRD are similar to SRD and FRD except that they exploit discontinuity
in the first derivatives of P {D = 1|Z = z} and E[Y |Z = z].

Multiple scores RD and Geographic RD These deigns involve at least


two running variables and discontinuities arise in R2 (or higher dimensional
space). As an example of multiple scores RD, we can think of a scholar-
ship which is awarded to a student whose math as well as English scores
are above certain numbers. Similarly, geographic RD uses location infor-
mation represented by latitude and longitude as their running variables and
boundary between two different regions as a cutoff.

Multiple cutoff RD This involves multiple treatments that are given at


multiple cutoffs.
10.1. REGRESSION DISCONTINUITY DESIGN 101

10.1.5 Extension to Fuzzy RD


Imperfect compliance often occurs in real applications. Some units with
score above c may decide not to take up treatment. Other units with score
below c sometimes manage to receive treatment. For instance, having a
score Z larger than c makes the application “strong” but may not guarantee
a scholarship.
Under imperfect compliance, the probability of receiving treatment changes
at c, but not necessarily from 0 to 1. This allows for identification of another
local treatment effect. The argument is similar to LATE, but a little subtler
due to limits. To formalize, define potential treatments as usual:

D(0) potential receipt of treatment when the treatment is assigned


D(1) potential receipt of treatment when the treatment is not assigned.

The treatment status is D = D(1)I{Z ≥ c} + D(0)I{Z < c}. If E[Y (d)|Z =


z] and E[D(d)|Z = z] for d = 0, 1 are continuous in z at c, then the canonical
parameter is identified,
E [Yi (1) | Zi = c] − E [Yi (0) | Zi = c]
θfrd =
E [Di (1) | Zi = c] − E [Di (0) | Zi = c]
limz↓c E [Yi | Zi = z] − limz↑c E [Yi | Zi = z]
= .
limz↓c E [Di | Zi = z] − limz↑c E [Di | Zi = z]
The parameter θfrd can be interpreted as the ATE for units with Zi = c and
only for compliers who are affected by the cutoff and satisfy Di (1) > Di (0).
We need to estimate four different conditional mean functions to estimate
θfrd . As in SRD, we can use local linear regression estimators. Let us define
the following estimators:
n  
n
+ +
o X Zi − c 2
β̂0 , β̂1 = argmin k I {Zi ⩾ c} Yi − b+ +
0 − b1 (Zi − c)
(b+ + h
0 ,b1 ) i=1
n  
n
− −
o X Zi − c 2
β̂0 , β̂1 = argmin k I {Zi < c} Yi − b− −
0 − b1 (Zi − c)
(b− − h
0 ,b1 ) i=1
n  
 + + X Zi − c 2
γ̂0 , γ̂1 = argmin k I {Zi ⩾ c} Di − g0+ − g1+ (Zi − c)
(g0+ ,g1+ ) i=1 h
n  
 − − X Zi − c 2
γ̂0 , γ̂1 = argmin k I {Zi < c} Di − g0− − g1− (Zi − c) .
(g0− ,g1− ) i=1 h

Then we estimate θfrd by

β̂0+ − β̂0−
θ̂frd = .
γ̂0+ − γ̂0−
102 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING

Alternatively, we can obtain the estimator θ̂frd using two stage least
squares. Define intention to treat by T = I{Z ≥ c}. Note T is a valid
instrument for D in that T is exogenous conditional on Z. It can be shown
that the LL approach with uniform kernels and same bandwidths is numer-
ically equivalent to a TSLS regression:

Yi = δ0 + θfrd Di + δ1 (Zi − c) + δ2 Ti (Zi − c) + Ui

with Ti as the excluded instrument for Di on the sample {i : c − hn ≤ Zi ≤


c + hn }.

10.1.6 Validity of RD
RD imposes relatively weak assumptions and identifies a very specific and
local parameter. The identification hinges on the continuity of E[Y (d)|Z =
z] at the cutoff, yet this assumption is fundamentally untestable and can
be violated in the following situation. Suppose that the running variable
is a test score. Individuals know the cutoff and have an option to re-take
the test, and may do so if their scores are just below the cutoff. This leads
to a discontinuity of the density fZ (z) of Z at the cutoff c, and possibly a
discontinuity of E[Y (d)|Z = z] as well because it is a functional of fZ (z),
Z
fY,Z (y, z)
E[Y (d)|Z = z] = yfY |Z (y|z)dy where fY |Z (y|z) = .
fZ (z)
This may invalidate the design. This problem is called “manipulation” of
the running variable.
As a way to detect such manipulation, ? proposes a test for continuity of
the density of fZ (z) at the cutoff. In principle, one does not need continuity
of the density of Z at c, but a discontinuity is suggestive of violations of
the no-manipulation assumption. ? also propose a new test based on order
statistics that does not require smoothness assumptions.
In addition to manipulation, the continuity assumption may fail to hold
due to discontinuity in the distribution of covariates. To see this, suppose
there are an observed factor X and an unobserved factor U that affect
potential outcomes, say

Y (d) = md (Z, X) + U

for some function md . Suppose that the distribution of X is discontinuous


at z = 0. Then the discontinuity in X at 0 may affect the outcome because
Z Z
E[Y (d)|Z = z] = E[E[Y (d)|Z = z, X]|Z = z] = yfY |Z,X (y|z, x)fX|Z (x|z)dxdy

f (y,z,x) f (x,z)
where fY |Z,X (y|z, x) = Y,Z,X
fZ,X (z,x) and fX|Z (x|z) = X,Z
fZ (z) . These effects
may be attributed erroneously to the treatment of interest.
10.2. MATCHING ESTIMATORS 103

A common practice to test discontinuity of covariates is to test the null


hypothesis that

H0 : lim E(X | Z = z) = lim E(X | Z = z).


z↑0 z↓0

The rejection of this null suggests that E(Y (d) | Z = z) may not be con-
tinuous either. However, E(X | Z = z) could be still continuous and H0
holds true even if the distribution of X is discontinuous at the cutoff. The
intuition on how discontinuity in X may confound the effect of the treat-
ment is about the entire distribution of X. ? propose a test for continuity
of FX|Z (x|z) at the cutoff. The test is easy to implement and based on
permutation tests and it involves novel asymptotic arguments.

10.1.7 RD Packages
The statistical packages to compute LL RD estimators and run RDD validity
tests are available online. Below we introduce four packages. 1

rdrobust package It provides estimation, inference and graphical presen-


tation using local polynomials, partitioning and spacings estimators. rdrobust
implements local polynomial RD point estimators with classic and robust
bias-corrected confidence intervals. rdbwselect, which is called by rdrobust,
provides different data-driven bandwidth selectors based on ?, cross-validation,
and Calonico et al. (2014). rdplot plots data with “optimal” block length.

rddensity package It implements automatic manipulation tests based


on density discontinuity at the cutoff using polynomial density estimator.
rddensity runs manipulation testing using local polynomial density esti-
mation and rdbwdensity selects a bandwidth or window.

rdperm packages It implements the approximate permutation test for


RDD, developed in ?.

rdcont packages It implements the approximate sign-test for RDD, de-


veloped in ?.

10.2 Matching Estimators


10.2.1 Identification through Unconfoundedness
We change gear and study another way of estimating average treatment
effect using matching estimators.
1
You can download rdrobust and rddensity at https://rdpackages.github.io, and
rdperm and rdcont at http://sites.northwestern.edu/iac879/software.
104 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING

Suppose we observe (Y, D, X) and consider the following unconfounded-


ness assumption

(Y (0), Y (1)) ⊥⊥ D | X (10.2)

which is often alternatively called selection on observables, conditional in-


dependence and so on. Unconfoundedness assumes that for subgroups of
agents with the same X there are no unobservable differences between the
treatment and control groups. This provides a way to identify the “condi-
tional” average treatment effect (CATE) because

E[Y (1) − Y (0) | X = x] = E[Y (1) | D = 1, X = x] − E[Y (0) | D = 0, X = x]


= E[Y | D = 1, X = x] − E[Y | D = 0, X = x]

where the first line follows by unconfoundedness. Note that by integrating


over X we can identify the ATE as well.
The idea behind the matching estimators is to find (or “match”) units in
the treatment group (D = 1) and control group (D = 0) with the same value
of X, i.e., X = x. To be able to match, we need the overlap assumption

0 < P {D = 1 | X = x} < 1 for all x

meaning that, within each subgroup of agents with the same X, there should
be both treated and control units. Complication arises when X is continu-
ously distributed.
Identification through the unconfoundedness assumption is inherently
different from RDD. In sharp RDD, the unconfoundedness assumption holds
trivially because if we define D = I{Z ≥ c} then

(Y (0), Y (1)) ⊥⊥ D | Z.

Moreover, the overlap assumption never holds in sharp RDD because the
probabilities to receive treatments given the running variable is either 1 or
0, i.e.,

P {D = 1 | Z < c} = 0 and P {D = 0 | Z ⩾ c} = 0.

10.2.2 Matching Metrics


If X ∈ Rk has continuous components, the event {X = x} has measure
zero, and so previous matching strategy is not feasible. To get around this,
we match X’s that are close according to some matching metric.
A common matching metric Mahalanobis distance is given by

Mij = (Xi − Xj )′ Σ−1 (Xi − Xj )


10.2. MATCHING ESTIMATORS 105

where Σ = Var[X]. Given this, j is the qth closest to Xi if


n
X
I {Mis ⩽ Mij } = q.
s=1

For alternative choices, there are Euclidean distance

Mij = |Xi − Xj |

and the diagonal version of the Mahalanobis distance

Mij = (Xi − Xj )′ diag Σ−1 (Xi − Xj ) .


 

10.2.3 Matching Estimator


In order to define the matching estimator, fix q. Let jq (i) be the index
j ∈ {1, · · · , n} that solves the following two conditions:

Opposing treatment: P Dj = 1 − Di
Opposing qth closest to i: s:Ds =1−Di I {Mis ⩽ Mij } = q.

That is, jq (i) is the index of the unit that is the qth closest to unit i in terms
of the covariate values, among the units with the treatment opposite to that
of unit i. Let Jq (i) denote the set of indices for the first q matches for unit
i:
Jq (i) = {j1 (i), . . . , jq (i)} .
Then the matching estimator of θate = E[Y (1) − Y (0)] is given by
n 
1 X  YiP if Di = d
θ̂ate = Ŷi (1) − Ŷi (0) where Ŷ (d) = 1 .
n q j∈Zq (i) Yj ̸ d
if Di =
i=1

This is a type of nearest neighbor (NN) estimator and thus as q in-


creases the variance goes down while the bias increases. ? study asymptotic
properties of θ̂ate under a fixed number of matches as n → ∞. Here we
summarize some of noteworthy properties of θ̂ate . First, θ̂ate is consistent
as n → ∞ for fixed q. Second, the bias of order O(n−1/kc ) where kc is the
dimension of the continuous covariates and the variance is of order O(1/n).

Since nBias converges to 0, some constant, or ∞ for kc = 1, kc = 2, and

kc > 2 respectively, the estimator is not n-asymptotically normal if kc > 2.
Third, θ̂ate is generally not efficient, and even if the bias is low enough, the
estimators are not efficient given a fixed number of matches. Lastly, regard-
ing the resampling methods which we will cover later in this course, ? show
that the bootstrap is generally invalid for the matching estimators due to
non-smoothness in the matching process. However, subsampling is valid for
kc ≤ 2.
106 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING

10.2.4 Propensity Score Matching and Weighting


Aforementioned properties of the matching estimators suggest that we can
make inference with the matching estimator only in limited cases where
the number of continuous covariates does not exceed 2. Propensity score
matching provides an alternative way to match.
Let p(X) = P {D = 1 | X = x} denote the propensity score. ? make an
observation of paramount importance that unconfoundedness implies that

(Y (0), Y (1)) ⊥⊥ D | p(X).

This means that we no longer need to condition on the entire X but only
on one-dimensional propensity score p(X) in order to achieve independence
between the potential outcome (Y (0), Y (1)) and the treatment status D. To
see how it holds, note that

P {D = 1 | Y (0), Y (1), p(X)} = E[E[D | Y (0), Y (1), P (X), X] | Y (0), Y (1), p(X)]
= E[E[D | Y (0), Y (1), X] | Y (0), Y (1), p(X)]
= E[E[D | X] | Y (0), Y (1), p(X)]
= E[p(X) | Y (0), Y (1), p(X)]
= p(X),

which is the same as P {D = 1 | p(X)}. Interestingly, all the biases due to


observable covariates can be removed by conditioning solely on the propen-
sity score.
The Rosenbaum-Rubin result implies that

θate = E[E[Y | D = 1, p(X)] − E[Y | D = 0, p(X)]]

and thus we can use the matching estimator matching on the propensity
score only. This can be formulated by nothing that
   
DY 1
E =E E[DY (1) | p(X)]
p(X) p(X)
 
1
=E E[D | p(X)]E[Y (1) | p(X)] = E[Y (1)]
p(X)

and similarly
 
(1 − D)Y
E = E[Y (0)]
1 − p(X)

which in turn allows us to write


     
[Di − p(Xi )]Yi DY (1 − D)Y
θate = E =E −E .
p(Xi )(1 − p(Xi ) p(X) 1 − p(X)
BIBLIOGRAPHY 107

We define the estimator for θate by the sample analog of θate :


n  
1X [Di − p (Xi )] Yi
θ̂n = .
n p (Xi ) (1 − p (Xi ))
i=1

θ̂n does not explicitly match observations but puts weights induced by the
propensity score to outcome Yi , while it is still based on the unconfounded-
ness assumption.
The propensity score is a scalar. ? imply that the bias term is of lower

order than the variance term and matching leads to a n-consistent, asymp-
totically normal estimator. Given the data, we cannot compute θ̂n because
it depends on the unknown propensity score function p(·). The estimator
based on the true propensity score has the same asymptotic variance in ?.
With estimated propensity scores, the asymptotic variance of matching es-
timators is more involved due to the “generated regressor”. The topic is
beyond our scope. Those who are interested can consult ?.

Bibliography
Calonico, S., M. D. Cattaneo, and R. Titiunik (2014): “Robust Non-
parametric Confidence Intervals for Regression-Discontinuity Designs,”
Econometrica, 82, 2295–2326.

Imbens, G. and K. Kalyanaraman (2012): “Optimal bandwidth choice


for the regression discontinuity estimator,” The Review of Economic Stud-
ies, 933–959.
108 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING
Lecture 11

Random Forests

11.1 Coming soon


The lecture notes for this lecture will be updated before class takes place.

109
110 LECTURE 11. RANDOM FORESTS
Lecture 12

LASSO

12.1 High Dimensionality and Sparsity


Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk . Let β = (β1 , . . . , βk )′ ∈ Rk be such that

Y = X ′β + U .

We observe a random sample {(Yi , Xi ) : 1 ≤ i ≤ n} from the distribution of


(Y, X) and without loss of generality, we further assume that
n n
1X 2 1X
Ȳn ≡ Yi = 0 and σ̂n,j ≡ (Xi,j − X̄j )2 = 1 , (12.1)
n n
i=1 i=1

where Xi,j denotes the j th component of Xi . In other words, we assume the


model does not have a constant term and that all variables are on the same
scale (something that will be important later). Our goal today is to study
estimation of β when k is large relative to n. That could mean that k < n,
but not by much, or simply that k > n. For simplicity, we assume X and U
are independent.
When k > n, the ordinary least squares estimator is not well-behaved
since the X′ X matrix does not have full rank and is not invertible. In partic-
ular, the estimator is not unique and will overfit the data. If all explanatory
variables are important in determining the outcome, it is not possible to
tease out their individual effects. However, if the model is sparse – that
is, only a few components of X have an influence on Y – then it might be
possible to discriminate between the relevant and irrelevant components of
X. The following definition formalizes this notion.

Definition 12.1 (Sparsity) Let S = {j : βj ̸= 0} be the identity of the


relevant regressors. A model is said to be sparse if s = |S| is fixed as n → ∞.

111
112 LECTURE 12. LASSO

If we knew the identity of the relevant regressors S then we could simply


do least squares in the usual manner. Since this would represent a sort of
ideal situation, we will call such a strategy the “oracle”.

Definition 12.2 (Oracle Estimator) The oracle estimator β̂no is the in-
feasible estimator that is estimated by least squares using only the variables
in S.
In practice, we do not know the set S and so our goal is to estimate β,
and possibly S, exploiting the fact that the model is known to be sparse.
In particular, we would like our estimator β̂n to satisfy three properties:
estimation consistency, model selection consistency, and oracle efficiency.

Definition 12.3 (Estimation Consistency) An estimator β̂n is estima-


tion consistent if
P
β̂n → β .

Definition 12.4 (Model-Selection Consistency) Let

Ŝn = {j : β̂n,j ̸= 0}

be the set of relevant covariates selected by an estimator β̂n . Then, β̂n is


model-selection consistent if

P {Ŝn = S} → 1 as n → ∞ .

Definition 12.5 (Oracle Efficiency) An estimator β̂n is oracle efficient


if it achieves the same asymptotic variance as the oracle estimator β̂no .
Achieving Oracle efficiency requires stronger conditions than achieving
model selection consistency, which in turn requires stronger assumptions
than estimation consistency. The last statement is rather straightforward if
we are able to consistently select variables, we would then be able to run
least squares on the selected variables. On the other hand, it is possible for
∥β̂n − β∥22 to be small even when β̂n is non-zero at every component, so that
selection never occurs.

12.2 LASSO
LASSO is short for Least Absolute Shrinkage and Selection Operator and is
one of the well known estimators for sparse models. The LASSO estimator
β̂n is defined as the solution to the following minimization problem
n
X k
X 
β̂n = arg min (Yi − Xi′ b)2 + λn |bj | , (12.2)
b
i=1 j=1
12.2. LASSO 113

where λn is a scalar tuning parameter. For a fixed λn > 0, LASSO corre-


sponds to OLS with an additional term that imposes a penalty for non-zero
coefficients. The penalty term shrinks the estimated coefficients towards
zero and this gives us model selection, albeit at the cost of introducing bias
in the estimated coefficients. The LASSO estimator can be alternatively
described as the solution to
n
X k
X
min (Yi − Xi′ b)2 subject to |bj | ≤ t , (12.3)
b
i=1 j=1

where now t is a scalar tuning parameter.


LASSO has the feature of delivering estimated coefficients that can be
exactly 0 for a given sample size n. The form of the penalty function is
important for selection, which does not occur under OLS or other penalty
functions (e.g., ridge regression). For intuition, consider penalty functions
of the form
Xk
|bj |γ .
j=1

If γ > 1, the objective function is continuously differentiable at all points.


The first order condition with respect to βn,j would then be
n
X
2 (Yi − Xi′ β)Xi,j = λn γ|βj |γ−1 sign(βj ) .
i=1

Suppose βj = 0. Then, β̂n,j = 0 if and only if


n
X n
X

0= (Yi − Xi β̂n )Xi,j = (Ui − Xi′ (β̂n − β))Xi,j . (12.4)
i=1 i=1

Whenever U is continuously distributed, the above equation holds with prob-


ability 0 and model selection does not occur.
On the other hand, if γ ≤ 1, the penalty function is not differentiable at
0. In this case, Karush-Kuhn-Tucker conditions are expressed in terms of
the subgradient.

Definition 12.6 (Sub-gradient & Sub-differential) We say g(·) ∈ R


is a sub-gradient of f (x) : R → R at point x if f (z) ≥ f (x) + g(z − x) for
all z ∈ R. The set of sub-gradients of f (·) at x, denoted by ∂f (x), is the
sub-differential of f (·) at x.
In the case of LASSO, we need the sub-differential of the absolute value
f (x) = |x|. For x < 0 the sub-gradient is uniquely given by ∂f (x) = {−1}.
For x > 0 the sub-gradient is uniquely given by ∂f (x) = {1}. At x = 0
the sub-differential is defined by the inequality |z| ≥ gz for all z, which is
114 LECTURE 12. LASSO

y
y = |x|

1
2x

− 12 x

Figure 12.1: Two sub-gradients of f (x) = |x| at x = 0

satisfied if and only if g ∈ [−1, 1]. We therefore have ∂f (x) = [−1, 1]. This
is illustrated in Figure 12.1
For non-differentiable functions, the Karush-Kuhn-Tucker theorem states
that a point minimizes the objective function of interest if and only if 0 is
in the sub-differential. Applying this to the problem in (12.2) implies that
the first order conditions are given by
n
X
2 (Yi − Xi′ β̂n )Xi,j = λn sign(β̂n,j ) if β̂n,j ̸= 0 (12.5)
i=1

and
n
X
−λn ≤ 2 (Yi − Xi′ β̂n )Xi,j ≤ λn if β̂n,j = 0 . (12.6)
i=1
Compared to our previous result in (12.4), this inequality is attained with
positive probability even when U is continuously distributed. Model selec-
tion is therefore possible when the penalty function has a cusp at 0. The
difference between using a penalty with γ = 1 (LASSO) and γ = 2 (Ridge)
in the constraint problem in (12.3) is illustrated in Figure 12.2 for the simple
case where k = 2.

Figure 12.2: Constrained problem in (12.3) when k = 2: γ = 1 (left


panel) and γ = 2 (right panel).
12.2. LASSO 115

12.2.1 Theoretical Properties of the LASSO


For ease of exposition, we only discuss the case where k as fixed as n → ∞.
Assume without loss of generality that S consists of the first s variables.
We partition X into X = (X1′ , X2′ )′ where X1 are the first s explanatory
variables, and X2 are the k − s remaining variables. Partition the variance-
covariance matrix of X accordingly,
E[X1 X1′ ] E[X1 X2′ ]
 

Σ = E[XX ] = .
E[X2 X1′ ] E[X2 X2′ ]

Assumption 12.1 (Irrepresentable Condition) ∃η > 0 s.t.


∥E[X2 X1′ ]E[X1 X1′ ]−1 · sign(β1 , . . . , βs )∥∞ ≤ 1 − η .
To understand the condition, note that when the sign of β is unknown,
we basically require the condition to hold for all possible signs. That is,
∥E[X1 X1′ ]−1 E[X1 X2′ ]∥∞ ≤ 1 − η .
This means that the regression coefficients of the irrelevant variables on the
relevant variables must all be less than 1. In that sense, the former are
irrepresentable by the latter. Under this condition, the following holds.

Theorem 12.1 (Zhao and Yu (2006)) Suppose k and s are fixed and
that {Xi : 1 ≤ i ≤ n} and {Ui : 1 ≤ i ≤ n} are i.i.d. and mutually in-
dependent. Let X have finite second moments, and U have mean 0 and
variance σ 2 . Suppose also that the irrepresentable condition holds and that
λn λn
→ 0 and 1+c → ∞ for 0 ≤ c < 1 . (12.7)
n n 2
Then LASSO is model-selection consistent.
The irrepresentable condition is a restrictive condition. When this con-

dition fails and λn / n → λ∗ > 0, it can be shown that LASSO selects too
many variables (i.e., it selects a model of bounded size that contains all
variables in S). Intuitively, if the relevant variables and irrelevant variables
are highly correlated, we will not be able discriminate between them.
Knight and Fu (2000) showed that the LASSO estimator is asymptoti-

cally normal when λn / n → λ∗ ≥ 0, but that the nonzero parameters are
estimated with some asymptotic bias if λ∗ > 0. If λ∗ = 0, LASSO has the
same limiting distribution as the LS estimator and so even with λ∗ = 0,
LASSO is not oracle efficient. In addition, the requirement for asymptotic
1+c
normality is at conflict with λn /n 2 → ∞ and so it follows that LASSO
cannot be both model selection consistent and asymptotically normal (hence
oracle efficient) at the same time. Oracle efficient penalization methods work
by penalizing small coefficients a lot and large coefficients very little or not
at all. This could be done by using weights (as in the Adaptive LASSO
below) or by changing the penalty function (which we discuss later).
116 LECTURE 12. LASSO

12.3 Adaptive LASSO


Definition 12.7 (Adaptive LASSO) The adaptive LASSO is the esti-
mator β̃n that arises from the following two steps.
1. Estimate β using ordinary LASSO,
n
X k
X 
′ 2
β̂n = arg min (Yi − Xi b) + λ1,n |bj | ,
b
i=1 j=1

where λ1,n / n → λ∗ > 0.

2. Let Ŝ1 = {j : β̂n =


̸ 0} be the set of selected covariates from the first
step. Estimate β by
X n X X 
2 −1
β̃n = arg min (Yi − Xi,j bj ) + λ2,n |β̂n,j | |bj | ,
b
i=1 j∈Ŝ1 j∈Ŝ1

where λ2,n / n → 0 and λ2,n → ∞.
Adaptive LASSO imposes a penalty in the second step that is inversely
proportional to the magnitude of the estimated coefficient in the first step.
This adaptive weights allows us to eliminate small, irrelevant covariates
while retaining the relevant ones without introducing asymptotic bias.

Theorem 12.2 (Zou, 2006) Suppose {Xi : 1 ≤ i ≤ n} and {Ui : 1 ≤ i ≤


n} are i.i.d. and mutually independent. Let X have finite second moments,
and U have mean 0 and variance σ 2 . The adaptive LASSO is model selection
consistent and oracle efficient, i.e.,
√ d
n(β̃n − β) → N(0, σ 2 E(X1 X1′ )−1 ) .

To see that adaptive LASSO is oracle efficient, note that the asymptotic
variance of the estimator is the same we would have achieved had we known
the set S and performed OLS on it. The rates at which λ1,n and λ2,n grow
are important for this result.
To see why the adaptive LASSO is model selection consistent and oracle
efficient, consider the following. Recall that β1 , . . . , βs ̸= 0 and βs+1 , ..., βk =
0. Suppose that β̂n has r non-zero components asymptotically. Without the
irrepresentable condition, the LASSO includes too many variables, so that
s ≤ r ≤ k. Without loss of generality, suppose β̂n is non-zero in its first
r components. Let b be any r × 1 vector, and let β̃n denote the adaptive

LASSO estimator. Define u = n(b − β). Some algebra shows,
n  r 2 r
√ X 1 X X 1
n(β̃n −β) = arg min Ui − √ Xi,j uj +λ2,n |β̂n,j |−1 (|βj + √ uj |−|βj |) .
u n n
i=1 j=1 j=1
12.4. PENALTIES FOR MODEL SELECTION CONSISTENCY 117


By Knight and Fu (2000), β̂n converges at rate n, so |β̂n − β| =
OP (n−1/2 ). We then split the analysis according to whether βj is zero or
not.
Case βj = 0: here |β̂n,j | = OP (n−1/2 ). Then,
1
λ2,n |β̂n,j |−1 (|βj + √ uj | − |βj |) ≈ λ2,n |uj |
n

where we have “canceled” the 1/ n term using |β̂n,j |. Now suppose uj ̸= 0
and note that
λ2,n |uj | → ∞ since λ2,n → ∞ .
The penalty effectively tends to infinity, so that bj ̸= 0 (uj ̸= 0) cannot be
the minimizer. It must be that uj = 0, i.e., bj = βj = 0.
Case βj ̸= 0: here |β̂n,j | = OP (1). It follows that,
1 1
λ2,n |β̂n,j |−1 (|βj + √ uj | − |βj |) ≈ λ2,n √ |uj | .
n n
P λ
It follows that λ2,n √1n |uj | → 0 since √2,n
n
→ 0 and uj = OP (1). That
is, asymptotically there is no penalty on non-zero terms, and the adaptive
LASSO becomes asymptotically equivalent to OLS estimation on S. This
gives rise to model selection consistency and oracle efficiency.

12.4 Penalties for Model Selection Consistency


Another way to achieve a model-selection consistent estimator is to use a
penalty function that is strictly concave (as a function of |bj |) and has a cusp
at the origin. As previously mentioned, LASSO is essentially OLS with an
L1 penalty term. As such, it belongs to the larger class of Penalized Least
Squares estimators:
X n Xk 
P LS ′ 2
β̂n (λ) = arg min (Yi − Xi b) + pλ (|bj |) .
b
i=1 j=1

Clearly, ordinary LASSO corresponds to the case where pλ (|ν|) = λ|ν|, but
such a penalty is not strictly concave and so model selection consistency
generally does not occur. Some alternative penalty functions include that
have the desire property are
1. Bridge: pλ (|ν|) = λ|ν|γ for 0 < γ < 1
2. Smoothly Clipped Absolute Deviation (SCAD): for a > 2,
    
′ λ (aλ/n − |ν|)+ λ
pλ (|ν|) = λ I |ν| ≤ + I |ν| > .
n (a − 1)λ/n n
Note that this function is defined by its derivative.
118 LECTURE 12. LASSO

Figure 12.3: Bridge penalty (solid line), SCAD penalty (dashed line) and
minimax concave penalty (dotted line)

3. Minimax Concave: for a > 0,


Z |ν|  
nx
pλ (|ν|) = λ 1− dx
0 aλ +

where (x)+ = max{0, x}. These penalty functions are plotted in Figure
12.3. Note that they are all steeply sloped near ν = 0. Bridge penalty, like
the LASSO, continues to increase far away from ν = 0, whereas SCAD and
minimax concave penalties flatten out. For this reason, the latter penalties
exhibit lower bias.

12.5 Choosing lambda


The need for model selection consistency imposes constraints on the growth
rate of λn , but does not pin down their specific values. In practice, λn for
the ordinary LASSO is often chosen by Q-fold cross validation.
Let Q be some integer, and suppose for ease of exposition that n = Qnq .
We partition the sample into the sets I1 , . . . , IQ each with nq members. For
each 1 ≤ q ≤ Q, perform LASSO on all but the observations in Iq to obtain
β̂n,−q (λ). Then, calculate the squared prediction error of β̂n,−q (λ) on the
set Iq : X
Γq (λ) = (Yi − Xi′ β̂n,−q (λ))2 .
i∈Iq

Doing
PQ so for each q, we are able to find total error for each λ: Γ(λ) =
q=1 Γq (λ). Then we define the cross validated λ as:

λ̂CV
n = arg min Γ(λ) .
λ

For the adaptive LASSO, we need to choose both λ1,n and λ2,n . A
computationally efficient way of doing so is to choose λ1,n via the above
12.6. CONCLUDING REMARKS 119

cross-validation procedure, and then having fixed this λ1,n , choose λ2,n by
a second round of cross-validation.
Arguably, there exist few results about the properties of the LASSO when
λn is chosen via cross-validation. In a recent working paper, Chetverikov
et al. (2016) show that in a model with random design, in which k is allowed
to depend on n, and assuming Ui |Xi is Gaussian, it follows that

∥β̂n − β∥2,n ≤ Q · ((|S| log k)/n)1/2 log7/8 (kn)

holds with high probability, where ∥b−β∥2,n = ( n1 ni=1 (Xi′ b)2 )1/2 is the pre-
P

diction norm. It turns out that ((|S| log k)/n)1/2 is the fastest convergence
rate possible so that cross-validated LASSO is nearly optimal. However, it
is not known if the log7/8 (kn) term can be dropped.
Finally, we mention one alternative approach to choosing λn . This is
done by minimizing the Bayesian Information Criterion. Define:
n
1X
σ̂ 2 (λ) = (Yi − Xi′ β̂n (λ))2 ,
n
i=1

and
log(n)
BIC(λ) = log(σ̂ 2 (λ)) + |Ŝn (λ)|Cn
n
where Cn is an arbitrary sequence that tends to ∞. Wang et al. (2009) show
that under some technical conditions, choosing λn to minimize BIC(λ) leads
to model selection consistency when U is normally distributed.

12.6 Concluding Remarks


Today we focused on the framework that keeps k fixed even as n → ∞.
There exist many extensions to the stated theorems that are valid in cases
where kn = O(na ) or even kn = O(en ). Sources such as Fan et al. (2011)
and Horowitz (2015).
Finally, we note that many packages are available for LASSO estimation.
A few starting points are lassopack in Stata, and glmnet or parcor in R.

Bibliography
Chetverikov, D., Z. Liao, and V. Chernozhukov (2016): “On Cross-
Validated LASSO,” arXiv preprint arXiv:1605.02214.

Fan, J., J. Lv, and L. Qi (2011): “Sparse High-Dimensional Models in


Economics,” Annual Review of Economics, 291–317.

Horowitz, J. L. (2015): “Variable selection and estimation in high-


dimensional models,” Canadian Journal of Economics, 48, 389–407.
120 LECTURE 12. LASSO

Knight, K. and W. Fu (2000): “Asymptotics for lasso-type estimators,”


The Annals of statistics, 28, 1356–1378.

Wang, H., B. Li, and C. Leng (2009): “Shrinkage tuning parameter


selection with a diverging number of parameters,” Journal of the Royal
Statistical Society. Series B: Statistical Methodology, 71, 671–683.

Zhao, P. and B. Yu (2006): “On Model Selection Consistency of Lasso,”


The Journal of Machine Learning Research, 7, 2541–2563.

Zou, H. (2006): “The adaptive lasso and its oracle properties,” Journal of
the American Statistical Association, 101, 1418–1429.
Lecture 13

Binary Choice

Let (Y, X) be a random vector where Y takes values in {0, 1} and X takes
values in Rk+1 . Let us consider the problem of estimating

P {Y = 1 | X} . (13.1)

This problem has two interpretations that can deliver different approaches.
The first interpretation consists in predicting the outcome variable Y for
a given value of the covariate X. This problem can be solved estimating
the probability (13.1) - which is also called propensity score - by different
methods, for instance local linear regression or classification trees.
The second interpretation of the problem consist in viewing (13.1) as
a model with structure, where we are interested in the partial effects or
the causal effects of X. This is traditionally the approach that is often
used in the Industrial Organization, where (13.1) models the behavior of
the decision makers and the estimated model is used to do counterfactual
analysis. Sometimes, this second interpretation is called a structural form
for (13.1), while the first one is a reduced form.
In this lecture, we consider the second interpretation. We restrict our
attention to parametric and semiparametric models using the linear index
model. This model assumes the existence of β ∈ Rk+1 such that

P {Y = 1 | X} = P Y = 1 | X ′ β .

(13.2)

This condition reduces the dimension of the problem. To see this, note that
the left hand side in (13.2) is a function of X ∈ Rk+1 . While the right hand
side in (13.2) is a function of X ′ β ∈ R, which is known as linear index.

13.1 Linear Index Model


Let (Y, X, U ) be a random vector where Y takes values in {0, 1}, X takes val-
ues in Rk+1 with X0 = 1 and U take values in R. Let β = (β0 , β1 , . . . , βk )′ ∈

121
122 LECTURE 13. BINARY CHOICE

Rk+1 be such that


Y = I{X ′ β − U ≥ 0} .
This model is known as Threshold crossing model or Single index model or
Linear index model.
In this setup, the binary outcome Y often indicates the observable choice
between two alternatives of a decision maker. This choice is modeled by
utility maximization. For instance, let us consider two alternatives A and
B that gives the following utility levels

X ′ βA + UA and X ′ βB + UB

to the decision maker, respectively. If the decision maker maximizes utility,


she will choose B over A if

X ′ βB + UB ≥ X ′ βA + UA ,

which is equivalent to

X ′ (βB − βA ) + UB − UA ≥ 0 .

This implies that X ′ β − U can be interpreted as the difference in the utility


level between two choices. Often times, one of the options is normalize to
zero, which leads to the linear index model described above.

13.1.1 Identification
Let us denote by P the distribution of the observed data. And denote by
P = {Pθ : θ ∈ Θ} a (statistical) model for P . These probability distri-
butions are indexed by the parameter θ, where this parameter could have
infinite dimensional components, e.g. a nonparametric distribution of the
unobservable component.
Using this notation, the model P is correctly specified if the distribution
of the observable data belong to the model, i.e. P ∈ P. In this case, there
is a parameter θ such that P = Pθ . Now our interest interest might be in θ
or a function of λ(θ).

Definition 13.1 Let Θ0 (P ) be the collection of θ such that P = Pθ , i.e.

Θ0 (P ) = {θ ∈ Θ : Pθ = P } .

We say that θ is identified if Θ0 (P ) is a singleton for all P ∈ P.

Remark 13.1 λ(θ) may be identified even if θ is not. For instance, in


the linear model, we can identified the coefficient using the assumptions
described in Lecture 1, but we cannot identified the distribution of the errors.
In this example, θ is defined by the coefficients of the linear model, the
distribution of the covariates, and the conditional distribution of the errors.
13.1. LINEAR INDEX MODEL 123

13.1.2 Identification of the parametric binary model


In the binary model described by the linear index model, the parameter is
θ = (β, PX , PU |X ). Denote by Θ the set of all the possible values of θ. And
let us consider the following parametric assumption.

Assumption 13.1 Consider the parametric assumptions:


P1. PU |X = N (0, σ 2 ).

P2. There exists no A ⊆ Rk+1 such that A has probability one under PX
and A is a proper linear subspace of Rk+1 .
The parametric assumption P1 allow us to replace PU |X with σ, since
that parameter characterizes the parametric distribution. Now we can write
θ = (β, PX , σ). Using this notation, let us study the identification of θ. We
prove the result by contradiction: assume there are two values θ = (β, PX , σ)
and θ∗ = (β ∗ , PX∗ , σ ∗ ) such that θ ̸= θ∗ and Pθ = Pθ∗ and the reach a
contradiction.
First, notice that the marginal distribution of X is identified from the
joint distribution of (Y, X). This implies that PX = PX∗ . Second, we can
use assumption P1 to compute the probability of Y = 1 given X using both
models. That is
 ′   ′ ∗
Xβ Xβ
Pθ {Y = 1|X} = Φ and Pθ∗ {Y = 1|X} = Φ ,
σ σ∗

which by assumption on Pθ and Pθ∗ deliver the same probability. Since Φ(·)
is an increasing function, we obtain
β β∗
 
X′ − ∗ =0.
σ σ
By assumption P2, we conclude
β β∗
= ∗ . (13.3)
σ σ
Otherwise, we can define a proper linear subspace A = {x ∈ Rk+1 |x′ (β/σ −
β ∗ /σ ∗ ) = 0} such that A has probability one under PX .
Note that we cannot conclude that β = β ∗ or σ = σ ∗ . Indeed, our
analysis shows that any θ and θ∗ such that (13.3) holds and PX = PX∗
satisfies Pθ = Pθ∗ . This implies we cannot identify θ = (β, PX , σ) but we
can identify λ(θ) = (PX , β/σ).
A few remarks are in order. First, researchers typically assume further
that |β| = 1 or β0 = 1 or σ = 1, which is a normalization argument to
conclude that now we can identify the parameter θ. Second, the model with
σ = 1 is called the Probit model. Let us verify the identification of θ.
124 LECTURE 13. BINARY CHOICE

Under this additional assumption, we have σ ∗ = 1. By equation (13.3), we


conclude β = β ∗ . This implies that θ = (β, PX , 1) and θ∗ = (β ∗ , PX , 1)
are the same. Finally, other parametric assumptions on the conditional
distribution PU |X , which is used to compute the probability Y = 1 given
X, deliver other models. In particular, if this parametric distribution is the
logistic distribution, we obtain the Logit model.
A natural question that motivate the next section is the identification of
θ without parametric assumptions on PU |X .

13.1.3 Identification via median independence


Let us recall that in the linear model that we studied in Lecture 1, the
identification assumption that we need from PU |X was E[U |X] = 0. If we
assume E[U |X] = 0 instead of Assumption P1, it can be prove that nothing
is learned about (β, PU |X ). This result was shown by Manski (1988), which
point it out that this conditional mean assumption is not even useful to
identify λ(θ) = β.
In general, the mean independence assumptions are rather useless in non-
linear models. This is not the case of the median independence assumption,
Med(U |X) = 0, which can be used to identify λ(θ) = β. Let us describe in
more detail the assumptions necessary for the identification result.

Assumption 13.2 Consider the following semi-parametric assumptions that


include the existence of a special covariate:

S1. Med(U |X) = 0 with probability 1 under PX .

S2. There exists no A ⊆ Rk+1 such that A has probability one under PX
and A is a proper linear subspace of Rk+1 .

S3. |β| = 1.

S4. PX is such that at least one component of X has support equal to


R conditional on the other components with probability 1 under PX .
Moreover, the corresponding component of β is non-zero.

Note that S1 is a weaker assumption than P1, S2 is the same as assump-


tion P2 and S3 is a normalization assumption similar to σ = 1 discussed in
the Probit case. What is new is assumption S4, which is a stronger assump-
tion on PX and also on β. This assumption is also known as the special
regressor assumption and will be fundamental for the identification of β.
Before to present the main result of this section, let us present the fol-
lowing lemma. This provides an additional insight that is useful for the
proof of the identification result.
13.1. LINEAR INDEX MODEL 125

Lemma 13.1 Let θ = (β, PX , PU |X ) be a parameter that satisfies S1. Con-


sider any β ∗ . If

Pθ X ′ β ∗ < 0 ≤ X ′ β ∪ X ′ β < 0 ≤ X ′ β ∗ > 0 ,



(13.4)

then there exists no θ∗ = (β ∗ , PX∗ , PU∗ |X ) satisfying S1 and also having Pθ =


Pθ ∗ .
Proof. Suppose by contradiction that (13.4) holds yet there exists such
θ∗ . Because Pθ = Pθ∗ , we conclude that PX = PX ∗ . Let us recall that
Y = I{X ′ β − U ≥ 0}. Note that
1 1
⇐⇒ Pθ X ′ β ≥ U |X ≥ ,

Pθ {Y = 1|X} ≥
2 2
by Assumption S1, the median independence assumption, this happens

⇐⇒ X ′ β ≥ 0 .

In a similar way, we have


1 1
⇐⇒ Pθ∗ X ′ β ∗ ≥ U |X ≥ ⇐⇒ X ′ β ∗ ≥ 0 ,

Pθ∗ {Y = 1|X} ≥
2 2
where the last equivalence follows by Assumption S1 as before. Now, our
condition (13.4) implies that with positive probabilty, either

X ′β∗ < 0 ≤ X ′β or X ′β < 0 ≤ X ′β∗ ,

which implies that either


1
Pθ∗ {Y = 1|X} < ≤ Pθ {Y = 1|X}
2
or
1
Pθ {Y = 1|X} < ≤ Pθ∗ {Y = 1|X} .
2
This contradicts the fact that Pθ = Pθ∗ . This complete the proof.

Now we are ready to state our mail result of this section.

Theorem 13.1 Under assumption S1-S4, λ(θ) = β is identified.


Proof. Assume without loss of generality that the (special) component of
X specified in S4 is the kth component and that βk > 0.
Let θ = (β, PX , PU |X ) be a parameter that satisfies S1-S2. Consider any
β ̸= β. We want to show that there is no θ∗ = (β ∗ , PX∗ , PU∗ |X ) that satisfies

S1-S4 such that Pθ = Pθ∗ .


From the previous Lemma it is sufficient to show that:

Pθ X ′ β ∗ < 0 ≤ X ′ β ∪ X ′ β < 0 ≤ X ′ β ∗ > 0 ,



126 LECTURE 13. BINARY CHOICE

which is equivalent to prove that

Pθ X ′ β ∗ < 0 ≤ X ′ β > 0


or
Pθ X ′ β < 0 ≤ X ′ β ∗ > 0 .


To prove this condition (or one of the equivalence), we consider three


cases according to the sign of βk∗ . In what follows, we denote by X−k as the
vector X without its kth component. In a similar way, β−k and β−k ∗ are the

vector of parameters without their kth component.

Case 1. Suppose βk∗ < 0. Then,


 ′ β∗
X−k ′ β
X−k

 ′ ∗ ′ −k −k
Pθ X β < 0 ≤ X β = Pθ Xk > − , Xk > − .
βk∗ βk
By Assumption S4, the above probability is positive.

Case 2. Suppose βk∗ = 0. Then,

X ′ β−k
 
′ ∗ ′ ′ ∗
< 0, Xk > − −k

Pθ X β < 0 ≤ X β = Pθ X−k β−k , (13.5)
βk
and
X ′ β−k
 
′ ′ ∗ ′ ∗
≥ 0, Xk < − −k

Pθ X β < 0 ≤ X β = Pθ X−k β−k . (13.6)
βk
′ β ∗ < 0} > 0, we can use Assumption S4 to conclude that (13.5)
If Pθ {X−k −k
′ β ∗ ≤ 0} > 0, as before, we conclude (13.6) is positive
is positive. If Pθ {X−k −k
using Assumption S4.

Case 3. Suppose βk∗ > 0. Then,

X ′ β−k X ′ β∗
 
′ ∗ ′
− −k ≤ Xk < − −k∗ −k

Pθ X β < 0 ≤ X β = Pθ (13.7)
βk βk
and
 ′ β∗
X−k ′ β
X−k

 ′ ′ ∗ −k −k
Pθ X β < 0 ≤ X β = Pθ − ≤ Xk < − . (13.8)
βk∗ βk
As we did in Case 2, we can prove that (13.7) or (13.8) is positive using
Assumption S4. Thus, we only need to prove that
 ′ ′ β∗ 
X−k β−k X−k −k
Pθ > >0
βk βk∗
13.2. ESTIMATION OF THE LINEAR INDEX MODEL 127

or ′ β
X−k X ′ β∗
 
−k
Pθ < −k∗ −k >0.
βk βk
Let us assume by contradiction that the probabilities above are equal to
zero. This implies
 ′ ′ β∗ 
X−k β−k X−k −k
Pθ = ∗ =1,
βk βk

which is equivalent to
β∗
   
β−k
Pθ ′
X−k − −k =0 =1.
βk βk∗

By Assumption S2, that said that there is no proper linear subspace that
contains X, we conclude
β−k β∗
= −k .
βk βk∗
This implies that β ∗ is a scalar multiple of β. By Assumption S3, we con-
clude β = β ∗ , but this is a contradiction. This complete the proof of this
case.

13.2 Estimation of the Linear Index Model


13.2.1 Estimation of parametric binary model
The semi-parametric assumptions S1-S4 and Theorem 13.1 implies that the
parameter β is identified. However, this result is not enough to identify the
marginal effects. Now let us explore how to estimate β in the parametric
case and how can we interpret it.
Let us consider the parametric case. That is

P {Y = 1|X} = F (X ′ β) ,

where F (·) can be the Probit function, F (x) = Φ(x), or the Logit function,
exp(x)
F (x) = 1+exp(x) . Suppose we have a random sample of size n from the
distribution (Y, X); this is (Y1 , X1 ), . . . , (Yn , Xn ). Since we have a paramet-
ric model, we can use the maximum likelihood estimator. Let us write the
likelihood of the observation Yi :

fβ (Yi |Xi ) = F (Xi′ β)Yi (1 − F (Xi′ β))1−Yi .


128 LECTURE 13. BINARY CHOICE

We can use this expression to write the log-likelihood of the random sample:
n
1X
ℓn (b) = ln (fb (Yi |Xi ))
n
i=1
n
1 X
Yi ln F (Xi′ β) + (1 − Yi ) ln 1 − F (Xi′ β)
 
= .
n
i=1

It can be shown that β is the unique maximizer of Q(b) = E[ℓn (b)]. Let
us denote by β̂n the maximum likelihood estimator (MLE). The asymptotic
normality of the MLE implies
√  
d
n β̂n − β → N (0, V) ,

where V = I−1
β and

∂2
 
Iβ = −E ln (f (Y
β i |Xi ))
∂β∂β ′

is the Fisher information matrix. By the information equality, we can rewrite


the expression above as follows
 
∂ ∂
Iβ = E ln (fβ (Yi |Xi )) ′ ln (fβ (Yi |Xi )) .
∂β ∂β

Since
Yi − F (Xi′ β)
 

ln (fβ (Yi |Xi )) = F ′ (Xi′ β)Xi ,
∂β F (Xi′ β)(1 − F (Xi′ β))

we can rewrite one more time the information matrix as follows


" 2 #
Yi − F (Xi′ β) ′ ′ 2 ′
Iβ =E F (Xi β) Xi Xi
F (Xi′ β)(1 − F (Xi′ β))
F ′ (Xi′ β)2
 

=E Xi Xi ,
F (Xi′ β)(1 − F (Xi′ β))

where the second equality above follows from the law of iterated expectations
and law of total variance.
This final expression implies that we can estimate the asymptotic vari-
ance, I−1β . This estimation can be done using the MLE and the sample
analogue to compute the expected value. Note that this implies that we
can do inference on β, but nothing yet about the inference on the marginal
effects.
13.2. ESTIMATION OF THE LINEAR INDEX MODEL 129

How can we interpret β?


Let us assume that Xj is continuously distributed. In the linear regression
with E[U |X] = 0, we had

∂E[Y |X]
= βj .
∂Xj

In this case, βj was capturing the marginal effect of Xj on Y . In the Binary


models we rather have that the marginal effect is non-linear and depends on
X:
∂E[Y |X] ∂P {Y = 1|X}
=
∂Xj ∂Xj
= F ′ (X ′ β)βj ,

where F ′ (·) is the derivative of F . In the case of the Probit, we obtain

∂P {Y = 1|X}
= ϕ(X ′ β)βj ,
∂Xj

and in the case of the Logit, we have

∂P {Y = 1|X}
= F (X ′ β)(1 − F (X ′ β))βj .
∂Xj

Note that the marginal effect of Xj on E[Y |X] depends on the linear
index X ′ β and βj . However, we can still extract information by simply
inspecting β. For instance, we can use the ratio between βj and βk to
obtain the ratio of the partial effects, since we have
∂P {Y =1|X}
∂Xj βj
∂P {Y =1|X}
= .
βk
∂Xk

Also, because F (·) is an increasing function, we can conclude that the sign βj
identifies the sign of the marginal effect of Xj on E[Y |X]. Finally, it is
possible to obtain upper bounds on the marginal effects from β using that
F ′ (·) is bounded. In the case of the Probit model, we obtain

∂P {Y = 1|X} 1
≤ 0.4βj since ϕ(x) ≤ ϕ(0) = √ ≈ 0.4 ,
∂Xj 2π
and in the case of the Logit model,

∂P {Y = 1|X} 1 1
≤ βj since F (x)(1 − F (x)) ≤ .
∂Xj 4 4
130 LECTURE 13. BINARY CHOICE

13.2.2 Estimation of marginal effects


Let us define and compute the average/mean marginal effect as follows
 
∂P {Y = 1|X}
= E F ′ (X ′ β) βj .
 
E
∂Xj

We can estimate this quantity using the sample analogue and the MLE:
n
1 X ′ ′ 
F Xi β̂n β̂n,j .
n
i=1

We can also compute the marginal effects “at the average”, this is defined
by
F ′ (E[X]′ β)βj ,
and can be estimated by
F ′ (X̄n′ β̂n )β̂n,j .
Stata offers both options with the option margins. Note that these two
quantities, average marginal effect and the marginal effect at the average,
are different. The second one could make sense if there is meaning behind
the evaluation of the effect at the average of the sample, but often this is
not the case. For instance, if Xj is a binary variable (e.g. gender).
It is important to note that we computed the average marginal effect of
Xj assuming this variable was continuously distributed. Now, let us focus
to the case in which this variable is binary. In particular, let us consider
the partition X = (X1 , D), where X1 ∈ Rk and D ∈ {0, 1}. Also, let us
consider the partition for β = (β1 , β2 ) accordingly. In this case, the following
expression  
∂P {Y = 1|X}
= E F ′ (X ′ β) β2
 
E
∂D
does not make a lot of sense since D take only two values. Instead, we can
consider the following marginal effect of D,

P {Y = 1|X1 , D = 1} − P {Y = 1|X1 , D = 0} = F (X1′ β1 + β2 ) − F (X1′ β1 ) .

Then, we can define the average marginal effect of D equal to

E F (X1′ β1 + β2 ) − F (X1′ β1 ) ,
 

which can be estimated by the sample analogue and the MLE:


n
1X
F (X1′ β̂n,1 + β̂n,2 ) − F (X1′ β̂n,1 ) .
n
i=1
13.3. LINEAR PROBABILITY MODEL 131

We can compute and report the standard errors for those estimated
marginal effects. Let us remember that for the continuous case, we derived

∂P {Y = 1|X}
= F ′ (X ′ β)βj ,
∂Xj

which is a known function of β. For the discrete case, we obtained F (X1′ β1 +


β2 ) − F (X1′ β1 ) which is also a known function of β. This implies that we
can compute standard errors via the Delta Method. We can also compute
the marginal effect on the treated by only conditioning on D = 1. Stata has
options for this; see margins for more details.

Logit and the odds ratio interpretation


In statistic and Biostatistic, the logit model has particular appeal. Let
pi = P {Yi = 1|Xi }. Using the parametric form of the Logit,

exp(Xi′ β)
pi = ,
1 + exp(Xi′ β)

we can conclude an equation for the odds ratio or relative risk,


pi
= exp(Xi′ β) ,
1 − pi
which, after taking logs, is equivalent to
 
pi
ln = Xi′ β .
1 − pi

In this case, we can interpret βj as the marginal effect of Xj on the log odds
ratio. For example, suppose a clinical trial, denote by Y = 1 if you live and
Y = 0 if you die. An odds ratio of 2 means that the odds of survival are
twice those of death. Now, if βj = 0.1, it means the relative probability of
survival increases by 10% (roughly) if Xj increase in one unit.

13.3 Linear probability model


Some people still advocate the use of the Linear Probability Model (LPM)
where
Y = X ′β + U
and E[U |X] = 0. The main reason for this stance is that β in the linear case
is a well-studied model and directly delivers “marginal effects”. Moreover,
the linear model easily accommodates the analysis of instrumental variables,
panel with fixed effects, etc. Finally, and as we discussed in Lecture 5, if
Y is binary and there are heterogeneous effects, the TSLS estimator admits
132 LECTURE 13. BINARY CHOICE

a LATE interpretation. All these possible extensions discussed above are


hard to implement together in the Probit/Logit model.
However, it is hard to interpret the linear probability model causally
as E[Y |X] cannot be linear in most cases (e.g. Probit/Logit model). As
we discuss in Lecture 1, the causal interpretation requires to believe in the
existence of a model and this is not the case of the LPM. This is usually
recognized by some of their supporters who claim:

The true E[Y |X] may arise from a causal model, but the regres-
sion is only providing a linear approximation to the true E[Y |X].

This suggests that the LPM follows the second interpretation of the linear
regression presented in Lecture 1. This means that LPM is a descriptive
tool that approximate E[Y |X] rather than a model that admit a causal
interpretation.
The linear probability model delivers predicted probabilities outside [0, 1],
which makes it internally inconsistent as a model. A well-known textbook
that support this approach recognize this issue. In Angrist and Pischke
(2008, p. 103) appears textually

...[linear regression] may generate fitted values outside the lim-


ited dependent variable boundaries. This fact bothers some re-
searchers and has generated a lot of bad press for the linear
probability model.

Angrist and Pischke (2008) acknowledge that there are available ap-
proaches for the binary choice model, which admit a causal interpretation
and are different than the LPM. However, in page 197, they add about this
point the following

“Yet we saw that the added complexity and extra work required
to interpret the results from latent index models may not be
worth the trouble”

At the very least, this statement may be controversial.

Remark 13.2 It is expected that Logit, Probit, and LPM yield quite dif-
ferent estimates β̂n . For instance, if we use the upper bounds for marginal
effects, we get

β̂logit ≈ 4β̂ols
β̂probit ≈ 2.5β̂ols
β̂logit ≈ 1.6β̂probit

However, average marginal effects from Logit, Probit, and even LPM are
often “close”, partly due because there is averaging going on.
BIBLIOGRAPHY 133

The binary choice model discussed here using the linear index model
is an idea that is applied to other settings. For instance, ordered choice
models, where individual decides how many units to buy from the same
item, or unordered choice models, where individual decides to buy one of
many different alternatives. In these kind of models, it is common to find
conditional Logit and multinomial Logit. The most popular example in
Industrial Organization (IO) is the random coefficient logit model introduced
by Berry et al. (1995), which is also known as BLP and is useful to estimate
demand. These topics are covered in second year IO classes.

Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.

Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in


Market Equilibrium,” Econometrica, 63, 841–890.

Manski, C. F. (1988): “Identification of Binary Response Models,” Journal


of the American Statistical Association, 83, 729–738.
134 LECTURE 13. BINARY CHOICE
Part III

A Primer on Inference and


Standard Errors

135
Lecture 14

Heteroskedastic-Consistent
Variance Estimation

14.1 Setup and notation


Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that

Y = X ′β + U . (14.1)

Suppose that E[XU ] = 0, that there is no perfect collinearity in X, that


E[XX ′ ] < ∞, and that Var[XU ] < ∞. Denote by P the marginal distri-
bution of (Y, X). Let (Y1 , X1 ), . . . , (Yn , Xn ) be an i.i.d. sample of random
vectors with distribution P . Under these assumptions, we established the
asymptotic normality of the OLS estimator, β̂n , this is
√ d
n(β̂n − β) → N (0, V)

for
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 .
We wish to test

H0 : β ∈ B0 versus H1 : β ∈ B1

where B0 and B1 form a partition of Rk+1 , paying particular attention


to hypotheses for one of the components of β. Without loss of generality,
assume we are interested in the first slope component of β so that,

H0 : β1 = c versus H1 : β1 ̸= c . (14.2)

The CMT implies that


√ d
n(β̂1,n − β1 ) → N (0, V1 )

137
138 LECTURE 14. HC VARIANCE ESTIMATION

as n → ∞ where V1 = V[2,2] is the element of V corresponding to β1 . A


natural choice of test statistic for this problem is the absolute value of the
t-statistic,

n(β̂1,n − c)
tstat = q ,
V̂1,n
so that Tn = |tstat |. In order for tstat to be asymptotically standard normal
under the null hypothesis, we need a consistent estimator V̂n of the limiting
variance V. In this part of the course we will cover consistent estimators of
V under different assumptions on the dependence and heterogeneity in the
data. We will, however, start with the usual i.i.d. setting, where one of such
estimators is
 −1   −1
1 X 1 X 1 X
V̂n =  Xi Xi′   Xi Xi′ Ûi2   Xi Xi′  ,
n n n
1≤i≤n 1≤i≤n 1≤i≤n

where Ûi = Yi − Xi′ β̂n . This is the most widely used form of the robust,
heteroskedasticity-consistent standard errors and it is associated with the
work of White (1980) (see also Eicker, 1967; Huber, 1967). We will refer to
these as robust EHW (or HC) standard errors.

14.2 Consistency of HC standard errors


P
We now prove that V̂n → V. The main difficulty lies in showing that
1 X P
Xi Xi′ Ûi2 → Var[XU ]
n
1≤i≤n
as n → ∞.
Note that
1 X 1 X 1 X
Xi Xi′ Ûi2 = Xi Xi′ Ui2 + Xi Xi′ (Ûi2 − Ui2 ) .
n n n
1≤i≤n 1≤i≤n 1≤i≤n

Under the assumption that Var[XU ] < ∞, the first term on the righthand
side of the preceding display converges in probability to Var[XU ]. It there-
fore suffices to show that the second term on the righthand side of the
preceding display converges in probability to zero. We argue this separately
for each of the (k + 1)2 terms. To this end, note for any 0 ≤ j ≤ k and
0 ≤ j ′ ≤ k that

1 X 1 X
Xi,j Xi,j ′ (Ûi2 − Ui2 ) ≤ |Xi,j Xi,j ′ ||Ûi2 − Ui2 |
n n
1≤i≤n 1≤i≤n
1 X
≤ |Xi,j Xi,j ′ | max |Ûi2 − Ui2 | .
n 1≤i≤n
1≤i≤n
14.2. CONSISTENCY OF HC STANDARD ERRORS 139

Because E[XX ′ ] < ∞, we have that E[|Xj Xj ′ |] < ∞. Hence,


1 X
|Xi,j Xi,j ′ | = OP (1) ,
n
1≤i≤n

so it suffices to show that

max |Ûi2 − Ui2 | = oP (1) .


1≤i≤n

For this purpose, the following lemma will be useful:

Lemma 14.1 Let Z1 , . . . , Zn be an i.i.d. sequence of random vectors such


1
that E[|Zi |r ] < ∞. Then max1≤i≤n |Zi | = oP (n r ), i.e.,
1 P
n− r max |Zi | → 0 .
1≤i≤n

Proof: Let ϵ > 0 be given. Note that


1 [
P {n− r max |Zi | > ϵ} = P{ {|Zi |r > ϵr n}}
1≤i≤n
1≤i≤n
X
≤ P {|Zi |r > ϵr n}
1≤i≤n
1 X
≤ E[|Zi |r I{|Zi |r > ϵr n}]
nϵr
1≤i≤n
1
= E[|Zi |r I{|Zi |r > ϵr n}]
ϵr
→ 0

as n → ∞, where the first equality follows by inspection, the first inequal-


ity follows from Bonferonni’s inequality, the second inequality follows from
Markov’s inequality, the final equality follows from the i.i.d. assumption,
and the convergence to zero follows from the assumption that E[|Zi |r ] < ∞.

We now use Lemma 14.1 to establish the desired convergence in prob-


ability to zero. Note that E[|X|2 ] < ∞ (which follows from the fact that
E[XX ′ ] < ∞) and E[|U X|2 ] < ∞ (which follows from the fact that Var[XU ] <
∞). Recall that Ûi = Ui − Xi′ (β̂n − β), so that

|Ûi2 − Ui2 | ≤ 2|Ui ||Xi ||β̂n − β| + |Xi |2 |β̂n − β|2 .



Next, note that Lemma 14.1 and the fact that n(β̂n − β) = OP (1) imply
that

|β̂n − β| max |Ui ||Xi | = oP (1)


1≤i≤n

|β̂n − β|2 max |Xi |2 = oP (1) .


1≤i≤n
140 LECTURE 14. HC VARIANCE ESTIMATION

The desired conclusion thus follows. If we combine this result with


1 X P
Xi Xi′ → E[XX ′ ] ,
n
1≤i≤n

it follows immediately that


P
V̂n → V .
Let V̂1,n denote the (2, 2)-diagonal element of V̂n - i.e., the entry corre-
sponding to β1 . It follows that the test that rejects H0 in (14.2) when

n(β̂1,n − c)
Tn = |tstat | = q
V̂1,n

exceeds z1− α2 , is consistent in levels. As before, using the duality between


hypothesis testing and the construction of confidence regions, we may con-
struct a confidence region of level α for β1 as

 
 n(β̂1,n − c) 
Cn = c ∈ R : q ≤ z1− α2
V̂1,n
 
 s s 
 V̂1,n V̂1,n 
= β̂1,n − z1− 2 α , β̂1,n + z1− 2α .
 n n 

This confidence region satisfies

P {β1 ∈ Cn } → 1 − α

as n → ∞.
It is worth noting that Stata does not compute V̂n in the default “ro-
bust” option, but rather a version of this estimator that includes a finite
sample adjustment to “inflate” the estimated residuals (known to be too
small in finite samples). This version of the HC estimator is commonly
known as HC1 and given by
 −1   −1
1 X 1 X 1 X
V̂hc1,n = Xi Xi′   Xi Xi′ Ûi∗2   Xi Xi′  ,
n n n
1≤i≤n 1≤i≤n 1≤i≤n

where Ûi∗2 = n−k−1


n
Ûi2 . It is immediate to see that this estimator is also
consistent for V. With the obvious modification for the components βj ,
1 ≤ j ≤ k, and using HC1 standard errors, these are the “robust” confidence
intervals reported by Stata. Other versions, including the one discussed in
the next section are also available as an option.
14.3. IMPROVING FINITE SAMPLE PERFORMANCE: HC2 141

The consistency of the standard errors does not necessarily translate into
accurante finite sample inference on β in general, something that lead to a
number of finite sample adjustments that are sometimes used in practice.
The simplest one is the HC1 correction, although better alternatives are
available. Below we discuss some of these adjustments.

14.3 Improving finite sample performance: HC2


An alternative to V̂n and V̂hc1,n is what MacKinnon and White (1985) call
the HC2 variance estimator, here denoted by V̂hc2,n . In order to define this
estimator, we need additional notation. Let

P = X(X′ X)−1 X′

be the n × n projection matrix, with i-th column denoted by

Pi = X(X′ X)−1 Xi

and (i, i)-th element denoted by

Pii = Xi′ (X′ X)−1 Xi .

Let Ω be the n × n diagonal matrix with i-th diagonal element equal to


σ 2 (Xi ) = Var[Ui |Xi ], and let en,i be the n-vector with i-th element equal to
one and all other elements equal to zero. Let I be the n × n identity matrix
and M = I − P be the residual maker matrix. The residuals Ûi = Yi − Xi′ β̂n
can be written as

Ûi = e′n,i MU , or, in vector form, Û = MU . (14.3)

The (conditional) expected value of the square of the residual is

E[Ûi2 |X1 , . . . , Xn ] = E[(e′n,i MU)2 |X1 , . . . , Xn ]


= (en,i − Pi )′ Ω(en,i − Pi ) .

If we further assume homoskedasticity (i.e., Var[U |X] = σ 2 ), the last ex-


pression reduces to

E[Ûi2 |X1 , . . . , Xn ] = σ 2 (1 − Pii ) ,

by exploiting that P is an idempotent matrix. In other words, even when


the error term U is homoskedastic, the LS residual Û is heteroskedastic
(due to the presence of Pii ). Moreover, since it can be shown that n1 ≤ Pii ≤
1, it follows that Var[Ûi ] underestimates σ 2 under homoskedasticity. This
discussion makes it natural to consider
Ûi2
Ũi2 ≡ , (14.4)
1 − Pii
142 LECTURE 14. HC VARIANCE ESTIMATION

as the squared residual to use in variance estimation as Ũi2 is unbiased for


E[Ui2 |X1 , . . . , Xn ] under homoskedasticity. This is the motivation for the
variance estimator MacKinnon and White (1985) introduce as HC2,

n
!−1 n
! n
!−1
1X 1X 1X
V̂hc2,n = Xi Xi′ Xi Xi′ Ũi2 Xi Xi′ , (14.5)
n n n
i=1 i=1 i=1

where Ũi2 is as in (14.4). Under heteroskedasticity this estimator is unbiased


only in some simple examples (we will cover one of these next class), but it
is biased in general. However, it is expected to have lower bias relative to
HC/HC1 - a statement supported by simulations.
There are other finite sample adjustments that give place to HC3, HC4,
and even HC5. For example, HC3 is equivalent to HC2 with

Ûi2
Ũi∗2 ≡ , (14.6)
(1 − Pii )2

replacing Ũi2 , and its justification is related to the Jackknife estimator of


the variance of β̂n . However, we will not consider these in class as these
adjustments do not deliver noticeable additional benefits relative to HC2
(at least for the purpose of this class). It is worth noting that HC2 and HC3
are available as an option in Stata.

14.4 The Behrens-Fisher Problem


The Behrens-Fisher problem is that of comparing the means of two popula-
tions when the ratio of their variances is unknown and the distributions are
assumed normal, i.e.,

Y (0) ∼ N (µ0 , σ 2 (0)) and Y (1) ∼ N (µ1 , σ 2 (1)) . (14.7)

We know from previous results that this problem can be viewed as a special
case of linear regression with a binary regressor, i.e. X = (1, D) and D ∈
{0, 1}. In this case, the coefficient on D identifies the average treatment
effect, which in this case equals precisely µ1 − µ0 . To be specific, consider
the linear model
Y = X ′ β + U = β0 + β1 D + U
where
Y = Y (1)D + (1 − D)Y (0) ,
and U is assumed to be normally distributed conditional on D, with zero
conditional mean and

Var[U |D = d] = σ 2 (d) for d ∈ {0, 1} .


14.4. THE BEHRENS-FISHER PROBLEM 143

We are interested in
Cov(Y, D)
β1 = = E[Y |D = 1] − E[Y |D = 0] .
Var(D)
Because D is binary, the least squares estimator of β1 can be written as
β̂1,n = Ȳ1 − Ȳ0 , (14.8)
where for d ∈ {0, 1},
n n
1 X X
Ȳd = Yi I{Di = d} and nd = I{Di = d} .
nd
i=1 i=1

Conditional on D(n) = (D1 , . . . , Dn ), the exact finite sample variance of β̂1,n


is
σ 2 (0) σ 2 (1)
V1∗ = Var[β̂1,n |D(n) ] = + ,
n0 n1
so that, under normality, it follows that
σ 2 (0) σ 2 (1)
 
β̂1,n |D(n) ∼ N β1 , + .
n0 n1
The problem of how to do inference on β1 in the absence of knowledge of
σ 2 (d) in this context is old, and known as the Behrens-Fisher problem. In
particular, the question is whether there exists κ ∈ R such that for some
∗ we get
estimator V̂1,n
β̂1,n − β1
q ∼ t(κ) , (14.9)

V̂1,n
where t(κ) denotes a t-distribution with κ degrees of freedom (dof). We
explore this question below under different assumptions.

Comment on notation. Today we are framing the discussion around the


“actual” conditional variance of β̂1,n as opposed to the asymptotic variance.
This means that the estimator V̂1,n ∗ above is an estimator of such variance

(which also explains why there is no n in the numerator of (14.9)). Of
course, if V̂1,n is a consistent estimator of the asymptotic variance of β̂1,n ,
∗ = 1 V̂
then V̂1,n n 1,n is an estimator of the variance of β̂1,n . I will use ∗ to
denote finite sample variances.

14.4.1 The homoskedastic case


Suppose the errors are homoskedastic: σ 2 = σ 2 (0) = σ 2 (1), so that the exact
conditional variance of β̂1,n is
 
∗ 2 1 1
V1 = σ + .
n0 n1
144 LECTURE 14. HC VARIANCE ESTIMATION

In this case, we can estimate σ 2 by


n
1 X
σ̂ 2 = (Yi − Xi′ β̂n )2 ,
n−2
i=1

and let  
∗ 1 1
V̂1,ho = σ̂ 2 + , (14.10)
n0 n1
be the estimator of V1∗ . This estimator has two important features.

(a) Unbiased. Since σ̂ 2 is unbiased for σ 2 , it follows that V̂1,ho is unbiased

for the true variance V1 .

(b) Chi-square. Under normality of U given D, the scaled distribution of



V̂1,ho is chi-square with n − 2 dof,

V̂1,ho
(n − 2) ∼ χ2 (n − 2) .
V1∗

It follows that, under normality of U given D, the t-stat has an exact t-


distribution under the null hypothesis in (14.2),

β̂1,n − c
tho = q ∼ t(n − 2) . (14.11)

V̂1,ho

This t-distribution with dof equal to n − 2 can be used to test (14.2) and,
by duality, for the construction of exact confidence intervals, i.e.,
 q q 
1−α n−2 ∗ n−2 ∗
CSho = β̂1,n − t1− α V̂1,ho , β̂1,n + t1− α V̂1,ho . (14.12)
2 2

Here tn−2
1− α denotes the 1− α2 quantile of a t distributed random variable with
2
n − 2 dof. Such confidence interval is exact under these two assumptions,
normality and homoskedasticity, and we can conclude that (14.9) holds with
κ = n − 2.

14.4.2 The robust EHW variance estimator


In the Behrens-Fisher example, the component of the EHW variance esti-
mator n1 V̂n corresponding to β1 simplifies to

∗ σ̂ 2 (0) σ̂ 2 (1)
V̂1,hc = +
n0 n1
where
n
1 X
σ̂ 2 (d) = (Yi − Ȳd )2 I{Di = d} for d ∈ {0, 1} .
nd
i=1
14.4. THE BEHRENS-FISHER PROBLEM 145

Unfortunately, however, there are no assumptions under which there exists


a value of κ such that (14.9) holds, even when U is normally distributed
conditional on D.
The standard, normal-distribution-based, 1−α confidence interval based
on the robust variance estimator is
 q q 
1−α ∗ ∗
CShc = β̂1,n − z1− α2 V̂1,hc , β̂1,n + z1− α2 V̂1,hc . (14.13)

In small samples the properties of these standard errors are not always

attractive: V̂1,hc is biased downward, i.e.,

∗ n0 − 1 σ 2 (0) n1 − 1 σ 2 (1)
E[V̂1,hc ]= + < V1∗ ,
n0 n0 n1 n1
1−α
and CShc can have coverage substantially below 1 − α. A common “cor-
n−2
rection” for this problem is to replace z1− α2 with t1− α . However, as we will
2
illustrate in the next section, such correction if often ineffective.

14.4.3 An unbiased estimator of the variance


An alternative to n1 V̂n is what MacKinnon and White (1985) call the HC2
variance estimator, here denoted by n1 V̂hc2,n . We learned that this estimator
is unbiased under homoskedasticity and that, in general, it removes only part
of the bias under heteroskedasticity. However, in the single binary regressor
(Behrens-Fisher) case the MacKinnon-White HC2 correction removes the
entire bias. Its form in this case is

∗ σ̃ 2 (0) σ̃ 2 (1)
V̂1,hc2 = + ,
n0 n1
where
n
1 X
σ̃ 2 (d) = (Yi − Ȳd )2 I{Di = d} .
nd − 1
i=1

These conditional variance estimators differ from σ̂ 2 (d) by a factor nd /(nd −


1). In combination with the normal approximation to the distribution of
the t-statistic, this variance estimator leads the following 1 − α confidence
interval
 q q 
1−α ∗ ∗
CShc2 = β̂1,n − z1− α2 V̂1,hc2 , β̂1,n + z1− α2 V̂1,hc2 . (14.14)


The estimator V̂1,hc2 is unbiased for V1∗ , but it does not satisfy the chi-
square property in (b) above. As a result, the associated confidence interval
is still not exact. Just as in the previous case, there are no assumptions
under which there exists a value of κ such that (14.9) holds, even when U
146 LECTURE 14. HC VARIANCE ESTIMATION

dof σ 2 (0) = 0 σ 2 (0) = 1 σ 2 (0) = 2



V̂ho ∞ 72.5 94.0 99.8
n−2 74.5 95.0 99.8

V̂1,hc ∞ 76.8 80.5 86.6
n−2 78.3 82.0 88.1

V̂1,hc2 ∞ 82.5 85.2 89.8
n−2 83.8 86.5 91.0

Table 14.1: Angrist-Pischke design. n1 = 3, n0 = 27.

is normally distributed conditional on D. In fact, in small samples these


standard errors do not work very well.
Consider the following simple simulation, borrowed from Imbens and
Kolesar (2012) and Angrist and Pischke (2008), where n1 = 3, n0 = 27,
Ui |Di ∼ N (0, σ 2 (Di )) ,
σ 2 (1) = 1, σ 2 (0) ∈ {0, 1, 2}, and 1 − α = 0.95. The results are reported
in Table 14.1. From the table it is visible that the confidence intervals
typically undercover. Note that the table also reports the usual finite-sample
n−2
adjustment, i.e. replacing z1− α2 with t1− α . However, in this single-binary-
2
covariate case it is easy to see why n − 2 may be a poor choice for the
degrees of freedom for the approximating t-distribution. Suppose that there
are many units with Di = 0 and few units with Di = 1 (say n1 = 3 and
n0 = 1, 000, 000). In that case E[Yi |Di = 0] is estimated relatively precisely,
with variance σ 2 (0)/n0 ≈ 0. As a result the distribution of the t-statistic is
approximately equal to that of
Ȳ1 − E[Yi |Di = 1]
p .
σ̃ 2 (1)/n1
The latter has, under normality, an exact t-distribution with dof equal to
n1 − 1 = 2, substantially different from the t-distribution with n − 2 ≈ ∞
dof. The question is whether we can figure out the appropriate dof in an
automatic data dependent way, and this leads to topic of “degrees of freedom
adjustment”.

Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
Imbens, G. W. and M. Kolesar (2012): “Robust standard errors in small
samples: some practical advice,” Tech. rep., National Bureau of Economic
Research.
BIBLIOGRAPHY 147

MacKinnon, J. G. and H. White (1985): “Some heteroskedasticity-


consistent covariance matrix estimators with improved finite sample prop-
erties,” Journal of Econometrics, 29, 305–325.

White, H. (1980): “A heteroskedasticity-consistent covariance matrix es-


timator and a direct test for heteroskedasticity,” Econometrica, 817–838.
148 LECTURE 14. HC VARIANCE ESTIMATION
Lecture 15

Heteroskedasticity
Autocorrelation Consistent
Covariance Estimation

15.1 Setup and notation


Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that
Y = X ′β + U . (15.1)
Suppose that E[U |X] = 0, that there is no perfect collinearity in X, that
E[XX ′ ] < ∞, and that Var[XU ] < ∞. We have already discussed Het-
eroskedasticity Consistent (HC) covariance matrix estimation, together with
a number of adjustments intended to improve the performance in finite sam-
ples. Today we will consider the case where the sample (Y1 , X1 ), . . . , (Yn , Xn )
is not necessarily i.i.d. due to the presence of dependence across observations.
In particular, we will introduce the term “autocorrelation”, to denote the
case where Xi and Xi′ may not be independent for i ̸= i′ . To derive an
estimator of the variance covariance matrix that is consistent we need two
tools: (a) appropriate LLNs and CLTs for dependent processes, and (b) and
description of the object we intend to estimate. For the sake of simplicity,
we will start assuming that Xi = X1,i is a scalar random variable that is
naturally ordered (e.g., a time series) to then move to the general case.

15.2 Limit theorems for dependent data


This section will informally discuss the issues that arise when extending
law of large numbers and central limit theorems to dependent data. The
discussion here follows Mikusheva (2007) and Billingsley (1995).

149
150 LECTURE 15. HAC COVARIANCE ESTIMATION

Let’s start with the LLN. When {Xi : 1 ≤ i ≤ n} is a sequence of i.i.d.


2 , it follows that
random variables with mean µ and variance σX
" n # n
1X 1 X σ2
Var Xi = 2 Var[Xi ] = X → 0 , (15.2)
n n n
i=1 i=1

and so convergence in probability follows by a simple application of Cheby-


shev’s inequality. However, without the independence, we need additional
assumptions to control the variance of the average. We will start by assum-
ing that the process we are dealing with are “stationary” as follows,

Definition 15.1 A process {Xi : 1 ≤ i ≤ n} is strictly stationarity if for


each j, the distribution of {Xi , . . . , Xi+j } is the same for all i.

Definition 15.2 A process {Xi : 1 ≤ i ≤ n} is weakly stationary if E[Xi ],


E[Xi2 ], and, for each j, γj ≡ Cov[Xi , Xi+j ], do not depend on i.
Assuming stationarity, the unique mean µ is well defined, and we can
consider the variance of the sample average again,
" n # n n
1X 1 XX
Var Xi = 2 Cov[Xi , Xk ]
n n
i=1 i=1 k=1
1
= 2 (nγ0 + 2(n − 1)γ1 + 2(n − 2)γ2 + · · · )
n 
n  
1 X j 
=  γ0 + 2 γj 1 − ,
n n
j=1

where we have used the notation γj = Cov[Xi , Xi+j ], so that γ0 = σX 2 .

We can immediately see that, for the variance to vanish, we need to make
sure the last summation does not explode. A sufficient condition for this is
absolute summability,
X∞
|γj | < ∞ , (15.3)
j=−∞

in which case a law of large numbers follows one more time from an appli-
cation of Chebyshev’s inequality.

Lemma 15.1 If {Xi : 1 ≤ i ≤ n} is a weakly stationary time series (with


mean µ) with absolutely summable auto-covariances, then a law of large
numbers holds (in probability and L2).

Remark 15.1 Stationarity is not enough. Let ζ ∼ N (0, σζ2 ). Suppose


Xi = ζ ∀i. Then Cov[Xi , Xi′ ] = σζ2 ∀i, i′ , so we do not have absolute
summability, and clearly we do not have a LLN since the average equals ζ,
which is random.
15.3. ESTIMATING LONG-RUN VARIANCES 151

Remark 15.2 Absolutely summability follows from mixing assumptions,


i.e., assuming the sequence {Xi : 1 ≤ i ≤ n} is α-mixing, see Billingsley
(1995, Lemma 3, p. 365). The notion of α-mixing captures the dependence
in the data as follows. Let αn be a number such that

|P (A ∩ B) − P (A)P (B)| ≤ αn , (15.4)

for any A ∈ σ(X1 , . . . , Xj ), B ∈ σ(Xj+n , Xj+n+1 , . . . ), where σ(X) is the


σ-field generated by X, and j ≥ 1, n ≥ 1. If αn → 0 as n → ∞, the
sequence is then said to be α-mixing, the idea being that Xj and Xj+n are
then approximately independent for large n.
From the new proof of LLN one can guess that the variance in a central
limit theorem should change. Remember that we wish to normalize the sum
in such a way that the limit variance would be 1. Note that
n n
" #  
1 X X j
Var √ Xi = γ0 + 2 γj 1 −
n n
i=1 j=1
X∞
→ γ0 + 2 γj = Ω , (15.5)
j=1

where Ω is called the long-run variance. There are many central limit the-
orems for serially correlated observations. Below we provide a commonly
used version, see Billingsley (1995, Theorem 27.4) for a proof under slightly
stronger assumptions.

Theorem 15.1 Suppose that {Xi : 1 ≤ i ≤ n} is a strictly stationary αn -


mixing stochastic process with E[|X|2+δ ] < ∞, E[X] = 0, and

X
αnδ/(2+δ) < ∞ . (15.6)
n=1

Then Ω in (15.5) is finite (i.e. (15.3) holds) and, provided Ω > 0,


n
1 X d
√ Xi → N (0, Ω) . (15.7)
n
i=1

15.3 Estimating long-run variances


Let’s go back to the regression model in (15.1). In the case of i.i.d. data, one
of the exclusion restrictions is formulated as E[Ui |Xi ] = 0. However, when-
ever the data is potentially dependent (time series, panel data, clustered
data), we have to describe the conditional mean relative to all variables
that may be important. In particular, we say Xi is weakly exogenous if
152 LECTURE 15. HAC COVARIANCE ESTIMATION

E(Ui |Xi , Xi−1 , . . . ) = 0, where we are implicitly assuming the observations


have a natural ordering (as it is the case for time series).
Now consider the LS estimator of β,
n
!−1 n
√ 1X ′ 1 X
n(β̂n − β) = Xi Xi √ Xi Ui . (15.8)
n n
i=1 i=1
Invoking Lemma 15.1 under appropriate assumption on {Xi : 1 ≤ i ≤
P
n} gives us n1 ni=1 Xi Xi′ → ΣX = E[XX ′ ]. In addition, assuming the
P
conditions in Theorem 15.1 for ηi ≡ Xi Ui gives
n
1 X d
√ ηi → N (0, Ω) , (15.9)
n
i=1
and thus √ d
n(β̂n − β) → N (0, Σ−1 −1
X ΩΣX ) . (15.10)
The only thing that P is different from the usual sandwich formula is the
“meat”, i.e., Ω = ∞ j=−∞ γj where γj are now the autocovariances of ηi .
This long-run variance is significantly harder to estimate than the usual
variance-covariance matrices that arise under i.i.d. assumptions. We need
to figure out how to estimate Ω.
Below we discuss the HAC approach. For simplicity, however, let’s ignore
the fact that in practice Ui will be replaced by a regression residual Ûi (since
such modification is easy to incorporate and follows similar steps to those
in previous lectures).

15.3.1 A naive approach


We know Ω is the sum of all auto-covariances (an infinite number of them).
However, we can only estimate n − 1 of them with a sample of size n. What
if we just use the ones we can estimate? This gives the following estimator,
n−1 n−j
X 1X
Ω̃ ≡ γ̂j , γ̂j = ηi ηi+j . (15.11)
n
j=−(n−1) i=1

Unfortunately, this does not result in a consistent estimator. To see this,


note that
n−1
X
Ω̃ = γ̂j
j=−(n−1)
n−1 n−j
1 X X
= ηi ηi+j
n
j=−(n−1) i=1
n
!2
1 X d
= √ ηi → (N (0, Ω))2 .
n
i=1
15.3. ESTIMATING LONG-RUN VARIANCES 153

The problem is that we are summing too many imprecisely estimated co-
variances. So, the noise does not die out. For example, to estimate γn−1 we
use only one observation.

15.3.2 Simple truncation


Given the problem with the naive estimator, a natural question would be:
what if we do not use all the covariances? This gives us a truncated estima-
tor,
mn
X mn
X
Ω̄ ≡ γ̂j = γ̂0 + 2 γ̂j . (15.12)
j=−mn j=1

where mn < n, mn → ∞, and mn /n → 0 as n → ∞. First, we have to notice


that due to truncation there will be a finite sample bias. As mn increases,
the bias due to truncation should be smaller and smaller. But we don’t
want to increase mn too fast for the reason stated above (we don’t want to
sum up noises). Assume that we can choose mn in such a way that this
estimator is consistent. Then we might face another small sample problem:
this estimator may be negative: Ω̄ < 0 (or in vector case, Ω̄ not positive
definite). To see this, take mn = 1, so that Ω̄ = γ̂0 + 2γ̂1 . In small samples,
we may find γ̂1 < − 21 γ̂0 , then Ω̄ will be negative.

15.3.3 Weighting and truncation: the HAC estimator


Following Newey and West (1987), the renewed suggestion is to create a
weighted sum of sample auto-covariances with weights guaranteeing positive-
definiteness:
n−1  
X j
Ω̂n ≡ k γ̂j . (15.13)
mn
j=−(n−1)

We need conditions on mn and k(·) to give us consistency and positive-


definiteness. First, mn → ∞ as n → ∞, although not too fast. For the
proof below we will assume that m3n /n → 0, but the result can be proved
under m2n /n → 0 (see Andrews, 1991). On the other hand, k(·) needs to
be such that it guarantees positive-definiteness by down-weighting high lag
covariances, but we also need k(j/mn ) → 1 as n → ∞ for consistency. As
with non-parametric density estimation, there exist a variety of kernels that
satisfy all the properties needed for consistency and positive-definiteness.
The first one proposed by Newey-West (1987), was the Barlett kernel, which
is defined as follows.

Barlett Kernel (Newey and West, 1987)


(
1 − |x| if |x| ≤ 1
k(x) = .
0 otherwise
154 LECTURE 15. HAC COVARIANCE ESTIMATION

Parzen kernel (Gallant, 1987)



2 3
1 − 6x + 6|x|
 if |x| ≤ 1/2
k(x) = 2(1 − |x|)3 if 1/2 ≤ |x| ≤ 1 .

0 otherwise

Quadratic spectral kernel (Andrews, 1991)


 
25 sin(6πx/5)
k(x) = − cos(sin(6πx/5)) .
12π 2 x2 6πx/5

These kernels are all symmetric at 0, where the first two kernels have
a bounded support [−1, 1], and the QS has unbounded support. For the
Bartlett and Parzen kernels, the weight assigned to γ̂j decreases with |j|
and becomes zero for |j| ≥ mn . Hence, mn in these functions is also known
as a truncation lag parameter. For the quadratic spectral, mn does not have
this interpretation because the weight decreases to zero at |j| = 1.2mn , but
then exhibits damped sine waves afterwards. Note that for the first two
kernels, we can write
mn  
X j
Ω̂n ≡ k γ̂j . (15.14)
mn
j=−mn

so that the truncation at mn is explicit. In the results that follow below we


will focus on this representation to simplify the intuition behind the formal
arguments.

Theorem 15.2 Assume that {ηi : 1 ≤ i ≤ n} is a weakly stationary se-


quence with mean zero and autocovariances γj = Cov[ηi , ηi+j ] that satisfy
(15.3). Assume that

1. mn → ∞ as n → ∞ and m3n /n → 0.

2. k(x) : R → [−1, 1], k(0) = 1, k(x) is continuous at 0, and k(−x) =


k(x).

3. For all j the sequence ξi,j = ηi ηi+j − γj is stationary and



X
sup | Cov(ξi,j , ξi+k,j )| < C
j
k=1

for some constant C (limited dependence).


P
Then, Ω̂n → Ω.
15.3. ESTIMATING LONG-RUN VARIANCES 155

I will provide a sketch of the proof below under the assumptions of the
theorem above. Start by writing the difference between our estimator and
the object of interest,
mn     mn  
X X j X j
Ω̂n − Ω = − γj + k − 1 γj + k (γ̂j − γj ) .
mn mn
|j|>mn j=−mn j=−mn

The first terms represents a truncation error, is non-stochastic, and goes to


zero as mn → ∞ by (15.3).
The second term is also non-stochastic and it represents an error from
using a kernel (as opposed to uniform weights). If we let
 
j
fn (j) ≡ k − 1 |γj | ,
mn
under condition 2 it follows that fn (j) ≤ g(j) ≡ 2|γj |, which, by (15.3), is
summable. By the same condition, fn (j) → f (j) = 0 for all j and, invoking
the dominated convergence theorem,
mn     mn  
X j X j
k − 1 γj ≤ k − 1 |γj | → 0 , (15.15)
mn mn
j=−mn j=−mn

so that the second term vanishes asymptotically.


Now consider the third term. This is the error from estimating the
covariances and it is stochastic. Notice that for the first term we want mn
big enough to eliminate it. For this last term, we want mn to be small
enough. Pn−j
Start by noting that γ̂j = n1 i=1 ηi ηi+j is biased for γj and
n−j
γj∗ ≡ E[γ̂j ] = γj .
n
This bias disappears as n → ∞, so we can split the last term in two parts,
mn   mn  
X j ∗
X j
k (γ̂j − γj ) + k (γj∗ − γj ) (15.16)
mn mn
j=−mn j=−mn

and conclude that the last term (which is non-stochastic) goes to zero by
similar arguments to those in (15.15). It then follows that it suffices to show
that
mn   mn
X j X P
k (γ̂j − γj∗ ) ≤ |γ̂j − γj∗ | → 0 .
mn
j=−mn j=−mn

In order to do this, let ξi,j = ηi ηi+j − γj so that


n−j
1X
ξi,j = γ̂j − γj∗ .
n
i=1
156 LECTURE 15. HAC COVARIANCE ESTIMATION

Simple albegra shows that


n−j n−j
1 XX
E[(γ̂j − γj∗ )2 ] = 2 Cov[ξi,j , ξk,j ]
n
i=1 k=1
n−j n−j
1 XX
≤ | Cov[ξi,j , ξk,j ]|
n2
i=1 k=1
n−j
1 X
≤ C
n2
i=1
C
≤ ,
n
where in the second inequality we used

X
sup | Cov(ξi,j , ξi+k,j | < C .
1≤j<∞
k=1

By Chebyshev’s inequality,
E[(γ̂j − γj∗ )2 ] C
P |γ̂j − γj∗ | > ϵ ≤

2
≤ 2 , (15.17)
ϵ nϵ
where, importantly, the bound holds uniformly in 1 ≤ j < ∞. This charac-
terize the accuracy with which we estimate each covariance. Now we need
to assess how many auto-covariances we can estimate well simultaneously:
   
mn  m
[n
 X  ϵ 
P |γ̂j − γj∗ | > ϵ ≤ P {|γ̂j − γj∗ | > }
   2mn + 1 
j=−mn j=−mn
mn  
X ϵ
≤ P |γ̂j − γj∗ | >
2mn + 1
j=−mn
mn
X E[(γ̂j − γj∗ )2 ](2mn + 1)2

ϵ2
j=−mn
C(2mn + 1)2
≤ (2mn + 1)
nϵ2
m3n
≤ C∗ →0,
nϵ2
where the last step uses m3n /n → 0. This completes the proof.
We have proved consistency but we have not addressed the question of
positive definiteness of our HAC estimator. To do this, it is convenient to
characterize positive definiteness using the Fourier transformation of Ω̂. We
will skip this in class, but the interested reader should see Newey and West
(1987).
BIBLIOGRAPHY 157

Bandwidth choice. After the original paper by Newey-West (1987), a


series of papers addressed the issue of bandwidth choice (notably, Andrews
(1991)). The general idea here is that we are facing a bias-variance trade-off
in the choice of bandwidth mn (also called truncation lag). Namely, a bigger
mn reduces the cut-off bias, however, it increases the number of estimated
covariances used (and hence the variance of the estimate). Andrews (1991)
proposed to choose mn by minimizing the mean squared error (MSE) of the
HAC estimator,

M SE(Ω̂n ) = bias(Ω̂n )2 + Var(Ω̂n ) . (15.18)

Andrews (1991) did this minimization and showed that the optimal band-
width is mn = C ∗ n1/r , where r = 3 for the Barlett kernel and r = 5 for
other kernels. He also provided values for the optimal constant C ∗ , that
depends on the kernel used, among other things.

Bibliography
Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Con-
sistent Covariance Matrix Estimation,” Econometrica, 59, 817–858.

Billingsley, P. (1995): Probability and Measure, Wiley-Interscience.

Mikusheva, A. (2007): “Course materials for Time Series Analysis,” MIT


OpenCourseWare, Massachusetts Institute of Technology.

Newey, W. K. and K. D. West (1987): “A Simple, Positive Semi-


definite, Heteroskedasticity and Autocorrelation Consistent Covariance
Matrix,” Econometrica, 55, 703–08.
158 LECTURE 15. HAC COVARIANCE ESTIMATION
Lecture 16

Cluster Covariance
Estimation

Let (Y, X, U ) be a random vector where Y and U takes values in R and


X take values in Rk+1 . Assume further that the first component of X is
a constant equal to one, i.e., X = (X0 , X1 , . . . , Xk )′ with X0 = 1. Let
β ∈ Rk+1 be such that
Y = X ′β + U . (16.1)
and assume that E[XU ] = 0, that there is no perfect collinearity in X, that
E[XX ′ ] < ∞, and that Var[XU ] < ∞.
To describe the sampling process, we use two indices as we did when we
covered panel data. The first index, j = 1, . . . , q, denotes what are known
as “clusters”. By clusters, we mean a group of observations that may be
related to each other. For example, a cluster could be a family, a school, an
industry, or a city. The second index, i = 1, . . . , nj , denotes units within a
cluster. For example, family members, students, firms, or individuals. If we
let
Xj = (X1,j , . . . , Xnj ,j )′
be a nj × (k + 1) matrix of stacked observations for cluster j, and define Yj
and Uj analogously, we can write (16.1) as

Yj = Xj β + Uj , j = 1, . . . , q ,

where E[Xj′ Uj ] = 0. We will assume that (Yj , Xj ) are independent across


j ≤ q but wish to allow for the possibility that (Yi,j , Xi,j ) and (Yi′ ,j , Xi′ ,j )
may be dependent for i ̸= i′ and same j ≤ q. In terms of constructing
valid tests for hypotheses on the parameter β, this problem translates into
constructing standard errors that account for the fact that (Xi,j , Ui,j ) and
(Xi′ ,j , Ui′ ,j ) may be correlated within a cluster. In order to do this, we
start by presenting appropriate versions of the law of large numbers and the
central limit theorem, following Hansen and Lee (2019).

159
160 LECTURE 16. CLUSTER COVARIANCE ESTIMATION

16.1 Law of Large Numbers


We start by focusing on the sample mean of Xi,j . Since the least squares
estimator is a function of sample means, doing so will prove useful to analysis
the properties of the LS estimator of β later on. Note that we can write
q q nj
1X ′ 1 XX
X̄n = Xj 1j = Xi,j ,
n n
j=1 j=1 i=1

where 1j is an nj -dimensional vector of ones. Our first result is:

Theorem 16.1 Suppose that as n → ∞,


nj
max →0 (16.2)
j≥q n

and that

lim sup E [||Xi,j ||I{||Xi,j ||} > M ] = 0 . (16.3)


M →∞ i,j

Then, as n → ∞
  P
||X̄n − E X̄n || → 0 .
A few comments about this theorem are worth highlighting. First, the
condition in (16.3) states that Xi,j is uniformly integrable and this is a tech-
nical requirement that is usually assumed outside the i.i.d. setting. Second,
the condition in (16.2) states that each cluster size nj is asymptotically neg-
ligible. This automatically holds when nj is fixed as q → ∞, which is the
traditional framework we discussed with panel data. It also implies that
q → ∞, so we do not explicitly list this as a condition.

Remark 16.1 The condition in (16.2) allows for considerable heterogeneity


in cluster sizes and it allows the cluster sizes to grow with sample size, so
long as the growth is not proportional. An example of this we will use
multiple times below is nj = na for 0 ≤ a < 1, which leads to
q
X q
X
n= nj = na = qna =⇒ q = n1−a .
j=1 j=1

Assumption (16.2) is necessary for parameter estimation consistency while


allowing arbitrary within-cluster dependence. Otherwise a single cluster
could dominate the sample average.
16.2. RATES OF CONVERGENCE 161

16.2 Rates of Convergence


Under i.i.d. sampling the rate of convergence of the sample mean is n−1/2 .
That is √   d
n X̄n − E X̄n → N (0, V ) .
In the case of clustering data, the rate of convergence may or may not be af-
fected. We will see in the example below that often is affected. For instance,
if the dependence within the cluster is strong, the rate of convergence is de-
termined by the number of clusters: q −1/2 . If the dependence within clusters
is weak —in a precise sense that we illustrate later in the examples—, the
rate of convergence is n−1/2 . However, if the dependence is in-between weak
and strong, the rate of convergence can be in-between or even slower than
the rates mentioned above.
To analyze the convergence rate, we can compute the standard deviation
of the sample mean. That is
1/2
sd(X̄n ) = Var[X̄n ]
 1/2
q
1 X
=  Var[Xj′ 1j ] ,
n
j=1

where the last equation uses that Xj are independent across clusters j.

Example 16.1 Consider the case where nj = na and q = n1−a for some
a ∈ (0, 1). Let Xi,j ∈ R be a random variable such that Var[Xi,j ] = 1 and
Cov[Xi,j , Xi′ ,j ] = 0. In this case it follows that
nj
X
Var[Xj′ 1j ] = Var[Xi,j ] = nj ,
i=1

and then
1
sd(X̄n ) = (qnj )1/2 = n−1/2 ,
n
where the last equality follows from q = n1−a and nj = na .

Example 16.2 Consider the case where nj = na and q = n1−a for some
a ∈ (0, 1). Let Xi,j ∈ R be a random variable such that Var[Xi,j ] = 1 and
Cov[Xi,j , Xi′ ,j ] = 1. In this case it follows that
nj nj
X X
Var[Xj′ 1j ] = Cov[Xi,j , Xi′ ,j ] = n2j = n2a ,
i=1 i′ =1

and then
1 1/2
sd(X̄n ) = qn2a = q −1/2 ,
n
where the last equality follows because q = n1−a .
162 LECTURE 16. CLUSTER COVARIANCE ESTIMATION

Example 16.3 Consider the case where nj = na and q = n1−a for some
a ∈ (0, 1). Let Xi,j ∈ R be a random variable such that Var[Xi,j ] = 1 and
Cov[Xi,j , Xi′ ,j ] = 1/|i − i′ | for i ̸= i′ . These conditions implies that
 
nj nj nj nj
X X X X 1 
Var[Xj′ 1j ] = Cov[Xi,j , Xi′ ,j ] = 1 + .
′ ′
|i − i′ |
i=1 i =1 i=1 i ̸=i

Now, let us rely on the following asymptotic proportional approximation


nj
X 1
∝ log(nj ) ∝ log(n) ,
|i − i′ |
i′ =1

where the last proportion approximation follows because log(nj ) = a log(n).


We can use this to conclude that
nj
X
Var[Xj′ 1j ] ∝ (1 + log(n)) ∝ nj log(n) .
i=1

This last expression can be use to approximate the standard deviation of the
sample mean,
 1/2
q r
1 X 1 log(n)
sd(X̄n ) ∝  nj log(n) = (qna log(n))1/2 = ,
n n n
j=1

where the last equality uses q = n1−a . It follows from here that the conver-
gence rate is slower than n−1/2 since
√ p
n sd(X̄n ) ∝ log(n) → ∞ as n → ∞ .

At the same time, the convergence rate is faster than q −1/2 since


r r
√ 1−a
log(n) log(n)
q sd(X̄n ) ∝ n = → 0 as n → ∞ .
n na
Example 16.4 Consider the case where there are two type of clusters. In
the first group, the number of cluster is q1 = n/2 and nj = 1 for j =
1, . . . , q1 . In the second group, the number of cluster is q2 = n1−a /2 and
nj = na for j = q1 + 1, . . . , q1 + q2 and a ∈ (0, 1). The number of cluster
is denoted by q = q1 + q2 . Let Xi,j ∈ R be a random variable such that
Var[Xi,j ] = 1 and Cov[Xi,j , Xi′ ,j ] = 1. These conditions implies that
nj nj
X X
Var[Xj′ 1j ] = Cov[Xi,j , Xi′ ,j ] = n2j ,
i=1 i′ =1
16.3. CENTRAL LIMIT THEOREM 163

which we can use to compute the standard deviation of the sample mean,
 1/2
q
1 X 2  1 1/2
sd(X̄n ) = nj = q1 + q2 n2a ,
n n
j=1

where the last equality follows because Var[Xj′ 1j ] = 1 if j = 1, . . . , q1 , and


Var[Xj′ 1j ] = n2a if j = q1 + 1, . . . , q1 + q2 . Now, we can use that q1 = n/2
and q2 = n1−a /2 to conclude that
1/2 
1 n n1−a 2a 1 + na 1/2
 
sd(X̄n ) = + n = ∝ n−(1−a)/2 .
n 2 2 2n
It follows that the convergence rate is slower than n−1/2 since

n sd(X̄n ) ∝ n1/2 n−1/2 na/2 → ∞ as n → ∞ .
In addition, the convergence rate is slower than q −1/2 since q = q1 + q2 ∝ n
and

q sd(X̄n ) ∝ n1/2 n−1/2 na/2 → ∞ as n → ∞ .
This means that sd(X̄n ) goes to zero at a slower rate than n−1/2 and q −1/2 .
The examples above provide insight about the convergence rate sd(X̄n ).

This rate can be equal to n (square root of the sample size). It also can be

equal to q (the square root of the number of clusters). It can be between
both of these rates and, under heterogeneity, can be even slower than both.
When sd(X̄n ) is a vector, it is possible that each of its elements converge at
different rates.
The last example illustrates the importance of considering heterogeneous
cluster sizes. In this case, the reason why the convergence rate is slower than
both n−1/2 and q −1/2 is because the number of clusters is determined by the
large number of small clusters (q1 ), but the convergence rate is determined
by the (relatively) small number of large clusters (q2 n2a ).

16.3 Central Limit Theorem


Under i.i.d. sampling the standard deviation of the sample mean, X̄n , is of

order O(n−1/2 ), so n to be the natural scaling to obtain the central limit
theorem (CLT). However, clustering can alter the rate of convergence, as we
saw in the examples above. Thus, it is essential to standardize the sample
mean by the actual variance rather than an assumed rate. Now, define the

variance-covariance matrix of nX̄n by
h  ′ i
Ωn = E n X̄n − E[X̄n ] X̄n − E[X̄n ]
q
1X h ′ ′ i
E Xj 1j − E[Xj′ 1j ] Xj′ 1j − E[Xj′ 1j ]

= ,
n
j=1
164 LECTURE 16. CLUSTER COVARIANCE ESTIMATION

where 1j is an nj -dimensional vector of ones. We also denote by λn =


λmin (Ωn ) the minimum eigenvalue of Ωn .
The next theorem presents a central limit theorem for the sample mean
−1/2 √
considering the correct scaling, Ωn n, so that

Ω−1/2
n n(X̄n − E[X̄n ])

is a random variable with mean zero and covariance matrix equal to the
identity matrix Ik+1 by construction.

Theorem 16.2 (Central Limit Theorem) Suppose that for some 2 ≤


r < +∞,
lim sup E [||Xi,j ||r I{||Xi,j || > M }] = 0 , (16.4)
M →∞ i,j

and P 2/r
q r
j=1 nj
≤C<∞, (16.5)
n
for some positive C > 0. Assume further that as n → ∞,
n2j
max →0 (16.6)
j≤q n
and
λn ≥ λ > 0 , (16.7)
for some positive λ > 0. Then, as n → ∞
√  d
Ω−1/2
n n X̄n − E[X̄n ] → N (0, Ik+1 ) . (16.8)

Let’s discuss the conditions under which this theorem holds. Assumption
(16.4) states that ||Xi,j ||r is uniformly integrable. When r = 2, this condi-
tion is similar to the Lindeberg condition for the CLT under independent
heterogeneous sampling. Assumption (16.5) involves a trade-off between the
cluster sizes and the number of moments r. It is least restrictive for large
r, and more restrictive for small r. Note that as r → ∞, we can conclude
maxj≤q n2j /n = O(1), which is implied by (16.6).
Assumption (16.6) allows for growing and heterogeneous cluster sizes.
It allows clusters to grow uniformly at the rate nj = na for any 0 ≤ a ≤
(r − 2)/(2r − 2). Note that this requires the cluster sizes to be bounded
if r = 2. It also allows for only a small number of clusters to grow. For
example, nj = n̄ (bounded clusters) for q − k clusters and nj = q a/2 for k
clusters, with k fixed. In this case the assumption holds for any a < 1 and
r = 2.

Finally, Assumption (16.7) specifies that Var[ nc′ X̄n ] does not vanish
for any vector c ̸= 0, since the condition implies that the the minimum
eigenvalue of the variance-covariance matrix is positive.
16.4. CLUSTER COVARIANCE ESTIMATION 165

16.4 Cluster Covariance Estimation


Let us now consider the estimation of Ωn . Suppose that E[Xj′ 1j ] = 0 for all
j = 1, . . . , q. This implies that
q
1X  ′
E (Xj 1j − E[Xj′ 1j ])(Xj′ 1j − E[Xj′ 1j ])′

Ωn =
n
j=1

is equal to
q
1X
E[Xj′ 1j 1′j Xj ] .
n
j=1

The natural estimator of Ωn would be


q q nj ! nj !′
1X ′ ′ 1X X X
Ωn =
b Xj 1j 1j Xj = Xi,j Xi,j .
n n
j=1 j=1 i=1 i=1

It is worth to mention that this estimator is robust to dependence within


clusters. It allows for arbitrary within-cluster correlation patterns. Also, it
allows for heterogeneity since E[Xj′ 1j 1′j Xj ] can vary across j. The following
theorem present the relevance of this estimator to obtain the asymptotic
normality of the sample mean X̄n after consider the right scaling.

Theorem 16.3 (Consistency of CCE) Under the same assumptions of


Theorem 16.2 and assuming that E[Xj′ 1j ] = 0, we obtain as n → ∞ that

P
||Ω
b n − Ωn || → 0

and √
b −1/2 d
Ω n nX̄n → N (0, Ik+1 ) .
This theorem shows that the cluster covariance estimator is consistent.
Moreover, replacing the covariance matrix in the central limit theorem de-
scribed in Theorem 16.2 with the estimated covariance matrix does not affect
the asymptotic distribution. This implies that cluster-robust t-statistics are
asymptotically standard normal. It is worth mentioning that we do not
need to know the actual rate of convergence of X̄n as the cluster covariance
estimator capture this rate of convergence. For the proof of these results,
see Hansen and Lee (2019).

16.4.1 Application to Linear Regression


Let us recall our initial setup. For each cluster j, let us denote by Xj =
(X1,j , . . . , Xnj ,j )′ ∈ Rnj × Rk+1 the matrix of stacked observations. Define
166 LECTURE 16. CLUSTER COVARIANCE ESTIMATION

Yj ∈ Rnj × 1 and Uj ∈ Rnj × 1 in a similar way. Using this notation, we


have
Yj = Xj β + Uj , j = 1, . . . , q, where E[Xj′ U ] = 0 ,
and we assume that (Yj , Xj ) are independent across clusters but remain
agnostic about the dependence within clusters.
The least square (LS) estimator of β is given by
 −1
q q
1 X
′ 1X ′
β̂n =  Xj Xj  Xj Yj .
n n
j=1 j=1

Using this expression and the model for Yj , we can derive the following
 −1
q q
√ 1 X 1 X ′
n(β̂n − β) =  Xj′ Xj  √ Xj Uj .
n n
j=1 j=1

Now, let us introduce notation before we discuss the consistency and the
asymptotic normality properties of the LS estimator.
q q
1X 1X
Σn = E[Xj′ Xj ] and Ωn = E[Xj′ Uj Uj′ Xj ] .
n n
j=1 j=1

Consistency of LS
If the condition (16.2) in Theorem 16.1 holds, Σn has full rank, λmin (Σn ) ≥
C > 0, and the uniform integrability condition in (16.3) holds for Xi,j Xi,j′

and Xi,j Ui,j , then
β̂n → β .

Asymptotic Normality of LS

To properly normalize n(β̂n − β) we define
Vn = Σ−1 n Ωn Σn
−1


as the rate of convergence may not be n. Using this notation, we assume
that the conditions in Theorem 16.2 hold for some r, Σn has full rank,
λmin (Σn ) ≥ C > 0, λmin (Ωn ) ≥ C > 0, and the uniform integrability
′ and X ′ U . It follows that as n → ∞:
condition in (16.4) holds for Xi,j Xi,j i,j i,j
√ d
Vn−1/2 n(β̂n − β) → N (0, Ik+1 ) .
Note that to conduct inference, all we need is a consistent estimator V
b n such
that √
b −1/2 n(β̂n − β) → d
V n N (0, Ik+1 ) .
This is what we develop next.
16.4. CLUSTER COVARIANCE ESTIMATION 167

Cluster Covariance Estimator


Building on earlier results, we can immediately derive a consistent estimator
of Vn as follows.

Definition 16.1 (CCE) The CCE estimator of Vn is given by


 −1  −1
q q q
1 X 1 X 1 X
bn = 
V Xj′ Xj  Xj′ Ûj Ûj′ Xj  Xj′ Xj  ,
n n n
j=1 j=1 j=1

where Ûj = Yj − Xj β̂n are the LS residuals.


Under the same conditions listed for the asymptotic normality of the LS
estimator, we can conclude
P √ d
||V
b n − Vn || → 0 b −1/2 n(β̂n − β) →
and V N (0, Ik+1 ) .
n

Note that in the special case with nj = 1 for all j = 1, . . . , q, this estimator
becomes the HC estimator presented in Lecture 14. It is worth mentioning
that Stata uses a multiplicative adjustment to reduce the bias,

b stata = n−1 q b
V Vn .
n−k−1q−1
This estimator allows for arbitrary within-cluster correlation patterns and
heteroskedasticity across clusters. Unlike HAC estimators, it does not re-
quire the selection of a kernel or bandwidth parameter.

Inference
For s ∈ {0, 1, . . . , k}, let βs be the s-th element of β and let V̂n,s be the
(s + 1)-th diagonal element of V̂n . Using this notation, consider testing

H0 : βs = c versus H1 : βs ̸= c

at level α. Using the results we just derived, it follows that under the null
hypothesis, the t-statistic is asymptotically standard normal,

n(β̂n,s − c) d
tstat = q → N (0, 1) as n→∞.
V̂n,s

This implies that the test that rejects H0 when |tstat | > z1−α/2 is consistent
in levels, were z1−α/2 is a critical value defined by the (1 − α/2)-quantile of
the standard normal distribution.
168 LECTURE 16. CLUSTER COVARIANCE ESTIMATION

Remark 16.2 The previous results show that inference based on the t-
statistic for the LS estimator of β and the cluster robust covariance estima-
tor V̂n is valid as q → ∞, regardless of whether n → ∞ as well or not, and
regardless of whether the dependence within each cluster is weak or strong.
That is, even though the LS estimator may converge at different rates de-
pending on the data structure, the studentization by the CCE captures such
a rate of convergences and makes the t-statistic adaptive.

16.4.2 Small q ad-hoc adjustments


As we mentioned before, the cluster-robust inference asymptotics are based
on many clusters, this means q → ∞. However, often in empirical settings,
there are few clusters (few regions, few schools, few states, etc). Following
the ideas discussed with HC standard errors, Lecture 14, there are some
finite-sample adjustments that people use in practice. For instance, Bell
and McCaffrey (2002) proposes a bias-reduction modification analogous to
that of HC2. That is
 −1  −1
q q q
1 X 1 X 1 X
V̂bm =  Xj′ Xj  Xj′ Ũj Ũj′ Xj  Xj′ Xj  ,
n n n
j=1 j=1 j=1

where
Ũj = (Inj − Pjj )−1/2 Ûj ,
Inj is the nj × nj identity matrix, Pjj is the nj × nj matrix defined as

Pjj = Xj (X′ X)−1 Xj′ ,

and X is the n × (k + 1) matrix constructed by stacking X1 through Xq . Bell


and McCaffrey (2002) also proposes a t critical value with degree of freedom
adjustment, following the same intuition we discussed in the context of the
Behrens-Fisher problem.

16.4.3 Simulations
Table 16.1 reports simulations results for the following five designs. The
model in all designs is

Yi = β0 + β1 Xi + Ui
Xi = VCi + Wi
Ui = νCi + ηi ,

where Ci denotes the cluster of i and all variables are N (0, 1). Also, in all
designs β0 = β1 = 0. In the first design there are q = 10 clusters, with
nj = 30 units in each cluster. In the second design q = 5 with nj = 30.
16.4. CLUSTER COVARIANCE ESTIMATION 169

In the third design there there are q = 10 clusters, half with nj = 10


and half with nj = 50. The fourth design has heteroskedasticity, with
ηi |Xi ∼ N (0, 0.9Xi2 ), and the fifth design, the covariate is fixed within the
clusters: Wi = 0 and Vs ∼ N (0, 2). The last two design have q = 10 clusters
with j = 30.
The following table reports the coverage probability of the Confidence
Intervals
Table 16.1: Design in Imbens and Kolesar/CGM: 1 − α = 95%

dof I II III IV V
V̂n ∞ 84.7 73.9 79.6 85.7 81.7
q-1 89.5 86.9 85.2 90.2 86.4
V̂stata ∞ 86.7 78.8 81.9 87.6 83.6
q-1 91.1 90.3 87.2 91.8 88.1
V̂bm ∞ 89.2 84.7 87.2 89.1 87.7
q-1 93.0 93.3 91.3 92.8 91.4
kbm 94.4 95.3 94.4 94.2 96.6

Final Comments
When q is small, V̂bm (more so V̂n ) typically leads to confidence sets that
under-cover. Bell and McCaffrey (2002) consider more traction from the
degree of freedom adjustment to the t-distribution that is used to compute
the critical value of the test. This adjustment performs well sometimes, but
is ad-hoc (no formal results).
The literature on inference with few clusters (i.e, q fixed) has made
significant progress recently and the main alternatives to using CCE are:

• The Wild Bootstrap: See Cameron et al. (2008) and Canay et al.
(2021).

• Exact t-approach: See Ibragimov and Müller (2010).

• Approximate Randomization Tests: See Canay et al. (2017).

The cluster consistent estimator discussed above is still a very good option
when q is large, but remember that its performance does not improve if nj
gets large while q remains small.
170 LECTURE 16. CLUSTER COVARIANCE ESTIMATION

Bibliography
Bell, R. M. and D. F. McCaffrey (2002): “Bias reduction in standard
errors for linear regression with multi-stage samples,” Survey Methodology,
28, 169–182.

Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008): “Bootstrap-


based improvements for inference with clustered errors,” The Review of
Economics and Statistics, 90, 414–427.

Canay, I. A., J. P. Romano, and A. M. Shaikh (2017): “Randomization


Tests under an Approximate Symmetry Assumption,” Econometrica, 85,
1013–1030.

Canay, I. A., A. Santos, and A. M. Shaikh (2021): “The wild boot-


strap with a “small” number of “large” clusters,” Review of Economics
and Statistics, 103, 346–363.

Hansen, B. E. and S. Lee (2019): “Asymptotic theory for clustered sam-


ples,” Journal of econometrics, 210, 268–290.

Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix


estimator for panel data when T is large,” Journal of Econometrics, 141,
597–620.

Ibragimov, R. and U. K. Müller (2010): “t-Statistic based correlation


and heterogeneity robust inference,” Journal of Business & Economic
Statistics, 28, 453–468.
Lecture 17

Bootstrap

17.1 Confidence Sets


Let Xi , i = 1, . . . , n be an i.i.d. sample of observations with distribution P ∈
P. The family P may be a parametric, nonparametric, or semiparametric
family of distributions. We are interested in making inferences about some
parameter θ(P ) ∈ Θ = {θ(P ) : P ∈ P}. Typical examples of θ(P ) are the
mean of P or median of P , but, more generally, it could be any function of
P . Specifically, we are interested in constructing a confidence set for θ(P );
that is, a random set, Cn = Cn (X1 , . . . , Xn ) such that

P {θ(P ) ∈ Cn } ≈ 1 − α ,

at least for n sufficiently large.


The typical way of constructing such sets is based off of approximating
the distribution of a root, Rn = Rn (X1 , . . . , Xn , θ(P )). A root is simply any
real-valued function depending on both the data, Xi , i = 1, . . . , n, and the
parameter of interest, θ(P ). The idea is that if the distribution of the root
were known, then one could straightforwardly construct a confidence set for
θ(P ). To illustrate this idea, let Jn (P ) denote the sampling distribution of
Rn and define the corresponding cumulative distribution function as,

Jn (x, P ) = P {Rn ≤ x} . (17.1)

The notation is intended to emphasize the fact that the distribution of the
root depends on both the sample size, n, and the distribution of the data,
P . Using Jn (x, P ), we may choose a constant c such that

P {Rn ≤ c} ≈ 1 − α .

Given such a c, the set

Cn = {θ ∈ Θ : Rn (X1 , . . . , Xn , θ) ≤ c}

171
172 LECTURE 17. BOOTSTRAP

is a confidence set in the sense described above. We may also choose c1 and
c2 so that
P {c1 ≤ Rn ≤ c2 } ≈ 1 − α .
Given such c1 and c2 , the set

Cn = {θ ∈ Θ : c1 ≤ Rn (X1 , . . . , Xn , θ) ≤ c2 }

is a confidence set in the sense described above.

17.1.1 Pivots and Asymptotic Pivots


In some rare instances, Jn (x, P ) does not depend on P . In these instances,
the root is said to be pivotal or a pivot. For example, if θ(P ) is the mean of
P and P = {N (θ, 1) : θ ∈ R}, then the root

Rn = n(X̄n − θ(P )) (17.2)

is a pivot because Rn ∼ N (0, 1). In this case, we may construct confidence


sets Cn with finite-sample validity; that is,

P {θ(P ) ∈ Cn } = 1 − α

for all n and P ∈ P.


Sometimes, the root may not be pivotal in the sense described above,
but it may be asymptotically pivotal or an asymptotic pivot in that Jn (x, P )
converges in distribution to a limit distribution J(x, P ) that does not de-
pend on P . For example, if θ(P ) is the mean of P and P is the set of all
distributions on R with a finite, nonzero variance, then

n(X̄n − θ(P ))
Rn = (17.3)
σ̂n
is asymptotically pivotal because it converges in distribution to J(x, P ) =
Φ(x). In this case, we may construct confidence sets that are asymptotically
valid in the sense that

lim P {θ(P ) ∈ Cn } = 1 − α
n→∞

for all P ∈ P.

17.1.2 Asymptotic Approximations


Typically, the root will be neither a pivot nor an asymptotic pivot. The
distribution of the root, Jn (x, P ), will typically depend on P , and, when it
exists, the limit distribution of the root, J(x, P ), will, too. For example,
17.2. THE BOOTSTRAP 173

if θ(P ) is the mean of P and P is the set of all distributions on R with a


finite, nonzero variance, then

Rn = n(X̄n − θ(P )) (17.4)

converges in distribution to J(x, P ) = Φ(x/σ(P )). In this case, we can ap-


proximate this limit distribution with Φ(x/σ̂n ), which will lead to confidence
sets that are asymptotically valid in the sense described above.
Note that this third approach depends very heavily on the limit distribu-
tion J(x, P ) being both known and tractable. Even if it is known, the limit
distribution may be difficult to work with (e.g., it could be the supremum
of some complicated stochastic process with many nuisance parameters).
Moreover, even if it is known and manageable, the method may be poor in
finite-samples because it essentially relies on a double approximation: first,
Jn (x, P ) is approximated by J(x, P ), then J(x, P ) is approximated in some
way by estimating the unknown parameters of the limit distribution.

17.2 The Bootstrap


The bootstrap is a fourth, more general approach to approximating Jn (x, P ).
The idea is very simple: replace the unknown P with an estimate P̂n . Given
P̂n , it is possible to compute (either analytically or using simulation to any
desired degree of accuracy) Jn (x, P̂n ). In the case of i.i.d. data, a typical
choice is the empirical distribution (though if P = P (ψ) for some finite-
dimensional parameter ψ, then one may also use P̂n = P (ψ̂n ) for some
estimate ψ̂n of ψ). The hope is that whenever P̂n is “close” to P (which
may be ensured, for example, by the Glivenko-Cantelli Theorem), Jn (x, P̂n )
is “close” to Jn (x, P ). Essentially, this requires that Jn (x, P ), when viewed
as a function of P , is continuous in an appropriate neighborhood of P . Often,
this turns out to be true, but, unfortunately, it is not true in general.

17.2.1 The Nonparametric Mean


We will now consider the case where P is a distribution on R and θ(P ) is

the mean of P . We will consider first the root Rn = n(X̄n − θ(P )). Let
P̂n denote the empirical distribution of the Xi , i = 1, . . . , n. Under what
conditions is Jn (x, P̂n ) “close” to Jn (x, P )?
The sequence of distributions P̂n is a random sequence, so it is more
convenient to answer the question first for a nonrandom sequence Pn . The
following theorem does exactly that.

Theorem 17.1 Let θ(P ) be the mean of P and let P denote the set of all
distributions on R with a finite, nonzero variance. Consider the root Rn =

n(X̄n −θ(P )). Let Pn , n ≥ 1 be a nonrandom sequence of distributions such
174 LECTURE 17. BOOTSTRAP

that Pn converges in distribution to P , θ(Pn ) → θ(P ) and σ 2 (Pn ) → σ 2 (P ).


Then,

(i) Jn (x, Pn ) converges in distribution to J(x, P ) = Φ(x/σ(P )).

(ii) Jn−1 (1 − α, Pn ) = inf{x ∈ R : Jn (x, Pn ) ≥ 1 − α} converges to

J −1 (1 − α, P ) = z1−α σ(P ) .

Proof. (i) For each n, let Xi,n , i = 1, . . . , n be an i.i.d. sequence of random


variables with distribution Pn . We must show that

n(X̄n,n − θ(Pn ))

converges in distribution to N (0, σ 2 (P )). To this end, let

Xn,i − θ(Pn )
Zn,i =
σ(Pn )

and apply the Lindeberg-Feller central limit theorem. We must show that
2 √
lim E[Zn,i I{|Zn,i | > ϵ n}] = 0 .
n→∞

Let ϵ > 0 be given. By the assumption that Pn converges in distribution to


P and Slutsky’s Theorem,

d X − θ(P )
Zn,i → Z = ,
σ(P )

where X has distribution P . It follows that for any λ > 0 for which the
distribution of |Z| is continuous at λ, we have that
2
E[Zn,i I{|Zn,i | > λ}] → E[Z 2 I{|Z| > λ}] .

To prove this last claim, we need a couple of results:

1. Lehmann and Romano (2005, Example 11.2.14): Suppose that Yn and


d
Y are real valued random variables and that Yn → Y . If the Yn are
uniformly bounded, then E[Yn ] → E[Y ]. (In general, convergence in
distribution does not imply convergence of moments!)

2. Continuous Mapping Theorem (Lehmann and Romano, 2005, Theo-


d
rem 11.2.13): Suppose that Yn → Y . Let g be a measurable map from
R to R. Let C be the set of point in R for which g is continuous. If
d
P {Y ∈ C} = 1, then g(Yn ) → g(Y ).
17.2. THE BOOTSTRAP 175

We now use these two results. First, note that for any λ > 0 for which
the distribution of |Z| is continuous at λ, the continuous mapping theorem
above implies that

2 d
g(|Zn,i |) = Zn,i I{|Zn,i | ≤ λ} → Z 2 I{|Z| ≤ λ} = g(|Z|) . (17.5)

Note that g is discontinuous at λ but that P {|Z| = λ} = 0, and so the result


follows. Second, note that
2 2 2
E[Zn,i I{|Zn,i | > λ}] = E[Zn,i ] − E[Zn,i I{|Zn,i | ≤ λ}] .

The first term on the right-hand side is always equal to one and also equal
to E[Z 2 ] = 1. The second term is the expectation of
2
Zn,i I{|Zn,i | ≤ λ} ∈ [0, λ2 ] ,

which is uniformly bounded. By (17.5) and the first result above,


2
E[Zn,i I{|Zn,i | ≤ λ}] → E[Z 2 I{|Z| ≤ λ}] .

We conclude that
2
E[Zn,i I{|Zn,i | > λ}] = E[Z 2 ] − E[Zn,i
2
I{|Zn,i | ≤ λ}]
→ E[Z 2 ] − E[Z 2 I{|Z| ≤ λ}]
= E[Z 2 I{|Z| > λ}] .

As λ → ∞, E[Z 2 I{|Z| > λ}] → 0. To complete the proof, note that for any
fixed λ > 0
2 √ 2
E[Zn,i I{|Zn,i | > ϵ n}] ≤ E[Zn,i I{|Zn,i | > λ}]

for n sufficiently large. Thus,


√ d
nZ̄n,n → N (0, 1)

under Pn . The desired result now follows from Slutsky’s Theorem and the
fact that σ(Pn ) → σ(P ).
(ii) This follows from part (i) and Lemma 17.1 below applied to Fn (x) =
Jn (x, P ) and F (x) = J(x, P ).

Lemma 17.1 Let Fn , n ≥ 1 and F be nonrandom distribution functions


on R such that Fn converges in distribution to F . Suppose F is continuous
and strictly increasing at F −1 (1 − α) = inf{x ∈ R : F (x) ≥ 1 − α}. Then,
Fn−1 (1 − α) = inf{x ∈ R : Fn (x) ≥ 1 − α} → F −1 (1 − α).
176 LECTURE 17. BOOTSTRAP

Proof: Let q = F −1 (1 − α). Fix δ > 0 and choose ϵ so that 0 < ϵ < δ
and F is continuous at q − ϵ and q + ϵ. This is possible because F is
continuous at q and therefore continuous in a neighborhood of q. Hence,
Fn (q − ϵ) → F (q − ϵ) < 1 − α and Fn (q + ϵ) → F (q + ϵ) > 1 − α, where the
inequalities follow from the assumption that F is strictly increasing at q. For
n sufficiently large, we thus have that Fn (q−ϵ) < 1−α and Fn (q+ϵ) > 1−α.
It follows that q − ϵ ≤ F −1 (1 − α) ≤ q + ϵ for such n.
We are now ready to pass from the nonrandom sequence Pn to the ran-
dom sequence P̂n .

Theorem 17.2 Let θ(P ) be the mean of P and let P denote the set of
all distributions on R with a finite, nonzero variance. Consider the root

Rn = n(X̄n − θ(P )). Then,

(i) Jn (x, P̂n ) converges in distribution to J(x, P ) = Φ(x/σ(P )) a.s.

(ii) Jn−1 (1 − α, P̂n ) converges to J −1 (1 − α, P ) = z1−α σ(P ) a.s.

Proof: By the Glivenko-Cantelli Theorem,

sup |P̂n ((−∞, x]) − P ((−∞, x])| → 0 a.s.


x∈R

This implies that P̂n converges in distribution to P a.s. Since |x| ≤ 1 + x2


and that σ 2 (P ) < ∞, we have that E[|X|] ≤ 1 + E[X 2 ] < ∞. Thus, we
may apply the Strong Law of Large Numbers to conclude that θ(P̂n ) = X̄n
converges to θ(P ) a.s. and σ(P̂n ) converges to σ(P ) a.s. Thus, w.p.1, P̂n
satisfies the assumptions of Theorem 17.1. The conclusions of the theorem
now follow.

Remark 17.1 Similar results hold for the studentized root in (17.1) where
σ̂n is a consistent estimator of σ(P ). Using this root leads to the so-called
Bootstrap-t, as the root is just the t-statistic. A key step in the proof of
this result is to show that σ̂n converges in probability to σ(P ) under an
appropriate sequence of distributions. We skip this in this class. However,
the advantage of working with a studentized root like the one in (17.3) is
that the limit distribution of Rn is pivotal, which affects the properties of
the bootstrap approximation as discussed in the next section.
It now follows from Slutsky’s Theorem that confidence sets of the form

Cn = θ ∈ R : Rn (X1 , . . . , Xn , θ) ≤ Jn−1 (1 − α, P̂n ) ,




which are known as symmetric confidence sets, or


n α   α o
Cn = θ ∈ R : Jn−1 , P̂n ≤ Rn (X1 , . . . , Xn , θ) ≤ Jn−1 1 − , P̂n ,
2 2
17.2. THE BOOTSTRAP 177

which are known as equi-tailed confidence sets, satisfy

P {θ(P ) ∈ Cn } → 1 − α (17.6)

for all P ∈ P.
In general, the consistency of the bootstrap is proved in the following
two steps:

1. For some choice of metric (or pseudo-metric) d on the space of prob-


ability measures, it must be known that d(Pn , P ) → 0 implies that
Jn (Pn ) converges weakly to J(P ). That is, the convergence of Jn (P )
to J(P ) must hold in a suitably locally uniform in P manner. After
all, we are replacing P by P̂n so Jn (P ) must be smooth in P . Note
that in Theorem 17.1, the “metric” d that we used involved weak con-
vergence together with convergence of first and second moments, see
Remark 15.4.1 in Lehmann and Romano (2005) for details. However,
other problems may require a different metric.

2. The estimator P̂n must then be known to satisfy d(P̂n , P ) → 0 almost


surely or in probability under P . This is what we proved in the proof
of Theorem 17.2.

17.2.2 Asymptotic Refinements


Note that even a confidence set Cn based off of the asymptotic normality of
either root would satisfy (17.6). It can be shown under certain conditions
(that ensure the existence of so-called Edgeworth expansions of Jn (x, P ))
that one-sided confidence sets Cn based off of such an asymptotic approxi-
mation satisfy

P {θ(P ) ∈ Cn } − (1 − α) = O(n−1/2 ) . (17.7)

One-sided confidence sets based off of the bootstrap and the root Rn =

n(X̄n − θ(P )) also satisfy (17.7), though there is some evidence to suggest
that it does a bit better in the size of O(n−1/2 ) term. On the other hand,
one-sided confidence sets based off the bootstrap-t, i.e., using the root

n(X̄n − θ(P ))
Rn =
σ̂n
as in Remark 17.1, satisfy

P {θ(P ) ∈ Cn } − (1 − α) = O(n−1 ) . (17.8)

Thus, the one-sided coverage error of the bootstrap-t interval is O(n−1 ) and
is of smaller order than that provided by the normal approximation or the
bootstrap based on a nonstudentized root. One-sided confidence sets that
178 LECTURE 17. BOOTSTRAP

satisfy only (17.7) are said to be first-order accurate, where as one-sided


confidence sets that satisfy (17.8) are said to be second-order accurate. See
Section 15.5 of Lehmann and Romano (2005) for further details.
A heuristic reason why the bootstrap based on the root (17.3) outper-
forms the bootstrap based on the root (17.4) is as follows. In the case of
(17.4), the bootstrap is estimating a distribution that has mean 0 and un-
known variance σ 2 (P ). The main contribution to the estimation error is the
implicit estimation of σ 2 (P ) by σ 2 (P̂n ). On the other hand, the root (17.3)
has a distribution that is nearly independent of P since it is an asymptotic
pivot.
The bootstrap may also provide a refinement in two-sided tests. For
example, symmetric intervals based on the absolute value of the root in
(17.3) are O(n−2 ), versus the asymptotic approximation that is of order
O(n−1 ). Note that, by construction, such intervals are symmetric about θ̂n .

17.2.3 Implementation of the Bootstrap


Outside certain exceptional cases, the bootstrap approximation Jn (x, P̂n )
cannot be calculated exactly, i.e., it is often not available in closed form.
However, we can approximate this distribution to an arbitrary degree of ac-
curacy by taking samples from P̂n , computing the root for each of these
samples, and then using the empirical distribution of these roots as an
approximation to Jn (x, P̂n ). The usual algorithm used to implement the
bootstrap involves the following steps.

Step 1. Conditional on the data (X1 , . . . , Xn ), draw B samples of size n from


P̂n . Denote the jth sample by
∗ ∗
(X1,j , . . . , Xn,j )

for j = 1, . . . , B. When P̂n is the empirical distribution, this amounts


to resampling the original observations in (X1 , . . . , Xn ) with replace-
ment.

Step 2. For each bootstrap sample j, compute the root, i.e.,


∗ ∗ ∗
Rj,n = Rn (X1,j , . . . , Xn,j , θ̂n ) .

Note that θ(P̂n ) = θ̂n , so in the bootstrap distribution the parameter


θ(P ) becomes θ̂n .
∗ , . . . , R∗ ) as
Step 3. Compute the empirical cdf of (R1,n B,n

B
1 X ∗
Ln (x) = I{Rj,n ≤ x} . (17.9)
B
j=1
BIBLIOGRAPHY 179

Step 4. Compute the desired function of Ln (x), for example, a quantile,

L−1
n (1 − α) = inf{x ∈ R : Ln (x) ≥ 1 − α} ,

for a given significance level α.

Remark 17.2 Sampling from P̂n in Step 1 is easy even when P̂n is the
empirical distribution. In such case P̂n is a discrete probability distribution
that puts probability mass n1 at each sample point (X1 , . . . , Xn ), so sampling
from P̂n is equivalent to drawing observations (with probability n1 ) from the
observed data with replacement. In consequence, a bootstrap sample will
likely have some ties and multiple values, which is generally not a problem.
In parametric problems one would simply get a new sample of size n from
P̂n = P (ψ̂n ).
Because B can be taken to be large (assuming enough computing power),
the resulting approximation Ln (x) can be made arbitrarily close to Jn (x, P̂n ).
It then follows that the properties of tests and confidence sets based on
Jn−1 (1 − α, P̂n ) and L−1
n (1 − α) are the same. In practice, values of B in the
order of 1, 000 are frequently enough for the approximation to work well.

Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.

Lehmann, E. and J. P. Romano (2005): Testing Statistical Hypotheses,


Springer, New York, 3rd ed.

Politis, D. N., J. P. Romano, and M. Wolf (1999): Subsampling,


Springer, New York.
180 LECTURE 17. BOOTSTRAP
Lecture 18

Subsampling &
Randomization Tests

18.1 Subsampling
Suppose Xi , i = 1, . . . , n is an i.i.d. sequence of random variables with dis-
tribution P ∈ P. Let θ(P ) be some real-valued parameter of interest, and
let θ̂n = θ̂n (X1 , . . . , Xn ) be some estimate of θ(P ). Consider the root

Rn = n(θ̂n − θ(P )) ,

where root stands for a functional depending on both, the data and θ(P ).
Let Jn (P ) denote the sampling distribution of Rzn and define the corre-
sponding cumulative distribution function as,

Jn (x, P ) = P {Rn ≤ x} . (18.1)

We wish to estimate Jn (x, P ) so we can make inferences about θ(P ). For


example, we would like to estimate quantiles of Jn (x, P ), so we can construct
confidence sets for θ(P ). Unfortunately, we do not know P , and, as a result,
we do not know Jn (x, P ).
The bootstrap solved this problem simply by replacing the unknown P
with an estimate P̂n . In the case of i.i.d. data, a typical choice of P̂n is the
empirical distribution of the Xi , i = 1, . . . , n. For this approach to work,
we essentially required that Jn (x, P ) when viewed as a function of P was
continuous in a certain neighborhood of P . An alternative to the bootstrap
known as subsampling, originally due to Politis and Romano (1994), does
not impose this requirement but rather the following much weaker condition.

Assumption 18.1 There exists a limiting law J(P ) such that Jn (P ) con-
verges weakly to J(P ) as n → ∞.

181
182 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS

In order to motivate the idea behind subsampling, consider the follow-


ing thought experiment. Suppose for the time being that θ(P ) is known.
Suppose that, instead of n i.i.d. observations from P , we had a very, very
large number of i.i.d. observations from P . For concreteness, suppose Xi , i =
1, . . . , m is an i.i.d. sequence of random variables with distribution P with
m = nk for some very big k. We could then estimate Jn (x, P ) by looking
at the empirical distribution of

n(θ̂n (Xn(j−1)+1 , . . . , Xnj ) − θ(P )), j = 1, . . . , k .
This is an i.i.d. sequence of random variables with distribution Jn (x, P ).
Therefore, by the Glivenko-Cantelli theorem, we know that this empirical
distribution is a good estimate of Jn (x, P ), at least for large k. In fact, with
a simple trick, we could show that it is even possible to improve upon this
estimate by using all possible sets of data of size n from the m observations,
not just those that are disjoint; that is, estimate Jn (x, P ) with the empirical
distribution of the

 
m
n(θ̂n,j − θ(P )), j = 1, . . . , .
n

where θ̂n,j is the estimate of θ(P ) computed using the jth set of data of size
n from the original m observations.
In practice m = n, so, even if we knew θ(P ), this idea won’t work. The
key idea behind subsampling is the following simple observation: replace n
with some smaller number b that is much smaller than n. We would then
expect

 
n
b(θ̂b,j − θ(P )), j = 1, . . . , ,
b
where θ̂b,j is the estimate of θ(P ) computed using the jth set of data of
size b from the original n observations, to be a good estimate of Jb (x, P ), at
least if nb is large. Of course, we are interested in Jn (x, P ), not Jb (x, P ).
We therefore need some way to force Jn (x, P ) and Jb (x, P ) to be close to
one another. To ensure this, it suffices to assume that Jn (x, P ) → J(x, P ).
Therefore, Jb (x, P ) and Jn (x, P ) are both close to J(x, P ), and thus close
to one another as well, at least for large b and n. In order to ensure that
both b and nb are large, at least asymptotically, it suffices to assume that
b → ∞, but b/n → 0.
This procedure is still not feasible because in practice we typically do
not know θ(P ). But we can replace θ(P ) with θ̂n . This would cause no
problems if √
√ b√
b(θ̂n − θ(P )) = √ n(θ̂n − θ(P ))
n
is small, which follows from b/n → 0 in this case. The next theorem formal-
izes the above discussion.
18.1. SUBSAMPLING 183

Theorem 18.1 Assume Assumption 18.1. Also, let Jn (P ) denote the sam-
pling distribution of τn (θ̂n − θ(P )) for some normalizing sequence τn → ∞,
Nn = nb , and assume that τb /τn → 0, b → ∞, and b/n → 0 as n → ∞.


i) If x is a continuity point of J(·, P ), then Ln,b (x) → J(x, P ) in proba-


bility, where
Nn
1 X
Ln,b (x) = I{τb (θ̂b,j − θ̂n ) ≤ x} . (18.2)
Nn
j=1

ii) Let

cn,b (1 − α) = inf{x : Ln,b (x) ≥ 1 − α} ,


c(1 − α, P ) = inf{x : J(x, P ) ≥ 1 − α} .

If J(·, P ) is continuous at c(1 − α, P ), then

P {τn (θ̂n − θ(P )) ≤ cn,b (1 − α)} → 1 − α as n → ∞ . (18.3)

In practice, Nn is too large to actually compute Ln (x), so what one


would do is randomly sample B of the Nn possible data sets of size b and
just use B in place of Nn when computing Ln (x). Provided B = Bn → ∞,
all the conclusions of the theorem remain valid. This approximation step is
similar in spirit to approximating the bootstrap distribution Jn (x, P̂n ) using
simulations from P̂n . In fact, except for the first step, implementing the
bootstrap and subsampling requires the same algorithm. The change in the
first step is as follows:

Step 1: Non-parametric Bootstrap. Conditional on the data (X1 , . . . , Xn ),


draw B samples of size n from the original observations with replace-
ment.

Step 1: Subsampling. Conditional on the data (X1 , . . . , Xn ), draw B sam-


ples of size b from the original observations without replacement.

Essentially, all we required was that Jn (x, P ) converged in distribution


to a limit distribution J(x, P ), whereas for the bootstrap we required this
and additionally that Jn (x, P ) was continuous in a certain sense. Show-
ing continuity of Jn (x, P ) was very problem specific. There are examples
where Jn (x, P ) → J(x, P ), but this continuity fails (e.g., the extreme order
statistic). Subsampling would have no problems handling the extreme order
statistic.
Typically, when both the bootstrap and subsampling are valid, the boot-
strap works better in the sense of higher-order asymptotics (see the lecture
notes on the bootstrap), but subsampling is more generally valid.
184 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS

There is a variant of the bootstrap known as the m-out-of-n bootstrap.


Instead of using Jn (x, P̂n ) to approximate Jn (x, P ), one uses Jm (x, P̂n )
where m is much smaller than n. If one assumes that m2 /n → 0, then
all the conclusions of the theorem remain valid with Jm (x, P̂n ) in place of
Ln (x). This follows because if m2 /n → 0, then (i) m/n → 0 and (ii) with
probability tending to 1, the approximation to Jm (x, P̂n ) is the same as the
approximation to Ln (x) because the probability of drawing all distinct ob-
servations tends to 1. To see this, note that this probability is simply equal
to
n(n − 1)(n − 2) · · · (n − b + 1) Y i
b
= (1 − ) .
n n
1≤i≤b−1
i
Since 1 − n ≥ 1 − nb , we have that
b 2
Y i b
(1 − ) ≥ (1 − )b = (1 − n )b .
n n b
1≤i≤b−1

If b2 /n → 0, then for every ϵ > 0 we have that b2 /n < ϵ for all n sufficiently
large. Therefore,
b2
n b ϵ
(1 − ) > (1 − )b → exp(−ϵ) .
b b
By choosing ϵ > 0 sufficiently small, we see that the desired probability
converges to 1.

18.2 Randomization Tests


Before we describe the general construction of randomization tests, we start
our discussion in the context of a simple example.

18.2.1 Motivating example: sign changes


Let X = (X1 , . . . , X10 ) ∼ P be an i.i.d. sample of size 10 where each Xi
takes values in R, has a finite mean θ ∈ R, and has a distribution that is
symmetric about θ. Let P be the collection of all distributions P satisfying
these conditions. Consider testing

H0 : θ = 0 vs H1 : θ ̸= 0 .

We only have 10 observations, so using an asymptotic approximation does


not seem fruitful. At the same time, this is more general than the nor-
mal location model where each Xi has distribution N (θ, σ 2 ), so exploiting
normality is not possible.
Suppose we decided to use the absolute value of X̄10 to test the above
hypothesis. Denote this test statistic by T (X). The question is: how do we
18.2. RANDOMIZATION TESTS 185

compute a critical value that delivers a valid test? It turns out we can do
this by exploiting symmetry.
To do this, let ϵi take on either the value 1 or −1 for i = 1, . . . , 10. Note
that the distribution of X = (X1 , . . . , X10 ) is symmetric about 0 under the
null hypothesis. Now consider a transformation g = (ϵ1 , . . . , ϵ10 ) of R10 that
defines the following mapping

(X1 , . . . , X10 ) 7→ gX = (ϵ1 X1 , . . . , ϵ10 X10 ) . (18.4)

Finally, let G be the M = 210 collection of such transformations. It follows


that the random variable X and gX have the same distribution under the
null hypothesis. What this means is that we can get “new samples” from
P by simply applying g to X. We can get a total of M = 1, 024 samples
and use these samples to simulate the distribution of T (X). This approach
leads to a test that is valid in finite samples as the next section shows.

18.2.2 The main result


In this section X denotes the observed sample and P denotes the distribution
of the entire sample X (as in the motivating example). Since all results are
finite sample in nature, we do not use an index n to denote the sample size
and do not index objects by n.
Based on data X taking values in a sample space X , it is desired to test
the null hypothesis H0 : P ∈ P0 , where P is the true distribution of X and
P0 is a subset of distributions in the space P. Let G be a finite group of
transformations g : X 7→ X . The following assumption allows for a general
test construction.

Definition 18.1 (Randomization Hypothesis) Under the null hypoth-


esis, the distribution of X is invariant under the transformations in G;
that is, for every g ∈ G, gX and X have the same distribution whenever
X ∼ P ∈ P0 .
Note that we do not require the alternative hypothesis parameter space
to remain invariant under g in G. Only the space P0 is assumed invariant.
Let T (X) be any real-valued test statistic for testing H0 . Suppose that
the group G has M elements. Given X = x, let

T (1) (x) ≤ T (2) (x) ≤ · · · ≤ T (M ) (x) (18.5)

be ordered values of T (gX) as g varies in G. Fix a nominal level α, 0 <


α < 1, and let k be defined as

k = ⌈(1 − α)M ⌉ (18.6)


186 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS

where ⌈C⌉ denotes the smallest integer greater than or equal to C. Let
M
X
M + (x) = I{T (j) (x) > T (k) (x)}
j=1
M
X
0
M (x) = I{T (j) (x) = T (k) (x)} .
j=1

Now set
M α − M + (x)
a(x) = , (18.7)
M 0 (x)
and define the randomization test as

1
 T (x) > T (k) (x)
ϕ(x) = a(x) T (x) = T (k) (x) . (18.8)

0 T (x) < T (k) (x)

This test is randomized, and since M + (x) ≤ M − k ≤ M α and M + (x) +


M 0 (x) ≥ M − k + 1 > M α, we have a(x) ∈ [0, 1).
Under the randomization hypothesis, Hoeffding (1952) shows that this
construction results in a test of exact level α, and this is true for any choice of
test statistic T (X). Note that this is possibly a randomized test if (1 − α)M
is not an integer and there are ties in the ordered values. Alternatively, if
one prefers not to randomize, the slightly conservative but non-randomized
test that rejects when T (X) > T (k) , i.e.,

ϕnr (X) = I{T (X) > T (k) } , (18.9)

is level α.

Theorem 18.2 Suppose that X has distribution P ion X and the problem is
to test the null hypothesis P ∈ P0 . Let G be a finite group of transformations
of X onto itself. Suppose the randomization hypothesis (18.1) holds. Given
a test statistic T (X), let ϕ be the randomization test as described above.
Then, ϕ(X) is a similar α level test, i.e.,

EP [ϕ(X)] = α, for all P ∈ P0 .

Proof. By construction, for every x ∈ X ,


X
ϕ(gx) = M + (x) + a(x)M 0 (x) = M α ,
g∈G

and so  
X X
M α = EP  ϕ(gX) = EP [ϕ(gX)] .
g∈G g∈G
18.2. RANDOMIZATION TESTS 187

By the randomization hypothesis EP [ϕ(gX)] = EP [ϕ(X)], so that


X X
Mα = EP [ϕ(gX)] = EP [ϕ(X)] = M EP [ϕ(X)] ,
g∈G g∈G

and the result follows.

Remark 18.1 Note that by construction the randomization test not only
is of level α for all n, but also “similar”, meaning that EP [ϕ(X)] is never
below α for any P ∈ P0 .
In general, one can define a p-value p̂ of a randomization test by
1 X
p̂ = I{T (gX) ≥ T (X)} . (18.10)
M
g∈G

It can be shown that p̂ satisfies, under the null hypothesis,

P {p̂ ≤ u} ≤ u for all 0 ≤ u ≤ 1 . (18.11)

Therefore, the non-randomized test that rejects when p̂ ≤ α is level α.


Because G may be large, one may resort to an approximation to con-
struct the randomization test, for example, by randomly sampling trans-
formations g from G with or without replacement. In the former case, for
example, suppose g1 , . . . , gB−1 are i.i.d. and uniformly distributed on G. Let
B−1
" #
X
−1
p̃ = B 1+ I{T (gi X) ≥ T (X)} . (18.12)
i=1

Then, it can be shown that, under the null hypothesis,

P {p̃ ≤ u} ≤ u for all 0 ≤ u ≤ 1 , (18.13)

where this probability reflects variation in both X and the sampling of the
gi .

18.2.3 Special case: Permutation tests


Probably the most popular application of randomization tests in economic
applications are the so-called permutation tests, which are just a special case
of the general construction we just described.
188 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS

Two sample problem


Suppose that Y1 , . . . , Ym are i.i.d. observations from a distribution PY and,
independently, Z1 , . . . , Zn are i.i.d. observations from a distribution PZ . In
other words, we have two samples that are not paired, i.e., Z1 and Y1 do not
correspond to the same “unit”. Here X is given by

X = (Y1 , . . . , Ym , Z1 , . . . , Zn ) .

Consider testing
H0 : PY = PZ vs H1 : PY ̸= PZ .
To describe an appropriate group of transformations G, let N = m + n.
For x = (x1 , . . . , xN ) ∈ RN , let gx ∈ RN be defined by

(x1 , . . . , xN ) 7→ gx = (xπ(1) , . . . , xπ(N ) ) , (18.14)

where (π(1), . . . , π(N )) is a permutation of {1, . . . , N }. Let G be the col-


lection of all such g, so that M = N !. It follows that whenever PY = PZ , X
and gX have the same distribution.
In essence, each transformation g produces a new data set gx, of which
the first m elements are used as the Y sample and the remaining n as the
Z sample to recompute the test statistic. Note that, if a test statistic is
chosen that is invariant under permutations within each  of the Y and Z
samples, like Ȳm − Z̄n , it is enough to consider the N m transformed data
sets obtained by taking m observations from all N as the Y observations
and the remaining n as the Z observations (which, of course, is equivalent
to using a subgroup G′ of G).

Treatment effects
Suppose that we observe a random sample {(Y1 , D1 ), . . . , (Yn , Dn )} from a
randomized controlled trial where

Y = Y (1)D + (1 − D)Y (0)

is the observed outcome and D ∈ {0, 1} is the exogenous treatment assign-


ment. Here, (Y (0), Y (1)) are the usual potential outcomes. Suppose that
we are interested in testing the hypothesis that the distribution Q0 of Y (0)
is the same as the distribution Q1 of Y (1). That is,

H0 : Q0 = Q1 vs. H1 : Q0 ̸= Q1 . (18.15)

Under the null hypothesis in (18.15), it follows that the distribution of


{(Y1 , D1 ), . . . , (Yn , Dn )} and {(Y1 , Dπ(1) ), . . . , (Yn , Dπ(n) )} are the same for
any permutation (π(1), . . . , π(n)) of {1, . . . , n}, and so a permutation test
that permutes individual from “treatment” to “control” (or from “control”
to “treatment”) delivers a test that is valid in finite samples.
BIBLIOGRAPHY 189

However, researchers are often interested in hypotheses about the average


treatment effect (ATE) as opposed to those in (18.15). For example, consider

H0 : E[Y (1)] = E[Y (0)] v.s. H1 : E[Y (1)] ̸= E[Y (0)] . (18.16)

In this case, one may still consider the permutation test that results from
considering all possible permutations to the vector of treatment assignment
(D1 , . . . , Dn ). Unfortunately, such an approach does not lead to a valid test
and may over-reject in finite samples. These test may be asymptotically
valid though, is one carefully chooses an appropriate test statistic.
The distinction between the null hypothesis in (18.15) and that in (18.16)
and their implications on the properties of permutation tests are often ig-
nored in applied research.
Randomization test are often dismissed in applied research due to the
belief that the randomization hypothesis is too strong to hold in a real em-
pirical application. For example, the distribution P may not be symmetric
in hypotheses about the mean of X. However, it turns out that the random-
ization test is asymptotically valid (under certain conditions), even when
P is not symmetric. See Bugni et al. (2018) for an example in the context
of randomized controlled experiments. Moreover, recent developments on
the asymptotic properties of randomization tests show that such a construc-
tion may be particularly useful in regression models with a fixed and small
number of clusters, see Canay et al. (2017). The approach does not require
symmetry in the distribution of X, but rather symmetry in the asymptotic
distribution of θ̂n - which automatically holds when these estimators are
asymptotically normal. We cover these topics in Econ 481.

Bibliography
Bugni, F. A., I. A. Canay, and A. M. Shaikh (2018): “Inference under
Covariate Adaptive Randomization,” Journal of the American Statistical
Association, 113, 1784–1796.

Canay, I. A., J. P. Romano, and A. M. Shaikh (2017): “Randomization


Tests under an Approximate Symmetry Assumption,” Econometrica, 85,
1013–1030.

Lehmann, E. and J. P. Romano (2005): Testing Statistical Hypotheses,


Springer, New York, 3rd ed.

Politis, D. N., J. P. Romano, and M. Wolf (1999): Subsampling,


Springer, New York.
190 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS

Bibliography
Abadie, A., A. Diamond, and J. Hainmueller (2010): “Synthetic con-
trol methods for comparative case studies: Estimating the effect of Cal-
ifornia’s tobacco control program,” Journal of the American Statistical
Association, 105, 493–505.

Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Con-


sistent Covariance Matrix Estimation,” Econometrica, 59, 817–858.

Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:


An empiricist’s companion, Princeton university press.

Arellano, M. (2003): Panel Data Econometrics, Oxford University Press.

Bell, R. M. and D. F. McCaffrey (2002): “Bias reduction in standard


errors for linear regression with multi-stage samples,” Survey Methodology,
28, 169–182.

Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in


Market Equilibrium,” Econometrica, 63, 841–890.

Billingsley, P. (1995): Probability and Measure, Wiley-Interscience.

Bugni, F. A., I. A. Canay, and A. M. Shaikh (2018): “Inference under


Covariate Adaptive Randomization,” Journal of the American Statistical
Association, 113, 1784–1796.

Calonico, S., M. D. Cattaneo, and R. Titiunik (2014): “Robust Non-


parametric Confidence Intervals for Regression-Discontinuity Designs,”
Econometrica, 82, 2295–2326.

Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008): “Bootstrap-


based improvements for inference with clustered errors,” The Review of
Economics and Statistics, 90, 414–427.

Canay, I. A., J. P. Romano, and A. M. Shaikh (2017): “Randomization


Tests under an Approximate Symmetry Assumption,” Econometrica, 85,
1013–1030.

Canay, I. A., A. Santos, and A. M. Shaikh (2021): “The wild boot-


strap with a “small” number of “large” clusters,” Review of Economics
and Statistics, 103, 346–363.

Card, D. and A. B. Krueger (1994): “Minimum wages and employment:


a case study of the fast-food industry in New Jersey and Pennsylvania,”
The American Economic Review, 84, 772–793.
BIBLIOGRAPHY 191

Chetverikov, D., Z. Liao, and V. Chernozhukov (2016): “On Cross-


Validated LASSO,” arXiv preprint arXiv:1605.02214.

Conley, T. G. and C. R. Taber (2011): “Inference with “difference


in differences” with a small number of policy changes,” The Review of
Economics and Statistics, 93, 113–125.

De Chaisemartin, C. and X. D’Haultfoeuille (2022): “Two-way fixed


effects and differences-in-differences with heterogeneous treatment effects:
A survey,” Tech. rep., National Bureau of Economic Research.

Fan, J., J. Lv, and L. Qi (2011): “Sparse High-Dimensional Models in


Economics,” Annual Review of Economics, 291–317.

Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.

Hansen, B. E. and S. Lee (2019): “Asymptotic theory for clustered sam-


ples,” Journal of econometrics, 210, 268–290.

Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix


estimator for panel data when T is large,” Journal of Econometrics, 141,
597–620.

Horowitz, J. L. (2015): “Variable selection and estimation in high-


dimensional models,” Canadian Journal of Economics, 48, 389–407.

Ibragimov, R. and U. K. Müller (2010): “t-Statistic based correlation


and heterogeneity robust inference,” Journal of Business & Economic
Statistics, 28, 453–468.

Imbens, G. and K. Kalyanaraman (2012): “Optimal bandwidth choice


for the regression discontinuity estimator,” The Review of Economic Stud-
ies, 933–959.

Imbens, G. W. and M. Kolesar (2012): “Robust standard errors in small


samples: some practical advice,” Tech. rep., National Bureau of Economic
Research.

Knight, K. and W. Fu (2000): “Asymptotics for lasso-type estimators,”


The Annals of statistics, 28, 1356–1378.

Lehmann, E. and J. P. Romano (2005): Testing Statistical Hypotheses,


Springer, New York, 3rd ed.

MacKinnon, J. G. and H. White (1985): “Some heteroskedasticity-


consistent covariance matrix estimators with improved finite sample prop-
erties,” Journal of Econometrics, 29, 305–325.
192 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS

Manski, C. F. (1988): “Identification of Binary Response Models,” Journal


of the American Statistical Association, 83, 729–738.

Mikusheva, A. (2007): “Course materials for Time Series Analysis,” MIT


OpenCourseWare, Massachusetts Institute of Technology.

Newey, W. K. and K. D. West (1987): “A Simple, Positive Semi-


definite, Heteroskedasticity and Autocorrelation Consistent Covariance
Matrix,” Econometrica, 55, 703–08.

Politis, D. N., J. P. Romano, and M. Wolf (1999): Subsampling,


Springer, New York.

Wang, H., B. Li, and C. Leng (2009): “Shrinkage tuning parameter


selection with a diverging number of parameters,” Journal of the Royal
Statistical Society. Series B: Statistical Methodology, 71, 671–683.

White, H. (1980): “A heteroskedasticity-consistent covariance matrix es-


timator and a direct test for heteroskedasticity,” Econometrica, 817–838.

Wooldridge, J. M. (2010): Econometric analysis of cross section and


panel data, MIT press.

Zhao, P. and B. Yu (2006): “On Model Selection Consistency of Lasso,”


The Journal of Machine Learning Research, 7, 2541–2563.

Zou, H. (2006): “The adaptive lasso and its oracle properties,” Journal of
the American Statistical Association, 101, 1418–1429.

You might also like