LectureNotes 480
LectureNotes 480
Introduction to Econometrics
SPRING 2022
Ver. May 23, 2022
Northwestern University
Lecture Notes by
IVAN A. CANAY
Department of Economics
Northwestern University
1 Linear Regression 11
1.1 Interpretations of the Linear Regression Model . . . . . . . . 11
1.1.1 Interpretation 1: Linear Conditional Expectation . . . 11
1.1.2 Interpretation 2: “Best” Linear Approximation to the
Conditional Expectation or “Best” Linear Predictor . 12
1.1.3 Interpretation 3: Causal Model . . . . . . . . . . . . . 13
1.2 Linear Regression when E[XU ] = 0 . . . . . . . . . . . . . . . 14
1.2.1 Solving for β . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Ordinary Least Squares . . . . . . . . . . . . . . . . . 15
1.3.2 Projection Interpretation . . . . . . . . . . . . . . . . 16
3
4 CONTENTS
4 Endogeneity 37
4.1 Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Partition of β: solve for endogenous components . . . 39
4.2 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1 The Instrumental Variables (IV) Estimator . . . . . . 40
4.2.2 The Two-Stage Least Squares (TSLS) Estimator . . . 41
4.3 Properties of the TSLS Estimator . . . . . . . . . . . . . . . . 43
4.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Limiting Distribution . . . . . . . . . . . . . . . . . . 43
4.3.3 Estimation of V . . . . . . . . . . . . . . . . . . . . . . 44
5 More on Endogeneity 45
5.1 Efficiency of the TSLS Estimator . . . . . . . . . . . . . . . . 45
5.2 “Weak” Instruments . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Interpretation under Heterogeneity . . . . . . . . . . . . . . . 48
5.3.1 Monotonicity in Latent Index Models . . . . . . . . . 51
5.3.2 IV in Randomized Experiments . . . . . . . . . . . . . 52
6 GMM & EL 53
6.1 Generalized Method of Moments . . . . . . . . . . . . . . . . 53
6.1.1 Over-identified Linear Model . . . . . . . . . . . . . . 53
6.1.2 The GMM Estimator . . . . . . . . . . . . . . . . . . 54
6.1.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . 55
6.1.5 Estimation of the Efficient Weighting Matrix . . . . . 56
6.1.6 Overidentification Test . . . . . . . . . . . . . . . . . . 57
6.2 Empirical Likelihood . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Asymptotic Properties and First Order Conditions . . 59
7 Panel Data 61
7.1 Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1.1 First Differences . . . . . . . . . . . . . . . . . . . . . 62
7.1.2 Deviations from Means . . . . . . . . . . . . . . . . . 63
7.1.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . 64
7.2 Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 Difference in Differences 71
8.1 A Simple Two by Two Case . . . . . . . . . . . . . . . . . . . 71
8.1.1 Pre and post comparison . . . . . . . . . . . . . . . . 73
8.1.2 Treatment and control comparison . . . . . . . . . . . 73
8.1.3 Taking both differences . . . . . . . . . . . . . . . . . 73
8.1.4 A linear regression representation with individual data 75
8.2 A More General Case . . . . . . . . . . . . . . . . . . . . . . 75
CONTENTS 5
II Some Topics 83
9 Non-Parametric Regression 85
9.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2 Nearest Neighbor vs. Binned Estimator . . . . . . . . . . . . 85
9.3 Nadaraya-Watson Kernel Estimator . . . . . . . . . . . . . . 86
9.3.1 Asymptotic Properties . . . . . . . . . . . . . . . . . . 88
9.4 Local Linear Estimator . . . . . . . . . . . . . . . . . . . . . . 92
9.4.1 Nadaraya-Watson vs Local Linear Estimator . . . . . 94
9.5 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . 94
12 LASSO 111
12.1 High Dimensionality and Sparsity . . . . . . . . . . . . . . . . 111
12.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
12.2.1 Theoretical Properties of the LASSO . . . . . . . . . . 115
12.3 Adaptive LASSO . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.4 Penalties for Model Selection Consistency . . . . . . . . . . . 117
12.5 Choosing lambda . . . . . . . . . . . . . . . . . . . . . . . . . 118
12.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 119
6 CONTENTS
17 Bootstrap 171
17.1 Confidence Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 171
17.1.1 Pivots and Asymptotic Pivots . . . . . . . . . . . . . . 172
17.1.2 Asymptotic Approximations . . . . . . . . . . . . . . . 172
17.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 173
CONTENTS 7
9
Lecture 1
Linear Regression1
11
12 LECTURE 1. LINEAR REGRESSION
E[XU ] = 0 .
1.1. INTERPRETATIONS OF THE LINEAR REGRESSION MODEL 13
In other words, we imagine two potential health status variables (Y (0), Y (1))
where Y (0) is the value of the outcome that would have been observed if
(possibly counter-to-fact) X were 0; and Y (1) is the value of the outcome
that would have been observed if (possibly counter-to-fact) X were 1.
The difference Y (1)−Y (0) is called the treatment effect, and the quantity
E[Y (1) − Y (0)] is usually referred to as the average treatment effect. Using
this notation, we may rewrite the observed outcome as
where
β0 = E[Y (0)]
β1 = Y (1) − Y (0)
U = Y (0) − E[Y (0)] .
up with a linear constant effect causal model with U ⊥⊥ X (from the nature
of the randomized experiment), E[U ] = 0, and so E[XU ] = 0. Notice that,
in order to have a linear causal model a randomized controlled experiment
is not enough; we also need a constant treatment effect. Without such an
assumption it can be shown that a regression of Y on X identifies the average
treatment effect (ATE). The ATE is often interpreted as a causal parameter
because it is an average of causal effects.
Y = X ′β + U .
Lemma 1.1 Let X be a random vector such that E[XX ′ ] < ∞. Then
E[XX ′ ] is invertible if and only if there is no perfect collinearity in X.
Proof: We first argue that if E[XX ′ ] is invertible, then there is no perfect
collinearity in X. To see this, suppose there is perfect collinearity in X,
i.e., that there exists a nonzero c ∈ Rk+1 such that P {c′ X = 0} = 1. Note
that E[XX ′ ]c = E[X(X ′ c)] = 0. Hence, the columns of E[XX ′ ] are linearly
dependent, i.e., E[XX ′ ] is not invertible.
We now argue that if there is no perfect collinearity in X, then E[XX ′ ] is
invertible. To see this, suppose E[XX ′ ] is not invertible. Then, the columns
of E[XX ′ ] must be linearly dependent, i.e., there exists nonzero c ∈ Rk+1
such that E[XX ′ ]c = 0. This implies further that c′ E[XX ′ ]c = E[(c′ X)2 ] =
0, which in turn implies that P {c′ X = 0} = 1., i.e., that there is perfect
collinearity in X.
The first assumption above together with the fact that U = Y − X ′ β
implies that E[X(Y − X ′ β)] = 0, i.e., E[XY ] = E[XX ′ ]β. Since E[XX ′ ] is
1.3. ESTIMATING β 15
1.3 Estimating β
Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that
Y = X ′β + U .
i.e.,
1 X 1 X
Xi Yi = Xi Xi′ β̂n .
n n
1≤i≤n 1≤i≤n
The matrix
1 X
Xi Xi′
n
1≤i≤n
Y = (Y1 , . . . , Yn )′
X = (X1 , . . . , Xn )′
Ŷ = (Ŷ1 , . . . , Ŷn )′
= Xβ̂n
U = (U1 , . . . , Un )′
Û = (Û1 , . . . , Ûn )′
= Y − Ŷ
= Y − Xβ̂n .
In this notation,
β̂n = (X′ X)−1 X′ Y
and may be equivalently described as the solution to
min |Y − Xb|2 .
b∈Rk+1
Hence, Xβ̂n is the vector in the column space of X that is closest (in terms
of Euclidean distance) to Y. From the above, we see that X′ Û = 0, thus Û
is orthogonal to all of the columns of X (and thus orthogonal to all of the
vectors in the column space of X). In this sense,
Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
Y = X1′ β1 + X2′ β2 + U .
1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.
19
20 LECTURE 2. MORE ON LINEAR REGRESSION
where the first equality follows from the formula for β̃1 , the second equality
follows from the expression for Ỹ , and the third equality follows from the
fact that E[X̃1 X2′ ] = 0 (because X̃1 is the error term from a regression of X1
on X2 ). Note that this first part of the derivation shows that β̃1 is also the
population coefficient of a linear regression of Y on X̃1 . If we now replace
Y by its expression and do some additional steps, we get
where the first equality follows from the expression for Y , the second equal-
ity follows from the fact that E[X̃1 X2′ ] = 0 and E[X̃1 U ] = 0 (because
E[XU ] = 0), the third equality follows from the expression for X̃1 , and the
final equality follows from the fact that E[X̃1 X2′ ] = 0.
In other words, β1 in the linear regression of Y on X1 and X2 is equal to
the coefficient in a linear regression of the error term from a linear regression
of Y on X2 on the error terms from a linear regression of the components
of X1 on X2 . This gives meaning to the common description of β1 as the
“effect” of X1 on Y after “controlling for X2 .”
Notice that if we take X2 to be just a constant, then Ỹ = Y − E[Y ] and
X̃1 = X1 − E[X1 ]. Hence,
β1 = E[(X1 − E[X1 ])(X1 − E[X1 ])′ ]−1 E[(X1 − E[X1 ])(Y − E[Y ])]
= Var[X1 ]−1 Cov[X1 , Y ] .
Finally, also note that if we use our formula to interpret the coefficient
βj associated with the jth covariate for 1 ≤ j ≤ k, we obtain
Cov[X̃j , Y ]
βj = , (2.1)
Var[X̃j ]
Y = X1′ β1 + X2′ β2 + U .
2.3. PROPERTIES OF LS 21
In other words, β̂1,n can be obtained by estimating via OLS the coefficients
from a linear regression of M2 Y on M2 X1 . Upon recognizing that M2 Y are
the residuals from a regression of Y on X2 and that the columns of M2 X1 are
the residuals from regressions of the columns of X1 on X2 , we see that this
formula exactly parallels the formula we derived earlier for a sub-vector of
β. This result is sometimes referred to as the Frisch-Waugh-Lovell (FWL)
decomposition.
2.3 Properties of LS
Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that
Y = X ′β + U .
2.3.1 Bias
Suppose in addition that E[U |X] = 0. Equivalently, assume that E[Y |X] =
X ′ β. Under this stronger assumption,
E[β̂n ] = β .
In fact,
E[β̂n |X1 , . . . , Xn ] = β .
To see this, note that
Hence,
E[β̂n |X1 , . . . , Xn ] = β + (X′ X)−1 XE[U|X1 , . . . , Xn ] .
Note for any 1 ≤ i ≤ n that
where the first equality follows from the fact that Xj is independent of Ui
for i ̸= j. The desired conclusion thus follows.
E[A′ Y|X1 , . . . , Xn ] = β .
When A′ = (X′ X)−1 X′ , this last expression is simply (X′ X)−1 σ 2 . It therefore
suffices to show that
A′ A − (X′ X)−1
is positive semi-definite for all matrices A satisfying A′ X = I. To this end,
define
C = A − X(X′ X)−1 .
Then,
X′ C = X′ A − X′ X(X′ X)−1 = I − I = 0 .
The desired conclusion thus follows from the fact that C′ C is positive semi-
definite by construction.
2.3.3 Consistency
In this case we do not need additional assumptions. Note that E[XY ] < ∞
since XY = XX ′ β + XU , and both E[XX ′ ] and E[XU ] exist. Under this
P
assumption, the OLS estimator, β̂n is consistent for β, i.e., β̂n → β as
n → ∞. To see this, simply note that by the WLLN
1 X P
Xi Xi′ → E[XX ′ ]
n
1≤i≤n
1 X P
Xi Yi → E[XY ]
n
1≤i≤n
as n → ∞, where
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 .
To see this, note that
−1
√ 1 X 1 X
n(β̂n − β) = Xi Xi′ √ Xi Ui .
n n
1≤i≤n 1≤i≤n
2.4 Estimation of V
In order to make use of the preceding estimators, we will require a consistent
estimator of
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 .
Note that V has the so-called sandwich form. As with most sandwich esti-
mators, the interesting object is the “meat” and not the “bread”. Indeed,
the bread can be consistently estimated by (2.2).
Focusing our attention to the meat, we first consider the case where
E[U |X] = 0 and Var[U |X] = σ 2 (i.e., under homoskedasticity). Under these
conditions,
Var[XU ] = E[XX ′ U 2 ] = E[XX ′ ]σ 2 .
Hence,
V = E[XX ′ ]−1 σ 2 .
A natural choice of estimator is therefore
−1
1 X
V̂n = Xi Xi′ σ̂n2 ,
n
1≤i≤n
Note that
Ûi = Yi − Xi′ β̂n = Ui − Xi′ (β̂n − β) ,
so
Ûi2 = (Ui − Xi′ (β̂n − β))2 = Ui2 − 2Ui Xi′ (β̂n − β) + (Xi′ (β̂n − β))2 .
as n → ∞. Next, note that the WLLN and CMT imply further that
1 X 1 X
Ui Xi′ (β̂n − β) = (β̂n − β)′ Xi Ui = oP (1) .
n n
1≤i≤n 1≤i≤n
1 X 1 X
(Xi′ (β̂n − β))2 ≤ |(Xi′ (β̂n − β))2 |
n n
1≤i≤n 1≤i≤n
1 X
≤ |Xi |2 |β̂n − β|2 ,
n
1≤i≤n
which tends in probability to zero because of the WLLN, CMT and the fact
that E[|X|2 ] < ∞ (which follows from the fact that E[XX ′ ] < ∞). The
desired conclusion thus follows.
When we do not assume Var[U |X] = σ 2 , a natural choice of estimator is
−1 −1
1 X 1 X 1 X
V̂n = Xi Xi′ Xi Xi′ Ûi2 Xi Xi′ . (2.3)
n n n
1≤i≤n 1≤i≤n 1≤i≤n
Later in the class we will prove that this estimator is consistent, i.e.,
P
V̂n → V as n → ∞ ,
regardless of the functional form of Var[U |X]. This estimator is called the
Heteroskedasticity Consistent (HC) estimator of V. The standard errors
used to construct t-statistics are the square roots of the diagonal elements
of V̂n , and this is the topic of the third part of this class. It is important to
note that, by default, Stata reports homoskedastic-only standard errors.
26 LECTURE 2. MORE ON LINEAR REGRESSION
Here, T SS is short for total sum of squares, ESS is short for explained sum
of squares, and SSR is short for sum of squared residuals. To show that
the two expressions for R2 are the same, and that 0 ≤ R2 ≤ 1, it suffices to
show that
SSR + ESS = T SS .
Moreover, R2 = 1 if and only if SSR = 0, i.e., Ûi = 0 for all 1 ≤ i ≤ n.
Similarly, R2 = 0 if and only if ESS = 0, i.e., Ŷi = Ȳn for all 1 ≤ i ≤ n. In
this sense, R2 isP
a measure of the “fit” of a regression.
Note that n1 1≤i≤n (Yi − Ȳn )2 may be viewed as an estimator of Var[Yi ]
and n1 1≤i≤n Ûi2 may be viewed as an estimator of Var[Ui ]. Thus, R2 may
P
be viewed as an estimator of the quantity
Var[Ui ]
1− .
Var[Yi ]
Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
3.1 Inference
Let (Y, X, U ) be a random vector where Y and U take values in R and X
takes values in Rk+1 . Assume further that the first component of X is a
constant equal to one. Let β ∈ Rk+1 be such that
Y = X ′β + U .
Suppose that E[XU ] = 0, that there is no perfect collinearity in X, that
E[XX ′ ] < ∞, and Var[XU ] < ∞. Denote by P the marginal distribution of
(Y, X). Let (Y1 , X1 ), . . . , (Yn , Xn ) be an i.i.d. sample of random vectors with
distribution P . Under these assumptions, we established the asymptotic
normality of the OLS estimator β̂n ,
√ d
n(β̂n − β) → N (0, V) (3.1)
with
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 . (3.2)
We also described a consistent estimator V̂n of the limiting variance V. We
now use these results to develop methods for inference. We will study in
particular Wald tests for certain hypotheses. Some other testing principles
will be covered later in class. Confidence regions will be constructed using
the duality between hypothesis testing and the construction of confidence
regions.
Below we will assume further that Var[XU ] = E[XX ′ U 2 ] is non-singular.
This would be implied, for example, by the assumption that P {E[U 2 |X] >
0} = 1. Since E[XX ′ ] is non-singular under the assumption of no perfect
collinearity in X, this implies that V is non-singular.
1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.
29
30 LECTURE 3. BASIC INFERENCE AND ENDOGENEITY
3.1.1 Background
Consider the following somewhat generic version of a testing problem. One
observes data Wi = (Yi , Xi ), i = 1, . . . , n, i.i.d. with distribution P ∈ P =
{Pβ : β ∈ Rk+1 } and wishes to test
H0 : β ∈ B0 versus H1 : β ∈ B1 (3.3)
where B0 and B1 form a partition of Rk+1 . In our context, β will be
the coefficient in a linear regression but in general it could be any other
parameter.
A test is simply a function ϕn = ϕn (W1 , . . . , Wn ) that returns the prob-
ability of rejecting the null hypothesis after observing W1 , . . . , Wn . For the
time being, we will only consider non-randomized tests which means that
the function ϕn will take only two values: it will be equal to 1 for rejection
and equal to 0 for non rejection. Most often, ϕn is the indicator function of a
certain test statistic Tn = Tn (W1 , . . . , Wn ) being greater than some critical
value cn (1 − α), this is,
ϕn = I {Tn > cn (1 − α)} . (3.4)
The test is said to be (pointwise) asymptotically of level α (or consistent in
levels) if,
lim sup EPβ [ϕn ] = lim sup Pβ {ϕn = 1} ≤ α , ∀β ∈ B0 .
n→∞ n→∞
Such tests include: Wald tests, quasi-likelihood ratio tests, and Lagrange
multiplier tests.
H0 : r′ β ≤ c versus H1 : r′ β > c .
P {βs ∈ Cn } → 1 − α
and a suitable choice of critical value is cp,1−α , the 1 − α quantile of χ2p . The
resulting test is consistent in level.
Note that by using the duality between hypothesis testing and the con-
struction of confidence regions, we may construct a confidence region of level
α for β as
P {β ∈ Cn } → 1 − α
Y = X ′β + U .
and so
E[XX ′ ]−1 E[XY ] = β + E[XX ′ ]−1 E[XU ] .
The results from the previous class showed that the least squares estimator
βn of β converges to E[XX ′ ]−1 E[XY ]. It follows that
P
β̂n → β + E[XX ′ ]−1 E[XU ] ,
Omitted Variables
Suppose k = 2, so
Y = β0 + β1 X1 + β2 X2 + U .
We are interpreting this regression as a causal model and are willing to
assume that E[XU ] = 0 (i.e., E[U ] = E[X1 U ] = E[X2 U ] = 0), but X2 is
unobserved. An example of a situation like this is when Y is wages, X1 is
education, and X2 is ability. Given unobserved ability, we may rewrite this
model as
Y = β0∗ + β1∗ X1 + U ∗ ,
with
β0∗ = β0 + β2 E[X2 ]
β1∗ = β1
U ∗ = β2 (X2 − E[X2 ]) + U .
E[X1 U ∗ ] = β2 Cov[X1 , X2 ] ,
∗ P Cov[X1 , X2 ]
β̂1,n → β1 + β2 , (3.6)
Var[X1 ]
Measurement Error
Partition X into X0 and X1 , where X0 = 1 and X1 takes values in Rk .
Partition β analogously. In this notation,
Y = β0 + X1′ β1 + U .
with
β0∗ = β0
β1∗ = β1
U ∗ = −V ′ β1 + U .
In this model,
Simultaneity
A classical example of simultaneity is given by supply and demand. De-
note by Qs the quantity supplied and by Qd the quantity demanded. As a
function of (non-market clearing) price P̃ , assume that
Qd = β0d + β1d P̃ + U d
Qs = β0s + β1s P̃ + U s ,
so
1
P = (β0d − β0s + U d − U s ) .
β1s d
− β1
It follows that P is endogenous in both of the equations
Q = β0d + β1d P + U d
Q = β0s + β1s P + U s
36 LECTURE 3. BASIC INFERENCE AND ENDOGENEITY
because
Var[U d ]
E[P U d ] =
β1s − β1d
Var[U s ]
E[P U s ] = − s .
β1 − β1d
Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Endogeneity1
Y = X ′β + U .
E[ZY ] = E[ZX ′ ]β .
1
This lecture is based on Azeem Shaikh’s lecture notes. I want to thank him for kindly
sharing them.
37
38 LECTURE 4. ENDOGENEITY
for any conformable matrices A and B. Applying this result, we see that
Hence,
rank(E[ZX ′ ]) = rank(E[ZZ ′ ]Π) = rank(Π) ,
as desired.
To complete the proof, note that Π′ E[ZX ′ ] = Π′ E[ZZ ′ ]Π and argue that
Π E[ZZ ′ ]Π is invertible using arguments given earlier.
′
4.2 Estimating β
Let (Y, X, Z, U ) be a random vector where Y and U take values in R, X
takes values in Rk+1 , and Z takes values in Rℓ+1 . Let β ∈ Rk+1 be such
that
Y = X ′β + U .
Suppose E[ZX ′ ] < ∞, E[ZZ ′ ] < ∞, E[ZU ] = 0, there is no perfect
collinearity in Z, and that the rank of E[ZX ′ ] is k + 1. We now discuss
estimation of β.
1 X
Zi Ûi = 0 .
n
1≤i≤n
Y = β0 + β1 X1 + U
4.2. ESTIMATING β 41
and
X1 = π0 + π1 Z1 + V ,
so that replacing the second equation into the first one delivers
Y = β0∗ + β1 π1 Z1 + U ∗
with
β0∗ = β0 + β1 π0
U ∗ = U + β1 V .
Z = (Z1 , . . . , Zn )′
X = (X1 , . . . , Xn )′
Y = (Y1 , . . . , Yn )′ .
1 X ′
Π̂n Zi Ûi = 0 .
n
1≤i≤n
Notice that this implies that Ûi is orthogonal to all of the instruments equal
to an exogenous regressors, but may not be orthogonal to the other re-
gressors. It is termed the TSLS estimator because it may be obtained in
the following way: first, regress (each component of) Xi on Zi to obtain
X̂i = Π̂′n Zi ; second, regress Yi on X̂i to obtain β̂n . However, in order to
obtain proper standard errors, it is recommended to compute the estimator
in one step (see the following section).
The estimator may again be expressed more compactly using matrix
notation. Define
X̂ = (X̂1 , . . . , X̂n )′
= PZ X ,
where
PZ = Z(Z′ Z)−1 Z′
is the projection matrix onto the column space of Z. In this notation, we
have
4.3.1 Consistency
Under the assumptions stated above, the TSLS estimator, β̂n , is consistent
P
for β, i.e., β̂n → β as n → ∞. To see this, first recall from our results on
OLS that
P
Π̂n → Π
as n → ∞. Next, note that the WLLN implies that
1 X P
Zi Zi′ → E[ZZ ′ ]
n
1≤i≤n
1 X P
Zi Yi → E[ZY ]
n
1≤i≤n
4.3.3 Estimation of V
A natural estimator of V is given by
−1
1 X
V̂n = Π̂′n Zi Zi′ Π̂n
n
1≤i≤n
1 X
× Π̂′n Zi Zi′ Ûi2 Π̂n
n
1≤i≤n
−1
1 X
× Π̂′n Zi Zi′ Π̂n ,
n
1≤i≤n
where Ûi = Yi − Xi′ β̂n . As in our discussion of OLS, the primary difficulty
in establishing the consistency of this estimator lies in showing that
1 X P
Zi Zi′ Ûi2 → Var[ZU ]
n
1≤i≤n
Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Wooldridge, J. M. (2010): Econometric analysis of cross section and
panel data, MIT press.
Lecture 5
More on Endogeneity1
45
46 LECTURE 5. MORE ON ENDOGENEITY
as n → ∞, where
and
where in both cases the first equality follows from Var[ZU ] = E[ZZ ′ U 2 ] =
σ 2 E[ZZ ′ ], and the second equality used the fact that X = Π′ Z + V with
E[ZV ′ ] = 0. It suffices to show that Ṽ−1 ≤ V−1 , i.e., to show that
Yet this follows upon realizing that the left-hand side of the preceding display
is simply E[Ŵ ∗ Ŵ ∗′ ] with
When we do not assume that E[U |Z] = 0 and Var[U |Z] = σ 2 , then
better estimators for β exist. Such estimators are most easily treated as a
special case of the generalized method of moments (GMM), which will be
covered later in class.
Yi = Xi β + Ui
Xi = Zi π + Vi ,
5.2. “WEAK” INSTRUMENTS 47
Note that Pn
√1
√ n i=1 Zi Ui
n(β̂n − β) = 1 Pn 2
1 Pn .
n i=1 Zi π + n i=1 Zi Vi
The finite-sample, joint distribution of the numerator and denominator is
simply
Z¯n2 σU2 √1 Z¯2 σU,V
!!
0 n n
N , ,
Z¯n2 π √1 Z¯2 σU,V
n n
1 ¯2 2
n Zn σV
where
n
1X 2
Z¯n2 = Zi .
n
i=1
One may therefore test the null hypothesis by comparing Tn with cℓ+1,1−α ,
the 1 − α quantile of the χ2ℓ+1 distribution. As we will discuss in the second
part of the class, one may now construct a confidence region using the duality
between hypothesis testing and the construction of confidence regions. A
closely related variant of this idea leads to the Anderson-Rubin test, in which
one tests whether all of the coefficients in a regression of Yi − Xi′ c on Zi are
zero.
Recent research in econometrics suggests that this method has good
power properties when the model is exactly identified, but may be less de-
sirable when the model is over-identified. Other methods for the case in
which the model is over-identified and/or one is only interested in some fea-
ture of β (e.g., one of the slope parameters) have been proposed and are the
subject of current research as well.
Instead of using these “more complicated” methods, researchers may at-
tempt a two-step method as follows. In the first step, they would investigate
whether the rank of E[ZX ′ ] is “close” to being < k + 1 or not by carrying
out a hypothesis test of the null hypothesis that H0 : rank(E[ZX ′ ]) < k + 1
versus the alternative hypothesis that H1 : rank(E[ZX ′ ]) = k + 1. In some
cases, such a test is relatively easy to carry out given what we have al-
ready learned: e.g., when there is a single endogenous regressor, such a test
is equivalent to a test of the null hypothesis that certain coefficients in a
linear regression are all equal to zero versus not all equal to zero. In the
second step, they would only use these “more complicated” methods if they
failed to reject this null hypothesis. This two-step method will also behave
poorly in finite-samples and should not be used. A deeper discussion of
these “uniformity” issues take place in Econ 481.
Y = β0 + β1 D .
In this case, we interpret β0 as Y (0) and β1 as Y (1) − Y (0), where Y (1) and
Y (0) are potential or counterfactual outcomes. Using this notation, we may
rewrite the equation as
The potential outcome Y (0) is the value of the outcome that would have
been observed if (possibly counter-to-fact) D were 0; the potential outcome
Y (1) is the value of the outcome that would have been observed if (possibly
counter-to-fact) D were 1. The variable D is typically called the treatment
and Y (1) − Y (0) is called the treatment effect. The quantity E[Y (1) − Y (0)]
is usually referred to as the average treatment effect.
If D were randomly assigned (e.g., by the flip of a coin, as in a randomized
controlled trial), then
(Y (0), Y (1)) ⊥⊥ D .
In this case, under mild assumptions, the slope coefficient from OLS regres-
sion of Y on a constant and D yields a consistent estimate of the average
treatment effect. To see this, note that the estimand is
Cov[Y, D]
= E[Y |D = 1] − E[Y |D = 0]
Var[D]
= E[Y (1)|D = 1] − E[Y (0)|D = 0]
= E[Y (1) − Y (0)] ,
where the first equality follows from a homework exercise, the second equal-
ity follows from the equation for Y , and the third equality follows from
independence of (Y (0), Y (1)) and D.
Otherwise, we generally expect D to depend on (Y (1), Y (0)). In this
case, OLS will not yield a consistent estimate of the average treatment
effect. To proceed further, we therefore assume, as usual, that there is an
instrument Z that also takes values in {0, 1}. We may thus consider the
50 LECTURE 5. MORE ON ENDOGENEITY
where the equality follows by multiplying and dividing by Var[Z] and using
earlier results. Our goal is to express this quantity in terms of the treatment
effect Y (1) − Y (0) somehow. To this end, analogously to our equation for
Y above, it is useful to also introduce a similar equation for D:
D = ZD(1) + (1 − Z)D(0)
= D(0) + (D(1) − D(0))Z
= π0 + π1 Z ,
where π0 = D(0), π1 = D(1) − D(0), and D(1) and D(0) are potential or
counterfactual treatments (rather than outcomes). We impose the following
versions of instrument exogeneity and instrument relevance, respectively:
and
P {D(1) ̸= D(0)} = P {π1 ̸= 0} > 0 .
Note that the first part of the assumption basically states that Z is as good
as randomly assigned. In addition, note that we are implicitly assuming that
Z does not affect Y directly, i.e., potential outcomes take the form Y (d) as
opposed to Y (d, z). This is the exclusion restriction in this setting. In the
linear model with constant effects, the exclusion restriction is expressed by
the omission of the instruments from the causal equation of interest and by
requiring that E[ZU ] = 0.
We further assume the following monotonicity (or perhaps better called
uniform monotonicity) condition:
where the first equality follows from the equations for Y and D, the second
equality follows from instrument exogeneity, and the fourth equality follows
from the monotonicity assumption. Furthermore,
which is termed the local average treatment effect (LATE). It is the average
treatment effect among the subpopulation of people for whom a change in
the value of the instrument switched them from being non-treated to treated.
We often refer to such subpopulation as compliers.
A few remarks are in order: First, it is important to understand that
this result depends crucially on the monotonicity assumption. Second, it is
important to understand that this quantity may or may not be of interest.
Third, it is important to understand that a consequence of this calculation is
that in a world with heterogeneity “different instruments estimate different
parameters.” Finally, this result also depends on the simplicity of the model.
When covariates are present, the entire calculation breaks down. Some
generalizations are available.
E[Y |Z = 1] − E[Y |Z = 0] = E[Y (1) − Y (0)|D(1) > D(0)]P {D(1) > D(0)}
− E[Y (1) − Y (0)|D(1) < D(0)]P {D(1) < D(0)} .
We might therefore have a situation where treatment effects are positive for
everyone (i.e., Y (1) − Y (0) > 0) yet the reduced form is zero because effects
on compliers are canceled out by effects on defiers, i.e., those individuals for
which the instrument pushes them out of treatment (D(1) = 0 and D(0) =
1). This doesn’t come up in a constant effect model where β = Y (1) − Y (0)
is constant, as in such case
and so a zero reduced-form effect means either the first stage is zero or
β = 0.
It is worth noting that monotonicity assumptions are easy to interpret
in latent index models. In such models individual choices are determined by
52 LECTURE 5. MORE ON ENDOGENEITY
Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
Generalized Method of
Moments and Empirical
Likelihood
53
54 LECTURE 6. GMM & EL
where in what follows we will use the notation mi (β) = m(Yi , Xi , Zi , β).
The method of moments estimator for β is defined as the parameter value
which sets m̄n (β) = 0. This is generally not possible when ℓ > k as there are
more equations than free parameters. The idea of the generalized method of
moments (GMM) is to define an estimator that sets m̄n (β) “close” to zero,
given a notion of “distance”.
P
Let Λn be an (ℓ + 1) × (ℓ + 1) matrix such that Λn → Λ for a symmetric
positive definite matrix Λ and define
This is a non-negative measure of the “distance” between the vector m̄n (β)
and the origin. For example, if Λn = I, then Qn (β) = n |m̄n (β)|2 , the square
of the Euclidean norm, scaled by the sample size n. The GMM estimator of
β is defined as the value that minimizes Qn (β), this is
Note that if k = ℓ, then m̄n (β̂n ) = 0, and the GMM estimator is the method
of moments estimator. The first order conditions for the GMM estimator
are
∂
0= Qn (β̂)
∂b
∂
= 2 m̄n (β̂n )′ Λn m̄n (β̂)
∂b
′
1 ′ 1 ′
= −2 Z X Λn Z (Y − Xβ̂n ) (6.4)
n n
so ′ ′
2 Z′ X Λn Z′ X β̂n = 2 Z′ X Λn Z′ Y ,
(6.5)
which establishes a closed-form solution for the GMM estimator in the linear
model,
′ −1 ′ ′
β̂n = Z′ X Λn Z′ X Z X Λn Z′ Y .
(6.6)
6.1. GENERALIZED METHOD OF MOMENTS 55
The matrix (Z′ X)′ Λn (Z′ X) may not be invertible for a given n, but, since
E[ZX ′ ]′ ΛE[ZX ′ ] is invertible, it will be invertible with probability ap-
proaching one. Note that using similar arguments to those in Lemma 4.1
we can claim that E[ZX ′ ]′ ΛE[ZX ′ ] is invertible provided E[ZX ′ ] has rank
k + 1 and Λ has rank ℓ + 1.
6.1.3 Consistency
Let
Σ = E[ZX ′ ] .
Then, by the WLLN and the CMT
′
1 ′ 1 ′ P
Z X Λn Z X → Σ′ ΛΣ
n n
and
1 ′ P
Z Y → E[ZY ] = Σβ
n
as n → ∞. The desired result therefore follows from the CMT.
Ω = E[ZZ ′ U 2 ]
where
V = (Σ′ ΛΣ)−1 (Σ′ ΛΩΛΣ)(Σ′ ΛΣ)−1 .
In general, GMM estimators are asymptotically normal with “sandwich
form” asymptotic variances. The optimal weigh matrix Λ∗ is the one which
minimizes V. This turn out to be Λ∗ = Ω−1 . The proof is left as an exercise.
This yields the efficient GMM estimator
′ −1 ′ ′ −1 ′
β̂n = Z′ X Ω−1 Z′ X ZX Ω ZY ,
which satisfies √ d
n(β̂n − β) → N (0, (Σ′ Ω−1 Σ)−1 ) .
In practice, Ω is not known but it can be estimated consistently. For
P
any Ω̂n → Ω, we still call β̂n the efficient GMM estimator, as it has the
same asymptotic distribution. By “efficient”, we mean that this estimator
has the smallest asymptotic variance in the class of GMM estimators with
this set of moment conditions. This is a weak concept of optimality, as we
are only considering alternative weight matrices Λn . However, it turns out
that the GMM estimator is semiparametrically efficient, as shown by Gary
Chamberlain (1987). If it is known that E[mi (β)] = 0 and this is all that
is known, this is a semi-parametric problem, as the distribution of the data
P is unknown. Chamberlain showed that in this context, if an estimator
has this asymptotic variance, it is semiparametrically efficient. This result
shows that no estimator has greater asymptotic efficiency than the efficient
GMM estimator. No estimator can do better (in this first-order asymptotic
sense), without imposing additional assumptions.
′ −1 ′ ′ ′ −1 ′
= Z′ X (Z′ Z)−1 Z′ X Z X (Z Z) ZY
−1
= X′ PZ X X′ PZ Y .
which uses the uncentered moment conditions. Since E[mi ] = 0 these two
estimators are asymptotically equivalent under the hypothesis of correct
specification. However, the uncentered estimator may be a poor choice when
constructing hypothesis tests, as under the alternative hypothesis the mo-
ment conditions are violated, i.e. E[mi ] ̸= 0.
Again, in the linear model m(Y, X, Z, β) = Z(Y − X ′ β) and we still use the
notation mi (β) = m(Yi , Xi , Zi , β).
Empirical likelihood may be viewed as parametric inference in moment
condition models, using a data-determined parametric family of distribu-
tions. The parametric family is a multinomial distribution on the observed
values (Y1 , X1 , Z1 ), . . . , (Yn , Xn , Zn ). This parametric family will have n − 1
parameters. Having the number of parameters grow as quickly as the sample
size makes empirical likelihood very different than parametric likelihood.
The multinomial distribution which places probability pi at each obser-
vation of the data will satisfy the above moment condition if and only if
n
X
pi mi (β) = 0 . (6.10)
i=1
where κ and λ are Lagrange multipliers. For a given value b ∈ Rk+1 , the
first order conditions with respect to pi , κ and λ are:
∂L 1
= − κ − nλ′ mi (b) = 0
∂pi pi
n
∂L X
= pi − 1 = 0
∂κ
i=1
n
∂L X
=n pi mi (b) = 0 .
∂λ
i=1
Note that Σ(β) does not depend on β in the linear model. However, in
non-linear models it does and so we keep the dependence on β throughout
the remainder of the section. Denote the sample analogs of Σ(β) and Ω(β)
by
n
1X
Σ̂n (β) = Mi (β)
n
i=1
n
1X
Ω̂n (β) = mi (β)mi (β)′ .
n
i=1
60 LECTURE 6. GMM & EL
Note that since 1/(1 + a) = 1 − a/(1 + a), we can re-write (6.15) and solve
for λ̃n ,
" n #−1
1 X mi (β̃n )mi (β̃n )′
λ̃n = m̄n (β̃n ) . (6.17)
n
i=1
1 + λ̃′n mi (β̃n )
where m̄n (β) = n−1 1≤i≤n mi (β). By (6.16) and (6.17),
P
" n
#′ " n
#−1
1X Mi (β̃n ) 1 X mi (β̃n )mi (β̃n )′
m̄n (β̃n ) = 0 .
n
i=1
1 + λ̃′n mi (β̃n ) n 1 + λ̃′n mi (β̃n )
i=1
where by (6.12),
n
X
Σ̃n (β) = p̃i Mi (β)
i=1
Xn
Ω̃n (β) = p̃i mi (β)mi (β)′ .
i=1
We can now see that EL and GMM have very similar first order condi-
tions. Recall from (6.4) that the first order condition for a GMM estimator
is given by
Σ̂n (β̂n )′ Ω̂−1
n m̄n (β̂n ) = 0 ,
Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Lecture 7
Panel Data1
Y1 = X1′ β + η + U1
Y2 = X2′ β + η + U2 .
Note that we are also assuming that β is a constant parameter that does not
change over time. If this is the case, we could simply take first differences,
i.e.,
Y2 − Y1 = (X2 − X1 )′ β + U2 − U1
∆Y = ∆X ′ β + ∆U ,
and remove the unobserved individual effect η in the process. Notice that
61
62 LECTURE 7. PANEL DATA
under the assumptions on Xi,t and Ui,t that we formalize below. Now define
for t ≥ 2, and proceed analogously with the other random variables. Note
again that ∆ηi = 0. Applying this transformation to (7.4), we get
′
∆Yi,t = ∆Xi,t β + ∆Ui,t , i = 1, . . . , n t = 2, . . . , T . (7.5)
and define Ẏi,t and U̇i,t analogously. Note that η̇i = 0 for all i = 1, . . . , n.
Applying this transformation to (7.4), we get
′
Ẏi,t = Ẋi,t β + U̇i,t , i = 1, . . . , n t = 1, . . . , T . (7.7)
In order to make this expression more tractable, we use two tricks. First,
note that
X X X X
Ẋi,t U̇i,t = Ẋi,t Ui,t − Ūi Ẋi,t = Ẋi,t Ui,t , (7.9)
1≤t≤T 1≤t≤T 1≤t≤T 1≤t≤T
7.1. FIXED EFFECTS 65
P
where the last step follows from 1≤t≤T Ẋi,t = 0. We can therefore replace
U̇i,t with Ui,t . Second, let Ẋi = (Ẋi,1 , . . . , Ẋi,T )′ be a T × k vector of stacked
observations for unit i, and define Ui in the same way. Using this notation,
we can write
X X
Ẋi′ Ẋi = ′
Ẋi,t Ẋi,t and Ẋi′ Ui = Ẋi,t Ui . (7.10)
1≤t≤T 1≤t≤T
where Ûi = Ẏi − Ẋi β̂nfe . This is what Stata computes when one uses the
luster(unit) } option to \verb+xtreg+ where unit is the variable that indexes $i$. This
consistent covariance matrix estimator that allows for arbitrary inter-temporal
correlation patterns and heteroskedasticity across individuals. As we will see
later in class, this estimator is generally known as a cluster covariance esti-
P
mator (CCE) and is consistent as n → ∞, i.e., V̂fe → Vfe .
66 LECTURE 7. PANEL DATA
E[∆Ui,t ∆Ui,t−1 ] = E[Ui,t Ui,t−1 − Ui,t−1 Ui,t−1 − Ui,t Ui,t−2 + Ui,t−1 Ui,t−2 ]
= − Var(Ui,t−1 ) .
However, in the other extreme where Ui,t follows a random walk, i.e., Ui,t =
Ui,t−1 + Vi,t for some i.i.d. sequence Vi,t , then ∆Ui,t = Vi,t . These results, at
the end of the day, rely on homoskedasticity and so it is advised to simply use
a robust standard error as above and forget about efficiency considerations.
Note that when T = 2, these two estimators are numerically the same.
In addition, first differences are used in dynamic panels and difference in
differences, as we will discuss later.
Remark 7.1 Panel data traditionally deals with units over time. However,
we can think about other cases where the data has a two-dimensional in-
dex and where we believe that one of the indices may exhibit within group
dependence. For example, it could be that we observe “employees” within
“firms”, or “students” within “schools”, or “families” in metropolitan sta-
tistical areas (MSA), etc. Cases like these are similar but not identical to
panel data. To start, units are not “repeated” in the sense that each unit
is potentially observed only once in the sample. In addition, these are cases
where “T ” is usually large and “n” is small. For example, we typically
observe many students (which may be dependent within a school) and few
schools. We will study these cases further in the second part of this class.
Hence all of the unobservable time-invariant factors that were being con-
trolled for in the fixed effects approach are now assumed to be mean in-
dependent (ergo, uncorrelated) with the explanatory variables at all time
periods. The strict exogeneity condition of the fixed effects approach (i.e.
FE1) is still maintained, so that the aggregate error term Vit = ηi + Ui,t now
satisfies E[Vit |Xi1 , . . . , XiT ] = 0 for all t = 1, . . . , T . The idea behind the
random effects approach is to exploit the serial correlation in Vit that is gen-
erated by having a common ηi component in each time period. Specifically,
the baseline approach maintains the following.
RE2. (i) Var[Ui,t |Xi,1 , . . . , Xi,T ] = σU2 , (ii) Var[ηi |Xi,1 , . . . , Xi,T ] = ση2 , (iii)
E[Ui,t Ui,s |Xi,1 , . . . , Xi,T ] = 0 for all t ̸= s, (iv) E[Ui,t ηi |Xi,1 , . . . , Xi,T ] =
0 for all t = 1, . . . , T .
and
E[Vi,t Vi,s |Xi,1 , . . . , Xi,T ] = E[ηi2 +Ui,t Ui,s +ηi Ui,t +ηi Ui,s |Xi,1 , . . . , Xi,T ] = ση2 .
Combining these results and stacking the observations for unit i, we get that
place. Second, the efficiency gains hold under the homoskedasticity and
independence assumptions in RE2 and do not hold more generally. These
are undoubtedly strong assumptions. Third, unlike the fixed effects estima-
tor, the random effects approach allows to estimate regression coefficients
associated with time-invariant covariates (this is, some of the Xi,t may be
constant across time - i.e., gender of the individual). So if the analysis is
primarily concerned with the effect of a time-invariant regressor and panel
data is available, it makes sense to consider some sort of random effects type
of approach. Fourth, under RE1 and RE2 β is identified in a single cross-
section. The parameters that require panel data for identification in this
model are the variances of the components of the error ση2 and σU2 , which
are needed for the GLS approach. Finally, note that the terminology “fixed
effects” and “random effects” is arguably confusing as ηi is random in both
approaches.
A last word of caution should be made about the use of Hausman spec-
ification tests. These are test that compare β̂nfe with β̂nre in order to test
the validity of RE1 (assuming RE2 holds). Under the null hypothesis that
RE1 holds, both estimators are consistent but β̂nre is efficient. Under the
alternative hypothesis, β̂nfe is consistent while β̂nre is not. Now, suppose we
were to define a new estimator β̂n∗ as follows
β̂n∗ = β̂nfe I{Hausman test rejects} + β̂nre I{Hausman test accepts} . (7.15)
The problem with this new estimator is that its finite sample distribution
looks very different from the usual normal approximations. This is gener-
ally the case when there is pre-testing, understood as a situation where we
conduct a test in a first step, and then depending on the outcome of this
test, we do A or B in a second step. A formal analysis of these uniformity
issues are covered in 481 and are beyond the scope of this class.
where ηi and Ui,t are the same as before but now Yi,t−1 is allowed to have a
direct effect on Yi,t , a feature sometimes referred to as state dependence. We
assume that |ρ| < 1. As is common in dynamic panel data (and time series)
contexts, we will assume that the model is dynamically complete in the sense
that all appropriate lags of Yi,t have been removed from the time-varying
7.3. DYNAMIC MODELS 69
where, as before, ∆Yi,t = Yi,t − Yi,t−1 and similarly for Ui,t . In general we
will have Cov(∆Yi,t−1 , ∆Ui,t ) ̸= 0 since (7.16) implies
Cov(Yi,t−2 , ∆Ui,t ) = 0 .
which makes Yi,t−2 a valid instrument for ∆Yi,t−1 since we assumed |ρ| < 1
and Cov(Yi,t−2 , ηi ) ̸= 0. An actual expression for this last covariance can be
obtained under additional assumptions. For example, under the assumption
that the initial condition, Yi,0 , is independent of ηi (and ηi ⊥ Ui,t ), then
t−3
X
Cov(Yi,t−2 , ηi ) = ση2 ρj .
j=0
The predictive power of these lags for ∆Yi,t−1 is likely to get progressively
weaker as the lag distance gets larger. Weak instrument problems may arise
as a consequence. If T ≥ 4 then one could consider using the differenced
term ∆Yi,t−2 (instead or in addition to the level Yi,t−2 ) as an instrument
for ∆Yi,t−1 . In the literature, these approaches are frequently referred to as
Arellano-Bond or Anderson-Hsiao estimators; see Arellano (2003).
Bibliography
Arellano, M. (2003): Panel Data Econometrics, Oxford University Press.
Difference in Differences
The treatment effect is the difference Y (1) − Y (0) and the usual quantity
of interest is E[Y (1) − Y (0)], typically referred to as the average treatment
effect.
Suppose that we observe a random sample of n individuals from this
population, and that for each individual i we observe both Yi (1) and Yi (0).
Clearly, for each i we can compute the treatment effect Yi (1) − Yi (0) and
estimate the average treatment effect as
n n
1X 1X
Yi (1) − Yi (0) .
n n
i=1 i=1
71
72 LECTURE 8. DIFFERENCE IN DIFFERENCES
to a treatment in the second period but not in the first period. The second
groups is not exposed to the treatment during either period. To be specific,
let
{(Yj,t , Dj,t ) : j ∈ {1, 2} and t ∈ {1, 2}} (8.2)
denote the observed data, where Yj,t and Dj,t ∈ {0, 1} denote the outcome
and treatment status of group j at time t. Note that in our setup, Dj,t = 1
if and only if j = 1 and t = 2 (assuming the first group is the one receiving
treatment in the second period). The parameter we will be able to identify
is
θ = E[Y1,2 (1) − Y1,2 (0)] , (8.3)
which is simply the average treatment effect on the treated : the average effect
of the treatment that occurs in group 1 in period 2. In order to interpret
θ as an average treatment effect, one would need to make the additional
assumption that
θ = E[Yj,t (1) − Yj,t (0)] (8.4)
is constant across j and t. This is a strong assumption and, in principle,
not fundamental for the DD approach. The assumption in (8.4) has partic-
ular bite when we consider multiple treated groups. Consider the following
example as an illustration.
Example 8.1 On April 1, 1992, New Jersey raised the state minimum wage
from $4.25 to $5.05. Card and Krueger (1994) collected data on employment
at fast food restaurants in New Jersey in February 1992 (t = 1) and again
in November 1992 (t = 2) to study the effect of increasing the minimum
wage on employment. They also collected data from the same type of restau-
rants in eastern Pennsylvania, just across the river. The minimum wage in
Pennsylvania stayed at $4.25 throughout this period. In our notation, New
Jersey would be the first group, Yj,t would be the employment rate in group j
at time t, and Dj,t denotes an increase in the minimum wage (the treatment)
in group j at time t.
The identification strategy of DD relies on the following assumption,
E[Y2,2 (0) − Y2,1 (0)] = E[Y1,2 (0) − Y1,1 (0)] , (8.5)
i.e., both groups have “common trends” in the absence of a treatment. One
way to parametrize this assumption is to assume that
Yj,t (0) = ηj + γt + Uj,t , (8.6)
where E[Uj,t ] = 0, and ηj and γt are (non-random) group and time ef-
fects. This additive structure for non-treated potential outcomes implies
that E[Yj,2 (0) − Yj,1 (0)] = γ2 − γ1 ≡ γ, which is constant across groups.
Note that this assumption, together with (8.3) imply that
E[Y1,2 (1)] = θ + η1 + γ2 . (8.7)
8.1. A SIMPLE TWO BY TWO CASE 73
In the context of the previous example, this assumption says that in the
absence of a minimum wage change, employment is determined by the sum
of a time-invariant state effect, a year effect that is common across states,
and a zero mean shock. Before we discuss the identifying power of this
structure, we discuss two natural (but unsuccessful) approaches that may
come to mind.
Emp. rate
Emp
. tren
d - con
−γ trol
−η d
te
- trea
trend
Emp.
θ
coun
terfa
c tual
- tre
ated Y1,1 (0) + Y2,2 (0) − Y2,1 (0)
Time
t=1 t=2
E[∆Y1,2 − ∆Y2,2 ] = E[Y1,2 (1) − Y1,1 (0)] − E[Y2,2 (0) − Y2,1 (0)]
=θ+γ−γ =θ .
Thus, the approach identifies the treatment effect by taking the differences
between pre-versus-post comparisons in the two groups, and exploiting the
fact that the time trend γ is “common” in the two groups.
Note that an alternative interpretation to the same idea is to compare
(Y1,2 − Y2,2 ) and (Y1,1 − Y2,1 ), this is, the treatment and control comparison
before and after the policy change. This is because
E[(Y1,2 − Y2,2 ) − (Y1,1 − Y2,1 )] = E[Y1,2 (1) − Y2,2 (0)] − E[Y1,1 (0) − Y2,1 (0)]
=θ−η+η =θ .
Using this representation, the difference for the pre-period is used to identify
the persistent group difference η, a strategy that again works under the
common trends assumption in (8.5).
A final interpretation of the same idea is that the DD approach construct
a counterfactual potential outcome Y1,2 (0) (which is unobserved) by combin-
ing Y1,1 (0), Y2,2 (0), and Y2,1 (0), which are all observed. The “constructed”
potential outcome is simply
where Ũ1,1 = U1,1 + U2,2 − U2,1 . Computing E[Y1,2 − Ỹ1,2 (0)] = θ therefore
delivers a valid identification strategy. Figure 8.1 illustrates this idea.
8.2. A MORE GENERAL CASE 75
where Ij,t is the set of individual in group j at time t. For simplicity, take
the treatment indicator Dj,t = I{j = 1}I{t = 2} to be non-random and
note that the observed outcome is
Yi,j,t = Yi,j,t (1)Dj,t + (1 − Dj,t )Yi,j,t (0) = (Yi,j,t (1) − Yi,j,t (0))Dj,t + Yi,j,t (0) ,
and
1 X 1 X
θ̂n = ∆n,j − ∆n,j . (8.12)
|J1 | |J0 |
j∈J1 j∈J0
What happens if we have few treated and control groups but many time
periods? Say, |J1 | and |J0 | fixed, but |T1 | → ∞ and |T0 | → ∞. Second,
what are the assumptions on Uj,t ? It is typically common to assume that
Uj,t ⊥ Uj ′ ,s for all j ′ ̸= j and (t, s). However, one would expect Uj,t and
Uj,s to be correlated, at least for t and s being “close” to each other. On
top of this, in the context of individual data one would expect Ui,j,t to be
correlated with Ui′ ,j,s - i.e., units in the same group may be dependent to
each other even if they are in different time periods. Each of these aspects
have tremendous impact on which inference tools end up being valid or not.
We will discuss some of these in the second part of this class.
As a way to illustrate how important these assumptions may be, let’s
consider the case where J1 = {1} but |J0 | → ∞ - we also assume that |T0 |
and |T1 | are finite. This is, only the first group is treated, while there are
many control groups. This is common in empirical applications with US
state level data, where often a few states exhibit a policy change while all
the other states do not. The DD estimator in this case reduced to
1 X
θ̂n = ∆n,1 − ∆n,j ,
|J0 |
j∈J0
1 X 1 X 1 X 1 X 1 X
=θ+ U1,t − U1,t − Uj,t − Uj,t
|T1 | |T0 | |J0 | |T1 | |T0 |
t∈T1 t∈T0 j∈J0 t∈T1 t∈T0
P 1 X 1 X
→θ+ U1,t − U1,t ,
|T1 | |T0 |
t∈T1 t∈T0
there were 38 states in the US that did not implement such programs. Rather
than just using a standard DD analysis - which effectively treats each state
as being of equal quality as a control group - ADH propose choosing a
weighted average of the potential controls. Of course, choosing a suitable
control group or groups is often done informally, including matching on pre-
treatment predictors. ADH formalize the procedure by optimally choosing
weights, and they propose methods for inference.
Consider the simple case in Section 8.1.2, except that we now assume
there are J0 possible controls and that J1 = {1}. Synthetic controls also
allow the model for potential outcomes to be more flexible, specially when
it comes to the parallel trends assumption required for DD. To be concrete,
in what follows assume that
so that now the time effect and the group effect interact with each other
(note that common trends does not hold in this case). Comparing Y1,2 and
Yj,2 for any j ∈ J0 delivers
and so this approach does not identify θ in the presence of persistent group
differences. The idea behind synthetic controls is to construct the so-called
synthetic control X
Ỹ1,2 (0) = wj Yj,2 ,
j∈J0
P
by appropriately choosing the weights {wj : j ∈ J0 , wj ≥ 0, j∈J0 wj = 1}.
In order for
h this idea to work,
i it must be the case that E[Y1,2 (0)] = E[Ỹ1,2 (0)]
so that E Y1,2 − Ỹ1,2 (0) = θ. Now, for a given set of weights, this approach
delivers
h i X X
E Y1,2 − Ỹ1,2 (0) = E Y1,2 − wj Yj,2 = θ + γ2 η1 − wj η j .
j∈J0 j∈J0
This is, however, not feasible as we do not observe the group effects ηj . The
main result in Abadie et al. (2010) can be stated for the example in this
∗ ∗
P as ∗follows: suppose that there exists weights {wj : j ∈ J0 , wj ≥
section
0, j∈J0 wj = 1} such that
X
Y1,1 = wj∗ Yj,1 . (8.16)
j∈J0
8.3. SYNTHETIC CONTROLS 79
so that
X X
γ1 η1 − wj∗ ηj = wj∗ (U1,1 − Uj,1 ) . (8.17)
j∈J0 j∈J0
where we used (8.17) in the third equality. The result follows from E[Uj,t ] =
0 for all (j, t).
We then get the weights by “matching” the observed outcomes of the
treated group and the control groups in the period before the policy change.
In practice, Y1,1 may not lie in the convex hull of {Yj,1 : j ∈ JP
0 } and so the
method relies on minimizing the distance between Y1,1 and j∈J0 wj Yj,1 .
Abadie et al. (2010) provide some formal arguments around these issues,
and in particular require that |T0 | → ∞ and that Uj,t is independent across
j and t. However, the model they consider is slightly more general than the
standard DD model, as it does not require the “common trends” assumption.
The basic idea can be extended in the presence of covariates Xj that
are not (or would not be) affected by the policy change. In this case, the
weights would be chosen to minimize the distance between
X
(Y1,1 , X1 ) and wj (Yj,1 , Xj ) .
j∈J0
8.4 Discussion
To keep the exposition simple we have ignored covariates. However, it is
straightforward to incorporate additional covariates under the assumption
that potential outcomes are linear in those covariates, i.e.,
′
E[Yj,t (0)|Xj,t ] = ηj + γt + Xj,t β.
E[log Y2,2 (0) − log Y2,1 (0)] = E[log Y1,2 (0) − log Y1,1 (0)] .
Indeed, the two assumptions are non-nested and one would typically suspect
that both cannot hold at the same time.
Bibliography
Abadie, A., A. Diamond, and J. Hainmueller (2010): “Synthetic con-
trol methods for comparative case studies: Estimating the effect of Cal-
ifornia’s tobacco control program,” Journal of the American Statistical
Association, 105, 493–505.
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
Canay, I. A., J. P. Romano, and A. M. Shaikh (2017): “Randomization
Tests under an Approximate Symmetry Assumption,” Econometrica, 85,
1013–1030.
BIBLIOGRAPHY 81
Some Topics
83
Lecture 9
Non-Parametric Regression
9.1 Setup
Let (Y, X) be a random vector where Y and X take values in R and let
P be the distribution of (Y, X). The case where X ∈ Rk will be discussed
later. We are interested in the conditional mean of Y given X:
m(x) = E[Y |X = x] .
85
86 LECTURE 9. NON-PARAMETRIC REGRESSION
h = max |Xi − x|
i∈Jq (x)
Definition 9.2 (Binned Estimator) Let h > 0 be given. The binned es-
timator is defined as
Pn
I{|Xi − x| ≤ h}Yi
m̂b (x) = Pi=1
n . (9.1)
i=1 I{|Xi − x| ≤ h}
2. 0 ≤ k(u) < ∞
9.3. NADARAYA-WATSON KERNEL ESTIMATOR 87
3. k(u) = k(−u)
R∞ 2
4. κ2 = −∞ u k(u)du ∈ (0, ∞)
Note that the definition of the kernel does not involve continuity. Indeed,
the binned estimator can be written in terms of a kernel function. To see
this, let
1
k0 (u) = I{|u| ≤ 1}
2
be the uniform density on [−1, 1]. Observe that
|Xi − x| Xi − x
I{|Xi − x| ≤ h} = I ≤ 1 = 2k0
h h
so that we can write m̂b (x) in (9.1) as
Pn Xi −x
k
i=1 0 h Yi
m̂(x) = P .
n Xi −x
i=1 k0 h
h → ∞ ⇒ m̂(x) → Ȳn .
The smaller the h, the more erratic the estimates (but the lower the bias):
h → 0 ⇒ m̂(Xi ) → Yi .
Let κ2 be defined as Z ∞
κ2 = u2 k (u) du
−∞
and
1 ′′ −1 ′ ′
B(x) = m (x) + f (x)m (x)f (x) .
2
Using this notation and the symmetry of the kernel, we can write
ˆ 1 (x)] = κ2 h2 f (x)B(x) + o(h2 ) .
E[∆
A similar expansion shows that
2
ˆ 1 (x) = O h 1
h i
Var ∆ =o .
nh nh
Again, by a triangular array CLT,
√ d
nh(∆ˆ 1 (x) − h2 κ2 f (x)B(x)) → 0.
P
Putting all the pieces together and using the fact that fˆ(x) → f (x), we have
our theorem
6. n → ∞, h → 0, nh → ∞, and h = O(n−1/5 ).
It follows that
√ σ 2 (x)R(k)
2
d
nh m̂(x) − m(x) − h κ2 B(x) → N 0, .
f (x)
From the theorem, we also have that the asymptotic mean squared error
of the NW estimator is
σ 2 (x)R(k)
M SE(x) = h4 κ22 B 2 (x) + .
nhf (x)
The optimal rate of h which minimises the asymptotic MSE is therefore
Cn−1/5 , where C is a function of (κ2 , B(x), σ 2 (x), R(k), f (x)). In this case,
the MSE converges at rate O(n−4/5 ), which is the same as the rate obtained
in density estimation. It is possible to estimate C, for instance by plug-in
approaches. However, this is cumbersome and other methods, such as cross
validation may be easier.
9.3. NADARAYA-WATSON KERNEL ESTIMATOR 91
Bias and Undersmoothing. Note that the bias term needs to be es-
timated to obtain valid confidence intervals. However, B(x) depends on
m′ (x), m′′ (x), f ′ (x) and f (x). Estimating these objects is arguably more
complicated than the problem we started out with. A (proper) residual
bootstrap could be used to obtain valid confidence interval.
Alternatively, we can undersmooth. Undersmoothing is about choosing
h such that √
nhh2 → 0 ,
which makes the bias small, i.e.,
√
nhh2 κ2 B(x) ≈ 0 .
This eliminates the asymptotic bias but requires h to be smaller than opti-
mal, since optimality requires that
nhh4 → C > 0 .
Such an h will also be incompatible with bandwidth choice methods like cross
validation. Further, undersmoothing does not work well in finite samples.
Better methods exist, though they are outside the scope of the course.
Definition 9.5 (Local Linear (LL) Estimator) For each x, solve the fol-
lowing minimization problem,
n
X Xi − x
{β̂0 (x), β̂1 (x)} = argmin k (Yi − b0 − b1 (Xi − x))2 . (9.3)
(b0 ,b1 ) h
i=1
The local linear estimator of m(x) is the local intercept: β̂0 (x).
The LL estimator of the derivative of m(x) is the estimated slope coefficient:
and
Xi − x
ki (x) = k .
h
Then
n
!−1 n
β̂0 (x) X
′
X
= ki (x)Zi (x)Zi (x) ki (x)Zi (x)Yi , (9.4)
β̂1 (x) i=1 i=1
so that for each x, the estimator is just weighted least squares of Y in Z(x).
In fact, as h → ∞, the LL estimator approaches the full-sample linear least-
squares estimator
m̂(x) = β̂0 + β̂1 x .
This is because as h → ∞, all observations receive equal weight regardless
of x. The LL estimator is thus a flexible generalization of least squares.
Deriving the asymptotic distribution of the LL estimator is similar to
that of the NW estimator, but much more involved. We will skip that here.
One advantage of spline estimators over kernels is that global inequality and
equality constraints can be imposed more conveniently. Series Estimators
are of the form
Xτn
m̂(x) = β̂j φj (x) .
j=0
They are typically very easy to compute. However, there is relatively little
theory about how to select the basis functions φ(x) and the smoothing
parameters τn .
Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Lecture 10
95
96 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING
10.1.1 Identification
The RDD allows us to estimate a specific type of treatment effect known as
the ATE at the cutoff, E[Y (1) − Y (0) | Z = 0]. Note that this is an average
effect for those individuals with scores exactly at the cutoff of the running
variable. However, Y (0) at the cutoff is not observed by design, so we need
some assumptions.
Note that a special situation occurs at the cutoff Z = 0, as illustrated by
Figure 10.1 where we plot the conditional mean function in (10.1). Consider
two groups of units: one with score equal to 0, and the other with score
barely below 0, say Z = −ε. If the value of E[Y (0)|Z = −ε] are not
abruptly different from E[Y (0)|Z = 0], then units with Z = −ε would
be a valid counterfactural to units with Z = 0. Putting it formally, if the
conditional mean function E[Y (0)|Z = z] is continuous at z = 0, then the
ATE at the cutoff, denoted by θsrd , can be identified as follows
θsrd = E[Y (1) − Y (0) | Z = 0] = E[Y | Z = 0] − lim E[Y | Z = z] .
z↑0
10.1. REGRESSION DISCONTINUITY DESIGN 97
Note that the regressor is (Zi − c) but we are assuming c = 0. Givne this,
we can estimate θsrd by
In addition to ĥCCT , they also propose bias correction methods and new
variance estimators that account for the additional noise introduced by es-
timating bias. While it is common to see papers based on undersmoothing,
i.e., use uh5 → 0 and ignore asymptotic bias, yet using ĥCCT is a better
approach.
Sharp RD (SRD) and Fuzzy RD (FRD) While sharp RDD are char-
acterized by perfect compliance, fuzzy RDD allow partial compliance which
arises if some units with running variable above c decide not to receive treat-
ment. For example, people may not cast a vote even if they are older than
18 and eligible for voting. Such partial compliance induces a discontinuity
in P {D = 1|Z = z} at c, but it does not necessarily change from 0 to 1.
100 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING
β̂0+ − β̂0−
θ̂frd = .
γ̂0+ − γ̂0−
102 LECTURE 10. REGRESSION DISCONTINUITY AND MATCHING
Alternatively, we can obtain the estimator θ̂frd using two stage least
squares. Define intention to treat by T = I{Z ≥ c}. Note T is a valid
instrument for D in that T is exogenous conditional on Z. It can be shown
that the LL approach with uniform kernels and same bandwidths is numer-
ically equivalent to a TSLS regression:
10.1.6 Validity of RD
RD imposes relatively weak assumptions and identifies a very specific and
local parameter. The identification hinges on the continuity of E[Y (d)|Z =
z] at the cutoff, yet this assumption is fundamentally untestable and can
be violated in the following situation. Suppose that the running variable
is a test score. Individuals know the cutoff and have an option to re-take
the test, and may do so if their scores are just below the cutoff. This leads
to a discontinuity of the density fZ (z) of Z at the cutoff c, and possibly a
discontinuity of E[Y (d)|Z = z] as well because it is a functional of fZ (z),
Z
fY,Z (y, z)
E[Y (d)|Z = z] = yfY |Z (y|z)dy where fY |Z (y|z) = .
fZ (z)
This may invalidate the design. This problem is called “manipulation” of
the running variable.
As a way to detect such manipulation, ? proposes a test for continuity of
the density of fZ (z) at the cutoff. In principle, one does not need continuity
of the density of Z at c, but a discontinuity is suggestive of violations of
the no-manipulation assumption. ? also propose a new test based on order
statistics that does not require smoothness assumptions.
In addition to manipulation, the continuity assumption may fail to hold
due to discontinuity in the distribution of covariates. To see this, suppose
there are an observed factor X and an unobserved factor U that affect
potential outcomes, say
Y (d) = md (Z, X) + U
f (y,z,x) f (x,z)
where fY |Z,X (y|z, x) = Y,Z,X
fZ,X (z,x) and fX|Z (x|z) = X,Z
fZ (z) . These effects
may be attributed erroneously to the treatment of interest.
10.2. MATCHING ESTIMATORS 103
The rejection of this null suggests that E(Y (d) | Z = z) may not be con-
tinuous either. However, E(X | Z = z) could be still continuous and H0
holds true even if the distribution of X is discontinuous at the cutoff. The
intuition on how discontinuity in X may confound the effect of the treat-
ment is about the entire distribution of X. ? propose a test for continuity
of FX|Z (x|z) at the cutoff. The test is easy to implement and based on
permutation tests and it involves novel asymptotic arguments.
10.1.7 RD Packages
The statistical packages to compute LL RD estimators and run RDD validity
tests are available online. Below we introduce four packages. 1
meaning that, within each subgroup of agents with the same X, there should
be both treated and control units. Complication arises when X is continu-
ously distributed.
Identification through the unconfoundedness assumption is inherently
different from RDD. In sharp RDD, the unconfoundedness assumption holds
trivially because if we define D = I{Z ≥ c} then
(Y (0), Y (1)) ⊥⊥ D | Z.
Moreover, the overlap assumption never holds in sharp RDD because the
probabilities to receive treatments given the running variable is either 1 or
0, i.e.,
P {D = 1 | Z < c} = 0 and P {D = 0 | Z ⩾ c} = 0.
Mij = |Xi − Xj |
Opposing treatment: P Dj = 1 − Di
Opposing qth closest to i: s:Ds =1−Di I {Mis ⩽ Mij } = q.
That is, jq (i) is the index of the unit that is the qth closest to unit i in terms
of the covariate values, among the units with the treatment opposite to that
of unit i. Let Jq (i) denote the set of indices for the first q matches for unit
i:
Jq (i) = {j1 (i), . . . , jq (i)} .
Then the matching estimator of θate = E[Y (1) − Y (0)] is given by
n
1 X YiP if Di = d
θ̂ate = Ŷi (1) − Ŷi (0) where Ŷ (d) = 1 .
n q j∈Zq (i) Yj ̸ d
if Di =
i=1
This means that we no longer need to condition on the entire X but only
on one-dimensional propensity score p(X) in order to achieve independence
between the potential outcome (Y (0), Y (1)) and the treatment status D. To
see how it holds, note that
P {D = 1 | Y (0), Y (1), p(X)} = E[E[D | Y (0), Y (1), P (X), X] | Y (0), Y (1), p(X)]
= E[E[D | Y (0), Y (1), X] | Y (0), Y (1), p(X)]
= E[E[D | X] | Y (0), Y (1), p(X)]
= E[p(X) | Y (0), Y (1), p(X)]
= p(X),
and thus we can use the matching estimator matching on the propensity
score only. This can be formulated by nothing that
DY 1
E =E E[DY (1) | p(X)]
p(X) p(X)
1
=E E[D | p(X)]E[Y (1) | p(X)] = E[Y (1)]
p(X)
and similarly
(1 − D)Y
E = E[Y (0)]
1 − p(X)
θ̂n does not explicitly match observations but puts weights induced by the
propensity score to outcome Yi , while it is still based on the unconfounded-
ness assumption.
The propensity score is a scalar. ? imply that the bias term is of lower
√
order than the variance term and matching leads to a n-consistent, asymp-
totically normal estimator. Given the data, we cannot compute θ̂n because
it depends on the unknown propensity score function p(·). The estimator
based on the true propensity score has the same asymptotic variance in ?.
With estimated propensity scores, the asymptotic variance of matching es-
timators is more involved due to the “generated regressor”. The topic is
beyond our scope. Those who are interested can consult ?.
Bibliography
Calonico, S., M. D. Cattaneo, and R. Titiunik (2014): “Robust Non-
parametric Confidence Intervals for Regression-Discontinuity Designs,”
Econometrica, 82, 2295–2326.
Random Forests
109
110 LECTURE 11. RANDOM FORESTS
Lecture 12
LASSO
Y = X ′β + U .
111
112 LECTURE 12. LASSO
Definition 12.2 (Oracle Estimator) The oracle estimator β̂no is the in-
feasible estimator that is estimated by least squares using only the variables
in S.
In practice, we do not know the set S and so our goal is to estimate β,
and possibly S, exploiting the fact that the model is known to be sparse.
In particular, we would like our estimator β̂n to satisfy three properties:
estimation consistency, model selection consistency, and oracle efficiency.
Ŝn = {j : β̂n,j ̸= 0}
P {Ŝn = S} → 1 as n → ∞ .
12.2 LASSO
LASSO is short for Least Absolute Shrinkage and Selection Operator and is
one of the well known estimators for sparse models. The LASSO estimator
β̂n is defined as the solution to the following minimization problem
n
X k
X
β̂n = arg min (Yi − Xi′ b)2 + λn |bj | , (12.2)
b
i=1 j=1
12.2. LASSO 113
y
y = |x|
1
2x
− 12 x
satisfied if and only if g ∈ [−1, 1]. We therefore have ∂f (x) = [−1, 1]. This
is illustrated in Figure 12.1
For non-differentiable functions, the Karush-Kuhn-Tucker theorem states
that a point minimizes the objective function of interest if and only if 0 is
in the sub-differential. Applying this to the problem in (12.2) implies that
the first order conditions are given by
n
X
2 (Yi − Xi′ β̂n )Xi,j = λn sign(β̂n,j ) if β̂n,j ̸= 0 (12.5)
i=1
and
n
X
−λn ≤ 2 (Yi − Xi′ β̂n )Xi,j ≤ λn if β̂n,j = 0 . (12.6)
i=1
Compared to our previous result in (12.4), this inequality is attained with
positive probability even when U is continuously distributed. Model selec-
tion is therefore possible when the penalty function has a cusp at 0. The
difference between using a penalty with γ = 1 (LASSO) and γ = 2 (Ridge)
in the constraint problem in (12.3) is illustrated in Figure 12.2 for the simple
case where k = 2.
Theorem 12.1 (Zhao and Yu (2006)) Suppose k and s are fixed and
that {Xi : 1 ≤ i ≤ n} and {Ui : 1 ≤ i ≤ n} are i.i.d. and mutually in-
dependent. Let X have finite second moments, and U have mean 0 and
variance σ 2 . Suppose also that the irrepresentable condition holds and that
λn λn
→ 0 and 1+c → ∞ for 0 ≤ c < 1 . (12.7)
n n 2
Then LASSO is model-selection consistent.
The irrepresentable condition is a restrictive condition. When this con-
√
dition fails and λn / n → λ∗ > 0, it can be shown that LASSO selects too
many variables (i.e., it selects a model of bounded size that contains all
variables in S). Intuitively, if the relevant variables and irrelevant variables
are highly correlated, we will not be able discriminate between them.
Knight and Fu (2000) showed that the LASSO estimator is asymptoti-
√
cally normal when λn / n → λ∗ ≥ 0, but that the nonzero parameters are
estimated with some asymptotic bias if λ∗ > 0. If λ∗ = 0, LASSO has the
same limiting distribution as the LS estimator and so even with λ∗ = 0,
LASSO is not oracle efficient. In addition, the requirement for asymptotic
1+c
normality is at conflict with λn /n 2 → ∞ and so it follows that LASSO
cannot be both model selection consistent and asymptotically normal (hence
oracle efficient) at the same time. Oracle efficient penalization methods work
by penalizing small coefficients a lot and large coefficients very little or not
at all. This could be done by using weights (as in the Adaptive LASSO
below) or by changing the penalty function (which we discuss later).
116 LECTURE 12. LASSO
To see that adaptive LASSO is oracle efficient, note that the asymptotic
variance of the estimator is the same we would have achieved had we known
the set S and performed OLS on it. The rates at which λ1,n and λ2,n grow
are important for this result.
To see why the adaptive LASSO is model selection consistent and oracle
efficient, consider the following. Recall that β1 , . . . , βs ̸= 0 and βs+1 , ..., βk =
0. Suppose that β̂n has r non-zero components asymptotically. Without the
irrepresentable condition, the LASSO includes too many variables, so that
s ≤ r ≤ k. Without loss of generality, suppose β̂n is non-zero in its first
r components. Let b be any r × 1 vector, and let β̃n denote the adaptive
√
LASSO estimator. Define u = n(b − β). Some algebra shows,
n r 2 r
√ X 1 X X 1
n(β̃n −β) = arg min Ui − √ Xi,j uj +λ2,n |β̂n,j |−1 (|βj + √ uj |−|βj |) .
u n n
i=1 j=1 j=1
12.4. PENALTIES FOR MODEL SELECTION CONSISTENCY 117
√
By Knight and Fu (2000), β̂n converges at rate n, so |β̂n − β| =
OP (n−1/2 ). We then split the analysis according to whether βj is zero or
not.
Case βj = 0: here |β̂n,j | = OP (n−1/2 ). Then,
1
λ2,n |β̂n,j |−1 (|βj + √ uj | − |βj |) ≈ λ2,n |uj |
n
√
where we have “canceled” the 1/ n term using |β̂n,j |. Now suppose uj ̸= 0
and note that
λ2,n |uj | → ∞ since λ2,n → ∞ .
The penalty effectively tends to infinity, so that bj ̸= 0 (uj ̸= 0) cannot be
the minimizer. It must be that uj = 0, i.e., bj = βj = 0.
Case βj ̸= 0: here |β̂n,j | = OP (1). It follows that,
1 1
λ2,n |β̂n,j |−1 (|βj + √ uj | − |βj |) ≈ λ2,n √ |uj | .
n n
P λ
It follows that λ2,n √1n |uj | → 0 since √2,n
n
→ 0 and uj = OP (1). That
is, asymptotically there is no penalty on non-zero terms, and the adaptive
LASSO becomes asymptotically equivalent to OLS estimation on S. This
gives rise to model selection consistency and oracle efficiency.
Clearly, ordinary LASSO corresponds to the case where pλ (|ν|) = λ|ν|, but
such a penalty is not strictly concave and so model selection consistency
generally does not occur. Some alternative penalty functions include that
have the desire property are
1. Bridge: pλ (|ν|) = λ|ν|γ for 0 < γ < 1
2. Smoothly Clipped Absolute Deviation (SCAD): for a > 2,
′ λ (aλ/n − |ν|)+ λ
pλ (|ν|) = λ I |ν| ≤ + I |ν| > .
n (a − 1)λ/n n
Note that this function is defined by its derivative.
118 LECTURE 12. LASSO
Figure 12.3: Bridge penalty (solid line), SCAD penalty (dashed line) and
minimax concave penalty (dotted line)
where (x)+ = max{0, x}. These penalty functions are plotted in Figure
12.3. Note that they are all steeply sloped near ν = 0. Bridge penalty, like
the LASSO, continues to increase far away from ν = 0, whereas SCAD and
minimax concave penalties flatten out. For this reason, the latter penalties
exhibit lower bias.
Doing
PQ so for each q, we are able to find total error for each λ: Γ(λ) =
q=1 Γq (λ). Then we define the cross validated λ as:
λ̂CV
n = arg min Γ(λ) .
λ
For the adaptive LASSO, we need to choose both λ1,n and λ2,n . A
computationally efficient way of doing so is to choose λ1,n via the above
12.6. CONCLUDING REMARKS 119
cross-validation procedure, and then having fixed this λ1,n , choose λ2,n by
a second round of cross-validation.
Arguably, there exist few results about the properties of the LASSO when
λn is chosen via cross-validation. In a recent working paper, Chetverikov
et al. (2016) show that in a model with random design, in which k is allowed
to depend on n, and assuming Ui |Xi is Gaussian, it follows that
holds with high probability, where ∥b−β∥2,n = ( n1 ni=1 (Xi′ b)2 )1/2 is the pre-
P
diction norm. It turns out that ((|S| log k)/n)1/2 is the fastest convergence
rate possible so that cross-validated LASSO is nearly optimal. However, it
is not known if the log7/8 (kn) term can be dropped.
Finally, we mention one alternative approach to choosing λn . This is
done by minimizing the Bayesian Information Criterion. Define:
n
1X
σ̂ 2 (λ) = (Yi − Xi′ β̂n (λ))2 ,
n
i=1
and
log(n)
BIC(λ) = log(σ̂ 2 (λ)) + |Ŝn (λ)|Cn
n
where Cn is an arbitrary sequence that tends to ∞. Wang et al. (2009) show
that under some technical conditions, choosing λn to minimize BIC(λ) leads
to model selection consistency when U is normally distributed.
Bibliography
Chetverikov, D., Z. Liao, and V. Chernozhukov (2016): “On Cross-
Validated LASSO,” arXiv preprint arXiv:1605.02214.
Zou, H. (2006): “The adaptive lasso and its oracle properties,” Journal of
the American Statistical Association, 101, 1418–1429.
Lecture 13
Binary Choice
Let (Y, X) be a random vector where Y takes values in {0, 1} and X takes
values in Rk+1 . Let us consider the problem of estimating
P {Y = 1 | X} . (13.1)
This problem has two interpretations that can deliver different approaches.
The first interpretation consists in predicting the outcome variable Y for
a given value of the covariate X. This problem can be solved estimating
the probability (13.1) - which is also called propensity score - by different
methods, for instance local linear regression or classification trees.
The second interpretation of the problem consist in viewing (13.1) as
a model with structure, where we are interested in the partial effects or
the causal effects of X. This is traditionally the approach that is often
used in the Industrial Organization, where (13.1) models the behavior of
the decision makers and the estimated model is used to do counterfactual
analysis. Sometimes, this second interpretation is called a structural form
for (13.1), while the first one is a reduced form.
In this lecture, we consider the second interpretation. We restrict our
attention to parametric and semiparametric models using the linear index
model. This model assumes the existence of β ∈ Rk+1 such that
P {Y = 1 | X} = P Y = 1 | X ′ β .
(13.2)
This condition reduces the dimension of the problem. To see this, note that
the left hand side in (13.2) is a function of X ∈ Rk+1 . While the right hand
side in (13.2) is a function of X ′ β ∈ R, which is known as linear index.
121
122 LECTURE 13. BINARY CHOICE
X ′ βA + UA and X ′ βB + UB
X ′ βB + UB ≥ X ′ βA + UA ,
which is equivalent to
X ′ (βB − βA ) + UB − UA ≥ 0 .
13.1.1 Identification
Let us denote by P the distribution of the observed data. And denote by
P = {Pθ : θ ∈ Θ} a (statistical) model for P . These probability distri-
butions are indexed by the parameter θ, where this parameter could have
infinite dimensional components, e.g. a nonparametric distribution of the
unobservable component.
Using this notation, the model P is correctly specified if the distribution
of the observable data belong to the model, i.e. P ∈ P. In this case, there
is a parameter θ such that P = Pθ . Now our interest interest might be in θ
or a function of λ(θ).
Θ0 (P ) = {θ ∈ Θ : Pθ = P } .
P2. There exists no A ⊆ Rk+1 such that A has probability one under PX
and A is a proper linear subspace of Rk+1 .
The parametric assumption P1 allow us to replace PU |X with σ, since
that parameter characterizes the parametric distribution. Now we can write
θ = (β, PX , σ). Using this notation, let us study the identification of θ. We
prove the result by contradiction: assume there are two values θ = (β, PX , σ)
and θ∗ = (β ∗ , PX∗ , σ ∗ ) such that θ ̸= θ∗ and Pθ = Pθ∗ and the reach a
contradiction.
First, notice that the marginal distribution of X is identified from the
joint distribution of (Y, X). This implies that PX = PX∗ . Second, we can
use assumption P1 to compute the probability of Y = 1 given X using both
models. That is
′ ′ ∗
Xβ Xβ
Pθ {Y = 1|X} = Φ and Pθ∗ {Y = 1|X} = Φ ,
σ σ∗
which by assumption on Pθ and Pθ∗ deliver the same probability. Since Φ(·)
is an increasing function, we obtain
β β∗
X′ − ∗ =0.
σ σ
By assumption P2, we conclude
β β∗
= ∗ . (13.3)
σ σ
Otherwise, we can define a proper linear subspace A = {x ∈ Rk+1 |x′ (β/σ −
β ∗ /σ ∗ ) = 0} such that A has probability one under PX .
Note that we cannot conclude that β = β ∗ or σ = σ ∗ . Indeed, our
analysis shows that any θ and θ∗ such that (13.3) holds and PX = PX∗
satisfies Pθ = Pθ∗ . This implies we cannot identify θ = (β, PX , σ) but we
can identify λ(θ) = (PX , β/σ).
A few remarks are in order. First, researchers typically assume further
that |β| = 1 or β0 = 1 or σ = 1, which is a normalization argument to
conclude that now we can identify the parameter θ. Second, the model with
σ = 1 is called the Probit model. Let us verify the identification of θ.
124 LECTURE 13. BINARY CHOICE
S2. There exists no A ⊆ Rk+1 such that A has probability one under PX
and A is a proper linear subspace of Rk+1 .
S3. |β| = 1.
⇐⇒ X ′ β ≥ 0 .
Pθ X ′ β ∗ < 0 ≤ X ′ β > 0
or
Pθ X ′ β < 0 ≤ X ′ β ∗ > 0 .
X ′ β−k
′ ∗ ′ ′ ∗
< 0, Xk > − −k
Pθ X β < 0 ≤ X β = Pθ X−k β−k , (13.5)
βk
and
X ′ β−k
′ ′ ∗ ′ ∗
≥ 0, Xk < − −k
Pθ X β < 0 ≤ X β = Pθ X−k β−k . (13.6)
βk
′ β ∗ < 0} > 0, we can use Assumption S4 to conclude that (13.5)
If Pθ {X−k −k
′ β ∗ ≤ 0} > 0, as before, we conclude (13.6) is positive
is positive. If Pθ {X−k −k
using Assumption S4.
X ′ β−k X ′ β∗
′ ∗ ′
− −k ≤ Xk < − −k∗ −k
Pθ X β < 0 ≤ X β = Pθ (13.7)
βk βk
and
′ β∗
X−k ′ β
X−k
′ ′ ∗ −k −k
Pθ X β < 0 ≤ X β = Pθ − ≤ Xk < − . (13.8)
βk∗ βk
As we did in Case 2, we can prove that (13.7) or (13.8) is positive using
Assumption S4. Thus, we only need to prove that
′ ′ β∗
X−k β−k X−k −k
Pθ > >0
βk βk∗
13.2. ESTIMATION OF THE LINEAR INDEX MODEL 127
or ′ β
X−k X ′ β∗
−k
Pθ < −k∗ −k >0.
βk βk
Let us assume by contradiction that the probabilities above are equal to
zero. This implies
′ ′ β∗
X−k β−k X−k −k
Pθ = ∗ =1,
βk βk
which is equivalent to
β∗
β−k
Pθ ′
X−k − −k =0 =1.
βk βk∗
By Assumption S2, that said that there is no proper linear subspace that
contains X, we conclude
β−k β∗
= −k .
βk βk∗
This implies that β ∗ is a scalar multiple of β. By Assumption S3, we con-
clude β = β ∗ , but this is a contradiction. This complete the proof of this
case.
P {Y = 1|X} = F (X ′ β) ,
where F (·) can be the Probit function, F (x) = Φ(x), or the Logit function,
exp(x)
F (x) = 1+exp(x) . Suppose we have a random sample of size n from the
distribution (Y, X); this is (Y1 , X1 ), . . . , (Yn , Xn ). Since we have a paramet-
ric model, we can use the maximum likelihood estimator. Let us write the
likelihood of the observation Yi :
We can use this expression to write the log-likelihood of the random sample:
n
1X
ℓn (b) = ln (fb (Yi |Xi ))
n
i=1
n
1 X
Yi ln F (Xi′ β) + (1 − Yi ) ln 1 − F (Xi′ β)
= .
n
i=1
It can be shown that β is the unique maximizer of Q(b) = E[ℓn (b)]. Let
us denote by β̂n the maximum likelihood estimator (MLE). The asymptotic
normality of the MLE implies
√
d
n β̂n − β → N (0, V) ,
where V = I−1
β and
∂2
Iβ = −E ln (f (Y
β i |Xi ))
∂β∂β ′
Since
Yi − F (Xi′ β)
∂
ln (fβ (Yi |Xi )) = F ′ (Xi′ β)Xi ,
∂β F (Xi′ β)(1 − F (Xi′ β))
where the second equality above follows from the law of iterated expectations
and law of total variance.
This final expression implies that we can estimate the asymptotic vari-
ance, I−1β . This estimation can be done using the MLE and the sample
analogue to compute the expected value. Note that this implies that we
can do inference on β, but nothing yet about the inference on the marginal
effects.
13.2. ESTIMATION OF THE LINEAR INDEX MODEL 129
∂E[Y |X]
= βj .
∂Xj
∂P {Y = 1|X}
= ϕ(X ′ β)βj ,
∂Xj
∂P {Y = 1|X}
= F (X ′ β)(1 − F (X ′ β))βj .
∂Xj
Note that the marginal effect of Xj on E[Y |X] depends on the linear
index X ′ β and βj . However, we can still extract information by simply
inspecting β. For instance, we can use the ratio between βj and βk to
obtain the ratio of the partial effects, since we have
∂P {Y =1|X}
∂Xj βj
∂P {Y =1|X}
= .
βk
∂Xk
Also, because F (·) is an increasing function, we can conclude that the sign βj
identifies the sign of the marginal effect of Xj on E[Y |X]. Finally, it is
possible to obtain upper bounds on the marginal effects from β using that
F ′ (·) is bounded. In the case of the Probit model, we obtain
∂P {Y = 1|X} 1
≤ 0.4βj since ϕ(x) ≤ ϕ(0) = √ ≈ 0.4 ,
∂Xj 2π
and in the case of the Logit model,
∂P {Y = 1|X} 1 1
≤ βj since F (x)(1 − F (x)) ≤ .
∂Xj 4 4
130 LECTURE 13. BINARY CHOICE
We can estimate this quantity using the sample analogue and the MLE:
n
1 X ′ ′
F Xi β̂n β̂n,j .
n
i=1
We can also compute the marginal effects “at the average”, this is defined
by
F ′ (E[X]′ β)βj ,
and can be estimated by
F ′ (X̄n′ β̂n )β̂n,j .
Stata offers both options with the option margins. Note that these two
quantities, average marginal effect and the marginal effect at the average,
are different. The second one could make sense if there is meaning behind
the evaluation of the effect at the average of the sample, but often this is
not the case. For instance, if Xj is a binary variable (e.g. gender).
It is important to note that we computed the average marginal effect of
Xj assuming this variable was continuously distributed. Now, let us focus
to the case in which this variable is binary. In particular, let us consider
the partition X = (X1 , D), where X1 ∈ Rk and D ∈ {0, 1}. Also, let us
consider the partition for β = (β1 , β2 ) accordingly. In this case, the following
expression
∂P {Y = 1|X}
= E F ′ (X ′ β) β2
E
∂D
does not make a lot of sense since D take only two values. Instead, we can
consider the following marginal effect of D,
E F (X1′ β1 + β2 ) − F (X1′ β1 ) ,
We can compute and report the standard errors for those estimated
marginal effects. Let us remember that for the continuous case, we derived
∂P {Y = 1|X}
= F ′ (X ′ β)βj ,
∂Xj
exp(Xi′ β)
pi = ,
1 + exp(Xi′ β)
In this case, we can interpret βj as the marginal effect of Xj on the log odds
ratio. For example, suppose a clinical trial, denote by Y = 1 if you live and
Y = 0 if you die. An odds ratio of 2 means that the odds of survival are
twice those of death. Now, if βj = 0.1, it means the relative probability of
survival increases by 10% (roughly) if Xj increase in one unit.
The true E[Y |X] may arise from a causal model, but the regres-
sion is only providing a linear approximation to the true E[Y |X].
This suggests that the LPM follows the second interpretation of the linear
regression presented in Lecture 1. This means that LPM is a descriptive
tool that approximate E[Y |X] rather than a model that admit a causal
interpretation.
The linear probability model delivers predicted probabilities outside [0, 1],
which makes it internally inconsistent as a model. A well-known textbook
that support this approach recognize this issue. In Angrist and Pischke
(2008, p. 103) appears textually
Angrist and Pischke (2008) acknowledge that there are available ap-
proaches for the binary choice model, which admit a causal interpretation
and are different than the LPM. However, in page 197, they add about this
point the following
“Yet we saw that the added complexity and extra work required
to interpret the results from latent index models may not be
worth the trouble”
Remark 13.2 It is expected that Logit, Probit, and LPM yield quite dif-
ferent estimates β̂n . For instance, if we use the upper bounds for marginal
effects, we get
β̂logit ≈ 4β̂ols
β̂probit ≈ 2.5β̂ols
β̂logit ≈ 1.6β̂probit
However, average marginal effects from Logit, Probit, and even LPM are
often “close”, partly due because there is averaging going on.
BIBLIOGRAPHY 133
The binary choice model discussed here using the linear index model
is an idea that is applied to other settings. For instance, ordered choice
models, where individual decides how many units to buy from the same
item, or unordered choice models, where individual decides to buy one of
many different alternatives. In these kind of models, it is common to find
conditional Logit and multinomial Logit. The most popular example in
Industrial Organization (IO) is the random coefficient logit model introduced
by Berry et al. (1995), which is also known as BLP and is useful to estimate
demand. These topics are covered in second year IO classes.
Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
135
Lecture 14
Heteroskedastic-Consistent
Variance Estimation
Y = X ′β + U . (14.1)
for
V = E[XX ′ ]−1 E[XX ′ U 2 ]E[XX ′ ]−1 .
We wish to test
H0 : β ∈ B0 versus H1 : β ∈ B1
H0 : β1 = c versus H1 : β1 ̸= c . (14.2)
137
138 LECTURE 14. HC VARIANCE ESTIMATION
where Ûi = Yi − Xi′ β̂n . This is the most widely used form of the robust,
heteroskedasticity-consistent standard errors and it is associated with the
work of White (1980) (see also Eicker, 1967; Huber, 1967). We will refer to
these as robust EHW (or HC) standard errors.
Under the assumption that Var[XU ] < ∞, the first term on the righthand
side of the preceding display converges in probability to Var[XU ]. It there-
fore suffices to show that the second term on the righthand side of the
preceding display converges in probability to zero. We argue this separately
for each of the (k + 1)2 terms. To this end, note for any 0 ≤ j ≤ k and
0 ≤ j ′ ≤ k that
1 X 1 X
Xi,j Xi,j ′ (Ûi2 − Ui2 ) ≤ |Xi,j Xi,j ′ ||Ûi2 − Ui2 |
n n
1≤i≤n 1≤i≤n
1 X
≤ |Xi,j Xi,j ′ | max |Ûi2 − Ui2 | .
n 1≤i≤n
1≤i≤n
14.2. CONSISTENCY OF HC STANDARD ERRORS 139
P {β1 ∈ Cn } → 1 − α
as n → ∞.
It is worth noting that Stata does not compute V̂n in the default “ro-
bust” option, but rather a version of this estimator that includes a finite
sample adjustment to “inflate” the estimated residuals (known to be too
small in finite samples). This version of the HC estimator is commonly
known as HC1 and given by
−1 −1
1 X 1 X 1 X
V̂hc1,n = Xi Xi′ Xi Xi′ Ûi∗2 Xi Xi′ ,
n n n
1≤i≤n 1≤i≤n 1≤i≤n
The consistency of the standard errors does not necessarily translate into
accurante finite sample inference on β in general, something that lead to a
number of finite sample adjustments that are sometimes used in practice.
The simplest one is the HC1 correction, although better alternatives are
available. Below we discuss some of these adjustments.
P = X(X′ X)−1 X′
Pi = X(X′ X)−1 Xi
n
!−1 n
! n
!−1
1X 1X 1X
V̂hc2,n = Xi Xi′ Xi Xi′ Ũi2 Xi Xi′ , (14.5)
n n n
i=1 i=1 i=1
Ûi2
Ũi∗2 ≡ , (14.6)
(1 − Pii )2
We know from previous results that this problem can be viewed as a special
case of linear regression with a binary regressor, i.e. X = (1, D) and D ∈
{0, 1}. In this case, the coefficient on D identifies the average treatment
effect, which in this case equals precisely µ1 − µ0 . To be specific, consider
the linear model
Y = X ′ β + U = β0 + β1 D + U
where
Y = Y (1)D + (1 − D)Y (0) ,
and U is assumed to be normally distributed conditional on D, with zero
conditional mean and
We are interested in
Cov(Y, D)
β1 = = E[Y |D = 1] − E[Y |D = 0] .
Var(D)
Because D is binary, the least squares estimator of β1 can be written as
β̂1,n = Ȳ1 − Ȳ0 , (14.8)
where for d ∈ {0, 1},
n n
1 X X
Ȳd = Yi I{Di = d} and nd = I{Di = d} .
nd
i=1 i=1
and let
∗ 1 1
V̂1,ho = σ̂ 2 + , (14.10)
n0 n1
be the estimator of V1∗ . This estimator has two important features.
∗
(a) Unbiased. Since σ̂ 2 is unbiased for σ 2 , it follows that V̂1,ho is unbiased
∗
for the true variance V1 .
β̂1,n − c
tho = q ∼ t(n − 2) . (14.11)
∗
V̂1,ho
This t-distribution with dof equal to n − 2 can be used to test (14.2) and,
by duality, for the construction of exact confidence intervals, i.e.,
q q
1−α n−2 ∗ n−2 ∗
CSho = β̂1,n − t1− α V̂1,ho , β̂1,n + t1− α V̂1,ho . (14.12)
2 2
Here tn−2
1− α denotes the 1− α2 quantile of a t distributed random variable with
2
n − 2 dof. Such confidence interval is exact under these two assumptions,
normality and homoskedasticity, and we can conclude that (14.9) holds with
κ = n − 2.
∗ σ̂ 2 (0) σ̂ 2 (1)
V̂1,hc = +
n0 n1
where
n
1 X
σ̂ 2 (d) = (Yi − Ȳd )2 I{Di = d} for d ∈ {0, 1} .
nd
i=1
14.4. THE BEHRENS-FISHER PROBLEM 145
In small samples the properties of these standard errors are not always
∗
attractive: V̂1,hc is biased downward, i.e.,
∗ n0 − 1 σ 2 (0) n1 − 1 σ 2 (1)
E[V̂1,hc ]= + < V1∗ ,
n0 n0 n1 n1
1−α
and CShc can have coverage substantially below 1 − α. A common “cor-
n−2
rection” for this problem is to replace z1− α2 with t1− α . However, as we will
2
illustrate in the next section, such correction if often ineffective.
∗ σ̃ 2 (0) σ̃ 2 (1)
V̂1,hc2 = + ,
n0 n1
where
n
1 X
σ̃ 2 (d) = (Yi − Ȳd )2 I{Di = d} .
nd − 1
i=1
∗
The estimator V̂1,hc2 is unbiased for V1∗ , but it does not satisfy the chi-
square property in (b) above. As a result, the associated confidence interval
is still not exact. Just as in the previous case, there are no assumptions
under which there exists a value of κ such that (14.9) holds, even when U
146 LECTURE 14. HC VARIANCE ESTIMATION
Bibliography
Angrist, J. D. and J.-S. Pischke (2008): Mostly harmless econometrics:
An empiricist’s companion, Princeton university press.
Imbens, G. W. and M. Kolesar (2012): “Robust standard errors in small
samples: some practical advice,” Tech. rep., National Bureau of Economic
Research.
BIBLIOGRAPHY 147
Heteroskedasticity
Autocorrelation Consistent
Covariance Estimation
149
150 LECTURE 15. HAC COVARIANCE ESTIMATION
We can immediately see that, for the variance to vanish, we need to make
sure the last summation does not explode. A sufficient condition for this is
absolute summability,
X∞
|γj | < ∞ , (15.3)
j=−∞
in which case a law of large numbers follows one more time from an appli-
cation of Chebyshev’s inequality.
where Ω is called the long-run variance. There are many central limit the-
orems for serially correlated observations. Below we provide a commonly
used version, see Billingsley (1995, Theorem 27.4) for a proof under slightly
stronger assumptions.
The problem is that we are summing too many imprecisely estimated co-
variances. So, the noise does not die out. For example, to estimate γn−1 we
use only one observation.
These kernels are all symmetric at 0, where the first two kernels have
a bounded support [−1, 1], and the QS has unbounded support. For the
Bartlett and Parzen kernels, the weight assigned to γ̂j decreases with |j|
and becomes zero for |j| ≥ mn . Hence, mn in these functions is also known
as a truncation lag parameter. For the quadratic spectral, mn does not have
this interpretation because the weight decreases to zero at |j| = 1.2mn , but
then exhibits damped sine waves afterwards. Note that for the first two
kernels, we can write
mn
X j
Ω̂n ≡ k γ̂j . (15.14)
mn
j=−mn
1. mn → ∞ as n → ∞ and m3n /n → 0.
I will provide a sketch of the proof below under the assumptions of the
theorem above. Start by writing the difference between our estimator and
the object of interest,
mn mn
X X j X j
Ω̂n − Ω = − γj + k − 1 γj + k (γ̂j − γj ) .
mn mn
|j|>mn j=−mn j=−mn
and conclude that the last term (which is non-stochastic) goes to zero by
similar arguments to those in (15.15). It then follows that it suffices to show
that
mn mn
X j X P
k (γ̂j − γj∗ ) ≤ |γ̂j − γj∗ | → 0 .
mn
j=−mn j=−mn
By Chebyshev’s inequality,
E[(γ̂j − γj∗ )2 ] C
P |γ̂j − γj∗ | > ϵ ≤
2
≤ 2 , (15.17)
ϵ nϵ
where, importantly, the bound holds uniformly in 1 ≤ j < ∞. This charac-
terize the accuracy with which we estimate each covariance. Now we need
to assess how many auto-covariances we can estimate well simultaneously:
mn m
[n
X ϵ
P |γ̂j − γj∗ | > ϵ ≤ P {|γ̂j − γj∗ | > }
2mn + 1
j=−mn j=−mn
mn
X ϵ
≤ P |γ̂j − γj∗ | >
2mn + 1
j=−mn
mn
X E[(γ̂j − γj∗ )2 ](2mn + 1)2
≤
ϵ2
j=−mn
C(2mn + 1)2
≤ (2mn + 1)
nϵ2
m3n
≤ C∗ →0,
nϵ2
where the last step uses m3n /n → 0. This completes the proof.
We have proved consistency but we have not addressed the question of
positive definiteness of our HAC estimator. To do this, it is convenient to
characterize positive definiteness using the Fourier transformation of Ω̂. We
will skip this in class, but the interested reader should see Newey and West
(1987).
BIBLIOGRAPHY 157
Andrews (1991) did this minimization and showed that the optimal band-
width is mn = C ∗ n1/r , where r = 3 for the Barlett kernel and r = 5 for
other kernels. He also provided values for the optimal constant C ∗ , that
depends on the kernel used, among other things.
Bibliography
Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Con-
sistent Covariance Matrix Estimation,” Econometrica, 59, 817–858.
Cluster Covariance
Estimation
Yj = Xj β + Uj , j = 1, . . . , q ,
159
160 LECTURE 16. CLUSTER COVARIANCE ESTIMATION
and that
Then, as n → ∞
P
||X̄n − E X̄n || → 0 .
A few comments about this theorem are worth highlighting. First, the
condition in (16.3) states that Xi,j is uniformly integrable and this is a tech-
nical requirement that is usually assumed outside the i.i.d. setting. Second,
the condition in (16.2) states that each cluster size nj is asymptotically neg-
ligible. This automatically holds when nj is fixed as q → ∞, which is the
traditional framework we discussed with panel data. It also implies that
q → ∞, so we do not explicitly list this as a condition.
where the last equation uses that Xj are independent across clusters j.
Example 16.1 Consider the case where nj = na and q = n1−a for some
a ∈ (0, 1). Let Xi,j ∈ R be a random variable such that Var[Xi,j ] = 1 and
Cov[Xi,j , Xi′ ,j ] = 0. In this case it follows that
nj
X
Var[Xj′ 1j ] = Var[Xi,j ] = nj ,
i=1
and then
1
sd(X̄n ) = (qnj )1/2 = n−1/2 ,
n
where the last equality follows from q = n1−a and nj = na .
Example 16.2 Consider the case where nj = na and q = n1−a for some
a ∈ (0, 1). Let Xi,j ∈ R be a random variable such that Var[Xi,j ] = 1 and
Cov[Xi,j , Xi′ ,j ] = 1. In this case it follows that
nj nj
X X
Var[Xj′ 1j ] = Cov[Xi,j , Xi′ ,j ] = n2j = n2a ,
i=1 i′ =1
and then
1 1/2
sd(X̄n ) = qn2a = q −1/2 ,
n
where the last equality follows because q = n1−a .
162 LECTURE 16. CLUSTER COVARIANCE ESTIMATION
Example 16.3 Consider the case where nj = na and q = n1−a for some
a ∈ (0, 1). Let Xi,j ∈ R be a random variable such that Var[Xi,j ] = 1 and
Cov[Xi,j , Xi′ ,j ] = 1/|i − i′ | for i ̸= i′ . These conditions implies that
nj nj nj nj
X X X X 1
Var[Xj′ 1j ] = Cov[Xi,j , Xi′ ,j ] = 1 + .
′ ′
|i − i′ |
i=1 i =1 i=1 i ̸=i
This last expression can be use to approximate the standard deviation of the
sample mean,
1/2
q r
1 X 1 log(n)
sd(X̄n ) ∝ nj log(n) = (qna log(n))1/2 = ,
n n n
j=1
where the last equality uses q = n1−a . It follows from here that the conver-
gence rate is slower than n−1/2 since
√ p
n sd(X̄n ) ∝ log(n) → ∞ as n → ∞ .
At the same time, the convergence rate is faster than q −1/2 since
√
r r
√ 1−a
log(n) log(n)
q sd(X̄n ) ∝ n = → 0 as n → ∞ .
n na
Example 16.4 Consider the case where there are two type of clusters. In
the first group, the number of cluster is q1 = n/2 and nj = 1 for j =
1, . . . , q1 . In the second group, the number of cluster is q2 = n1−a /2 and
nj = na for j = q1 + 1, . . . , q1 + q2 and a ∈ (0, 1). The number of cluster
is denoted by q = q1 + q2 . Let Xi,j ∈ R be a random variable such that
Var[Xi,j ] = 1 and Cov[Xi,j , Xi′ ,j ] = 1. These conditions implies that
nj nj
X X
Var[Xj′ 1j ] = Cov[Xi,j , Xi′ ,j ] = n2j ,
i=1 i′ =1
16.3. CENTRAL LIMIT THEOREM 163
which we can use to compute the standard deviation of the sample mean,
1/2
q
1 X 2 1 1/2
sd(X̄n ) = nj = q1 + q2 n2a ,
n n
j=1
is a random variable with mean zero and covariance matrix equal to the
identity matrix Ik+1 by construction.
and P 2/r
q r
j=1 nj
≤C<∞, (16.5)
n
for some positive C > 0. Assume further that as n → ∞,
n2j
max →0 (16.6)
j≤q n
and
λn ≥ λ > 0 , (16.7)
for some positive λ > 0. Then, as n → ∞
√ d
Ω−1/2
n n X̄n − E[X̄n ] → N (0, Ik+1 ) . (16.8)
Let’s discuss the conditions under which this theorem holds. Assumption
(16.4) states that ||Xi,j ||r is uniformly integrable. When r = 2, this condi-
tion is similar to the Lindeberg condition for the CLT under independent
heterogeneous sampling. Assumption (16.5) involves a trade-off between the
cluster sizes and the number of moments r. It is least restrictive for large
r, and more restrictive for small r. Note that as r → ∞, we can conclude
maxj≤q n2j /n = O(1), which is implied by (16.6).
Assumption (16.6) allows for growing and heterogeneous cluster sizes.
It allows clusters to grow uniformly at the rate nj = na for any 0 ≤ a ≤
(r − 2)/(2r − 2). Note that this requires the cluster sizes to be bounded
if r = 2. It also allows for only a small number of clusters to grow. For
example, nj = n̄ (bounded clusters) for q − k clusters and nj = q a/2 for k
clusters, with k fixed. In this case the assumption holds for any a < 1 and
r = 2.
√
Finally, Assumption (16.7) specifies that Var[ nc′ X̄n ] does not vanish
for any vector c ̸= 0, since the condition implies that the the minimum
eigenvalue of the variance-covariance matrix is positive.
16.4. CLUSTER COVARIANCE ESTIMATION 165
is equal to
q
1X
E[Xj′ 1j 1′j Xj ] .
n
j=1
P
||Ω
b n − Ωn || → 0
and √
b −1/2 d
Ω n nX̄n → N (0, Ik+1 ) .
This theorem shows that the cluster covariance estimator is consistent.
Moreover, replacing the covariance matrix in the central limit theorem de-
scribed in Theorem 16.2 with the estimated covariance matrix does not affect
the asymptotic distribution. This implies that cluster-robust t-statistics are
asymptotically standard normal. It is worth mentioning that we do not
need to know the actual rate of convergence of X̄n as the cluster covariance
estimator capture this rate of convergence. For the proof of these results,
see Hansen and Lee (2019).
Using this expression and the model for Yj , we can derive the following
−1
q q
√ 1 X 1 X ′
n(β̂n − β) = Xj′ Xj √ Xj Uj .
n n
j=1 j=1
Now, let us introduce notation before we discuss the consistency and the
asymptotic normality properties of the LS estimator.
q q
1X 1X
Σn = E[Xj′ Xj ] and Ωn = E[Xj′ Uj Uj′ Xj ] .
n n
j=1 j=1
Consistency of LS
If the condition (16.2) in Theorem 16.1 holds, Σn has full rank, λmin (Σn ) ≥
C > 0, and the uniform integrability condition in (16.3) holds for Xi,j Xi,j′
′
and Xi,j Ui,j , then
β̂n → β .
Asymptotic Normality of LS
√
To properly normalize n(β̂n − β) we define
Vn = Σ−1 n Ωn Σn
−1
√
as the rate of convergence may not be n. Using this notation, we assume
that the conditions in Theorem 16.2 hold for some r, Σn has full rank,
λmin (Σn ) ≥ C > 0, λmin (Ωn ) ≥ C > 0, and the uniform integrability
′ and X ′ U . It follows that as n → ∞:
condition in (16.4) holds for Xi,j Xi,j i,j i,j
√ d
Vn−1/2 n(β̂n − β) → N (0, Ik+1 ) .
Note that to conduct inference, all we need is a consistent estimator V
b n such
that √
b −1/2 n(β̂n − β) → d
V n N (0, Ik+1 ) .
This is what we develop next.
16.4. CLUSTER COVARIANCE ESTIMATION 167
Note that in the special case with nj = 1 for all j = 1, . . . , q, this estimator
becomes the HC estimator presented in Lecture 14. It is worth mentioning
that Stata uses a multiplicative adjustment to reduce the bias,
b stata = n−1 q b
V Vn .
n−k−1q−1
This estimator allows for arbitrary within-cluster correlation patterns and
heteroskedasticity across clusters. Unlike HAC estimators, it does not re-
quire the selection of a kernel or bandwidth parameter.
Inference
For s ∈ {0, 1, . . . , k}, let βs be the s-th element of β and let V̂n,s be the
(s + 1)-th diagonal element of V̂n . Using this notation, consider testing
H0 : βs = c versus H1 : βs ̸= c
at level α. Using the results we just derived, it follows that under the null
hypothesis, the t-statistic is asymptotically standard normal,
√
n(β̂n,s − c) d
tstat = q → N (0, 1) as n→∞.
V̂n,s
This implies that the test that rejects H0 when |tstat | > z1−α/2 is consistent
in levels, were z1−α/2 is a critical value defined by the (1 − α/2)-quantile of
the standard normal distribution.
168 LECTURE 16. CLUSTER COVARIANCE ESTIMATION
Remark 16.2 The previous results show that inference based on the t-
statistic for the LS estimator of β and the cluster robust covariance estima-
tor V̂n is valid as q → ∞, regardless of whether n → ∞ as well or not, and
regardless of whether the dependence within each cluster is weak or strong.
That is, even though the LS estimator may converge at different rates de-
pending on the data structure, the studentization by the CCE captures such
a rate of convergences and makes the t-statistic adaptive.
where
Ũj = (Inj − Pjj )−1/2 Ûj ,
Inj is the nj × nj identity matrix, Pjj is the nj × nj matrix defined as
16.4.3 Simulations
Table 16.1 reports simulations results for the following five designs. The
model in all designs is
Yi = β0 + β1 Xi + Ui
Xi = VCi + Wi
Ui = νCi + ηi ,
where Ci denotes the cluster of i and all variables are N (0, 1). Also, in all
designs β0 = β1 = 0. In the first design there are q = 10 clusters, with
nj = 30 units in each cluster. In the second design q = 5 with nj = 30.
16.4. CLUSTER COVARIANCE ESTIMATION 169
dof I II III IV V
V̂n ∞ 84.7 73.9 79.6 85.7 81.7
q-1 89.5 86.9 85.2 90.2 86.4
V̂stata ∞ 86.7 78.8 81.9 87.6 83.6
q-1 91.1 90.3 87.2 91.8 88.1
V̂bm ∞ 89.2 84.7 87.2 89.1 87.7
q-1 93.0 93.3 91.3 92.8 91.4
kbm 94.4 95.3 94.4 94.2 96.6
Final Comments
When q is small, V̂bm (more so V̂n ) typically leads to confidence sets that
under-cover. Bell and McCaffrey (2002) consider more traction from the
degree of freedom adjustment to the t-distribution that is used to compute
the critical value of the test. This adjustment performs well sometimes, but
is ad-hoc (no formal results).
The literature on inference with few clusters (i.e, q fixed) has made
significant progress recently and the main alternatives to using CCE are:
• The Wild Bootstrap: See Cameron et al. (2008) and Canay et al.
(2021).
The cluster consistent estimator discussed above is still a very good option
when q is large, but remember that its performance does not improve if nj
gets large while q remains small.
170 LECTURE 16. CLUSTER COVARIANCE ESTIMATION
Bibliography
Bell, R. M. and D. F. McCaffrey (2002): “Bias reduction in standard
errors for linear regression with multi-stage samples,” Survey Methodology,
28, 169–182.
Bootstrap
P {θ(P ) ∈ Cn } ≈ 1 − α ,
The notation is intended to emphasize the fact that the distribution of the
root depends on both the sample size, n, and the distribution of the data,
P . Using Jn (x, P ), we may choose a constant c such that
P {Rn ≤ c} ≈ 1 − α .
Cn = {θ ∈ Θ : Rn (X1 , . . . , Xn , θ) ≤ c}
171
172 LECTURE 17. BOOTSTRAP
is a confidence set in the sense described above. We may also choose c1 and
c2 so that
P {c1 ≤ Rn ≤ c2 } ≈ 1 − α .
Given such c1 and c2 , the set
Cn = {θ ∈ Θ : c1 ≤ Rn (X1 , . . . , Xn , θ) ≤ c2 }
P {θ(P ) ∈ Cn } = 1 − α
lim P {θ(P ) ∈ Cn } = 1 − α
n→∞
for all P ∈ P.
Theorem 17.1 Let θ(P ) be the mean of P and let P denote the set of all
distributions on R with a finite, nonzero variance. Consider the root Rn =
√
n(X̄n −θ(P )). Let Pn , n ≥ 1 be a nonrandom sequence of distributions such
174 LECTURE 17. BOOTSTRAP
J −1 (1 − α, P ) = z1−α σ(P ) .
Xn,i − θ(Pn )
Zn,i =
σ(Pn )
and apply the Lindeberg-Feller central limit theorem. We must show that
2 √
lim E[Zn,i I{|Zn,i | > ϵ n}] = 0 .
n→∞
d X − θ(P )
Zn,i → Z = ,
σ(P )
where X has distribution P . It follows that for any λ > 0 for which the
distribution of |Z| is continuous at λ, we have that
2
E[Zn,i I{|Zn,i | > λ}] → E[Z 2 I{|Z| > λ}] .
We now use these two results. First, note that for any λ > 0 for which
the distribution of |Z| is continuous at λ, the continuous mapping theorem
above implies that
2 d
g(|Zn,i |) = Zn,i I{|Zn,i | ≤ λ} → Z 2 I{|Z| ≤ λ} = g(|Z|) . (17.5)
The first term on the right-hand side is always equal to one and also equal
to E[Z 2 ] = 1. The second term is the expectation of
2
Zn,i I{|Zn,i | ≤ λ} ∈ [0, λ2 ] ,
We conclude that
2
E[Zn,i I{|Zn,i | > λ}] = E[Z 2 ] − E[Zn,i
2
I{|Zn,i | ≤ λ}]
→ E[Z 2 ] − E[Z 2 I{|Z| ≤ λ}]
= E[Z 2 I{|Z| > λ}] .
As λ → ∞, E[Z 2 I{|Z| > λ}] → 0. To complete the proof, note that for any
fixed λ > 0
2 √ 2
E[Zn,i I{|Zn,i | > ϵ n}] ≤ E[Zn,i I{|Zn,i | > λ}]
under Pn . The desired result now follows from Slutsky’s Theorem and the
fact that σ(Pn ) → σ(P ).
(ii) This follows from part (i) and Lemma 17.1 below applied to Fn (x) =
Jn (x, P ) and F (x) = J(x, P ).
Proof: Let q = F −1 (1 − α). Fix δ > 0 and choose ϵ so that 0 < ϵ < δ
and F is continuous at q − ϵ and q + ϵ. This is possible because F is
continuous at q and therefore continuous in a neighborhood of q. Hence,
Fn (q − ϵ) → F (q − ϵ) < 1 − α and Fn (q + ϵ) → F (q + ϵ) > 1 − α, where the
inequalities follow from the assumption that F is strictly increasing at q. For
n sufficiently large, we thus have that Fn (q−ϵ) < 1−α and Fn (q+ϵ) > 1−α.
It follows that q − ϵ ≤ F −1 (1 − α) ≤ q + ϵ for such n.
We are now ready to pass from the nonrandom sequence Pn to the ran-
dom sequence P̂n .
Theorem 17.2 Let θ(P ) be the mean of P and let P denote the set of
all distributions on R with a finite, nonzero variance. Consider the root
√
Rn = n(X̄n − θ(P )). Then,
Remark 17.1 Similar results hold for the studentized root in (17.1) where
σ̂n is a consistent estimator of σ(P ). Using this root leads to the so-called
Bootstrap-t, as the root is just the t-statistic. A key step in the proof of
this result is to show that σ̂n converges in probability to σ(P ) under an
appropriate sequence of distributions. We skip this in this class. However,
the advantage of working with a studentized root like the one in (17.3) is
that the limit distribution of Rn is pivotal, which affects the properties of
the bootstrap approximation as discussed in the next section.
It now follows from Slutsky’s Theorem that confidence sets of the form
P {θ(P ) ∈ Cn } → 1 − α (17.6)
for all P ∈ P.
In general, the consistency of the bootstrap is proved in the following
two steps:
One-sided confidence sets based off of the bootstrap and the root Rn =
√
n(X̄n − θ(P )) also satisfy (17.7), though there is some evidence to suggest
that it does a bit better in the size of O(n−1/2 ) term. On the other hand,
one-sided confidence sets based off the bootstrap-t, i.e., using the root
√
n(X̄n − θ(P ))
Rn =
σ̂n
as in Remark 17.1, satisfy
Thus, the one-sided coverage error of the bootstrap-t interval is O(n−1 ) and
is of smaller order than that provided by the normal approximation or the
bootstrap based on a nonstudentized root. One-sided confidence sets that
178 LECTURE 17. BOOTSTRAP
B
1 X ∗
Ln (x) = I{Rj,n ≤ x} . (17.9)
B
j=1
BIBLIOGRAPHY 179
L−1
n (1 − α) = inf{x ∈ R : Ln (x) ≥ 1 − α} ,
Remark 17.2 Sampling from P̂n in Step 1 is easy even when P̂n is the
empirical distribution. In such case P̂n is a discrete probability distribution
that puts probability mass n1 at each sample point (X1 , . . . , Xn ), so sampling
from P̂n is equivalent to drawing observations (with probability n1 ) from the
observed data with replacement. In consequence, a bootstrap sample will
likely have some ties and multiple values, which is generally not a problem.
In parametric problems one would simply get a new sample of size n from
P̂n = P (ψ̂n ).
Because B can be taken to be large (assuming enough computing power),
the resulting approximation Ln (x) can be made arbitrarily close to Jn (x, P̂n ).
It then follows that the properties of tests and confidence sets based on
Jn−1 (1 − α, P̂n ) and L−1
n (1 − α) are the same. In practice, values of B in the
order of 1, 000 are frequently enough for the approximation to work well.
Bibliography
Hansen, B. E. (2019): “Econometrics,” University of Wisconsin - Madison.
Subsampling &
Randomization Tests
18.1 Subsampling
Suppose Xi , i = 1, . . . , n is an i.i.d. sequence of random variables with dis-
tribution P ∈ P. Let θ(P ) be some real-valued parameter of interest, and
let θ̂n = θ̂n (X1 , . . . , Xn ) be some estimate of θ(P ). Consider the root
√
Rn = n(θ̂n − θ(P )) ,
where root stands for a functional depending on both, the data and θ(P ).
Let Jn (P ) denote the sampling distribution of Rzn and define the corre-
sponding cumulative distribution function as,
Assumption 18.1 There exists a limiting law J(P ) such that Jn (P ) con-
verges weakly to J(P ) as n → ∞.
181
182 LECTURE 18. SUBSAMPLING & RANDOMIZATION TESTS
where θ̂n,j is the estimate of θ(P ) computed using the jth set of data of size
n from the original m observations.
In practice m = n, so, even if we knew θ(P ), this idea won’t work. The
key idea behind subsampling is the following simple observation: replace n
with some smaller number b that is much smaller than n. We would then
expect
√
n
b(θ̂b,j − θ(P )), j = 1, . . . , ,
b
where θ̂b,j is the estimate of θ(P ) computed using the jth set of data of
size b from the original n observations, to be a good estimate of Jb (x, P ), at
least if nb is large. Of course, we are interested in Jn (x, P ), not Jb (x, P ).
We therefore need some way to force Jn (x, P ) and Jb (x, P ) to be close to
one another. To ensure this, it suffices to assume that Jn (x, P ) → J(x, P ).
Therefore, Jb (x, P ) and Jn (x, P ) are both close to J(x, P ), and thus close
to one another as well, at least for large b and n. In order to ensure that
both b and nb are large, at least asymptotically, it suffices to assume that
b → ∞, but b/n → 0.
This procedure is still not feasible because in practice we typically do
not know θ(P ). But we can replace θ(P ) with θ̂n . This would cause no
problems if √
√ b√
b(θ̂n − θ(P )) = √ n(θ̂n − θ(P ))
n
is small, which follows from b/n → 0 in this case. The next theorem formal-
izes the above discussion.
18.1. SUBSAMPLING 183
Theorem 18.1 Assume Assumption 18.1. Also, let Jn (P ) denote the sam-
pling distribution of τn (θ̂n − θ(P )) for some normalizing sequence τn → ∞,
Nn = nb , and assume that τb /τn → 0, b → ∞, and b/n → 0 as n → ∞.
ii) Let
If b2 /n → 0, then for every ϵ > 0 we have that b2 /n < ϵ for all n sufficiently
large. Therefore,
b2
n b ϵ
(1 − ) > (1 − )b → exp(−ϵ) .
b b
By choosing ϵ > 0 sufficiently small, we see that the desired probability
converges to 1.
H0 : θ = 0 vs H1 : θ ̸= 0 .
compute a critical value that delivers a valid test? It turns out we can do
this by exploiting symmetry.
To do this, let ϵi take on either the value 1 or −1 for i = 1, . . . , 10. Note
that the distribution of X = (X1 , . . . , X10 ) is symmetric about 0 under the
null hypothesis. Now consider a transformation g = (ϵ1 , . . . , ϵ10 ) of R10 that
defines the following mapping
where ⌈C⌉ denotes the smallest integer greater than or equal to C. Let
M
X
M + (x) = I{T (j) (x) > T (k) (x)}
j=1
M
X
0
M (x) = I{T (j) (x) = T (k) (x)} .
j=1
Now set
M α − M + (x)
a(x) = , (18.7)
M 0 (x)
and define the randomization test as
1
T (x) > T (k) (x)
ϕ(x) = a(x) T (x) = T (k) (x) . (18.8)
0 T (x) < T (k) (x)
is level α.
Theorem 18.2 Suppose that X has distribution P ion X and the problem is
to test the null hypothesis P ∈ P0 . Let G be a finite group of transformations
of X onto itself. Suppose the randomization hypothesis (18.1) holds. Given
a test statistic T (X), let ϕ be the randomization test as described above.
Then, ϕ(X) is a similar α level test, i.e.,
and so
X X
M α = EP ϕ(gX) = EP [ϕ(gX)] .
g∈G g∈G
18.2. RANDOMIZATION TESTS 187
Remark 18.1 Note that by construction the randomization test not only
is of level α for all n, but also “similar”, meaning that EP [ϕ(X)] is never
below α for any P ∈ P0 .
In general, one can define a p-value p̂ of a randomization test by
1 X
p̂ = I{T (gX) ≥ T (X)} . (18.10)
M
g∈G
where this probability reflects variation in both X and the sampling of the
gi .
X = (Y1 , . . . , Ym , Z1 , . . . , Zn ) .
Consider testing
H0 : PY = PZ vs H1 : PY ̸= PZ .
To describe an appropriate group of transformations G, let N = m + n.
For x = (x1 , . . . , xN ) ∈ RN , let gx ∈ RN be defined by
Treatment effects
Suppose that we observe a random sample {(Y1 , D1 ), . . . , (Yn , Dn )} from a
randomized controlled trial where
H0 : Q0 = Q1 vs. H1 : Q0 ̸= Q1 . (18.15)
H0 : E[Y (1)] = E[Y (0)] v.s. H1 : E[Y (1)] ̸= E[Y (0)] . (18.16)
In this case, one may still consider the permutation test that results from
considering all possible permutations to the vector of treatment assignment
(D1 , . . . , Dn ). Unfortunately, such an approach does not lead to a valid test
and may over-reject in finite samples. These test may be asymptotically
valid though, is one carefully chooses an appropriate test statistic.
The distinction between the null hypothesis in (18.15) and that in (18.16)
and their implications on the properties of permutation tests are often ig-
nored in applied research.
Randomization test are often dismissed in applied research due to the
belief that the randomization hypothesis is too strong to hold in a real em-
pirical application. For example, the distribution P may not be symmetric
in hypotheses about the mean of X. However, it turns out that the random-
ization test is asymptotically valid (under certain conditions), even when
P is not symmetric. See Bugni et al. (2018) for an example in the context
of randomized controlled experiments. Moreover, recent developments on
the asymptotic properties of randomization tests show that such a construc-
tion may be particularly useful in regression models with a fixed and small
number of clusters, see Canay et al. (2017). The approach does not require
symmetry in the distribution of X, but rather symmetry in the asymptotic
distribution of θ̂n - which automatically holds when these estimators are
asymptotically normal. We cover these topics in Econ 481.
Bibliography
Bugni, F. A., I. A. Canay, and A. M. Shaikh (2018): “Inference under
Covariate Adaptive Randomization,” Journal of the American Statistical
Association, 113, 1784–1796.
Bibliography
Abadie, A., A. Diamond, and J. Hainmueller (2010): “Synthetic con-
trol methods for comparative case studies: Estimating the effect of Cal-
ifornia’s tobacco control program,” Journal of the American Statistical
Association, 105, 493–505.
Zou, H. (2006): “The adaptive lasso and its oracle properties,” Journal of
the American Statistical Association, 101, 1418–1429.