Survey Data in Macroeconomics
III. Econometric Methods for the Analysis of Survey Microdata
Prof. Dr. Lena Dräger
Johannes Gutenberg-University Mainz, GSEFM field course
Email: ldraeger@uni-mainz.de
1 / 49
Outline
1 Analysis of Survey Data in STATA
2 Pooled Cross-Sections
3 Panel Estimators
4 Analysis of Binary and Ordinal (Qualitative) Variables
5 Sample Selection and Attrition Bias
6 Sampling Weights
Textbooks: Wooldridge (2010) and Greene (2012)
2 / 49
Analysis of Survey Data in STATA
Analysis of Survey Data in STATA
3 / 49
Analysis of Survey Data in STATA
Analysis of Survey Microdata
In general:
Control for socio-demographic characteristics (where possible) or
individual fixed effects (only panel data)
Possibly control for time trends and macroeconomic control variables
Careful with causal statements regarding policy changes, generally not
possible
Probit/logit estimators: Define a “representative survey respondent”
and evaluate marginal effects at this point ⇒ otherwise, marginal
effects will be evaluated at the mean which may differ across models
4 / 49
Analysis of Survey Data in STATA
Analysis of Survey Microdata in STATA
If the dataset is repeated cross-sections, declare the data to be
(pooled) time series: tsset timevar
If the dataset has a panel structure, declare the data to be a panel:
xtset panelvar timevar
For a single cross-section, you can declare the dataset to be survey
data: svyset
Think about meaningful truncation
Get to know your dataset!
5 / 49
Analysis of Survey Data in STATA
Implications of Truncation
Micro survey data of macroeconomic expectations is typically
truncated, at least for non-experts
We might also think about truncating extremely high incomes etc.
Should we adjust our estimator to the truncated nature of the data?
⇒ E.g., use estimators for censored data like a tobit estimator?
⇒ No, because truncated observations are simply dropped, not set to
some fixed value (censored)
⇒ Truncation may bias OLS estimates towards 0, maximum likelihood
remains consistent
⇒ In practice, estimations are typically not adjusted for truncation,
probably because the truncation only becomes binding in very few cases
6 / 49
Pooled Cross-Sections
Pooled Cross-Sections
7 / 49
Pooled Cross-Sections
Pooled Cross-Sections
Pool micro cross-sections of survey waves over time
Individual and time component
⇒ If identical individuals are followed over time: Use panel estimators
⇒ If there are repeated (random) cross-sections: Use a simple OLS
estimator on the pooled cross-sections
8 / 49
Pooled Cross-Sections
Pooled OLS
Assumptions:
Random samples from a population at different points in time
Observations are independent, but not identically distributed
Include either macro control variables or time dummies to control for
aggregate changes over time
Include individual characteristics to avoid omitted variables bias
Standard errors may be corrected to account for heteroscedasticity
(STATA: option vce(robust)) or clustering (STATA: option vce(cluster
clustvar))
9 / 49
Pooled Cross-Sections
Pooled OLS
Pooled cross-section model:
yt = xt β + ut , t = 1, 2, ..., T (1)
Coefficients in the vector β can be consistently estimated by OLS if:
0
1 E xt ut = 0, t = 1, 2, ..., T
⇒ No contemporaneous correlation between explanatory variables and
the error term
hP 0 i
T
2 rank t=1 E xt xt = K
⇒ No perfect linear dependencies (collinearity) between the
explanatory variables
10 / 49
Pooled Cross-Sections
Pooled OLS
For unbiasedness of standard errors of the OLS estimator, we additionally
need:
0 0
E ut2 xt xt = σ 2 E xt xt , t = 1, 2, ..., T , where σ 2 = E ut2 for all t
3
⇒ Homoscedasticity
0
4 E ut us xt xs = 0, t 6= s, t, s = 1, 2, ..., T
⇒ No serial correlation
11 / 49
Pooled Cross-Sections
Diff-in-diff Estimation
Popular in applied microeconometrics for policy analysis (e.g.
unexpected change in unemployment insurance only in one of two
neighboring states)
Natural experiment: One control group (A) and one treatment group
(B, dummy dB)
Two time periods: One before the policy change, one afterwards
(dummy d2)
12 / 49
Pooled Cross-Sections
Diff-in-diff Estimation
Research question: What is the effect of the change in unemployment
insurance on the individual labor market experience?
Estimation equation:
yi,t = β0 + β2 dBi,t + β3 d2t + δ1 d2t ∗ dBi,t + ui,t (2)
OLS estimator of δ1 :
δˆ1 = y B,1 − y B,2 − y A,1 − y A,2
(3)
where y A,1 , y A,2 , y B,1 and y B,2 are the sample averages of y for the
control (A) and the treatment (B) group before and after the policy
change (time periods 1 and 2)
⇒ δˆ1 measures the relative difference in individual labor market
experience that can be attributed to the policy change only
experienced by group B.
13 / 49
Pooled Cross-Sections
Diff-in-diff Estimation
Estimator allows for time- and group-specific effects
The estimator is unbiased if the policy change is not systematically
related to other factors that affect yi,t (and are hidden in ui,t ) ⇒
usually include further control variables for both time and
cross-sectional variation
14 / 49
Pooled Cross-Sections
Robust and Clustered Standard Errors
Problem: OLS assumes that observations are drawn from identical
distributions, but this is often not the case in survey data
Cross-sectional or time heteroscedasticity: E.g., variance of income is
larger for high income respondents than for respondents with lower
income
Clustering: Data may differ across certain groups, either different time
periods or different cross-section groups
⇒ Both heteroscedasticity and clustering lead to correlation between
residuals and regressors ⇒ standard errors are biased
Robust and clustered standard errors are available with many
estimators, including panel and probit/logit models
15 / 49
Pooled Cross-Sections
Robust and Clustered Standard Errors
OLS standard errors:
h 0 i−1 h 0 0 i h 0 i−1 h 0 i−1
Var (β) = x x x uu x x x = σ2 x x (4)
Robust standard errors:
Account for (cross-sectional) heteroscedasticity:
" N #
h 0 i−1 X 0
h 0 i−1
2
Var (β) = x x ui xi xi x x (5)
i=1
In STATA: option vce(robust)
16 / 49
Pooled Cross-Sections
Robust and Clustered Standard Errors
Clustered standard errors:
Similar to robust standard errors (correct for correlations between
individual residuals ei and regressors xi ) with corrections summed over
each cluster ⇒ assume independence across clusters g
" G #
h 0 i−1 X 0 0
h 0 i−1
Var (β) = x x xg ug ug xg x x (6)
i=1
Appropriate cluster variables may emerge from the estimation context
Clustering is possible along several dimensions (multi-way clustering),
otherwise include fixed effects along one dimension and cluster along
the other
In STATA: option vce(cluster clustvar), e.g. vce(cluster income) or
vce(cluster year)
17 / 49
Panel Estimators
Panel Estimators
18 / 49
Panel Estimators
Panel Estimators
Accounts for time-variation of individual units
Survey data: Usually unbalanced panel
In principal, we can test for fixed vs. random effects (Hausman test),
but intuitively, individual fixed effects make more sense ⇒ account for
unobserved constant individual effects
19 / 49
Panel Estimators
Fixed Effects Estimator
Unobserved effects model:
yit = xit β + ci + uit , t = 1, 2, ..., T (7)
The model includes time-invariant individual effects ci by estimating a
dummy variable for each individual unit over time
1 Exogeneity Assumption:
E (uit |xi , ci ) = 0, t = 1, 2, ..., T (8)
⇒ Strict exogeneity of xit conditional on ci
With fixed effects, xit cannot include any time-invariant characteristics
such as gender, race etc.! All variables must be time-varying at least
for SOME part of the sample 20 / 49
Panel Estimators
Fixed Effects Estimator
2 Rank condition:
T
!
X 0
rank E (xit − xi ) (xit − xi ) =K (9)
t=1
⇒ Consistent estimation if exogeneity assumption and rank condition
hold
3 Homoscedasticity and no serial correlation of residuals:
0
E ui ui |xi , ci = σu2 IT (10)
⇒ Efficient and unbiased estimation
21 / 49
Panel Estimators
Fixed Effects: Within and Between Estimator
Within estimator:
(yit − y i ) = (xit − xi )β + (uit − u i ) (11)
⇒ Eliminates the unobserved effect
PTby subtracting time-averages for each
−1
cross-section, with y i = T t=1 yit ⇒ Evaluate time-variance
within cross-sections
⇒ Estimated by pooled OLS
⇒ In STATA: xtreg depvar indepvars [if] [in] [weight], fe [options]
Between estimator:
y i = xi β + ci + u i (12)
⇒ Eliminates the time effect by calculating time averages between the
cross-sections
⇒ In STATA: xtreg depvar indepvars [if] [in] [weight], be [options]
22 / 49
Panel Estimators
First Differencing: FD Estimator
We can also first-difference the data to eliminate the unobserved
individual component in (7):
∆yit = ∆xit β + ∆uit , t = 1, 2, ..., T (13)
Use pooled OLS to estimate (13)
Loses one observation
Same assumptions as for FE estimator need to hold
We can use this version to construct a diff-in-diff estimator
23 / 49
Panel Estimators
Dynamic Panel Models
Standard FE estimators are biased if the residuals are serially
correlated (Nickell (1981) bias)
Arellano-Bond estimator to account for a lagged dependent variable
(Arellano and Bond (1991), in STATA: xtabond )
Nickell bias not a severe problem if T is significantly larger than N ⇒
often the case in aggregate country panels
Use Least Squares Dummy Variable Corrected (LSDVC) estimator
instead (Bruno (2005), in STATA: xtlsdvc)
Generally: Serial correlation not a big issue with micro survey data,
applies more to aggregated (macro) panel models
24 / 49
Panel Estimators
Accounting for Cross-Sectional Correlation in Panel Data
Usual panel estimators assume that cross-sections are not correlated
However, we may find contemporaneous correlation across either
individuals or aggregate cross-sections
Tests for cross-sectional correlation:
Breusch and Pagan (1980) statistic for cross-sectional independence of
the residuals of a fixed effects model (see also Greene, 2012).
In STATA: xttest2
Pesaran (2004) test for cross-sectional dependence, can also handle
unbalanced panels. In STATA: xtcsd, pesaran show
xtcsd includes further tests by Frees (1995, 2004) and Friedman (1937)
25 / 49
Panel Estimators
Accounting for Cross-Sectional Correlation in Panel Data
Solution:
Use OLS regression with clustered standard errors, robust to
heteroscedasticity across cross-sections, and within cross-section
correlation
Estimate with feasible generalized least squares (FGLS), estimator is
robust to AR(1) autocorrelation within cross-sections,
heteroscedasticity and cross-sectional correlation.
Estimator in STATA: xtgls
Use OLS regression with panel-corrected standard errors that account
for heteroscedasticity and contemporaneous correlation across
cross-sections and, potentially, for autocorrelation.
Estimator in STATA: xtpcse
26 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Analysis of Binary and Ordinal
(Qualitative) Variables
27 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Binary Variables
Dependent variable is a dummy, e.g. =1 if a person is employed and 0
otherwise
We’re interested in the response probability conditional on a set of
explanatory variables x:
p(x) ≡ P(y = 1|x) = P(y = 1|x1 , x2 , ..., xK ) (14)
Interpretation of coefficients not straightforward, except for the signs
Estimate marginal effects post-regression
28 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Probit or Logit Estimators
Latent variable model:
y ∗ = xβ + e, y = 1[y ∗ > 0] (15)
e is a continuously distributed variable independent of x with a
symmetric distribution around 0
Let G be the cdf of e, then by the symmetry of the pdf of e around 0,
we have 1 − G (−z) = G (z). Use this to get:
P(y = 1|x) = P(y ∗ > 0|x) = P(e > −xβ|x) = 1 − G (−xβ) = G (xβ)
(16)
Estimation by maximum likelihood
Robust or clustered standard errors to account for heteroscedasticity
or clustered correlations
29 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Probit or Logit Estimators
Probit estimator: Assume a standard normal distribution for G :
Z xβ
G (xβ) = Φ(xβ) ≡ φ(ν)dν (17)
−∞
In STATA: probit depvar [indepvars] [if] [in] [weight] [, options]
Logit estimator: Assume a logistic distribution for G :
exp(xβ)
G (xβ) = Λ(xβ) ≡ (18)
1 + exp(xβ)
In STATA: logit depvar [indepvars] [if] [in] [weight] [, options]
30 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Marginal Effects
In order to get an estimate of the marginal effect of a regressor on the
probability that y = 1, we need to derive marginal effects after the
estimation of probit/logit models
For a continuous regressor xj :
∂p(x) dG
= g (xβ)βj , where g (xβ) ≡ (xβ) (19)
∂xj dxβ
For a discrete regressor xK , the marginal effect of changing xK from 0
to 1, holding all other variables fixed, is given by:
G (β1 + β2 x2 + ... + βK −1 xK −1 + βK ) − G (β1 + β2 x2 + ... + βK −1 xK −1 )
(20)
⇒ The marginal effect of a single regressor depends on all regressors in x
31 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Marginal Effects
Default option: Marginal effects are evaluated at the mean of variables
in x
With survey micro data: Define a representative participant and
evaluate marginal effects there, so that effects are comparable across
different models
In STATA: margins, dydx(*) at(sex==1 age==49 inc==3
employ==3)
32 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Bi-Probit Estimators
What if we are interested in the likelihood that two dummy variables
equal 1 simultaneously?
y1 = 1 [x1 β 1 + e1 > 0] (21)
y2 = 1 [x2 β 2 + e2 > 0] (22)
Error terms in e ≡ (e1 , e2 ) are assumed to be bivariate normally
distributed: e|x ∼ N(0, Ω)
33 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Bi-Probit Estimators
If e1 and e2 are correlated, estimating a joint maximum likelihood
procedure is more efficient
In STATA: biprobit depvar1 depvar2 [indepvars] [if] [in] [weight] [,
options]
Compute marginal effects for the likelihood of both y1 and y2 being 1:
margins, p11 dydx(*) (at sex==1 age==49 inc==3 employ==3)
Margins also computes conditional marginal effects or marginal effects
for one variable being 1 and the other being 0
34 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Ordered Probit or Logit Estimators
Dependent variable is ordinal, e.g. has values from 1-5 where the
ordering matters
Often the case for qualitative survey data, e.g. Likert scale: (1 – like),
(2 – like somewhat), (3 – neutral), (4 – dislike somewhat), (5 – dislike)
Latent variable model:
y ∗ = xβ + e, e|x ∼ N(0, 1) (23)
∗
y = 0 if y ≤ α1
y = 1 if α1 < y ∗ ≤ α2
..
.
y =J if y ∗ > αJ
35 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Ordered Probit or Logit Estimators
Estimate parameters in α and β by maximum likelihood assuming
either a normal or a logistic distribution for e
Make sure the ordering is meaningful, i.e. qualitative expectations
should assign higher numbers to higher expectations
In STATA: oprobit or ologit
Marginal effects have to be estimated separately for each ordinal
realisation, e.g. in STATA: margins, dydx(*) predict(outcome(5))
at(sex==1 age==49 inc==3 employ==3), gives the marginal effects
of all regressors for the likelihood that the ordinal variable takes on the
value of 5
36 / 49
Analysis of Binary and Ordinal (Qualitative) Variables
Binary or Ordinal Dependent Variables in Panel Data
All estimators for binary or ordinal dependent variables are also
available for panel data
With random cross-sections: Simply use the estimators on the pooled
data, controlling for individual characteristics
With repeated cross-sections:
Use within transformation or first-differencing to eliminate the
unobserved effect ci
Estimate random-effects or population-averaged probit or ordered
probit models (in STATA: xtprobit and xtoprobit)
Population-averaged probit: Estimates at the average value of
unobserved components ci in the population, c
Random-effects probit: Assume a normal distribution for ci , integrate
out ci when constructing the log-likelihood function, marginal effects
can be estimated at c = 0
37 / 49
Sample Selection and Attrition Bias
Sample Selection and Attrition Bias
38 / 49
Sample Selection and Attrition Bias
Sample Selection and Attrition Bias
Assumption: Survey samples are drawn randomly and are
representative of the overall population
What if there is non-random sample selection (= incidental
truncation)? ⇒ Survivorship or attrition bias
Respondents leave the survey permanently (e.g. professional forecasters
get a new job at a firm not participating in the survey, a household
drops out for private reasons or the respondent dies, a firm goes
bankrupt)
Certain groups of participants are more likely to be selected into a
panel dimension of the survey
What if the decision to not answer a particular question is
non-random? ⇒ Non-response bias
Certain questions might be deemed more difficult than others
39 / 49
Sample Selection and Attrition Bias
Heckman (1979) Estimator
Two-step estimation procedure by Heckman (1979) to account for
attrition
Unobserved selection equation:
0
zi∗ = wi γ + ui , zi = 1 if zi∗ > 0 and 0 otherwise
0
Prob(zi = 1|wi ) = Φ wi γ
0
Prob(zi = 0|wi ) = 1 − Φ wi γ (24)
Regression equation:
0
yi = xi β + εi observed only if zi = 1 (25)
(ui , εi ) ∼ bivariate normal[0, 0, 1, σε , ρ]
40 / 49
Sample Selection and Attrition Bias
Heckman (1979) Estimator
Need to assume that zi and wi are observed for a random sample
Combining (24) and (40) gives:
0
0
E [yi |zi = 1, xi , wi ] = xi β + ρσe λ wi γ , (26)
0 0 0
with λ wi γ = φ wi γ /Φ wi γ
⇒ Not accounting for sample selection results in an omitted variable bias
⇒ OLS is not consistent!
41 / 49
Sample Selection and Attrition Bias
Two-Step Heckman Correction
1 Estimate a probit model with maximum likelihood of (24) to obtain
estimates
of γ. For
each
observation, compute0
0 0
λ̂i = φ wi γ̂ /Φ wi γ̂ and δ̂i = λ̂i λ̂i + wi γ̂
0
2 Estimate yi = xi β + βλ λ̂ + εi by least squares to obtain estimates of
β and βλ
The estimator of the correlation between the selection equation and
the regression model, ρ̂, provides a measure for the degree of attrition
bias ⇒ Test whether ρ̂ = 0
42 / 49
Sample Selection and Attrition Bias
Two-Step Heckman Correction in STATA
Cross-section data: heckman depvar [indepvars], select([depvar_s =]
varlist_s) twostep [options]
Binary data: heckprobit depvar [indepvars] [if] [in] [weight],
select([depvar_s =] varlist_s) [options]
Ordinal data: heckoprobit depvar [indepvars] [if] [in] [weight],
select([depvar_s =] varlist_s) [options]
So far no module to do a Heckman correction with panel data in
STATA!
43 / 49
Sampling Weights
Sampling Weights
44 / 49
Sampling Weights
Sampling Weights
Sampling weights are calculated by surveys in order to ensure the
representative of the sample w.r.t. the overall population
Check whether sample contains a weight variable (index). e.g.
household head weight index in the Michigan Survey of Consumers
Sampling weights (in STATA: pweight) denote the inverse of the
probability that the observation is included due to the sampling design
Use weights in the estimations, e.g.:
oprobit cons_past infl_exp int_exp age sex income
[pweight=weight], robust
45 / 49
Sampling Weights
Literature I
Arellano, M. and S. Bond (1991).
Some tests of specification for panel data: Monte carlo evidence and an
application to employment equations.
Review of Economic Studies 58 (2), 277–297.
Breusch, T. and A. R. Pagan (1980).
The Lagrange multiplier test and its applications to model specification in
econometrics.
Review of Economic Studies 47 (1), 239–253.
Bruno, G. S. (2005).
Approximating the bias of the LSDV estimator for dynamic unbalanced panel
data models.
Economics Letters 87 (3), 361–366.
46 / 49
Sampling Weights
Literature II
Frees, E. W. (1995).
Assessing cross-sectional correlations in panel data.
Journal of Econometrics 64, 393–414.
Frees, E. W. (2004).
Longitudinal and Panel Data: Analysis and Applications in the Social
Sciences.
Cambridge University Press.
Friedman, M. (1937).
The use of ranks to avoid the assumption of normality implicit in the analysis
of variance.
Journal of the American Statistical Association 32, 675–701.
47 / 49
Sampling Weights
Literature III
Greene, W. (2012).
Econometric Analysis (7th ed.).
Pearson Education.
Heckman, J. J. (1979).
Sample selection bias as a specification error.
Econometrica 47 (1), 153–161.
Nickell, S. (1981).
Biases in dynamic models with fixed effects.
Econometrica 49 (6), 1417–1426.
Pesaran, M. H. (2004).
General diagnostic tests for cross section dependence in panels.
Cambridge Working Paper in Economics 0435.
48 / 49
Sampling Weights
Literature IV
Wooldridge, J. M. (2010).
Econometric Analysis of Cross Section and Panel Data (2nd ed.).
Cambridge, MA: MIT Press.
49 / 49