Causal Inference Extended Tutorial
Causal Inference Extended Tutorial
Amit Sharma
Researcher, Microsoft Research India
amshar@microsoft.com
@amt_shrma
http://www.github.com/amit-sharma/causal-inference-tutorial
Session Objectives and Takeaways
• What is causal inference? Why should we care?
• Most machine learning algorithms depend on correlations.
• Correlations alone are a dangerous path to actionable insights.
3
We have lots of data!
www.tylervigen.com 4
Observed data can be confounded, or even
spurious.
www.tylervigen.com 5
𝑦 = 𝛽 𝑥 +𝜖
¿𝑋,𝑌>¿ ?
^
𝑦 ^
𝛽
Prediction
Causation Hofman, Sharma, and Watts (2017). Science, 355.6324 6
PART I:
Causality, what?
7
I.A What is causality?
8
A question that has attracted scholars for
centuries
10
“More has been learned about causal inference in the last few
decades than the sum total of everything that had been learned
about it in all prior recorded history”
11
What are they talking about?
12
No one knows what causality means
Hume, 18th century
*Interventionist definition [
http://plato.stanford.edu/entries/causation-mani/] 15
Bag of tricks => Powerful statistical frameworks
16
I.B Why should we care?
We have increasing amounts of data and highly accurate
predictions. How is causal inference useful?
17
Predictive systems are everywhere
How do predictive systems work?
Aim: Predict future activity for a user.
…
We see data about their user profile and past activity.
E.g., for any user, we might see their age, gender, past
activity and their social network.
From data to prediction
Higher Activity Lower
Activity
Use these correlations to make a predictive model.
Future Activity ->
f(number of friends, logins in past month)
From data to “actionable insights”
How do we know
what causes what?
Decision: To increase activity, would it make sense to
launch a campaign to increase friends?
Another example: Search Ads
x% of ads shown
are effective
Probably not.
There can be many hidden causes for an action, some
of which may be hard to quantify.
Estimating the impact of ads
Ignoring hidden
causes can also
lead to
completely wrong
conclusions.
Example: Did a system change lead to better
outcomes?
Have a current production algorithm. Want to test if a
new algorithm is better.
Say a feature that provides information or discount for
a financial product.
?
Algorithm A Algorithm B
Comparing old versus new algorithm
Two algorithms, A (production) and B (new) running
on the system.
From system logs, collect data for 1000 sessions for
each. Measure Success Rate (SR).
So let us look at 4
SR
SR separately. 2
1 2 3 4
Is Algorithm A better?
Simpson (1951)
Answer (as usual): May be, may be not.
39
II.A In search of an intervention
and a counterfactual:
A historical tour of causal
inference
40
…And there was fire!
43
1854: London was having a devastating
cholera outbreak 44
Enter John Snow. He found higher cholera deaths near
a water pump, but could be just correlational.
45
S & TER
WA PANY
COM
V
LA TERANY
WA MP
CO
MB
ET
H
COUNTERFACTUAL
48
Fisher experiment
A way to combine the ideas of an intervention and
counterfactual.
Q. : Estimate the effect of fertilizers.
Controlled experiment 49
Fisher experiment
A way to combine the ideas of an intervention and
counterfactual.
Q. : Estimate the effect of fertilizers.
Natural experiment 50
Fisher experiment
A way to combine the ideas of an intervention and
counterfactual.
Q. : Estimate the effect of fertilizers.
clicks to
recommendations
Causal estimate
clicks to clicks to
recommendations recommendations
Cloned user
53
Two questions leading from Fisher’s work
What if you have a complicated randomized
experiment?
E.g. Instead of plots, randomized at the level of
farmers. But all farmers may not comply.
54
Two questions leading from Fisher’s work
1. Need a language for “complicated” questions.
Counterfactuals cannot be expressed by probability.
Try “If the fertilizer was applied, what is the probability that
the yield was greater than , in case that the fertilizer was not
applied?
55
Two questions leading from Fisher’s work
1. Need a language for “complicated” questions.
Counterfactuals cannot be expressed by probability.
Try “If the fertilizer was applied, what is the probability that
the yield was greater than , in case that the fertilizer was not
applied?
Frameworks:
• Causal graphical models [Pearl 2009]
• Potential Outcomes Framework [Imbens-Rubin 2016]
Graphical Models: Express causal relationships
visually
58
Graphical Models: Express causal relationships
as a Bayesian network
59
Graphical Models: Make assumptions explicit
61
Graphical Models: A language for intervention
and counterfactual
63
Graphical models: Provide an mechanistic way
of identifying a causal effect
65
Tricky to find correct variables to condition on.
Fortunately, graphical models make it precise.
“Backdoor” paths: Look for (undirected) paths that point to both and .
Pearl (2009) 66
Backdoor criterion: Condition on enough
variables to cover all backdoor paths
67
Identified Causal Effect:
68
Potential Outcomes: Every variable has a
counterfactual
Imagine
Need to estimate difference between the real and
counterfactual world.
69
Potential Outcomes: Estimating an effect
identified from the backdoor criterion
Imagine
Causal effect:
71
II.C Different ways to identify
and estimate causal effect
72
Running example: Estimating effect of an
algorithm
𝑃 (𝐶𝑇𝑅 ∨ 𝑑𝑜 ( 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 ))
73
Lookback: Need answers to “what if” questions
Counterfactual thinking*:
What would have happened if I had changed X?
clicks to
recommendations
Causal estimate
clicks to clicks to
recommendations recommendations
Cloned user
Randomization
A/B test
EASE OF USE
VALIDITY
Multi-armed bandits
Natural Experiments
Regression discontinuity
Instrumental Variables
Conditioning
Stratification, Matching
Propensity Scores
1. Randomization to the
rescue
Randomizing algorithm assignment: A/B test
Old Algorithm
clicks to
… recommendations
Causal estimate
clicks to clicks to
recommendations recommendations
Two goals:
1. Show the best
known algorithm to
most users.
2. Keep randomizing
to update
knowledge about
competing
algorithms.
Bandits: The right mix of explore and exploit
Old Algorithm
clicks to
… recommendations
Current-best Random
Algorithm Algorithm
clicks to clicks to
recommendations recommendations
Repeat:
(Explore) With low probability ɛ, choose an
output item randomly.
(Exploit) Otherwise, show the current-best
algorithm.
Li-Chu-Langford-Schapire (2010) 84
Caveat: Not always feasible to randomize, or
ensure that people fully comply
Same user
Algorithm: Regression discontinuity
For any top-k recommendation list:
Using logs, identify apps that were similarly
ranked but could not make it to the top-k shown
apps.
Measure difference in app installs between
shown and not-shown apps for each user.
Imbens-Lemieux (2008)
Another natural experiment: Instrumental
Variables
Can look at as-if random variations due to external
events.
E.g. Featuring on the Today show may lead to a sudden spike in installs for an
app.
Such external shocks can be used to determine causal
effects, such as the effect of showing recommendations.
Angrist-Pischke (2008)
Cont. example: Effect of store
recommendations
How many
new visits are
caused by the
recommender
system?
Demand for App 1 is correlated with demand for App 2.
Users would most likely have visited App 2 even without
recommendations.
Traffic on normal days to App 1
click-throughs
from
App 1 to App 2
click-throughs
from
its App 1 to App 2
s
vi 1
in p
ike Ap If demand for App 2 remains
Sp to
constant, additional views to
App 2 would not have
happened had these new users
click-throughs
from not visited App 1.
App 1 to App 2
Sharma-Hofman-Watts (2015)
Exploiting sudden variation in traffic to App 1
To compute Causal CTR of Visits to App1 on Visits to
App2:
• Compare observed effect of external event separately on Visits to App1,
and on Rec. Clicks to App2.
• Causal click-through rate =
More generally…
Unobserved
As-If-Random
¿
Confounders
(U)
Instrument
Cause (X) Outcome (Y)
(Z)
Exclusion (𝑍 ∐𝑌∨𝑋,𝑈) 97
Lottery
Weather
Shocks
Discontinuities
Hard-to-find
variations
Dunning (2012)
98
Lottery
Weather
Change in access Shocks
of digital services
Discontinuities
Change in train Hard-to-find
stops in a city variations
Change in medicines
at a hospital
…
99
Caveat: Natural experiments are hard to find
Estimates may not be generalizable to all products.
Old Algorithm
clicks to
… recommendations
clicks to clicks to
recommendations recommendations
Old Algorithm
clicks to
… recommendations
Causal estimate
clicks to clicks to
recommendations recommendations
1. Make assumptions
about how the
data as generated.
2. Create a graphical
model
representing those
assumptions.
Backdoor criterion: Condition on enough
variables to cover all backdoor paths
Algorithm: Stratification
With observational data:
1. Assume a graphical model that explains how the
data was generated.
2. Choose variables to condition on using backdoor
criterion.
3. Stratify data into subsamples such that each
subsample has the same value of all conditioned
variables.
4. Evaluate the difference in outcome variable
separately within these strata.
5. (Optional) Aggregate over all data.
Stratification may be inefficient if there are
multiple hidden causes
Stratification creates tiny strata when data is high-
dimensional. Hard to obtain stable estimates.
E.g. activity data may be high-dimensional: a vector for purchases in each
app category.
Morgan-Winship (2014)
This was stratification…
Old Algorithm
clicks to
User 2 and User 1
… recommendations are the same on all
relevant attributes.
clicks to clicks to
recommendations recommendations
Old Algorithm
clicks to
User 2 and User 1
… recommendations are equally likely to
be shown New
Algorithm.
New Algorithm Old Algorithm
clicks to clicks to
recommendations recommendations
Morgan-Winship (2014)
Example: Causal effect of a social news feed
Friends on a social network may Like similar items
E.g. on Last.fm, friends of a user may like similar
music to the user
f5 n5
f1 n1
u f4 u n4
f2 f3 n2 n3
120
Remember, we are always looking for the ideal
experiment with multiple worlds
Old Algorithm
Naïve estimate
clicks to
recommendations
Causal estimate
clicks to clicks to
recommendations recommendations
Cloned user
121
Example: Randomization + Instrumental Variable
Fisher example: You could not randomize over plots,
but can randomize which farmers get fertilizers
122
Random assignment as an instrument
Unobserved
As-If-Random
¿
Confounders
(U)
Random
assignmen Cause (X) Outcome (Y)
t(Z)
Unobserved
Confounders
(U)
Random
assignmen Cause (X) Outcome (Y)
t(Z)
124
Can use this variation to compute causal effect
An increase in Z can lead to a change in Y only
through X.
125
II.D. Key takeaways
Three steps to causal inference
Model
Use a causal graphical model. Make your assumptions
explicit even if it is cumbersome.
Identify
Use the graph to identify the effect you are after., or
check if your desired identification strategy is valid (BEFORE
looking at the data)
Estimate
Use any statistical method to estimate the identified
effect.
127
Estimating causal effects: Best practices
Whenever possible, use randomization.
If number of output items low, consider using explore-exploit methods.
http://tylervigen.com/spurious-correlations
III. Hands-on tutorial
Code and resources available at
http://www.github.com/amit-sharma/causal-inference-tu
torial
https://www.github.com/amit-sharma/causal-inference-tutorial
Example 1: Does having ice cream cause you
to swim?
133
Step 1: Model the world
Temperature
134
Step 2: Identify the causal effect
Temperature
𝑃 ¿ 135
Step 3: Estimate the effect
def linear_causal_estimate(df):
observed_common_causes = df[['w0']]
treatment_2d =
df["Treatment"].values.reshape(len(df["Treatment"]), -1)
features = np.concatenate((treatment_2d,
observed_common_causes),
axis=1)
model = linear_model.LinearRegression()
model.fit(features, df["Outcome"])
coefficients = model.coef_
estimate= {'value':coefficients[0],
'intercept':model.intercept_}
return estimate
estimate = linear_causal_estimate(df)
print("Causal Estimate is " + str(estimate["value"]))
136
Step 3: Estimate the effect
137
Example 2: Effect of recommendations
Study the effect of app store recommendation
system.
Using system logs,
Compare two recommendation algorithms.
Estimate the causal effect of recommendations.
I. Which of the two algorithms is better?
Situation: Two Algorithms, A and B, were used to
show app recommendations on the Store.
user_app_visits_A =
pd.read_csv("../datasets/user_app_visits_A.csv)
user_app_visits_B =
pd.read_csv("../datasets/user_app_visits_B.csv)
Dataset at a glance
Data description
user_id: Unique ID for user
activity_level: User’s activity level. Discrete (1:Lowest, 4:Highest)
product_id: Unique ID for an app
category: Category for an app (e.g. productivity, music, etc.)
is_rec_visit: Whether the app visit came through a
recommendation click-through.
rec_rank: Rank in the recommendation list (only top-3 apps
shown to user, -1 means that app was not in the recommendation
list)
What’s in the dataset?
> nrow(user_app_visits_A)
[1] 1,000,000
> length(unique(user_app_visits_A$user_id))
[1] 10,000
> length(unique(user_app_visits_A$product_id))
[1] 990
> length(unique(user_app_visits_A$category))
[1] 10
Causal assumptions
We ask the system designers/look at the source code
for the system.
Algorithm was selected based on activity level of
users.
Further, CTR depends on
• Activity level
• Time of day
• App category
Step 3: Estimate the effect of changing
algorithm
> stratified_by_category_estimate(user_app_visits_A)
Source: local data frame [10 x 2]
category stratified_estimate
1 1 0.1758294
2 2 0.2276829
3 3 0.2763157
4 4 0.1239860
5 5 0.1767163
… … …
> stratified_by_category_estimate(user_app_visits_B)
Source: local data frame [10 x 2]
category stratified_estimate
1 1 0.2002127
2 2 0.2517528
3 3 0.3021371
4 4 0.1503150
5 5 0.1999519
… … …
I. Which of the two algorithms is better?
The two Algorithms lead to roughly the same CTR.
Answer: Both are equally effective.
> naive_observational_estimate(user_app_visits_A)
naive_estimate
[1] 0.200768
> ranking_discontinuity_estimate(user_app_visits_A)
discontinuity_estimate
[1] 0.121362
155
Advanced methods
156
IV.A. Inverse Propensity
Weighting
157
Weighting to balance treatment and control
Suppose people who were shown a recommendation
were really likely to find it.
clicks to clicks to
recommendations recommendations
Control users
159
Caveat
What if likelihood/propensity too low?
160
IV.B. Doubly robust estimation
161
Prediction + IPW
Use IPW as the unbiased estimate.
162
Simple example: Estimating average height of
students in a class
163
IV.C. Refuting the obtained
causal estimate
164
Causal estimate is a result of your assumptions
“Causal” part does not come from the data.
165
Sanity check 1: Add random variables to your
model
Can add randomly drawn common causes, or
random instruments.
166
Sanity check 2: Replace treatment by a placebo
(A/A test)
Can randomize or permute the treatment.
167
Sanity Check 3: Divide data into subsets
(cross-validation)
Create subsets of your data.
168
Sensitivity Analysis: How bad should a
confounder be so that your estimate reverses?
Use simulation to add effect of unknown confounders.
169
Does smoking cause lung cancer?
Demographics Genes
170
Observational causal inference: Best practices
Always follow the four steps: Model, Identify, Estimate,
Refute.
Refute is the most important step.
171
V. Hands-on exercise
172
Example 3: Estimating causal effect with many
auxiliary variables
173
VI. Case studies on online
social systems
174
Studies
1. Distinguishing between homophily and influence
2. Measuring peer effects
3. Is running contagious?
4. Estimating counterfactuals in an online ad system
5. Measuring effect of marketing campaigns
175
Distinguishing homophily and contagion is
impossible
176
Distinguishing homophily and contagion is
impossible
177
Measuring peer effects
178
Measuring peer effects
Use
randomized
experiments to
validate
179
Is running contagious?
180
Is running contagious?
Use weather as an instrument: Does rain in NYC cause
your friends in LA to also run less?
181
Estimating causal effects in ad systems
182
Estimating causal effects in ad systems
183
Estimating effect of marketing campaigns
184
Estimating effect of marketing campaigns
Use synthetic controls.
Construct them by using Bayesian modeling.
185
VII. Further curiosity
186
1. Dealing with social networks
Networks mess up causal inference, because
individual units are no longer connected.
187
1. Dealing with social networks
Networks mess up causal inference, because
individual units are no longer connected.
188
2. Causal inference and machine learning
Amit Sharma
Researcher, Microsoft Research India
amshar@microsoft.com
@amt_shrma