KEMBAR78
Causal Inference Extended Tutorial | PDF | Estimator | Machine Learning
0% found this document useful (0 votes)
70 views189 pages

Causal Inference Extended Tutorial

Uploaded by

Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views189 pages

Causal Inference Extended Tutorial

Uploaded by

Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 189

WORKSHOP: University of Michigan (2018)

Causal inference in online systems:


Methods, pitfalls and best
practices
From Prediction to Causation

Amit Sharma
Researcher, Microsoft Research India
amshar@microsoft.com
@amt_shrma

http://www.github.com/amit-sharma/causal-inference-tutorial
Session Objectives and Takeaways
• What is causal inference? Why should we care?
• Most machine learning algorithms depend on correlations.
• Correlations alone are a dangerous path to actionable insights.

• Learn how to formulate and estimate causal


effects.
• To evaluate the impact of online systems.
• To make underlying algorithms more robust to changes in data.

• Apply causal inference methods to a practical


problem
• Estimating the causal impact of a recommendation system.
Agenda
• 9am-11am: Introduction to causal inference
• 11:15am-12pm: Working with code
• 12-1:30pm: Lunch
• 1:30pm-2:30pm Advanced methods and a working
example
• 3-4pm: Case studies and further curiosities

3
We have lots of data!
www.tylervigen.com 4
Observed data can be confounded, or even
spurious.
www.tylervigen.com 5
𝑦 = 𝛽 𝑥 +𝜖

¿𝑋,𝑌>¿ ?

^
𝑦 ^
𝛽
Prediction
Causation Hofman, Sharma, and Watts (2017). Science, 355.6324 6
PART I:
Causality, what?

7
I.A What is causality?

8
A question that has attracted scholars for
centuries

Largely a philosophical pursuit for many centuries.


9
“More has been learned about causal
inference in the last few decades than the
sum total of everything that had been
learned about it in all prior recorded history”
--Gary King, Harvard University

10
“More has been learned about causal inference in the last few
decades than the sum total of everything that had been learned
about it in all prior recorded history”

--Gary King, Harvard University

A “causal revolution” is upon us.


--Judea Pearl, University of California Los Angeles

11
What are they talking about?

12
No one knows what causality means
Hume, 18th century

If you strike a match and it lights up, does the striking


cause lighting?

What if you repeat the experiment 100 times?

How do you know that striking always leads to light?


How is it different from regularity or predictability?

Does causality even exist?


13
Still, everyone aims to find causal effects
Empirical causal inference is pragmatic, best-effort.

Concerns effect of actions that generalize to all


reasonable contexts.
If I strike a match, would it light up (assuming that
everything else stays the same)?

“A bag of tricks to produce knowledge!”


--Try the action multiple times
--Try controlling for the environment
--Somehow account for uncontrolled factors
14
A practical definition

Definition: X causes Y iff


changing X leads to a change in Y,
keeping everything else constant.

The causal effect is the magnitude by which Y is changed by


a unit change in X.

Called the “interventionist” interpretation of causality.

*Interventionist definition [
http://plato.stanford.edu/entries/causation-mani/] 15
Bag of tricks => Powerful statistical frameworks

Potential Outcomes Bayesian Networks

16
I.B Why should we care?
We have increasing amounts of data and highly accurate
predictions. How is causal inference useful?

17
Predictive systems are everywhere
How do predictive systems work?
Aim: Predict future activity for a user.


We see data about their user profile and past activity.

E.g., for any user, we might see their age, gender, past
activity and their social network.
From data to prediction


Higher Activity Lower
Activity
Use these correlations to make a predictive model.
Future Activity ->
f(number of friends, logins in past month)
From data to “actionable insights”

Number of friends can predict activity with high


accuracy.
How do we increase activity of users?

Would increasing the number of friends increase


people’s activity on our system?
Maybe, may be not (!)
Different explanations are possible

How do we know
what causes what?
Decision: To increase activity, would it make sense to
launch a campaign to increase friends?
Another example: Search Ads

Search engines uses ad targeting to show relevant


ads.
Prediction model based on user’s search query.

Search Ads have the highest click-through rate (CTR)


in online ads.
Are search ads really that effective?

Ad targeting was highly accurate.


Blake-Tadelis-Noskos (2014)
But search results point to the same website

Counterfactual question: Would I have reached


Amazon.com anyways, without the ad?
Without reasoning about causality, may
overestimate effectiveness of ads

x% of ads shown
are effective

<x% of ads shown


are effective
Okay, search ads have an explicit intent. Display
ads should be fine?

Probably not.
There can be many hidden causes for an action, some
of which may be hard to quantify.
Estimating the impact of ads

Toys R Us designs new ads.


Big jump in clicks to their ads compared to past
campaigns. Were these ads more effective?
People anyways buy more toys in December

Misleading to compare ad campaigns with changing


underlying demand.
So far, so good.
Be mindful of
hidden causes, or
else we might
overestimate
causal effects. 1 2
(But) 1 2

Ignoring hidden
causes can also
lead to
completely wrong
conclusions.
Example: Did a system change lead to better
outcomes?
Have a current production algorithm. Want to test if a
new algorithm is better.
Say a feature that provides information or discount for
a financial product.

?
Algorithm A Algorithm B
Comparing old versus new algorithm
Two algorithms, A (production) and B (new) running
on the system.
From system logs, collect data for 1000 sessions for
each. Measure Success Rate (SR).

Old Algorithm (A) New Algorithm (B)

50/1000 (5%) 54/1000 (5.4%)


New algorithm is better?
Change in SR by income of people
8

So let us look at 4

SR
SR separately. 2

1 2 3 4

Old Algorithm (A) New Algorithm (B) Low-income


10/400 (2.5%) 4/200 (2%) Users

Old Algorithm (A) New Algorithm (B) High-income


40/600 (6.6%) 50/800 (6.2%) Users
The Simpson’s paradox
Old algorithm (A) New Algorithm
(B)
CTR for Low- 10/400 (2.5%) 4/200 (2%)
Activity users

CTR for High- 40/600 (6.6%) 50/800 (6.2%)


Activity users

Total CTR 50/1000 (5%) 54/1000 (5.4%)

Is Algorithm A better?
Simpson (1951)
Answer (as usual): May be, may be not.

E.g., Algorithm A could have been shown at different


times than B.
There could be other hidden causal variations.
Example: Simpson’s paradox in Reddit

Average comment length decreases over time.


But for each yearly cohort of users, comment length
increases over time.
Barbosa-Cosley-Sharma-Cesar (2016)
Making sense of
such data can be
too complex.
PART II:
Causal inference
basics

39
II.A In search of an intervention
and a counterfactual:
A historical tour of causal
inference

40
…And there was fire!

Action: Strike stones.


Outcome: Sparks? Fire?
Probably one of the oldest experiments in causal inference.
Intervention: Taking an action.
41
The idea of a “controlling” while taking action

Hooke discovered that pulling a spring causes it to


expand in length proportional to the amount of the force
applied. [Hooke’s law]

Only works if the other end is fixed, or controlled.


A common property of many physical experiments,
leading to the notion of a “controlled” experiment.
Intervention: Taking an action while keeping other relevant
factors constant.
42
What if you cannot intervene?
For centuries, controlled experiments have worked
well for many physical experiments and do so even
today.

However, they do not work in the messier life sciences


or social sciences.

We needed another big idea for causality to branch


out of physical experiments.
The idea of a “counterfactual”.

43
1854: London was having a devastating
cholera outbreak 44
Enter John Snow. He found higher cholera deaths near
a water pump, but could be just correlational.
45
S & TER
WA PANY
COM

V
LA TERANY
WA MP
CO

MB
ET
H

New Idea: Two major water companies for London:


one upstream and one downstream.
46
S&V
and
Lambeth

No difference in neighborhood, still an 8-fold increase


in cholera with the downstream company.
47
Real world Acting in the Real world
INTERVENTION

Observed Data from the No data from the


Real world Counterfactual world

COUNTERFACTUAL
48
Fisher experiment
A way to combine the ideas of an intervention and
counterfactual.
Q. : Estimate the effect of fertilizers.

Controlled experiment 49
Fisher experiment
A way to combine the ideas of an intervention and
counterfactual.
Q. : Estimate the effect of fertilizers.

Natural experiment 50
Fisher experiment
A way to combine the ideas of an intervention and
counterfactual.
Q. : Estimate the effect of fertilizers.

Randomized Controlled experiment 51


Ideal experiment
Old Algorithm
Naïve estimate

clicks to
recommendations
Causal estimate

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

Cloned user

Ideally, requires creation of multiple worlds.


52
II.B How do we systematically
reason about and estimate the
relationship between effects
and their causes?

53
Two questions leading from Fisher’s work
What if you have a complicated randomized
experiment?
E.g. Instead of plots, randomized at the level of
farmers. But all farmers may not comply.

What if you could not randomize but had a lot of


historical data?
E.g. Yield and fertilizer data for thousands of
plots.

54
Two questions leading from Fisher’s work
1. Need a language for “complicated” questions.
Counterfactuals cannot be expressed by probability.

Try “If the fertilizer was applied, what is the probability that
the yield was greater than , in case that the fertilizer was not
applied?

Pearl’s graphical model framework

55
Two questions leading from Fisher’s work
1. Need a language for “complicated” questions.
Counterfactuals cannot be expressed by probability.

Try “If the fertilizer was applied, what is the probability that
the yield was greater than , in case that the fertilizer was not
applied?

→ Pearl’s graphical model framework

2. Need an estimation framework that understands this


language.
Methods for estimating a causal quantity.
→ Neyman-Rubin’s potential outcome framework
56
Aside: Formulating causal inference problems

Causal inference: Principled basis for both experimental


and non-experimental methods.

Such questions form the basis of almost all scientific


inquiry.
E.g., occur in medicine (drug trials, effect of a drug), social sciences (effect of a
certain policy), and genetics (effect of genes on disease).

Frameworks:
• Causal graphical models [Pearl 2009]
• Potential Outcomes Framework [Imbens-Rubin 2016]
Graphical Models: Express causal relationships
visually

𝑐𝑡𝑟=𝑓(𝑎𝑙𝑔, 𝑎𝑐𝑡, time)


a

Edges represent direct causes.


Directed paths represent indirect causes.

58
Graphical Models: Express causal relationships
as a Bayesian network

Markov condition: A node is independent of all other


non-descendants given its parents.
Leads to factorization of joint probability.

59
Graphical Models: Make assumptions explicit

The graph encodes all causal assumptions.


Assumptions are the nodes and edges that are
missing.
60
Example: Assumptions encoded in the graph
• level of users affects which
they are shown and their
overall .
• is different at different times of
day.
• Unobserved of a user
determine when they visit the
Store, which also affects their
level, and in turn the they are
shown.

61
Graphical Models: A language for intervention
and counterfactual

Intervention on a node: Acting to change the node


exogenously by severing its ties to its parents.

which is different from


62
Graphical Models: A language for intervention
and counterfactual

Counterfactual: The recommendations were always


shown. But what would have happened if they were
not?

63
Graphical models: Provide an mechanistic way
of identifying a causal effect

Appeals to the idea of controlling.


When we cannot control the environment, use
conditioning.
64
Should we also restrict our comparison to
people who come at the same times?

does depend on of a user’s visit.

But the algorithm assigned does


not change based on .

While may be different at


different times, any is equally
likely to be shown at any point in
time.

65
Tricky to find correct variables to condition on.
Fortunately, graphical models make it precise.

“Backdoor” paths: Look for (undirected) paths that point to both and .

Pearl (2009) 66
Backdoor criterion: Condition on enough
variables to cover all backdoor paths

Identified Causal Effect:

67
Identified Causal Effect:

For complicated graphs, do-calculus provides a set of rules to


automate the identification process.

Wait, but how to estimate this?

68
Potential Outcomes: Every variable has a
counterfactual
Imagine
Need to estimate difference between the real and
counterfactual world.

This formulation has led to some powerful methods


for estimating causal effect.

Equivalent to graphical models.

69
Potential Outcomes: Estimating an effect
identified from the backdoor criterion
Imagine
Causal effect:

Can estimate using regression.

Valid if all effects are linear.


70
Unifying the two frameworks
Use graphical models and do-calculus for
modeling the world
identifying the causal effect

Use potential outcomes-based methods for


estimating the causal effect

71
II.C Different ways to identify
and estimate causal effect

72
Running example: Estimating effect of an
algorithm

𝑃 (𝐶𝑇𝑅 ∨ 𝑑𝑜 ( 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 ))
73
Lookback: Need answers to “what if” questions

Counterfactual thinking*:
What would have happened if I had changed X?

E.g. What would have been the CTR had we not


shifted to the new algorithm?

*Counterfactual theories of causation


http://plato.stanford.edu/entries/causation-counterfactual/
Ideal experiment
Old Algorithm
Naïve estimate

clicks to
recommendations
Causal estimate

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

Cloned user

Ideally, requires creation of multiple worlds.


Methods for answering causal questions

Randomization
A/B test

EASE OF USE
VALIDITY

Multi-armed bandits
Natural Experiments
Regression discontinuity
Instrumental Variables
Conditioning
Stratification, Matching
Propensity Scores
1. Randomization to the
rescue
Randomizing algorithm assignment: A/B test

We cannot clone users.

Next best alternative: Randomly


assign which users see new
Algorithm’s recommendations
and which see the old
algorithm’s.
Randomization removes hidden variation

Old Algorithm

clicks to

… recommendations
Causal estimate

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

Random User 1 Random User 2


Cost: Possibly risky, unethical

Say the new algorithm was really bad.

Can decrease the percentage of users who see the


new algorithm, but how do we know this beforehand?

Such manual tweaks even more inefficient if multiple


algorithms to test.
Efficient randomization: Multi-armed bandits

Two goals:
1. Show the best
known algorithm to
most users.
2. Keep randomizing
to update
knowledge about
competing
algorithms.
Bandits: The right mix of explore and exploit

Old Algorithm

clicks to

… recommendations

Current-best Random
Algorithm Algorithm

clicks to clicks to
recommendations recommendations

Most users Other users


Algorithm: ɛ-greedy multi-armed bandits

Repeat:
(Explore) With low probability ɛ, choose an
output item randomly.
(Exploit) Otherwise, show the current-best
algorithm.

Use CTR results for Random output items to train new


algorithms offline.
Practical Example: Contextual bandits on
Yahoo! News

Actions: Different news


articles to display
A/B tests using all articles inefficient.

Randomize the articles


shown using ɛ-greedy
policy.
Better: Use context of visit (user,
browser, time, etc.) to have different
current-best algorithms for different
contexts.

Li-Chu-Langford-Schapire (2010) 84
Caveat: Not always feasible to randomize, or
ensure that people fully comply

Randomization may be too expensive or involve ethical


hazards.
There may not be perfect compliance with random
assignment. E.g. referral experiment for a subscription service
like Netflix.

Even when feasible, randomization methods need a limited set


of "good" alternatives to test.
• How do we identify a good set of algorithms or a good set of parameters?
• Common metrics like CTR will not be useful, because they might miss hidden causes.

Need causal inference “without intervention”.


2. So how about naturally
occurring experiments?
“Natural” experiments: exploit variation in
observed data
Can exploit naturally occurring close-to-random
variation in data.
Since data is not randomized, need assumptions about
the data-generating process.
If there is sufficient reason to believe the assumptions,
we can estimate causal effects.

Dunning (2002), Rosenzweig-Wolpin (2000)


Example: Effect of Store recommendations
Suppose instead of comparing recommendation
algorithms, we want to estimate the causal effect of
showing any algorithmic recommendation.

Can be used to benchmark how much revenue a


recommendation system brings, and allocate
resources accordingly.
(and perhaps help analyze the tradeoff with users’ privacy)
Exploiting arbitrary cutoffs to
recommendations

Only 3 recommendations shown to user.


Assumption: Closely-ranked not-shown apps
are as relevant as shown apps

Causal effect of being


shown as recommendation

3rd ranked app 4th ranked app


(Shown)
(Not-shown)

number of app number of app


installs installs

Same user
Algorithm: Regression discontinuity
For any top-k recommendation list:
Using logs, identify apps that were similarly
ranked but could not make it to the top-k shown
apps.
Measure difference in app installs between
shown and not-shown apps for each user.

Imbens-Lemieux (2008)
Another natural experiment: Instrumental
Variables
Can look at as-if random variations due to external
events.
E.g. Featuring on the Today show may lead to a sudden spike in installs for an
app.
Such external shocks can be used to determine causal
effects, such as the effect of showing recommendations.

Angrist-Pischke (2008)
Cont. example: Effect of store
recommendations
How many
new visits are
caused by the
recommender
system?
Demand for App 1 is correlated with demand for App 2.
Users would most likely have visited App 2 even without
recommendations.
Traffic on normal days to App 1

click-throughs
from
App 1 to App 2

Cannot say much about the


causal effect of
recommendations from App 1.
click-throughs
from
App 1 to App 2
External shock brings as-if random users to
App1

click-throughs
from
its App 1 to App 2
s
vi 1
in p
ike Ap If demand for App 2 remains
Sp to
constant, additional views to
App 2 would not have
happened had these new users
click-throughs
from not visited App 1.
App 1 to App 2

Sharma-Hofman-Watts (2015)
Exploiting sudden variation in traffic to App 1
To compute Causal CTR of Visits to App1 on Visits to
App2:
• Compare observed effect of external event separately on Visits to App1,
and on Rec. Clicks to App2.
• Causal click-through rate =
More generally…

Unobserved
As-If-Random
¿
Confounders
(U)

Instrument
Cause (X) Outcome (Y)
(Z)

Exclusion (𝑍 ∐𝑌∨𝑋,𝑈) 97
Lottery
Weather
Shocks
Discontinuities
Hard-to-find
variations

Dunning (2012)
98
Lottery
Weather
Change in access Shocks
of digital services
Discontinuities
Change in train Hard-to-find
stops in a city variations

Change in medicines
at a hospital


99
Caveat: Natural experiments are hard to find
Estimates may not be generalizable to all products.

Critical assumptions may not be satisfied.


Both sources of experimentation:
• Controlled
• Natural
ruled out.
Can we estimate causal effects with only observational
data?
3. What can we conclude with
observed data?
Imagine a randomized experiment…

Old Algorithm

clicks to

… recommendations

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

Random User 1 Random User 2


Compare with a similar user instead of random

Old Algorithm

clicks to

… recommendations
Causal estimate

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

User 1 Similar User 2


Continuing example: Effect of Algorithm on
CTR
Does new Algorithm B increase CTR for recommendations on
Windows Store, compared to old algorithm A?

1. Make assumptions about how the data as


generated.
2. Create a graphical model representing those
assumptions.
Previous example: Effect of Algorithm over CTR

Does new Algorithm B increase CTR for recommendations on


Windows Store, compared to old algorithm A?

1. Make assumptions
about how the
data as generated.
2. Create a graphical
model
representing those
assumptions.
Backdoor criterion: Condition on enough
variables to cover all backdoor paths
Algorithm: Stratification
With observational data:
1. Assume a graphical model that explains how the
data was generated.
2. Choose variables to condition on using backdoor
criterion.
3. Stratify data into subsamples such that each
subsample has the same value of all conditioned
variables.
4. Evaluate the difference in outcome variable
separately within these strata.
5. (Optional) Aggregate over all data.
Stratification may be inefficient if there are
multiple hidden causes
Stratification creates tiny strata when data is high-
dimensional. Hard to obtain stable estimates.
E.g. activity data may be high-dimensional: a vector for purchases in each
app category.

Key Idea: Instead of conditioning on all relevant


attributes, can condition on the likelihood of being
assigned an Algorithm.

Morgan-Winship (2014)
This was stratification…

Old Algorithm

clicks to
User 2 and User 1
… recommendations are the same on all
relevant attributes.

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

User 1 Similar User 2


Instead condition on propensity to treatment

Old Algorithm

clicks to
User 2 and User 1
… recommendations are equally likely to
be shown New
Algorithm.
New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

User 1 Similar User 2


Continued example: Effect of Algorithm on CTR

Based on backdoor criterion, need to


condition only on Activity.
Activity is multi-dimensional.

Estimate likelihood to be shown New


Algorithm using observed Algorithm-
user pairs.

Compare CTR between users with the same propensity score.


Algorithm: Propensity score matching
With observational data:
1. Assume a graphical model that explains how the data was
generated.
2. Choose variables to condition on using backdoor criterion.
3. Compute propensity score for each user based on conditioned
variables.
4. Match pairs of individuals with similar scores, where one of them
saw Old Algorithm and the other saw New Algorithm.
5. Compare the outcome variable within each such matched pair
and aggregate.

Morgan-Winship (2014)
Example: Causal effect of a social news feed
Friends on a social network may Like similar items
E.g. on Last.fm, friends of a user may like similar
music to the user

This may be due to influence, or


simply due to homophily

Causal question: Given only


log data, how can we determine
social influence due to the newsfeed,
compared to homophily effects?
117
Example: Causal effect of a social newsfeed
Solution: Use matching based on past items liked by
each user, to create control group of non-friends that
are as similar to a user as her friends.

f5 n5
f1 n1

u f4 u n4

f2 f3 n2 n3

Ego Network Non-Friends

Sharma-Cosley (2015), Aral-Muchnik-Sundarajan (2009) 118


Caveat: Causal effect only if assumed graphical
model is correct
There might be unknown and unobserved causes that
might affect an Algorithm’s CTR.
E.g. early adopters, more tech-savvy, or another characteristic.

There might be known unobserved user features.


E.g. their age or the context in which they use an online system.

At best, with only observational data, we can obtain


strong hints to causality.
5. Many of these techniques can be combined

120
Remember, we are always looking for the ideal
experiment with multiple worlds
Old Algorithm
Naïve estimate

clicks to
recommendations
Causal estimate

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

Cloned user

121
Example: Randomization + Instrumental Variable
Fisher example: You could not randomize over plots,
but can randomize which farmers get fertilizers

Algorithm example: You cannot remove


recommendations at random, but could advertise a
focal product to a random subset of people on the
homepage.

122
Random assignment as an instrument

Unobserved
As-If-Random
¿
Confounders
(U)

Random
assignmen Cause (X) Outcome (Y)
t(Z)

Exclusion (𝑍 ∐𝑌∨𝑋,𝑈) 123


Random assignment as an instrument

Unobserved
Confounders
(U)

Random
assignmen Cause (X) Outcome (Y)
t(Z)

124
Can use this variation to compute causal effect
An increase in Z can lead to a change in Y only
through X.

So change in Y is a product of change in Z->X and X-


>Y arrows.

Compare the extent by which random assignment


affects X versus Y.
Causal effect (X->Y) =

125
II.D. Key takeaways
Three steps to causal inference
Model
Use a causal graphical model. Make your assumptions
explicit even if it is cumbersome.

Identify
Use the graph to identify the effect you are after., or
check if your desired identification strategy is valid (BEFORE
looking at the data)

Estimate
Use any statistical method to estimate the identified
effect.

127
Estimating causal effects: Best practices
Whenever possible, use randomization.
If number of output items low, consider using explore-exploit methods.

If randomization is not feasible, consider exploiting


natural experiments.
Better to consider multiple sources of natural experiments.

If natural experiments are hard to find, consider using


conditioning methods.
Use them as strong hints for causality.
Causal inference is tricky
Correlations are seldom enough. And sometimes horribly misleading.

Always be skeptical of causal claims from observational


any data.
More data does not automatically lead to better causal estimates.

http://tylervigen.com/spurious-correlations
III. Hands-on tutorial
Code and resources available at
http://www.github.com/amit-sharma/causal-inference-tu
torial

Contact: amshar@microsoft.com, @amt_shrma


Prerequisites
Need Python, Jupyter notebook
Packages: numpy, pandas
1. Install them using:
pip install numpy pandas
pip install scikit-learn
(Optional)
pip install seaborn

2. Clone git repository

https://www.github.com/amit-sharma/causal-inference-tutorial
Example 1: Does having ice cream cause you
to swim?

133
Step 1: Model the world

Temperature

Ice Cream Swimming

134
Step 2: Identify the causal effect

Temperature

Ice Cream Swimming

𝑃 ¿ 135
Step 3: Estimate the effect
def linear_causal_estimate(df):
observed_common_causes = df[['w0']]
treatment_2d =
df["Treatment"].values.reshape(len(df["Treatment"]), -1)

features = np.concatenate((treatment_2d,
observed_common_causes),
axis=1)
model = linear_model.LinearRegression()
model.fit(features, df["Outcome"])
coefficients = model.coef_
estimate= {'value':coefficients[0],
'intercept':model.intercept_}
return estimate

estimate = linear_causal_estimate(df)
print("Causal Estimate is " + str(estimate["value"]))
136
Step 3: Estimate the effect

137
Example 2: Effect of recommendations
Study the effect of app store recommendation
system.
Using system logs,
Compare two recommendation algorithms.
Estimate the causal effect of recommendations.
I. Which of the two algorithms is better?
Situation: Two Algorithms, A and B, were used to
show app recommendations on the Store.

Data: System log data recording users’ visits.

Causal question: Which algorithm leads to higher


click-through rates?
Loading user-app visits data
Jupyter notebook

user_app_visits_A =
pd.read_csv("../datasets/user_app_visits_A.csv)

user_app_visits_B =
pd.read_csv("../datasets/user_app_visits_B.csv)
Dataset at a glance
Data description
user_id: Unique ID for user
activity_level: User’s activity level. Discrete (1:Lowest, 4:Highest)
product_id: Unique ID for an app
category: Category for an app (e.g. productivity, music, etc.)
is_rec_visit: Whether the app visit came through a
recommendation click-through.
rec_rank: Rank in the recommendation list (only top-3 apps
shown to user, -1 means that app was not in the recommendation
list)
What’s in the dataset?

> nrow(user_app_visits_A)
[1] 1,000,000
> length(unique(user_app_visits_A$user_id))
[1] 10,000
> length(unique(user_app_visits_A$product_id))
[1] 990
> length(unique(user_app_visits_A$category))
[1] 10
Causal assumptions
We ask the system designers/look at the source code
for the system.
Algorithm was selected based on activity level of
users.
Further, CTR depends on
• Activity level
• Time of day
• App category
Step 3: Estimate the effect of changing
algorithm

Naïve estimate for comparing algorithms

Algorithm B appears to be better.


Step 1: Model the world
Step 2: Using backdoor criterion, identify
correct variables on condition on
Step 3: Estimate the effect of changing
algorithm
Stratified estimate for comparing algorithms
> stratified_by_activity_estimate(user_app_visits_A)
Source: local data frame [4 x 2]
activity_level stratified_estimate
1 1 0.1248852
2 2 0.1750483
3 3 0.2266394
4 4 0.2763522
> stratified_by_activity_estimate(user_app_visits_B)
Source: local data frame [4 x 2]
activity_level stratified_estimate
1 1 0.1253469
2 2 0.1753933
3 3 0.2257211
4 4 0.2749867
If we had conditioned on category…

> stratified_by_category_estimate(user_app_visits_A)
Source: local data frame [10 x 2]
category stratified_estimate
1 1 0.1758294
2 2 0.2276829
3 3 0.2763157
4 4 0.1239860
5 5 0.1767163
… … …
> stratified_by_category_estimate(user_app_visits_B)
Source: local data frame [10 x 2]
category stratified_estimate
1 1 0.2002127
2 2 0.2517528
3 3 0.3021371
4 4 0.1503150
5 5 0.1999519
… … …
I. Which of the two algorithms is better?
The two Algorithms lead to roughly the same CTR.
Answer: Both are equally effective.

Still, the CTR estimate must be an over-estimate of the


causal effect of recommendations, as people might
have visited some of the apps anyways.
How to estimate the causal effect?
II. What is the causal effect of the
recommendation system?
Situation: Two Algorithms, A and B, were used to
show app recommendations on the Store.

Data: System logs containing user-app visits.

Causal question: How many apps would users have


visited in case no recommendations were shown?
Step 1, Model: Graphical model to estimate
causal effect

We observe total recommendation click-throughs.


But some of them may be due to correlated
demand.
Step 2, Identify: Using regression discontinuity
analysis

We know that the Store only shows top-3


recommendations.

Comparing number of visits to the 4th ranked app (not


shown to the user) with the 3rd ranked app can be
used to estimate the effect of showing a
recommendation.
(Assuming 3rd and 4th ranked apps are equally relevant to a user)
Step 3, Estimate: Discontinuity estimate for
recommendations

> naive_observational_estimate(user_app_visits_A)
naive_estimate
[1] 0.200768

> ranking_discontinuity_estimate(user_app_visits_A)
discontinuity_estimate
[1] 0.121362

40% of app visits coming from recommendation click-


throughs are not causal.
Could have happened even without the
recommendation system.
IV. Advanced methods

155
Advanced methods

1. Inverse propensity weighting


2. Doubly robust estimation

Refutation: 4th step in causal analysis


(Model, Identify, Estimate, Refute)

156
IV.A. Inverse Propensity
Weighting

157
Weighting to balance treatment and control
Suppose people who were shown a recommendation
were really likely to find it.

So if we had a score for their likelihood to find it


themselves, we could:
downweight the people who saw
recommendations who had a higher likelihood
give higher weight to people who had a lower
likelihood of finding it anyways

Turns out this creates a balanced dataset, assuming


the likelihood is accurate. 158
Estimation

Old Algorithm Naïve estimate


clicks to
recommendations
Causal estimate

New Algorithm Old Algorithm

clicks to clicks to
recommendations recommendations

Control users

159
Caveat
What if likelihood/propensity too low?

Leads to high variance estimates.

A single value can derail the estimate.

Can we combine the benefits of prediction and causal


inference?

160
IV.B. Doubly robust estimation

161
Prediction + IPW
Use IPW as the unbiased estimate.

To reduce variance, add a term that is the output of a


predictive ML model that tries to predict outcome
variable.

Combine them in a special way to yield a property:


The estimate is unbiased if either the propensity is
correctly estimated, or if the prediction model is
accurate.

162
Simple example: Estimating average height of
students in a class

163
IV.C. Refuting the obtained
causal estimate

164
Causal estimate is a result of your assumptions
“Causal” part does not come from the data.

It comes from your assumptions that lead to


identification.

The data is simply used for statistical estimation.

Critical to verify your assumptions. How?

165
Sanity check 1: Add random variables to your
model
Can add randomly drawn common causes, or
random instruments.

Rerun your analysis.

Does the causal estimate change?

166
Sanity check 2: Replace treatment by a placebo
(A/A test)
Can randomize or permute the treatment.

Or match treatment and control to have the same


distribution of the treatment variable.

Rerun your analysis.

Does the causal estimate change?

167
Sanity Check 3: Divide data into subsets
(cross-validation)
Create subsets of your data.

Rerun your analysis.

Does the causal estimate change?

168
Sensitivity Analysis: How bad should a
confounder be so that your estimate reverses?
Use simulation to add effect of unknown confounders.

Domain knowledge to guide reasonable values of the


simulation.

Make comparisons to other known estimates.

169
Does smoking cause lung cancer?

Demographics Genes

Smoking Lung Cancer

Cornwell (1959) showed that the effect of Genes had to be 8


times any known confounder for the effect to go to zero.

170
Observational causal inference: Best practices
Always follow the four steps: Model, Identify, Estimate,
Refute.
Refute is the most important step.

Aim for simplicity.


If your analysis is too complicated, it is most likely wrong.

Try at least two methods with different assumptions.


Higher confidence in estimate if both methods agree.

171
V. Hands-on exercise

172
Example 3: Estimating causal effect with many
auxiliary variables

Use multiple methods


Add Step 4: Try to refute your own estimate.

173
VI. Case studies on online
social systems

174
Studies
1. Distinguishing between homophily and influence
2. Measuring peer effects
3. Is running contagious?
4. Estimating counterfactuals in an online ad system
5. Measuring effect of marketing campaigns

175
Distinguishing homophily and contagion is
impossible

176
Distinguishing homophily and contagion is
impossible

177
Measuring peer effects

178
Measuring peer effects

Use
randomized
experiments to
validate

179
Is running contagious?

180
Is running contagious?
Use weather as an instrument: Does rain in NYC cause
your friends in LA to also run less?

181
Estimating causal effects in ad systems

182
Estimating causal effects in ad systems

183
Estimating effect of marketing campaigns

184
Estimating effect of marketing campaigns
Use synthetic controls.
Construct them by using Bayesian modeling.

Intuitive method when no controls present in the data.

185
VII. Further curiosity

186
1. Dealing with social networks
Networks mess up causal inference, because
individual units are no longer connected.

The key assumption of independence of


counterfactuals breaks.

If I treat a person, she might spread it (partially) to her


friends, and so the assignment to friends is not
completely randomized.
Think of a video sharing feature.

187
1. Dealing with social networks
Networks mess up causal inference, because
individual units are no longer connected.

Possible solution: New Zealand.


Or better clustering of the network (Ugander et al.)

Or really complicated statistics (Eckles et al.).

188
2. Causal inference and machine learning

Causal inference  robust prediction

(Supervised) ML Causal inference


Predicted value under the Predicted value under the
training distribution counterfactual distribution
P(X,y). P’(X,y).
2. Causal inference and machine learning

Machine learning Causal inference


Use causal inference Use ML algorithms to
methods for robust, better model the non-
generalizable prediction. linear effect of
confounders, or find
low-dimensional
representations.

In general, be wary of methods that have not been empirically


tested, especially ones that you do not understand.
3. Automating causal inference
Causal inference is tricky, requires patience and
ingenuity.

But also a lot of mechanical and statistical parts.

To what extent can tasks of modeling, identification,


estimation and refutation be automated like ML?
-- Finding the best method based on data
-- Mining natural experiments from data

Really, really nascent field.


191
Causal inference: Best practices
Always follow the four steps: Model, Identify, Estimate,
Refute.
Refute is the most important step.

Aim for simplicity.


If your analysis is too complicated, it is most likely wrong.

Remember the order for validity: Randomization,


Natural experiments, Conditioning.
Consider observational methods as strong hints.
Thank you!

Amit Sharma
Researcher, Microsoft Research India

amshar@microsoft.com
@amt_shrma

You might also like