Randomization
and
Selection on Observables
Lukˊaˇs Laﬀˊers
Matej Bel University, Dept. of Mathematics
MUNI Brno
11.11.2021 - 12.11.2021
We may be fortunate to run a randomized experiment.
This makes identiﬁcation and estimation of causal eﬀects easy.
But even a proper experiment may be ”broken” in many interesting ways.
In many other cases, this is not possible.
We rely on the fact that observable characteristics make the treatment ”as
good as random”.
There are diﬀerent ways how to do this. With diﬀerent pros and cons.
Randomization
N individuals
Di ∈ {0,1} treatment indicator
Yi(Di) potential outcomes
Yi = Yi(1)Di +(1 −Di)Yi(0) observe variable
Yi(.) is only a function of i-th treatment nad there are no interactions
there are no hidden versions of the treatment, everyone receives 0 or 1
δi = Yi(1)−Yi(0)
is individual treatment eﬀect
Y(1) and Y(0)
What are they really?
Pr(Yi(1) = y) = Pr(Yi = y|do(D = 1))
What if we cannot manipulate the treatment? What if it does not make
sense?
Is it enough if we can contemplate it?
Sometimes we can manipulate the treatment.
Sometimes nature can manipulate the treatment (e.g. gender).
Missing data problem. You have to ﬁx this.
Somehow.
Observational data
D Y
U
Randomized trial
D Y
U
Observational data
D Y
U
X
Randomized trial
D Y
U
X
Treatment is randomized.
All the parents of D are removed.
There is no way how X or U have any inﬂuence on D.
Y is a ”collider” on the path between D and X and the path is therefore
blocked.
D ⊥⊥ X and D ⊥⊥ U
D=1 Y
U
X
D=0 Y
U
X
Randomization manipulated the treatment status of these people.
If randomization was successful, these two groups will not diﬀer in terms of
X
Randomization is the benchmark
If randomization worked, we should have:
E[X|D = 1] = E[X|D = 0]
and this can be checked in the data.
The subjects should ideally only diﬀer in terms of D.
Apples to apples.
Econ Nobel price 2019
Abhijit Banerjee, Esther Duﬂo and Michael Kremer for their experimental
approach to alleviating global poverty
https://www.nobelprize.org/prizes/economic-sciences/2019/press-release/
We aim to have comparable units.
E[Y(1)−Y(0)] - Average treatment eﬀect
E[Y|do(D = 1)] = E[Y(1)] = E[Y(1)|D = 1]
observed
= E[Y(1)|D = 0]
unobserved
E[Y|do(D = 0)] = E[Y(0)] = E[Y(0)|D = 0]
observed
= E[Y(0)|D = 1]
unobserved
E[Y(1)]−E[Y(0)] = E[Y(1)|D = 1]−E[Y(0)|D = 0] = E[Y|D = 1]−E[Y|D = 0]
E[Y(1)−Y(0)|D = 1] - Average treatment eﬀect on the
treated
E[Y(1)|D = 1]
observed
= E[Y(1)|D = 0]
unobserved
E[Y(1)−Y(0)|D = 1] = E[Y(1)|D = 1]
observed
−E[Y(0)|D = 1]
unobserved
= E[Y|D = 1]−E[Y|D = 0]
Here, only one counterfactual is needed.
Decomposition
E[Y|D = 1]−E[Y|D = 0] =
ATT=E[Y(1)−Y(0)|D=1]
E[Y(1)|D = 1]
E[Y|D=1]
−E[Y(0)|D = 1]
unobserved
+ E[Y(0)|D = 1]
unobserved
−E[Y(0)|D = 0]
E[Y|D=0]
Selection bias
Selection bias is zero under randomization.
Potential problems (not outcomes this time)
Randomization itself
Outcome attrition
Knowing you are in an experiment
Sample size (expensive)
External validity
Non-scalability
Peer-eﬀects, general equlibrium eﬀects
Duﬂo, Esther, Rachel Glennerster, and Michael Kremer. ”Using randomization in development economics research: A toolkit.” Handbook of development
economics 4 (2007): 3895-3962.
Some further tips
Prospective trials often lead to surprises.
Some programs fail. Beware of publication bias.
Not only eﬀects we are interested in, but also mechanisms, potential
side eﬀects.
RCTs are costly, diﬃcult, but feasible.
Spillovers eﬀects are real.
Kremer, Michael. ”Randomized evaluations of educational programs in developing countries: Some lessons.” American Economic Review 93.2 (2003):
102-106.
Implementation matters too
Important to have a partner company you can trust.
Example: Tennessee STAR experiment
Student Teacher Achievment Ratio
Do smaller classes make sense?
They are expensive.
Cost $12mil and implemented on 11600 kids in kindergartens in
1985/86
Long, expensive, logistically diﬃcult
Useful benchmark, but we might want to learn about the eﬀects sooner
You can try to work with it on your own https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/10766
Example: Tennessee STAR experiment
Apples to apples?
Table 2.2.1 of Angrist and Pischke (2009)
RCT and regression
Y = α
E(Y(0))
+ ρ
Y(1)−Y(0)
D + η
Y(0)−E(Y(0))
=⇒
E[Y|D = 1] = α +ρ +E[η|D = 1]
E[Y|D = 0] = α +E[η|D = 0]
E[Y|D = 1]−E[Y|D = 0] = ρ
treatment effect
+E[η|D = 1]−E[η|D = 0]
selection bias
if we assume that ρ is non-random (homogenous treatment eﬀects)
RCT and regression + covariates
assignment was random only within schools - add schools speciﬁc
intercept
inclusion of covariates may improve the statistical precision of ρ
estimate
Y = α +ρD +XT
γ +η
Note:
we still assume homogenous treatment eﬀects
we now assume a speciﬁc linear form how X is connected to Y
this may be thought of as an approximation
[Adjusting for X in RCT or not? See Negi and Wooldridge 2021.]
Example: Tennessee STAR experiment
Table 2.2.2 of Angrist and Pischke (2009)
Selection on observables
Y(0),Y(1) ⊥⊥ D|X
We rarely have the luxury of an RCT, especially in economics.
Observational data may be useful in recovering causal relationship.
This often requires modelling and deep institutional knowledge.
Sometimes we have something that resembles RCT, we will discuss this
later
Assume that the richness of X allows us to close all the backdoor paths from
D to Y.
Selection on observables
Y(0),Y(1) ⊥⊥ D|X
It has various labels:
Conditional independence assumption
Unconfoundedness
Ignorability
Selection on observables
D Y
X
How realistic is this model?
Well, obviously: it depends.
If you have rich set of information (many many variables X), it might be
ﬁne.
But then it is tricky to model, you also need large data set.
Within a large data set, units are very diﬀerent and homogeneity makes
rarely sense.
Selection on observables
Identiﬁcation is straightforward.
There are, however, diﬀerent statistical techniques how to estimate the
eﬀects.
We will cover these classes of estimation techniques:
Regression
Matching
Propensity score weighting
What do they have in common?
Estimated from observation data.
There is no randomization, no quasi-randomization involved.
Regression
We know a lot about the mechanics of the linear regression, projections etc.
In the ﬁrst part of the course we were silent about the causal interpretation.
We have assumed that the model is correctly speciﬁed.
Y = α +ρD +XT
γ +εi
E[Y(1)|X] = E[Y|X,D = 1] = α +ρ +XT
γ
E[Y(0)|X] = E[Y|X,D = 0] = α +XT
γ
For a simple linear model - no heterogeneity:
ATT = E[Y(1)−Y(0)|X] = ρ = E[Y(1)−Y(0)] = ATE
We made use of E[ε|X,D] = 0
D Y
X
ε
Linearity?
Y = f(D,X)+εi
E[Y(1)|X,D = 1] = E[Y|X,D = 1] = f(1,X)
E[Y(0)|X,D = 1] =
C.I.A.
E[Y|X,D = 0] = f(0,X)
δX ≡ E[Y|X,D = 1]−E[Y|X,D = 0]
E[Y(1)−Y(0)|X,D = 1] = f(1,X)−f(0,X) =
C.I.A.
δX
E[Y(1)−Y(0)|D = 1] = E E[Y(1)−Y(0)|X,D = 1] = ∑
x
δx Pr(X = x|D = 1)
E[Y(1)−Y(0)] = E E[Y(1)−Y(0)|X] = ∑
x
δx Pr(X = x)
Matching
Matching is a class of statistical techniques that takes:
We aim to have comparable units.
very seriously.
Example: Matching - Titanic
700 out of 2200 on board survived
did wealth aﬀected survival probability?
women and children were given priority, but they were also likely to be
in the ﬁrst class
D Y
X1
X2
D - ﬁrst class
X1 - gender
X2 - age (old/young)
Y - survived
Two back-door paths.
Any unobserved confounders are ruled out.
Example: Matching - Titanic
4 categories: {young male, young female, old male, old female}
E[Y|D = 1]−E[Y|D = 0] = 0.354
E[Y(1)−Y(0)] = ∑
x
δx Pr(X = x) = 0.196
E[Y(1)−Y(0)|D = 1] = ∑
x
δx Pr(X = x|D = 1) = 0.238
E[Y(1)−Y(0)|D = 0] = ∑
x
δx Pr(X = x|D = 0) = 0.189
By stratiﬁcation we lose information.
As a reward, we get something that is easy to interpret and implement.
If we do not stratify, we may have few observations in a certain group.
There is no 12yo boy in the ﬁrst class.
ATT =
K
∑
k=1
(¯Y1,k
− ¯Y0,k
)·
Nk
T
NT
ATC =
K
∑
k=1
(¯Y1,k
− ¯Y0,k
)·
Nk
C
NC
ATE =
K
∑
k=1
(¯Y1,k
− ¯Y0,k
)·
Nk
N
K diﬀerent categories
¯Y1,k
- mean outcome of treated in group k
¯Y0,k
- mean outcome of control in group k
Nk
T , Nk
C, Nk
- number of treated, controls,
overall within category k
NT , NC, N - number of treated, controls, overall
Example: Matching - Angrist (1998)
Voluntary military service. How did it aﬀect wages?
Military was the largest employer.
Military size declined sharply in 1987.
Compares applicants. 50% of them enlisted.
Applicants are not chosen at random
Example: Matching - Angrist (1998)
698’000 observations
Information in X: year of application, test score group, schooling level,
year of birth.
Heterogenous across race: Separate estimates for Whites and
Non-whites
8’760 cells, but only 5’654 had at least 25 observations
Fig.2 in Angrist (1998)
Fig.3 in Angrist (1998)
Part of Table 2 in Angrist (1998)
Matching vs. Regression
These results diﬀer. Why?
Explore the simplest possible case. Binary X.
Binary X
Saturated model (heterogenous eﬀects)
Y = β0 +β1X +δ0D(1 −X)+δ1DX
δ1 = E[Y|X = 1,D = 1]−E[Y|X = 1,D = 0]
δ0 = E[Y|X = 0,D = 1]−E[Y|X = 0,D = 0]
Non saturated model (homogenous eﬀects)
Y = α +ρD +γX +εi
CATT is assumed to be the same for both X = 1 and X = 0
Saturated model (heterogenous eﬀects)
Y = β0 +β1X +δ0D(1 −X)+δ1DX
δ1 = E[Y|X = 1,D = 1]−E[Y|X = 1,D = 0]
δ0 = E[Y|X = 0,D = 1]−E[Y|X = 0,D = 0]
E[Y(1)−Y(0)|D = 1] = ∑
x
δx Pr(X = x|D = 1)
= δ0Pr(X = 0|D = 1)+δ1Pr(X = 1|D = 1)
= δ0
Pr(D = 1|X = 0)·P(X = 0)
P(D = 1)
+δ1
Pr(D = 1|X = 1)·P(X = 1)
P(D = 1)
= δ0wM
0 +δ1wM
1
Non-saturated model (homogenous eﬀects)
Non saturated model (homogenous eﬀects)
Y = α +ρD +γX +εi
ˆρ = ...[3.3.1 in Angrist and Pischke (2009)]...
=
∑x δx [Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))]Pr(X = x)
∑x [Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))]Pr(X = x)
= δ0wR
0 +δ1wR
1
Comparison - Matching vs Regression
wM
x =
∼share of treated among X=x
Pr(D = 1|X = x))
P(D = 1)
·Pr(X = x)
wR
x =
∼variance of D given X=x
Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))
∑x [Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))]Pr(X = x)
·Pr(X = x)
Matching
Matching
Diﬀerent types of Matching
In many interesting cases exact matches are not possible
We need to introduce some measure on how similar diﬀerent units are
There are many ways how this can be done
Overlap
0 <
for E[Y(1)|D=0]
P(D = 1|X) <
for E[Y(0)|D=1]
1
It is important to have comparable units.
If we don’t we may drop these observations or we may rely on
extrapolation.
Dropping observations means we estimate eﬀects only on a
subpopulation, so the object of interest changes.
You don’t want to extrapolate much, but, at the same time, you want to
have your eﬀect representative enough.
One to one matching
ATT =
1
NT
∑
i:Di =1
(Yi −Yj(i))
j(i) is ”similar” to i in terms of X in the control group
We compare Yi to the similar unit
One to many matching
ATT =
1
NT
∑
i:Di =1
Yi −
1
M
M
∑
m=1
Yjm(i)
jm(i) is one of the M ”similar” units from the control group to i in terms of X
We compare Yi to the average of the similar units
Nearest neighbour covariate matching
How to measure how similar the units are?
||Xi −Xj|| = (Xi −Xj)T (Xi −Xj) =
p
∑
n=1
(Xni −Xnj)2
Or weight by the variance
||Xi −Xj|| = (Xi −Xj)T ˆV−1(Xi −Xj) =
p
∑
n=1
(Xni −Xnj)2
ˆσ2
n
Or weight by the covariance
||Xi −Xj|| = (Xi −Xj)T ˆΣ−1(Xi −Xj)
Bias
The larger the dimension of X, the more diﬃcult is to ﬁnd matches
Data greedy
Xi converges to Xj(i) only slowly
Bias corrected matching estimator
ATTBC =
1
NT
∑
i:Di =1
(Yi −Yj(i))−(ˆE[Y|X = Xi,D = 0]− ˆE[Y|X = Xj(i),D = 0])
bias correction term
Variance?
Without replacement
Use control units only once.
ˆσ2
ATT =
1
NT
∑
i:Di =1
Yi −
1
M
M
∑
m=1
Yjm(i) −ATT
2
With replacement
Use control units possibly more than once.
ˆσ2
ATT =
1
NT
∑
i:Di =1
Yi −
1
M
M
∑
m=1
Yjm(i) −ATT
2
+
1
NT
∑
i:Di =1
Ki(1 −Ki)
M2
(Yi −Yj)2
2
(in this particular case bootstrap fails - Abadie and Imbens (2008)
Matching vs Regression - Practical considerations
There are many diﬀerent ways how one can perform matching.
There are many diﬀerent ways how one can perform regression.
Researchers degree of freedom is a problem.
Matching is appealing because it is easy to communicate to outsiders.
Regression is appealing as there seems to be (or are?) fewer degrees of
freedom
Angrist (1998), page 255: Angrist, Joshua. ”Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military
Applicants.” Econometrica 66.2 (1998): 249-288.
Example: LaLonde (1986)
Very inﬂuential study.
Does job training increase future wages?
Having randomized treatment (NSW - National Support Work),
LaLonde can compare matching estimators (from two diﬀerent
observational datasets: CPS - Current Population Survey and PSID Panel
Survey of Income Dynamics) to the one from the randomized,
which served as a benchmark
Results pessimistic: Estimates from obs. datasets are all over the place!
E.g. $800 vs -$8000 vs - $4400
Well, the samples are very diﬀerent
It is important to check how comparable treated and controls are in our
matched sample
This is called a balance
The success of matching can be shown using a balance graph.
Excellent implementation is in MatchIt package in R
Unadjusted Sample Adjusted Sample
black hispan white black hispan white
0.0
0.2
0.4
0.6
0.8
race
Proportion
Treatment
0
1
Distributional Balance for "race"
Unadjusted Sample Adjusted Sample
0 5 10 15 0 5 10 15
0.00
0.05
0.10
0.15
0.20
educ
Density
Treatment
0
1
Distributional Balance for "educ"
Propensity score
p(x) = P(D = 1|X = x)
We may skip the high-dimensionality of X in a very neat way.
Projecting them on the quantity that matters - probability of treatment
Propensity score matching
This idea comes from Donald Rubin (e.g. Rubin,1977) and Paul Rosenbaum
(Rosenbaum and Rubin, 1983, over 30k citations).
Y(0),Y(1) ⊥⊥ D | X
=⇒
Y(0),Y(1) ⊥⊥ D | p(X)
Pr(D = 1|Y(1),Y(0),p(X)) = E(D|Y(1),Y(0),p(X)) = E(E(D|Y(1),Y(0),X)|Y(1),Y(0),p(X))
= E(E(D|X)|Y(1),Y(0),p(X)) = E(p(X)|Y(1),Y(0),p(X)) = p(X)
= p(X) = E(p(X)|p(X)) = E(E(D|X,p(X))|p(X)) = E(D|p(X))
= Pr(D = 1|p(X)).
Propensity score matching
D Y
X
p(X)
Conditioning on p(X) closes the backdoor path.
Also, notice that D ⊥⊥ X|p(X) as Y is the collider on the path.
Propensity score matching
δp(X) = E(Y|D = 1,p(X))−E(Y|D = 0,p(X))
E[Y(1)−Y(0)|D = 1] = E[δp(X)|D = 1]
Propensity score matching
1. Use logit/probit to estimate propensity scores.
log
p(X)
1 −p(X)
= XT
β
2. Sort observations according to ˆp(X)
3. Stratify sample to blocks so that mean scores are not statistically
diﬀerent among treated and controls
4. Check for balance. If no balance within a block → split the block. If, for
some variable, no balance in all the blocks → check the model
speciﬁcation in Step 1.
Implemented in Stata by Becker, Sascha O., and Andrea Ichino. ”Estimation of average treatment eﬀects based on propensity scores.” The stata journal
2.4 (2002): 358-377.
There are other ways how PS matching can be implemented
Nearest neighbour matching
Radius matching
Kernel matching - weight controls by a Kernel function - those controls
close to propensity score of the treated get larger weight
Example: Dehejia and Wahba (2002)
Use data from LaLonde (1986)
Compares randomized NSW data to two observational datasets: CPS
and PSID
PS Matching in detail
With or without replacement? Smaller PS distance vs. Fewer
comparison units.
How many comparison units? Smaller PS distance vs. Increased
precision.
Which matching method to use? Caliper matching can use more (fewer)
matches if (not) available.
If overlap is good, diﬀerent matching will lead to similar results.
National Supported
Work Program
Provided work
experience to people
with social problems
Here is a randomized
sample from LaLonde
(1986)
Fig 1 and 2 from Dehejia and Wahba (2002)
Fig 5 and 6 from Dehejia and Wahba (2002)
Table 2 from Dehejia and Wahba (2002)
Table 3 from Dehejia and Wahba (2002)
Lessons to take
When few control units are available, use sampling with replacement
(you can use the same control twice)
When enough control units are available, sampling without replacement
would be ﬁne
Careful diagnostics aid the right choices.
So perhaps it is not as bad as LaLonde (1986) suggested?
→ Reply: Smith, Jeﬀrey A., and Petra E. Todd. ”Does matching overcome
LaLonde’s critique of nonexperimental estimators?.” Journal of
econometrics 125.1-2 (2005): 305-353.
Results are sensitive to covariates in PS estimation and to choice of the
sample.
PSM ”...does not represent a general solution to the evaluation problem”
→ Rejoinder: Dehejia, Rajeev. ”Practical propensity score matching: a reply
to Smith and Todd.” Journal of econometrics 125.1-2 (2005): 355-364.
Yes, one should check the sensitivity of estimates to the PS model
speciﬁcation.
High quality comparison group should not be too sensitive.
With this on your mind, PSM works ﬁne. Even in the diﬀerent subsamples
of LaLonde (1996)
Implementation issues
There are other ways how PS matching can be implemented
Fig 1 in Caliendo, Marco, and Sabine Kopeinig. ”Some practical guidance for the implementation of propensity score matching.” Journal of economic
surveys 22.1 (2008): 31-72.
Inverse Propensity Score Weighting
Y(0),Y(1) ⊥⊥ D | X
=⇒
ATE = E[Y(1)]−E[Y(0)] = E
Y ·D
p(X)
−E
Y ·(1 −D)
1 −p(X)
ATT = E[Y(1)|D = 1]−E[Y(0)|D = 1] = E [Y ·D]−E Y ·(1 −D)
p(X)
1 −p(X)
Inverse Propensity Score Weighting
E
Y ·D
p(X)
= E E
Y ·D
p(X)
|X = E E
Y(1)
p(X)
|D = 1,X Pr(D = 1|X)
= E E
Y(1)
p(X)
|D = 1,X p(X) = E [E [Y(1)|D = 1,X]] = E[Y(1)]
and other quantities similarly.
Inverse Propensity Score Weighting
First: estimate ˆp.
Then:
ATE =
1
N
∑
i
YiDi
ˆp(Xi)
−∑
i
Yi(1 −Di)
1 − ˆp(Xi)
ATT =
1
N
∑
i
YiDi −∑
i
Yi(1 −Di)
ˆp(Xi)
1 − ˆp(Xi)
Inverse Propensity Score Weighting
Normalized versions (more stable):
ATE =
1
N
∑
i
YiDi
ˆp(Xi)
/
1
N
∑
i
Di
ˆp(Xi)
− ∑
i
Yi(1 −Di)
1 − ˆp(Xi)
/ ∑
i
(1 −Di)
1 − ˆp(Xi)
ATT =
1
N
∑
i
YiDi /
1
N
∑
i
Di − ∑
i
Yi(1 −Di)
ˆp(Xi)
1 − ˆp(Xi)
/ ∑
i
(1 −Di)
ˆp(Xi)
1 − ˆp(Xi)
Weigthing: Hirano and Imbens (2001)
Performance for diﬀerent constructions of standard errors: Bodory,
Camponovo, Huber, and Lechner, (2020)
R package treatweight by Bodory and Huber (2021)
Sensitive to speciﬁcation of p(·)
May require trimming
Does not rely on stratiﬁcation nor matching (less degrees of freedom?)
Standard errors need to take into account that the propensity scores
are only estimated (Hirano, Imbens and Ridder, 2003)
Wrap-up
There are diﬀerent ways how we can estimate the quantity of interest (e.g.
ATE, ATT) if our observables are informative in explaining the selection bias.
Regression, Matching, IPW.
They all have pros and cons.
It is the selection on observables assumption that drive the identiﬁcation.
Without this, any estimator is dubious at best.
Thank you for your attention!
References
Tips and tricks on implementation of randomization: Duﬂo, Esther, Rachel Glennerster, and Michael Kremer. ”Using randomization in development economics
research: A toolkit.” Handbook of development economics 4 (2007): 3895-3962.
This book is a classic. Somewhat opinionated. By the pioneers of the ﬁeld: Angrist, Joshua D., and J¨orn-Steﬀen Pischke. Mostly harmless econometrics. Princeton
university press, 2008.
Very readable and engaging book, highly recommended: Cunningham, Scott. Causal Inference. Yale University Press, 2021.
Adjusting for X in RCT? Negi, Akanksha, and Jeﬀrey M. Wooldridge. ”Revisiting regression adjustment in experiments with heterogeneous treatment eﬀects.”
Econometric Reviews 40.5 (2021): 504-534. Or this twitter summary: https://twitter.com/jmwooldridge/status/1457001530985492495?s=21
Angrist, Joshua. ”Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants.” Econometrica 66.2 (1998):
249-288.
Some practical recommendations on Matching: Imbens, Guido W. ”Matching methods in practice: Three examples.” Journal of Human Resources 50.2 (2015):
373-419.
Why bootstrap fails in matching: Abadie, Alberto, and Guido W. Imbens. ”On the failure of the bootstrap for matching estimators.” Econometrica 76.6 (2008):
1537-1557.
Book length treatment of causal inference. Long and very rich and detailed exposition. Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social,
and biomedical sciences. Cambridge University Press, 2015.
Rubin, Donald B. ”Assignment to treatment group on the basis of a covariate.” Journal of educational Statistics 2.1 (1977): 1-26.
PSM paper: Rosenbaum, Paul R., and Donald B. Rubin. ”The central role of the propensity score in observational studies for causal eﬀects.” Biometrika 70.1 (1983):
41-55.
Large sample theory for PS matching: Abadie, Alberto, and Guido W. Imbens. ”Matching on the estimated propensity score.” Econometrica 84.2 (2016): 781-807.
Very popular article on implementation issues in PSM: Caliendo, Marco, and Sabine Kopeinig. ”Some practical guidance for the implementation of propensity score
matching.” Journal of economic surveys 22.1 (2008): 31-72.
References
Pessimistic view on policy evaluations based on observational data: LaLonde, Robert J. ”Evaluating the econometric evaluations of training programs with
experimental data.” The American economic review (1986): 604-620.
Addressing the LaLonde critique: Dehejia, Rajeev H., and Sadek Wahba. ”Propensity score-matching methods for nonexperimental causal studies.” Review of
Economics and statistics 84.1 (2002): 151-161.
Reply: Smith, Jeﬀrey A., and Petra E. Todd. ”Does matching overcome LaLonde’s critique of nonexperimental estimators?.” Journal of econometrics 125.1-2 (2005):
305-353.
Rejoinder: Dehejia, Rajeev. ”Practical propensity score matching: a reply to Smith and Todd.” Journal of econometrics 125.1-2 (2005): 355-364.
IPW estimator: Hirano, Keisuke, Guido W. Imbens, and Geert Ridder. ”Eﬃcient estimation of average treatment eﬀects using the estimated propensity score.”
Econometrica 71.4 (2003): 1161-1189.
Hirano, Keisuke, and Guido W. Imbens. ”Estimation of causal eﬀects using propensity score weighting: An application to data on right heart catheterization.” Health
Services and Outcomes research methodology 2.3 (2001): 259-278.
Bodory, H., Camponovo, L., Huber, M., and Lechner, M. ”The ﬁnite sample performance of inference methods for propensity score matching and weighting
estimators.” Journal of Business & Economic Statistics 38.1 (2020): 183-200.