Randomization and Selection on Observables Lukˊaˇs Laffˊers Matej Bel University, Dept. of Mathematics MUNI Brno 11.11.2021 - 12.11.2021 We may be fortunate to run a randomized experiment. This makes identification and estimation of causal effects easy. But even a proper experiment may be ”broken” in many interesting ways. In many other cases, this is not possible. We rely on the fact that observable characteristics make the treatment ”as good as random”. There are different ways how to do this. With different pros and cons. Randomization N individuals Di ∈ {0,1} treatment indicator Yi(Di) potential outcomes Yi = Yi(1)Di +(1 −Di)Yi(0) observe variable Yi(.) is only a function of i-th treatment nad there are no interactions there are no hidden versions of the treatment, everyone receives 0 or 1 δi = Yi(1)−Yi(0) is individual treatment effect Y(1) and Y(0) What are they really? Pr(Yi(1) = y) = Pr(Yi = y|do(D = 1)) What if we cannot manipulate the treatment? What if it does not make sense? Is it enough if we can contemplate it? Sometimes we can manipulate the treatment. Sometimes nature can manipulate the treatment (e.g. gender). Missing data problem. You have to fix this. Somehow. Observational data D Y U Randomized trial D Y U Observational data D Y U X Randomized trial D Y U X Treatment is randomized. All the parents of D are removed. There is no way how X or U have any influence on D. Y is a ”collider” on the path between D and X and the path is therefore blocked. D ⊥⊥ X and D ⊥⊥ U D=1 Y U X D=0 Y U X Randomization manipulated the treatment status of these people. If randomization was successful, these two groups will not differ in terms of X Randomization is the benchmark If randomization worked, we should have: E[X|D = 1] = E[X|D = 0] and this can be checked in the data. The subjects should ideally only differ in terms of D. Apples to apples. Econ Nobel price 2019 Abhijit Banerjee, Esther Duflo and Michael Kremer for their experimental approach to alleviating global poverty https://www.nobelprize.org/prizes/economic-sciences/2019/press-release/ We aim to have comparable units. E[Y(1)−Y(0)] - Average treatment effect E[Y|do(D = 1)] = E[Y(1)] = E[Y(1)|D = 1] observed = E[Y(1)|D = 0] unobserved E[Y|do(D = 0)] = E[Y(0)] = E[Y(0)|D = 0] observed = E[Y(0)|D = 1] unobserved E[Y(1)]−E[Y(0)] = E[Y(1)|D = 1]−E[Y(0)|D = 0] = E[Y|D = 1]−E[Y|D = 0] E[Y(1)−Y(0)|D = 1] - Average treatment effect on the treated E[Y(1)|D = 1] observed = E[Y(1)|D = 0] unobserved E[Y(1)−Y(0)|D = 1] = E[Y(1)|D = 1] observed −E[Y(0)|D = 1] unobserved = E[Y|D = 1]−E[Y|D = 0] Here, only one counterfactual is needed. Decomposition E[Y|D = 1]−E[Y|D = 0] = ATT=E[Y(1)−Y(0)|D=1] E[Y(1)|D = 1] E[Y|D=1] −E[Y(0)|D = 1] unobserved + E[Y(0)|D = 1] unobserved −E[Y(0)|D = 0] E[Y|D=0] Selection bias Selection bias is zero under randomization. Potential problems (not outcomes this time) Randomization itself Outcome attrition Knowing you are in an experiment Sample size (expensive) External validity Non-scalability Peer-effects, general equlibrium effects Duflo, Esther, Rachel Glennerster, and Michael Kremer. ”Using randomization in development economics research: A toolkit.” Handbook of development economics 4 (2007): 3895-3962. Some further tips Prospective trials often lead to surprises. Some programs fail. Beware of publication bias. Not only effects we are interested in, but also mechanisms, potential side effects. RCTs are costly, difficult, but feasible. Spillovers effects are real. Kremer, Michael. ”Randomized evaluations of educational programs in developing countries: Some lessons.” American Economic Review 93.2 (2003): 102-106. Implementation matters too Important to have a partner company you can trust. Example: Tennessee STAR experiment Student Teacher Achievment Ratio Do smaller classes make sense? They are expensive. Cost $12mil and implemented on 11600 kids in kindergartens in 1985/86 Long, expensive, logistically difficult Useful benchmark, but we might want to learn about the effects sooner You can try to work with it on your own https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/10766 Example: Tennessee STAR experiment Apples to apples? Table 2.2.1 of Angrist and Pischke (2009) RCT and regression Y = α E(Y(0)) + ρ Y(1)−Y(0) D + η Y(0)−E(Y(0)) =⇒ E[Y|D = 1] = α +ρ +E[η|D = 1] E[Y|D = 0] = α +E[η|D = 0] E[Y|D = 1]−E[Y|D = 0] = ρ treatment effect +E[η|D = 1]−E[η|D = 0] selection bias if we assume that ρ is non-random (homogenous treatment effects) RCT and regression + covariates assignment was random only within schools - add schools specific intercept inclusion of covariates may improve the statistical precision of ρ estimate Y = α +ρD +XT γ +η Note: we still assume homogenous treatment effects we now assume a specific linear form how X is connected to Y this may be thought of as an approximation [Adjusting for X in RCT or not? See Negi and Wooldridge 2021.] Example: Tennessee STAR experiment Table 2.2.2 of Angrist and Pischke (2009) Selection on observables Y(0),Y(1) ⊥⊥ D|X We rarely have the luxury of an RCT, especially in economics. Observational data may be useful in recovering causal relationship. This often requires modelling and deep institutional knowledge. Sometimes we have something that resembles RCT, we will discuss this later Assume that the richness of X allows us to close all the backdoor paths from D to Y. Selection on observables Y(0),Y(1) ⊥⊥ D|X It has various labels: Conditional independence assumption Unconfoundedness Ignorability Selection on observables D Y X How realistic is this model? Well, obviously: it depends. If you have rich set of information (many many variables X), it might be fine. But then it is tricky to model, you also need large data set. Within a large data set, units are very different and homogeneity makes rarely sense. Selection on observables Identification is straightforward. There are, however, different statistical techniques how to estimate the effects. We will cover these classes of estimation techniques: Regression Matching Propensity score weighting What do they have in common? Estimated from observation data. There is no randomization, no quasi-randomization involved. Regression We know a lot about the mechanics of the linear regression, projections etc. In the first part of the course we were silent about the causal interpretation. We have assumed that the model is correctly specified. Y = α +ρD +XT γ +εi E[Y(1)|X] = E[Y|X,D = 1] = α +ρ +XT γ E[Y(0)|X] = E[Y|X,D = 0] = α +XT γ For a simple linear model - no heterogeneity: ATT = E[Y(1)−Y(0)|X] = ρ = E[Y(1)−Y(0)] = ATE We made use of E[ε|X,D] = 0 D Y X ε Linearity? Y = f(D,X)+εi E[Y(1)|X,D = 1] = E[Y|X,D = 1] = f(1,X) E[Y(0)|X,D = 1] = C.I.A. E[Y|X,D = 0] = f(0,X) δX ≡ E[Y|X,D = 1]−E[Y|X,D = 0] E[Y(1)−Y(0)|X,D = 1] = f(1,X)−f(0,X) = C.I.A. δX E[Y(1)−Y(0)|D = 1] = E E[Y(1)−Y(0)|X,D = 1] = ∑ x δx Pr(X = x|D = 1) E[Y(1)−Y(0)] = E E[Y(1)−Y(0)|X] = ∑ x δx Pr(X = x) Matching Matching is a class of statistical techniques that takes: We aim to have comparable units. very seriously. Example: Matching - Titanic 700 out of 2200 on board survived did wealth affected survival probability? women and children were given priority, but they were also likely to be in the first class D Y X1 X2 D - first class X1 - gender X2 - age (old/young) Y - survived Two back-door paths. Any unobserved confounders are ruled out. Example: Matching - Titanic 4 categories: {young male, young female, old male, old female} E[Y|D = 1]−E[Y|D = 0] = 0.354 E[Y(1)−Y(0)] = ∑ x δx Pr(X = x) = 0.196 E[Y(1)−Y(0)|D = 1] = ∑ x δx Pr(X = x|D = 1) = 0.238 E[Y(1)−Y(0)|D = 0] = ∑ x δx Pr(X = x|D = 0) = 0.189 By stratification we lose information. As a reward, we get something that is easy to interpret and implement. If we do not stratify, we may have few observations in a certain group. There is no 12yo boy in the first class. ATT = K ∑ k=1 (¯Y1,k − ¯Y0,k )· Nk T NT ATC = K ∑ k=1 (¯Y1,k − ¯Y0,k )· Nk C NC ATE = K ∑ k=1 (¯Y1,k − ¯Y0,k )· Nk N K different categories ¯Y1,k - mean outcome of treated in group k ¯Y0,k - mean outcome of control in group k Nk T , Nk C, Nk - number of treated, controls, overall within category k NT , NC, N - number of treated, controls, overall Example: Matching - Angrist (1998) Voluntary military service. How did it affect wages? Military was the largest employer. Military size declined sharply in 1987. Compares applicants. 50% of them enlisted. Applicants are not chosen at random Example: Matching - Angrist (1998) 698’000 observations Information in X: year of application, test score group, schooling level, year of birth. Heterogenous across race: Separate estimates for Whites and Non-whites 8’760 cells, but only 5’654 had at least 25 observations Fig.2 in Angrist (1998) Fig.3 in Angrist (1998) Part of Table 2 in Angrist (1998) Matching vs. Regression These results differ. Why? Explore the simplest possible case. Binary X. Binary X Saturated model (heterogenous effects) Y = β0 +β1X +δ0D(1 −X)+δ1DX δ1 = E[Y|X = 1,D = 1]−E[Y|X = 1,D = 0] δ0 = E[Y|X = 0,D = 1]−E[Y|X = 0,D = 0] Non saturated model (homogenous effects) Y = α +ρD +γX +εi CATT is assumed to be the same for both X = 1 and X = 0 Saturated model (heterogenous effects) Y = β0 +β1X +δ0D(1 −X)+δ1DX δ1 = E[Y|X = 1,D = 1]−E[Y|X = 1,D = 0] δ0 = E[Y|X = 0,D = 1]−E[Y|X = 0,D = 0] E[Y(1)−Y(0)|D = 1] = ∑ x δx Pr(X = x|D = 1) = δ0Pr(X = 0|D = 1)+δ1Pr(X = 1|D = 1) = δ0 Pr(D = 1|X = 0)·P(X = 0) P(D = 1) +δ1 Pr(D = 1|X = 1)·P(X = 1) P(D = 1) = δ0wM 0 +δ1wM 1 Non-saturated model (homogenous effects) Non saturated model (homogenous effects) Y = α +ρD +γX +εi ˆρ = ...[3.3.1 in Angrist and Pischke (2009)]... = ∑x δx [Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))]Pr(X = x) ∑x [Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))]Pr(X = x) = δ0wR 0 +δ1wR 1 Comparison - Matching vs Regression wM x = ∼share of treated among X=x Pr(D = 1|X = x)) P(D = 1) ·Pr(X = x) wR x = ∼variance of D given X=x Pr(D = 1|X = x)(1 −Pr(D = 1|X = x)) ∑x [Pr(D = 1|X = x)(1 −Pr(D = 1|X = x))]Pr(X = x) ·Pr(X = x) Matching Matching Different types of Matching In many interesting cases exact matches are not possible We need to introduce some measure on how similar different units are There are many ways how this can be done Overlap 0 < for E[Y(1)|D=0] P(D = 1|X) < for E[Y(0)|D=1] 1 It is important to have comparable units. If we don’t we may drop these observations or we may rely on extrapolation. Dropping observations means we estimate effects only on a subpopulation, so the object of interest changes. You don’t want to extrapolate much, but, at the same time, you want to have your effect representative enough. One to one matching ATT = 1 NT ∑ i:Di =1 (Yi −Yj(i)) j(i) is ”similar” to i in terms of X in the control group We compare Yi to the similar unit One to many matching ATT = 1 NT ∑ i:Di =1 Yi − 1 M M ∑ m=1 Yjm(i) jm(i) is one of the M ”similar” units from the control group to i in terms of X We compare Yi to the average of the similar units Nearest neighbour covariate matching How to measure how similar the units are? ||Xi −Xj|| = (Xi −Xj)T (Xi −Xj) = p ∑ n=1 (Xni −Xnj)2 Or weight by the variance ||Xi −Xj|| = (Xi −Xj)T ˆV−1(Xi −Xj) = p ∑ n=1 (Xni −Xnj)2 ˆσ2 n Or weight by the covariance ||Xi −Xj|| = (Xi −Xj)T ˆΣ−1(Xi −Xj) Bias The larger the dimension of X, the more difficult is to find matches Data greedy Xi converges to Xj(i) only slowly Bias corrected matching estimator ATTBC = 1 NT ∑ i:Di =1 (Yi −Yj(i))−(ˆE[Y|X = Xi,D = 0]− ˆE[Y|X = Xj(i),D = 0]) bias correction term Variance? Without replacement Use control units only once. ˆσ2 ATT = 1 NT ∑ i:Di =1 Yi − 1 M M ∑ m=1 Yjm(i) −ATT 2 With replacement Use control units possibly more than once. ˆσ2 ATT = 1 NT ∑ i:Di =1 Yi − 1 M M ∑ m=1 Yjm(i) −ATT 2 + 1 NT ∑ i:Di =1 Ki(1 −Ki) M2 (Yi −Yj)2 2 (in this particular case bootstrap fails - Abadie and Imbens (2008) Matching vs Regression - Practical considerations There are many different ways how one can perform matching. There are many different ways how one can perform regression. Researchers degree of freedom is a problem. Matching is appealing because it is easy to communicate to outsiders. Regression is appealing as there seems to be (or are?) fewer degrees of freedom Angrist (1998), page 255: Angrist, Joshua. ”Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants.” Econometrica 66.2 (1998): 249-288. Example: LaLonde (1986) Very influential study. Does job training increase future wages? Having randomized treatment (NSW - National Support Work), LaLonde can compare matching estimators (from two different observational datasets: CPS - Current Population Survey and PSID Panel Survey of Income Dynamics) to the one from the randomized, which served as a benchmark Results pessimistic: Estimates from obs. datasets are all over the place! E.g. $800 vs -$8000 vs - $4400 Well, the samples are very different It is important to check how comparable treated and controls are in our matched sample This is called a balance The success of matching can be shown using a balance graph. Excellent implementation is in MatchIt package in R Unadjusted Sample Adjusted Sample black hispan white black hispan white 0.0 0.2 0.4 0.6 0.8 race Proportion Treatment 0 1 Distributional Balance for "race" Unadjusted Sample Adjusted Sample 0 5 10 15 0 5 10 15 0.00 0.05 0.10 0.15 0.20 educ Density Treatment 0 1 Distributional Balance for "educ" Propensity score p(x) = P(D = 1|X = x) We may skip the high-dimensionality of X in a very neat way. Projecting them on the quantity that matters - probability of treatment Propensity score matching This idea comes from Donald Rubin (e.g. Rubin,1977) and Paul Rosenbaum (Rosenbaum and Rubin, 1983, over 30k citations). Y(0),Y(1) ⊥⊥ D | X =⇒ Y(0),Y(1) ⊥⊥ D | p(X) Pr(D = 1|Y(1),Y(0),p(X)) = E(D|Y(1),Y(0),p(X)) = E(E(D|Y(1),Y(0),X)|Y(1),Y(0),p(X)) = E(E(D|X)|Y(1),Y(0),p(X)) = E(p(X)|Y(1),Y(0),p(X)) = p(X) = p(X) = E(p(X)|p(X)) = E(E(D|X,p(X))|p(X)) = E(D|p(X)) = Pr(D = 1|p(X)). Propensity score matching D Y X p(X) Conditioning on p(X) closes the backdoor path. Also, notice that D ⊥⊥ X|p(X) as Y is the collider on the path. Propensity score matching δp(X) = E(Y|D = 1,p(X))−E(Y|D = 0,p(X)) E[Y(1)−Y(0)|D = 1] = E[δp(X)|D = 1] Propensity score matching 1. Use logit/probit to estimate propensity scores. log p(X) 1 −p(X) = XT β 2. Sort observations according to ˆp(X) 3. Stratify sample to blocks so that mean scores are not statistically different among treated and controls 4. Check for balance. If no balance within a block → split the block. If, for some variable, no balance in all the blocks → check the model specification in Step 1. Implemented in Stata by Becker, Sascha O., and Andrea Ichino. ”Estimation of average treatment effects based on propensity scores.” The stata journal 2.4 (2002): 358-377. There are other ways how PS matching can be implemented Nearest neighbour matching Radius matching Kernel matching - weight controls by a Kernel function - those controls close to propensity score of the treated get larger weight Example: Dehejia and Wahba (2002) Use data from LaLonde (1986) Compares randomized NSW data to two observational datasets: CPS and PSID PS Matching in detail With or without replacement? Smaller PS distance vs. Fewer comparison units. How many comparison units? Smaller PS distance vs. Increased precision. Which matching method to use? Caliper matching can use more (fewer) matches if (not) available. If overlap is good, different matching will lead to similar results. National Supported Work Program Provided work experience to people with social problems Here is a randomized sample from LaLonde (1986) Fig 1 and 2 from Dehejia and Wahba (2002) Fig 5 and 6 from Dehejia and Wahba (2002) Table 2 from Dehejia and Wahba (2002) Table 3 from Dehejia and Wahba (2002) Lessons to take When few control units are available, use sampling with replacement (you can use the same control twice) When enough control units are available, sampling without replacement would be fine Careful diagnostics aid the right choices. So perhaps it is not as bad as LaLonde (1986) suggested? → Reply: Smith, Jeffrey A., and Petra E. Todd. ”Does matching overcome LaLonde’s critique of nonexperimental estimators?.” Journal of econometrics 125.1-2 (2005): 305-353. Results are sensitive to covariates in PS estimation and to choice of the sample. PSM ”...does not represent a general solution to the evaluation problem” → Rejoinder: Dehejia, Rajeev. ”Practical propensity score matching: a reply to Smith and Todd.” Journal of econometrics 125.1-2 (2005): 355-364. Yes, one should check the sensitivity of estimates to the PS model specification. High quality comparison group should not be too sensitive. With this on your mind, PSM works fine. Even in the different subsamples of LaLonde (1996) Implementation issues There are other ways how PS matching can be implemented Fig 1 in Caliendo, Marco, and Sabine Kopeinig. ”Some practical guidance for the implementation of propensity score matching.” Journal of economic surveys 22.1 (2008): 31-72. Inverse Propensity Score Weighting Y(0),Y(1) ⊥⊥ D | X =⇒ ATE = E[Y(1)]−E[Y(0)] = E Y ·D p(X) −E Y ·(1 −D) 1 −p(X) ATT = E[Y(1)|D = 1]−E[Y(0)|D = 1] = E [Y ·D]−E Y ·(1 −D) p(X) 1 −p(X) Inverse Propensity Score Weighting E Y ·D p(X) = E E Y ·D p(X) |X = E E Y(1) p(X) |D = 1,X Pr(D = 1|X) = E E Y(1) p(X) |D = 1,X p(X) = E [E [Y(1)|D = 1,X]] = E[Y(1)] and other quantities similarly. Inverse Propensity Score Weighting First: estimate ˆp. Then: ATE = 1 N ∑ i YiDi ˆp(Xi) −∑ i Yi(1 −Di) 1 − ˆp(Xi) ATT = 1 N ∑ i YiDi −∑ i Yi(1 −Di) ˆp(Xi) 1 − ˆp(Xi) Inverse Propensity Score Weighting Normalized versions (more stable): ATE = 1 N ∑ i YiDi ˆp(Xi) / 1 N ∑ i Di ˆp(Xi) − ∑ i Yi(1 −Di) 1 − ˆp(Xi) / ∑ i (1 −Di) 1 − ˆp(Xi) ATT = 1 N ∑ i YiDi / 1 N ∑ i Di − ∑ i Yi(1 −Di) ˆp(Xi) 1 − ˆp(Xi) / ∑ i (1 −Di) ˆp(Xi) 1 − ˆp(Xi) Weigthing: Hirano and Imbens (2001) Performance for different constructions of standard errors: Bodory, Camponovo, Huber, and Lechner, (2020) R package treatweight by Bodory and Huber (2021) Sensitive to specification of p(·) May require trimming Does not rely on stratification nor matching (less degrees of freedom?) Standard errors need to take into account that the propensity scores are only estimated (Hirano, Imbens and Ridder, 2003) Wrap-up There are different ways how we can estimate the quantity of interest (e.g. ATE, ATT) if our observables are informative in explaining the selection bias. Regression, Matching, IPW. They all have pros and cons. It is the selection on observables assumption that drive the identification. Without this, any estimator is dubious at best. Thank you for your attention! References Tips and tricks on implementation of randomization: Duflo, Esther, Rachel Glennerster, and Michael Kremer. ”Using randomization in development economics research: A toolkit.” Handbook of development economics 4 (2007): 3895-3962. This book is a classic. Somewhat opinionated. By the pioneers of the field: Angrist, Joshua D., and J¨orn-Steffen Pischke. Mostly harmless econometrics. Princeton university press, 2008. Very readable and engaging book, highly recommended: Cunningham, Scott. Causal Inference. Yale University Press, 2021. Adjusting for X in RCT? Negi, Akanksha, and Jeffrey M. Wooldridge. ”Revisiting regression adjustment in experiments with heterogeneous treatment effects.” Econometric Reviews 40.5 (2021): 504-534. Or this twitter summary: https://twitter.com/jmwooldridge/status/1457001530985492495?s=21 Angrist, Joshua. ”Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants.” Econometrica 66.2 (1998): 249-288. Some practical recommendations on Matching: Imbens, Guido W. ”Matching methods in practice: Three examples.” Journal of Human Resources 50.2 (2015): 373-419. Why bootstrap fails in matching: Abadie, Alberto, and Guido W. Imbens. ”On the failure of the bootstrap for matching estimators.” Econometrica 76.6 (2008): 1537-1557. Book length treatment of causal inference. Long and very rich and detailed exposition. Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Rubin, Donald B. ”Assignment to treatment group on the basis of a covariate.” Journal of educational Statistics 2.1 (1977): 1-26. PSM paper: Rosenbaum, Paul R., and Donald B. Rubin. ”The central role of the propensity score in observational studies for causal effects.” Biometrika 70.1 (1983): 41-55. Large sample theory for PS matching: Abadie, Alberto, and Guido W. Imbens. ”Matching on the estimated propensity score.” Econometrica 84.2 (2016): 781-807. Very popular article on implementation issues in PSM: Caliendo, Marco, and Sabine Kopeinig. ”Some practical guidance for the implementation of propensity score matching.” Journal of economic surveys 22.1 (2008): 31-72. References Pessimistic view on policy evaluations based on observational data: LaLonde, Robert J. ”Evaluating the econometric evaluations of training programs with experimental data.” The American economic review (1986): 604-620. Addressing the LaLonde critique: Dehejia, Rajeev H., and Sadek Wahba. ”Propensity score-matching methods for nonexperimental causal studies.” Review of Economics and statistics 84.1 (2002): 151-161. Reply: Smith, Jeffrey A., and Petra E. Todd. ”Does matching overcome LaLonde’s critique of nonexperimental estimators?.” Journal of econometrics 125.1-2 (2005): 305-353. Rejoinder: Dehejia, Rajeev. ”Practical propensity score matching: a reply to Smith and Todd.” Journal of econometrics 125.1-2 (2005): 355-364. IPW estimator: Hirano, Keisuke, Guido W. Imbens, and Geert Ridder. ”Efficient estimation of average treatment effects using the estimated propensity score.” Econometrica 71.4 (2003): 1161-1189. Hirano, Keisuke, and Guido W. Imbens. ”Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization.” Health Services and Outcomes research methodology 2.3 (2001): 259-278. Bodory, H., Camponovo, L., Huber, M., and Lechner, M. ”The finite sample performance of inference methods for propensity score matching and weighting estimators.” Journal of Business & Economic Statistics 38.1 (2020): 183-200.