Difference in Differences Lukˊaˇs Laffˊers Matej Bel University, Dept. of Mathematics MUNI Brno 2.12.2021 6.1.2022 One of the current leading research designs for estimating causal effects. It is based on the assumption that differences across units in time should be the same (similar) absent the treatment. Any time-constant unobservables are taken care of. It is very popular (26% of the most cited paper published in 2015-2019 used DiD) This lecture Examples 2x2 setup Identification Regression formulation + covariates Different complications DiD with covariates (without linearity) Two-way fixed effects model (TWFE) (*) Recent developments (problems with TWFE) John Snow - Cholera (1854) The first careful analysis of this type was done by epidemiologist John Snow in the 19th century in Soho, London. At the time of the cholera outbreak, it was believed it was spread via miasma (via ”air”) Snow challenged this view via his careful analysis. Snow compared the evolution of cholera related deaths with 2 groups of (otherwise similar) houses where one group had their water supply changed for a cleaner one. Source: https://www.rcseng.ac.uk/library-and-publications/library/blog/mapping-disease-john-snow-and-cholera/ TREATED CONTROL Cholera deaths Water company year 1849 year 1854 Difference Lambeth 85 19 -66 Soutwark and Vauxhall 135 147 12 Difference in differences (-66) - 12 = -78 (YL 1854 −YL 1849)−(YSV 1854 −YSV 1849) = (−66)−12 = −78 TREATED CONTROL Cholera deaths Water company year 1849 year 1854 Lambeth 85 19 Soutwark and Vauxhall 135 147 Difference -50 -138 Difference in differences -138 - (-50) = -78 (YL 1854 −YSV 1854)−(YL 1849 −YSV 1849) = −138 −(−50) = −78 Example: Minimum wage and employment What is the impact of minimum wages on employment? From February ’92 to November ’92: Pennysylvania (control): $4.25 → $4.25 New Yersey (treated): $4.25 → $5.05 They look at the subpopulation where minimum wage mattered: surveyed 400 fast-food restaurants. Outcome variable was the average number of employees per store. Card and Krueger (1994) Was minimum wage binding? Source: Figure 2 in Card and Krueger (1994). Card and Krueger (1994) Average employment per store State February November Difference Pennysylvania (control) 23.3 21.14 -2.16 New Yersey (treated) 20.44 21.0 0.56 Difference -2.86 -0.14 Difference in differences -0.14 - (-2.86) = 2.72 0.56 - ( -2.16) = 2.72 (E[YNov |NY]−E[YNov |PA])−(E[YFeb|NY]−E[YFeb|PA]) = −0.14−(−2.86) = 2.72 (E[Y1|D = 1]−E[Y1|D = 0])−(E[Y0|D = 1]−E[Y0|D = 0]) = −0.14−(−2.86) = 2.72 Again. How comparable are the units? Work hard to convince your reader it is the treatment that matters. Apples to Apples. Basic 2x2 case Causal Graphical Model Time D Y Outcomes are changing in time and this is unrelated to the treatment. Identification What we have seen before: Under (Y(0),Y(1)) ⊥⊥ D, we have ATE = E[Y(1)−Y(0)] = E[Y|D = 1]−E[Y|D = 0] Under Y(0) ⊥⊥ D, we have ATT = E[Y(1)−Y(0)|D = 1] = E[Y|D = 1]−E[Y|D = 0] Here, we have introduced time, thus we have countrafactuals Yt (1),Yt (0) and observed Yt . Y0(d) = Ybefore(d) and Y1(d) = Yafter(d) This is the object of interest: ATT = E[Y1(1)−Y1(0)|D = 1] = E[Y1(1)|D = 1]−E[Y1(0)|D = 1] unobserved Identification How do we identify ATT ? Assumption 1: Consistency assumption ∀t : D = d =⇒ Yt = Yt (d) Assumption 2: Parallel trends E[Y1(0)−Y0(0)|D = 1] = E[Y1(0)−Y0(0)|D = 0] (weaker than (Y1(0)−Y0(0)) ⊥⊥ D Assumption 3: No pre-treatment effect E[Y0(1)|D = 1]−E[Y0(0)|D = 1] = 0 Assumption 4: SUTVA (often not stated explicitly) No interactions between individuals and no hidden versions of the treatment (no hidden variability, everyone receives the same treatment) Identification How do we identify ATT ? ATT = E[Y1(1)−Y1(0)|D = 1] (definition) = E[Y1(1)|D = 1]−E[Y1(0)|D = 1] (linearity of E(·)) = E[Y1|D = 1]−E[Y1(0)|D = 1] = E[Y1|D = 1]− E[Y0(0)|D = 1]+E[Y1(0)|D = 0]−E[Y0(0)|D = 0] = E[Y1|D = 1]− E[Y0(0)|D = 1]+E[Y1|D = 0]−E[Y0|D = 0] = E[Y1|D = 1]− E[Y0(1)|D = 1]+E[Y1|D = 0]−E[Y0|D = 0] = E[Y1|D = 1]− E[Y0|D = 1]+E[Y1|D = 0]−E[Y0|D = 0] = E[Y1|D = 1]−E[Y0|D = 1] + E[Y1|D = 0]−E[Y0|D = 0]) observed quantities only Regression formulation Treatment assignment: D ∈ {0,1} Time pre/post, before/after: T ∈ {0,1} Y = β0 +β1D +β2T +β3D ·T +ε This is a saturated model. β0 = E[Y0|D = 0] β1 = E[Y1|D = 0]−E[Y0|D = 0] β2 = E[Y0|D = 1]−E[Y0|D = 0] β3 = (E[Y1|D = 1]−E[Y1|D = 0])−(E[Y0|D = 1]−E[Y0|D = 0]) Complications Parallel trends may only hold conditional on X E[Y1(0)−Y0(0)|X,D = 1] = E[Y1(0)−Y0(0)|X,D = 0] Parallel trends assumption is NOT scale invariant E[Y1(0)−Y0(0)|D = 1] = E[Y1(0)−Y0(0)|D = 0] =⇒ E[logY1(0)−logY0(0)|D = 1] = E[logY1(0)−logY0(0)|D = 0] (unless D is randomly assigned: Roth and Sant’Anna (2020)) Effects may be heterogenous Units may be treated in different times Differential timing Yit = δDit +γXit +αi· +α·t +εit Differential timing with state level (or any group) treatments: Yist = δDst +γXist +αs· +α·t +εist Aggregated version: this will lead to the same estimate δ but with higher standard errors: Yst = δDst +γXst +αs· +α·t +εist Dit = 1 if the unit i is treated at time t Dst = 1 if the state s is treated at time t αi· - constant for unit i αs· - constant for state s α·t - constant for time t Xit ,Xist - covariates - (beware of colliders!!) Statistical inference? Estimate ˆδ via OLS. BUT: Observations are likely serially correlated across states (groups) and thus standard errors may be too optimistic (small). Panels are long. Often very little variation in Dst Simulations in Bertrand et al. (2004) show you can reject correct null in 45% cases! (instead of 5%) How to fix this? Block bootstrap. (Sample states with replacement) Ignore the time dimension altogether. (We’re in 2x2 table) Cluster standard errors (at the level of groups or individuals) - we may allow arbitrary correlation between outcomes within a certain state (or individual) over time. Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. ”How much should we trust differences-in-differences estimates?.” The Quarterly journal of economics 119.1 (2004): 249-275. Pre-treatment trends? Event study Yit = −2 ∑ τ=−q δτDτ it leads + m ∑ τ=0 δτDτ it lags +γXit +αi· +α·t +εit Dτ it is an indicator for unit i being τ periods away from the initial treatment at time t If state i adopted a new policy in t = 2000, then D−1 i,1999 = D0 i,2000 = D1 i,2001 = ... = 1 and e.g. D−2 i,1999 = D0 i,1999 = D1 i,1999 = D2 i,1999 = 0. Pre-treatment trends? Event study Yit = −2 ∑ τ=−q δτDτ it leads + m ∑ τ=0 δτDτ it lags +γXit +αi· +α·t +εit Pre-treatment trends? Event study (The previous figure was too beautiful, normally it looks more like this one.) Placebo tests There is a lot of room for creativity choose workers unaffected by the minimum wage change treatment date to a fake one choose a fake treatment group change the outcome to the one that should plausibly be unaffected look at different subgroups - use your domain knowledge Empirical Application - Cheng and Hoekstra (2013) had gun reform had impact on violance? different states adopted the law in different times ChH provide evidence that it is not associated with other types of crimes (e.g. cars theft) The new law was associated with an increase 8-10% in homicides Source: Chapter 9.6.6 in https://mixtape.scunning.com/difference-in-differences.html Source: Chapter 9.6.6 in https://mixtape.scunning.com/difference-in-differences.html using 20 different placebo dates the average estimates essentially zero Source: Chapter 9.6.6 in https://mixtape.scunning.com/difference-in-differences.html Chapter 9.6.6 in https://mixtape.scunning.com/difference-in-differences.html Chapter 9.6.6 in https://mixtape.scunning.com/difference-in-differences.html DiD with covariates based on IPW Parallel trends cond. on X: E[Y1(0)−Y0(0)|X,D = 1] = E[Y1(0)−Y0(0)|X,D = 0] No effect of D on X: X(1) = X(0) = X No pretreatment effect: E[Y0(1)|D = 1]−E[Y0(0)|D = 1] = 0 Common support: P(D = 1,T = 1|X,(D,T) ∈ {(d,t),(1,1)}) < 1 for all (d,t) ∈ {(1,0),(0,1),(0,0)} ATT = E Y · D ·T Π − D ·(1 −T)·ρ1,1(X) ρ1,0(X)·Π − (1 −D)·T ·ρ1,1(X) ρ0,1(X)·Π − (1 −D)·T ·ρ1,1(X) ρ0,0(X)·Π where Π = P(D = 1,T = 1) and ρd,t (X) = p(D = d,T = t|X) Lechner, Michael. ”The Estimation of Causal Effects by Difference-in-Difference Methods.” Foundations and Trends (R) in Econometrics 4.3 (2011): 165-224. Two-way fixed effects model (TWFE) Yit = δDit +γXit +αi· +α·t +εit it looks reasonable: we extend the basic 2x2 setup into multiple time-periods, covariates and differential timing. Units can be treated at different time-periods. We even plugged in dummies for greater flexibility (but hey, more is better, right?). But, after all, what is this δ ? Goodman-Bacon (2021) decomposition We estimate Yit = δDit +αi· +α·t +εit to get ˆδ Staggered rollout setup. Once treated, then treated forever. Dit = 1 =⇒ Dit+1 = 1 Goodman-Bacon (2021) shows this ˆδ is a weighted average of different ˆδ2x2 . These are based on different 2x2 comparisons! Just like the Card and Krueger (1994). This is great, because we understand what ˆδ2x2 from canonical 2x2 setup means! There are 3 groups: k - early adopters, l - late adopters, U - untreated ˆδ = wkU ˆδ2x2 kU +wlU ˆδ2x2 lU +wkl ˆδ2x2 kl +wlk ˆδ2x2 lk Weights depend on: (i) how large the groups are, (ii) how much variation there is in the treatments. Just like in OLS, large weights are given to groups with higher variation. This result is about estimators not estimands. Adding/removing time periods changes the weights. Diagnostics Similar decomposition could be done if you have many different groups. Source: Cunningham’s reconstruction of Cheng and Hoekstra (2013) Diagnostics (different paper) Additional control group here (circles). Source: Goodman-Bacon (2021) What is TWFE really? Consider a situation in which the treatment changes the trend line by δ in every period (as opposed to only once). Static specification (a single δ) Yit = δ m ∑ τ=0 Dτ it +γXit +αi· +α·t +εit Dτ it is an indicator for unit i being τ periods away from the initial treatment at time t If state i adopted a new policy in t = 2000, then D−1 i,1999 = D0 i,2000 = D1 i,2001 = ... = 1 What is TWFE really? Dynamic specification (multiple δτ-s) Yit = −2 ∑ τ=−q δτDτ it + m ∑ τ=0 δτDτ it +γXit +αi· +α·t +εit Yes, we run some regressions. But what do we actually get? How do we interpret these ˆδ or ˆδτ? Sun and Abraham (2021) Consider e.g. Yit = −2 ∑ τ=−q δτDτ it + m ∑ τ=0 δτDτ it +αi· +α·t +εit Common practice is to use leads to test for a pre-trend differences. But these coefficients are contaminated by both the pre-trends and heterogeneity They propose a way how to examine how much of a problem this is They also propose an estimator that uses never-treated as a comparison group Callaway and Sant’Anna (2021) Staggered treatment adoption setup. Dit = 1 =⇒ Dit+1 = 1 Decompose everything into ”lego” pieces: ATT(g,t) = E[Yt (g)−Yt (0)|Gg = 1] ATT in time t for group treated in time g. (Gg = 1) They make Limited treatment anticipation assumption Different Conditional parallel trend assumptions Comparing to never-treated individuals Comparing to not-yet-treated individuals Source: https://pedrohcgs.github.io/files/Callaway SantAnna 2020 slides.pdf Estimate via OLS? Yit = −2 ∑ τ=−q λτDsτ + m ∑ τ=0 δτDsτ +γXist +αi· +α·t +εist Source: https://pedrohcgs.github.io/files/Callaway SantAnna 2020 slides.pdf E.g. based on comparing to never-treated individuals (denoted as C = 1), they get: ATT(g,t) = E     Gg E[Gg] − pg(X)C 1−pg(X) E pg(X)C 1−pg(C)  (Yt −Yg−1)   pg(X) = is a propensity score Comparing to never-treated individuals Never-treated are re-weighted to match those treated in time g (IPW style) They have a doubly robust version of this expression. Different ATT(g,t) are weighted into forming different parameters of interest did and DRDID packages Much nicer with their method Source: https://pedrohcgs.github.io/files/Callaway SantAnna 2020 slides.pdf de Chaisemartin and d’Haultfoeuille (2020) Consider the following object of interest ATT(g,t) = E[Yt (g)−Yt (0)|Gg = 1] Let δ be TWFE estimand from this regression Yit = δDit +αi· +α·t +εit Then δ = E ∑ i,t:Dit =1 1 N1 wit ·ATT(g,t) But the weights wit can be negative(!) So δ = ATT. What is the δ then? It depends on the assumptions you impose (have a look at dCh & d’H (2020) Two very recent reviews! The status quo has been changed. New papers emerging very rapidly. de Chaisemartin and D’Haultfœuille - Two-Way Fixed Effects and Differences-in-Differences with Heterogeneous Treatment Effects: A Survey (Dec 15 2021) Roth, Sant’Anna, Bilinski and Poe - What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature (Jan 3 2022) Concluding remarks The stream of new papers show rather depressing set of results. Note that this is relevant only if there is differential treatment timing TWFE is not what we would like it to be and all these papers show various degrees of hopelessness. But They also provide alternative estimators and implementations in R/STATA Concluding remarks What are the important questions we should ask? Who to compare with whom? What is the the object of interest? What kind of parallel trends assumptions will we impose? Thank you for your attention! References Chapter on Dif-in-dif in Cunningham’s book is long, but fun nevertheless. I found the notation somewhat inconsistent. Cunningham, Scott. Causal Inference. Yale University Press, 2021. Free here: https://mixtape.scunning.com/difference-in-differences.html Introductory video on 2x2 DiD identification etc: Brady Neal, Causal Inference course https://www.youtube.com/watch?v=2nDgrNP7XSE Chapter 18 in Bruce Hansen’s Econometrics book is a good start. Inference problems with DiD: Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. ”How much should we trust differences-in-differences estimates?.” The Quarterly journal of economics 119.1 (2004): 249-275. Parallel trends and functional forms: Roth, Jonathan, and Pedro HC Sant’Anna. ”When Is Parallel Trends Sensitive to Functional Form?.” arXiv preprint arXiv:2010.04814 (2020) DiD with covariates based on IPW: Lechner, Michael. ”The Estimation of Causal Effects by Difference-in-Difference Methods.” Foundations and Trends (R) in Econometrics 4.3 (2011): 165-224. Cheng, Cheng, and Mark Hoekstra. 2013. “Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine.” Journal of Human Resources 48 (3): 821–54. Recent advances: Taylor Wright’s DiD reading group: https://taylorjwright.github.io/did-reading-group/ This is the best source. Videos of presentations by the authors of some of the most important recent contributions in the DiD literature. Goodman-Bacon, Andrew. ”Difference-in-differences with variation in treatment timing.” Journal of Econometrics (2021). Sun, Liyang, and Sarah Abraham. ”Estimating dynamic treatment effects in event studies with heterogeneous treatment effects.” Journal of Econometrics 225.2 (2021): 175-199. Callaway, Brantly, and Pedro HC Sant’Anna. ”Difference-in-differences with multiple time periods.” Journal of Econometrics 225.2 (2021): 200-230. De Chaisemartin, Clˊement, and Xavier d’Haultfoeuille. ”Two-way fixed effects estimators with heterogeneous treatment effects.” American Economic Review 110.9 (2020): 2964-96. de Chaisemartin, Clˊement, and Xavier D’Haultfœuille. ”Two-Way Fixed Effects and Differences-in-Differences with Heterogeneous Treatment Effects: A Survey.” Available at SSRN (2021). Jonathan Roth, Pedro H. C. Sant’Anna, Alyssa Bilinski and John Poe - What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature