Faculty of Economics and Administration Essays in econometrics of model uncertainty Habilitation Thesis Lukáš Lafférs Brno 2023 Annotation Author: Lukáš Lafférs Title: Essays in econometrics of model uncertainty Year: 2023 This habilitation thesis deals with the econometrics of model uncertainty, and it consists of four academic papers on this topic. The introductory chapter motivates two different research approaches to dealing with model uncertainty, where the four papers in this collection make a contribution. Each research approach is presented in a separate chapter, starting with an introduction that explains the basic setup, then summarizing the papers’ main results and contributions to the literature and concluding with a discussion of limitations and suggestions for future research. Acknowledgements I wish to thank my mentors Katarína Pavlíčková, Radoslav Harman, Marian Grendár, Gernot Doppelhofer and Alexei Onatski, who supported me throughout different stages of my studies and career. I am grateful for my fantastic co-authors and curious students, who have helped me learn and expand my knowledge in many ways. I also thank my family and friends for supporting me throughout. Contents Introduction 11 1 Model uncertainty and machine learning 15 1.1 Causal mediation analysis with double machine learning (Farbmacher, Huber, Lafférs, Langen, and Spindler 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.1.3 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.2 Evaluating (weighted) dynamic treatment effects by double machine learning (Bodory, Huber, and Lafférs 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.2.3 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.3 Limitations and future research avenues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2 Model uncertainty and incomplete models 29 2.1 Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition (Huber and Lafférs 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.1.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.1.3 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 Sensitivity of the bounds on the ATE in the presence of sample selection (Lafférs and Nedela 2017) 35 2.2.1 Setup and main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.2 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3 Limitations and future research avenues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Authorship contribution statements 39 Bibliography 41 A Essays 45 A.1 Causal mediation analysis with double machine learning (Farbmacher, Huber, Lafférs, Langen, and Spindler 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 10 A.2 Evaluating (weighted) dynamic treatment effects by double machine learning (Bodory, Huber, and Lafférs 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 A.3 Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition (Huber and Lafférs 2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.4 Sensitivity of the bounds on the ATE in the presence of sample selection (Lafférs and Nedela 2017) 53 Introduction Econometric models are tools that help to merge assumptions and data into conclusions. Economic theory, institutional rules or expert knowledge - all these pieces of information help to construct an econometric model, which consists of set of restrictions on observable variables. These restrictions take the form of mathematical statements, can be combined with various statistical techniques in order to analyze data. Unlike in some natural sciences, such as physics, where models describe behaviour of inanimate objects, economic models deal with people whose behaviour is much less predictable and subsequently, there is rarely a strong consensus how an econometric model should look like. This is why model uncertainty has always been at the forefront of econometric research. There are different strategies for coping with model uncertainty. This collection of essays contributes to two particular streams of literature. Specifically, how to deal with model uncertainty in a high-dimensional setup by using machine learning tools within the treatment effects literature, and how to conduct sensitivity analysis of identifying assumptions in a systematic way using incomplete models. While the contributions of the papers summarized here are mostly methodological and theoretical, they can be applied to a wide range of economic applications, especially in labor economics, health economics and policy evaluation. Importantly, each paper in this collection showcases the usefulness of the proposed method through an empirically relevant application. Model uncertainty and machine learning The first line of thought attempts to choose the best model among the set of (potentially many) competing alternatives. This approach has given rise to a model selection literature that explicitly balances model complexity and fit to the data, where, among several models that fit the data similarly, the less complex model is preferred. Examples include Akkaike information criterion or Bayesian information criterion (e.g. Claeskens, Hjort, et al. (2008) for an overview). It has been recognized that if the same dataset is used to simultaneously choose the model and estimate its parameters, the resulting model may be overly optimistic (in terms of fit) and its properties may be poor (Leeb and Pötscher (2005)). In the recent years, more and more data became available which makes the number of possible models very large. This gave rise to high-dimensional methods capable of handling situations where the number of variables may be larger than the sample size (see e.g. Bühlmann and Van De Geer (2011) for an overview). Recent advances in this stream of literature have adopted high-dimensional methods to improve the estimation of a low-dimensional object of interest, such as an average treatment effect of a policy change (Chernozhukov et al. (2018)). The first two essays in this collection contribute to this stream of literature. The contribution of these papers is that they extend the existing results of Chernozhukov et al. (2018) to two new frameworks, namely mediation analysis and dynamic treatment effects. These two frameworks were not considered in the original Chernozhukov 12 et al. (2018) and the presented two papers therefore constitute a distinct value-added to the literature. Mediation analysis and dynamic treatment effects cover a wide range of useful economic applications and thanks to these papers, high-dimensional data can be utilized which further broaden their empirical appeal. Farbmacher, Huber, Lafférs, Langen, and Spindler (2022)1 studies the problem of causal mediation analysis in a high-dimensional setup. The paper makes use of the Double Machine Learning framework (DML) of Chernozhukov et al. (2018) to guide the choice of observed confounders, therefore the choice of the model, in a data-driven way. It proves that the estimators of direct and indirect (mediated) effects based on DML possess good statistical properties, which is also supported via a simulation study. Empirical illustration based on National Longitudinal Survey of Youth data suggests that health insurance coverage effect on general health does not appear to be mediated via more frequent routine checkups. Bodory, Huber, and Lafférs (2022)2 considers estimation of dynamic treatment effects in a situation with high-dimensional set of covariates. Instead of effect of a single treatment, it considers effects of a sequence of treatments, where past treatments may influence the receipt of future treatments. The paper justifies the use of DML framework of Chernozhukov et al. (2018) and therefore chooses a set of relevant confounders in a data-driven way using machine learning algorithms. Dynamic treatment effects for specific subgroups is also considered. The performance of estimators are demonstrated using a simulation study and it is illustrated on an evaluation of training sequences provided by the Job Corps programme on employment. Model uncertainty and incomplete models A completely different approach to deal with model uncertainty is to make use of least information possible and create a model that is not completely specified. This leads to situations where the object of interest in not point identified but only partially identified (e.g. Tamer (2010) for a review). Even if a sample size goes to infinity, the parameter of interest is still only bounded - there is an interval of values for the parameter and all of them are compatible with model assumptions. There is a tension between the amount of information that is assumed and how much can be learnt, which was summarized by Charles Manski (Manski (2003), p.1): "Law of Decreasing Credibility: The credibility of inference decreases with the strength of the assumptions maintained." Empirical examples may include interval estimates for different menus of assumptions. The width of these estimated intervals represents a model uncertainty that is different from the statistical uncertainty. Another scope of usefulness of these methods is related to sensitivity analysis. It is useful to explore how important different assumptions are by relaxing them and considering the width of the identified set. The last two essays in this collection deal with sensitivity analysis by extending its application to two important classes of problems in econometric practice, namely mediation analysis and sample selection models. Huber and Lafférs (2022)3 deals with causal mediation analysis, exploring sensitivity to identifying assumptions that are commonly violated in empirical applications. These include treatment and mediator 1. Farbmacher, H., Huber, M., Lafférs, L., Langen, H., & Spindler, M. (2022). Causal mediation analysis with double machine learning. The Econometrics Journal, 25(2), 277-300. doi; 10.1093/ectj/utac003. Full-text is available at https://academic.oup.com/ectj/article/25/2/277/6517682. 2. Bodory, H., Huber, M., & Lafférs, L. (2022). Evaluating (weighted) dynamic treatment effects by double machine learning. The Econometrics Journal, 25(3), 628-648. doi; 10.1093/ectj/utac018. Full-text is available at https://academic.oup.com/ectj/article- abstract/25/3/628/6604379. 3. Huber, M., & Lafférs, L. (2022). Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition. Econometric Reviews, 41(10), 1141-1163. doi; 10.1080/07474938.2022.2127077. Full-text is available at https://www.tandfonline.com/doi/full/10.1080/07474938.2022.2127077. 13 exogeneity and outcome attrition, even after controlling for observable variables. This paper proposes a method for bounding direct and indirect effects under relaxations of these assumptions. The relaxation parameters are interpretable and varying them we can set the amount of misspecification we wish to consider. The paper provides some suggestions how these can be set in practice, so that the comparisons are meaningful. The method is applied to gender wage gap decomposition using National Longitudinal Survey of Youth 1979 data, where the overall effect is decomposed into direct and indirect (mediated) effect. The sensitivity analysis suggests that the results tends to be sensitive to both outcome attrition and to violations of mediator exogeneity. Lafférs and Nedela (2017)4 provides a computational tool to conduct a sensitivity analysis to the method of estimating the average treatment effects under sample selection problem presented in David S Lee. 2009. “Training, wages, and sample selection: Estimating sharp bounds on treatment effects.” The Review of Economic Studies 76 (3): 1071–1102. The empirical applications include for instance a job market program where even under the treatment randomization some people choose not to work and ignoring this selectivity leads to biased estimators. Lafférs and Nedela (2017) controls the departure from the identifying assumptions in Lee (2009) via relaxation parameters that are easy to interpret. One of them is that the proportion of individuals for whom the assignment to the program would have negative effect on their employment (monotonicity assumption), is not set to zero, but bounded from above by some number, say 1%. The method shows that the results presented in Lee (2009) are sensitive even to mild deviations from monotonicity assumption and treatment exogeneity assumption. The four papers summarized in this collection contribute to the different research areas. Figure 1 schematically depicts the placement of the different papers within the literature. Chapter 1: Machine Learning Sample Selection Chapter 2: Incomplete Models Mediation Analysis 2 1 3 4 1 Farbmacher, Huber, Lafférs, Langen, and Spindler (2022) 2 Bodory, Huber, and Lafférs (2022) 3 Huber and Lafférs (2022) 4 Lafférs and Nedela (2017) Figure 1: The placement of the different papers within the literature. 4. Lafférs, L., & Nedela Jr, R. (2017). Sensitivity of the bounds on the ATE in the presence of sample selection. Economics Letters, 158, 84-87. doi; 10.1016/j.econlet.2017.06.039 . Full-text is available at https://www.sciencedirect.com/science/article/abs/pii/S0165176517302665. 14 Authors’ contributions (qualitative and quantitative) are listed in Chapter 4. Complete papers and appendices are provided in Appendix A.5 5. Due to copyright restrictions, the public version of the habilitation thesis does not include the full-texts, links to these papers are in footnotes 1-4. Chapter 1 Model uncertainty and machine learning In recent decades machine learning (ML) algorithms have revolutionized many fields of science. The powerful prediction capabilities have facilitated progress in classification, computer vision, pattern recognition, business intelligence, and many other fields. Economics is no exception. While most of ML advances were about prediction, the main interest in economics is about understanding underlying mechanisms, so it is an effect of a certain variable that is often of interest. Thus, estimation is at the forefront of econometric research. In this chapter, the focus is placed on ML based solutions of dealing with model uncertainty in the context of treatment effects estimation, which is a large but by far not exhaustive part of the econometric literature at the intersection of ML methods and economics. Notable advances has been recently made in the treatment effect heterogeneity (Wager and Athey (2018)) and many other subfields. The usefulness of ML methods in economics was recognized earlier (Varian (2014)) and for a more recent reviews, we refer to Mullainathan and Spiess (2017), Athey and Imbens (2019) or Athey (2019). Consider the following example. A job-seeker went through a training/course to improve her skill-set. Suppose we wish to estimate a causal effect to find out whether the intervention actually worked, whether it improved her employment chances or raised her wage. We typically have a rich set of information about job-seekers that are managed via employment offices often collected via questionnaires and possibly linked with registry data. The number of potential predictors can be very large and the choice of the model, e.g. which of the predictors are relevant, is challenging. It may be based on institutional knowledge or expert opinion but it is inherently subjective to some extent. An important question is whether the prediction performance of ML based estimators can help to provide unbiased estimators in a high-dimensional setup. Another reason why handling high-dimensional model is useful, is the fact that most models are based on non-experimental data and rely on identifying assumptions. In the context of the job-seeker, treatment assignment is not a result of a randomization but it is a choice and therefore naive comparisons would lead to biased estimates of the true causal effects. We have to rely on selection-on-observables assumption and that controlling for a rich set of characteristics (and e.g. socioeconomic variables and employment history) the people who went through the training are comparable to those that did not. In other words that we control for all or at least most of possible confounders. The more information we have, the more plausible this assumption and, consequently, our conclusions are. In the rest of this introduction the main ideas of Double Machine Learning (DML) framework (Chernozhukov et al. (2018)) for estimation in high-dimensional setup will be presented. The two presented essays build heavily on these results. This part will present DML framework on a very simple example - estimation of an average treatment effect (ATE) under selection-on-observables assumption (also known as the conditional independence assumption). This particular example is selected because it closely resembles the two setups considered in this chapter. 16 Double Machine Learning - an illustrative example Consider a scenario in which we have an outcome of interest 𝑌, a binary treatment variable 𝐷, and a highdimensional vector of covariates 𝑋. We use potential outcome notation and therefore the observed outcome is a combination of the two potential outcomes: 𝑌 = 𝑌 (1)𝐷 + 𝑌 (0)(1 − 𝐷). Suppose that are interested in the average treatment effect of the binary treatment Δ = 𝐸[𝑌 (1) − 𝑌 (0)] = 𝜃1 − 𝜃0, where 𝜃 𝑑 = 𝐸[𝑌 (𝑑)]. Furthermore assume that the vector of covariates 𝑋 is rich enough so that it makes the treatment 𝐷 as good as random in terms of it’s relationship to potential outcomes (Assumption 1). Also assume that we do not lack comparison units, so that common support assumption holds (Assumption 2). (1) Conditional independence of D: {𝑌 (1), 𝑌 (0)}⊥𝐷 | 𝑋, (2) Common support: Pr(𝐷 = 𝑑|𝑋 = 𝑥) > 0. Figure 1.1 shows Direct Acyclic Graph (DAG) (Judea Pearl (2009)) that represents the causal structure of estimation of average treatment effect of a binary treatment in this setup. D X Y Figure 1.1: The causal structure under the conditional independence assumption of 𝐷. In terms of the identification, the situation is simple. Controlling for 𝑋 is sufficient for recovering the true potential outcome and subsequently the average treatment effect Δ = 𝐸[𝑌 (1)] − 𝐸[𝑌 (0)] = 𝜃1 − 𝜃0. We may base our estimation on either estimating an outcome model 𝐸[𝑌|𝐷 = 𝑑, 𝑋] or a propensity score Pr(𝐷 = 𝑑|𝑋). Under Assumption (1) we have 𝐸[𝑌 (𝑑)] = 𝐸[𝐸[𝑌|𝐷 = 𝑑, 𝑋]], and 𝐸[𝑌 (𝑑)] = 𝐸 𝑌 · 𝐼(𝐷 = 𝑑) Pr(𝐷 = 𝑑|𝑋) . As long as we can estimate conditional expectation 𝐸[𝑌|𝐷 = 𝑑, 𝑋] or Pr(𝐷 = 𝑑|𝑋) then we have two consistent estimators for mean potential outcomes 𝜃 𝑑: ˆ𝜃(1) 𝑑 = 1 𝑛 𝑛∑︁ 𝑖=1 ˆ𝐸[𝑌𝑖 |𝐷 𝑖 = 𝑑, 𝑋𝑖], or ˆ𝜃(2) 𝑑 = 1 𝑛 𝑛∑︁ 𝑖=1 𝑌𝑖 · 𝐼(𝐷 𝑖 = 𝑑) ˆPr(𝐷 𝑖 = 𝑑|𝑋𝑖) . ML estimators are biased The problem is that in the high-dimensional situation, when there are many covariates in 𝑋, the estimation may not be feasible if there are too many covariates relative to the sample size. ML estimators, such as lasso, 17 random forest or neural nets, have the possibility to deal with high-dimensional data. The problem is that these estimators are in general biased and their speed of convergence is typically slow. Because they aim to minimize the prediction error, they trade of some bias for the reduction in variance. This property is well known as a bias-variance trade-off (see e.g. James, Witten, Hastie, and Tibshirani (2021)). We refer to this as to regularization bias. Neither ˆ𝜃(1) 𝑑 nor ˆ𝜃(2) 𝑑 will be root-n consistent, under typical assumptions, if they are estimated via ML algorithms. Another strategy is to assume a linear model for 𝐸[𝑌|𝐷, 𝑋] and then interpret the coefficient next to 𝐷 as a treatment effect. The interpretation of the coefficient to represent a true causal effect hinges upon very restrictive assumptions on treatment heterogeneity (Goldsmith-Pinkham, Hull, and Kolesár (2022)).1 But even if such model would be correct, the coefficient of 𝐷 would still be biased if ML estimators are used, that is again due to the bias-variance trade-off. This example illustrates that even in such a simple situation, the estimation is in general challenging. The main problem is the bias of the ML estimators. Removing regularization bias Chernozhukov et al. (2018) built on early results of Neyman (1959), James M Robins and Rotnitzky (1995), Newey (1994) and developed a methodology that can overcome the problem of bias of ML estimators in the context of low-dimensional parameter estimation. The key idea is to construct a moment-condition that is locally-insensitive to some of the bias that comes from the ML estimators, they call this property Neyman-orthogonality. Coming back to the illustrative example: consider two functions for 𝐷 = 0, 1: propensity score function 𝑝 𝑑 (𝑋) ≡ Pr(𝐷 = 𝑑|𝑋) and outcome model 𝜇(𝑑, 𝑋) ≡ 𝐸[𝑌|𝐷 = 𝑑, 𝑋], that are not directly of interest, but they will serve as nuisance functions and they can be estimated using ML algorithms. The target parameter of interest is the mean potential outcome 𝜃 𝑑. We will now combine the two previous approaches and make use of both the propensity score Pr(𝐷 = 𝑑|𝑋) and the outcome model 𝜇(𝑑, 𝑋). Consider the following moment function 𝜓 𝑑 ≡ 𝜓(𝑌, 𝐷, 𝑋 data ; 𝜃 𝑑 target , 𝜇, 𝑝 𝑑 nuisance ) for estimating 𝜃 𝑑 = 𝐸[𝑌 (𝑑)], where Δ = 𝜃1 − 𝜃0 is our object of interest 𝜓 𝑑 = 𝐼{𝐷 = 𝑑} · [𝑌 − 𝜇(𝑑, 𝑋)] 𝑝(𝑋) + 𝜇(𝑑, 𝑋) − 𝜃 𝑑. This moment function satisfies 𝐸[𝜓 𝑑] = 0. It is also doubly-robust (James M Robins and Rotnitzky (1995)), so if either the propensity score Pr(𝐷 = 𝑑|𝑋) or the outcome model 𝜇(𝑑, 𝑋) is correct, then 𝜃 𝑑 = 𝐸[𝑌 (𝑑)] holds. A key property of this moment function 𝜓 𝑑 is that of Neyman-orthogonality, so that in the vincinity of the true nuisance functions 𝜇 and 𝑝 𝑑 and the true target parameter 𝜃 𝑑, the moment function does not change.2 Chernozhukov et al. (2018) showed that the asymptotic behavior of ˆ𝜃 𝑑 can be studied using the following expression: √ 𝑛( ˆ𝜃 𝑑 − 𝜃 𝑑) = 𝑎∗ Approx. Gaussian + 𝑏∗ Regularization bias + 𝑐∗ Overfitting bias 1. Goldsmith-Pinkham, Hull, and Kolesár (2022) provide an detailed discussion of this problem, which they call a contamination bias that mainly arises in the case of multiple valued discrete treatment. 2. Formally 𝜕 𝜕𝑟 𝐸[𝜓(𝑌, 𝐷, 𝑋; 𝜃 𝑑, 𝜂0 + 𝑟(𝜂 − 𝜂0))] 𝑟=0 = 0, where 𝜂 = (𝜇, 𝑝 𝑑) and subscript 0 denotes the true value. 18 Step 1 ˆ𝜇(𝑑, 𝑋) ˆ𝑝 𝑑 (𝑋) Step 2 ˆ𝜃 𝑘 𝑑𝑖 Step 3 Step 4 ˆ𝜃 𝑑 = 1 𝑛 7 𝑘=1 𝑛 𝑘 𝑖=1 ˆ𝜃 𝑘 𝑑𝑖 Figure 1.2: Visualization of 7-fold cross fitting procedure. Dataset is split randomly into 7 parts. For each split, only one fold (approximately 1/7 of the whole dataset) is used for the estimation of target parameter, while the rest of the dataset is used for estimation of nuisance parameters, that are used as plug-in estimators. Eventually, estimator ˆ𝜃 𝑑 is calculated as an average across these 7 folds. If the ˆ𝜃 𝑑 is based on a Neyman-orthogonal moment function, then the second term vanishes under mild conditions placed on the quality of the nuisance parameter estimators.3 This is because in this asymptotic expansion √ 𝑛( ˆ𝜃 𝑑 − 𝜃 𝑑), there is a product of estimation errors of both ˆ𝑝 𝑑 − 𝑝 𝑑 and ˆ𝜇 − 𝜇, so even the ML estimators or ˆ𝑝 𝑑 and ˆ𝜇 converge slowly, their product converge faster than square root of 𝑛, which is what makes the regularization term 𝑏∗ go to zero. This is key for achieving the root-n consistency of the estimator. Removing overfitting bias The last term 𝑐∗ arises due to the fact that the same dataset is used for the estimation of nuisance parameters 𝜇, 𝑝 𝑑 and also the target parameter 𝜃 𝑑. An easy solution is to split the data sample randomly in half. Use one half to construct estimators ˆ𝜇, ˆ𝑝 𝑑 and the other half to estimate ˆ𝜃 𝑑 using ˆ𝜇, ˆ𝑝 𝑑 as an input. A more efficient variant is called a cross-fitting procedure which is based on switching the roles of the data sample for estimating the nuisance functions and the target parameter and then taking the average over the effect estimates in either part. This procedure is depicted in Figure 1.2.4 Overview Double Machine Learning framework of Chernozhukov et al. (2018) provides a general framework to estimate lower dimensional parameters without a bias in a high-dimensional setup, thus (at least partially) addressing the problem of econometric model specification. Using cross-fitting, if the moment function used for estimation is Neyman-orthogonal we have a square-root 𝑛 consistent estimator that is asymptotically normally distributed, thus making statistical inference straightforward. This is especially suitable for treatment effects estimation. The following essays leverage the usefulness of the DML framework and adapt it to two specific situations. The first one is mediation analysis and the second one is dynamic treatment effects. 3. More precisely, they are satisfied if the ML estimators converge at rate 𝑜(𝑛−1/4), which is not a strong requirement and is satisfied for many commonly used ML estimators under specific conditions, such as approximate sparsity of 𝜂0 for lasso, well-approximability of 𝜂0 with trees for random forest, see Chernozhukov et al. (2018) for more examples. 4. E.g. see section 3 in Chernozhukov et al. (2018) for a formal definition. 19 1.1 Causal mediation analysis with double machine learning (Farbmacher, Huber, Lafférs, Langen, and Spindler 2022) In many situations, we not only wish to estimate a treatment effect but also discover and quantify different causal mechanisms. Mediation analysis attempts to disentangle the various channels through which a treatment operates. There may be a direct effect of the treatment but also an indirect effect, that is the effect that operates through a variable called a mediator. As an example we may be interested whether the effect of education on health operate through employment or through health behaviour. If the mediator is endogenous, such effect decomposition is not possible without additional assumptions, even if the treatment is fully randomized (Rosenbaum (1984), Robins and Greenland (1992)). In order to identify the causal effect that operates via mediator we need to control for potential confounders of the mediator. Early literature is build on linear and highly-restrictive linear models Cochran (1957), Judd and Kenny (1981). Later on semi-parametric and non-parametric models were considered while identification mostly relied on selection-on-observables assumption Robins and Greenland (1992), Pearl (2001), Robins (2003), Petersen, Sinisi, and Laan (2006), VanderWeele (2009). Within economics the empirical examples include but are not limited to e.g. Flores and Flores-Lagunes (2009), Heckman, Pinto, and Savelyev (2013), Keele, Tingley, and Yamamoto (2015), Conti, Heckman, and Pinto (2016), Huber (2015) or Huber, Lechner, and Mellace (2017). These papers are all based on selection-on-observables assumption, thus it requires selection of relevant confounders based on institutional knowledge and other theoretical considerations. This necessarily brings some degree of ambiguity to the modelling. The present paper addresses the model uncertainty that is implicit in this covariate choice by employing a data-driven selection from the set of high-dimensional vector of potential confounders. Contribution of this paper is that it proves that the efficient score functions for direct and indirect effects of Tchetgen Tchetgen and Shpitser (2012) are Neyman-orthogonal and therefore DML framework of Chernozhukov et al. (2018) can be applied. This approach requires the estimation of density of a mediator, which is not feasible or problematic if mediator is a vector of multiple variables. In such case, we provide an alternative moment condition that is also Neyman-orthogonal, yet it avoids conditional mediator density estimation. We provide a simulation study that explores the finite sample behaviour of the suggested estimators and show that it performs well in our simulation design with approximate sparsity. As an empirical illustration we study health effects of health insurance coverage. We decompose the total effect into a direct effect and the indirect effect that operates via more regular check-ups. Our results suggest that this is not an important channel of the total effect. In the following subsections we will show the setup, present the main results and then conclude. 1.1.1 Setup We adopt the potential outcomes framework with 𝑌 as the observed outcome, 𝑀 as the observed mediator, and 𝐷 as the treatment. The potential outcome is a function of both the treatment and the mediator. For the observed outcome and mediator it holds that 𝑌 = 𝐷 ·𝑌 (1, 𝑀(1))+(1− 𝐷) ·𝑌 (0, 𝑀(0)) and 𝑀 = 𝐷 · 𝑀(1)+(1− 𝐷) · 𝑀(0), thus we either observe 𝑌 = 𝑌 (1, 𝑀(1)) and 𝑀 = 𝑀(1) if 𝐷 = 1 or 𝑌 = 𝑌 (0, 𝑀(0)) and 𝑀 = 𝑀(0) if 𝐷 = 0. The natural direct effect 𝜃(𝑑) stands for the effect in situation when the mediator is being fixed while it takes its natural value: 𝜃(𝑑) = 𝐸[𝑌 (1, 𝑀(𝑑)) − 𝑌 (0, 𝑀(𝑑))], 𝑑 ∈ {0, 1}. 20 The (average) indirect effect, 𝛿(𝑑), equals the difference in mean potential outcomes when switching the potential mediator values while keeping the treatment fixed to block the direct effect. 𝛿(𝑑) = 𝐸[𝑌 (𝑑, 𝑀(1)) − 𝑌 (𝑑, 𝑀(0))], 𝑑 ∈ {0, 1}. In some case, another effect may be of interest. The controlled direct effect refers to the effect that arises when the value of the mediator is fixed at a certain level for the entire population (see Pearl (2001) for further discussion). 𝛾(𝑚) = 𝐸[𝑌 (1, 𝑚) − 𝑌 (0, 𝑚], 𝑚 ∈ M. The (total) average treatment effect Δ = 𝐸[𝑌 (1, 𝑀(1)) − 𝑌 (0, 𝑀(0))] can be decomposed into the direct and indirect effect: Δ = 𝐸[𝑌 (1, 𝑀(1)) − 𝑌 (0, 𝑀(0))] (1.1) = 𝐸[𝑌 (1, 𝑀(1)) − 𝑌 (0, 𝑀(1))] + 𝐸[𝑌 (0, 𝑀(1)) − 𝑌 (0, 𝑀(0))] = 𝜃(1) + 𝛿(0) = 𝐸[𝑌 (1, 𝑀(0)) − 𝑌 (0, 𝑀(0))] + 𝐸[𝑌 (1, 𝑀(1)) − 𝑌 (1, 𝑀(0))] = 𝜃(0) + 𝛿(1). The following assumptions are made in order to identify the effects of interest. The first assumption states that conditioning on 𝑋 makes the treatment as good as randomly assigned in terms on its influence on both the outcome and the mediator. Assumption A1 (conditional independence of the treatment) {𝑌 (𝑑′, 𝑚), 𝑀(𝑑)}⊥𝐷|𝑋 = 𝑥 for all 𝑑′, 𝑑 ∈ {0, 1} and 𝑥 in the support of 𝑋, where ‘⊥’ denotes statistical independence. The second assumption requires the mediator to be conditionally independent of the potential outcomes given the treatment and the covariates. Assumption A2 (conditional independence of the mediator) newline 𝑌 (𝑑′, 𝑚)⊥𝑀|𝐷 = 𝑑, 𝑋 = 𝑥 for all 𝑑′, 𝑑 ∈ {0, 1} and 𝑚, 𝑥 in the support of 𝑀, 𝑋. Assumption A2 states that there are no confounders jointly affecting the mediator and the outcome conditional on 𝐷 and 𝑋. If 𝑋 is a vector of pre-treatment variables (in order to avoid "bad controls" problem Cinelli, Forney, and Pearl (2020)), this means that there are no post-treatment confounders of the mediator-outcome relation. The validity of such a restriction needs depends on the empirical context and needs to be scrutinized. It is typically less plausible if the time window between the measurement of the treatment and the mediator is large, as many potential confounders may vary in time. The third assumption assumes common support on the conditional treatment probability across different treatment states. Assumption CS (common support): Pr(𝐷 = 𝑑|𝑀 = 𝑚, 𝑋 = 𝑥) > 0 for all 𝑑 ∈ {0, 1} and 𝑚, 𝑥 in the support of 𝑀, 𝑋. The directed acyclic graphs that represent the causal structure and the different effects are visualized in Figure 1.3. It is worth noting that the plausibility of these assumptions depends heavily on the richness of covariate vector 𝑋 which makes the high-dimensional setup particularly appealing. 21 D M X Y (a) Total effect Δ. D M X Y (b) The natural direct effect 𝜃(𝑑). D M X Y (c) The indirect effect, 𝛿(𝑑). Figure 1.3: Directed acyclic graph under conditional exogeneity given pre-treatment covariates 𝑋. 1.1.2 Main results We introduce additional notation, we use 𝜇 for the outcome model, 𝑝 𝑑 for the treatment propensity, which is a function of either 𝑋 or 𝑀 and 𝑋, and 𝜔 to represent a conditional outcome model. These nuisance functions are estimated via ML estimators and then plugged into the moment functions using the cross-fitting algorithm. 𝜇(𝐷, 𝑀, 𝑋) = 𝐸[𝑌|𝐷, 𝑀, 𝑋], 𝑝 𝑑 (𝑋) = Pr(𝐷 = 𝑑|𝑋), 𝑝 𝑑 (𝑀, 𝑋) = Pr(𝐷 = 𝑑|𝑀, 𝑋), 𝜔(1 − 𝑑, 𝑋) = 𝐸 𝜇(𝑑, 𝑀, 𝑋)|𝐷 = 1 − 𝑑, 𝑋 , also let 𝑓 (𝑚|𝐷, 𝑋) be the conditional density of 𝑀 given 𝐷 and 𝑋 (if 𝑀 is discrete, this is 𝑓 (𝑚|𝐷, 𝑋) = Pr(𝑀 = 𝑚|𝐷, 𝑋) and integrals need to be replaced by sums). In order to estimate direct effects 𝜃(𝑑), indirect effects 𝛿(𝑑), and controlled direct effects 𝛾(𝑚) we need to estimate 𝐸[𝑌 (𝑑, 𝑀(𝑑))], 𝐸[𝑌 (𝑑, 𝑀(1 − 𝑑))] and 𝐸[𝑌 (𝑑, 𝑚)]. These moment conditions are used in the estimation using the DML framework: 𝐸[𝑌 (𝑑, 𝑀(𝑑))] = 𝐸 𝐼{𝐷 = 𝑑} · [𝑌 − 𝜇(𝑑, 𝑋)] 𝑝 𝑑 (𝑋) + 𝜇(𝑑, 𝑋) 𝐸[𝑌 (𝑑, 𝑀(1 − 𝑑))] = 𝐸 𝐼{𝐷 = 𝑑} · 𝑓 (𝑀|1 − 𝑑, 𝑋) 𝑝 𝑑 (𝑋) · 𝑓 (𝑀|𝑑, 𝑋) · [𝑌 − 𝜇(𝑑, 𝑀, 𝑋)] + 𝐼{𝐷 = 1 − 𝑑} 1 − 𝑝 𝑑 (𝑋) · 𝜇(𝑑, 𝑀, 𝑋) − ∫ 𝑚∈M 𝜇(𝑑, 𝑚, 𝑋) · 𝑓 (𝑚|1 − 𝑑, 𝑋) 𝑑𝑚 + ∫ 𝑚∈M 𝜇(𝑑, 𝑚, 𝑋) · 𝑓 (𝑚|1 − 𝑑, 𝑋) 𝑑𝑚 22 𝐸[𝑌 (𝑑, 𝑀(1 − 𝑑))] = 𝐸 𝐼{𝐷 = 𝑑}(1 − 𝑝 𝑑 (𝑀, 𝑋)) 𝑝 𝑑 (𝑀, 𝑋) · (1 − 𝑝 𝑑 (𝑋)) · [𝑌 − 𝜇(𝑑, 𝑀, 𝑋)] + 𝐼{𝐷 = 1 − 𝑑} 1 − 𝑝 𝑑 (𝑋) · 𝜇(𝑑, 𝑀, 𝑋) − 1 1 − 𝑝 𝑑 (𝑋) · 𝐸 𝜇(𝑑, 𝑀, 𝑋) · (1 − 𝑝 𝑑 (𝑀, 𝑋)) 𝑋 + 𝐸 𝜇(𝑑, 𝑀, 𝑋) · 1 − 𝑝 𝑑 (𝑀, 𝑋) 1 − 𝑝 𝑑 (𝑋) 𝑋 𝐸[𝑌 (𝑑, 𝑚)] = 𝐸 𝐼{𝐷 = 𝑑} · 𝐼{𝑀 = 𝑚} · [𝑌 − 𝜇(𝑑, 𝑚, 𝑋)] 𝑓 (𝑚|𝑑, 𝑋) · 𝑝 𝑑 (𝑚, 𝑋) + 𝜇(𝑑, 𝑚, 𝑋) There are two formulations for 𝐸[𝑌 (𝑑, 𝑀(1−𝑑))], the first one is the efficient score function of Tchetgen Tchetgen and Shpitser (2012), we prefer the second one as it avoids the conditional mediator density estimation. The following statement provides a high-level summary of the results in Farbmacher, Huber, Lafférs, Langen, and Spindler (2022).5 Under identifying assumptions A1, A2, CS, along with various regularity conditions, the K-fold cross-fitting algorithm based on moment conditions above provides estimators of 𝐸[𝑌 (𝑑, 𝑀(𝑑))], 𝐸[𝑌 (𝑑, 𝑀(1−𝑑))] and 𝐸[𝑌 (𝑑, 𝑚)], and subsequently those of 𝜃(𝑑), 𝛿(𝑑) and 𝛾(𝑚), that are root-n consistent and asymptotically normal. The proofs are based on the DML framework, and they require showing that the moment conditions are Neyman-orthogonal and that various regularity conditions hold. Most of these are mild technical assumptions with the exception on those placed on the quality of the ML estimators. Specifically, the ML estimators of the nuisance functions 𝜇, 𝑝 𝑑 need to converge at the rate 𝑜(𝑛−1/4), which is a weaker rate than the usual parametric rate 𝑛−1/2. Convergence rates of various ML estimators are derived in the literature based on more primitive assumptions, see Chernozhukov et al. (2018) for examples. We also investigate a finite sample behaviour based on the following simple data generating process. 𝑌 = 0.5𝐷 + 0.5𝑀 + 0.5𝐷 𝑀 + 𝑋′ 𝛽 + 𝑈, 𝑀 = 𝐼{0.5𝐷 + 𝑋′ 𝛽 + 𝑉 > 0}, 𝐷 = 𝐼{𝑋′ 𝛽 + 𝑊 > 0}, 𝑋 ∼ 𝑁(0, Σ), 𝑈, 𝑉, 𝑊 ∼ 𝑁(0, 1) independently of each other and 𝑋. where Σ𝑖 𝑗 = 0.5|𝑖− 𝑗|. In the simulation, the sample size was set to two different values 𝑛 = 1000, 4000 with 1000 simulations for the data generating process. We chose two different amount of confounding, 𝛽 = 0.3/𝑖2 or 𝛽 = 0.5/𝑖2 for 𝑖 = 1, . . . , 200. Absolute bias was smaller than 0.01 in all our specifications and quadrupling the sample size cuts the root mean square error roughly to half, which is compatible with the square-root consistency of our main results. 1.1.3 Empirical application We explored the method on an empirical application based on National Longitudinal Survey of Youth 1997 data in the United States (Bureau of Labor Statistics, U.S. Department of Labor (2001)). The question of interest was if the effect of health insurance coverage (D) on the general health (Y) is mediated via a regular 5. More concretely Lemma 3.3, Lemma 3.4, Theorem 4.1 and Theorem 4.2. 23 check-ups (M), see Maciosek, Coffield, Flottemesch, Edwards, and Solberg (2010) for a review. Health insurance coverage is a binary variable based on the 2006 interview, mediator is also a binary variable based on the 2017 interview, where individuals were asked if they had gone for the routine checkup in 2016. The outcome is self reported health - an ordinal variable with levels ‘excellent’, ‘very good’, ‘good’, ‘fair’, and ‘poor’. A wide range of control variables 𝑋 are based on 2005 or earlier interviews, so that they cannot be influenced by the treatment. It includes demographics characteristics, socio-economic background, education, training, household characteristics, marital status, fertility, received monetary transfers, attitudes, expectations, physical and mental health, nutrition, physical activity. Altogether we have 755 control variables with 593 dummy variables, out of which 251 dummy variables encoded missing information to deal with non-response. The sample size was 7,489. We used post-lasso regression, with three-fold cross fitting and with trimming the propensity scores above 2%. The results suggest that the health care coverage improves general health, so that the direct effect on health is positive and statistically significant. The indirect effect is small and not significant suggesting that the more regular health checkups is not an important channel of this effect. 1.1.4 Conclusions This paper extended Double Machine Learning framework into the mediation analysis, so that it is possible to explore different causal pathways of treatment effects in a high-dimensional setup. It avoids ad-hoc based model specification and relies on data-driven covariate choice. This performance of the new method is supported by a simulation study and illustrated on an example from health economics. 24 1.2 Evaluating (weighted) dynamic treatment effects by double machine learning (Bodory, Huber, and Lafférs 2022) In the introductory section we considered estimation of the effect of a single treatment. In many cases, researchers are interested in the effect of sequences of treatments. After a job training applicant may attend a language course or after a surgery patient goes through a rehabilitation procedure. The treatment is in many empirically relevant cases a result of a choice (thus non-random) and naive comparisons are not informative of the true causal effects. In practice, we often rely on selection-on-observables assumptions, thus that the available information make the treatment as good as random. In the case with multiple treatments that happen in sequence, we may assume that in each period the treatment assignment is unconfounded given the available information at the time, that is, including the treatments in previous time periods - such assumption is called sequential conditional independence assumption. The validity of such assumption heavily depends on the richness of the information that the covariates span. In the situation with many covariates, researchers face the challenge how to pick the relevant variables. This is typically done using institutional knowledge or expert opinion, so there is always some amount of ambiguity involved. This paper addresses this concern. It provides a data-driven way of choosing the relevant covariates by combining the semi-parametrically efficient estimation (using efficient score function of Robins, Rotnitzky, and Zhao (1994), J. M. Robins and Rotnitzky (1995)) of dynamic treatment effects with the Double Machine Learning (DML) of Chernozhukov et al. (2018). We formally prove that our approach fits within the DML framework. In the early literature on dynamic treatment effects Robins (1986) proposed a dynamic causal framework called a g-computation for recursively modeling outcomes under the sequential conditional independence assumption initially implemented by parametric maximum likelihood estimation. Robins (1998) suggested a computationally less expensive alternative representing outcomes in specific treatment states as functions of time-constant covariates only. Given the time-varying confounding, these models need to reweighted by the inverse of the dynamic treatment propensity scores (Robins, Greenland, and Hu (1999) or Robins, Hernan, and Brumback (2000)). More recently, Lechner (2009) considered inverse probability weighting (IPW) by the dynamic treatment propensity scores alone, while Lechner and Miquel (2010) apply propensity score matching and Blackwell and Strezhnev (2020) direct matching on the covariates. The two papers closest to ours are Lewis and Syrgkanis (2020) and Viviano and Bradic (2021). Lewis and Syrgkanis (2020) provide a DML estimation that can be applied to continuous treatments too, but rely on a more restrictive (partially linear) outcome model. Viviano and Bradic (2021) also provides a method that can be combined with ML estimators, but it is based on covariate balancing (Athey, Imbens, and Wager (2018)) instead of inverse propensity score weighting (as it is done in this work). Following subsections show the setup, present the main results (including simulation study and empirical application) and conclusions. 1.2.1 Setup Let 𝐷 𝑡 and 𝑌𝑡 be the treatment and outcome in period 𝑇 = 𝑡, respectively. For instance, 𝐷1 and 𝐷2 represent the treatments in the first and second periods, respectively, and can take on values 𝑑1, 𝑑2 ∈ 0, 1, ..., 𝑄, where 0 indicates no treatment, and 1, ..., 𝑄 represent the various treatment options. Moreover, 𝑌2 represents the outcome of interest in the second period after the treatment sequence 𝐷1 and 𝐷2 has been applied. We use the potential outcome framework (Rubin (1974)): for a specific treatment sequence 𝑑2 ≡ (𝑑1, 𝑑2) with 𝑑1, 𝑑2 ∈ 0, 1, ..., 𝑄, let 𝐷2 ≡ (𝐷1, 𝐷2), and 𝑌2(𝑑2) denotes the potential outcome that would be observed if the treatments were set to 25 that sequence 𝑑2. We also consider the identification for a specific subgroup of interest (𝑆 = 1), for instance the treated populations in the first period. The objects of interests are Δ(𝑑2, 𝑑∗ 2) = 𝐸[𝑌2(𝑑2)] − 𝐸[𝑌 (𝑑∗ 2)], Δ(𝑑2, 𝑑∗ 2|𝑆 = 1) = 𝐸[𝑌2(𝑑2)|𝑆 = 1] − 𝐸[𝑌2(𝑑∗ 2)|𝑆 = 1]. Identification is based on sequential conditional independence assumptions and on the common support assumption. Also, the selection indicator 𝑆 and the potential outcome are independent conditional on pre-treatment covariates 𝑋0. Assumption B1 (conditional independence of the first treatment) 𝑌2(𝑑2)⊥𝐷1|𝑋0, for 𝑑2 ∈ {0, 1, ..., 𝑄}2 where ‘⊥’ denotes statistical independence. Assumption B2 (conditional independence of the second treatment) 𝑌2(𝑑2)⊥𝐷2|𝐷1, 𝑋0, 𝑋1, for 𝑑2 ∈ {0, 1, ..., 𝑄}2. Assumption B3 (common support) Pr(𝐷1 = 𝑑1|𝑋0) > 0, Pr(𝐷2 = 𝑑2|𝐷1, 𝑋1) > 0 for 𝑑1, 𝑑2 ∈ {0, 1..., 𝑄}. Assumption B4 (conditional independence of the subgroup indicator) 𝑆⊥𝑌2(𝑑2)|𝑋0, for 𝑑2 ∈ {0, 1, ..., 𝑄}2. The causal structure in Figure 1.4 encodes the information in the identifying assumptions. D1 X0 D2 X1 Y2 Figure 1.4: Directed acyclic graph under conditional independence of the first treatment given pre-treatment covariates 𝑋0 and under conditional independence of the the second treatment given covariates 𝑋0 and 𝑋1. 1.2.2 Main results We introduce further notation for the nuisance functions: 𝜇 𝑌2 (𝐷2, 𝑋1) = 𝐸[𝑌2|𝐷2, 𝑋0, 𝑋1], 𝑝 𝑑1 (𝑋0) = Pr(𝐷1 = 𝑑1|𝑋0), 𝑝 𝑑2 (𝐷1, 𝑋1) = Pr(𝐷2 = 𝑑2|𝐷1, 𝑋1), 𝜈 𝑌2 (𝐷2, 𝑋0) = ∫ 𝑥1 ∈X1 𝐸[𝑌2|𝐷2, 𝑋0, 𝑋1 = 𝑥1]𝑑𝐹 𝑋1=𝑥1 |𝐷1,𝑋0 , 𝑔(𝑋0) = Pr(𝑆 = 1|𝑋0). 26 These are estimated via machine learning estimators and plugged into the following moment conditions that are shown to satisfy Neyman-orthogonal property and thus are locally insensitive to the mild deviations from their true values. 𝐸[𝑌2(𝑑2)] = 𝐸 𝐼{𝐷1 = 𝑑1} · 𝐼{𝐷2 = 𝑑2} · [𝑌2 − 𝜇 𝑌2 (𝑑2, 𝑋1)] 𝑝 𝑑1 (𝑋0) · 𝑝 𝑑2 (𝑑1, 𝑋1) + 𝐼{𝐷1 = 𝑑1} · [𝜇 𝑌2 (𝑑2, 𝑋1) − 𝜈 𝑌2 (𝑑2, 𝑋0)] 𝑝 𝑑1 (𝑋0) + 𝜈 𝑌2 (𝑑2, 𝑋0) . 𝐸[𝑌2(𝑑2)|𝑆 = 1] = 𝐸 𝑔(𝑋0) Pr(𝑆 = 1) · 𝐼{𝐷1 = 𝑑1} · 𝐼{𝐷2 = 𝑑2} · [𝑌2 − 𝜇 𝑌2 (𝑑2, 𝑋1)] 𝑝 𝑑1 (𝑋0) · 𝑝 𝑑2 (𝑑1, 𝑋1) + 𝑔(𝑋0) Pr(𝑆 = 1) · 𝐼{𝐷1 = 𝑑1} · [𝜇 𝑌2 (𝑑2, 𝑋1) − 𝜈 𝑌2 (𝑑2, 𝑋0)] 𝑝 𝑑1 (𝑋0) + 𝑆 Pr(𝑆 = 1) · 𝜈 𝑌2 (𝑑2, 𝑋0) . Theorem 1 and Theorem 2 from Bodory, Huber, and Lafférs (2022) states that: Under identifying assumptions B1-B4, various regularity conditions, the K-fold cross-fitting algorithm based on moment conditions above provides estimators of 𝐸[𝑌2(𝑑2)] and 𝐸[𝑌2(𝑑2)|𝑆 = 1], and subsequently those of Δ(𝑑2, 𝑑∗ 2) and Δ(𝑑2, 𝑑∗ 2|𝑆 = 1), are root-n consistent and asymptotically normal. The paper provides a simulation study with the following design: 𝑌2 = 𝐷1 + 𝐷2 + 𝑋′ 0 𝛽 𝑋0 + 𝑋′ 1 𝛽 𝑋1 + 𝑈, 𝐷1 = 𝐼{𝑋′ 0 𝛽 𝑋0 + 𝑉 > 0}, 𝐷2 = 𝐼{0.3𝐷1 + 𝑋′ 0 𝛽 𝑋0 + 𝑋′ 1 𝛽 𝑋1 + 𝑊 > 0}, 𝑋0 ∼ 𝑁(0, Σ0), 𝑋1 ∼ 𝑁(0, Σ1), 𝑈, 𝑉, 𝑊 ∼ 𝑁(0, 1), independently of each other. In this design, coefficient vectors 𝛽 𝑋0 and 𝛽 𝑋0 determine the degree of confounding. The 𝑖th element in the coefficient vectors 𝛽 𝑋0 and 𝛽 𝑋1 was set to 0.4/𝑖4 for 𝑖 = 1, ..., 𝑝, and this quadratic decay is compatible with approximate sparsity condition suitable for the use of the lasso estimator. Two sample sizes of 2,500 and 10,000 were used, running 1,000 simulations for the smaller and 250 simulations for the larger sample sizes. The number of covariates 𝑝 in 𝑋1 and 𝑋0, respectively, to 50, 100, or 500. Σ0 and Σ1 are defined based on setting the covariance of the 𝑖th and 𝑗th covariate in 𝑋0 or 𝑋1 to 0.5|𝑖− 𝑗|. Based on this specification, the degree of confounding is substantial (Bodory, Huber, and Lafférs (2022), Table 1) and could therefore reasonably mimic empirical applications. We used 3-fold cross-fitting with lasso estimator using the SuperLearner package Van der Laan, Polley, and Hubbard (2007). Table 1.1 shows good performance from the simulations for the subpopulation of treated (𝑆 = 𝐼(𝐷1 = 1)) in the first period. 27 covar- sample true absolute standard average RMSE coverage iates size effect bias deviation SE in % ATE on selected: ˆΔ(𝑑2, 𝑑∗ 2, 𝑆 = 1) 50 2,500 2 0.027 0.076 0.087 0.081 96.5 50 10,000 2 0.006 0.037 0.043 0.038 95.6 100 2,500 2 0.042 0.079 0.087 0.089 94 100 10,000 2 0.011 0.037 0.043 0.039 96.4 500 2,500 2 0.064 0.075 0.088 0.099 91.5 500 10,000 2 0.019 0.038 0.043 0.043 95.2 Table 1.1: Simulation results based on 𝛽 𝑋0 = 𝛽 𝑋1 = 0.4/𝑖4. Notes: SE and RMSE denote the standard error and the root mean squared error, respectively. Coverage is based on 95% confidence intervals. 1.2.3 Empirical application The proposed method is applied to evaluate the impact of different sequences of job trainings provided by the U.S. Job Corps program on employment probability. The sample used consisted of 11,313 individuals who completed interviews four years after randomization (6,828 assigned in Job corps, 4,485 randomized out). There are four different treatments: 0 - no instruction (being randomized out), 1 - no instruction (being randomized in), 2 academic education, 3 - vocational training. After processing there are 909 variables in 𝑋0 and 1,427 variables 𝑋1, most of them dummy variables, that consists of rich set of information on socio-economic characteristics, labor market history, education, trainings, job search activities health, crime, and how one learnt about the existence of Job Corps. Random forest was used for the estimation of nuisance functions. Average treatment effects are estimated in the subsample with first treatment entering one of the treatment sequences compared. We estimated that a sequence of two vocational trainings provides higher employment in comparison to sequences consisting of academic trainings or no trainings (Table 1.2), in the range of 5 to 10 percentage points. 𝑑2 𝑑∗ 2 ˆ𝐸[𝑌2(𝑑∗ 2)|𝑆 = 1] ˆΔ(𝑑2, 𝑑∗ 2, 𝑆 = 1) SE p-value observations trimmed 33 22 0.76 0.1 0.06 0.11 3783 507 33 21 0.82 0.05 0.03 0.07 3783 43 33 11 0.81 0.08 0.03 0.02 2346 22 Table 1.2: Effect estimates with a trimming threshold of 0.01. Notes: 𝑑2 and 𝑑∗ 2 indicate the treatment sequences under treatment and non-treatment, respectively. ˆ𝐸[𝑌2(𝑑∗ 2)|𝑆 = 1] denotes the mean potential outcome under non-treatment conditional on 𝑆 = 1, where 𝑆 is an indicator for the first treatment corresponding to either the first treatment in 𝑑2 or 𝑑∗ 2. ˆΔ(𝑑2, 𝑑∗ 2, 𝑆 = 1) provides the ATE estimate, SE is the standard error. To assess the validity of these findings, a placebo test was conducted by comparing the effect of two sequences, 00 and 11, where neither sequence involved participation in any training programs. As expected, the estimated effect was very close to zero with a p-value of 0.92. 1.2.4 Conclusions This paper showed how Double Machine Learning framework may be used to study dynamic treatment effects, that is how to estimate a causal effects of different sequences of treatments based on selection-on-observables assumption. This alleviates the difficulty of the covariate choice in the case with high-dimensional data. The paper demonstrated that the estimators are asymptotically normal and root-n consistent under specific regularity conditions. The simulation study examines the finite sample properties of the estimators and the methodology is applied to the U.S. Job Corps study. 28 1.3 Limitations and future research avenues In the mediation analysis framework we assumed that the machine learners are "good enough" in the sense that they achieve the desired rate of convergence. One of them is nested conditional mean: 𝜔(1 − 𝑑, 𝑋) = 𝐸 𝜇(𝑑, 𝑀, 𝑋)|𝐷 = 1 − 𝑑, 𝑋] and may require a special focus.6 This is a plug-in estimator and we use additional data-splitting to estimate 𝜇 and 𝜔 on different subsamples within the cross-fitting algorithm, so that the estimation errors in both stages of the estimation are independent by design. It might be interesting to provide some more primitive conditions under which the nested estimator of 𝜔 would achieve the desired rate of convergence for the DML. This would most likely have to be specific to a particular ML method used. Such conditions would make the paper more self-contained and could be of independent interest. This problem may constitute a separate research project on its own, as this area appears to be underresearched. While the method can choose the set of relevant variables to control for in a data-driven way, it is not an automatic tool to estimate effects and still require careful thought in the sense that the (potentially large) list of possible covariates cannot be chosen arbitrarily. The causal structure represented by the Directed Acyclic Graph (DAG) in Figure 1.3 still has to hold true. For instance we have to make sure that we do not include any "bad controls" (Cinelli, Forney, and Pearl (2020)), e.g. variables that would be influenced by the outcome itself and would introduce spurious associations. This is why we chose variables measured before the treatment itself. 6. Similar situation emerged in the dynamic treatment effects framework for 𝜈 𝑌2 . Chapter 2 Model uncertainty and incomplete models Consider an econometric model where the parameter of interest is not identified. As a simple example, consider identifying treatment effects under non-random treatment assignment. In such case, without additional assumptions, it is not possible to determine mean potential outcomes as outcomes are only observed in one of the treatment states. Using potential outcomes notation, let 𝑌 be an observed outcome which is one of the potential outcomes: 𝑌 = 𝑌 (1)𝐷 + 𝑌 (0)(1 − 𝐷), where 𝐷 is a binary treatment. Then quantities 𝐸[𝑌 (1)] and 𝐸[𝑌 (0)] are not identified as they depend on unobserved quantities 𝐸[𝑌 (1)|𝐷 = 0] and 𝐸[𝑌 (0)|𝐷 = 1]: 𝐸[𝑌 (1)] = 𝐸[𝑌 (1)|𝐷 = 1] Pr(𝐷 = 1) + 𝐸[𝑌 (1)|𝐷 = 0] Pr(𝐷 = 0) = 𝐸[𝑌 (1)|𝐷 = 1] Pr(𝐷 = 1) + 𝐸[𝑌 (1)|𝐷 = 0] unobserved Pr(𝐷 = 0), 𝐸[𝑌 (0)] = 𝐸[𝑌 (0)|𝐷 = 1] Pr(𝐷 = 1) + 𝐸[𝑌 (0)|𝐷 = 0] Pr(𝐷 = 0) = 𝐸[𝑌 (0)|𝐷 = 1] unobserved Pr(𝐷 = 1) + 𝐸[𝑌|𝐷 = 0] Pr(𝐷 = 0). Depending on the context, we may be willing to make assumptions about 𝐸[𝑌 (1)|𝐷 = 0] and 𝐸[𝑌 (0)|𝐷 = 1]. For instance, if the treatment was random, we would have 𝐸[𝑌 (1)] = 𝐸[𝑌|𝐷 = 1] and 𝐸[𝑌 (0)] = 𝐸[𝑌|𝐷 = 0] and therefore the average treatment effect Δ = 𝐸[𝑌 (1)] − 𝐸[𝑌 (0)] would be identified. On the other hand, if the only information was that the outcome 𝑌 has a finite support, 𝑌 ∈ [𝑦min, 𝑦max], we would only know that 𝐸[𝑌|𝐷 = 1] Pr(𝐷 = 1) + 𝑦min Pr(𝐷 = 0) ≤ 𝐸[𝑌 (1)] ≤ 𝐸[𝑌|𝐷 = 1] Pr(𝐷 = 1) + 𝑦max Pr(𝐷 = 0), 𝑦min Pr(𝐷 = 1) + 𝐸[𝑌|𝐷 = 0] Pr(𝐷 = 0) ≤ 𝐸[𝑌 (0)] ≤ 𝑦max Pr(𝐷 = 1) + 𝐸[𝑌|𝐷 = 0] Pr(𝐷 = 0), which would translate to bounds for Δ ∈ [Δ 𝑚𝑖𝑛, Δ 𝑚𝑎𝑥]. This is called an identified set, see Manski (2003) for a book-length presentation and Tamer (2010) for a review on partial identification literature. The width of the bounds represents the lack of information, in other words, the degree of uncertainty. We could consider different menus of assumptions and explore how the width of the bounds vary, thus better understanding the sources of identification. This may sharpen the discussion as we could focus more on the assumptions that are important. Suppose we have a point-identified econometric method and we wish to conduct a sensitivity analysis to different identifying assumption. Say, that the selection-on-observables only holds approximately and we wish to explore if a small deviation from this assumption could lead to a very different results. Consider two different scenarios depicted on Figure 2.1. In the first one, a relaxation of an identifying assumption leads to a small 30 Point-identified model Assumption is relaxed robust sensitive Partially identified models 𝜃 Figure 2.1: A relaxation of an identifying assumption may lead to a narrow identified set (left), indicating a robust result or, to the contrary, to a wide identified set (right). identified set, thus we may conclude that the effect is robust to violation of this assumption. In the other example, we get a very large identified set, highlighting the importance of this particular assumption. The following two essays present methods how bounds can be calculated and therefore offer tools to conduct sensitivity analyses. The first paper studies mediation analysis and estimators based on inverse propensity score weighting. The second paper considers the setup of estimating average treatment effects under sample selection problem, where we do not observe outcome variable for non-random proportion of the population. 2.1 Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition (Huber and Lafférs 2022) Similar to Farbmacher, Huber, Lafférs, Langen, and Spindler (2022) presented in Section 1.1, this paper also studies mediation analysis, where the objective is to decompose a total effect into a direct effect and to an effect that is channeled through additional variable(s), called a mediator(s). The motivation, review of the literature and different approaches are presented there. This paper, however, has a different objective. While in Farbmacher, Huber, Lafférs, Langen, and Spindler (2022) the focus was on the model uncertainty that was connected to the variable choice, here we address sensitivity analysis of the effects to the identifying assumptions. It these assumptions are relaxed, the direct and indirect effects are no longer identified, and we only get bounds on the effects. It is, however, challenging to come up with relaxations that are easy-to-interpret and, at the same time, it is technically feasible to estimate the bounds. This is the contribution of the paper. It provides a computationally feasible method that calculates bounds under relaxed assumptions and thus allows to conduct sensitivity analysis. We now add a few notes on the literature on sensitivity analysis within the mediation analysis, which is the stream of literature where this paper contributes. Imai, Keele, and Yamamoto (2010) shows how to conduct sensitivity analysis within highly restrictive linear models, with relaxation parameters as correlation of unobserved terms in the mediation and outcome equations. Tchetgen Tchetgen and Shpitser (2011) proposed a semi-parametric framework that uses a "selection bias function" which relates confounders of the mediatoroutcome to the treatment and allows multidimensional unobserved confounders. These ideas were further developed in Vansteelandt and VanderWeele (2012). The paper that is closest to ours in terms of the strategy is Hong, Qin, and Yang (2018). They consider weighting estimators and their core idea is, that if some important confounders are omitted, then this renders the weights, that are actually used in the analysis, to be incorrect. We follow similar line of reasoning, but we use a different measure to limit the errors in weights due to confounding. 31 In contrast to their method we do not provide analytical formulas but only a computational method, but our method has the advantage that it allows for simultaneous relaxations of different assumptions. This allows us to better understand the non-robustness of the results to violations of the various identification assumptions. The following subsections present the setup and then motivate and explain step-by-step how the computational method was constructed. The methodology is then applied on gender wage gap decomposition on National Longitudinal Survey of Youth 1979 dataset in the United States. 2.1.1 Setup The setup is very similar to that in Section 1.1.1, where we introduced mediation analysis. There is an additional layer of complication - the outcome is not observed for the whole sample, which we refer to as a sample selection problem. The identification or direct and indirect effects is possible based on the assumptions of conditional independence of the treatment (A1), conditional independence of the mediator (A2), common support assumptions (CS)1 (all stated and discussed in Section 1.1.1) and a new assumption that postulates that the information in 𝐷, 𝑋, 𝑀 is rich enough to capture all the possible confounders of selection 𝑆 and outcome 𝑌: Assumption A3 (conditional independence of selection): 𝑌⊥𝑆|𝐷 = 𝑑, 𝑀 = 𝑚, 𝑋 = 𝑥 for all 𝑑 ∈ {0, 1} and 𝑚, 𝑥 in the support of 𝑀, 𝑋. Theorem 1 of Huber and Solovyeva (2020a) states that Assumptions A1-A3 allow identifying the mean potential outcomes by 𝐸[𝑌 (1, 𝑀(1))] = 𝐸 𝑌 · 𝐷 · 𝑆 · 1 Pr(𝐷 = 1|𝑋) · 1 Pr(𝑆 = 1|𝐷, 𝑀, 𝑋) , (2.1) 𝐸[𝑌 (0, 𝑀(0))] = 𝐸 𝑌 · (1 − 𝐷) · 𝑆 · 1 1 − Pr(𝐷 = 1|𝑋) · 1 Pr(𝑆 = 1|𝐷, 𝑀, 𝑋) , 𝐸[𝑌 (1, 𝑀(0))] = 𝐸 𝑌 · 𝐷 · 𝑆 · 1 1 − Pr(𝐷 = 1|𝑋) · 1 Pr(𝑆 = 1|𝐷, 𝑀, 𝑋) · 1 Pr(𝐷 = 1|𝑀, 𝑋) − 1 , 𝐸[𝑌 (0, 𝑀(1))] = 𝐸 𝑌 · (1 − 𝐷) · 𝑆 · 1 Pr(𝐷 = 1|𝑋) · 1 Pr(𝑆 = 1|𝐷, 𝑀, 𝑋) · 1 1 − Pr(𝐷 = 1|𝑀, 𝑋) − 1 . The direct and indirect effects of interest are obtained as differences between two out of the four mean potential outcomes. In order to ease notation, we denote the various propensity scores in (2.1) by 𝑝 𝐴1 = Pr(𝐷 = 1|𝑋), 𝑝 𝐴2 = Pr(𝐷 = 1|𝑀, 𝑋), 𝑝 𝐴3 = Pr(𝑆 = 1|𝐷, 𝑀, 𝑋). The Figure 2.2 represents the causal structure in this problem. Assumptions A1-A3 are arguably very strong. It is natural to consider sensitivity analysis and ask how important these are and to what extent are the results driven by particular assumptions. This is the research question of the present paper. 2.1.2 Main results Suppose that we have doubts about assumption A3, e.g. that there exists an important confounder 𝑈 that jointly influences both the decision to work (𝑆) and wage itself (𝑌). In such case the true propensity score 𝑞 𝐴3 = Pr(𝑆 = 1|𝐷, 𝑀, 𝑋, 𝑈) is different from the one that we can estimate, 𝑝 𝐴3 = Pr(𝑆 = 1|𝐷, 𝑀, 𝑋). This 1. With an extra assumption that Pr(𝑆 = 1|𝐷 = 𝑑, 𝑀 = 𝑚, 𝑋 = 𝑥) > 0, for 𝑑, 𝑚, 𝑥 in their support. 32 D M X Y S Figure 2.2: Causal paths under conditional exogeneity and missing at random given pre-treatment covariates motivates the approach in this paper: consider all the possible true probabilities 𝑞 𝐴3 that are not too different from observable probabily 𝑝 𝐴3. In an ideal situation the distance between the two could be relaxed by some interpretable parameter. We consider the following relaxation via a parameter 𝜖 𝐴3 : |𝑞 𝐴3 − 𝑝 𝐴3 | ≤ 𝜖 𝐴3 √︃ 𝑝 𝐴3(1 − 𝑝 𝐴3). The scaling term √︁ 𝑝 𝐴3(1 − 𝑝 𝐴3) serves two purposes. Firstly, it ensures that the distances between the true propensity scores and the estimated propensity scores are comparable across different values of 𝑝 𝐴3, such as 0.99 or 0.5. Secondly, it is symmetric, which means that distances from 𝑝 𝐴3 = 0.95 and 𝑝 𝐴3 = 0.05 are treated similarly. For the sake of exposition, we now consider the identification and estimation of 𝐸[𝑌 (1, 𝑀(1))]. Under Assumptions A1-A3, we get that 𝐸[𝑌 (1, 𝑀(1))] = 𝐸 𝑌 · 𝐷 · 𝑆 Pr(𝐷 = 1|𝑋) · Pr(𝑆 = 1|𝐷, 𝑀, 𝑋) = 𝐸 𝑌 · 𝐷 · 𝑆 𝑝 𝐴1 · 𝑝 𝐴3 , but in case that some confounder (or a vector of confounders) 𝑈 implies violations of A1 and A2, then the true value of 𝐸[𝑌 (1, 𝑀(1))] is instead 𝐸[𝑌 (1, 𝑀(1))] = 𝐸 𝑌 · 𝐷 · 𝑆 Pr(𝐷 = 1|𝑋, 𝑈) · Pr(𝑆 = 1|𝐷, 𝑀, 𝑋, 𝑈) = 𝐸 𝑌 · 𝐷 · 𝑆 𝑞 𝐴1 · 𝑞 𝐴3 , where 𝑞 𝐴1 = Pr(𝐷 = 1|𝑋, 𝑈) and 𝜖 𝐴1 is defined analogously to 𝜖 𝐴3. It is therefore in principle possible to find bounds on 𝐸[𝑌 (1, 𝑀(1))] solving the following (population) optimization problem: min/max 𝑞 𝐴1,𝑞 𝐴3 𝐸[𝑌 (1, 𝑀(1))] = 𝐸 𝑌 · 𝐷 · 𝑆 𝑞 𝐴1 · 𝑞 𝐴3 . 𝑠.𝑡. |𝑞 𝐴1 − 𝑝 𝐴1 | ≤ 𝜖 𝐴1 √︃ 𝑝 𝐴1(1 − 𝑝 𝐴1), |𝑞 𝐴3 − 𝑝 𝐴3 | ≤ 𝜖 𝐴3 √︃ 𝑝 𝐴3(1 − 𝑝 𝐴3). 33 The finite sample counterpart is this optimization problem: min/max 𝑞 𝐴1,𝑞 𝐴3 𝑛∑︁ 𝑖=1 𝑌𝑖 · 𝐷 𝑖 · 𝑆𝑖 · 1 𝑞 𝐴1 𝑖 · 1 𝑞 𝐴3 𝑖 𝑛∑︁ 𝑖=1 𝐷 𝑖 𝑞 𝐴1 𝑖 𝑆𝑖 𝑞 𝐴3 𝑖 𝑠.𝑡. ∀𝑖 : |𝑞 𝐴1 𝑖 − ˆ𝑝 𝐴1 𝑖 | ≤ 𝜖 𝐴1 √︃ ˆ𝑝 𝐴1 𝑖 (1 − ˆ𝑝 𝐴1 𝑖 ), |𝑞 𝐴3 𝑖 − ˆ𝑝 𝐴3 𝑖 | ≤ 𝜖 𝐴3 √︃ ˆ𝑝 𝐴3 𝑖 (1 − ˆ𝑝 𝐴3 𝑖 ), 𝑞 𝐴1 𝑖 ∈ [0, 1], 𝑞 𝐴3 𝑖 ∈ [0, 1]. It is not immediately clear how this problem would be solved in practice, especially if the sample size is large. Luckily, after some manipulations, this optimization problem can be shown to be equivalent to the following optimization problem, which is a linear program and therefore computationally attractive: min/max 𝜔,𝑡 𝑛∑︁ 𝑖=1 𝑌𝑖 · 𝐷 𝑖 · 𝑆𝑖 · 𝜔𝑖 𝑠.𝑡. ∀𝑖 : 𝜔𝑖 ≤ 𝑡 ˆ𝑝 𝐴1 𝑖 − 𝜖 𝐴1 √︃ ˆ𝑝 𝐴1 𝑖 (1 − ˆ𝑝 𝐴1 𝑖 ) · ˆ𝑝 𝐴3 𝑖 − 𝜖 𝐴3 √︃ ˆ𝑝 𝐴3 𝑖 (1 − ˆ𝑝 𝐴3 𝑖 ) 𝜔𝑖 ≥ 𝑡 ˆ𝑝 𝐴1 𝑖 + 𝜖 𝐴1 √︃ ˆ𝑝 𝐴1 𝑖 (1 − ˆ𝑝 𝐴1 𝑖 ) · ˆ𝑝 𝐴3 𝑖 + 𝜖 𝐴3 √︃ ˆ𝑝 𝐴3 𝑖 (1 − ˆ𝑝 𝐴3 𝑖 ) 𝑛∑︁ 𝑖=1 𝐷 𝑖 · 𝑆𝑖 · 𝜔𝑖 = 1, 𝜔𝑖 ≥ 0, 𝑡 ≥ 0. There is one important challenge that remains and that is that how to set the relaxation parameters 𝜖 𝐴1 and 𝜖 𝐴3 in a meaningful way. The get a value of 𝜖 𝐴3 that is interpretable, consider the following approach. Suppose that we remove the most important predictor (e.g. in the sense of reduction in deviance) in 𝑋 in a logit regression of 𝑆 on 𝐷, 𝑀, 𝑋, and denote these predictions as ˆ𝑝 𝐴3 𝑖,𝑋1. This would lead to the following scaled in-sample differences: 𝜖 𝐴3 𝑖,𝑋1 = | ˆ𝑝 𝐴3 𝑖,𝑋1 − ˆ𝑝 𝐴3 𝑖 | √ ˆ𝑝 𝐴3 𝑖 (1− ˆ𝑝 𝐴3 𝑖 ) . One way to set 𝜖 𝐴3 𝑋1 is to calculate an average of 𝜖 𝐴3 𝑖,𝑋1 in the subpopulation with 𝐷 𝑖 = 1 and 𝑆𝑖 = 1: 𝜖 𝐴3 𝑋1 = 𝑛∑︁ 𝑖=1 𝐷 𝑖 · 𝑆𝑖 · 𝜖 𝐴3 𝑖,𝑋1 𝑛 𝑖=1 𝐷 𝑖 · 𝑆𝑖 . In a similar way, 𝜖 𝐴3 𝑋 𝑗 and 𝜖 𝐴3 𝑀 𝑗 could be calculated based on the omission of the 𝑗-th most important predictor from 𝑋 and 𝑀. 2.1.3 Empirical application The paper looks at the decomposition of the U.S. gender wage gap using data from the National Longitudinal Survey of Youth (1979). Huber and Solovyeva (2020b) considered five different wage decomposition techniques to examine the sensitivity of the direct/indirect effects estimators, see also Huber (2015) for a discussion on identification issues in mediation analysis. Dataset consists of 6,658 observations, 𝐷 is gender (0 - female, 1 male), 𝑌 is logaritm of average hourly wage during one year. Selection indicator 𝑆 is set to one for people who worked at least 1,000 hours (about 80% of the sample). Vector of mediators 𝑀 consists of variables that were measured after the birth and before the treatment itself such as education, employment variables, occupation, marital status, length of marriage, regional dummies, history of health problems and others. Conditioning 34 covariates in 𝑋 consists of variables that were determined prior to birth such as race, religion, year of birth, birth order, parental place of birth and parental education. There still might exist confounders that we do not observe and that may cause violation of exogeneity assumptions, such as risk preferences, attitudes towards competition, motivation or other socio-psychological factors. Table 2.1 shows the importance of different predictors in the propensity score estimator. They are used to set up the relaxation parameters, that correspond to average deviation in propensity score that missing the 1st, 2nd and the 3rd most important regressor from X or M would lead to. Assumption A1 𝑃(𝐷 = 1|𝑋) Most important 𝑋 1st Mothers educ. missing 2nd Mothers educ. high school graduate 3rd Religion missing Assumption A2 𝑃(𝐷 = 1|𝑀, 𝑋) Most important 𝑀 1st Farmer or laborer 2nd Industry: Professional services 3rd Clerical occupation Most important 𝑋 1st White 2nd Fathers educ. college/more 3rd Mothers educ. missing Assumption A3 𝑃(𝑆 = 1|𝐷, 𝑀, 𝑋) Most important 𝑀 1st Employed full time 2nd Employment status: employed 3rd Operator (machines, transport) Most important 𝑋 1st Fathers educ. college/more 2nd Mothers educ. some college 3rd Protestant Table 2.1: Covariates and mediators with the highest predictive power in propensity score estimations measured by the change in deviance. Considering the male (or female) wages as the reference, one can take Δ = 𝜃(0) + 𝛿(1) (or Δ = 𝜃(1) + 𝛿(0)) as the preferred decomposition of direct (unexplained) and indirect (explained) effects, see Sloczynski (2013) for a detailed discussion. Several interesting lessons can be learnt: the bounds are in general not symmetric around the point identified effect. Also, the most important regressor in terms of prediction in the propensity score estimation need not be the one that leads to the widest bounds. By applying the proposed methods, we find that the omission of a confounder, that has the same predictive power as the first or second most important mediator entering the treatment propensity score, would render all the natural indirect effects for males insignificant. The effect of the choice of the link function does not matter in most of the specifications. Most importantly, bounds for the natural direct effects (for males and females), thus the unexplained component of the gender wage gap decomposition, do not include zero and are highly significant for the most of the specifications, except for some of those where an important mediator was missing, which leads to a violation of conditional mediator exogeneity. This provides a robust evidence on the existence of unexplained gender wage gap. 35 2.1.4 Conclusions This paper presented a computationally feasible method to study sensitivity to identifying assumptions in a mediation analysis with a sample selection problem. This method was applied to NLSY 1979 dataset to study gender wage gap decomposition (outcomes are only observable for those who work), where the results are sensitive to non-ignorable mediator selection and to sample selection. There are different ways how the sensitivity analysis may be done. There is a trade-off between having an assumption relaxation that is elegant and easy to interpret and, at the same time, the method is still computationally feasible. This paper attempted to find a fine balance between the two. 2.2 Sensitivity of the bounds on the ATE in the presence of sample selection (Lafférs and Nedela 2017) In many empirical applications, the outcome is not observed for a specific subpopulation. Ignoring this may lead to misleading results if the missingness is not random - problem that is called a sample selection bias. There are different ways how sample selection bias can be addressed. Earlier literature was built around restrictive parametric models (Heckman (1979)), and typically, some exogenous variation in the outcome missingness mechanism is needed to identify causal treatment effects. Lee (2009), an influential paper, took a different approach. Lacking an external instrument and not willing to assume a parametric structure on the problem, Lee estimated only bounds on treatment effects instead. This note builds up on these results. It shows how bounding the average treatment effects under sample selection (Lee (2009)) can be reformulated as a optimization problem. This gives insights into the original problem but more importantly it provides opportunity to conduct sensitivity analysis with respect to some violations of the identifying assumptions of treatment exogeneity and monotonicity, which are relaxed in a novel interpretable manner. This note contributes to the liteature on the crossroads of identification and linear programming, that was pioneered by Balke and Pearl (1997) and later explored by many others (e.g. Honoré and Tamer (2006)). Formulating identification as an optimization problem is a principled approach that has also been explored in Lafférs (2019). 2.2.1 Setup and main results Let 𝑌 be the outcome only observed if 𝑆 = 1, 𝐷 be a binary treatment and let us use the potential outcome notation (Rubin (1974)). Lee (2009) studied the effect of U.S. Job Corps training on wages, where wages are observed only for those who work. These are the identifying assumptions:2 (C1) Compatibility with the observed data: 𝑌 = 𝑆 · 𝑌1 𝐷 + 𝑌0(1 − 𝐷) , and 𝑆 = 𝑆1 𝐷 + 𝑆0(1 − 𝐷); (C2) Treatment exogeneity: 𝐷 is statistically independent of 𝑌1, 𝑌0, 𝑆1, 𝑆0 , (C3) Monotonicity of selection in 𝐷: 𝑆1 ≥ 𝑆0 with probability 1. Assumption (C2) is satisfied if treatment is randomized and Assumption (C3) states that treatment cannot have a negative effect on the selection. In his study, Lee (2009) derived analytical formulas for the following quantity 𝐸 𝑌1 − 𝑌0|𝑆1 = 1, 𝑆0 = 1 , that is the average treatment effect of "always-takers", so those who would be employed in both states when treated and non-treated, thus capturing the pure wage effect. 2. While (C1) is only implicitly assumed in Lee (2009), for the approach in this paper, it is useful to state it explicitly. 36 This note looks at the identification problem as to a optimization problem that searches through the space of probability distributions. Let Φ be the set of all probability distributions of 𝜋 of 𝑈 = (𝑌1, 𝑌0, 𝑆1, 𝑆0, 𝐷) that satisfy (C1)-(C3): Φ = {𝜋 ∈ Π(U) : 𝜋 satisfies (C1)-(C3)}. The problem of bounding the average treatment effects of the always-takers is equivalent to finding the probability distribution 𝜋 that solves the subsequent optimization problem: min/max 𝜋∈Φ 𝐸 𝜋 𝑌1 − 𝑌0 |𝑆1 = 1, 𝑆0 = 1 , where the dependence on 𝜋 was made explicit. Condition (C1) states that 𝜋 has to be compatible with observable distributon of (𝑌, 𝑆, 𝐷). This optimization problem can be formulated as a linear program and it replicates the analytical formulas derived in Lee (2009). Suppose that researcher would be concerned about the validity of the Assumption (C2). The motivation is that even under randomization there is often a non-response/sample attrition problem. In the dataset of Lee (2009) the non-response rate is over 40% and if the sample attrition is correlated with the treatment assignment, and the outcome then Assumption (C2) is violated. Furthermore, Assumption (C3) could also be violated if treatment group members who undergo training choose to wait for a better job. We can relax the exogeneity and monotonicity assumptions using relaxation parameters 𝛼 𝐸, 𝛼 𝑀 in the following way: (rC2) Relaxed exogeneity: Total variation distance between distributions 𝜋(.|𝐷 = 0) and 𝜋(.|𝐷 = 1) is smaller than 𝛼 𝐸 : 𝑇𝑉 𝜋(.|𝐷 = 0), 𝜋(.|𝐷 = 1) ≤ 𝛼 𝐸. (rC3) Relaxed monotonicity: No more than 𝛼 𝑀 proportion of population violate monotonicity: 𝜋(𝑆1 ≥ 𝑆0 ) ≥ 1 − 𝛼 𝑀 . The total variation distance is the maximum of absolute difference between 𝜋( 𝐴|𝐷 = 0) and 𝜋( 𝐴|𝐷 = 1) across all possible events 𝐴, therefore (rC2) puts an upper bound on how different these two distributions could be. Reformulating the relaxed problem as a tractable optimization problem is not straightforward, because the share of always-takes 𝜋(𝑆1 = 1, 𝑆0 = 1) is no longer identified under the relaxed assumptions. It can be, however, treated as an additional unknown free variable in the optimization. With some manipulation and reparametrization, this relaxed problem can also be formulated as a linear program. 2.2.2 Empirical application We apply this methodology to the original Lee (2009) dataset and simultaneously considered relaxation of both exogeneity and monotonicity. Outcome variable was discretized3 and subsampling was used for confidence bounds (Politis, Romano, and Wolf (1999)). A mild deviation from identifying assumptions 𝛼 𝐸 = 𝛼 𝑀 = 0.01 doubles the width of the identified set, while larger deviations 𝛼 𝐸 = 𝛼 𝑀 = 0.05 lead to very wide bounds, where 3. The discretization for 𝛼 𝐸 = 𝛼 𝑀 = 0, where we have analytical formulas, led to an error of order only 10−5. 37 the lower bound drops from -1.7% to -29%. This suggests that the results are very sensitive to the identifying assumptions. Sensitivity Analysis of the Bounds on 𝐴𝑇 𝐸 𝐶 [Lower bound, Upper bound] (90% confidence bounds) 𝛼 𝑀 0.00 0.01 0.05 0.00 [-0.0171, 0.0931] [-0.0545, 0.1217] [-0.1286, 0.2036] (-0.0252, 0.1043) (-0.0663, 0.1333) (-0.1431, 0.2179) 𝛼 𝐸 0.01 [-0.0539, 0.1270] [-0.0871, 0.1541] [-0.1607, 0.2359] (-0.0664, 0.137) (-0.0985, 0.1667) (-0.1752, 0.2518) 0.05 [-0.1821, 0.254] [-0.2113, 0.2807] [-0.2893, 0.3641] (-0.2009, 0.2688) (-0.2266, 0.2952) (-0.309, 0.3796) Table 2.2: Bounds on the average treatment effect of a Job Corps program on log hourly wages for individuals who would be employed regardless of treatment status. 2.2.3 Conclusions We provide a method for sensitivity analysis of bounds on ATE under the sample selection. Assumptions are allowed to be relaxed simultaneously using interpretable parameters. This method can be applied whenever Lee (2009) bounds are estimated. 2.3 Limitations and future research avenues While the methods presented in this chapter can give interesting insights into empirical practice, applied researchers are often hesitant to try new methods, especially if they are computationally based. Providing well-tuned implementations in popular software packages, coupled with detailed documentation, may bring these methods closer to practitioners. The sensitivity method presented in Huber and Lafférs (2022) can readily be applied to any estimator that is based on inverse propensity score weighting and potentially extending its scope of usefulness. The total variation distance metric that was used to relax exogeneity is both interpretable and computationally convenient and it has a potential to be used within a different context too. 38 Chapter 3 Authorship contribution statements The authors’ list is in alphabetical order. Causal mediation analysis with double machine learning (Farbmacher, Huber, Lafférs, Langen, and Spindler 2022) Quantitative authorship contribution of Lukáš Lafférs: 20%. • Corresponding author: Henrika Langen • Conceptualization: Helmut Farbmacher, Martin Huber, Henrika Langen, Lukáš Lafférs, Martin Spindler • Methodology: Helmut Farbmacher, Martin Huber, Henrika Langen, Lukáš Lafférs, Martin Spindler • Software: Martin Huber, Henrika Langen • Data curation: Martin Huber, Henrika Langen • Data analysis: Martin Huber, Henrika Langen • Writing (original draft): Helmut Farbmacher, Martin Huber, Henrika Langen, Lukáš Lafférs, Martin Spindler • Writing (review): Helmut Farbmacher, Martin Huber, Henrika Langen, Lukáš Lafférs, Martin Spindler Evaluating (weighted) dynamic treatment effects by double machine learning (Bodory, Huber, and Lafférs 2022) Quantitative authorship contribution of Lukáš Lafférs: 33%. • Corresponding author: Lukáš Lafférs • Conceptualization: Hugo Bodory, Martin Huber, Lukáš Lafférs • Methodology: Hugo Bodory, Martin Huber, Lukáš Lafférs • Software: Hugo Bodory, Martin Huber • Data curation: Hugo Bodory, Martin Huber • Data analysis: Hugo Bodory 40 • Writing (original draft): Hugo Bodory, Martin Huber, Lukáš Lafférs • Writing (review): Hugo Bodory, Martin Huber, Lukáš Lafférs Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition (Huber and Lafférs 2022) Quantitative authorship contribution of Lukáš Lafférs: 50%. • Corresponding author: Lukáš Lafférs • Conceptualization: Martin Huber, Lukáš Lafférs • Methodology: Martin Huber, Lukáš Lafférs • Software: Lukáš Lafférs • Data curation: Martin Huber • Data analysis: Lukáš Lafférs • Writing (original draft): Martin Huber, Lukáš Lafférs • Writing (review): Martin Huber, Lukáš Lafférs Sensitivity of the bounds on the ATE in the presence of sample selection (Lafférs and Nedela 2017) Quantitative authorship contribution of Lukáš Lafférs: 50%. • Corresponding author: Lukáš Lafférs • Conceptualization: Lukáš Lafférs, Roman Nedela • Methodology: Lukáš Lafférs, Roman Nedela • Software: Lukáš Lafférs, Roman Nedela • Data analysis: Lukáš Lafférs, Roman Nedela • Writing (original draft): Lukáš Lafférs, Roman Nedela • Writing (review): Lukáš Lafférs Bibliography Athey, Susan. 2019. “21. The Impact of Machine Learning on Economics.” In The Economics of Artificial Intelligence, 507–552. University of Chicago Press. Athey, Susan, and Guido W Imbens. 2019. “Machine learning methods that economists should know about.” Annual Review of Economics 11:685–725. Athey, Susan, Guido W. Imbens, and Stefan Wager. 2018. “Approximate residual balancing: debiased inference of average treatment effects in high dimensions.” Journal of the Royal Statistical Society Series B 80:597–623. Balke, Alexander, and Judea Pearl. 1997. “Bounds on treatment effects from studies with imperfect compliance.” Journal of the American Statistical Association 92 (439): 1171–1176. Blackwell, Matthew, and Anton Strezhnev. 2020. “Telescope Matching for Reducing Model Dependence in the Estimation of the Effects of Time-varying Treatments: An Application to Negative Advertising.” working paper, Harvard University. Bodory, Hugo, Martin Huber, and Lukáš Lafférs. 2022. “Evaluating (weighted) dynamic treatment effects by double machine learning.” The Econometrics Journal 25 (3): 628–648. Bühlmann, Peter, and Sara Van De Geer. 2011. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. Bureau of Labor Statistics, U.S. Department of Labor. 2001. National Longitudinal Survey of Youth 1979 cohort, 1979-2000 (rounds 1-19). Produced and distributed by the Center for Human Resource Research, The Ohio State University. Columbus, OH. Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning.” The Econometrics Journal 21 (1). Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2020. “A crash course in good and bad controls.” Sociological Methods & Research, 00491241221099552. Claeskens, Gerda, Nils Lid Hjort, et al. 2008. “Model selection and model averaging.” Cambridge Books. Cochran, William G. 1957. “Analysis of Covariance: Its Nature and Uses.” Biometrics 13:261–281. Conti, Gabriella, James J. Heckman, and Rodrigo Pinto. 2016. “The Effects of Two Influential Early Childhood Interventions on Health and Healthy Behaviour.” The Economic Journal 126:F28–F65. Farbmacher, Helmut, Martin Huber, Lukáš Lafférs, Henrika Langen, and Martin Spindler. 2022. “Causal mediation analysis with double machine learning.” The Econometrics Journal 25 (2): 277–300. 42 Flores, Carlos A., and Alfonso Flores-Lagunes. 2009. “Identification and Estimation of Causal Mechanisms and Net Effects of a Treatment under Unconfoundedness.” IZA DP No. 4237. Goldsmith-Pinkham, Paul, Peter Hull, and Michal Kolesár. 2022. Contamination bias in linear regressions. Technical report. National Bureau of Economic Research. Heckman, James, Rodrigo Pinto, and Peter Savelyev. 2013. “Understanding the Mechanisms Through Which an Influential Early Childhood Program Boosted Adult Outcomes.” American Economic Review 103:2052– 2086. Heckman, James J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47 (1): 153–161. Hong, Guanglei, Xu Qin, and Fan Yang. 2018. “Weighting-Based Sensitivity Analysis in Causal Mediation Studies.” Journal of Educational and Behavioral Statistics 43:32–56. Honoré, Bo E, and Elie Tamer. 2006. “Bounds on parameters in panel dynamic discrete choice models.” Econometrica 74 (3): 611–629. Huber, Martin. 2015. “Causal pitfalls in the decomposition of wage gaps.” Journal of Business and Economic Statistics 33:179–191. Huber, Martin, and Lukáš Lafférs. 2022. “Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition.” Econometric Reviews 41 (10): 1141–1163. Huber, Martin, Michael Lechner, and Giovanni Mellace. 2017. “Why Do Tougher Caseworkers Increase Employment? The Role of Program Assignment as a Causal Mechanism.” The Review of Economics and Statistics 99:180–183. Huber, Martin, and Anna Solovyeva. 2020a. “Direct and indirect effects under sample selection and outcome attrition.” Econometrics 8 (4): 44. . 2020b. “On the sensitivity of wage gap decompositions.” Journal of Labor Research 41:1–33. Imai, Kosuke, Luke Keele, and Teppei Yamamoto. 2010. “Identification, Inference and Sensitivity Analysis for Causal Mediation Effects.” Statistical Science 25:51–71. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. “An Introduction to Statistical Learning: With Applications in R.” Judd, C M, and D A Kenny. 1981. “Process Analysis: Estimating Mediation in Treatment Evaluations.” Evaluation Review 5:602–619. Keele, Luke, Dustin Tingley, and Teppei Yamamoto. 2015. “Identifying mechanisms behind policy interventions via causal mediation analysis.” Journal of Policy Analysis and Management 34:937–963. Lafférs, Lukáš. 2019. “Identification in models with discrete variables.” Computational Economics 53 (2): 657–696. Lafférs, Lukáš, and Roman Nedela. 2017. “Sensitivity of the bounds on the ATE in the presence of sample selection.” Economics Letters 158:84–87. Lechner, M. 2009. “Sequential Causal Models for the Evaluation of Labor Market Programs.” Journal of Business and Economic Statistics 27:71–83. 43 Lechner, Michael, and Ruth Miquel. 2010. “Identification of the effects of dynamic treatments by sequential conditional independence assumptions.” Empirical Economics 39:111–137. Lee, David S. 2009. “Training, wages, and sample selection: Estimating sharp bounds on treatment effects.” The Review of Economic Studies 76 (3): 1071–1102. Leeb, Hannes, and Benedikt M Pötscher. 2005. “Model selection and inference: Facts and fiction.” Econometric Theory 21 (1): 21–59. Lewis, Greg, and Vasilis Syrgkanis. 2020. “Double/Debiased Machine Learning for Dynamic Treatment Effects.” arXiv preprint 2002.07285. Maciosek, Michael V, Ashley B Coffield, Thomas J Flottemesch, Nichol M Edwards, and Leif I Solberg. 2010. “Greater use of preventive services in US health care could save lives at little or no cost.” Health Affairs 29 (9): 1656–1660. Manski, Charles F. 2003. Partial identification of probability distributions. Vol. 5. Springer. Mullainathan, Sendhil, and Jann Spiess. 2017. “Machine learning: an applied econometric approach.” Journal of Economic Perspectives 31 (2): 87–106. Newey, Whitney K. 1994. “The asymptotic variance of semiparametric estimators.” Econometrica: Journal of the Econometric Society, 1349–1382. Neyman, Jerzy. 1959. “Optimal asymptotic tests of composite hypotheses.” Probability and statistics, 213–234. Pearl, J. 2001. “Direct and indirect effects.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 411–420. San Francisco: Morgan Kaufman. Pearl, Judea. 2009. Causality. Cambridge university press. Petersen, M L, S E Sinisi, and M J van der Laan. 2006. “Estimation of Direct Causal Effects.” Epidemiology 17:276–284. Politis, DN, JP Romano, and M Wolf. 1999. Subsampling. Robins, J M. 1986. “A new approach to causal inference in mortality studies with sustained exposure periods application to control of the healthy worker survivor effect.” Mathematical Modelling 7:1393–1512. . 1998. “Marginal Structural Models.” In 1997 Proceedings of the American Statistical Association, Section on Bayesian Statistical Science, 1–10. . 2003. “Semantics of causal DAG models and the identification of direct and indirect effects.” In In Highly Structured Stochastic Systems, edited by P.J. Green, N.L. Hjort, and S. Richardson, 70–81. Oxford: Oxford University Press. Robins, J M, and Sander Greenland. 1992. “Identifiability and Exchangeability for Direct and Indirect Effects.” Epidemiology 3:143–155. Robins, J M, M A Hernan, and B Brumback. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11:550–560. 44 Robins, J. M., S. Greenland, and F.-C. Hu. 1999. “Estimation of the Causal Effect of a Time-Varying Exposure on the Marginal Mean of a Repeated Binary Outcome.” Journal of the American Statistical Association 94:687–700. Robins, J. M., A. Rotnitzky, and L.P. Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are not Always Observed.” Journal of the American Statistical Association 90:846–866. Robins, J. M., and Andrea Rotnitzky. 1995. “Semiparametric Efficiency in Multivariate Regression Models with Missing Data.” Journal of the American Statistical Association 90:122–129. Robins, James M, and Andrea Rotnitzky. 1995. “Semiparametric efficiency in multivariate regression models with missing data.” Journal of the American Statistical Association 90 (429): 122–129. Rosenbaum, P. 1984. “The consequences of adjustment for a concomitant variable that has been affected by the treatment.” Journal of Royal Statistical Society, Series A 147:656–666. Rubin, D B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66:688–701. Sloczynski, T. 2013. “Population average gender effects.” IZA Discussion Paper No. 7315. Tamer, Elie. 2010. “Partial identification in econometrics.” Annu. Rev. Econ. 2 (1): 167–195. Tchetgen Tchetgen, E. J., and I. Shpitser. 2011. “Semiparametric Estimation of Models for Natural Direct and Indirect Effects.” Harvard University Biostatistics Working Paper 129. . 2012. “Semiparametric theory for causal mediation analysis: Efficiency bounds, multiple robustness, and sensitivity analysis.” The Annals of Statistics 40:1816–1845. Van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super learner.” Statistical applications in genetics and molecular biology 6 (1). VanderWeele, Tyler J. 2009. “Marginal Structural Models for the Estimation of Direct and Indirect Effects.” Epidemiology 20:18–26. Vansteelandt, S., and T. J. VanderWeele. 2012. “Natural direct and indirect effects on the exposed: effect decomposition under weaker assumptions.” Biometrics 68:1019–1027. Varian, Hal R. 2014. “Big data: New tricks for econometrics.” Journal of Economic Perspectives 28 (2): 3–28. Viviano, Davide, and Jelena Bradic. 2021. “Dynamic covariate balancing:estimating treatment effects over time.” arXiv preprint 2103.01280. Wager, Stefan, and Susan Athey. 2018. “Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Association 113 (523): 1228–1242. Appendix A Essays 46 47 A.1 Causal mediation analysis with double machine learning (Farbmacher, Huber, Lafférs, Langen, and Spindler 2022) 48 49 A.2 Evaluating (weighted) dynamic treatment effects by double machine learning (Bodory, Huber, and Lafférs 2022) 50 51 A.3 Bounds on direct and indirect effects under treatment/mediator endogeneity and outcome attrition (Huber and Lafférs 2022) 52 53 A.4 Sensitivity of the bounds on the ATE in the presence of sample selection (Lafférs and Nedela 2017)