1/36 Econometrics Endogenous Regressors and Instrumental Variables Anna Donina Lecture 7 Endogeneity Problem • An endogenous variable is one that is correlated with u • An exogenous variable is one that is uncorrelated with u • Intuition behind bias: ▪ If an explanatory variable x and the error term ε are correlated with each other, the OLS estimate attributes to x some of the variation in y that actually came form the error term ε • In IV regression, we focus on the case that X is endogenous and there is an instrument, Z, which is exogenous. Digression on terminology: “Endogenous” literally means “determined within the system.” If X is jointly determined with Y, then a regression of Y on X is subject to simultaneous causality bias. But this definition of endogeneity is too narrow because IV regression can be used to address OV bias and errors-in-variable bias. Thus we use the broader definition of endogeneity above. Endogeneity Problem • Omitted variable bias from a variable that is correlated with X but is unobserved and for which there are inadequate control variables; • Selection bias: an unobservable characteristic has influence on both dependent and independent variables; • Measurement error bias (X is measured with error) • Simultaneous causality bias (X causes Y, Y causes X); All three problems cause X to be endogenous, E(u|X) ≠ 0 Selection Bias • Very similar to omitted variable bias; • We suppose there is some unobservable characteristic that influences both the level of the dependent variable y and of the explanatory variable x; • This unobservable characteristic forms part of the error term ε, causing cov(ε, x)≠0 (in the same manner as an omitted variable); • Example: surveying only non-smoking mothers when inferring the impact of the number of prenatal visits on the birth weight of children. • Smoking affects both the number of prenatal visits and the birth weight Simultaneity • Occurs in models where variables are jointly determined: • Intuitively: change in y1i will cause a change in y2i, which in turn will cause y1i to change again • Technically: y1i = α0 + α1y2i + ε1i y2i = β0 + β1y1i + ε2i • The endogeneity problem is endemic in social sciences/economics • In many cases important personal variables cannot be observed (examples?) • These are often correlated with observed explanatory information • In addition, measurement error may also lead to endogeneity • Solutions to endogeneity problems: • Proxy variables method for omitted regressors • Fixed effects methods if: 1) panel data is available, 2) endogeneity is time-constant, and 3) regressors are not time-constant • Instrumental variables method (IV) • IV is the most well-known method to address endogeneity problems Endogeneity Problem • Answer to the situation when Cov(x, ε) ≠0 • Instrumental variable (or instrument) should be a variable z such that 1. z is uncorrelated with the error term: Cov(z, ε) = 0 2. z is correlated with the explanatory variable x: Cov(x, z) ≠ 0 • Intuition behind instrumental variables approach: • project the endogenous variable x on the instrument z; • this projection is uncorrelated with the error term and can be used as an explanatory variable instead of x Instrumental Variables (IV) Yi = β0 + β1Xi + ui • IV regression breaks X into two parts: a part that might be correlated with u, and a part that is not. By isolating the part that is not correlated with u, it is possible to estimate β1. • This is done using an instrumental variable, Zi, which is correlated with Xi but uncorrelated with ui. Instrumental Variables • Properties of IV with a poor instrumental variable • IV may be much more inconsistent than OLS if the instrumental variable is not completely exogenous and only weakly related to • Variance of IV estimator is always (!) greater than variance of OLS estimator! IV worse than OLS if: e.g. There is no problem if the instrumental variable is really exogenous. If not, the asymptotic bias will be the larger the weaker the correlation with x. Instrumental Variables • IV estimation in the multiple regression model • Conditions for instrumental variable • 1) Does not appear in regression equation • 2) Is uncorrelated with error term • 3) Is partially correlated with endogenous explanatory variable endogenous exogenous variables This is the so called „reduced form regression“ In a regression of the endogenous explanatory variable on all exogenous variables, the instrumental variable must have a nonzero coefficient. Instrumental Variables Two Stage Least Squares: 2SLS As it sounds, TSLS has two stages – two regressions: 1. Isolate the part of X that is uncorrelated with u by regressing X on Z using OLS: Xi = π0 + π1Zi + vi (1) • Because Zi is uncorrelated with ui, π0 + π1Zi is uncorrelated with ui. We don’t know π0 or π1 but we have estimated them, so… • Compute the predicted values of Xi, 2. Replace Xi by in the regression of interest: regress Y on using OLS: Yi = β0 + β1 + ui (2)ˆ iX ˆ iX ˆ iX Two Stage Least Squares: 2SLS • Because is uncorrelated with ui, the first least squares assumption holds for regression (2). (This requires n to be large so that π0 and π1 are precisely estimated.) • Thus, in large samples, β1 can be estimated by OLS using regression (2) • The resulting estimator is called the Two Stage Least Squares (TSLS) estimator, . ˆ iX 1 ˆTSLS  Two Stage Least Squares: 2SLS Suppose Zi, satisfies the two conditions for a valid instrument: 1.Instrument relevance: corr(Zi, Xi) ≠ 0 2.Instrument exogeneity: corr(Zi, ui) = 0 Two-stage least squares: Stage 1: Regress Xi on Zi (including an intercept), obtain the predicted values, Stage 2: Regress Yi on (including an intercept); the coefficient on is the TSLS estimator, . is a consistent estimator of β1. ˆ iX ˆ iX ˆ iX 1 ˆTSLS  1 ˆTSLS  • Estimating the impact of education on the number of children for a sample of women in Botswana (OLS) Example • Education may be endogenous - both education and number of children may be influenced by some unobserved socioeconomic factors • Omitted variable bias: family background is an unobserved factor that influences both the number of children and years of education • Finding possible instrument: • Something that explains education • But is not correlated with the family background • A dummy variable Example • The first condition - instrument explains education: • School year in Botswana starts in January • Thus, women born in the first half of the year start school when they are at least six and a half. • Schooling is compulsory till the age of 15 • Thus, women born in the first half of the year get less education if they leave school at the age of 15. • The second condition - the instrument is uncorrelated with the error term: • Being born in the first half of the year is uncorrelated with the unobserved socioeconomic factors that influence education and the number of children (family background etc.) Example: Intuition behind the IV Example: 2SLS Example: 2SLS • Compare the estimates: • OLS: • 2SLS Example • Why does Two Stage Least Squares work? • All variables in the second stage regression are exogenous because endogenous variable has been replaced by a prediction based on only exogenous information; • By using the prediction based on exogenous information, endog. variable is purged of its endogenous part (the part that is related to the error term) • Properties of Two Stage Least Squares • The standard errors from the OLS second stage regression are wrong. However, it is not difficult to compute correct standard errors. • If there is one endogenous variable and one instrument then 2SLS = IV • The 2SLS estimation can also be used if there is more than one endogenous variable and at least as many instruments Two Stage Least Squares: 2SLS Statistical properties of 2SLS/IV-estimation • Under assumptions completely analogous to OLS, but conditioning on .. rather than on , 2SLS/IV is consistent and asymptotically normal • 2SLS/IV is typically much less precise because there is more multicollinearity and less explanatory variation in the second stage regression • Corrections for heteroscedasticity analogous to OLS • 2SLS/IV easily extends to time series and panel data situations Two Stage Least Squares: 2SLS