LECTURE 10 Introduction to Econometrics Endogeneity Hieu Nguyen 1 / 25 Fall semester, 2024 A LITTLE REVISION: OLS CLASSICAL ASSUMPTIONS 2 / 25 1.Linearity: the regression model is linear in the parameters (coefficients) 2.Random sampling: the data is a random sample drawn from the population and each data point follows the population equation 3.No perfect collinearity: the values of explanatory variables are not all the same and no explanatory variable is a perfect linear function of any other explanatory variable(s) 4.Zero conditional mean: values of explanatory variables must contain no information about the mean of the unobserved factors - explanatory variables are uncorrelated with the error term 5.Homoskedasticity: the error term has a constant variance 6.Normality of the error term: the error term is normally distributed ON PREVIOUS LECTURES 3 / 25 ►We discussed what happens if some of the assumptions are violated ►Linearity of coefficients and no perfect multicollinearity are essential for the definition of OLS estimator ►Zero mean of the error term is always ensured by the inclusion of intercept ►Normality of the error term is needed for statistical inference, but it can be shown that if the number of observations is sufficiently high, the OLS estimate will have asymptotically normal distribution even if the stochastic error term is not normal ►Heteroskedasticity leads to incorrect statistical inference, but we have studied tests to detect it and techniques to overcome this problem ON TODAY’S LECTURE 4 / 25 ►The assumption of no correlation between explanatory variables and the error term is crucial ►Variables that are correlated with the error term are called endogenous variables (as opposed to exogenous variables) ►We will show that the estimated coefficients of endogenous variables are inconsistent and biased ►We will explain in which situations we may encounter endogenous variables ►We will define the concept of instrumental variables ►We will derive the 2SLS technique to deal with endogeneity ENDOGENOUS VARIABLES 5 / 25 ►Notation: ►Intuition behind the bias: If an explanatory variable x and the error term ε are correlated with each other, the OLS estimate attributes to x some of the variation in y that actually came form the error term ε ►Example: Analysis of household consumption patterns Households with lower income may indicate higher consumption (because of shame) ►Leads to biased and inconsistent estimates GRAPHICAL REPRESENTATION X 6 / 25 True model Estimated model INCONSISTENCY OF ESTIMATES 7 / 25 ►We can express ►We assume that there exists a finite matrix Q so that n −→ 1 X'X n→∞ Q ►It can be shown that ►This implies: TYPICAL CASES OF ENDOGENEITY 8 / 25 1.Omitted variable bias An explanatory variable is omitted from the equation and makes part of the error term 2.Selection bias An unobservable characteristic has influence on both dependent and explanatory variables 3.Simultaneity The causal relationship between the dependent variable and the explanatory variable goes in both directions 4.Measurement error Some of the variables are measured with error ►In all 4 cases, the sign of the bias is given by the sign of Cov(εi, xi) OMITTED VARIABLE BIAS 9 / 25 SELECTION BIAS 10 / 25 Smoking affects both the number of prenatal visits and the birth weight SIMULTANEITY 11 / 25 ►Occurs in models where variables are jointly determined y1i y2i = α0 + α1y2i + ε1i = β0 + β1y1i + ε2i ►Intuitively: change in y1i will cause a change in y2i, which will in turn cause y1i to change again ►Technically: SIMULTANEITY 12 / 25 ►Example: QDi QSi QDi = = = α0 + α1Pi + α2Ii + ε1i β0 + β1Pi + ε2i QSi QD . . . quantity demanded QS . . . quantity supplied P . . . price I . . . income where ►Endogeneity of price: it is determined from the interaction of supply and demand MEASUREMENT ERROR I 13 / 25 ►True regression model: ►Estimated regression: ui = εi + νi and so yi = β0 + β1xi + εi y∗i = β0 + β1xi + ui where MEASUREMENT ERROR II 14 / 25 ►Classical measurement error in the explanatory variable x∗i = xi + νi where Cov(νi, xi) = 0 ►True regression model: yi = β0 + β1xi + εi INSTRUMENTAL VARIABLES (IV) 15 / 25 INSTRUMENTAL VARIABLES 16 / 25 ►Suppose the equation we want to estimate is: y = Xβ + η We can have several instruments for several endogenous variables - we will use the matrix notation Z and X X denotes endogenous variable(s) Z denotes instrumental variable(s) Assume that we have at least as many instruments as endogenous variables TWO STAGE LEAST SQUARES 17 / 25 ►2SLS is a method of implementing instrumental variables approach ► ►Consists of two steps: 1.Regress the endogenous variables on the instruments X = Zδ + ν , get predicted values , 2.Use these predicted values instead of X in the original equation: TWO STAGE LEAST SQUARES 18 / 25 ►The estimate is ►This estimate is consistent, but it has higher variance than OLS (it is not efficient) ►Intuitively: Only part of the variation in X that is uncorrelated with the error term is used for the estimation. ^ This ensures consistency (X that is uncorrelated with error term). because not all variation in X is used. EXAMPLE 19 / 25 ►Estimating the impact of education on the number of children for a sample of women in Botswana ► ►OLS: children Coef. Std. Err. t P>|t| [95% Conf. Interval] educ -.0905755 .0059207 -15.30 0.000 -.102183 -.0789679 age .3324486 .0165495 20.09 0.000 .3000032 .364894 agesq -.0026308 .0002726 -9.65 0.000 -.0031652 -.0020964 _cons -4.138307 .2405942 -17.20 0.000 -4.609994 -3.66662 Source SS df MS Model 12243.0295 3 4081.00985 Residual 9284.14679 4357 2.13085765 Total 21527.1763 4360 4.93742577 Prob > F = 0.0000 R-squared = 0.5687 Adj R-squared = 0.5684 Root MSE = 1.4597 Number of obs = 4361 F( 3, 4357) = 1915.20 EXAMPLE 20 / 25 ►Education may be endogenous - both education and number of children may be influenced by some unobserved socioeconomic factors Omitted variable bias: family background is an unobserved factor that influences both the number of children and years of education ►Finding possible instrument: Something that explains education But is not correlated with the family background ►A dummy variable EXAMPLE 21 / 25 ►Intuition behind the instrument: ►The first condition - instrument explains education: School year in Botswana starts in January ⇒ Thus, women born in the first half of the year start school when they are at least six and a half. Schooling is compulsory till the age of 15 ⇒ Thus, women born in the first half of the year get less education if they leave school at the age of 15. ►The second condition - instrument is uncorrelated with the error term: Being born in the first half of the year is uncorrelated with the unobserved socioeconomic factors that influence education and number of children (family background etc.) EXAMPLE 23 / 25 EXAMPLE 23 / 25 Instrumented: educ Instruments: age agesq frsthalf children Coef. Std. Err. z P>|z| [95% Conf. Interval] educ -.1714989 .0531553 -3.23 0.001 -.2756813 -.0673165 age .3236052 .0178514 18.13 0.000 .2886171 .3585934 agesq -.0026723 .0002796 -9.56 0.000 -.0032202 -.0021244 _cons -3.387805 .5478988 -6.18 0.000 -4.461667 -2.313943 = 0.0000 = 0.5502 = 1.49 Wald chi2(3) Prob > chi2 R-squared Root MSE = 5300.22 Instrumental variables (2SLS) regression Number of obs = 4361 EXAMPLE 24 / 25 ►Compare the estimates from OLS and 2SLS: ►OLS: children Coef. Std. Err. t P>|t| [95% Conf. Interval] educ -.0905755 .0059207 -15.30 0.000 -.102183 -.0789679 ►2SLS: children Coef. Std. Err. z P>|z| [95% Conf. Interval] educ -.1714989 .0531553 -3.23 0.001 -.2756813 -.0673165 ►Is the bias reduced by IV? ►Are these results statistically different? SUMMARY 25 / 25 ►We showed that the estimated coefficients of endogenous variables are inconsistent and biased ►In which situations we may encounter endogenous variables Omitted variable (omitting important variable which is correlated to independent variable) Selection bias (unobserved factors influencing both dependent and independent variable) Simultaneity (causality goes both ways) Measurement error (in either dependent or independent variable) ►We can deal with endogeneity by using instrumental variables (2SLS technique)