LECTURE 8 Introduction to Econometrics Choosing explanatory variables November 3, 2017 1 / 25 WHAT WE HAVE LEARNED SO FAR We know what a linear regression model is and how its parameters are estimated by OLS We know what the properties of OLS estimator are We know how to test single and multiple hypotheses in linear regression models We know how to asses the goodness of fit using R2 We started to talk about the specification of a regression equation 2 / 25 SPECIFICATION OF A REGRESSION EQUATION Specification consists of choosing: 1. correct independent variables 2. correct functional form 3. correct form of the stochastic error term We discussed the choice of functional form on the previous lecture We will discuss the choice of independent variables today We will study the form of the error term on the next two lectures 3 / 25 ON TODAY’S LECTURE We will learn that omitting a relevant variable from an equation is likely to bias remaining coefficients including an irrelevant variable in an equation leads to higher variance of estimated coefficients our choice should be led by the economic theory and confirmed by a set of statistical tools 4 / 25 OMITTED VARIABLES We omit a variable when we forget to include it do not have data for it This misspecification results in not having the coefficient for this variable biasing estimated coefficients of other variables in the equation −→ omitted variable bias 5 / 25 OMITTED VARIABLES Where does the omitted variable bias come from? True model: yi = βxi + γzi + ui Model as it looks when we omit variable z: yi = βxi + ˜ui implying ˜ui = γzi + ui We assume that Cov(ui, xi) = 0, but: Cov(˜ui, xi) = Cov(γzi + ui, xi) = γCov(zi, xi) = 0 The classical assumption is violated ⇒ biased (and inconsistent) estimate!!! 6 / 25 OMITTED VARIABLES For the model with omitted variable: E(βomitted model ) = β + bias bias = γ ∗ α Coefficients β and γ are from the true model yi = βxi + γzi + ui Coefficient α is from a regression of z on x, i.e. zi = αxi + ei The bias is zero if γ = 0 or α = 0 (not likely to happen) 7 / 25 OMITTED VARIABLES Intuitive explanation: if we leave out an important variable from the regression (γ = 0), coefficients of other variables are biased unless the omitted variable is uncorrelated with all included dependent variables (α = 0) the included variables pick up some of the effect of the omitted variable (if they are correlated), and the coefficients of included variables thus change causing the bias Example: what would happen if you estimated a production function with capital only and omitted labor? 8 / 25 OMITTED VARIABLES Example: estimating the price of chicken meat in the US ˆYt = 31.5 − 0.73 0.08) PCt + 0.11 0.05) PBt + 0.23 0.02) YDt R2 = 0.986 , n = 44 Yt . . . per capita chicken consumption PCt . . . price of chicken PBt . . . price of beef YDt . . . per capita disposable income 9 / 25 OMITTED VARIABLES When we omit price of beef: ˆYt = 32.9 − 0.70 0.08) PCt + 0.27 0.01) YDt R2 = 0.895 , n = 44 Compare to the true model: ˆYt = 31.5 − 0.73 0.08) PCt + 0.11 0.05) PBt + 0.23 0.02) YDt R2 = 0.986 , n = 44 We observe positive bias in the coefficient of PC (was it expected?) 10 / 25 OMITTED VARIABLES Determining the direction of bias: bias = γ ∗ α Where γ is a correlation between the omitted variable and the dependent variable (the price of beef and chicken consumption) γ is likely to be positive Where α is a correlation between the omitted variable and the included independent variable (the price of beef and the price of chicken) α is likely to be positive Conclusion: Bias in the coefficient of the price of chicken is likely to be positive if we omit the price of beef from the equation. 11 / 25 OMITTED VARIABLES In reality, we usually do not have the true model to compare with Because we do not know what the true model is Because we do not have data for some important variable We can often recognize the bias if we obtain some unexpected results We can prevent omitting variables by relying on the theory If we cannot prevent omitting variables, we can at least determine in what way this biases our estimates 12 / 25 IRRELEVANT VARIABLES A second type of specification error is including a variable that does not belong to the model This misspecification does not cause bias but it increases the variances of the estimated coefficients of the included variables 13 / 25 IRRELEVANT VARIABLES True model: yi = βxi + ui (1) Model as it looks when we add irrelevant z: yi = βxi + γzi + ˜ui (2) We can represent the error term as ˜ui = ui − γzi but since from the true model γ = 0, we have ˜ui = ui and there is no bias 14 / 25 IRRELEVANT VARIABLES True model: ˆYt = 31.5 − 0.73 0.08) PCt + 0.11 0.05) PBt + 0.23 0.02) YDt R2 = 0.986 , n = 44 If we include interest rate Rt (irrelevant variable) ˆYt = 30.0 − 0.73 0.10) PCt + 0.12 0.06) PBt + 0.22 0.03) YDt + 0.17 0.21) Rt R2 = 0.987 , n = 44 We observe that Rt is insignificant and standard errors of other variables increase 15 / 25 SUMMARY OF THE THEORY Bias - efficiency trade-off: Omitted variable Irrelevant variable Bias Yes* No Variance Decreases * Increases* * As long as we have correlation between x and z 16 / 25 FOUR IMPORTANT SPECIFICATION CRITERIA Does a variable belong to the equation? 1. Theory: Is the variable’s place in the equation unambiguous and theoretically sound? Does intuition tells you it should be included? 2. t-test: Is the variable’s estimated coefficient significant in the expected direction? 3. R2: Does the overall fit of the equation improve (enough) when the variable is added to the equation? 4. Bias: Do other variables’ coefficients change significantly when the variable is added to the equation? 17 / 25 FOUR IMPORTANT SPECIFICATION CRITERIA If all conditions hold, the variable belongs in the equation If none of them holds, the variable is irrelevant and can be safely excluded If the criteria give contradictory answers, most importance should be attributed to theoretical justification Therefore, if theory (intuition) says that variable belongs to the equation, we include it (even though its coefficients might be insignificant!). 18 / 25 ESTIMATING PRICE ELASTICITY OF BRAZILIAN COFFEE Should we include the price of Brazilian coffee into the equation? COF = 9.3 + 2.6 1.0) PT + 0.0036 0.0009) Y t = 2.6 4.0 R2 = 0.58 , n = 25 COF = 9.1 + 7.8 15.6) PBC + 2.4 1.2) PT + 0.0035 0.0010) Y t = 0.5 2.0 3.5 R2 = 0.60 , n = 25 The three criteria does not hold (theory is inconclusive) ⇒ the price of Brazilian coffee does not belong to the equation (Brazilian coffee is price inelastic) 19 / 25 ESTIMATING PRICE ELASTICITY OF BRAZILIAN COFFEE Really??? What if we add price of Colombian coffee (PCC)? COF = 10.0 + 8.0 4.0) PBC − 5.6 2.0) PCC + 2.6 1.3) PT + 0.0030 0.0010) Y t = 2.0 − 2.8 2.0 3.0 R2 = 0.70 , n = 25 COF = 9.1 + 7.8 15.6) PCC + 2.4 1.2) PT + 0.0035 0.0010) Y t = 0.5 2.0 3.5 R2 = 0.60 , n = 25 The three criteria hold ⇒ the price of Brazilian coffee belongs to the equation!!! (Brazilian coffee is price elastic) 20 / 25 THE DANGER OF SPECIFICATION SEARCHES “If you just torture the data long enough, they will confess.” If too many specifications are tried: The final result has desired properties only by chance The statistical significance of the results is overestimated because the estimations of the previous regressions are ignored How to proceed: Keep the number of regressions estimated low Focus on theoretical considerations: leave the insignificant variables in the equation if the theory predicts they should be included Document all specifications investigated 21 / 25 ADDITIONAL SPECIFICATION TEST Ramsey’s Regression Specification Error Test (RESET) allows to detect possible misspecification - tells you if all important variables are included or not unfortunately does not allow to detect its source There are two forms of this test, both based on similar intuition: If the equation is correctly specified, nothing is missing in the equation and the residuals are a white noise. We will derive the test for the model yi = β0 + β1xi1 + β2xi2 + εi 22 / 25 RESET I 1. We run the regression yi = β0 + β1xi1 + β2xi2 + εi 2. We save the predicted values yi = β0 + β1xi1 + β2xi2 3. We run an augmented regression yi = β0 + β1xi1 + β2xi2 + γ1y2 i + γ2y3 i + εt (more powers of y can be included) 4. We test H0 : γ1 = γ2 = 0 using a standard F-test. 5. If we reject H0, there is a misspecification problem in our model. Intuition: If the model is correct, y is well explained by x1 and x2 and the predicted values of y (raised to higher powers) should not be significant. 23 / 25 RESET II 1. We run the regression yi = β0 + β1xi1 + β2xi2 + εi 2. We save the predicted values yi = β0 + β1xi1 + β2xi2 and the residuals ei = yi − yi 3. We run the regression ei = α0 + α1yi + α2y2 i + εi (more powers of y can be included) 4. We test H0 : α1 = α2 = 0 using a standard F-test. 5. If we reject H0, there is a misspecification problem in our model. Intuition: if the model is correct, residuals should not display any pattern depending on the explanatory variables. 24 / 25 SUMMARY Omitted variable causes bias (and decreases variance) sign of this bias can be predicted Included irrelevant variable increases variance (but does not cause bias) such variable is insignificant in the regression it does not contribute to the overall fit of the regression There is a set of criteria that help us to recognize correct specification these criteria have to be applied with caution - theoretical justification has always priority over statistical properties Readings: Studenmund Chapter 6, Wooldridge Chapter 9 25 / 25