LECTURE 6 1 / 24 Introduction to Econometrics Omitted variables, Multicollinearity & Heteroskedasticity November 1, 2019 SPECIFICATION ERROR - OMITTED VARIABLES 2 / 49 e We omit a variable when we forget to include it do not have data for it e This misspecification results in not having the coefficient for this variable biasing estimated coefficients of other variables in the equation −→ omitted variable bias OMITTED VARIABLES 3 / 49 OMITTED VARIABLES 4 / 49 e For the model with omitted variable: OMITTED VARIABLES 5 / 49 e Example: what would happen if you estimated a production function with capital only and omitted labor? OMITTED VARIABLES 6 / 49 e Example: estimating the price of chicken meat in the US Yt . . . per capita chicken consumption PCt . . . price of chicken PBt . . . price of beef YDt . . . per capita disposable income OMITTED VARIABLES 7 / 49 e When we omit price of beef: , n = 44 R2 = 0.895 e Compare to the true model: R2 = 0.986 , n = 44 e We observe positive bias in the coefficient of PC (was it expected?) OMITTED VARIABLES 8 / 49 e Determining the direction of bias: bias = γ ∗ α Where γ is a correlation between the omitted variable and the dependent variable (the price of beef and chicken consumption) γ is likely to be positive Where α is a correlation between the omitted variable and the included independent variable (the price of beef and the price of chicken) α is likely to be positive e Conclusion: Bias in the coefficient of the price of chicken is likely to be positive if we omit the price of beef from the equation. OMITTED VARIABLES 9 / 49 e In reality, we usually do not have the true model to compare with Because we do not know what the true model is Because we do not have data for some important variable e We can often recognize the bias if we obtain some unexpected results e We can prevent omitting variables by relying on the theory e If we cannot prevent omitting variables, we can at least determine in what way this biases our estimates IRRELEVANT VARIABLES 10 / 49 e A second type of specification error is including a variable that does not belong to the model e This misspecification does not cause bias but it increases the variances of the estimated coefficients of the included variables IRRELEVANT VARIABLES 11 / 49 e True model: yi = βxi + ui (1) (2) e Model as it looks when we add irrelevant z: yi = βxi + γzi + u˜i e We can represent the error term as u˜i = ui − γzi e but since from the true model γ = 0, we have u˜i = ui and there is no bias SUMMARY OF THE THEORY e Bias - efficiency trade-off: Omitted variable Irrelevant variable Bias Yes* No Variance Decreases * Increases* * As long as we have correlation between x and z 12 / 49 ON PREVIOUS LECTURES 2 / 24 ►We discussed the specification of a regression equation ► ►Specification consists of choosing: ► 1.correct independent variables 2.correct functional form 3.correct form of the stochastic error term SHORT REVISION 3 / 24 ►We talked about the choice of correct functional form: What are the most common function forms? ►We studied what happens if we omit a relevant variable: Does omitting a relevant variable cause a bias in the other coefficients? ►We studied what happens if we include an irrelevant variable: Does including an irrelevant variable cause a bias in the other coefficients? ►We defined the four specification criteria that determine if a variable belongs to the equation: Can you list some of these specification criteria? ON TODAY’S LECTURE 4 / 24 ►We will finish the discussion of the choice of independent variables by talking about multicollinearity ► ►We will start the discussion of the correct form of the error term by talking about heteroskedasticity ► ►For both of these issues, we will learn •what is the nature of the problem •what are its consequences •how it is diagnosed •what are the remedies available PERFECT MULTICOLLINEARITY 6 / 24 ►Some explanatory variable is a perfect linear function of one or more other explanatory variables ► ►Violation of one of the classical assumptions ► ►OLS estimate cannot be found ► Intuitively: the estimator cannot distinguish which of the explanatory variables causes the change of the dependent variable if they move together Technically: the matrix X'X is singular (not invertible) ►Rare and easy to detect EXAMPLES OF PERFECT MULTICOLLINEARITY Dummy variable trap ►Inclusion of dummy variable for each category in the model with intercept ► ►Example: wage equation for sample of individuals who have high-school education or higher: wagei = β1 + β2high schooli + β3universityi + β4phdi + ei ►Automatically detected by most statistical softwares 7 / 24 IMPERFECT MULTICOLLINEARITY 8 / 24 ►Two or more explanatory variables are highly correlated in the particular data set ► ►OLS estimate can be found, but it may be very imprecise Intuitively: the estimator can hardly distinguish the effects of the explanatory variables if they are highly correlated Technically: the matrix XjX is nearly singular and this causes the variance of the estimator to be very large ►Usually referred to simply as “multicollinearity” CONSEQUENCES OF MULTICOLLINEARITY 9 / 24 1.Estimates remain unbiased and consistent (estimated coefficients are not affected) 2. 2.Standard errors of coefficients increase Confidence intervals are very large - estimates are less reliable t-statistics are smaller - variables may become insignificant DETECTION OF MULTICOLLINEARITY 10 / 24 ►Some multicollinearity exists in every equation - the aim is to recognize when it causes a severe problem ► ►Multicollinearity can be signaled by the underlying theory, but it is very sample depending ► ►We judge the severity of multicollinearity based on the properties of our sample and on the results we obtain ► ►One simple method: examine correlation coefficients between explanatory variables if some of them is too high, we may suspect that the coefficients of these variables can be affected by multicollinearity REMEDIES FOR MULTICOLLINEARITY 11 / 24 ►Drop a redundant variable when the variable is not needed to represent the effect on the dependent variable in case of severe multicollinearity, it makes no statistical difference which variable is dropped theoretical underpinnings of the model should be the basis for such a decision ►Do nothing when multicollinearity does not cause insignificant t-scores or unreliable estimated coefficients deletion of collinear variable can cause specification bias ►Increase the size of the sample the confidence intervals are narrower when we have more observations EXAMPLE 12 / 24 ►Estimating the demand for gasoline in the U.S.: PCONi . . . petroleum consumption in the i-th state TAXi . . . the gasoline tax rate in the i-th state UHMi . . . urban highway miles within the i-th state REGi . . . motor vehicle registrations in the i-the state EXAMPLE 13 / 24 ►We suspect a multicollinearity between urban highway miles and motor vehicle registration across states, because those states that have a lot of highways might also have a lot of motor vehicles. ► ►Therefore, we might run into multicollinearity problems. How do we detect multicollinearity? Look at correlation coefficient. It is indeed huge (0.978). Look at the coefficients of the two variables. Are they both individually significant? UHM is significant, but REG is not. This further suggests a presence of multicollinearity. ►Remedy: try dropping one of the correlated variables. EXAMPLE 14 / 24 HETEROSKEDASTICITY 16 / 24 ►Observations of the error term are drawn from a distribution that has no longer a constant variance Var(εi) = σ2 , i = 1, 2, . . . , n i Note: constant variance means: Var(εi) = σ2(i = 1, 2, . . . , n) ►Often occurs in data sets in which there is a wide disparity between the largest and smallest observed values Smaller values often connected to smaller variance and larger values to larger variance (e.g. consumption of households based on their income level) ►One particular form of heteroskedasticity (variance of the error term is a function of some observable variable): Var(εi) = h(xi) , i = 1, 2, . . . , n HETEROSKEDASTICITY X 17 / 24 CONSEQUENCES OF HETEROSKEDASTICITY 18 / 24 ►Violation of one of the classical assumptions 1.Estimates remain unbiased and consistent (estimated coefficients are not affected) 2.Estimated standard errors of the coefficients are biased heteroskedastic error term causes the dependent variable to fluctuate in a way that the OLS estimation procedure attributes to the independent variable heteroskedasticity biases t statistics, which leads to unreliable hypothesis testing typically, we encounter underestimation of the standard errors, so the t scores are incorrectly too high DETECTION OF HETEROSKEDASTICITY 19 / 24 ►There is a battery of tests for heteroskedasticity Sometimes, simple visual analysis of residuals is sufficient to detect heteroskedasticity ►We will derive a test for the model yi = β0 + β1xi + β2zi + εi ►The test is based on analysis of residuals ► ►The null hypothesis for the test is no heteroskedasticity: E(e2) = σ2 Therefore, we will analyse the relationship between e2 and explanatory variables BREUSCH PAGAN TEST FOR HETEROSKEDASTICITY 20 / 24 1.Estimate the equation, get the residuals ei 2.Regress the squared residuals on all explanatory variables: e2 = α0 + α1xi + α2zi + νi (1) i 3.Get the R2 of this regression and the sample size n 4. 5. If nR2 is larger than the χ2 critical value, then we have to reject H0 of no heteroskedasticity WHITE TEST FOR HETEROSKEDASTICITY 21 / 24 1.Estimate the equation, get the residuals ei 2.Regress the squared residuals on all explanatory variables and on squares and cross-products of all explanatory variables: e2 = α0 + α1xi + α2zi + α3x2 + α4z2 + α5xizi + νi (2) i i i 3.Get the R2 of this regression and the sample size n 4. Test the joint significance of (2): test statistic where k is the number of slope coefficients in (2) 5. If nR2 is larger than the χ2 critical value, then we have to k reject H0 of no heteroskedasticity REMEDIES FOR HETEROSKEDASTICITY 22 / 24 1.Redefing the variables in order to reduce the variance of observations with extreme values e.g. by taking logarithms or by scaling some variables 2.Weighted Least Squares (WLS) consider the model yi = β0 + β1xi + β2zi + εi suppose Var(εi) = σ2z2i it can be proved that if we redefine the model as it becomes homoskedastic 3. Heteroskedasticity-corrected robust standard errors HETEROSKEDASTICITY-CORRECTED ROBUST ERRORS 23 / 24 ►The logic behind: Since heteroskedasticity causes problems with the standard errors of OLS but not with the coefficients, it makes sense to improve the estimation of the standard errors in a way that does not alter the estimate of the coefficients (White, 1980) ►Heteroskedasticity-corrected standard errors are typically larger than OLS s.e., thus producing lower t scores ► ►In panel and cross-sectional data with group-level variables, the method of clustering the standard errors is the desired answer to heteroskedasticity SUMMARY 24 / 24 ►Multicollinearity does not lead to inconsistent estimates, but it makes them lose significance if really necessary, can be remedied by dropping or transforming variables, or by getting more data ►Heteroskedasticity does not lead to inconsistent estimates, but invalidates inference can be simply remedied by the use of (clustered) robust standard errors ►Readings: Studenmund Chapter 8 and 10 Wooldridge Chapter 8