LECTURE 8 Introduction to Econometrics Multicollinearity & Heteroskedasticity Hieu Nguyen Fall semester, 2024 1 / 26 ON PREVIOUS LECTURES ►We discussed the specification of a regression equation ► ►Specification consists of choosing: ► 1.correct independent variables (lecture 7: omitted or irrelevant) 2.correct functional form (lecture 6) 3.correct form of the stochastic error term 2 / 26 ON TODAY’S LECTURE •Today we will continue the discussion of choosing the correct independent variables by talking about multicollinearity •But also, we will start the discussion of the correct form of the error term by talking about heteroskedasticity ► •For both of these issues, we will learn •what is the nature of the problem •what are its consequences •how it is diagnosed •what are the remedies available 3 / 26 PERFECT MULTICOLLINEARITY •Some explanatory variable is a perfect linear function of one or more other explanatory variables •Violation of one of the classical assumptions •OLS estimate cannot be found •Intuitively: the estimator cannot distinguish which of the explanatory variables causes the change of the dependent variable if they move together • •Technically: the matrix X'X is singular (not invertible) • •Rare and easy to detect • •Usually an obvious mistake: e. g. full set of dummies and a constant 4 / 26 EXAMPLES OF PERFECT MULTICOLLINEARITY Dummy variable trap •Inclusion of dummy variable for each category in the model with intercept ► •Example: wage equation for sample of individuals who have high-school education or higher: wagei = β1 + β2high schooli + β3universityi + β4phdi + ei •Automatically detected by most statistical softwares 5 / 26 IMPERFECT MULTICOLLINEARITY •Two or more explanatory variables are highly correlated in the particular data set ► •OLS estimate can be found, but it may be very imprecise Intuitively: the estimator can hardly distinguish the effects of the explanatory variables if they are highly correlated Technically: the matrix XjX is nearly singular and this causes the variance of the estimator to be very large •Usually referred to simply as “multicollinearity” 6 / 26 CONSEQUENCES OF MULTICOLLINEARITY 1.Estimates remain unbiased and consistent (estimated coefficients are not affected) 2. 2.Standard errors of coefficients increase •Confidence intervals are very large - estimates are less reliable •t-statistics are smaller - variables may become insignificant 7 / 26 DETECTION OF MULTICOLLINEARITY •Some degree of multicollinearity exists in every equation - the aim is to recognize when it causes a severe problem ► •Multicollinearity can be signaled by the underlying theory, but it is very sample depending ► •We judge the severity of multicollinearity based on the properties of our sample and on the results we obtain ► •One simple method: examine correlation coefficients between explanatory variables (run covariance) if some of them is too high, we may suspect that the coefficients of these variables can be affected by multicollinearity 8 / 26 EXAMPLE •Estimating the demand for gasoline in the U.S.: PCONi . . . petroleum consumption in the i-th state TAXi . . . the gasoline tax rate in the i-th state UHMi . . . urban highway miles within the i-th state REGi . . . motor vehicle registrations in the i-the state 9 / 26 EXAMPLE •We suspect a multicollinearity between urban highway miles and motor vehicle registration across states, because those states that have a lot of highways might also have a lot of motor vehicles. ► •Therefore, we might run into multicollinearity problems. How do we detect multicollinearity? Look at correlation coefficient. It is indeed huge (0.978). Look at the coefficients of the two variables. Are they both individually significant? UHM is significant, but REG is not. This further suggests a presence of multicollinearity. •Remedy: try dropping one of the correlated variables. 10 / 26 11 / 26 EXAMPLE REMEDIES FOR MULTICOLLINEARITY •Drop a redundant variable •when the variable is not needed to represent the effect on the dependent variable •in case of severe multicollinearity, it makes no statistical difference which variable is dropped •theoretical underpinnings of the model should be the basis for such a decision •Do nothing •when multicollinearity does not cause insignificant t-scores or unreliable estimated coefficients •deletion of collinear variable can cause specification bias •Increase the size of the sample •the confidence intervals are narrower when we have more observations 12 / 26 •Transform the multicollinear variables • •in case when all variables are extremely important on theoretical grounds • •we can try various transformations: • • 1.Combination of multicollinear variables 2.First differences (for time series) 3.Increase the size of the sample (the confidence intervals are narrower when we have more observations) 13 / 26 REMEDIES FOR MULTICOLLINEARITY 14 / 26 EXAMPLE 15 / 26 EXAMPLE HETEROSKEDASTICITY 16 / 26 HETEROSKEDASTICITY X 17 / 26 CONSEQUENCES OF HETEROSKEDASTICITY •Violation of one of the classical assumptions 1.Estimates remain unbiased and consistent (estimated coefficients are not affected) 2.Estimated standard errors of the coefficients are biased • heteroskedastic error term causes the dependent variable to fluctuate in a way that the OLS estimation procedure attributes to the independent variable • heteroskedasticity biases t statistics, which leads to unreliable hypothesis testing • typically, we encounter underestimation of the standard errors, so the t scores are incorrectly too high 18 / 26 VARIENCE OF OLS UNDER HETEROSKEDASTICITY 19 / 26 DETECTION OF HETEROSKEDASTICITY •There is a battery of tests for heteroskedasticity Sometimes, simple visual analysis of residuals is sufficient to detect heteroskedasticity •We will derive a test for the model yi = β0 + β1xi + β2zi + εi •The test is based on analysis of residuals ► •The null hypothesis for the test is no heteroskedasticity: E(e2) = σ2 Therefore, we will analyse the relationship between e2 and explanatory variables 20 / 26 21 / 26 CHI SQUARED DISTRIBUTION WHITE TEST FOR HETEROSKEDASTICITY 4. Test the joint significance of (2): test statistic where k is the number of slope coefficients in (2) 5. If nR2 is larger than the χ2 critical value, then we have to k reject H0 of no heteroskedasticity 22 / 26 BREUSCH PAGAN TEST FOR HETEROSKEDASTICITY 4. Get the R2 of this regression and the sample size n 3. 6. If nR2 is larger than the χ2 critical value, then we have to reject H0 of no heteroskedasticity 23 / 26 5 REMEDIES FOR HETEROSKEDASTICITY 1.Redefining the variables in order to reduce the variance of observations with extreme values e.g. by taking logarithms or by scaling some variables 2.Weighted Least Squares (WLS) consider the model yi = β0 + β1xi + β2zi + εi suppose Var(εi) = σ2z2i it can be proved that if we redefine the model as it becomes homoskedastic 3. Heteroskedasticity-corrected robust standard errors 24 / 26 HETEROSKEDASTICITY-CORRECTED ROBUST ERRORS •The logic behind: Since heteroskedasticity causes problems with the standard errors of OLS but not with the coefficients, it makes sense to improve the estimation of the standard errors in a way that does not alter the estimate of the coefficients (White, 1980) •Heteroskedasticity-corrected standard errors are typically larger than OLS s.e., thus producing lower t scores ► •In panel and cross-sectional data with group-level variables, the method of clustering the standard errors is the desired answer to heteroskedasticity 25 / 26 SUMMARY ►Multicollinearity does not lead to inconsistent estimates, but it makes them lose significance if really necessary, can be remedied by dropping or transforming variables, or by getting more data ►Heteroskedasticity does not lead to inconsistent estimates, but invalidates inference can be simply remedied by the use of (clustered) robust standard errors ►Readings: Studenmund Chapter 8 and 10 Wooldridge Chapter 8 26 / 26