1/36 Econometrics Multicollinearity & Heteroskedasticity Anna Donina Lecture 6 ON PREVIOUS LECTURES 2/24 • Wediscussed the specification of a regression equation • Specification consists of choosing: 1. correct independent variables 2. correct functional form 3. correct form of the stochastic error term SHORT REVISION 3/24 • Wetalked about the choice of correct functional form: ▪ What are the most common function forms? • Westudied what happens if we omit a relevant variable: ▪ Does omitting a relevant variable cause a bias in the other coefficients? • Westudied what happens if we include an irrelevant variable: ▪ Does including an irrelevant variable cause a bias in the other coefficients? • Wedefined the four specification criteria that determine if a variable belongs to the equation: ▪ Can you list some of these specification criteria? ON TODAY’S LECTURE 4/24 • Wewill finish the discussion of the choice of independent variables by talking about multicollinearity • Wewill start the discussion of the correct form of the error term by talking about heteroskedasticity • For both of these issues, we will learn ▪ what is the nature of the problem ▪ what are its consequences ▪ how it is diagnosed ▪ what are the remedies available PERFECT MULTICOLLINEARITY 6/24 • Some explanatory variable is a perfect linear function of one or more other explanatory variables • Violation of one of the classical assumptions • OLS estimate cannot be found ▪ Intuitively: the estimator cannot distinguish which of the explanatory variables causes the change of the dependent variable if they move together ▪ Technically: the matrix X' X is singular (not invertible) • Rare and easy to detect EXAMPLES OF PERFECT MULTICOLLINEARITY 7/24 Dummy variable trap • Inclusion of dummy variable for each category in the model with intercept • Example: wage equation for sample of individuals who have high-school education or higher: 𝑤𝑎𝑔𝑒𝑖 = 𝛽1 + 𝛽2ℎ𝑖𝑔ℎ_𝑠𝑐ℎ𝑜𝑜𝑙𝑖 + 𝛽3 𝑢𝑛𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦𝑖 + 𝛽4 𝑝ℎ𝑑𝑖 + 𝑒𝑖 • Automatically detected by most statistical softwares IMPERFECT MULTICOLLINEARITY 8/24 • Two or more explanatory variables are highly correlated in the particular data set • OLS estimate can be found, but it may be very imprecise ▪ Intuitively: the estimator can hardly distinguish the effects of the explanatory variables if they are highly correlated ▪ Technically: the matrix Xj X is nearly singular and this causes the variance of the estimator to be very large • Usually referred to simply as “multicollinearity” CONSEQUENCES OF MULTICOLLINEARITY 9/24 1. Estimates remain unbiased and consistent (estimated coefficients are not affected) 2. Standard errors of coefficients increase ▪ Confidence intervals are very large - estimates are less reliable ▪ t-statistics are smaller - variables may become insignificant DETECTION OF MULTICOLLINEARITY 10/24 • Some multicollinearity exists in every equation - the aim is to recognize when it causes a severe problem • Multicollinearity can be signaled by the underlying theory, but it is very sample depending • Wejudge the severity of multicollinearity based on the properties of our sample and on the results we obtain • One simple method: examine correlation coefficients between explanatory variables ▪ if some of them is too high, we may suspect that the coefficients of these variables can be affected by multicollinearity REMEDIES FOR MULTICOLLINEARITY 11/24 • Drop a redundant variable ▪ when the variable is not needed to represent the effect on the dependent variable ▪ in case of severe multicollinearity, it makes no statistical difference which variable is dropped ▪ theoretical underpinnings of the model should be the basis for such a decision • Do nothing ▪ when multicollinearity does not cause insignificant tscores or unreliable estimated coefficients ▪ deletion of collinear variable can cause specification bias • Increase the size of the sample ▪ the confidence intervals are narrower when we have more observations EXAMPLE 12/24 Estimating the demand for gasoline in the U.S.: PCONi ... petroleum consumption in the i-th state TAXi ... the gasoline tax rate in the i-th state UHMi ... urban highway miles within the i-th state REGi ... motor vehicle registrations in the i-the state EXAMPLE 13/24 • Wesuspect a multicollinearity between urban highway miles and motor vehicle registration across states, because those states that have a lot of highways might also have a lot of motor vehicles. • Therefore, we might run into multicollinearity problems. How do we detect multicollinearity? ▪ Look at correlation coefficient. It is indeed huge (0.978). ▪ Look at the coefficients of the two variables. Are they both individually significant? UHM is significant, but REG is not. This further suggests a presence of multicollinearity. • Remedy: try dropping one of the correlated variables. EXAMPLE 14/24 HETEROSKEDASTICITY 16/24 • Observations of the error term are drawn from a distribution that has no longer a constant variance 𝑉𝑎𝑟 𝜀𝑖 = 𝜎𝑖 2 , 𝑖 = 1,2,… , 𝑛 Note: constant variance means: 𝑉𝑎𝑟 𝜀𝑖 = 𝜎2 , 𝑖 = 1,2,… , 𝑛 • Often occurs in data sets in which there is a wide disparity between the largest and smallest observed values ▪ Smaller values often connected to smaller variance and larger values to larger variance (e.g. consumption of households based on their income level) • One particular form of heteroskedasticity (variance of the error term is a function of some observable variable): Var(εi) = h(xi) , i = 1,2,. ..,n HETEROSKEDASTICITY X 17/24 Y CONSEQUENCES OF HETEROSKEDASTICITY 18/24 Violation of one of the classical assumptions 1. Estimates remain unbiased and consistent (estimated coefficients are not affected) 2. Estimated standard errors of the coefficients are biased ▪ heteroskedastic error term causes the dependent variable to fluctuate in a way that the OLS estimation procedure attributes to the independent variable ▪ heteroskedasticity invalidates t and F statistics, which leads to unreliable hypothesis testing ▪ typically, we encounter underestimation of the standard errors, so the t scores are incorrectly too high ▪ Under heteroscedasticity, OLS is no longer the best linear unbiased estimator (BLUE); there may be more efficient linear estimators DETECTION OF HETEROSKEDASTICITY 19/24 • There are tests for heteroskedasticity ▪ Sometimes, simple visual analysis of residuals is sufficient to detect heteroskedasticity • Wewill derive a test for the model yi = β0 + β1xi + β2zi + εi • The test is based on analysis of residuals • The null hypothesis for the test is no heteroskedasticity: E(e2) = σ2 ▪ Therefore, we will analyse the relationship between e2 and explanatory variables BREUSCH PAGAN TEST FOR HETEROSKEDASTICITY 20/24 1. Estimate the OLS model 2. Compute Breusch and Pagan call 𝑔𝑖 𝑔𝑖 = Ƹ𝜀𝑖 2 ො𝜎𝑖 2 , 𝑤ℎ𝑒𝑟𝑒 ො𝜎𝑖 2 = ෍ Ƹ𝜀𝑖 2 𝑛 3. Estimate the auxiliary regression 𝑔𝑖 = 𝛼0 + 𝛼1 𝑥1𝑖 + 𝛼2 𝑥2𝑖 + ⋯ 𝛼 𝑘 𝑥 𝑘𝑖 + 𝑣𝑖 (1) 4. The LM test statistic 𝐿𝑀 = 1 2 𝑇𝑆𝑆 − 𝑆𝑆𝑅 TSS - the sum of squared deviations of the 𝑔𝑖 from their mean of 1, SSR - the sum of squared residuals from the auxiliary regression. 5. 𝐿𝑀~𝒳 𝑘−1 2 , where k is the number of slope coefficients in (1) WHITE TEST FOR HETEROSKEDASTICITY 21/24 1. Estimate the equation, get the residuals ei 2. Regress the squared residuals on all explanatory variables and on squares and cross-products of all explanatory variables: 𝑒𝑖 2 = 𝛼0 + 𝛼1 𝑥𝑖 + 𝛼2 𝑧𝑖 + 𝛼3 𝑥𝑖 2 + 𝛼4 𝑧𝑖 2 + 𝛼5 𝑥𝑖 𝑧𝑖 + 𝑣𝑖 (2) 3. Get the R2 of this regression and the sample sizen 4. Test the joint significance of (2): use test statistic, 𝐿𝑀 = 𝑛𝑅2 ~𝒳𝑘−1 2 , where k is the # of slope coefficients in (2) 3. If 𝑛𝑅2 is larger than the 𝒳2 critical value, then we have to reject 𝐻0 of homoskedasticity REMEDIES FOR HETEROSKEDASTICITY 22/24 1. Redefing the variables ▪ in order to reduce the variance of observations with extreme values e.g. by taking logarithms or by scaling some variables 2. Weighted Least Squares (WLS) consider the model yi = β0 + β1xi + β2zi + εi suppose Var(εi) = σ2z2 i it can be proved that if we redefine the model as it becomes homoskedastic 3. Heteroskedasticity-corrected robust standard errors HETEROSKEDASTICITY-CORRECTED ROBUST ERRORS 23/24 The logic behind: ▪ Since heteroskedasticity causes problems with the standard errors of OLS but not with the coefficients, it makes sense to improve the estimation of the standard errors in a way that does not alter the estimate of the coefficients (White, 1980) • Heteroskedasticity-corrected standard errors are typically larger than OLS s.e., thus producing lower t scores • In panel and cross-sectional data with group-level variables, the method of clustering the standard errors is the desired answer to heteroskedasticity HETEROSKEDASTICITY-ROBUST INFERENCE AFTER OLS 23/24 • All formulas are only valid in large samples • Formula for heteroscedasticity-robust OLS standard error • Using this formula, the usual t-test is valid asymptotically • The usual F-statistic does not work under heteroskedasticity, but robust versions are available in most software HETEROSKEDASTICITY-ROBUST INFERENCE AFTER OLS 23/24 Example: Hourly wage equation SUMMARY 24/24 Multicollinearity ▪ does not lead to inconsistent estimates, but it makes them lose significance ▪ if really necessary, can be remedied by dropping or transforming variables, or by getting more data Heteroskedasticity ▪ does not lead to inconsistent estimates, but invalidates inference ▪ can be simply remedied by the use of (clustered) robust standard errors ❑ Readings: Studenmund Chapter 8 and 10 Wooldridge Chapter 8