Introduction to Econometrics Home assignment # 4 (Suggested solutions) 1. For this exercise, use the data in fertil.gdt file. It contains information about a sample of women in Botswana: their education (variable educ), their age (variable age), and the number of children they have (variable children). (a) Estimate the model children = β0 + β1educ + ε , interpret the coefficient β1. (b) Conduct the Breusch-Pagan test for heteroskedasticity. Does its result justify the use of robust standard errors? If yes, reestimate the model using these robust standard errors and comment on the difference with respect to the model estimated by simple OLS. (c) Redefine the model as children = β0 + β1educ + β2age + β3age2 + ε and estimate it. i. Does age have a significant impact on number of children of Botswana women? State the hypothesis, and compute the test statistics by hand. Interpret the results and compare it to the results of the test in Gretl. ii. Do you find justification for the inclusion of age in quadratic form? [Hint: State the four specification criteria and argue if they are satisfied for the quadratic in age.] iii. How does the coefficient β1 change when compared to part (a)? Does this signal any bias in the model from part (a)? Where does it come from? Explain the sign of this bias. (d) Do you think the coefficient β1 from the model in part (c) may suffer from some omitted variable bias? What variable(s) could be missing in the model and how would the coefficient β1 change if they were included in the regression? (What is the sign of the potential bias in coefficient β1?) (e) Conduct the RESET test of the model from part (c) and interpret the results. 1 Solution: (a) The estimation result is:              The meaning of the coefficient β1 is that an additional year of education will reduce the number of children a woman has by 0.2. (Obviously, children is not a continuous variable, so this interpretation sounds a little bit odd, but we have to realize we talk here in averages.) (b) The result of the test is:            The p-value equal to zero indicates that we should reject the null hypothesis of homoskedasticity. It means that we have a heteroskedastic error term in the model and we should use robust standard errors. When we reestimate the model using robust standard errors, we get: 2               As expected, we have the same coefficients here, but the standard errors are larger, showing that they were underestimated by OLS in part (a). (c) The estimation result of the redefined model is:                 i. To test for a significant impact of age on number of children of Botswana women, we need to use an F-test: H0 : β2 = 0&β3 = 0 vs HA : β2 = 0 ∨ β3 = 0 F = (SSER − SSEU )/J SSEU /(n − k) ∼ FJ,n−k , F = (18571.78 − 9284.147)/2 9284.147/(4361 − 4) = 2179 , Fcrit(2,4357;0.95) = 3.00 Therefore, we reject the null hypothesis of insignificant coefficients. The quadratic form of age has a positive impact on number of children. The results of the test in Gretl confirm our conclusion: 3 ii. The specification criteria: test of significance of the coefficient, change in R2 , bias in other coefficients, and theory. The significant coefficients for the age and age2 variables, a sharp rise in R2 and a dramatic change of coefficient for variable educ signal that this model performs much better than the one in part (a) and age in quadratic form should indeed be included in the model. Moreover, theory (intuition) also suggests that age and age2 should be included in the model: obviously, very young women have less children than older women, because they have just started their families. The impact of age will be smaller for older women, because after some age, it is biologically impossible to have more children: this justifies the quadratic specification in age, reflecting the concave relationship. iii. The coefficient β1 decreases significantly in absolute value when compared to part (a), which signals that there was an omitted variable bias given by the omission of the age variable. This bias was negative and was given by the fact that age is a relevant variable for the model (and its correlation with the dependent variable is positive - see the coefficient in the estimated equation) and that it is correlated with the variable educ (this correlation is negative, which we can verify in Gretl and explain by possible development of education in Botswana, leading to higher education of new generations as compared to the old generations). The bias is proportional to the product of these two correlations and this is why we should expect it to be negative. (d) We can hypothesize that the coefficient β1 from the model in part (c) may still suffer from some omitted variable bias. A possible variable that could have an impact on the number of children a woman has may be some variable representing socio-economic characteristics of the woman’s family. Since we can suppose that women in better social and economic situation may have less children (negative correlation) and more education (positive correlation), we would expect the bias to be negative. Hence, if we had such variable and we included it in the equation, we should expect the coefficient to become even less negative than in part (c) (smaller in absolute value). (e) The result of the RESET test of the model from part (c) is: 4               The very low p-value signals that we should reject the null hypothesis that the model is correctly specified. This signals that there are still some variables which are omitted from our estimation or that our specification has an incorrect functional form. 2. Suppose following investment model was estimated with quarterly data from 1997- 2009 (standard errors in parenthesis): It = 7.70 1.10) + 0.55 0.23) Yt + 0.63 0.12) Qt2 + 1.55 1.03) Qt3 + 2.13 0.74) Qt4 , n = 64 , R2 = 0.72 , where It is the investment in period t, Yt is the GDP in period t, and dummy variables Qti are equal to 1 in the i-th quarter and zero otherwise (i = 2, 3, 4). Denote the coefficients associated with the dummies δ2, δ3 and δ4. (a) What restriction on these parameters would lead to the model: It = β0 + βY Yt + δqt + εt , where qt = 0, 1, 2, 3 in the first, second, third and fourth quarters respectively? Briefly discuss this restriction. [Hint: To find the restrictions, compare the coefficients of the two models (restricted and unrestricted) for each quarter.] (b) Test the restriction if the regression R2 of the restricted model was 0.68. (c) Explain how would you test for presence of AR(4) autocorrelation of the error term in this model. Describe all steps that you need to take to conduct the test, the null and alternative hypothesis, and the test statistics. Solution: 5 (a) To find the restrictions, we will compare the expected values of It in each quarter for both models: It = α + βY Yt + δ2Qt2 + δ3Qt3 + δ4Qt4 + ηt It = β0 + βY Yt + δqt + εt . 1st quarter: E[It] = α + βY Yt E[It] = β0 + βY Yt 2nd quarter: E[It] = α + δ2 + βY Yt E[It] = β0 + δ + βY Yt 3rd quarter: E[It] = α + δ3 + βY Yt E[It] = β0 + 2δ + βY Yt 4th quarter: E[It] = α + δ4 + βY Yt E[It] = β0 + 3δ + βY Yt The comparison gives us the following results: 1st quarter ⇒ α = β0 2nd quarter ⇒ α + δ2 = β0 + δ 3rd quarter ⇒ α + δ3 = β0 + 2δ 4th quarter ⇒ α + δ4 = β0 + 3δ which reduces finally to the two restrictions δ3 = 2δ2 δ4 = 3δ2 . These restrictions assume constant difference between the quarters accross the year (meaning that there is e.g. the same difference between the first an the second quarter as between the second and the third one). 6 (b) We will test the null hypothesis that these restiction are valid using the standard F-test over restricted and unrestricted models. As given in the setup, R2 U = 0.72 and R2 R = 0.68. Number of restrictions J = 2, number of observations n = 64, number of parameters of unrestricted model k = 5. We construct the F-statistic F = (R2 U − R2 R)/J (1 − R2 U )/(n − k) = (0.72 − 0.68)/2 (1 − 0.72)/(64 − 5) = 4.2143 and when we compare it to the corresponding critical value F2,59 = 3.1531, we see that we can reject the null hypothesis that the restrictions are valid. (c) AR(4) autocorrelation of the error term implies that the error term has the following structure: εt = ρ1εt−1 + ρ2εt−2 + ρ3εt−3 + ρ4εt−4 + ut We can test for AR(4) autocorrelation of the error term using analysis of residuals. Since OLS is consistent even under autocorrelation, the residuals are consistent estimates of the stochastic error term and we can thus use them to test for autocorrelation of the error term. The test proceeds as follows: i. Estimate the original model by OLS, save the residuals et = It − It. ii. Estimate the model et = α + ρ1et−1 + ρ2et−2 + ρ3et−3 + ρ4et−4 + ut by OLS. iii. Test if ρ1 = ρ2 = ρ3 = ρ4 = 0 using the standard F-test: H0 : ρ1 = 0&ρ2 = 0&ρ3 = 0&ρ4 = 0 vs HA : ρ1 = 0∨ρ2 = 0∨ρ3 = 0∨ρ4 = 0 F = (SSER − SSEU )/J SSEU /(n − k) ∼ FJ,n−k . 7