Introduction to Econometrics
Home assignment # 3
(Suggested solutions)
1. Decide if the following claims are true or false (and explain why):
(a) When we add an omitted variable into a regression model, the coeﬃcients of
the remaining variables can change, but they cannot lose signiﬁcance.
(b) Compared with the unconstrained regression, estimation of a least squares regression
under a constraint (say β2 = β3) will result in a higher R2
if the
constraint is true and a lower R2
if it is false.
(c) Since x2
is an exact function of x, we will be faced with the exact multicollinearity
if we attempt to use both x and x2
as regressors.
(d) If we reject the null hypothesis of the RESET test, we conclude that the our
model is correctly speciﬁed.
Solution:
(a) FALSE. If we add an omitted variable into a regression model, the coeﬃcients
of variables that are correlated with that variable will change (they suﬀered
from omitted variable bias before we added that variable), and they can loose
signiﬁcance as well. For example, a variable can loose signiﬁcance, because its
true eﬀect on the dependent variable is zero, but the coeﬃcient in the model
with omitted variable was biased towards positive eﬀect (it was positive and signiﬁcant,
and then it lost its signiﬁcance when the omitted variable was added).
(b) FALSE. Estimation of a constrained model will always result in lower R2
, no
matter if the constraint is true or false. If the constraint is false, the drop in R2
should be larger than if it is true.
(c) FALSE. Exact multicollinearity problem arises only if the function linking the
two variables is linear.
(d) FALSE. The null hypothesis of the RESET test is that the model is correctly
speciﬁed, because all important variables are included in the model and therefore
the residuals are a white noise. If we reject the null hypothesis of the
RESET test, the conclusion is that the model is misspeciﬁed and some important
variables are not included in the model.
1
2. Your aim is to estimate how the number of prenatal examinations and several other
characteristics inﬂuence the birth weight of a baby. Your initial hypothesis is that
more responsible pregnant women visit the doctor more often and this leads to healthier
and thus also bigger babies.
(a) In your ﬁrst speciﬁcation, you run the following model:
bwght = β0 + β1 npvis + β2 npvis2
+ β3 monpre + β4 male + ε ,
where bwght is birth weight of the baby (in grams), npvis is the number of
prenatal doctor’s visits, monpre is the month on pregnancy in which the prenatal
care began and male is a dummy, equal to one if the baby is a boy and zero if
it is a girl. You obtain the following results form Stata:1
_cons 2853.196 101.3073 28.16 0.000 2654.498 3051.895
male 76.69243 27.76083 2.76 0.006 22.24391 131.141
monpre 30.47033 12.40794 2.46 0.014 6.134091 54.80657
npvissq -1.173175 .3591552 -3.27 0.001 -1.877601 -.4687481
npvis 53.50974 11.41313 4.69 0.000 31.12468 75.8948
bwght Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 582851231 1725 337884.772 Root MSE = 575.5
Adj R-squared = 0.0198
Residual 570003184 1721 331204.639 R-squared = 0.0220
Model 12848047.5 4 3212011.87 Prob > F = 0.0000
F( 4, 1721) = 9.70
Source SS df MS Number of obs = 1726
i. Is there strong evidence that npvissq (stands for npvis2
) should be included
in the model?
ii. How do you interpret the negative coeﬃcient of npvissq?
iii. Holding npvis and monpre ﬁxed, test the hypothesis that newborn boys
weight by 100 grams more than newborn girls (at 95% conﬁdence level).
1
Stata is a statistical software, which can be used to for econometric purposes. The Stata output
is quite similar to the Gretl output you are familiar with. In particular, Coef. denotes the estimated
coeﬃcients, Std.Err. denotes the standard deviations of these coeﬃcients, t denotes the t-statistic of the
test of signiﬁcance of the coeﬃcients, P > |t| denotes the corresponding p-value.
2
(b) A friend of yours, student of medicine, reminds you of the fact that the age of
the parents (especially of the mother) might be a decisive factor for the health
and for the weight of the baby. Therefore, in your second speciﬁcation, you
decide to include in your model also the age of the mother (mage) and of the
father (fage). The results of your estimation are now the following:
_cons 2592.813 139.6173 18.57 0.000 2318.974 2866.651
fage 8.697342 3.465973 2.51 0.012 1.899357 15.49533
mage .5285275 4.218069 0.13 0.900 -7.744582 8.801637
male 74.45482 27.75247 2.68 0.007 20.02252 128.8871
monpre 34.35661 12.69477 2.71 0.007 9.457725 59.2555
npvissq -1.138545 .3585648 -3.18 0.002 -1.841816 -.4352743
npvis 52.43859 11.40558 4.60 0.000 30.06826 74.80891
bwght Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 579528396 1719 337131.121 Root MSE = 573.42
Adj R-squared = 0.0247
Residual 563258231 1713 328813.912 R-squared = 0.0281
Model 16270165.8 6 2711694.3 Prob > F = 0.0000
F( 6, 1713) = 8.25
Source SS df MS Number of obs = 1720
i. Comment on the signiﬁcance of the coeﬃcients on mage and fage separately:
are they in line with your friend’s claim?
ii. Test the hypothesis that mage and fage are jointly signiﬁcant (at 95%
conﬁdence level). Is the result in line with your friend’s claim?
iii. How can you reconcile you ﬁndings from the two previous questions?
(c) In your third speciﬁcation, you decide to drop fage and you get the following
results:
_cons 2648.851 137.2778 19.30 0.000 2379.602 2918.1
mage -6.91257 3.137972 -2.20 0.028 -13.06721 -.757928
male 79.38175 27.75667 2.86 0.004 24.94136 133.8221
monpre 35.25912 12.58328 2.80 0.005 10.57898 59.93927
npvissq -1.142647 .3590214 -3.18 0.001 -1.846811 -.4384821
npvis 52.27885 11.41406 4.58 0.000 29.89196 74.66575
bwght Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 582851231 1725 337884.772 Root MSE = 574.86
Adj R-squared = 0.0220
Residual 568399545 1720 330464.852 R-squared = 0.0248
Model 14451685.6 5 2890337.13 Prob > F = 0.0000
F( 5, 1720) = 8.75
Source SS df MS Number of obs = 1726
Comment on the signiﬁcance of the coeﬃcient on mage, compared to the results
from part (b). Is your ﬁnding in line with your reasoning in part (b)? Does it
conﬁrm your friend’s claim?
3
(d) Having regained trust in your friend, you consult your results once more with
him. Together, you come up with an interesting question: whether smoking
during pregnancy can aﬀect the weight of the baby. Fortunately, you have at
your disposition the variable cigs, standing for the average number of cigarettes
each woman in your sample smokes per day during the pregnancy, and so you
can include it in your model. However, your friend warns you that women
who smoke during pregnancy are in general less responsible than those who do
not smoke, and that these women also tend to visit the doctor less often. (In
other words, the more the women smokes, the less prenatal doctor’s visits she
has). This is an important fact that you have to take into consideration while
interpreting your ﬁnal results, which are:
_cons 2748.856 141.868 19.38 0.000 2470.591 3027.12
cigs -10.209 3.398309 -3.00 0.003 -16.87456 -3.54344
mage -6.980738 3.227181 -2.16 0.031 -13.31064 -.6508356
male 82.39438 28.34937 2.91 0.004 26.78897 137.9998
monpre 31.77658 12.78156 2.49 0.013 6.706395 56.84676
npvissq -.8948737 .3624432 -2.47 0.014 -1.605782 -.1839653
npvis 42.43442 11.59582 3.66 0.000 19.68999 65.17885
bwght Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 537842203 1621 331796.547 Root MSE = 569.22
Adj R-squared = 0.0235
Residual 523281374 1615 324013.235 R-squared = 0.0271
Model 14560828.9 6 2426804.81 Prob > F = 0.0000
F( 6, 1615) = 7.49
Source SS df MS Number of obs = 1622
i. Interpret the coeﬃcient on cigs.
ii. What evidence do you ﬁnd that cigs really should be included in the model?
List at least two arguments.
iii. Compare the coeﬃcient on npvis with the one you obtained in part (c). Do
you think there was a bias? If yes, explain where it came from and interpret
its sign.
Solution:
(a) i. The p-value on the coeﬃcient on npvissq is very small, and hence the variable
is strongly signiﬁcant and should be included in the model.
ii. The negative coeﬃcient on npvissq signals a concave form of the impact of
the number of prenatal doctor’s visits, meaning that there are decreasing
returns to visiting the doctor. A possible explanation is that some number
of visits is beneﬁciary for all pregnant women, but higher necessity of visits
could mean that the pregnancy is risky for some reasons and the woman
has to go to the doctor more often than usually. Such woman is also more
likely to have smaller baby.
4
iii. Such hypothesis can be stated as
H0 : β4 = 100 vs HA : β4 = 100 .
The test statistic is
t =
β4 − 100
s.e.(β4)
∼ tn−k ,
where k = 5 and n = 1726 in this case. When we compute this test statistic,
we have to compare its absolute value to the critical value t1721,0.975, since
the test is two-sided. The statistic is
t =
β4 − 100
s.e.(β4)
=
76.69243 − 100
27.76083
= −0.839584767 .
The corresponding critical value is t∞,0.975 = 1.96, and we see that the
absolute value of the statistic is smaller than this critical value. Hence,
we cannot reject the null hypothesis and we conclude that newborn boys
weight by 100 grams more than newborn girls at 95% conﬁdence level.
(b) i. When we look on the p-values of the corresponding coeﬃcients, we see that
whereas fage is signiﬁcant at 99% conﬁdence level, mage is insigniﬁcant.
This is not in line with our friend’s claim, who says that especially the age
of the mother should be an important factor.
ii. Let us introduce the following notation for the model from this part:
bwght = β0+β1 npvis+β2 npvis2
+β3 monpre+β4 male+β5mage+β6fage+ε .
Using this notation, the hypothesis that both mage and fage are jointly
signiﬁcant can be stated as
H0 :
β5 = 0
β6 = 0
vs HA :
β5 = 0
β6 = 0
or .
When we incorporate the restrictions, we see that the restricted model is
exactly the same as the one estimated in part (a). Hence, we can use the
SSE from part (a) as SSER and the SSE from part (b) as SSEU and
construct the test statistic of the F-test (note that we cannot use the R2
version of the F-test, because SSTR = SSTU as the number of observations
in the two models are diﬀerent):
F =
(SSER − SSEU )/J
(SSEU )/(n − k)
∼ FJ,n−k ,
where J = 2, N = 1720 and k = 7 in this case. The test statistic is equal
to
F =
(570003184 − 563258231)/2
(563258231)/(1720 − 7)
= 10.26
5
and it is larger than the corresponding critical value F2,∞;0.95 = 3.00. Hence,
we can reject the null hypothesis and we conclude that mage and fage are
jointly signiﬁcant.
iii. The ﬁnding about the joint signiﬁcance from the second question is not
surprising, since we know already from the ﬁrst question that fage is individually
signiﬁcant. If a variable is signiﬁcant, then the HA of the test of
the joint signiﬁcance has to be valid and so the variables have to be jointly
signiﬁcant.
(c) Now, the p-value of the coeﬃcient on mage is very low and so the coeﬃcient is
strongly signiﬁcant. When we compare this ﬁnding to part (b), we realize that
the insigniﬁcance of this coeﬃcient in that part was probably given by a strong
correlation between mage and fage, leading to the multicollinearity problem,
which increases the standard errors and decreases thus the signiﬁcance of the
coeﬃcients. When we drop fage, the multicollinearity problem is solved and
we see that our friend’s claim was true.
(d) i. The coeﬃcient on cigs tells us that with each additional cigarette smoked
by the pregnant woman on average per day, the weight of the baby is smaller
by 10 grams, ceteris paribus.
ii. We can see from the p-value that the coeﬃcient on cigs is strongly signiﬁcant.
We can also see that the R2
as well as the adjusted R2
are higher than
in the model without this variable (in part (c)). Moreover, we see that the
coeﬃcient on npvis has changed quite a lot once we included cigs, which is
a signal of an omitted variable bias in part (c) and a proof that cigs indeed
should be included in the model.
iii. In part (c), the coeﬃcient on npvis was approximatively equal to 52, now
it is equal to 42. This shows there was a positive bias in part (c): the coefﬁcient
was overestimated there. We know that the sign of this bias is the
sign of the product of two correlations: the correlation between the omitted
variable cigs and the variable npvis and the correlation between cigs and
the dependent variable bwght. The correlation between cigs and the dependent
variable bwght is negative as we can see from the negative coeﬃcient
on cigs in the model estimated in part (d), the correlation between cigs
and npvis is negative as we learn from our friend (women who smoke tend
to visit the doctor less often). The product of these two correlations is thus
positive and so is the bias in part (c).
Intuitively, we can say that when cigs was omitted, everything that could
measure the degree of responsibility of pregnant women in our model was
the variable npvis. Once we included cigs, we can measure sepately the
responsibility of going to the doctor and the responsibility of not smoking,
and so the coeﬃcient on npvs is reﬂecting only the correct part of this
inﬂuence and it is not overestimated.
6
3. Suppose that you have a sample of n individuals who apart from their mother tongue
(Czech) can speak English, German, or are trlingual (i.e., all individuals in your
sample speak in addition to their mother tongue at least one foreign language). You
estimate the following model:
wage = β0 + β1educ + β2IQ + β3exper + β4DM + β5Germ + β6Engl + ε ,
where
educ . . . years of education
IQ . . . IQ level
exper . . . years of on-the-job experience
DM . . . dummy, equal to one for males and zero for females
Germ . . . dummy, equal to one for German speakers and zero otherwise
Engl . . . dummy, equal to one for English speakers and zero otherwise
(a) Explain why a dummy equal to one for trilingual people and zero otherwise is
not included in the model.
(b) Explain how you would test for discrimination against females (in the sense that
ceteris paribus females earn less than males). Be speciﬁc: state the hypothesis,
give the test statistic and its distribution.
(c) Explain how you would measure the payoﬀ (in term of wage) to someone of
becoming trilingual given that he can already speak (i) English, (ii) German.
(d) Explain how you would test if the inﬂuence of on-the-job experience is greater
for males than for females. Be speciﬁc: specify the model, state the hypothesis,
give the test statistic and its distribution.
Solution:
(a) If we included the dummy for people who are trilingual, we would have the
complete set of dummies in the model (describing all three possible options German
speaker, English speaker, both foreign languages). Since we have the
intercept in the model, this would lead to perfect multicollinearity.
(b) For women, the dummy DM is equal to 0 and the model stands as follows:
wage = β0 + β1educ + β2IQ + β3exper + β5Germ + β6Engl + ε .
For men, the dummy DM is equal to 1 and the model stands as follows:
wage = β0 + β1educ + β2IQ + β3exper + β4 + β5Germ + β6Engl + ε .
Therefore, ceteris paribus, the diﬀerence between the wage of men and the wage
of women is equal to β4. If this coeﬃcient is positive, then men earn more than
women. Hence, our hypothesis to be tested is
H0 : β4 ≤ 0 vs HA : β4 > 0 .
7
This leads to a one-sided t-test with the test statistic
t =
β4
s.e.(β4)
∼ tn−k ,
where k = 7 in this case. When we compute this test statistic, we compare it to
the critical value tn−7,0.95. If the test statistic is larger than this critical value,
then we reject the H0 at 95% conﬁdence level and we conclude that there is
discrimination against females.
(c) The payoﬀ of a trilingual person is
wage = β0 + β1educ + β2IQ + β3exper + β4DM + β5 + β6 + ε ,
the payoﬀ of a German speaking person is
wage = β0 + β1educ + β2IQ + β3exper + β4DM + β5 + ε ,
and the payoﬀ of an English speaking person is
wage = β0 + β1educ + β2IQ + β3exper + β4DM + β6 + ε .
Hence, by becoming trilingual, a person who can already speak English gains β5
and a person who can already speak German gains β6. If we assume that both
coeﬃcients are positive, this payoﬀ should be positive.
(d) To allow the on-the-job experience to be greater for males than for females, we
have to deﬁne a slope coeﬃcient on exper that would be diﬀerent for males and
for females. We can do so using the following model:
wage = β0+β1educ+β2IQ+β3exper+β4DM+β5Germ+β6Engl+β7exper·DM+ε .
In this case, the impact of on the on-the-job experience on wage would be β3
for females and β3 + β7 for males. Hence, if β7 is positive, then men gain more
from experience than women. Hence, our hypothesis to be tested is
H0 : β7 ≤ 0 vs HA : β7 > 0 .
This leads to a one-sided t-test with the test statistic
t =
β7
s.e.(β7)
∼ tn−k ,
where k = 8 in this case. When we compute this test statistic, we compare
it to the critical value tn−8,0.95. If the test statistic is larger than this critical
value, then we reject the H0 at 95% conﬁdence level and we conclude that the
inﬂuence of on-the-job experience is greater for males than for females.
8