LECTURE 9
Introduction to Econometrics
Choosing explanatory variables
November 15, 2016
1 / 25
WHAT WE HAVE LEARNED SO FAR
We know what a linear regression model is and how its
parameters are estimated by OLS
We know what the properties of OLS estimator are
We know how to test single and multiple hypotheses in
linear regression models
We know how to asses the goodness of ﬁt using R2
We started to talk about the speciﬁcation of a regression
equation
2 / 25
SPECIFICATION OF A REGRESSION EQUATION
Speciﬁcation consists of choosing:
1. correct independent variables
2. correct functional form
3. correct form of the stochastic error term
We discussed the choice of functional form on the previous
lecture
We will discuss the choice of independent variables today
We will study the form of the error term on the next two
lectures
3 / 25
ON TODAY’S LECTURE
We will learn that
omitting a relevant variable from an equation is likely to
bias remaining coefﬁcients
including an irrelevant variable in an equation leads to
higher variance of estimated coefﬁcients
our choice should be led by the economic theory and
conﬁrmed by a set of statistical tools
4 / 25
OMITTED VARIABLES
We omit a variable when we
forget to include it
do not have data for it
This misspeciﬁcation results in
not having the coefﬁcient for this variable
biasing estimated coefﬁcients of other variables in the
equation −→ omitted variable bias
5 / 25
OMITTED VARIABLES
Where does the omitted variable bias come from?
True model:
yi = βxi + γzi + ui
Model as it looks when we omit variable z:
yi = βxi + ˜ui
implying
˜ui = γzi + ui
We assume that Cov(ui, xi) = 0, but:
Cov(˜ui, xi) = Cov(γzi + ui, xi) = γCov(zi, xi) = 0
The classical assumption is violated ⇒ biased (and
inconsistent) estimate!!!
6 / 25
OMITTED VARIABLES
For the model with omitted variable:
E(βomitted model
) = β + bias
bias = γ ∗ α
Coefﬁcients β and γ are from the true model
yi = βxi + γzi + ui
Coefﬁcient α is from a regression of z on x, i.e.
zi = αxi + ei
The bias is zero if γ = 0 or α = 0 (not likely to happen)
7 / 25
OMITTED VARIABLES
Intuitive explanation:
if we leave out an important variable from the regression
(γ = 0), coefﬁcients of other variables are biased unless the
omitted variable is uncorrelated with all included
dependent variables (α = 0)
the included variables pick up some of the effect of the
omitted variable (if they are correlated), and the coefﬁcients
of included variables thus change causing the bias
Example: what would happen if you estimated a
production function with capital only and omitted labor?
8 / 25
OMITTED VARIABLES
Example: estimating the price of chicken meat in the US
ˆYt = 31.5 − 0.73
0.08)
PCt + 0.11
0.05)
PBt + 0.23
0.02)
YDt
R2
= 0.986 , n = 44
Yt . . . per capita chicken consumption
PCt . . . price of chicken
PBt . . . price of beef
YDt . . . per capita disposable income
9 / 25
OMITTED VARIABLES
When we omit price of beef:
ˆYt = 32.9 − 0.70
0.08)
PCt + 0.27
0.01)
YDt
R2
= 0.895 , n = 44
Compare to the true model:
ˆYt = 31.5 − 0.73
0.08)
PCt + 0.11
0.05)
PBt + 0.23
0.02)
YDt
R2
= 0.986 , n = 44
We observe positive bias in the coefﬁcient of PC (was it
expected?)
10 / 25
OMITTED VARIABLES
Determining the direction of bias: bias = γ ∗ α
Where γ is a correlation between the omitted variable and
the dependent variable (the price of beef and chicken
consumption)
γ is likely to be positive
Where α is a correlation between the omitted variable and
the included independent variable (the price of beef and
the price of chicken)
α is likely to be positive
Conclusion: Bias in the coefﬁcient of the price of chicken is
likely to be positive if we omit the price of beef from the
equation.
11 / 25
OMITTED VARIABLES
In reality, we usually do not have the true model to
compare with
Because we do not know what the true model is
Because we do not have data for some important variable
We can often recognize the bias if we obtain some
unexpected results
We can prevent omitting variables by relying on the theory
If we cannot prevent omitting variables, we can at least
determine in what way this biases our estimates
12 / 25
IRRELEVANT VARIABLES
A second type of speciﬁcation error is including a variable
that does not belong to the model
This misspeciﬁcation
does not cause bias
but it increases the variances of the estimated coefﬁcients of
the included variables
13 / 25
IRRELEVANT VARIABLES
True model:
yi = βxi + ui (1)
Model as it looks when we add irrelevant z:
yi = βxi + γzi + ˜ui (2)
We can represent the error term as ˜ui = ui − γzi
but since from the true model γ = 0, we have ˜ui = ui and
there is no bias
14 / 25
IRRELEVANT VARIABLES
True model:
ˆYt = 31.5 − 0.73
0.08)
PCt + 0.11
0.05)
PBt + 0.23
0.02)
YDt
R2
= 0.986 , n = 44
If we include interest rate Rt (irrelevant variable)
ˆYt = 30.0 − 0.73
0.10)
PCt + 0.12
0.06)
PBt + 0.22
0.03)
YDt + 0.17
0.21)
Rt
R2
= 0.987 , n = 44
We observe that Rt is insigniﬁcant and standard errors of
other variables increase
15 / 25
SUMMARY OF THE THEORY
Bias - efﬁciency trade-off:
Omitted variable Irrelevant variable
Bias Yes* No
Variance Decreases * Increases*
* As long as we have correlation between x and z
16 / 25
FOUR IMPORTANT SPECIFICATION CRITERIA
Does a variable belong to the equation?
1. Theory: Is the variable’s place in the equation
unambiguous and theoretically sound? Does intuition tells
you it should be included?
2. t-test: Is the variable’s estimated coefﬁcient signiﬁcant in
the expected direction?
3. R2: Does the overall ﬁt of the equation improve (enough)
when the variable is added to the equation?
4. Bias: Do other variables’ coefﬁcients change signiﬁcantly
when the variable is added to the equation?
17 / 25
FOUR IMPORTANT SPECIFICATION CRITERIA
If all conditions hold, the variable belongs in the equation
If none of them holds, the variable is irrelevant and can be
safely excluded
If the criteria give contradictory answers, most importance
should be attributed to theoretical justiﬁcation
Therefore, if theory (intuition) says that variable belongs to
the equation, we include it (even though its coefﬁcients
might be insigniﬁcant!).
18 / 25
ESTIMATING PRICE ELASTICITY OF BRAZILIAN COFFEE
Should we include the price of Brazilian coffee into the
equation?
COF = 9.3 + 2.6
1.0)
PT + 0.0036
0.0009)
Y
t = 2.6 4.0
R2
= 0.58 , n = 25
COF = 9.1 + 7.8
15.6)
PBC + 2.4
1.2)
PT + 0.0035
0.0010)
Y
t = 0.5 2.0 3.5
R2
= 0.60 , n = 25
The three criteria does not hold (theory is inconclusive) ⇒
the price of Brazilian coffee does not belong to the
equation (Brazilian coffee is price inelastic)
19 / 25
ESTIMATING PRICE ELASTICITY OF BRAZILIAN COFFEE
Really???
What if we add price of Colombian coffee (PCC)?
COF = 10.0 + 8.0
4.0)
PBC − 5.6
2.0)
PCC + 2.6
1.3)
PT + 0.0030
0.0010)
Y
t = 2.0 − 2.8 2.0 3.0
R2
= 0.70 , n = 25
COF = 9.1 + 7.8
15.6)
PCC + 2.4
1.2)
PT + 0.0035
0.0010)
Y
t = 0.5 2.0 3.5
R2
= 0.60 , n = 25
The three criteria hold ⇒ the price of Brazilian coffee
belongs to the equation!!! (Brazilian coffee is price elastic)
20 / 25
THE DANGER OF SPECIFICATION SEARCHES
“If you just torture the data long enough, they will
confess.”
If too many speciﬁcations are tried:
The ﬁnal result has desired properties only by chance
The statistical signiﬁcance of the results is overestimated
because the estimations of the previous regressions are
ignored
How to proceed:
Keep the number of regressions estimated low
Focus on theoretical considerations: leave the insigniﬁcant
variables in the equation if the theory predicts they should
be included
Document all speciﬁcations investigated
21 / 25
ADDITIONAL SPECIFICATION TEST
Ramsey’s Regression Speciﬁcation Error Test (RESET)
allows to detect possible misspeciﬁcation - tells you if all
important variables are included or not
unfortunately does not allow to detect its source
There are two forms of this test, both based on similar
intuition:
If the equation is correctly speciﬁed, nothing is missing in
the equation and the residuals are a white noise.
We will derive the test for the model
yi = β0 + β1xi1 + β2xi2 + εi
22 / 25
RESET I
1. We run the regression yi = β0 + β1xi1 + β2xi2 + εi
2. We save the predicted values yi = β0 + β1xi1 + β2xi2
3. We run an augmented regression
yi = β0 + β1xi1 + β2xi2 + γ1y2
i + γ2y3
i + εt
(more powers of y can be included)
4. We test H0 : γ1 = γ2 = 0 using a standard F-test.
5. If we reject H0, there is a misspeciﬁcation problem in our
model.
Intuition: If the model is correct, y is well explained by x1
and x2 and the predicted values of y (raised to higher
powers) should not be signiﬁcant.
23 / 25
RESET II
1. We run the regression yi = β0 + β1xi1 + β2xi2 + εi
2. We save the predicted values yi = β0 + β1xi1 + β2xi2
and the residuals ei = yi − yi
3. We run the regression
ei = α0 + α1yi + α2y2
i + εi
(more powers of y can be included)
4. We test H0 : α1 = α2 = 0 using a standard F-test.
5. If we reject H0, there is a misspeciﬁcation problem in our
model.
Intuition: if the model is correct, residuals should not
display any pattern depending on the explanatory
variables.
24 / 25
SUMMARY
Omitted variable causes bias (and decreases variance)
sign of this bias can be predicted
Included irrelevant variable increases variance (but does
not cause bias)
such variable is insigniﬁcant in the regression
it does not contribute to the overall ﬁt of the regression
There is a set of criteria that help us to recognize correct
speciﬁcation
these criteria have to be applied with caution - theoretical
justiﬁcation has always priority over statistical properties
Readings:
Studenmund Chapter 6, Wooldridge Chapter 9
25 / 25