LECTURE 12
Introduction to Econometrics
Endogeneity
December 6, 2016
1 / 25
A LITTLE REVISION: OLS CLASSICAL ASSUMPTIONS
1. The regression model is linear in coefﬁcients, is correctly
speciﬁed, and has an additive error term
2. The error term has a zero population mean
3. Observations of the error term are uncorrelated with each
other
4. The error term has a constant variance
5. All explanatory variables are uncorrelated with the error
term
6. No explanatory variable is a perfect linear function of any
other explanatory variable(s)
7. The error term is normally distributed
2 / 25
ON PREVIOUS LECTURES
We discussed what happens if some of the assumptions are
violated
Linearity of coefﬁcients and no perfect multicollinearity
are essential for the deﬁnition of OLS estimator
Zero mean of the error term is always ensured by the
inclusion of intercept
Normality of the error term is needed for statistical
inference, but it can be shown that if the number of
observations is sufﬁciently high, the OLS estimate will
have asymptotically normal distribution even if the
stochastic error term is not normal
Heteroskedasticity and serial correlation lead to incorrect
statistical inference, but we have studied a set of
techniques to overcome this problem
3 / 25
ON TODAY’S LECTURE
The assumption of no correlation between explanatory
variables and the error term is crucial
Variables that are correlated with the error term are called
endogenous variables (as opposed to exogenous variables)
We will show that the estimated coefﬁcients of endogenous
variables are inconsistent and biased
We will explain in which situations we may encounter
endogenous variables
We will deﬁne the concept of instrumental variables
We will derive the 2SLS technique to deal with
endogeneity
4 / 25
ENDOGENOUS VARIABLES
Notation: E[xiεi] = Cov(xi, εi) = 0 or E[X ε] = 0
Intuition behind the bias:
If an explanatory variable x and the error term ε are
correlated with each other, the OLS estimate attributes to x
some of the variation in y that actually came form the error
term ε
Example: Analysis of household consumption patterns
Households with lower income may indicate higher
consumption (because of shame)
Leads to inconsistent estimates
5 / 25
GRAPHICAL REPRESENTATION
X
Y
True model
Estimated model
6 / 25
TYPICAL CASES OF ENDOGENEITY
1. Omitted variable bias
An explanatory variable is omitted from the equation and
makes part of the error term
2. Selection bias
An unobservable characteristic has inﬂuence on both
dependent and explanatory variables
3. Simultaneity
The causal relationship between the dependent variable
and the explanatory variable goes in both directions
4. Measurement error
Some of the variables are measured with error
In all 4 cases, the sign of the bias is given by the sign of
Cov(εi, xi)
7 / 25
OMITTED VARIABLE BIAS
Studied on lecture 7
True model: yi = βxi + γzi + ui
Model as it looks when we omit variable z:
yi = βxi + ˜ui implying ˜ui = γzi + ui
This gives
Cov(˜ui, xi) = Cov(γzi + ui, xi) = γCov(zi, xi) = 0
It can be remedied by including the variable in question,
but sometimes we do not have data for it
We can include some proxies for such variable, but this
may not reduce the bias completely and some endogeneity
remains in the equation
8 / 25
SELECTION BIAS
Very similar to omitted variable bias
We suppose there is some unobservable characteristic that
inﬂuences both the level of the dependent variable y and of
the explanatory variable x
This unobservable characteristic forms part of the error
term ε, causing Cov(ε, x) = 0 (in the same manner as an
omitted variable)
Example: unobserved ability in the regression estimating
the impact of education on wages
9 / 25
SIMULTANEITY
Occurs in models where variables are jointly determined
y1i = α0 + α1y2i + ε1i
y2i = β0 + β1y1i + ε2i
Intuitively: change in y1i will cause a change in y2i, which
will in turn cause y1i to change again
Technically:
Cov(ε1i, y2i) = Cov(ε1i, β0 + β1y1i + ε2i)
= β1Cov(ε1i, yi1)
= β1Cov(ε1i, α0 + α1y2i + ε1i)
= β1 (α1Cov(ε1i, y2i) + Var(ε1i))
Cov(ε1i, y2i) =
β1
1 − α1β1
Var(ε1i) = 0
10 / 25
SIMULTANEITY
Example:
QDi = α0 + α1Pi + α2Ii + ε1i
QSi = β0 + β1Pi + ε2i
QDi = QSi
where
QD . . . quantity demanded
QS . . . quantity supplied
P . . . price
I . . . income
Endogeneity of price: it is determined from the interaction
of supply and demand
11 / 25
MEASUREMENT ERROR I
Measurement error in the dependent variable
Measurement error is correlated with an explanatory
variable
y∗
i = yi + νi where Cov(νi, xi) = 0
True regression model: yi = β0 + β1xi + εi
Estimated regression: y∗
i = β0 + β1xi + ui where
ui = εi + νi and so
Cov(xi, ui) = Cov(xi, εi + νi) = Cov(νi, xi) = 0
Example: analysis of household consumption patterns
(above)
12 / 25
MEASUREMENT ERROR II
Classical measurement error in the explanatory variable
x∗
i = xi + νi where Cov(νi, xi) = 0
True regression model: yi = β0 + β1xi + εi
Estimated regression: yi = β0 + β1x∗
i + ui
where ui = εi − β1νi and so
Cov(x∗
i , ui) = Cov(xi + νi, εi − β1νi) = −β1Var(νi) = 0
Causes attenuation bias (estimated coefﬁcient is smaller in
absolute value than the true one)
13 / 25
INSTRUMENTAL VARIABLES (IV)
Answer to the situation when Cov(x, ε) = 0
Instrumental variable (or instrument) should be a variable
z such that
1. z is uncorrelated with the error term: Cov(z, ε) = 0
2. z is correlated with the explanatory variable x: Cov(x, z) = 0
Intuition behind instrumental variables approach:
project the endogenous variable x on the instrument z
this projection is uncorrelated with the error term and can
be used as an explanatory variable instead of x
14 / 25
INSTRUMENTAL VARIABLES
Suppose the equation we want to estimate is:
y = Xβ + η
We can have several instruments for several endogenous
variables - we will use the matrix notation Z and X
X denotes endogenous variable(s)
Z denotes instrumental variable(s)
Assume that we have at least as many instruments as
endogenous variables
15 / 25
TWO STAGE LEAST SQUARES
2SLS is a method of implementing instrumental variables
approach
Consists of two steps:
1. Regress the endogenous variables on the instruments
X = Zδ + ν ,
get predicted values
X = Zδ = Z (Z Z)
−1
Z X ,
2. Use these predicted values instead of X in the original
equation:
y = Xβ + η
16 / 25
TWO STAGE LEAST SQUARES
The estimate is
β
2SLS
= X X
−1
X y
= X Z Z Z
−1
Z X
−1
X Z Z Z
−1
Z y
This estimate is consistent, but it has higher variance than
OLS (it is not efﬁcient)
Intuitively:
Only part of the variation in X that is uncorrelated with the
error term is used for the estimation.
This ensures consistency (X that is uncorrelated with error
term).
But it makes the estimate less precise (higher variance of β),
because not all variation in X is used.
17 / 25
EXAMPLE
Estimating the impact of education on the number of
children for a sample of women in Botswana
OLS:
_cons -4.138307 .2405942 -17.20 0.000 -4.609994 -3.66662
agesq -.0026308 .0002726 -9.65 0.000 -.0031652 -.0020964
age .3324486 .0165495 20.09 0.000 .3000032 .364894
educ -.0905755 .0059207 -15.30 0.000 -.102183 -.0789679
children Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 21527.1763 4360 4.93742577 Root MSE = 1.4597
Adj R-squared = 0.5684
Residual 9284.14679 4357 2.13085765 R-squared = 0.5687
Model 12243.0295 3 4081.00985 Prob > F = 0.0000
F( 3, 4357) = 1915.20
Source SS df MS Number of obs = 4361
18 / 25
EXAMPLE
Education may be endogenous - both education and
number of children may be inﬂuenced by some
unobserved socioeconomic factors
Omitted variable bias: family background is an unobserved
factor that inﬂuences both the number of children and
years of education
Finding possible instrument:
Something that explains education
But is not correlated with the family background
A dummy variable
frsthalf =



1 if the woman was born in the ﬁrst
six months of a year
0 otherwise
19 / 25
EXAMPLE
Intuition behind the instrument:
The ﬁrst condition - instrument explains education:
School year in Botswana starts in January
⇒ Thus, women born in the ﬁrst half of the year start
school when they are at least six and a half.
Schooling is compulsory till the age of 15
⇒ Thus, women born in the ﬁrst half of the year get less
education if they leave school at the age of 15.
The second condition - instrument is uncorrelated with the
error term:
Being born in the ﬁrst half of the year is uncorrelated with
the unobserved socioeconomic factors that inﬂuence
education and number of children (family background etc.)
20 / 25
EXAMPLE
_cons 9.692864 .5980686 16.21 0.000 8.520346 10.86538
frsthalf -.8522854 .1128296 -7.55 0.000 -1.073489 -.6310821
agesq -.0005056 .0006929 -0.73 0.466 -.0018641 .0008529
age -.1079504 .0420402 -2.57 0.010 -.1903706 -.0255302
educ Coef. Std. Err. t P>|t| [95% Conf. Interval]
Root MSE = 3.7110
Adj R-squared = 0.1070
R-squared = 0.1077
Prob > F = 0.0000
F( 3, 4357) = 175.21
Number of obs = 4361
First-stage regressions
21 / 25
EXAMPLE
Instruments: age agesq frsthalf
Instrumented: educ
_cons -3.387805 .5478988 -6.18 0.000 -4.461667 -2.313943
agesq -.0026723 .0002796 -9.56 0.000 -.0032202 -.0021244
age .3236052 .0178514 18.13 0.000 .2886171 .3585934
educ -.1714989 .0531553 -3.23 0.001 -.2756813 -.0673165
children Coef. Std. Err. z P>|z| [95% Conf. Interval]
Root MSE = 1.49
R-squared = 0.5502
Prob > chi2 = 0.0000
Wald chi2(3) = 5300.22
Instrumental variables (2SLS) regression Number of obs = 4361
22 / 25
2SLS
Note that the endogenous variable has to be instrumented
by the instrument and by all other exogenous variables
included in the regression
Think about why:
In the ﬁrst stage, we run X = Zδ + ν = X + ν ,
True model: y = Xβ + ε = X + ν β + ε
Model estimated in the second stage: y = Xβ + η
This implies: η = νβ + ε
Including all exogenous variables in the ﬁrst stage make
them orthogonal to the residual ν and hence uncorrelated
to the error term η in the second stage
23 / 25
BACK TO THE EXAMPLE
Compare the estimates from OLS and 2SLS:
OLS:
_cons -4.138307 .2405942 -17.20 0.000 -4.609994 -3.66662
agesq -.0026308 .0002726 -9.65 0.000 -.0031652 -.0020964
age .3324486 .0165495 20.09 0.000 .3000032 .364894
educ -.0905755 .0059207 -15.30 0.000 -.102183 -.0789679
children Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 21527.1763 4360 4.93742577 Root MSE = 1.4597
Adj R-squared = 0.5684
Residual 9284.14679 4357 2.13085765 R-squared = 0.5687
Model 12243.0295 3 4081.00985 Prob > F = 0.0000
F( 3, 4357) = 1915.20
Source SS df MS Number of obs = 4361
2SLS:
Instruments: age agesq frsthalf
Instrumented: educ
_cons -3.387805 .5478988 -6.18 0.000 -4.461667 -2.313943
agesq -.0026723 .0002796 -9.56 0.000 -.0032202 -.0021244
age .3236052 .0178514 18.13 0.000 .2886171 .3585934
educ -.1714989 .0531553 -3.23 0.001 -.2756813 -.0673165
children Coef. Std. Err. z P>|z| [95% Conf. Interval]
Root MSE = 1.49
R-squared = 0.5502
Prob > chi2 = 0.0000
Wald chi2(3) = 5300.22
Instrumental variables (2SLS) regression Number of obs = 4361
Is the bias reduced by IV?
Are these results statistically different?
24 / 25
SUMMARY
We showed that the estimated coefﬁcients of endogenous
variables are inconsistent and biased
In which situations we may encounter endogenous
variables
Omitted variable (omitting important variable which is
correlated to independent variable)
Selection bias (unobserved factors inﬂuencing both
dependent and independent variable)
Simultaneity (causality goes both ways)
Measurement error (in either dependent or independent
variable)
We can deal with endogeneity by using instrumental
variables (2SLS technique)
25 / 25