1
Panel Data Model
December, 2021

2
Panel Data
•Panel data is obtained by observing the same person, firm, county, etc. over several periods.
•Unlike the pooled cross sections, the observations for the same cross section unit (panel,
entity, cluster) in general are dependent. Thus cluster-robust statistics that account for
correlation within panel should be      used.

Panel Data
A double subscript distinguishes entities (states) and time periods (years)

i = entity (state), n = number of entities,
so i = 1,…,n

t = time period (year), T = number of time periods
so t =1,…,T

Data:  Suppose we have 1 regressor.  The data are:

(Xit, Yit), i = 1,…,n, t = 1,…,T

4
Panel Data and Causality
•Panel data can be used to control for time invariant unobserved heterogeneity, and therefore  is
widely used for causality research.
•By contrast, cross sectional data cannot control for time invariant unobserved heterogeneity,  so
may suffer bigger omitted variable bias than panel data.
•The idea is simple. We take various forms of difference, and the time invariant unobserved
heterogeneity is removed.
•Effectively, the panel data use the same panel as both treatment group and control group,  and by
invoking the before and after comparison, remove the time invariant omitted  variables. The
limitation of panel data is that time varying omitted variables are still  present. But overall,
the omitted variable bias gets smaller than cross sectional data.

5
Unobserved Effect Panel Data Model
(1)
•Consider a two-period unobserved effect   model
yit = β0 + δ0dt + β1xit + ai + eit
•The subscript i indexes panels, while t indexes    periods.
•ai is time constant unobserved heterogeneity. eit is the idiosyncratic error, or time-varying
unobserved heterogeneity. ai + eit is the composite error term.
•dt is time dummy, so is panel constant and time varying; you can think of ai as panel  dummy, so
is time constant and panel varying.

6
Endogeneity
•The main reason to use panel data is to correct for the endogeneity caused by unobserved  time
constant effect, i.e.,
cov(xit, ai) ≠0            (2)
•Given that nonzero covariance, the pooled OLS estimator applied to (1) is inconsistent.

7
First Difference (FD) Estimator I
•The repeated observations for the same panel make it possible to remove ai via differencing
•First write down the regression for period 2 and period 1 explicitly as
= β0 + δ0 ∗ 1 + β1xit=2 + ai + eit=2
=
yit=2
yit=1 β0 + δ0 ∗ 0 + β1xit=1 + ai + eit=1
(3)
(4)
Now it is clear that ai can be removed by subtracting the second equation from the first one.

8
First Difference (FD) Estimator II
•So we compute the first time difference for each panel
∆yi =
∆xi =
∆ei =
yi,t=2 − yi,t=1  xi,t=2 − xi,t=1  ei,t=2 − ei,t=1
(5)
(6)
(7)
(8)
•Finally, run the regression using the first-differened data, called first difference equation:
∆yi = δ0 + β1∆xi + ∆ei
Notice that both ai and β0 disappear. In general, differencing removes all time constant  variables
(such as gender).
•OLS applied to the FD regression (8) yields the so called first-difference estimator. The FD
estimator is consistent and has causal interpretation if the regressor in (8) is exogenous, i.e.,
E(∆xi, ∆ei) = 0 (9)

9
Serial Correlation
•In general the error term in the difference regression (8), ∆ei, is negatively serially
correlated when eit is serially uncorrelated.
•For example, if data have three periods, then
•So cluster-robust statistics should be used.

10
Diminishing Variation
Typically, the variation in the differenced independent variable is much smaller than the
variation in the original independent variable. Thus imprecise estimate can be expected from FD
estimator. Like the IV estimator, here we face the same tradeoff of efficiency versus
unbiasedness.

11
Gretl Command


12
FD Estimator can be used to control for time-constant unobserved  heterogeneity. FD estimator
cannot be used when the regressor of interest  is time-constant. FD estimator is imprecise when the
regressor changes  little over time.

13
Fixed Effect (FE) Estimator I
(10)
For concreteness let t = (1, 2, 3) in the following causal model
yit = β0 + δ1d1t + δ2d2t + β1xit + ai + eit
Note that there are two time-dummies in (10) because there are three periods.
(11)
(12)
so period 3 is the base period.

14
Fixed Effect (FE) Estimator II
(13)
Averaging (10) across i leads to the so called between regression
y¯i = β0 + δ1d¯1t + δ2d¯2t + β1x¯i + ai + e¯i
where the time averages  are
y¯i =
 1
3
3 ∑
t=1
y
it
(14)
i
x¯ =
 1
3
3 ∑
t=1
x
it
(15)
i
e¯ =
 1
3
3 ∑
e
it
(16)
t=1
The average of ai is itself since it is time-invariant. Note that these averages are of variables
across time by cross-sectional unit not by variable alone.

16
Between Regression
OLS estimator applied to the between regression is inconsistent since

17
Fixed Effect (FE) Estimator III


20
Gretl Command
The Gretl command:
Panel dep-var indep-var(s) --fixed effects
If you include a time  constant independent variable such as gender, it will be dropped.


23
Three Questions
•Q: Why do we need time dummy?
•A: The time dummy d1t and d2t in (10) can control for time varying but panel constant  unobserved
effect. Example is national trend. It affects every panel and evolves over time.
•Q : Why do we need panel dummy?
•The panel dummy c j in (22) can control for panel varying but time constant unobserved  effect.
Example is ability. It varies across persons but remains unchanged over time.
•Q: What if there are time-varying omitted variables?
•A: IV is still needed if there is time-varying omitted variable.

24
* Panel data model is useful when the omitted variable is time-invariant.
* Panel data model cannot be used when the key regressor is time-invariant.
* IV Estimator applied to the Within Regression should be considered when  the omitted variable is
time-varying.
Summary