1 Panel Data Model December, 2021 2 Panel Data •Panel data is obtained by observing the same person, firm, county, etc. over several periods. •Unlike the pooled cross sections, the observations for the same cross section unit (panel, entity, cluster) in general are dependent. Thus cluster-robust statistics that account for correlation within panel should be used. Panel Data A double subscript distinguishes entities (states) and time periods (years) i = entity (state), n = number of entities, so i = 1,…,n t = time period (year), T = number of time periods so t =1,…,T Data: Suppose we have 1 regressor. The data are: (Xit, Yit), i = 1,…,n, t = 1,…,T 4 Panel Data and Causality •Panel data can be used to control for time invariant unobserved heterogeneity, and therefore is widely used for causality research. •By contrast, cross sectional data cannot control for time invariant unobserved heterogeneity, so may suffer bigger omitted variable bias than panel data. •The idea is simple. We take various forms of difference, and the time invariant unobserved heterogeneity is removed. •Effectively, the panel data use the same panel as both treatment group and control group, and by invoking the before and after comparison, remove the time invariant omitted variables. The limitation of panel data is that time varying omitted variables are still present. But overall, the omitted variable bias gets smaller than cross sectional data. 5 Unobserved Effect Panel Data Model (1) •Consider a two-period unobserved effect model yit = β0 + δ0dt + β1xit + ai + eit •The subscript i indexes panels, while t indexes periods. •ai is time constant unobserved heterogeneity. eit is the idiosyncratic error, or time-varying unobserved heterogeneity. ai + eit is the composite error term. •dt is time dummy, so is panel constant and time varying; you can think of ai as panel dummy, so is time constant and panel varying. 6 Endogeneity •The main reason to use panel data is to correct for the endogeneity caused by unobserved time constant effect, i.e., cov(xit, ai) ≠0 (2) •Given that nonzero covariance, the pooled OLS estimator applied to (1) is inconsistent. 7 First Difference (FD) Estimator I •The repeated observations for the same panel make it possible to remove ai via differencing •First write down the regression for period 2 and period 1 explicitly as = β0 + δ0 ∗ 1 + β1xit=2 + ai + eit=2 = yit=2 yit=1 β0 + δ0 ∗ 0 + β1xit=1 + ai + eit=1 (3) (4) Now it is clear that ai can be removed by subtracting the second equation from the first one. 8 First Difference (FD) Estimator II •So we compute the first time difference for each panel ∆yi = ∆xi = ∆ei = yi,t=2 − yi,t=1 xi,t=2 − xi,t=1 ei,t=2 − ei,t=1 (5) (6) (7) (8) •Finally, run the regression using the first-differened data, called first difference equation: ∆yi = δ0 + β1∆xi + ∆ei Notice that both ai and β0 disappear. In general, differencing removes all time constant variables (such as gender). •OLS applied to the FD regression (8) yields the so called first-difference estimator. The FD estimator is consistent and has causal interpretation if the regressor in (8) is exogenous, i.e., E(∆xi, ∆ei) = 0 (9) 9 Serial Correlation •In general the error term in the difference regression (8), ∆ei, is negatively serially correlated when eit is serially uncorrelated. •For example, if data have three periods, then •So cluster-robust statistics should be used. 10 Diminishing Variation Typically, the variation in the differenced independent variable is much smaller than the variation in the original independent variable. Thus imprecise estimate can be expected from FD estimator. Like the IV estimator, here we face the same tradeoff of efficiency versus unbiasedness. 11 Gretl Command 12 FD Estimator can be used to control for time-constant unobserved heterogeneity. FD estimator cannot be used when the regressor of interest is time-constant. FD estimator is imprecise when the regressor changes little over time. 13 Fixed Effect (FE) Estimator I (10) For concreteness let t = (1, 2, 3) in the following causal model yit = β0 + δ1d1t + δ2d2t + β1xit + ai + eit Note that there are two time-dummies in (10) because there are three periods. (11) (12) so period 3 is the base period. 14 Fixed Effect (FE) Estimator II (13) Averaging (10) across i leads to the so called between regression y¯i = β0 + δ1d¯1t + δ2d¯2t + β1x¯i + ai + e¯i where the time averages are y¯i = 1 3 3 ∑ t=1 y it (14) i x¯ = 1 3 3 ∑ t=1 x it (15) i e¯ = 1 3 3 ∑ e it (16) t=1 The average of ai is itself since it is time-invariant. Note that these averages are of variables across time by cross-sectional unit not by variable alone. 16 Between Regression OLS estimator applied to the between regression is inconsistent since 17 Fixed Effect (FE) Estimator III 20 Gretl Command The Gretl command: Panel dep-var indep-var(s) --fixed effects If you include a time constant independent variable such as gender, it will be dropped. 23 Three Questions •Q: Why do we need time dummy? •A: The time dummy d1t and d2t in (10) can control for time varying but panel constant unobserved effect. Example is national trend. It affects every panel and evolves over time. •Q : Why do we need panel dummy? •The panel dummy c j in (22) can control for panel varying but time constant unobserved effect. Example is ability. It varies across persons but remains unchanged over time. •Q: What if there are time-varying omitted variables? •A: IV is still needed if there is time-varying omitted variable. 24 * Panel data model is useful when the omitted variable is time-invariant. * Panel data model cannot be used when the key regressor is time-invariant. * IV Estimator applied to the Within Regression should be considered when the omitted variable is time-varying. Summary