1/36 Econometrics F-Test Omitted Variables Nonlinear specifications and dummy variables Anna Donina Lecture 5 TESTING MULTIPLE HYPOTHESES REVISITED 2/49 • Suppose we have amodel yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi • Suppose we want to test multiple linear hypotheses in this model • For example, we want to see if the following restrictions on coefficients hold jointly: β1 + β2 = 1 and β3 = 0 • Wecannot use a t-test in this case (t-test can be used only for one hypothesis at a time) • Wewill use anF-test RESTRICTED VS. UNRESTRICTED MODEL 3/49 • Wecan reformulate the model by plugging the restrictions as if they were true (model under H0) • Wecall this model restricted model as opposed tothe unrestricted model • The unrestricted modelis yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi • Restricted model can be derived to have the following form: y∗ i = β0 + β1x∗ i + εi , where y∗ i = yi − xi2 and x∗ i = xi1 − xi2 IDEA OF THE F-TEST 4/49 • If the restrictions are true, then the restricted model fits the data in the same way as the unrestricted model ▪ residuals are nearly the same • If the restrictions are false, then the restricted model fits the data poorly ▪ residuals from the restricted model are much larger than those from the unrestricted model • The idea is thus to compare the residuals from the two models IDEA OF THE F-TEST 5/49 How to compare residuals in the twomodels? ▪ Calculate the sum of squared residuals in the two models ▪ Test if the difference between the two sums is equal to zero (statistically) ▪ H0: the difference is zero (residuals in the two models are the same, restrictions hold) ▪ HA: the difference is positive (residuals in the restricted model are bigger, restrictions do not hold) Sum of squaredresiduals F-TEST 6/49 The test statistic is defined as F = (SSRr − SSRur)/q SSRur/(n − k − 1) ∼ Fq,n−k−1 , . . . sum of squared residuals from the restricted model . . . sum of squared residuals from the unrestricted model where: SSRr SSRur q . . . number of restrictions n . . . number of observations k . . . number of estimated coefficients GOODNESS OF FIT MEASURE 7/49 • Weknow that education and experience have a significant influence on wages • But how important are they in determiningwages? • How much of difference in wages between people is explained by differences in education and in experience? • How well variation in the independent variable(s) explains variation in the dependent variable? • This are the questions answered by the goodness of fit measure - R2 TOTAL AND EXPLAINED VARIATION Total variation in the dependent variable: Predicted value of the dependent variable = part that is explained by independent variables: (case of regression line - for simplicity of notation) Explained variation in the dependent variable: 8/49 GOODNESS OF FIT - R2 Denote: 9/49 Define the measure of the goodness offit: R2 = SSE = Explained variation in y SST Total variation in y GOODNESS OF FIT - R2 10/49 In all models: 0 ≤ R2 ≤ 1 • R2 tells us what percentage of the total variation in the dependent variable is explained by the variation in the independent variable(s) ▪ R2 = 0.3 means that the independent variables can explain 30% of the variation in the dependent variable • Higher R2 means better fit of the regression model (not necessarily a better model!) DECOMPOSING THE VARIANCE For models with intercept, R2 can be rewritten using the decomposition of variance. Variance decomposition: 11/49 VARIANCE DECOMPOSITION AND R2 12/49 Variance decomposition: SST = SSE + SSR Intuition: total variation can be divided between the explained variation and the unexplained variation ▪ the true value y is a sum of estimated (explained) ư𝑦 and the residual ei (unexplainedpart) Wecan rewriteR2: 2 R = = SSE SST −SSR SST SST = 1− SSR SST ADJUSTED R2 13/49 • The sum of squared residuals (SSR) decreases when additional explanatory variables are introduced in the model, whereas total sum of squares (SST) remains the same ▪ 𝑅2 = 1 − 𝑆𝑆𝑅 𝑆𝑆𝑇 increases if we add explanatory variables ▪ Models with more variables automatically have better fit. • To deal with this problem, we define the adjusted R2: R2 adj = 1− SSR n−k−1 SST n−1 ≤ R2 (k is the number of coefficients) • This measure introduces a “punishment” for including more explanatory variables OMITTED VARIABLES 14/49 Weomit a variable whenwe ▪ forget to include it ▪ do not have data for it This misspecification resultsin ▪ not having the coefficient for this variable ▪ biasing estimated coefficients of other variables in the equation → omitted variable bias OMITTED VARIABLES 15/49 • Where does the omitted variable bias come from? • True model: 𝑦𝑖 = 𝛽𝑥𝑖 + 𝛾𝑧𝑖 + 𝑢𝑖 • Model as it looks when we omit variable z: 𝑦𝑖 = 𝛽𝑥𝑖 + ෤𝑢𝑖 implying ෤𝑢𝑖 = 𝛾𝑧𝑖 + 𝑢𝑖 • Weassume that Cov ෤𝑢𝑖, 𝑥𝑖 = 0, but: Cov ෤𝑢𝑖, 𝑥𝑖 = 𝐶𝑜𝑣 𝛾𝑧𝑖 + 𝑢𝑖, 𝑥𝑖 = 𝛾𝐶𝑜𝑣 𝑧𝑖, 𝑥𝑖 ≠ 0 • The classical assumption is violated ⇒ biased (and inconsistent) estimate!!! OMITTED VARIABLES 16/49 For the model with omitted variable: ▪ Coefficients β and γ are from the true model 𝑦𝑖 = 𝛽𝑥𝑖 + 𝛾𝑧𝑖 + 𝑢𝑖 ▪ Coefficient 𝛼 is from a regression of z on x, i.e. 𝑧𝑖 = 𝛼𝑥𝑖 + 𝑒𝑖 The bias is zero if 𝛾 = 0 or 𝛼 = 0 (not likely to happen) OMITTED VARIABLES 17/49 Intuitive explanation: ▪ if we leave out an important variable from the regression (𝛾 ≠ 0), coefficients of other variables are biased unless the omitted variable is uncorrelated with all included dependent variables (𝛼 ≠ 0) ▪ the included variables pick up some of the effect of the omitted variable (if they are correlated), and the coefficients of included variables thus change causing the bias Example: what would happen if you estimated a production function with capital only and omitted labor? OMITTED VARIABLES 18/49 Example: estimating the price of chicken meat in theUS Yt . . . per capita chicken consumption PCt . . . price ofchicken PBt . . . price ofbeef YDt . . . per capita disposableincome OMITTED VARIABLES 19/49 When we omit price of beef: , n = 44R2 = 0.895 Compare to the true model: R2 = 0.986 , n = 44 Weobserve positive bias in the coefficient of PC (was it expected?) OMITTED VARIABLES 20/49 Determining the direction of bias: 𝑏𝑖𝑎𝑠 = 𝛾 ∗𝛼 ▪ Where 𝛾 is a correlation between the omitted variable and the dependent variable (the price of beef and chicken consumption) ▪ 𝛾 is likely to be positive ▪ Where 𝛼 is a correlation between the omitted variable and the included independent variable (the price of beef and the price of chicken) ▪ 𝛼 is likely to be positive Conclusion: Bias in the coefficient of the price of chicken is likely to be positive if we omit the price of beef from the equation. OMITTED VARIABLES 21/49 • In reality, we usually do not have the true model to compare with ▪ Because we do not know what the true model is ▪ Because we do not have data for some important variable • We can often recognize the bias if we obtain some unexpected results • We can prevent omitting variables by relying on the theory • If we cannot prevent omitting variables, we can at least determine in what way this biases our estimates IRRELEVANT VARIABLES 22/49 A second type of specification error is including a variable that does not belong to the model This misspecification ▪ Does not cause bias ▪ But it increases the variance of the estimated coefficients of the included variables IRRELEVANT VARIABLES 23/49 • True model: yi = βxi + ui (1) (2) • Model as it looks when we add irrelevant z: 𝑦𝑖 = 𝛽𝑥𝑖 + 𝛾𝑧𝑖 + ǁ𝑢𝑖 • Wecan represent the error term as ǁ𝑢𝑖 = 𝑢𝑖 − 𝛾𝑧𝑖 • But since from the true model 𝛾 = 0, we have ǁ𝑢𝑖 = 𝑢𝑖 and there is no bias SUMMARY OF THE THEORY Bias – efficiency trade-off: Omitted variable Irrelevantvariable Bias Yes* No Variance Decreases * Increases* * As long as we have correlation between x and z 24/49 FOUR IMPORTANT SPECIFICATION CRITERIA 25/49 Does a variable belong to the equation? 1. Theory: Is the variable’s place in the equation unambiguous and theoretically sound? Does intuition tells you it should be included? 2. t-test: Is the variable’s estimated coefficient significant in the expected direction? 3. R2: Does the overall fit of the equation improve (enough) when the variable is added to the equation? 4. Bias: Do other variables’ coefficients change significantly when the variable is added to the equation? FOUR IMPORTANT SPECIFICATION CRITERIA 26/49 • If all conditions hold, the variable belongs in the equation • If none of them holds, the variable is irrelevant and can be safely excluded • If the criteria give contradictory answers, most importance should be attributed to theoretical justification ▪ Therefore, if theory (intuition) says that variable belongs to the equation, we include it (even though its coefficients might be insignificant!). NONLINEAR SPECIFICATION 27/49 Wewill discuss different specifications: ▪ nonlinear in dependent and independent variables and their interpretation Wewill define the notion of a dummy variable and we will show its different uses in linear regression models NONLINEAR SPECIFICATION 28/49 There is not always a linear relationship between dependent variable and explanatory variables ▪ The use of OLS requires that the equation be linear in coefficients ▪ However, there is a wide variety of functional forms that are linear in coefficients while being nonlinear in variables! We have to choose carefully the functional form of the relationship between the dependent variable and each explanatory variable ▪ The choice of a functional form should be based on the underlying economic theory and/or intuition ▪ Do we expect a curve instead of a straight line? Does the effect of a variable peak at some point and then start to decline? LINEAR FORM y = β0 + β1x1 + β2x2 + ε • Assumes that the effect of the explanatory variable on the dependent variable is constant: 𝑑 𝑦 𝑑 𝑥 𝑘 = 𝛽 𝑘 , k = 1,2 • Interpretation: if xk increases by 1 unit (in which xk is measured), then y will change by 𝛽 𝑘 units (in which y is measured) • Linear form is used as default functional form until strong evidence that it is inappropriate is found 29/49 LOG-LOG FORM ln y = β0 + β1 ln x1 + β2 ln x2 + ε • Assumes that the elasticity of the dependent variable with respect to the explanatory variable is constant: ∂ ln y ∂y/y ∂ ln xk = ∂xk/xk = βk 30/49 k = 1,2 • Interpretation: if xk increases by 1 percent, then y will change by βk percent • Before using a double-log model, make sure that there are no negative or zero observations in the data set EXAMPLE 31/49 • Estimating the production function of Indian sugar industry: ln Q = 2.70 + 0 (0.14) (0.17) .59 ln L + 0.33 lnK Q . . . output L . . . labor K . . . capital employed Interpretation: if we increase the amount of labor by 1%, the production of sugar will increase by 0.59%, ceteris paribus. Ceteris paribus is a Latin phrase meaning ’other things being equal’. LOG-LINEAR FORMS 32/49 Linear-log form: y = β0 + β1 ln x1 + β2 ln x2 + ε ▪ Interpretation: if xk increases by 1 percent, then y will change by (𝛽 𝑘/100) units (k = 1,2) Log-linear form: ln y = β0 + β1x1 + β2x2 + ε ▪ Interpretation: if xk increases by 1 unit, then y will change by (𝛽 𝑘 ∗100) percent (k = 1,2) EXAMPLES OF LOG LINEAR FORMS 33/49 Estimating demand for chicken meat: Y . . . annual chicken consumption (kg.) PC . . . price of chicken PB . . . price of beef YD . . . annual disposable income Interpretation: An increase in the annual disposable income by 1% increases chicken consumption by 0.12 kg per year, ceteris paribus. EXAMPLES OF LOG LINEAR FORMS 34/49 Estimating the influence of education and experience on wages: wage educ exper . . . annual wage (USD) . . . years of education . . . years of experience Interpretation: An increase in education by one year increases annual wage by 9.8%, ceteris paribus. An increase in experience by one year increases annual wage by 1%, ceteris paribus. POLYNOMIAL FORM 1y = β0 + β1x1 + β2x2 + ε • Todetermine the effect of x1 on y, we need to calculate the derivative: ∂y ∂x1 = β1 + 2 ·β2·x1 • Clearly, the effect of x1 on y is not constant, but changes with the level of x1 35/49 • Wemight also have higher order polynomials,e.g.: y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + ε 1 1 1 EXAMPLE OF POLYNOMIAL FORM • The impact of the number of hours of studying on the grade from Econometrics: • Todetermine the effect of hours on grade, calculate the derivative: 36/49 ▪ Decreasing returns to hours of studying: more hours implies higher grade, but the positive effect of additional hour of studying decreases with more hours CHOICE OF CORRECT FUNCTIONAL FORM 37/49 • The functional form has to be correctly specified in order to avoid biased and inconsistent estimates ➢ Remember that one of the OLS assumptions is that the model is correctly specified • Ideally: the specification is given by underlying theory of the equation • In reality: underlying theory does not give precise functional form • In most cases, either linear form is adequate, or common sense will point out an easy choice from among the alternatives CHOICE OF CORRECT FUNCTIONAL FORM 38/49 Nonlinearity of explanatory variables ▪ often approximated by polynomial form ▪ missing higher powers of a variable can be detected as omitted variables Nonlinearity of dependentvariable ▪ harder to detect based on statistical fit of the regression R2 is incomparable across models where the y is transformed ▪ dependent variables are often transformed to log-form in order to make their distribution closer to the normal distribution DUMMY VARIABLES 39/49 Dummy variable - takes on the values of 0 or 1, depending on a qualitative attribute Examples of dummyvariables: INTERCEPT DUMMY 40/49 • Dummy variable included in a regression alone (not interacted with other variables) is an intercept dummy • It changes the intercept for the subset of data defined by a dummy variable condition: yi = β0 + β1Di + β2xi + εi where Wehave yi = (β0 + β1) + β2xi + εi if Di = 1 yi = β0 + β2xi + εi if Di = 0 INTERCEPT DUMMY X 41/49 Y β0+β1 β0 Di=1 Slope = β2 Di=0 Slope = β2 EXAMPLE 42/49 • Estimating the determinants of wages: • Interpretation of the dummy variable M: men earn on average $2.156 per hour more than women, ceteris paribus SLOPE DUMMY 43/49 • If a dummy variable is interacted with another variable (x), it is a slope dummy. • It changes the relationship between x and y for a subset of data defined by a dummy variable condition: We have yi = β0 + (β1 + β2)xi + εi if Di = 1 yi = β0 + β1xi + εi if Di = 0 SLOPE DUMMY X 44/49 Y β0 Di=0 Slope = β1+β2 Di=1 Slope = β1 EXAMPLE 45/49 Estimating the determinants of wages: Interpretation: men gain on average 17 cents per hour more than women for each additional year of education, ceteris paribus SLOPE AND INTERCEPT DUMMIES 46/49 • Allow both for different slope and intercept for two subsets of data distinguished by a qualitativecondition: yi = β0 + β1Di + β2xi + β3(xi ·Di) + εi where iD = 1 if the i-th observation meets a particularcondition 0 otherwise We have yi = (β0 + β1) + (β2 + β3)xi + εi if Di = 1 yi = β0 + β2xi + εi if Di = 0 SLOPE AND INTERCEPT DUMMIES X 47/49 Y Di=0 Slope = β2+β3 Di=1 Slope = β2 β0+β1 β0 DUMMY VARIABLES - MULTIPLE CATEGORIES 48/49 • What if a variable defines three or more qualitative attributes? • Example: level of education - elementary school, high school, and college • Define and use a set of dummy variables: • Should we include also a third dummy in the regression, which is equal to 1 for people with elementaryeducation? ▪ No, unless we exclude the intercept! ▪ Using full set of dummies leads to perfect multicollinearity (dummy variable trap) SUMMARY 49/49 • WerevisitedF-testandtalkedaboutomittedvariables • Wediscussed different nonlinear specifications of a regression equation and their interpretation • Wedefined the concept of a dummy variable and we showed its use ❖ Furtherreadings: Studenmund, Chapter 7 Wooldridge, Chapters 6 & 7