Regression basics Lukˊaˇs Laffˊers Matej Bel University, Dept. of Mathematics MUNI Brno 30.9.2021 Organization six meetings per 4hrs: 30.9., 1.10., 11.11., 12.11., 2.12., 3.12. three assignments/tests (per 20%) + written exam (40%) lukas.laffers@gmail.com Some math tools random variable, expectation, variance, covariance probability density function, cumulative density function conditional expectation law of large numbers central limit theorem New to R? Here are some resources W. N. Venables, D. M. Smith and the R Core Team. ”An Introduction to R” https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf Wickham, Hadley, and Garrett Grolemund. R for data science: import, tidy, transform, visualize, and model data. ” O’Reilly Media, Inc.”, 2016. https://r4ds.had.co.nz Wickham, Hadley. ”Elegant graphics for data analysis.” Springer. https://ggplot2-book.org Today assumptions of the regression model geometry of linear squares confidence intervals connection to t-test stars and p-values - what do they mean purpose: prediction vs explanation correlated variables weighted regression heteroskedasticity model selection: bias-variance trade-off some practical considerations Linear regression Regression - What is it good for? Prediction (What will happen?) easier, at least for repeated events Explanation (Why did something happen?) difficult, requires deep institutional knowledge and subject matter expertise. These are fundamentally very different objectives. Regression Prediction - easier, at least for repeated events. Explanation - difficult, requires deep institutional knowledge and subject matter expertise. Data Data x- independent variable, regressor, factor, feature y - outcome, dependent variable Least squares Least squares Notation - Univariate regression y = β0 +β1x +ε or yi = β0 +β1xi +εi rewritten in matrix form y = Xβ +ε y =     y1 y2 ... yn    , X =     1 x1 1 x2 ... 1 xn    , β = β0 β1 , ε =     ε1 ε2 ... εn    . Basic linear regression Notation - Multivariate regression y = β0 +β1x1 +β2x2 +...+βpxp +ε or yi = β0 +β1xi +β1xip +...+βpxip +εi rewritten in matrix form y = Xβ +ε y =     y1 y2 ... yn    , X =     1 x11 x12 ... x1p 1 x21 x22 ... x2p ... ... 1 xn1 xn2 ... xnp    , β =       β0 β1 β2 ... βp       , ε =     ε1 ε2 ... εn    . Model - philosophical interlude The purpose of the model is to be useful. It is not to be correct. Model should be right about the relevant parts. It simplifies the parts that it does not aim to model. Ingredients: y,x1,x2,..., - observed random variables ε - unobserved random variables β0,β1,.... - unknown variables we wish to estimate y = β0 +β1x1 +β2x2 +...+βpxp +ε - assumption about how the different quantities are related to each other Minimize sum of squares: n ∑ i=1 (yi −Xiβ)2 = εT ε = (y −Xβ)T (y −Xβ) differentiating via β, we get: XT X ˆβ = XT y, if XT X is invertible, we get ˆβ = (XT X)−1 XT H y = Hy Back to the model Ingredients: y,x1,x2,..., - observed random variables ε - unobserved random variables β0,β1,.... - unknown (fixed!) variables we wish to estimate y = β0 +β1x1 +β2x2 +...+βpxp +ε - assumption about how the different quantities are related to each other ˆβ = (ˆβ0, ˆβ1,...., ˆβp)T - estimator = vector of random variables. It is a function of our data sample, which is random. Our best attempt to recover the true unknown β. Different estimators possess different qualities (bias, variance, robustness). OLS estimator is just one of them (but pretty good). Linear model = simple linear model can model non-linear relationships linear = linear in parameters Also a linear model log(y) = β0 +β1x +β2x2 +ε y = β0xβ1 ε → log(y) = log(β0)+β1 log(x)+log(ε) = β∗ 0 +β∗ 1 x∗ +ε∗ Goodness of fit measure R2 = 1 − ∑(yi − ˆyi)2 ∑(yi − ¯y)2 = 1 − RSS TSS = ESS TSS = ∑(¯y − ˆyi)2 ∑(yi − ¯y)2 , RSS residual sum of squares TSS total sum of squares R2 = (cor(ˆy,y))2 = ∑(¯y − ˆyi)2 ∑(yi − ¯yi)2 . R2 = 0.65 (for a simple linear model) Geometry: It is a simple projection. Goodness of fit? Prediction If you wish to predict well, you’d better explain the variation in y. Explanation Not necessary a problem if you have a small R2 . Projection ˆy = X ˆβ = X XT X −1 XT y ˆβ = X XT X −1 XT P ≡ projection matrix y = Py ˆε = y −X ˆβ = (I −P) M ≡ residual maker matrix y = My symmetric PT = P and MT = M idempotent PP = P and MM = M ˆy ⊥ ˆε Different qualities Ordinary Least Squares estimator geometric interpretation easy analytic formula ˆβ = (XT X)−1 XT y among the linear, unbiased estimator it has the lowest variance (Gauss-Markov theorem) y = Xβ +ε E[ε|X] = 0 Var[ε|X] = σ2 In =⇒ For any unbiased estimator ˜β of β we have var[˜β|X] ≥ σ2 (XT X)−1 Maximum Likelihood Estimator under normal errors (we will discuss latter) Stastical inference Assume that errors are normally distributed: ε ∼ N(0,σ2 I) + y = Xβ +ε =⇒ y ∼ N(Xβ,σ2 I) So we get that* ˆβ = (XT X)−1 XT y ∼ N β,(XT X)−1 σ2 ∗ Var(Ay) = AVar(y)AT =⇒ Var(ˆβ) = (XT X)−1 XT (σ2 I)((XT X)−1 XT )T = (XT X)−1 σ2 Hypothesis tests H0 : βi = 0 can be tested: ti = ˆβi se(ˆβi) ∼ tn−p. where se(ˆβi) is the standard error - sq. root of the diagonal of the matrix (XT X)−1 ˆσ2 ˆσ2 = 1 n−p ∑i ˆε2 i RSS = ∑i ˆε2 i is the residual sum of squares tn−p is the Student’s t-distribution with n −p degrees of freedom Hypothesis tests H0 : βi=βj=0 can be tested: F = (RSSω −RSSΩ)/(2) RSSΩ/(n −p) ∼ F2,n−p. where Ω denotes a large model with p parameters ω denotes a small model with p−2 paramaters (a special case of Ω, the two models are nested) Confidence intervals CIα = [ˆβi −t (α/2) n−p se(ˆβi), ˆβi +t (α/2) n−p se(ˆβi)] Parameter is a fixed value a confidence interval is random interval. CIs for expected/future values ˆy0 = xT 0 ˆβ CI for expected value ˆy0 ±t (α/2) n−p ˆσ xT 0 (XT X)−1x0 CI for future value ˆy0 ±t (α/2) n−p ˆσ 1 +xT 0 (XT X)−1x0 why ”1+” ? var(ˆy0 +ε0) = var(ˆy0)+var(ε0) = xT 0 (XT X)−1 x0σ2 +σ2 = (1+xT 0 (XT X)−1 x0)σ2 CIs for expected/future values : Returns to education. log(wage) = β0 +β1education+β2age+ε an increase of education by one year increases wage by β1 ·100% percent. our model predicts, that for individuals of the same age, an increase of education by 1 extra year is associated with an increase of wage by β1 ·100% percent. Be careful with causal interpretations based on observational data. Interpretation of parameters y = β0 +β1x1 +β2x2 +ε log(y) = β0 +β1x1 +β2x2 +ε y = β0 +β1 log(x1)+β2x2 +ε log(y) = β0 +β1 log(x1)+β2x2 +ε y = β0 +β1x1 +β2x2 +ε [x1 → x1 +δ] =⇒ [y → y +β1δ] our model predicts, that an increase of x1 by one unit is associated with an increase in y by β1 units, if the x2 will not change. log(y) = β0 +β1x1 +β2x2 +ε [x1 → x1 +δ] =⇒ [y → exp(β0 +β1(x1 +δ)+β2x2 +ε)) = y ·exp(β1δ) ≈ y(1 +β1δ)] our model predicts, that an increase in x1by one unit is associated with an increase in y by approximately β1 ·100%, if x2 will not change. y = β0 +β1 log(x1)+β2x2 +ε [x1 → x1 ·(1 +δ)] =⇒ [y → y +β1 log(1 +δ) ≈ y +β1δ] our model predicts, that an increase in x1 by 1% is associated with an increase in y by approximately β1/100 units, if x2 will not change. log(y) = β0 +β1 log(x1)+β2x2 +ε [x1 → x1 ·(1 +δ)] =⇒ [y → y ·exp(β1 log(1 +δ)) = y ·(1 +δ)β1 ≈ y ·(1 +β1δ)] our model predicts, that an increase in x1 by 1% is associated with an increase in y by approximately β1%, if x2 will not change. Interaction terms log(wage) = β0 +β1educ +β2south +β3south ·educ +ε log(wage) = β0 +β1educ +ε log(wage) = β0 +β2 +(β1 +β3)educ +ε Collinearity Why is it a problem? Estimators have a high variance numerically unstable How to detect it (1) Look at the correlation matrix of regressors and look for numbers close to +1 or -1. (2) Running a regression of xi on other regressors, measure of linear fit R2 i is close to 1. (3) Sort eigenvalues of XT X,λ1 ≥ ··· ≥ λp. Condition number κ = λ1 λp ≥ 30 indicates problems. How to quantify the effect of it var(ˆβj) = σ2 1 1 −R2 j 1 ∑n i=1(xij − ¯xj)2 , Collinearity These two models y = β0 +β1x1 +ε y = β0 +β1x1 +β2x2 +ε may lead to a completely different estimates of β1. : Pearl’s Simpson Machine. http://dagitty.net/learn/simpson/index.html Simpson’s Simpson’s paradox Ommitted variable bias log(wage) = β0 +β1education+β2age+ε log(wage) = β∗ 0 +β∗ 1 education+β∗ 2 age+β∗ 3 ability+ε ability = γ0 +γ1education+ε log(wage) = (β∗ 0 +β∗ 3 γ0)+(β∗ 1 +β∗ 3 γ1) β1 education+β∗ 2 age+(β∗ 3 ε +ε) Does not matter as long as ability is either β∗ 3 = 0 - irrelevant γ1 = 0 - uncorrelated with education More on errors: Heteroskedasticity yi = 1 +3xi +xi ·ε The larger |xi| the larger the error But we typically assume σ2 i = const So far we have assumed: var(ε) = σ2 I =      σ2 0 ... 0 0 σ2 ... 0 ... ... 0 0 ... σ2      . But what if var(ε) = σ2 Σ? Σ = SST . Transform back y = Xβ +ε S−1 y = S−1 Xβ +S−1 ε y = X β +ε Variance of the new transformed errors ε is var(ε ) = var(S−1 ε) = S−1 var(ε)S−T = σ2 I. We apply OLS on the transformed data S−1 y a S−1 X. We minimize (y −X β)T (y −X β) = (y −Xβ)T Σ−1 (y −Xβ), which is solved by ˆβW = (XT Σ−1 X)−1 XT Σ−1 y. The variance of this estimator is var(ˆβW ) = σ2 (XT Σ−1 X)−1 . var(ε) =      σ2 1 0 ... 0 0 σ2 2 ... 0 ... ... 0 0 ... σ2 n      =      1 w1 0 ... 0 0 1 w2 ... 0 ... ... 0 0 ... 1 wn      . S =       1√ w1 0 ... 0 0 1√ w2 ... 0 ... ... 0 0 ... 1√ wn       S−1 =      √ w1 0 ... 0 0 √ w2 ... 0 ... ... 0 0 ... √ wn      . Examples: If var(εi) ∝ xi, we make use of wi = x−1 i . If yi are averages based on ni obs, under LLN this is proportional to 1/ni. Hence var(yi) = var(εi) = σ2 /ni, so wi = ni. Example: average wage in different countries. In general we set wi = 1/var(yi). The covariance matrix is generally unknown, it may be estimated in different ways : HR standard errors : HC0: ˆVˆβ = (XT X)−1 ∑i XiXT i ˆε2 i (XT X)−1 HC1: ˆVˆβ = n n−k (XT X)−1 ∑i XiXT i ˆε2 i (XT X)−1 HC2: ˆVˆβ = (XT X)−1 ∑i XiXT i ¯¯ε2 i (XT X)−1 HC3: ˆVˆβ = (XT X)−1 ∑i XiXT i ˜ε2 i (XT X)−1 where (see Hansen 4.10 Residuals) ˆεi = yi −XT i ˆβ is a vector of residuals ¯¯ε = (1 −hii)−1/2ˆε is a vector of standardized residuals hii is the leverage: the diagonal element of residual maker matrix M ˜ε = yi −XT i ˆβ(−i) = (1 −hii)−1ˆε is a vector prediction error ˆβ(−i) is estimated without i-th observation. Weighting? How do we deal with heteroskedasticity? Calculate ˆβW = (XT ˆΣ−1 X)−1 XT ˆΣ−1 y for some estimate ˆΣ Calculate ˆβ = (XT X)−1 XT y and then use ˆΣ to adjust for standard errors Angrist and Pischke (2008, section 3.4.1) argue for the second approach. More efficient if ˆΣ is estimated well. But this requires good model for E[ε2 |X] Estimator for E[ε2 |X] may have bad finite sample properties. Efficiency gains from using ˆβW are typically modest. In case that E[yi|Xi] is not linear, the unweighted ˆβ estimates are at least the best linear minimum mean squared error. Clustered standard errors G groups. Within these groups, errors are allowed to be correlated. But not between the groups Σ =                     Σ1    0 ... 0 ... 0 ... 0    ...    0 ... 0 ... 0 ... 0       0 ... 0 ... 0 ... 0    Σ2 ...    0 ... 0 ... 0 ... 0    ... ...    0 ... 0 ... 0 ... 0       0 ... 0 ... 0 ... 0    ... ΣG                     When should you cluster your standard errors Assignment of a treatment is not on an individual level (but on a different level, say school level, town level etc) Sample is not random but first clusters are sampled and then observations within clusters are sampled. Diagnostics Correct model Incorrect model Model Selection - Occam’s razor ’Among competing hypotheses, the one with the fewest assumptions should be selected.’ John Punch 1639: ’Entities must not be multiplied beyond necessity’ Aristoteles: ’We may assume the superiority ceteris paribus [other things being equal] of the demonstration which derives from fewer postulates or hypotheses.’ Ptolemaus: ’We consider it a good principle to explain the phenomena by the simplest hypothesis possible.’ Madhva: ’To make two suppositions when one is enough is to err by way of excessive supposition’ Isaac Newton: ’We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Therefore, to the same natural effects we must, as far as possible, assign the same causes.’ Small model vs Large model If we wish to predict, we may prefer larger model even if the smaller one is more parsimonious. When explaining, we prefer smaller models. Automatic model selection based on p-values We add regressors with smallest p-values, until some threshold is met. We remove regressors with the largest p-values, until some threshold is met. The use of any automatic model selection tool is very risky. p-values are not valid as they are the result of multiple testing. Results look better than reality. removal = no association model selection tools cannot replace deep subject matter expertise : Model building If you have too many regressors relative to the sample size (eg. p=50, n=100). You may find some patterns in the data just by chance! : Model selection Freedman, David A., and David A. Freedman. ”A note on screening regression equations.” The American Statistician 37.2 (1983): 152-155. IC = −([FIT ]−[COMPLEXITY]) Akkaike Information criterion (AIC) AIC = −2L(ˆθ)+2p, we prefer models with a small AIC. An alternative to AIC is Bayes Information Criterion (BIC) BIC = −2L(ˆθ)+2p log(n) it penalizes large (more comples) models more. Suppose you have a set of competing models with a similar fit Do they give qualitatively similar results? Do they predict similarly? How difficult/expensive is data collection? Are the model assumptions satisfied? (diagnostic graphs) If we are in a situation, where models that fit data similarly well lead to very different results, it may be that we cannot answer the question of interest. It is intelectually honest to appreciate this uncertainty, no matter how inconvenient/impractical it may be. Model choice should not be data driven only (say looking at the statistical significance). Subject matter expertise is always appreciated. Regression and t-test connection yA ∼ N(µA,σ2 ) yB ∼ N(µB,σ2 ) H0 : µA = µB vs H1 : µA = µB T = (¯yA − ¯yB)−(µA − µB) Sp 1 nA + 1 nB ∼ tnA+nB−2, where S2 p = (nA−1)S2 A+(nB−1)S2 B nA+nB−2 a S2 A a S2 B are sample variances. yi = βAdiA +βBdiB +ε yA ∼ N(βA,σ2 ) and yB ∼ N(βB,σ2 ). These are identical assumptions to the t-test. Linear regression with dummy variables is the same thing as a t-test. Statistical significance Statistical vs. Practical significance (a.k.a. is the effect economically meaningful) More on Statistical significance Statistical significance - p-values and confidence intervals What they are not! Greenland, Sander, et al. ”Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations.” European journal of epidemiology 31.4 (2016): 337-350. Statement of American Statistical Association: Wasserstein, Ronald L., and Nicole A. Lazar. ”The ASA statement on p-values: context, process, and purpose.” (2016): 129-133. P-values: Common misconceptions The P value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the P value for the null hypothesis is 0.08, there is an 8 % probability that chance alone produced the association. No! The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true. Nope! A significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected. Also no! There are 15 more variations in Greenland et al. (2016) Confidence intervals: Common misconceptions The specific 95 % confidence interval presented by a study has a 95 % chance of containing the true effect size. No! An effect size outside the 95 % confidence interval has been refuted (or excluded) by the data. Not this one! If two confidence intervals overlap, the difference between two estimates or studies is not significant. Still no! There are a few more in Greenland et al. (2016) So what are they then?? These quantities are model based. They express a level of uncertainty about a particular statement (a hypothesis) assuming that a particular model is correct! p-value: Assuming that the model assumptions are correct AND assuming that a particular hypothesis is true, then in a repeated sampling setup, you would observe more extreme values of a test-statistic in approximately p.100% cases. p-value is a measure od compatibility of the calculated test statistic with the underlying model assumption and the null hypothesis. So what are they then?? These quantities are model based. They express a level of uncertainty about a particular statement (a hypothesis) assuming that a particular model is correct! 95%CI: Assuming that the model assumptions are correct AND assuming that a particular hypothesis is true, in a repeated sampling, 95% CI is an interval estimate (that is, a random interval), that would cover the true effect in approximately 95% cases. CI summarises the results of a hypothesis tests for multiple effect sizes. Beautiful vis here: https://seeing-theory.brown.edu/frequentist-inference/index.html Misc: Principal components analysis We often wish to reduce the dimension of X. Source: https://www.quora.com/What-is-an-intuitive-explanation-for-PCA Misc: Principal components analysis Find the linear combination of regressors that maximize the variance: var(Xu1) → max subject to uT 1 u1 = 1 then var(Xu2) → max subject to uT 2 u2 = 1 and uT 1 u2 = 1 ··· Xu1 is the first principal component Xu2 is the second principal component are orthogonal by construction compress most information (in terms of variance) into fewer variables may be interpretable may help us to identify ”similar” points Properties PCA examples A few more comments... Ask about the data. How it was collected, handled, updated? Check for data discrepancies, summary statistics. Do plot the data. Yes, always. Talk to the experts. Comment your code, make it easily reproducible. Adhere to coding standards (http://adv-r.had.co.nz/Style.html). Do not alter your source dataset, od any preprocessing in a separate file. Consider interacting two most important regressors. Thank you for your attention! References There are zillions of books when it comes to regression. The choices I made reflect my personal biases. An excellent book on linear regression with applications in R: Faraway, Julian J. Linear models with R. Chapman and Hall/CRC, 2004. Accompanying website: https://julianfaraway.github.io/faraway/LMR/ Very accessible intro to regression: is ch1 in: Adams, Christopher P. Learning Microeconometrics with R. Chapman and Hall/CRC, 2020. Accompanying website: https://sites.google.com/view/microeconometricswithr/home?authuser=0 A practical and very popular book with gear specifically towards economics is Angrist, Joshua D., and J¨orn-Steffen Pischke. Mostly harmless econometrics. Princeton university press, 2008. Reference books (free to download) are two Bruce Hansen’s books: Probability and Econometrics: https://www.ssc.wisc.edu/ bhansen/econometrics/ The explanations are rather brief and succinct (which may not suit everyone). The paper that shows that you can find patterns in a noise: Freedman, David A. ”A note on screening regression equations.” The American Statistician 37.2 (1983): 152-155. When should one cluster standard errors: https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle is non-technical summary of: Abadie, Alberto, et al. When should you adjust standard errors for clustering?. No. w24003. National Bureau of Economic Research, 2017. The classic and book length account on Model selection by the pioneers of the field: Claeskens, Gerda, and Nils Lid Hjort. ”Model selection and model averaging.” Cambridge Books (2008). Discussions on statistical significance: Greenland, Sander, et al. ”Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations.” European journal of epidemiology 31.4 (2016): 337-350 and Wasserstein, Ronald L., and Nicole A. Lazar. ”The ASA statement on p-values: context, process, and purpose.” (2016): 129-133. R resources: W. N. Venables, D. M. Smith and the R Core Team. ”An Introduction to R https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf R resources: Wickham, Hadley, and Garrett Grolemund. R for data science: import, tidy, transform, visualize, and model data.” O’Reilly Media, Inc.”, 2016. https://r4ds.had.co.nz R resources: Wickham, Hadley. ”Elegant graphics for data analysis.” Springer. https://ggplot2-book.org For those interested in visualizations I would recommend Jonatan Schwabish’s books: Schwabish, Jonathan. Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks. Columbia University Press, 2021 and Schwabish, Jonathan. Better presentations. Columbia University Press, 2016. Also there is a short article Schwabish, Jonathan A. ”An economist’s guide to visualizing data.” Journal of Economic Perspectives 28.1 (2014): 209-34.