9. Linear regression, correlation and intro to general linear models Regression and correlation Both regression and correlation refer to associations between two quantitative variables. One variable, the predictor, is considered independent in the case of regression and its values are considered not to be random. The other variable, the response, is dependent on the values of the predictor with certain level of error variability, i.e. it is a random variable. In case of correlation, both variables are considered random. Regression and correlation are thus quite different – theoretically. In practice however, they are numerically identical concerning both the measure of association and p-values (type I error probabilities) associated with rejecting the null hypothesis on independence between the two variables. Linear regression Linear association between two quantitative variables X and Y, of which Y is a random variable, can be described by the equation: Y = a + bX + ε where a and b are intercept and slope of a linear function, which represents the systematic (deterministic) component of the regression model and ε is the error (residual) variation representing the stochastic component. ε is assumed to follow normal distribution with mean = 0. The goal of regression model fitting is to estimate the population slope and intercept from sample data of Y and X. a and b are thus estimates of population parameters. There are multiple approaches to conduct such estimates. Maximum likelihood estimation is most common, which provides numerically identical results as least square estimation in ordinary regression. We shall discuss the least square estimation here, as it is fairly intuitive and will help us to understand the relationship with ANOVA. The least square estimation aims at minimizing the sum of error squares (SSerror), i.e. the squares of the differences between fitted and observed values of the response variable (Fig. 9.1). Note that this mechanism is notably similar to that of analysis of variance. In parallel with ANOVA, we can also define the total sum of squares (SStotal) and regression sum of squares (SSregr). Subsequently, we can calculate mean squares (MS) by dividing SS by corresponding DF, with DFtotal = n – 1, DFregr = 1, and DFerror = DFtotal – DFeffect = n – 2, where n is total number of observations. Hence we get: MSregr = SSregr/DFregr MSerror = SSerror/DFerror As in ANOVA, the ratio between MS can be used in an F-test of a null hypothesis that there is no linear relationship between the two variables: FDFregr,DFerror = MSregr/ MSerror Rejecting the null hypothesis means, that the two variables are linearly related. Note however, that non-significant result may be produced also in cases when the relationship exists but is not linear (e.g. when it is quadratic). Fig. 9.1 Mechanism of least square estimation in regression: definition of squares exemplified with the red data point. In regression, we are usually interested not only in statistical significance but also in the strength of the association, i.e. the proportion of variability in Y explained by X. That is measured by the coefficient of determination (R2): R2 = SSregr/SStotal which can range from 0 (no association) to 1 (deterministic linear relationship). Alternatively, so-called adjusted-R2 may be used (and is reported by R), which accounts for the fact that the association is computed from samples and not from populations: adjusted-R2 = 1 – MSerror/MStotal Returning back to the regression coefficients – their estimate nature means that errors of the estimate may be computed. Their significance (i.e. significant difference from zero) may thus be tested by a single sample t-test. The p-value of such test for the slope (b) is identical to that of the F-test in simple regression with single predictor. Note, that the test of the intercept (reported by R or other statistical software) is irrelevant for significance of the regression itself. Significant intercept only indicates that mean(Y) is significantly different from zero. Regression diagnostics We have discussed the systematic component of the regression equation. However, the stochastic component is also important. This is because its properties can provide crucial information on validity of regression assumptions and thus validity of the whole model. The stochastic component of the model, called model residuals, can be computed using equation: ε = Y – a – bX = Y – fitted(Y) Residuals form a vector of values for each of the data points. As such, they can be analyzed by descriptive statistics. They may also be standardized by division of their standard deviation. The basic assumptions concerning the residuals are: 1. Residuals should follow the normal distribution 2. Size of their absolute value should be independent of fitted value. 3. There should be no obvious trend in residuals associated with fitted values, which would indicate non-linearity of the relationship between X and Y. These assumptions are best evaluated on a regression-diagnostics plot (Fig 9.2). In addition, it may be worth to check that the regression result is not driven by a single extreme observation (or few of these), which is provided on the bottom-right plot on Fig 9.2. Fig 9.2. Regression diagnostics plots. 1. Residuals vs. fitted values indicate potential nonlinearity of the relationship (smoothed trend displayed by red line). 2. Normal Q-Q plot displays agreement between normal distribution and distribution of residuals (dashed line). 3. Square root of absolute value of residuals indicate potential correlation between the size of residuals and fitted values. 4. Residuals vs. leverage (https://en.wikipedia.org/wiki/Leverage_(statistics)) plot detect points, which have high influence on the regression parameter estimates (these points have high Cook distance; https://en.wikipedia.org/wiki/Cook%27s_distance). Correlation Correlation is a symmetric measure of the association between two random variables, of which neither can be considered a predictor or a response. Correlation is most commonly measured by Pearson correlation coefficient: 𝑟 = ∑ (𝑋𝑖 − 𝑋̅)(𝑌𝑖 − 𝑌̅)𝑛 𝑖=1 √∑ (𝑋𝑖 − 𝑋̅)2 ∑ (𝑌𝑖 − 𝑌̅)2𝑛 𝑖=1 𝑛 𝑖=1 Its values can range from -1 (absolute negative correlation) to +1 {absolute positive correlation), with r = 0 corresponding to no correlation. r2 then refers to the amount of shared variability. Numerically, Pearson r2 and regression R2 have identical values for given data and have basically the same meaning. Pearson r is also an estimate of population parameter; its significance (i.e. significant difference from zero) can thus be tested by a single sample t-test with n – 2 degrees of freedom. On correlation and causality Note, that significant result of a regression of observational data may only be interpreted as correlation (or coincidence) despite there is a variable called the predictor and the response. Causal explanations imply that a change of predictor value causes a directional change in the response. Causality may therefore only be tested in manipulative experiments, where the predictor is manipulated. See more details on this in Chapter 6. Brief introduction to multiple regression and general linear models In regression, multiple predictors may be used in a model: Y = a + b1X + b2X +… + bnX + ε provided that number of predictors is lower than number of observations – 1. While statistical testing of such models is easy and largely follows the same principles as in simple regression, it is more difficult to decide which predictors to include in the model and which not (i.e. how to find the best model) as some may have significant effects while others not. This is done by a model selection procedure, details of which are beyond the scope of the present text. It is important, that categorical predictors of ANOVA may be decomposed into a series of k - 1 quantitative variables (e.g. binary variables containing 0s and 1s), where k is the number of categories. This means that ANOVA and regression are fundamentally identical. In addition, both quantitative and categorical variables may be used as predictors in a single analysis. Statistical models containing a single response variable and possibly multiple quantitative and categorical predictors and assuming normal distribution of the residuals are called general linear models (or simply linear models). How to do in R 1. Regression (or a linear model) start with function lm to fit the model and save the lm output into an object: model.1<-lm(response~predictor) or model.2<-lm(response~predictor1+predictor2+…) anova(model.1) performs analysis of variance of the model (i.e. tests its significance by an F test). Models may also be compared by anova(model.1, model.2) summary(model.1) displays summary of the model, including the ttests of individual coefficients. resid(model.1) extracts model residuals predict(model.1) returns predicted values plot(model.1) plots regression diagnostic plots of the model 2. Pearson correlation coefficient cor(Var1~Var2) computes just the coefficient value cor.test(Var1~Var2) computes the coefficient value together with significance test