Simple linear regression E0420 Week 6 Linear Regression — Icebreaker to Machine Learning Algorithms. | by Rishi Kumar | Nerd For Tech | Medium Simple Linear Regression in Python (From Scratch) | by Aidan Wilson | Towards Data Science What is it good for? •Testing associations between: •1 or more independent variables (IVs) •Categorical/binary •Ordinal •Continuous •1 dependent variable (DV, outcome) •Possibility to add covariates Goal of regression •Prediction •Use known IV(s) to predict DV •Correlation ≠ causation still applies! •Explanation •Explain the DV’s variability by partitioning a “chunk” that is explained by the IV, and a “chunk” that is left unexplained How regression works •Finding a matematical function, or model that best describes the association between the variables •Simple linear regression - a straight line, or linear equation •The regression line is obtained that provides the best possible description of the relationship between X (IV) and Y (DV) • •If the association is not linear, we can model also quadratic function Which line is the best fitting? scatterplot basic https://pythonbasics.org/seaborn-scatterplot/ Which line? reg-least squares •The regression line selected is the one that minimizes the sum of the squared vertical distances to the data points Simple Linear Regression in Python (From Scratch) | by Aidan Wilson | Towards Data Science The regression line •The slope of the line is given by a constant value used for everyone in the sample, β1 •How much of a change in Y (DV) is expected for every one-unit increase in X (IV) •Unstandardized (B) or standardized (β) •The point at which the line crosses the Y axis is also a constant, β0 •Intercept •This is also the value of Y (DV) when X equals zero (IV) •Xi and Yi are variable scores for each observation • FCAT example •The following data show 2001 FCAT math scores and the percentage of free/reduced lunch students for 10 elementary schools •What is the IV? And DV? • % Free Lunch 2001 FCAT Math 9.4 366 57.2 302 54.6 327 15.7 330 19.5 330 38.5 335 61.3 308 66.3 318 16.4 335 14.9 340 FCAT example •The values of B1 and B0 are: •B1 = -.618 •B0 = 350.969 •The regression equation is •Yi = 350.969 -.618(Xi) •Interpreting the unstandardized regression coefficient: •For every 1% increase in the free/reduced lunch rate, a .618 decrease is predicted in FCAT scores • Predicting scores (DV from IV) •Consider the school with a F/R lunch rate of 38.5 •Y’i = 350.969 -.618(38.5) = 327.176 •In regression terms, Y’ is the predicted score, or predicted value of Y •Y’ would be the same for every school with F/R % (i.e, X) = 38.5 •However, the predictions come with an error! • How much of an error? • •How well does our line fit the data? • •How much variability in DV is explained by IV? • https://www.datasciencecentral.com/r-squared-in-one-picture/ Explaining variability •Consider the school with a free/reduced lunch rate (X1) of 9.4 and an FCAT mean (Y1) of 366 •This school was 36.9 FCAT points above the grand mean of Y (329.10) •This distance of 36.9 points represents school 1’s contribution to the Y variability •It is the goal of the regression procedure to explain this total variation (SSTOTAL) • Graphic of SSTOTAL (Yi – Ybar)2 Mean of Y This model does not include the IV = each school’s mean is a function of the grand mean and a unique error The effect of IV •Quantification of the shift in scores from the overall mean that can be attributed to the IV •This shift can be computed for each value of X (the IV) •This is found using the predicted Y scores from the regression equation (line) •The variability attributed to the IV for the entire sample is computed by •Squaring each expected distance from the mean (the positive and negative distances would cancel out otherwise) •Summing these values across the entire sample (SSREG) • Graphic of SSREG Mean of Y (Yi’ – Ybar)2 Residual (error) variability •The IV does not completely explain the variation in Y scores •The portion of the variation around the mean that is not captured by the IV is called residual variability •This is defined as the difference between the observed Y values and those predicted by the regression line •ei = Yi – Y’I •The residual variability for the entire sample is computed by •Squaring each person’s residual (the positive and negative errors would cancel out otherwise) •Summing these values across the entire sample (SSRES) • Graphic of SSRES (Yi – Yi’)2 Residual Mean of Y FCAT example •The residual for the school with a 9.4% F/R lunch rate would be: •e1 = Y1 – Y’1 = 366 – 345.16 = 20.84 •Thus, the school's actual performance was 20.84 FCAT points higher than what would be predicted using %F/R •20.84 is the portion of that school’s FCAT variation that is not explained by the IV Summary of FCAT example •The 1st school’s total distance, or variation from the mean of Y was 36.9 •Of this variation, 16.06 can be attributed to the IV, while 20.84 is unexplained •Thus, the total variation for this school has been partitioned into two components that sum to the total variation for that school •36.9 = 16.06 + 20.84 Partitioning total variability •For the entire sample, the total variation in Y can be partitioned into two components: •Variability attributed to the IV (SSREG) •Variability not accounted for by the IV (SSRES) • Graphic of variance partitioning Residual = (Yi – Yi’)2 IV Effect = (Yi’ – Ybar)2 Mean of Y FCAT example •The following quantities are obtained from the ANOVA summary table •SSTOT = 2858.9 •SSREG = 1748.826 •SSRES = 1110.074 Coefficient of determination (R2) •The total proportion (or %) of the DV variability that is explained by knowing X is called the coefficient of determination • • •FCAT example: • •Squaring Pearson’s r yields .7822 = .612 • • R2 https://journals.plos.org/plosone/article/figure/image?size=large&id=10.1371/journal.pone.0196740.g 001 https://www.datasciencecentral.com/r-squared-in-one-picture/ O’Brien, R. M. (2018). A consistent and general modified Venn diagram approach that provides insights into regression analysis. Plos one, 13(5), e0196740. Significance testing of R2 •RQ: Does the IV (IVs) account for variability in DV? •H0: R2 is no larger than 0 •Test this assumption via F statistic, reject H0 if F statistic is ³ critical F (p £ .05) •F represents a comparison of the variance explained by the IV and the residual variance • • • • •F test tells you if a group of variables are jointly significant dfREG = k dfRES = N - k - 1 Significance testing of regression (b) coefficients •Upon finding a significant R2 value, determine which IV is contributing most to the significant R2 •Unstandardized regression coefficients are tested using a t statistic •T-test tells you if a single variable is statistically significant •This tests whether or not the slope is different from 0 •H0: β = 0, H1: β ¹ 0 Simple linear regression assumptions •Linearity: The relationship between X and the mean of Y is linear •Independent errors: Residuals of observations should be uncorrelated •Homoscedasticity: The variance of residual is the same for any value of X •Normally distributed errors: Residuals in the model should be random, normaly distributed values with a mean of 0 Normality of residuals Modelling Linear Relationships with Randomness Present Homoscedascity • Regression write-up •The results of regression analysis showed that extraversion explained 35.8% of the variance (R2 =.38, F(2,55)=5.56, p<.01) in aggressive tendencies (β = .56, p<.001). Regression analysis steps 1.Run the analysis in SPSS 2.Check the assumptions 3.Determine the magnitude and significance of R2 4.If R2 significant, determine the magnitude and significance of regression coefficients (B, β) 5.Interpret R2, B, β 6.Write-up the results • •