Linear regression
Lukáš Lehotský & Petr Ocelík
ESS401 Social Science Methodology / MEB431 Metodologie sociálních věd
6th February 2017
Outline
• Refresh: Pearson’s r correlation
• (Simple) linear regression
Refresh: Pearson’s r
• Pearson’s rho product-moment correlation coefficient (r).
• Pearson’s r measures the strength and direction of the
linear relationship between two variables.
• Ranges within <-1,1>
– Perfect positive linear relationship = 1
– Perfect negative linear relationship = -1
– No linear relationship = 0
• Value does not depend on variables’ units.
• It is a sample (aggregative) statistic.
Pearson’s r: assumptions
• Normal distribution of X and Y
• Histograms and descriptive statistics
• Linear relationship between X and Y
• Scatterplot
• Histogram of residuals
• Homoscedasticity
• Same as with linear relationship
• Correlation = covariance / combined total variance.
Association vs. causation
Association does not imply causation!
xkcd.com/552/
Correlation vs. causation
– X causes Y and Y causes X (bidirectional causation):
• Democracies trade more, therefore trade increases
democracy.
– Y causes X (reverse causation):
• The more firemen is sent to a fire, the more damage is done.
– X and Y are consequences of common cause:
• There is a correlation between ice cream consumption and
street criminality (both more prevalent during summer).
– There is no connection between X and Y (coincidence):
• Number of meaningless “funny correlations”.
– More examples here: http://tinyurl.com/85jfu6y
Models
• All models are wrong; some models are useful (Box 1976).
• Models (not only mathematical!) reduce and represent the
real-world phenomena.
Mean as a model
Statistical models
• We need a mathematical function for statistical prediction.
Function changes input (values of predictor variable) to an
output (value of outcome variable) according to specific
rule(s).
Y = f(X); Y = 2*X
if X = 2, then Y = 4
• For different relationships between quantities, different
functions might be used.
wikimedia commons
(Linear) regression
• Regression is a statistical method used to predict
scores on an outcome variable based on scores of
one ore more predictor variables.
• Linear regression: models linear relationship.
• Bivariate (simple) linear regression: uses only one
predictor variable.
• Multivariate (multiple) linear regression: uses more
than one predictor variable.
Regression: terminology / notation
X Y
cause effect
independent variable dependent variable
predictor variable outcome variable
explanatory variable response variable
α, a, b, β0, B0, m β, B, b ε, e
intercept slope error /
residual
constant coefficient
alpha Beta
Linear relationship
• A relationship where two variables are related in the first
degree; i.e. the power of variables is 1.
• Linear relationship is represented by formula:
outcome (dep. var.) = constant + coefficient*predictor + error
Y = β0 + β1X + ε ; population regression function
Y = a + bX + e ; sample regression function
Y’ = 0.75 + 0.425*X + 2.791; sample regression line
• Linear relationship is graphically represented by a straight line.
Linear regression: assumptions
• Independence of observations (random sampling).
• Normal distribution of Y.
• Linear relationship between X and Y.
• Normal distribution of residuals.
• Homoscedasticity.
• Independence of residuals (over time).
• Applicable to metric level of measurement.
• Sensitive to outliers.
The standard normal distribution
Lehotský 2016
wikimedia commons
stats.stackexchange.com
statistics.leard.com
Normal distribution of residuals
Draper & Smith 1998
Independence of residuals
OriginLab 2015
Linear relationship
• A relationship where two variables are related in the first
degree; i.e. the power of variables is 1.
• Linear relationship is represented by formula:
outcome (dep. var.) = constant + coefficient*predictor + error
Y = β0 + β1X + ε ; population regression function
Y = a + bX + e ; sample regression function
Y’ = 0.75 + 0.425*X + 2.791; sample regression line
• Linear relationship is graphically represented by a straight line.
Fitting a straight line
Fitting a straight line
Fitting a straight line
Fitting a straight line
Ordinary least squares
• Ordinary least squares (OLS): estimates parameters
(intercept and slope) in a linear regression model.
• Minimizes squared vertical distances between the
observations (Y) and the straight line (predicted value
of Y = Y’).
• Residual = (Y - Y’)
• ∑ (Y - Y’) = 0 ; ∑ (Y - Y’)^2 >= 0
• OLS: Y’ = min ∑ (Y - Y’)^2
Ordinary least squares
• Comparison of mean and OLS estimation.
wikimedia commons
Linear regression: example
• Assume we have two variables: X and Y.
• To what extent X explains Y?
X Y
1 1
2 2
3 1.3
4 3.75
5 2.25
Linear regression: example
• Statistics for calculating regression line:
• The slope (b): r(x, y) * (s(Y)/s(X)) ; same as 
• The slope (b): ∑(x – m(x))*(y – m(y)) / ∑((x – m(x))^2)
• The intercept (a): m(Y) – b*m(X)
• b = 0.627 * 1.072 / 1.581 = 0.425
• a = 2.06 – 0.425 * 3 = 0.75
m(X) m(Y) s(X) s(Y) r(X, Y)
3 2.06 1.581 1.072 0.627
Linear regression: example
• Fitting a straight line by using OLS.
Total / unexplained / explained variation
n.a.
Linear regression: example
• Residual: difference between observed values
Y and predicted values Y’ .
X Y Y’ Y – Y’ (Y – Y’)^2
1 1 1.21 -0.210 0.044
2 2 1.653 0.365 0.133
3 1.3 2.060 -0.760 0.578
4 3.75 2.485 1.265 1.600
5 2.25 2.910 -0.660 0.436
sum 0 2.791
Linear regression: example
• Model is a representation of the relationship
between variables. Linear regression model
predicts (models) values of Y based on values of X.
• Model is represented by formula in a form of
linear equation: Y’ = a + bX + e.
• Model in example: Y’ = 0.75 + 0.425*X + 2.791.
• R command: lm()
Linear regression: interpretation
• Model in example: Y = 0.78 + 0.425*X
• Intercept: value of Y when value of X = 0.
• Slope: change in Y when X increases by 1 unit.
• Error: unexplained variance of Y.
• What is the Y’ for X = 2?
• Y’ = 0.75 + (0.425)*2
• Y’ = 0.75 + 0.850 = 1.6
Coefficient of determination
• CoD (R^2) indicates proportion of Y explained variation
(SSM) to Y total variation (SST) = SSM / SST.
• SST = SSM (explained var.) + SSR (unexplained var.)
Coefficient of determination
• Unexplained variation = difference between observed
values of Y and predicted values of Y’ (regression line) =
sum of squares of residuals (SSR).
• Explained variation = difference between predicted values
of Y’ and mean of Y = sum of squares of model (SSM).
• Total variation = difference between observed values of Y
and mean of Y = SSE + SSR = sum of squares of total
variation (SST).
• Explained variation (%) = SSM / SST =
coefficient of determination = R^2
Coefficient of determination: example
• SST = SSM + SSR = 1.81 + 2.791 = 4.59
• R^2 = SSM / SST = 1.81 / 4.59 = 0.39 = 39 %
Y’ mean Y (Y’ – mY) (Y’ – mY)^2
1.210 2.06 -0.850 0.72
1.653 2.06 -0.425 0.18
2.060 2.06 0 0
2.485 2.06 0.425 0.18
2.910 2.06 0.850 0.72
sum (SSM) 1.81
Y Y’ Y – Y’ (Y – Y’)^2
1 1.210 -0.210 0.044
2 1.653 0.365 0.133
1.3 2.060 -0.760 0.578
3.75 2.485 1.265 1.600
2.25 2.910 -0.660 0.436
sum (SSR) 2.791
Coefficient of determination
• CoD (R^2) indicates proportion of Y explained variation
(SSM) to Y total variation (SST) = SSM / SST.
• SST = SSM (explained var.) + SSR (unexplained var.)
Rationale for multiple regression
• But: What if the outcome variable is influenced by more
than one predictor variable?
• (Always the case...)
• E.g.: Income can be predicted by completed years of
education and gender.
 Idea of statistical control
Statistical control: confounding effect
• Confounding effect: third variable affects the relationship
between predictor(s) and outcome variable.
• A confounder is a variable that correlates both with predictor(s)
and outcome variable.
• E.g.: Relationship between income (predictor) and risk of heart
attack (outcome) may be confounded by age (confounder).
Wu 2010
Multiple regression: assumptions
• Independence of observations (random sampling)
• Normal distribution of Y
• Linear relationship between X and Y
• Normal distribution of residuals
• Homoscedasticity (variance of error is constant)
• Independence of residuals (over time)
• No high collinearity between predictors
Collinearity
• Collinearity (multicollinearity) = two or more predictors are
correlated.
rXZ = 0 rXZ > 0.9
ZX
Y
X
Z
Y
• Correlation matrix of IVs as a simple diagnostic
Multiple linear relationship
• We add further coefficient*predictor terms into the formula:
outcome (dependent variable) =
constant + coefficient1*predictor1 + coefficient2*predictor2 + error
Y = β0 + β1X1 + β2X2 + ε ; population regression function
Y = a + b1X1 + b2X2 + e ; sample regression function
Y’ = 0.75 + 0.425*X1 + 0.132*X2 + 2.791; sample regression line
Fitting a plane
www.ck12.org
Slope in multiple regression
• Slope gives us information about the change of the outcome variable
caused by the predictor while controlling for other predictors in the
model.
• E.g.: what is the effect of education (predictor) on income (outcome
variable) when we control for age (predictor)?
income <- 6000 + 500*education + 100*age
• Interpretation: for each change in one unit of education (e.g. year),
the average unit change of income is 500 unit (i.e. 500 Kč) if age is not
changing.
Interpretations
• If the coefficients are statistically significant:
• If X and Z uncorrelated  reduction to bivariate slopes (X and Z
are independent on each other)
• If X correlates with Y more than Z  effect of X is stronger
(while controlling for Z)
• If Z correlates with Y more than X  effect of Z is stronger
(while controlling for X)
• If X and Z (almost) perfectly correlated  denominator close to
0, resulting values approach infinity (non-interpretable) 
problem of collinearity (reduction to one variable)
Conclusions
• Linear regression allows us to go beyond associations measurement
– Prediction
– Statistical control
• Models are always imprecise!
– Reduction as well as measurement
• Extensions of regression framework
– Logistic regression (binary category outcome variable)
– Multinomial logistic regression (multiple category outcome variable)
– Ordinal regression (ordinal outcome variable)
– etc.