Correlation and linear regression
Petr Ocelík
ESS401 Social Science Methodology
20th October 2015
Outline
• Correlation
• Simple linear regression
• Correlation and linear regression in R
Correlation
• Pearson’s product-moment correlation
coefficient (r).
• Correlation measures the strength of the
relationship between two variables.
• Ranges between -1 (perfect negative corr) and
1 (perfect positive corr).
• 0 indicates no systematic linear relationship
between variables.
• Value does not depend on variables’ units.
• It is a sample statistic.
Correlation
• Assumptions and limitations:
– Normal distribution of X and Y
– Linear relationship between X and Y
– Homoscedasticity
– Sensitive to outliers
The standard normal distribution
Correlation
• Normal distribution of X and Y
• Histograms and descriptive statistics
• Linear relationship between X and Y
• Scatterplot
• Histogram of residuals
• Homoscedasticity
• Same as with linear relationship
Correlation vs. causation
Correlation does not imply causation.
• Correlation is necessary but not sufficient
condition for causation.
Correlation vs. causation
General patterns:
– X causes Y and Y causes X (bidirectional causation):
• Democracies trade more, therefore trade increases
democracy.
– Y causes X (reverse causation):
• The more firemen is sent to a fire, the more damage is done.
– X and Y are consequences of common cause:
• There is a correlation between ice cream consumption and
street criminality (both more prevalent during summer).
– There is no connection between X and Y (coincidence):
• Number of meaningless “funny correlations”.
Correlation: example
• Assume we have 2 variables: X and Y.
• What is correlation (r) of these two variables?
X Y
1 0
2 1
1 4
6 8
7 4
• Correlation = covariance / combined total variance.
• First: we calculate variance of variables.
• mean(x) = 3.4; mean(y) = 3.4
• R command = var()
• s^2(X) = 33.2 / 4 = 8.3; s^2(Y) = 39.2 / 4 = 9.8
X (x – m) dev. dev.^2
1 (1 – 3.4) -2.4 5.76
2 (2 – 3.4) -1.4 1.96
1 (1 – 3.4) -2.4 5.76
6 (6 – 3.4) 2.6 6.76
7 (7 – 3.4) 3.6 12.96
sum 0 0 33.2
Y (y – m) dev. dev.^2
0 (0 – 3.4) -3.4 11.56
1 (1 – 3.4) -2.4 5.76
4 (4 – 3.4) 0.6 0.36
8 (8 – 3.4) 4.6 21.16
4 (4 – 3.4) 0.6 0.36
sum 0 0 39.2
• Second: we calculate covariance of variables.
• Covariance is a sum of deviation products of
two variables divided by n–1.
cov(X, Y) = 24.2 / 4 = 6.05; R command = cov()
(x – m) (y – m) cross-prod.
(1 – 3.4) (0 – 3.4) 8.16
(2 – 3.4) (1 – 3.4) 3.36
(1 – 3.4) (4 – 3.4) -1.44
(6 – 3.4) (8 – 3.4) 11.96
(7 – 3.4) (4 – 3.4) 2.16
0 0 24.2
• Third: we divide X, Y covariance by square
rooted product of X and Y variances.
– r = cov(X, Y) / sqrt(var(X) * var(Y))
– r = 6.05 / sqrt(8.3 * 9.8) = 0.67
– R command: cor()
• Correlation = covariance / combined total variance.
(Linear) regression
• Regression is a statistical method used to
predict scores on an outcome variable based
on scores of one ore more predictor variables.
• Linear regression: models linear relationship.
• Bivariate (simple) linear regression: uses only
one predictor variable.
• Multivariate (multiple) linear regression: uses
more than one predictor variable.
Regression: terminology / notation
X Y
cause effect
independent variable dependent variable
predictor variable outcome variable
explanatory variable response variable
α, a, b, β0, B0, m β, B, b ε, e
intercept slope error /
residual
constant coefficient
alpha Beta
Linear regression: assumptions
• Independence of observations (random sampling).
• Normal distribution of Y.
• Linear relationship between X and Y.
• Normal distribution of residuals.
• Homoscedasticity.
• Independence of residuals (over time).
• Applicable for continuous variables.
• Sensitive to outliers.
Normal distribution of residuals
Draper & Smith 1998
Independence of residuals
OriginLab 2015
Linear relationship
• A relationship where two variables are related in
the first degree.
• Meaning the power of variables is 1.
• Linear relationship is represented by formula:
• Y = a + bX
• Y = β0 + β1X + ε ; population regression function
• Y = a + bX + e ; sample regression function
• Y’ = 0.75 + 0.425*X + 2.791; sample regression line
• Linear relationship is graphically represented by
straight line.
Fitting a straight line
Fitting a straight line
Fitting a straight line
Fitting a straight line
Ordinary least squares
• Ordinary least squares (OLS): estimates parameters
(intercept and slope) in a linear regression model.
• Minimizes squared vertical distances between the
observations (Y) and the straight line (predicted value
of Y = Y’).
• Residual = (Y - Y’)
• ∑ (Y - Y’) = 0 ; ∑ (Y - Y’)^2 >= 0
• OLS: Y’ = min ∑ (Y - Y’)^2
Ordinary least squares
Ordinary least squares
• Comparison of mean and OLS estimation.
Linear regression: example
• Assume we have two variables: X and Y.
• To what extent X explains Y?
X Y
1 1
2 2
3 1.3
4 3.75
5 2.25
Linear regression: example
• Statistics for calculating regression line:
• The slope (b): r(X, Y) * s(Y) / s(X)
• The intercept (a): m(Y) – b*m(X)
• b = 0.627 * 1.072 / 1.581 = 0.425
• a = 2.06 – 0.425 * 3 = 0.75
m(X) m(Y) s(X) s(Y) r(X, Y)
3 2.06 1.581 1.072 0.627
Linear regression: example
• Fitting a straight line by using OLS.
Total / unexplained / explained variation
Linear regression: example
• Residual: difference between observed values
Y and predicted values Y’ .
X Y Y’ Y – Y’ (Y – Y’)^2
1 1 1.21 -0.210 0.044
2 2 1.653 0.365 0.133
3 1.3 2.060 -0.760 0.578
4 3.75 2.485 1.265 1.600
5 2.25 2.910 -0.660 0.436
sum 0 2.791
Linear regression: example
• Model is a representation of the relationship
between variables. Linear regression model
predicts (models) values of Y based on values of X.
• Model is represented by formula in a form of
linear equation: Y’ = a + bX + e.
• Model in example: Y’ = 0.75 + 0.425*X + 2.791.
• R command: lm()
Linear regression: interpretation
• Model in example: Y = 0.78 + 0.425*X
• Intercept: value of Y when value of X = 0.
• Slope: change in Y when X increases by 1 unit.
• Error: unexplained variance of Y.
• What is the Y’ for X = 2?
• Y’ = 0.75 + (0.425)*2
• Y’ = 0.75 + 0.850 = 1.6
Coefficient of determination
• CoD (R^2) indicates proportion of Y explained variation
(SSM) to Y total variation (SST) = SSM / SST.
• SST = SSM (explained var.) + SSR (unexplained var.)
Coefficient of determination
• Unexplained variation = difference between observed
values of Y and predicted values of Y’ (regression line) =
sum of squares of residuals (SSR).
• Explained variation = difference between predicted values
of Y’ and mean of Y = sum of squares of model (SSM).
• Total variation = difference between observed values of Y
and mean of Y = SSE + SSR = sum of squares of total
variation (SST).
• Explained variation (%) = SSM / SST =
coefficient of determination = R^2
Coefficient of determination: example
• SST = SSM + SSR = 1.81 + 2.791 = 4.59
• R^2 = SSM / SST = 1.81 / 4.59 = 0.39 = 39 %
Y’ mean Y (Y’ – mY) (Y’ – mY)^2
1.210 2.06 -0.850 0.72
1.653 2.06 -0.425 0.18
2.060 2.06 0 0
2.485 2.06 0.425 0.18
2.910 2.06 0.850 0.72
sum (SSM) 1.81
Y Y’ Y – Y’ (Y – Y’)^2
1 1.210 -0.210 0.044
2 1.653 0.365 0.133
1.3 2.060 -0.760 0.578
3.75 2.485 1.265 1.600
2.25 2.910 -0.660 0.436
sum (SSR) 2.791