Regression basics
Lukˊaˇs Laﬀˊers
Matej Bel University, Dept. of Mathematics
MUNI Brno
30.9.2021
Organization
six meetings per 4hrs: 30.9., 1.10., 11.11., 12.11., 2.12., 3.12.
three assignments/tests (per 20%) + written exam (40%)
lukas.laffers@gmail.com
Some math tools
random variable, expectation, variance, covariance
probability density function, cumulative density function
conditional expectation
law of large numbers
central limit theorem
New to R?
Here are some resources
W. N. Venables, D. M. Smith and the R Core Team. ”An Introduction to
R”
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Wickham, Hadley, and Garrett Grolemund. R for data science: import,
tidy, transform, visualize, and model data. ” O’Reilly Media, Inc.”, 2016.
https://r4ds.had.co.nz
Wickham, Hadley. ”Elegant graphics for data analysis.” Springer.
https://ggplot2-book.org
Today
assumptions of the regression model
geometry of linear squares
conﬁdence intervals
connection to t-test
stars and p-values - what do they mean
purpose: prediction vs explanation
correlated variables
weighted regression
heteroskedasticity
model selection: bias-variance trade-oﬀ
some practical considerations
Linear regression
Regression - What is it good for?
Prediction
(What will happen?)
easier, at least for repeated events
Explanation
(Why did something happen?)
diﬃcult, requires deep institutional
knowledge and subject matter
expertise.
These are fundamentally very diﬀerent objectives.
Regression
Prediction - easier, at least for repeated events.
Explanation - diﬃcult, requires deep institutional knowledge and
subject matter expertise.
Data
Data
x- independent variable,
regressor, factor, feature
y - outcome, dependent
variable
Least squares
Least squares
Notation - Univariate regression
y = β0 +β1x +ε
or
yi = β0 +β1xi +εi
rewritten in matrix form
y = Xβ +ε
y =




y1
y2
...
yn



, X =




1 x1
1 x2
...
1 xn



, β =
β0
β1
, ε =




ε1
ε2
...
εn



.
Basic linear regression
Notation - Multivariate regression
y = β0 +β1x1 +β2x2 +...+βpxp +ε
or
yi = β0 +β1xi +β1xip +...+βpxip +εi
rewritten in matrix form
y = Xβ +ε
y =




y1
y2
...
yn



, X =




1 x11 x12 ... x1p
1 x21 x22 ... x2p
... ...
1 xn1 xn2 ... xnp



, β =






β0
β1
β2
...
βp






, ε =




ε1
ε2
...
εn



.
Model - philosophical interlude
The purpose of the model is to be useful.
It is not to be correct.
Model should be right about the relevant parts.
It simpliﬁes the parts that it does not aim to model.
Ingredients:
y,x1,x2,..., - observed random variables
ε - unobserved random variables
β0,β1,.... - unknown variables we wish to estimate
y = β0 +β1x1 +β2x2 +...+βpxp +ε - assumption about how the diﬀerent
quantities are related to each other
Minimize sum of squares:
n
∑
i=1
(yi −Xiβ)2
= εT
ε = (y −Xβ)T
(y −Xβ)
diﬀerentiating via β, we get:
XT
X ˆβ = XT
y,
if XT
X is invertible, we get
ˆβ = (XT
X)−1
XT
H
y = Hy
Back to the model
Ingredients:
y,x1,x2,..., - observed random variables
ε - unobserved random variables
β0,β1,.... - unknown (ﬁxed!) variables we wish to estimate
y = β0 +β1x1 +β2x2 +...+βpxp +ε - assumption about how the diﬀerent
quantities are related to each other
ˆβ = (ˆβ0, ˆβ1,...., ˆβp)T
- estimator = vector of random variables.
It is a function of our data sample, which is random.
Our best attempt to recover the true unknown β.
Diﬀerent estimators possess diﬀerent qualities (bias, variance,
robustness). OLS estimator is just one of them (but pretty good).
Linear model = simple
linear model can model non-linear relationships
linear = linear in parameters
Also a linear model
log(y) = β0 +β1x +β2x2
+ε
y = β0xβ1
ε → log(y) = log(β0)+β1 log(x)+log(ε) = β∗
0 +β∗
1 x∗
+ε∗
Goodness of ﬁt measure
R2
= 1 −
∑(yi − ˆyi)2
∑(yi − ¯y)2
= 1 −
RSS
TSS
=
ESS
TSS
=
∑(¯y − ˆyi)2
∑(yi − ¯y)2
,
RSS residual sum of squares
TSS total sum of squares
R2
= (cor(ˆy,y))2
=
∑(¯y − ˆyi)2
∑(yi − ¯yi)2
.
R2
= 0.65 (for a simple linear model)
Geometry: It is a simple projection.
Goodness of ﬁt?
Prediction If you wish to predict well, you’d better explain the variation
in y.
Explanation Not necessary a problem if you have a small R2
.
Projection
ˆy = X ˆβ = X XT
X
−1
XT
y
ˆβ
= X XT
X
−1
XT
P ≡ projection matrix
y = Py
ˆε = y −X ˆβ = (I −P)
M ≡ residual maker matrix
y = My
symmetric PT
= P and MT
= M
idempotent PP = P and MM = M
ˆy ⊥ ˆε
Diﬀerent qualities Ordinary Least Squares estimator
geometric interpretation
easy analytic formula
ˆβ = (XT
X)−1
XT
y
among the linear, unbiased estimator it has the lowest variance
(Gauss-Markov theorem)
y = Xβ +ε
E[ε|X] = 0
Var[ε|X] = σ2
In
=⇒ For any unbiased estimator ˜β of β we have var[˜β|X] ≥ σ2
(XT
X)−1
Maximum Likelihood Estimator under normal errors (we will discuss
latter)
Stastical inference
Assume that errors are normally distributed:
ε ∼ N(0,σ2
I) + y = Xβ +ε =⇒ y ∼ N(Xβ,σ2
I)
So we get that*
ˆβ = (XT
X)−1
XT
y ∼ N β,(XT
X)−1
σ2
∗
Var(Ay) = AVar(y)AT
=⇒ Var(ˆβ) = (XT
X)−1
XT
(σ2
I)((XT
X)−1
XT
)T
= (XT
X)−1
σ2
Hypothesis tests
H0 : βi = 0 can be tested:
ti =
ˆβi
se(ˆβi)
∼ tn−p.
where
se(ˆβi) is the standard error - sq. root of the diagonal of the matrix
(XT
X)−1 ˆσ2
ˆσ2
= 1
n−p ∑i ˆε2
i
RSS = ∑i ˆε2
i is the residual sum of squares
tn−p is the Student’s t-distribution with n −p degrees of freedom
Hypothesis tests
H0 : βi=βj=0 can be tested:
F =
(RSSω −RSSΩ)/(2)
RSSΩ/(n −p)
∼ F2,n−p.
where
Ω denotes a large model with p parameters
ω denotes a small model with p−2 paramaters (a special case of Ω, the
two models are nested)
Conﬁdence intervals
CIα
= [ˆβi −t
(α/2)
n−p se(ˆβi), ˆβi +t
(α/2)
n−p se(ˆβi)]
Parameter is a ﬁxed value a conﬁdence interval is random interval.
CIs for expected/future values
ˆy0 = xT
0
ˆβ
CI for expected value
ˆy0 ±t
(α/2)
n−p ˆσ xT
0 (XT X)−1x0
CI for future value
ˆy0 ±t
(α/2)
n−p ˆσ 1 +xT
0 (XT X)−1x0
why ”1+” ?
var(ˆy0 +ε0) = var(ˆy0)+var(ε0) = xT
0 (XT
X)−1
x0σ2
+σ2
=
(1+xT
0 (XT
X)−1
x0)σ2
CIs for expected/future values
: Returns to education.
log(wage) = β0 +β1education+β2age+ε
an increase of education by one year increases wage by β1 ·100% percent.
our model predicts, that for individuals of the same age, an increase of
education by 1 extra year is associated with an increase of wage by β1 ·100%
percent.
Be careful with causal interpretations based on observational data.
Interpretation of parameters
y = β0 +β1x1 +β2x2 +ε
log(y) = β0 +β1x1 +β2x2 +ε
y = β0 +β1 log(x1)+β2x2 +ε
log(y) = β0 +β1 log(x1)+β2x2 +ε
y = β0 +β1x1 +β2x2 +ε
[x1 → x1 +δ] =⇒ [y → y +β1δ]
our model predicts, that an increase of x1 by one unit is associated with an
increase in y by β1 units, if the x2 will not change.
log(y) = β0 +β1x1 +β2x2 +ε
[x1 → x1 +δ] =⇒ [y → exp(β0 +β1(x1 +δ)+β2x2 +ε)) =
y ·exp(β1δ) ≈ y(1 +β1δ)]
our model predicts, that an increase in x1by one unit is associated with an
increase in y by approximately β1 ·100%, if x2 will not change.
y = β0 +β1 log(x1)+β2x2 +ε
[x1 → x1 ·(1 +δ)] =⇒ [y → y +β1 log(1 +δ) ≈ y +β1δ]
our model predicts, that an increase in x1 by 1% is associated with an increase in
y by approximately β1/100 units, if x2 will not change.
log(y) = β0 +β1 log(x1)+β2x2 +ε
[x1 → x1 ·(1 +δ)] =⇒ [y → y ·exp(β1 log(1 +δ)) = y ·(1 +δ)β1 ≈ y ·(1 +β1δ)]
our model predicts, that an increase in x1 by 1% is associated with an increase in
y by approximately β1%, if x2 will not change.
Interaction terms
log(wage) = β0 +β1educ +β2south +β3south ·educ +ε
log(wage) = β0 +β1educ +ε
log(wage) = β0 +β2 +(β1 +β3)educ +ε
Collinearity
Why is it a problem?
Estimators have a high variance
numerically unstable
How to detect it
(1) Look at the correlation matrix of regressors and look for numbers close
to +1 or -1.
(2) Running a regression of xi on other regressors, measure of linear ﬁt R2
i
is close to 1.
(3) Sort eigenvalues of XT
X,λ1 ≥ ··· ≥ λp. Condition number κ = λ1
λp
≥ 30
indicates problems.
How to quantify the eﬀect of it
var(ˆβj) = σ2 1
1 −R2
j
1
∑n
i=1(xij − ¯xj)2
,
Collinearity
These two models
y = β0 +β1x1 +ε
y = β0 +β1x1 +β2x2 +ε
may lead to a completely diﬀerent estimates of β1.
: Pearl’s Simpson Machine.
http://dagitty.net/learn/simpson/index.html
Simpson’s Simpson’s paradox
Ommitted variable bias
log(wage) = β0 +β1education+β2age+ε
log(wage) = β∗
0 +β∗
1 education+β∗
2 age+β∗
3 ability+ε
ability = γ0 +γ1education+ε
log(wage) = (β∗
0 +β∗
3 γ0)+(β∗
1 +β∗
3 γ1)
β1
education+β∗
2 age+(β∗
3 ε +ε)
Does not matter as long as ability is either
β∗
3 = 0 - irrelevant
γ1 = 0 - uncorrelated with education
More on errors: Heteroskedasticity
yi = 1 +3xi +xi ·ε
The larger |xi| the larger
the error
But we typically assume
σ2
i = const
So far we have assumed:
var(ε) = σ2
I =





σ2
0 ... 0
0 σ2
... 0
...
...
0 0 ... σ2





.
But what if
var(ε) = σ2
Σ?
Σ = SST
.
Transform back
y = Xβ +ε
S−1
y = S−1
Xβ +S−1
ε
y = X β +ε
Variance of the new transformed errors ε is
var(ε ) = var(S−1
ε) = S−1
var(ε)S−T
= σ2
I.
We apply OLS on the transformed data S−1
y a S−1
X.
We minimize
(y −X β)T
(y −X β) = (y −Xβ)T
Σ−1
(y −Xβ),
which is solved by
ˆβW = (XT
Σ−1
X)−1
XT
Σ−1
y.
The variance of this estimator is
var(ˆβW ) = σ2
(XT
Σ−1
X)−1
.
var(ε) =





σ2
1 0 ... 0
0 σ2
2 ... 0
...
...
0 0 ... σ2
n





=





1
w1
0 ... 0
0 1
w2
... 0
...
...
0 0 ... 1
wn





.
S =






1√
w1
0 ... 0
0 1√
w2
... 0
...
...
0 0 ... 1√
wn






S−1
=





√
w1 0 ... 0
0
√
w2 ... 0
...
...
0 0 ...
√
wn





.
Examples:
If var(εi) ∝ xi, we make use of wi = x−1
i .
If yi are averages based on ni obs, under LLN this is proportional to 1/ni.
Hence var(yi) = var(εi) = σ2
/ni, so wi = ni. Example: average wage in
diﬀerent countries.
In general we set wi = 1/var(yi).
The covariance matrix is generally unknown, it may be estimated in diﬀerent
ways : HR standard errors :
HC0: ˆVˆβ
= (XT
X)−1
∑i XiXT
i
ˆε2
i (XT
X)−1
HC1: ˆVˆβ
= n
n−k
(XT
X)−1
∑i XiXT
i
ˆε2
i (XT
X)−1
HC2: ˆVˆβ
= (XT
X)−1
∑i XiXT
i
¯¯ε2
i (XT
X)−1
HC3: ˆVˆβ
= (XT
X)−1
∑i XiXT
i
˜ε2
i (XT
X)−1
where (see Hansen 4.10 Residuals)
ˆεi = yi −XT
i
ˆβ is a vector of residuals
¯¯ε = (1 −hii)−1/2ˆε is a vector of standardized residuals
hii is the leverage: the diagonal element of residual maker matrix M
˜ε = yi −XT
i
ˆβ(−i) = (1 −hii)−1ˆε is a vector prediction error
ˆβ(−i) is estimated without i-th observation.
Weighting?
How do we deal with heteroskedasticity?
Calculate ˆβW = (XT ˆΣ−1
X)−1
XT ˆΣ−1
y for some estimate ˆΣ
Calculate ˆβ = (XT
X)−1
XT
y and then use ˆΣ to adjust for standard errors
Angrist and Pischke (2008, section 3.4.1) argue for the second approach.
More eﬃcient if ˆΣ is estimated well.
But this requires good model for
E[ε2
|X]
Estimator for E[ε2
|X] may have bad
ﬁnite sample properties.
Eﬃciency gains from using ˆβW are
typically modest.
In case that E[yi|Xi] is not linear, the
unweighted ˆβ estimates are at least
the best linear minimum mean
squared error.
Clustered standard errors
G groups. Within these groups, errors are allowed to be correlated.
But not between the groups
Σ =




















Σ1



0 ... 0
...
0 ... 0


 ...



0 ... 0
...
0 ... 0






0 ... 0
...
0 ... 0


 Σ2 ...



0 ... 0
...
0 ... 0



...
...



0 ... 0
...
0 ... 0






0 ... 0
...
0 ... 0


 ... ΣG




















When should you cluster your standard errors
Assignment of a treatment is not on an individual level (but on a
diﬀerent level, say school level, town level etc)
Sample is not random but ﬁrst clusters are sampled and then
observations within clusters are sampled.
Diagnostics
Correct model Incorrect model
Model Selection - Occam’s razor
’Among competing hypotheses, the one with the fewest assumptions should
be selected.’
John Punch 1639: ’Entities must not be multiplied beyond necessity’
Aristoteles: ’We may assume the superiority ceteris paribus [other things
being equal] of the demonstration which derives from fewer postulates or
hypotheses.’
Ptolemaus: ’We consider it a good principle to explain the phenomena by
the simplest hypothesis possible.’
Madhva: ’To make two suppositions when one is enough is to err by way of
excessive supposition’
Isaac Newton: ’We are to admit no more causes of natural things than such
as are both true and suﬃcient to explain their appearances. Therefore, to
the same natural eﬀects we must, as far as possible, assign the same causes.’
Small model vs Large model
If we wish to predict, we may prefer larger model even if the smaller
one is more parsimonious.
When explaining, we prefer smaller models.
Automatic model selection based on p-values
We add regressors with smallest p-values, until some threshold is met.
We remove regressors with the largest p-values, until some threshold is
met.
The use of any automatic model selection tool is very risky.
p-values are not valid as they are the result of multiple testing. Results
look better than reality.
removal = no association
model selection tools cannot replace deep subject matter expertise :
Model building
If you have too many regressors relative to the sample size (eg. p=50,
n=100).
You may ﬁnd some patterns in the data just by chance!
: Model selection
Freedman, David A., and David A. Freedman. ”A note on screening
regression equations.” The American Statistician 37.2 (1983): 152-155.
IC = −([FIT ]−[COMPLEXITY])
Akkaike Information criterion (AIC)
AIC = −2L(ˆθ)+2p,
we prefer models with a small AIC.
An alternative to AIC is Bayes Information Criterion (BIC)
BIC = −2L(ˆθ)+2p log(n)
it penalizes large (more comples) models more.
Suppose you have a set of competing models with a similar ﬁt
Do they give qualitatively similar results?
Do they predict similarly?
How diﬃcult/expensive is data collection?
Are the model assumptions satisﬁed? (diagnostic graphs)
If we are in a situation, where models that ﬁt data
similarly well lead to very diﬀerent results, it may be
that we cannot answer the question of interest.
It is intelectually honest to appreciate this
uncertainty, no matter how inconvenient/impractical
it may be.
Model choice should not be data driven only (say looking at the statistical
signiﬁcance).
Subject matter expertise is always appreciated.
Regression and t-test connection
yA ∼ N(µA,σ2
)
yB ∼ N(µB,σ2
)
H0 : µA = µB vs H1 : µA = µB
T =
(¯yA − ¯yB)−(µA − µB)
Sp
1
nA
+ 1
nB
∼ tnA+nB−2,
where S2
p =
(nA−1)S2
A+(nB−1)S2
B
nA+nB−2
a S2
A a S2
B are sample variances.
yi = βAdiA +βBdiB +ε
yA ∼ N(βA,σ2
) and yB ∼ N(βB,σ2
).
These are identical assumptions to the t-test.
Linear regression with dummy variables is the same thing as a t-test.
Statistical signiﬁcance
Statistical vs. Practical signiﬁcance (a.k.a. is the eﬀect economically
meaningful)
More on Statistical signiﬁcance
Statistical signiﬁcance - p-values and conﬁdence intervals
What they are not!
Greenland, Sander, et al. ”Statistical tests, P values, conﬁdence intervals, and
power: a guide to misinterpretations.” European journal of epidemiology
31.4 (2016): 337-350.
Statement of American Statistical Association:
Wasserstein, Ronald L., and Nicole A. Lazar. ”The ASA statement on
p-values: context, process, and purpose.” (2016): 129-133.
P-values: Common misconceptions
The P value for the null hypothesis is the probability that chance alone
produced the observed association; for example, if the P value for the
null hypothesis is 0.08, there is an 8 % probability that chance alone
produced the association.
No!
The P value is the probability that the test hypothesis is true; for example,
if a test of the null hypothesis gave P = 0.01, the null hypothesis
has only a 1 % chance of being true; if instead it gave P = 0.40, the null
hypothesis has a 40 % chance of being true.
Nope!
A signiﬁcant test result (P ≤ 0.05) means that the test hypothesis is false
or should be rejected.
Also no!
There are 15 more variations in Greenland et al. (2016)
Conﬁdence intervals: Common misconceptions
The speciﬁc 95 % conﬁdence interval presented by a study has a 95 %
chance of containing the true eﬀect size.
No!
An eﬀect size outside the 95 % conﬁdence interval has been refuted (or
excluded) by the data.
Not this one!
If two conﬁdence intervals overlap, the diﬀerence between two estimates
or studies is not signiﬁcant.
Still no!
There are a few more in Greenland et al. (2016)
So what are they then??
These quantities are model based. They express a level of uncertainty about
a particular statement (a hypothesis) assuming that a particular model is
correct!
p-value: Assuming that the model assumptions are correct AND assuming
that a particular hypothesis is true, then in a repeated sampling setup, you
would observe more extreme values of a test-statistic in approximately
p.100% cases.
p-value is a measure od compatibility of the calculated test statistic with the
underlying model assumption and the null hypothesis.
So what are they then??
These quantities are model based. They express a level of uncertainty about
a particular statement (a hypothesis) assuming that a particular model is
correct!
95%CI: Assuming that the model assumptions are correct AND assuming
that a particular hypothesis is true, in a repeated sampling, 95% CI is an
interval estimate (that is, a random interval), that would cover the true eﬀect
in approximately 95% cases.
CI summarises the results of a hypothesis tests for multiple eﬀect sizes.
Beautiful vis here:
https://seeing-theory.brown.edu/frequentist-inference/index.html
Misc: Principal components analysis
We often wish to reduce the dimension of X.
Source: https://www.quora.com/What-is-an-intuitive-explanation-for-PCA
Misc: Principal components analysis
Find the linear combination of regressors that maximize the variance:
var(Xu1) → max subject to uT
1 u1 = 1
then
var(Xu2) → max subject to uT
2 u2 = 1 and uT
1 u2 = 1
···
Xu1 is the ﬁrst principal component
Xu2 is the second principal component
are orthogonal by construction
compress most information (in terms of variance) into fewer variables
may be interpretable
may help us to identify ”similar” points
Properties PCA examples
A few more comments...
Ask about the data. How it was collected, handled, updated?
Check for data discrepancies, summary statistics.
Do plot the data. Yes, always.
Talk to the experts.
Comment your code, make it easily reproducible.
Adhere to coding standards (http://adv-r.had.co.nz/Style.html).
Do not alter your source dataset, od any preprocessing in a separate ﬁle.
Consider interacting two most important regressors.
Thank you for your attention!
References
There are zillions of books when it comes to regression. The choices I made reﬂect my personal biases.
An excellent book on linear regression with applications in R: Faraway, Julian J. Linear models with R. Chapman and Hall/CRC, 2004. Accompanying website:
https://julianfaraway.github.io/faraway/LMR/
Very accessible intro to regression: is ch1 in: Adams, Christopher P. Learning Microeconometrics with R. Chapman and Hall/CRC, 2020. Accompanying website:
https://sites.google.com/view/microeconometricswithr/home?authuser=0
A practical and very popular book with gear speciﬁcally towards economics is Angrist, Joshua D., and J¨orn-Steﬀen Pischke. Mostly harmless econometrics. Princeton
university press, 2008.
Reference books (free to download) are two Bruce Hansen’s books: Probability and Econometrics: https://www.ssc.wisc.edu/ bhansen/econometrics/ The
explanations are rather brief and succinct (which may not suit everyone).
The paper that shows that you can ﬁnd patterns in a noise: Freedman, David A. ”A note on screening regression equations.” The American Statistician 37.2 (1983):
152-155.
When should one cluster standard errors:
https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle is non-technical
summary of: Abadie, Alberto, et al. When should you adjust standard errors for clustering?. No. w24003. National Bureau of Economic Research, 2017.
The classic and book length account on Model selection by the pioneers of the ﬁeld: Claeskens, Gerda, and Nils Lid Hjort. ”Model selection and model averaging.”
Cambridge Books (2008).
Discussions on statistical signiﬁcance: Greenland, Sander, et al. ”Statistical tests, P values, conﬁdence intervals, and power: a guide to misinterpretations.” European
journal of epidemiology 31.4 (2016): 337-350 and Wasserstein, Ronald L., and Nicole A. Lazar. ”The ASA statement on p-values: context, process, and purpose.”
(2016): 129-133.
R resources: W. N. Venables, D. M. Smith and the R Core Team. ”An Introduction to R https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
R resources: Wickham, Hadley, and Garrett Grolemund. R for data science: import, tidy, transform, visualize, and model data.” O’Reilly Media, Inc.”, 2016.
https://r4ds.had.co.nz
R resources: Wickham, Hadley. ”Elegant graphics for data analysis.” Springer. https://ggplot2-book.org
For those interested in visualizations I would recommend Jonatan Schwabish’s books: Schwabish, Jonathan. Better Data Visualizations: A Guide for Scholars,
Researchers, and Wonks. Columbia University Press, 2021 and Schwabish, Jonathan. Better presentations. Columbia University Press, 2016. Also there is a short
article Schwabish, Jonathan A. ”An economist’s guide to visualizing data.” Journal of Economic Perspectives 28.1 (2014): 209-34.