Stano Pekár
0 2 4 6 8 10 12 14
0.00.10.20.30.40.5
x
y
R
1) R Environment
Exploratory Data Analysis
Regression models
2) The first example
Systematic components
3) Stochastic components
Analyses of continual measurements
4) Analyses of continual measurements II
Analyses of counts
5) Analyses of counts II
Analyses of proportions
• very fast due to use of computers
• chose statistical models that approach data characters
This course
• focuses on regression models in a broad sense
• only on linear models
• with only one response variable (univariate methods)
• with independent observations
Response variable
• (dependent) is the variable whose variation we aim to understand,
the variable that we measure, it goes on ordinate
- continuous measurement, count, proportion (y)
Explanatory variable
• (independent) is the variable that we manipulate (select levels),
interested to what extent is variation in response associated with
variation in explanatory variables, displayed on abscissa
- numeric: continuous or discrete measurements (x) .. covariate
- categorical .. a factor (A, B) with two or more levels (A1, A2, .. B1,
B2, ..)
Stano Pekár
0 2 4 6 8 10 12 14
0.00.10.20.30.40.5
x
y
R
• software packages that include GLM
• environment for the manipulation of objects
- data manipulation, calculation and graphical display
- a high-level programming language
• combination of S (developed at AT&T Bell Laboratories and
forms the basis of the S-PLUS systems) and Scheme languages
• initially written by Gentleman & Ihaka (1996), nowadays with
many contributors (R Development Core Team)
- includes about 30 standard packages
- available 2000 additional packages
• user-unfriendly (limited pull-down menus)
- based on commands
- pull-down menus only for basic commands
Pros
• freeware
• one of the largest statistical systems
• open environment with more dynamic development than other
systems
• whereas Statistica or SAS will give copious output, R will
provide minimal output
• makes you think about the analysis
Cons
• no warranty
• user-unfriendly
• software available from www.r-project.com
•data from
•https://www.press.muni.cz/edicni-rady-munipress/moderni-analyza-biologickych-dat-1
+ - * / > <
== equal
!= not equal
<= less than or equal
^ power
• logical values T .. TRUE, F .. FALSE
Functions
• trigonometric
sin, cos, tan, asin, acos, atan
• logarithmic: log, log2, log10
• sqrt, exp, abs, sum, prod
•seq, c, which, length, cbind, xbind, matrix
• names are case sensitive
• “_” is not allowed to use
• vectors: numeric, character, logical
• arguments (in parentheses): use their names or without at
specified order
• centring: to subtract mean
• scaling: to divide by SD
Created in R:
• use data.frame, rep, factor, levels, relevel
• export: write.table
Imported:
- from Excel via clipboard
dat <- read.delim("clipboard")
• data matrix:
- number of columns = number of variables
- first row contains names of variables (names without blank
spaces)
• each row corresponds to an observation (trial, etc.)
• factors levels can be names or coded as numbers
• all columns must have the same number of rows
• missing data are assigned as NA
- is.na
- $
attach(dat)
names(dat)
soil field distance amount
moist pasture 12 0.22
moist pasture 22 0.11
moist pasture 43 0.29
moist pasture 23 0.33
moist rape 32 0.19
moist rape 67 0.39
moist rape 54 0.18
moist rape NA 0.29
dry pasture 11 1.16
dry pasture 33 1.03
dry pasture 45 1.11
dry pasture NA 1.33
dry rape 55 1.02
dry rape 41 1.23
dry rape 14 1.05
dry rape 27 1.12
• a visual (tabular or graphical) analysis of the data
Important to
- check errors
- get an idea of the result
- suggest a model
- check assumptions for use of desired methods
- set hypotheses
- look for unexpected trends
• use expected values and variation
• E(y), μ: theoretical long-term average of a variable
- one of a few characteristics of a distribution
- for discrete distributions E(y) might not be a possible value
- estimate of E(y) is mean ... mean
- a robust estimate for asymmetric distributions is median: ...
median
- another robust estimate is trimmed mean: mean where α*n
observations are removed from each tail ... mean(y, trim=)
Example
Find mean, median, and mean trimmed by 10% of the amount
variable.
• Var(y), σ2: a theoretical measure of the variability
• minimum and maximum ... range, min, max
• quantiles (0, 25, 50, 75, 100%) ... quantile
• estimate of Var(y) is s2... var
• standard deviation (s) ... sd
• standard error of the mean ...
Example
Find variance, standard deviation, range and standard error of
the mean for amount.
n
s
SEM 
• of a parameter (mean): if large number of samples is taken from a
population then α% of intervals will contain mean
• based on quantiles of the t distribution qt
- lower CI95
- upper CI95
• for asymmetric distributions CI95 is estimated on transformed
values  asymmetric intervals
Example
Find 95% confidence intervals of mean for amount.
SEMty  ,975.0
SEMty  ,975.0
1 n
• basic summaries (min, max, Q25, Q75, median, mean) for all
variables.. summary
• summary table for data with explanatory variable(s) .. tapply
• to count frequencies .. table
Example
Make a table of means for SOIL and FIELD, and table of SEM for
FIELD.
• see demo(graphics) or demo(image)
• graphs
- basic: plot
- advanced: xyplot (library lattice)
• to get all graphic parameters: ?par
• to split window to subplots: par(mfrow)
• to add legend .. legend
• graph window size: x11
plot
Argument Values
type= Style: "n" (empty), "p" (scatter), "l" (lines),
"b" (both), "h" (vertical)
las= Style of axes values: 0 (parallel), 1 (horizontal)
2 (perpendicular), 3 (vertical)
xlab,ylab= Text of axes labels: "..."
cex.lab= Size of axes labels: 1,..
xlim,ylim= Range of axes: c(min, max)
cex.axis= Size of axes values: 1,..
log= Logarithmic scale of x, y or xy
main= Text of title: "..."
main.cex= Size of title: 1,..
1
2
3
4
5
6
7
+
16
17
18
19
20
points
Argument Values
pch= Type of symbols: 0,..,18, "letters"
cex= Size of symbols: 1, ...
col= Colour: 1, 2, 3, 4, 5, 6, 7, 8
font= Font type: 1, 2, 3, 4
• to study distribution of a numeric (response) variable
• histogram .. hist
• stem-and-leaf plot .. stem
• q-q plots to compare distribution of two variables
- compare a single variable with normal: qqnorm
- compare distributions of two variables: qqplot
- to add diagonal line: qqline
Example
Make q-q plot of data from
normal distribution.
Deviations from normal distribution
• for data with continuous explanatory variables
• to produce plots with points: plot
Example
Make scatterplot of distance on amount without and with
different points for two levels of SOIL.
• when there are categorical explanatory variables
• argument notch for boxes with CI95 for median
Example
Make boxplot of amount for SOIL without and with notches.
• for data with both categorical and continuous explanatory
variables
• xyplot from library lattice
Example
Make panel scatterplot regression plot of distance against amount
for SOIL.
• for data with two categorical explanatory variables
• to plot means of two factors (A, B) connected by lines
.. interaction.plot
Example
Make interaction plot of SOIL and FIELD for amount.
• when data are counts or proportions
- data are arranged in a matrix or table
• barplot: beside, legend
Example
Make barplot of SOIL and FIELD for amount.
• when data include several continuous explanatory variables
• pairs produces matrix of all possible plots
• when data include 2 continuous explanatory variables
• wireframe (lattice) produces 3-dimensional plot
• use ggplot2 package
- ggplot
Example
Make violinplot of SOIL and amount and dotplot of SOIL and
amount .
1
2
3
4
5
6
lines
Argument Values
x,y= Coordinates: c( ..,..)
lty= Line type: 1,...,6
col= Colour: 1, 2, 3, 4, 5, 6, 7, 8
lwd= Width: 1, ..
• final plot of estimated models
use visreg package
• lines connects points specified by coordinates
• abline produces line specified by intercept and slope
Example
Make lineplots for the following models:
inverse
exponential logarithmic
power logistic
squareroot
quadratic
inverse squareroot
x
y
1

x
y
1

xy 
)log(xy 
3
xy 
x
ey 
2
006.01.06.0 xxy 
x
e
y 3.0
1
1



• includes systematic and stochastic components
iii xy  
y = a+bx + ε
Deterministic model Statistical model
y
x
y = a+bx
y
x0 0
• assumptions of the stochastic component:
= variance is equal = homoscedastic model
To find real model we need to estimate its parameters: α, β, σ2
as a, b, s2 so that we get
),0(~ 2
 Ni
iiii
 ,0),(cor 
00 )(ˆ bxaxy 
• extension of the systematic component
Simple regression 1-way ANOVA
iii xy   ijjij Ay  
iiy   iiy  
0 0
  kk xxxy ..2211
Linear model (LM) has a general form
x can include: u2, u1/2, log(u), exp(u), sin(u), factors
= model is linear in parameters when it includes only linear
combinations of parameters
Some nonlinear relationships can be linearised
• log-transformation of both sides:
linear predictor

 
bxayeey ibxa
)log(
 bxazyz )log(
• eε has lognormal distribution while ε has normal distribution
• y has heterogenous variance z has homogenous variance
• eε is multiplicative while ε is additive
• curved relationship becomes linear
Other nonlinear relationships can not be linearised
use Nonlinear regression
)1( x
ey 
 

kk xxxf   ..)( 2211
ondistributiy ~
kk xxx   ..2211
),0(~ 2
 Ny 
• extension of the stochastic component
- we model transformed expected value of y
f(μ) .. link function
For example,
),(~ 2
Ny
GLM has 3 components:
• link function
• linear predictor
• distribution family
- Gaussian (normal), Gamma, Inverse Gaussian, Poisson,
Quasipoisson, Binomial, Quasibinomial, Quasi
• measure of fit is deviance not sum of squares
- null deviance = SST
- residual deviance = SSE
- ANODEV table = ANOVA table
• a useful simplification of the reality
- should include important aspects for which it is being made and
ignore aspects that we are not interested in
• Principle of parsimony: Simpler model is better if it explains
study phenomenon as good as complicated model.
G. E. P. Box: „All models are wrong. But some of them are
useful.“
Bottom -up or forward selection
• building up a model by adding available variables
Top-down or backward selection
• reducing maximal (saturated) model
1. Fit maximal model- all main effects and interactions
2. Remove insignificant interactions and main effects
3. Group together similar factor levels
4. Check diagnostic plots
5. Alter model if necessary
6. Achieve minimal adequate model
- contains only terms in which all parameters are significantly
different
• to assess model quality and assumptions
- study of both systematic and stochastic components
- we can never prove that model is adequate
Residuals
should not
• make trends when plotted against explanatory or response variables
• be heteroscedastic
• have unusual distribution
• be interdependent
Checking assumptions
• informal using plots - plot produces 6 plots
• formal using tests
),0(~ 2
 Ni
iiii
 ,0),(cor 
• raw (LM) or deviance (GLM) residuals against fitted values
• curved pattern suggests lack of polynomial term
• q-q plot of standardised (LM) standardised deviance (GLM)
residuals
• data from other than normal distribution can not have normally
distributed residuals
• when the pattern is “J” or “S” shaped change link function or
transform the variable
• plot of standardised (LM) standardised deviance (GLM)
residuals against fitted/predicted values
• when variance increases with the mean use Poisson or gamma
distribution or log transformation
• plot of Cook’s distance for each observation shows the influence
of individual observations on the model fit
• values of influential observations are close to 1 and higher
• residuals versus leverage
• omit influential observations or transform the explanatory
variables (using log, power, reciprocal)
• dependence on continual explanatory variable
- using standardised (LM) or Pearson residues (GLM)
• serial dependence if explanatory variable is time or space
Stano Pekár
0 2 4 6 8 10 12 14
0.00.10.20.30.40.5
x
y
R
Background
Nutritional quality of the diet affects growth of organisms in
various ways. To find optimal diet for cockroaches the following
experiments was performed.
Design
Effect of five diet types (control, lipid1, lipid2, protein1,
protein2) was tested on body weight [g] of cockroaches. For each
diet type there were 17 observations.
Biological hypothesis
Is nutritional quality of the diet affecting size of organisms?
Statistical hypotheses
H0: Weight is similar among diet groups.
HA: Weight is significantly different among diet groups.
Prediction:
Protein-enriched diet should lead to highest weight.
Variables
DIET: control, lipid1, lipid2, protein1, protein2
weight
Model:
EDA:
Analysis:
• compare individual differences between factor levels
• comparisons are valid only if a factor is significant
Options:
- Apriori contrasts (before analysis)
- Posteriori simplification (after analysis)
- Multiple comparisons (after analysis)
- apriori contrasts are preferred to avoid excess of significant results
For a model
a contrast will be
where Aj .. mean value of a level, wj .. contrast coefficient
Creating contrasts
- levels lumped together get the same sign
- levels contrasted get opposite sign
- levels excluded get 0
.. so that sum of each contrast
ijjij Ay 


J
j
jj AwK
1
0
1

J
j
jw
Contrasts are arranged in a matrix
• only k-1 (k .. number of levels) contrasts are orthogonal, i.e. each
level (combination) is compared only once
... products of any two contrasts = 0
• specified by function contrasts prior to analysis
Pre-specified contrasts:
• Treatment (default in R) - compare specific level with the
reference level
• Helmert - compare specific level with the average of previous
levels
• Sum - compare specific level with the grand mean
• Textbook - compare each level with 0
• levels of a factor are compared using Wald statistics from ouput
• similar factor levels are the grouped together
• test each grouping by anova
• compare the final model with the first one
• post hoc tests
• Bonferroni correction – applied to non-orthogonal contrasts
• Dunn test, Scheffe test, Tukey HSD test
• comparison by means of confidence intervals
We should check as many aspects as possible
• use diagnostic plots
• use formal tests:
- Bartlett test to compare variances
- Shapiro-Wilk test of normality
Diagnosis:
Stano Pekár
0 2 4 6 8 10 12 14
0.00.10.20.30.40.5
x
y
R
Explanatory variable(s) Method
Continuous Regression
Categorical ANOVA
Continuous and categorical ANCOVA
• the same explanatory variable can be taken once as continuous
other time as categorical: e.g. two levels of concentration
• continuous variable allows interpolation and extrapolation
Key to methods:
iii bxay 
Linear predictor can include various terms:
- intercept .. α estimated as a
- linear term .. βx with b as coefficient of linear trend
- quadratic term .. γx2 with c as coefficient of quadratic trend
- cubic term .. τx3 with t as coefficient of cubic trend
- main effect .. A
- interaction between factors .. A:B
- interaction between continuous variables x1:x2
- linear interaction .. A:x
- quadratic interaction .. A:x2
• simple regression ... 1 explanatory variable
• multiple regression .. 2 and more explanatory variables
General linear predictor of multiple regression
α .. intercept
βk .. linear coefficients of xk
x .. may represent polynomic functions (x3), interactions (x1.x2)
- rule of thumb: less than n/3 parameters in model at any time
- number of combinations of explanatory variables will often
exceed the number of data so we can not include all terms
kk xxx   ..2211
Simplification
• linear predictor with 2 explanatory variables (x1, x2) should include
all main effects, all interactions, and quadratic terms
with estimates a, b1, b2, c1,c2, d
Nested models are:
• 5 parameters (a, b1, b2, c1, c2),
at least c1 and c2 are significantly
different
21
2
22
2
112211 xxxxxx  
• 4 parameters (a, b1, b2, c1), at least c1 is significantly different
• 3 parameters (a, b1, b2), at least b1 and b2 are significantly
different
If one explanatory variable (x2) turns out to be insignificant:
• 3 parameters (a, b, c), at least c is significantly different
• 2 parameters (a, b), at least b is significantly different
• 1 parameter (a) that is significantly different
• 1-way ANOVA .. 1 factor
• 2-way ANOVA .. 2 factors
• k-way ANOVA .. k factors
• k-way ANOVA might be with our without interactions
Given 2 categorical variables A and B each with 2 levels (A1, A2,
and B1, B2) model with treatment contrasts is
α .. mean of A1B1, Ai and Bj .. main effects, A:Bij .. interaction
ijji BABA :
• 4 parameters (A1B1, A2B1–A1B1, A1B2–A1B1 a A2B2–A1B2): interaction
is significant
• 3 parameters (A1B1, A2B1–A1B1, B2–B1): only A and B are significant
• 2 parameters (B1, B2–B1): only B is significant
• 1 parameter (grand mean): null model
• combination of regression and ANOVA
• continuous variable = covariate
Given 1 factor (Aj) and 1 covariate (x) linear predictor is:
α .. intercept, Aj .. effect of factor, β .. slope, δ .. effect of interaction
Given 1 categorical variable A with 2 levels (A1, A2) and 1 continual
x, the linear predictor will be
22
xxxxA jjj  
xxA jj  
• 6 parameters - 2 intercepts (a1, a2–a1), 2 slopes (b1, b2–b1), 3
quadratic (c1, c2–c1) - interaction A:x2 is significant
• 4 parameters - 2 intercepts (a1, a2–a1), 2 slopes (b1, b2–b1) interaction
A:x is significant, but quadratic terms are not significant
• 4 parameters - 2 intercepts (a1, a2–a1), 1 slope (b), 1 quadratic
(c) - interactions A:x2 and A:x are not significant, but A and
quadratic terms are significant
• 3 parameters - 2 intercepts (a1, a2–a1), 1 slope (b) - only main
effects (A and x) are significant
• Further simplification  1-way ANOVA or simple regression
response variable ~ explanatory variable(s)
• Operators:
- on left side any mathematical operator can be used
- on the right side only few:
+ .. add
- .. delete
: .. interaction
* .. all terms
1 .. intercept
I .. interpreter that translates operators into mathematical meaning
/ .. nested
| .. conditioned
Model formula Description
y ~ 1 Null model
y ~ x Linear model with
1 explanatory variable
log(y) ~ x -1 Linear model with
1 explanatory variable, without intercept
and with log-transformed response
y ~ x + I(x^2) Quadratic model with 1
y ~ poly(x,2) explanatory variable
y ~ x1 + x2 Linear model with
2 explanatory variables
 )( if
ii xf  )(
ii x )log(
2
)( iii xxf  
iii xxf 2211)(  
Model formula Description .
y ~ A*B*C 3-way ANOVA with
y ~ A +B+C+A:B three main effects,
+ A:C+B:C+A:B:C two 2-way interactions
and one 3-way interaction
y ~ (A+B+C)^2 3-way ANOVA with
only three 2-way interactions
y ~ x*A 1-way ANCOVA ijijij xxAf  )(
jkikij
kjiijk
CBCABA
CBAf
:::
)(

 
ijk
jkikij
kjiijk
CBA
CBCABA
CBAf
::
:::
)(


 
Stano Pekár
0 2 4 6 8 10 12 14
0.00.10.20.30.40.5
x
y
R
• choose distribution if using GLM
• there are many distributions but only some are available for GLM
• decision should be based upon theoretical models or previous
experience
Response variable can be
• continuous measurements
• counts
• proportions
iii bxay 
• bell-shaped, symmetric around mean
• mean = median = modus
• parameters: μ, σ2
- s2 is independent of mean
f(x)
s2
y y
• measurements that can be made with infinite precision
• positive real values
• asymmetric, skewed to the right
• variance increases with mean at quadratic trend
• after logarithmic transformation variances are similar
• positive real values
• asymmetric, skewed to the right
• variance increases with mean at a quadratic trend
• positive real values
• used to model diffusion processes, dispersion in ecology
• variance increases steeply with mean
• discrete values, made of integers
• asymmetric, skewed to the right
• variance is equal to expected value
- variance increases with mean
• discrete values, made of integers
• asymmetric, strongly skewed to the right
• variance is larger than expected value
- variance increases with mean at a parabolic trend






 p
p
1
log
• arise when we counts events (y) from a whole population (n)
• p .. relative frequency = y/n
• we study only qualitative character of an event not its
quantitative aspect
• p is an estimate of a theoretical value π
• based on logit transformation
• measurements (y) are integers of n independent trials
• π .. a single parameter showing probability of event occurrence
• 0  π  1
• variance of π is maximal at 0.5
• is not any distribution
• specifies expected value and the relationship between
expected value and variance
• mixture of available settings
0 2 4 6 8 10 12 14
0.00.10.20.30.40.5
x
y
R
Stano Pekár
 response variable is continuous
• measurements of length, width, distance, concentration, pH, etc.
• data are real numbers
• distribution is symmetric (-, +)
• parameters: μ, σ2 independent of each other
• t-test (t.test) to compare one or two means
• Linear model (lm) to study effect of categorical and
continuous variables
- inference is exact, reliable for each n
• GLM (glm) to study effect of categorical and continuous
variables
- Gaussian family (default)
- link: identity
- inference is asymptotic, valid only for large n
glm(formula, family=Gaussian)
Background
The number of grains in ears affects the yield of cereals.
Design
On 20 plots mean number of seeds per oat ear was estimated.
Then at harvest the yield [t/ha] for each plot was estimated.
Hypotheses
Is number of seeds related to the yield?
What is the predictive model of this relationship?
Variables
grain
yield
• check for curvature by fitting a separate quadratic term for
continuous explanatory variables
• quadratic model - a simple description of non-monotonous trend
• use either poly(x,2) or x + I(x^2)
  2
xxy
• remove insignificant interactions
- begin with the higher order terms because main effects are marginal
to interactions
- intercept is marginal to slope and both are marginal to the quadratic
term
• remove insignificant main effects
Criteria
• test (F or χ2) and a given p-value (anova)
• Akaike Information Criterion (AIC):
- the more there are parameters in the model the better fit but worse
explanatory power of the model
- the lower AIC the better model
pLogLik 22AIC 
Background
Sexual size dimorphism may increases with ambient temperature in
spiders.
n
2

Weighting
• to increase/decrease effect of some measurements
• only positive values are allowed
• instead of least squares weighted least squares are used
Design
Males and females of Zodarion spiders were sampled on 13 sites
with a different temperature [ºC]. Of the average size of males and
females a size ratio was calculated for each site. The number of
individuals varied between sites (2 to 62 specimens).
Hypotheses
Is there relationship between the ratio and the temperature?
What is the model?
Variables
temp
number
ratio
Background
The carcinogenic disease is related to the production of toxins by
certain bacteria in the body of patients. Presence of toxins can be
used as an indicator of certain carcinogenic disease.
Design
In a clinical study, the amount of a toxin [units/μl] produced by four
bacteria species was measured in patients with two carcinogenic and
two non-carcinogenic diseases. For each disease there were 20
patients. In each patient only a single bacterial toxin was measured
so there were 5 replications per bacteria species.
Hypotheses
Is the amount of toxin similar for four bacteria species and four
diseases?
If not what is the difference?
Which species can be used as an indicator?
Variables
SPECIES:bacterA, bacterB,
bacterC, bacterD
DIAGNOSIS:carc.rectum,
carc.intestine,
apendicitis, skin.absces
toxin
• anova uses Type I Sum of Squares
- sequential assessment of effects according to the given order
- at first main effects are assessed then interactions
- in orthogonal the order is not important
- if data are unortogonal it is more appropriate to use Type III SS
Orthogonality
• independent variables are orthogonal - effects are straightforward
• correlated variables are unorthogonal - effects are complicated
- when there are missing values or unequal number of observations
per treatment
Background
Rate of population increase is a function of temperature in
ectotherms, such as mites. A model of the relationship is essential
for the control of mite pests.
Design
In the lab, population increase of two pest mite species was studied
at 11 temperatures between 10 and 35 °C. The rate of increase was
estimated using formula for exponential population growth. For
each temperature a single measurement for each species was
available.
Hypotheses
Did temperature affect the rate of increase?
Was the rate similar for both species?
What is the model of the relationship?
Variables
GENUS: genA, genB
temp
rate
Background
Yield of cereals is determined by a number of variables. To predict
yield with high accuracy, various effects have to be studied.
Design
On 100 plots, the yield of wheat [t/ha] was estimated together with
six other variables: 1. number of overwintering plants, 2. number of
ears/m2, 3. pH of soil, 4. content of phosphorus [mg/kg], 5. content
of potassium [mg/kg], 6. content of magnesium [mg/kg].
Hypotheses
Did any of six variables affect the yield?
If so which ones?
What is the model for prediction of yield?
Variables
winter
ears
pH
P
K
Mg
yield
• When two or more variables show correlation
• PCA can reduce dimensionability of variables – use pCA
scores instead