Stano Pekár 0 2 4 6 8 10 12 14 0.00.10.20.30.40.5 x y R 1) R Environment Exploratory Data Analysis Regression models 2) The first example Systematic components 3) Stochastic components Analyses of continual measurements 4) Analyses of continual measurements II Analyses of counts 5) Analyses of counts II Analyses of proportions • very fast due to use of computers • chose statistical models that approach data characters This course • focuses on regression models in a broad sense • only on linear models • with only one response variable (univariate methods) • with independent observations Response variable • (dependent) is the variable whose variation we aim to understand, the variable that we measure, it goes on ordinate - continuous measurement, count, proportion (y) Explanatory variable • (independent) is the variable that we manipulate (select levels), interested to what extent is variation in response associated with variation in explanatory variables, displayed on abscissa - numeric: continuous or discrete measurements (x) .. covariate - categorical .. a factor (A, B) with two or more levels (A1, A2, .. B1, B2, ..) Stano Pekár 0 2 4 6 8 10 12 14 0.00.10.20.30.40.5 x y R • software packages that include GLM • environment for the manipulation of objects - data manipulation, calculation and graphical display - a high-level programming language • combination of S (developed at AT&T Bell Laboratories and forms the basis of the S-PLUS systems) and Scheme languages • initially written by Gentleman & Ihaka (1996), nowadays with many contributors (R Development Core Team) - includes about 30 standard packages - available 2000 additional packages • user-unfriendly (limited pull-down menus) - based on commands - pull-down menus only for basic commands Pros • freeware • one of the largest statistical systems • open environment with more dynamic development than other systems • whereas Statistica or SAS will give copious output, R will provide minimal output • makes you think about the analysis Cons • no warranty • user-unfriendly • software available from www.r-project.com •data from •https://www.press.muni.cz/edicni-rady-munipress/moderni-analyza-biologickych-dat-1 + - * / > < == equal != not equal <= less than or equal ^ power • logical values T .. TRUE, F .. FALSE Functions • trigonometric sin, cos, tan, asin, acos, atan • logarithmic: log, log2, log10 • sqrt, exp, abs, sum, prod •seq, c, which, length, cbind, xbind, matrix • names are case sensitive • “_” is not allowed to use • vectors: numeric, character, logical • arguments (in parentheses): use their names or without at specified order • centring: to subtract mean • scaling: to divide by SD Created in R: • use data.frame, rep, factor, levels, relevel • export: write.table Imported: - from Excel via clipboard dat <- read.delim("clipboard") • data matrix: - number of columns = number of variables - first row contains names of variables (names without blank spaces) • each row corresponds to an observation (trial, etc.) • factors levels can be names or coded as numbers • all columns must have the same number of rows • missing data are assigned as NA - is.na - $ attach(dat) names(dat) soil field distance amount moist pasture 12 0.22 moist pasture 22 0.11 moist pasture 43 0.29 moist pasture 23 0.33 moist rape 32 0.19 moist rape 67 0.39 moist rape 54 0.18 moist rape NA 0.29 dry pasture 11 1.16 dry pasture 33 1.03 dry pasture 45 1.11 dry pasture NA 1.33 dry rape 55 1.02 dry rape 41 1.23 dry rape 14 1.05 dry rape 27 1.12 • a visual (tabular or graphical) analysis of the data Important to - check errors - get an idea of the result - suggest a model - check assumptions for use of desired methods - set hypotheses - look for unexpected trends • use expected values and variation • E(y), μ: theoretical long-term average of a variable - one of a few characteristics of a distribution - for discrete distributions E(y) might not be a possible value - estimate of E(y) is mean ... mean - a robust estimate for asymmetric distributions is median: ... median - another robust estimate is trimmed mean: mean where α*n observations are removed from each tail ... mean(y, trim=) Example Find mean, median, and mean trimmed by 10% of the amount variable. • Var(y), σ2: a theoretical measure of the variability • minimum and maximum ... range, min, max • quantiles (0, 25, 50, 75, 100%) ... quantile • estimate of Var(y) is s2... var • standard deviation (s) ... sd • standard error of the mean ... Example Find variance, standard deviation, range and standard error of the mean for amount. n s SEM  • of a parameter (mean): if large number of samples is taken from a population then α% of intervals will contain mean • based on quantiles of the t distribution qt - lower CI95 - upper CI95 • for asymmetric distributions CI95 is estimated on transformed values  asymmetric intervals Example Find 95% confidence intervals of mean for amount. SEMty  ,975.0 SEMty  ,975.0 1 n • basic summaries (min, max, Q25, Q75, median, mean) for all variables.. summary • summary table for data with explanatory variable(s) .. tapply • to count frequencies .. table Example Make a table of means for SOIL and FIELD, and table of SEM for FIELD. • see demo(graphics) or demo(image) • graphs - basic: plot - advanced: xyplot (library lattice) • to get all graphic parameters: ?par • to split window to subplots: par(mfrow) • to add legend .. legend • graph window size: x11 plot Argument Values type= Style: "n" (empty), "p" (scatter), "l" (lines), "b" (both), "h" (vertical) las= Style of axes values: 0 (parallel), 1 (horizontal) 2 (perpendicular), 3 (vertical) xlab,ylab= Text of axes labels: "..." cex.lab= Size of axes labels: 1,.. xlim,ylim= Range of axes: c(min, max) cex.axis= Size of axes values: 1,.. log= Logarithmic scale of x, y or xy main= Text of title: "..." main.cex= Size of title: 1,.. 1 2 3 4 5 6 7 + 16 17 18 19 20 points Argument Values pch= Type of symbols: 0,..,18, "letters" cex= Size of symbols: 1, ... col= Colour: 1, 2, 3, 4, 5, 6, 7, 8 font= Font type: 1, 2, 3, 4 • to study distribution of a numeric (response) variable • histogram .. hist • stem-and-leaf plot .. stem • q-q plots to compare distribution of two variables - compare a single variable with normal: qqnorm - compare distributions of two variables: qqplot - to add diagonal line: qqline Example Make q-q plot of data from normal distribution. Deviations from normal distribution • for data with continuous explanatory variables • to produce plots with points: plot Example Make scatterplot of distance on amount without and with different points for two levels of SOIL. • when there are categorical explanatory variables • argument notch for boxes with CI95 for median Example Make boxplot of amount for SOIL without and with notches. • for data with both categorical and continuous explanatory variables • xyplot from library lattice Example Make panel scatterplot regression plot of distance against amount for SOIL. • for data with two categorical explanatory variables • to plot means of two factors (A, B) connected by lines .. interaction.plot Example Make interaction plot of SOIL and FIELD for amount. • when data are counts or proportions - data are arranged in a matrix or table • barplot: beside, legend Example Make barplot of SOIL and FIELD for amount. • when data include several continuous explanatory variables • pairs produces matrix of all possible plots • when data include 2 continuous explanatory variables • wireframe (lattice) produces 3-dimensional plot • use ggplot2 package - ggplot Example Make violinplot of SOIL and amount and dotplot of SOIL and amount . 1 2 3 4 5 6 lines Argument Values x,y= Coordinates: c( ..,..) lty= Line type: 1,...,6 col= Colour: 1, 2, 3, 4, 5, 6, 7, 8 lwd= Width: 1, .. • final plot of estimated models use visreg package • lines connects points specified by coordinates • abline produces line specified by intercept and slope Example Make lineplots for the following models: inverse exponential logarithmic power logistic squareroot quadratic inverse squareroot x y 1  x y 1  xy  )log(xy  3 xy  x ey  2 006.01.06.0 xxy  x e y 3.0 1 1    • includes systematic and stochastic components iii xy   y = a+bx + ε Deterministic model Statistical model y x y = a+bx y x0 0 • assumptions of the stochastic component: = variance is equal = homoscedastic model To find real model we need to estimate its parameters: α, β, σ2 as a, b, s2 so that we get ),0(~ 2  Ni iiii  ,0),(cor  00 )(ˆ bxaxy  • extension of the systematic component Simple regression 1-way ANOVA iii xy   ijjij Ay   iiy   iiy   0 0   kk xxxy ..2211 Linear model (LM) has a general form x can include: u2, u1/2, log(u), exp(u), sin(u), factors = model is linear in parameters when it includes only linear combinations of parameters Some nonlinear relationships can be linearised • log-transformation of both sides: linear predictor    bxayeey ibxa )log(  bxazyz )log( • eε has lognormal distribution while ε has normal distribution • y has heterogenous variance z has homogenous variance • eε is multiplicative while ε is additive • curved relationship becomes linear Other nonlinear relationships can not be linearised use Nonlinear regression )1( x ey     kk xxxf   ..)( 2211 ondistributiy ~ kk xxx   ..2211 ),0(~ 2  Ny  • extension of the stochastic component - we model transformed expected value of y f(μ) .. link function For example, ),(~ 2 Ny GLM has 3 components: • link function • linear predictor • distribution family - Gaussian (normal), Gamma, Inverse Gaussian, Poisson, Quasipoisson, Binomial, Quasibinomial, Quasi • measure of fit is deviance not sum of squares - null deviance = SST - residual deviance = SSE - ANODEV table = ANOVA table • a useful simplification of the reality - should include important aspects for which it is being made and ignore aspects that we are not interested in • Principle of parsimony: Simpler model is better if it explains study phenomenon as good as complicated model. G. E. P. Box: „All models are wrong. But some of them are useful.“ Bottom -up or forward selection • building up a model by adding available variables Top-down or backward selection • reducing maximal (saturated) model 1. Fit maximal model- all main effects and interactions 2. Remove insignificant interactions and main effects 3. Group together similar factor levels 4. Check diagnostic plots 5. Alter model if necessary 6. Achieve minimal adequate model - contains only terms in which all parameters are significantly different • to assess model quality and assumptions - study of both systematic and stochastic components - we can never prove that model is adequate Residuals should not • make trends when plotted against explanatory or response variables • be heteroscedastic • have unusual distribution • be interdependent Checking assumptions • informal using plots - plot produces 6 plots • formal using tests ),0(~ 2  Ni iiii  ,0),(cor  • raw (LM) or deviance (GLM) residuals against fitted values • curved pattern suggests lack of polynomial term • q-q plot of standardised (LM) standardised deviance (GLM) residuals • data from other than normal distribution can not have normally distributed residuals • when the pattern is “J” or “S” shaped change link function or transform the variable • plot of standardised (LM) standardised deviance (GLM) residuals against fitted/predicted values • when variance increases with the mean use Poisson or gamma distribution or log transformation • plot of Cook’s distance for each observation shows the influence of individual observations on the model fit • values of influential observations are close to 1 and higher • residuals versus leverage • omit influential observations or transform the explanatory variables (using log, power, reciprocal) • dependence on continual explanatory variable - using standardised (LM) or Pearson residues (GLM) • serial dependence if explanatory variable is time or space Stano Pekár 0 2 4 6 8 10 12 14 0.00.10.20.30.40.5 x y R Background Nutritional quality of the diet affects growth of organisms in various ways. To find optimal diet for cockroaches the following experiments was performed. Design Effect of five diet types (control, lipid1, lipid2, protein1, protein2) was tested on body weight [g] of cockroaches. For each diet type there were 17 observations. Biological hypothesis Is nutritional quality of the diet affecting size of organisms? Statistical hypotheses H0: Weight is similar among diet groups. HA: Weight is significantly different among diet groups. Prediction: Protein-enriched diet should lead to highest weight. Variables DIET: control, lipid1, lipid2, protein1, protein2 weight Model: EDA: Analysis: • compare individual differences between factor levels • comparisons are valid only if a factor is significant Options: - Apriori contrasts (before analysis) - Posteriori simplification (after analysis) - Multiple comparisons (after analysis) - apriori contrasts are preferred to avoid excess of significant results For a model a contrast will be where Aj .. mean value of a level, wj .. contrast coefficient Creating contrasts - levels lumped together get the same sign - levels contrasted get opposite sign - levels excluded get 0 .. so that sum of each contrast ijjij Ay    J j jj AwK 1 0 1  J j jw Contrasts are arranged in a matrix • only k-1 (k .. number of levels) contrasts are orthogonal, i.e. each level (combination) is compared only once ... products of any two contrasts = 0 • specified by function contrasts prior to analysis Pre-specified contrasts: • Treatment (default in R) - compare specific level with the reference level • Helmert - compare specific level with the average of previous levels • Sum - compare specific level with the grand mean • Textbook - compare each level with 0 • levels of a factor are compared using Wald statistics from ouput • similar factor levels are the grouped together • test each grouping by anova • compare the final model with the first one • post hoc tests • Bonferroni correction – applied to non-orthogonal contrasts • Dunn test, Scheffe test, Tukey HSD test • comparison by means of confidence intervals We should check as many aspects as possible • use diagnostic plots • use formal tests: - Bartlett test to compare variances - Shapiro-Wilk test of normality Diagnosis: Stano Pekár 0 2 4 6 8 10 12 14 0.00.10.20.30.40.5 x y R Explanatory variable(s) Method Continuous Regression Categorical ANOVA Continuous and categorical ANCOVA • the same explanatory variable can be taken once as continuous other time as categorical: e.g. two levels of concentration • continuous variable allows interpolation and extrapolation Key to methods: iii bxay  Linear predictor can include various terms: - intercept .. α estimated as a - linear term .. βx with b as coefficient of linear trend - quadratic term .. γx2 with c as coefficient of quadratic trend - cubic term .. τx3 with t as coefficient of cubic trend - main effect .. A - interaction between factors .. A:B - interaction between continuous variables x1:x2 - linear interaction .. A:x - quadratic interaction .. A:x2 • simple regression ... 1 explanatory variable • multiple regression .. 2 and more explanatory variables General linear predictor of multiple regression α .. intercept βk .. linear coefficients of xk x .. may represent polynomic functions (x3), interactions (x1.x2) - rule of thumb: less than n/3 parameters in model at any time - number of combinations of explanatory variables will often exceed the number of data so we can not include all terms kk xxx   ..2211 Simplification • linear predictor with 2 explanatory variables (x1, x2) should include all main effects, all interactions, and quadratic terms with estimates a, b1, b2, c1,c2, d Nested models are: • 5 parameters (a, b1, b2, c1, c2), at least c1 and c2 are significantly different 21 2 22 2 112211 xxxxxx   • 4 parameters (a, b1, b2, c1), at least c1 is significantly different • 3 parameters (a, b1, b2), at least b1 and b2 are significantly different If one explanatory variable (x2) turns out to be insignificant: • 3 parameters (a, b, c), at least c is significantly different • 2 parameters (a, b), at least b is significantly different • 1 parameter (a) that is significantly different • 1-way ANOVA .. 1 factor • 2-way ANOVA .. 2 factors • k-way ANOVA .. k factors • k-way ANOVA might be with our without interactions Given 2 categorical variables A and B each with 2 levels (A1, A2, and B1, B2) model with treatment contrasts is α .. mean of A1B1, Ai and Bj .. main effects, A:Bij .. interaction ijji BABA : • 4 parameters (A1B1, A2B1–A1B1, A1B2–A1B1 a A2B2–A1B2): interaction is significant • 3 parameters (A1B1, A2B1–A1B1, B2–B1): only A and B are significant • 2 parameters (B1, B2–B1): only B is significant • 1 parameter (grand mean): null model • combination of regression and ANOVA • continuous variable = covariate Given 1 factor (Aj) and 1 covariate (x) linear predictor is: α .. intercept, Aj .. effect of factor, β .. slope, δ .. effect of interaction Given 1 categorical variable A with 2 levels (A1, A2) and 1 continual x, the linear predictor will be 22 xxxxA jjj   xxA jj   • 6 parameters - 2 intercepts (a1, a2–a1), 2 slopes (b1, b2–b1), 3 quadratic (c1, c2–c1) - interaction A:x2 is significant • 4 parameters - 2 intercepts (a1, a2–a1), 2 slopes (b1, b2–b1) interaction A:x is significant, but quadratic terms are not significant • 4 parameters - 2 intercepts (a1, a2–a1), 1 slope (b), 1 quadratic (c) - interactions A:x2 and A:x are not significant, but A and quadratic terms are significant • 3 parameters - 2 intercepts (a1, a2–a1), 1 slope (b) - only main effects (A and x) are significant • Further simplification  1-way ANOVA or simple regression response variable ~ explanatory variable(s) • Operators: - on left side any mathematical operator can be used - on the right side only few: + .. add - .. delete : .. interaction * .. all terms 1 .. intercept I .. interpreter that translates operators into mathematical meaning / .. nested | .. conditioned Model formula Description y ~ 1 Null model y ~ x Linear model with 1 explanatory variable log(y) ~ x -1 Linear model with 1 explanatory variable, without intercept and with log-transformed response y ~ x + I(x^2) Quadratic model with 1 y ~ poly(x,2) explanatory variable y ~ x1 + x2 Linear model with 2 explanatory variables  )( if ii xf  )( ii x )log( 2 )( iii xxf   iii xxf 2211)(   Model formula Description . y ~ A*B*C 3-way ANOVA with y ~ A +B+C+A:B three main effects, + A:C+B:C+A:B:C two 2-way interactions and one 3-way interaction y ~ (A+B+C)^2 3-way ANOVA with only three 2-way interactions y ~ x*A 1-way ANCOVA ijijij xxAf  )( jkikij kjiijk CBCABA CBAf ::: )(    ijk jkikij kjiijk CBA CBCABA CBAf :: ::: )(     Stano Pekár 0 2 4 6 8 10 12 14 0.00.10.20.30.40.5 x y R • choose distribution if using GLM • there are many distributions but only some are available for GLM • decision should be based upon theoretical models or previous experience Response variable can be • continuous measurements • counts • proportions iii bxay  • bell-shaped, symmetric around mean • mean = median = modus • parameters: μ, σ2 - s2 is independent of mean f(x) s2 y y • measurements that can be made with infinite precision • positive real values • asymmetric, skewed to the right • variance increases with mean at quadratic trend • after logarithmic transformation variances are similar • positive real values • asymmetric, skewed to the right • variance increases with mean at a quadratic trend • positive real values • used to model diffusion processes, dispersion in ecology • variance increases steeply with mean • discrete values, made of integers • asymmetric, skewed to the right • variance is equal to expected value - variance increases with mean • discrete values, made of integers • asymmetric, strongly skewed to the right • variance is larger than expected value - variance increases with mean at a parabolic trend        p p 1 log • arise when we counts events (y) from a whole population (n) • p .. relative frequency = y/n • we study only qualitative character of an event not its quantitative aspect • p is an estimate of a theoretical value π • based on logit transformation • measurements (y) are integers of n independent trials • π .. a single parameter showing probability of event occurrence • 0  π  1 • variance of π is maximal at 0.5 • is not any distribution • specifies expected value and the relationship between expected value and variance • mixture of available settings 0 2 4 6 8 10 12 14 0.00.10.20.30.40.5 x y R Stano Pekár  response variable is continuous • measurements of length, width, distance, concentration, pH, etc. • data are real numbers • distribution is symmetric (-, +) • parameters: μ, σ2 independent of each other • t-test (t.test) to compare one or two means • Linear model (lm) to study effect of categorical and continuous variables - inference is exact, reliable for each n • GLM (glm) to study effect of categorical and continuous variables - Gaussian family (default) - link: identity - inference is asymptotic, valid only for large n glm(formula, family=Gaussian) Background The number of grains in ears affects the yield of cereals. Design On 20 plots mean number of seeds per oat ear was estimated. Then at harvest the yield [t/ha] for each plot was estimated. Hypotheses Is number of seeds related to the yield? What is the predictive model of this relationship? Variables grain yield • check for curvature by fitting a separate quadratic term for continuous explanatory variables • quadratic model - a simple description of non-monotonous trend • use either poly(x,2) or x + I(x^2)   2 xxy • remove insignificant interactions - begin with the higher order terms because main effects are marginal to interactions - intercept is marginal to slope and both are marginal to the quadratic term • remove insignificant main effects Criteria • test (F or χ2) and a given p-value (anova) • Akaike Information Criterion (AIC): - the more there are parameters in the model the better fit but worse explanatory power of the model - the lower AIC the better model pLogLik 22AIC  Background Sexual size dimorphism may increases with ambient temperature in spiders. n 2  Weighting • to increase/decrease effect of some measurements • only positive values are allowed • instead of least squares weighted least squares are used Design Males and females of Zodarion spiders were sampled on 13 sites with a different temperature [ºC]. Of the average size of males and females a size ratio was calculated for each site. The number of individuals varied between sites (2 to 62 specimens). Hypotheses Is there relationship between the ratio and the temperature? What is the model? Variables temp number ratio Background The carcinogenic disease is related to the production of toxins by certain bacteria in the body of patients. Presence of toxins can be used as an indicator of certain carcinogenic disease. Design In a clinical study, the amount of a toxin [units/μl] produced by four bacteria species was measured in patients with two carcinogenic and two non-carcinogenic diseases. For each disease there were 20 patients. In each patient only a single bacterial toxin was measured so there were 5 replications per bacteria species. Hypotheses Is the amount of toxin similar for four bacteria species and four diseases? If not what is the difference? Which species can be used as an indicator? Variables SPECIES:bacterA, bacterB, bacterC, bacterD DIAGNOSIS:carc.rectum, carc.intestine, apendicitis, skin.absces toxin • anova uses Type I Sum of Squares - sequential assessment of effects according to the given order - at first main effects are assessed then interactions - in orthogonal the order is not important - if data are unortogonal it is more appropriate to use Type III SS Orthogonality • independent variables are orthogonal - effects are straightforward • correlated variables are unorthogonal - effects are complicated - when there are missing values or unequal number of observations per treatment Background Rate of population increase is a function of temperature in ectotherms, such as mites. A model of the relationship is essential for the control of mite pests. Design In the lab, population increase of two pest mite species was studied at 11 temperatures between 10 and 35 °C. The rate of increase was estimated using formula for exponential population growth. For each temperature a single measurement for each species was available. Hypotheses Did temperature affect the rate of increase? Was the rate similar for both species? What is the model of the relationship? Variables GENUS: genA, genB temp rate Background Yield of cereals is determined by a number of variables. To predict yield with high accuracy, various effects have to be studied. Design On 100 plots, the yield of wheat [t/ha] was estimated together with six other variables: 1. number of overwintering plants, 2. number of ears/m2, 3. pH of soil, 4. content of phosphorus [mg/kg], 5. content of potassium [mg/kg], 6. content of magnesium [mg/kg]. Hypotheses Did any of six variables affect the yield? If so which ones? What is the model for prediction of yield? Variables winter ears pH P K Mg yield • When two or more variables show correlation • PCA can reduce dimensionability of variables – use PCA scores instead