1 analysis of biologícdX daíta Institute of Botany and Zoology Masaryk University, Brno Stano Pekár ©inflM 2) The first example Systematic components 3) Stochastic components Analyses of continual measurements 4) Analyses of continual measurements II Analyses of counts 1) R Environment Exploratory Data Analysis Regression models 5) Analyses of counts II Analyses of proportions Crawley M. 2002. Statistical computing. An Introduction to Data Analysis using S-Plus. Wiley & Sons. Crawley M. 2005. Statistics: An Introduction Using R. Wiley. Dalgaard P. 2004. Introductory Statistics with R. Springer. Faraway J.J. 2005. Linear Models with R. Chapman & Hall/CRC. Fox J. 2002. An R and S-Plus Companion to Applied Regression. Sage. Maindonald J. & Braun J. 2003. Data Analysis and Graphics Using R. Cambridge. Pekar S. & Brabec M. 2009. Modern Analysis of Biological Data.l. Generalised Linear Models in R. Scientia, Praha. [in Czech] http ://w ww. scientia.cz/katalog. asp?idk= 16 l&nav=&id=l 931 • very fast due to use of computers • chose statistical models that approach data characters This course • focuses on regression models in a broad sense • only on linear models • with only one response variable (univariate methods) • with independent observations Response variable • (dependent) is the variable whose variation we aim to understand, the variable that we measure, it goes on ordinate - continuous measurement, count, proportion (y) Explanatory variable • (independent) is the variable that we manipulate (select levels), interested to what extent is variation in response associated with variation in explanatory variables, displayed on abscissa - numeric: continuous or discrete measurements (x) .. covariate - categorical.. a factor (A, B) with two or more levels (Al9 A2, ..Bl9 B2, ..) £ŕafisttc3\ • environment for the manipulation of objects - data manipulation, calculation and graphical display - a high-level programming language • combination of S (developed at AT&T Bell Laboratories and forms the basis of the S-PLUS systems) and Scheme languages • initially written by Gentleman & Ihaka (1996), nowadays with many contributors (R Development Core Team) - includes about 30 standard packages - available 2000 additional packages • user-unfriendly (limited pull-down menus) - based on commands - pull-down menus only for basic commands Pros • freeware • one of the largest statistical systems • open environment with more dynamic development than other systems • whereas Statistica or SAS will give copious output, R will provide minimal output • makes you think about the analysis Cons • no warranty • user-unfriendly OjBQSMDH • available from www.r-project.com • copy data to C:\Program Files\R\R-2.9.0\MABD • install sciplot package + -*/>< == equal ! = not equal <= less than or equal A power • logical values T .. TRUE, F .. FALSE Functions • trigonometric sin, cos, tan, asin, acos, atan • logarithmic: log, log2, loglO •sqrt, exp, abs, sum, prod •seq, c, which, length, cbind, xbind, matrix • names are case sensitive • is not allowed to use • avoid using names: break, c, C, D, diff, else, F, FALSE, for, function, I, if, in Inf, mean, NA, NaN, next, NULL, pi, q, range, rank, repeat, s, sd, t, T, tree, TRUE, var, while • vectors: numeric, character, logical • arguments (in parentheses): use their names or without at specified order • centring: to subtract mean • scaling: to divide by SD Created in R: • use data. frame, rep, factor, levels, relevel • export: write. table Imported: - from Excel via clipboard dat <- read.delim("clipboard") or via TXT file dat <- read.delim("c: \\MABDWmetal.txt") • data matrix: - number of columns = number of variables - first row contains names of variables (names without blank spaces) • each row corresponds to an observation (trial, etc.) • factors levels can be names or coded as numbers • all columns must have the same number of rows • missing data are assigned as NA - is.na -$ attach(dat) names(dat) soil field dis tance amount moist pasture 12 0.22 moist pasture 22 0.11 moist pasture 43 0.29 moist pasture 23 0.33 moist rape 32 0.19 moist rape 67 0.39 moist rape 54 0.18 moist rape NA 0.29 dry pasture 11 1.16 dry pasture 33 1.03 dry pasture 45 1.11 dry pasture NA 1.33 dry rape 55 1.02 dry rape 41 1.23 dry rape 14 1.05 dry rape 27 1.12 • a visual (tabular or graphical) analysis of the data Important to - check errors - get an idea of the result - suggest a model - check assumptions for use of desired methods - set hypotheses - look for unexpected trends • use expected values and variation • E(y), ju: theoretical long-term average of a variable - one of a few characteristics of a distribution - for discrete distributions E(y) might not be a possible value - estimate of E(y) is mean ... mean - a robust estimate for asymmetric distributions is median: ... median - another robust estimate is trimmed mean: mean where a*n observations are removed from each tail... mean (y, trim=) Example Find mean, median, and mean trimmed by 10% of the amount variable. Data: metal.txt • Var(y), a2: a theoretical measure of the variability in a variable • minimum and maximum ... range, min, max • quantiles (0, 25, 50, 75, 100%) ... quantile • estimate of Var(y) is s2... var • standard deviation (s) ... sd • standard error of the mean ... WSk 5= Example Find variance, standard deviation, range and standard error of the mean for amount. • of a parameter (mean): if large number of samples is taken froma population then a% of intervals will contain mean • based on quantiles of the t distribution qt - lower CL - upper CI95 • for asymmetric distributions CI95 is estimated on transformed values —» asymmetric intervals • from model objects use function conf int Example Find 95% confidence intervals of mean for amount. • basic summaries (min, max, Q25, Q75, median, mean) for all variables., summary • summary table for data with explanatory variable(s) .. tapply • to count frequencies .. table Example Make a summary table, table of replications for FIELD, table of means for SOIL and FIELD, and table of SEM for FIELD. • see demo (graphics) or demo (image) • graphs - basic: plot - advanced: xyplot (library lattice) • to get all graphic parameters: ?par • to split window to subplots: par (mf row) • to add legend .. legend • graph window size: xll plot Argument Values type= Style: "n" (empty), "p" (scatter), "1" (lines), "b" (both), "h" (vertical) las= Style of axes values: 0 (parallel), 1 (horizontal) 2 (perpendicular), 3 (vertical) xlab,ylab= Text of axes labels: cex.lab= Size of axes labels: 1, . . xlim, ylim= Range of axes: c(min, max) cex.axis= Size of axes values: 1, . . log= Logarithmic scale of x, y or xy main= Text of title: main.cex= Size of title: 1, . . points Values_ Type of symbols: 0, . ., 18, "letters" Size of symbols: 1, ... Colour: , , 3, , 5, , 7, 8 Font type: 1, 2, 3, 4 • to study distribution of a numeric (response) variable • histogram ..hist • stem-and-leaf plot.. stem • q-q plots to compare distribution of two variables - compare a single variable with normal: qqnorm - compare distributions of two variables: qqplot - to add diagonal line: qqline Example Make histogram and q-q plot of distance. Deviations from normal distribution Asymmetric, right-skewed Symmetric with extended tails Symmetric with heavy tails for data with continuous explanatory variables to produce plots with points: plot Example Make scatterplot of distance on amount without and with different points for two levels of SOIL. • when there are categorical explanatory variables • function ..plot - central line represents median - box is Q25 and Q75 - whiskers are 1.5 times interquartile range - circles are outliers • argument notch for boxes with CI95 for median - if median of one level falls outside notches of another level, there is likely significant difference Example Make boxplot of amount for SOIL without and with notches. • for data with both categorical and continuous explanatory variables • xyplot from library lattice - separate plots for each level of a factor: y ~ x|A - several types: type="r" .. regression plot Example Make panel scatterplot regression plot of distance against amount for SOIL. • for data with two categorical explanatory variables • to plot means of two factors (A, B) connected by lines .. interaction .plot - A is plotted on axis x - B is in the legend • visual assessment of interaction between factors A *B ovA:B - two factors can affect response additively or multiplicatively - additive effect: parallel lines - multiplicative effect: crossed lines Example Make interaction plot of SOIL and FIELD for amount. Itr plot • when data are counts or proportions - data are arranged in a matrix or table • barplot: beside, legend Example Make barplot of SOIL and FIELD for amount. when data include several continuous explanatory variables pairs produces matrix of all possible plots when data include 2 continuous explanatory variables wireframe (lattice) produces 3-dimensional plot • to display error bars use vertical lines or sciplot package • plot empirical means and errors - bargraph.CI - lineplot.CI Example Make barplot of SOIL and amount and line plot of SOIL and FIELD and amount. • final plot of estimated models • lines connects points specified by coordinates • abline produces line specified by intercept and slope includes systematic and stochastic components • assumptions of the stochastic component: = variance is equal = homoscedastic model To find real model we need to estimate its parameters: a, p, a2 as a, b, s1 so that we get • extension of the systematic component Simple regression 1-way ANOVA Linear model (LM) has a general form y = a + ßlxl+ ß2x2 +.. + ßkxk + £ linear predictor x can include: u2, um, log(w), exp(w), sin(w), factors = model is linear in parameters when it includes only linear combinations of parameters Some nonlinear relationships can be linearised • log-transformation of both sides: • e£ has lognormal distribution while s has normal distribution • y has heterogenous variance z has homogenous variance • e£ is multiplicative while s is additive • curved relationship becomes linear Other nonlinear relationships can not be linearised use Nonlinear regression extension of the stochastic component we model transformed expected value of y f([i).. link function For example, ■ GLM has 3 components: • link function • linear predictor • distribution family - Gaussian (normal), Gamma, Inverse Gaussian, Poisson, Quasipoisson, Binomial, Quasibinomial, Quasi • measure of fit is deviance not sum of squares - null deviance = SST - residual deviance = SSE - ANODEV table = ANOVA table CBcdcdcQ aDCDiIca • a useful simplification of the reality - should include important aspects for which it is being made and ignore aspects that we are not interested in - like a good map • Principle of parsimony: Simpler model is better if it explains study phenomenon as good as complicated model. G. E. P. Box: „A11 models are wrong. But some of them are useful." Bottom -up or forward selection • building up a model by adding available variables Top-down or backward selection • reducing maximal (saturated) model 1. Fit maximal model- all main effects and interactions 2. Remove insignificant interactions and main effects 3. Group together similar factor levels 4. Check diagnostic plots 5. Alter model if necessary 6. Achieve minimal adequate model - contains only terms in which all parameters are significantly different • to assess model quality and assumptions - study of both systematic and stochastic components - we can never prove that model is adequate should not • make trends when plotted against explanatory or response variables • be heteroscdeastic • have unusual distribution • be interdependent Checking assumptions • informal using plots - plot produces 6 plots • formal using tests • raw (LM) or deviance (GLM) residuals against fitted values • curved pattern suggests lack of polynomial term • q-q plot of standardised (LM) standardised deviance (GLM) residuals • data from other than normal distribution can not have normally distributed residuals • when the pattern is "J" or "S" shaped change link function or transform the variable iardised deviance (GL1 values he mean use Poisson oi n • plot of Cook's distance for each observation shows the influence of individual observations on the model fit • values of influential observations are close to 1 and higher • check for errors in the data • omit influential observations or transform the explanatory variables (using log, power, reciprocal) • dependence on continual explanatory variable - using standardised (LM) or Pearson residues (GLM) • serial dependence if explanatory variable is time or space Independence on x Serial independence Serial dependence Background Nutritional quality of the diet affects growth of organisms in various ways. To find optimal diet for cockroaches the following experiments was performed. Design Effect of five diet types (control, lipidl, lipid2, proteinl, protein2) was tested on body weight [g] of male and female cockroaches. For each diet 10 females and 7 males were used. Their body weight [g] was recorded before and after the experiment. Hypotheses Is weight influenced by the diet type? If so which diet resulted in largest weight? Is weight on diets similar for males and females? Variables DIET, control, lipid 1, lipid2, protein 1, protein2 SEX: male, female start weight Data cockroach.txt AN OVA Tab It • anova uses Type I Sum of Squares - sequential assessment of effects according to the given order - at first main effects are assessed then interactions - in orthogonal the order is not important - if data are unortogonal it is more appropriate to use Type III SS Ortogonality • independent variables are orthogonal - effects are straightforward • correlated variables are unorthogonal - effects are complicated - when there are missing values or unequal number of observations per treatment (OMfljaffl® mm • check for curvature by fitting a separate quadratic term for continuous explanatory variables • quadratic model - a simple description of nonmonotonous trend • use either poly (x, 2) or x + I(xA2) • remove insignificant interactions - begin with the higher order terms because main effects are marginal to interactions - intercept is marginal to slope and both are marginal to the quadratic term • remove insignificant main effects Criteria • test (F or %2) and a given p-value (anova) • Akaike Information Criterion (AIC): - the more there are parameters in the model the better fit but worse explanatory power of the model - the lower AIC the better model • compare individual differences between factor levels • comparisons are valid only if a factor is significant Options: - Apriori contrasts (before analysis) - Posteriori simplification (after analysis) - Multiple comparisons (after analysis) - apriori contrasts are preferred to avoid excess of significant results where A- .. mean value of a level, w}.. contrast coefficient Creating contrasts - levels lumped together get the same sign - levels contrasted get opposite sign - levels excluded get 0 Contrasts are arranged in a matrix • only k-l (k .. number of levels) contrasts are orthogonal, i.e. each level (combination) is compared only once ... products of any two contrasts = 0 • specified by function contrasts prior to analysis Pre-specified contrasts: • Treatment (default in R) - compare specific level with the reference level • Helmert - compare specific level with the average of previous levels • Sum - compare specific level with the grand mean • Textbook - compare each level with 0 • levels of a factor are compared using Wald statistics from ouput • similar factor levels are the grouped together • test each grouping by anova • compare the final model with the first one We should check as many aspects as possible • use diagnostic plots • use formal tests: - Bartlett test to compare variances - Shapiro-Wilk test of normality Analysis dat<-read.delim(" cockroach.txt"); attach(dat); names(dat) plot(diet,weight) interaction.plot(diet,sex,weight) library(lattice) xyplot(weight-start|diet,groups=sex,pch=l:2) ml<-lm(weight-diet*sex*start) anova(ml) m2<-lm(weight-diet*sex*poly(start, 2 ) ) anova(ml,m2) m3<-update(ml,~.- diet:sex:start) anova(ml,m3) anova(m3) m4<-update(m3,~.- diet:start) anova(m4) m5<-update(m4,~.- sex:start) anova(m5) m6<-update(m5,~.- diet:sex) anova(m6) m7<-update(m6,~.- start) anova(m7) m8<-update(m7,~.- sex) anova(m8) summary(m8) levels(diet) contrasts(diet)<-cbind(c(1,-1/4,-1/4,-1/4,-1/4),c(0,-1/2,-1/2, 1/2 , 1/2 ) , c (0,0,0,1/2,-1/2),c(0,-1/2, 1/2, 0,0) ) contrasts(diet) summary(lm(weight-diet)) contrasts(diet)<-'contr.helmert' summary(lm(weight-diet)) contrasts(diet)<-'contr.sum' summary(lm(weight-diet)) dietK-diet levels(dietl) levels(dietl)[4:5]<-"prot" levels(dietl) contrasts(dietl)<-'contr.treatment' m9<-lm(weight-diet1) anova(m8,m9) diet2<-dietl levels(diet2)[2:3]<-"lipid" ml0<-lm(weight-diet2) anova(m9,mlO) summary(mlO) diet3<-diet2 levels(diet3)[2:3]<-"other" mll<-lm(weight-diet3) anova(mlO,mll) anova(mlO,ml) plot(mlO,which=l:4) Shapiro.test(resid(mlO)) library(sciplot) lineplot.CI(diet2,weight,ylab="Weightxlab="Diet") • the same explanatory variable can be taken once as continuous other time as categorical: e.g. two levels of concentration • continuous variable allows interpolation and extrapolation Key to methods: Explanatory variable(s) Method Continuous Regression Categorical ANOVA Continuous and categorical ANCOVA Linear predictor can include various terms: - intercept.. a estimated as a - linear term .. ßx with b as coefficient of linear trend - quadratic term .. yx2 with c as coefficient of quadratic trend - cubic term .. tx3 with t as coefficient of cubic trend - main effect.. A - interaction between factors .. A:B - interaction between continuous variables - linear interaction .. A:x - quadratic interaction .. A:x2 • simple regression ... 1 explanatory variable • multiple regression .. 2 and more explanatory variabels General linear predictor of multiple regression a .. intercept ßk .. linear coefficients of xk x .. may represent polynomic functions (x3), interactions (xvx2) - rule of thumb: less than n/3 parameters in model at any time - number of combinations of explanatory variables will often exceed the number of data so we can not include all terms Simplification • linear predictor with 2 explanatory variables (xl9 x2) should include all main effects, all interactions, and quadratic terms a + ßxxx + ß2x2 + yxxx + y2x2 + Sxx with estimates a, bv b2, cl9c29 d Nested models are: • 5 parameters (a, bv b2, cv c2), at least c7 and c2 are significantly different • 4 para] • 3 para] differen , b2, Cj), at least q is significa , b2), at least Z?x and Z?2 are sig1 fferent tly a+b^+b-pc.. If one explanatory variable (x2) turns out to be insignificant: • 3 parameters (a, b, c), at least c is significantly different • 2 parameters (a, b), at least b is significantly different • 1 parameter (a) that is significantly different AN OVA • 1-way ANOVA .. 1 factor • 2-way ANOVA .. 2 factors • k-way ANOVA .. k factors • k-way ANOVA might be with our without interactions Given 2 categorical variables A and B each with 2 levels (Al9A and BA, B2) model with treatment contrasts is a .. mean of AxBl9 A{ and B .. main effects, A\B- .. interaction • 4 parameters (AXBX, A2B1-A1B1, A1B2-A1B1 a.A2B2-A1B2): interaction is significant • 3 parameters (AXBX, A2B1-A1B1, B2-B{): only A and B are significant • 2 parameters (Bv B2-B^): only B is significant • 1 parameter (grand mean): null model • combination of regression and ANOVA • continuous variable = covariate Given 1 factor (A:) and 1 covariate (x) linear predictor is: a .. intercept, A- .. effect of factor, P .. slope, 8 .. effect of interaction Given 1 categorical variable A with 2 levels (Av A2) and 1 continual x, the linear predictor will be • 6 parameters - 2 intercepts (av a2-a^), 2 slopes (bp b2-b1), 3 quadratic (q, c^c^) - interaction A:x2 is significant • 4 parameters - 2 intercepts (av a2-a^), 2 slopes (bp b^-b^) -interaction A:x is significant, but quadratic terms are not significant • 4 parameters - 2 intercepts (av a2-al), 1 slope (b), 1 quadratic (c) - interactions A:x2 and A:x are not significant, but A and quadratic terms are significant • 3 parameters - 2 intercepts (av a2-al), 1 slope (b) - only main effects (A and x) are significant • Further simplification —» 1-way ANOVA or simple regression response variable ~ explanatory variable^) • Operators: - on left side any mathematical operator can be used - on the right side only few: + .. add - .. delete : .. interaction * .. all terms 1 .. intercept I .. interpreter that translates operators into mathematical meaning / .. nested | .. conditioned Model formula Description y ~ 1 Null model y ~ x Linear model with 1 explanatory variable log(y) ~ x -1 Linear model with 1 explanatory variable, without intercept and with log-transformed response y ~ x + I(xA2) Quadratic model with 1 y ~ poly(x,2) explanatory variable y ~ xl + x2 Linear model with — 2 explanatory variables El Model formula Description y - A*B*C 3-way ANOVA with y ~ A +B+C+A:B three main effects, + A: C+B: C+A: B: C two 2-way interactions and one 3-way interaction y - (A+B+C)A2 3-way ANOVA with only three 2-way interactions f(iuijk) = a + Ai+Bj+C J + A:Bij+A:Cik+B:Cjk +A:B:C ijk f(Mijk) = a+Ai+Bj+Ck + A:Bij+A:Cik+B:Cjk ~ x*A 1-way ANCOVA f(jilj) = a + Aj+jSxl+SJx, • choose distribution if using GLM • there are many distributions but only some are available for GLM • decision should be based upon theoretical models or previous experience (BimniiffliiB onasmanmi^ i 1 • measurements that can be made with infinite precision ■■■iM u AMU bell-shaped, symmetric around mea mean = median = modus parameters: |i, o2 s2 is independent of mean D3.+::^/+-C • discrete values, made of integers • asymmetric, skewed to the right • variance increases with mean at quadratic trend • after logarithmic transformation variances are similar • positive real values • asymmetric, skewed to the right • variance increases with mean at a quadratic trend • Inverse Gaussian distribution - used to model diffusion processes - variance increases steeply with mean (BaXDQDGS • discrete values, made of integers • asymmetric, skewed to the right • variance is equal to expected value - variance increases with mean • discrete values, made of integers • asymmetric, strongly skewed to the right • variance is larger than expected value - variance increases with mean at a parabolic trend • arise when we counts events (y) from a whole population (n) • p .. relative frequency = yln • we study only qualitative character of an event not its quantitative aspect • p is an estimate of a theoretical value n • based on logit transformation measurements (y 7ia single parai 0<7T< 1 variance of n is n it mm ■ response variable is continuous • measurements of length, width, distance, concentration, pH, etc. • data are real numbers • distribution is symmetric (-00, +00) • parameters: ju, o1 independent of each other • t-test (t. test) to compare one or two means • Linear model (lm) to study effect of categorical and continuous variables - inference is exact, reliable for each n • GLM (glm) to study effect of categorical and continuous variables - Gaussian family (default) - link: identity - inference is asymptotic, valid only for large n glm(formula, fami1y=Gaus s ian) Background The number of grains in ears affects the yield of cereals. Design On 20 plots mean number of seeds per oat ear was estimated. Then at harvest the yield [t/ha] for each plot was estimated. Hypotheses Is number of seeds related to the yield? What is the predictive model of this relationship? Variables grain yield Data oat.txt Analysis dat<-read.delim("oat.txt"); attach(dat); names(dat) plot(grain,yield) ml<-lm(yield-poly(grain, 2 ) ) summary(ml) m2<-lm(yield-grain) summary(m2) m3<-update(m2,-.-1) summary(m3) AIC(m2,m3) plot(grain,yield,xlim=c(0,30),ylim=c(0,6)) abline(m2) abline(m3,lty=2) legend(18,3,c("m2","m3"),lty=l:2) plot(m2,which=l) sr<-rstandard(m2); plot(grain,sr) 0.07 617+c(-0.0111,0.0111)*qt(0.97 5,19) 2*(1-pt((0.1-0.07617)/0.0111,18)) lii|ijfii ltp§j§jti Weighting • to increase/decrease effect of some measurements • only positive values are allowed • instead of least squares weighted least squares are used Background Sexual size dimorphism may increases with ambient temperature in spiders. Design Males and females of Zodarion spiders were sampled on 13 sites with a different temperature [°C]. Of the average size of males and females a size ratio was calculated for each site. The number of individuals varied between sites (2 to 62 specimens). Hypotheses Is there relationship between the ratio and the temperature? What is the model? Variables temp number ratio Data zodarion.txt Analysis dat<-read.delim("zodarion.txt"); attach(dat); names(dat) plot(temp,ratio) ml<-lm(ratio-poly(temp,2)) summary(ml) m2<-lm(ratio-temp) summary(m2) m3<-update(m2,weights=number) summary(m3) plot(temp,ratio,xlab="Temperature",ylab="Size ratio") abline(m2) abline(m3,lty=2) legend(6,1.15,c("m2","m3"),lty=l:2) Yield of cereals is determined by a number of variables. To predict yield with high accuracy, various effects have to be studied. On 100 plots, the yield of wheat [t/ha] was estimated together with six other variables: 1. number of overwintering plants, 2. number of ears/m2, 3. pH of soil, 4. content of phosphorus [mg/kg], 5. content of potassium [mg/kg], 6. content of magnesium [mg/kg]. Hypotheses Did any of six variables affect the yield? If so which ones? What is the model for prediction of yield? Variables winter ears pH P K Mg yield Data wheat.txt Analysis dat<-read.delim("wheat.txt"); attach(dat); names(dat) pairs(yield~winter+ears+pH+P+K+Mg,panel=panel.smooth) ml<-lm(yield-(winter+ears+pH+P+K+Mg)A2 + I(winterA2)+1(earsA2)+1(pHA2)+ I(PA2)+1(KA2)+1 (MgA2) ) anova(ml) m2<-step(ml) anova(m2) summary(m2,corr=T) wl<-scale(winter); el<-scale(ears); pHl<-scale(pH) PK-scale(P); KK-scale (K) ; MgK-scale (Mg) m3<-lm(yield-(wl+el+pHl+Pl+Kl+Mgl)A2 + I(wlA2)+1 (elA2)+1 (pHlA2) + I(P1A2)+1(K1A2)+1(MglA2)) anova(m3) m4<-update(m3,~.-wl:pHl); anova(m4) m5<-update(m4,~.-el:Kl); anova(m5) m26<-lm(yield~wl+pHl+Kl+I(pHlA2)) anova(m2 6) mean(winter);sd(winter) mean(pH);sd(pH) mean(K);sd(K) summary(m2 6,corr=T) plot(m26,which=l) plot(m26,which=2) plot(m26,which=3) sr<-rstandard(m2 6) plot(wl,sr) plot (pHl,sr) plot(Kl,sr) range(winter) range(K) plot (pH,yield,type="n") phh<-seq(4,8,0.2) yl<-8.71416+0.28494* (162-275 .6) /50 . 94 -0.01134*(phh- 5.852)/0. 381- 0.1888* ( (phh-5.852)/0.381)^2 -0 .09666* (60-106.7)/40. 39 lines (phh,yl) y2<-8.71416+0.28494*(162-275 . 6)/50.94 -0.01134*(phh- 5.852)/0. 381- 0.1888* ( (phh-5.852)/0.381)^2 -0 .09666* (320-106.7)/40 .39 lines (phh,y2,lty=2) y3<-8.71416+0.28494*(400-275 . 6)/50.94 -0.01134*(phh- 5.852)/0. 381- 0.1888*((phh-5.852)/0.381)^2 -0 .09666* (320-106.7)/40 .39 lines(phh,y3,lty=3) legend(6,9.5,c("w=l62,K=60", "w=162,K= 320", "w=4 00,K= 320"),lty = 1:3) Background The carcinogenic disease is related to the production of toxins by certain bacteria in the body of patients. Presence of toxins can be used as an indicator of certain carcinogenic disease. Design In a clinical study, the amount of a toxin [units/|il] produced by four bacteria species was measured in patients with two carcinogenic and two non-carcinogenic diseases. For each disease there were 20 patients. In each patient only a single bacterial toxin was measured so there were 5 replications per bacteria species. Hypotheses Is the amount of toxin similar for four bacteria species and four diseases? If not what is the difference? Which species can be used as an indicator? Variables SPECIES:bactevA, bacterB, bacterC, bacterD DIAGNOSIS:carc.vectum, carc.intestine, apendicitis, skin.absces toxin Data bacteria.txt Analysis dat<-read.delim("bacteria.txt"); attach(dat); names(dat) interaction.plot(species,diagnosis,toxin) ml<-lm(toxin-species*diagnosis) anova(ml) summary(ml) tapply(predict(ml),list(species,diagnosis),mean) diagnosisl<-c(rep("care",40),rep("non",40)) diagnosisl<-factor(diagnosisl) m2<-lm(toxin-species*diagnosisi) anova(ml,m2) interaction.plot(species,diagnosisl,toxin) speciesl<-species levels(speciesl) levels(speciesl)[2:3]<-"bacterBC" m3<-lm(toxin-species1*diagnosisi) anova(m2,m3) levels(speciesl) levels(speciesl)[c(1,3)]<-"bacterAD" m4<-lm(toxin-species1*diagnosisi) anova(m3,m4) anova(m4) summary(m4) anova(m4,ml) plot(m4,which=l) plot(m4,which=2) both<-paste(species1,diagnosisl) both<-factor(both) m5<-lm(toxin~both-l) summary(m5) confint(m5) interaction.plot(species1,diagnosisl,toxin,type="p", pch=l:2,ylim=c(1,2),ylab="Toxin amount",xlab="Species",legend=F) legend(1.5,1.9,c("Care","Non"),pch=l:2) lines(c(l,l),c(1.85,1.96)) lines(c(l,l),c(1.09,1.2) ) lines(c(2,2),c(1.35,1.46)) lines(c(2,2),c(1.07,1.18)) Background Rate of population increase is a function of temperature in ectotherms, such as mites. A model of the relationship is essential for the control of mite pests. ^^^^^^^^^^^^ Design In the lab, population increase of two pest mite species was studied at 11 temperatures between 10 and 35 °C. The rate of increase was estimated using formula for exponential population growth. For each temperature a single measurement for each species was available. Hypotheses Did temperature affect the rate of increase? Was the rate similar for both species? What is the model of the relationship? Variables GENUS: genA, genB temp rate Data mite.txt Analysis dat<-read.delim("mite.txt"); attach(dat); names(dat) plot(temp,rate,type="n") points(temp[genus=="genA"],rate[genus=="genA"]) points(temp[genus=="genB"],rate[genus=="genB"],pch=l6) ml<-lm(rate-poly(temp,3)*genus) anova(ml) m2<-lm(rate-poly(temp,3)+genus) anova(m2) m3<-lm(rate~temp+I(temp^2)+1(temp^3)) summary(m3) m4<-lm(rate~temp+I(temp^2)) summary(m4) plot(temp,rate,xlab="Temperature",ylab="Rate") x<-seq(from=0,to=4 0,by=0.1) lines(x,predict(m4,list(temp=x))) ci<-predict(m4,list(temp=x),se.fit=T) names(ci) ciU<-ci$fit+qt(.975,19)*ci$se.fit ciL<-ci$fit+qt(.025,19)*ci$se.fit lines(x,ciL,lty=3) lines(x,ciU,lty=3) ■ Gamma and lognormal data arise: • precise measurements of small quantities (concentration), weight, time, etc. • measurements are continuous - non-negative values and zeros are not allowed - distribution is skewed to the right • logarithmic transformation of measurements will homogenise variance and adjust asymmetry of distribution • moments - 2 parameters (jutr, otT) - while on log scale variance is independent of mean, on original scale variance is a function of expected mean • used to model inverse polynomials moments - 2 parameters (ju, cp) E(y) = /u\ \Var(y) = 1 - underdispersion: variance is smaller —» q> < 1 • causes: - if the distribution is aggregated - if counts are not independent - lack of important variables, etc. - suspicious data • solution: use quasipoisson family • this will influence SE of parameter estimates - if q> > 1 then SE will be larger -if q> < 1 then SE will be smaller • without correction for overdispersion there would be too many false positive results (in favour of HA) • when using quasipoisson %2- and z- tests have to change to F- and t- tests It liinsskR Background -~--T-w ^w^www Abundance of carabid beetles in cereals depends on abiotic and biotic factors. If we understand how abiotic factors influence abundance of carabids then we can adapt certain management practices to increase the abundance when needed. Design In the field, on 21 wheat plots the abundance of carabid beetles was studied by means of pitfall traps. At every site average day temperature [°C] and average sun activity [W/m2] was recorded. 64950663 Hypotheses Was abundance of beetles affected by any of the two variables? If so what is the model of the relationship? Variables temp sun abun Data carabid.txt Analysis dat<-read.delim("carabid.txt"); attach(dat) ; names(dat) pairs(abun~temp+sun,panel=panel.smooth) ml<-glm(abun~temp*sun,family=poisson) summary(ml) m2<-update(ml,family=quasipoisson) anova(m2,test="Fn) plot(m2,which=l) plot(m2,which=4) pr<-resid(m2,type="pearson") plot (sun,pr) plot(temp,pr) abun[21] m3<-glm(abun-temp*sun,poisson,subset=-21) anova(m3,test="Chi") m4<-update(m3,~.-temp:sun) anova(m4,test="Chi" ) summary(m4) (75.292-22.836)/75.292 range(sun) range(temp) xyz<-expand.grid(sun=seq(900,3500,50),temp= seq(9,30,0.5)) xyz$density<-as.vector(predict(m4,xyz,type= "response")) library(lattice) wireframe(density~sun+temp,xyz) Some predators use conditional strategies to catch prey. The use of strategy often depends on the characteristics of prey. Design In the field, it was observed which of three strategies spiders used to capture prey. For each trial, size (two size classes) and movement (slow or fast) of prey was recorded. Altogether 88 trials were observed. Hypotheses Is use of strategy influenced by prey size and its movement? If so which prey is captured by strategy A, B and C? Variables PREY: fast, slow SIZE: large, small STRATEGY: stratA, stratB, stratC freq Data predator.txt Analysis dat<-read.delim("predator.txt"); attach(dat); names(dat) interaction.plot(strategy,prey, freq) interaction.plot(strategy,size,freq) ml<-glm(freq-strategy* size*prey,family=poisson) summary(ml) anova(ml,test="Chi") m2<-update(ml,-.-strategy:size:prey) anova(m2,test="Chi") m3<-update(m2,-.-strategy:prey) anova(m3,test="Chi") summary(m3) attacks<-tapply(predict(m3,type="response"),list(size,strategy),mean) attacks both<-paste(strategy,size) m4<-glm(freq-factor(both)-1,poisson) summary(m4) exp (confint(m4)) barplot(attacks,beside=T,ylab="No. of attacks", xlab="Strategy", legend.text=c("large","small"),ylim=c(0,25)) lines(c(1.5,1.5),c(7,16.3)) lines(c(2.5,2.5),c(14.4,26.9)) lines(c(4.5,4.5),c(5.5,13.8)) lines(c(5.5,5.5),c(0.6,4.6)) lines(c(7.5,7.5) ,c(0.4,3.9)) lines(c(8.5,8.5),c(0.03,2.2)) yj/ialyse* ©d ©@m$m> \\\\ Stano Pekár ■ NB is a parametric alternative to Poisson model with overdispersion • distribution of y is strongly asymmetric with many zeros • NB has two parameters, ju and 6 • moments: - 6 is aggregation parameter (0,oo) - if 6 > 1 .. random distribution, 6 < 1 .. aggregated distribution - 6 can be estimated from glm. nb (formula) from MASS library • links: log (default) sqrt identity • begin with Poisson model, if overdispersion is large switch to glm. nb 1*wiyAN0VA Background Grain beetles are serious pests in grain stores. They may occur not only in the grain but also in crevices of corridors. It is essential to know where they occur before control methods are applied. Design Density of grain beetles was surveyed in a grain store by means of sticky traps. Traps were installed in two places: 25 traps in the corridors and 25 traps in the grain. After few days number of beetles was recorded. Hypotheses Is density of beetles similar on both places? If not how different it is? Variables PLACE: floor, grain density Data beetle.txt Analysis dat<-read.delim("beetle.txt"); attach(dat); names(dat) plot(place,density) table(density) tapply(density,place,mean) ml<-glm(density-place,family=quasipoisson) anova(ml,test="F") summary(ml) plot(ml,which=l) tapply(density,place,var)/tapply(density,place,mean) tapply(density,place,function(x) mean(x)A2/(var(x)-mean(x))) library(MASS) m2<-glm.nb(density-place) anova(m2) summary(m2) plot(m2,which=l) exp(confint(m5)) barplot(tapply(predict(m2,type="response"),place,mean),ylab="Density", ylim=c(0,200)) lines(c(0.7,0.7),c(49.6,197.2)) lines(c(1.9,1.9) ,c(9.7,38.9)) ■ Binomial data arise: • when we count response to a certain stimulus —» dose-response studies • whenever we record whether an event has occurred or not within a known population (n) • events: death, birth, germination, attack, consumption, reaction, etc. • there are no classical replications - records are clustered to p or q distribution is bounded [0 < p < 1] variance is not constant, maximal when p moments wam^^mi rhhi = q = 0.5 estimated parameters are on logit scale (-<», +oo) logistic model will always asymptote at 0 and 1 - predicted values are then always within [0, 1] • inverse function to logit is ani-logit where Q is a parameter estimate odds ratio • Exact binomial test (binom. test) to compare a single proportion • Proportion test (prop. test) to compare two proportions • Contingency tables (xtabs) to study effect of factors • Logistic regression to study effect of continuous predictors • Standard regression (lm) can be used after transformation - angular transformation - can predict values out of bounds (negative or >1) • Binomial GLM (glm) to study effect of both factorial and continuous predictors Data format: • Binomial distribution ... individuals within a group are homogenous - two vectors (y, n-y) or (y, n) of integers • Bernoulli (binary) distribution ... individuals within a group are heterogenous, each characterised by a continuous character -n=l - single vector of O's or l's 5& I w£m$?mffl$M arises when dispersion parameter cp - overdispersion: variance is larger —» cp > 1 - underdispersion: variance is smaller —» ^ < 1 • causes: - if the model is mispecified - lacks important explanatory variables - relative frequency is not constant within a group • solution: use quasibinomial family in which variance is estimated as ^^^^^^^^^M instead of this will influence SE of parameter estimates if q> > 1 then SE will be larger if cp < 1 then SE will be smaller changes P values • when using quasibinomial %2- and z- tests have to change to F- and t- tests , A Regression Background Production of eggsac is influenced by a number of variables, such as body size, i.e. amount of consumed food. For an experimental study we need to be able to predict probability of production at a range of body sizes. Design ^^^Hh^^I In the laboratory, production of eggsacs was studied in a spider with a variable body size [mm]. As the body size was measured with the precision of 0.5 mm, all 160 individuals were classified into size classes each containing 15 to 30 specimens. Females that produced eggsac were recorded. Hypotheses • Is eggsac production related to the body size? • If it is what is the shape of the relationship? • What is the model that can be used to predict eggsac production for spider sizes of 3-12 mm? Variables: body n eggs Data spider.txt Analysis dat<-read.delim("spider.txt"); attach(dat); names(dat) p<-eggs/n plot(body,p) tr<-asin(sqrt(p)) ml<-lm(tr~body+I(body^2),weights=n) summary(ml) m2<-update(ml,~.-I(body^2)) summary(m2) x<-seq(0,12,by=0.1) plot (body,tr) lines(x,predict(ml,list(body=x))) abline(m2,lty=2) legend(3,1.5,c("ml","m2"),lty=l:2) plot(body,p,xlim=c(3,12),ylim=c(0,1)) lines (x,sin(predict(ml,list(body=x)))^2) lines(x,sin(predict(m2,list(body=x)))^2,lty: =2) legend(5,0.4,c("ml","m2"),lty=l:2) y<-cbind(eggs,n-eggs) m3<-glm(y~body+I(body^2),family=binomial) summary(m3) m4<-update(m3,~ .-I(body^2)) plot(body,p,xlim=c(3,12),ylim=c(0,1)) lines(x,predict(m3,list(body=x),type="response")) lines(x,predict(m4,list(body=x),type="response"),lty=2) legend(5,0.7,c("m3","m4"),lty=l:2) summary(m4) m5<-update(m4,family=quasibinomial) summary(m5) anova(m5,test="F") , A tali MCOVA Background Synthetic insecticides often have a species-specific efficiency. The recommended doses or concentrations then have to adjusted. Design In the laboratory an effect of an insecticide on the mortality of two aphid species was studied. The insecticide was applied at 6 concentrations [ppm]. Each concentration was tested on 30 individuals of both aphid species. Hypotheses • Is mortality affected by the concentration? • Was the efficiency similar for both species? • What is the LC50 (i.e. 50% lethal concentration) for both species? Variables: SPECIES: A, B cone n dead Data aphid.txt Analysis dat<-read.delim("aphid.txt"); attach(dat); names(dat) p<-dead/n plot(conc,p,type="n") text(conc,p,labels=as.character(species)) y<-cbind(dead,n-dead) ml<-glm(y~log(cone)*species,binomial) anova(ml,test="Chi") m2<-update(ml,~.-log(cone):species) anova(m2,test=nChin) summary(m2) plot(m2,which=l) pr<-resid(m2,type="pearson"); plot(log(cone),pr) plot(log(cone),p,type="n",xlab="Log(Concentration)",ylab="Mortality") x<-seq(-3,2,0.1) A<-1/(1+exp(-1.382 5-1.2 328*x)); lines(x,A) B<-1/(1+exp(-1.382 5+2.2117-1.2 328*x)); lines(x,B,lty=2) legend(1,0.3,c("A","B"),lty=l:2) m3<-glm(y~species+log(cone)-1,binomial) summary(m3) library(MASS) dose.p(m3,cf=c(l,3),p=0.5) dose.p(m3,cf=c(2,3),p=0.5) Granivorous ants collect various seeds and bring them into nest. Sympatrically occurring species may show trophic niche partitioning related to the size of collected seeds. Seed preference of two ant species was studied in the laboratory. Each of 25 ants of both species was offered seeds of variable size expressed as its weight [mg]. Response of ants was classified as "yes" or "no" if it took or refused to take a seed, respectively. Hypotheses • Is acceptance related to the seed size? • Did both species have similar preference for seed sizes? • If not what is the threshold size of seeds for both species? (The threshold size is defined as a size that is accepted with higher than 90% probability) Variables: SPECIES'. specA, specB seed take Data ant.txt Analysis dat<-read.delim("ant.txt"); attach(dat); names(dat) library(lattice) xyplot(take-seed|species) ml<-glm(take-seed*species,family=binomial) summary(ml) anova(ml,test="Chi") m2<-glm(take-log(seed)*species,binomial) AIC(ml,m2) plot(seed,take,type="n",xlab="Seed weight",ylab="Transported") x<-seq(0,3,0.01) A<-1/(1+exp(-4.012+8.3 64*x)); lines(x,A) B<-1/(1+exp(-4.012+10.957+(8.3 64-19.14 7)*x));lines(x,B,lty=2) legend(1.5,0.8,c("specA","specB"),lty=l:2) (log(0.9/0.1)-4.012)/-8.34 6 (log(0.9/0.1)-4.012+10.957)/(-8.34 6+19.14 7)