Introduction to Biostatistics Jakub Těšitel Brno, 2021 1. Philosophical foundations of empirical science and statistics You are all students of science. But have you ever thought of what actually is the aim of science? Probably, we can all agree that the aim of science is to increase human knowledge. But how this is done? We may think that adding new pieces of knowledge to what is already known is actually the process of science. These new pieces come in the form of universal statements (laws; theories) describing natural processes. Some scientific disciplines, including biology, use data or experience to increase current knowledge and are thus called empirical science. Intuitively, we may assume that the new pieces of knowledge are first collected as the newly gathered experience or data (singular observations, statements) from which the theories and hypotheses (universal statements) are built. Statistics should then be the language of empirical science to summarize the data and make the inference of universal statements from the singular ones. This approach to empirical science would be called induction. Despite intuitive, it is not the approach we use in modern science to increase knowledge. We may also agree that only true universal statements or theories represent a real addition to knowledge and may be used to infer correct causal explanations. So we should aim at truth, which should be an essential aspect of our scientific work. But how does science and scientists recognize the truth of their theories? This is not an easy task. Truth can be defined as a correspondence of statements with the facts1. But the question is how to measure such correspondence. There are two apparent ways: 1. We can believe authorities who may issue a judgement on this. The authorities may be of various kind: experienced scientists, distinguished professors, priests or books written by them (note that this is well compatible with the accumulative process of science described above) or 2. We can believe that truth is manifest - that truth is revealed by reason and everybody (who is not ignorant) can see it. The first way was largely applied in the Middle Age with the church, priests and the Bible as the authorities and ultimate source of truth. This led to a long-term stagnation of science and a few guys burnt at the stake. The second approach stems from the Renaissance thinking revolving against the dogmatic doctrines of the church. It was a foundation of many great discoveries made since the Renaissance time. Unfortunately, there is also devil hidden in this approach to truth. It lies in the fact that if truth is manifest, then those who cannot see it are either ignorant, or worse, pursue some evil intentions. Declaring itself as the only science-based approach to the society and politics, the Marxist-Leninist doctrine largely relies on the belief that its truth is obvious, which also provided justification for the ubiquitous cruel handling of its opponents whenever possible.2'3 1 Facts (i.e. for instance measurements) are (usually) considered true. There is always sort of measurement error, but that is mostly negligible. Reporting false facts is unacceptable. It is basically cheating, which, if occurs, has a great negative effect on knowledge, because challenging published facts is something, which is rarely done. 2 Note here that if the conflict between the Renaissance thinkers such as Galileo Galilei or Giordano Bruno and the church is viewed as a fight between the two views on truth both of which may lead to evil ends, you may reconsider the outright negative view on the representatives of the inquisition. Nevertheless, burning your opponents at the stake is not an acceptable means of discussion in any case. 3 A strange mix of both approaches to truth is still largely applied in secondary education in some countries (e.g. Czechia). Textbooks and the teachers' knowledge may be used here as the ultimate authority for truth. At the same time, students are punished for making mistakes (by low grades) because truth is manifest. If they cannot see it, they are considered ignorant and as such deserve the punishment. Box 1. Misleading empirical experience 1. Ancient Greek philosopher Anaximandros (c. 610 - c. 546 BC) was the first who identified the Earth as an individual celestial body and presented the first cosmology. This was a great achievement of human reason. However, he supposed the Earth to be of barrel shape because he only could see flat world around him - as we actually do. It seems that we have a problem with truth and need to find the way out of it. The solution of the problem was summarized the philosopher of science Karl R. Popper (1902-1994). Popper states that although truth exists and we should pursue it, we can never be sure that our theories are true. This is because we are prone to make mistakes with the interpretation of what our senses tell us. This view is not that novel as K.R. Popper himself refers to ancient Greek philosophers some of whom have identified this paradox of truth. One illustrative account of this is the story of prisoners in cave contained in Plato's Republic. This is the story about prisoners who are kept in a cave from the very beginning of their life and have their heads fixed to look at a wall. Fire is located far behind them and persons and objects pass between the fire and the prisoners' back casting shadows on the wall, which the prisoners can see. Then, as Plato says (by the speech of Socrates): "To them, I said, the truth would be literally nothing but the shadows of the images.". In this writing, Plato also declares ourselves to be like these prisoners. This may seem strange as we tend to believe that what we see is real but consider e.g. the recent observation of gravitational waves. We observe them by super-complicated and ultra-sensitive devices and can only see shadows of them (nobody can see them directly). Life of Anaximandros on barrel Earth 2. Jean-Baptiste Lamarck (1744-1829) formulated the first comprehensive evolutionary theory based on his naturalist experience with adaptations of organisms to their environment. He asserted that organisms adapt to their environment by adjustments of their bodies, which changes are inherited by the offspring. This is very intuitive but demonstrated to be false by a long series of experimental testing. Although we can only see shadows of reality, these shadows still contain some information. We can actually use this information to make estimates about the reality and more importantly to demonstrate our universal statements false. The ability to demonstrate some theories and hypotheses/o/se is the principal strength of empirical science. This leads to rejection of theories demonstrated not to be true while those, for which the falsifying evidence is not available (yet) are retained. If a theory is rejected on the basis of falsifying evidence, a new one can be suggested to replace the false theory, but note, that this new theory is never produced by an "objective" process based on the data. Instead, it is produced by subjective human reasoning (which aims to formulate the theory not to be in conflict with objective facts though). In summary, experience can tell us that a theory is wrong but no experience can prove truth of a theory (note here, that we actually do not use the word "proof" in terminology of empirical science). Consider e.g. the universal statement "All plants are green". It is not important how many green plants you observe to prove it true. Instead, observation of e.g. single non-green parasitic Orobanche (Fig. 1.1) is enough to demonstrate that it is false. Our approach of doing science is thus not based on induction. Instead it is hypothetical-deductive as we formulate hypotheses and from them deduce how world should look like if the hypotheses were true. If such predictions can be quantified, their (dis)agreement with the reality can be measured by statistics. The use of statistics is however not limited to hypothesis testing. We also use statistics for data exploration and for parameter estimates. Finally, you may wonder how Biostatistics differs from Statistics in general. Well, there no fundamental theoretical difference, Biostatistics refers to application of statistical tools in biological disciplines. Biostatistics generally acknowledges, that biologists mostly fear maths so the mathematical roots of statistics are not discussed in details and also e.g. complicated formulae are avoided wherever possible. Literature Plato: Republic (Book VII) Popper KR: Conjectures and Refutations Popper KR: Logic of Scientific Discovery https://en.wikipedia.org/wiki/Anaximander https://en.wikipedia.org/wiki/Jean-Baptiste_Lamarck 2. Data exploration and data types If you have some data, say a variable describing observations of 100 objects (e.g. tail length of 100 rats), you may wish to explore these values to be able to say something about these data. That is, you may wish to describe the data using descriptive statistics. The data are here: [1] 4.57 5.69 4.49 6.09 5.46 6.28 4.90 5.80 4.39 4.32 4.85 4.05 6.36 3.10 5.30 3.74 5.45 4.08 [19] 4.97 3.31 4.71 5.49 6.37 5.32 5.31 5.20 2.29 3.91 4.09 5.59 6.85 3.56 6.13 3.73 6.41 4.01 [37] 4.77 5.84 6.37 6.49 5.27 5.26 5.92 5.27 4.17 7.00 4.73 5.26 5.17 3.76 7.03 6.79 5.94 7.42 [55] 5.87 5.61 5.25 4.45 4.41 7.27 5.53 5.69 3.59 5.47 5.69 3.63 2.03 5.65 3.36 3.60 5.39 3.90 [73] 5.82 3.17 3.73 4.81 4.70 4.71 5.02 5.61 2.99 3.96 3.28 4.99 5.30 5.23 6.06 6.31 5.60 5.85 [91] 5.15 4.62 5.79 5.36 3.89 4.35 5.26 3.76 4.68 5.77 Fig. 1.1. Non-green parasitic plant Orobanche lutea. First, we need to know the size of the data, i.e. number of observations (n). Here n = 100. Second, we are interested is the central tendency, i.e. certain middle value around which, the data are located. This is provided by the median. Which is the middle value4 of the ordered data dataset from the lowest to the highest value. Here med = 5.24 Third, we need to know the spread of the data. A simple characteristic is range (minimum and maximum. Here min = 2.03 and max = 7.42. However, the minima and maxima may be affected by outliers and extremes. While, it is useful to know them, we may also prefer some more robust characteristics. This comes with guartiles. Quartiles are 25% and 75% quantiles. XX%-guantile refers to a value compared to which XX% of other observations are lower. In our case the first quartile (25%) = 4.15 and the third quartile (75%) = 5.71. The second (50%) quartile is the median. These descriptive statistics can be summarized graphically in the form of boxplot. That is very useful for comparisons between different datasets (e.g. comparison of mouse tail length with a similar dataset on rats): o outlier -i— non-ibutlier max 3 d quartile median 1st quartile mm Rat 15 E u to 10 c 01 5 - T Mouse Fig. 2.1. Boxplot displaying tail length of mice and rats. The bold lines in boxes represent medians, boxes represent quartiles (i.e. 25 and 75% quantiles) and the lines extending from the box boundaries (whiskers) represent the range or non-outlier range of values, whichever is smaller. The non-outlier range is defined as the interval between (25% quantile) 1.5 x interquartile range) and (75% quantile + 1.5 x interquartile range). Any point outside this interval is considered an outlier and is depicted separately.5 Another useful type of plot is the histogram. Histogram is very useful for displaying data distributions (but less so for comparisons between different datasets). To plot a histogram, 4 Note here, that if n is even and the two values close to the middle are not equal, median is computed as their arithmetic mean. 5 This is a very detailed description of a boxplot. Usually it can be briefer. Still, I was forced to make it this detailed by the editor of one paper I published. values of the variable are assigned into intervals (called also bins). Numbers of observation (frequency) within each bin is then plotted on in the graph. Histogram of mice.tail 20 -i 15 - CT 10 -5 - 0 J 1-1-1-1-1-1-1-1-1-1-1-1 I-1-1-1-1-1 2 3 4 5 6 7 Tail lenght (cm) Fig. 2.2. Histogram of mouse tail length. Types of data The data on mouse tail length we have explored are called data on ratio scale. Several other types of data can be defined on the basis of their properties. These are summarized in Table 2.1. in ratio-scale and interval data, further distinction can be made between continuous and discrete data but that makes little difference for practical computation. Table 2.1. Summary of data types definition and properties. Data type Criteria Possible math, operations Examples Object class in R Ratio scale data constant intervals between values, meaningful zero +,-,*,/ length, mass, temperature in numeric Interval scale data constant intervals between values, zero not meaningful +,- temperature in °C numeric Ordinal data (also called semiquantitative) variable intervals between values comparison of values exam grades, Braun-Blanquet cover numeric (but may require conversion) Categorical data non-numeric values none colors, sex, species identity character, factor Categorical variables cannot be explored by the methods described above. Instead, frequencies of individual categories can be summarized in a table, or a barplot can be used to illustrate the data graphically. Consider e.g. 163 bean plant individual with flowers of three colors: white, red, purple. 80 60 - o cz cr 40 - 20 - purple red Flower color white Fig 2.3. Barplot of frequencies of flower colors in the bean dataset. How to do in R Obtaining R: download from https://cloud.r-project.org/ and install Obtaining R Studio: download from https://rstudio.com/products/rstudio/download/#download and download Installing an R package install.package("package") Loading an R package for use library(package) Importing data: 1. Using clipboard: read.delim("clipboard") with decimal point format or read.delim2("clipboard") with decimal comma format 2. Directly from excel: install and load the package readxl, then read_xlsx("file") Size of data: function length Median: function median Range: function range Minimum: function min Maximum: function max Quartiles: function quantile with default settings produces 5 values: min, lower quartile, median, upper quartile, maximum Boxplot: function boxplot supports the formula notation, i.e. response variable ~ classifying variable) Histogram: function hist Barplot: function barplot requires frequencies to be provided e.g. by table or tapply 3. Probability, distribution, parameter estimates and likelihood Random variable and probability distribution Imagine tossing a coin. Before you make a toss, you don't know the result and you cannot affect the outcome. The set of future outcomes generated by such process is called random variable. Randomness does not mean, that you do not know anything about the possible outcomes of this process. You know the two possible outcomes that can be produced and also the expectation of getting one or the other (assuming that the coin is "fair"). A random variable can thus be described by its properties. This description of the process generating the random variable is then indicative of the expectations of individual future observations -probabilities. We are not limited by a single observation but can consider a series of them. Then, it makes sense to ask e.g. what is the probability to get less than 40 eagles in 100 tosses. If we do not fix the value to 40 but instead study the probabilities for all possible vales (here from 1 to 100), we can define probability associated with each value from 1 to 100 as: Pi = P(X< Xi) where pi is the probability of observing a value lower than a given value xi. Then we can construct the probability distribution function defined as: JV0 = Y.x> CO CM >, X 6 5 4 3 2 a b C o o o o xy.2$type.1 Fig 4.2. Boxplot displaying the values of the variable y2 for individual categories of type.l. Note the non-symmetric distributions and the outliers. How to do in R function boxplot applied on formula numeric~factor produces the boxplot Useful tips: Parameter col="grey" makes the boxplot more readable/elegant; other colors may be used. The midpoints of boxes of boxplots are located at integer numbers on the x-axis (unless changed by parameters). Modern alternatives to boxplots Boxplots have many advantages, which makes them a standard plot type for displaying associations between a categorical and quantitative variable. However, there are also some issues. One obvious is that they do not display the mean values. In addition, they may provide misleading results if the underlying distribution is e.g. bimodal. For these reasons, alternative types were developed called Bean plot and Violin plot. While useful, their use is still rather limited in biological community. For more information, see e.g. https://cran.r-proiect.org/web/packages/beanplot/vignettes/beanplot.pdf. 7 - 6 - 5 - CM >> 4 - 3 - 2 - 1 - a b c d e type.1 Fig 4.3. Beanplot displaying the values and densities of the variable y2 for individual categories of type.1. Dotted line, bold lines and short narrow lines indicate global mean, group means and individual observations respectively. Barplots mostly display means of quantitative variables, in particular differences between means of individual categories (factor levels). To judge on difference between means, it is necessary to display also a characteristic of uncertainty of mean estimate or variability. Therefore, barplots are usually supplied by error bars displaying standard errors, confidence intervals or standard deviation. Of these, the general best choice is probably the confidence intervals, which indicates the range of values within which the population mean lies with 95% probability (more on that in chapter 7). In any case, specification of error bars (what they display) must always be included in graph caption. The strong aspect of barplots is that they allow judging on difference between means. However, this comes with substantial loss of information: barplots do not display the distribution at all and may even be misleading in this respect. The y-axis range in barplots should always start at 0. Displaying negative values (or combination of negative and positive values) may be awkward. Barplots may also be used to display counts, where they display the raw data without any loss of information. Barplot type. 1 Fig. 4.3. Barplot displaying mean values of variable y2 for individual categories of type.1. Error bars indicate 1 standard error. How to do in R function barplot requires a numeric vector of bar height (i.e. means) Errorbars are supplied by function arrows arrows(xO=x.coords, yO=means-err.b, yl= means+err.b, code=3, length=0.05) where err.b is the errorbar parameter (standard error or confidence interval). May also be range but then, the interval is not symmetric. The midpoints of bars are not located at integer numbers on the x-axis. To get their coordinates, you need to save the output of barplot in a vector, like: x.coords<-barplot(x); this plots the barplot and saves the midpoints in the x.coords vector. Useful tips: The barplot function does not allow the formula input and data parameter. Parameter col="grey" makes the boxplot more readable/elegant; other colors may be used. R does not have a dedicated function for plotting a barplot with errorbars. I have made one for you: see barplotN.R in IS (Study Materials/Learning Materials/Rfunctions). The barplotN function also implements automatic calculation of means; just the classifying factor and the numeric variable need to be supplied. The type of error bars is specified by parameters. Dotchart Dotcharts are closely similar to barcharts. They are also very suitable to comparisons between means. They however better display negative values and allow adjustment of y-axis range. Thus they are considered generally superior options to barplots. CM >> type.1 Fig. 4.4. Barplot displaying mean values of variable y2 for individual categories of type.1. Error bars indicate 1 standard error. How to do in R function dotchart requires a numeric vector of means. Using this function is however quite awkward. A better option is to use simple plot plot(l:N, means) where N is the number of categories and means is the vector containing the means. Errorbars are supplied by function arrows arrows(xO=l:N, yO=means-err.b, yl= means+err.b, code=3, length=0.05) where err.b is the errorbar parameter (standard error or confidence interval). May also be range but then, the interval is not symmetric. Useful tip: You may change the point symbol of the mean by parameter pch pch = 16 creates a filled point pch = 15 makes a filled box Because R does not have a dedicated function for plotting a dotchart with errorbars, I have made one for you: see dotchartN.R in IS (Study Materials/Learning Materials/Rfunctions). The dotchartN function also implements automatic calculation of means; just the classifying factor and the numeric variable need to be supplied. The type of error bars is specified by parameters. Scatterplot Scatterplot is a simple point-based plot illustrating the association between two quantitative variables. The point sin scatterplot usually represent original data, thus there is little loss of information, if any. Scatterplot is perfect for exploration of interdependence between two variables. Regression line (with confidence) intervals may also be added to the raw scatterplot to visualize a regression model (see chapter 10 for details). o a ° ^ o CbQ <§> oo <£ O CM Fig 4.4. Scatterplots displaying the relationships between x and y (a) and x and y2 with indication of assignment into categories of type.2 (b). Note the different types of relationships between variables: linear (a) and exponential (b). How to do in R scatterplot is produced by function plot(x, y) if both x and y are numeric variables. Alternatively, plot also accepts the formula and data parameters. Useful tips: With large data, or data, where values are limited to few integers, overlapping points may occur in a scatterplot, which are not visible. A solution to this is to use semitransparent point color. The number overlapping points is then indicated by color intensity. Semitransparent color is specified by parameter alpha in color-specifying function rgb, e.g. col=rgb(0.2,0.2,0.2,alpha=0.5) produces semitransparent grey. Function scatterplot of car package provides many additional functionalities for enhanced scatterplots like adding a regression line, possibility to draw categorized scatterplots (i.e. with different types of points), etc. However, its default settings is not very nice. My friend Pavel Fibich has also scripted a scatterplot function which plots a scatterplot together with regression line and its confidence intervals. It is called lmconf and is available in IS (Study Materials/Learning Materials/Rfunctions). General tips for graph creation/adjustments in R 1. Exporting graphs is best done by saving them as separate files. These files may be raster or vector graphics (see e.g. here for explanation https://vector-conversions . com/vectorizing/raster vs vector.html) a. vectors: functions pdf, svg (svg can easily be post-processed in InkScape https://inkscape.org/). b. rasters: functions png, jpg c. general syntax is e.g. pdf("file.name.pdf", width.in.inches, height.in.inches)# In rasters, the width and height are specified in pixels. In addition, raster resolution (in dpi) can also be specified. plot(x,y) dev.off() # Closes the file and saves it to disk. 2. Graphical parameters are set by function par a. ?par provides info on all graphical parameters used also in other functions like plot b. Parameter setting done by par affects the plot, which is produced afterwards, e.g. par(mar=c(2,2,2,2)) sets all plot margins to 2 text lines. A graph produced by plot afterwards will have such margins. most important parameters used directly in par: mfcol, mfrow: A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. mar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1. most important graphical parameters used mostly in other functions (like plot) xlim: the x limits (xl, x2) of the plot. The same with ylim for y-axis xlab, ylab: axis labels main: graph headline cex: A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. This starts as 1 when a device is opened, and is reset when the layout is changed. las: numeric in {0,1,2,3}; the style of axis labels. 0:always parallel to the axis [default], 1:always horizontal, 2:always perpendicular to the axis, 3:always vertical. pch: plotting 'character', i.e., symbol to use. This can either be a single character or an integer code for one of a this set of graphics symbols. 0 1 2 3 4 □ O A + X 5 6 7 8 9 O V 10 11 12 13 14 e $ ffl 8S 0 15 16 17 18 19 ■ • ▲ ♦ • 20 21 22 23 24 • • ■ ♦ ▲ 25 T lty: line type (2 for dashed) lwd: line width log: produces log-scaled axis; use log='x', log='y' or log='xy' for horizontal, vertical or both axes respectively. 3. Other useful functions a. legend: this function adds legend to an existing plot. It is very useful e.g. in scatterplots with multiple point types. b. text: adds a text to a specified place within the plotting region; mtext adds text onto graph margins 5. Hypothesis testing and pattern detection; goodness-of-fit test Scientific statements In chapter 1,1 explained that science consists of theories, and these comprise hypotheses. Scientists formulate these hypotheses as universal statements describing the world, but they never know whether a hypothesis is true until it is rejected based on empirical evidence. This makes science an infinite process of searching for truth, to which we hopefully approach but never know whether we reach it or not. Let's now return to the term universal statement I used in the previous paragraph and in chapter 1 because this is crucial to understand how empirical science works and hypothesis testing proceeds. Statements describing the world can be classified into two classes: 1. Universal statements generally apply to all objects concerned. E. g. "All (adult) swans are white" is a universal statement. This can be converted to a negative form: "Swans of other colors than white do not exist." You can see that the universal statements prohibit certain patterns or events (e.g. observing a black swan here); therefore, they have the form of "natural laws". They can also be used to make predictions. If the white swan hypothesis is true, the next swan I will see will be white (and this is not dependent on how many white swans I saw before). A universal statement cannot be verified, i.e. confirmed to be true. We would need to inspect the color of all swans living on the Earth (and in the Universe) to do so, and even if we did so, we can never be sure that the next baby swan hatching from an egg would not be different from white at adulthood. By contrast, it is very easy to reject such a universal statement based on empirical evidence. Observing only a single swan of another color than white is sufficient for that. 2. Singular statements are asserted only on specific objects. E.g. "The swan I see is white." Such a statement refers to a particular swan and does not predict anything about other swans. A specific class of singular statements are existential statements that can be derived from singular ones. The fact that I see a white swan (singular statement) can be used to infer that there is at least one swan that is white, i.e. white swans do exist. Based on the previous paragraph, you would probably not consider any novel since it agrees with the universal statement on white swans. However, seeing a single black swan (Fig 5.1) changes the situation completely. It means that at least one black swan exists and that the universal statement on white swans is not true. In general terms, this existential statement rejected the universal statement. To sum up, a scientific hypothesis must have a form of a universal statement in order to have a predictive power, which we need to explain patterns in nature. They cannot be verified but can be rejected by empirical existential statements which are in conflict with the prediction of the hypothesis. Fig. 5.1 A black swan in Perth (Western Australia). Hypotheses and their testing Empirical science is essentially the process of hypothesis testing, which means searching for conflicts between predictions of hypotheses and collected/measured data. Once a hypothesis is rejected, a new hypothesis can be formulated to replace the old one. Note here that there is no "objective" way to formulate new hypotheses - they are rather genuine guesses. An important implication from this is that it should be possible to define singular observations for every scientific theory or hypothesis that, if they exist, would reject it. This means that each scientific hypothesis must befalsifiable. Universal statements that are not falsifiable may be components of art, religion, or pseudoscience but definitely not of science. Various conspiracy theories also belong to this class. These statements need not be only dogmatic; they may also be tautological. An example of this is e.g. recently published theory of stability-based sorting in evolution (https://www.ncbi.nlm.nih.gov/pubmed/28899756), a "theory" which says that evolution operates with stability, i.e. organisms and traits which are more stable persist for longer. The problem is that long persistence is a synonym for stability. Thus, this theory says, "What is stable is stable" - not very surprising. The authors declare the theory to explain everything (see the ending of the abstract), and this is indeed true. Still, the problem is that the theory neither produces any useful predictions nor can be tested by empirical data. If we select only hypotheses that are falsifiable and can be considered scientific statements, we may discover that there are multiple theories without any conflicts with the data. It is a natural question to ask, which one to choose over the others. Here, we should use the Occam's razor (https://en.wikipedia.org/wiki/Occam%27s razor) principle and use the simplest (and also most universal and most easily falsifiable) hypothesis available. This is also termed "minimum adequate model" - i.e. choose the model with the minimal number of parameters that adequately match the data. Pattern detection Biological and ecological systems display high complexity arising from an interplay among complicated biochemical processes, evolutionary history, and ecological interactions. As a result, quite a large proportion of the research is exploratory, aiming at discovering effects that were not anticipated yet. Therefore, no previous theory could have informed about them, or such information on the absence of effect would be just redundant. These are special cases of hypothesis testing, which can be called pattern detection. In pattern detection tests, we test the universal statement that the effect under investigation is zero (e.g. there is no correlation between two quantitative variables). Rejecting such a statement (null hypothesis) means that our observations are significantly different from what could be observed just by chance, i.e. we demonstrate the significance of a singular statement - and this can be consequently used to formulate a new universal hypothesis Hypothesis testing with statistics In statistics, we work with numbers and probabilities. Therefore, we do not record clear-cut evidence to reject a hypothesis as in the example with swans. In other words, even improbable events may happen by chance, and their observation may not be sufficient evidence to reject a hypothesis. A general statistical testing procedure involves the computation of test statistic. The test statistic measures the discrepancy between the prediction of the null hypothesis and the data, also considering the strength of the evidence based on the number of observations. The test statistic is a random variable, which follows particular theoretical distribution if the null hypothesis is true. As a result, the probability of observing the actual data or data that differ even more from the null hypothesis expectation can be quantified. If this probability (called the p-value) is below a certain threshold, we can justify the rejection of the null hypothesis. The probability of observing specific data under the null hypothesis can be very low but never zero. As a result, we are left with uncertainty concerning whether we made the right decision when rejecting or retaining the null hypothesis. In general, we may take either the right decision or make an error (Table 5.1). Table 5.1. Possible outcomes of hypothesis testing by statistical tests. Ho = null hypothesis Reality HQ is true HQ is false Our Decision Reject HQ Type 1 Error Ok Not reject HQ Ok Type II Error Two types of error can be made, of which type I error is more harmful because it means rejection of a null hypothesis which is actually true. This is called false positive evidence. It is misleading and may even obscure the scientific research of a given topic. By contrast, type II error (false negative) is typically invisible to anybody except to the researcher himself because results not rejecting the null hypothesis are usually not published. Statistical tools can precisely control the probability of making type I error by setting an a-priori limit for the p-value. This limit, called the level of significance (a), is typically set to a = 0.05 (5%). If the p-value resulting from the testing is higher than that, the null hypothesis cannot be rejected. Note here that such a non-significant result does not mean that the null hypothesis is true. Non-significant results indicate the absence of evidence, not of the evidence of absence of an effect. Concerning type II error (probability of which is denoted |3), statistical inference is less informative. It can be quantified in some controlled experiments, but its precise value is not of particular interest. Instead, a useful concept is the power of the test, which equals 1 - |3 and its relative rather than absolute size. The power of the test increases with sample size and with decreasing a, i.e. if the tester accepts an elevated risk of type I error. Goodness-of-fit test Let's have a look at an example of a statistical test. One of the most basic statistical tests is called goodness-of-fit tests (sometimes inappropriately chi-square test following the name of the test statistic). It is particularly suitable for testing frequencies (counts) of categorical data, although the x2 distribution is quite universal and approximates e.g. the very general likelihood ratio. the formula is this: x2 = 2 ^° ^ E where O indicates observed, and frequencies and E indicate frequencies expected under the null hypothesis. The sum is repeated for each of the categories under investigation. Jhex2 value is subsequently compared with the corresponding_y2distribution to determine the p-value. There are many ^distributions that differ in the number of degrees of freedom (DF; Fig 5.2). The DF is a more general concept common to all statistical tests as it quantifies the size of the data and/or complexity of the model. Here, it is important to know that for ordinary goodness-of-fit test: DF = number of categories - 1. X distribution with DF = 1 X distribution with DF = 5 LO -4—I cz -o q CO ,_ P d o CD LO d £ o "O o o o o o X Fig. 5.2 Probability densities of two x2 distributions differing in the number of degrees of freedom. Dashed line indicates cut-off values for 0.05 probabilities on the upper tail. Goodness-of-fit test exgmple A typical application of the goodness-of-fit test is in genetics, as demonstrated in the following example: You are a geneticist interested in testing the Mendelian rules. To do so, you cross red and white flowering bean plants. Red is dominant and white recessive, so in the Fl generation, you only get red flowering individuals. You cross these and get 44 red flowering and 4 white flowering individuals in the F2 generation. What can you say about the universal validity of the second Mendelian rule (which predicts 3:1 ratio between dominant and recessive phenotypes) at the level of significance a = 0.05? First, you need to calculate the expected frequencies. These are: Ered = 48x3/4 = 36 Ewhite = 48 x 1 / 4 = 12 then, computation of test statistic follows: f = (44-36)2/36+(4-12)2/4 = 7.11 DF = 1 p(X2= 7.11, DF = 1) = 0.0077 Conclusion (to be written in the text): Heredity in our bean-crossing system is significantly different from the second Mendelian rule (x2= 7.11, DF = 1, p = 0.0077). As a result, the second Mendelian rule is not universally true. Here you can see that our experiment produced a singular statement on the number of bean plants. The statistics translated this into an existential statement that at least one (our) genetic system exists which does not follow the Mendelian rule. This was then used to reject the universal statement. How to do in R Goodness-of-fit test: chisq.test Parameter x is used for inputting the observed frequencies Parameter p is used for inputting the null hypothesis-derived probabilities Example with output: chisq.test(x=c(4 4,4), p=c(3/4,1/4)) Chi-squared test for given probabilities data: c(44, 4) X-squared = 7.1111, df = 1, p-value = 0.007661 Probabilities of x2 distribution can be computed by pchisq (do not forget to set lower.tail=F to get the p=value). pchisq(7.11, df=l, lower.tail = F) [1] 0.007665511 6. Contingency tables - association of two (or more) categorical variables Contingency tables - introduction Contingency tables are tables that summarize frequencies (counts) of two (or more) categorical variables. Their analysis allows testing (in)dependence between the two variables. Table 6.1 is a contingency table summarizing frequencies of people of different eye and hair colors. Table 6.1. Contingency table of two variables: eye and hair color with basic frequency statistics (marginal sums and grand total). Hair color black brown blonde marginal sums Eye color blue 12 45 14 71 brown 51 256 84 391 marginal sums 63 301 98 grand total: 462 Basic analysis by goodness-of-fit test Association between the variables (i.e. the null hypothesis which states that the variables are independent) can be tested by a goodness-of-fit test. This is a universal approach suitable for tables of any size and dimensions, but its explanatory power is limited. For a goodness-of-fit test, we need expected frequencies under the null hypothesis, which are calculated on the basis of probability theory: P (event 1 and event 2) = P (event 1) x P (event 2), if the two events are independent. In contingency tables, this can be used to calculate expected frequencies as the product of ratios of corresponding marginal totals and the grand total. For instance, expected probability of observing a blue-eyed and black-haired person in Table 6.1 can be calculated as P (blueE and blackH) = 63/462 x 71/462 = 0.02096. Multiplication of the probability then gives the expected frequency Freq(e) = 0.02096 x 462 = 9.68. The same approach can be used to calculate expected frequencies in all cells but is made automatically by software nowadays. The goodness-of-fit test can consequently be computed (in the same way as described in chapter 5). Note, however, that the number of degrees of freedom is determined as DF = (number of rows - 1) x (number of columns - 1) In our example (Table 6.1): We did not find a significant association between eye and hair color Of2 = 0.785, DF = 2, p = 0.6755). The goodness-of-fit test does not provide much more information on the result than the significance of the association. Still, in the case of a significant result, it may make sense to report also the difference between observed-expected frequencies (i.e. the residuals) or their standardized values (residuals divided by square root of corresponding expected frequencies) as supplementary information. In particular, standardized residuals are helpful as they indicate excess or deficiency of which combinations cause the association between the variables. 2x2 tables and their analysis These tables represent particular and the simplest cases of contingency tables (Table 6.2). Table 6.2. Structure of a 2x2 table. Var2 level 1 level 2 Varl level 1 fll fl2 Rl level 2 f21 f22 R2 CI C2 n Their simplicity allows additional statistics to be computed to express how tight the association between the two variables is. Most important of these is the phi-coefficient: /"11/22 -/"12/21

1 positive association. OR is a population parameter. The computation summarized above is actually its maximum-likelihood estimation procedure. As a result, an OR estimate has associated standard error and confidence intervals (i.e. intervals within which the population OR lies with 95% probability). A confidence interval directly indicates significance - if a confidence interval of OR contains 1, the OR is not significantly different from 1, and thus, independence between the two variables cannot be rejected. A worked example Malaria is a dangerous disease widespread in tropical areas. It is caused by protozoans of the genus Plasmodium and transmitted by mosquitos. Preventing the infection is possible by taking prophylaxis, i.e. a treatment which blocks the disease after a mosquito bite. This is only possible for short-time journeys to malaria areas since the prophylaxis drugs are not safe for long-term use. Here we asked whether the prophylaxis is efficient and whether there is a significant difference between two prophylaxis types. The data are summarized in Table 6.3. Table 6.3. Table summarizing frequencies of travelers to the tropics infected by malaria (or not) and anti-malaria prophylaxis they used. Prophylaxis Infected by malaria Frequency none (control) 0 40 none (control) 1 94 doxycycline 0 130 doxycycline 1 80 lariam 0 180 lariam 1 15 Note here that a contingency table can also have a form of a table with individual factor combinations and corresponding frequencies. This is actually a bit better for computation than the cross-tabulated form. The goodness-of-fit test demonstrates that there is a significant association between the two variables: Chisq = 137.45, df = 2, p-value = 1.42e-30 Odds ratios summary then follows. Two odds ratios are produced comparing the second and third levels to the first one (here control). The "lower" and "upper" values indicate limits of confidence intervals. We can see that both types of prophylaxis are associated with significantly decreased infection rates. infected prophylax 0 pO 1 pi oddsratio lower upper p.value control 40 0.1142857 94 0.49735450 1.00000000 NA NA NA doxy 130 0.3714286 80 0.42328042 0.26186579 0.16479825 0.41610692 6.790312e-09 lariam 180 0.5142857 15 0.07936508 0.03546099 0.01862937 0.06749997 8.847446e-34 To compare just the two prophylaxis types, we can select just the corresponding part of the data for analysis (specifying this by square brackets in R). The result shows that taking Lariam is associated with a significantly lower infection rate than taking doxycycline. infected prophylax 0 pO 1 pi oddsratio lower upper p.value doxy 130 0.4193548 80 0.8421053 1.0000000 NA NA NA lariam 180 0.5806452 15 0.1578947 0.1354167 0.07462922 0.2457171 1.531487e-13 In a paper/thesis, the result can be summarized as Table 6.4 Table 6.4. Summary of a contingency table analysis testing the association between malaria prophylaxis and infection. Overall test of independencex2 = 137.45, df = 2, p < 10~6. Odds ratio lower 95% conf. limit upper 95% conf. limit P Lariam vs. none 0.035 0.019 0.067 -6 < 10 doxycycline vs. none 0.262 0.165 0.416 <106 Lariam vs. doxycycline 0.135 0.075 0.246 <106 Coincidence and causality Note here that significant results of a contingency table analysis indicate a significant association. This can be caused either by coincidence or causality. Causality means that if we manipulate one variable, the other also changes, i.e. one variable has a direct effect on the other. By contrast, coincidence may happen due to another variable affecting the two ones analyzed. In such a case, manipulation of one variable does not induce a change in the other variable. In the malaria example, the travelers using prophylaxis are simultaneously more likely to use mosquito repellents, which is known to decrease infection risk strongly. Therefore, if somebody from the no-prophylaxis travelers decided to take prophylaxis, it may have a much lower effect than our analysis suggests. People in general like causal explanations (and expect them). As a result, an association is frequently interpreted as a causal relationship, which is inappropriate. An association may only suggest causality at best, which can be consequently demonstrated by a manipulative experiment. In our case, this would mean selecting a group of people, assign them randomly into three groups according to prophylaxis, send them to the tropics and see what happens. In this particular case, however, such research would not be approved by an ethics committee. How to do in R 1. Chisq analysis of contingency tables Option 1: apply chisq.test on matrix containing frequencies Option 2: If the data are formatted in the data frame as in Table 6.3, they can be converted to contingency table by function xtabs data.table<-xtabs(freq~varl+var2, data=data.frame) chisq.test can then be applied to the contingency table. If its result is saved in an object: test.res<-chisq.test(data.table) running test.res$std.resid can then be used to display standardized residuals. 2. Phi - coefficient function phi (package psych) applied on a 2x2 matrix 3 . Odds ratios function epitab (package epitools) applied on contingency table produced by xtabs. Square brackets can be used to select the levels to compare. 7. The f-distribution, confidence intervals, and f-tests The t-distribution For any fixed value X, a t-value can be computed from a sample of a quantitative random variable using this formula: X - x t = - where, x is the sample mean and is its associated standard error. Recall here that x is the estimate of the population mean and quantifies its accuracy. As a result, the t-value represents the estimate of the difference between X and the population mean. Because x is a random variable, t-value is also a random variable, and its probability distribution is called the f-distribution. Its shape is closely similar to Z (standard normal distribution). In contrast to Z, the t distribution has a single parameter - the number of degrees of freedom, which equals the number of observations in a given sample minus 1. In fact, t approaches Z asymptotically for high df (Fig 7.1). Similarly to the normal distribution, the t-distribution is symmetric, and its two tails must be considered when computing probabilities {Fig 7.2). Probability density of t and Z Q_ TO a o c 73 O d -4 -2 0 2 4 Fig. 7.1 Probability density plot of t-distributions with different DF and their comparison to standard normal distribution (Z). t (df = 5) Fig 7.2. The t-distribution with its two tails and the 2.5% and 97.5%-quantiles. Confidence intervals for the mean value and single sample t-test The t-distribution can be used to compute confidence intervals (CI), i.e. intervals within which the population mean value lies with a certain probability (usually 95%). The confidence limits (CL) within which the CI lies are determined using these formulae: CLi0W = X + t(dj-p = 0 025)% CLfcgh = X + t(d/,p=o.975)5% where t(df, P) equals 2.5% or 97.5% probability quantile of t-distribution with given df. These intervals can be used as error bars in barplots or dotcharts. In fact, they represent the best option to be used like this (in contrast to standard error or 2 x standard error). Confidence intervals can also be used to determine whether the population mean differs significantly from a given value: a value lying outside the CI is significantly different (at 5%-level of significance) while a value lying inside is not. This is closely associated with the single sample t-test, which tests a null hypothesis that a value X equals the population mean. Using the formula for t-value and DF, the t-test determines the probability of type I error associated with rejection of such a hypothesis. Student t-test If means can be compared with an apriori given value, two means of different samples should also be comparable. This is done by a two-sample t-test1, which quantifies uncertainty about the values of both means considered: where xx and x2 are arithmetic means of the two sample and is the standard error of their difference. The * is computed using the following formula: where s£ is the pooled variance of the two samples, and ni and /12 are sample sizes of the two samples. Pooling variance like this is only possible if the two variances are equal. Correspondingly, the equality of population variances, called homogeneity of variance, is one of the t-test assumptions. In addition, the t-test assumes that the samples come from populations that are distributed normally. There is also the universal assumption that individual observations are independent. The t-test is relatively robust to violations of the assumptions about the homogeneity of variance and normality (i.e. their moderate violation does not produce strongly biased test outcomes). If variances are not equal, Welch approximation of t-test (Welch t-test) can be used instead of the original Student t-test. A slightly modified formula is used for the t-value computation, and also the degrees of freedom are approximated (as a result, df is usually not an integer). Note here that the Welch t-test is used by default in R. In the original (two-sample) Student t-test, the DF is determined as DF = ni- 1 + n2- 1 where ni is the size of sample 1 and /12 is the size of sample 2. Paired t-test Paired t-test is used to analysis of data composed of paired observations. For instance, a difference in length between left and right arms of people would be analyzed using a paired t-test. The null hypothesis, in this case, is that the difference within the pair is zero. In fact, a paired t-test is fully equivalent to a single sample t-test comparing the within-pair difference distribution with zero. The number of degrees of freedom is determined as DF = n - 1 because there is just one sample (of paired values) in a paired t-test. 1 Called also Student t-test after its inventor William Sealy Gösset (1976-1937) who used the pen name Student. t = xl x2 ■^1 —%2 How to do in R 1. t distribution computations functions pt and qt are available. For instance, qt(0.025, df) can be used to compute the difference between the lower confidence limit and the mean. 2 . t-test Function t.test. For two samples, the best way is to use a classifying factor and response variable in two columns. Then, t.test(response~factor) can be used. But t.test(samplel, sample2) is also okay. important parameters: var.equal - switches between Welch and Student variants. Defaults to FALSE (Welch) mu - a priori null value of the difference (relevant mainly for a single sample test) paired - TRUE specifies a paired t-test analysis. 8. F-test and distribution, analysis of variance (ANOVA) F-test Normally distributed data can be described by two parameters - the mean and the variance. We discussed testing the difference in the mean between two samples in the previous chapter. However, it is also possible to test whether two samples come from a population with the same variance, i.e. the null hypothesis stating: a2i = a22 as usual for population parameters, we do not know the a, but they can be estimated by s2 (sample variances). A comparison between sample variances is then made by F-test which is a simple ratio between sample variances. The F statistic follows the F distribution, the shape of which is defined by two degrees of freedom - DF numerator and DF denominator. These are found as ni - 1 and m-l (i.e. number of observations in the corresponding sample - 1). When reporting test results in a text, both DFs must be reported (usually as subscripts). For instance, "variances significantly differed between green and red apples" (F20,25 = 2.52, p = 0.015). Fig. 8.1 Probability density plot of F-distributions with different DFs. Analysis of variance (ANOVA) F-test is rarely used to test the differences in variances between two samples because hypotheses on variance are not common. However, F-test has its crucial application in the analysis of variance. In chapter 7, we discussed comparisons between the means of two samples using a t-test. A natural question, however, arises - what if we have more than two samples? We may try using pair-wise comparisons between each pair of them. That would, however, lead to multiple non-independent tests and result in inflated type I error probability1. Therefore, we use analysis of variance (ANOVA) to solve such problems. ANOVA tests a null hypothesis on means of multiple samples, which states that the population means are equal, i.e. The mechanism of ANOVA is based on decomposing the total variability into two components: 1. systematic component corresponding to differences between groups and 2. error (or residual) component corresponding to differences within groups. These differences are measured as squares. For each observation in the dataset, its total square (measuring difference between its value and the overall mean), effect square (measuring difference between corresponding group mean and the overall mean), and error square (measuring difference between the value and corresponding group mean) can be calculated (Fig 8.2). overall mean group mean Total Square V ¥ Erro" Sq Group (Effect) Sq. 2 sample Fig. 8.2 Mechanism of ANOVA: definition of squares exemplified with the red data point. Subsequently, we can sum the square statistics over the whole dataset and get sums of squares (SS): SStotai, SSeffect, SSerror. We can further calculate mean squares (MS) by dividing SS by corresponding DF, with DFtotai = n-1, DFeffect = k-l, and 1 This comes from the fact that if individual tests are performed at a = 0.05, then probability of making type I error in 2 tests (i.e. making error in at least one of the test) is p = 0.05+0.05-0.052 = 0.975. DFerror = DFtotai - DFeffect, where n is total number of observations and k number of categories. Hence we get: MSeffect = SSeffect/DFeffect MSerror = SSerror/DFe rror and now, it comes: the mean squares are actually variances. As a result, we can use an F- test to test the null hypothesis that MSeffect is lower than or equal to MSerror. Such test is equivalent to the test of the null hypothesis stating that all means are equal: FDFeffect.DFerror = MSeffect/ MSe rror the corresponding p-value is then found based on a comparison with F distribution as in an ordinary F-test. Note that rejecting the null hypothesis means that at least one of the means is significantly different from at least one other. Besides the p-value, it is also possible to compute the proportion of variability explained by the groups: r2 = SSeffect/SStotal Typical report of ANOVA result in the text then reads: Means were significantly different among the groups (r2 = 0.70, F2,i2 = 14.63, p = 0.0006). ANOVA assumptions ANOVA application assumes that i. samples come from normally distributed populations and variances are equal among the groups. These assumptions can be checked by analysis of residuals as they can be restated as requirements for i. normal distribution and ii. the constant variance of residuals. There are formal tests testing for normality, such as the Shapiro-Wilk test, but their use is problematic as they test the null hypothesis that a given sample comes from a normal distribution. The tests are more powerful (likely to reject the null) if there are many observations, but in that case, ANOVA is rather robust to moderate violations of the assumption. By contrast, the formal tests of normality fail to identify the most problematic cases, when the assumptions are not met, and also the number of observations is low because in such cases, their power is low. Instead, I highly recommend a visual check of the residuals. In particular, a scatterplot of standardized residuals and normal quantile-quantile (QQ) plots are informative about possible problems with ANOVA assumptions. Details on how to use these plots to assess ANOVA assumptions are very nicely explained here: https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/57 Post-hoc comparisons When we get a significant result in ANOVA (and only in such case!), we may be further interested to see which mean is different from which. The statistical theory does not provide much help here, however, some pragmatic tools were developed for this purpose. These are based on the principle of pair-wise comparisons (similar to a series of pair-wise two-sample t-tests), which control the inflation of the type I error probability by adjusting the p-values upwards. An example of such a test is Tukey honest significant difference test (Tukey HSD). Results of these tests are frequently summarized in plots by letter indices with different letters indicating significant differences (Fig. 8.3) sample Fig. 8.3 Dotchart displaying means and 95%-confidence intervals for the means of the three samples. Means significantly different from each other at a = 0.05 are denoted by different letters (based on Tukey HSD test). How to do in R 1. F test, F-distribution function var.test; pf, qf for F-distribution probabilities 2 . ANOVA Function aov - accepts formula syntax. Note that the predictor must be a factor; otherwise linear regression is fitted (which is incorrect, but no warning is given). summary (aov.object) displays the ANOVA table with SS, MS, F and p. plot(aov.object) displays the diagnostic plots for checking ANOVA assumptions. See https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/57 for a detailed explanation. Note that this webpage refers to more general regression diagnostics, which we will discuss in upcoming classes. It is, however, basically the same except for the diagnostics plot #4, which you may ignore for now. 3. Post-hoc test tukeyHSD(aov.object)-produces just the differences between groups. Letters as in Fig. 8.3 must be produced manually. 9. Linear regression, correlation and intro to general linear models Regression and correlation Both regression and correlation refer to associations between two quantitative variables. One variable, the predictor, is considered independent in regression, and its values are considered not to be random. The other variable, the response, is dependent on the values of the predictor with a certain level of error variability, i.e. it is a random variable. In the case of correlation, both variables are considered random. Regression and correlation are thus quite different - theoretically. In practice, however, they are numerically identical concerning both the measure of association and p-values (type I error probabilities) associated with rejecting the null hypothesis on independence between the two variables. Linear regression Linear association between two quantitative variables X and Y, of which Y is a random variable, can be described by the equation: Y = a + bX + s where a and b are intercept and slope of a linear function, respectively. These represent the systematic (deterministic) component of the regression model, while e is the error (residual) variation representing the stochastic component, e is assumed to follow the normal distribution with mean = 0. The goal of regression model fitting is to estimate the population slope and intercept from sample data of Y and X. a and b are thus estimates of population parameters. There are multiple approaches to conduct such estimates. Maximum-likelihood estimation is most common, which provides numerically identical results to least-square estimation in ordinary regression. We shall discuss the least square estimation here, as it is fairly intuitive and will help us to understand the relationship with ANOVA. The least-square estimation aims at minimizing the sum of error squares (SSerror), i.e. the squares of the differences between fitted and observed values of the response variable (Fig. 9.1). Note that this mechanism is notably similar to that of analysis of variance. In parallel with ANOVA, we can also define the total sum of squares (SStotai) and the regression sum of squares (SSregr). Subsequently, we can calculate mean squares (MS) by dividing SS by corresponding DF, with DFtotai = n - 1, DFregr = 1, and DFerror = DFtotai - DFeffect = n - 2, where n is total number of observations. Hence, we get: MSregr = SSregr/DFregr MSerror = SSerror/DFe rror As in ANOVA, the ratio between MS can be used in an F-test of a null hypothesis that there is no linear relationship between the two variables: rDFregr,DFerror = MSregr/ MSe rror Rejecting the null hypothesis means that the two variables are linearly related. Note, however, that a non-significant result may also be produced in cases when the relationship exists but is not linear (e.g. when it is quadratic). Fig. 9.1 Mechanism of least square estimation in regression: definition of squares exemplified with the red data point. In regression, we are usually interested in statistical significance, and the strength of the association, i.e. the proportion of variability in Y explained by X. That is measured by the coefficient of determination (R2): R2 = SSregr/SStotal which can range from 0 (no association) to 1 (deterministic linear relationship). Alternatively, so-called adjusted-/?2 may be used (and is reported by R), which accounts for the fact that the association is computed from samples and not from populations: adjusted-/?2 = 1 - MSerror/MStotai Coming back to the regression coefficients - the fact that these are estimates means that associated errors of such estimates may be computed. Their significance (i.e. significant difference from zero) may thus be tested by a single sample t-test. The p-value of such a test for the slope (b) is identical to that of the F-test in simple regression with a single predictor. Note that the test of the intercept (reported by R or other statistical software) is irrelevant for the significance of the regression itself. Significant intercept only indicates that mean(Y) is significantly different from zero. Regression diagnostics We have discussed the systematic component of the regression equation. However, the stochastic component is also important. This is because its properties can provide crucial information on the validity of regression assumptions and thus the validity of the whole model. The stochastic component of the model, called model residuals, can be computed using the equation: £ = Y- a-bX = Y- fitted(Y) Residuals form a vector of values for each of the data points. As such, they can be analyzed by descriptive statistics. They may also be standardized by division of their standard deviation. The basic assumptions concerning the residuals are: 1. Residuals should follow the normal distribution 2. The size of their absolute value should be independent of the fitted value. 3. There should be no obvious trend in residuals associated with fitted values, which would indicate the non-linearity of the relationship between X and Y. These assumptions are best evaluated on a regression-diagnostics plot (Fig 9.2). In addition, it may be worth checking that the regression result is not driven by a single extreme observation (or a few of these), which is provided on the bottom-right plot in Fig 9.2. Normal Q-Q_ ::. ° P . o •■6 □1" T-1-1-1-1-H 1.0 -0.5 0.0 0.5 1.0 1.5 Theoretical Quantiles Scale-Location Residuals vs Leverage --- CooPs dis%lnce 15 20 25 0.0 0.1 0.2 0.3 0.4 Fitted values Leverage Fig 9.2. Regression diagnostics plots. 1. Residuals vs. fitted values indicate potential non-linearity of the relationship (smoothed trend displayed by the red line). 2. Normal Q-Q plot displays agreement between normal distribution and distribution of residuals (dashed line). 3. Square root of the absolute value of residuals indicate a potential correlation between the size of residuals and fitted values. 4. Residuals vs. leverage (https://en.wikipedia.org/wiki/Leverage (statistics)) plot detect points, which have a strong influence on the regression parameter estimates (these points have high Cook distance; https://en.wikipedia.org/wiki/Cook%27s distance). See also the detailed explanation of regression diagnostics here: https://arc.lib.montana.edu/book/statistics-with-r-textbook/item/57 Correlation Correlation is a symmetric measure of the association between two random variables, of which neither can be considered a predictor or a response. Correlation is most commonly measured by the Pearson correlation coefficient: Its values can range from -1 (absolute negative correlation) to +1 {absolute positive correlation), with r = 0 corresponding to no correlation, r2 then refers to the amount of shared variability. Numerically, Pearson r2 and regression R2 have identical values for given data and have basically the same meaning. Pearson r is also an estimate of the population parameter; its significance (i.e. significant difference from zero) can thus be tested by a single sample t-test with n-2 degrees of freedom. On correlation and causality Note that a significant result of a regression of observational data may only be interpreted as correlation (or coincidence) despite there is a variable called the predictor and the response. Causal explanations imply that a change of predictor value causes a directional change in the response. Causality may, therefore, only be tested in manipulative experiments, where the predictor is manipulated, the See more details on this in Chapter 6. How to do in R 1. Regression (or a linear model) start with function lm to fit the model and save the lm output into an object: model. K-lm (response~predictor) or model. 2<-lm (response~predictorl+predictor2+...) anova(model.1) performs analysis of variance of the model (i.e. tests its significance by an F test). Models may also be compared by anova(model.1, model.2) summary(model.1) displays a summary of the model, including the t-tests of individual coefficients. resid(model.1) extracts model residuals predict(model.1) returns predicted values plot(model.1) plots regression diagnostic plots of the model 2. Pearson correlation coefficient cor(Varl~Var2) computes just the coefficient value IU(.Xi-Z)(Yi-F) r = cor.test(Varl~Var2) computes the coefficient value together with significance test 10. When assumptions are violated - data transformation and non-parametric methods Log-normally distributed data Log-normal distribution is very common in many kinds of biological data. These are random variables logarithm of which follows the normal distribution. As a result, log-normal variables may range from the zero limit (excluding zero itself) to plus infinity - that is pretty realistic e.g. for dimensions, mass, time, etc. In contrast to the symmetric normal distribution, log-normal variables are positively skewed and display a positive correlation between mean and variance (Fig. 10.1). A straightforward suggestion for such data is to apply log-transformation to the values to obtain normally distributed variables (Figs 10.1, 10.2, Table 10.1). ANOVA applied on non-transformed and transformed data provides quite different results (Table 10.1.). 60 40 20 0 - T IT ~i i r lawyer manager student job ~\ i r IT lawyer manager student job Fig. 10.1. Example of a log-normal variable: effect of job on the length of phone calls. The left panel shows the boxplot on the ordinary linear scale, while the right panel shows the same values on the log-scaled y-axis. Table 10.1. Summaries of ANOVA applied on non-transformed and transformed data displayed in Fig. 10.1. Analysis__F_DF_p non-transformed 0.26 4.13 3,36 0.013 log-transformed 0.42 8.72 3,36 0.0002 Residuals vs Fitted Normal Q-Q Scale-Location 1.5 2-0 Fitted values Theoretical Ouantiles 013 > 8 O 0 <•» -' 0 ~- 0 a° ° 0 '-D 0 0 d °0 TT c 0 CM 0 O 0 □ I 1.0 I I 1.5 2.0 2 5 Fitted values Fig. 10.2. Diagnostic plots of ANOVA models applied on non-transformed (upper row) and log-transformed data (lower row). Note the improved normal fit on the QQplot and homogeneity of variances after transformation (Residuals vs. Fitted and Scale-Location plots). Note that log-transformation is not a simple utility procedure it also affects the interpretation of the analysis. Log-transformation changes the scale from additive to multiplicative, i.e. we test the null hypothesis stating that the ratio between population means is 1 (instead of the difference being 0). We also consider different means - analysis on log-scale implies testing the geometric means on the original scale. The same applies to regression coefficients, which become relative rather than absolute numbers e.g. the slope indicates how many times the response variable will change with a change in the predictor. An example with log-transformation in linear regression is displayed in Fig. 10.3., 10.4. and Table 10.2. Log-transformation is sometimes used for data, which are not log-normally distributed but are just positively skewed. Such data may contain zeros and thus are not log-transformable. Instead log (x + constant) transformation must be used. Alternatively, square-root transformation may be considered for such data. Note that the analysis results do not depend on the logarithm used - natural and decadic logarithms are used most frequently. Just beware of being consistent in using the same logarithm throughout the analysis. 1400 1200 1000 m 5. 800 c CD 600 400 200 □ 0 0 o o ° o 0 - O ° 9)<®o 0 ° g o 1000 S 500 200 100 °o o o Ö 0 0 o °o9 , ° 0 o o 8 10 12 14 Fertilizer -1- 10 Fertilizer 12 14 Fig. 10.3. Example of a regression with log-normal variable: how grain yield of maize depends on the amount of fertilizer applied. The left panel shows the scatterplot on the ordinary linear scale, while the right panel shows the same values on the log-scaled y-axis. Table 10.2. ANOVA tables of linear models fitted on non-transformed and transformed data displayed in Fig. 10.3. Analysis R2 DF non-transformed log-transformed Residuals vs Fitted 0.10 0.14 Normal Q-Q 11.0 16.05 1,98 1,98 Scale-Location 0.0013 0.0001 Residuals vs Leverage 300 400 500 600 Fitted values Residuals vs Fitted -2-10 1 2 Theoretical Guantiles Normal Q-Q 300 400 500 600 Fitted values Scale-Location 0.00 0.02 0.04 0.06 0.0 Leverage Residuals vs Leverage -2-10 1 2 Theoretical Quantiles 0.00 0.02 0.04 0.06 0.08 Leverage Fig. 10.4. Diagnostic plots of linear models fitted on non-transformed (upper row of plots) and log-transformed data (lower row of plots). Note improved normal fit on the QQplot and improved homogeneity of variances after transformation (Scale-Location plot). Non-parametric tests Some distributions cannot be approximated by the normal distribution, and simple transformations may not helpful. This applies e.g. to many data on the ordinal scale such as school grades, subjective rankings etc. For such cases, non-parametric tests were developed (Table 10.3.). These tests replace the original values by their order and use the resulting order values to test differences in central tendencies (which are not precisely means) between the samples. These tests are still based on the assumption that the samples come from the same distribution (which, however, is quite reasonable). Table 10.3. List of parametric tests and their non-parametric counterparts together with appropriate R functions. Parametric test Non-parametric test R function two-sample t-test paired t-test One way ANOVA Pearson correlation Mann-Whitney U test Wilcoxon test Kruskal-Wallistest* Spearman correlation wilcox.test wilcox.test with parameter paired=T kruskal.test cor.test with parameter method="spearman" Dunn test may be used for post-hoc comparisons (function dunnTest in package FSA) Permutation tests Permutation tests represent valuable alternatives to parametric or non-parametric tests. First, a statistic of difference from the null hypothesis (between samples) is defined. That may be the raw or relative difference or an F-ratio if multiple groups are analyzed. This statistic is computed for observed data (observed statistic). Subsequently, values of the response variable are repeatedly permuted (reshuffled), and the same statistic is computed in each permutation. The p-value is then determined by the formula: x + 1 "■perm ~ ±- where x is the number of permutations in which test statistic was higher than observed test statistic, and nperm is the total number of permutations. How to do in R 1. Log-scaling of graph axis: parameter log='axis to be log-scaled', i.e. mostly log='y' 2. Log-transformation: function log for natural logarithm, loglO for decadic 3. Non-parametric tests: see Table 10.3. 4. Permutation tests are available in library coin: a. permutation-based ANOVA: function oneway_test b. permutation-based correlation: spearman_test Both functions require this parameter distribution=approximate(B=number of permutations) to be set; B is usually set to 999 or 9999. 11. Brief introduction to multi-way ANOVA, multiple regression and general linear models Multiple regression and interaction In regression, multiple predictors may be used in the model: Y = a + biXi + 62X2 +... + bnXn + £ predictors may be both quantitative and categorical variables. This is based on the fact, that categorical variables may be decomposed into k-1 binary (0-1) variables (where k is number of categories/levels). In general, the maximum number of predictors is limited by degrees of freedom in the model. Complexity of the model measured by the model number of degrees of freedom may never exceed total df (i.e. number of observations - 1). Models containing two or more predictors may also contain interaction terms: Y = a + biXi + 62X2 + C1X1X2+ ... + £ interaction means that the dependence of the response variable on one predictor (Xi) depends on the value of second predictor (X2). Interaction it typically tested in multi-way ANOVA, where even higher-order interactions can be considered. Interaction may be positive (i.e. the value of response is higher than expected from additive sums of main effects; in such case ci > 0; Fig. 11.1) or negative (response value is lower that the additive sum; Ci < 0). The interaction is formally notated by x (Alt + 0215), i.e. Y ~ Xi + X2 + Xi x X2. In R, interaction may be represented by "*" which indicates both additive and interaction effects or by ":" which indicates just the interaction term. No that 1. testing the interaction is very common in manipulative experiments and 2. interaction does not mean correlation between predictors. As you will see later, correlation between predictors is a serious problem which among other issues prevents from reasonable assessment of interactive effects. Interaction plot o - Control Fertilized x.factor Fig. 11.1. Interaction plot showing positive interactive effects of fertilizer application and watering on plant growth. The interaction is directly visible from the graph as non-parallel lines connecting the mean values. Testing of linear models and their terms Statistical significance testing of linear models (as whole predictor structures) using e.g. an F-test is easy and largely follows the same principles as in simple regression. It is more difficult to decide which predictors to include in the model and which not. Finding the best model is done by a model selection procedure which aims at finding the model which contains only the predictors, which have significant effect on the response while those effect of which is non-significant (i.e. do not contribute to the predictive power of the model) are left out. Such models are called minimum adequate models. Philosophically, they are based upon the principle of Occam's razor or parsimony (https://en.wikipedia.org/wiki/Occam's razor). Statistical methods are very efficient if applied on model testing and/or comparisons between models. However, there are no universal guidelines, which could be used for model/predictor selection in all cases. In models with few candidate predictors, it is possible to fit all possible models and select the one with the highest explanatory power. Frequently (but certainly not always), simple testing of significance of individual predictors (which is based on statistical comparison between models excluding and including given predictor) can also be used. For efficient model selection, we need 1. a measure of model quality (or quality comparison between models) and 2. a strategy how to build the model. Measures of model quality There are several measures of model quality or parameters for model comparison. F-test: the F-test may not only be used to test the significance of a model but also to test whether one model is significantly better that another. Works generally well for models with up to moderate number of observations (~200). With large n almost all predictors tend to be significant even if explaining very little variability in response. R2: proportion of explained variation is a property of a model itself. It is easy to interpret. For model comparisons, its main disadvantage is, that addition of more predictors always increases R2 even if the predictor added has little effect. As such, it is not suitable to compare models of different complexity. AIC (Akaike information criterion; https://en.wikipedia.org/wiki/Akaike information criterion): This measure is derived from information theory and allows straightforward comparisons of model quality. Models with lower AIC value are better. The AIC is computed using the following formula: AIC = 2k-2log(Z.) where, k is number of parameters of the model and \og(L) is log-likelihood of the model. Likelihood-ratio. Likelihood ratio is a very general approach, which can be used to compare many types of models. It is based on the principle that the logarithm of likelihood ratio (which numerically equals the difference between log-likelihoods) multiplied by 2 follows the X2 distribution; thus the goodness-of-fit test may be used for testing of models differing in numbers of df. Model building strategies There are several options how to build a model. Theoretically, the best way would be to fit all possible models and choose the best fitting one based e.g. on AIC. However, number of possible models could be very large (increases with numbers of predictors and complexity of interaction terms) and fitting of models may be demanding for computer power (with increasing availability of big data even with current fast computers). Therefore, it may be useful to use a pragmatic approach to model building. There are two reasonable approaches each of which has its advantages and disadvantages - forward and backward selection. Forward selection starts with the null (intercept-only) model. Next step includes testing every model containing single predictor against the null model. Such comparisons are indicative of individual predictor explanatory power and on this basis the best fitting predictor (using R2 or AIC) can be added to the model. In the next step the model containing the selected predictor is used as the null against with the other predictors are tested and so on until there is no significant candidate predictor left. With two or more predictors in the model, interactions between the predictors may also be tested to be included in the model. An advantage of this approach is its intuitiveness and possibility to use a large number of candidate predictors (though multiple testing issue should be considered here). However, there are also disadvantages related with this approach including often high risk of selecting of non-optimal model due to constraints related to the procedure. Still forward selection is a reasonable choice for observational data, in particular when large number of predictors is available. Backward selection uses an opposite strategy - first a saturated model is fitted (i.e. model containing all candidate predictors together with all their interactions - these may be limited up to a specified order). Non-significant terms are then removed from the model one-by-one starting with poorest predictors (again measures by AIC). Note that in the case of a significant interaction, main effects are retained in the model even if they are not significant themselves (if the same model was built-up in a forward manner, such interactions would never be tested). Correlation between predictors Correlation between predictors is a serious issue in multiple regression analysis. This issue concerns observational data because in experimental studies, we should use an experimental design which ensures independence of tested predictors. The problem is, that if there are two inter-correlated candidate predictors to be included in the model, one of them may be included just by chance (because it may look slightly better with given data). The other predictor will then never be included in the model, because its effect is already accounted for by the first predictor. Depending on the actual data, either one or the other predictor may be included while the other left out. Such inconsistency may lead to very different conclusions even if the relationships between the variables are the same and the data are just slightly different. Such cases are quite common in nature, e.g. soil pH and Ca concentration represent a common case in ecological studies. Unfortunately, none of the model building strategies or model quality measures can control this. However, a detailed exploration of the associations between the predictors themselves and between individual predictors and the response may be useful. As a part of this exploration, we may first analyze marginal or simple effects - i.e. effects of given predictor on the response which ignore the effects of other variables. These are simple linear regressions (or one-way ANOVAs) and are indicative of the correlation structure in the study system. Conversely, partial or conditional effects can be computed (i.e. unique effects of individual predictors), which are computed by testing a given predictor against a model containing all other predictors. Such effects are greatly affected by predictor inter-correlation but if significant, they may really point to mechanisms underlying the correlations. Computing marginal and partial effects is then a part of a more general approach called variation partitioning. With this approach, you can describe the correlation structure among the predictors (or frequently groups of predictors) and quantify their unique or shared effects on the response variable. How to do in R 1. Fitting a model - function lm (see chapter 9 for basics). Individual predictors are included in the formula on the predictor side separated either by + (additive effects) or by * (additive and interactive effects) 2. Testing candidate predictors to be included in the model - function addl; e.g. addl(lm.model, scope =~predictorl*predictor2,...) . Parameter test is then used for specification of the model quality criterion. AIC is displayed always; for ordinary linear models, it makes sense to ask for an F-test by setting test="F". 3. Testing predictors to be removed from the model -function dropl. The use is similar to addl, just the parameter scope is not specified. 4. Changing model structure - function update; adding a predictor: new.modeK-update (old.model, .~.+added.predictor), removing a predictor new.modeK-update (old.model, .~.-removed.predictor). Update can be used to change also other parameters of a model. 5. Comparison of model quality - function anova; e.g. anova(modell, model2) compares 6. Testing individual terms - anova(lm.model) displays sequential F-tests for individual terms. Sequential testing means, that order of the predictors affects the results (unless the predictors are perfectly independent - orthogonal). summary(lm.model) displays detailed model statistics - F-test of the whole model and t-tests of individual regression coefficients. These t-tests are not sequential and thus are independent term order in the model. 7. Model coefficients may be called by function coef - i.e. coef(lm.model) 8. Model residuals may be called by function resid - i.e. resid(lm.model)