5. Hypothesis testing and goodness-of-fit test Scientific statements In chapter 1, I explained that science consists of theories and these comprise hypotheses. Scientists formulate these hypotheses as universal statements describing the world but they never know whether a hypothesis is true until it is rejected based on the empirical evidence. This makes science an infinite process of searching for true, to which we hopefully approach but never know whether we reach it or not. Let’s now return to the term universal statement I used in the previous paragraph and in chapter 1 as this is crucial to understand how empirical science works and hypothesis testing proceeds. Statements describing the world can be classified into two classes: 1. Universal statements apply generally on all objects concerned. E. g. “All (adult) swans are white” is a universal statement. This can be converted to a negative form: “Swans of other color than white do not exist.” You can see that the universal statements prohibit certain patterns or events (e.g. observing a black swan here); therefore, they have the form of “natural laws”. They can also be used to make predictions. If the white swan hypothesis is true, the next swan I will see will be white (and this is not dependent on how many white swans I saw before). A universal statement cannot be verified, i.e. confirmed to be true. We would need to inspect color all swans living on the Earth (and in the Universe) to do so (that is not very realistic) and even if we did so, we can never be sure that the next baby swan hatching from the egg would not be different from white at adulthood. By contrast, it is very easy to reject such universal statement on the basis of empirical evidence. Observing only a single swan of other color than white is sufficient for that. 2. Singular statements are asserted only on specific objects. E.g. “The swan I see is white.” Such statement refers to a particular swan and does not predict anything about other swans. A specific class of singular statements are existential statements which can be derived from singular ones. The fact that I see a white swan (singular statement) can be used to infer that there is at least one swan which is white, i.e. white swans do exist. Based on the previous paragraph, you would probably not consider such statement any novel since it is in agreement with the universal statement on white swans. However, seeing a single black swan (Fig 5.1) completely changes the situation. It means, that at least one black swan exists and that the universal statement on white swans is not true. In general terms, this existential statement rejected the universal statement. To sum up, scientific hypothesis must have a form of universal statements in order to have a predictive power, which we need to explain patterns in natural. They cannot be verified but can be rejected by empirical existential statements which are in conflict with the prediction the hypothesis makes. Fig. 5.1 A black swan in Perth (Western Australia). Hypotheses and their testing Empirical science is largely the process of hypothesis testing. This means searching for conflicts between predictions of hypotheses and collected/measured data. Once a hypothesis is rejected, a new hypothesis can be formulated to replace the old one. Note, here that there is no “objective” way how to formulate new hypotheses – they are rather genuine guesses. An important implication from this is that for every scientific theory or hypothesis, it should be possible to define singular observations which if they exist would reject it. This means, that each scientific hypothesis must be falsifiable. Universal statements that are not falsifiable may be components of art, religion or pseudoscience but definitely not of science. Various conspiracy theories also belong to this class. These statements need not to be only dogmatic, they may also be tautological. Example of this is e.g. recently published theory of stability-based sorting in evolution (https://www.ncbi.nlm.nih.gov/pubmed/28899756), a “theory” which says that evolution operates with stability, i.e. organisms and traits which are more stable, persist for longer. The problem is that long persistence is a synonym for stability. So in fact the theory says “What is stable is stable”. Not very surprising. The authors declare the theory to explain everything (see the ending of the abstract), and this is indeed true, but the problem is that the theory neither produces any useful predictions nor can be tested by empirical data. If we select only hypotheses which are falsifiable, and as such can be considered scientific statements, we may discover that there are multiple theories without any conflicts with the data. It is a natural question to ask, which one to choose over the others. Here, we should use the Occam’s razor (https://en.wikipedia.org/wiki/Occam%27s_razor) principle and use the simplest (and also most universal and most easily falsifiable) hypothesis available. This is also termed “minimum adequate model” – i.e. choose the model with minimum number of parameters which fits adequately with the data. Note on specifics of biology and ecology Biological and ecological systems display high complexity arising from an interplay among complicated biochemical processes, evolutionary history and ecological interactions. As a result, quite large proportion of the research is exploratory aiming at discovering effects which were not anticipated yet. Therefore, no previous theory could have informed about them, or such information on absence of effect would be just redundant. In these cases, we test a universal statement, that the effect under investigation is zero. Hypothesis testing with statistics In statistics, we work with numbers and probabilities. Therefore, we do not record a clear-cut evidence to reject a hypothesis as in the example with swans. In other words, even improbable events do happen by chance and their observation may not be sufficient evidence to reject a hypothesis. A general statistical testing procedure involves computation of test statistic. This statistic measures the discrepancy between the prediction of the null hypothesis and the data considering also strength of the evidence based on the number of observations. The test statistic is a random variable following certain theoretical distribution. As a result, probability of observing the actual data or data that differ even more from the null hypothesis expectation can be quantified. If this probability (called the p-value) is below certain threshold we can justify rejection of the null hypothesis. The probability of observing certain data under null hypothesis can be very low but never zero. As a result, we are left with uncertainty concerning whether we did a right decision when rejecting or retaining the null hypothesis. We may take either right decision or make an error (Table 5.1). Table 5.1. Possible outcomes of hypothesis testing by statistical tests. H0 = null hypothesis Reality H0 is true H0 is false Our Decision Reject H0 Type I Error Ok Not reject H0 Ok Type II Error Two types of error can be made, of which type I error is more harmful as it means rejection of a null hypothesis which is true. Such false positive evidence is misleading and may obscure the scientific research of given topic. By contrast, type II error (false negative) is typically invisible to anybody except to the researcher itself because results not rejecting the null hypothesis are not published. Statistical tools can quite precisely control the probability of making type I error, by setting an a-priori limit for the p-value. Typically, this limit called level of significance (α) is set to α = 0.05 (5%). If the p-value resulting from the testing is higher than that, null hypothesis cannot be rejected. Note here, that such non-significant result does not mean that the null hypothesis is true. Non-significant results are indicative of absence of evidence, not of evidence of absence of an effect. Concerning type II error (probability of which is denoted β), statistical inference is less informative. It can be quantified in some controlled experiments, but its precise value is not of particular interest. Instead, a useful concept is power of the test, which equals 1 – β and its relative rather than absolute size. Power of the test increases with sample size and with decreasing α, i.e. if the tester accepts an elevated risk of type I error. Goodness-of-fit test Let’s have a look at an example of a statistical test. One of the most basic statistical tests is called goodness-of-fit tests (sometimes inappropriately chi-square test following the name of the test statistic). It is particularly suitable for testing frequencies (counts) of categorical data though the χ2 distribution is quite universal and approximates e.g. very general likelihood ratio. the formula is this: χ2 = ∑ (𝑂−𝐸)2 𝐸 where O indicates observed and frequencies and E indicates frequencies expected under the null hypothesis. The sum is repeated for each of the categories under investigation. The χ2 value is subsequently compared with corresponding χ2 distribution to determine the pvalue. There are many χ2 distributions which differ in the number of degrees of freedom (DF; Fig 5.2). The DF is a more general concept common to all statistical tests as it quantifies size of the data and/or complexity of the model. Here, it is important to know that for ordinary goodnessof-fit test: DF = number of categories – 1. Fig. 5.2 Probability densities of two χ2 distributions differing in the number of degrees of freedom. Dashed line indicates cut-off values for 0.05 probabilities on the upper tail. Goodness-of-fit test example A typical application of the goodness-of-fit test is in genetics as demonstrated in the following example: You are a geneticist interested in testing the Mendelian rules. To do so, you cross red and white flowering bean plants. Red is dominant and white recessive, so in the F1 generation you only get red flowering individuals. You cross these and get 44 red flowering and 4 white flowering individuals in the F2 generation. What can you say about the universal validity of the second Mendelian rule (which predicts 3:1 ratio between dominant and recessive phenotypes) at the level of significance α = 0.05? First, you need to calculate the expected frequencies. These are: Ered = 48 x 3 / 4 = 36 Ewhite = 48 x 1 / 4 = 12 then, computation of test statistic follows: χ2 = (44-36)2/36+(4-12)2/4 = 7.11 DF = 1 p(χ2= 7.11, DF = 1) = 0.0077 Conclusion (to be written in the text): Heredity in our bean-crossing system is significantly different from the second Mendelian rule (χ2= 7.11, DF = 1, p = 0.0077). As a result, the second Mendelian rule is not universally true. Here you can see that our experiment produced a singular statement on the number of bean plants. This was translated by the statistics into an existential statement that at least one (the our) genetic system exists which does not follow the Mendelian rule. This was then used to reject the universal statement. How to do in R Goodness-of-fit test: chisq.test Parameter x is used for inputting the observed frequencies Parameter p is used for inputting the null hypothesis-derived probabilities Example with output: chisq.test(x=c(44,4), p=c(3/4,1/4)) Chi-squared test for given probabilities data: c(44, 4) X-squared = 7.1111, df = 1, p-value = 0.007661 Probabilities of χ2 distribution can be computed by pchisq (do not forget to set lower.tail=F to get the p=value). pchisq(7.11, df=1, lower.tail = F) [1] 0.007665511