5. Hypothesis testing and pattern detection; goodness-of-fit test Scientific statements In chapter 1, I explained that science consists of theories, and these comprise hypotheses. Scientists formulate these hypotheses as universal statements describing the world, but they never know whether a hypothesis is true until it is rejected based on empirical evidence. This makes science an infinite process of searching for truth, to which we hopefully approach but never know whether we reach it or not. Let’s now return to the term universal statement I used in the previous paragraph and in chapter 1 because this is crucial to understand how empirical science works and hypothesis testing proceeds. Statements describing the world can be classified into two classes: 1. Universal statements generally apply to all objects concerned. E. g. “All (adult) swans are white” is a universal statement. This can be converted to a negative form: “Swans of other colors than white do not exist.” You can see that the universal statements prohibit certain patterns or events (e.g. observing a black swan here); therefore, they have the form of “natural laws”. They can also be used to make predictions. If the white swan hypothesis is true, the next swan I will see will be white (and this is not dependent on how many white swans I saw before). A universal statement cannot be verified, i.e. confirmed to be true. We would need to inspect the color of all swans living on the Earth (and in the Universe) to do so, and even if we did so, we can never be sure that the next baby swan hatching from an egg would not be different from white at adulthood. By contrast, it is very easy to reject such a universal statement based on empirical evidence. Observing only a single swan of another color than white is sufficient for that. 2. Singular statements are asserted only on specific objects. E.g. “The swan I see is white.” Such a statement refers to a particular swan and does not predict anything about other swans. A specific class of singular statements are existential statements that can be derived from singular ones. The fact that I see a white swan (singular statement) can be used to infer that there is at least one swan that is white, i.e. white swans do exist. Based on the previous paragraph, you would probably not consider any novel since it agrees with the universal statement on white swans. However, seeing a single black swan (Fig 5.1) changes the situation completely. It means that at least one black swan exists and that the universal statement on white swans is not true. In general terms, this existential statement rejected the universal statement. To sum up, a scientific hypothesis must have a form of a universal statement in order to have a predictive power, which we need to explain patterns in nature. They cannot be verified but can be rejected by empirical existential statements which are in conflict with the prediction of the hypothesis. Fig. 5.1 A black swan in Perth (Western Australia). Hypotheses and their testing Empirical science is essentially the process of hypothesis testing, which means searching for conflicts between predictions of hypotheses and collected/measured data. Once a hypothesis is rejected, a new hypothesis can be formulated to replace the old one. Note here that there is no “objective” way to formulate new hypotheses – they are rather genuine guesses. An important implication from this is that it should be possible to define singular observations for every scientific theory or hypothesis that, if they exist, would reject it. This means that each scientific hypothesis must be falsifiable. Universal statements that are not falsifiable may be components of art, religion, or pseudoscience but definitely not of science. Various conspiracy theories also belong to this class. These statements need not be only dogmatic; they may also be tautological. An example of this is e.g. recently published theory of stability-based sorting in evolution (https://www.ncbi.nlm.nih.gov/pubmed/28899756), a “theory” which says that evolution operates with stability, i.e. organisms and traits which are more stable persist for longer. The problem is that long persistence is a synonym for stability. Thus, this theory says, “What is stable is stable” - not very surprising. The authors declare the theory to explain everything (see the ending of the abstract), and this is indeed true. Still, the problem is that the theory neither produces any useful predictions nor can be tested by empirical data. If we select only hypotheses that are falsifiable and can be considered scientific statements, we may discover that there are multiple theories without any conflicts with the data. It is a natural question to ask, which one to choose over the others. Here, we should use the Occam’s razor (https://en.wikipedia.org/wiki/Occam%27s_razor) principle and use the simplest (and also most universal and most easily falsifiable) hypothesis available. This is also termed “minimum adequate model” – i.e. choose the model with the minimal number of parameters that adequately match the data. Pattern detection Biological and ecological systems display high complexity arising from an interplay among complicated biochemical processes, evolutionary history, and ecological interactions. As a result, quite a large proportion of the research is exploratory, aiming at discovering effects that were not anticipated yet. Therefore, no previous theory could have informed about them, or such information on the absence of effect would be just redundant. These are special cases of hypothesis testing, which can be called pattern detection. In pattern detection tests, we test the universal statement that the effect under investigation is zero (e.g. there is no correlation between two quantitative variables). Rejecting such a statement (null hypothesis) means that our observations are significantly different from what could be observed just by chance, i.e. we demonstrate the significance of a singular statement – and this can be consequently used to formulate a new universal hypothesis Hypothesis testing with statistics In statistics, we work with numbers and probabilities. Therefore, we do not record clear-cut evidence to reject a hypothesis as in the example with swans. In other words, even unlikely events may happen by chance, and their observation may not be sufficient evidence to reject a hypothesis. A general statistical testing procedure involves the computation of test statistic. The test statistic measures the discrepancy between the prediction of the null hypothesis and the data, also considering the strength of the evidence based on the number of observations. The test statistic is a random variable, which follows particular theoretical distribution if the null hypothesis is true. As a result, the probability of observing the actual data or data that differ even more from the null hypothesis expectation can be quantified. If this probability (called the p-value) is below a certain threshold, we can justify the rejection of the null hypothesis. The probability of observing specific data under the null hypothesis can be very low but never zero. As a result, we are left with uncertainty concerning whether we made the right decision when rejecting or retaining the null hypothesis. In general, we may take either the right decision or make an error (Table 5.1). Table 5.1. Possible outcomes of hypothesis testing by statistical tests. H0 = null hypothesis Reality H0 is true H0 is false Our Decision Reject H0 Type I Error Ok Not reject H0 Ok Type II Error Two types of error can be made, of which type I error is more harmful because it means rejection of a null hypothesis which is actually true. This is called false positive evidence. It is misleading and may even obscure the scientific research of a given topic. By contrast, type II error (false negative) is typically invisible to anybody except to the researcher himself because results not rejecting the null hypothesis are usually not published. Statistical tools can precisely control the probability of making type I error by setting an a-priori limit for the p-value. This limit, called the level of significance (α), is typically set to α = 0.05 (5%). If the pvalue resulting from the testing is higher than that, the null hypothesis cannot be rejected. Note here that such a non-significant result does not mean that the null hypothesis is true. Non-significant results indicate the absence of evidence, not of the evidence of absence of an effect. Concerning type II error (probability of which is denoted β), statistical inference is less informative. It can be quantified in some controlled experiments, but its precise value is not of particular interest. Instead, a useful concept is the power of the test, which equals 1 – β and its relative rather than absolute size. The power of the test increases with sample size and with decreasing α, i.e. if the tester accepts an elevated risk of type I error. Goodness-of-fit test Let’s have a look at an example of a statistical test. One of the most basic statistical tests is called goodness-of-fit tests (sometimes inappropriately chi-square test following the name of the test statistic). It is particularly suitable for testing frequencies (counts) of categorical data, although the χ2 distribution is quite universal and approximates e.g. the very general likelihood ratio. the formula is this: χ2 = ∑ (𝑂𝑂−𝐸𝐸)2 𝐸𝐸 where O indicates observed, and frequencies and E indicate frequencies expected under the null hypothesis. The sum is repeated for each of the categories under investigation. The χ2 value is subsequently compared with the corresponding χ2 distribution to determine the p-value. There are many χ2 distributions that differ in the number of degrees of freedom (DF; Fig 5.2). The DF is a more general concept common to all statistical tests as it quantifies the size of the data and/or complexity of the model. Here, it is important to know that for ordinary goodness-of-fit test: DF = number of categories – 1. Fig. 5.2 Probability densities of two χ2 distributions differing in the number of degrees of freedom. Dashed line indicates cut-off values for 0.05 probabilities on the upper tail. Goodness-of-fit test example A typical application of the goodness-of-fit test is in genetics, as demonstrated in the following example: You are a geneticist interested in testing the Mendelian rules. To do so, you cross red and white flowering bean plants. Red is dominant and white recessive, so in the F1 generation, you only get red flowering individuals. You cross these and get 44 red flowering and 4 white flowering individuals in the F2 generation. What can you say about the universal validity of the second Mendelian rule (which predicts 3:1 ratio between dominant and recessive phenotypes) at the level of significance α = 0.05? First, you need to calculate the expected frequencies. These are: Ered = 48 x 3 / 4 = 36 Ewhite = 48 x 1 / 4 = 12 then, computation of test statistic follows: χ2 = (44-36)2 /36+(4-12)2 /4 = 7.11 DF = 1 p(χ2 = 7.11, DF = 1) = 0.0077 Conclusion (to be written in the text): Heredity in our bean-crossing system is significantly different from the second Mendelian rule (χ2 = 7.11, DF = 1, p = 0.0077). As a result, the second Mendelian rule is not universally true. Here you can see that our experiment produced a singular statement on the number of bean plants. The statistics translated this into an existential statement that at least one (our) genetic system exists which does not follow the Mendelian rule. This was then used to reject the universal statement. How to do in R Goodness-of-fit test: chisq.test Parameter x is used for inputting the observed frequencies Parameter p is used for inputting the null hypothesis-derived probabilities Example with output: chisq.test(x=c(44,4), p=c(3/4,1/4)) Chi-squared test for given probabilities data: c(44, 4) X-squared = 7.1111, df = 1, p-value = 0.007661 Probabilities of χ2 distribution can be computed by pchisq (do not forget to set lower.tail=F to get the p=value). pchisq(7.11, df=1, lower.tail = F) [1] 0.007665511