5. Hypothesis testing and pattern detection; goodness-of-fit test
Scientific statements
In chapter 1, I explained that science consists of theories, and these comprise hypotheses.
Scientists formulate these hypotheses as universal statements describing the world, but they
never know whether a hypothesis is true until it is rejected based on empirical evidence. This
makes science an infinite process of searching for truth, to which we hopefully approach but
never know whether we reach it or not.
Let’s now return to the term universal statement I used in the previous paragraph
and in chapter 1 because this is crucial to understand how empirical science works and
hypothesis testing proceeds. Statements describing the world can be classified into two
classes:
1. Universal statements generally apply to all objects concerned. E. g. “All (adult) swans
are white” is a universal statement. This can be converted to a negative form: “Swans
of other colors than white do not exist.” You can see that the universal statements
prohibit certain patterns or events (e.g. observing a black swan here); therefore, they
have the form of “natural laws”. They can also be used to make predictions. If the
white swan hypothesis is true, the next swan I will see will be white (and this is not
dependent on how many white swans I saw before). A universal statement cannot be
verified, i.e. confirmed to be true. We would need to inspect the color of all swans
living on the Earth (and in the Universe) to do so, and even if we did so, we can never
be sure that the next baby swan hatching from an egg would not be different from
white at adulthood. By contrast, it is very easy to reject such a universal statement
based on empirical evidence. Observing only a single swan of another color than
white is sufficient for that.
2. Singular statements are asserted only on specific objects. E.g. “The swan I see is
white.” Such a statement refers to a particular swan and does not predict anything
about other swans. A specific class of singular statements are existential statements
that can be derived from singular ones. The fact that I see a white swan (singular
statement) can be used to infer that there is at least one swan that is white, i.e. white
swans do exist. Based on the previous paragraph, you would probably not consider
any novel since it agrees with the universal statement on white swans. However,
seeing a single black swan (Fig 5.1) changes the situation completely. It means that at
least one black swan exists and that the universal statement on white swans is not
true. In general terms, this existential statement rejected the universal statement.
To sum up, a scientific hypothesis must have a form of a universal statement in order to
have a predictive power, which we need to explain patterns in nature. They cannot be
verified but can be rejected by empirical existential statements which are in conflict with the
prediction of the hypothesis.
Fig. 5.1 A black swan in Perth (Western Australia).
Hypotheses and their testing
Empirical science is essentially the process of hypothesis testing, which means searching for
conflicts between predictions of hypotheses and collected/measured data. Once a
hypothesis is rejected, a new hypothesis can be formulated to replace the old one. Note
here that there is no “objective” way to formulate new hypotheses – they are rather genuine
guesses.
An important implication from this is that it should be possible to define singular
observations for every scientific theory or hypothesis that, if they exist, would reject it. This
means that each scientific hypothesis must be falsifiable. Universal statements that are not
falsifiable may be components of art, religion, or pseudoscience but definitely not of science.
Various conspiracy theories also belong to this class. These statements need not be only
dogmatic; they may also be tautological. An example of this is e.g. recently published theory
of stability-based sorting in evolution (https://www.ncbi.nlm.nih.gov/pubmed/28899756), a
“theory” which says that evolution operates with stability, i.e. organisms and traits which are
more stable persist for longer. The problem is that long persistence is a synonym for
stability. Thus, this theory says, “What is stable is stable” - not very surprising. The authors
declare the theory to explain everything (see the ending of the abstract), and this is indeed
true. Still, the problem is that the theory neither produces any useful predictions nor can be
tested by empirical data.
If we select only hypotheses that are falsifiable and can be considered scientific statements,
we may discover that there are multiple theories without any conflicts with the data. It is a
natural question to ask, which one to choose over the others. Here, we should use the
Occam’s razor (https://en.wikipedia.org/wiki/Occam%27s_razor) principle and use the
simplest (and also most universal and most easily falsifiable) hypothesis available. This is also
termed “minimum adequate model” – i.e. choose the model with the minimal number of
parameters that adequately match the data.
Pattern detection
Biological and ecological systems display high complexity arising from an interplay among
complicated biochemical processes, evolutionary history, and ecological interactions. As a
result, quite a large proportion of the research is exploratory, aiming at discovering effects
that were not anticipated yet. Therefore, no previous theory could have informed about
them, or such information on the absence of effect would be just redundant. These are
special cases of hypothesis testing, which can be called pattern detection. In pattern
detection tests, we test the universal statement that the effect under investigation is zero
(e.g. there is no correlation between two quantitative variables). Rejecting such a statement
(null hypothesis) means that our observations are significantly different from what could be
observed just by chance, i.e. we demonstrate the significance of a singular statement – and
this can be consequently used to formulate a new universal hypothesis
Hypothesis testing with statistics
In statistics, we work with numbers and probabilities. Therefore, we do not record clear-cut
evidence to reject a hypothesis as in the example with swans. In other words, even unlikely
events may happen by chance, and their observation may not be sufficient evidence to
reject a hypothesis.
A general statistical testing procedure involves the computation of test statistic. The test
statistic measures the discrepancy between the prediction of the null hypothesis and the
data, also considering the strength of the evidence based on the number of observations.
The test statistic is a random variable, which follows particular theoretical distribution if the
null hypothesis is true. As a result, the probability of observing the actual data or data that
differ even more from the null hypothesis expectation can be quantified. If this probability
(called the p-value) is below a certain threshold, we can justify the rejection of the null
hypothesis.
The probability of observing specific data under the null hypothesis can be very low but
never zero. As a result, we are left with uncertainty concerning whether we made the right
decision when rejecting or retaining the null hypothesis. In general, we may take either the
right decision or make an error (Table 5.1).
Table 5.1. Possible outcomes of hypothesis testing by statistical tests. H0 = null hypothesis
Reality
H0
is true H0
is false
Our
Decision
Reject H0
Type I
Error
Ok
Not reject H0
Ok Type II Error
Two types of error can be made, of which type I error is more harmful because it means
rejection of a null hypothesis which is actually true. This is called false positive evidence. It is
misleading and may even obscure the scientific research of a given topic. By contrast, type II
error (false negative) is typically invisible to anybody except to the researcher himself
because results not rejecting the null hypothesis are usually not published. Statistical tools
can precisely control the probability of making type I error by setting an a-priori limit for the
p-value. This limit, called the level of significance (α), is typically set to α = 0.05 (5%). If the pvalue
resulting from the testing is higher than that, the null hypothesis cannot be rejected.
Note here that such a non-significant result does not mean that the null hypothesis is true.
Non-significant results indicate the absence of evidence, not of the evidence of absence of
an effect.
Concerning type II error (probability of which is denoted β), statistical inference is less
informative. It can be quantified in some controlled experiments, but its precise value is not
of particular interest. Instead, a useful concept is the power of the test, which equals 1 – β
and its relative rather than absolute size. The power of the test increases with sample size
and with decreasing α, i.e. if the tester accepts an elevated risk of type I error.
Goodness-of-fit test
Let’s have a look at an example of a statistical test. One of the most basic statistical tests is
called goodness-of-fit tests (sometimes inappropriately chi-square test following the name
of the test statistic). It is particularly suitable for testing frequencies (counts) of categorical
data, although the χ2
distribution is quite universal and approximates e.g. the very general
likelihood ratio.
the formula is this: χ2
= ∑
(𝑂𝑂−𝐸𝐸)2
𝐸𝐸
where O indicates observed, and frequencies and E indicate frequencies expected under the
null hypothesis. The sum is repeated for each of the categories under investigation.
The χ2
value is subsequently compared with the corresponding χ2
distribution to determine
the p-value. There are many χ2
distributions that differ in the number of degrees of freedom
(DF; Fig 5.2). The DF is a more general concept common to all statistical tests as it quantifies
the size of the data and/or complexity of the model. Here, it is important to know that for
ordinary goodness-of-fit test:
DF = number of categories – 1.
Fig. 5.2 Probability densities of two χ2
distributions differing in the number of degrees of
freedom. Dashed line indicates cut-off values for 0.05 probabilities on the upper tail.
Goodness-of-fit test example
A typical application of the goodness-of-fit test is in genetics, as demonstrated in the
following example:
You are a geneticist interested in testing the Mendelian rules. To do so, you cross red and
white flowering bean plants. Red is dominant and white recessive, so in the F1 generation,
you only get red flowering individuals. You cross these and get 44 red flowering and 4 white
flowering individuals in the F2 generation. What can you say about the universal validity of
the second Mendelian rule (which predicts 3:1 ratio between dominant and recessive
phenotypes) at the level of significance α = 0.05?
First, you need to calculate the expected frequencies. These are:
Ered = 48 x 3 / 4 = 36
Ewhite = 48 x 1 / 4 = 12
then, computation of test statistic follows:
χ2
= (44-36)2
/36+(4-12)2
/4 = 7.11
DF = 1
p(χ2
= 7.11, DF = 1) = 0.0077
Conclusion (to be written in the text): Heredity in our bean-crossing system is significantly
different from the second Mendelian rule (χ2
= 7.11, DF = 1, p = 0.0077). As a result, the
second Mendelian rule is not universally true.
Here you can see that our experiment produced a singular statement on the number of bean
plants. The statistics translated this into an existential statement that at least one (our)
genetic system exists which does not follow the Mendelian rule. This was then used to reject
the universal statement.
How to do in R
Goodness-of-fit test: chisq.test
Parameter x is used for inputting the observed frequencies
Parameter p is used for inputting the null hypothesis-derived
probabilities
Example with output:
chisq.test(x=c(44,4), p=c(3/4,1/4))
Chi-squared test for given probabilities
data: c(44, 4)
X-squared = 7.1111, df = 1, p-value = 0.007661
Probabilities of χ2 distribution can be computed by pchisq (do
not forget to set lower.tail=F to get the p=value).
pchisq(7.11, df=1, lower.tail = F)
[1] 0.007665511