5 Intoduction to hypothesis testing
A statistical hypothesis test is a method of making statistical decisions using experimental data.
These decisions are almost always made using null-hypothesis tests; that is, ones that answer the
question: "Assuming that the null hypothesis is true, what is the probability of observing a value for
the test statistic that is at least as extreme as the value that was actually observed?" The critical
region of a hypothesis test is the set of all outcomes which, if they occur, cause the null hypothesis
to be rejected and the alternative hypothesis accepted. One use of hypothesis testing is deciding
whether experimental results contain enough information to cast doubt on conventional wisdom.
This means that we often introduce the null hypothesis as that proposition which we actually wish
to disprove. For instance using experimental data the researcher wants to examine "new wisdom"
that passive smoking is detrimental to health. So the null hypothesis is that the passive smoking
is not detrimental to health and the alternative hypothesis is stated as smoking is detrimental to
health.
Hypothesis testing is aimed to find conclusion of "reject" or "do not reject" the null hypothesis.
Definition 5.1
Let X1, . . . , Xn be a random sample from distribution L(), where parameter    is unknown, let
h() be a parametric function and c  R is a constant.
(i.) H0 : h() = c versus H1 : h() = c
The assertion H0 : h() = c is called simple null hypothesis
the assertion H1 : h() = c is called composite two-sided alternative hypothesis.
We speak about two-tailed test.
(ii.) H0 : h()  c versus H1 : h() < c
The assertion H0 : h()  c is called composite right-sided null hypothesis,
the assertion H1 : h() < c is called composite left-sided alternative hypothesis.
We speak about left-tailed test.
(iii.) H0 : h()  c versus H1 : h() > c
The assertion H0 : h()  c is called composite left-sided null hypothesis,
the assertion H1 : h() > c composite right-sided alternative hypothesis.
We speak about right-tailed test.
The Hypothesis testing H0 versus H1 is a decision procedure based on random sample X1, . . . , Xn,
which leads to the result: to reject the null hypothesis in favor of alternative one or not to reject
(which is not the same as accept).
The criterion for rejecting or not rejecting the null hypothesis on the basis of sample evidence is not
a guarantee of arriving at a correct conclusion. Let us now consider in detail the kinds of error that
could be made.
Definition 5.2
Testing H0 versus H1 we may commit one error of two types:
Error Type I is made when H0 is incorrectly rejected, though H0 is in fact true. The  stands for
the probability of making the Error Type I.
Error Type II is made by failing to reject H0, though H0 is in fact false. The  stands for the
probability of making the Error Type II.
21
H0 is not rejected H0 is rejected
H0 is true correct decision Type I error
P(H0 is not rejected|H0 is true) = 1 -  P(H0 is rejected|H0 is true) = 
H0 is false Type II error correct decision
P(H0 is not rejected|H0 is false) =  P(H0 is rejected|H0 is false) = 1 - 
The probability of making the Error Type I  is called significance level of a test .
The value 1 -  stands for the probability that the false null hypothesis is correctly rejected. This
value is called power of a test .
[The statisticians wish to have a power of a test 1- as great as possible and simultaneously to have
a level of significance  as small as possible. However by decreasing the probability  it increases the
probability  and the power of the test goes down. The standard practice is to fix first the low level
of significance  and then to use a test statistic (if there are more test statistics) that would make
the probability  as small as possible. For a given significance level, the test with the greatest power
1 -  is called most powerful test. In statistical testing, the only way in which we could reduce the
probabilities of both kinds of error at the same time is by increasing the sample size - assuming that
the test statistic used is the best that can be devised.]
Remark 5.3
There are three methods of testing hypothesis H0 versus H1 at a level of significance . a) classical
method using critical region, b) confidence interval method, c) p-value method. It is essential to
understand the first one, the others may be easily derived.
Definition 5.4
Let X1, . . . , Xn be a random sample. Let the rejection or non-rejection of H0 is decided by numerical
realization of the statistic T0 = T0(X1, . . . , Xn). Then T0 is called test statistic (It is a random variable
whose value varies from sample to sample). Consider a set of all realization of the test statistic T0.
This can be partioned into two subsets. The critical region, sometimes called the region of rejection,
is a subset such that if the value of the test statistics falls in it, the null hypothesis is rejected. It is
denoted as W.
Similarly the region of nonrejection is a subset such that if the value of the test statistics falls in it,
the null hypothesis is not rejected. It is denoted as V .
The boundary between the rejection and nonrejection regions is called the critical value which is
determined by prior information concerning the distribution of the test statistic and by the specification
of the alternative hypothesis; it is tabulated.
[The test statistic serves us as an indicator of discrepancy between tested null hypothesis and the
observed data. The realizations of the test statistic in the critical region means that this discrepancy
is not more acceptable.]
If the numerical realization t0 of the test statistic T0 falls within critical region W, then H0 is rejected
at the level of significance  in favor of alternativ hypothesis H1, which is accepted. This is the real
negation of H0.
If the numerical realization t0 of the test statistic T0 falls within nonrejection region V , then H0 is
not rejected at the level of significance . But it does not mean, that H0 is true; we do not have
arguments for rejection only. The probabilities of Type Error I and Type Error II can be expressed
as follows:
P(T0  W|H0 platí) = 
P(T0  V |H1 platí) = 
22
Remark 5.5
The values of the test statistic, which give evidence in favor of alternative hypothesis, are concentrated
in critical region. According to the form of the alternative hypothesis the list of corresponding
critical regions (for common test statistics) follows :
W1 = (tmin, K/2(T)  K1-/2(T), tmax) for two-tailed test, where H1 : h() = c
W2 = K1-(T), tmax) for right-tailed test, where H1 : h() > c
W3 = (tmin, K(T) for left-tailed test, where H1 : h() < c ,
and where tmin stands for the minimal value which the test statistic T0 may take,
tmax stands for the maximal value which the test statistic T0 may take and finally
K(T) stands for -quantile of the test statistic T0.
The procedure for testing hypothesis can be succinctly described in the following steps:
* State the null hypothesis and the alternative hypothesis. (It is common practice to choose as the
null hypothesis that claim which more nearly represents the status quo and as the alternative
hypothesis that claim which represents "new wisdom".)
* Choose the level of significance . The common levels of significance are  = 0.05 or 0.01 or
0.1, whereas the most common level is  = 0, 05
* Select the test statistic and determine the distribution of the test statistic under the null
hypothesis.
* Determine the critical region.
* Draw a sample and evaluate the test statistic.
* Make a decision and reach a conclusion.
Now let us move on to confidence interval method of hypothesis testing. The selection of appropriate
pivotal statistic (doing confidence intervals) and the selection of appropriate test statistic (doing
testing) are inter-related. This relation will be demonstrated in seminars.
Theorem 5.6
Let (d, h) be an empirical confidence interval for parametric function h(). If this interval does
contain the constant c, then H0 is not rejected at the significance level . If it does not contain the
constant c, then H0 is rejected at the level .
1. In case of two-tailed test, where H1 : h() = c, we form two-sided confidence interval (d, h).
2. In case of left-tailed test, where H1 : h() < c we form right-sided confidence interval (-, h).
3. In case of right-tailed test, where H1 : h() > c we form left-sided confidence interval (d, ).
Doing testing using statistical software it is common that input does not require level of significance 
and the output produces so-called p-value instead. The principle of p-value is analogous to significance
level , but generally it gives more information about the test result.
Definition 5.7
The smallest level of significance, at which the given sample observations and related value of the
test statistic lead to rejection of the null hypothesis, is called the p-value.
23
Theorem 5.8
Consider the hypothesis testing at significance level . The decision about the null hypothesis follows:
If p - value  , then we reject H0.
[Thus the realization of the test statistic, for which the p-value was calculated, falls within the
critical region corresponding to significance level .]
If p - value > , then we do not reject H0.
[Thus the realization of the test statistic, for which the p-value was calculated, does not fall within
the critical region corresponding to the significance level .]
According to the form of the alternative hypothesis we select one of following ways of calculation the
p-value:
p = 2 min{P(T0  t0), P(T0  t0)} for two-tailed test, where H1 : h() = c
p = P(T0  t0) for right-tailed test, where H1 : h() > c
p = P(T0  t0) for left-tailed test, where H1 : h() < c
The same decision is reached regardless of which of the three methods is used.
24