Statistics for Computer Sciences Lecture 10 to Lecture 12 Testing of Statistical Hypotheses Stanislav Katina1 1Institute of Mathematics and Statistics, Masaryk University Honorary Research Fellow, The University of Glasgow December 2, 2015 Testing of Statistical Hypotheses Null and alternative hypothesis ◮ a ’hypothesis’ is a theory which is assumed to be true unless evidence is obtained which indicates otherwise ◮ ’null’ means ’nothing’ and the term ’null hypothesis’ (H0) means a ’theory of no change’ – that is ’no change’ from what would be expected from past experience ◮ ’alternative hypothesis’ (H1) means a ’theory of change’ – that is ’change’ from what would be expected from past experience ◮ the procedure which is used to decide between these two opposite theories is called ’hypothesis test’ or sometimes ’significance test’ ◮ one-tail test – test in which thy alternative hypothesis proposes a change in parameter in only one direction – increase or decrease ◮ two-tail test– test in which the alternative hypothesis suggests a difference in parameter in either direction Testing of Statistical Hypotheses Test statistic, rejection and acceptance region, critical value and quantile ◮ the test statistic is calculated from the sample – its value is used to decide whether the null hypothesis should be rejected ◮ the rejection (or critical) region gives the values of the test statistic for which the null hypothesis is rejected ◮ the acceptance region gives the values of the test statistic for which the null hypothesis is not rejected ◮ the boundary value(s) of the rejection region is (are) called the critical value(s) or quantile(s) ◮ the significance level α of a test gives the probability of the test statistic falling in the rejection region when null hypothesis is true Testing of Statistical Hypotheses Hypothesis testing procedure ◮ a hypothesis is a statement about a population parameter base on a sample from this population ◮ H0 and H1 are two complementary hypotheses in a hypothesis testing problem ◮ a hypothesis testing procedure or hypothesis test is a rule that specifies – for which sample values the decision is made to accept null hypothesis as true – and for which sample values H0 is rejected ◮ the subset of sample space for which H0 will be rejected is called rejection region (critical region) ◮ the complement of the rejection region is called the acceptance region Testing of Statistical Hypotheses Four possibilities Four choices: A H0 is true – our decision is to reject H0 B H0 is true – our decision is not to reject H0 C H1 is true – our decision is not to reject H0 D H1 is true – our decision is to reject H0 Decision-reality table: decision/reality H0 is true H0 is not true to reject H0 Type I error true decision not to reject H0 true decision Type II error Testing of Statistical Hypotheses Four possibilities Four choices: A) Pr(A) = Pr(Type I error) ≤ α [significance level] B) Pr(B) ≥ 1 − α [coverage probability, confidence coefficient (level)] C) Pr(C) = Pr(Type II error) ≤ β D) Pr(D) ≥ 1 − β [power] Four choices (formalised): A) 1 − α ≤ Pr(don’t reject H0|H0 is true) B) α ≥ Pr(CHPD) = Pr(reject H0|H0 is true) C) β = Pr(CHDD) = Pr(don’t reject H0|H0 isn’t true) D) 1 − β = Pr(reject H0|H0 isn’t true) Testing of Statistical Hypotheses Empirical 100 × (1 − α)% confidence intervals for parameter θ Relationship of confidence interval and statistical test ◮ Empirical 100(1 − α)% confidence interval (CI) for parameter θ ◮ α-level hypothesis test about θ Three types of intervals: ◮ two-tailed CI – Pr(LB(X) < θ < UB(X)) = 1 − α ◮ one-tailed (right-tailed) CI – Pr(θ < UB∗ (X)) = 1 − α ◮ one-tailed (left-tailed)– CI – Pr(LB∗(X) < θ) = 1 − α Testing of Statistical Hypotheses Acceptance region Definition (Acceptance region of H0) Let X be a random variable with certain distribution (probabilistic model) dependent on parameter θ ∈ Θ, g (θ) is parametric function. We are testing null hypothesis H01 : g (θ) = g(θ0) against two-sided alternative H11 : g (θ) = g(θ0). Let (LB, UB) be interval estimate of parametric function g (θ) with coverage probability 1 − α. Then AIS,1 = {LB, UB; g(θ0) ∈ (LB, UB)} is acceptance region of a test H01 against H11 on significance level α. If we are testing H02 : g (θ) ≤ g(θ0) against one-sided (right) alternative H12 : g (θ) > g(θ0) and if LB∗ be lower estimate of g (θ) with coverage probability 1 − α, then AIS,2 = {LB∗; LB∗ < g(θ0)} is acceptance region of a test H02 against H12 on significance level α. If we are testing H03 : g (θ) ≥ g(θ0) against one-sided (left) alternative H13 : g (θ) < g(θ0) and if UB∗ is upper estimate of g (θ) with coverage probability 1 − α, then AIS,3 = {UB∗ ; UB∗ > g(θ0)} is acceptance region of a test H03 against H13 on significance level α. Testing of Statistical Hypotheses Rejection region Definition (Rejection (critical) region of H0) Let X be a random variable with certain distribution (probabilistic model) dependent on parameter θ ∈ Θ, g (θ) is parametric function. We are testing null hypothesis H01 : g (θ) = g(θ0) against two-sided alternative H11 : g (θ) = g(θ0). Let (LB, UB) be interval estimate of parametric function g (θ) with coverage probability 1 − α. Then WIS,1 = {LB, UB; g(θ0) /∈ (LB, UB)} is critical region of a test H01 against H11 on significance level α. If we are testing H02 : g (θ) ≤ g(θ0) against one-sided (right) alternative H12 : g (θ) > g(θ0) and if LB∗ be lower estimate of g (θ) with coverage probability 1 − α, then WIS,2 = {LB∗; LB∗ ≥ g(θ0)} is critical region of a test H02 against H12 on significance level α. If we are testing H03 : g (θ) ≥ g(θ0) against one-sided (left) alternative H13 : g (θ) < g(θ0) and if UB∗ is upper estimate of g (θ) with coverage probability 1 − α, then WIS,3 = {UB∗ ; UB∗ ≤ g(θ0)} is critical region of a test H03 against H13 on significance level α. Testing of Statistical Hypotheses Test criterion Definition (Test criterion) A test criterion is a test statistic T = T0 = T0(X1, X2, . . . , Xn), with known asymptotic distribution if H0 is known. The set of possible values of T0 is divided to two subsets, i.e. acceptance region H0 (notation A) and critical region H0 (notation W). These two regions are divided by critical values tα/2 and t1−α/2, resp. tα and t1−α (for particular H0 and H1) of the distribution of test statistics T0 (if H0 is true). Definition (Confidence interval) A confidence interval (CI) is a type of interval estimate of a population parameter θ. It is an observed, often called empirical, interval (i.e., it is calculated from the observations) that includes the value of an unobservable parameter θ if the experiment is repeated. The frequency that observed interval contains the parameter is determined by the confidence coefficient 1 − α (i.e. confidence level, coverage probability). Testing of Statistical Hypotheses To carry out a hypothesis test Step 1 define the null and alternative hypothesis (H0 and H1) Step 2 decide on a significance level α = 0.1, 0.05, 0.01 Step 3 calculate the test statistic (test criterion) T0 Step 3 determine the critical value(s) Step 5 decide on the outcome of the test (reject/don’t reject H0) depending on one of the following ways: ◮ base on critical region W = WT (observed test statistic t0 = tobs and critical values tα/2 and t1−α/2, resp. tα and t1−α), ◮ base on critical region WIS, t.j. empirical confidence interval (and g(θ0)), ◮ base on p-value. Step 6 state the conclusion in words Testing of Statistical Hypotheses To carry out a hypothesis test – based on test statistic and critical value Definition (Testing based on critical region W) Rejecting H0. If observed test statistic (realisation of test statistic) t0 of test statistic T0 is within a critical region W (equivalently is not from an acceptance region A), H0 is rejected at a significance level α, i.e. we do have sufficiently enough evidence to reject H0. Not rejecting H0. If observed test statistic t0 of test statistic T0 is within an acceptance region A (equivalently, it is not from a critical region W), H0 is not rejected at a significance level α, i.e. we don’t have sufficiently enough evidence to reject H0. Let tmin be the smallest possible value of a test criteria T0 and tmax be the highest possible value of a test criteriaT0, then 1. two-sided alternative – critical region W1 = (tmin, t1−α/2) ∪ (tα/2, tmax), 2. one-sided (right) alternative – critical region W2 = (tα, tmax), 3. one-sided (left) alternative – critical region W3 = (tmin, t1−α). Testing of Statistical Hypotheses To carry out a hypothesis test – based on CI Definition (Testing based on CI) Rejecting H0: If g(θ) = g(θ0) is within CI (H0 is valid), H0 is rejected at the significance level α, i.e. we do have sufficiently enough evidence to reject H0. Not rejecting H0: If g(θ) = g(θ0) is not within CI (H0 is valid), H0 isn’t rejected at a significance level α, i.e. we don’t have sufficiently enough evidence to reject H0. Relationship of confidence interval and statistical test ◮ hypothesis testing ≡ CIs ◮ α-level hypothesis test ≡ 100(1 − α)% CI ◮ one-tail test ≡ one-sided CI (left-sided CI ≡ right-sided alternative, right-sided CI ≡ left-sided alternative ◮ two-tail test ≡ two-sided CI ◮ parameter(s) ∈ CI ≡ not reject H0 ◮ parameter(s) /∈ CI ≡ reject H0 Testing of Statistical Hypotheses To carry out a hypothesis test – based on p-value (observed significance level) Definition (Testing based on p-value) Minimal significance level α (for some test statistic T0), base on which H02 : g(θ) ≤ g(θ0) is rejected (tested against H12 : g(θ) > g(θ0)), is called observed significance level or p-value, i.e. p-value = αobs = sup θ∈Θ0 Pr (T(X1, X2, . . . , Xn) ≥ T(x1, x2, . . . , xn); θ) . This could be written less formally as p-value = Pr(any test statistics equal or greater than observed |H0 is true). The closer αobs is to zero, the smaller is the probability that any test statistic T(X1, X2, . . . , Xn) produces a p-value (under H0) equal to or smaller than that observed, while the probability is higher under H1. Therefore, p-value could be understood as an indicator of credibility of H0. Testing of Statistical Hypotheses To carry out a hypothesis test – based on p-value (observed significance level) ◮ Usually, if αobs < α = 0.05, there is sufficiently enough evidence to reject H0 and the result of a test is statistically significant. ◮ While αobs > α = 0.1, there is sufficiently enough evidence to reject H0 and the result of a test is not statistically significant. ◮ The values between 0.05 and 0.1 should be taken as reference points in a broad sense. As αobs gets closer to either boundary point of the interval 0.05, 0.1 , so this is taken as increasing evidence for one or other alternative. ◮ Situation with αobs ∈ 0.05, 0.1) are usually most difficult to handle and the result is here marginally statistically significant. Testing of Statistical Hypotheses To carry out a hypothesis test – based on p-value (observed significance level) Wording of the results of a statistical test: range for p-value stars of significance wording of the result 0, 0.001) *** extremely highly statistically significant 0.001, 0.01) ** high statistically significant 0.01, 0.05) * statistically significant 0.05, 0.1) · marginally statistically significant 0.1, 1 non-significant Testing of Statistical Hypotheses To carry out a hypothesis test – based on p-value (observed significance level) Interpretation of p-values: ◮ p-value < 0.001: the prevalence of an estimated effect is smaller than one to one thousand (the odds of estimated effect is smaller than 1 : 999), if an effect is not present in a population (the presence of such an effect is highly improbable, if an effect is not present in a population – and – the presence of such an effect is highly probable, if an effect is present in a population) ◮ p-value < 0.01: the prevalence of an estimated effect is smaller than one to one hundred (the odds of estimated effect is smaller than 1 : 99), if an effect is not present in a population (the presence of such an effect is very improbable, if an effect is not present in a population – and – the presence of such an effect is very probable, if an effect is present in a population) ◮ p-value < 0.05: the prevalence of an estimated effect is smaller than one to one hundred (the odds of estimated effect is smaller than 5 : 95 or 1 : 19), if an effect is not present in a population (the presence of such an effect is sufficiently improbable, if an effect is not present in a population – and – the presence of such an effect is sufficiently probable, if an effect is present in a population) ◮ p-value ≥ 0.05: the prevalence of an estimated effect is five to one hundred or greater (5 % or more); ◮ p-value = k, k ∈ 0.05, 1 : the prevalence of an estimated effect is 100 × k to one hundred (100 × k % or more). Testing of Statistical Hypotheses To carry out a hypothesis test – based on p-value (observed significance level) How is the p-value (mostly) calculated? 1. two-sided alternative – p-value = 2 min(Pr(T0 ≤ t0|H0), Pr(T0 ≥ t0|H0)), e.g. for normal and Student distribution of test statistic (symmetric distributions) and for χ2 df and Fdf1,df2 distribution of test statistic (asymmetric distributions) or p-hodnota = min(Pr(T0 ≤ t0|H0), Pr(T0 ≥ t0|H0)), e.g. for χ2 df and Fdf1,df2 distribution of test statistic (asymmetric distributions) 2. one-sided (right) alternative – p-value = Pr(T0 ≥ t0|H0) 3. one-sided (left) alternative – p-value = Pr(T0 ≤ t0|H0) Testing of Statistical Hypotheses On a philosophical level ◮ distinction between ’rejecting H0’ and ’accepting H1’ ◮ ’rejecting H0’ – nothing implies about what state the experimenter is accepting, only that the state defined by H0 is being rejected ◮ distinction between ’accepting H0’ and ’not rejecting H0’ ◮ ’accepting H0’ – the experimenter is willing to assert the state of nature specified by H0 ◮ ’not rejecting H0’ – the experimenter really does not believe H0 but does not have the evidence to reject it Testing of Statistical Hypotheses Conservative and liberal test and CI Definition (Conservative and liberal test) A test with actual/observed significance level smaller than nominal significance level α, is called conservative (the test should theoretically be ”rejecting quickly” H0, but, in reality, it is the opposite, i.e. the test is ”rejecting slowly”). A test with actual/observed significance level greater than nominal significance level α, is called liberal (the test should theoretically be ”rejecting slowly” H0, but, in reality, it is the opposite, i.e. the test ”rejecting quickly”). Definition (Conservative and liberal CI) CI with actual/real coverage probability greater than nominal coverage probability 1 − α, is called conservative (i.e. the probability that θ0 is within CI is greater that expected). CI with actual/real coverage probability smaller than nominal coverage probability 1 − α, is called liberal (i.e. the probability that θ0 is within CI is smaller that expected). Testing of Statistical Hypotheses Likelihood ratio – generalised relative likelihood Two types of hypotheses: 1. simple hypothesis – H0 : θ = θ0 against H1 : θ = θ0, then simple likelihood ratio is equal to λ(x) = λ = L(θ0|x) supθ∈Θ L(θ|x) = L(θ0|x) L(θ|x) , where λ(x) = L(θ0|x) is test statistic and L(θ|x) is continuous for all x. 2. composite hypothesis – H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1, then generalised likelihood ratio is equal to λ(x) = supθ∈Θ0 L(θ|x) supθ∈Θ L(θ|x) . Testing of Statistical Hypotheses Likelihood ratio test statistic Subsets of Θ, Θ0 and Θ1, remain the same after monotone transformation of λ(x), i.e. the statistical tests before and after transformation are equivalent. Therefore, likelihood ratio test statistic is equal to ULR = −2 ln λ(X). Its realisation, observed likelihood ratio test statistic, is equal to uLR = −2 ln λ(x), where uLR ∈ (0, ∞). Testing of Statistical Hypotheses Three test statistics Geometrical interpretation: 1. ULR – is measuring properly standardised difference between log-likelihoods in θ and θ0 (i.e. in direction of y axis) 2. UW – is measuring properly standardised absolute value of a difference of θ a θ0 (in direction of x axis) 3. US – is measuring properly standardised slope of log-ratio in θ0 Example (normal distribution) Let X ∼ N(µ, σ2 ), where σ2 is known, H0 : θ = θ0 against H1 : θ = θ0, where θ = µ. Then 1. ULR = −2(l(θ0|X) − l(θ|X)) = − n i=1(Xi − X)2 /σ2 + n i=1(Xi − µ0)2 /σ2 = n(X−µ0)2 σ2 , 2. UW = (X − µ0)2 I(x) = n(X−µ0)2 σ2 , 3. US = (S(µ0))2 I(µ0) = (n(X−µ0)/σ2 )2 n/σ2 = n(X−µ0)2 σ2 . All three test statistics are equal, i.e. ULR = UW = US. Testing of Statistical Hypotheses Three test statistics If θ is a scalar, three test statistics are defined as: 1. ULR = −2(l(θ0|X) − l(θ|X)) D ∼ χ2 1, 2. UW = (θ − θ0)2 I(θ) D ∼ χ2 1 and equivalently U 1/2 W = ZW D ∼ N(0, 1), 3. US = (S(θ0))2 I(θ0) D ∼ χ2 1 and equivalently U 1/2 S = ZS D ∼ N(0, 1), If θ is a vector, three test statistics are defined as: 1. ULR = −2(l(θ0|X) − l(θ|X)) D ∼ χ2 k , 2. UW = (θ − θ0)T I(θ)(θ − θ0) D ∼ χ2 k , 3. US = (S(θ0))T (I(θ0))−1 S(θ0) D ∼ χ2 k . Testing of Statistical Hypotheses Three test statistics and related confidence intervals If θ is a scalar, three confidence intervals are defined as follows: 1. likelihood ratio empirical (1 − α) × 100% CI for θ is defined as CS1−a = θ : ULR(θ) < χ2 1(α) , where ULR(θ) = −2 ln L(θ|x) L(θ|x) . 2. Wald empirical (1 − α) × 100% CI for θ is defined based on a pivot (pivotal statistics)Tpiv = UW(θ) 3. Score empirical (1 − α) × 100% CI for θ is defined based on a pivot Tpiv = US(θ) If θ is a vector, CIs can be generalized to confidence set CS1−a. ◮ If k = 2, CS1−a is an confidence ellipse. ◮ If k > 2, CS1−a is an confidence ellipsoid. Additionally, if k = 1, CS1−a is an confidence interval. Testing of Statistical Hypotheses Confidence intervals Wald empirical (1 − α) × 100% CI for θ is defined as (d, h) = θ − tα/2SE[θ], θ + tα/2SE[θ] , where the critical value tα/2 depends on the choice of θ. Likelihood ratio empirical (1 − α) × 100% CI for θ is defined by its lower and upper bounds as k% cut-offs of standardized relative log-likelihood as follows Pr L(θ|x) L(θ|x) > cα = Pr −2 ln L(θ|x) L(θ|x) < −2 ln cα = 1 − α, where cα = e− 1 2 χ2 1(α) . Then ◮ if 1 − α = 0.95, then cα = 0.1465001 . = 0.15 (15% cut-off ), ◮ if 1 − α = 0.90, then cα = 0.2585227 . = 0.26 (26% cut-off), ◮ if 1 − α = 0.99, then cα = 0.0362452 . = 0.04 (4% cut-off). Testing of Statistical Hypotheses Likelihood confidence intervals – bisection method Bisection method Let θ01, θ02 ∈ θL, θU and f(θ01)f(θ02) < 0, f(·) is continuous with at least one root within the interval θ01, θ02 , where f(θ) = −2 ln L(θ|x) − χ2 1(α) = 0. If the first derivative of f(·) is having constant sign, then exactly one root θ∗ ∈ θ01, θ02 of f(θ) = 0 exists. The iterative process is defined as follows: 1. initialisation step – starting point θ(0) = (θ01 + θ02)/2 and i = 1, 2. updating equations – substitution of the boundaries θ01 and θ02 is defined as θi1, θi2 = θi−1,1, θ(i−1) , if f(θi−1,1)f(θ(i−1) ) < 0 θ(i−1) , θi−1,2 , if f(θi−1,1)f(θ(i−1) ) > 0 , if f(θ(i−1) ) = 0, then end, if not, Testing of Statistical Hypotheses Likelihood confidence intervals – Brent-Dekker method Example (Brent-Dekker method) Let X ∼ Bin(N, p), where N = 10 and n = x = 8. Estimate the boundaries of empirical 100×(1 − α)% CI for (1) p and (2) odds p 1−p . The empirical CI are of the two types (A) likelihood and (B) Wald. Draw the log-likelihood function and its quadratic approximation with the lower and upper boundary of CI. Solution (partial) Wald empirical 100 × (1 − α)% CI for p: p = 8 10 = 0.8; SE[p] = p(1−p) N = 0.13. (d, h) = p − uα/2SE[p], p + uα/2SE[p] = (0.55, 1.05). Likelihood empirical 100 × (1 − α)% CI for p: CS1−α = p : −2 ln L(p|x) L(p|x) ≤ 3.84 , where (d, h) = (0.50, 0.96), Wald empirical 100 × (1 − α)% CI for g(p): g(p) = ln p 1−p = log 0.8 0.2 = 1.39. ∂ ∂p g(p) = 1 p + 1 1−p ; SE[g(p)] = SE[p] 1 p + 1 1−p = p(1−p) N 1 p + 1 1−p = 1 n + 1 N−n = 0.79. Then (dg, hg) = (−0.16, 2.94) and back-transformed (d, h) = (0.46, 0.95). Testing of Statistical Hypotheses Likelihood confidence intervals – Brent-Dekker method 1 x <- 8; N <- 10 2 probs <- seq(0.4,.99,length=1000) 3 like <- dbinom(8,10,probs) 4 rellike <- like/max(like) 5 relloglike <- -2*log(rellike) 6 cutoff <- exp(-1/2*qchisq(0.95,df=1)) #0.1465001 7 like.CI.p <- range(probs[rellike>cutoff]) #0.5009910 0.9634234 8 cutoff <- qchisq(0.95,df=1) #3.841459 9 like.CI.p <- range(probs[relloglike