CHAPTER 2 Two-Way Contingency Tables Table 2.1 cross classifies a sample of Americans according to their gender and their opinion about an afterlife. For the females in the sample, for instance, 435 said they believed in an afterlife and 147 said they did not or were undecided. For such data, we might study whether an association exists between gender and belief in an afterlife. Is one sex more likely than the other to believe in an afterlife, or is belief in an afterlife independent of gender? Analyzing associations is at the heart of most multivariate statistical analyses. This chapter deals with associations between two categorical variables. We introduce parameters that describe the association and present inferential methods for those parameters. Many applications involve comparing two groups with respect to the relative numbers of observations in two categories. For Table 2.1, one might compare the proportions of males and females who believe in an afterlife. For such data, Section 2.2 presents methods for analyzing differences and ratios of proportions. Section 2.3 presents another measure, the odds ratio, that plays a key role for several methods discussed in this text. Sections 2.4 and 2.5 describe large-sample significance tests about whether an association exists between two categorical variables; Section 2.4 presents tests for nominal variables, and Section 2.5 presents an alternative test for ordinal variables. Section 2.6 discusses small-sample analyses. First, Section 2.1 introduces terminology and notation. 2.1 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES Categorical data consist of frequency counts of observations occurring in the response categories. Let X and Y denote two categorical variables, X having / levels and Y having / levels. We display the I J possible combinations of outcomes in a rectangular table having / rows for the categories of X and / columns for the categories of ľ". The cells of the table represent the // possible outcomes. A table of this form in 16 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES 17 "Bible 2.1 Cross Classification of Belief in Afterlife by Gender Belief in Afterlife Gender Yes No or Undecided Females Males 435 375 147 134 Source: Data from 1991 General Social Survey. which the cells contain frequency counts of outcomes is called a contingency table. A contingency table that cross classifies two variables is called a two-way table; one that cross classifies three variables is called a three-way table, and so forth. A two-way table having / rows and / columns is called an / X / (read /-by-/) table. Table 2.1, for instance, is a 2 X 2 table. . 2.1.1 (Joint) Marginal, and Conditional Probabilities Probability distributions for contingency tables relate to the sampling scheme, as we shall discuss in Section 2.1.4. We first present the fundamental types of probabilities for two-way contingency tables. Suppose first that each subject in a sample is randomly chosen from some population of interest, and then classified on two categorical responses, X and Y. Let irřj = P(X = i,Y = j) denote the probability that (X,Y) falls in the cell in row i and column j. The probabilities {-n-//} form the joint dislribution of X and Y. They satisfy TJr, Try,- = 1. Themarginal distributionsare therow;,and^colunm totals of the joint probabilities. These are denoted by {tth-} for the row variable and W+j} for the column variable, where the subscript"+" denotes the sum over the index it replaces. For instance, for 2X2 tables, ^1 + - TT\l + TT\2 and TT+i = TTn + T^J. We use similar notation for samples, with Roman p in place of Greek it. For instance, {pry} denotes the sample joint distribution. These are the sample cell proportions. The cell counts are denoted by {n,j}, with n = J2j "ü denoting the total sample size. The cell proportions and cell counts are related by ni} p (2.2.2) where za/2 denotes the standard normal percentile having right-tail probability equal to a/2 (e.g., for a 95% interval, a = .05, za/2 = 2.025 = 1.96). 2.2.2 Aspirin and Heart Attacks Example Table 2.3 is taken from a report on the relationship between aspirin use and myocardial infarction (heart attacks) by the Physicians5 Health Study Research Group at Harvard Medical School. The Physicians' Health Study was a five-year randomized study testing whether regular intake of aspirin reduces mortality from cardiovascular disease. Every other day, physicians participating in the study took either one aspirin tablet or a placebo. The study was blind—the physicians in the study did not know which type of pill they were taking. Table 2.3 Cross Classification or Aspirin Use and Myocardial Infarction (MI) Myocardial Infarction Group Yes No Total Placebo Aspirin 189 10,845 104 10,933 11,034 11,037 Source: Preliminary Report: Findings from the Aspirin Component of the Ongoing Physicians' Health Study. N. Engl. J. Med., 318: 262-264 (1988). COMPARING PROPORTIONS IN TWO-BY-TWO TABLES 21 We treat the two rows in Table 2.3 as independent binomial samples. Of the Ni = 11,034 physicians taking placebo, 189 suffered myocardial infarction (MI) over the course of the study, a proportion of p\ - 189/11,034 ~ .0171. Of the N2 - 11,037 physicians taking aspirin, 104 suffered MI, a proportion of p2 = .0094. The sample difference of proportions is .0171 - .0094 = .0077. From (2.2.1), this difference has an estimated standard error of (.0171)09829) + (.0094X.9906) =Qm5 11,034 11,037 A 95% confidence interval for the true difference in - tt2 is .0077 ± 1.96(0.0015), or .008 ± 0.003, or (.005, .011). Since this interval contains only positive values, we conclude that ir\ ~ v2 > 0; that is, in > ^2, so taking aspirin appears to diminish the risk of MI. 2.2.3 Relative Risk A difference between two proportions of a certain fixed size may have greater importance when both proportions are near 0 or 1 than when they are near the middle of the range. Consider a comparison of two drugs on the proportion of subjects who have adverse reactions when using the drug. The difference between .010 and .001 is the same as the difference between .410 and .401, namely ,009. The first difference seems more noteworthy, since ten times as many subjects have adverse reactions with one drug as the other. In such cases, the ratio of proportions is also a useful descriptive measure. In 2 X 2 tables, the relative risk is the ratio of the "success" probabilities for the two groups, —. (2.2.3) It can be any nonnegative real number. The proportions .010 and .001 have a relative risk of .010/.001 = 10.0, whereas the proportions .410 and .401 have a relative risk of .410/.401 = 1.02. A relative risk of 1.00 occurs when tt\ = tt2; that, is, when response is independent of group. Two groups with sample proportions of p\ and p2 have a sample relative risk of Pi/p2- Its sampling distribution can be highly skewed unless the sample sizes are quite large, so its confidence interval formula is rather complex (Problem 2.12). For Table 2.3, the sample relative risk is pjp2 *= .0171/.0094 = 1.82. The sample proportion of MI cases was 82% higher for the group taking placebo. Using computer software (SAS-PROC FREQ), we find that a 95% confidence interval for the true relative risk is (1.43,2.30). We can be 95% confident that, after five years, the proportion of MI cases for physicians taking placebo is between 1.43 and 2.30 times the proportion of MI cases for physicians taking aspirin. The confidence interval for the relative risk indicates that the risk of MI is at least 43% higher for the placebo group. The confidence interval (.005, .011) for 22 TWO-WAY CONTINGENCY TABLES the difference of proportions makes it seem as if the two groups differ by a trivial amount, but the relative risk shows that the difference may have important public health implications. Using the difference of proportions alone to compare two groups can be somewhat misleading when the proportions are both close to zero. It is sometimes informative to compute also the ratio of "failure" probabilities, (1 - tti)/(1 -7T2). This takes a different value than the ratio of the success probabilities. When one of the two outcomes has small probability, normally one computes the ratio of the probabilities for that outcome. 2.3 THE ODDS RATIO We next present another measure of association for 2 X 2 contingency tables, called th^öd3srätiö\ This is a fundamental parameter for models presented in later chapters. In 2 X 2 tables, the probability of "success" is Tq in row 1 and ^ in row 2. Within row 1, the odds of success are defined to be odds! = (1 - ^) For instance, if tti = .75, then the odds of success equal .15/25 = 3. The odds are nonnegative, with value greater than 1.0 when a success is more likely than a failure. When odds = 4.0, a success is four times as likely as a failure. The probability of success is .8, the probability of failure is .2, and the odds equal .8/.2 = 4. We then expect to observe four successes for every one failure. When odds = \, a failure is four times as likely as a success; we expect to observe one success for every four failures. Within row 2, the odds of success equal AA ^ OddS2 = 1-171 In either row, the success probability is the function of the odds, _ odds 77 ~~ odds + 1' For instance, when odds = 4, then ir = 4/(4 + 1) = .8. When the conditional distributions are identical in the two rows (i.e., iri = t^), the odds satisfy oddsi = odds2. The variables are then independent. The ratio of odds from the two rows, 0 = 2ÉÉU = ffl/(I ~ ^ (2 3 n odds2 «i/(l - us)' is called the Qddsmtio. Whereas the relative risk is a ratio of two probabilities, the odds ratio 0 is a ratio of two odds. THE ODDS RATIO 23 2.3.1 Properties of the Odds Ratio The odds ratio can equal any nonnegative number. When X and Y are independent, 7Ti = tti, so that oddsi = odds2 and Q = oddsi/odds2 = 1. The value 0 = 1 corresponding to independence serves as a baseline for comparison. Odds ratios on each side of 1 reflect certain types of associations. When 1 < 0 < co7 the odds of success are higher in row 1 than in row 2. For instance, when 0=4, the odds of success in row 1 are four times the odds of success in row 2. Thus, subjects in row 1 are more likely to have successes than are subjects in row 2; that is, tt\ > tti. When 0 < 0 < 1, a success is less likely in row 1 than in row 2; that is, -n-j < Tr%. Values of 0 farther from 1.0 in a given direction represent stronger levels of association. An odds ratio of 4 is farther from independence than an odds ratio of 2, and an odds ratio of 0.25 is farther from independence than an odds ratio of 0.50. Two values for 0 represent the same level of association, but in opposite directions, when one value is the inverse of the other. When 0 = 0.25, for instance, the odds of success in row 1 are 0.25 times the odds of success in row 2, or equivalently 1 /0.25 = 4.0 times as high in row 2 as in row 1. When the order of the rows is reversed or the order of the columns is reversed, the new value of 0 is the inverse of the original value. This ordering is usually arbitrary, so whether we get 4.0 or 0.25 for the odds ratio is simply a matter of how we label the rows and columns. The odds ratio does not change value when the orientation of the table reverses so that the rows become the columns and the columns become the rows. The same value occurs when we treat the columns as the response variable and the rows as the explanatory variable, or the rows as the response variable and the columns as the explanatory variable. Since the odds ratio treats the variables symmetrically, it is unnecessary to identify one classification as a response variable in order to calculate, it. By contrast, the relative risk requires this, and its value also depends on whether we apply it to the first or second response category. When both variables are responses, the odds ratio can be defined using joint probabilities as 0 = ^p^ = ^*. (2.3.2) -""21/^22 ^12^21 The odds ratio is also called the cross-product ratio, since it equals the ratio of the products ttutt22 and iruirii of cell probabilities from diagonally opposite cells. The sample odds ratio equals the ratio of the sample odds in the two rows, a _ Pl/0--Pl) nU/«i2 _ /Iu«22 «,„ 0--------,., < = ------t-------------------■ (2.3.3) Pi/iX -pi) n2l/n22 nnn2l For the standard sampling schemes, this is the ML estimator of the true odds ratio. 2.3.2 Odds Ratio for Aspirin Study To illustrate the odds ratio, we revisit Table 2.3 from Section 2.2.2 on aspirin use and myocardial infarction (MI). For the physicians taking placebo, the estimated 24 TWO-WAY CONTINGENCY TABLES odds of MI equal nu/nn = 189/10,845 = 0.0174. The value 0.0174 means there were 1.74 "yes" responses for every 100 "no" responses. The estimated odds equal 104/10,933 — 0.0095 for those taking aspirin, or 0.95 "yes" responses per every 100 "no" responses. The sample odds ratio equals 8 = 0.0174/0.0095 = 1.832. This also equals the cross-product ratio (189)(10,933)/(10,845)(104). The estimated odds of MI for physicians taking placebo equal 1.832 times the estimated odds for physicians taking aspirin. The estimated odds were 83% higher for the placebo group. 2.3.3 Inference for Odds Ratios and Log Odds Ratios For small to moderate sample sizes, the sampling distribution of the odds ratio is highly skewed. When 8 = 1, for instance, 9 cannot be much smaller than 0 (since 8 s 0), but it could be much larger with nonnegligible probability. Because of this skewness, statistical inference for the odds ratio uses an alternative but equivalent measure: jtsjiatural logarithm, log(0). Independence, corresponds to ■ log(0) j=^ O.Jhat is, an oá^ráuo ornns^quw^nrj^a log odds ratio "of Ö~Q^ An odds ratio of 2.0 has a log odds ratio of 0.7. The log odds ratio is symmetric about zero, in the sense that reversal of rows or reversal of columns changes its sign. Two values for log(ö) that are the same except for sign, such as log(2.0) = 0.7 and log(0.5) = —0.7, represent the same level of association. Doubling a log odds ratio corresponds to squaring an odds ratio. For instance, log odds ratios of 2(0.7) = 1.4 and 2(-0.7) = -1.4 correspond to odds ratios of 22 = 4 and 0.52 = 0.25. The logtransform of the sample odds ratio, log 8, has a less skewed sampling distribution thaMs.^nK<»r m norrnality Its large-sample approximating normal dis-tribution has a mean of log 8 and a standard deviation, referred to as an asymptotic standard error and denoted by ASE, of ASE(log 0) = J— + — + — + —. (2.3.4) V «11 «12 «21 «22 The ASE value decreases as the cell counts increase. Because this sampling distribution is closer to normality, it is best to construct confidence intervals for log 8 and then transform back (i.e., take antilogs, using the exponential function) to form a confidence interval for 8. A large-sample confidence interval for log 8 is log0±zffl/2AS£(log0). Exponentiating endpoints of this confidence interval yields one for 9. For Table 2.3, the natural log of 6 equals log(1.832) = 0.605. The ASE (2,3.4) of logo equals (1/189 + 1/10,933 + 1/10,845 + 1/104)'/2 = 0.123. For the population this sample represents, a 95% confidence interval for log 8 equals 0.605 ± 1,96(0.123), or (0.365,0.846). The corresponding confidence interval for 0 is [exp(0.365),exp(0.846)3 = (e0-365, e0M<>) = (1.44,2.33). Since the confidence interval for 0 does not contain 1,0, the true odds of MI seem different for the two THE ODDS RATIO 25 groups. The interval predicts that the odds of MI are at least 44% higher for subjects taking placebo than for subjects taking aspirin. The endpoints of the interval are not equally distant from Ö = 1.83, because the sampling distribution of 8 is skewed to the right. The sample odds ratio 8 equals 0 or co if any rnj = 0, and it is undefined if both entries in a row or column are zero. The slightly amended estimator ~Q_ (BU+0.5)(«22 + 0.5) (n12 + 0.5)(/i2i+0.5)' corresponding to adding \ to each cell count, does not have this problem. It is preferred when the cell counts are very small or any zero cell counts occur. In that case, the ASE formula (2.3.4) replaces {mj} by {ni} + 0.5}. For Table 2.3, 6 = (189.5)(10,933.5)/(10,845.5)(104.5) = 1.828 is close to 6 = 1.832, since no cell count is especially small. 2.3.4 Relationship Between Odds Ratio and Relative Risk A sample odds ratio of 1.83 does not mean that p\ is 1.83 times /?2; tnat would be the interpretation of a relative risk of 1.83, since that measure deals with proportions rather than odds. Instead, 0 = 1.83 means that the otftfj value pi/il—p!) is 1.83 times the odds value p^/iX — p2). From (2.3.3) and from the sample analog of definition (2.2.3), Odds ratio = Pl/j}~Pl\ = Relative risk X (l-Zll) , When the proportion of successes is close to zero for both groups, the fraction in the last term of this expression equals approximately 1.0. The odds ratio and relative risk then take similar values. Table 2.3 illustrates this similarity. For each group, the sample proportion of MI cases is close to zero. Thus, the sample odds ratio of 1.83 is similar to the sample relative risk of 1.82 obtained in Section 2.2.3. In such a case, an odds ratio of 1.83 does mean that pi is about 1.83 times p2. This relationship between the odds ratio and the relative risk is useful. For some data sets calculation of the relative risk is not possible, yet one can calculate the odds ratio and use it to approximate the relative risk. Table 2.4 is an example of this type. These data refer to a study that investigated the relationship between myocardial infarction and smoking. The first column refers to 262 young and middle-aged women (age <69) admitted to 30 coronary care units in northern Italy with acute MI during the period 1983-1988. Each case was matched with two control patients admitted to the same hospitals with other acute disorders. The controls fall in the second column of the table. All subjects were classified according to whether they had ever been smokers. The "yes" group consists of women who were current smokers or ex-smokers, whereas the "no" group consists of women who never were smokers. We refer to this variable as smoking status. 26 TWO-WAY CONTINGENCY TABLES Table 2.4 Cross Classification of Smoking Status and Myocardial Infarction (MI) Ever Myocardial Smoker Infarction Controls Yes 172 173 No 90 346 Source: A. Gramenzi et al., /. Epidemiol, and Commun. Health, 43: 214-217 (1989). Reprinted with permission of BMJ Publishing Group. We would normally regard MI as a response variable and smoking status as an explanatory variable. In this study, however, the marginal distribution of MI is fixed by the sampling design, there being two controls for each case. The outcome measured for each subject is whether she ever was a smoker. The study, which uses a retrospective design to "look into the past," is called a case-control study. Such studies are common in health-related applications, for instance, to ensure a sufficiently large sample of subjects having the disease studied. We might wish to compare ever-smokers with nonsmokers in terms of the proportion who suffered MI. These proportions refer to the conditional distribution of MI, given smoking status. We cannot estimate such proportions for this data set. For instance, about a third of the sample suffered MI. This is because the study matched each MI case with two controls, and it does not make sense to use | as an estimate of the probability of MI. We can compute proportions in the reverse direction, for the conditional distribution of smoking status, given myocardial infarction status. For women suffering MI, the proportion who ever were smokers was 172/262 = .656, while it was 173/519 = .333 for women who had not suffered MI. When the sampling design is retrospective, one can construct conditional distributions for the explanatory variable, within levels of the fixed response. It is usually not possible to estimate the probability of the response outcome of interest, or to compute the difference of proportions or relative risk for that outcome. Using Table 2.4, for instance, we cannot estimate the difference between nonsmokers and ever smokers in the probability of suffering MI. We can compute the odds ratio, however. This is because the odds ratio takes the same value when it is defined using the conditional distribution of X given Y as it does when denned (as in (2.3.1)) using the distribution of 7 givenX; that is, it treats the variables symmetrically. The odds ratio is determined by the conditional distributions in either direction, and can be calculated even if we have a study design that measures a response on X within each level of Y. In Table 2.4, the sample odds ratio is [.656/(1 - .656)]/[.333/(l -.333)] = (172X346)/(173X90) ~ 3.82. The estimated odds of ever being a smoker were about 2 for the MI cases (i.e., .656/344) and about \ for the controls (i.e., .333/667), yielding an odds ratio of about 2/(1/2) =4. We noted that when the probability that Y = 1 is small for each value of X, the odds ratio and relative risk take similar values. Even if we can estimate only conditional probabilities of X given Y, if we expect P(Y = 1 | X) to be small, then CHI-SQUARED TESTS OF INDEPENDENCE 27 we can use the sample odds ratio to provide a rough indication of the relative risk. For Table 2.4, we cannot estimate the relative risk of MI or the difference of proportions suffering MI. Since the probability of young or middle-aged women suffering MI is probably small regardless of smoking status, however, the odds ratio value of 3.82 is also a rough estimate of the relative risk. We estimate that women who ever smoked were nearly four times as likely to suffer MI as women who never smoked. In Table 2.4, it makes sense to treat each column, rather than each row, as a binomial sample. Because of the matching that occurs in case-control studies, however, the binomial samples in the two columns are dependent rather than independent. Each observation in column 1 is naturally paired with two of the observations in column 2. Chapter 9 presents specialized methods for analyzing dependent binomial samples. 2.3.5 Types of Observational Studies* By contrast to the study summarized by Table 2.4, imagine a study where we follow a sample of women for the next 20 years, observing the rates of MI for smokers and nonsmokers. Such a sampling design is prospective. There are two types of prospective studies. In cohort studies, the subjects make their own choice about which group to join (e.g., whether to be a smoker), and we simply observe in future time who suffers MI. In clinical trials, we randomly allocate subjects to the two groups of interest, such as in the aspirin study described in Section 2.2.2, again observing in future time who suffers MI. Yet another approach, a cross-sectional design, samples women and classifies them simultaneously on the group classification and their current response. As in a case-control study, we can then get the data at once, rather than waiting for future events. Case-control, cohort, and cross-sectional studies are called observational studies. We observe who chooses each group and who has the outcome of interest. By contrast, a clinical trial is an experimental study, the investigator having control over which subjects enter each group, for instance, which subjects take aspirin and which take placebo. Clinical trials have fewer potential pitfalls, because of the use of randomization, but observational studies are often more practical for biomedical and social science research. 2.4 CHI-SQUARED TESTS OF INDEPENDENCE We next show how to test the null hypothesis (Ho) that cell probabilities equal certain fixed values {777y}. For a sample of size n with cell counts {71,7}, the values^/*,-; = ftirj/} . are called expected frequencies. They represent the values of the expectations {E(nij)} ' when Ho is „true. This notation refers to two-way tables, but similar notions apply to multiway tables or to a set of counts for a single categorical variable. To illustrate, for n flips of a coin, let it denote the probability of a head and 1 — it the probability of a tail on each flip. The null hypothesis that the coin is balanced corresponds to ir ~ 1 — tr = .5. JThe 28 TWO-WAY CONTINGENCY TABLES expected frequency of heads equals ju, = nir = n/2, which also equals the expected frequency of tails. If Hq is true, we expect to observe about half heads and half tails. We compare sample cell counts to the expected frequencies to judge whether the data contradict H0. If Hq is true for a two-way table, ng should be close to fin in each cell. The larger the differences {n^ — /x//}, the stronger the evidence against Ho- The test statistics used to make such comparisons have large-sample chi-squared distributions. 2.4.1 Pearson Statistic and the Chi-Squared Distribution The Pearson chi-squared statistic for testing Hq is C J ,-OiA'-' C\M X2 _ \^ (nV - Ml)2 I E Pij S^ (2.4.1) It was proposed ir jy Karl Pearson, the British statistician known also for the Pearson product-momerircbŕŕeíation, among his many contributions. This statistic takes its minimum value of zero when all «,*_,- = ju,,y. For a fixed sample size, greater differences between {mj} and {/a,;/} produce larger X2 values and stronger evidence against Hq. Since larger X2 values are more contradictory to Hq, the P-value of the test is the null probability that X2 is at least as large as the observed value. The X2 statistic has approximately a chi-squared distribution for large sample sizes. It is difficult to specify what "large" means, but {^,ŕJ- S 5} is sufficient. The P-value is the chi-squared right-hand tail probability above the observed X2 value. The chi-squared distribution is specified by its degrees of freedom, denoted.by df. The mean of the chi-squared distnbüHbn equals df, and iTš~stan3aŕd deviation equals \j2df. As df increases, the distribution concentrates around larger values and is more spread out. It is defined only for nonnegative values and is skewed to the right, but becomes more bell-shaped (normal) as df increases. Figure 2.1 displays the shapes of chi-squared densities having df = 1, 5, 10, and 20. The d f value equals the difference between the number of parameters in the alternative and null hypotheses, as explained later in this section. r \ "\ ■v-,. 2.4.2 Likelihood-Ratio Statistic \ ,**& An alternative statistic for testing Hq results from the likelihood-ratio method for significance tests. The test determines the parameter values that maximize the likelihood function under the assumption that Hq is true. It also determines the values that maximize it under the more general condition that HQ may or may not be true. The test is based on the ratio of the maximized likelihoods, A = maximum likelihood when parameters satisfy Hq maximum likelihood when parameters are unrestricted' CHI-SQUARED TESTS OF INDEPENDENCE 0.15 -Probability Density 0.10 - 0.05 - 0.0 0 10 20 30 40 Chi-Squared Figure 2.1 Examples of chi-squared distributions. This ratio cannot exceed 1. If the maximized likelihood is much larger when the parameters are not forced to satisfy Ho, then the ratio A is far below 1 and there is strong evidence against Ho. The test statistic for a likeliliood-ratio test equals -21og(A). This value is non-negative, and "small" values of A yield "large" values of -2 log(A). The reason for the log transform is to yield an approximate chi-squared sampling distribution. For two-way contingency tables, this statistic simplifies to the formula G2 = 25>;l°g(-^). (2.4.2) The statistic G2 is called üiejikelihood-ratio chi-squared statistic. Like the Pearson |j statistic, G2 takes its minimum value of 0 when all «y - jjljj, and larger values I provide stronger evidence against H0. * * Though the Pearson X2 and likelihood-ratio G2 provide separate test statistics, they share many properties and commonly yield the same conclusions. When Hq is 30 TWO-WAY CONTINGENCY TABLES true and the sample cell counts are large, the two statistics have the same chi-squared distribution, and their numerical values are similar. Each statistic has advantages and disadvantages, which we allude to later in this section and in Sections 7.3.1 and 7.4.3. 2.4.3 Tests of Independence In two-way contingency tables, the null hypothesis of statistical independence of two responses has the form Hq '■ Tf/ = fTj+TT+j for all i and j. The marginal probabilities then specify the joint probabilities. To test Ho, we identify /x,-y = n-ny,- = niri+ ir+J as the expected frequency. Here, {Xjj is the expected value of mj assuming independence. Usually, {117+} and {tt+j} are unknown, _as_is this expected value. We estimate the _exp_ectednfre^uencies by substituting sample proportions for the unknown probabilities, giving ni+ n+j r ni+n+j wi = nPi+p+j = «---------- n — 3 n n \ n \ The {fiij} are called eJ^m^tedLexpected f^e^uewies. They have the same row and \ column totals as the observed counts, but they display the pattern of independence. For testing independence in / X / contingency tables, the Pearson and likeíihood- ratio statistics equal *2 = E (riij - ßjj)2 ßij G2 = 2£Vog(^). (2.4.3) Their large-sample chi-squared distributions have df ~ (I - 1)(J - 1). This means the following: Under Hq, {717+} and {tr+j} determine the cell probabilities. There are 7 — 1 nonredundant row probabilities; since they sum to 1, the first / — 1 determine the last one through tti+ = 1 -(iti+ + • - ■ + 7x7-1,+)- Similarly, tfiere are / - 1 nonredundant column probabilities, for a total of (/ - 1) + (/ - 1) parameters. The alternative Hypothesis does not specify the // cell probabilities. They are then solely constrained to sum to 1, so there are I J — 1 nonredundant parameters. The value for d f is the difference between the number of parameters under the alternative and null hypotheses, or (// - 1) - [(/ - 1) + (/ - 1)] = // - / - / + 1 =* (/ - 1)(/ - 1). 2.4.4 Gender Gap Example We illustrate chi-squared tests of independence using Table 2.5, from the 1991 General Social Survey. The variables are gender and party identification. Subjects indicated whether they identified more strongly with the Democratic or Republican party CHI-SQUARED TESTS OF INDEPENDENCE 31 Table 2.5 Cross Classification of Party Identification by Gender Party Identification Gender Democrat Independent Republican Total Females Males Total 279 (261.4) 165 (182.6) 444 73 (70.7) 47 (49.3) 120 225 (244.9) 191 (171.1) 416 511 403 980 Note: Estimated expected frequencies for hypothesis of independence in parentheses. Source: Data from 1991 General Social Survey. or as Independents. Table 2.5 also contains estimated expected frequencies for Hq : independence. For instance, the first cell has ßu = ni+n+i/n = (577X444)/980 = 261.4. The chi-squared test statistics are X2 = 7.01 and G2 = 7.00, based on df = (/ — 1)(/ — 1) — (2 — 1)(3 — 1) = 2. The reference chi-squared distribution has a mean of d f = 2 and a standard deviation of y/2df = y4 = 2, so a value of 7.0 is fairly far out in the right-hand tail. Each statistic has a P-value of .03. This evidence of association would be rather unusual if the variables were truly independent. Both test statistics suggest that party_identification and gender are associated. " Most major statistical software packages have routines for calculating X2, G1, and their P-values. These P-values are approximations for true P-values, since the j chi-squared distribution is an approximation for the true sampling distribution. Thus, it would be overly optimistic for us to report P-values to the 4 or 5 decimal places that software provides them. IS we are lucky, the P-value approximation is good to the second decimal place, so it makes more sense to report it as .03 (or, at best, .028) rather than .02837. In any case, a P-value simply summarizes the strength of evidence. against the null hypothesis, and accuracy to two or three decimal places is sufficient for this purpose. 2.4.5 Residuals A test statistic andits P-value simply describe the evidence against the null hypothesis. A cell-by-cell comparison of observed and estimated expected frequencies helps us better unäe^taŕíd fee nature of the évíäence.jLarger differences Jéhreenj^y^icf^ tend to occur for ceils Maf have liřger.expecte.^ «,y — "jEy is insufficient. For the test of independence, useful;cell residuals^.have the fôrrn....... ""'"'"...... Ttij ~ tyj y/ßij(\ - pi+)i\ ~P+j) (2.4.4) These are caUe$ädjusľed residuals. \ When the null hypothesis is true, each adjusted residual has a large-sample standard 32 TWO-WAY CONTINGENCY TABLES Table 2.6 Adjusted Residuals (in Parentheses) for Testing Independence in Table 2.5 Party Identification Gender Democrat Independent Republican Females Males 279 (2.29) 165 (-2.29) 73 (0.46) 47 (-0.46) 225 (-2.62) 191 (2.62) indicates lack of lit of Ho in that cell. Table 2.6 shows the adjusted residuals for testing independence in Table 2.5. For the first cell, for instance, nu ~ 279andju.n = 261.4. The first row and first column marginal proportions equal p1+ = 577/980 = .589 and p+1 = 444/980 = .453. Substituting into (2.4.4), the adjusted residual for this cell equals 279-261.4 ^261.4(1 - .589)(1 - .453) = 2.29. This cell shows a greater discrepancy between «n and ßn than one would expect if the variables were truly independent. Table 2.6 shows large positive residuals for female Democrats and male Republi-;. cans, and large negative residuals for female Republicans and male Democrats^ThuSj V 11 ^^Ei^íS sigrnfip.antly.more..femaleJJernocrats „and male Repubjucaris and fewerje- faj in one cell, the reverse must happen in the other cell. The differences íi1; — fílý and n2j - ßzj have the same magnitude but different signs, implying the same pattern for their adjusted residuals. 2.4.6 Partitioning Chi-Squared Chi-squared statistics have a reproductive property. If one chi-squared statistic has df = df\ and a separate, independent, chi-squared statistic has df — df2, then then-sum is chi-squared with df = df\ + df2. For instance, if we had a table of form Table 2.5 for college-educated subjects and a separate one for subjects not having a college education, the sum of the X2 values or the sum of the G2 values from the two tables would be a chi-squared statistic with df = 2 + 2 = 4. CHI-SQUARED TESTS OF INDEPENDENCE 33 Similarly, chi-squared statistics having df > 1 can be broken into components with fewer degrees of freedom. For instance, a statistic having df — 2 can be partitioned into two independent components each having d f = 1. Another supplement to a test of independence partitions its chi-squared test statistic so that the components represent certain aspects of the association. A partitioning may show that an association primarily reflects differences between certain categories or groupings of categories. We illustrate with a partitioning of G2 for testing independence in 2 X / tables. The test statistic then has df = (J — 1), and we partition it into 7 — 1 components. The 7'th component is G2 for testing independence in a 2 X 2 table, where the first column combines columns 1 through j of the original table, and the second column uses column j+l of the original table. That is, G2 for testing independence in a 2 X / table equals the sum of a G2 statistic that compares the first two columns, plus a G2 statistic for the 2 X 2 table that combines the first two columns and compares them to the third column, and so on, up to a G2 statistic for the 2 X 2 table that combines the first J — I columns and compares them to the last column. Each component G2 statistic has d f = 1. Consider again Table 2.5. The first two columns of this table form a 2 x 2 table with cell counts, by row, of (279.73/165,47). For this component table, G2 = 0.16, with df = 1. Of those subjects who identify either as Democrats or Independents, there is little evidence of a difference between females and males in the relative numbers in the two categories. We form the second 2x2 table by combining these columns and comparing them to the Republican column, giving the table with rows (279 + 73, 225/165 + 47, 191) = (352,225/212,191). This table has G2 = 6.84, based on df = 1. There is strong evidence of a difference between females and males in the relative numbers identifying as Republican instead of Democrat or Independent. ,-Note that 0.16 + 6.84 - 7.00; that is, the sum of these G2 components equals G2 for the test of independence for the complete 2X3 table. This overall statistic primarily reflects differences between genders in choosing between Republicans and Democrats/Independents. It might seem more natural to compute G2 for separate 2X2 tables that pair each column with a particular one, say the last. Though this is a reasonable way to investigate association in many data sets, these component statistics are not independent and do not sum to G2 for the complete table. Certain rules determine ways of forming tables so that chi-squared partitions, but they are beyond the scope of this text (see, e.g., Agresti (1990), p. 53, for rules and references). A necessary condition is that the G2 values for the component tables sum to G2 for the original table. The G2 statistic has exact partitionings. The overall Pearson X2 statistic does not equal the sum of the X2 values for the separate tables in a partition. However, it is valid to use the X2 statistics for the separate tables in the partition; they simply do not provide an exact algebraic partitioning of the X2 statistic for the overall table. 2.4.7 Comments on Chi-Squared Tests Chi-squared tests of independence, like any significance tests, have serious limitations. They simply indicate the degree of evidence for an association. They are rarely 34 TWO-WAY CONTINGENCY TABLES adequate for answering all questions we have about a data set. Rather than relying t solely on results of these tests, one should study the nature of the association.^ is sensible to decompose chi-squared into components, study residuals, and estimate parameters such as odds ratios that describe the strength of association. * The X2 and G2 chi-squared tests also have limitations in the types of data sets for which they are applicable. For instance, they require large samples. The sampling., distributions of X2 and G2 get closer to chi-squared as the sample size n increases, relative tolhe number of cells //. The convergence is quicker for X2 than G2. The chi-squared approximation is often poor for G2 when n/IJ < 5. When I or J is large, it can be decent for X2 when some expected frequencies are as small as 1. Section 7.4.3 provides further guidelines, but these are not crucial since small-sample procedures are available whenever we question whether n is sufficiently large. Section 2.6 discusses these. i The {p.,-; = «,+n+j/n] used in X2 and G2 depend on the row and column marginal I totals, but not on the order in which the rows and columns are listed. Thus, X2 and G2 do not change value with arbitrary reorderings of rows or of columns. This means that these tests treat both classifications as nominal. We ignore some information ■' when we use them to test independence between ordinal classifications. When at least one variable is ordinal, more powerful tests of independence usually exist. The ! next sectidnpresents such a test. 2.5 TESTING INDEPENDENCE FOR ORDINAL DATA The chi-squared test of independence using test statistic X2 or G2 treats both classifications as nominal. When the rows and/or the columns are ordinal, test statistics that utilize the ordinality are usually more appropriate. 2.5.1 Linear Trend Alternative to Independence When the row variable X and the column variable Y are ordinal, a "trend" association is quite common. As the level of X increases, responses on ľ" tend to increase toward higher levels, or responses on ľ" tend to decrease toward lower levels. One can use a single parameter to describe such an ordinal trend association. The most common analysis assigns scores to categories and measures the degree of linear trend or correlation. We next present a test statistic that is sensitive to positive or negative linear trends in the relationship between X and Y. It utilizes correlation information in the data. Let «i ^ «2 — * ■' — "/ denote scores for the rows, and let ^i < t>2 — "'" — v/ denote scores for the columns. The scores have the same ordering as the category levels and are said to be monotone. The scores reflect distances between categories, with greater distances between categories treated as farther apart. The sum E/; HiVjUij, which weights cross-products of scores by the frequency of their occurrence, relates to the covariation of X and 7. For the chosen scores, the Pearson product-moment correlation between X and Y equals the standardization of TESTING INDEPENDENCE FOR ORDINAL DATA 35 this sum, HijUiVjnij- (Ei«»"»+) (£jtfJn+j)/"_________ TT„2n (£ygj"+j)2 sLvJn+i -------n------- j Alternative formulas exist for r, and one can compute it using standard software, entering for each subject their score on the row classification and their score on the column classification. The correlation falls between -1 and +1. Independence between the variables implies that its true value equals zero. The larger the correlation is in absolute value, the farther the data fall from independence in this linear dimension. A statistic for testing the null hypothesis of independence against the two-sided alternative hypothesis of nonzero true correlation is given by M2 = (n - \)r2. (2.5.1) This statistic increases as the sample correlation r increases in magnitude and as the sample size n grows. For large samples, it has approximately a chi-squared distribution with d f = 1. Large values contradict independence, so, as with X2 and G2, the P-value is the right-tail probability above the observed value. The square root, M = s/n — lr, has approximately a standard normal null distribution. It applies to directional alternatives, such as positive correlation between the classifications. Tests using M1 treat the variables symmetrically. If one interchanges the rows with the columns and their scores in an / X / table, M2 takes identical value for the corresponding J XI table. 2.5.2 Alcohol and Infant Malformation Example Table 2.7 refers to a prospective study of maternal drinking and congenital malformations. After the first three months of pregnancy, the women in the sample completed a questionnaire about alcohol consumption. Following childbirth, observations were recorded on presence or absence of congenital sex organ malformations. Alcohol Table 2.7 Infant Malformation and Mother's Alcohol Consumption Percentage Adjusted Consumption Absent Present Total Present Residual 0 17,066 48 17,114 0.28 -0.18 <1 14,464 38 14,502 0.26 -0.71 1-2 788 5 793 0.63 1.84 3-5 126 1 127 0.79 1.06 >6 37 . 1 38 2.63 2.71 Source: B. I. Graubard and E. L. Korn, Biometrics 43:471-476 (1987). Reprinted with permission of the Biometrie Society. \ l^uin*----------— 36 TWO-WAY CONTINGENCY TABLES consumption, measured as average number of drinks per day, is an explanatory variable with ordered categories. Malformation, the response variable, is nominal. When a variable is nominal but has only two categories, statistics (such as M2) that treat the variable as ordinal are still valid. For instance, we could artificially regard malformation as ordinal, treating "absent" as "low" and "present" as "high." Any choice of two scores yields the same value of M2, and we simply use 0 for "absent" and 1 for "present." Table 2.7 has a mixture of very small, moderate, and extremely large counts. Even though the sample size is large (n = 32,574), in such cases the actual sampling distributions of X2 or G2 may not be close to chi-squared. For these data, having df = 4, G2 = 6.2 (P = .19) and X2 = 12.1 (P = .02), so they provide mixed signals. In any case, they ignore the ordinality of alcohol consumption. Table 2.7 lists the percentage of malformation cases at each level of alcohol consumption. These percentages show roughly an increasing trend. The first two are similar and the next two are also similar, however, and any of the last three percentages changes dramatically with the addition or deletion of one malformation case. Table 2.7 also reports adjusted residuals for the "present" category in this table. They are negative at low levels of alcohol consumption and positive at high levels of consumption, though most are small, and they also change substantially with slight changes in the data. The sample percentages and the adjusted residuals both suggest a possible tendency for malformations to be more likely at higher levels of alcohol consumption. The ordinal test statistic M2 requires scores for levels of alcohol consumption. It seems sensible to use scores that are midpoints of the categories; that is, v\ = 0, v2 = 0.5, v3 = 1.5, v4 = 4.0, v5 = 7.0, the last score being somewhat arbitrary. One can calculate r and M2 using software (e.g., PROC FREQ in SAS; see Table A.2 in the Appendix). The sample correlation between alcohol consumption and malformation is r = .014, and M2 = (32,573)(.014)2 = 6.6. The P-value of .01 suggests strong evidence of a nonzero correlation. The standard normal statistic M = 2.56 has P = .005 for the one-sided alternative of a positive correlation. For the chosen scores, the correlation value of .014 seems weak. However, r has limited use as a descriptive measure for tables, such as this one, that are highly discrete and unbalanced. Future chapters present tests such as M2 as part of a model-based analysis. For instance, Section 4.2 presents a model in which the probability of malformation changes linearly according to alcohol consumption. Model-based approaches yield estimates of the size of the effect as well as smoothed estimates of cell probabilities. These estimates are more informative than mere significance tests. 2.5.3 Extra Power with Ordinal Test For testing independence, X2 and G2 refer to the most general alternative hypothesis possible, whereby cell probabilities exhibit any type of statistical dependence. Their df value of (/ - 1)(7 - 1) reflects an alternative hypothesis that has (/ — 1)(/ — 1) more parameters than the null hypothesis. These statistics are designed to detect any type of pattern for the additional parameters. In achieving this generality, they sacrifice sensitivity for detecting particular patterns. TESTING INDEPENDENCE FOR ORDINAL DATA 37 "When the row and column variables are ordinal, one can attempt to describe the association using a single extra parameter. For instance, the test statistic M2 is based on a correlation measure of linear trend. When a test statistic refers to a single parameter, it has d f = 1. When the association truly has a positive or negative trend, the ordinal test using M2 has a power advantage over the tests based on X2 or G2. Since df equals the mean of the chi-squared distribution, a relatively large M2 value based on df = 1 falls farther out in its right-hand tail than a comparable value of X2 or G2 based on df = (I — !)(/ — 1); falling farther out in the tail produces a smaller P-value. When there truly is a linear trend, M2 tends to have similar size as X2 or G2, so it tends to have greater power in terms of yielding smaller P-values. In attempting to detect any type of dependence, the X2 and G2 statistics lose power relative to statistics designed to detect a particular type of dependence if that type of dependence truly occurs. Another advantage of chi-squared tests having small df values relates to the accuracy of chi-squared approximations. For small to moderate sample sizes, the true sampling distributions tend to be closer to chi-squared when df is smaller. When several cell counts are small, the chi-squared approximation is likely to be worse for X2 or G2 than it is for M2. 2.5.4 Choice of Scores For most data sets, the choice of scores has little effect on the results. Different choices of monotone scores usually give similar results. This may not happen, however, when the data are very unbalanced, such as when some categories have many more observations than other categories. Table 2.7 illustrates this. For the equally-spaced row scores (1, 2, 3, 4, 5), the test statistic equals M2 = 1.83, giving a much weaker conclusion (P = .18). The magnitudes of r and M2 do not change with transformations of the scores that maintain the same relative spacings between the categories. For instance, scores (1,2,3,4,5) yield the same correlation as scores (0,1,2,3,4) or (2,4,6,8,10) or (10,20,30,40,50). An alternative approach avoids the responsibility of selecting scores and uses the data to form them automatically. Specifically, one assigns ranks to the subjects and uses them as the category scores. For all subjects in a category, one assigns the average of the ranks that would apply for a complete ranking of the sample from 1 to n. These are called midranks. We illustrate by assigning midranks to the levels of alcohol consumption in Table 2.7, The 17,114 subjects at level 0 for alcohol consumption share ranks 1 through 17,114. We assign to each of them the average of these ranks, which is the midrank (1 -I- 17,U4)/2 = 8557.5. The 14,502 subjects at level < 1 for alcohol consumption share ranks 17,115 through 17,114+ 14,502 = 31,616, for a midrank of (17,115 + 31,616)/2 = 24,365.5. Similarly the midranks for the last three categories are 32,013.0, 32,473.0, and 32,555.5. These scores yield M2 = 0.35 and a weaker conclusion yet: (P = .55). Why does this happen? Adjacent categories having relatively few observations necessarily have similar midranks. For instance, the midranks (8557.5,24,365.5, 32,013.0,32,473.0,32,555.5) for Table 2.7 are similar for the final three categories, since those categories have considerably fewer observations than the first two 38 TWO-WAY CONTINGENCY TABLES categories. A consequence is that this scoring scheme treats alcohol consumption level 1-2 (category 3) as much closer to consumption level £: 6 (category 5) than to consumption level 0 (category 1). This seems inappropriate. It is usually better to use one's judgment by selecting scores that reflect distances between categories. When uncertain about this choice, perform a sensitivity analysis. Select two or three "sensible" choices and check that the results are similar for each. Equally-spaced scores often provide a reasonable compromise when the category labels do not suggest any obvious choices, such as the categories (liberal, moderate, conservative) for political philosophy. When X and Y are both ordinal, one can use midrank scores for each. The M2 statistic is then sensitive to detecting nonzero values of a nonparametric form of correlation called Spearman's rho. Alternative ordinal tests for I X J tables utilize versions of other ordinal association measures. For instance, gamma and Kendall's tau-b are contingency table generalizations of the ordinal measure called Kendall's tau. The sample value of any such measure divided by its standard error has a large-sample standard normal distribution for testing independence, and the square of the statistic is chi-squared with df = 1. Like the test based on M2, these tests share the potential power advantage that results from using a single parameter to describe the association. 2.5.5 Trend Tests for /-by-2 and 2-by-/ Tables We now study how M2 utilizes the sample data when X or Y has only two levels. Suppose the row variable X is an explanatory variable, and the column variable Y is a response variable. When X is binary, the table has size 2 X /. Tables of this size occur in comparisons of two groups, such as when the rows represent two treatments. Using scores (íí( = 0, «2 = 1) for levels of Xin this case, we see that the covariation measure £\ -uiv^nij on which Mz is based simplifies to J2j vjn2j- This term sums the scores on Y for all subjects in row 2. Divided by the number of subjects in row 2, it gives the mean score for that row. In fact, when the columns (Y) are ordinal with scores {v}}, the M2 statistic for 2 X / tables is directed toward detecting differences between the two row means of the scores on ľ". In testing independence using M2, small P-values suggest that the true difference in row means is nonzero. When we use midrank scores for Y, the test for 2 X / tables is sensitive to differences in mean ranks for the two rows. This test is called the Wtlcoxon or Mann-Whitney test. Most nonparametric statistics texts present this test for fully-ranked response data, whereas the 2 X / table is an extended case in which sets of subjects at the same level of Y are tied and use midranks. The large-sample version of that nonparametric test uses a standard normal z statistic. The square of the z statistic is equivalent to M2, using arbitrary scores (such as 0, 1) for the rows and midranks for the columns. Tables of size 1X2, such as Table 2.7, have a binary response variable rather than a binary explanatory variable. It is then natural to focus on how the proportion classified in a given response category of Y varies across the levels of X. For ordinal EXACT INFERENCE FOE SMALL SAMPLES 39 X with monotone row scores and arbitrary scores for the two columns, M2 focuses on detecting a linear trend in this proportion and relates to models presented in Section 4.2. In testing independence using M2, small P-values suggest that the slope for this linear trend is nonzero. This 7X2 version of the ordinal test is called the Cochran-Armitage trend test. 2.5.6 Nominal-Ordinal Tables The test statistic (2.5.1) treats both classifications as ordinal. When one variable (say X) is nominal but has only two categories, we can still use it. When X is nominal with more than two categories, it is inappropriate, and we use a different statistic. It is based on calculating a mean response on the ordinal variable in each row and considering the variation among the row means. The statistic is rather complex computationally, and we defer discussion of it to Section 7.3.6. It has a large-sample chi-squared distribution with df ~ (I — 1). When / = 2, it is identical to M2, which then compares the two row means. 2.6 EXACT INFERENCE FOR SMALL SAMPLES The confidence intervals and tests presented so far in this chapter are large-sample methods. As the sample size n grows, the cell counts grow, and "chi-squared" statistics such as X2, G2, and M2 have distributions that are more nearly chi-squared. When the sample size is small, one can perform inference using exact distributions rather than large-sample approximations. This section discusses exact inference for two-way contingency tables. 2.6.1 Fisher's Exact Test We first study the 2 X 2 case. The null hypothesis of independence corresponds to an odds ratio of 6 = 1. A small-sample probability distribution for the cell counts is defined for the set of tables having the same row and column totals as the observed data. Under Poisson, binomial, or multinomial sampling assumptions for the cell counts, the distribution that applies to this restricted set of tables fixing the row and column totals is called the hypergeometric. For given row and column marginal totals, the value for «n determines the other three cell counts. Thus, the hypergeometric formula expresses probabilities for the four cell counts in terms of «n alone. When d = 1, the probability of a particular value «n for that count equals 'nx+ \ ( n2+ P(nn) - -i------*-£—^---------<~. (2.6.1) n n+l 40 TWO-WAY CONTINGENCY TABLES The binomial coefficients equal \b) b\(a-b)\' To test independence, the P-value is the sum of hypergeometric probabilities for outcomes at least as favorable to the alternative hypothesis as the observed outcome. We illustrate for Ha : B > 1. Given the marginal totals, tables having larger »n values also have larger sample odds ratios 0 = (ři11«22)/(ni2«2t),andhence provide stronger evidence in favor of this alternative. The P-value equals the right-tail hypergeometric probability that n\ i is at least as large as the observed value. This test for 2 X 2 tables, proposed by the eminent British statistician R. A. Fisher in 1934, is called Fisher's exact test. 2.6.2 Fisher's Tea Taster To illustrate this test in his 1935 text, The Design of Experiments, Fisher described the following experiment: A colleague of Fisher's at Rothamsted Experiment Station near London claimed that, when drinking tea, she could distinguish whether milk or tea was added to the cup first. To test her claim, Fisher designed an experiment in which she tasted eight cups of tea. Four cups had milk added first, and the other four had tea added first. She was told there were four cups of each type, so that she should try to select the four that had milk added first. The cups were presented to her in random order. Table 2.8 shows a potential result of the experiment. We conduct Fisher's exact test of Hq : 8 = 1 against Ha : 0 > 1. The null hypothesis states that Fisher's colleague's guess was independent of the actual order of pouring; the alternative hypothesis reflects her claim, predicting a positive association between true order of pouring and her guess. For this experimental design, the column margins are identical to the row margins (4,4), since she knew that four cups had milk added first. Both marginal distributions are naturally fixed. The null distribution of «n is the hypergeometric distribution defined for all 2X2 tables having row and column margins (4,4). The potential values for nn are (0,1,2,3,4). The observed table, three correct guesses of the four cups having milk Table 2.8 Fisher's Tea-Tasting Experiment Guess Poured First Poured First Milk Tea Milk 3 1 Tea I 3 Total 4 4 Total 4 4 EXACT INFERENCE FOB SMALL SAMPLES 41 added first, has null probability 4\ A rm = vVW = t4l/(3!)(l!)][4!/(l!)(3!)] = 16 K> /nX [8!/(4!)(4!)] 70 The only table that is more extreme, for the alternative Ha : 0 > 1, consists of four correct guesses. It has «n = «22 — 4 and «12 = «21 ~ 0, and a probability of P(4) = ^VV=_L..014. Table 2.9 summarizes the possible values of nn and their probabilities. The P-value for the one-sided alternative Ha : 9 > 1 equals the right-tail probability that «11 is at least as large as observed; that is, P = P(3) 4- P(4) = .243. This is not much evidence against the null hypothesis of independence. The experiment did not establish an association between the actual order of pouring and the guess. Of course, it is difficult to show effects with such a small sample. If the tea taster had guessed all cups correctly (i.e., «n — 4), the observed result would have been the most extreme possible in the right-hand tail of the hypergeometric distribution; then, P = F(4) = .014, giving some reason to believe her claim. For the potential «! 1 values, Table 2.9 shows P-values for the alternative Ha : 0 > 1. 2.6.3 P-values and Type I Error Probabilities The two-sided alternative Ha : 0 =£ 1 refers to the general alternative of statistical dependence used in chi-squared tests. Its exact P-value is usually defined as the two-tailed sum of the probabilities of tables no more likely than the observed table. To calculate it, one adds the hypergeometric probabilities of all outcomes y for the first cell count for which P(y) < P(«n), where nn is the observed count. For Table 2.8, summing all probabilities that are no greater than the probability P (3) = .229 of the Table 2.9 Hypergeometric Distribution for Tables with Margins of Table 2.8 nn Probability P-value X2 0 .014 1.000 8.0 1 .229 .986 2.0 2 .514 .757 0.0 3 .229 .243 2.0 4 .014 .014 8.0 Note: P-value refers to right-tail probability for one-sided alternative. 42 TWO-WAY CONTINGENCY TABLES observed table gives P = P(p) + P(l) + P(3) + P(4) = .486. When the row or column marginal totals are equal, the hypergeometric distribution is symmetric, and the two-sided P-value doubles the one-sided one. An alternative two-sided P-value sums the probabilities of those tables for which the Pearson X2 statistic is at least as large as the observed value. That is, it uses the exact small-sample distribution of X2 rather than its large-sample chi-squared distribution. Table 2.9 shows the X1 values for the five tables having the margins of Table 2.8. The statistic can assume only three distinct values, so its highly discrete distribution is far from the continuous chi-squared distribution. Figure 2.2 plots this exact small-sample distribution of X2. It equals 0.0 with probability .514, 2.0 with probability .458, and 8.0 with probability .028. The observed table has X2 = 2.0, and the P-value equals the null probability of a value this large or larger, or .458 + .028 = .486. For these data, this P-value based onX2 is identical to the one based solely on probabilities. Computations for the hypergeometric distribution are rather messy. One can sidestep this distribution and approximate the exact P-value for X2 by obtaining 0.6 -j 0.5 - V/yy\ 0.4 - y'yyy >? •''.'•■'.'-'''-',' I o.3-';ŕ:ÍÄ o £ ýyyy 0.2 - -y'yy- 0.1 -syyy. 0 12 3 4 5 6 7 8 X2 Figure 2.2 Exact distribution of Pearson X2 for Table 2.8. EXACT INFERENCE FOR SMALL SAMPLES 43 a P-value from the chi-squared distribution for an adjustment of the Pearson statistic using the Yates continuity correction. There is no longer any reason to use this approximation, however, since modern software makes it possible to conduct Fisher's exact test even for fairly large samples with hypergeometric P-values based on the X2 or probability criteria. For small samples, the exact distribution (2.6.1) is highly discrete, in the sense that «n can assume relatively few values. The P-value also has a small number of possible values. For Table 2.8, it can assume five values for the one-sided test and three values for the two-sided test. This has an impact on error rates in hypothesis testing. Suppose we make a formal decision about the null hypothesis using a supposed Type I error probability such as .05. That is, we reject the null hypothesis if the P-value is less than or equal to .05. Because of the test's discreteness, it is usually not possible to achieve that level exactly. For the one-sided alternative, the tea-tasting experiment yields a P-value below .05 only when «n = 4, in which case P = .014. When Hq is true, the probability of this outcome is .014, so the actual Type I error probability would be .014, not .05. The test is said to be conservative, since the actual error rate is smaller than the intended one. (The approximation of exact tests using the Yates continuity correction is also conservative.) This illustrates an awkwardness with formal decision-making at "sacred" levels such as .05 when the test statistic is discrete. For test statistics having a continuous distribution, the P-value has a uniform null distribution over the interval [0,1]. That is, P is equally likely to fall anywhere between 0 and 1, so the probability that P falls below a fixed level a equals a, and the expected value of P is .5. For test statistics having discrete distributions, the null distribution of the P-value is discrete and has expected value greater than .5. For instance, for the one-sided test with the tea-tasting data, the P-value equals .014 with probability P(0) = .014, it equals .243 with probability F(l) = .229, and so forth; from Table 2.9, the expected value of the P-value is Y"? X Prob(F) = .014(.0I4) + .243(.229) + .757(.514) + .986(.229) + 1.0(.014) = .685. In this average sense, P-values for discrete distributions tend to be too large. To diminish the conservativeness of tests for discrete data, one can use a slightly different definition of P-value. The mid P-value equals half the probability of the observed result, plus the probability of more extreme results. It has a null expected value of .5, the same as the regular P-value for continuous variates. For the tea-tasting data, with an observed value of 3 for «u, the one-sided mid P-value equals P(3)/2 + P(4) = .229/2 + .014 = .129, compared to .243 for the ordinary P-value. The mid P-value for the two-sided test based on the X2 statistic equals P{X2 = 2)/2 + P(X2 = 8) = .257, compared to .486 for the ordinary P-value. Unlike an exact test with ordinary P-value, a test using the mid P-value does not guarantee that the Type I error rate falls below a fixed value (see Problem 2.27). However, it usually performs well and is less conservative than Fisher's exact test. For either P-value, rather than reducing the data to the extreme binary decision (reject 44 TWO-WAY CONTINGENCY TABLES Hq, do not reject H0), it is better simply to report the P-value, using it as a measure of the weight of evidence against the null hypothesis. In Table 2.8, both margins are naturally fixed. When only one set is fixed, such as when rows totals are fixed with independent binomial samples, alternative exact tests exist that are less conservative than Fisher's exact test. These are beyond the scope of this text, but the reader can refer to a recent article by R. Berger and D. Boos (J. Am. Statist. Assoc., 1994, p. 1012). 2.6.4 Small-Sample Confidence Interval for Odds Ratio Exact inference is not limited to testing. One can also construct small-sample confidence intervals for the odds ratio. They correspond to a generalization of Fisher's exact test that tests an arbitrary value, Hq : 0 = 6q. A 95% confidence interval contains all values of 0O for which the exact test of Ho : 0 = 8Q yields P > .05; that is, for which one would not reject the null hypothesis at the .05 level. As happens with exact tests, the discreteness makes these confidence intervals conservative. The true confidence level can be no smaller than the nominal one, but it may actually be considerably larger. For instance, a nominal 95% confidence interval may have true confidence level 98%. Moreover, the true level is unknown. The difference between the nominal and true levels can be considerable when the sample size is small. To reduce the conservativeness, one can construct the interval corresponding to the test using a mid P-value. The confidence interval consists of all 0o values for which the mid P-value exceeds .05. This interval is shorter. Though its actual confidence level is not guaranteed to be at least the nominal level, it tends to be close to that level. Computations for either of these types of confidence intervals are complex and require specialized software (e.g., StatXact, Cytel Software, Cambridge, MA). For the tea-tasting data (Table 2.8), the "exact" 95% confidence interval for the true odds ratio equals (0.21,626.17). The interval based on the test using the mid P-value equals (0.31,308.55). Both intervals are very wide, because the sample size is so small. 2.6.5 Exact Tests of Independence for Larger Tables* Exact tests of independence for tables of size larger than 2X2 use a multivariate version of the hypergeometric distribution. This distribution also applies to the set of all tables having the same row and column margins as the observed table. The exact tests are not practical to compute by hand or calculator but are feasible using computers. One selects a test statistic that describes the distance of the observed data from Hq. One then computes the probability of the set of tables for which the test statistic is at least as great as the observed one. For instance, for nominal variables, one could use X2 as the test statistic. The P-value is then the null probability that X2 is at least as large as the observed value, the calculation being done using the exact distribution rather than the large-sample chi-squared distribution. PROBLEMS 45 Table 2.10 Example of 3 x 9 Table for Small-Sample Test 0 7 0 0 0 0 0 11 1111.1110 0 080000000 Recently developed software makes exact tests feasible for tables for which large-sample approximations are invalid. The software StatXact performs many exact inferences for categorical data. To illustrate, Table 2.10 is a 3 X 9 table having many zero entries and small counts. For it, X2 = 22.3 with df = 16. The chi-squared approximation for the distribution of X2 gives P = .13. Because the cell counts are so small, the validity of this approximation is suspect. Using StatXact to generate the exact sampling distribution of X2, we obtain an exact P-value of .001, quite different from the result using the large-sample approximation. For another example, we return to the analysis in Section 2.5 of Table 2.7, on the potential effect of maternal alcohol consumption on infant sex organ malformation. For testing independence, the values of X2 = 12.1 and G2 = 6.2 yield P-values from a chi-squared distribution with df = 4 of .02 and .19, respectively. Because of the imbalance in the table counts and the presence of some small counts, we could instead use exact tests for these statistics. The P-values using the exact distributions of X2 and G2 are .03 and .13, respectively. These are closer together but still give differing evidence about the association. The columns of Table 2.7 are ordinal, and Section 2.5 presented a large-sample ordinal test based on a statistic M2 (formula 2.5.1) that assigns scores to rows and columns. For ordinal data, exact tests exist using this statistic or using M for one-sided alternatives. For the one-sided alternative of a positive association, the exact P-value equals .02 for the midpoint scores (0,0.5,1.5,4,7), .10 for the equally spaced scores (0,1,2,3,4) and .29 for the midrank scores. For these data, the result depends greatly on the choice of scores. PROBLEMS 2.1. A Swedish study considered the effect of low-dose aspirin on reducing the risk of stroke and heart attacks among people who have already suffered a stroke (Lancet 338: 1345-1349 (1991)). Of 1360 patients, 676 were randomly assigned to the aspirin treatment (one low-dose tablet a day) and 684 to a placebo treatment. During a follow-up period averaging about three years, the number of deaths due to myocardial infarction were 18 for the aspirin group and 28 for the placebo group. a. Calculate and interpret the difference of proportions, relative risk of death, and the odds ratio. b. Conduct an inferential analysis for these data. Interpret results. 2.2. In the United States, the estimated annual probability that a woman over the age of 35 dies of lung cancer equals .001304 for current smokers and .000121 for