Applied Categorical & Nonnormal Data Analysis Contingency Tables A table that cross classifies two variable is called a two-way contingency table. It is also known as a cross tabulation or crosstabs for short. If each of the two variables has two levels then the table is a 2x2. If there are three levels of one variable and 5 of the other, it would be a 3x5 table. We will start off by looking at a 2x2 table. Observed Frequencies The following table gives a representation of the observed frequencies of a 2x2 contingency table. | column variable row var | col 1 col 2 | Total -----------+----------------------+-------- row 1 | n11 n12 | n1+ row 2 | n21 n22 | n2+ -----------+----------------------+-------- Total | n+1 n+2 | n Here is what the observed frequencies look like for an example using myocardial infarction and the use of aspirin. | myocardial infarction group | yes no | Total -----------+----------------------+-------- placebo | 189 10845 | 11034 aspirin | 104 10933 | 11037 -----------+----------------------+-------- Total | 293 21778 | 22071 The values in the body of the table represent the joint distribution and the values around the edges represent the marginal distributions. Observed Proportions Here is a representation of the observed proportions which can also be treated as probabilities. | column variable row var | col 1 col 2 | Total -----------+----------------------+-------- row 1 | p11 p12 | p1+ row 2 | p21 p22 | p2+ -----------+----------------------+-------- Total | p+1 p+2 | 1.0 The observed proportions for our example look like this: | myocardial infarction group | yes no | Total -----------+----------------------+-------- placebo | .0086 .4914 | .4999 aspirin | .0047 .4954 | .5001 -----------+----------------------+-------- Total | .0133 .9867 | 1.0000 Relative Risk The relative risk in a 2x2 table is the ratio of "success" probabilities for the two groups. For the MI example, it would look like this. RR = p11/p21 = .0086/.0047 = 1.82 In this example, the sample proportion of myocardial infarction was 82% higher for the placebo group. If you take the reciprocal of the relative risk the value is .55. The proportion of myocardial infarction was 45% lower for the aspirin group. Odds Ratio Before we can talk about odds ratios we need to define odds. odds = p/(1 - p) Theoretically, odds can run from 0 to positive infinity. When the odds equal one, the probability of success is equal to the probability of failure. When the odds are less than one, the probability of success is less than the probability of failure. And, when the odds are greater than one, the probability of success is greater than the probability of failure. An odds ratio is exactly what it seems, the ratio of two odds. This is not the only way to compute the odds ratio. It is easier to compute it as a ratio of the cross products of either the frequencies or the proportions. When the odds ratio equal 1, the odds for group 1 are the same as the odds for groups 2. When the odds ratio is greater than 1, the odds for group 1 are greater than the odds for groups 2. When the odds ratio is less than 1, the reverse is true. The farther odds ratio goes in either direction, the stronger the association among the variables. In this example, the odds of a myocardial infarction are 83% higher for the placebo group. If you take the reciprocal of the odds ratio the value is .546. Thus, the odds of myocardial infarction was about 45% lower for the aspirin group than for the placebo group. Odds ratios are invariant when the orientation of the rows and columns reversed. The odds ratios are relatively invariant to changes in the marginal frequencies. For example, if you were to multiply each of the frequencies in the table by a constant, c, the odds ratio would remain unchanged. The same is true if you multiply the frequencies for one row by one constant and the frequencies in the other row by a different constant. Relation of Relative Risk to Odds Ratio When p1 and p2 are both very small, the value of the odds ratio is close to that of the relative risk, In any case, the odds ratio can be obtained from the relative risk by the following formula. This is useful because there are times when it isn't possible to estimate relative risk directly. Conditional Probabilities Conditional probabilities are the probabilities of an event given that some other event has occurred. In our MI example, the conditional probabilities for the groups are: | myocardial infarction group | yes no | Total -----------+----------------------+-------- placebo | .0171 .9829 | 1.0000 aspirin | .0094 .9906 | 1.0000 -----------+----------------------+-------- Total | .0133 .9867 | 1.0000 Recall that, | myocardial infarction group | yes no | Total -----------+----------------------+-------- placebo | .0086 .4914 | .4999 aspirin | .0047 .4954 | .5001 -----------+----------------------+-------- Total | .0133 .9867 | 1.0000 Thus, the conditional probability of a myocardial infarction for the placebo group is .0171, while for the aspirin group in is .0094. Two variables are said to be independent when the conditional distributions of one are identical for each level of the other. In this example, the conditional distributions are not identical. Expected Frequencies Here are the expected frequencies for our example given independence of group and myocardial infarction. | myocardial infarction group | yes no | Total -----------+-------------------------+-------- placebo | 146.480 10887.520 | 11034 aspirin | 146.520 10890.480 | 11037 -----------+-------------------------+-------- Total | 293 21778 | 22071 Note that the marginal frequencies are that same as in the table of the observed frequencies. This is the case because the expected frequencies are obtained form the marginal distribution of the observed frequencies. For example, the expected frequency of 146.480 is obtained as follows: eij = (ni+)(n+j)/n++ = 11034*293/22071 = 146.480 where ni+ is the frequency for the ith row, n+j is the frequency for the jth column, and n++ is the total frequency for the entire table. What this means, is that, the joint distribution is determined by the marginal distribution of the variables when the two variables are independent. This property is just a variation of the rule for the joint probability of independent events P(A & B) = P(A)*P(B). Chi-Squared Statistic In two-way contingency tables chi-squared is used to test the independence of the two marginal variables. The chi-squared test is often called a goodness-of-fit test but is perhaps better thought of as a badness-of-fit test, because a large value of chi-squared is indicative of a bad fit between the observed and expected frequencies. There are two commonly computed chi-squared statistics; the Pearson chi-squared (÷2) and the likelihood ratio chi-squared (G2) with degrees of freedom = (I-1)(J-1) Asymptotically, ÷2 and G2 are equivalent. However, in finite samples there can be a considerable difference the estimates of these two statistics. Stata Examples use http://www.gseis.ucla.edu/courses/data/hsb2 tabulate ses prog, all | type of program ses | general academic vocation | Total -----------+---------------------------------+---------- low | 16 19 12 | 47 middle | 20 44 31 | 95 high | 9 42 7 | 58 -----------+---------------------------------+---------- Total | 45 105 50 | 200 Pearson chi2(4) = 16.6044 Pr = 0.002 likelihood-ratio chi2(4) = 16.7830 Pr = 0.002 Cramer's V = 0.2037 gamma = 0.0109 ASE = 0.097 Kendall's tau-b = 0.0069 ASE = 0.062 tabulate ses prog, cell nofreq | type of program ses | general academic vocation | Total -----------+---------------------------------+---------- low | 8.00 9.50 6.00 | 23.50 middle | 10.00 22.00 15.50 | 47.50 high | 4.50 21.00 3.50 | 29.00 -----------+---------------------------------+---------- Total | 22.50 52.50 25.00 | 100.00 tabulate ses prog, row nofreq | type of program ses | general academic vocation | Total -----------+---------------------------------+---------- low | 34.04 40.43 25.53 | 100.00 middle | 21.05 46.32 32.63 | 100.00 high | 15.52 72.41 12.07 | 100.00 -----------+---------------------------------+---------- Total | 22.50 52.50 25.00 | 100.00 tabchi ses prog observed frequency expected frequency ---------------------------------------- | type of program ses | general academic vocation ----------+----------------------------- low | 16 19 12 | 10.575 24.675 11.750 | middle | 20 44 31 | 21.375 49.875 23.750 | high | 9 42 7 | 13.050 30.450 14.500 ---------------------------------------- Pearson chi2(4) = 16.6044 Pr = 0.002 likelihood-ratio chi2(4) = 16.7830 Pr = 0.002 tabchi ses prog,raw pearson cont adjust noo noe raw residual Pearson residual contribution to chi-square adjusted residual ---------------------------------------- | type of program ses | general academic vocation ----------+----------------------------- low | 5.425 -5.675 0.250 | 1.668 -1.142 0.073 | 2.783 1.305 0.005 | 2.167 -1.895 0.096 | middle | -1.375 -5.875 7.250 | -0.297 -0.832 1.488 | 0.088 0.692 2.213 | -0.466 -1.666 2.371 | high | -4.050 11.550 -7.500 | -1.121 2.093 -1.970 | 1.257 4.381 3.879 | -1.511 3.604 -2.699 ---------------------------------------- Pearson chi2(4) = 16.6044 Pr = 0.002 likelihood-ratio chi2(4) = 16.7830 Pr = 0.002 A note about tetrachoric correlations Tetrachoric correlations measure the association between two dichotomous variables by estimating the correlation between their associated latent variables. The tabulate command includes an estimate of phi, a measure of association between dichotomous variables. Stata, in the 2x2 case, labels phi as "Cramer's V." The same coefficient can be obtained by computing a standard correlation correlation between the two variables. The tetrac command (findit tetrac) available from ATS uses an approximation of the tetrachoric correlations due to Edwards (1957). let á = ad/bc then r = (áð/4+1)/(áð/4-1) The tetrachoric correlations are often larger than the phi coefficients for the same variables. use http://www.gseis.ucla.edu/courses/data/tetra tabulate hon sci, all | sci hon | 0 1 | Total -----------+----------------------+---------- 0 | 111 36 | 147 1 | 22 31 | 53 -----------+----------------------+---------- Total | 133 67 | 200 Pearson chi2(1) = 20.2150 Pr = 0.000 likelihood-ratio chi2(1) = 19.4693 Pr = 0.000 Cramer's V = 0.3179 gamma = 0.6258 ASE = 0.103 Kendall's tau-b = 0.3179 ASE = 0.072 corr hon sci (obs=200) | hon sci -------------+------------------ hon | 1.0000 sci | 0.3179 1.0000 tetrac hon sci (obs=200) Approximate Tetrachoric Correlations hon sci hon 1.0000 sci 0.5204 1.0000 tetrac female schtyp ses hon sci (obs=200) Approximate Tetrachoric Correlations female schtyp ses hon sci female 1.0000 schtyp -0.0331 1.0000 ses -0.2844 -0.5840 1.0000 hon 0.2504 0.0365 0.0837 1.0000 sci -0.2616 -0.1434 0.2996 0.5204 1.0000