A Modified Multiple Regression Approach to the Analysis of Dichotomous Variables STOR ® Leo A. Goodman American Sociological Review, Vol. 37, No. 1 (Feb., 1972), 28-46. Stable URL: http://links.jstor.org/sici?sici=0003-1224%28197202%2937%3Al%3C28%3AAMMRAT%3E2.0.CO%3B2-J American Sociological Review is currently published by American Sociological Association. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/asa.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to creating and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contactjstor-info@umich.edu. http://www.j stor.org/ ThuFeb 5 10:10:52 2004 A MODIFIED MULTIPLE REGRESSION APPROACH TO THE ANALYSIS OF DICHOTOMOUS VARIABLES * LEO A. GOODMAN University of Chicago American Sociological Review 1972, Vol. 37 (February):28-46 To illustrate the models and methods of the present article, we shall reanalyze those data in the famous study of The American Soldier by Stouffer et al. (1949), subsequently analyzed by Coleman (1964), Zeisel (1968), and Theil (1970). The methods we present reveal how the odds pertaining to a given dichotomized variable (e.g., the odds that a soldier would prefer a Northern to a Southern Camp assignment) are related to other dichotomized variables (e.g., (a) the soldier's race, (b) his region of origin, (c) his present camp location). The usual regression analysis methods do not suit the case considered here, where the dependent variable is the odds pertaining to a given dichotomous variable. Nor do the usual methods suit the case where the dependent variable is a proportion pertaining to the dichotomous variable. This article presents some relatively elementary models and methods suitable for analyzing the odds (or a proportion) pertaining to the given dichotomous dependent variable. Applying these models and methods to the data referred to above, new insights are obtained. x Let us begin by describing the data that will be analyzed here for illustrative purposes. These data, which are based on the earlier data first presented by Stouffer et al. (1949), appear in Table 1 below. This four-way table cross-classifies soldiers by the following four dichotomous variables: (A) race (Negro or white); (B) region of origin (North or South); (C) location of present camp (North or South); and (D) preference as to camp location (North or South). Table 1 shows that for say, a Negro Northerner in a Northern camp the odds are 387 to 36 that he will prefer a Northern camp. This table also shows that for, say, a white Southerner in a Southern camp, the odds are 91 to 869 that he will prefer a Northern camp. In the next section, we present a model that describes quantitatively how these and the other odds in Table 1 are affected by (A) race, (B) region of origin, and (C) present camp location. We show how to test whether the model fits the data, and we measure how well the model fits using an index that is analogous to the usual multiple correlation coefficient of re- *This research was supported in part by Research Contract No. NSF GS 2818 from the Division of the Social Sciences of the National Science Foundation. For helpful comments, the author is indebted to R. D. Bock, S. Haberman, P. F. Lazars-feld and A. Stinchcombe. gression analysis. We also show how to assess the statistical significance of the contribution made by certain parameters in the model, and we measure the contribution's magnitude with indices that are analogous to the usual partial and multiple-partial correlation coefficients of regression analysis. With the model that will be described in the next section, and with the more general model that follows it, we can estimate how the odds for preferring a Northern camp are changed by the "main effects" of race, region or origin, and present camp location, as well as by certain "interaction effects" among these variables. With each model considered here we can also estimate what the expected frequencies in Table 1 would be if the model were true. We can compare these estimated expected frequencies with the corresponding observed frequencies to determine if the model fits the data. For the model that will be described in the next section, Table 2 gives the expected frequencies estimated under the assumption that the model is true. Under this model, the estimated odds are 390.64 to 32.36 that the Negro Northerner in a Northern camp will prefer a Northern camp; and the estimated odds are 91.74 to 868.26 that the white Southerner in a Southern camp will prefer a Northern camp. When we compare the corresponding entries in Tables 1 and 2 28 REGRESSION ANALYSIS 29 Table 1. Cross-Classification of. Soldiers With Respect to Pour Dichotomized Variables: (A) Race, (B) Region of Origin, (C) Location of Present Camp, and (D) Preference as to Camp Location Variable A Variable B Variable C Variable D Race Region of Location of Number of Soldiers Preferring Camp* Origin Present Camp In North In South Negro North North 387 36 Negro North South 876 250 Negro South North 383 270 Negro South South 381 1712 White North North 955 162 White North South 874 510 White South North 104 176 White, South South 91 869 *The numbers in this table were recalculated from the percentage table in Stouffer et al. (1949, p. 553). These numbers are consistent with the percentages in the 1949 table, but they may differ somewhat from the actual observed frequencies due to rounding of the percentages. A related percentage table was given also by Coleman (1964, p. 198) and Theil (1970, p. 104). "Preference for camp in North" includes (a) those who prefer to move to a specific camp located in the North, and (b) those whose present camp is in the North and who prefer to stay there. Similarly, "preference for camp in South" includes (a) those who prefer to move to a specific camp located in the South, and (b) those whose present camp is in the South and who prefer to stay there. (by methods that will be described later herein), we find that the model fits the data well. Using the expected frequencies estimated under the given model (see Table 2), we can also estimate the expected proportion preferring a Northern camp (as well as the odds referred to above), for the individuals in each row in Table 2, under the assumption that the model is true. The models considered herein, which describe how the odds for preferring a Northern camp are changed by certain specified "main effects" and "interaction effects," can also be used to describe how the proportion preferring a Northern camp is changed by these effects. In order to understand how the dependent variable (preference as to camp location) is related to the other three variables (race, region of origin, and present camp location), we began with the four-way table (Table 1). Is it necessary to use a four-way table to describe how the dependent variable is related to the other three variables, or can this relationship be summarized adequately using the information contained in tables of smaller dimension (e.g., two-way and/or three-way tables)? The methods presented in the present article can be used to answer this question. To estimate the relationship between the dependent variable and the other three variables, the model that will be presented in the next section will actually use only the information contained in (1) the two-way table describing the relationship between the dependent variable and race, and (2) the three-way table describing the relationship between the dependent variable, region of origin and present camp location. Since that model fits the data well, we find that the relationship between the dependent variable and the other three variables can be summarized adequately using only the information contained in the particular two-way and three-way tables noted above. This topic will be discussed more fully later herein when Table 5 is presented. We propose to analyze the data in Table 1 by methods quite different from those used in earlier analyses of these data. Our model 30 AMERICAN SOCIOLOGICAL REVIEW Table 2, Estimate oí the Expected Frequencies in the Four-Way Contingency Table (Table 1}ý Under the Model in Which the Odds for Preferring a Northern Camp Depend on Race, Region of Origin* Location of Present Camp, and on the Interaction Between Region of Origin and Location of Present Camp Variable, A Race Variable B Region of Origin Variable C Location of Present Camp Variable D Number of Soldiers Preferring Camp In North In South Negro North North 390.64 32.36 Negro North South 879.31 246.69 Negro South North 376.79 276.21 Negro South South 380.26 1712.74 White North North 951.36 165.64 White North South 870.69 513.31 White South North 110.21 169.79 White South South 91.74 868.26 fits the data better than Coleman's (1964) and we present a more parsimonious explanation of these data than he does. With the estimated parameters in our model, we can explain, in a more comprehensive and compact way, various interesting features of these data noted by Zeisel (1968). Some of the models considered in the present article are related to those in Theil (1970), but the methods we use are easier to apply than his. In the final section herein, we shall compare more fully ours with earlier methods. A MODEL FOR ANALYZING THE ODDS The symbols A, B, C, and D denote the four dichotomized variables in the four-way table (Table 1): (A) race, (B) region of origin, (C) location of present camp, and (D) preference as to camp location. For variable A, we use numbers 1 and 2 to denote Negro and white. For variables B, C, and D, we use numbers 1 and 2 to denote North and South. Each of Table l's sixteen cells can be designated (i, j, k, I), where i - 1 or 2; j = 1 or 2; k = 1 or 2; I - 1 or 2. For example, entry 387 is in cell (1, 1, 1, 1), and is a case where variables A, B, C, D all take on value 1; entry 36 is in the cell (1, 1, 1, 2), with variable A, B, and C taking on value 1 and variable D value 2; entry 876 is in cell (1, 1, 2, 1) with variables A, B, and D taking on value 1 and variable C value 2; entry 250 is in cell (1, 1, 2, 2) with variables A and B taking on value 1 and variables C and D value 2. Let fijkz denote the observed frequency in cell (i, j, k, /) of Table 1. For example, fiiii = 387, fm2 = 36, fusi = 876, fii22 = 250, etc. Note that each row of Table 1 can be described by the triplet (i, j, k). For example, the first row is (1, 1, 1); the second (1, 1, 2); etc. Let nijk denote the total observed frequency in a row (i, j, k). In other words, we can write n^k as ttijk - fijkl + fijk2. ( 1) For example, nm = 423, ^^=1126, etc. For those in row (i, j, k) the observed odds in favor of a preference for a Northern camp (i.e., the odds that variable D will take on value 1) can be written as <*>ijk - fijkl/fijk2. (2) For example, a>m = 10.75, a>n2 = 3.50, etc. In other words, when variables A, B, and C all take on value 1, the odds are 10.75 to 1 that variable D will take on that value. When variables A and B take on value 1 and variable C value 2, the odds are 3.50 to 1 that variable D will take on value 1. For row (i, j, k) in Table 1, let pijk denote that row's observed proportion of observations for which variable D takes on value 1. In other words, we can write pijk as Pijk - fijkl/ttijk. (3) For example, piu = .91, pn2 = .78, etc. We also let qijk denote the observed proportion REGRESSION ANALYSIS 31 of observations in row (i, j, k) for which variable D takes on value 2. Thus, qijk - fijk2/nyk = 1 - Pijk. (4) From (2)-(4), we see that Wijk = Pijk/Qijk. (5) We can also express the pyk and qi;jk in terms of the observed odds <úm: Pijk-Wijk/(1 + Wijk), qijk=l/(l+ü>ijk). (6) From (3) and (6) we see that the observed frequencies fyw can be expressed in terms of the observed odds ft)ijk and the nijk: fijki - nijkft)ijk/( 1 + a>ijk) fijk2 - nijk/ ( 1 + ft>ijk) . (7) Let Fijkz denote the expected frequency in cell (i, j, k, I) under some specified model. For example, for the model referred to at the end of the preceding section, we see from Table 2 that Fun and Fm2 are estimated as 390.64 and 32.36, respectively. (The calculation of the entries in Table 2 will be commented upon later herein after we have presented the material in Table 5.) Letting Oijk denote the odds based on the expected frequencies, we see that ^i jk = Fyki/Fijk2. ( 8 ) Formula (8) corresponds to (2). In addition, corresponding to "(7), we have the following: Fijki = nykOijk/( 1 + Oijk) Fijk2 = nijk/(l + Oijk). (9) (For a related matter, see (44) later herein.) Thus, from the Fijkz, we can calculate the "expected odds" Oijk; and from the Oijk and nijk, we can calculate the Fijkř. Our models will express the Oijk in terms of a set of parameters that describe the "main effects" of variables A, B, and C, and certain "interaction effects" among these variables, in a way that is somewhat analogous to the corresponding effects in the usual analysis of variance model. In the present section, we shall present a particular model that fits the data (Table 1) well; and in the next section, we shall present a more general model, namely, a "saturated model" for analyzing the odds, that can help determine the various "unsaturated models" that should be examined further.1 xThe saturated model, which we present in the next section, can also be described as a full model or an unrestricted model. The unsaturated models caii also be described as restricted models. The various models we consider, which assume that the expected odds ßijk are subject to certain multi- Our analysis of this saturated model led us to the particular unsaturated model that we shall present now. For expository reasons we present the unsaturated model first. Consider the following model: «i3k = yyAiyBjy°k7BOjk (10) where yAi = l/yA2,yBi = l/yB2,y°i = l/yC2, yBOii = yB022 = l/yBOi2 = l/yBV (11) Parameters y, y\, yBi, and y°i describe the "main effects" on Oijk of the general mean 2 and variables A, B, and C, respectively; and parameter yBOn describes the "interaction effect" of variables B and C on Oijk.3 Formula (10) describes the effects of the parameters on Oyk, expressing Oijk explicitly in terms of the model's parameters. These parameters can also be explicitly expressed in terms of OiJk. From (10)-(11), we obtain the folowing expressions for the parameters in terms of Oyk: [222 "|l/8 n n n n I , (12) i=i j=i k=i ijk j yAi = [OijkMjk]y* (for j = 1, 2; k = 1, 2) [22 "I 1/8 n n (o /o ) I , (13) j=*l k=l ljk 2jk J [22 1 1/8 n n (o /o ) I , (14) i=l k=l ilk i2k J [22 "I 1/8 n n (o /o ) I , (15) i=l j=l ijl ij2 J yBOn = [ (OillOi22)/(Oil2Oi21) ] % (fori=l, 2) [2 "11/8 II [(o o )/(o a )] I i=l ill 122 112 i21 J . (16) plicative main and interaction effects (see, e.g., formulas (10) and (29)), are quite different from models of the kind appearing in, for example, Coleman (1964) and Boudon (1968). For further comment, see the final section of the present article. 2 Since 7 is somewhat analogous to the main effect of the general mean in the usual model for the analysis of variance (i.e., the constant term in that model), we refer to 7 as the main effect of the general mean on the fíijk. 7 actually equals the geometric mean of the Oijk corresponding to the eight possible values of (i, j, k) obtained when i=l, 2; j=l, 2; k=l, 2. For further details, see formula (12) below. 3 The relationship between the model described above by (10)-(11) and the usual model for the analysis of variance will be clarified when we discuss formulas (20)-(22). 32 AMERICAN SOCIOLOGICAL REVIEW From (12), we see that y is actually the geometric mean of the eight Oyk. From (13), we see that yAi is the square-root of the odds-ratio OijkA^jk- From (12)—(16), we see that all the y parameters can be expressed in terms of Oijk.4 Since Table 2 presents the estimated values of Fijkř (under model (10)), we can use these to estimate first, Oijk (see (8)) and second, the y parameters (see (12)—(16)). Table 3 gives the estimated values of the y parameters. To emphasize the fact that odds Oijk pertain to variable D, and that the y parameters describe the main and interaction effects on these odds, we could replace the symbols Oijk, y, y\ y\ y%jB%Jn (10)" (16) by OV, yD, y^i, y*\ y<*\, yB^jk, respectively. This notation was used in Table 3 and later in Table 4, where each of the above parameters is identified by its superscript. From Table 3, we see that the estimated main effect of each variable (A, B, C) is positive (i.e., the estimates of yADi, yBDi, yCDi are all larger than 1); but the estimated interaction effect between variables B and C is negative (i.e., the estimate of yB0Dn is less than 1). This means, among other things, that the estimated effect on 0Dijk of being a Northerner in a Northern camp is less positive (due to the multiplicative factor of 0.86 pertaining to yB0Dn) than might be surmised simply by combining the main effect of being a Northerner with the main effect of his being in a Northern camp. More precisely, after taking account of the model's various main effects, we must multiply the estimate of ßDijk by the factor 0.86 for a Northerner located in a Northern camp, to account for the interaction effect yBCDn between variables B and C (i.e., the effect on 0Dijk of the interaction between region of origin and present camp location).5 By applying the numeri- 4The relationship between formulas (12)—(16) and certain formulas in the analysis of variance will be clarified when we discuss formulas (23)-(27). 5 From the relationship between 7B0u, 7B02s, yBOia and yB02i described by formula (11), we see that, after taking account of the various main effects in the model, the estimate of the expected odds ßDijk favoring a preference for a Northern camp must be multiplied by the factor 0.86 for Table 3. Estimate of the Main Effects and Interaction Effects of the Three Variables (A,B,C) on the Odds ß-.i. Pertaining to Variable D in the Four-Way Contingency Table (Table 1), Under Models (10) and (20) Variable Y Effects in Model (10) ß Effects in Model (20) D 1.31 .27 AD 1.45 .37 BD 3.. 45 1.24 CD 2.14 .76 BCD 0.86 -.15 cal values of Table 3 to formula (10), we see, for example, that flPin = (1.31) (1.45) (3.45) (2.14) (0.86) = 12.07. (17a) (All calculations in this paper were carried out to more significant digits than are reported here.) Further insight into the meaning of the y parameters can be gained by noting how the estimated value of these parameters affect the estimate of 05ijk, for i= 1, 2; j= 1, 2; k = 1, 2. By applying the numerical values of Table 3 to formulas (10)-(11), we find that the Í25ijk can be estimated by (17a) and as follows: 052ii=(1.31)(y^š)(3.4S)(2.14) (0.86) =5.74, (17b) 0^i2i= (1.31)(1.4S)(3^š)(2.14) 0Dii2=(1.31)(1.45)(3.45) et cetera. Comparing (17a) with (17b), we those whose region of origin is the same as their present camp location (viz., Northerners in a Northern and Southerners in a Southern camp); and it must be divided by the factor 0.86 for those whose region of origin differs from their present camp location (viz., Northerners in a Southern and Southerners in a Northern camp). For further comments, see the final section herein. REGRESSION ANALYSIS 33 see the effect of yADi. Comparing (17a) with (17c), we see the effect of yBDi and yBd5ii. Comparing (17a) with (17d), we see the effect of yCDi and yBcSn. We used the superscript D in the preceding two paragraphs to emphasize the fact that the odds Oijk pertain to variable D, and that the y parameters describe the main and interaction effects on these odds. To simplify notation we will delete this superscript hereafter, in all but one section of the paper. Formula (10) expresses ßi;jk as a product of certain main and interaction effect parameters. This formula can also be expressed in an additive form via logarithms. First, corresponding to Oijk, we let <£ijk denote the natural logarithm of OiJk; i.e., we define $ijk as $ijk = log Oijk, (18) where "log" denotes the natural logarithm. Second, corresponding to formula (10)'s set of parameters (y, yAi? yBj, yck, yBCjk), we define a new set as follows: ß = log 7, ßAi = log 7Ai> /?Bj = log7\ etc. (19) Then from (10) and (18)-(19) we see that *«^£ + ^ + £V£^j8BV (20) From (11) and (19) we see that ß\ = -ß\ ßBi=-ßB2, ßci = ~ß% iÖBCll = iÖBC22=-iÖBC12=-iÖBC2l, (21) which can also be expressed as follows: i=l j = l a /?ck = o, k=l 2/?B j=i jk = 0 (fork=l, 2), (22) S£BCJk = 0 (for j: k=l 1,2). Parameters ß, ß\, ßB1} and ßci describe the main effects on $ijk of the general mean6 and variables A, B, and C; and parameter ßBCu describes the interaction effect of variables B and C on $ijk. The model described by formula (20), which expresses 6 Since ß is somewhat analogous to the main effect of the general mean in the usual model for the analysis of variance (i.e., the constant term in that model), we refer to ß as the main effect of the general mean on i»ijk. ß actually equals the arithmetic mean of the $ijk corresponding to the eight possible values of (i, j, k) obtained when i=l, 2; j=l, 2; and k=l, 2. For further details, see formula (23) below. i>ijk in terms of five parameters (ß, ß\, ßBi, ßci, ßBCn), is equivalent to that described by formula (10), which expresses the corresponding Oijk in terms of the corresponding five parameters (y, y\ yBly y% rBCn). We noted earlier that model (10)'s parameters could be expressed explicitly in terms of Oijk (see formulas (12)—(16)). Similarly, the parameters in model (20) can be expressed explicitly in terms of i>ijk. From (20)-(22), we obtain the following expressions for these parameters in terms of $ijk: ß- i 2 2 2 2 S 2 $ijk |/8, = 1 j = l k==l ] (23) /3Ai=[*ijk-*2jk]/2(forj = l,2;k=l,2) (24) [ 2 Š (fcijk-Sajk) j=l k=l /3V [ 2 2 (*iik-*i2k) |/8, (25) j=i k=i ] ] A ßGi = I I 2 (*m-*U2) 1/8, (26) ] ßBGU = [*m + *i22 - *il2 - *12l]/4 (for i = 1,2) (27) [ S [^Íll + ^i22-^il2-*i2l] 1/8 ] Formulas (23)-(27) are equivalent to the corresponding formulas (12)-(16).7 Formula (23) states that ß is the arithmetic mean of i>ijk (corresponding to the eight possible values of (i, j, k)). Formula (24) states that ß\ can be expressed both as one-half the difference $1^ - $2jk (for j = 1, 2; k = 1, 2), and as one-half the arithmetic mean of the differences <&ijk ~ $2jk corresponding to the four possible values of (j, k) obtained when j = l, 2; k = l, 2. Formula (25) states that ßB! equals one-half the arithmetic mean of the differences <£iik - $i2k corresponding 7 Indeed, instead of obtaining (23)—(27) from formulas (20)-(22), we could also have obtained (23)-(27) from formulas (12)-(16) and (18)-(19). Similarly, we could have obtained (12)—(16) from formulas (23)—(27), making use of formula (28) below and the fact that ňijk=rexp «říjk. 34 AMERICAN SOCIOLOGICAL REVIEW to the four possible values of (i, k) obtained when i = l, 2; k=l, 2. Formula (26) can be similarly expressed, and formula (27) also has a somewhat similar interpretation. Since Table 2 presents the estimated values of Fim (under model (10) or the equivalent model (20)), we can use these values to estimate first, Oijk and <£yk (see (8) and (18)) and second, the ß parameters (see (23)-(27)). Table 3 includes the ß parameters' estimated values. We can also use these values to calculate the y parameters' estimated values since the relationship between the ß and y parameters can be expressed by (19) or by the following equivalent set of formulas: 8 y = exp ß, y\ = exp ß\ yBj = exp ß\ (2Z) etc., where "exp" denotes the exponential function.9 Earlier we discussed Table 3's y parameter estimated values. Now let us examine the estimated values of ß, ß\, ßBi, ß% ßBOn, also given in Table 3. In line with our earlier discussion of Table 3, in examining the estimated ß parameters, we note that the estimated main effect of each variable (A, B, C) is positive (i.e., the estimates of the ß\, ßBi, ß% which could have been written as /?a5i, ßB»ly ß°3lf are all positive) ; but the estimated interaction effect between variables B and C is negative (i.e., the estimate of ßBGu, which could have been written as ßB0®u, is negative).10 8 For expository purposes, we discussed the 7 before the ß parameters. Since Table 3 already provided the 7 parameters' estimated values, we could have used them in turn to estimate the ß parameters (see (19)). Actually, rather than calculate the estimated ß from the estimated 7 parameters, calculated earlier from the estimated ßijk (see (12)-(16)), it is easier to calculate the estimated 7 from the corresponding estimated ß parameters (see (28)), which can be calculated from the estimated $iJk (see (23)-(27)). 9 The exponential function is the inverse of the natural logarithm. Comparison of (19) and (28) should make this point clear. For example, for a given 7 value, we can calculate ß from (19) using a table of natural logarithms; and for a given ß value, we can calculate 7 either from (19), with the natural-logarithm table used now in so to speak, inverted order, or equivalently from (28) using a table of the exponential function. 10 The fact that the estimate of ßBOn is negative corresponds to the fact that the estimate of 7B0u Later we shall show how to assess the statistical significance of the contribution made by certain parameters (e.g., yBCn) in model (10), and by certain parameters (e.g., ßBGu) in model (20), and we also show how to measure this contribution's magnitude. Our models express the expected odds Oijk in terms of the y parameters (see (10) and also (29) below); or they express the expected log-odds $m in terms of the ß parameters (see (20) and also (35) below). These two forms of expression are equivalent. Since the expected frequencies Fijkř can be expressed in terms of the Oijk (see (9)), our models can also be used to express Fijkz in terms of the y parameters. In addition, letting Pijk and Qijk denote the expected proportions Fijki/nijk and Fijk2/nijk, respectively (see (3)-(4)), note that our models can also be used to express PiJk (and Qijk) in terms of these y parameters. Before closing this section, we should note the relationship between model (20) and the usual models for (a) the analysis of variance and (b) the analysis of the "logit" pertaining to variable D. Model (20) and the usual model for the three-way analysis of variance may be compared in several ways. (Note that the 3>ijk in (20) can be presented in a three-way array, while the Fijkj are presented in a four-way table.) In the usual three-way analysis of variance, one must assume homoscedas-ticity, i.e., that each observation in the three-way table has the same variance. On the other hand, for our kind of data, the ho-moscedasticity assumption would be contradicted in a way that could not be ignored.11 Our data also contradict the assumption in the usual analysis of variance that each observation has a normal distribution.12 is less than 1 (see (19) and (28)). We interpreted this fact in footnote (5). For further comment, see the final section herein. 11 In the present context, we note that the variance of the observed proportion pijk (see (3)) will depend both on the magnitude of nijk (see (1)) and the expected proportion Pijk=Fijki/nijk. A similar remark applies to the variance of the observed odds Wijk (see (2) and (5)) and the variance of the logarithm of wuk. (The logarithm of the coi jk is of interest here since it corresponds to $ijk in the same sense that conk corresponds to Oiikj see (2), (8), (18).) 12 On the other hand, when nijk is large, the REGRESSION ANALYSIS iš Note also that formulas (20)-(22) and (23)—(27) are similar to formulas appearing in the usual analysis of variance. However, to estimate the ß parameters under model (20), we use the estimated values of the expected frequencies Fykr under the model (see Table 2) to estimate first ßijk and <3>ijk (see (8) and (18)); and then we use these estimated values of <3>ijk in (23)-(27) to estimate the ß parameters. In contrast, in the usual analysis of variance (assuming homoscedasticity), the quantity corresponding to the estimated i>ijk in formulas (23)—(27) is replaced by the observation in cell (i, j, k); and formulas (24) and (27) are replaced simply by the corresponding expressions on the second line of these two formulas. Now let us consider the usual model for analyzing the logit pertaining to variable D. This logit is usually defined as being 3>ijk/2 (see, e.g., Fisher and Yates 1963). Model (20) states that this logit (multiplied by 2) can be expressed as a sum of parameters ß, ß\ ß\ ß% ßB% (i-e-> the main efřects of the general mean and of variables A, B, C, and the interaction effect between variables B and C). We can rewrite this model as a regression model expressing variable D's logit as a linear function of dummy variables pertaining to the main effects of variables A, B, C and the interaction effect between variables B and C; but homoscedasticity can not be assumed in this model. Later we shall test the statistical significance of the contribution made by certain parameters in this model, and we shall measure the contributions' magnitude by applying methods proposed in Goodman (1970, 1971a). For some related material, see also Dyke and Patterson (1952), Bishop (1969), Theil (1970), and the final section below. A GENERAL MODEL FOR ANALYZING THE ODDS Model (10) included the main effect on Oijk of all three variables (A, B, C), but only one of three possible two-factor interaction effects (viz., yBOjk); and it did not include the three-factor interaction effect observed proportion puk will be approximately normally distributed (as long as the expected proportion Pijk differs sufficiently from the extreme values of 0 and 1). A similar remark applies to the observed odds wijk and the logarithm of wijk. (viz., yABOijk). This model assumed that 7AB«, 7AOik, and yABOijk all equal 1. We shall now consider the model that includes all possible main and interaction effects and that makes no assumptions about which (if any) of these effects equals 1. Instead of model (10), we now have the following "saturated" model: %k = y 7Ai7Bjy0kyABij7AOik7BVyABOijk, (29) where yAi = l/yA2, ...,yABu = yAB22 = l/yABi2=l/yAB21, yABCiii = yABC22i = yABC2i2 = yABCi22 = l/yABCii2=l/yABCi2i = l/yABC2ii = l/yABC222. (30) Formula (29) describes the effects of the y parameters on Oijk. It expresses Oijk explicitly in terms of the model's y parameters. These parameters can also be expressed explicitly in terms of Oijk. From (29)-(30), we obtain the following expressions for the parameters in terms of the Oijk: 13 [222 "|1/8 n n n oijk I (31) i=i j=i k=i J r 2 2 i% yAx= I n n (OijkMjk) I , (32) i 3=1 k=i 1 yABn= I n (Oiiko22k)/(ßi2iA2ik) I , . . . , (33) yABCiii= [(^111^221^212^122)/ (0112Ol2lß21lß222) ] % - (34) For the saturated model (29)-(30), we can estimate the y parameters by formulas (31)-(34), replacing the expected odds Oijk in these formulas by the corresponding observed odds coiJk.14 With the saturated 13 Formulas (31)-(34) for the saturated model (29)-(30) correspond to formulas (12)-(16) for the unsaturated model (10)-(11). 14 In contrast to this procedure for the saturated model, note that for an unsaturated model (e.g., model (10)) we use the estimated values of the expected frequencies Fijki under the model (see Table 2) to estimate ßijk (see (8)); then wě can use these estimated values of the ßij* in (31)-(34) to estimate the 7 parameters. (When these estimated values of the ßijk are used in (31)-(34), we 36 AMERICAN SOCIOLOGICAL REVIEW Table 4. Estimate of the Main Effects and Interaction Effects of the Three Variables (A,B,C) on the Odds ^^-^ Pertaining to .Variable D in the Four-Way Contingency Table (Table 1), Under the Saturated Models (29) and (35) Var- y Effects 3 Effects Stand-iable in in ardized Model (29) Model (35) Value D 1.28 .25 6.96 AD 1.44 .37 10.21 BD 3.43 1.23 34.36 CD 2.10 .74 20.65 ABD 0.96 -.04 -1.11 ACD 1.00 .00 0.00 BCD 0.86 -.15 -4.31 ABCD 0.97 -.03 -0.86 model's y parameters thus estimated, the observed data fit perfectly. (For further comments on this point, see footnote 19 later herein.) Based on Table l's data, the y parameters' estimated values áre given in Table 4. Note that, for Table l's data, Table 4;s estimated y's are quite similar to the corresponding quantities of Table 3. Having replaced the unsaturated (10) with the saturated model (29), we can also replace the unsaturated (20) with the following saturated model: *«k = j9 + j8Ai + j8Bj + j8°k + j8AB« + ßAC* + ßB% + ßABCm (35) where ß\=-ßB2, . . . , /?AB11 = /?AB22 = -)BABi2 = -iBABai,..., /?ABOm = /?ABC22i =/?AB02i2 = ß**°122 = -^AB0112 = _^ABOi2i = _____________-/?ABC2ii = -/?ABC222. (36) obtain the same results as when they are used in (12)-(16).) For an unsaturated model (e.g., model (10)), the entries in Table 2 are the maximum-likelihood estimates of the Fijkř under the model, and they are calculated by an iterative procedure which we shall comment upon later herein after we have presented the material in Table 5. The observed frequencies fijki are the maximum-likelihood estimates of the Fijki under the saturated model, but not under an unsaturated model. Similarly, the observed odds Wijfc are the maximum-likelihood estimates of the ížijk under the saturated model, but not under an unsaturated model. Model (35)-(36) is, of course, equivalent to model (29)-(30). Similarly, formulas (31)-(34) are equivalent to the following set of formulas: ß=\ í Í Í *ijk 1/8, (37) ß\ = l ^ J^ (4>ijk-*2jfe) 1/8, (38) /?ABii = S (*llk + $22k - *12k - *21k) I A (39) k=l j ' " ' > ßAB0lll = [*111 + ^221 + ^212 + 3>122 - 3>i12 - $i21 - $2ii - $222]/8. (40) For the saturated model (35)-(36) we can estimate the ß parameters by formulas (37)-(40), replacing the "expected log-odds" $iJk in these formulas by the corresponding log o)yk.15 In addition, the variance of the estimated ß parameters can be estimated by the following formula:16 Sai= [% S 1 3, (l/ím) I /64. (41) i=l j=l k=l 1=1 J By dividing each estimated ß parameter by its estimated standard deviation S£, we obtain the corresponding "standardized value" of the estimate. Each standardized value can be used to test whether the corresponding ß parameter is nil.17 Table 4 in- 15 Remarks similar to those in footnotes 8 and 14 would apply here as well. 16 Note should be taken of the fact that the estimation method presented herein for the saturated model can be improved upon by replacing fijki in (41) by fijki+^, and replacing the wijk that are used in (31)-(34) (or in (37)-(40)) by Wijk=(fijki+^)/(fijk2+^). It should also be noted that formula (41) and some of the other results presented herein are applicable both in the case where the observed four-way table (Table 1) describes results obtained for a random sample of individuals cross-classified with respect to the four variables (A, B, C, D), and also in the case where the fijki and fijk2 in row (i, j, k) of Table 1 describe results obtained with respect to variable D for a random sample of nijk individuals at levels i, j, k on variables A, B, C, respectively. For further details, see Goodman (1970) and Haber-man (1970). 17 The term "standardized value" of a statistic is used here to mean the ratio of the statistic and REGRESSION ANALYSIS 37 eludes the ß parameters' estimated values, and their corresponding standardized values. By examining the magnitudes of Table 4's standardized values, we find that the model in which ßABih ßAOik, jOABOijk are set equal to zero in (35) should merit consideration. (Recall that these three parameters could also have been written as ßABDij, ßA0V ßABc5m, respectively.) But the model obtained when these particular parameters are set equal to zero in (35) is equivalent to model (20). Thus, for Table l's data, examining the saturated models (29) and (35) leads to models (10) and (20). HOW TO TEST WHETHER A MODEL FOR THE ODDS FITS THE DATA To test whether the hypothesis H described by model (10) fits Table l's data, we first estimate the expected frequencies Fijkř under the hypothesis H (see Table 2), and then compare the observed frequency fijkř in Table 1 with the corresponding estimate of the Fijkz in Table 2, by calculating either the usual chi-square goodness-of-fit statistic I 2 2 2 (fiju-Fijk02/Fijkř, (42) i=i j=i k=i i=i or the corresponding chi-square based on the likelihood-ratio statistic; viz., 2 2 2 2 2 5 S S 5 fykí log [fyw/Fyw]. i=i j=i k=i 1=1 (43) The chi-square value obtained from (42) or (43) can be assessed by comparing its numerical value with the percentiles of the tabulated chi-square distribution. The degrees of freedom for testing hypothesis H will be 8-5 = 3 (since (a) there are eight observed odds in Table 1, and (b) there are five y parameters estimated in model (10)). Using (42), we obtain a goodness-of-fit its estimated standard deviation. The same or similar words have also been used by other writers to denote other things with which the usage here should not be confused. If a particular ß parameter is nil, then the standardized value of the corresponding estimated ß will be approximately normally distributed with zero mean and unit variance (when the sample size is large). For comments on related matters, see Goodman (1970, 1971a). chi-square value of 1.46, and using (43), a likelihood-ratio chi-square value of 1.45. Since there were three degrees of freedom under H, the model fits the data well. Model (10) is obtained from the saturated model (29) by making a specific set of its y parameters equal to one.18 Model (20) is obtained from the saturated model (35) by making a specific set of its ß parameters equal to zero. Models obtained this way from saturated models we call "unsaturated."19 Of course, all unsaturated models obtained from (29) or (35) are models for the odds Oijk (or the log-odds $ijk) pertaining to variable D. Thus, all these unsaturated models view the four-way table (Table 1) asymmetrically. In the four-way table for variables (A, B, C, D), we treated variable D as the dependent variable; i.e., we viewed the odds (or log odds) pertaining to variable D as depending on the level of variables (A, B, C). When each individual in a sample is classified by four dichotomous variables (e.g., A, B, C, D), we obtain a four-way table (e.g., Table 1); and, in some contexts, any one of the four variables might be viewed as the dependent variable. For the four-way table, the expected frequencies estimated under a given unsaturated model that treats variable D as the dependent variable (see, e.g., Table 2) will usually differ from the corresponding expected frequencies estimated under a model that treats one of the other variables as the dependent variable. In some contexts, the research worker will know which variable should be treated as the dependent variable; in others, any one of the four might be treated so. In still others, a different point of view would be appropriate. We could, for example, con- 18 Indeed, the three degrees of freedom used above to test model (10) correspond to the three 7 parameters (viz., 7ABn, 7ACu, 7ABCm) in (29) that are set equal to one under model (10). 19 The number of degrees of freedom used to test a given unsaturated model will equal the number of 7 parameters in (29) that are set equal to one under the unsaturated model. Since none of the 7 parameters in (29) are set equal to one under that model (i.e., the number of 7 parameters set equal to one is zero), there will be zero degrees of freedom under the saturated model. This corresponds to the fact that the observed data fit perfectly under the saturated model, since it includes all possible main and interaction effects (i.e., all possible 7 parameters). 38 AMERICAN SOCIOLOGICAL REVIEW sider the case where none of the variables is the dependent variable, but where all are mutually related in some sense (see, e.g., Goodman, 1970). For the four-way table, Goodman (1970) described in his Table 4 a large class of models that would include as special cases models like our (10) and (20) which treat one variable as the dependent variable, as well as "unsaturated" models of a different kind where none of the variables is treated as the dependent variable but where some or all may be mutually related variables. Goodman's Table 4 (1970) contains fifty-three different "unsaturated models" in which one of the four variables is treated as the dependent variable and 113 different "unsaturated" models in which none of the four variables is viewed as the dependent variable. For the case where a given variable (say variable D) is the dependent variable, Goodman's Table 4 (1970) lists nineteen different unsaturated models. Earlier herein we considered the case where a given variable is the dependent variable. Our models well suit this case (see (10), (20), (29), (35)). Many readers will find our exposition of this case easier to understand than the exposition of the more general case in Goodman (1970). Nevertheless, the more general models and methods of the earlier article also apply to the special case we considered. For each unsaturated model of the kind considered herein, and also for other kinds of "unsaturated" models, Goodman's Table 4 (1970) gave the corresponding degrees of freedom when each variable in the contingency table is dichotomous. He also described ways to calculate the degrees of freedom when some variables are polytomous but not necessarily dichotomous. A single computer program can be used to calculate the estimate of the Fmh and the corresponding chi-square values (42) and (43), for any set of "unsaturated" models of the kinds considered herein and in Goodman (1970). For related material dealing with such models see, e.g., Bishop (1969), Goodman (1970, 1971a, 1972). Let us reconsider model (10), which we obtained from the saturated model (29) by making some of its y parameters equal to one. We can describe this unsaturated model in any of the following equivalent ways: (1) By listing the y parameters that are included in model (10); viz., •/> yA\ yéDj, y°% yB0V20 (2) By listing the y parameters in (29) that are set equal to one under the model; viz., yAB»ih 7Ao5lk, yAB0]V (3) By listing the particular marginal tables that are fitted under the model—a topic we shall now discuss. From our Table 1 we can determine nljk as defined by formula (1). In all unsaturated models obtained from the saturated model (29), the nijk are considered fixed; thus in these models the expected frequencies Fijkř (under the model) will satisfy the following condition: Fijki + Fijk2 = nijk. (44) By comparing the nijk from Table 1 with the estimated value of Fijki + Fijk2 from Table 2, we see that condition (44) is satisfied. Since the nijk describe the three-way marginal table pertaining to variables (A, B, C), we shall use the symbol {ABC} to denote this table. Condition (44) states that the marginal table {ABC} is fitted under the model. In addition to the marginal table {ABC}, two other marginal tables are fitted under model (10); viz., the two-way marginal table {AD} and the three-way marginal table {BCD}. Table 5 gives the three marginal tables fitted under model (10).21 In the preceding paragraph, we explained why the marginal table {ABC} was fitted. Under model (10), we also fit the marginal tables {AD} and {BCD} because it includes the parameters yADi and yB0Bjk, which pertain to the relationship between variables A and D (as displayed in the marginal table {AĎ}) 20 We return now to the notation used earlier where the letter D was included in the superscript of each y parameter to emphasize the fact that the 7 parameters describe the main and interaction effects on the odds pertaining to variable D. This notation will facilitate some of our present exposition. This notation's utility will become clearer two paragraphs below. 21 The four^wäý contingency table of observed data (Table 1) can be displayed as a 2x2x2x2 table, or an 8x2 table (as in Table 1); similarly the three-way marginal table {ABC} can be displayed as a 2x2x2 table, or a 4x2 table (as in Table 5), or as a 8x1 table (as we would obtain if we present it as the marginal of the 8x2 table displayed in Table 1). REGRESSION ANALYSIS 39 Table 5. The Three Marginal Tables That are Fitted When Models (10) and (20) are Applied to the Four-Way Contingency-Table (Table 1) I. Table {ABC} Variable A Variable B Variable C North South Negro North 423 1126 Negro South 653 2093 White North 1117 1384 White South 280 960 II. Table {AD} Variable A Variable D North South Negro 2027 2268 White 2024 1717 III. Table {BCD} Variable B Variable C Variable D North South North North 1263 286 North South 764 1982 South North 1829 672 South South 195 1045 and to the joint relationship among variables B, C, and D (as displayed in the marginal table {BCD}).22 The reader will find that the entries in the three marginal tables in Table 5, which were calculated from Table l's data, equal the corresponding entries in the three mar- 22 When the three-way marginal table {BCD} is fitted, then the following two-way marginal tables will fit automatically: {BC}, {BD}, {CD}. Similarly, when a two-way marginal table, say, {AD} is fitted, then the two one-way marginals, {A} and {D}, will fit automatically. Corresponding to the superscript of each y parameter in model (10) (with the letter D added to each superscript), a marginal table pertaining to that superscript will be included in the set of marginal tables fitted under the model. Under model (10), it will suffice to include {AD} and {BCD} (in addition to table {ABC}) in the set of fitted marginal tables, since then all the marginal tables corresponding to the model's 7 parameters (viz., {D}, {AD}, {BD}, {CD}, {BCD}) will actually be fitted. ginal tables calculated from Table 2's estimated FOT. The computer program, to which we referred in the fourth paragraph preceding this one, calculated the estimated values in Table 2 (viz., the maximum-likelihood estimates of the expected frequencies Fikjř under model (10)) by an iterative procedure which insured that the three marginal tables given in Table 5 would be fitted when Table 2's estimated Fijkz are used. For further details about the computing procedure, see, for example, the literature cited in the paragraph referred to above. Although the three marginal tables (viz., {ABC}, {AD}, {BCD}) in Table 5 are fitted under model (10), we noted earlier that the reason for fitting {ABC} in the present context is somewhat different from the reason for fitting {AD} and {BCD}. The marginal table {ABC} is considered to be fixed under model (10); i.e., the nijk in (1) and (44) are viewed as constants. Aside from the nm constants, to estimate the Fim under model (10), we use only the information contained in the observed marginals tables {AD} and {BCD}. The above remarks pertain to model (10), but they can be extended in a straightforward way to a wide range of unsaturated models obtained from the saturated model (29) by setting certain specified y parameters in (29) equal to one. Now leťs apply several such unsaturated models to Table l's data. Table 6 lists the chi-square values (42) and (43) obtained in testing these models. We include both chi-square values (42) and (43) in Table 6; but, in the present context, (43) has some advantages (see, e.g., Goodman 1968, 1970). In the remaining discussion, we shall use only the chi-square value based on (43). Each model in Table 6 is described there by listing the marginal tables fitted under the model. For the sake of brevity, we actually list in Table 6 the "minimal set" of marginal tables fitted under the model, rather than the entire set of marginal tables that will in fact be fitted (see footnote 22 herein and Goodman 1970). For example, for model (10), which is presented as Hx in Table 6, we list in Table 6 the following "minimal set" of marginal tables fitted under the model: {ABC}, {AD}, {BCD}. From this "minimal set" of marginal tables, 40 AMERICAN SOCIOLOGICAL REVIEW Table 6. Chi-Square Values for Some Models Pertaining to Table 1 Model Fitted Marginals Degrees of Freedom Likelihood-Ratio Chi-Square Goodness-of-Fit Chi-Square Y Parameters Included in the Model Hl {ABC},{AD},{BCD} 3 1.45 1.46 [D],[AD],[BD],[CD], [BCD] H2 {ABC},{AD},{BD},{CD} 4 24.96 25.73 [D],[AD],[BD],[CD] H3 {ABC},{BCD} 4 152.65 147.59 [D],[BD],[CD],[BCD] H4 {ABC},{BD},{CD} 5 186.36 180.26 [D],[BD],[CD] H5 {ABC},{AD},{CD} 5 2286.83 2187.71 [D],[AD],[CD] H6 {ABC},{AD},{BD} 5 695.01 727.16 [D],[AD],[BD] H7 {ABC},{D} 7 3111.47 2812.64 [D] H8 {ABC},{ACD},{BCD} 2 1.32 1.34 [D],[AD],[BD],[CD], [ACD],[BCD] H9 {ABC},{ABD},{BCD> 2 0.68 0.69 [D],[AD],[BD],[CD], [ABD],[BCD] H10 {ABC},{ABD},{ACD} 2 17.29 18.73 [D], [AD], [BD], [CD], [ABD],[ACD] Hll {ABD},{ACD},{BCD} 2 24.79 25.11 * H12 None 15 5469.88 5989.11 * *Models H,, and H,2 cannot be expressed.in terms of the y parameters. See related discussion in the present article. we find that the following marginal tables Table 6 fits the marginals {ABC}, {BD}, will in fact be fitted: {D}, {AD}, {BD}, {CD}; and thus that model includes the {CD}, {BCD} as well as {ABC} and all the following y parameters: y5, yB^, y°\. Let marginal tables formed from {ABC}. Vari- us discuss these and other models in Table able D is included in five of the marginal 6 further. tables listed above, and the model under As we have aiready noted, model (10) is consideration (i.e., model Hx of Table 6) listed as Hl of Table 6> If we now make will include the following five y parameters B0» } ^ j {n moM (1Q) we H corresponding to these five marginal tables: Jf ^ fi In model R ^ odds in_ D AD. BD. ..CD, VBCD., . . , . _ *> , . / r 7 > y i> y j> y k, y 3k. mg to variable D are expressed m terms of Consider now hypothesis H, in Table 6. ^ parameters y5 yAĎ bd od j the Since this model fits the marginals {ABC}, • * <. z \x! i i • r ATM rr>™ r^™ •* -n • i j xi r i mam effects of the general mean and van- {AD}, {BD}, {CD}, it will include the fol- ^ A> ß> and ^ To ^ whether ^ lowing y parameters: y°,?A\y«\ W3 parameter yBc5 in model (10) contributes Similarly, hypothesis H8 m Table 6 fits the f „tatistirallv y__J>_ y k, y jk. Hypothesis H4 in SqUare statistic with one degree of freedom. 23 For the reader who has difficulty determining (We get tne one degree of freedom by sub- which y parameters are included in the model from tracting the corresponding degrees of free- the description of the model in terms of the mar- dom for H2 and Hi; i.e., 4-3 = 1.) From ginal tables that are fitted, we include this infor- T M g, chi.square values for H2 and Hlf mation m Table 6's final column. In that column, - * ±f we use the symbols [D], [AD], [BD],..., to de- we see that yBCDjk does contribute to model note 7D, yADi, 7b5j,..., respectively. (10) in a statistically significant way. REGRESSION ANALYSIS 41 If we set yADi equal to 1 in model (10), we get H3 of Table 6. To test whether the parameter yADi in model (10) contributes in a statistically significant way, we can use the difference between the corresponding chi-square values for H3 and Hi as a chi-square statistic with one degree of freedom. From Table 6's chi-square values for H3 and Hi, we see that yADi does contribute to model (10) in a statistically significant way. If we set yADi and yBCDjk equal to 1 in model (10), we get H4 of Table 6. If we set yBÖ"j and yBCDjk equal to 1 in model (10), we get H5. If we set yCDk and yBCDjk equal to 1 in model (10), we get H6. Comparing the magnitudes of Table 6's corresponding three chi-square values, we see that the worst fitting model was H5, the next worst H6, and the least worst H4. In other words, by comparing the three models obtained from H2 by deleting the main effect of one of the variables (A, B, C), we see that yB5j contributes the most. If we set yA5i, yBĎj, yo5k, and yBo5jk equal to 1 in model (10), we get H7 of Table 6. In model H7, the odds pertaining to variable D depend on y5 (the main effect of the general mean), but are unaffected by the level of variables A, B, and C. In other words, model H7 states that variable D is independent of the joint variable A, B, C. From the chi-square value for H7 in Table 6, we see that the data contradict this model. If we set yAB5ij and yABc5ijk equal to 1 in model (29), we get H8 of Table 6. If we set yACDik and yABCDijk equal to 1 in model (29), we get H9. If we set yBc5jk and yABCDijk equal to 1 in model (29), we get H10. Table 6 shows that H8 and H9 fit the data well, but Hi0 does not. From the above description of H8 and H9, we can express model Hi as follows: Model Hi states both that H8 is true and that yACDik in H8 equals 1. Model Hi also states both that H9 is true and that yAB^ij in H9 equals 1. Thus, if Hi is true, then H8 and H9 will also be true; but H8 and H9 can be true in cases where Hi is not. Hi implies models H8 and Ha, Models Hu and Hi2 of Table 6 differ from Hi to H10 in an important respect. These last two models do not include the marginal {ABC} among the marginals that are fitted under the model. Therefore, the expected frequencies Fijkř estimated under models Hn and Hi2 will not satisfy condition (44) (except in some special cases). These two models cannot be expressed as unsaturated models obtained from the saturated model (29), except in cases where condition (44) is satisfied. Model H12 of Table 6 is easier to describe than Hn, so I will describe it first. Model H12 states that the sixteen cells of Table 1 are equiprobable. From the chi-square value for H12 in Table 6, we see that the data contradict this model. Now let us consider Hn of Table 6. As we noted above, this model cannot be expressed as one in which variable D is the dependent variable, since table {ABC} is not fitted under it. However, since the other three-way marginal tables (viz., {ABD}, {ACD}, {BCD}) are fitted under model H11} we see that Hn is a model in which any one of the other variables (C, B, or A) can be viewed as the dependent variable. (Note that three of the four possible three-way marginal tables are fitted under Hn, and also under H8, and H9, and Hi0.) From the chi-square value for Hn in Table 6, we see that the data contradict this model. We noted earlier that Goodman's Table 4 (1970) included models in which one of the variables is treated as the dependent variable, and it included other kinds of models as well. To test whether any of these other kinds of models might fit the data in our four-way table, we would first consider Hn of Table 6; for this model assumes only that the Fiju are not affected by the three factor "interaction effect" among the three variables A, B, C, nor by the four-factor "interaction effect" among variables A, B, D, C.24 If this particular model does not fit 24 Under model Hn, the only tables not fitted to the data are -the three-way marginal table {ABC} and the four-way table {ABCD}; so the Fijki (under model Hu) are not affected by the three-factor "interaction effect" among variables A, B, C (as displayed in the marginal table {ABC}) nor by the four-factor "interaction effect" among variables A, B, C, D (as displayed in the four-way table). We use the term "interaction effect" in the preceding sentence, and in the sentence to which this footnote applies, in a way 42 AMERICAN SOCIOLOGICAL REVIEW Table 7. Analysis of the Variation in the Odds Pertaining to Variable D in the Four-Way Contingency Table (Table 1) Source of Variation Degrees Freedom of Chi-Square Numerical Value 1. Total variation due to the "main effects" of variables A,B,C and "interaction effects" among these variables 7 X2(H7) 3111.47 la. Due to variation unexplained by model IL, 3 X2(HX) 1.45 lb. Due to variation explained by model H-, 4 X^H^-X2^) 3110.02 Partition of (la) la.l. Due to variation unexplained by model H« 2 X2(H9) 0.68 la.2. Duetto variation explained by the y.. parameter in model Hg 1 xVp-x2^) 0.77 the data, then the data will also contradict any of the other kinds of "unsaturated" models that do not treat variable D as the dependent variable.25 For examples of data that do not contradict these other kinds of "unsaturated" models, see, Goodman (1970, 1971a). Before closing this section, we should note that some of the material discussed above could be presented in summary form in tables that are somewhat analogous to the usual analysis of variance tables. Table 7 is an example. MULTIPLE AND PARTIAL CORRELATION COEFFICIENTS FOR MODELS FOR THE ODDS In the usual multiple regression analysis for quantitative variables (predicting vari- related to but different from the way we used it earlier. Earlier the term referred to the interaction effects of certain variables on the expected odds Oijk pertaining to variable D; whereas above the term refers to the "interaction effects" among certain variables in the four-way table. For further details, see Goodman (1970, 1971a). 25 Except for model Hu, any other unsaturated model that does not treat variable D as the dependent variable can be viewed as a model that states both that Hu is true and that some additional "interaction effects" (in addition to the particular three and four-factor "interaction effects" noted in sentence one of footnote 24) can be set equal to one. Thus, if any other unsaturated model (of the above kind) is 'true, then Hu will also be true. If Hu is not true, then none of the other unsaturated models (of the above kind) can be true. For related matters, see Goodman (1970, 1971a). able Y from, say, variables Xi and X2), the quantity RVxiX2, which is the square of the multiple correlation coefficient, can be interpreted as follows: It is the relative decrease in Y's "unexplained variation" obtained when comparing the case where Xi and X2 are not used to predict Y with the case where both are used. Similarly, the quantity r2Yx1x2? which is the square of the partial correlation coefficient, can be interpreted as follows: It is the relative decrease in Y's unexplained variation obtained when comparing the case where X2 but not Xi is used to predict Y with the case where both are used. The quantity R2i.Xlx2 is sometimes referred to as the coefficient of multiple determination, and the quantity r2Yxx • x2 can be called the coefficient of partial determination. Goodman (1970, 1971a) introduced coefficients that are somewhat analogous to the usual coefficients of multiple and partial determination for analyzing the odds pertaining to a given variable in the four-way contingency table. We shall now illustrate their calculation. For a given model in Table 6 (say model Hi, for i= 1, 2, . . . , 12), we shall use the symbol X2(Hi) to denote its chi-square value. In the preceding section, we noted, among other things, that the statistic X2(H2)-X2(Hi) could be used to test whether the parameter yBCjk in Hi contributed in a statistically significant way.26 To 26Jo facilitate exposition in the preceding section, we included the letter D in the superscript of REGRESSION ANALYSIS 43 measure the contribution's magnitude, we recommend the following coefficient, which we shall call the coefficient of partial determination between the odds wyk and the parameter yBCjk, when the other 7's in model Hi are taken into account: 27 r2 „ = [X2(H2)-X2(H1)]/X2(H2). (45) From Table 6, we see that this coefficient equals .94 for Table l's data. We also noted in the preceding section that the statistic X2(H3) -X2(Hi) could be used to test whether the parameter yAi in Hi contributed in a statistically significant way. As in the preceding paragraph, we shall measure this contribution's magnitude by the following coefficient of partial determination: 28 <Ü[X2(H3)-X2(Hi)]/X2(H3). (46) From Table 6, we see that this coefficient equals .99 for Table l's data. To measure how well model Hi fits the data, we consider the following coefficient, which we call the coefficient of multiple determination between w and the y parameters in model Hi*. each 7 parameter; e.g., YB°ifc in (10) became 7BCDjk. In the present section, we have no need for this more cumbersome notation and will not include the letter D in the superscript. The reader should, of course, keep in mind that, say, 7BCjk here has the same meaning as 7BCDjk earlier, and that the various y parameters describe the main and interaction effects on the odds pertaining to variable D. 27 In the subscript of r2 in (45), we changed the 7BC notation to the 7bc notation because of typographical considerations. This simple notational change should not confuse the reader. Similar notational changes will be made in other formulas in this section. Since model Hi includes the 7 parameters 7, 7Ai, 7BJ, y% 7B°Jk, we could let r2a>rBC.r, yA, yB, yG denote the coefficient defined by (45). To test whether this coefficient differs significantly from zero, we use the statistic X2(H2)—X2(Hi) as noted earlier. 28 Remarks like those in the second paragraph of footnote 27 can be applied to the coefficients defined by (46)-(49). For example, for (46), we could let r2a>rA.y, yB, ro, yB0 denote this coefficient, and we could assess the statistical significance of this coefficient using the statistic X2(H3)—X2(Hi) noted earlier. [>X21(H7)-X2(Hi)]/X2(H7). (47) From Table 6, we see that this coefficient equals 1.00 (to two decimal places) for Table l's data. We might also consider the following coefficient, which we shall call the coefficient of multiple-partial determination between w and the parameters yAi and yBCjk in model Hi, when Hi's other 7 parameters (viz., 7, 7Bj, y°k) are taken into account. R26> (yA, 7Bo)'y, 7B, 70 = [X2(H4)-X2(Hi)]/X2(H4). (48) From Table 6 we see that the coefficient equals .99 (to two decimal places) for Table l's data. Similarly, we can measure the contribution of yBj and 7BOjk (using H5 rather than H4 in (48)) or the contribution of 7°k and yBOjk (using H6 rather than H4 in (48)). We can also use the above coefficients to measure the magnitude of the contribution made by the parameters in other models in Table 6. For example, to measure the magnitude of 7AOik in model H8, we use the following coefficient of partial determination: r a 7A0-h8- [X2(Hi)-X2(H8)]/X2(H!). (49) From Table 6, we see that this coefficient equals .09 for Table l's data. All of the r2 and the R2 coefficients given by (45)-(49) above took the general form R2=[X2(H,,)-X2(H,)]/X2(H"), (50) where the y parameters in model H" are also included among the 7 parameters in model H'. We could also write each coefficient as follows: I 2 I 2 F'ijkZ log [FWFVd i=l j = l k=i z=i ------------------------------------------------ (51) I I I S f«Blog [fWF"«w] i=l j = l k=l z=i where F'ijkZ and F"^ denote the expected frequencies estimated under model H' and model H", respectively. The expression of the coefficient in the form (51) is somewhat analogous to the expression of the coefficients of multiple and partial determination, in the usual multiple regression analysis, as 44 AMERICAN SOCIOLOGICAL REVIEW a ratio of the "explained variation" (when model H' is used to "explain" the variation that was not explained by H") to the "unexplained variation" (when model H" is used).29 COMMENTS ON SOME RELATED WORK As we noted earlier, Coleman's model and methods differ from ours in several ways. His model does not fit the data as well, and his explanation of the data is less parsimonious. Indeed, he observes in his book (1964) that his model differs from the actual data in certain systematic ways, attributing these deviations to a supposed interaction for Negroes (but not whites) between their region of origin and present camp location.30 In contrast, we find that (a) model (10) in the present article fits the data very well, (b) it does not require an ostensible interaction for Negroes (but not whites) of the kind considered by Coleman, (c) it includes an interaction effect yBOjk between region of origin and present camp location, which applies equally to both Negroes and whites, (d) this interaction effect is statistically significant, and (e) it both reduces the expected odds favoring a preference for a Northern camp for those whose region of origin is the same as their present camp location, and increases these expected odds for those whose region of origin differs from their present camp location, after the various main effects in the model have been taken into account.31 29 When H" is taken as H7 of Table 6, then the denominator in (51) (i.e., the "unexplained variation" when model H7 is used) corresponds to the "total variation" in the denominator of the usual coefficient of multiple determination in multiple regression analysis. For related matters, see Goodman (1970). 30 Coleman (1964) did not provide methods for including interaction effects in his models, and so could not measure the magnitude of the ostensible interaction to which he referred, nor could he judge whether introducing the ostensible interaction would improve the fit of his model. 31 For further details, see footnote 5. In addition to the effect of 7BCjk described there, the effect can also be described as follows: For the estimate of the expected odds ížDijk in favor of a Northern camp, the effect on -the estimated O^ijk of being at present in a Northern rather than a Southern camp is less for those from the North than from the South. Similarly, the effect on the estimated 0Dijk of being Coleman's article did not show how to test whether his model fit the actual data, nor was he able to measure how well it fit. Furthermore, he did not show how to test the statistical significance of the contribution made by the various parameters in the model, nor could he measure their contribution's magnitude. In addition, the variance of Coleman's estimates of the main effects in his model was larger than it would have been had he used more efficient estimation methods (e.g., maximum-likelihood estimation methods); and his estimates are biased to the extent that his model excluded relevant interaction effects.32 Coleman's model states that the effects on the expected proportions Pijk are linear.33 Applying Coleman's estimation methods to his model, it is possible to obtain clearly incorrect estimates of the Pijk under the model; e.g., estimates of the expected proportions Pijk that are negative or larger than one.34 Furthermore, his model and methods do not take into account the fact that the variance of the observed proportion pijk will depend on the magnitude of Pyk-35 Some of the limitations of Coleman's approach apply to the usual multiple regression model (and analysis of variance model) if used in the present context. For data of the kind considered in the present article, the assumption of homoscedasticity made in the usual multiple regression model (and in a Northerner rather than a Southerner is less for those presently in a Northern rather than a Southern camp. 32 The remarks above apply to Coleman's (1964) and Boudon's (1968) articles, except that Boudon's model did allow for interaction effects. 33 Recall that Pijk=Fijki/nijk, using our notation. In contrast to Coleman's, our model states that the expected odds fiijk can be expressed in terms of multiplicative effects. In many substantive contexts, it will be more useful to consider multiplicative rather than additive effects. Furthermore, from the point of view of statistical theory, there are a number of reasons for preferring multiplicative models of our kind for analyzing data of the kind presented in Table 1. We shall not pursue these matters further here. 34 Since Pijk denotes an expected proportion, it should not be negative nor larger than one. Therefore, it is undesirable to use models and methods that can lead to estimates of the Pijk that are negative or that are larger than one. "35 In other words, Coleman implicitly assumes homoscedasticity when, on the contrary, his data violate this assumption. REGRESSION ANALYSIS 45 the usual analysis of variance model) would be contradicted in a way that could not be ignored. In addition, as with Coleman's analysis, if one applied the usual multiple regression methods to the model in which the effects on the expected proportions Pijk are linear, one could obtain clearly incorrect estimates of the Pijk under the model, in the sense described above.36 We noted earlier that our data were also analyzed by Zeisel (1968) and Theil (1970). Zeisel described various interesting features of these data. These features can be explained, in a more comprehensive and compact way, in terms of the estimated parameters in model (10) of the present article. For example, from the estimates for model (10) presented in Table 3 herein, we find that the estimated product of the parameters y and yBOn is approximately one (more precisely, this product is 1.13), and this single fact can be used to explain the following features of the data: (a) the preference for a Northern camp location among Negro Northerners in Northern camps is approximately equal to the preference for a Southern camp location among white Southerners in Southern camps; and (b) the preference for a Northern camp location among white Northerners in Northern camps is approximately equal to the preference for a Southern camp location among Negro Southerners in Southern camps. (In order to see that this single fact explains features (a) and (b), insert the estimated values of the parameters in model (10).) The other interesting features noted by Zeisel can also be explained in similar terms, with one exception. This exception pertains to Zeiseľs mention of a supposed effect on camp preference due to the interaction between race and region of origin, among those in Northern camps. Applying the methods of the present article, we find that this ostensible interaction effect is not statistically significant, and there is no need to include it in our model (10). We comment next on the article by Theil (1970). He used the logit model corresponding to (20), but his estimation method and his analysis differed from ours. Theil (1970) used a weighted 36 The comments in footnotes 33 and 34 are relevant here. least-squares procedure, as did Grizzle, Starmer, and Koch (1969) in the same context; whereas, all the estimates presented in the present article are maximum-likelihood estimates. In commenting on the weighted least-squares procedure, the Griz-zle-Starmer-Koch article notes that estimates obtained by their procedure have a somewhat larger variance than maximum-likelihood estimates (see also Rao 1965); similarly Theiľs estimates have a somewhat larger variance than our maximum-likelihood estimates. We also find that it is harder to use the methods proposed by Theil, and by Grizzle, Starmer, and Koch than the methods proposed in the present article, when studying the kinds of hypotheses we have discussed for the four-way contingency table (or in Goodman 1970, for the five-way table).37 Before closing, we remind the reader that our methods were for the case where a given variable (e.g., variable D) can be viewed as the dependent variable which is affected by the other variables under consideration. Where this is not the case, we refer the reader to the more general techniques presented in, for example, Goodman (1970, 1972). REFERENCES Bishop, Y. Y. M. 1969 "Full contingency tables, logits, and split contingency tables." Biometrics 25:383-400. Boudon, R. 1968 "A new look at correlation analysis." In H. M. Blalock, Jr. and A. Blalock (eds.), Methodology in Social Research. New York: McGraw-Hill. Coleman, J. S. 1964 Introduction to Mathematical Sociology. New York: Free Press. Dyke, G. V. and H. D. Patterson 1952 "Analysis of factorial arrangements when the data are proportions." Biometrics 8:1-12. Fisher, R. A. and F. Yates 1963 Statistical Tables for Biological, Agricultural and Medical Research. Sixth Edition, New York: Hafner Publishing Co., Inc. Goodman, L. A. 1968 "The analysis of cross-classified data: Independence, quasi-independence and interactions in contingency tables with or with- 37 For further details, see Goodman 1971a. 46 AMERICAN SOCIOLOGICAL REVIEW out missing entries." Journal of the American Statistical Association 63:1091-1131. 1970 "The multivariate analysis of qualitative data: Interactions among multiple classifications." Journal of the American Statistical Association 65:226-256. 1971a "The analysis of multidimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications." Tech-nometrics 13:33-61. 1971b "Partitioning of chi-square, analysis of marginal contingency tables, and estimation of expected frequencies in multidimensional contingency tables." Journal of the American Statistical Association 66: 339-344. 1972 "A general model for the analysis of surveys," American Journal of Sociology 77 (in press). Grizzle, J. E., C. F. Starmer, and G. G. Koch 1969 "Analysis of categorical data by linear models." Biometrics 25:489-504. Haberman, S. J. 1970 "The general log-linear model." PhD. Thesis. University of Chicago. Lazarsfeld, P. F. 1971 "Regression analysis with dichotomous attributes." unpublished manuscript. Rao, C. R. 1965 "Criteria of estimation in large samples." Pp. 345-362 in Contributions to Statistics. New York: Pergamon Press. Stouffer, S. A., E. A. Suchman, L. C. Devinney, S. A. Star, and R. M. Williams, Jr. 1949 The American Soldier: Adjustment during Army Life. Studies in Social Psychology in World War II, Vol. 1. Princeton, N.J.: Princeton University Press. Theil, H. 1970 "On the estimation of relationships involving qualitative variables." American Journal of Sociology 76:103-154. Zeisel, H. 1968 Say It with Figures. Fifth Edition, Revised. New York: Harper & Row. INDUSTRIAL CONFLICT AND UNIONIZATION * David Britt Omer R. Galle Assistant Professor Associate Professor Florida Atlantic University Vanderbilt University American Sociological Review 1972, Vol. 37 (February) :46-57 People who analyze industrial conflict from strike activity usually base their study on one overall measure of conflict, such as Kerr and Siegel's measure of strike propensity. We suggest it might be easier to identify several components of strike activity rather than one. We identify a composite variable, volume of conflict, which resembles Kerr and Siegeľs Strike Propensity, and demonstrate the mathematical relationship to its components: prone-ness to conflict, extensity of conflict, and intensity of conflict. We analyze the relative contributions of each component to the composite variable, and conclude that extensity of conflict and proneness to conflict exert strong influences on volume of conflict, while intensity of conflict is weaker in impact. We then introduce two union variables—degree of unionization and average union size^—and speculate on their effect on the various dimensions of industrial conflict We propose a tentative model which explains the unionization variables' difference in impact in terms of external support, threat potential, and factionalism. Industrial conflict calls to mind a wide variety of phenomena embracing vertical, horizontal, individual and collective forms. Dahrendorf (1959:236-40) treats in dustrial conflict as vertical class conflict between labor, the subject class, and management, the dominant class. Other authors note the relevance of more truncated authority distributions to variations in the degree of industrial conflict. Lopreato (1968), in test- *The research for this paper was supported in part by the Urban and Regional Development Center, Vanderbilt University. The writers wish to thank John McCarthy, Mayer Zald and especially Leo Rigsby for their comments. ing Dahrendorf s thesis, notes great conflict within the dominant class between greater and lesser authorities. Similarly, Michels (1962:33-56) and others describe conflicts between authority levels in unions. Where authority is split among parallel hierarchies, industrial conflict is described as more nearly horizontal: Dalton (1959), focuses on line staff conflicts, Harvey and Mills (1970:181-213) and Lawrence and Lorsch (1967) refer to more general sub-unit conflict over the resource and power allocation. Finally, Udy (1967:678—709) delineates variables which channel anger modes into individual or collective forms.