Amos Tversk "Suppose you have run an e significant result which co tailed). You now have caus What do you think the prob a one-tailed test, separately f If you feel that the proba pleased to know that you be median answer of two small questionnaire distributed a Group and of the American On the other hand, if yo belong to a minority. Only 9 .40 and .60. However, .48 ha than .85.1 Apparently, most psychol hood of successfully replica 1 The required estimate can be in follow commo n research practice plausible altern ati ve to th e null h th en be in terpreted as th e powe significant result in the second s result of the first sample. In the s would compute th e power of th equals the mean of the first samp first, th e computed probability o justifiable approach is to interpre and compute it relative to some uniform prior, th e desired poste favors th e null hypothesis, as is small er. Th is chapter origi nally appear ed in the American Psychological Associa essential characteristics. Consequently, they expect any two samples drawn from a particular population to be more similar to one another and to the population than sampling theory predicts, at least for small samples. The tendency to regard a sample as a representation is manifest in a wide variety of situations. When subjects are instructed to generate a random sequence of hypothetical tosses of a fair coin , for example, they produce sequences where the proportion of heads in any short segment stays far closer to .50 th an the laws of chance would predict (Tu ne, 1964) . Thus, each segmen t of th e response seque nce is highly re presentative of the "fairness" of th e coin. Simi lar effects are observed w hen subjec ts successively predict eve n ts in a randomly genera ted series, as in probability learning experiments (Estes, 1964) or in other sequential games of chance. Subjects act as if every segment of the random sequence must reflect the tr ue proportion: if the sequence has strayed from the population proportion, a corrective bias in the other direction is expected. This has been called the gambler's fallacy. The heart of the ga mbler's fallacy is a misconception of the fairness of th e laws of chance. The gambler feels that the fairness of the coin entitles him to expect that any deviation in one direction will soon be cancelled by a corresponding deviation in the other. Even the fairest of coins, however, given the limitations of its memory and moral sense, cannot be as fair as the gambler expects it to be. This fallacy is not unique to gamblers. Consider the following example: The mean IQ of the population of eighth graders in a city is knoum to be 100. You have selec ted a random sample of 50 children for a st udy of educa tional achievements. The first child tested has an IQ of 150. What do yo u expect the mean IQ to be for the w ho le sample? The correct answer is 101. A surprisingly large number of people believe th at the expected IQ for the sample is still 100. This expectatio n can be justified only by the belief th at a random process is self-correcting. Idioms such as "errors canc el each other ou t" reflect the image of an active self-correc ting process. Some familiar processes in nature obey such laws: a deviation from a stable equilibrium produces a force that restores the eq ui librium. The laws of chance, in contrast, do not work th at way: deviations are not canceled as sampling pro cee ds, they are merely di luted. The law of large numb indeed be highly represen drawn. If, in addition, a se samples should also be hig People's in tu itions about r small numbers, which ass small numbers as well. Consider a hypothetical s How would h is belief affe studies phenomena whose variability, that is, the sig from nature is low. Our sci gist, or perhaps a psycholog If he believes in the la exaggerated confidence in samples. To ill ustrate, supp toys infants will prefer to p have shown a preference f some confidence at this poi false. Fortunately, such a journal publication, althou tion, our psychologist wil extreme as the one obtained To be sure, the applicati inference is beset with seri of significance levels (or l forces the scientist to eva estima te of sampling varian estima te. Statistical tests, aga inst overly hasty rejectio policing its ma ny member numbers. On th e other han th e risk of fai ling to con f error). Imagine a psychologist achievement and grades. W follows: "What correlation the result significant? (Loo In a detailed investigation of statistical power, J. Cohen (1962, 1969) has provided plausible definitions of large, medium, and small effects and an extensive set of computational aids to the estimation of power for a variety of statistical tests. In the normal test for a difference between two means, for example, a difference of .250" is small, a difference of .500" is medium, and a difference of 10" is large, according to the proposed definitions. The mean IQ difference between clerical and semiskilled workers is a medium effect. In an ingenious study of research practice, J. Cohen (1962) reviewed all the statistical analyses published in one volume of the Journal of Abnormal and Social Psychology, and computed the likelihood of detecting each of the three sizes of effect. The average power was .18 for the detection of small effects, .48 for medium effects, and .83 for large effects. If psychologists typically expect medium effects and select sample size as in the above example, the power of their studies should indeed be about .50. Cohen's analysis shows that the statistical power of many psychological studies is ridiculously low. This is a self-defeating practice: it makes for frustrated scientists and inefficient research. The investigator who tests a valid hypothesis but fails to obtain significant results cannot help but regard nature as untrustworthy or even hostile. Furthermore, as Overall (1969) has shown, the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of invalid rejections of the null hypothesis among published results. Because considerations of statistical power are of particular importance in the design of replication studies, we probed attitudes concerning replication in our questionnaire. Suppose one of your doctoral students has completed a difficult and timeconsuming experiment on 40 animals. He has scored and analyzed a large number i of variables. His results are generally inconclusive, but one before-after comparison yields a highly significant t ~ 2.70, which is surprising and could be of major theoretical significance. Considering the importance of the result, its surprisal value, and the number of analyses that your student has performed, would you recommend that he replicate the study before publishing? If you recommend replication, how many animals would you urge him to run? Among the psychologists to whom we put these questions there was overwhelming sentiment favoring replication: it was recommended by 66 one-tail test). Since we ha would appear reasonable t question: Assume that your unhappy st additional animals, and has ob t ~ 1.24. What would you parentheses refer to the numbe (a) He should pool the resul (b) He should report the res (c) He should run another g (d) He should try to find a groups. (30) Note that regardless of credibility is surely enhanc mental effect in the same d of the effect in the replicat study. In view of the sam mended, the replication was The distribution of respon concerning the studenťs fin This unhappy state of aff statistical power. In contrast to Response grounds, the most popular r that the same answer woul realized that the differen approach significance. (If th the difference is .53.) In th followed the representation samples was larger than explanation. However, the ence between the two grou ing noise. Altogether our responde This follows from the repre to be very similar to one a You are reviewing th e literature. What is the highest value of t in th e second se t of data th at yo u would describe as a failure to replicate? The ma jority of our respondents regarded t = 1.70 as a failure to replicate. If the data of two such studies (t = 2.46 and t = 1.70) are pooled, the va lue of t for the combined data is ab out 3.00 (assuming equal variances). Thus, we are faced with a paradoxical state of affairs, in which the same data that would increase our confidence in the finding when viewed as part of the original st udy, shake our confidence when viewed as an independent study. This double standard is particularly disturbing since, for many reasons, replications are usually considered as independent studies, and h ypoth eses are often eva lua ted by lis ting confirming an d d isconfirmi ng re ports. Contra ry to a widespread belief, a case can be made th at a replication sample should often be larger than the original. The decision to replicate a once ob tained finding often expresses a great fondness for that finding and a desire to see it accepted by a skeptical community. Since that c?m~.unity unreasonably demands that the replication be independently significant, or at least that it approach significance, one must run a large sample. To illustrate, if the unfortu nate doctoral student w hose thesis was d iscussed earlier assumes th e va lidity of h is initial result (t = 2.70, N = 40), and if he is willing to accept a risk of only .10 of obtai ning a t lower th an 1.~O, he should run approximat ely 50 animals in h is replicatio n study. WIth a somewhat weaker in itia l result (t = 2.20, N ~ 40), the size of the replication sample required for the same power rises to about 75. That th e effects discussed thus far are not limited to hypotheses about mean~ and variances is demonstrated by the responses to the following question: You have run a correlational study, scoring 20 variables on 100 subjects. Twentyseven of th e 190 correlation coefficients are significan t at the .05 level; and 9 of these are significant beyond the .01 level. The mean absolute level of the significan t correlations is .31, and the pattern of results is very reasonable on th eoretical grounds. Ho w many of the 27 significant correlat ions wo uld you expect to be significant again, in an exact replication of the study, with N ~ 40? Wi th N = 40, a correlation of about .31 is required for significance at the .05 level. This is the mean of the significant correlations in th e original study. Thus, only about half of the originally significant correlations (i.e., 13 or 14) would remain significant with N = 40. In addition, of course, the mere duplication of the ori expect a duplicatio n of the sample size. This expectatio sentation hypothesis; even generating such a result. The expectation that pat entirety provides the ratio practice. The investigator w indexes of anxiety and thre interpret with great confid e His confid ence in th e sh obtai ned correlat ion matrix ble. In review, we have seen practices science as follows: 1. He gambles his researc ing that the odds against h power. 2. He has undue confide few subjects) and in th e st and iden tity of significan t r 3. In evaluating replicati expectations abo ut the rep ma tes the breadth of confid 4. He rarely attributes sampling variability, becau dis crepancy. Thus, he has l tion in action. His belief i forever remain intact. Our questionnaire elicite th e belief in th e law of s be liever, regardless of the g cally no d ifferences be twe 2 w. Edwards (1968, 25) has arg certainty from probabilistic data hardly be described as conservat they tend to extract more certain tion is all too easily "explained." Corrective experiences are those that provide neither motive nor opportunity for spurious explanation. Thus, a student in a statistics course may draw repeated samples of given size from a population, and learn the effect of sample size on sampling variability from personal observation. We are far from certain, however, that expectations can be corrected in this manner, since related biases, such as the gambler's fallacy, survive considerable contradictory evidence. Even if the bias cannot be unlearned, students can learn to recognize its existence and take the necessary precautions. Since the teaching of statistics is not short on admonitions, a warning about biased statistical intuitions may not be out of place. The obvious precaution is computation. The believer in the law of small numbers has incorrect intuitions about significance level, power, and confidence intervals. Significance levels-are usually computed and reported, but power and confidence limits are not. Perhaps they should be. Explicit computation of power, relative to some reasonable hypothesis, for instance, J. Cohen's (1962, 1969) small, large, and medium effects, should surely be carried out before any study is done. Such computations will often lead to the realization that there is simply no point in running the study unless, for example, sample size is multiplied by four. We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis. In addition, computations of power are essential to the interpretation of negative results, that is, failures to reject the null hypothesis. Because readers' intuitive estimates of power are likely to be wrong, the publication of computed values does not appear to be a waste of either readers' time or journal space. In the early psychological literature, the convention prevailed of reporting, for example, a sample mean as M PE,where PE is the probable error (i.e., the 50% confidence interval around the mean). This convention was later abandoned in favor of the hypothesis-testing formulation. A confidence interval, however, provides a useful index of sampling variability, and it is precisely this variability that we tend to underestimate. The emphasis on significance levels tends to obscure a fundamental distinction between the size of an effect and its statistical significance. Regardless of sample size, the size of an effect in one study is a reasonable estimate of the size of the effect in replication. In contrast, the estimated significance level in a replication depends critically on sample size. Unrealistic expectations concerning the replicability of significance levels may be corrected regardless of motivational f null hypothesis is gratifyin aggravating, yet the true be tions are governed by a con by opportunistic wishful thi be willing to regard his sta replace impression formatio