Preface This book is based on an important principle: It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it. Learning first what you can do will help you to work more easily and effectively. This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation. Examples, NOT case histories The book does not exist to make the case that exploratory data analysis is useful. Rather it exists to expose its readers and users to a considerable variety of techniques for looking more effectively at one's data. The examples are not intended to be complete case histories. Rather they show isolated techniques in action on real data. The emphasis is on general techniques, rather than specific problems. A basic problem about any body of data is to make it more easily and effectively handleable by minds—our minds, her mind, his mind. To this general end: o anything that makes a simpler description possible makes the description more easily handleable. o anything that looks below the previously described surface makes the description more effective. So we shall always be glad (a) to simplify description and (b) to describe one layer deeper. In particular: ❖ to be able to say that we looked one layer deeper, and found nothing, is a definite step forward—though not as far as to be able to say that we looked deeper and found thus-and-such. o to be able to say that "if we change our point of view in the following way . . . things are simpler" is always a gain—though not quite as much as to be able to say "if we don't bother to change our point of view (some other) things are equally simple". vi Preface Thus, for example, we regard learning that log pressure is almost a straight line in the negative reciprocal of absolute temperature as a real gain, as compared to saying that pressure increases with temperature at an evergrowing rate. Equally, we regard being able to say that a batch of values is roughly symmetrically distributed on a log scale as much better than to say that the raw values have a very skew distribution. In rating ease of description, after almost any reasonable change of point of view, as very important, we are essentially asserting a belief in quantitative knowledge—a belief that most of the key questions in our world sooner or later demand answers to "by how much?" rather than merely to "in which direction?". Consistent with this view, we believe, is a clear demand that pictures based on exploration of data should force their messages upon us. Pictures that emphasize what we already know—"security blankets" to reassure us—are frequently not worth the space they take. Pictures that have to be gone over with a reading glass to see the main point are wasteful of time and inadequate of effect. The greatest value of a picture is when \t forces us to notice what we never expected to see. We shall not try to say why specific techniques are the ones to use. Besides pressures of space and time, there are specific reasons for this. Many of the techniques are less than ten years old in their present form—some will improve noticeably. And where a technique is very good, it is not at all certain that we yet know why it is. We have tried to use consistent techniques wherever this seemed reasonable, and have not worried where it didn't. Apparent consistency speeds learning and remembering, but ought not to be allowed to outweigh noticeable differences in performance. In summary, then, we: o leave most interpretations of results to those who are experts in the subject-matter field involved. o present techniques, not case histories. o regard simple descriptions as good in themselves. o feel free to ask for changes in point of view in order to gain such simplicity, o demand impact from our pictures. o regard every description (always incomplete!) as something to be lifted off and looked under (mainly by using residuals). o regard consistency from one technique to another as desirable, not essential. Confirmation The principles and procedures of what we call confirmatory data analysis are both widely used and one of the great intellectual products of our century. Preface vii In their simplest form, these principles and procedures look at a sample—and at what that sample has told us about the population from which it came—and assess the precision with which our inference from sample to population is made. We can no longer get along without confirmatory data anlysis. But we need not start with it. The best way to understand what CAN be done is no longer—if it ever was—to ask what things could, in the current state of our skill techniques, be confirmed (positively or negatively). Even more understanding is lost if we consider each thing we can do to data only in terms of some set of very restrictive assumptions under which that thing is best possible—assumptions we know we CANNOT check in practice. Exploration AND confirmation Once upon a time, statisticians only explored. Then they learned to confirm exactly—to confirm a few things exactly, each under very specific circumstances. As they emphasized exact confirmation, their techniques inevitably became less flexible. The connection of the most used techniques with past insights was weakened. Anything to which a confirmatory procedure was not explicitly attached was decried as "mere descriptive statistics", no matter how much we had learned from it. Today, the flexibility of (approximate) confirmation by the jackknife makes it relatively easy to ask, for almost any clearly specified exploration, "How far is it confirmed?'1 Today, exploratory and confirmatory can—and should—proceed side by side. This book, of course, considers only exploratory techniques, leaving confirmatory techniques to other accounts. Relation to the preliminary edition The preliminary edition of Exploratory Data Analysis appeared in three volumes, represented the results of teaching and modifying three earlier versions, and had limited circulation. Complete restructuring and revision was followed by further major changes after the use of the structure and much of the material in an American Statistical Association short course. The present volume contains: o those techniques from the first preliminary volume that seemed to deserve careful attention. o a selection of techniques from the second preliminary volume. o a few techniques from the third preliminary volume. o some techniques (especially in chapters 7, 8, and 17) that did not apppear in the preliminary edition at all. It is to be hoped that the preliminary edition will reappear in microfiche form. viii Preface About the problems The teacher needs to be careful about assigning problems. Not too many, please. They are likely to take longer than you think. The number supplied is to accommodate diversity of interest, not to keep everyone busy. Besides the length of our problems, both teacher and student need to realize that many problems do not have a single "right answer". There can be many ways to approach a body of data. Not all are equally good. For some bodies of data this may be clear, but for others we may not be able to tell from a single body of data which approach is preferred. Even several bodies of data about very similar situations may not be enough to show which approach should be preferred. Accordingly, it will often be quite reasonable for different analysts to reach somewhat different analyses. Yet more—to unlock the anlysis of a body of data, to find the good way or ways to approach it, may require a key, whose finding is a creative act. Not everyone can be expected to create the key to any one situation. And, to continue to paraphrase Barnum, no one can be expected to create a key to each situation he or she meets. To learn about data anlysis, it is right that each of us try many things that do not work—that we tackle more problems than we make expert analyses of. We often learn less from an expertly done analysis than from one where, by not trying something, we missed—at least until we were told about it—an opportunity to learn more. Each teacher needs to recognize this in grading and commenting on problems. Precision The teacher who heeds these words and admits that there need be no one correct approach may, I regret to contemplate, still want whatever is done to be digit-perfect. (Under such a requirement, the writer should still be able to pass the course, but it is not clear whether he would get an "A".) One does, from time to time, have to produce digit-perfect, carefully checked results, but forgiving techniques that are not too distributed by unusual data are also, usually, little disturbed by SMALL arithmetic errors. The techniques we discuss here have been chosen to be forgiving. It is to be hoped, then, that small arithmetic errors will take little off the problem's grades, leaving severe penalties for larger errors, either of arithmetic or of concept. Acknowledgments It is a pleasure to acknowledge support, guidance, cooperation, and hard work. Both the Army Research Office (Durham), through a contract with Princeton University, and the Bell Telephone Laboratories have supported the writing financially. Charles P. Winsor taught the writer, during the 1940's, many things about data analysis that were not in the books. Although its formal beginning came after his death, this book owes much to S. S. Wilks, whose Preface ix leadership for statistics in Princeton made possible the gaining of the insights on which it is based. Careful reading of earlier versions by friends and colleagues, especially David Hoaglin and Leonard Steinberg, was most helpful, as were comments by those who have taught the course at various institutions. As noted above, student reaction led to many changes. Frederick Mosteller took his editorial responsibilities very seriously; the reader owes him thanks for many improvements. The arithmetic is much more nearly correct because of the work of Ms. Agelia Mellros. Careful and skilled typing, principally by Mrs. Mary E. Bittrich and by Mrs. Elizabeth LaJeunesse Dutka (earlier versions), with significant contributions by Mrs. Glennis Cohen and Mrs. Eileen Olshewski, has been of vital importance. The cooperative attitude and judgment of the Addison-Wesley staff, particularly that of Roger Drumm (without whose long-continued encouragement this book might not be a reality), Mary Cafarella (production editor), Marshall Henrichs (designer), and Richard Morton (illustrator), were of great help. Princeton, New Jersey Christmas, 1976 John W. Tukey