Statistical graphics: Mapping the pathways of science Howard Wainer; Paul F Velleman Annual Review of Psychology; 2001; 52, ProQuest Medical Library pg. 305 Annu. Rev. Psychol. 2001. 52:305-35 Copyright © 2001 by Annual Reviews. All rights reserved Statistical Graphics: Mapping the Pathways of Science Howard Wainer Educational Testing Service, Princeton, New Jersey 08541; e-mail: hwainer@ets.org Paul F. Velleman Cornell University, Ithaca, NY 14853; e-mail: pfv2@cornell.edu Key Words linking, slicing, brushing, EDA, rotating plots, dynamic display, interactive displays, multivariate analysis ■ Abstract This chapter traces the evolution of statistical graphics starting with its departure from the common noun structure of Cartesian determinism, through William Playfair's revolutionary grammatical shift to graphs as proper nouns, and alights on the modern conception of graph as an active participant in the scientific process of discovery. The ubiquitous availability of data, software, and cheap, high-powered, computing when coupled with the broad acceptance of the ideas in Tukey's 1977 treatise on exploratory data analysis has yielded a fundamental change in the way that the role of statistical graphics is thought of within science—as a dynamic partner and guide to the future rather than as a static monument to the discoveries of the past. We commemorate and illustrate this development while pointing readers to the new tools available and providing some indications of their potential. CONTENTS INTRODUCTION: Graphs as Nouns, from Common to Proper................306 THE NEXT GRAPHICAL REVOLUTION: Graphs as Dynamic Colleagues......313 Conversational Graphics..........................................315 The Absurdity of Graphing Data....................................315 Multiple Dimensions ............................................316 Time as a Dimension ............................................317 Kinds of Interaction.............................................318 The Illusion of Three Dimensions...................................324 What Is an Interesting View?.......................................324 Seeing Patterns in Rotating Plots....................................325 Rotation and Color as an Additional Dimension.........................331 Four Variables and More..........................................331 Practical Multivariate Graphics.....................................332 CONCLUSIONS AND LIMITATIONS.................................332 0066-4308/01/0201-0305$14.00 305 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 306 WAINER ■ VELLEMAN INTRODUCTION: Graphs as Nouns, from Common to Proper Graphic displays abounded in ancient times. For example, a primitive coordinate system of intersecting horizontal and vertical lines that enabled a precise placement of data points was used by Nilotic surveyors as early as 1400 BC. A more refined coordinate system was used by Hipparchus (ca. 140 BC), whose terms for the coordinate axes translates into Latin as longitudo and latitudo, to locate points in the heavens. Somewhat later, Roman surveyors used a coordinate grid to lay out their towns on a plane that was defined by two axes. The decimanus were lines running from east to west, and the cardo ran north to south (Smith 1925). There are many other examples of special-purpose coordinate systems in wide use before the end of the first millennium; musical notation placed on horizontal running lines was in use as early as the ninth century (Apel 1944); the chessboard was invented in seventh century India. Costigan-Eaves & Macdonald-Ross (in preparation) found what appears to be one of the earliest examples of printed graph paper dating from about 1680. Large sheets of paper engraved with a grid were apparently printed to aid in designing and communicating the shapes of the hulls of ships. Both Beniger & Robyn (1978) and Funkhouser (1937) describe Descartes' 1637 development of a coordinate system as an important intellectual milestone in the path toward statistical graphics. We join Biderman (1978) in interpreting this in exactly the opposite way—that it was an intellectual impediment that took a century and a half and William Playfair's (1759-1823) eclectic mind to overcome. Because natural science originated within natural philosophy, it favored a rational rather than empirical approach to scientific inquiry. Such an outlook was antithetical to the more empirical modern approach to science that does not disdain the atheoretical plotting of data points with the goal of investigating suggestive patterns. Graphs in existence before Playfair (with some notable exceptions discussed below) grew out of the same rationalist tradition that yielded Descartes' coordinate geometry—that is, the plotting of curves on the basis of an a priori mathematical expression (e.g. Orseme's "pipes" on the first page of the Padua edition of his 1486 Tractatus de latitudunibus formarum is often cited as an early example; see Figure 1). This notion is supported by statements like that of Luke Howard, a prolific grapher of data in the late eighteenth and early nineteenth century who, as late as ' This material is classed in the "collection" category of the British Library with the entry, "A collection of engraved sheets of squared paper, whereon are traced in pencil or ink the curves or sweeps of the hulls of sundry men-of-war." 2Clagett (1968) argued convincingly that this work was not written by Oresme, but probably by Jacobus de Sancto Martino, one of his followers, in about 1390—yet another instance of how surprisingly often eponymous referencing is an indication only of who did not do it (Stigler 1980). Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. STATISTICAL GRAPHICS 307 Mh bfflFo:nr 0 mfo:miter varatfo redátí rruYoJ fcip.x ftaf ad ň J «wer oitŤo:rnucr oirTormcj. (J jSLaam! nu Urin c Oiiiohs č lU q uu «Keiiu* graduuj cq oiuaouj tŕuat cúdč ^pponôj »lá m a $' portôe eqLua'd. Tlá w uiť occtig*graduuj D4fônr oiftcna Hiíst ic eq ouUiiaú fr»rcnt «pportój eqnta - de uccá aatn° vmfarmť oiftcŕiôutpjcje ©tffuuaoiubud inembrcum lecíuic omil'iúia Kuriu« U nulla proporao fenwc tunc nnlia políc:auaidi viUormiua m latundwetau x oh oř, ©.torta ík nea eiľcr vn foímicer oifi o.m ť oiffozmis ' Č JUtcuľ oirTo:muero.rfamitör Difrormis í iiU q uuer c-celTua graduú eque oiftanouj no» -eruat candcm proporoonemítcu^m fc csnda pane patcbíc 1 lorondum tamea dl tpfícuunfapradictisotfímuóib9 ubi logtur ©« e;- aj 30 ^8 39 30 17 18 15 30 V V V v v Figure 2 Robert Plot's (1685) "History of the Weather" recording of the daily barometric pressure in Oxford for the year 1684 (based on the original work of Martin Lister). o vo ZI CD T3 O Q. d O CD Q. T3 CD O O O o 7Í CD CD T3 O Q. o o' T3 O CD Q. % 84. Hujjud &$.. (š^lrJíir Ltr sa.. 0&ohzr&$. íftjnrcnécr &$. ^íoeccmUť 8ý.. _ J j ~:u zr i" F 3 4- _i____L_'_______ h ' - a | :±± *::::: 2 c- 4- a -+- -IIÍ-—1_____., 2 -1 7 : ' J ^ ': J 3 T -í- - 3 p=«-l T- ? 'L T 3 r """*±c£ # r-r ::____# -IĚ— __+ 4__í: -^Í-I* """^±i j í 1 "j s f JZ _ _-l^ s - -^-d-í --rv-t~í:-t ď ( 6 T é" ! 6 - T ^ f -1- 7 ' -ť - - 7 - --n 7 h- ^ 4: 7 7 7 _/■ .____„ --------8-=:r _ -8 CL —.„. _______ 8________= _____8____T r 8 - —^—- "" j> ? 9 :z____9______I _________9________ 9 «=3- /»-----------= /s ? ■«> .._________*>_________e: .fo r ■'° ■ ■-j-jj—- -í' í // *■ :==_±_^:____: T 1 J! n r-r- ^ - - r- ya - - _ t rx ji n =:-/**=:- ji 73 J3 n ! J3 "jrÍ~~Í .- — ,,- — i- j* r.r J* z_______J*__________ ^ ..^_____^j /y l # _j- -X'5 - t Ji r!-" Jf ; 30 i 70 ^ r' 3° r......----- ::r>:;;;: v 3; _-_____3;_________ :±:::_:^.__:- >; T3 CD Figure 2 (Continued) o' STATISTICAL GRAPHICS 311 0 6 16 26 36 46 56 66 76 86 Figure 3 A redrafting of Christiaan Huygens' 1669 curve showing how many people out of a hundred survive between the ages of infancy to 86. [Data from John Graunt's (1662) Natural and Political Observations on the Bills of Mortality]. interpolations. The letters on the chart are related to an associated discussion on how to construct a life expectancy chart from this one—that is, analyzing a set of data to yield deeper insights into the subject. Christiaan's constructed such a chart and indicated that it was more interesting from a scientific point of view; Figure 3, he felt, was more helpful in wagering. There was a smattering of other examples of empirically based graphs that appeared in the century between Huygens' letter and the publication of Playfair's Commercial and Political Atlas (1786), for although some graphic forms were available before Playfair, they were rarely used to plot empirical information. Bi-derman (1978) argued that this was because there was an antipathy toward such a use as a scientific approach. This suggestion was supported by such statements as that made by Luke Howard. However, at least sometimes when data were available (e.g. Pliny's astronomical data, Graunt's survival data, Plot's weather There are many other graphical devices contained in the 22 volume Oeuvres Completes (1888-1950) to be explored by anyone with fluency in ancient Dutch, Latin, and French. Incidentally, Huygen's graphical work on the pendulum proved to him that a pendulum's oscillations would be isochronic regardless of its amplitude. This discovery led him to actually build the first clock based on this principle. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 312 WAINER ■ VELLEMAN data, and several other admirable uses) they were plotted. Perhaps part of the exponential increase in the use of graphics since the beginning of the nineteenth century is merely concomitant to the exponential growth in the availability of data. Of course, there might also be a symbiosis, in that the availability of graphic devices for analyzing data encouraged data gathering. For whatever the reasons, Playfair was at the cusp of an explosion in data gathering, and his graphic efforts appear causal. He played an important role in that explosion. The consensus of scholars, well phrased by P Costigan-Eaves & M Macdonald-Ross (in preparation), is that until Playfair "many of the graphic devices used were the result of a formal and highly deductive science.... This world view was more comfortable with an arm-chair, rationalistic approach to problem-solving which usually culminated in elegant mathematical principles" often associated with elegant geometrical diagrams. The empirical approach to problem solving, a critical driving force for data collection, was slow to get started. However, the empirical approach began to demonstrate remarkable success in solving problems, and with improved communications, the news of these successes, and hence the popularity of the associated graphic tools, began to spread quickly. We are accustomed to intellectual diffusion taking place from the natural and physical sciences into the social sciences; certainly that is the direction taken for both calculus and the scientific method. However, statistical graphics in particular, and statistics in general, went the reverse route. Although, as we have seen, there were applications of data-based graphics in the natural sciences, it was only after Playfair applied them within the social sciences that their popularity began to accelerate. Playfair should be credited with producing the first chartbook of social statistics; indeed publishing an atlas that contained not a single map is one indication of his belief in the methodology (to say nothing of his chutzpah). Playfair's work was immediately admired, but emulation, at least in Britain, took a little longer (graphic use started up on the continent a bit sooner). Interestingly, one of Playfair's earliest emulators was the banker S Tertius Galton (the father of Francis Galton, and hence the biological grandfather of modern statistics) who, in 1813, published a multiline time series chart of the money in circulation, rates of foreign exchange, and prices of bullion and wheat. The relatively slower diffusion of the graphical method back into the natural sciences provides additional support for the hypothesized bias against empiricism there. The newer social sciences, having no such tradition and faced with both problems to solve and relevant data, were quicker to see the potential of Playfair's methods. Playfair's graphical inventions and adaptations look contemporary. He invented the statistical bar chart out of desperation because he lacked the time series data required to draw a line showing the trade with Scotland, and so used bars to symbolize 5The first encyclopedia in English appeared in 1704. The number of scientific periodicals began a rapid expansion at the end of the eighteenth century; between 1780 and 1789 20 new journals appeared, between 1790 and 1800 25 more (McKie 1972). 6Biderman (1978, 1990) pointed out that ironically, Galton's chart predicted the financial crisis of 1831 that created a ruinous run on his own bank. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. STATISTICAL GRAPHICS 313 the cross-sectional character of the data he did have. Playfair acknowledged Priestley's (1769) priority in this form, although Priestly used bars to symbolize the life spans of historical figures in a time line. Playfair's role was crucial for several reasons, but not for his development of the graphic recording of data; others preceded him in that. Indeed, in 1805 he pointed out that as a child his brother John had him keep a graphic record of temperature readings. However, Playfair was in a remarkable position. Because of his close relationship with his brother and his connections with James Watt he was on the periphery of science. He was close enough to know the value of the graphical methods, but sufficiently detached in his own interests to apply them in a very different arena—that of economics and finance. These areas, then as now, tend to attract a larger audience than matters of science, and Playfair was adept at self-promotion. [For more about the remarkable life and accomplishments of William Playfair (including the fascinating story of his attempted blackmail of Lord Archibald Douglas) the interested reader is referred to Spence & Wainer (1997,2000), Wainer (1996) and Wainer & Spence (1997).] In a review of Playfair's 1786 Atlas, which appeared in The Political Herald, Dr. Gilbert Stuart wrote, "The new method in which accounts are stated in this work, has attracted very general notice. The propriety and expediency of all men, who have any interest in the nation, being acquainted with the general outlines, and the great facts relating to our commerce are unquestionable; and this is the most commodious, as well as accurate mode of effecting this object, that has hitherto been thought of ...To each of his charts the author has added observations (which) ...in general are just and shrewd; and sometimes profound... Very considerable applause is certainly due to this invention; as a new, distinct, and easy mode of conveying information to statesmen and merchants..." Such wholehearted approval rarely greets any scientific development. Playfair's adaptation of graphic methods to matters of general interest provided an enormous boost to the popularity of statistical graphics. THE NEXT GRAPHICAL REVOLUTION: Graphs as Dynamic Colleagues o "Eppur si muoveľ Galileo (c. 1622) For almost 200 years, from 1786 and the publication of Playfair's Atlas until 1977 and the publication of Tukey's Exploratory Data Analysis, the use 7Priestley's use of the bar as a metaphor is somewhat different then Playfair's in that the data were not really statistical. Moreover, Priestly was not the first to construct a graphical time-line; in 1753 the French physician Jacques Barbeu-Dubourg produced a graphic in the form of a 54 foot long scroll, configured in a way not unlike a torah, that contains thumbnail sketches of famous people from The Creation to 1750 (see Wainer 1998 for a fuller story). 8"And yet it moves!" Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 314 WAINER ■ VELLEMAN of graphics within science remained static. Statistical graphics became widely used to communicate information, to decorate and enliven scientific presentations, and to store information. Their use as the principal tool in the exploration of quantitative phenomena also grew in fits and starts, but sentiments, analogous to Luke Howard's were still voiced. Tukey's Exploratory Data Analysis changed things. Suddenly terms like data snooping, data dredging, and the currently trendy "data mining" were no longer pejorative. Coupled with the scientific acceptability, even desirability, of the clever plotting of data points in the search for suggestive patterns, was the ubiquitous appearance of cheap powerful computing. This manuscript is being prepared on a $2000 computer more powerful than any institutional mainframe available when Tukey's book was published. Although most of its MIPS are wastefully idle, they can be called upon whenever needed. However, the computer revolution does not stop with machinery (although it is surely powered by it). Enormous data sets, on varied topics, are readily available. A CD-ROM or two can provide you with the results of the decenniel census or the entire National Assessment of Educational Progress. Through the Cochrane Collaboration the results of 250,000 different random assignment medical experiments are immediately accessible for scrutiny and meta-analysis. Soon all three billion pieces of the human genome will be available to serve as biology's analog to the periodic table. And then there is "the web," overflowing with data (and nondata). Software for data analysis and visualization when added to the assets of powerful computing and extensive data completes the scientific triumvirate. Studies that were either too expensive, too tedious, or too difficult can now be done with the click of a mouse. It is this ease of manipulation that characterizes the latest transformation of graphics in scientific inquiry. The graph is no longer a static object to be carefully constructed and enshrined for further study. It is a dynamic partner in the investigation. The rest of this chapter focuses on some of the new dynamic tools that are available for examining data. We ignore the set of useful tools for data exposition that were described 20 years ago in an earlier incarnation of this chapter (Wainer & Thissen 1981) and instead refer interested readers to that review. There are many more ways to display data badly (Wainer 1997, Chapt. 1), than there are to display data well—that is, to say what you mean about the data clearly and grammatically. Whereas the earlier chapter on graphical data analysis discussed clear, grammatical presentation of data, including methods that are resistant to influence by outliers, the balance of this chapter discusses how to hold a conversation about your data with a data display. 9Data mining, which usually implies fitting a very complex general model to an enormous data set, still seems to deserve critical scrutiny. Bert Green (personal communication) characterizes data mining as being akin to the Ganzwelt of the nineteenth century psy-chophysicists; sooner or later you begin to see things, whether or not anything is really there. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. STATISTICAL GRAPHICS 315 The key to conversational graphics is the recognition of a graph as dynamic and malleable. During the course of a good conversation, each party changes, learns, and grows. A good conversation about data is much the same: We may see something new in the data that leads us to want to view it in a new way. By viewing the data in many different but consistent ways, we have a greater chance of noting patterns, relationships, and exceptions. As the conversation leads us to a new point of view, we understand the data differently. Conversational Graphics Data graphics have evolved from depicting numbers, to depicting variables (e.g. distributions), to depicting relationships among variables. At each stage, however, the communication has been in only one direction: from the graph (or graph maker) to the viewer. But, as computers have taken over almost all graph drawing for data, we have come to realize the possibility of interacting with graphs, of holding a conversation with a graph in an attempt to mutually achieve greater insight. We have come to realize the extraordinary enhancement that such interaction brings to the understanding of data through graphs. There is good experimental evidence that we learn better through interaction. Such "active learning" is almost a fad among educators, but the principle that interacting with something new aids in understanding is sound. Graphs that interact with the viewer first appeared in the early 1970s with projects such as PRTM-9 (Fisherkeller et al 1974), the first multidimensional rotating scatterplot and early experiments with plot brushing at AT&T Bell Labs (Becker & Cleveland 1984). It is only with the wide availability of powerful desktop computers, however, that they have become widely available. Various kinds of real-time interaction can be found in many statistics programs, although few offer all of the methods we discuss here. However, each method has usually been discussed on its own. We attempt here to bring together discussions of interactive graphics and provide unifying principles and insights. The Absurdity of Graphing Data The Nobel Laureate Eugene P Wigner (1960), in his address commemorating the opening of the Courant Institute, remarked on the unusual effectiveness of mathematics in science. He pointed out that "mathematics works so often in science that it is disquieting. It is Uke a man with a large key ring and a sequence of doors to open who finds that after choosing a key at random each door opens on the first or second try. Sooner or later you begin to doubt the relationship between the keys and the locks. So it is with mathematics and science." Why should the universe operate in such a way that human mathematics accurately describes it? It is with the same sense of wonder that we ask the identical question about graphical display, for graphs of data are based on the somewhat absurd notion that we can usefully represent data values whose meaning relates to units of measurement in the real world by arbitrarily assigning them a position in space, Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 316 WAINER ■ VELLEMAN a color, a symbol, or a behavior. Moreover, although the data values themselves have no position, color, symbol, or behavior, an appropriate assignment will not only allow us to perceive patterns and relationships that might not otherwise be evident, but will meaningfully relate to the original measurement units. Just as the unusual effectiveness of mathematics in science suggests something about the universal truth of mathematics, the unusual effectiveness of graphs for communication with humans suggests fundamental truths about human perception. In his Silliman Lectures, Jacob Bronowski (1978) notes that human perceptual abilities evolved along with our species and are thus optimized for certain survival-enhancing perceptions. We see edges well. We see straight lines and understand their relative slopes easily. We can compare areas and sizes visually unless distracted by an illusion of depth and volume. We are well-equipped to see smooth, physically-appropriate motion and we implicitly understand trajectories. As a result, Bronowski points out, we see the world the way we look rather than the way it looks, which constrains what we perceive. Data graphics, however, must take account of how we look and what we will see. Properly designed graphics use human perception abilities wisely. Thus, well-planned layouts, straight lines, starkly different colors, areas of simple shapes, and smooth motion facilitate understanding and perception in graphs. More generally, modern graphics take advantage of human perception by constraining the points and symbols representing the data to behave with a "cartoon reality" that obeys reasonable laws. These laws include the principle that elements in a graph move smoothly (not jumping from place to place), that they have a consistent color, shape, and selection state, and that the mapping of numeric value to physical plot attribute is consistent and shows an appropriate association (e.g. the well-established "area principle," which holds that the perceived size of a plot element should correspond to the magnitude of the value displayed). In fact, the wise use of these principles makes it possible for modern statistical graphics to display greater complexity than humans can easily understand otherwise. Well-designed statistical displays enable analysts to understand relationships among four, five, or even more variables—certainly more than three-dimensional (3-D) creatures are usually comfortable manipulating in Cartesian coordinates. Multiple Dimensions Traditional graphics are limited by the two-dimensional page or screen on which they appear. It is difficult to display more than two variables, and nearly impossible to display more than three clearly. [The now famous Minard graph depicting Napoleon's disastrous invasion of Russia (see Wainer 1997, p. 64) is remarkable precisely because it surpasses these limits so gracefully.] The world is not bivariate. The challenges of understanding multivariate relationships makes graphs that can help in this understanding particularly useful. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. STATISTICAL GRAPHICS 317 Time as a Dimension Because the human eye tracks smooth motion well, motion can be an effective display dimension. Physicists have told us for a century that time should be regarded as a dimension along with the three spatial dimensions. Designers of data graphics have now taken this admonition to heart, although not in the sense that Einstein had in mind. Rather, it is possible to use motion to show how a relationship that has been graphed changes as some other term is modified. The ability of a graph to change in real time, in response to viewer action, can display relationships among variables in ways that are perceived by most viewers as naturally as the mapping of value to physical location on a bivariate plot. One use of this capability that has become relatively common is the display of three variables in a three dimensional scatterplot, whose structure is displayed by rotating it smoothly on the computer screen. Even though the display in fact shows successive projections of the point cloud on the screen, the illusion of a three dimensional display seen in rotation is compelling. Another use of such animation is to show a display changing as a parameter is altered. For example, the analyst might control the value of an exponent used in the reexpression of one variable by sliding an on-screen control with the mouse. Simultaneously, a display of the residuals from a regression analysis can be updated, smoothly changing as the reexpression changes. Some animations of this sort show residuals becoming more homoskedastic as an appropriate reexpression is found. Others might show a single data value drifting away from the others and becoming an outlier, vividly revealing the sensitivity of that particular value to the parameter change. Yet another use of animation shows the relationship among two variables as a third variable is added smoothly to the model or otherwise modified. Such methods display an interaction effect—an aspect of statistical modeling that is notoriously difficult to understand and display, but that nevertheless is of great importance in discerning the truth about multivariate relationships. To achieve this perceptually comfortable mapping, changes over time must follow their own rules of consistency. Displays must change smoothly and must keep up with mouse-based controls. (A delay of as little as 0.1 second can make the display appear to lag behind the mouse and destroy the illusion of physical reality.) Other rules usually lean toward simplification. For example, despite a number of attempts to simulate three dimensions accurately on a computer screen with perspective and shadow, most viewers are more comfortable with a flat projection of a three dimensional pseudoreality onto the screen, which then moves to show the third dimension. Such a view of the data is much like the view through a telescope at some distance, in which the depth of field is lost. It also corresponds to the mathematical operation of projecting from higher dimensions onto lower dimensions—an operation fundamental to most multivariate statistics. Such a display sacrifices all cues about the direction of rotation. Some viewers can reverse the illusion, switching the perception that the frontmost points are Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 318 WAINER ■ VELLEMAN moving to the left (and the rearward points to the right) with the perception that the motions are reversed. Interestingly, the two displays are equivalent in data analysis content, so the ambiguity has no important consequences. Kinds of Interaction Modern data-display software provides several kinds of interaction with data graphics and some underlying principles that support them. All of these methods assume that what we are seeing shows the data from many points of view and in many different ways but continues to preserve the data's central reality and consistency. The displays observe the principle of "linking," in which multiple arrays of related data are consistent in how the data are displayed, in particular in the use of color, symbol, and highlighting of points. Changes in one view of the data alter all other views simultaneously, preserving the illusion that, for example, the color of a datapoint is the same regardless of how it is viewed. Selecting is a fundamental operation because selected points stand out from the background of other points. Selected points and regions are usually highlighted by becoming brighter, by becoming slightly larger, by changing color, or by filling in open spaces. The unselected points are displayed as well, providing a context for the selected points. It is thus easy to see whether, for example, the selected points cluster together consistently or show a trend that differs from the background trend. Linking shows each case consistently across several displays. When a case is selected in one plot, all views of that case are selected immediately and highlighted so that the selection can be seen. The selected case stands out from the other cases in each window, so its relationship to them becomes clearer, making it easy to see conditional relationships. Clusters of points in one display can be selected to see whether they appear as a group in other views of the data or whether the observed clustering is a local feature. Linking makes it easy to answer questions such as 1. Is this extreme point also extreme in any other view of the data? 2. Do the points in this part of the histogram cluster on other variables? 3. Is the relationship between these two variables the same for each of the groups in this pie chart? 4. Does the pattern shown in this rotating plot correspond to any patterns shown in other views of the data? These questions require sophisticated and complex statistical calculations to answer numerically but are easy to investigate with linked plots. More fundamentally, linking treats each case as an object with a graphic reality. Just as real world objects have a shape, location, and color, graphic representations of data values benefit from having a consistent existence. Thus, graphing programs can also link plot symbols and colors. Each case is drawn in all of the plots with Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. STATISTICAL GRAPHICS 319 the same symbol (where symbols are appropriate) and in the same color (where colors are possible). Linking also makes possible the interactive actions brush and slice. These actions have emerged as fundamental parts of the conversation that data analysts can hold with graphic displays of their data. Brushing and slicing can reveal joint patterns and relationships among many variables. Thus, they are actions appropriate for multivariate analysis. Plot brushing was developed initially by statisticians at AT&T Bell Labs (Becker & Cleveland 1984) as a way to work with scatterplot matrices and is still offered in that specialized form by some statistics programs. Other programs generalize brushing beyond that isolated framework, making the plot brush a tool that works in any appropriate display. Brushing focuses attention on a selected subset of points while showing them against the background of the rest of the points. Each kind of display can offer an appropriate way to define the selected subset. The simplest case is brushing a scatterplot in which a rectangle (whose size and shape can be controlled by the analyst) is dragged over a scatterplot controlled by mouse movements. Points covered by the rectangle are highlighted in the scatterplot and in all other displays simultaneously. One can usually define brushes of different sizes and shapes; a tall, thin brush, for example, selects small, local parts of an x-axis variable. The highlighted points in other plots show the patterns and distributions conditional on the selected slice of points. By contrast, selecting points in a dotplot focuses attention on a subrange of the plotted brushed variable and shows where those points reside in other displays. Such a strip of values in effect, conditions on the selected subrange of the brushed variable, and shows the effects of changing the conditioning. One can even brush bars in a histogram, watching the corresponding selection in other displays. More subtly, the effects of brushing can link into a histogram. Experience has shown that the best display for this is a highlighted "subset histogram" shown against the background of the full data histogram. By selecting points in a rotating plot, you can orient the rotation to identify a key dimension or to isolate a subgroup. A slicing tool selects points in vertical or horizontal slices of a plot. The tool slices right to left, left to right, top to bottom, or bottom to top, according to its initial direction. In contrast to a plot brush, in which points leaving the brush lose their highlighting, points selected by a slicing tool are selected as the tool passes their position and remain selected unless you reverse direction and drag back over them. Brushing and slicing help to answer questions such as 1. Do the same cases seem to be in roughly the same places in each plot? 2. Is there any trend in sales from east to west? 3. Which variables change systematically as I move along this principal dimension in a rotating plot? Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 320 WAINER ■ VELLEMAN 4. How does the relationship between the gas mileage and weight of cars change as drive ratio increases? Brushing and slicing are based on the principle that by emphasizing the common identity of cases in multiple displays, we can help analysts relate several displays to one another. They do not add information that is not already in the displays; rather, they provide easier access to that information. 80 Companies Slicing Example As an example of how slicing can help, consider the scatter plot of Log(Assets) versus Log(Market Value) from 80 companies drawn randomly from the Forbes 500 (Figure 4, in which original data were in millions of dollars). We see three interesting features: 1. There is surely a trend of companies with greater market value to have greater assets (see the regression line in Figure 4); 2. There are about seven companies with a market value of about a billion dollars that have lower than expected assets; and 3. There seems to be a string of companies, of varying market values, that have unusually large assets. It seems sensible to look at the residuals from the overall trend (item 1). These are shown, as a function of predicted value, in Figure 5. We next use our slicing —I------------------------1------------------------1------------------------!----- 2.25 3.00 3.75 4.50 Log( Market Value) Figure 4 Scatter plot of Log(Assets) against Log(Market Value) for 80 companies drawn at random from the Forbes 500. A least squares regression line is drawn in. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. STATISTICAL GRAPHICS 321 0.75 a 'vi OL -0.75 o ö o o.oo----------- o o -. - B. °„ o o*