- - - (1980). "We need both exploratory and confirmatory," The American Statistician, 34, 23-25. Displays John D. Emerson Middlebury College David C. Hoaglin Harvard University and Abt Associates The most common data structure is a b structure of data may have characteristi or studying the numbers. The stem-and the numbers graphically in a way tha features of the data. This basic but ve most widely used; we call upon it in late batches and to examine residuals. The stem-and-Ieaf display enables us notice such features as: How nearly symmetric the batch is. How spread out the numbers are. Whether a few values are far remove Whether there are concentrations of Whether there are gaps in the data. In exposing the analyst to these f display has much in common with its clo the digits of the data values themselves i display offers advantages in some situat easier to construct, and it takes a majo IA. THE BASIC DISPLAY The stem-and-leaf display (Tukey, 1970, 1972) provides a flexible and effective technique for starting to look at a batch or sample of data. The most significant digits of the data values themselves do most of the work of sorting the batch into numerical order and displaying it. To explain the display and how one constructs it, we begin with an example. Royston and Abrams (1980) give the mean menstrual cycle length and the mean preovulatory basal body temperature for 21 healthy women who were using natural family planning. We reproduce these data in Table 1-1 and show one stem-and-leaf display for the cycle lengths in Figure 1-1. If we follow the first data value, 22.9, we see that it appears in the display as 2219. To construct this simplest form of stem-and-leaf display, we proceed as follows. First we choose a suitable pair of adjacent digits in the data-in the example, the ones digit and the tenths digit. Next we split each data value between these two digits: because the interaction between analyst and display is a personal one, we face considerations of taste and esthetics, particularly in choosing the number of intervals or the interval width. As often happens in exploratory data analysis, we aim to provide flexibility through a number of variations. We begin with the basic stem-and-leaf display and the steps in lts construction, and then we add variations. For historical interest, we reproduce a close predecessor of the stem-and-leaf display. We also briefly pursue the connections with sorting to put the data in order. Choosing the number of lines (or the interval width) provides a point of contact with work (some of it more theoretical) on these choices for histograms. 26.6 26.8 26.9 26.9 27.5 27.6 27.6 28.0 28.4 28.4 28.5 28.8 28.8 29.4 29.9 30.0 30.3 31.2 31.8 3 4 5 6 7 8 9 IO 11 12 13 14 15 16 17 18 19 20 21 Source: J. P. Royston (1980). "An objective me shift in basal body tem Biometrics, 36, 217-224 ( 221, used by permission) Then we allocate a separate line in the leading digits (the stem)-for Figure 1-1 from 22 to 31. Finally we write down t each data value on the line correspondin The finished display, Figure 1-2, inclu are in units of .1 day, as well as a colum shortly) to the left of the stems. When t sorted, the initial display will not usually As an option, the final display can then9 leaf and and 22 stemsplit ~ 221 922.9 data value 31 2 8 cycle length. 2 31 2 8 for mean menstru automatically when a computer produces the display (Velleman and Hoaglin, 1981). In overall appearance the display resembles a histogram with an interval width of I day; the leaves add numerical detail, and in this instance they preserve all the information in the data. Figures I-I and 1-2 show that two-thirds of the women have mean cycle length between 26.3 and 28.8 days. All but one of the other third have longer mean cycle lengths. One woman, who appears unusual, has a mean cycle length of only 22.9 days. We r~gard such an unusual data point as an outlier, and we might well treat it separately from the other data points. If we do, then a stem-and-leaf display for the remaining cycle lengths can give more detail by choosing the stems differently. We return to this example in Section lB. Depths of Data Values A data value can be assigned a rank by counting in from each end of the ordered batch. For example, in Figure 1-2, 26.3 has rank 2 when counting up from 22.9 and rank 20 when counting down from 31.8. The depth of the data value is the smaller of these two ranks, 2 in the example. Because a number of summary values (such as the median and the quartiles or fourths, see Chapter 2) can easily be defined in terms of their depths, it is helpful to present a set of depths with the display. Except for one middle line, the number in the depth column is the maximum depth associated with data values on that line. Thus the depth of 29.4 is 6. The "middle line" includes the median, and the depth column shows in parentheses the number of leaves on this line. In the example, this number is 6. (When the batch size is even and the median falls between lines, we do not need this special feature.) If the display has been prepared by hand, adding the count on the middle line an lines provides a ~imf,lc check that no da example, 9 + 6 + (, -::, 7.1, the total num dentally, care is neethJ ."hen determinin median: in the example 0nl: Iľight tend, depth of 30.3.) Chapters :2. ~ Cij 3 usc t summaries of data. Organizing the Display An effective choice of the number of involves the number of data values in the as well as some judgment. To get start lines: where n is the number of data values exceeding x. This rule seems to give v displays over the range 20 ~ n ~ 300, w example, which has n = 21, it gives L = [10 X 1 = [10 X 1 = 13. Thus unless we decide to treat the value the 1O-line stem-and-Ieaf display of Figu way to do this uses a power of 10 as the interval width. (We discuss other interval widths in Section IB.) Thus we divide R, the range of the batch, by L and round the quotient up, if necessary, to the nearest power of 10. In the example, the range R = 31.8 - 22.9 = 8.9 and L = 13, so that R/L = .68. Rounding up to the nearest power of 10 gives the value 1 as the interval width. This is the value used in Figures 1-1 and 1-2, which actually require only 10 lines. lB. SOME VARIAnONS A segment of stems for the basic stem-and-leaf display might look like this: digits that come after those that serve low-order digits of data values are prefe This practice makes it easier to recover ing to a leaf in the display. Thus the truncated at the decimal point and ap display; the values at 55.7 are not roun data values, we simply locate the three first two digits are 55. Sometimes the display is still too cr too straggly with one line per stem at t Data: Hardness of aluminum die 53.0 70.2 84.3 55.3 7 82.5 67.3 69.5 73.0 5 74.4 54.1 77.8 52.4 6 55.7 70.5 87.5 50.7 7 o 1 2 3 and each line may receive leaves 0 through 9. But sometimes this format is too crowded, having too many leaves per line. One effective response is to split lines and repeat each stem: putting leaves 0 through 4 on the * line and 5 through 9 on the . line. In such a display, the interval width is 5 times a power of 10. EXAMPLE: HARDNESS OF ALUMINUM DIE CASTINGS (II = 30) Depths 11 (5) \4 6 I Displays: (unit = 1) 50123 3 345 5 5 9 6 3 4 7 9 9 700 1 2 347 8 8 224 5 7 9 5 Shewhart (1931, p. 42) gives the hardness of 60 aluminum die castings. For the first 30 of these, the data values and two stem-and-leaf displays appear Figure 1-3. Splitting to get two lines per s Economic Control of Quality of Manufactu Nostrand, Inc. [data from Table 3 (specimen Stem-and-leaf: Times at which patients' ob Depths Censoring Figure 1-4. A stem-and-leaf display with fiv (1982). "Nonparametric estimation for parti data," Biometrics, 38, 417-431 (data from Ta devote Chapter 7) are centered at 0, and signs. When we start to display such da stem. The only tricky detail is that value or both the - 0 stem and the +0 ste roughly equally between the two. To illustrate this variation, we use the (by the "resistant line" method discu 2 3. 0 t 2 f 4 7 5 O. I 2 4 s 6 O 8 1.00 2 4 6 I ˇ 8 2. 0 9 7 5 I 10 6 18 30 37 (12) 34 26 20 15 12 II EXAMPLE: TUMOR PROGRESSION IN PATIENTS WITH GLIOBLASTOMA Dinse (1982) gives the times until tumor progression for 172 patients with glioblastoma, a brain tumor. The times for 83 of the patients are censored times in that the tumors for these patients had not progressed at the time the study was concluded. These times are indicated by "+" in the data portion of Figure 1-4, which also shows them in a stem-and-leaf display with 5 lines per stem. The display has a total of 19 lines, and each line has width 2 months. Note that for this example, L = [l010gI083] = 19 lines, R = 37 - 1 = 36 months, and R/L = 36/19 = 1.89. When this quotient is rounded up to I, 2, or 5 times the nearest power of 10, we obtain 2 X 100, or 2, as the interval width. This is consistent with the width adopted for the display. In this display we immediately see the asymmetry of the main part of the data, the clump of values from 30 months to 37 months, and the single value at 25 months. The large number of small censoring times most likely indicates that these patients have not been participating in the study long enough to suffer a progression of their tumors. The practice of recording these times in whole months, so that times within the first month are recorded as 1, explains why there are no values at 0, and it may mean that progression times and censoring times are both rounded up. The times at 25 months and beyond do not fit into a regular pattern with the rest of the censoring times; they may represent an initial bulge in the rate of entry of patients into the study. with leaves 0 and 1 on the * line, 2 (two) and 3 (three) on the t line, 4 (four) and 5 (five) on the f line, 6 (six) and 7 (seven) on the s line, and 8 and 9 on the . line. As a reminder in starting to place leaves, the three lettered lines contain leaves whose words begin with that letter. Here the interval width is 2 times a power of 10. Another variation in the stem-and-leaf display accommodates data that include both positive and negative values. Generally, residuals (to which we Resistance (un low 26 27 28 29 30 31 Depths (n = 21) (un I low 2 26 * 6 26 . 27* 9 27' (3) 28 * 9 28 . 6 29 * 5 29 . 4 30 * 30 . 2 31 * I 31 ˇ Depths (n = 21) I 6 9 (6) 6 4 2 One Line per Stem wit Two Lines per Stem wi Figure 1-6. Two stem-and-Ieaf dis such lines in parentheses, as well as sep mas. The first display in Figure 1-6 does lines in Figure 1-2, we focus more attent expense of no longer showing just how fa Figure 1-5. Stem-and-Ieaf display of residuals from a line for mean basal body temperature against mean cycle length. unit = .oloe -3 0 -2 0 -13558 -0 0 2 4 7 9 o 3 679 I I 2 5 203 8 Figure 1-5 shows the stem-and-leaf display, with the residuals in units of .01°C. We notice some tendency for the values to pile up around zero, and the value at -.30 attracts some attention. Together with the one at +.28, it probably deserves a closer look. In Figure 1-5 and in other stem-and-leaf displays that involve both positive and negative values, we have chosen, perhaps arbitrarily, to have the data values increase from the top of the display toward the bottom. We could equally well handle plus and minus by having the entries increase from bottom to top, and others may prefer this direction. In anyone book or paper, however, we usually adopt one of these directions. Resistant methods are little affected by a small fraction of unusual data values and are an important part of exploratory data analysis. Thus it is unwise for the scale of a stem-and-Ieaf display to depend on the largest and smallest data values. (The stem-and-leaf display of the mean cycle lengths in Figure 1-2 is clearly influenced by the unusually small value at 22.9 days.) Instead, we often begin by setting aside any unusual data values, and we then base the choice of scale for the display on the rest of the data. (One rule of thumb for setting aside low and high values appears in Section 2e.) We list those outlying values on the lines labeled "low" and "high," beyond the set of stems. To emphasize further the separate treatment of these values, we may place parentheses around the lists. A comma after each Depths (n = 21) I 2 6 (5) 10 6 3 Ie. AN mSTORICAL NOTE OIJIJO 1/ J,.. ~:13 ~N' 0 /I J.JoJ- .3 '"J I 3 Note: When necessary, allow two o hand column. Figure 1-7. Dudley's transcription of data i tion of Industrial Measurements by John W McGraw-Hill Book Company, Inc. Used w Company. The role and development of graph ceived considerable attention in recent y contributed several novel displays. Fienb use of graphical methods and includes tions. [See also Wainer and Thissen (198 L and we round R/L up to .5-a display with two lines per stem. The second part of Figure 1-6 shows the result, which probably has too many lines for some viewers. Both displays list the 22.9 separately, and the choice between them is a matter of taste. With its variations, the stem-and-leaf display has proved to be a versatile technique for the analysťs first look at a batch of numbers. The three ways of factoring 10 (1 X 10, 2 X 5, and 5 X 2) provide adequate control over scaling (1,2, or 5 lines per stem) especially when combined with "low" and "high" lines for unusual data values. Setting aside potential outliers in this way often leads to a more detailed and more effective display; it also focuses attention on the unusual data values so that we do what we can to probe their surrounding circumstances for clues. The histogram, the better known relative of the stem-and-leaf display, has been in use for many years. Beniger and Robyn (1978) trace the origin of the term to Karl Pearson in 1895, and they mention earlier examples going back to the bar chart published by William Playfair in 1786 in his Commercial and Political Atlas. Of course, the primary difference between the histogram and the stemand-leaf display is the use of a digit from each data value to form the display. Without any systematic search for possible predecessors of the stem-and-leaf display, we report finding a digit-based display (Figure 1-7) in the text by Dudley (1946, p. 22). In the way it groups matching "leaves" together and spreads the groups out over the line, this technique serves more as a sorting and tallying device than as a semigraphic display. Still, its similarity to the stem-and-leaf display emphasizes the convenience of working with digits from the data values. ID. SORTING Because many exploratory techniques wo the simplest way of gaining resistancesion of sorting, the mechanical process o (usually, from smallest to largest). Re details may skip this section without los In hand work, as we mentioned in Se leaf display accomplishes much of the remains is to rearrange the leaves on ea "bucket sort" or "radix sort," with "buckets." The skeleton structure of stem the data establishes the range, and then