48 INTERPRETING QUANTITATIVE DATA WITH SPSS entries. This statistic gives us the range of the remaining 80% of data entries. Wc can also compute the 5% trimmed range by deleting from the computation the 5% of values that are the farthest away from the mean. We will also see in a forthcoming section1 something called a box-plot, that shows us graphically hoih the full range, and the range of the central 50% of the data after you have disregarded the top 25% and the bottom 25%. This last range is called the interquartile range, the distance between the first and third quarliles, which are the values that split the data into four equal parts. These various notions of the range do not use the exact values of all the data in their computation. The following measures do. STANDARD DEVIATION The most important measure is the standard deviation. To explain what it is wc must first define some simpler notions such as the deviation from the mean. For an individual data entry xi the deviation from the mean is the distance that separates it from the mean. If we want to write it in symbols, we will have to use two different symbols, depending whether we have a sample or a population. For a sample, the deviation from the mean is written: (xt - x) For a population, the deviation from the mean is written: (x, - fi) The list of all deviations of the mean may give us a good impression of how spread out the data is. Example Consider the following distribution, representing the grades out of ten of a group of 14 students: ■ 4, 5, 5,6, 7,7, 8, 8, 8,9,9. 9. 10. 10 Here the mean is given by 105/14 = 7.5. The deviations from Ihe mean are given in Tabic 3.6. But that list may be long. We want to summarize it. and end up with a single numerical value that constitutes a measure of how dispersed the data is. We could take the mean of all these deviations. If you perform the computation for the mean deviation, you will get a mean deviation equal to zero (do the compulation yourself on the preceding example). This is no accideni. Indeed, wc can easily show that the mean of these deviations is necessarily zero, as the positive deviations arc cancelled out by the negative deviations. UNIVARIATE DESCRIPTIVE $TATISTICS 49 Table 3.6 Calculation of the deviations from the mean Data entry > Deviation from th<_> mean: (x, x) 4 4 - 7.5 - -3.5 5 5 -74 = -2.5 5 5 - 7.5 - -2.5 6 6- 74 = -1-5 7 7 -74 - -0.5 7 7-74 =-0.5 8 8 - 7.5 « 04 8 8-7.5 = 04 8 8 - 7.5 = 04 '■ 9-7.5=14 9 9-7.5=14 0 9-7.5=14 k: 10-7.5 = 24 If: 10-75 = 25 The mathematical proof (which is given only for those who are interested and which can be ignored otherwise) goes like this: Sum of all deviations from the mean = E <*,-*> = !>,-Ľ *=""*-"** = t> (Explanation: Recall thai the sum of all entries is equal to n times the mean, and that the mean, in the second summation, is counted n times. This is why we get n times the mean twice, once with a positive sign, and once with a negative sign.) We thus conclude that the deviations from the mean always add up to zero, and therefore we cannot summarize them by finding their mean. The way around this difficulty is the following: we will square the deviations, and then take their mean. By squaring the deviations, we get rid of the negative signs, and the positive and negative deviations do not cancel out any more. This operation changes their magnitude, however, and gives an erroneous impression about the real dispersion of data, since the deviations are all squared. This distortion will be corrected by taking the square root of the result, which brings it back to an order of magnitude similar to the original deviations. In summary, wc end up with the following calculation: Standard deviation for a population, denoted by the symbol o" In the case of a sample, fi will be replaced by x and iV will be replaced not by n, but by ft - 1. The reason why we write n - 1 instead of n is due to some of the mathematical properties of the standard deviation. It can be proven that using n - 1 in ihe formula gives a better prediction of the standard deviation of a population when we know that of the sample. so INTERPRETING QUANTITATIVE DATA WITH SPSS Conclusion: the standard deviation for a sample, denoted by the symbol s. is given, by: The standard deviation (often written st.dev.) is the most powerful measure of dispersion for quantitative data. It will permit us to do very sophisticated descriptions of various distributions. All the calculations of statistical inference are also made possible by the use of the standard deviation. VARIANCE Another useful measure is the variance, which is defined as the square of the standard deviation. It is thus given by variance of a sample = s1 or variance of a population = o~: THE COEFFICIENT OF VARIATION Finally, we can define the coefficient of variation. To explain the use of this measure, suppose you have two distributions having the means and standard deviations given below: Distribution 1 mean = 30 st. dev. = 3 Distribution 2 mean = 150 st. dcv. = 3 In one case the center of ihe distribution is 30. indicating that ihe data entries fall in a certain range around the value 30, Their magnitude is around 30. In the other case, the mean is 150- indicating that the data entries fall in a range around die value 150 and have an average magnitude of 150. Although they have the same dispersion • (measured by the standard deviation), the relative importance of the dispersion is not the same in the two cases because the magnitude of the daia is different. In one case the entries revolve around the value 30, and the standard deviation is equal to 10% of the average value of the entries. In ihe other case, the entries revolve around the value 150 and the standard deviation is about 3/150. that is, 2% of the average value of the entries, a value which denotes a smaller relative variation. There is a way to assess the relative importance of the variation among the entries, by comparing this variation with the mean. The measure is called the coefficient of variation. The coefficient of variation is defined as the standard deviation divided by the mean, and multiplied by 100 to turn it into a percentage. The formula is thus: Coefficient of variation CV = — x 100 M UNIVARIATE DESCRIPTIVE STATISTICS 51 This measure will only be used occasionally. Measures of Position Measures of position are used for quantitative variables, measured at die numerical scale level. They could sometimes be used for variables measured at the ordinal level. They provide us with a way of determining how one individual entry compares with all the others. The simplest measure of position is the quartile. If you list your entries in an ascending order according to size, the quariites are the values that split the ranked population into four equal groups. Twenty-five percent of the population has a score less or equal than the 1st quartile (Q,). 50*% has a score less than the 2nd quartile (Qj), and 75% has a score less than the 3rd quartile (Qj). Recall that we have seen earlier a measure of dispersion called the interquartile range, which is die difference between Q. and Q;. Figure 3.1 illustrates the way the quartiles divide the ordered list of units in a sample or in a population. 25% of the population 25% of the population 25% of the population <— Q, 25% of the population Figure 3.1 The quartiles arc obtained by ordering the individuals in the population by increasing rank, and then splitting it into four equal parts. The quartiles are the values that separate these four parts In a similar way, we can define ihe deciles: they split the ranked population into ten equal groups. If a data entry falls in the first decile it means that its score is among the lowest 10%. If it is in the 10th decile it means it is among the top 10%. The most common measure of position, however, is the percentile rank. The dala is arranged by order of si« (recall it must be quantitative) and divided into 100 equal groups. The numerical values that separate these 100 groups arc called percentiles. The percentile rank of a data entry is the rank of die percentile group this entry falls into. For example, if you are told that your percentile rank in a national exam is 83, this means diat you fall within the 83rd percentile. Your grade is just above ihat of 82% of the population, and just below that of 17% of the population. You will learn in the SPSS session how to display die percentile ranks of the data entries. You may have realized by now (he connection between the median and the various measures of position, since the median divides your ranked population into two equal groups. The median is equal to the 50th percentile. It is also equal to the 5th decile, and of course the 2nd quartile. 54 INTERPRETING QUANTITATIVE DATA WITH SPSS Missing married widowed divorced separated never Marital Status married 10 - 18-29 Age Categories 3C-!3 41) 49 H:)- SS^'ASKST1the -—• - -*■«* ü people who speak a given language. You can choose lo have ,hc Y-axis represent peonages mstcad of counts. The chart shown in Figure 3.2 represents .he percentages or the various marital categories. UNIVARIATE DESCRIPTIVE STATISTICS 30 - ?c. - 10 - 18-30 31-40 41-50 Age into 7 categories 51-60 61-70 71-80 81+ Figure 3.4 A bar chart where the category 50+ years has been broken down into four categories The variable on the X-axis could also be a quantitative variable that has been grouped into a small number of categories. For instance, we could have agecat4 as the variable on the X-axis. The bars would dien represent the number of people found in each of the four age categories. In this kind of bar graph, you must be careful about the range (lhat is. the length of the interval) of each of the categories. If the categories are intervals that do not have the same length, you may get the wrong impression that one group is more numerous than the other, such as with the group of people who are 50 years old or more in the chart shown in Figure 3.3. However, this group (50 years and older) spans a range of agťS which is much wider than the other groups: close to 40 years (from 50 years to 89 years exactly). If we regroup the respondents into age categories that arc equal or almost equal, we get the chart in Figure 3.4. This bar chart is a much better representation of the distribution of ages than the previous one. In a clustered bar chart, each column is subdivided in several columns representing the categories of a second variable. For instance, each column could be split in two, for men and for women. Figure 3.5 provides an example of a clustered bar chart where the height of the columns represents the number of people in each category. In a clustered bar chart, it is generally preferable to display the percentages of die various categories radier than their frequencies. Look for instance at the clustered bar chart displayed in Figure 3.5. We see that in every category, women are more INTCHPRIIING QUANTITATIV» DATA WITH SPSS MO 4tt> -..... too Raaponnenľ» Sex ■ Mill ■ Ft milt MtaatrtQ widowMl m>4(MM married iftvwieail never married Manial Siatua Figure 3.S A cluttered bar chart what» tha height of tha column« represents lh« number of people In »ach «Magory numerous tlum men. This is so because the sample as a whole contain! mote women. This chart does noi allow us to assess how the percentages of men und women compare in ciich category. If wc display the percentages rather than the frequencies (the count), wc get the char! illustrated in Figure 3.6. Reipaiulent'a So* I Htn.ťn Mlulny witlownil H u want lo hii:hhj;h( the .lu.intils ,tw« laletl with eveiy category 00 the X axis. A bar chart where the vertical axis does not «mi at 0 can be very misleading, for if the columns arc Iruucaicd at their base, the SB INTERPRETING QUANTITATIVE DATA WITH SPSS differences in height between them can appear to be more important than they really are. Consequently, as a general rule, bar charts should start at zero and should not be imncated fr©m their base. Finally, it should be said that bar charts could also be presented horizontally, by interchanging the X- and Y-axcs. Pie Charts Pie charts (Figure 3.8) are most useful when you want to illustrate proportions, rather than actual quantities. They show the relative importance of the various categories of the variable. In SPSS you have the option of including missing values as a slice in the pie, or excluding them and dividing the pie among valid answers. The details of how to do that are explained in Lab 5. Pie charts are better suited when we want to convey the way a fixed amount of resources is allocated among various uses. For instance, the way a budget is spent over various categories of items is best represented by a pie chart. When the emphasis is on the amount of money spent on each budget item, rather than on the way the budget is allocated, a bar chart is more suggestive. However, both bar charts and pie charts are appropriate to represent the distribution of u nominal variable, and there is no clear-cut line of demarcation lhal would lell us which of the two is preferable. oinor pay» rent ■ owns home Figure 3.8 Pie chart Illustrating the proportion of people who own ., home as compared to those who pay rent. One of the options in the plo chart command allows you to either include or exclude tho category of missing answers. In this diagram it has been excluded from the graph Histograms Histograms arc useful when the variable is quantitative. The data are usually grouped into classes, or intervals, and then the frequency of each class is represented ' UNIVARIATE DESCRIPTIVE STATISTICS 59 100 ~"1 1 . [ | 0 - -r-l— 1 ' 1 _L 1 •' 1 m ''v'V ~r~ 10 o CM M o o o O d ui o u> m m <í> <ď o o O ui 6 ui r* n