62 INTERPRETING QUANTITATIVE OATA WITH SPSS lo back with horizontal bars pointing 10 opposite directions, where each bar represents a five-year span. This type of histogram is called a population pyramid. In a population pyramid, the last class is left open. Usually it is the '80 years and more* class, as shown in Figure 3.10. FREQUENCY POLYGONS AND DENSITY CURVES If we join all the midpoints at the top of the columns in a histogram, wc °et what is called a frequency polygon. The polygon shows the general pattern of die distribution. Imagine now a frequency polygon drawn on a histogram with very large number of columns. Wc could redraw it as a smooth curve, called a density curve (Figure 3.11). A density curve is drawn in such a way that its surface is equal to 1. And if we look at the surface under the curve between any two values, it tells us the exact proportion of data that falls within these two values. We can now be more specific about the definition of the mode for a quantitative variable. / A k N K Figure 3.11 A density curve can be thought of as the curve resulting from joining the midpoints at the top of the various bars of a histogram with a large number of classes If the variable is represented by a histogram, the mode is the class with the highest frequency. If it is represented by a density curve, the mode is the x-value that corresponds to the highest point on the density curve. HISTOGRAM OR BAR CHART? When we have a quantitative variable that has been grouped into a small number of categories, we can represent it either by a histogram or by 3 bar chart. But which of the two representations is better? It depends on what we want to convey. To explain this point, consider a situation where we have the variable Age represented by seven categories as shown in Figure 3.4. If wc want to convey how the ages of the sample studied arc distributed over the whole range of ages, the histogram shown in Figure 3.9 is better. But if we want to show how the various age group;- are divided among men and women, or among married vs. unmarried individuals, a clustered bar chart allows us lo do that, as shown in Figure 3.12. A histogram would not permit us to juxtapose corresponding categories of age groups for men and women. In Figure 3.12. we sec the distribution separately for men and women, and we can determine ihat women arc more represented in ihe older categories, as they tend to live UNIVARIATE DESCRIPTIVE STATISTICS Respondent's Sex ■ Male I. 1 Female 18-30 31-40 41-50 51-60 61-70 71-80 81+ Age into 7 categories Figure 3.12 A clustered bar chart allows us to show the pattern of ages separately for men and for women. It is appropriate for a quantitative variable grouped into a small number of categories longer than men. Notice that the vertical axis represents the percentages, not the frequencies. If we made it represent the frequencies instead, we would still get die same general shape, but we would not be able to determine whether men or women are more represented in a given class, as the overall number of women in this sample is greater than the overall number of men. In almost every age category, we would therefore find more women, not because a higher percentage of women (as opposed to men) fall into that category, but because there are more women in the sample as a whole. Box plots Box plots are very useful to show how the values of a quantitative variable are distributed. Tlie box plot indicates the minimum and maximum values, and the three quartilcs. The central 50% of the data (the 2ná and 3rd quarters) are represented as a shaded solid box, whereas the first and last quarters are represented by thin lines. The box plot gives automatically ihe five-number summary of the data: the minimum, the 1st quartile. the median (which is the 2nd quartilc). the 3rd quartile, and the maximum. In symbols the five-number summary is given by: Min, Q,, Median, Q,. Max. The box plot is shown in Figure 3.13. 64 INTEAPRETING QUANTITATIVE DATA WITH SPSS 1 80 - * 60 - T7T- .;., - j i 40 - :m 20 - -------------------1 \ : 1 ■ mod« Symmetric Positively skewed Negatively showed distribution distribution distribution Figure 3.17 Symmetric and skewed distributions How can we know that a distribution is skewed? The first indication is the histogram: the tail end of the histogram is longer on one side than on the other. We can also sec that a distribution is skewed through its numerical features: the mean is different from the median. When the distribution is positively skewed, the mean is larger than the median, as it is pushed by the extreme values toward the longer tail. For negatively skewed distributions, the mean is smaller than the median. Therefore. a mean larger than the median tells us that the extreme values on the higher end of the distribution are much larger than the bulk of Ihe data in the distribution, pulling the mean toward the positive side. This is illustrated by the numerical example given in the section on the median, where one extreme value (60) pulls die mean up but does not affect the median. Therefore, when a distribution is highly skewed, the median is usually a better representative of the center of the data than the mean. Kurtosis This is a measure of the degree of peakedncss of the curve. It tells you whether the curve representing the distribution tends to be very peaked, with a high proportion of data entries clustered near the center, or rather flat, witii data spread out over a wide range. A normal distribution has a kurtosis equal to 0. A positive value indicates that the data is clustered around ihc center, and thai the curve is highly peaked. A negative value indicates that the duta is spread out, and that the curve is flatter lhan a normal curve. Figure 3.18 sliows three curves with zero, positive, and negative kurtosis respectively. Methodological Issues Although they seem to be simple, descriptive measures can be tricky to use. We would like to point out here some of the pitfalls and difficulties associated with dieir use. «e INTERPRETING QUANTITATIVE DATA WITH SPSS A curve with A curve with A curve with kurtosis = 0 positive kurtosis negative kurtosis Figure 3.'8 Illustration of zero, positive and negative kurtosis The Definition of the Categories over which the Counting is Done Suppose I say that the passing rate m a given class is 82%. In another college, a colleague tells me that his passing rate is 95%. Before concluding that his passing rate is much higher. I have to make sure that we are defining the passing rate in the same way. 1 may define the passing rate as the number of students who pass a course compared to those who were registered at the beginning of the semester. If he defines it the same way, we can make meaningful comparisons. But if he defines it as the number of students who pass the course compared to the number registered at the end of the semester, we cannot make a meaningful comparison. This is so because all the students who dropped out would not be taken into account in his calculation, whereas they would be taken into account in mine. A careful definition of the categories used to define a concept is therefore important. Such problems arise when we define the unemployment rate in various countries, or even wealth. The conclusion is that careful attention should be given to the way categories are defined when comparing the statistics that refer to different populations. Outliers Outliers are values that are unusually large or unusually small in a distribution. They have to be examined carefully to determine if they are the result of an error of measurement, or a typing error, or whether they actually represent an extreme case. For instance, the value 69 in the column of the variable age for college students could be a typing error, but it could also represent the interesting case of a retired peison who decided to pursue a college program. Even if they represent an extreme case, it may be desirable to disregard extreme values in some of the statistical computations. When producing a Box Plot diagram, SPSS excludes the outliers from the computation, and prints them above or below the box plot. An option allows users to have the case number printed next to the dot representing the outlier, so as to be able to identify the case and examine it more closely. UNIVARIATE DESCRIPTIVE STATISTICS Summary We have seen in this chapter the various measures used to summarize the data pertaining to a single variable as well as the various types of charts that could be used to illustrate the distribution. You should keep in mind one fundamental point: the level of measurement used for the variable determines which measures and graphs are appropriate. It does not make sense, for example, to compute the mean of the variable when the level of measurement is nominal, that is, when the variable is qualitative. There are three types of univariate descriptive measures: • measures of central tendency, • measures of dispersion, and • measures of position. Measures of central tendency, also called measures of the center, tell us the values around which most of the data is found. They give us an order of magnitude of the data, allowing comparisons across populations and subgroups within a population. They include the mean, the median, and the mode. The mean should not be used when ihe variable is qualitative. Measures of dispersion arc an indication of how spread out the data is. They are mostly used for quantitative data. The most important ones are the range, the interquartile range, the variance, and the standard deviation. Measures of position tell us how one particular data entry is situated in comparison to the others. The percentile rank is one such measure. Other measures include the quartilcs and the deciles. In addition to these measures, we have seen the weighted mean. When calculating it. the various entries are multiplied by a weight, which is a positive number between 0 and 1. All the weights add up to 1. The weighted mean is used when the numbers that are averaged have been calculated over populations of unequal size. For instance, if you have the birth rates in all Canadian provinces and you want to find the average birth rate for Canada as a whole, you must weight these numbers by the demographic importance of every province. The weighted mean is also used when you want to increase or decrease the relative importance of the numbers you arc averaging, as is done when finding the average grade over exams that do not count for the same percentage in the final grade. When categories are involved {either because the variable is qualitative, or when quantitative values have been grouped) we can/md ratios, percentages, and proportions of the groups corresponding to the categories. The general shape of a distribution is analyzed in terms of symmetry or skewncss. and in terms of kurtosis (the degree to which the curve is peaked). 70 INTERPRETING QUANTITATIVE OATA WITH SPSS The comparison of the mean and ihe median is very useful. Recall the following: II" the distribution is very skewed, the median is a bciler representative of the center of the «lata, as the extreme values lend to pull the mean towards one side of the curve. The median is no* affected by extreme values. If the mean 1ft larger than the median, the distribution is positively skewed. If Ihe mean is smaller than the median. Ihe distribution is negatively skewed. As for the graphical representation of a distribution, recall again thai the level of measurement of (he variable detennines what kind of chart is appropriate Bar charts and pie charts arc appropriate when the data is qualitative, or measured at the nominal or ordinal level». Quantitative data (whether measured at the ordinal of numerical scale levels) could also be represented by bar charts or pie charts if the values have been grouped inlo a small number (if categories- The essential difference lietween pie charts and bar charts is that In the former, the emphasis is on the relative importance of each category as compared to the other caleiyini"-. *lu'iľ;i. hi tin1 l.tivi. ihe BiTiphaSM a OD Ihe tilt <■'• Mdl I MfOf) HOW ever, there is no clear-cut distinction between tbc two. and if one is appropriate, the other is usually appropriate also, even if the emphasis is slightly different. The great advantage of bar charts is that it allows making comparisons between the distributions of subgroups, with the help of clustered bar charts. Quantitative variables ore boiler represented through histograms, A specific type of histogram is the population pyramid, which is a standard tool in demography. Line charts arc most suited 10 represent the variation of a quantity across time. In all kinds of charts, truncating the Y-axis is sometimes done to zoom in on the variations of the variables and to represent them in a more detailed way. However, wc should be aware of the f act that truncating tlte Y-axis may also convey a mistaken impression that the variations of ihr variable arc more important than ihey are in realily. Keywords Univariate Bivariate Measures of ccniial tendency Measures of dispersion Measures of posilion Mean Trimmed mean Weighted mean Median Frequencies Cumulative frequencies Valid percent Range Trimmed range Interquartile range Deviation from the mean Standard deviation Variation ratio Ratios PinivtiLons ll.il -Ml-il Clustered hui graph Pic chart Histogram Frequency polygon Line chart Box plot UNIVARIATE OCSCRIPTIVE STATISTICS '1 Mod. Coefficient of variation live-number summarv V1od.il . ;iu- ;oiy Quartiles Symmetry Majority Deciles Skcwncss Plurality Percentiles Kuriosu Percentile rank Ouilieis Suggestions for Further Reading Dcvorc, Jay and Peck. Roxy (1997) Statistics, the Exploration and Analysis of Data (3rd Kdn) Belmunt. Albany: Duxbury Press. Harnett, Donald II and Murphy, lames L (19931 Statistical Analy\U for Business and Economics Don Mills. Ontario: Aůdisoo-Wesley Publisher* Tmdel. Robert and Antoniu*. Rachad (1991) Methode* quantitatives appliquén aux sciences humaines, Montreal: CEC. Wonnacott. Thomas H. and Wonnacott, Ronald J. (1977) Introductory Statistici i3id cdn). New York: John Wiley and Sons. EXERCISES 3.1 Complclc the following senicnces: (a) HUM types of measures are useful to summarize a mimcm *l distribution. They arc ud (b) The mosi frequent value in a distribution is called_____________ (C) When ihe values of the distribution arc grouped into classes, the mode is the___________wilh the highest frequency. (d) When there arc two classes that are bigger than Ihe ones immediaiely next to ihcm, the distribution is called _____________ (c) If the modal class includes more than 50% of Ihe population, we say thai it constitutes the____________ŕ Otherwise, wc simply talk of a (f> The median falls_________________ of the otdered list of entries. ______% of the data arc less than o/ equal to ihe median, and______% are larger than or equal to it. (g) The mean of a numerical distribution is equal lo I lie _ ______of all entries divided by,_________________________. (h) The mathematical measure used to find the mean when the entries do not have the same relative importance is called_______________________-