62
INTERPRETING   QUANTITATIVE   OATA  WITH  SPSS
lo back with horizontal bars pointing 10 opposite directions, where each bar represents a five-year span. This type of histogram is called a population pyramid. In a population pyramid, the last class is left open. Usually it is the '80 years and more* class, as shown in Figure 3.10.
FREQUENCY     POLYGONS     AND     DENSITY    CURVES
If we join all the midpoints at the top of the columns in a histogram, wc °et what is called a frequency polygon. The polygon shows the general pattern of die distribution. Imagine now a frequency polygon drawn on a histogram with very large number of columns. Wc could redraw it as a smooth curve, called a density curve (Figure 3.11). A density curve is drawn in such a way that its surface is equal to 1. And if we look at the surface under the curve between any two values, it tells us the exact proportion of data that falls within these two values. We can now be more specific about the definition of the mode for a quantitative variable.
/
A
k
N
K
Figure 3.11 A density curve can be thought of as the curve resulting from joining the midpoints at the top of the various bars of a histogram with a large number of classes
If the variable is represented by a histogram, the mode is the class with the highest frequency. If it is represented by a density curve, the mode is the x-value that corresponds to the highest point on the density curve.
HISTOGRAM     OR     BAR     CHART?
When we have a quantitative variable that has been grouped into a small number of categories, we can represent it either by a histogram or by 3 bar chart. But which of the two representations is better? It depends on what we want to convey. To explain this point, consider a situation where we have the variable Age represented by seven categories as shown in Figure 3.4. If wc want to convey how the ages of the sample studied arc distributed over the whole range of ages, the histogram shown in Figure 3.9 is better. But if we want to show how the various age group;- are divided among men and women, or among married vs. unmarried individuals, a clustered bar chart allows us lo do that, as shown in Figure 3.12. A histogram would not permit us to juxtapose corresponding categories of age groups for men and women.
In Figure 3.12. we sec the distribution separately for men and women, and we can determine ihat women arc more represented in ihe older categories, as they tend to live
UNIVARIATE   DESCRIPTIVE   STATISTICS
Respondent's Sex ■ Male I.   1   Female
18-30 31-40 41-50 51-60 61-70 71-80    81+ Age into 7 categories
Figure 3.12    A clustered bar chart allows us to show the pattern of ages separately for men and for women. It is appropriate for a quantitative variable grouped into a small number of categories
longer than men. Notice that the vertical axis represents the percentages, not the frequencies. If we made it represent the frequencies instead, we would still get die same general shape, but we would not be able to determine whether men or women are more represented in a given class, as the overall number of women in this sample is greater than the overall number of men. In almost every age category, we would therefore find more women, not because a higher percentage of women (as opposed to men) fall into that category, but because there are more women in the sample as a whole.
Box plots
Box plots are very useful to show how the values of a quantitative variable are distributed. Tlie box plot indicates the minimum and maximum values, and the three quartilcs. The central 50% of the data (the 2ná and 3rd quarters) are represented as a shaded solid box, whereas the first and last quarters are represented by thin lines.
The box plot gives automatically ihe five-number summary of the data: the minimum, the 1st quartile. the median (which is the 2nd quartilc). the 3rd quartile, and the maximum.
In symbols the five-number summary is given by: Min, Q,, Median, Q,. Max. The box plot is shown in Figure 3.13.
64
INTEAPRETING   QUANTITATIVE   DATA  WITH   SPSS
			
1 80 -			*
60 -	T7T-		.;., -    j
			i
40  -			:m
20 -			
			
		-------------------1	
\       :
1 <!35
Age ot Respondent
Figure 3.13    The box plot representing the variable Ago of respondent. We can read off directly the five-number summary of the distribution
Box plois can also be used lo represent several similar variables on ihe same graph, allowing comparisons. You could also split a population into several separate groups (such as men and women) and have a separate box plot for each group, drawn in the same graph, next to each other, to permit comparisons. This is illustrated in Figure 3.14 where five box plots of respondents' income are drawn, for the various groups defined by educational level. This figure illustrates clearly how the income varies with the highest level of education attained. We should note, however, that this data comes from a file where the income is not measured as a continuous scale variable, but is coded into 21 categories, and that the 22nd category is made up, as explained earlier in this chapter. More details are found in Lab 5.
Line Charts
Lina charts arc most useful to represent the variation of a quantitative variable over liuiť. Tlie X-axis represents the time line, and the Y-axis represents some quantitative variable. For example, the variable could be the number of students enrolled in a given program, or ihe inflation rate, or Ihe market value of a given portfolio of stocks. The line chart would show how the variable increases or decreases as time goes by. A common mistake sometimes made intentionally consists in not showing
■
.'C
20 -
10 -
0 -
1
-10
I      .— f"™"1
O 710           O 873
;:-
N=         106              528                73                191                95                  1
Less than HS                 Junior college                    Graduate
High school                       Bachelor                               DK
RS Highost Degree
Figure 3.14    Although the income Is coded Into 22 categories and not given as a dollar amount, tho comparisons of the income for each educational level gives us a good idea of how incomes vary as a function of education
Unemployment «nte Seasonally adjusted
Figure 3.IS    A lino chart showing the variation over time of the unemployment rate (n Canada. The reader should bo aware of thtf fact that tho Y-axis does not start at zero, which may give tho impression that the variations are greater than they really arc. An awareness of this can guard us against misinterpretations. (Source: Statistics Canada)

£6                                                    INTERPRETING   QUANTITATIVE   DATA  WITH   SPSS
ihc zero level of ihc quantity, or in drawing (he Y-axis shorter than it should be. This procedure has the effect of giving the impression that the variations are bigger than they really arc, but at Ihe same time it allows us to see the variations in the graph in greater detail. When it is necessary to show a shorter Y-axis, this should be indicated by an interruption in the line representing the Y-axis. Figure 3.15 provides an example of a line chart. Here you can sec that the Y-axis does not start at zero, giving die impression that the variations arc much bigger than they really arc. However, this is justified by the fact that it allows us to see the variation in unemployment rates in great detail, and by the fact that this is an increasingly standard practice, which means that readers should be aware of the resulting distortion and interpret what they see accordingly.
The General Shape of a Distribution
In addition to the measures explained above, we could describe the general shape of die distribution of a quantitative variable by looking at two of its features: symmetry and kurtosis.
Symmetry
The first characteristic to look at is symmetry. A distribution is said to be symmetric if the mean splits its histogram into two equal halves, which are mirror images of each other. A typical symmetric distribution is the normal distribution. It is a bell-shaped distribution that follows a very specific pattern, and occurs in a wide range of situations. It is represented by die curve of Figure 3.16. It will be studied later on, In a symmetric distribution, the mean and the median are equal. If die distribution is also unimodal, then the mean, the median, and the mode are all equal. This is true of normal distributions.
Figure 3.16    An oxamplo of a symmetric distribution, this one is the normal distribution, which will bo studied in Chapter 5
If a distribution is symmetric and unimodal. the mean is a good representative of the center. However, it often happens that a distribution is not symmetric. We dien say
UNIVARIATE   DESCRIPTIVE   STATISTICS
67
that it is skewed. That means that one side of the graph of die distribution is stretched more than the other. We say that it is positively skewed if it is stretched on the right side, and negatively skewed if it is stretched on the left side. Figure 3.17 illustrates the difference between symmetric distributions and skewed distributions. SPSS allows you to compute a statistic called skewness, which is a measure of how skewed a distribution is. A normal curve has a skewness of 0. If the skewness is larger than 1, die shape starts to look significantly different from that of a normal curve.
main ■ m*dt»i> ■ mod«
Symmetric                       Positively skewed                         Negatively showed
distribution                           distribution                                   distribution
Figure 3.17    Symmetric and skewed distributions
How can we know that a distribution is skewed? The first indication is the histogram: the tail end of the histogram is longer on one side than on the other. We can also sec that a distribution is skewed through its numerical features: the mean is different from the median. When the distribution is positively skewed, the mean is larger than the median, as it is pushed by the extreme values toward the longer tail. For negatively skewed distributions, the mean is smaller than the median. Therefore. a mean larger than the median tells us that the extreme values on the higher end of the distribution are much larger than the bulk of Ihe data in the distribution, pulling the mean toward the positive side. This is illustrated by the numerical example given in the section on the median, where one extreme value (60) pulls die mean up but does not affect the median. Therefore, when a distribution is highly skewed, the median is usually a better representative of the center of the data than the mean.
Kurtosis
This is a measure of the degree of peakedncss of the curve. It tells you whether the curve representing the distribution tends to be very peaked, with a high proportion of data entries clustered near the center, or rather flat, witii data spread out over a wide range. A normal distribution has a kurtosis equal to 0. A positive value indicates that the data is clustered around ihc center, and thai the curve is highly peaked. A negative value indicates that the duta is spread out, and that the curve is flatter lhan a normal curve. Figure 3.18 sliows three curves with zero, positive, and negative kurtosis respectively.
Methodological Issues
Although they seem to be simple, descriptive measures can be tricky to use. We would like to point out here some of the pitfalls and difficulties associated with dieir use.
«e
INTERPRETING   QUANTITATIVE   DATA  WITH   SPSS
A curve with                           A curve with                                A curve with
kurtosis = 0                         positive kurtosis                         negative kurtosis
Figure 3.'8    Illustration of zero, positive and negative kurtosis
The Definition of the Categories over which the Counting is Done
Suppose I say that the passing rate m a given class is 82%. In another college, a colleague tells me that his passing rate is 95%. Before concluding that his passing rate is much higher. I have to make sure that we are defining the passing rate in the same way. 1 may define the passing rate as the number of students who pass a course compared to those who were registered at the beginning of the semester. If he defines it the same way, we can make meaningful comparisons. But if he defines it as the number of students who pass the course compared to the number registered at the end of the semester, we cannot make a meaningful comparison. This is so because all the students who dropped out would not be taken into account in his calculation, whereas they would be taken into account in mine. A careful definition of the categories used to define a concept is therefore important. Such problems arise when we define the unemployment rate in various countries, or even wealth. The conclusion is that careful attention should be given to the way categories are defined when comparing the statistics that refer to different populations.
Outliers
Outliers are values that are unusually large or unusually small in a distribution. They have to be examined carefully to determine if they are the result of an error of measurement, or a typing error, or whether they actually represent an extreme case. For instance, the value 69 in the column of the variable age for college students could be a typing error, but it could also represent the interesting case of a retired peison who decided to pursue a college program. Even if they represent an extreme case, it may be desirable to disregard extreme values in some of the statistical computations. When producing a Box Plot diagram, SPSS excludes the outliers from the computation, and prints them above or below the box plot. An option allows users to have the case number printed next to the dot representing the outlier, so as to be able to identify the case and examine it more closely.
UNIVARIATE   DESCRIPTIVE  STATISTICS
Summary
We have seen in this chapter the various measures used to summarize the data pertaining to a single variable as well as the various types of charts that could be used to illustrate the distribution. You should keep in mind one fundamental point: the level of measurement used for the variable determines which measures and graphs are appropriate. It does not make sense, for example, to compute the mean of the variable when the level of measurement is nominal, that is, when the variable is qualitative. There are three types of univariate descriptive measures:
•    measures of central tendency,
•    measures of dispersion, and
•    measures of position.
Measures of central tendency, also called measures of the center, tell us the values around which most of the data is found. They give us an order of magnitude of the data, allowing comparisons across populations and subgroups within a population. They include the mean, the median, and the mode. The mean should not be used when ihe variable is qualitative.
Measures of dispersion arc an indication of how spread out the data is. They are mostly used for quantitative data. The most important ones are the range, the interquartile range, the variance, and the standard deviation.
Measures of position tell us how one particular data entry is situated in comparison to the others. The percentile rank is one such measure. Other measures include the quartilcs and the deciles.
In addition to these measures, we have seen the weighted mean. When calculating it. the various entries are multiplied by a weight, which is a positive number between 0 and 1. All the weights add up to 1. The weighted mean is used when the numbers that are averaged have been calculated over populations of unequal size. For instance, if you have the birth rates in all Canadian provinces and you want to find the average birth rate for Canada as a whole, you must weight these numbers by the demographic importance of every province. The weighted mean is also used when you want to increase or decrease the relative importance of the numbers you arc averaging, as is done when finding the average grade over exams that do not count for the same percentage in the final grade.
When categories are involved {either because the variable is qualitative, or when quantitative values have been grouped) we can/md ratios, percentages, and proportions of the groups corresponding to the categories.
The general shape of a distribution is analyzed in terms of symmetry or skewncss. and in terms of kurtosis (the degree to which the curve is peaked).
70                                                             INTERPRETING   QUANTITATIVE   OATA   WITH   SPSS
The comparison of the mean and ihe median is very useful. Recall the following:
II" the distribution is very skewed, the median is a bciler representative of the center of the «lata, as the extreme values lend to pull the mean towards one side of the curve. The median is no* affected by extreme values.
If the mean 1ft larger than the median, the distribution is positively skewed. If Ihe mean is smaller than the median. Ihe distribution is negatively skewed.
As for the graphical representation of a distribution, recall again thai the level of measurement of (he variable detennines what kind of chart is appropriate Bar charts and pie charts arc appropriate when the data is qualitative, or measured at the nominal or ordinal level». Quantitative data (whether measured at the ordinal of numerical scale levels) could also be represented by bar charts or pie charts if the values have been grouped inlo a small number (if categories-
The essential difference lietween pie charts and bar charts is that In the former, the emphasis is on the relative importance of each category as compared to the other caleiyini"-. *lu'iľ;i. hi tin1 l.tivi. ihe BiTiphaSM a OD Ihe tilt <■'• Mdl I MfOf) HOW ever, there is no clear-cut distinction between tbc two. and if one is appropriate, the other is usually appropriate also, even if the emphasis is slightly different. The great advantage of bar charts is that it allows making comparisons between the distributions of subgroups, with the help of clustered bar charts.
Quantitative variables ore boiler represented through histograms, A specific type of histogram is the population pyramid, which is a standard tool in demography.
Line charts arc most suited 10 represent the variation of a quantity across time.
In all kinds of charts, truncating the Y-axis is sometimes done to zoom in on the variations of the variables and to represent them in a more detailed way. However, wc should be aware of the f act that truncating tlte Y-axis may also convey a mistaken impression that the variations of ihr variable arc more important than ihey are in realily.
Keywords
Univariate Bivariate Measures of ccniial
tendency Measures of dispersion Measures of posilion Mean
Trimmed mean Weighted mean Median
Frequencies Cumulative frequencies Valid percent
Range
Trimmed range
Interquartile range Deviation from the mean Standard deviation Variation ratio
Ratios PinivtiLons
ll.il   -Ml-il
Clustered hui graph Pic chart Histogram Frequency polygon Line chart Box plot
UNIVARIATE   OCSCRIPTIVE   STATISTICS
'1
Mod.	Coefficient of variation	live-number summarv
V1od.il . ;iu- ;oiy	Quartiles	Symmetry
Majority	Deciles	Skcwncss
Plurality	Percentiles	Kuriosu
	Percentile rank	Ouilieis
Suggestions for Further Reading
Dcvorc, Jay and Peck. Roxy (1997) Statistics, the Exploration and Analysis of Data
(3rd Kdn) Belmunt. Albany: Duxbury Press. Harnett, Donald II and Murphy, lames L (19931 Statistical Analy\U for Business and
Economics  Don Mills. Ontario: Aůdisoo-Wesley Publisher* Tmdel. Robert and Antoniu*. Rachad (1991) Methode* quantitatives appliquén aux sciences
humaines, Montreal: CEC. Wonnacott. Thomas H. and Wonnacott, Ronald J. (1977) Introductory Statistici i3id cdn).
New York: John Wiley and Sons.
EXERCISES
3.1    Complclc the following senicnces:
(a)   HUM types of measures are useful to summarize a mimcm *l distribution. They arc
ud
(b)    The mosi frequent value in a distribution is called_____________
(C)   When ihe values of the distribution arc grouped into classes, the mode is
the___________wilh the highest frequency.
(d)   When there arc two classes that are bigger than Ihe ones immediaiely next
to ihcm, the distribution is called _____________
(c)    If the modal class includes more than 50% of Ihe population, we say thai it constitutes the____________ŕ Otherwise,  wc  simply  talk of a
(f>   The median falls_________________ of the otdered list of entries.
______% of the data arc less than o/ equal to ihe median, and______%
are larger than or equal to it. (g)   The mean of a numerical distribution is equal lo I lie _   ______of all
entries divided by,_________________________.
(h)   The mathematical measure used to find the mean when the entries do not
have the same relative importance is called_______________________-