UNIVARIATE   DESCRIPTIVE STATISTICS
This chapter explains how data concerning one variable can Ik; summarised and described, with ubi« and with simple charts and diagrams. Alter studying this chapter, the student should know:
•     the basic types ol'univariate descriptive measures;
•     how   the   level   of measurement   determines the descriptive measures to be used;
•    how to interpret these descriptive measures;
•     how to read a frequency table:
•     the differences in the significance and the uses of the mean and the median:
•     how co interpret the mean when a quantitative variable is coded;
•     how to describe the shape of a distribution (symmetry; skewness);
•     how to present data (frequency tables; charts):
•     what arc weighted means and when 10 use them.
■ Data files contain a lot of information thai must be summarized in order to be useful. If we look for instance at the variable age in ihe data file GSS93 SUDMI lhal comes with the SPSS package, we will find 1500 entries, giving us ihe age of every individual in the sample. If we examine Ihe ages of men and women separately, we cannot determine, by looking simply at the raw data, whether men of this sample tend to be older lhan women or whether it is ihe other way around. Wc would need lo know, let us say. that the average age of men is 23 years and of women 20 years io make a comparison. The average is a descriptive measure.
Descriptive statistics aim ai describing a situation by summarizing information in a way that highlights the important numerical features of ihe data. Some of the information is lost as a result. A good summary captures the essential aspect* 0f the data ami ihe mosi relevant ones, h summarizes ii wiih ihe help of numbers, usually organized into tables, but also with the help of charts and graphs that give a visual
representation of the distributions.
UNIVAfllAIE   DeSCHiPTIVf   SIAriSTICS                                                                       M
•s
In this chapter, we will he looking al one variable at a time. Measures that concern one variable are called univariate measures. Wc will examine bivariate measures, those measures lhal concern two variables together, in Chapter H.
There are three important types of univariate descriptive measures:
•     measures of central tendency, «    measures of dispersion, and
•     measures of position.
Measures of central tendency (sometimes called measures of ihe center) answer the question: What are the categories or numerical values that represent ihe bulk of the dala in the besl way'.' Such measures will be useful for comparing various groups within a population, or seeing whether a variable has changed over lime Measures of central tendency Include the menu (which is ihe technical term for average), the median, and the mode.
MciiMiri'S of dispersion answer the question: How spread out is the data? Is it mostly concentrated around die center, or spread oul over a large range of values? Measures of dispersion include the standard deviation, the variance, ihe range (there arc several variants ot the range, such as ihe interquartile range) and the coefficient of variation.
Measures of position answer the question: How is one individual entry positioned wirb reaped to all the others'.' Or how does one individual score on a variable in comparison with the others? If you want to know whether you arc pan ol the lop 5% of a math class, you must use a measure of |M>sition. Measures of posiiion include percentiles, deciles, ami quaitiles.
Other measures. In addition io these measures, we can compute ú» frequencies
of certain subgroups of ihe population, as well as certain ratios and proportions thai help us compare Iheir relative importance. This is particularly useful when the variable is qualitative. or when it is quantitative but us values have been grouped i»l" categories.
Tlic various descriptive measures that can be used in a specific situation depend on whether ihe variable is qualitative or quantitative. When the variable is quantitative, we can look at the general shape of the distribution, to see wliether it is symmetric (that is. the values are distributed in a similar way on both sides of the center) or skewed (that is. lacking symmetry), and whether it is rather flai or rather peaked (a characteristic called kurtosis).                              f
Finally, we can make use of charts to convey a visual impression ol ihe distribution of ihe data. It is very easy to produce colorful outputs with any statistical software. Ii is important, however, to choose ihe appropriate chart, one that is meaningful and that conveys the most important properties ot ihe data. This is ikh
70
0413
36
iNXtfi?ň£T\NG   QUANTITATIVE   OATA  WITH   SPS*
always eaSľi Mid you will have to pay intention in the way an appropriate chart is chosen, a choice that depends on llie level of measurement of the variable.
It is very important to realize thai the statistical measures used to describe the data pertaining to a variable depend on the level of measurement used. If a variable is measured at the nominal scale, you can compute certain measures and not others. Therefore you should pay attention to the conditions under which a measure could be used; otherwise you will end up computing numerical values that are meaningless-
Measures of Central Tendency
For Qualitative Variables
The best way to describe (he data lhal corresponds to a qualitative variable is to show the IVeqiH'ndes of its various categories, which are a simple count of how many individuals tall into each category. You could then work out this count as a percentage of the total number of units in the sample. When you ask for the frequencies. SPSS automatically calculates the percentages us well, and n docs it twice: the percentage with respect to the total number of people in the sample, and the percentage with respect to the valid answers only, called valid percent in the SPSS outputs. Lei us say that the percentage of people who answered Yes to a question is 403? of the total. If only half the people had answered, this percentage would correspond to 80% of the valid answers. In other words, although 40% of the people answered Yes. they still constitute 80% of those who answered. SPSS gives you both percentages (the total percentage and the valid percentage) and you have to decide which one is more significant m a particular situation.
For instance, lable 3.1 summarizes the answers to a question about the legalization
of marijuana, in a survey given to a sample of 1500 individuals.
Table 3.1 A frequency table, showing the frequencies of the various categories, as well as the percentage and valid percentage they represent in the sample
Should Marijuana Be Made Legal
"^^^		Frequency	Percent	Valid Percent
Valid	Ugal	211	M.I	22.7
	Nol legal	719	47.9	77..X
	Toial volM	930	62.1)	too.o
VI i win ť		570	J8.0	
I..I..I		15»»	ItHM)	
fable VI tells us ih.il the sample included 1500 individuals, but that we have the answers to thai question for 930 individuals only. The percentage of positive answers can he calculated either out of the total number of people in the sample, giving 14.1% as shown m the Percent column, or oul of (he number of people for whom we
■
UNIVAfilATf   OfSCHIPTIVE   STATISTICS                                                                       J?
have answers, giving 22.7% as shown in the Valid Percent column. Which percentage is the mosl useful? It depends on the reason for the missing answers. If people did not answer because the quest ion was asked of only a subset of the sample, the valid percentage is easier to interpret, Bui if 570 people abstained because they do not want to let their opinion be known, it is more difficult lo interpret the resulting figures. A good analysis should include a discussion of the missing answers when their proportion is as important as it is in this example.
Table „VI comes from ihe SPSS output. When we write a statistical report, we do noi include all ihe columns in that tabic. Most of the time, you would choose either the valiil percentage (which is the preferred solution) or the total percentage, but rarely both, unless you want to discuss specifically the difference between these two percentages, [lie Cumulative percentage is only used for ordinal or quantitative variables, and even men is included only if you plan lo discuss it.
To describe the center of the distribution of a qualitative variable, you must determine which category includes the biggest concentration of data, This is called the mode. /'/"• mode for <i qualitative variable is the category that has the highest frequency (sometimes called modal category).
The modul category could include more than 50% of ihe data. In this case we say that this category includes the majority of individuals. If the modal category includes less than 50% of the data, we say (hat ii constitutes ., plurality We can illustrate this by the following situations concerning the votes in an election.
First situation:                           Party A                           54% of the votes
Party H                           21% of ihe votes
Part) l                              !59 of ihe votes.
Here we could say that Patty A won Ihe election with a nuijuhiy. Compare with the following situation.
Second situation:                       Party A                            14% of ihe votes
Party Ii                            31% of the votes
Party C                            25% of ihe votes.
Here we can say that Party A won the election with a plurality of votes, but without a majority. If Parties B and C formed a coalilion. they could defeat Party A. Tor this reason, some countries include in theif electoral law a provision that, should the winning candidate or a winning pany gel less than the absolute majority of voics (50'; i 11, ;i second turn should take [dace among those candidates who are at the top of the list, so as 10 end up with a winner having more lhan 50'! of the votes.
A good description of lbe distribution of a qualitative variable should include a mention of Ihe nitxlal category, but ii should also include a discussion of the pattern
70
78
M
INTfRPRfTING   QUANTITATIVE   DATA  WlľH   SPSS
always easy, ami you will have iu pay attention to iho way an appropriate char! is chosen, a choice that depends on ihe level of measurement or the variable.
It is very important to renli/c that the statistical measures used 10 describe the data pertaining to a variable depend on the level of measurement used. II a variable is measured at the nominal scale, you can compute certain measures and not others. Therefore you should pay attention to the conditions under which a measure could be used; otherwise you will end up computing numerical values thai arc mean i tig less.
Measures of Central Tendency
For Qualitative Variables
The best way lo describe the dam that corresponds to a qualitative variable is to show the frequencies of its various categories, which are a simple count of how many individuals fall into each category. Yon could then work out litis count us a percentage of the total number of units in the sample. When you ask for the frequencies. SPSS automatically calculates the percentages as well, and it does it twice: the percentage with respeci to the total number of people in the sample, and the percentage with respect lo lite valid answers only, called valid percent in the SPSS outputs. U-l us say that Ihe percentage of |>eoplc who answered Yes lo a question is 40% of the
total. If only half the people had answered, this percentage would correspond to 80%
of the valid answers. In oilier words, although -HOi of the people answered Yes. they Mill constitute 80% of those who answered. SPSS gives you both |>ercentages (the total percentage and the valid percentage) and you have to decide which one is more significant in a particular situation.
For instance. Table 3.1 summarizes the answers to a question about the legalization of marijuana, in a survey given to a sample of 1500 individuals.
Table 3.1 A frequency labte, showing the frequencies of the various categories, as well as the percentage and valid percentage they represent in the sample
Should Marijuana Be Made Legal
		Frequency	Percent	Valid Percent
Valid	Legal	211	i-i ;	217
	Not leniti	719	I? •>	77.3
	ll'l.ll   Ml 111)	930	62.«	100.0
\IKmiiC		."u	WO	
l„l.,l		151)11	100.0	
Table .'.I tells us that the sample included I MX) individuals, hut that we have the answers to that question lor 930 individuals only. The percentage of positive answers can be calculated eilher out of ihe total number of people in the sample, giving 14.1 % as shown in ihe Percent column, or out of the number of people for whom we
"
UNIVARIArf   DESCRIPTIVE   STATISTICS                                                                        37
have answers, giving 22.7% its shown in the Valid Percent column. Which perceulage is the mOSi useful? It depends on the reason for the missing answers. If people did not answer because ihe question was asked of only a subset of the sample, the valid percentage is easier to interpret. Hut if 570 people abstained because they do not want to let their opinion he known, il is more difficult to interpret the resulting figures. A good analysis should include a discussion of the missing answers when their proportion is as important a-s it is in Ibis example.
Table 3.1 comes from die SPSS output. When we write a statistical report, we do not include all the columns in that table. Most of the time, you would choose either ihe valid perceulage (which is the preferred solution) or the total percentage, but rarely both, unless you want to discuss specifically ihe difference between these two percentages. The cumulative percentage is only used for ordinal or quantitative variables, and even then is included only if you plan to discuss it.
To describe the center of the distribution of a qualitative variable, you must determine which category includes the biggest concentration of data. This is called the mode. The mode for a qualitative variable is the category thai hus the highest frequency (sometimes called modal category)
The modal category could include more than 50% Of the data. In this case we say thai ibis category includes the majority of individuals, II the modal category includes less than 50% '>t Hie data, we say thai il constitutes a plurality We can illustrate this by the following Situations concerning the votes in an election.
Flrsl situation:                           Party A                            54% of the votes
Pany B                           21% of the votes
Party C                           25% of the votes.
Here we could say that Party A won the election with a majority. Compare with the following situation.
Second situation:                       Party A                            44% of the votes
Party fí                            31 % of the votes
Party C                            25% of the votes.
/ Here we can soy that Pain A wtoi the election with a plurality of votes, bin without
a majority. If Parties B and (' formed a coalition, they could defeat Party a. For this
reason, some countries include in their electoral law a provision thai, should the winning candidate or a winning parly get less than the absolute majority of voles (50% +■ I), a second turn should take place among those candidates who are at the top of the list, so as to end up with a winner having more than 50% ol ihe votes.
A good description of the distribution of a qualitative variable should include a mention of the modal category, bul il should also include a discussion of the pattern
71
je
INTERPRETING   QUANTITATIVE   OAtA  WITH   SPSS
of Hie distribution of individuals across the various categories. Concrete examples will be given in die last section of this chapter.
For Quantitative Variables
Quantitative variables allow us a lot more possibilities. The most useful measures of centra) tendency are the mean and the median. We will also see how and when to use ihc mode. The mean of a quantitative, variable is defined as the sttm of all entries divided by their number.
In symbolic terms.
the mean of a sample is written as  „      J = =jp,        and
the mean of a population is written as    j, =----'
These symbols are read as follows:
x    is read as x bar. and it stands for the mean of a sample for variable .V.
U,   is read as mu x. and it stands for the mean of a population. The subscript .v refers
to the variable X. x,    is read as x i. It refers to all the entries of your data thai pertain to the variable
X, which are labeled xy .v.. .v., etc. £    is read as sigma. When followed by *. it means: add all the .1 "s. lectin» i range
over all possible values, that is. from | u. n (for a sample) or from I to ,v (for a
population). «    is the size of the sample, that is. the number of units that are in it. N    is the size of the population.
You may have noticed that we use different symbols for a population and for a sample, to indicate clearly whether we are talking about a population or a sample. Wc do not always need to write the subscript i in /*,. We do it only when several variables arc involved, and when we want to keep track of which of the variables we are talking about. In such a situation we would use fiv p., and ,u. to refer to the mean of the population for the variables ,r. y. and z respectively. Notice that in the formula for the mean of a population, wc have written a capital N to refer io ihe size of the population raiher ihan the small n used for the size of a sample.
The mean is very useful to compare various populations, or to see how a variable evolves over time. Rut ii can be very misleading if the population is not homogeneous. Imagine a group of five people whose hourly wages are: $10, $20, $45. $60 and S65 an hour. The average hourly wage would be:

UNIVARIATE   DESCRIPTiVt   STATISTICS                                                                       39
10 4- 20 v 45 7-60 + 65
1 --
5
= $40 an hour.
But if the last participant was an international lawyer who charged $400 an hour of consultancy, the average would have been $107 an hour (you can compute it yourself), which is well above what four out of the five individuals make, and would be a misrepresentation of the center of the data.
In order to avoid this problem, we can compute the trimmed mean: you first eliminate the most extreme values, and then you compute the mean of the remaining ones. But you must indicate how much you have trimmed. In SPSS, one of ihe procedures produces a 5% trimmed mean, which means that you disregard the 5% of the data that arc farthest away from the center, and then you compute the mean of ihe remaining data eniries.
The mean has a mathematical property that will be used later on. Starting from the definition of the mean, which states that *==—'■ we can conclude, by multiplying both sides by n, that:
y *«-£*,
In plain language, this states that the sum of all entries is equal to n times the mean.
We will discuss all the limitations and warnings concerning the mean in a later section on methodological issues.
THE    MEAN    OF     DATA    GROUPED     INTO    CLASSES
When we are given numerical data that is grouped into classes, and we do not know the exact value of every single entry, wc can still compute the mean of the distribution by using the midpoint of every class. What we get is not the exact mean, but it is the closest guess of the mean that is available. If ihe classes are not too wide, the value obtained by using the midpoints is not that different from the value that would have resulted from the individual data.
Consider one of the intervals ŕ with frequency/ and midpoint x. The exact sum of all the entries in that class is not known, but we can approximate it using the midpoint. Thus, instead of the sum of the individual entries (not known) we will count the midpoint of the class/ times. We obtain the following formula.
LA '*, Mean (or grouped data   = —-—
Here, n is the number of all entries in the sample. It is therefore equal to the sum of the class frequencies, that is. the sum of the number of individuals in the various classes. The formula can thus be rewritten as
.11)
INTEflP«E'ING   QUANTITATIVE   DATA  WITH   SPSS
Mean for grouped data ■     ■'■■'-'
INTERPRETATION    OF     TH6     MEAN     WHEN THE    VARIABLE     IS    CODED
We often have data files where a quantitative variable is not given in its original form, bul coded into a small number of categories. Tor instance, Ihc variable Respondent's Income could be given in the form shown m Table 3.2.
Fable 3.2 Example of a quantitative variable that is coded into 21 categories, with a 22nd category for those who refused to answer
Category	Cod
Less dian $1000	1
SlOOO 2999	2
$3000-3999	3
W00O-W99	4
$5000-5999	5
""" fi'»W	ft
ttOOO ;w>	7
S si xxi 9999	«
SI 0,000-12/199	9
$12.50(1-14,999	m
SI5.000-t7.499	II
$17.500-19.999	12
$20.000-22.499	13
$22.500-24.999	ll
S23.O00-29.W9	ľ
$30*00-34.999	16
$35.000-39.999	17
$40.000-19.999	ľ-
550.OOO-59.999	ľ.i
$60.000-74,999	20
S7S.OO0 iiml more	21
Refused lo answer	22
Tims, we would not know che exact income of a respondent. We would only know the category he or she falls into.
This kind of measuring scale poses a challenge. II' we compute the mean with SPSS, we will noi gel Che mean income. We will get Ihc mean code, because il is (he codes that are used lo perform the compulations, I here is a dala lile lhal comes with SPSS where Che income is coded in this way. This dala tile contains information about '500 respondents, including information on (he income bracket they fall into, coded as shown in Table 3.2. When we exclude the 22nd category, which consists ol the people who refused to answer this question, the compulation of ihe mean with SPSS produces ihc following result:
Mean3 12.35
■
UNIVARIATE   DESCRIPTIVE   STATISTICS                                                                                ^41
Whai istheuseof ihis number? It is not a dollar amount! If we look at Table 3.2. we see that ÜW code 12 stands for an income of Iwtwecn S17.5(H) a year and $20.000 a year (with that last number excluded from the category), lb interpret Ihis number, WC should lirsl translate H into a dollar amount (il can lie done wild a simple rule). Bui even without transforming it into ihc dollar amount il corresponds lo. wc could use ihc mean code for comparisons. For instance, wc will sec in Lab 3 lhal if wc compute the mean income separately for men and women, wc get
Mean income for men: 13.9
Mean income for women: 10.9
(excluding che cacegory of people who refused lo answer).
Although the mean code does not lell us exactly the mean income lor men and women, il still tells us thai there is a big difference between men and women for that variable. Tabic 3.2 tells us that che code 13 corresponds co the income bracket $22.500-25.000. while ihc code 10 represents the income bracket $12.500 15.000. We can conclude that the difference in income between men and women, for thai sample, is roughly around $10.000 a year.
We see (hat lhal when the variables arc coded, the interpretation of the mean requires us to translate the value obtained mlo what il stands for. For quantitative variables coiled ibis way. il may also be useful to find Ihc frequencies ol the various calegorics. as we did for nominal variables. For the example at hand, wc would get fable 3.3 as shown.
The conclusion of Ihe preceding discussion is ihal when we have all ordinal van-able with few categories, or even a quantitative variable thai has been recoded into a small number of categories, il may be useful to compute the frequency table of the various categories, in addition to the mean and other descriptive measures.
Weighted Means
Consider the following situation: you want to find the average grade in an exam for two classes of students. The first class averaged 40 out of 50 in the exam, and ihe second class averaged 16 out ol 50. II you put the i/o classes logclhcr. you CQ/WOl conclude thai Ihc average is 43. This is so because the classes may have different numbers of students. Suppose the firsi class has 20 students, and the second one
40 Students. In Other words, we have ihe data shown in Table 3.4.
To compute the average grade for ihc two classes taken logclhcr, we do not need 10 know Ihe individual scores of each student. Indeed, we have seen before thai a sum of n scores is equal to its average limes //. We will use this to obtain the lonnula shown below for weighted means.
Che mean for Ihc two classes taken together can be written as
9254
19
42
INTERPRETING   QUANTITATIVE   OATA   WI'H   SPSS
Tdble   3,3    Frequencies   of    the   various   incomv categories (or the variable Income
Respondent's income ___________Frequency__________Valid Percent
I-TSIOW S1000-2999	26 36	2.6 í 6
S3000-3999 S4IKK) 4999 $5000-5999	30 24 23	3.0 2.4 2.3
S600O-fi999	23	2.3
S7000-7999	15	..s
SftOOO 9999	II	! l
$10.000-12.499	55	5.5
SI 2.500-14.999	5-1	5.4
SIJ.000-17.-IW	64	6.4
SI   ,'.MKI    |<>W1	5»	5.8
V'O.iHNl-22.499 $12,300-24.°»	55 61	5.5 6.1
S25.000-29.999	■ l	Í.3
S3O.UO0-M.999 $33,000-39.999 S4O.0O0-49.999	S3 54 66	8.4 5.4 6.6
S5O.00O-59.999	38	3.8
S60.000-74.999	23	2.3
S75.0OO+	44	4.4
<l'll:>.:ij til niMViT	47	4.7
foul	■"-i	100.0
vliKsing Ginnd Total	506 1500	
Lihli; 1.4   Two classes of different site and the mean ((i.iilc (n each
	Averag« Grade out of 50	Number of Students	
Claw A Cla» ll	40 46		2ÍI 40
Sum of all scores in class A + Sum		of all	stores in class B
'Hie sunt of nil scores in class A can be replaced by (he average score (40) times 20. since there arc 20 students in ihis class. Ami ihe sum of all scores in class B can be replaced also by its average score (4b) times 10, since Ulis class includes 40 students. The equation lor the mean becomes:
(40 x 20)     (4b x 40)
m                                    ',>:■
UNIVARIATE   OESCRIPIIVE   STATISTICS                                                                       ^
This can now be written as:
mean of Ihe two classes combined - 40 x (20/60) r 46 x (40/60)


or again as;
mean of ihe two classes combined = 40 x (1/3) + 46 x (2/3)
The last formula is important: we see that the average grade of class A is multiplied by (lie weight of class A. which is its relative importance in the total population. Class A forms 1/3 of the total population (20 students out of 60) and class B 2/3 of the lotal (40 students out of 60). The underlying formula is:
Average grade for ihe (wo classes: 41) x w, + 46 X w2
The w,\ arc called the weights of the various classes. In litis case, the weight is an expression of the number of people in each class compared to the total population of
the two classes.
The general formula is as follows.
If you have n values                                         xr x,. .?,,... etc..
each having the corresponding weights:     wt, vv.. it-,. ... etc..
the weighted mean is given by                     *, w, + .r, w, + xs w, + — + xa wn
The weights arc positive numbers and nuisl add Up 10 I. Thai is:
tľ, + U', + »', + ••• + iv, a  I.
The weights are not always 4 reflection of the size of Ihe various groups involved. If you are computing ihe weighted average of your grades during your college studies, the weights could be proportional to ihe credits given to each course, or they could be an expression of the importance of the course in a given program of studies. A Faculty of Medicine may weight the grades of its candidates by giving a bigger weight to Chemistry and Biology than Art History, for instance.
Example
A buyer wants to evaluate several houses site has seen. She attributes a score out of ten to each house on each of ihe following items: size, location, internal design, and quality of construction, Any house having a score less than s on any item would nol l>e acceptable. The resulting scores for three houses that arc seen as acceptable on all grounds are recorded in Table 3.5. The buyer docs not

44                                                    INTERPRETING   QUANTITATIVE   DATA  WITH   SPSS
attribute the same importance to each item. The size of ihe house is the most important quality. The quality of the construction is also very important, but not as important"; The buyer attributes a weight to each item, which reflects the importance of that item for her. The weights arc given in the last column.
Table 3.5   Scores glvon lo three houses on tour items, and thotr weights
Item	House A	House H	House C	Weight of item
Site    ' Location Internat design Quality of construction	9 5	7 9 5 9	6 10 S	0.4 0.1 0.2 0.3
We can now calculate the weighted average score for each house, using the formula for weighted means given above.
For house A: weighted mean score: 10 x 0.4 + 5 x 0.1 + 6 x 0.2 + 7 x 0.3 = 7.8 For house B: weighted mean score: 7 x 0.4 -i- 9 x 0.1 4- 5 x 0.2 + 9 x 0,3 = 7.4 For house C: weighted mean score: 6 x 0.4 +10 x 0.1 + 8 x 0.2 + 7 x 0.3 = 7.1
We see that house A obtained the highest weighted score. The total, unweighted score of house C is higher than that of house A. But because Ihe items do not all have the same importance, house A ended up having a higher weighted score.
THE     MEDIAN     AND     THE     MODE
The median is another measure of central tendency for quantitative variables. It is defined as the value that sits right in the middle of all data entries when they are listed in ascending order. If the number of entries is odd. there will be one data entry right in the middle. If the number of entries is even, we will have two data entries in the middle, and the median in this case will he their average. Here are two examples.
Case 1: variable*   2,3,4,4,5.5.5.6,7,8. II, 13. 13 Case 2: variable Y   2. 3,4.4, 5. S, 6, 7. 8,11. 13. 13
For the variable X we have 13 entries. The value 5 sits in the middle, with six entries equal or smaller than it. and six entries equal or larger. The median for X is thus 5. But for variable Y, we have 12 entries. There arc therefore two entries in Ihe middle of the ordered list, not just one. The median will be the average of the two. that is <S +• 6) + 2 = 5.5.
The median is not sensitive to extreme values. Suppose, for instance, that the entries for variable X were: 2, 3, 4, 4, 5. 5, 5. 6. 7. 8, 11, 13. 60. Although the last
UNIVARIATE   DESCRIPTIVE   STATISTICS
45
entry is very large compared to the others, it does not affect the median, which is still 5. The mean, however, would have been affected (compute it yourself for the two situations and sec how different il would be). For Ihis reason, the median is a better representative of the center when iherc arc extremely large values on one side of it. Bui the mean is more useful for statistical computations, as we will see in the coming sections.
Half the population has a score that is lower lhan or equal to the median, and the oilier half has a score larger than the median or equal to it. This way of formulating the median is very useful in situations where the distribution is skewed (such as the distribution of income) or in situations where time is involved, especially when processes have not been completed by everybody, as illustrated below.
Examples of the use of the median
*    We arc told that the average age at first marriage for a population is 22 years for women, and 25 for men. The median for women is 21, and for men it is 24. This means lhal by the time Ihey reached 21 years of age, half Ihe women in this population were married. For men. half of them were married by the age of 24.
•    In a research on ihe lime taken by immigrants lo find a job, 500 new immigrants who arrived at least three years ago are interviewed. The mean can not he found because some of i item have not found a regular or lull-time job yet. But it is found thai Ihe median time taken for them to find a regular, full-time job was 18 months for men. and 5 months for women. This means that by the 18th month after arrival, 50% of the men had found a job. Women were faster in finding regular full-time jobs: 50*5> had a job within 5 months of their dale of arrival.
Because the median involves only the ordered list of data entries, it can be used if the quantitative variable is measured at the ordinal level. But if the number of categories is small, ihe median is not very useful.
The mode can also be used for quantitative variables. When the values arc grouped into classes, the mode is defined as it is for qualitative variables: it is ihe class that has ihe highest frequency. Bui the mean and median remain ihe best descriptive measures for quantitative variables. If ihe variable is continuous and Ihe values have not been grouped into classes, the n/xlc is ihe value at which a peak occurs in ihe graph representing ihe distribution.
COMPARISON     OF     THE     MEAN     AND     THE     MEDIAN
Both the mean and the median arc measures of central tendency of a distribution, Üiat is. they give us a central value around which Ihe other values arc found. They are therefore very useful for comparing different samples, or different populations.
.15
INTCftffM TING   QUANTITATIVE   DATA   WITH   SPSS
or samples with ,i population, or a given population al different moments m lime (o sec how tl has evolved. However. e;ich ol the mean and Ihc methan lias its advantages and ndnwlw k'.
Ilie mean takes into account eveiy single value thai ocelli s m the dala. Therefore, it is sensitiv« lo ever) value. A simile vet y huge value can boost lh DMU Up if ÜM number of cnirics is not very large, For instance, if one worker in a group of 20 workers won a SI million loiicry defeat, the average wealth of those 20 would look artificially high. The median is not scnsiiive lo every single value, In a distribution where Ihc largest value is changed from 60 to 600, (he median would nol change. The mean would.
U follows from ihcsc remarks thai the mean is a more sophisticated measure. because it lakes every value inlo account. Indeed, it is the mean that is used lo
compute the standard deviation, winch ......rasure of dispersion lhat will be seen
below Movvevei, in situations where the distribution is not very symmetric, and where theic are some extreme values on only one side of the disliibulion, ihc mean will lend lo be shifted towards the ex Irenu- values, whereas the median will stay close to the bulk of (he data. Then-tore, whenever Ihc distribution is highly skewed, the median is a better representative of the center of the distribution than the mean Ibis is Hue lot variables such as income or wealth, where the distribution among individuals in a country, mid also worldwide, is highly skewed. For such a variable, the median is a more accurate representative of (he central tendency of the distribution.
Measures of Dispersion
For Qualitativ* Variables
There are not mimy measures ol dispersion for qualitative variables. One of the measures we | an <• impute is the variation ratio It tells us whether a Luge proportion of data is ■ oi« enlraled in the modal citegoiy, or whether it is .pie,id out over Ihc other categories, The variation ratio is defined as
number of entries not in lbe modal s lass
variation ratio =--------------------------------------------------------------
total number at entries
It is a positive number smaller than one. If this ratio is close to zero, it indicates a great homogeneity, almost every unit being in the modal class The laither it is from /cio. the gicatei i:„- diipttfSiOn ol I he data ovei lbe otliei calegi lie:    I ike mall) othei
measures, this one is easy to interpret when doing comparisons, Pbr Instance, if we
compare the 'i/cs ol the various linguistic groups in two cities where several languar.es ,ue spoken, we can use the variation ratio to assess ihc degree til heterogeneity m eai h i ity   Mere is an example
iiNivAaiArř DřScaiPTtví statistics
SI
City CUJ A
City B
LinguWtk groups	P»ic«nt«gi
,,,     ,  .,,    tmr i <i)'ti'.ti tpftking i in    i i|iejikine i ifii,, raM	list. ■i.:
French »r*»k>ng Bnsllsh *i««king Chine« »pcakiiyj .....■■■ TMal	2B* 4M IN ii-i
The s.ui.ition i.iiid tot city A would he (Mt t-20 * lit)/100 0.6ft, and for city B it would lír |2K • 20 < I2)/100 iini. showing iliat my A is ,i Imle more hetero gencous than city II.
For Quantitative Variables
There are many ways of measuring the dis(>ersion lor ipianlltalive variables. The simplest is ihc range, but we also have various forms of restricted range, wc have the deviation from the mean, the standard deviation, the variance and finally the coefficient of variation. Let us go through these measures one nt a lime.
R A NO I
The rung** Is 'he simplest way ol                       how spread out the data is You simply
subtract the smaller entry from the large) one and add I, and this tells yon ihc ilZC of the imeivalovcr which the data is spread out. For example, you would des, nbe a range of values for the variable A|«f as follows:
In this sample, the youngest person is 16 years old and the oldest 89. spanning a range of 74 years (89 -16+1).
Bui we tuny have extreme values thai give a misleading impression about the dispersion of the data. For installs e. suppose thai a retired pciso» decided to enroll in one ol our classes. We could then say that the ages of tlie students, m this class range from It) years up to69 yeais, but lhat would be misleading, is ihe gieal inajoiiiv
of studenu m somewhere bciwecn 17 years old and maybe 23 or M years old i oi •n m i m introduce variants ol the notion <>t range TIr- <',,„ range, for instance« ««unpuics the range of values after w(- base dropped HI'S ol the data al i.\u b end   the 10% largest entiles.....I the 10% smallest
48
INTERPRETING   QUANTITATIVE   DATA  WITH  SPSS
entries. This statistic gives us the range of the remaining 80% of data entries. Wc can also compute the 5% trimmed range by deleting from the computation the 5% of values that are the farthest away from the mean. We will also see in a forthcoming section1 something called a box-plot, that shows us graphically hoih the full range, and the range of the central 50% of the data after you have disregarded the top 25% and the bottom 25%. This last range is called the interquartile range, the distance between the first and third quarliles, which are the values that split the data into four equal parts.
These various notions of the range do not use the exact values of all the data in their computation. The following measures do.
STANDARD     DEVIATION
The most important measure is the standard deviation. To explain what it is wc must first define some simpler notions such as the deviation from the mean. For an individual data entry xi the deviation from the mean is the distance that separates it from the mean. If we want to write it in symbols, we will have to use two different symbols, depending whether we have a sample or a population.
For a sample, the deviation from the mean is written:     (xt - x) For a population, the deviation from the mean is written: (x, - fi)
The list of all deviations of the mean may give us a good impression of how spread out the data is.
Example
Consider the following distribution, representing the grades out of ten of a group of 14 students:
■ 4, 5, 5,6, 7,7, 8, 8, 8,9,9. 9. 10. 10
Here the mean is given by 105/14 = 7.5. The deviations from Ihe mean are given in Tabic 3.6.
But that list may be long. We want to summarize it. and end up with a single numerical value that constitutes a measure of how dispersed the data is. We could take the mean of all these deviations. If you perform the computation for the mean deviation, you will get a mean deviation equal to zero (do the compulation yourself on the preceding example). This is no accideni. Indeed, wc can easily show that the mean of these deviations is necessarily zero, as the positive deviations arc cancelled out by the negative deviations.
UNIVARIATE   DESCRIPTIVE   $TATISTICS
49
Table 3.6   Calculation   of the   deviations from the mean
Data	entry >	Deviation from th<_> mean: (x,    x)
4		4 - 7.5 - -3.5
5		5 -74 = -2.5
5		5 - 7.5 - -2.5
6		6- 74 = -1-5
7		7 -74 - -0.5
7		7-74 =-0.5
8		8 - 7.5 « 04
8		8-7.5 = 04
8		8 - 7.5 = 04
'■		9-7.5=14
9		9-7.5=14
0		9-7.5=14
k:		10-7.5 = 24
If:		10-75 = 25
The mathematical proof (which is given only for those who are interested and which can be ignored otherwise) goes like this: Sum of all deviations from the mean =
E <*,-*> = !>,-Ľ *=""*-"** = t>
(Explanation: Recall thai the sum of all entries is equal to n times the mean, and that the mean, in the second summation, is counted n times. This is why we get n times the mean twice, once with a positive sign, and once with a negative sign.)
We thus conclude that the deviations from the mean always add up to zero, and therefore we cannot summarize them by finding their mean. The way around this difficulty is the following: we will square the deviations, and then take their mean. By squaring the deviations, we get rid of the negative signs, and the positive and negative deviations do not cancel out any more. This operation changes their magnitude, however, and gives an erroneous impression about the real dispersion of data, since the deviations are all squared. This distortion will be corrected by taking the square root of the result, which brings it back to an order of magnitude similar to the original deviations. In summary, wc end up with the following calculation:
Standard deviation for a population, denoted by the symbol o"
In the case of a sample, fi will be replaced by x and iV will be replaced not by n, but by ft - 1. The reason why we write n - 1 instead of n is due to some of the mathematical properties of the standard deviation. It can be proven that using n - 1 in ihe formula gives a better prediction of the standard deviation of a population when we know that of the sample.
so
INTERPRETING  QUANTITATIVE   DATA  WITH   SPSS
Conclusion: the standard deviation for a sample, denoted by the symbol s. is given, by:
The standard deviation (often written st.dev.) is the most powerful measure of dispersion for quantitative data. It will permit us to do very sophisticated descriptions of various distributions. All the calculations of statistical inference are also made possible by the use of the standard deviation.
VARIANCE
Another useful measure is the variance, which is defined as the square of the standard deviation. It is thus given by
variance of a sample = s1 or
variance of a population = o~:
THE     COEFFICIENT     OF    VARIATION
Finally, we can define the coefficient of variation. To explain the use of this measure, suppose you have two distributions having the means and standard deviations given below:
Distribution 1                           mean = 30                       st. dev. = 3
Distribution 2                           mean = 150                     st. dcv. = 3
In one case the center of ihe distribution is 30. indicating that ihe data entries fall in a certain range around the value 30, Their magnitude is around 30. In the other case, the mean is 150- indicating that the data entries fall in a range around die value 150 and have an average magnitude of 150. Although they have the same dispersion • (measured by the standard deviation), the relative importance of the dispersion is not the same in the two cases because the magnitude of the daia is different. In one case the entries revolve around the value 30, and the standard deviation is equal to 10% of the average value of the entries. In ihe other case, the entries revolve around the value 150 and the standard deviation is about 3/150. that is, 2% of the average value of the entries, a value which denotes a smaller relative variation.
There is a way to assess the relative importance of the variation among the entries, by comparing this variation with the mean. The measure is called the coefficient of variation. The coefficient of variation is defined as the standard deviation divided by the mean, and multiplied by 100 to turn it into a percentage. The formula is thus:
Coefficient of variation CV = — x 100
M
UNIVARIATE   DESCRIPTIVE   STATISTICS
51
This measure will only be used occasionally.
Measures of Position
Measures of position are used for quantitative variables, measured at die numerical scale level. They could sometimes be used for variables measured at the ordinal level. They provide us with a way of determining how one individual entry compares with all the others. The simplest measure of position is the quartile. If you list your entries in an ascending order according to size, the quariites are the values that split the ranked population into four equal groups. Twenty-five percent of the population has a score less or equal than the 1st quartile (Q,). 50*% has a score less than the 2nd quartile (Qj), and 75% has a score less than the 3rd quartile (Qj). Recall that we have seen earlier a measure of dispersion called the interquartile range, which is die difference between Q. and Q;. Figure 3.1 illustrates the way the quartiles divide the ordered list of units in a sample or in a population.
25% of the population
25% of the population
25% of the population
<— Q,
25% of the population
Figure 3.1     The quartiles arc obtained by ordering the individuals in the population by increasing rank, and then splitting it into four equal parts. The quartiles are the values that separate these four parts
In a similar way, we can define ihe deciles: they split the ranked population into ten equal groups. If a data entry falls in the first decile it means that its score is among the lowest 10%. If it is in the 10th decile it means it is among the top 10%.
The most common measure of position, however, is the percentile rank. The dala is arranged by order of si« (recall it must be quantitative) and divided into 100 equal groups. The numerical values that separate these 100 groups arc called percentiles. The percentile rank of a data entry is the rank of die percentile group this entry falls into. For example, if you are told that your percentile rank in a national exam is 83, this means diat you fall within the 83rd percentile. Your grade is just above ihat of 82% of the population, and just below that of 17% of the population. You will learn in the SPSS session how to display die percentile ranks of the data entries.
You may have realized by now (he connection between the median and the various measures of position, since the median divides your ranked population into two equal groups. The median is equal to the 50th percentile. It is also equal to the 5th decile, and of course the 2nd quartile.
54
INTERPRETING   QUANTITATIVE   DATA  WITH   SPSS
Missing    married   widowed   divorced separated    never Marital Status                                                                  married
10 -
18-29 Age Categories
3C-!3
41) 49
H:)-
SS^'ASKST1the -—• - -*■«* ü
people who speak a given language. You can choose lo have ,hc Y-axis represent peonages mstcad of counts. The chart shown in Figure 3.2 represents .he percentages or the various marital categories.
UNIVARIATE   DESCRIPTIVE   STATISTICS
30   -
?c.  -
10 -
18-30      31-40      41-50 Age into 7 categories
51-60        61-70        71-80
81+
Figure 3.4    A bar chart where the category 50+ years has been broken down into four categories
The variable on the X-axis could also be a quantitative variable that has been grouped into a small number of categories. For instance, we could have agecat4 as the variable on the X-axis. The bars would dien represent the number of people found in each of the four age categories. In this kind of bar graph, you must be careful about the range (lhat is. the length of the interval) of each of the categories. If the categories are intervals that do not have the same length, you may get the wrong impression that one group is more numerous than the other, such as with the group of people who are 50 years old or more in the chart shown in Figure 3.3.
However, this group (50 years and older) spans a range of agťS which is much wider than the other groups: close to 40 years (from 50 years to 89 years exactly). If we regroup the respondents into age categories that arc equal or almost equal, we get the chart in Figure 3.4.
This bar chart is a much better representation of the distribution of ages than the previous one.
In a clustered bar chart, each column is subdivided in several columns representing the categories of a second variable. For instance, each column could be split in two, for men and for women. Figure 3.5 provides an example of a clustered bar chart where the height of the columns represents the number of people in each category.
In a clustered bar chart, it is generally preferable to display the percentages of die various categories radier than their frequencies. Look for instance at the clustered bar chart displayed in Figure 3.5. We see that in every category, women are more
INTCHPRIIING   QUANTITATIV»   DATA   WITH  SPSS
MO
4tt>
-.....
too
Raaponnenľ» Sex
■ Mill ■ Ft milt
MtaatrtQ             widowMl           m>4(MM
married             iftvwieail        never married
Manial Siatua
Figure 3.S    A cluttered bar chart what» tha height of tha column« represents lh« number of people In »ach «Magory
numerous tlum men. This is so because the sample as a whole contain! mote women. This chart does noi allow us to assess how the percentages of men und women compare in ciich category. If wc display the percentages rather than the frequencies (the count), wc get the char! illustrated in Figure 3.6.
Reipaiulent'a So*
I Htn.ťn
Mlulny               witlownil             H<i|inr iilncl
mtnled             divorced        navar married
Martial Slant*
Figure 36    A cluttered bar char! displaying tha percentage* rather than tin« frequencies


UNlVAtUATÍ   PÉSCRIPTlVt   STATISTICS
',/
We sec now that percentage-wise, (here arc a lot mori WMDM who arc widows than men who are widowers. In that sample, it also happens that the dtofotd women ate slightly more luminous ih.m ihr di voiced turn (divorced women v% 11 -. . cx-hiishand has died are noi lounird in the Widow category hut in (he Dmnerd catrgoiy) Although the »ample used here is not necessarily representative of the whole American population, it doci illustrate a social reality as in many other societies, women lend lo live longer than men. Therefore, the percentage of women in the categories Widowed and Divorced is larger than the percentage of men. and consequently lower than the percentage of men in all other categories, even if their numbers are bigger.
In o slacked bar chart, rather than being adjacent, the split columns are stocked one on top of ihe other, as shown in Ihc Figure 3.7.
12Q
■
Roftpondont';! Sex ■^   Main J   Female
Missing              wlduwail            separated
married              divorced         novor married
Marital Status
Figure 3.7    A Hacked bar chart
Die advantage of a stacked Kir chart, as opdbscd to a ctustVMd bar chart, is that it
shows ihr overall mipoitaiur nl the categories (married, widowed, etc.), while .u Ibfl
same Mine showing how Ihcy air luolen down into ihr calci.....t\s o| another vanahlr
silt li .is Ses
ll.ii i halts .tie mos! adequate when ,> u want lo hii:hhj;h( the .lu.intils ,tw« laletl with eveiy category 00 the X axis. A bar chart where the vertical axis does not «mi at 0 can be very misleading, for if the columns arc Iruucaicd at their base, the
SB                                                            INTERPRETING   QUANTITATIVE   DATA   WITH   SPSS
differences in height between them can appear to be more important than they really are. Consequently, as a general rule, bar charts should start at zero and should not be imncated fr©m their base.
Finally, it should be said that bar charts could also be presented horizontally, by interchanging the X- and Y-axcs.
Pie Charts
Pie charts (Figure 3.8) are most useful when you want to illustrate proportions, rather than actual quantities. They show the relative importance of the various categories of the variable. In SPSS you have the option of including missing values as a slice in the pie, or excluding them and dividing the pie among valid answers. The details of how to do that are explained in Lab 5. Pie charts are better suited when we want to convey the way a fixed amount of resources is allocated among various uses. For instance, the way a budget is spent over various categories of items is best represented by a pie chart. When the emphasis is on the amount of money spent on each budget item, rather than on the way the budget is allocated, a bar chart is more suggestive. However, both bar charts and pie charts are appropriate to represent the distribution of u nominal variable, and there is no clear-cut line of demarcation lhal would lell us which of the two is preferable.
oinor pay» rent
■
owns home
Figure 3.8    Pie chart Illustrating the proportion of people who own ., home as compared to those who pay rent. One of the options in the plo chart command allows you to either include or exclude tho category of missing answers. In this diagram it has been excluded from the graph
Histograms
Histograms arc useful when the variable is quantitative. The data are usually grouped into classes, or intervals, and then the frequency of each class is represented
'
UNIVARIATE   DESCRIPTIVE   STATISTICS
59
															
100															
												~"1			
						1   .									
															
															
										[					
															
								|							
0 -	-r-l—   1 ' 1					_L		1 •' 1 m ''v'V						~r~	
10    o
CM      M
o     o     o     O
d     ui     o     u> m    m    <í>    <ď
o    o
O
ui     6     ui r*    n   <q
5id.Oev = 17.42 Mean =■ 46.2 N r: 1495.00
■
Age of Respondent
Figure 3.9
of the histogram for the variable Age
by a bar. The bars in a histogram are adjacent, and not separated as in a bar chart, because the numerical values are continuously increasing. For instance, if you draw the histogram of the variable Respondent's Age (Figure 3.9), you will see the pattern of the distribution of the individuals of the sample across the various categories. Contrary to a bar chart, which is used for a qualitative variable, the columns of the histogram cannot be switched around. You can switch around the categories of a variable measured at the nominal level, but not those of an ordinal or quantitative variable.
When producing a histogram with SPSS, the program automatically selects the number of classes (usually no more than 15) and divides the range of values accordingly into intervals of equal size. In the histogram shown in Figure 3.9, the midpoints of the classes arc shown on the graph. They are:
20, 25, 30, 35. ftc.
Therefore, (he class limits (that is. the cut-point between one class and the ncxl) are the values in between: 22.5. 27.5, 32.5, etc. We can infer that the lower limit of die first class is 17.5 years, and the upper limit of the last class is 92.5 years.
85