.11)
INTEflP«E'ING   QUANTITATIVE   DATA  WITH   SPSS
Mean for grouped data ■     ■'■■'-'
INTERPRETATION    OF     TH6     MEAN     WHEN THE    VARIABLE     IS    CODED
We often have data files where a quantitative variable is not given in its original form, bul coded into a small number of categories. Tor instance, Ihc variable Respondent's Income could be given in the form shown m Table 3.2.
Fable 3.2 Example of a quantitative variable that is coded into 21 categories, with a 22nd category for those who refused to answer
Category	Cod
Less dian $1000	1
SlOOO 2999	2
$3000-3999	3
W00O-W99	4
$5000-5999	5
""" fi'»W	ft
ttOOO ;w>	7
S si xxi 9999	«
SI 0,000-12/199	9
$12.50(1-14,999	m
SI5.000-t7.499	II
$17.500-19.999	12
$20.000-22.499	13
$22.500-24.999	ll
S23.O00-29.W9	ľ
$30*00-34.999	16
$35.000-39.999	17
$40.000-19.999	ľ-
550.OOO-59.999	ľ.i
$60.000-74,999	20
S7S.OO0 iiml more	21
Refused lo answer	22
Tims, we would not know che exact income of a respondent. We would only know the category he or she falls into.
This kind of measuring scale poses a challenge. II' we compute the mean with SPSS, we will noi gel Che mean income. We will get Ihc mean code, because il is (he codes that are used lo perform the compulations, I here is a dala lile lhal comes with SPSS where Che income is coded in this way. This dala tile contains information about '500 respondents, including information on (he income bracket they fall into, coded as shown in Table 3.2. When we exclude the 22nd category, which consists ol the people who refused to answer this question, the compulation of ihe mean with SPSS produces ihc following result:
Mean3 12.35
■
UNIVARIATE   DESCRIPTIVE   STATISTICS                                                                                ^41
Whai istheuseof ihis number? It is not a dollar amount! If we look at Table 3.2. we see that ÜW code 12 stands for an income of Iwtwecn S17.5(H) a year and $20.000 a year (with that last number excluded from the category), lb interpret Ihis number, WC should lirsl translate H into a dollar amount (il can lie done wild a simple rule). Bui even without transforming it into ihc dollar amount il corresponds lo. wc could use ihc mean code for comparisons. For instance, wc will sec in Lab 3 lhal if wc compute the mean income separately for men and women, wc get
Mean income for men: 13.9
Mean income for women: 10.9
(excluding che cacegory of people who refused lo answer).
Although the mean code does not lell us exactly the mean income lor men and women, il still tells us thai there is a big difference between men and women for that variable. Tabic 3.2 tells us that che code 13 corresponds co the income bracket $22.500-25.000. while ihc code 10 represents the income bracket $12.500 15.000. We can conclude that the difference in income between men and women, for thai sample, is roughly around $10.000 a year.
We see (hat lhal when the variables arc coded, the interpretation of the mean requires us to translate the value obtained mlo what il stands for. For quantitative variables coiled ibis way. il may also be useful to find Ihc frequencies ol the various calegorics. as we did for nominal variables. For the example at hand, wc would get fable 3.3 as shown.
The conclusion of Ihe preceding discussion is ihal when we have all ordinal van-able with few categories, or even a quantitative variable thai has been recoded into a small number of categories, il may be useful to compute the frequency table of the various categories, in addition to the mean and other descriptive measures.
Weighted Means
Consider the following situation: you want to find the average grade in an exam for two classes of students. The first class averaged 40 out of 50 in the exam, and ihe second class averaged 16 out ol 50. II you put the i/o classes logclhcr. you CQ/WOl conclude thai Ihc average is 43. This is so because the classes may have different numbers of students. Suppose the firsi class has 20 students, and the second one
40 Students. In Other words, we have ihe data shown in Table 3.4.
To compute the average grade for ihc two classes taken logclhcr, we do not need 10 know Ihe individual scores of each student. Indeed, we have seen before thai a sum of n scores is equal to its average limes //. We will use this to obtain the lonnula shown below for weighted means.
Che mean for Ihc two classes taken together can be written as
9254
19
42
INTERPRETING   QUANTITATIVE   OATA   WI'H   SPSS
Tdble   3,3    Frequencies   of    the   various   incomv categories (or the variable Income
Respondent's income ___________Frequency__________Valid Percent
I-TSIOW S1000-2999	26 36	2.6 í 6
S3000-3999 S4IKK) 4999 $5000-5999	30 24 23	3.0 2.4 2.3
S600O-fi999	23	2.3
S7000-7999	15	..s
SftOOO 9999	II	! l
$10.000-12.499	55	5.5
SI 2.500-14.999	5-1	5.4
SIJ.000-17.-IW	64	6.4
SI   ,'.MKI    |<>W1	5»	5.8
V'O.iHNl-22.499 $12,300-24.°»	55 61	5.5 6.1
S25.000-29.999	■ l	Í.3
S3O.UO0-M.999 $33,000-39.999 S4O.0O0-49.999	S3 54 66	8.4 5.4 6.6
S5O.00O-59.999	38	3.8
S60.000-74.999	23	2.3
S75.0OO+	44	4.4
<l'll:>.:ij til niMViT	47	4.7
foul	■"-i	100.0
vliKsing Ginnd Total	506 1500	
Lihli; 1.4   Two classes of different site and the mean ((i.iilc (n each
	Averag« Grade out of 50	Number of Students	
Claw A Cla» ll	40 46		2ÍI 40
Sum of all scores in class A + Sum		of all	stores in class B
'Hie sunt of nil scores in class A can be replaced by (he average score (40) times 20. since there arc 20 students in ihis class. Ami ihe sum of all scores in class B can be replaced also by its average score (4b) times 10, since Ulis class includes 40 students. The equation lor the mean becomes:
(40 x 20)     (4b x 40)
m                                    ',>:■
UNIVARIATE   OESCRIPIIVE   STATISTICS                                                                       ^
This can now be written as:
mean of Ihe two classes combined - 40 x (20/60) r 46 x (40/60)


or again as;
mean of ihe two classes combined = 40 x (1/3) + 46 x (2/3)
The last formula is important: we see that the average grade of class A is multiplied by (lie weight of class A. which is its relative importance in the total population. Class A forms 1/3 of the total population (20 students out of 60) and class B 2/3 of the lotal (40 students out of 60). The underlying formula is:
Average grade for ihe (wo classes: 41) x w, + 46 X w2
The w,\ arc called the weights of the various classes. In litis case, the weight is an expression of the number of people in each class compared to the total population of
the two classes.
The general formula is as follows.
If you have n values                                         xr x,. .?,,... etc..
each having the corresponding weights:     wt, vv.. it-,. ... etc..
the weighted mean is given by                     *, w, + .r, w, + xs w, + — + xa wn
The weights arc positive numbers and nuisl add Up 10 I. Thai is:
tľ, + U', + »', + ••• + iv, a  I.
The weights are not always 4 reflection of the size of Ihe various groups involved. If you are computing ihe weighted average of your grades during your college studies, the weights could be proportional to ihe credits given to each course, or they could be an expression of the importance of the course in a given program of studies. A Faculty of Medicine may weight the grades of its candidates by giving a bigger weight to Chemistry and Biology than Art History, for instance.
Example
A buyer wants to evaluate several houses site has seen. She attributes a score out of ten to each house on each of ihe following items: size, location, internal design, and quality of construction, Any house having a score less than s on any item would nol l>e acceptable. The resulting scores for three houses that arc seen as acceptable on all grounds are recorded in Table 3.5. The buyer docs not

44                                                    INTERPRETING   QUANTITATIVE   DATA  WITH   SPSS
attribute the same importance to each item. The size of ihe house is the most important quality. The quality of the construction is also very important, but not as important"; The buyer attributes a weight to each item, which reflects the importance of that item for her. The weights arc given in the last column.
Table 3.5   Scores glvon lo three houses on tour items, and thotr weights
Item	House A	House H	House C	Weight of item
Site    ' Location Internat design Quality of construction	9 5	7 9 5 9	6 10 S	0.4 0.1 0.2 0.3
We can now calculate the weighted average score for each house, using the formula for weighted means given above.
For house A: weighted mean score: 10 x 0.4 + 5 x 0.1 + 6 x 0.2 + 7 x 0.3 = 7.8 For house B: weighted mean score: 7 x 0.4 -i- 9 x 0.1 4- 5 x 0.2 + 9 x 0,3 = 7.4 For house C: weighted mean score: 6 x 0.4 +10 x 0.1 + 8 x 0.2 + 7 x 0.3 = 7.1
We see that house A obtained the highest weighted score. The total, unweighted score of house C is higher than that of house A. But because Ihe items do not all have the same importance, house A ended up having a higher weighted score.
THE     MEDIAN     AND     THE     MODE
The median is another measure of central tendency for quantitative variables. It is defined as the value that sits right in the middle of all data entries when they are listed in ascending order. If the number of entries is odd. there will be one data entry right in the middle. If the number of entries is even, we will have two data entries in the middle, and the median in this case will he their average. Here are two examples.
Case 1: variable*   2,3,4,4,5.5.5.6,7,8. II, 13. 13 Case 2: variable Y   2. 3,4.4, 5. S, 6, 7. 8,11. 13. 13
For the variable X we have 13 entries. The value 5 sits in the middle, with six entries equal or smaller than it. and six entries equal or larger. The median for X is thus 5. But for variable Y, we have 12 entries. There arc therefore two entries in Ihe middle of the ordered list, not just one. The median will be the average of the two. that is <S +• 6) + 2 = 5.5.
The median is not sensitive to extreme values. Suppose, for instance, that the entries for variable X were: 2, 3, 4, 4, 5. 5, 5. 6. 7. 8, 11, 13. 60. Although the last
UNIVARIATE   DESCRIPTIVE   STATISTICS
45
entry is very large compared to the others, it does not affect the median, which is still 5. The mean, however, would have been affected (compute it yourself for the two situations and sec how different il would be). For Ihis reason, the median is a better representative of the center when iherc arc extremely large values on one side of it. Bui the mean is more useful for statistical computations, as we will see in the coming sections.
Half the population has a score that is lower lhan or equal to the median, and the oilier half has a score larger than the median or equal to it. This way of formulating the median is very useful in situations where the distribution is skewed (such as the distribution of income) or in situations where time is involved, especially when processes have not been completed by everybody, as illustrated below.
Examples of the use of the median
*    We arc told that the average age at first marriage for a population is 22 years for women, and 25 for men. The median for women is 21, and for men it is 24. This means lhal by the time Ihey reached 21 years of age, half Ihe women in this population were married. For men. half of them were married by the age of 24.
•    In a research on ihe lime taken by immigrants lo find a job, 500 new immigrants who arrived at least three years ago are interviewed. The mean can not he found because some of i item have not found a regular or lull-time job yet. But it is found thai Ihe median time taken for them to find a regular, full-time job was 18 months for men. and 5 months for women. This means that by the 18th month after arrival, 50% of the men had found a job. Women were faster in finding regular full-time jobs: 50*5> had a job within 5 months of their dale of arrival.
Because the median involves only the ordered list of data entries, it can be used if the quantitative variable is measured at the ordinal level. But if the number of categories is small, ihe median is not very useful.
The mode can also be used for quantitative variables. When the values arc grouped into classes, the mode is defined as it is for qualitative variables: it is ihe class that has ihe highest frequency. Bui the mean and median remain ihe best descriptive measures for quantitative variables. If ihe variable is continuous and Ihe values have not been grouped into classes, the n/xlc is ihe value at which a peak occurs in ihe graph representing ihe distribution.
COMPARISON     OF     THE     MEAN     AND     THE     MEDIAN
Both the mean and the median arc measures of central tendency of a distribution, Üiat is. they give us a central value around which Ihe other values arc found. They are therefore very useful for comparing different samples, or different populations.
.15
INTCftffM TING   QUANTITATIVE   DATA   WITH   SPSS
or samples with ,i population, or a given population al different moments m lime (o sec how tl has evolved. However. e;ich ol the mean and Ihc methan lias its advantages and ndnwlw k'.
Ilie mean takes into account eveiy single value thai ocelli s m the dala. Therefore, it is sensitiv« lo ever) value. A simile vet y huge value can boost lh DMU Up if ÜM number of cnirics is not very large, For instance, if one worker in a group of 20 workers won a SI million loiicry defeat, the average wealth of those 20 would look artificially high. The median is not scnsiiive lo every single value, In a distribution where Ihc largest value is changed from 60 to 600, (he median would nol change. The mean would.
U follows from ihcsc remarks thai the mean is a more sophisticated measure. because it lakes every value inlo account. Indeed, it is the mean that is used lo
compute the standard deviation, winch ......rasure of dispersion lhat will be seen
below Movvevei, in situations where the distribution is not very symmetric, and where theic are some extreme values on only one side of the disliibulion, ihc mean will lend lo be shifted towards the ex Irenu- values, whereas the median will stay close to the bulk of (he data. Then-tore, whenever Ihc distribution is highly skewed, the median is a better representative of the center of the distribution than the mean Ibis is Hue lot variables such as income or wealth, where the distribution among individuals in a country, mid also worldwide, is highly skewed. For such a variable, the median is a more accurate representative of (he central tendency of the distribution.
Measures of Dispersion
For Qualitativ* Variables
There are not mimy measures ol dispersion for qualitative variables. One of the measures we | an <• impute is the variation ratio It tells us whether a Luge proportion of data is ■ oi« enlraled in the modal citegoiy, or whether it is .pie,id out over Ihc other categories, The variation ratio is defined as
number of entries not in lbe modal s lass
variation ratio =--------------------------------------------------------------
total number at entries
It is a positive number smaller than one. If this ratio is close to zero, it indicates a great homogeneity, almost every unit being in the modal class The laither it is from /cio. the gicatei i:„- diipttfSiOn ol I he data ovei lbe otliei calegi lie:    I ike mall) othei
measures, this one is easy to interpret when doing comparisons, Pbr Instance, if we
compare the 'i/cs ol the various linguistic groups in two cities where several languar.es ,ue spoken, we can use the variation ratio to assess ihc degree til heterogeneity m eai h i ity   Mere is an example
iiNivAaiArř DřScaiPTtví statistics
SI
City CUJ A
City B
LinguWtk groups	P»ic«nt«gi
,,,     ,  .,,    tmr i <i)'ti'.ti tpftking i in    i i|iejikine i ifii,, raM	list. ■i.:
French »r*»k>ng Bnsllsh *i««king Chine« »pcakiiyj .....■■■ TMal	2B* 4M IN ii-i
The s.ui.ition i.iiid tot city A would he (Mt t-20 * lit)/100 0.6ft, and for city B it would lír |2K • 20 < I2)/100 iini. showing iliat my A is ,i Imle more hetero gencous than city II.
For Quantitative Variables
There are many ways of measuring the dis(>ersion lor ipianlltalive variables. The simplest is ihc range, but we also have various forms of restricted range, wc have the deviation from the mean, the standard deviation, the variance and finally the coefficient of variation. Let us go through these measures one nt a lime.
R A NO I
The rung** Is 'he simplest way ol                       how spread out the data is You simply
subtract the smaller entry from the large) one and add I, and this tells yon ihc ilZC of the imeivalovcr which the data is spread out. For example, you would des, nbe a range of values for the variable A|«f as follows:
In this sample, the youngest person is 16 years old and the oldest 89. spanning a range of 74 years (89 -16+1).
Bui we tuny have extreme values thai give a misleading impression about the dispersion of the data. For installs e. suppose thai a retired pciso» decided to enroll in one ol our classes. We could then say that the ages of tlie students, m this class range from It) years up to69 yeais, but lhat would be misleading, is ihe gieal inajoiiiv
of studenu m somewhere bciwecn 17 years old and maybe 23 or M years old i oi •n m i m introduce variants ol the notion <>t range TIr- <',,„ range, for instance« ««unpuics the range of values after w(- base dropped HI'S ol the data al i.\u b end   the 10% largest entiles.....I the 10% smallest