.11) INTEflP«E'ING QUANTITATIVE DATA WITH SPSS Mean for grouped data ■ ■'■■'-' INTERPRETATION OF TH6 MEAN WHEN THE VARIABLE IS CODED We often have data files where a quantitative variable is not given in its original form, bul coded into a small number of categories. Tor instance, Ihc variable Respondent's Income could be given in the form shown m Table 3.2. Fable 3.2 Example of a quantitative variable that is coded into 21 categories, with a 22nd category for those who refused to answer Category Cod Less dian $1000 1 SlOOO 2999 2 $3000-3999 3 W00O-W99 4 $5000-5999 5 """ fi'»W ft ttOOO ;w> 7 S si xxi 9999 « SI 0,000-12/199 9 $12.50(1-14,999 m SI5.000-t7.499 II $17.500-19.999 12 $20.000-22.499 13 $22.500-24.999 ll S23.O00-29.W9 ľ $30*00-34.999 16 $35.000-39.999 17 $40.000-19.999 ľ- 550.OOO-59.999 ľ.i $60.000-74,999 20 S7S.OO0 iiml more 21 Refused lo answer 22 Tims, we would not know che exact income of a respondent. We would only know the category he or she falls into. This kind of measuring scale poses a challenge. II' we compute the mean with SPSS, we will noi gel Che mean income. We will get Ihc mean code, because il is (he codes that are used lo perform the compulations, I here is a dala lile lhal comes with SPSS where Che income is coded in this way. This dala tile contains information about '500 respondents, including information on (he income bracket they fall into, coded as shown in Table 3.2. When we exclude the 22nd category, which consists ol the people who refused to answer this question, the compulation of ihe mean with SPSS produces ihc following result: Mean3 12.35 ■ UNIVARIATE DESCRIPTIVE STATISTICS ^41 Whai istheuseof ihis number? It is not a dollar amount! If we look at Table 3.2. we see that ÜW code 12 stands for an income of Iwtwecn S17.5(H) a year and $20.000 a year (with that last number excluded from the category), lb interpret Ihis number, WC should lirsl translate H into a dollar amount (il can lie done wild a simple rule). Bui even without transforming it into ihc dollar amount il corresponds lo. wc could use ihc mean code for comparisons. For instance, wc will sec in Lab 3 lhal if wc compute the mean income separately for men and women, wc get Mean income for men: 13.9 Mean income for women: 10.9 (excluding che cacegory of people who refused lo answer). Although the mean code does not lell us exactly the mean income lor men and women, il still tells us thai there is a big difference between men and women for that variable. Tabic 3.2 tells us that che code 13 corresponds co the income bracket $22.500-25.000. while ihc code 10 represents the income bracket $12.500 15.000. We can conclude that the difference in income between men and women, for thai sample, is roughly around $10.000 a year. We see (hat lhal when the variables arc coded, the interpretation of the mean requires us to translate the value obtained mlo what il stands for. For quantitative variables coiled ibis way. il may also be useful to find Ihc frequencies ol the various calegorics. as we did for nominal variables. For the example at hand, wc would get fable 3.3 as shown. The conclusion of Ihe preceding discussion is ihal when we have all ordinal van-able with few categories, or even a quantitative variable thai has been recoded into a small number of categories, il may be useful to compute the frequency table of the various categories, in addition to the mean and other descriptive measures. Weighted Means Consider the following situation: you want to find the average grade in an exam for two classes of students. The first class averaged 40 out of 50 in the exam, and ihe second class averaged 16 out ol 50. II you put the i/o classes logclhcr. you CQ/WOl conclude thai Ihc average is 43. This is so because the classes may have different numbers of students. Suppose the firsi class has 20 students, and the second one 40 Students. In Other words, we have ihe data shown in Table 3.4. To compute the average grade for ihc two classes taken logclhcr, we do not need 10 know Ihe individual scores of each student. Indeed, we have seen before thai a sum of n scores is equal to its average limes //. We will use this to obtain the lonnula shown below for weighted means. Che mean for Ihc two classes taken together can be written as 9254 19 42 INTERPRETING QUANTITATIVE OATA WI'H SPSS Tdble 3,3 Frequencies of the various incomv categories (or the variable Income Respondent's income ___________Frequency__________Valid Percent I-TSIOW S1000-2999 26 36 2.6 í 6 S3000-3999 S4IKK) 4999 $5000-5999 30 24 23 3.0 2.4 2.3 S600O-fi999 23 2.3 S7000-7999 15 ..s SftOOO 9999 II ! l $10.000-12.499 55 5.5 SI 2.500-14.999 5-1 5.4 SIJ.000-17.-IW 64 6.4 SI ,'.MKI |<>W1 5» 5.8 V'O.iHNl-22.499 $12,300-24.°» 55 61 5.5 6.1 S25.000-29.999 ■ l Í.3 S3O.UO0-M.999 $33,000-39.999 S4O.0O0-49.999 S3 54 66 8.4 5.4 6.6 S5O.00O-59.999 38 3.8 S60.000-74.999 23 2.3 S75.0OO+ 44 4.4 .:ij til niMViT 47 4.7 foul ■"-i 100.0 vliKsing Ginnd Total 506 1500 Lihli; 1.4 Two classes of different site and the mean ((i.iilc (n each Averag« Grade out of 50 Number of Students Claw A Cla» ll 40 46 2ÍI 40 Sum of all scores in class A + Sum of all stores in class B 'Hie sunt of nil scores in class A can be replaced by (he average score (40) times 20. since there arc 20 students in ihis class. Ami ihe sum of all scores in class B can be replaced also by its average score (4b) times 10, since Ulis class includes 40 students. The equation lor the mean becomes: (40 x 20) (4b x 40) m ',>:■ UNIVARIATE OESCRIPIIVE STATISTICS ^ This can now be written as: mean of Ihe two classes combined - 40 x (20/60) r 46 x (40/60) or again as; mean of ihe two classes combined = 40 x (1/3) + 46 x (2/3) The last formula is important: we see that the average grade of class A is multiplied by (lie weight of class A. which is its relative importance in the total population. Class A forms 1/3 of the total population (20 students out of 60) and class B 2/3 of the lotal (40 students out of 60). The underlying formula is: Average grade for ihe (wo classes: 41) x w, + 46 X w2 The w,\ arc called the weights of the various classes. In litis case, the weight is an expression of the number of people in each class compared to the total population of the two classes. The general formula is as follows. If you have n values xr x,. .?,,... etc.. each having the corresponding weights: wt, vv.. it-,. ... etc.. the weighted mean is given by *, w, + .r, w, + xs w, + — + xa wn The weights arc positive numbers and nuisl add Up 10 I. Thai is: tľ, + U', + »', + ••• + iv, a I. The weights are not always 4 reflection of the size of Ihe various groups involved. If you are computing ihe weighted average of your grades during your college studies, the weights could be proportional to ihe credits given to each course, or they could be an expression of the importance of the course in a given program of studies. A Faculty of Medicine may weight the grades of its candidates by giving a bigger weight to Chemistry and Biology than Art History, for instance. Example A buyer wants to evaluate several houses site has seen. She attributes a score out of ten to each house on each of ihe following items: size, location, internal design, and quality of construction, Any house having a score less than s on any item would nol l>e acceptable. The resulting scores for three houses that arc seen as acceptable on all grounds are recorded in Table 3.5. The buyer docs not 44 INTERPRETING QUANTITATIVE DATA WITH SPSS attribute the same importance to each item. The size of ihe house is the most important quality. The quality of the construction is also very important, but not as important"; The buyer attributes a weight to each item, which reflects the importance of that item for her. The weights arc given in the last column. Table 3.5 Scores glvon lo three houses on tour items, and thotr weights Item House A House H House C Weight of item Site ' Location Internat design Quality of construction 9 5 7 9 5 9 6 10 S 0.4 0.1 0.2 0.3 We can now calculate the weighted average score for each house, using the formula for weighted means given above. For house A: weighted mean score: 10 x 0.4 + 5 x 0.1 + 6 x 0.2 + 7 x 0.3 = 7.8 For house B: weighted mean score: 7 x 0.4 -i- 9 x 0.1 4- 5 x 0.2 + 9 x 0,3 = 7.4 For house C: weighted mean score: 6 x 0.4 +10 x 0.1 + 8 x 0.2 + 7 x 0.3 = 7.1 We see that house A obtained the highest weighted score. The total, unweighted score of house C is higher than that of house A. But because Ihe items do not all have the same importance, house A ended up having a higher weighted score. THE MEDIAN AND THE MODE The median is another measure of central tendency for quantitative variables. It is defined as the value that sits right in the middle of all data entries when they are listed in ascending order. If the number of entries is odd. there will be one data entry right in the middle. If the number of entries is even, we will have two data entries in the middle, and the median in this case will he their average. Here are two examples. Case 1: variable* 2,3,4,4,5.5.5.6,7,8. II, 13. 13 Case 2: variable Y 2. 3,4.4, 5. S, 6, 7. 8,11. 13. 13 For the variable X we have 13 entries. The value 5 sits in the middle, with six entries equal or smaller than it. and six entries equal or larger. The median for X is thus 5. But for variable Y, we have 12 entries. There arc therefore two entries in Ihe middle of the ordered list, not just one. The median will be the average of the two. that is had a job within 5 months of their dale of arrival. Because the median involves only the ordered list of data entries, it can be used if the quantitative variable is measured at the ordinal level. But if the number of categories is small, ihe median is not very useful. The mode can also be used for quantitative variables. When the values arc grouped into classes, the mode is defined as it is for qualitative variables: it is ihe class that has ihe highest frequency. Bui the mean and median remain ihe best descriptive measures for quantitative variables. If ihe variable is continuous and Ihe values have not been grouped into classes, the n/xlc is ihe value at which a peak occurs in ihe graph representing ihe distribution. COMPARISON OF THE MEAN AND THE MEDIAN Both the mean and the median arc measures of central tendency of a distribution, Üiat is. they give us a central value around which Ihe other values arc found. They are therefore very useful for comparing different samples, or different populations. .15 INTCftffM TING QUANTITATIVE DATA WITH SPSS or samples with ,i population, or a given population al different moments m lime (o sec how tl has evolved. However. e;ich ol the mean and Ihc methan lias its advantages and ndnwlw k'. Ilie mean takes into account eveiy single value thai ocelli s m the dala. Therefore, it is sensitiv« lo ever) value. A simile vet y huge value can boost lh DMU Up if ÜM number of cnirics is not very large, For instance, if one worker in a group of 20 workers won a SI million loiicry defeat, the average wealth of those 20 would look artificially high. The median is not scnsiiive lo every single value, In a distribution where Ihc largest value is changed from 60 to 600, (he median would nol change. The mean would. U follows from ihcsc remarks thai the mean is a more sophisticated measure. because it lakes every value inlo account. Indeed, it is the mean that is used lo compute the standard deviation, winch ......rasure of dispersion lhat will be seen below Movvevei, in situations where the distribution is not very symmetric, and where theic are some extreme values on only one side of the disliibulion, ihc mean will lend lo be shifted towards the ex Irenu- values, whereas the median will stay close to the bulk of (he data. Then-tore, whenever Ihc distribution is highly skewed, the median is a better representative of the center of the distribution than the mean Ibis is Hue lot variables such as income or wealth, where the distribution among individuals in a country, mid also worldwide, is highly skewed. For such a variable, the median is a more accurate representative of (he central tendency of the distribution. Measures of Dispersion For Qualitativ* Variables There are not mimy measures ol dispersion for qualitative variables. One of the measures we | an <• impute is the variation ratio It tells us whether a Luge proportion of data is ■ oi« enlraled in the modal citegoiy, or whether it is .pie,id out over Ihc other categories, The variation ratio is defined as number of entries not in lbe modal s lass variation ratio =-------------------------------------------------------------- total number at entries It is a positive number smaller than one. If this ratio is close to zero, it indicates a great homogeneity, almost every unit being in the modal class The laither it is from /cio. the gicatei i:„- diipttfSiOn ol I he data ovei lbe otliei calegi lie: I ike mall) othei measures, this one is easy to interpret when doing comparisons, Pbr Instance, if we compare the 'i/cs ol the various linguistic groups in two cities where several languar.es ,ue spoken, we can use the variation ratio to assess ihc degree til heterogeneity m eai h i ity Mere is an example iiNivAaiArř DřScaiPTtví statistics SI City CUJ A City B LinguWtk groups P»ic«nt«gi ,,, , .,, tmr i ng Bnsllsh *i««king Chine« »pcakiiyj .....■■■ TMal 2B* 4M IN ii-i The s.ui.ition i.iiid tot city A would he (Mt t-20 * lit)/100 0.6ft, and for city B it would lír |2K • 20 < I2)/100 iini. showing iliat my A is ,i Imle more hetero gencous than city II. For Quantitative Variables There are many ways of measuring the dis(>ersion lor ipianlltalive variables. The simplest is ihc range, but we also have various forms of restricted range, wc have the deviation from the mean, the standard deviation, the variance and finally the coefficient of variation. Let us go through these measures one nt a lime. R A NO I The rung** Is 'he simplest way ol how spread out the data is You simply subtract the smaller entry from the large) one and add I, and this tells yon ihc ilZC of the imeivalovcr which the data is spread out. For example, you would des, nbe a range of values for the variable A|«f as follows: In this sample, the youngest person is 16 years old and the oldest 89. spanning a range of 74 years (89 -16+1). Bui we tuny have extreme values thai give a misleading impression about the dispersion of the data. For installs e. suppose thai a retired pciso» decided to enroll in one ol our classes. We could then say that the ages of tlie students, m this class range from It) years up to69 yeais, but lhat would be misleading, is ihe gieal inajoiiiv of studenu m somewhere bciwecn 17 years old and maybe 23 or M years old i oi •n m i m introduce variants ol the notion <>t range TIr- <',,„ range, for instance« ««unpuics the range of values after w(- base dropped HI'S ol the data al i.\u b end the 10% largest entiles.....I the 10% smallest