UNIVARIATE DESCRIPTIVE STATISTICS This chapter explains how data concerning one variable can Ik; summarised and described, with ubi« and with simple charts and diagrams. Alter studying this chapter, the student should know: • the basic types ol'univariate descriptive measures; • how the level of measurement determines the descriptive measures to be used; • how to interpret these descriptive measures; • how to read a frequency table: • the differences in the significance and the uses of the mean and the median: • how co interpret the mean when a quantitative variable is coded; • how to describe the shape of a distribution (symmetry; skewness); • how to present data (frequency tables; charts): • what arc weighted means and when 10 use them. ■ Data files contain a lot of information thai must be summarized in order to be useful. If we look for instance at the variable age in ihe data file GSS93 SUDMI lhal comes with the SPSS package, we will find 1500 entries, giving us ihe age of every individual in the sample. If we examine Ihe ages of men and women separately, we cannot determine, by looking simply at the raw data, whether men of this sample tend to be older lhan women or whether it is ihe other way around. Wc would need lo know, let us say. that the average age of men is 23 years and of women 20 years io make a comparison. The average is a descriptive measure. Descriptive statistics aim ai describing a situation by summarizing information in a way that highlights the important numerical features of ihe data. Some of the information is lost as a result. A good summary captures the essential aspect* 0f the data ami ihe mosi relevant ones, h summarizes ii wiih ihe help of numbers, usually organized into tables, but also with the help of charts and graphs that give a visual representation of the distributions. UNIVAfllAIE DeSCHiPTIVf SIAriSTICS M •s In this chapter, we will he looking al one variable at a time. Measures that concern one variable are called univariate measures. Wc will examine bivariate measures, those measures lhal concern two variables together, in Chapter H. There are three important types of univariate descriptive measures: • measures of central tendency, « measures of dispersion, and • measures of position. Measures of central tendency (sometimes called measures of ihe center) answer the question: What are the categories or numerical values that represent ihe bulk of the dala in the besl way'.' Such measures will be useful for comparing various groups within a population, or seeing whether a variable has changed over lime Measures of central tendency Include the menu (which is ihe technical term for average), the median, and the mode. MciiMiri'S of dispersion answer the question: How spread out is the data? Is it mostly concentrated around die center, or spread oul over a large range of values? Measures of dispersion include the standard deviation, the variance, ihe range (there arc several variants ot the range, such as ihe interquartile range) and the coefficient of variation. Measures of position answer the question: How is one individual entry positioned wirb reaped to all the others'.' Or how does one individual score on a variable in comparison with the others? If you want to know whether you arc pan ol the lop 5% of a math class, you must use a measure of |M>sition. Measures of posiiion include percentiles, deciles, ami quaitiles. Other measures. In addition io these measures, we can compute ú» frequencies of certain subgroups of ihe population, as well as certain ratios and proportions thai help us compare Iheir relative importance. This is particularly useful when the variable is qualitative. or when it is quantitative but us values have been grouped i»l" categories. Tlic various descriptive measures that can be used in a specific situation depend on whether ihe variable is qualitative or quantitative. When the variable is quantitative, we can look at the general shape of the distribution, to see wliether it is symmetric (that is. the values are distributed in a similar way on both sides of the center) or skewed (that is. lacking symmetry), and whether it is rather flai or rather peaked (a characteristic called kurtosis). f Finally, we can make use of charts to convey a visual impression ol ihe distribution of ihe data. It is very easy to produce colorful outputs with any statistical software. Ii is important, however, to choose ihe appropriate chart, one that is meaningful and that conveys the most important properties ot ihe data. This is ikh 70 0413 36 iNXtfi?ň£T\NG QUANTITATIVE OATA WITH SPS* always eaSľi Mid you will have to pay intention in the way an appropriate chart is chosen, a choice that depends on llie level of measurement of the variable. It is very important to realize thai the statistical measures used to describe the data pertaining to a variable depend on the level of measurement used. If a variable is measured at the nominal scale, you can compute certain measures and not others. Therefore you should pay attention to the conditions under which a measure could be used; otherwise you will end up computing numerical values that are meaningless- Measures of Central Tendency For Qualitative Variables The best way to describe (he data lhal corresponds to a qualitative variable is to show the IVeqiH'ndes of its various categories, which are a simple count of how many individuals tall into each category. You could then work out this count as a percentage of the total number of units in the sample. When you ask for the frequencies. SPSS automatically calculates the percentages us well, and n docs it twice: the percentage with respect to the total number of people in the sample, and the percentage with respect to the valid answers only, called valid percent in the SPSS outputs. Lei us say that the percentage of people who answered Yes to a question is 403? of the total. If only half the people had answered, this percentage would correspond to 80% of the valid answers. In other words, although 40% of the people answered Yes. they still constitute 80% of those who answered. SPSS gives you both percentages (the total percentage and the valid percentage) and you have to decide which one is more significant m a particular situation. For instance, lable 3.1 summarizes the answers to a question about the legalization of marijuana, in a survey given to a sample of 1500 individuals. Table 3.1 A frequency table, showing the frequencies of the various categories, as well as the percentage and valid percentage they represent in the sample Should Marijuana Be Made Legal "^^^ Frequency Percent Valid Percent Valid Ugal 211 M.I 22.7 Nol legal 719 47.9 77..X Toial volM 930 62.1) too.o VI i win ť 570 J8.0 I..I..I 15»» ItHM) fable VI tells us ih.il the sample included 1500 individuals, but that we have the answers to thai question for 930 individuals only. The percentage of positive answers can he calculated either out of the total number of people in the sample, giving 14.1% as shown m the Percent column, or oul of (he number of people for whom we ■ UNIVAfilATf OfSCHIPTIVE STATISTICS J? have answers, giving 22.7% as shown in the Valid Percent column. Which percentage is the mosl useful? It depends on the reason for the missing answers. If people did not answer because the quest ion was asked of only a subset of the sample, the valid percentage is easier to interpret, Bui if 570 people abstained because they do not want to let their opinion be known, it is more difficult lo interpret the resulting figures. A good analysis should include a discussion of the missing answers when their proportion is as important as it is in this example. Table „VI comes from ihe SPSS output. When we write a statistical report, we do noi include all ihe columns in that tabic. Most of the time, you would choose either the valiil percentage (which is the preferred solution) or the total percentage, but rarely both, unless you want to discuss specifically the difference between these two percentages, [lie Cumulative percentage is only used for ordinal or quantitative variables, and even men is included only if you plan lo discuss it. To describe the center of the distribution of a qualitative variable, you must determine which category includes the biggest concentration of data, This is called the mode. /'/"• mode for eoplc who answered Yes lo a question is 40% of the total. If only half the people had answered, this percentage would correspond to 80% of the valid answers. In oilier words, although -HOi of the people answered Yes. they Mill constitute 80% of those who answered. SPSS gives you both |>ercentages (the total percentage and the valid percentage) and you have to decide which one is more significant in a particular situation. For instance. Table 3.1 summarizes the answers to a question about the legalization of marijuana, in a survey given to a sample of 1500 individuals. Table 3.1 A frequency labte, showing the frequencies of the various categories, as well as the percentage and valid percentage they represent in the sample Should Marijuana Be Made Legal Frequency Percent Valid Percent Valid Legal 211 i-i ; 217 Not leniti 719 I? •> 77.3 ll'l.ll Ml 111) 930 62.« 100.0 \IKmiiC ."u WO l„l.,l 151)11 100.0 Table .'.I tells us that the sample included I MX) individuals, hut that we have the answers to that question lor 930 individuals only. The percentage of positive answers can be calculated eilher out of ihe total number of people in the sample, giving 14.1 % as shown in ihe Percent column, or out of the number of people for whom we " UNIVARIArf DESCRIPTIVE STATISTICS 37 have answers, giving 22.7% its shown in the Valid Percent column. Which perceulage is the mOSi useful? It depends on the reason for the missing answers. If people did not answer because ihe question was asked of only a subset of the sample, the valid percentage is easier to interpret. Hut if 570 people abstained because they do not want to let their opinion he known, il is more difficult to interpret the resulting figures. A good analysis should include a discussion of the missing answers when their proportion is as important a-s it is in Ibis example. Table 3.1 comes from die SPSS output. When we write a statistical report, we do not include all the columns in that table. Most of the time, you would choose either ihe valid perceulage (which is the preferred solution) or the total percentage, but rarely both, unless you want to discuss specifically ihe difference between these two percentages. The cumulative percentage is only used for ordinal or quantitative variables, and even then is included only if you plan to discuss it. To describe the center of the distribution of a qualitative variable, you must determine which category includes the biggest concentration of data. This is called the mode. The mode for a qualitative variable is the category thai hus the highest frequency (sometimes called modal category) The modal category could include more than 50% Of the data. In this case we say thai ibis category includes the majority of individuals, II the modal category includes less than 50% '>t Hie data, we say thai il constitutes a plurality We can illustrate this by the following Situations concerning the votes in an election. Flrsl situation: Party A 54% of the votes Pany B 21% of the votes Party C 25% of the votes. Here we could say that Party A won the election with a majority. Compare with the following situation. Second situation: Party A 44% of the votes Party fí 31 % of the votes Party C 25% of the votes. / Here we can soy that Pain A wtoi the election with a plurality of votes, bin without a majority. If Parties B and (' formed a coalition, they could defeat Party a. For this reason, some countries include in their electoral law a provision thai, should the winning candidate or a winning parly get less than the absolute majority of voles (50% +■ I), a second turn should take place among those candidates who are at the top of the list, so as to end up with a winner having more than 50% ol ihe votes. A good description of the distribution of a qualitative variable should include a mention of the modal category, bul il should also include a discussion of the pattern 71 je INTERPRETING QUANTITATIVE OAtA WITH SPSS of Hie distribution of individuals across the various categories. Concrete examples will be given in die last section of this chapter. For Quantitative Variables Quantitative variables allow us a lot more possibilities. The most useful measures of centra) tendency are the mean and the median. We will also see how and when to use ihc mode. The mean of a quantitative, variable is defined as the sttm of all entries divided by their number. In symbolic terms. the mean of a sample is written as „ J = =jp, and the mean of a population is written as j, =----' These symbols are read as follows: x is read as x bar. and it stands for the mean of a sample for variable .V. U, is read as mu x. and it stands for the mean of a population. The subscript .v refers to the variable X. x, is read as x i. It refers to all the entries of your data thai pertain to the variable X, which are labeled xy .v.. .v., etc. £ is read as sigma. When followed by *. it means: add all the .1 "s. lectin» i range over all possible values, that is. from | u. n (for a sample) or from I to ,v (for a population). « is the size of the sample, that is. the number of units that are in it. N is the size of the population. You may have noticed that we use different symbols for a population and for a sample, to indicate clearly whether we are talking about a population or a sample. Wc do not always need to write the subscript i in /*,. We do it only when several variables arc involved, and when we want to keep track of which of the variables we are talking about. In such a situation we would use fiv p., and ,u. to refer to the mean of the population for the variables ,r. y. and z respectively. Notice that in the formula for the mean of a population, wc have written a capital N to refer io ihe size of the population raiher ihan the small n used for the size of a sample. The mean is very useful to compare various populations, or to see how a variable evolves over time. Rut ii can be very misleading if the population is not homogeneous. Imagine a group of five people whose hourly wages are: $10, $20, $45. $60 and S65 an hour. The average hourly wage would be: UNIVARIATE DESCRIPTiVt STATISTICS 39 10 4- 20 v 45 7-60 + 65 1 -- 5 = $40 an hour. But if the last participant was an international lawyer who charged $400 an hour of consultancy, the average would have been $107 an hour (you can compute it yourself), which is well above what four out of the five individuals make, and would be a misrepresentation of the center of the data. In order to avoid this problem, we can compute the trimmed mean: you first eliminate the most extreme values, and then you compute the mean of the remaining ones. But you must indicate how much you have trimmed. In SPSS, one of ihe procedures produces a 5% trimmed mean, which means that you disregard the 5% of the data that arc farthest away from the center, and then you compute the mean of ihe remaining data eniries. The mean has a mathematical property that will be used later on. Starting from the definition of the mean, which states that *==—'■ we can conclude, by multiplying both sides by n, that: y *«-£*, In plain language, this states that the sum of all entries is equal to n times the mean. We will discuss all the limitations and warnings concerning the mean in a later section on methodological issues. THE MEAN OF DATA GROUPED INTO CLASSES When we are given numerical data that is grouped into classes, and we do not know the exact value of every single entry, wc can still compute the mean of the distribution by using the midpoint of every class. What we get is not the exact mean, but it is the closest guess of the mean that is available. If ihe classes are not too wide, the value obtained by using the midpoints is not that different from the value that would have resulted from the individual data. Consider one of the intervals ŕ with frequency/ and midpoint x. The exact sum of all the entries in that class is not known, but we can approximate it using the midpoint. Thus, instead of the sum of the individual entries (not known) we will count the midpoint of the class/ times. We obtain the following formula. LA '*, Mean (or grouped data = —-— Here, n is the number of all entries in the sample. It is therefore equal to the sum of the class frequencies, that is. the sum of the number of individuals in the various classes. The formula can thus be rewritten as .11) INTEflP«E'ING QUANTITATIVE DATA WITH SPSS Mean for grouped data ■ ■'■■'-' INTERPRETATION OF TH6 MEAN WHEN THE VARIABLE IS CODED We often have data files where a quantitative variable is not given in its original form, bul coded into a small number of categories. Tor instance, Ihc variable Respondent's Income could be given in the form shown m Table 3.2. Fable 3.2 Example of a quantitative variable that is coded into 21 categories, with a 22nd category for those who refused to answer Category Cod Less dian $1000 1 SlOOO 2999 2 $3000-3999 3 W00O-W99 4 $5000-5999 5 """ fi'»W ft ttOOO ;w> 7 S si xxi 9999 « SI 0,000-12/199 9 $12.50(1-14,999 m SI5.000-t7.499 II $17.500-19.999 12 $20.000-22.499 13 $22.500-24.999 ll S23.O00-29.W9 ľ $30*00-34.999 16 $35.000-39.999 17 $40.000-19.999 ľ- 550.OOO-59.999 ľ.i $60.000-74,999 20 S7S.OO0 iiml more 21 Refused lo answer 22 Tims, we would not know che exact income of a respondent. We would only know the category he or she falls into. This kind of measuring scale poses a challenge. II' we compute the mean with SPSS, we will noi gel Che mean income. We will get Ihc mean code, because il is (he codes that are used lo perform the compulations, I here is a dala lile lhal comes with SPSS where Che income is coded in this way. This dala tile contains information about '500 respondents, including information on (he income bracket they fall into, coded as shown in Table 3.2. When we exclude the 22nd category, which consists ol the people who refused to answer this question, the compulation of ihe mean with SPSS produces ihc following result: Mean3 12.35 ■ UNIVARIATE DESCRIPTIVE STATISTICS ^41 Whai istheuseof ihis number? It is not a dollar amount! If we look at Table 3.2. we see that ÜW code 12 stands for an income of Iwtwecn S17.5(H) a year and $20.000 a year (with that last number excluded from the category), lb interpret Ihis number, WC should lirsl translate H into a dollar amount (il can lie done wild a simple rule). Bui even without transforming it into ihc dollar amount il corresponds lo. wc could use ihc mean code for comparisons. For instance, wc will sec in Lab 3 lhal if wc compute the mean income separately for men and women, wc get Mean income for men: 13.9 Mean income for women: 10.9 (excluding che cacegory of people who refused lo answer). Although the mean code does not lell us exactly the mean income lor men and women, il still tells us thai there is a big difference between men and women for that variable. Tabic 3.2 tells us that che code 13 corresponds co the income bracket $22.500-25.000. while ihc code 10 represents the income bracket $12.500 15.000. We can conclude that the difference in income between men and women, for thai sample, is roughly around $10.000 a year. We see (hat lhal when the variables arc coded, the interpretation of the mean requires us to translate the value obtained mlo what il stands for. For quantitative variables coiled ibis way. il may also be useful to find Ihc frequencies ol the various calegorics. as we did for nominal variables. For the example at hand, wc would get fable 3.3 as shown. The conclusion of Ihe preceding discussion is ihal when we have all ordinal van-able with few categories, or even a quantitative variable thai has been recoded into a small number of categories, il may be useful to compute the frequency table of the various categories, in addition to the mean and other descriptive measures. Weighted Means Consider the following situation: you want to find the average grade in an exam for two classes of students. The first class averaged 40 out of 50 in the exam, and ihe second class averaged 16 out ol 50. II you put the i/o classes logclhcr. you CQ/WOl conclude thai Ihc average is 43. This is so because the classes may have different numbers of students. Suppose the firsi class has 20 students, and the second one 40 Students. In Other words, we have ihe data shown in Table 3.4. To compute the average grade for ihc two classes taken logclhcr, we do not need 10 know Ihe individual scores of each student. Indeed, we have seen before thai a sum of n scores is equal to its average limes //. We will use this to obtain the lonnula shown below for weighted means. Che mean for Ihc two classes taken together can be written as 9254 19 42 INTERPRETING QUANTITATIVE OATA WI'H SPSS Tdble 3,3 Frequencies of the various incomv categories (or the variable Income Respondent's income ___________Frequency__________Valid Percent I-TSIOW S1000-2999 26 36 2.6 í 6 S3000-3999 S4IKK) 4999 $5000-5999 30 24 23 3.0 2.4 2.3 S600O-fi999 23 2.3 S7000-7999 15 ..s SftOOO 9999 II ! l $10.000-12.499 55 5.5 SI 2.500-14.999 5-1 5.4 SIJ.000-17.-IW 64 6.4 SI ,'.MKI |<>W1 5» 5.8 V'O.iHNl-22.499 $12,300-24.°» 55 61 5.5 6.1 S25.000-29.999 ■ l Í.3 S3O.UO0-M.999 $33,000-39.999 S4O.0O0-49.999 S3 54 66 8.4 5.4 6.6 S5O.00O-59.999 38 3.8 S60.000-74.999 23 2.3 S75.0OO+ 44 4.4 .:ij til niMViT 47 4.7 foul ■"-i 100.0 vliKsing Ginnd Total 506 1500 Lihli; 1.4 Two classes of different site and the mean ((i.iilc (n each Averag« Grade out of 50 Number of Students Claw A Cla» ll 40 46 2ÍI 40 Sum of all scores in class A + Sum of all stores in class B 'Hie sunt of nil scores in class A can be replaced by (he average score (40) times 20. since there arc 20 students in ihis class. Ami ihe sum of all scores in class B can be replaced also by its average score (4b) times 10, since Ulis class includes 40 students. The equation lor the mean becomes: (40 x 20) (4b x 40) m ',>:■ UNIVARIATE OESCRIPIIVE STATISTICS ^ This can now be written as: mean of Ihe two classes combined - 40 x (20/60) r 46 x (40/60) or again as; mean of ihe two classes combined = 40 x (1/3) + 46 x (2/3) The last formula is important: we see that the average grade of class A is multiplied by (lie weight of class A. which is its relative importance in the total population. Class A forms 1/3 of the total population (20 students out of 60) and class B 2/3 of the lotal (40 students out of 60). The underlying formula is: Average grade for ihe (wo classes: 41) x w, + 46 X w2 The w,\ arc called the weights of the various classes. In litis case, the weight is an expression of the number of people in each class compared to the total population of the two classes. The general formula is as follows. If you have n values xr x,. .?,,... etc.. each having the corresponding weights: wt, vv.. it-,. ... etc.. the weighted mean is given by *, w, + .r, w, + xs w, + — + xa wn The weights arc positive numbers and nuisl add Up 10 I. Thai is: tľ, + U', + »', + ••• + iv, a I. The weights are not always 4 reflection of the size of Ihe various groups involved. If you are computing ihe weighted average of your grades during your college studies, the weights could be proportional to ihe credits given to each course, or they could be an expression of the importance of the course in a given program of studies. A Faculty of Medicine may weight the grades of its candidates by giving a bigger weight to Chemistry and Biology than Art History, for instance. Example A buyer wants to evaluate several houses site has seen. She attributes a score out of ten to each house on each of ihe following items: size, location, internal design, and quality of construction, Any house having a score less than s on any item would nol l>e acceptable. The resulting scores for three houses that arc seen as acceptable on all grounds are recorded in Table 3.5. The buyer docs not 44 INTERPRETING QUANTITATIVE DATA WITH SPSS attribute the same importance to each item. The size of ihe house is the most important quality. The quality of the construction is also very important, but not as important"; The buyer attributes a weight to each item, which reflects the importance of that item for her. The weights arc given in the last column. Table 3.5 Scores glvon lo three houses on tour items, and thotr weights Item House A House H House C Weight of item Site ' Location Internat design Quality of construction 9 5 7 9 5 9 6 10 S 0.4 0.1 0.2 0.3 We can now calculate the weighted average score for each house, using the formula for weighted means given above. For house A: weighted mean score: 10 x 0.4 + 5 x 0.1 + 6 x 0.2 + 7 x 0.3 = 7.8 For house B: weighted mean score: 7 x 0.4 -i- 9 x 0.1 4- 5 x 0.2 + 9 x 0,3 = 7.4 For house C: weighted mean score: 6 x 0.4 +10 x 0.1 + 8 x 0.2 + 7 x 0.3 = 7.1 We see that house A obtained the highest weighted score. The total, unweighted score of house C is higher than that of house A. But because Ihe items do not all have the same importance, house A ended up having a higher weighted score. THE MEDIAN AND THE MODE The median is another measure of central tendency for quantitative variables. It is defined as the value that sits right in the middle of all data entries when they are listed in ascending order. If the number of entries is odd. there will be one data entry right in the middle. If the number of entries is even, we will have two data entries in the middle, and the median in this case will he their average. Here are two examples. Case 1: variable* 2,3,4,4,5.5.5.6,7,8. II, 13. 13 Case 2: variable Y 2. 3,4.4, 5. S, 6, 7. 8,11. 13. 13 For the variable X we have 13 entries. The value 5 sits in the middle, with six entries equal or smaller than it. and six entries equal or larger. The median for X is thus 5. But for variable Y, we have 12 entries. There arc therefore two entries in Ihe middle of the ordered list, not just one. The median will be the average of the two. that is had a job within 5 months of their dale of arrival. Because the median involves only the ordered list of data entries, it can be used if the quantitative variable is measured at the ordinal level. But if the number of categories is small, ihe median is not very useful. The mode can also be used for quantitative variables. When the values arc grouped into classes, the mode is defined as it is for qualitative variables: it is ihe class that has ihe highest frequency. Bui the mean and median remain ihe best descriptive measures for quantitative variables. If ihe variable is continuous and Ihe values have not been grouped into classes, the n/xlc is ihe value at which a peak occurs in ihe graph representing ihe distribution. COMPARISON OF THE MEAN AND THE MEDIAN Both the mean and the median arc measures of central tendency of a distribution, Üiat is. they give us a central value around which Ihe other values arc found. They are therefore very useful for comparing different samples, or different populations. .15 INTCftffM TING QUANTITATIVE DATA WITH SPSS or samples with ,i population, or a given population al different moments m lime (o sec how tl has evolved. However. e;ich ol the mean and Ihc methan lias its advantages and ndnwlw k'. Ilie mean takes into account eveiy single value thai ocelli s m the dala. Therefore, it is sensitiv« lo ever) value. A simile vet y huge value can boost lh DMU Up if ÜM number of cnirics is not very large, For instance, if one worker in a group of 20 workers won a SI million loiicry defeat, the average wealth of those 20 would look artificially high. The median is not scnsiiive lo every single value, In a distribution where Ihc largest value is changed from 60 to 600, (he median would nol change. The mean would. U follows from ihcsc remarks thai the mean is a more sophisticated measure. because it lakes every value inlo account. Indeed, it is the mean that is used lo compute the standard deviation, winch ......rasure of dispersion lhat will be seen below Movvevei, in situations where the distribution is not very symmetric, and where theic are some extreme values on only one side of the disliibulion, ihc mean will lend lo be shifted towards the ex Irenu- values, whereas the median will stay close to the bulk of (he data. Then-tore, whenever Ihc distribution is highly skewed, the median is a better representative of the center of the distribution than the mean Ibis is Hue lot variables such as income or wealth, where the distribution among individuals in a country, mid also worldwide, is highly skewed. For such a variable, the median is a more accurate representative of (he central tendency of the distribution. Measures of Dispersion For Qualitativ* Variables There are not mimy measures ol dispersion for qualitative variables. One of the measures we | an <• impute is the variation ratio It tells us whether a Luge proportion of data is ■ oi« enlraled in the modal citegoiy, or whether it is .pie,id out over Ihc other categories, The variation ratio is defined as number of entries not in lbe modal s lass variation ratio =-------------------------------------------------------------- total number at entries It is a positive number smaller than one. If this ratio is close to zero, it indicates a great homogeneity, almost every unit being in the modal class The laither it is from /cio. the gicatei i:„- diipttfSiOn ol I he data ovei lbe otliei calegi lie: I ike mall) othei measures, this one is easy to interpret when doing comparisons, Pbr Instance, if we compare the 'i/cs ol the various linguistic groups in two cities where several languar.es ,ue spoken, we can use the variation ratio to assess ihc degree til heterogeneity m eai h i ity Mere is an example iiNivAaiArř DřScaiPTtví statistics SI City CUJ A City B LinguWtk groups P»ic«nt«gi ,,, , .,, tmr i ng Bnsllsh *i««king Chine« »pcakiiyj .....■■■ TMal 2B* 4M IN ii-i The s.ui.ition i.iiid tot city A would he (Mt t-20 * lit)/100 0.6ft, and for city B it would lír |2K • 20 < I2)/100 iini. showing iliat my A is ,i Imle more hetero gencous than city II. For Quantitative Variables There are many ways of measuring the dis(>ersion lor ipianlltalive variables. The simplest is ihc range, but we also have various forms of restricted range, wc have the deviation from the mean, the standard deviation, the variance and finally the coefficient of variation. Let us go through these measures one nt a lime. R A NO I The rung** Is 'he simplest way ol how spread out the data is You simply subtract the smaller entry from the large) one and add I, and this tells yon ihc ilZC of the imeivalovcr which the data is spread out. For example, you would des, nbe a range of values for the variable A|«f as follows: In this sample, the youngest person is 16 years old and the oldest 89. spanning a range of 74 years (89 -16+1). Bui we tuny have extreme values thai give a misleading impression about the dispersion of the data. For installs e. suppose thai a retired pciso» decided to enroll in one ol our classes. We could then say that the ages of tlie students, m this class range from It) years up to69 yeais, but lhat would be misleading, is ihe gieal inajoiiiv of studenu m somewhere bciwecn 17 years old and maybe 23 or M years old i oi •n m i m introduce variants ol the notion <>t range TIr- <',,„ range, for instance« ««unpuics the range of values after w(- base dropped HI'S ol the data al i.\u b end the 10% largest entiles.....I the 10% smallest 48 INTERPRETING QUANTITATIVE DATA WITH SPSS entries. This statistic gives us the range of the remaining 80% of data entries. Wc can also compute the 5% trimmed range by deleting from the computation the 5% of values that are the farthest away from the mean. We will also see in a forthcoming section1 something called a box-plot, that shows us graphically hoih the full range, and the range of the central 50% of the data after you have disregarded the top 25% and the bottom 25%. This last range is called the interquartile range, the distance between the first and third quarliles, which are the values that split the data into four equal parts. These various notions of the range do not use the exact values of all the data in their computation. The following measures do. STANDARD DEVIATION The most important measure is the standard deviation. To explain what it is wc must first define some simpler notions such as the deviation from the mean. For an individual data entry xi the deviation from the mean is the distance that separates it from the mean. If we want to write it in symbols, we will have to use two different symbols, depending whether we have a sample or a population. For a sample, the deviation from the mean is written: (xt - x) For a population, the deviation from the mean is written: (x, - fi) The list of all deviations of the mean may give us a good impression of how spread out the data is. Example Consider the following distribution, representing the grades out of ten of a group of 14 students: ■ 4, 5, 5,6, 7,7, 8, 8, 8,9,9. 9. 10. 10 Here the mean is given by 105/14 = 7.5. The deviations from Ihe mean are given in Tabic 3.6. But that list may be long. We want to summarize it. and end up with a single numerical value that constitutes a measure of how dispersed the data is. We could take the mean of all these deviations. If you perform the computation for the mean deviation, you will get a mean deviation equal to zero (do the compulation yourself on the preceding example). This is no accideni. Indeed, wc can easily show that the mean of these deviations is necessarily zero, as the positive deviations arc cancelled out by the negative deviations. UNIVARIATE DESCRIPTIVE $TATISTICS 49 Table 3.6 Calculation of the deviations from the mean Data entry > Deviation from th<_> mean: (x, x) 4 4 - 7.5 - -3.5 5 5 -74 = -2.5 5 5 - 7.5 - -2.5 6 6- 74 = -1-5 7 7 -74 - -0.5 7 7-74 =-0.5 8 8 - 7.5 « 04 8 8-7.5 = 04 8 8 - 7.5 = 04 '■ 9-7.5=14 9 9-7.5=14 0 9-7.5=14 k: 10-7.5 = 24 If: 10-75 = 25 The mathematical proof (which is given only for those who are interested and which can be ignored otherwise) goes like this: Sum of all deviations from the mean = E <*,-*> = !>,-Ľ *=""*-"** = t> (Explanation: Recall thai the sum of all entries is equal to n times the mean, and that the mean, in the second summation, is counted n times. This is why we get n times the mean twice, once with a positive sign, and once with a negative sign.) We thus conclude that the deviations from the mean always add up to zero, and therefore we cannot summarize them by finding their mean. The way around this difficulty is the following: we will square the deviations, and then take their mean. By squaring the deviations, we get rid of the negative signs, and the positive and negative deviations do not cancel out any more. This operation changes their magnitude, however, and gives an erroneous impression about the real dispersion of data, since the deviations are all squared. This distortion will be corrected by taking the square root of the result, which brings it back to an order of magnitude similar to the original deviations. In summary, wc end up with the following calculation: Standard deviation for a population, denoted by the symbol o" In the case of a sample, fi will be replaced by x and iV will be replaced not by n, but by ft - 1. The reason why we write n - 1 instead of n is due to some of the mathematical properties of the standard deviation. It can be proven that using n - 1 in ihe formula gives a better prediction of the standard deviation of a population when we know that of the sample. so INTERPRETING QUANTITATIVE DATA WITH SPSS Conclusion: the standard deviation for a sample, denoted by the symbol s. is given, by: The standard deviation (often written st.dev.) is the most powerful measure of dispersion for quantitative data. It will permit us to do very sophisticated descriptions of various distributions. All the calculations of statistical inference are also made possible by the use of the standard deviation. VARIANCE Another useful measure is the variance, which is defined as the square of the standard deviation. It is thus given by variance of a sample = s1 or variance of a population = o~: THE COEFFICIENT OF VARIATION Finally, we can define the coefficient of variation. To explain the use of this measure, suppose you have two distributions having the means and standard deviations given below: Distribution 1 mean = 30 st. dev. = 3 Distribution 2 mean = 150 st. dcv. = 3 In one case the center of ihe distribution is 30. indicating that ihe data entries fall in a certain range around the value 30, Their magnitude is around 30. In the other case, the mean is 150- indicating that the data entries fall in a range around die value 150 and have an average magnitude of 150. Although they have the same dispersion • (measured by the standard deviation), the relative importance of the dispersion is not the same in the two cases because the magnitude of the daia is different. In one case the entries revolve around the value 30, and the standard deviation is equal to 10% of the average value of the entries. In ihe other case, the entries revolve around the value 150 and the standard deviation is about 3/150. that is, 2% of the average value of the entries, a value which denotes a smaller relative variation. There is a way to assess the relative importance of the variation among the entries, by comparing this variation with the mean. The measure is called the coefficient of variation. The coefficient of variation is defined as the standard deviation divided by the mean, and multiplied by 100 to turn it into a percentage. The formula is thus: Coefficient of variation CV = — x 100 M UNIVARIATE DESCRIPTIVE STATISTICS 51 This measure will only be used occasionally. Measures of Position Measures of position are used for quantitative variables, measured at die numerical scale level. They could sometimes be used for variables measured at the ordinal level. They provide us with a way of determining how one individual entry compares with all the others. The simplest measure of position is the quartile. If you list your entries in an ascending order according to size, the quariites are the values that split the ranked population into four equal groups. Twenty-five percent of the population has a score less or equal than the 1st quartile (Q,). 50*% has a score less than the 2nd quartile (Qj), and 75% has a score less than the 3rd quartile (Qj). Recall that we have seen earlier a measure of dispersion called the interquartile range, which is die difference between Q. and Q;. Figure 3.1 illustrates the way the quartiles divide the ordered list of units in a sample or in a population. 25% of the population 25% of the population 25% of the population <— Q, 25% of the population Figure 3.1 The quartiles arc obtained by ordering the individuals in the population by increasing rank, and then splitting it into four equal parts. The quartiles are the values that separate these four parts In a similar way, we can define ihe deciles: they split the ranked population into ten equal groups. If a data entry falls in the first decile it means that its score is among the lowest 10%. If it is in the 10th decile it means it is among the top 10%. The most common measure of position, however, is the percentile rank. The dala is arranged by order of si« (recall it must be quantitative) and divided into 100 equal groups. The numerical values that separate these 100 groups arc called percentiles. The percentile rank of a data entry is the rank of die percentile group this entry falls into. For example, if you are told that your percentile rank in a national exam is 83, this means diat you fall within the 83rd percentile. Your grade is just above ihat of 82% of the population, and just below that of 17% of the population. You will learn in the SPSS session how to display die percentile ranks of the data entries. You may have realized by now (he connection between the median and the various measures of position, since the median divides your ranked population into two equal groups. The median is equal to the 50th percentile. It is also equal to the 5th decile, and of course the 2nd quartile. 54 INTERPRETING QUANTITATIVE DATA WITH SPSS Missing married widowed divorced separated never Marital Status married 10 - 18-29 Age Categories 3C-!3 41) 49 H:)- SS^'ASKST1the -—• - -*■«* ü people who speak a given language. You can choose lo have ,hc Y-axis represent peonages mstcad of counts. The chart shown in Figure 3.2 represents .he percentages or the various marital categories. UNIVARIATE DESCRIPTIVE STATISTICS 30 - ?c. - 10 - 18-30 31-40 41-50 Age into 7 categories 51-60 61-70 71-80 81+ Figure 3.4 A bar chart where the category 50+ years has been broken down into four categories The variable on the X-axis could also be a quantitative variable that has been grouped into a small number of categories. For instance, we could have agecat4 as the variable on the X-axis. The bars would dien represent the number of people found in each of the four age categories. In this kind of bar graph, you must be careful about the range (lhat is. the length of the interval) of each of the categories. If the categories are intervals that do not have the same length, you may get the wrong impression that one group is more numerous than the other, such as with the group of people who are 50 years old or more in the chart shown in Figure 3.3. However, this group (50 years and older) spans a range of agťS which is much wider than the other groups: close to 40 years (from 50 years to 89 years exactly). If we regroup the respondents into age categories that arc equal or almost equal, we get the chart in Figure 3.4. This bar chart is a much better representation of the distribution of ages than the previous one. In a clustered bar chart, each column is subdivided in several columns representing the categories of a second variable. For instance, each column could be split in two, for men and for women. Figure 3.5 provides an example of a clustered bar chart where the height of the columns represents the number of people in each category. In a clustered bar chart, it is generally preferable to display the percentages of die various categories radier than their frequencies. Look for instance at the clustered bar chart displayed in Figure 3.5. We see that in every category, women are more INTCHPRIIING QUANTITATIV» DATA WITH SPSS MO 4tt> -..... too Raaponnenľ» Sex ■ Mill ■ Ft milt MtaatrtQ widowMl m>4(MM married iftvwieail never married Manial Siatua Figure 3.S A cluttered bar chart what» tha height of tha column« represents lh« number of people In »ach «Magory numerous tlum men. This is so because the sample as a whole contain! mote women. This chart does noi allow us to assess how the percentages of men und women compare in ciich category. If wc display the percentages rather than the frequencies (the count), wc get the char! illustrated in Figure 3.6. Reipaiulent'a So* I Htn.ťn Mlulny witlownil H u want lo hii:hhj;h( the .lu.intils ,tw« laletl with eveiy category 00 the X axis. A bar chart where the vertical axis does not «mi at 0 can be very misleading, for if the columns arc Iruucaicd at their base, the SB INTERPRETING QUANTITATIVE DATA WITH SPSS differences in height between them can appear to be more important than they really are. Consequently, as a general rule, bar charts should start at zero and should not be imncated fr©m their base. Finally, it should be said that bar charts could also be presented horizontally, by interchanging the X- and Y-axcs. Pie Charts Pie charts (Figure 3.8) are most useful when you want to illustrate proportions, rather than actual quantities. They show the relative importance of the various categories of the variable. In SPSS you have the option of including missing values as a slice in the pie, or excluding them and dividing the pie among valid answers. The details of how to do that are explained in Lab 5. Pie charts are better suited when we want to convey the way a fixed amount of resources is allocated among various uses. For instance, the way a budget is spent over various categories of items is best represented by a pie chart. When the emphasis is on the amount of money spent on each budget item, rather than on the way the budget is allocated, a bar chart is more suggestive. However, both bar charts and pie charts are appropriate to represent the distribution of u nominal variable, and there is no clear-cut line of demarcation lhal would lell us which of the two is preferable. oinor pay» rent ■ owns home Figure 3.8 Pie chart Illustrating the proportion of people who own ., home as compared to those who pay rent. One of the options in the plo chart command allows you to either include or exclude tho category of missing answers. In this diagram it has been excluded from the graph Histograms Histograms arc useful when the variable is quantitative. The data are usually grouped into classes, or intervals, and then the frequency of each class is represented ' UNIVARIATE DESCRIPTIVE STATISTICS 59 100 ~"1 1 . [ | 0 - -r-l— 1 ' 1 _L 1 •' 1 m ''v'V ~r~ 10 o CM M o o o O d ui o u> m m <í> <ď o o O ui 6 ui r* n