UNIVARIATE DESCRIPTIVE STATISTICS This chapter explains how data concerning one variable can Ik; summarised and described, with ubi« and with simple charts and diagrams. Alter studying this chapter, the student should know: • the basic types ol'univariate descriptive measures; • how the level of measurement determines the descriptive measures to be used; • how to interpret these descriptive measures; • how to read a frequency table: • the differences in the significance and the uses of the mean and the median: • how co interpret the mean when a quantitative variable is coded; • how to describe the shape of a distribution (symmetry; skewness); • how to present data (frequency tables; charts): • what arc weighted means and when 10 use them. ■ Data files contain a lot of information thai must be summarized in order to be useful. If we look for instance at the variable age in ihe data file GSS93 SUDMI lhal comes with the SPSS package, we will find 1500 entries, giving us ihe age of every individual in the sample. If we examine Ihe ages of men and women separately, we cannot determine, by looking simply at the raw data, whether men of this sample tend to be older lhan women or whether it is ihe other way around. Wc would need lo know, let us say. that the average age of men is 23 years and of women 20 years io make a comparison. The average is a descriptive measure. Descriptive statistics aim ai describing a situation by summarizing information in a way that highlights the important numerical features of ihe data. Some of the information is lost as a result. A good summary captures the essential aspect* 0f the data ami ihe mosi relevant ones, h summarizes ii wiih ihe help of numbers, usually organized into tables, but also with the help of charts and graphs that give a visual representation of the distributions. UNIVAfllAIE DeSCHiPTIVf SIAriSTICS M •s In this chapter, we will he looking al one variable at a time. Measures that concern one variable are called univariate measures. Wc will examine bivariate measures, those measures lhal concern two variables together, in Chapter H. There are three important types of univariate descriptive measures: • measures of central tendency, « measures of dispersion, and • measures of position. Measures of central tendency (sometimes called measures of ihe center) answer the question: What are the categories or numerical values that represent ihe bulk of the dala in the besl way'.' Such measures will be useful for comparing various groups within a population, or seeing whether a variable has changed over lime Measures of central tendency Include the menu (which is ihe technical term for average), the median, and the mode. MciiMiri'S of dispersion answer the question: How spread out is the data? Is it mostly concentrated around die center, or spread oul over a large range of values? Measures of dispersion include the standard deviation, the variance, ihe range (there arc several variants ot the range, such as ihe interquartile range) and the coefficient of variation. Measures of position answer the question: How is one individual entry positioned wirb reaped to all the others'.' Or how does one individual score on a variable in comparison with the others? If you want to know whether you arc pan ol the lop 5% of a math class, you must use a measure of |M>sition. Measures of posiiion include percentiles, deciles, ami quaitiles. Other measures. In addition io these measures, we can compute ú» frequencies of certain subgroups of ihe population, as well as certain ratios and proportions thai help us compare Iheir relative importance. This is particularly useful when the variable is qualitative. or when it is quantitative but us values have been grouped i»l" categories. Tlic various descriptive measures that can be used in a specific situation depend on whether ihe variable is qualitative or quantitative. When the variable is quantitative, we can look at the general shape of the distribution, to see wliether it is symmetric (that is. the values are distributed in a similar way on both sides of the center) or skewed (that is. lacking symmetry), and whether it is rather flai or rather peaked (a characteristic called kurtosis). f Finally, we can make use of charts to convey a visual impression ol ihe distribution of ihe data. It is very easy to produce colorful outputs with any statistical software. Ii is important, however, to choose ihe appropriate chart, one that is meaningful and that conveys the most important properties ot ihe data. This is ikh 70 0413 36 iNXtfi?ň£T\NG QUANTITATIVE OATA WITH SPS* always eaSľi Mid you will have to pay intention in the way an appropriate chart is chosen, a choice that depends on llie level of measurement of the variable. It is very important to realize thai the statistical measures used to describe the data pertaining to a variable depend on the level of measurement used. If a variable is measured at the nominal scale, you can compute certain measures and not others. Therefore you should pay attention to the conditions under which a measure could be used; otherwise you will end up computing numerical values that are meaningless- Measures of Central Tendency For Qualitative Variables The best way to describe (he data lhal corresponds to a qualitative variable is to show the IVeqiH'ndes of its various categories, which are a simple count of how many individuals tall into each category. You could then work out this count as a percentage of the total number of units in the sample. When you ask for the frequencies. SPSS automatically calculates the percentages us well, and n docs it twice: the percentage with respect to the total number of people in the sample, and the percentage with respect to the valid answers only, called valid percent in the SPSS outputs. Lei us say that the percentage of people who answered Yes to a question is 403? of the total. If only half the people had answered, this percentage would correspond to 80% of the valid answers. In other words, although 40% of the people answered Yes. they still constitute 80% of those who answered. SPSS gives you both percentages (the total percentage and the valid percentage) and you have to decide which one is more significant m a particular situation. For instance, lable 3.1 summarizes the answers to a question about the legalization of marijuana, in a survey given to a sample of 1500 individuals. Table 3.1 A frequency table, showing the frequencies of the various categories, as well as the percentage and valid percentage they represent in the sample Should Marijuana Be Made Legal "^^^ Frequency Percent Valid Percent Valid Ugal 211 M.I 22.7 Nol legal 719 47.9 77..X Toial volM 930 62.1) too.o VI i win ť 570 J8.0 I..I..I 15»» ItHM) fable VI tells us ih.il the sample included 1500 individuals, but that we have the answers to thai question for 930 individuals only. The percentage of positive answers can he calculated either out of the total number of people in the sample, giving 14.1% as shown m the Percent column, or oul of (he number of people for whom we ■ UNIVAfilATf OfSCHIPTIVE STATISTICS J? have answers, giving 22.7% as shown in the Valid Percent column. Which percentage is the mosl useful? It depends on the reason for the missing answers. If people did not answer because the quest ion was asked of only a subset of the sample, the valid percentage is easier to interpret, Bui if 570 people abstained because they do not want to let their opinion be known, it is more difficult lo interpret the resulting figures. A good analysis should include a discussion of the missing answers when their proportion is as important as it is in this example. Table „VI comes from ihe SPSS output. When we write a statistical report, we do noi include all ihe columns in that tabic. Most of the time, you would choose either the valiil percentage (which is the preferred solution) or the total percentage, but rarely both, unless you want to discuss specifically the difference between these two percentages, [lie Cumulative percentage is only used for ordinal or quantitative variables, and even men is included only if you plan lo discuss it. To describe the center of the distribution of a qualitative variable, you must determine which category includes the biggest concentration of data, This is called the mode. /'/"• mode for eoplc who answered Yes lo a question is 40% of the total. If only half the people had answered, this percentage would correspond to 80% of the valid answers. In oilier words, although -HOi of the people answered Yes. they Mill constitute 80% of those who answered. SPSS gives you both |>ercentages (the total percentage and the valid percentage) and you have to decide which one is more significant in a particular situation. For instance. Table 3.1 summarizes the answers to a question about the legalization of marijuana, in a survey given to a sample of 1500 individuals. Table 3.1 A frequency labte, showing the frequencies of the various categories, as well as the percentage and valid percentage they represent in the sample Should Marijuana Be Made Legal Frequency Percent Valid Percent Valid Legal 211 i-i ; 217 Not leniti 719 I? •> 77.3 ll'l.ll Ml 111) 930 62.« 100.0 \IKmiiC ."u WO l„l.,l 151)11 100.0 Table .'.I tells us that the sample included I MX) individuals, hut that we have the answers to that question lor 930 individuals only. The percentage of positive answers can be calculated eilher out of ihe total number of people in the sample, giving 14.1 % as shown in ihe Percent column, or out of the number of people for whom we " UNIVARIArf DESCRIPTIVE STATISTICS 37 have answers, giving 22.7% its shown in the Valid Percent column. Which perceulage is the mOSi useful? It depends on the reason for the missing answers. If people did not answer because ihe question was asked of only a subset of the sample, the valid percentage is easier to interpret. Hut if 570 people abstained because they do not want to let their opinion he known, il is more difficult to interpret the resulting figures. A good analysis should include a discussion of the missing answers when their proportion is as important a-s it is in Ibis example. Table 3.1 comes from die SPSS output. When we write a statistical report, we do not include all the columns in that table. Most of the time, you would choose either ihe valid perceulage (which is the preferred solution) or the total percentage, but rarely both, unless you want to discuss specifically ihe difference between these two percentages. The cumulative percentage is only used for ordinal or quantitative variables, and even then is included only if you plan to discuss it. To describe the center of the distribution of a qualitative variable, you must determine which category includes the biggest concentration of data. This is called the mode. The mode for a qualitative variable is the category thai hus the highest frequency (sometimes called modal category) The modal category could include more than 50% Of the data. In this case we say thai ibis category includes the majority of individuals, II the modal category includes less than 50% '>t Hie data, we say thai il constitutes a plurality We can illustrate this by the following Situations concerning the votes in an election. Flrsl situation: Party A 54% of the votes Pany B 21% of the votes Party C 25% of the votes. Here we could say that Party A won the election with a majority. Compare with the following situation. Second situation: Party A 44% of the votes Party fí 31 % of the votes Party C 25% of the votes. / Here we can soy that Pain A wtoi the election with a plurality of votes, bin without a majority. If Parties B and (' formed a coalition, they could defeat Party a. For this reason, some countries include in their electoral law a provision thai, should the winning candidate or a winning parly get less than the absolute majority of voles (50% +■ I), a second turn should take place among those candidates who are at the top of the list, so as to end up with a winner having more than 50% ol ihe votes. A good description of the distribution of a qualitative variable should include a mention of the modal category, bul il should also include a discussion of the pattern 71 je INTERPRETING QUANTITATIVE OAtA WITH SPSS of Hie distribution of individuals across the various categories. Concrete examples will be given in die last section of this chapter. For Quantitative Variables Quantitative variables allow us a lot more possibilities. The most useful measures of centra) tendency are the mean and the median. We will also see how and when to use ihc mode. The mean of a quantitative, variable is defined as the sttm of all entries divided by their number. In symbolic terms. the mean of a sample is written as „ J = =jp, and the mean of a population is written as j, =----' These symbols are read as follows: x is read as x bar. and it stands for the mean of a sample for variable .V. U, is read as mu x. and it stands for the mean of a population. The subscript .v refers to the variable X. x, is read as x i. It refers to all the entries of your data thai pertain to the variable X, which are labeled xy .v.. .v., etc. £ is read as sigma. When followed by *. it means: add all the .1 "s. lectin» i range over all possible values, that is. from | u. n (for a sample) or from I to ,v (for a population). « is the size of the sample, that is. the number of units that are in it. N is the size of the population. You may have noticed that we use different symbols for a population and for a sample, to indicate clearly whether we are talking about a population or a sample. Wc do not always need to write the subscript i in /*,. We do it only when several variables arc involved, and when we want to keep track of which of the variables we are talking about. In such a situation we would use fiv p., and ,u. to refer to the mean of the population for the variables ,r. y. and z respectively. Notice that in the formula for the mean of a population, wc have written a capital N to refer io ihe size of the population raiher ihan the small n used for the size of a sample. The mean is very useful to compare various populations, or to see how a variable evolves over time. Rut ii can be very misleading if the population is not homogeneous. Imagine a group of five people whose hourly wages are: $10, $20, $45. $60 and S65 an hour. The average hourly wage would be: UNIVARIATE DESCRIPTiVt STATISTICS 39 10 4- 20 v 45 7-60 + 65 1 -- 5 = $40 an hour. But if the last participant was an international lawyer who charged $400 an hour of consultancy, the average would have been $107 an hour (you can compute it yourself), which is well above what four out of the five individuals make, and would be a misrepresentation of the center of the data. In order to avoid this problem, we can compute the trimmed mean: you first eliminate the most extreme values, and then you compute the mean of the remaining ones. But you must indicate how much you have trimmed. In SPSS, one of ihe procedures produces a 5% trimmed mean, which means that you disregard the 5% of the data that arc farthest away from the center, and then you compute the mean of ihe remaining data eniries. The mean has a mathematical property that will be used later on. Starting from the definition of the mean, which states that *==—'■ we can conclude, by multiplying both sides by n, that: y *«-£*, In plain language, this states that the sum of all entries is equal to n times the mean. We will discuss all the limitations and warnings concerning the mean in a later section on methodological issues. THE MEAN OF DATA GROUPED INTO CLASSES When we are given numerical data that is grouped into classes, and we do not know the exact value of every single entry, wc can still compute the mean of the distribution by using the midpoint of every class. What we get is not the exact mean, but it is the closest guess of the mean that is available. If ihe classes are not too wide, the value obtained by using the midpoints is not that different from the value that would have resulted from the individual data. Consider one of the intervals ŕ with frequency/ and midpoint x. The exact sum of all the entries in that class is not known, but we can approximate it using the midpoint. Thus, instead of the sum of the individual entries (not known) we will count the midpoint of the class/ times. We obtain the following formula. LA '*, Mean (or grouped data = —-— Here, n is the number of all entries in the sample. It is therefore equal to the sum of the class frequencies, that is. the sum of the number of individuals in the various classes. The formula can thus be rewritten as