Descriptive statistics Petr Ocelík ESS401 Social Science Methodology 13th October 2015 Outline • Measures of central tendency, position, and variability. • Graphic displays of descriptive statistics. • R introduction: cont’d. Descriptive statistics • The purpose is to summarize data. • Quantitative variables have two key features: – The center of the data – a typical observation. – The variability of the data – the spread around the center. Notation Central tendency • The statistics that describe the center of a frequency distribution for a quantitative variable. • Shows a typical observation/case. • Most common measures: mean, mode, and median. Central tendency: mean • Arithmetic mean • Properties: – Center of gravity of a distribution. – Can be used only for metric scales. – Strongly influenced by outliers. Central tendency: mode – Value that occurs most frequently in the sample. – Applicable at all levels of measurement. – Used mainly for highly discrete variables such as categorical data. – {“catholic”, “Muslim”, “Hindu”, “catholic”, “catholic”, “Muslim”, “catholic”, “catholic”} – {1, 2, 3, 1, 1, 2, 1, 1} – {“agree”, “agree”, “disagree”, “agree”, “neutral”, “disagree”, “disagree”, “disagree”, “agree”} – {1, 1, -1, 1, 0, -1, -1, -1, 1} – Years of education. Central tendency: median – Observation that is in the middle of the ordered sample (between 50th bottom and 50th upper percentile). – Splits data into two parts with equal # of observations. – For even sized samples: average value of the two middle observations. – Applicable at least at ordinal level. Central tendency: median – Identification of median: (n + 1) / 2 ; n = # of observations in the data – Odd numbered n: {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} – Median = (13 + 1)/2 = 7th position = 5 – Even numbered n: {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10} – Median = (12 + 1)/2 = 6.5th position = (6th + 7th position)/2 = (3 + 5)/2 = 4 Central tendency: median Set 1 8 9 10 11 12 Set 2 8 9 10 11 100 Set 3 0 9 10 10 10 Set 4 8 9 10 100 100 Finlan & Agresti 2009: 43 Central tendency • Mode • Median • Mean • {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} Central tendency • Mode • Median • Mean • {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} Position • The measures of central tendency are not sufficient for description of data for a quantitative variable. • Does not describe the spread of the data. • Position measures: describe the point at which a given percentage of the data fall below or above that point. Position: percentile • Percentile. The pth percentile is the point such that p% of the observations fall below that point and (and 100 - p)% fall above it. – E.g. 89th percentile = indicates a point where 89% of observations lie below and 11% lie above it. – Median is a 50th percentile. – “Standard” percentiles: (25, 50, 75), or (10, 25, 50, 75, 90). Position: IQR • Interquartile range – Difference between the values of observations at 75% (upper quartile) and 25% (lower quartile). – Shows spread of middle half of the observations. {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} Median = (13 + 1)/2 = 7th observation = 5 Q1 = (6 + 1)/2 = 3.5th observation = (2 + 2)/2 = 2 Q2 = (6 + 1)/2 = 3.5th observation = (6 + 7)/2 = 6.5 IQR = Q3 – Q1 IQR = 6.5 – 2 = 4.5 Position: quartile • Quartile – Values of observations at 25% (Q1), 50% (Q2), and 75% (Q3) of a distribution. {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} Q1 (25 %) = 2 Q2 (50 %) = 5 Q3 (75 %) = 6.5 Measures of center and position: R commands mean(data) mode does not have standard R function median(data) range(data) IQR(data, na.rm=F) quantile(data, c(0.25, 0.5, 0.75)) Variability • The measures of central tendency are not sufficient for description of data for a quantitative variable. • Does not describe the spread of the data. • Variability measures: describe the deviations of the data from a measure of center (such as mean). – With exception of a range. Variability Finlan & Agresti 2009: 46 Variability: range • Range: difference between largest and smallest value. • The simplest measure of variability. • Does not describe deviations from the mean. {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} Range = 39 – 1 = 38 Variability Finlan & Agresti 2009: 47 Variability: deviation • Deviation – Difference between value of observation and mean. {1, 1, 2, 2, 3, 3, 5, 6, 6, 6, 7, 10, 39} (1 - 7), (1 - 7), (2 - 7), … , (39 - 7) -6, -6, -5, -5, -4, -4, -2, -1, -1, -1, 0, 3, 32 Variability: deviation • Deviation – Difference between value of observation and mean. – Positive deviation: observation value > mean – Negative deviation: observation value < mean – Zero deviation: observation value = mean. – Since sum of deviations = 0, the absolute values or the squares are used in measures that use deviations. Variability: variance • Mean is usually not very indicative for data dispersion: {4, 4, 6, 6}; mean = 5; s^2 = 1.33 {0, 0, 10, 10}; mean = 5; s^2 = 33.33 • Therefore we need other measures such as variance (s^2). Variability: variance • Variance – Squared mean deviation from mean. population = {1, 3, 6, 10} ¼ * ((1 - 5)^2 + (3 - 5)^2 + (6 - 5)^2 + (10 - 5)^2) ¼ * ((-4)^2 + (-2)^2 + 1^2 + 5^2) ¼ * (16 + 4 + 1 + 25) = ¼ * 46 = 11.5 Variability: variance • Variance – Squared approximate mean deviation from mean. sample = {1, 3, 6, 10} 1/3 * ((1 - 5)^2 + (3 - 5)^2 + (6 - 5)^2 + (10 - 5)^2) 1/3 * ((-4)^2 + (-2)^2 + 1^2 + 5^2) 1/3 * (16 + 4 + 1 + 25) = 1/3 * 46 = 15.33 Variability: standard deviation • Standard deviation – Measure of average deviation. – Typical distance of observation from the mean. – Sensitive to outliers. sample = {1, 3, 6, 10} s^2 = 15.33 s = sqrt(15.33) = 3.92 Variability: standard deviation • Properties – s >= 0 – s = 0 only when all observations have same value. – The greater variability about mean, the larger s. – If data are rescaled, the s is rescaled as well. – E.g. if we rescale s of annual income in $ = 34,000 to thousands of $ = 34, the s also changes by factor of 100 from 11,800 to 11.8. Variability: standard deviation • Interpretation – Scale dependent. – E.g. assume that average amount of points received in this course is 50 points graded on a scale 0 to 60. – s = 0 extremely unlikely (no differences in performance). – As well as s > 20 (huge differences in performance). Variability: dimensionless measures • Coefficient of variability – Allows comparisons across different distributions (units, means, …). – Applicable only to ratio scale. • Z-score – Standardized measure of variability. – Express variation in standard deviations instead of original metric. Variability: dimensionless measures • Coefficient of variability – Proportion of std. dev. on the mean value. – Allows to compare variability of different data sets. – mean = 80, std. dev. = 12, CV = 12 / 80 = 0.15 – mean = 50, std. dev. = 20, CV = 20 / 50 = 0.40 Variability: dimensionless measures • Z-score – Shows a distance of an observation in # of standard deviations from the mean. – For bell-shaped distributions very unlikely to have values larger than 3 std. deviations from the mean. – Data = {1, 3, 6, 10} ; mean = 5 ; s = 3.92 – Z-score for 2nd case: (3 - 5) / 3.92 = - 0.51 – 0.51 * 3.92 = 1.99 ; 1.99 + 3 ~ 5 – Z-scores = (-1.02, -0.51, 0.26, 1.28) Measures of variability: R commands range(data) var(data) sd(data) scale(data) = z-scores sd(data) / mean(data) = coefficient of variability Frequency distribution • Frequency distribution: table or visual display of the frequency of variable values. 155-160 3 160-165 2 165-170 9 170-175 7 175-180 10 180-185 5 185-190 5 190-195 1 195-200 0 Frequency distribution • Absolute frequency: # of the observations of a category. • Relative frequency: proportion of the observations of a category over total # of observations. • Percentage: proportion multiplied by 100. 155-160 3 0.07 7% 160-165 2 0.05 5% 165-170 9 0.21 21% 170-175 7 0.17 17% 175-180 10 0.24 24% 180-185 5 0.12 12% 185-190 5 0.12 12% 190-195 1 0.02 2% 195-200 0 0 0% Bar chart • The columns are positioned over values of categorical variable (U.S. states). • The height of the column indicates the value of the variable (per capita income). Histogram • The columns are positioned over a values of quantitative variable. • The column label can be single value or range of values. • The height of the column indicates the value of the variable. Boxplot • Splits data into quartiles (position measure). • Box: from Q1 to Q3. • Median (Q2): line within the box. • Whiskers: indicate the range from: – Q1 to smallest non-outlier. – Q3 to largest non-outlier. • Outlier > 1.5 * (Q3 – Q1) from Q1 or Q3 • Outliers are represented separately.