© 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 1 http://www.indiana.edu/~statmath I n d i a n a U n i v e r s i t y Universi t y I nf ormation T echnology Services Univariate Analysis and Normality Test Using SAS, Stata, and SPSS* Hun Myoung Park, Ph.D. © 2002-2008 Last modified on November 2008 University Information Technology Services Center for Statistical and Mathematical Computing Indiana University 410 North Park Avenue Bloomington, IN 47408 (812) 855-4724 (317) 278-4740 http://www.indiana.edu/~statmath * The citation of this document should read: “Park, Hun Myoung. 2008. Univariate Analysis and Normality Test Using SAS, Stata, and SPSS. Working Paper. The University Information Technology Services (UITS) Center for Statistical and Mathematical Computing, Indiana University.” http://www.indiana.edu/~statmath/stat/all/normality/index.html © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 2 http://www.indiana.edu/~statmath This document summarizes graphical and numerical methods for univariate analysis and normality test, and illustrates how to do using SAS 9.1, Stata 10 special edition, and SPSS 16.0. 1. Introduction 2. Graphical Methods 3. Numerical Methods 4. Testing Normality Using SAS 5. Testing Normality Using Stata 6. Testing Normality Using SPSS 7. Conclusion 1. Introduction Descriptive statistics provide important information about variables to be analyzed. Mean, median, and mode measure central tendency of a variable. Measures of dispersion include variance, standard deviation, range, and interquantile range (IQR). Researchers may draw a histogram, stem-and-leaf plot, or box plot to see how a variable is distributed. Statistical methods are based on various underlying assumptions. One common assumption is that a random variable is normally distributed. In many statistical analyses, normality is often conveniently assumed without any empirical evidence or test. But normality is critical in many statistical methods. When this assumption is violated, interpretation and inference may not be reliable or valid. Figure 1. Comparing the Standard Normal and a Bimodal Probability Distributions 0.1.2.3.4 -5 -3 -1 1 3 5 Standard Normal Distribution 0.1.2.3.4 -5 -3 -1 1 3 5 Bimodal Distribution The t-test and ANOVA (Analysis of Variance) compare group means, assuming a variable of interest follows a normal probability distribution. Otherwise, these methods do not make much sense. Figure 1 illustrates the standard normal probability distribution and a bimodal distribution. How can you compare means of these two random variables? There are two ways of testing normality (Table 1). Graphical methods visualize the distributions of random variables or differences between an empirical distribution and a theoretical distribution (e.g., the standard normal distribution). Numerical methods present © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 3 http://www.indiana.edu/~statmath summary statistics such as skewness and kurtosis, or conduct statistical tests of normality. Graphical methods are intuitive and easy to interpret, while numerical methods provide objective ways of examining normality. Table 1. Graphical Methods versus Numerical Methods Graphical Methods Numerical Methods Descriptive Stem-and-leaf plot, (skeletal) box plot, dot plot, histogram Skewness Kurtosis Theory-driven P-P plot Q-Q plot Shapiro-Wilk, Shapiro- Francia test Kolmogorov-Smirnov test (Lillefors test) Anderson-Darling/Cramer-von Mises tests Jarque-Bera test, Skewness-Kurtosis test Graphical and numerical methods are either descriptive or theory-driven. A dot plot and histogram, for instance, are descriptive graphical methods, while skewness and kurtosis are descriptive numerical methods. The P-P and Q-Q plots are theory-driven graphical methods for normality test, whereas the Shapiro-Wilk W and Jarque-Bera tests are theory-driven numerical methods. Figure 2. Histograms of Normally and Non-normally Distributed Variables 0.1.2.3.4.5 -3 -2 -1 0 1 2 3 Randomly Drawn from the Standard Normal Distribution (Seed=1,234,567) A Normally Distributed Variable (N=500) 0.03.06.09.12.15 0 10 20 30 40 50 60 Per Capita Gross National Income in 2005 ($1,000) A Non-normally Distributed Variable (N=164) Three variables are employed here. The first variable is unemployment rate of Illinois, Indiana, and Ohio in 2005. The second variable includes 500 observations that were randomly drawn from the standard normal distribution. This variable is supposed to be normally distributed with mean 0 and variance 1 (left plot in Figure 2). An example of a non-normal distribution is per capita gross national income (GNI) in 2005 of 164 countries in the world. GNIP is severely skewed to the right and is least likely to be normally distributed (right plot in Figure 2). See the Appendix for details. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 4 http://www.indiana.edu/~statmath 2. Graphical Methods Graphical methods visualize the distribution of a random variable and compare the distribution to a theoretical one using plots. These methods are either descriptive or theory-driven. The former method is based on the empirical data, whereas the latter considers both empirical and theoretical distributions. 2.1 Descriptive Plots Among frequently used descriptive plots are the stem-and-leaf-plot, dot plot, (skeletal) box plot, and histogram. When N is small, a stem-and-leaf plot and dot plot are useful to summarize continuous or event count data. Figure 3 and 4 respectively present a stem-and-leaf plot and a dot plot of the unemployment rate of three states. Figure 3. Stem-and-Leaf Plot of Unemployment Rate of Illinois, Indiana, Ohio -> state = IL Stem-and-leaf plot for rate(Rate) rate rounded to nearest multiple of .1 plot in units of .1 3. | 7889 4* | 011122344 4. | 556666666677778888999 5* | 0011122222333333344444 5. | 5555667777777888999 6* | 000011222333444 6. | 555579 7* | 0033 7. | 8* | 0 8. | 8 -> state = IN Stem-and-leaf plot for rate(Rate) rate rounded to nearest multiple of .1 plot in units of .1 3* | 1 3. | 89 4* | 012234 4. | 566666778889999 5* | 00000111222222233344 5. | 555666666777889 6* | 002222233344 6. | 5666677889 7* | 1113344 7. | 67 8* | 14 -> state = OH Stem-and-leaf plot for rate (Rate) rate rounded to nearest multiple of .1 plot in units of .1 3* | 8 4* | 014577899 5* | 01223333445556667778888888999 6* | 001111122222233444446678899 7* | 01223335677 8* | 1223338 9* | 99 10* | 1 11* | 12* | 13* | 3 Figure 4. Dot Plot of Unemployment Rate of Illinois, Indiana, Ohio 051015 Illinois (N=102) Indiana (N=92) Ohio (N=88) Indiana Business Research Center (http://www.stats.indiana.edu/) Source: Bureau of Labor Statistics © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 5 http://www.indiana.edu/~statmath A box plot presents the minimum, 25th percentile (1st quartile), 50th percentile (median), 75th percentile (3rd quartile), and maximum in a box and lines.1 Outliers, if any, appear at the outsides of (adjacent) minimum and maximum lines. As such, a box plot effectively summarizes these major percentiles using a box and lines. If a variable is normally distributed, its 25th and 75th percentile are symmetric, and its median and mean are located at the same point exactly in the center of the box.2 In Figure 5, you should see outliers in Illinois and Ohio that affect the shapes of corresponding boxes. By contrast, the Indiana unemployment rate does not have outliers, and its symmetric box implies that the rate appears to be normally distributed. Figure 5. Box Plots of Unemployment Rates of Illinois, Indiana, and Ohio 2468101214 Illinois (N=102) Indiana (N=92) Ohio (N=88) UnemploymentRate(%) Indiana Business Research Center (http://www.stats.indiana.edu/) Source: Bureau of Labor Statistics The histogram graphically shows how each category (interval) accounts for the proportion of total observations and is appropriate when N is large (Figure 6). Figure 6. Histograms of Unemployment Rates of Illinois, Indiana and Ohio 0.1.2.3.4.5 0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15 Illinois (N=102) Indiana (N=92) Ohio (N=88) Indiana Business Research Center (http://www.stats.indiana.edu/) Source: Bureau of Labor Statistics 1 The first quartile cuts off lowest 25 percent of data; the second quartile, median, cuts data set in half; and the third quartile cuts off lowest 75 percent or highest 25 percent of data. See http://en.wikipedia.org/wiki/Quartile 2 SAS reports a mean as “+” between (adjacent) minimum and maximum lines. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 6 http://www.indiana.edu/~statmath 2.2 Theory-driven Plots P-P and Q-Q plots are considered here. The probability-probability plot (P-P plot or percent plot) compares an empirical cumulative distribution function of a variable with a specific theoretical cumulative distribution function (e.g., the standard normal distribution function). In Figure 7, Ohio appears to deviate more from the fitted line than Indiana. Figure 7. P-P Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 0.000.250.500.751.00 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 2005 Indiana Unemployment Rate (N=92 Counties) 0.000.250.500.751.00 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(N+1) Source: Bureau of Labor Statistics 2005 Ohio Unemployment Rate (N=88 Counties) Similarly, the quantile-quantile plot (Q-Q plot) compares ordered values of a variable with quantiles of a specific theoretical distribution (i.e., the normal distribution). If two distributions match, the points on the plot will form a linear pattern passing through the origin with a unit slope. P-P and Q-Q plots are used to see how well a theoretical distribution models the empirical data. In Figure 8, Indiana appears to have a smaller variation in its unemployment rate than Ohio. By contrast, Ohio appears to have a wider range of outliers in the upper extreme. Figure 8. Q-Q Plots of Unemployment Rates of Indiana and Ohio (Year 2005) 4.15.57.4 051015 UnemploymentRatein2005 5.641304 7.3501913.932418 3 4 5 6 7 8 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles Source: Bureau of Labor Statistics 2005 Indiana Unemployment Rate (N=92 Counties) 4.56.18.8 051015 UnemploymentRatein2005 6.3625 8.7608573.964143 2 4 6 8 10 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles Source: Bureau of Labor Statistics 2005 Ohio Unemployment Rate (N=88 Counties) Detrended normal P-P and Q-Q plots depict the actual deviations of data points from the straight horizontal line at zero. No specific pattern in a detrended plot indicates normality of the variable. SPSS can generate detrended P-P and Q-Q plots. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 7 http://www.indiana.edu/~statmath 3. Numerical Methods Graphical methods, although visually appealing, do not provide objective criteria to determine normality of variables. Interpretations are thus a matter of judgments. Numerical methods use descriptive statistics and statistical tests to examine normality. 3.1 Descriptive Statistics Measures of dispersion such as variance reveal how observations of a random variable deviate from their mean. The second central moment is 1 )( 2 2     n xx s i Skewness is a third standardized moment that measures the degree of symmetry of a probability distribution. If skewness is greater than zero, the distribution is skewed to the right, having more observations on the left. 232 3 3 3 3 3 ])([ )(1 )1( )(])[(          xx xxn ns xxxE i ii   Kurtosis, based on the fourth central moment, measures the thinness of tails or “peakedness” of a probability distribution. 22 4 4 4 4 4 ])([ )()1( )1( )(])[(          xx xxn ns xxxE i ii   Figure 9. Probability Distributions with Different Kurtosis 0.2.4.6.80.2.4.6.8 -5 -3 -1 1 3 5 -5 -3 -1 1 3 5 Kurtosis < 3 Kurtosis = 3 Kurtosis > 3 © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 8 http://www.indiana.edu/~statmath If kurtosis of a random variable is less than three (or if kurtosis-3 is less than zero), the distribution has thicker tails and a lower peak compared to a normal distribution (first plot in Figure 9).3 By contrast, kurtosis larger than 3 indicates a higher peak and thin tails (last plot). A normally distributed random variable should have skewness and kurtosis near zero and three, respectively (second plot in Figure 9). state | N mean median max min variance skewness kurtosis -------+-------------------------------------------------------------------------------- IL | 102 5.421569 5.35 8.8 3.7 .8541837 .6570033 3.946029 IN | 92 5.641304 5.5 8.4 3.1 1.079374 .3416314 2.785585 OH | 88 6.3625 6.1 13.3 3.8 2.126049 1.665322 8.043097 -------+-------------------------------------------------------------------------------- Total | 282 5.786879 5.65 13.3 3.1 1.473955 1.44809 8.383285 ---------------------------------------------------------------------------------------- In short, skewness and kurtosis show how the distribution of a variable deviates from a normal distribution. These statistics are based on the empirical data. 3.2 Theory-driven Statistics The numerical methods of normality test include the Kolmogorov-Smirnov (K-S) D test (Lilliefors test), Shapiro-Wilk test, Anderson-Darling test, and Cramer-von Mises test (SAS Institute 1995).4 The K-S D test and Shapiro-Wilk W test are commonly used. The K-S, Anderson-Darling, and Cramer-von Misers tests are based on the empirical distribution function (EDF), which is defined as a set of N independent observations x1, x2, …xn with a common distribution function F(x) (SAS 2004). Table 2. Numerical Methods of Testing Normality Test Statistic N Range Dist. SAS Stata SPSS Jarque-Bera 2  )2(2  - - Skewness-Kurtosis 2  9≤N )2(2  - .sktest Shapiro-Wilk W 7≤N≤ 2,000 - YES .swilk YES Shapiro-Francia W’ 5≤N≤ 5,000 - - .sfrancia Kolmogorov-Smirnov D EDF YES * YES Cramer-vol Mises W2 EDF YES - Anderson-Darling A2 EDF YES - * Stata .ksmirnov command is not used for testing normality. The Shapiro-Wilk W is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance (Shapiro and Wilk 1965).5 The statistic is positive and less than or equal to one. Being close to one indicates normality. 3 SAS and SPSS produce (kurtosis -3), while Stata returns the kurtosis. SAS uses its weighted kurtosis formula with the degree of freedom adjusted. So, if N is small, SAS, Stata, and SPSS may report different kurtosis. 4 The UNIVARIATE and CAPABILITY procedures have the NORMAL option to produce four statistics. 5 The W statistic was constructed by considering the regression of ordered sample values on corresponding expected normal order statistics, which for a sample from a normally distributed population is linear (Royston 1982). Shapiro and Wilk’s (1965) original W statistic is valid for the sample sizes between 3 and 50, but Royston extended the test by developing a transformation of the null distribution of W to approximate normality throughout the range between 7 and 2000. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 9 http://www.indiana.edu/~statmath The W statistic requires that the sample size is greater than or equal to 7 and less than or equal to 2,000 (Shapiro and Wilk 1965).6       2 2 )( )( xx xa W i ii where a’=(a1, a2, …, an) = 21111 ]'['  mVVmVm , m’=(m1, m2, …, mn) is the vector of expected values of standard normal order statistics, V is the n by n covariance matrix, x’=(x1, x2, …, xn) is a random sample, and x(1)< x(2)< … |t| 0.0346 Sign M -28 Pr >= |M| 0.0138 Signed Rank S -6523 Pr >= |S| 0.0435 Tests for Normality Test --Statistic--- -----p Value----- Shapiro-Wilk W 0.995564 Pr < W 0.168 Kolmogorov-Smirnov D 0.026891 Pr > D >0.150 Cramer-von Mises W-Sq 0.083351 Pr > W-Sq 0.195 Anderson-Darling A-Sq 0.540894 Pr > A-Sq 0.171 Quantiles (Definition 5) © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 14 http://www.indiana.edu/~statmath Quantile Estimate 100% Max 2.511694336 99% 2.055464409 95% 1.530450397 90% 1.215210586 75% Q3 0.612538495 50% Median -0.119592165 25% Q1 -0.805191028 10% -1.413548051 5% -1.794057126 1% -2.219479314 0% Min -2.837417522 Extreme Observations -------Lowest------- -------Highest------ Value Obs Value Obs -2.83741752 29 2.14897641 119 -2.59039285 204 2.21109349 340 -2.47829639 73 2.42113892 325 -2.39126554 391 2.42171307 139 -2.24047386 393 2.51169434 332 4.1.2 Graphical Methods The stem-and-leaf plot and box plot, produced by the UNIVARIATE produre, illustrate that the variable is normally distributed (Figure 11). The locations of first quantile, mean, median, and third quintile indicate a bell-shaped distribution. Note that the mean -.0951 and median -.1196 are very close. Figure 11. Stem-and-Leaf Plot and Box Plot of a Normally Distributed Variable Histogram # Boxplot 2.75+* 1 | .** 4 | .******** 23 | .**************** 46 | .*********************** 68 +-----+ .*************************** 80 | | .*************************************** 116 *--+--* .********************** 64 +-----+ .******************* 56 | .********* 27 | .***** 13 | -2.75+* 2 | ----+----+----+----+----+----+----+---- * may represent up to 3 counts The normal probability plot available in UNIVARIATE shows a straight line, implying the normality of the randomly drawn variable (Figure 12). © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 15 http://www.indiana.edu/~statmath Figure 12. Normal Probability Plot of a Normally Distributed Variable Normal Probability Plot 2.75+ * | +++** | ******** | ******* | ***** | ****** | ******* | ***** | ****** | ****** |******* -2.75+*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 The P-P and Q-Q plots below show that the data points are not seriously deviated from the fitted line. They consistently indicate that the variable is normally distributed. Figure 13. P-P plot and Q-Q Plot of a Normally Distributed Variable C u m u l a t i v e D i s t r i b u t i o n o f r a n d o m 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Nor mal ( Mu=- 0. 095 Si gma=1. 0033) 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 - 4 - 3 - 2 - 1 0 1 2 3 4 - 3 - 2 - 1 0 1 2 3 r a n d o m Nor mal Quant i l es 4.1.3 Numerical Methods The mean of -.0951 is very close to 0 and variance is almost 1. The skewness and kurtosis-3 are respectively -.0204 and -.3988, indicating an almost normal distribution. However, these descriptive statistics do not provide conclusive information about normality. SAS provides four different statistics for testing normality. Shapiro-Wilk W of .9956 does not reject the null hypothesis that the variable is normally distributed (p<.168). Similarly, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests do not reject the null hypothesis. Since the number of observations is less than 2,000, however, Shapiro-Wilk W test will be appropriate for this case. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 16 http://www.indiana.edu/~statmath The Jarque-Bera test also indicates the normality of the randomly drawn variable (p=.1875). Note that -.3988 is kurtosis -3. )2(3.3482776~ 24 0.3988198- 6 0.0203721- 500 22        Consequently, we can safely conclude that the randomly drawn variable is normally distributed. 4.2 A Non-normally Distributed Variable Let us examine the per capita gross national income as an example of non-normally distributed variables. See the appendix for details about this variable. 4.2.1 SAS Output of Descriptive Statistics This section employs the UNIVARIATE procedure to compute descriptive statistics and perform normality tests. The variable has mean 8.9646 and median 2.0495, where are substantially different. Variance 184.0577 is extremely large. PROC UNIVARIATE DATA=masil.gnip NORMAL PLOT; VAR gnip; QQPLOT gnip /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1); HISTOGRAM / NORMAL(COLOR=MAROON W=4) CFILL = BLUE CFRAME = LIGR; RUN; The UNIVARIATE Procedure Variable: GNIP Moments N 164 Sum Weights 164 Mean 8.9645732 Sum Observations 1470.19001 Std Deviation 13.5667877 Variance 184.057728 Skewness 2.04947469 Kurtosis 3.60816725 Uncorrected SS 43181.0356 Corrected SS 30001.4096 Coeff Variation 151.337798 Std Error Mean 1.05938813 Basic Statistical Measures Location Variability Mean 8.964573 Std Deviation 13.56679 Median 2.765000 Variance 184.05773 Mode 1.010000 Range 65.34000 Interquartile Range 7.72500 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 17 http://www.indiana.edu/~statmath Student's t t 8.462029 Pr > |t| <.0001 Sign M 82 Pr >= |M| <.0001 Signed Rank S 6765 Pr >= |S| <.0001 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.663114 Pr < W <0.0001 Kolmogorov-Smirnov D 0.284426 Pr > D <0.0100 Cramer-von Mises W-Sq 4.346966 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 22.23115 Pr > A-Sq <0.0050 Quantiles (Definition 5) Quantile Estimate 100% Max 65.630 99% 59.590 95% 38.980 90% 32.600 75% Q3 8.680 50% Median 2.765 25% Q1 0.955 10% 0.450 5% 0.370 1% 0.290 0% Min 0.290 Extreme Observations ----Lowest---- ----Highest---- Value Obs Value Obs 0.29 164 46.32 5 0.29 163 47.39 4 0.31 162 54.93 3 0.33 161 59.59 2 0.34 160 65.63 1 4.2.2 Graphical Methods The stem-and-leaf plot, box plot, and normal probability plots all indicate that the variable is not normally distributed (Figure 14). Most observations are highly concentrated on the left side of the distribution. See the stem-and-leaf plot and box plot in Figure 14. Figure 14. Stem-and-Leaf Plot, Box Plot, and Normally Probability Plot Histogram # Boxplot 67.5+* 1 * © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 18 http://www.indiana.edu/~statmath . .* 1 * .* 1 * .* 2 * .* 3 * .** 6 * .** 5 0 .** 5 0 .* 2 0 .** 6 | .*** 7 | .****** 17 +--+--+ 2.5+************************************ 108 *-----* ----+----+----+----+----+----+----+* may represent up to 3 counts Normal Probability Plot 67.5+ * | | * | * | ** | ** +++ | *** +++ | ** ++++ | **+++ | +*+ | ++++** | ++++ ** | +++ **** 2.5+* * *** ********************** +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 The following P-P and Q-Q plots show that the data points are seriously deviated from the fitted line (Figure 15). Figure 15. P-P plot and Q-Q Plot of a Non-normally Distributed Variable C u m u l a t i v e D i s t r i b u t i o n o f G N I P 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Normal ( Mu=8. 9646 Si gma=13. 567) 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 - 3 - 2 - 1 0 1 2 3 0 10 20 30 40 50 60 70 G N I P Nor mal Quant i l es 4.2.3 Numerical Methods © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 19 http://www.indiana.edu/~statmath Per capita gross national income has a mean of 8.9646 and a large variance of 184.0557. Its skewness and kurtosis-3 are 2.0495 and 3.6082, respectively, indicating that the variable is highly skewed to the right with a high peak and thin tails. It is not surprising that the Shapiro-Wilk test rejected the null hypothesis; W is .6631 and pvalue is less than .0001. Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests also report similar results. Finally, the Jarque-Bera test returns 203.7717, which rejects the null hypothesis of normality at the .05 level (p<.0000). )2(203.77176~ 24 3.60816725 6 2.04947469 164 22        To sum, we can conclude that the per capita gross national income is not normally distributed. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 20 http://www.indiana.edu/~statmath 5. Testing Normality Using Stata In Stata, you have to use individual commands to get specific statistics or draw various plots. This section contrasts normally distributed and non-normally distributed variables using graphical and numerical methods. 5.1 Graphical Methods A histogram is the most widely used graphical method. The histograms of normally and nonnormally distributed variables are presented in the introduction. The Stata .histogram command is followed by a variable name and options. The normal option adds a normal density curve to the histogram. . histogram normal, normal . histogram gnip, normal Let us draw a stem-and-leaf plot using the .stem command. The stem-and-leaf plot of the randomly drawn normal shows a bell-shaped distribution (Figure 16). . stem normal Figure 16. Stem-and-Leaf Plot of a Normally Distributed Variable Stem-and-leaf plot for normal normal rounded to nearest multiple of .01 plot in units of .01 -28* | 4 -27* | -26* | -25* | 9 -24* | 8 -23* | 9 -22* | 40 -21* | 93221 -20* | 8650 -19* | 8842 -18* | 875200 -17* | 94 -16* | 9987550 -15* | 97643320 -14* | 87755432110 -13* | 98777655433210 -12* | 8866666433210 -11* | 987774332210 -10* | 875322 -9* | 88887665542210 -8* | 99988777533110 -7* | 77766544100 -6* | 998332 -5* | 99988877654433221110 -4* | 9998766655444433321 -3* | 88766654433322221100 -2* | 999988766555544433322111100 -1* | 8888777776655544433222221110 -0* | 99887776655433333111 0* | 01233344445669 1* | 0111222333445666778 2* | 0001234444556889999 3* | 1133444556667899 © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 21 http://www.indiana.edu/~statmath 4* | 014455667777 5* | 00112334556888 6* | 0001123668899 7* | 00233466799999 8* | 1122334667889 9* | 012445666778889 10* | 1133457799 11* | 1222334445689 12* | 122233489 13* | 26889 14* | 2777799 15* | 00112459 16* | 1347 17* | 02467 18* | 358 19* | 03556 20* | 21* | 5 22* | 1 23* | 24* | 22 25* | 1 By contrast, per capita gross national income is highly skewed to the right, having most observations within $10,000 (Figure 17). . stem gnip Figure 17. Stem-and-Leaf Plot of a Non-normally Distributed Variable Stem-and-leaf plot for gnip gnip rounded to nearest multiple of .1 plot in units of .1 0** | 03,03,03,03,03,03,03,03,04,04,04,04,04,04,04,04,04,04,05,05, ... (64) 0** | 21,22,23,23,23,24,24,24,24,25,25,25,26,26,26,27,28,28,28,28, ... (34) 0** | 44,45,45,46,46,47,48,48,50,50,50,52,53,55,59 0** | 62,68,71,71,73,76,79 0** | 81,82,83,91,91 1** | 00,04,07,09,18 1** | 36 1** | 44,58 1** | 62,65,74 1** | 86,97 2** | 2** | 38 2** | 40,54 2** | 60,75,77,78 2** | 3** | 00 3** | 22,26 3** | 46,48,57 3** | 66,70,75,76 3** | 90 4** | 02,11 4** | 37 4** | 4** | 63,74 4** | 5** | 5** | 5** | 49 5** | 5** | 96 6** | 6** | 6** | 56 © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 22 http://www.indiana.edu/~statmath The .dotplot command generates a dot plot, very similar to the stem-and leaf plot, in a descending order (Figure 18). . dotplot normal . dotplot gnip Figure 18. Dotplots of Normally and Non-normally Distributed Variables -3-2-1012 0 10 20 30 40 Frequency A Normally Distributed Variable (N=500) 020406080 0 10 20 30 40 Frequency A Non-normally Distributed Variable (N=164) The .graph box command draws a box plot. In the left plot of Figure 19, the shaded box represents the 25th percentile, median, and 75th percentile, which are symmetrically arranged. The right plot has an asymmetric box with many outliers beyond the adjacent maximum line. . graph box normal . graph box gnip Figure 19. Box plots of Normally and Non-normally Distributed Variables -4-202 A Normally Distributed Variable (N=500) 020406080 A Non-normally Distributed Variable (N=164) The .pnorm command produces standardized normal P-P plot. The left plot shows almost no deviation from the line, while the right depicts an s-shaped curve that is largely deviated from the fitted line. In Stata, a P-P plot has the cumulative distribution of an empirical variable on the x axis and the theoretical normal distribution on the y axis.9 .pnorm normal .pnorm gnip 9 In SAS, these distributions are located reversely. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 23 http://www.indiana.edu/~statmath Figure 20. P-P plots of Normally and Non-normally Distributed Variables0.000.250.500.751.00 NormalF[(normal-m)/s] 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(N+1) A Normally Distributed Variable (N=500) 0.000.250.500.751.00 NormalF[(gnip-m)/s] 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(N+1) A Non-normally Distributed Variable (N=164) The .qnorm command produces a standardized normal Q-Q plot. The following Q-Q plots show a similar pattern that P-P plots do (Figure 21). In the right plot, data points are systematically deviated from the straight fitted line. .qnorm normal .qnorm gnip Figure 21. Q-Q plots of Normally and Non-normally Distributed Variables -1.794057-.11959221.53045 -4-2024 -.0950725 1.555212-1.745357 -4 -2 0 2 4 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles A Normally Distributed Variable (N=500) .372.76538.98 -200204060 8.964573 31.27995-13.35081 -20 0 20 40 Inverse Normal Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles A Non-normally Distributed Variable (N=164) 5.2 Numerical Methods Let us first get summary statistics using the .summarize command. The detail option lists various statistics including mean, standard deviation, minimum, and maximum. Skewness and kurtosis of a randomly drawn variable are respectively close to 0 and 3, implying normality. Per capital gross national income has large skewness of 2.03 and kurtosis of 6.46, being skewed to the right with a high peak and flat tails. . summarize normal, detail normal ------------------------------------------------------------- Percentiles Smallest 1% -2.219479 -2.837418 5% -1.794057 -2.590393 © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 24 http://www.indiana.edu/~statmath 10% -1.413548 -2.478296 Obs 500 25% -.805191 -2.391266 Sum of Wgt. 500 50% -.1195922 Mean -.0950725 Largest Std. Dev. 1.003302 75% .6125385 2.211093 90% 1.215211 2.421139 Variance 1.006614 95% 1.53045 2.421713 Skewness -.0203109 99% 2.055464 2.511694 Kurtosis 2.593181 . sum gnip, detail gnip ------------------------------------------------------------- Percentiles Smallest 1% .29 .29 5% .37 .29 10% .45 .31 Obs 164 25% .955 .33 Sum of Wgt. 164 50% 2.765 Mean 8.964573 Largest Std. Dev. 13.56679 75% 8.68 47.39 90% 32.6 54.93 Variance 184.0577 95% 38.98 59.59 Skewness 2.030682 99% 59.59 65.63 Kurtosis 6.462734 The .tabstat command is vary useful to produce descriptive statistics in a table form. The column(variable)option lists statistics vertically (in table rows). The command for the variable normal is skipped. . tabstat gnip, stats(n mean sum max min range sd var semean skewness kurtosis /// median p1 p5 p10 p25 p50 p75 p90 p95 p99 iqr q) column(variable) stats | normal ---------+---------- N | 500 mean | -.0950725 sum | -47.53624 max | 2.511694 min | -2.837418 range | 5.349112 sd | 1.003302 variance | 1.006614 se(mean) | .044869 skewness | -.0203109 kurtosis | 2.593181 p50 | -.1195922 p1 | -2.219479 p5 | -1.794057 p10 | -1.413548 p25 | -.805191 p50 | -.1195922 p75 | .6125385 p90 | 1.215211 p95 | 1.53045 p99 | 2.055464 iqr | 1.41773 p25 | -.805191 p50 | -.1195922 p75 | .6125385 -------------------- stats | gnip ---------+---------- N | 164 mean | 8.964573 sum | 1470.19 max | 65.63 min | .29 range | 65.34 sd | 13.56679 variance | 184.0577 se(mean) | 1.059388 skewness | 2.030682 kurtosis | 6.462734 p50 | 2.765 p1 | .29 p5 | .37 p10 | .45 p25 | .955 p50 | 2.765 p75 | 8.68 p90 | 32.6 p95 | 38.98 p99 | 59.59 iqr | 7.725 p25 | .955 p50 | 2.765 p75 | 8.68 -------------------- Now let us conduct statistical tests of normality. Stata provide three testing methods: ShapiroWilk test, Shapiro-Francia test, and Skewness-Kurtosis test. The .swilk and .sfrancia commands respectively conduct the Shapiro-Wilk and Shapiro-Francia tests. Both tests do not © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 25 http://www.indiana.edu/~statmath reject normality of the randomly drawn variable and reject normality of per capita gross national income. . swilk normal Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------- normal | 500 0.99556 1.492 0.962 0.16804 . sfrancia normal Shapiro-Francia W' test for normal data Variable | Obs W' V' z Prob>z -------------+------------------------------------------------- normal | 500 0.99645 1.273 0.541 0.29412 . swilk gnip Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------- gnip | 164 0.66322 42.309 8.530 0.00000 . sfrancia gnip Shapiro-Francia W' test for normal data Variable | Obs W' V' z Prob>z -------------+------------------------------------------------- gnip | 164 0.66365 45.790 7.413 0.00001 Stata’s .sktest command conducts the Skewness-Kurtosis test that is conceptually similar to the Jarque-Bera test. The noadjust option suppresses the empirical adjustment made by Royston (1991). The following S-K tests do not reject normality of a randomly drawn variable at the .05 level but surprisingly reject the null hypothesis at the .1 level. . sktest normal Skewness/Kurtosis tests for Normality ------- joint ------ Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 -------------+------------------------------------------------------- normal | 0.851 0.027 4.93 0.0850 . sktest normal, noadjust Skewness/Kurtosis tests for Normality ------- joint ------ Variable | Pr(Skewness) Pr(Kurtosis) chi2(2) Prob>chi2 -------------+------------------------------------------------------- normal | 0.851 0.027 4.93 0.0850 Like the Shapiro-Wilk and Shapiro-Francia tests, both S-K tests below reject the null hypothesis that per capita gross national income is normally distributed at the .01 significance level. . sktest gnip Skewness/Kurtosis tests for Normality ------- joint ------ Variable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 -------------+------------------------------------------------------- gnip | 0.000 0.000 55.33 0.0000 © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 26 http://www.indiana.edu/~statmath . sktest gnip, noadjust Skewness/Kurtosis tests for Normality ------- joint ------ Variable | Pr(Skewness) Pr(Kurtosis) chi2(2) Prob>chi2 -------------+------------------------------------------------------- gnip | 0.000 0.000 75.39 0.0000 The Jarque-Bera statistic of normal is 3.4823 = 500*(-.0203109^2/6+(2.593181-3)^2/24), which is not large enough to reject the null hypothesis (p<.1753). The Jarque-Bera statistic of the per capita gross national income is 194.6489 = 164*(2.030682^2/6+(6.462734-3)^2/24). This large chi-squared rejects the null hypothesis (p<.0000). The Jarque-Bera test appears to be more reliable than the Stata S-K test (see Table 4). In conclusion, graphical methods and numerical methods provide sufficient evidence that per capita gross national income is not normally distributed. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 27 http://www.indiana.edu/~statmath 6. Testing Normality Using SPSS SPSS has the DESCRIPTIVES and FREQUENCIES commands to produce descriptive statistics. DESCRIPTIVES is usually applied to continuous variables, but FREQUENCIES is also able to produce various descriptive statistics in addition to frequency tables. The IGRAPH command draws histogram and box plots. The PPLOT command produces (detrended) P-P and Q-Q plots. The EXAMINE command can produce both descriptive statistics and various plots, such as a stem-leaf-plot, histogram, box plot, (detrended) P-P plot, and (detrended) Q-Q plot. EXAMINE also performs the Kolmogorov-Smirnov and Shapiro-Wilk tests for normality. 6.1 A Normally Distributed Variable DESCRIPTIVES summarizes interval or continuous variables and FREQUENCIES reports frequency tables of discrete variables and summary statistics. The /STATISTICS subcommand in both commands specify statistics to be produced. The following DESCRIPTIVES command reports the number of observations, sum, mean, variance, standard deviation of normal.10 The mean of -.10 and standard deviation 1 implies that the variable is normally distributed. DESCRIPTIVES VARIABLES=normal /STATISTICS=MEAN SUM STDDEV VARIANCE. Descriptive Statistics N Sum Mean Std. Deviation Variance normal 500 -47.54 -.0951 1.00330 1.007 Valid N (listwise) 500 The following FREQUENCIES produces various statistics of normal, a frequency table, and a histogram.11 Since normal is continuous, its frequency table is long and thus skipped here. The /HISTOGRAM subcommand draws a histogram, which is the same as what the GRAPH command in the next page produces. FREQUENCIES VARIABLES=normal /NTILES= 4 /STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM SEMEAN MEAN MEDIAN MODE SUM SKEWNESS SESKEW KURTOSIS SEKURT /HISTOGRAM /ORDER= ANALYSIS. Statistics 10 In order to execute this command, open a syntax window , copy and paste the syntax into the window, and then click Run menu. Alternatively, click Analysis Descriptive StatisticsDescriptives and provide a variable of interest. 11 Click Analysis Descriptive StatisticsFrequencies and then specify statistics using the Statistics option. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 28 http://www.indiana.edu/~statmath normal Valid 500.000N Missing .000 Mean -.095 Std. Error of Mean .045 Median -.120 Mode -2.837a Std. Deviation 1.003 Variance 1.007 Skewness -.020 Std. Error of Skewness .109 Kurtosis -.399 Std. Error of Kurtosis .218 Range 5.349 Minimum -2.837 Maximum 2.512 Sum -47.536 25 -.807 50 -.120 Percentiles 75 .613 a. Multiple modes exist. The smallest value is shown The variable has a mean -.10 and a unit variance. The median -.120 is very close to the mean. The kurtosis-3 is -.399 and skewness is -.020. 6.1.1 Graphical Methods Like the /HISTOGRAM subcommand of FREQUENCIES, the GRAPH command draws a histogram of the variable normal (left plot in Figure 22).12 GRAPH /HISTOGRAM=normal. The IGRAPH command can produce a similar histogram (right plot in Figure 22) but its syntax appears to be messy.13 Two histograms report mean -.1 and standard deviation 1 on the right top corner and suggest that the variable is normally distributed. IGRAPH /VIEWNAME='Histogram' /X1 = VAR(normal) TYPE = SCALE /Y = $count /COORDINATE = VERTICAL /X1LENGTH=3.0 /YLENGTH=3.0 /X2LENGTH=3.0 /CHARTLOOK='NONE' /Histogram SHAPE = HISTOGRAM CURVE = OFF X1INTERVAL AUTO X1START = 0. Figure 22. Histogram of a Normally Distributed Variable 12 Click GraphsLegacy DialogsHistogram. 13 Click GraphsLegacy DialogsInteractiveHistogram. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 29 http://www.indiana.edu/~statmath The EXAMINE command can produce descriptive statistics as well as a stem-and-leaf plot and a box plot (Figure 23 and 24).14 The /PLOT subcommand with STEMLEAF and BOXPLOT draws two plots that is very similar to the histogram in Figure 22. EXAMINE VARIABLES=normal /PLOT BOXPLOT STEMLEAF /COMPARE GROUP /STATISTICS DESCRIPTIVES /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL. Figure 23. Stem-and-Leaf Plot of a Normally Distributed Variable normal Stem-and-Leaf Plot Frequency Stem & Leaf 2.00 -2 . & 13.00 -2 . 00111& 27.00 -1 . 555566678899 56.00 -1 . 000111111222222333333344444 64.00 -0 . 555555555666777778888888999999 116.00 -0 . 000000000011111111111111222222222222233333333334444444444 80.00 0 . 000000011111111122222222233333333444444 68.00 0 . 555555556666677777778888889999999 46.00 1 . 000001111112222334444 23.00 1 . 55566778899 4.00 2 . 4& 1.00 2 . & Stem width: 1.00 Each leaf: 2 case(s) & denotes fractional leaves. Figure 24. Box Plot of a Normally Distributed Variable 14 Click AnalyzeDescriptive StatisticsExplore, and then include the variable you want to examine. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 30 http://www.indiana.edu/~statmath The both extremes (i.e., minimum and maximum), the 25th , 50th , and 75th percentiles are symmetrically arranged in the box plot. EXAMINE also produces a histogram and normal Q-Q plot and detrended normal Q-Q plot using HISTOGRAM and NPPLOT option (Figure 25).15 NPPLOT conducts normality test and draw the two Q-Q plots. EXAMINE VARIABLES=normal /PLOT HISTOGRAM NPPLOT /COMPARE GROUP /STATISTICS DESCRIPTIVES /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL. Figure 25. Q-Q and Detrended Q-Q Plots of a Normally Distributed Variable 15 In the Explore dialog box, choose Plots option and then check Normality plots with tests option. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 31 http://www.indiana.edu/~statmath The PPLOT command produces P-P and Q-Q plots as well. 16 The /TYPE subcommand chooses either P-P or Q-Q plot and /DIST specifies a probability distribution (e.g., the standard normal distribution). The following PPLOT command draws normal P-P and detrended normal P-P plots (Figure 26); the output of other descriptive statistics is skipped here. PPLOT /VARIABLES=normal /NOLOG /NOSTANDARDIZE /TYPE=Q-Q /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. Figure 26. P-P and Detrended P-P Plots of a Normally Distributed Variable The following PPLOT command draws normal Q-Q and detrended normal Q-Q plots of the variable (see Figure 25). PPLOT /VARIABLES=normal /NOLOG /NOSTANDARDIZE /TYPE=Q-Q /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. Both P-P and Q-Q plots show no significant deviation from the fitted line. As in Stata, the normal Q-Q plot and detrended Q-Q plot has observed quantiles on the X axis and normal quantiles on the Y axis. 6.1.2 Numerical Methods EXAMINE has the /PLOT NPPLOT subcommand to test normality of a variable. This command produces descriptive statistics (/STATISTICS DESCRIPTIVES), outliers (EXTREME), draws a normal Q-Q plot (/PLOT NPPLOT), and performs the KolmogorovSmirnov and Shapiro-Wilk tests. EXAMINE VARIABLES=normal /PLOT NPPLOT /STATISTICS DESCRIPTIVES EXTREME /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL. Case Processing Summary 16 In SPSS 16.0, you may not see P-P and Q-Q under the Graphs menu, which were available in previous versions. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 32 http://www.indiana.edu/~statmath Cases Valid Missing Total N Percent N Percent N Percent normal 500 100.0% 0 .0% 500 100.0% Descriptives Statistic Std. Error Mean -.0951 .04487 Lower Bound -.183295% Confidence Interval for Mean Upper Bound -.0069 5% Trimmed Mean -.0933 Median -.1196 Variance 1.007 Std. Deviation 1.00330 Minimum -2.84 Maximum 2.51 Range 5.35 Interquartile Range 1.42 Skewness -.020 .109 normal Kurtosis -.399 .218 Extreme Values Case Number Value 1 332 2.51 2 139 2.42 3 325 2.42 4 340 2.21 Highest 5 119 2.15 1 29 -2.84 2 204 -2.59 3 73 -2.48 4 391 -2.39 Normal Lowest 5 393 -2.24 Since N is less than 2,000, we have to read the Shapiro-Wilk statistic and do not reject the null hypothesis of normality (p<.168). Like SAS, SPSS reports the same Kolmogorov-Smirnov statistic of .027, but it provides an adjusted p-value of .200, a bit larger than the .150 that SAS reports. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 33 http://www.indiana.edu/~statmath Tests of Normality Kolmogorov-Smirnov a Shapiro-Wilk Statistic df Sig. Statistic df Sig. Normal .027 500 .200 * .996 500 .168 a. Lilliefors Significance Correction *. This is a lower bound of the true significance. 6.2 A Non-normally Distributed Variable Let us consider per capita national gross income that is not normally distributed. 6.2.1 Graphical Methods The following EXAMINE command produce the histogram, stem-and-leaf plot, and box plot of a non-normally distributed variable gnip. The stem-and-leaf plot is skipped here. EXAMINE VARIABLES=gnip /PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT /STATISTICS DESCRIPTIVES EXTREME /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL. Figure 27 illustrates that the distribution is heavily skewed to the right and there exist many outliers beyond the extreme line in the box plot (right plot). The median and the 25th percentile are close to each other. Figure 27. Histogram and Box Plot a Non-normally Distributed Variable Figure 28 presents the P-P and detrended P-P plots where data points are significantly deviated from the straight fitted line. PPLOT /VARIABLES=gnip /NOLOG /NOSTANDARDIZE /TYPE=P-P /FRACTION=BLOM /TIES=MEAN © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 34 http://www.indiana.edu/~statmath /DIST=NORMAL. Figure 28. P-P and Detrended P-P Plots of a Non-normally Distributed Variable The Q-Q and detrended Q-Q plots also show a significant deviation from the fitted line (Figure 26). PPLOT /VARIABLES=gnip /NOLOG /NOSTANDARDIZE /TYPE=Q-Q /FRACTION=BLOM /TIES=MEAN /DIST=NORMAL. Figure 29. Q-Q and Detrended Q-Q Plots of a Non-normally Distributed Variable 6.2.2 Numerical Methods The descriptive statistics of gnip indicates that the variable is not normally distributed. There is a large gap between the mean of 8.9646 and the median of 2.7650. The skewness and kurtosis - 3 are 2.049 and 3.608, respectively. The variable appears severely skewed to the right with a higher peak and flat tails. The following tables are the output of the above EXAMINE command. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 35 http://www.indiana.edu/~statmath Case Processing Summary Cases Valid Missing Total N Percent N Percent N Percent gnip 164 100.0% 0 .0% 164 100.0% Descriptives Statistic Std. Error Mean 8.9646 1.05939 Lower Bound 6.872795% Confidence Interval for Mean Upper Bound 11.0565 5% Trimmed Mean 7.1877 Median 2.7650 Variance 184.058 Std. Deviation 13.56679 Minimum .29 Maximum 65.63 Range 65.34 Interquartile Range 7.92 Skewness 2.049 .190 gnip Kurtosis 3.608 .377 Extreme Values Case Number Value 1 1 65.63 2 2 59.59 3 3 54.93 4 4 47.39 Highest 5 5 46.32 1 164 .29 2 163 .29 3 162 .31 4 161 .33 gnip Lowest 5 160 .34a a. Only a partial list of cases with the value .34 are shown in the table of lower extremes. Tests of Normality © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 36 http://www.indiana.edu/~statmath Kolmogorov-Smirnova Shapiro-Wilk Statistic df Sig. Statistic df Sig. gnip .284 164 .000 .663 164 .000 a. Lilliefors Significance Correction The Shapiro-Wilk test rejects the null hypothesis of normality at the .05 level. The Jarque-Bera test also rejects the null hypothesis with a large statistic of 204. Its computation is skipped (see section 4.2.3). Based on a consistent result from both graphical and numerical methods, we can conclude the variable gnip is not normally distributed. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 37 http://www.indiana.edu/~statmath 7. Conclusion Univariate analysis is the first step of data analysis once a data set is ready. Various descriptive statistics provide valuable basic information about variables that is used to determine appropriate analysis methods to be employed. Normality is commonly assumed in many statistical and economic methods, although often conveniently assumed in reality without any empirical test. Violation of this assumption will result in unreliable inferences and misleading interpretations. There are graphical and numerical methods for conducting univariate analysis and normality tests (Table 1). Graphical methods produce various plots such as a stem-and-leaf plot, histogram, and a P-P plot that are intuitive and easy to interpret. Some are descriptive and others are theory-driven. Numerical methods compute a variety of measures of central tendency and dispersion such as mean, median, quantile, variance, and standard deviation. Skewness and kurtosis provide clues to the normality of a variable. If skewness and kurtosis-3 are close to zero, the variable may be normally distributed. Keep in mind that SAS and SPSS report kurtosis-3, while Stata returns kurtosis itself. If the skewness of a varialbe is larger than 0, the variable is skewed to the right with many observations on the left of the distribution; a negative skewness indicates many observations on the right. If kurtosis-3 is greater than 0 (or kurtosis is greater than 3), the distribution has a high peak and flat tails (third plot in Figure 8). If kurtosis is smaller than 3, the variable has a low peak and thick tails (first plot in Figure 9). In addition to these descriptive statistics, there are formal ways to perform normality tests. The Shapiro-Wilk and Shapiro-Francia tests are proper when N is less than 2,000 and 5,000, respectively. The Kolmogorov-Smirnov, Cramer-vol Mises, and Anderson-Darling tests are recommended when N is large. The Jarque-Bera test, although not supported by most statistical software packages, is a consistent method of normality testing. The SAS UNIVARIATE and CONTENTS procedures provide a variety of descriptive statistics and normality testing methods including Kolmogorov-Smirnov, Cramer-vol Mises, and Anderson-Darling tests (Table 5). These procedures produce stem-and-leaf, box plot, histogram, P-P plot, and Q-Q plot as well. Stata has various commands for univariate analysis and graphics. In particular, Stata supports the Shapiro-Francia test, a modification of the ShapiroWilk test, and the skewness-kurtosis test. But there is no command to conduct the KolmogorovSmirnov test for normality in Stata. SPSS can produce detrended P-P and Q-Q plots, and perform the Shapiro-Wilk and Kolmogorov-Smirnov tests with Lilliefors significance correction. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 38 http://www.indiana.edu/~statmath Appendix A: Data Sets This document uses the following three variables. 1. Unemployment Rate of Illinois, Indiana, and Ohio in 2005 This unemployment rate is provided by Bureau of Labor Statistics. Actual data were downloaded from http://www.stats.indiana.edu/, Indiana Business Research Center of the Kelley School of Business, Indiana University. . tabstat rate, stat(mean sd p25 median p75 skewness kurtosis) by(state) Summary for variables: rate by categories of: state state | mean sd p25 p50 p75 skewness kurtosis -------+---------------------------------------------------------------------- IL | 5.421569 .9242206 4.7 5.35 6 .6570033 3.946029 IN | 5.641304 1.038929 4.9 5.5 6.35 .3416314 2.785585 OH | 6.3625 1.458098 5.5 6.1 6.95 1.665322 8.043097 -------+---------------------------------------------------------------------- Total | 5.786879 1.214066 5 5.65 6.4 1.44809 8.383285 ------------------------------------------------------------------------------ 2. A Randomly Drawn Variable This variable includes 500 observations that were randomly drawn from the standard normal distribution with a seed of 1,234,567. The RANNOR() of SAS was used as a random number generator. %LET n=500; %LET dataset=n500; DATA masil.&dataset; seed=1234567; DO i=1 TO &n; normal=RANNOR(seed); OUTPUT; END; RUN; . tabstat normal, stat(mean sd p25 median p75 skewness kurtosis) variable | mean sd p25 p50 p75 skewness kurtosis -------------+---------------------------------------------------------------------- normal | -.0950725 1.003302 -.805191 -.1195922 .6125385 -.0203109 2.593181 ------------------------------------------------------------------------------------ 3. Per Capita Gross National Income in 2005. This data set includes per capita gross national incomes of 164 countries in the world that are provided by World Bank (http://web.worldbank.org/). . tabstat gnip, stat(mean sd p25 median p75 skewness kurtosis) variable | mean sd p25 p50 p75 skewness kurtosis -------------+---------------------------------------------------------------------- gnip | 8.964573 13.56679 .955 2.765 8.68 2.030682 6.462734 ------------------------------------------------------------------------------------ © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 39 http://www.indiana.edu/~statmath © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 40 http://www.indiana.edu/~statmath References Bera, Anil. K., and Carlos. M. Jarque. 1981. "Efficient Tests for Normality, Homoscedasticity and Serial Independence of Regression Residuals: Monte Carlo Evidence." Economics Letters, 7(4):313-318. D’Agostino, Ralph B., Albert Belanger, and Ralph B. D’Agostino, Jr. 1990. “A Suggestion for Using Powerful and Informative Tests of Normality.” American Statistician, 44(4): 316- 321. Jarque, Carlos M., and Anil K. Bera. 1980. "Efficient Tests for Normality, Homoscedasticity and Serial Independence of Regression Residuals." Economics Letters, 6(3):255-259. Jarque, Carlos M., and Anil K. Bera. 1987. "A Test for Normality of Observations and Regression Residuals." International Statistical Review, 55(2):163-172. Mitchell, Michael N. 2004. A Visual Guide to Stata Graphics. College Station, TX: Stata Press. Royston, J. P. 1982. "An Extension of Shapiro and Wilk's W Test for Normality to Large Samples." Applied Statistics, 31(2): 115-124. Royston, J. P. 1983. "A Simple Method for Evaluating the Shapiro-Francia W' Test of NonNormality." Statistician, 32(3) (September): 297-300. Royston, J.P. 1991. “Comment on sg3.4 and an Improved D’Agostino test.” Stata Technical Bulletin, 3: 13-24. Royston, P.J. 1992. "Approximating the Shapiro-Wilk W-Test for Non-normality." Statistics and Computing, 2:117-119. SAS Institute. 1995. SAS/QC Software: Usage and Reference I and II. Cary, NC: SAS Institute. SAS Institute. 2004. SAS 9.1.3 Procedures Guide Volume 4. Cary, NC: SAS Institute. Shapiro, S. S., and M. B. Wilk. 1965. "An Analysis of Variance Test for Normality (Complete Samples)." Biometrika, 52(3/4) (December):591-611. Shapiro, S. S., and R. S. Francia. 1972. "An Approximate Analysis of Variance Test for Normality." Journal of the American Statistical Association, 67 (337) (March): 215-216. SPSS Inc. 2007. SPSS 16.0 Command Syntax Reference. Chicago, IL: SPSS Inc. Stata Press. 2007. Stata Base Reference Manual Release 10. College Station, TX: Stata Press. Stata Press. 2007. Stata Graphics Reference Manual Release 10. College Station, TX: Stata Press. © 2002-2008 The Trustees of Indiana University Univariate Analysis and Normality Test: 41 http://www.indiana.edu/~statmath Acknowledgements I am grateful to Jeremy Albright and Kevin Wilhite at the UITS Center for Statistical and Mathematical Computing for comments and suggestions. Revision History  2002 First draft.  2006. 11 Revision with new data.  2008. 11 Revision with new versions of software packages.