CHAPTER 9 Linear Regression and Correlation g.1 LINEAR RELATIONSHIPS g.2 LEAST SQUARES PREDICTION EQUATION g,3 THE LINEAR REGRESSION MODEL g.4 MEASURING LINEAR ASSOCIATION: THE CORRELATION g.5 INFERENCES FOR THE SLOPE AND CORRELATION g.6 MODEL ASSUMPTIONS AND VIOLATIONS 9 7 CHAPTER SUMMARY Chapter 8 presented methods Cor analyzing association between categorical response and explanatory variables. This chapter presents methods for analyzing quantitative response and explanatory variables. Table 9.1 shows data from Statistical Abstract of the United States for the 50 states and the District of Columbia (D.C.) on the following: • Murder rate: The number of murders per 100,000 people in the population • Violent crime rate: The number of murders, forcible rapes, robberies, and aggravated assaults per 100,000 people in the population • Percentage of the population with income below the poverty level • Percentage of families headed by a single parent For these quantitative variables, violent crime rate and murder rate are natural response variables. We'll treat the poverty rate and percentage of single-parent families as explanatory variables for these responses as we study methods for analyzing relationships between quantitative variables in this chapter and in some exercises. The text Web site contains two datasets on these and other variables that we will also analyze in exercises in this and later chapters. We analyze thTee different, but related, aspects of such relationships: 1. We investigate whether there is an association between the variables by testing the hypothesis of statistical independence. 2. We study the strength of their association using the correlation measure of association. 3. We estimate a regression equation that predicts the value of the response variable from the value of the explanatory variable. For instance, such an equation predicts a state's murder rate using the percentage of its population living below the poverty level. The analyses are collectively called a regression analysis. Section 9.1 shows how to use a straight line for the regression equation, and Section 9.2 shows how to use data to estimate the line. Section 9.3 introduces the linear regression model, which takes into account variability of the data about the regression line. Section 9.4 uses the correlation and its square to describe the strength of association. Section 9.5 presents 255 256 Chapter 9 Linear Regression and Correlation TABLE 9.1: Statewide Data Used to Illustrate Regression Analyses Violent Murder Poverty Single State Crime Rate Rate Parent State Crime Rate Rate Parent AK 761 9.0 9.1 14.3 MT 178 3.0 14.9 10.8 AL 780 11.6 17.4 11.5 NC 679 11.3 14.4 11: AR 593 10.2 20.0 10.7 ND 82 1.7 11.2 8/ AZ 715 8.6 15.4 12.1 NE 339 3.9 10.3 9.4 CA 1078 13.1 18.2 12.5 NH 138 2.0 9.9 9.2 CO 567 5.8 9.9 12.1 NJ 627 5.3 10.9 9.i, CT 456 6.3 8.5 10.1 NM 930 8.0 17.4 13.S DE 686 5.0 10.2 11.4 NV 875 10.4 9.8 12.4 FL 1206 8.9 17.8 10.6 NY 1074 13.38 16.4 12.' GA 723 11.4 13.5 13.0 OH 504 6.0 13.0 11.1 HI 261 3.8 8.0 9.1 OK 635 8.4 19.9 11.1 IA 326 2.3 10.3 9.0 OR 503 4.6 11.8 11.3 ID 282 2.9 13.1 9.5 PA 418 6,8 13.2 9.6 IL 960 11.42 13.6 11.5 RI 402 3.9 11.2 10.8 IN 489 7.5 12.2 10.8 sc 1023 10.3 18.7 12.3 KS 496 6.4 13.1 9.9 SD 208 3.4 14.2 9.1 ICY 463 6.6 20.4 10.6 TN 766 10.2 19.6 11.2 LA 1062 20.3 26.4 14.9 TX 762 11.9 17.4 11.8 MA 805 3.9 10.7 10.9 UT 301 3.1 10.7 10.0 MD 998 12.7 9.7 12.0 VA 372 8.3 9.7 10.j ME 126 1.6 10.7 10.6 VT 114 3.6 10.0 11.0 MI 792 9.8 15.4 13.0 WA 515 5.2 12.1 11.7 MN 327 3.4 11.6 9.9 WI 264 4.4 12.6 10.4 MO 744 11.3 16.1 10.9 WV 208 6.9 22.2 9.4 MS 434 13.5 24.7 14.7 WY 286 3.4 13.3 10.8 DC 2922 78.5 26.4 22.1 statistical inference for a regression analysis. The final section takes a closer look al assumptions and potential pitfalls in using regression. 9.1 LINEAR RELATIONSHIPS Notation for Response and Explanatory Variables_ Let y denote the response variable and let x denote the explanatory variable._ We shall analyze how values of y tend to change from one subset of the population to another, as denned by values of x. For categorical variables, we did this by comparing the conditional distributions of y at the various categories of x, in a contingency table. For quantitative variables, a mathematical formula describes how the conditional distribution of y varies according to the value of x. This formula describes how y = statewide murder varies according to the level of x = percent below the poverty level. Does the murder rate tend to be higher for states that have higher poverty levels? Linear Functions Any particular formula might provide a good description or a poor one of how y relates to .v. This chapter introduces the simplest type of formula—astmight line. For it, y is said to be a linear function of x. Section 9.1 Linear Relationships 257 Linear Function The formula y = a + jix expresses observations on y as a linear function of observations on x. The formula has a straight line graph with slope 13 (beta) and y-intercept a (alpha). EXAMPLE 9.1 Example of a Linear Function The formula y = 3 + 2x is a linear function. It has the form y = a + p.x with a = 3 and p = 2. The y-intercept equals 3 and Hie slope equals 2. Each real number x, when substituted into the formula v = 3 + 2v, yields a distinct value for y. For instance, x = 0 has y = 3 + 2(0) = 3, and ,v = 1 has y =3 + 2(1) = 5. Figure 9.1 plots this function. The horizontal axis, the:x-axis, lists the possible values of x. The vertical axis, they-axis, lists the possible values of y. The axes intersect at the point where x = 0 and y = 0, called the origin. ■ 18 - y 15 12 9 6 3 -1-1-1-i-1-1— x 0 1 2 3 4 3 6 FIGURE 9.1: Graph of the Straight Liney = 3 + 2*. The y-intercept is 3 and the slope is 2. interpreting the y-lntercept and Slope At x = 0, the equation y — a + px simplifies to y = a + jlx = a + p(0) = a. Thus, the constant a in this equation is the value of y when x = 0. Now, points on the y-axis have x = 0, so the line has height a at the point of its intersection with the y-axis. Because of this, a is called the y-intercept. The straight line y = 3 + 2x intersects the y-axis at a = 3, as Figure 9.1 shows. The slope p equals the change in y for a one-unit increase in x. That is, for two x-values that differ by 1.0 (such as x = Oandx = 1), the y-values differ by p. For the line y = 3 + Zx,y = 3atx = Oandy = 5atx = 1. These y values differ by p = 5-3 = 2. Two x-values that are 10 units apart differ by 10/3 in their y-values. For example, whenx = 0,y = 3, and when a: = 10, y = 3 + 2(10) = 23, and 23 - 3 = 20 = 10/3. Figure 9.2 portrays the interpretation of the y-intercept and slope. To draw the straight line, we find any two separate pairs of (x,y) values on the graph and then draw the line through the points. To illustrate, let's use the points just discussed: (x = 0, y = 3) and (x = 1, y = 5). The point on the graph with (x = 0, y = 3) is three units up the y-axis. To find the point with (x = 1, y = 5), we start at the origin (x = 0,y = 0) and move one unit to the right on the x-axis and five units upward parallel to the y-axis (see Figure 9.1). After plotting the two points, drawing the straight line through the two points gTaphs the function y = 3 + Zv. 258 Chapter 9 Linear Regression and Correlation Section 9.2 Least Squares Prediction Equation 259 ■i—,—,—,—I—I—I—i—i—hi 12345678 9 10 FIGURE 9.2: Graph of the Straight Liney = a + /Jx.The y-intercept is a and the slope is (3. EXAMPLE 9.2 Straight Lines for Predicting Violent Crime For the 50 states, consider the variables y = violent crime rate and x = povert rate. We'll see that the straight line y = 210 +• 25.v approximates their relation. Tlu-y-intercept equals 210. This represents the violent crime rate at poverty rate x = (unfortunately, there are no such states). The slope equals 25. When the percentage with income below the poverty level increases by 1, the violent crime rate increases by about 25 crimes a year per 100,000 population. By contrast, if instead .v = percentage of the population living in urban areas, th; straight line approximating the relationship is y = 26 + &t. The slope of 8 is smalls than the slope of 25 when poverty rate is the predictor. An increase of 1 in the percent below the poverty level corresponds to a greater change in the violent crime rate than an increase of 1 in the percent urban. Figure 9.3 shows the lines relating the violent crime rate to poverty rate and to urban residence. Generally, the larger the absolute value of /3, the steeper the line. ■ If /3 is positive, then y increases as x increases—the straight line goes upward, like the two lines just mentioned. Then large values of y occur with large values of x, and small values of y occur with small values of x. When a relationship between two variables follows a straight line with j3 > 0, the relationship is said to be positive. Ifj3 is negative, then y decreases asx increases. The straight line then goes downward,. and the relationship is said to be negative. For instance, the equationy = 1756 - 16x,: which has slope -16, approximates the relationship between y = violent crime rate and .v- = percentage of residents who are high school graduates. For each increase of 1.0 in the percent who are high school graduates, the violent crime rate decreases by about 16. Figure 9.3 also shows this line. When (3 = 0, the graph is a horizontal line. The value of y is constant and does not vary as x varies. If two variables are independent, with the value of y not depending on the value of .v, a straight line with /3 = 0 represents their relationship. The line y = 800 shown in Figure 9.3 is an example of a line with f3 = 0. Models Are Simple Approximations for Reality i As Section 7.3 explained, a model is a simple approximation for the relationship : between variables in the population. The linear function is the simplest mathematical | 0 20 40 60 SO 100 FIGURE 9.3: Graphs of Lines Showing Positive Relationships {j3 > 0), a Negative Relationship (jS < 0), and Independence = 0) function. It provides the simplest model for the relationship between two quantitative variables. For a given value of x, the model y = a + (3.x predicts a value for y. The better these predictions tend to be, the better the model. As we mentioned in Section 3.4 and will explain further at the beginning of Chapter 10, association does not imply causation. For example, consider the interpretation of the slope in Example 9.2 of "When the percentage with income below the poverty level increases by 1, the violent crime rate increases by about 25 crimes a year per 100,000 population." This does not mean that if we had the ability to go to a state and increase the percentage of people living below the poverty level from 10% to 11%, we could expect the number of crimes to increase in the next year by 25 crimes per 100,000 people. It merely means that based on current data, if one state had a 10% poverty rate and one had a 11% poverty rate, we'd predict that the state with the higher poverty rate would have 25 more crimes per year per 100,000 people. But, as we'll see in Section 9.3, a sensible model is actually a bit more complex than the one We've presented so far. LEAST SQUARES PREDICTION EQUATION Using sample data, we can estimate the linear model. The process treats a and j3 in the equation y = a + f3x as unknown parameters and estimates them. The estimated linear function then provides predicted y-values at fixed values for x. A Scatterplot Portrays the Data The first step of model fitting is to plot the data, to reveal whether a model with a straight line trend makes sense. The data values (x,y) for any one subject form a point relative to the x- and y-axes. A plot of the n observations as n points is called a scatterplot. EXAMPLE 9.3 Scatterplot for Statewide Murder Rate and Poverty For Table 9.1, let x = poverty rate and y = murder rate. To check whether a straight line approximates the relationship well, we first construct a scatterplot for the 51 observations. Figure 9.4 shows this plot. 260 Chapter 9 Linear Regression and Correlation Murder Rale . 80- T 0 20 MD / - Poverty 25 Rate FIGURE 9.4: Scatterplot for y = Murder Rate and x = Percentage of Residents below the Poverty Level, for 50 States and D.C. Box plots are shown for murder rate to the left of the scatterplot and for poverty rate below the scatterplot. Each point in Figure 9.4 portrays the values of poverty rate and murder rate for a given state. For Maryland, for instance, the poverty rate is x = 9.7, and the murder rate is y = 12,7. Its point (x,y) = (9.7,12.7) has coordinate 9.7 for the x-axis and 12.7 for the y-axis. This point is labeled MD in Figure 9.4. Figure 9.4 indicates that the trend of points seems to be approximated well by a straight line. Notice, though, that one point is far removed from the rest. This is the point for the District of Columbia (D.C). For it, the murder rate was much higher than for any state. This point lies far from the overall trend. Figure 9.4 also shows box plots for these variables. They reveal that D.C. is an extreme outlier on murder rate. In fact, it falls 6.5 standard deviations above the mean. We shall see that outliers can have a serious impact on a regression analysis. ■ The scatterplot provides a visual check of whether a relationship is approximately linear. When the relationship seems highly nonlinear, it is not sensible to use a straight line model. Figure 9.5 illustrates such a case. This figure shows a negative relationship FIGURE 9.5: A Nonlinear Relationship, for Which It Is Inappropriate to Use a Straight Line Model Section 9.2 Least Squares Prediction Equation 261 over part of the range of .v values, and a positive relationship over the rest. These cancel each other out using a straight line model. For such data, a different model, presented in Section 14.5, is appropriate. Prediction Equation When the scatterplot suggests that the model y to estimate this line. The notation a + ßx is realistic, we use the data y + bx represents a sample equation that estimates the linear model. In the sample equation, the y-intercept (a) estimates the y-intercept a of the model and the slope (b) estimates the slope /3. Substituting a particular x-value into a + bx provides a value, denoted by y, that predicts y at that value of x. The sample equation y = a + fix is called the prediction equation, because it provides a prediction y for the response variable at any value of x. The prediction equation is the best straight line, falling closest to the points in the scatterplot, in a sense discussed later in this section. The formulas for a and b in the prediction equation y = a + bx are b = S(-v ~ x)(y - y) 2(.v - J)2 ' : y bx. If an observation has both x- and y-values above their means, or both x- and y-values below their means, then (x - x)(y — y) is positive. The slope estimate b tends to be positive when most observations are like this, that is, when points with large x-values also tend to have large y-values and points with small x-values tend to have small y-values. Figure 9.4 is an example of such a case. We shall not dwell on these formulas or even illustrate how to use them, as they are messy for hand calculation. Anyone who does any serious regression modeling uses a computer or a calculator that has these formulas programmed. To use statistical software, you supply the data file and usually select the regression method from a menu. The appendix at the end of the text provides details. EXAMPLE 9.4 Predicting Murder Rate from Poverty Rate For the 51 observations on y = murder rate and x = poverty rate in Table 9.1, SPSS software provides the results shown in Table 9.2. Murder rate has y = 8.7 and s = 10.7, indicating that it is probably highly skewed to the right. The box plot for murder rate in Figure 9.4 shows that the extreme outlying observation for D.C. contributes to this. TABLE 9.2: Part of SPSS Printout for Fitting Linear Regression Model to Observations for 50 States and D.C. on x = Percent in Poverty and y = Murder Rate Variable Mean Std Deviation MURDER 8.727 10.718 POVERTY 14.259 4.584 Std. Error (Constant) POVERTY -10.1364 1.3230 4.1206 0.2754 262 Chapter 9 Linear Regression and Correlation The estimates of a and jS are listed under the heading "B," the symbol 1 SPSS uses to denote an estimated regression coefficient. The estimated J'-intercep| is a = -10.14, listed opposite "(Constant)." The estimate of the slope is ft = i • listed opposite the variable name of which it is the coefficient in the predictii , equation, "POVERTY." Therefore, the prediction equation is y = a + bx . ~ " -10.14 + 1.3Zr. The slope b = 1.32 is positive. So the larger the poverty rate, the larger is the predicted murder rate. The value 1.32 indicates that an increase of 1 in the percenta; living below the poverty rate corresponds to an increase of 1.32 in the predicted ~i' murder rate. Similarly, an increase of 10 in the poverty rate corresponds to a 10(1.32) = 13,2. unit increase in predicted murder rate. If one state has a 12% poverty rate and anothS has a 22% poverty rate, for example, the predicted annual number of murders per 100,000 population is 13.2 higher in the second state than the first state. Since flit mean murder rate is 8.7, it seems that poverty rate is an important predictor < murder rate. This differential of 13 murders per 100,000 population translates to 13[i per million or 1300 per 10 million population. If the two states each had populations of 10 million, the one with the higher poverty rate would be predicted to have 130li more murders per year. I Effect of Outliers on the Prediction Equation Figure 9.6 plots the prediction equation from Example 9.4 over the scatterplot. t!k diagram shows that the observation for D.C. is a regression outlier—it falls quitj far from the trend that the rest of the data follow. This observation seems to have 1 substantial effect. The line seems to be pulled up toward it and away from the cenlji of the general trend of points. 80- Murder Rule 60 - 40 20 0 10 15 20 25 Poverty Rate FIGURE9.6; Prediction Equations Relating Murder Rate and Percentage in Poverty, with and without D.C. Observation Let's now refit the line using the observations for the 50 states but not the one for D.C. Table 9.3 shows that the prediction equation equals y = -0.86 + 0.58.1'. Figure 9.6 also shows this line, which passes more directly through the 50 points. The slope is 0,58, compared to 1.32 when the observation for D.C. is included. The one outlying observation has the impact of more than doubling the slope! Prediction equation deleting D.C. Section 9.2 Least Squares Prediction Equation 263 TABLE 9.3: Part of Printout for Fitting Linear Model to 50 States (but not D.C.) on x = Percent in Poverty and y = Murder Rate Sum of Mean Unstandaxdized Squares df Square Coefficients Regression 307.342 1 307.34 B ' Residual 470.406 48 9.80 (Constant) -.857 Total 777.749 49 POVERTY .584 MURDER PREDICT RESIDUAL 1 9.0000 4.4599 4.5401 2 11.6000 9.3091 2.2909 3 10.2000 10.8281 -0.6281 4 8.6000 8.1406 0.4594 An observation is called influential if removing it results in a large change in the prediction equation. Unless the sample size is large, an observation can have a strong influence on the slope if its .v-value is low or high compared to the rest of the data and if it is a regression outlier. In summary, the line for the data set including D.C. seems to distort the relationship for the 50 states. It seems wiser to use the equation based on data for the 50 states alone rather than to use a single equation both for the 50 states and D.C. This line for the 50 states better represents the overall trend. In reporting these results, we would note that the murder rate for D.C. falls outside this trend, being much larger than this equation predicts. Prediction Errors Are Called Residuals The prediction equation y = -0.86 + 0.58* predicts murder rates using.r = poverty rate. For the sample data, a comparison of the actual murder rates to the predicted values checks the goodness of the prediction equation. For example, Massachusetts had x = 10.7 andy = 3.9. The predicted murder rate (y) at x = 10.7 is y = -0.86 + 0.58.V = -0.86 + 0.58(10.7) = 5.4. The prediction error is the difference between the actualy-value of 3.9 and the predicted value of 5.4, ory — y = 3.9 - 5.4 = -1.5. The prediction equation overestimates the murder rate by 1.5. Similarly, for Louisiana, .v = 26.4 and y = -0.86 + 0.58(26.4) = 14.6. The actual murder rate is y = 20.3, so the prediction is too low. The prediction error is y — y = 20.3 - 14.6 = 5.7. The prediction errors are called residuals. Residual For an observation, the difference between an observed value and the predicted value of the response variable, y — y, is called the residual. Table 9.3 shows the murder rates, the predicted values, and the residuals for the first four states in the data file. A positive residual results when the observed value y is larger than the predicted value y, so y - y > 0. A negative residual results when the observed value is smaller than the predicted value. The smaller the absolute value of the residual, the better is the prediction, since the predicted value is closer to the observed value. 264 Chapter 9 Linear Regression and Correlation 15 20 Poverty Rate FIGURE 9.7: Prediction Equation and Residuals. A residual is a vertical distance between a point and t . prediction line. In a scatterplot, the residual Cor an observation is the vertical distance between its point and the prediction line. Figure 9.7 illustrates this. For example, the observation for Louisiana is the point with (x,y) coordinates (26.4, 20.3). The prediction is represented by the point (26.4, 14.6) on the prediction line obtained by substituling x = 26.4 into the prediction equation y = -0.86 + 0.58*. The residual is the difference between the observed and predicted points, which is the vertical distance..: y — y = 20.3 - 14.6 = 5.7. Prediction Equation Has Least Squares Property Each observation has a residual. If the prediction line falls close to the points in the scatterplot, the residuals are small. We summarize the size of the residuals by the sum of their squared values. This quantity, denoted by SSE, equals sse = 2(y yf- In other words, the residual is computed for every observation in the sample, each residual is squared, and then SSE is the sum of these squares. The symbol SSE is an abbreviation for sum of squared errors. This terminology refers to the residual being a measure of prediction error from using y to predict y. Some software (such as SPSS) calls SSE the residual sum of squares. It describes the variation of the data around the prediction line. The better the prediction equation, the smaller the residuals tend to be and, hence, the smaller SSE tends to be. Any particular equation has corresponding residuals and a value of SSE. The prediction equation specified by the formulas for the estimates a and; b of a. and j3 has the smallest value of SSE out of all possible linear prediction equations. least Squares Estimates The least squares estimates a and b are the values that provide the prediction equation y = a + bx for which the residual sum of squares, SSE = 2(y - y)2, is a minimum. Section 9.3 The Linear Regression Model 265 The prediction line y = a + bx is called the least squares line, because it is the one with the smallest sum of squared residuals. If we square the residuals (such as those in Table 9.3) for the least squares line y = -0.86 + 0.58.v and then sum them, we get SSE = 2(y - yf = (4.54)2 + (2.29)2 + : 470.4. This value is smaller than the value of SSE for any other straight line predictor, such as y = —0.88 + 0.60.v. In this sense, the data fall closer to this line than to any other line. Software for regression lists the value of SSE. Table 9.3 reports it in the "Sum of Squares" column, in the row labeled "Residual." In some software, such as SAS, this is labeled as "Error" in the sum of squares column. Besides making the errors as small as possible in this summary sense, the least squares line • Has some positive residuals and some negative residuals, but the sum (and mean) of the residuals equals 0 • Passes through the point, (x,y) The first property tells us that the too-low predictions are balanced by the too-high predictions. Just as deviations of observations from their meany satisfy 2 (y - y) = 0, so is the prediction equation defined so that 2(y - y) =0. The second property tells us that the line passes through the center of the data. 9.3 THE LINEAR REGRESSION MODEL For the model y = a + fix, each value of x corresponds to a single value of y. Such a model is said to be deterministic. It is unrealistic in social science research, because we do not expect all subjects who have the same .v-value to have the same y-value. Instead, the y-values vary. For example, let x = number of years of education and y = annual income. The subjects having x = 12 years of education do not all have the same income, because income is not completely dependent upon education. Instead, a probability distribution describes annual income for individuals with x = 12. This distribution refers to the variability in the y values at a fixed value of x, so it is a conditional distribution. A separate conditional distribution applies for those with x = 13 years of education, and others apply for those with other values of x. Each level of education has its own conditional distribution of income. For example, the mean of the conditional distribution of income would likely be higher at higher levels of education. A probabilistic model for the relationship allows for variability in y at each value of x. We now show how a linear function is the basis for a probabilistic model. Linear Regression Function A probabilistic model uses a + fix to represent the mean of v-values, rather than y itself, as a function of x. For a given value of x, a + fix represents the mean of the conditional distribution of y for subjects having that value of .v. Expected Value of y Let £(y) denote the mean of a conditional distribution of y. The symbol E represents expected value, which is another term for the mean. 266 Chapter 9 Linear Regression and Correlation We now use the equation 7 E(y) = a + ßx to model the relationship between x and the mean of the conditional distribution or y. For y = annual income, in dollars, and x = number of years of education, suppQSe E(y) = -5000 + 3000.v. For instance, those having a high school education (x = [V, have a mean income of E{y) = -5000 + 3000(12) = 31,000 dollars. The mo(i"L| states that the mean income is 31,000, rather than stating that every subject will, x = 12 has income 31,000 dollars, The model allows different subjects having x = p to have different incomes. An equation of the form £(y) = a + fix that relates values of x to the mear i,| the conditional distribution of y is called a regression function. Regression Function A regression function is a mathematical -function that describes how the mean of the response variable changes according to the value of an explanatory variable. The function E[y) = a + fix is called a linear regression function, because it u0wo_ a straight line to relate the mean of y to the values of x. The y-intercept a and the! slope jS are called the regression coefficients for the linear regression function. In practice, the parameters of the linear regression function are unknown. Least squares provides the sample prediction equation y = a + bx. At a fixed value of x, y = a + bx estimates the mean of y for all subjects in the population having that value of x. Describing Variation about the Regression Line The linear regression model has an additional parameter The correlation is valid only when a straight line is a sensible model for the relationship. Since r is proportional to the slope of a linear prediction equation, it measures the strength of the linear association between x and y. Section 9.4 Measuring Linear Association: The Correlation 271 • -1 — — 1. The correlation, unlike the slope b, must fall between -1 and +1. The reason will be seen later in the section. • r has the same sign as the slope b. Since r equals b multiplied by the ratio of two (positive) standard deviations, the sign is preserved. Thus, r > 0 when the variables are positively related, and < 0 when the variables are negatively related. • r - 0 for those lines having 6 = 0. When ;• = 0, there is not a linear increasing or linear decreasing trend in the relationship. • r = ±1 when all the sample points fall exactly on the prediction line. These correspond to perfect positive and negative linear associations. There is then no prediction error when the prediction equation y = a + bx predicts y. ■ The larger the absolute value of r, the stronger the linear association. Variables with a correlation of -0.80 are more strongly linearly associated than variables with a correlation of 0.40. Figure 9.10 shows scatterplots having various values for r. y / y ^-x I-— x 'i0 -.\- FIGURE 9.10: Scatterplots for Different Correlations • The correlation, unlike the slope b, treats x and y symmetrically. The prediction equation using y to predict x has the same correlation as the one using x to predict y. m The value of r does not depend on the variables' units. For example, if y is the number of murders per 1,000,000 population instead of per 100,000 population, we obtain the same value of r = 0.63. Also, when murder rate predicts poverty rate, the correlation is the same as when poverty rate predicts murder rate, r = 0.63 in both cases. The correlation is useful for comparing associations for variables having different units. Another potential predictor for murder rate is the mean number of years of education completed by adult residents in the state. Poverty rate and education have different units, so a one-unit change in poverty rate is not comparable to a one-unit change in education. Their slopes from the separate prediction equations are not comparable. The correlations are comparable. Suppose the correlation of murder rate with education is -0.30. Since the correlation of murder rate with poverty rate is 0.63, and since 0.63 > | — 0.30|,murderrate is more strongly associated with poverty rate than with education. Many properties of the correlation are similar to those of the ordinal measure of association gamma (Section 8.5). It falls between -1 and +1, it is symmetric, and larger absolute values indicate stronger associations. 272 Chapter 9 Linear Regression and Correlation We emphasize that the correlation describes linear relationships. For curvilinear relationships, the best-fitting prediction line may be completely or nearly hdrizontiil and /• = 0 when 6 = 0. See Figure 9.11. A low absolute value for r does not then imply that the variables are unassociated, but that the association is not linear. Good fit using curvilinear function y = a + b.x {b = 0) FIGURE 9.11: Scatterplot for Which r = 0, Even Though There Is a Strong Curvilinear Relationship Correlation Implies Regression toward the Mean Another interpretation of the correlation relates to its standardized slope property. We can rewrite the equality r = (sx/sy)b as sxb ■ 3y. Now the slope b is the change in y for a one-unit increase in x. An increase in x of sx units has a predicted change of sxb units. (For instance, if sx = 10, an increase of 10 units in x corresponds to a change in y of 10*.) See Figure 9.12. Since sxb - rsy, an increase of sx in x corresponds to a predicted change of r standard deviations rn the y values. The larger the absolute value of r, the stronger the association, in the sense that a standard deviation change in x corresponds to a greater proportion of a standard deviation change in y. /TT FIGURE 9.12: An Increase of sx Units in x Corresponds to a Predicted Change erf ray Units iny EXAMPLE 9.8 Child's Height Regresses toward the Mean The British scientist Sir Francis Galton discovered the basic ideas of regression and correlation in the 1880s. After multiplying each female height by 1.08 to account for Section 9.' Measuring Linear Association: The Correlation 273 gender differences, he noted that the correlation between x = parent height (the average of father's and mother's height) and v = child's height is about 0.5. From the property just discussed, a standard deviation change in parent height corresponds to half a standard deviation change in child's height. For parents of average height, the child's height is predicted to be average. If, on the other hand, parent height is a standard deviation above average, the child is predicted to be half a standard deviation above average. If parent height is two standard deviations below average, the child is predicted to be one standard deviation below average (because the correlation is 0.5). Since r is less than 1, a y-value is predicted to be fewer standard deviations from its mean than .v is from its mean. Tall parents tend to have tall children, but on the average not quite so tall. For instance, if you consider all fathers with height 7 feet, perhaps their sons average 6 feet 5 inches—taller than average, but not so extremely tall; if you consider all fathers with height 5 feet, perhaps their sons average 5 feet 5 inches—shorter than average, but not so extremely short. In each case, Galton pointed out the regression toward the mean. This is the origin of the name for regression analysis. ■ For .v = poverty rate and y = murder rate for the 50 states, the correlation is r = 0.63. So a standard deviation increase in the poverty rate corresponds to a predicted 0.63 standard deviation increase in murder rate. By contrast, r = 0.37 between the poverty rate and the violent crime rate. This association is weaker. A standard deviation increase in poverty rate corresponds to a smaller change in the predicted violent crime rate than in the predicted murder rate (in standard deviation units). r-Squared: Proportional Reduction in Prediction Error A related measure of association summarizes how well .v can predict y. If we can predicty much better bysubstitutingx-values into the prediction equationy = a + bx than without knowing the x-values, the variables are judged to be strongly associated. This measure of association has four elements: • Rule 1 for predicting y without using x. • Rule 2 for predicting y using information on x. • A summary measure of prediction error for each rule, E\ for errors by rule 1 and Ei for errors by rule 2. • The difference in the amount of error with the two rules is E\ - En. Converting this reduction in error to a proportion provides the definition Proportional reduction in error : Rule 1 (Predicting y without using .v): The best predictor is y, the sample mean. Rule 2 (Predicting y using x): When the relationship between x and y is linear, the prediction equation y = a + bx provides the best predictor of y. For each subject, substituting the x-value into this equation provides the predicted value of y. T 274 Chapter 9 Linear Regression and Correlation Prediction Errors: The prediction error for each subject is the difference between the observed and predicted values of y. The prediction error using rule 1 is v - y, and the prediction error using rule 2 is y - y, the residual. For each predictor, some prediction errors are positive, some are negative, and the sum of the errors equals 0. We summarize the prediction errors by their sum of squared values, E = 2 (observed y value - predicted y value)2 . For rule 1, the predicted values all equal y. The total prediction error equals £i =2(y - yf- This is the total sum of squares of the y-values about their mean. We denote this lr. . . TSS. For rule 2, the predicted values are the y-values from the prediction equatioi The total prediction error equals Ei = 2 (y - yf- We have denoted this by SSE, called the sum of squared errors or the residual sum of squares. When x and y have a strong linear association, the prediction equation provides predictions (y) that are much better than y, in the sense that the sum of squared prediction errors is substantially less. Figure 9.13 shows graphical representations of the two predictors and their prediction errors. For rule 1, the same prediction (y) applies for the value of y, regardless of the value of x. For rule 2 the prediction changes as x changes, and the prediction errors tend to be smaller. Rule 1 Me 2 FIGURE 9.13: Graphical Representation of Rule 1 and Total Sum of 5quares En = TSS = £(y - y)2, R"'6 2 and Residual Sum of Squares E2 = SSE = S(y - p)2 Definition of Measure: The proportional reduction in error from using the linear prediction equation instead of y to predicty is 7 = Ej - E2 = TSS - SSE = S(y - y? - S(y - y)2 r Ei Tss 2(y - y)2 It is called r-squared, or sometimes the coefficient of determination. The notation r2 is used for this measure because, in fact, the proportional reduction in error equals the square of the correlation r. We don't need to use the sums of Section 9.4 Measuring Linear Association: The Correlation 275 squares in its definition to find r2, as we can square the correlation. Its defining formula is useful for interpreting ;2, but it is not needed for its calculation. EXAMPLE 9.9 r2 for Murder Rate and Poverty Rate The correlation between poverty rate and murder rate for the 50 states is r = 0.629. Therefore, r2 = (0.629)2 = 0.395. For predicting murder rate, the linear prediction equation y = -0.S6 + 0.58.v has 39.5% less error than y. Software for regression routinely provides tables that contain the sums of squares that compose r2. For example, part of Table 9.3 showed Sum of Squares Regression 307.342 Residual 470.406 Total 777.749 The sum of squared errors using the prediction equation is SSE = £(y - y)2 = 470.4, and the total sum of squares is TSS = E(y - y )2 = 777.7. Thus, p. = TSS - SSE = 777.7 - 470.4 = 3073 = TSS 777.7 777.7 ' ' In practice, it is unnecessary to perform this computation, since software reports r or r2 or both. h Properties of r-Squared The properties of r follow directly from those of the correlation r or from its definition in terms of the sums of squares. • Since -1 < r is, 1, r2 falls between 0 and 1. • The rninimum possible value for SSE is 0, in which case/- = TSS/TSS = l.For SSE = 0, all sample points must fall exactly on the prediction line. In that case, there is no prediction error using x to predict y. This condition corresponds Co r = ±1. « When the least squares slope 6 = 0, the y-intercept a equals y (because a = y - bx, which equals y when b = 0). Then y = y for all x. The two prediction rules are then identical, so that SSE = TSS and r = 0. • Like the correlation, r2 measures the strength of linear association. The closer i~ is to 1, the stronger the linear association, in the sense that the more effective the least squares line y - a + bx is compared to y in predicting y. • r2 does not depend on the units of measurement, and it takes the same value when x predicts y as when y predicts x. Sums of Squares Describe Conditional and Marginal Variability To summarize, the correlation r falls between -1 and +1. It indicates the direction of the association, positive or negative, through its sign. It is a standardized slope, equaling the slope when x and y are equally spread out. A one standard deviation change in x corresponds to a predicted change of r standard deviations in y. The square of the correlation has a proportional reduction in error interpretation related to predicting y using y = a + bx rather than y. 276 Chapter 9 Linear Regression and Correlation Section 9.5 Inferences for the Slope and Correlation 277 The total sum of squares, TSS = 2(y - >')2, summarizes the variability of the observations on y, since this quantity divided by n - 1 is the sample variance s2, of the y-values. Similarly, SSE = E(v - 9)2 summarizes the variability around the prediction equation, which refers to variability for the conditional distributions. When r = 0.39, the variability in y using x to make the predictions (via the prediction equation) is 39% less than the overall variability of the y values. Thus, the r2 result is often expressed as "the poverty rate explains 39% of the variability in murder rate" or "39% of the variance in murder rate is explained by its linear relationship with the poverty rate." Roughly speaking, the variance of the conditional distribution of murder rate for a given poverty rate is 39% smaller than the variance of the marginal distribution of murder rate. This interpretation has the weakness, however, that variability is summarized by the variance. Many statisticians find r to be less useful than r, because (being based on sums of squares) it uses the square of the original scale of measurement. It's easier to interpret the original scale than a squared scale. This is also the advantage of the standard deviation over the variance. When two variables are strongly associated, the variation in the conditional distributions is considerably less than the variation in the marginal distribution. Figure 9.9 illustrated this. 9.5 INFERENCES FOR THE SLOPE AND CORRELATION Sections 9.1-9.3 showed how a linear regression model can represent the form of relationships between quantitative variables. Section 9.4 used the correlation and its square to describe the strength of the association. These parts of a regression analysis are descriptive. We now present inferential methods for the regression model. A test of whether the two quantitative variables are statistically independent has the same purpose as the chi-squared test for categorical variables. A confidence interval for the slope of the regression equation or the correlation tells us about the size of the effect. These inferences enable us to judge whether the variables are associated and to estimate the direction and strength of the association. Assumptions for Statistical Inference Statistical inferences for regression make the following assumptions: ■ The study used randomization, such as a simple random sample in a survey. • The mean ofy is related to x by the linear equation E(y) = a + fix. • The conditional standard deviation rx is identical at each x-value. • The conditional distribution ofy at each value of x is normal. The second assumption states that the linear regression function is valid. The assumption about a common a is one under which the least squares estimates are the best possible estimates of the regression coefficients.2 The assumption about normality assures that the test statistic for a test of independence has a f sampling distribution. In practice, none of these assumptions is ever satisfied exactly. In the final section of the chapter we'll see that the important assumptions are the first two. 2TJnder the assumptions of normality with common 0 or Ha: 0 < 0, to predict the direction of the association. The test statistic equals where se is the standard error of the sample slope b. The form of the test statistic is the usual one for a t or z test. We take the estimate b of the parameter j3, subtract the null hypothesis value (J3 = 0), and divide by the standard error of the estimate b. Under the assumptions, this test statistic has the t sampling distribution with df = n - 2. The degrees of freedom are the same as the df of the conditional standard deviation estimate s. The formula for the standard error of b is V2(.v - *)2 where s = SSE This depends on the point estimate s of the standard deviation of the conditional distributions ofy. The smaller s is, the more precisely b estimates p. A small,? occurs when the data paints show little variability about the prediction equation. Also, the standard error of b is inversely related to 2(x - x)2, the sum of squares of the observed x-values about their mean. This sum increases, and hence b estimates /3 more precisely, as the sample size n increases. (The se also decreases when the x-values are more highly spread out, but the researcher usually has no control over this except in designed experiments.) The P-value for Ha: fj * 0 is the two-tail probability from the t distribution. Software provides the P-value. For large df, recall that the t distribution is similar to 278 Chapter 9 Linear Regression and Correlation the standard normal, so the P-value can be approximated using the normal probability table. EXAMPLE 9.10 Regression for Selling Price of Homes What affects the selling price of a house? Table 9.4 shows observations on home sales in Gainesville, Florida, in fall 2006. This table shows data for 8 homes. The entire file for 100 home sales is the "house selling price" data file at the text Web site. Variables listed are selling price (in dollars), size of house (in square feet), annual taxes (in dollars), number of bedrooms, number of bathrooms, and whether the house is newly built. For now, we use only the data on y = selling price and x = size of house. TABLE 9.4: Selling Prices and Related Factors for a Sample of Home Sales in Gainesville, Florida Home Selling Price Size Taxes Bedrooms Bathrooms New 1 279,900 2048 3104 4 2 no 2 145,500 912 1173 2 1 no 3 237,700 1654 3076 4 2 no 4 200,000 20SS 1608 3 2 no 5 159,900 1477 1454 3 3 no 6 499,900 3153 2997 3 2 yes 7 265,500 1355 4054 3 2 no 8 289,900 2075 3002 3 2 yes Note: For the complete Tile for 100 homes, see the lext Web site. Since these 100 observations come from one city alone, we cannot use them to make inferences about the relationship between x and y in general. We treat them as a random sample of a conceptual population of home sales in this market in order to analyze how these variables seem to be related. Figure 9.15 shows a scatterplot, which displays a strong positive trend. The model £(y) = a + |3x seems appropriate. Some of the points at high levels of size are FIGURE 9.15: Scatterplot and Prediction Equation fory = Selling Price (in Dollars) and* = Size of House (in Square Feet) Section 9.5 Inferences for the Slope and Correlation 279 regression outliers, however, and one point falls quite far below the overall trend. We discuss this abnormality in Section 14.5, which introduces an alternative model that does not assume constant variability around the regression line. Table 9.5^shows part of a SPSS printout for a regression analysis. The prediction equation is y = -50,926 + 126.6x. The predicted selling price increases by b = 126.6 dollars for an increase in size of a square foot. Figure 9.15 also superimposes the prediction equation over the scatterplot. In SPSS, "Beta" denotes the estimated standardized regression coefficient. For the regression model of this chapter, this is the correlation; it is not to be confused with the population slope, p, which is unknown. TABLE 9.5: Information from SPSS Printout for Regression Analysis of y ■ x = 5ize of House Selling Price and N Mean Std. Deviation price 100 155331 00 101262.21 size 100 1629 28 666.94 Sura of Squares di Mean Square Regression 7.057E+11 1 7.057E+11 Residual 3.094E+11 98 3157352537 Total 10.15E+11 99 R Square Std. Error of the Estimate .695 56190.324 Unstandardized Standardized Coefficients Coefficients B Std Error Beta t Sig. (Constant) 50926.3 14896.373 -3 42 .001 size 126.594 8 .468 .334 14 95 .000 Table 9.5 reports that the standard error of the slope estimate is se = 8.47. This is listed under "Std. Error" for the size predictor. This value estimates the variability in sample slope values that would result from repeatedly selecting random samples of 100 house sales in Gainesville and calculating prediction equations. For testing independence, Hq: /3 = 0, the test statistic is t ■■ 14.95, b_ = 126.6 se 8.47 shown in the last row of Table 9.5. Since n = 100, its degrees of freedom are df = n - 2 = 98. This is an extremely large test statistic. The P-value, listed in Table 9.5 under the heading "Sig", is 0.000 to three decimal places. This refers to the two-sided alternative Ha: p * 0. It is the two-tailed probability of a t statistic at least as large in absolute value as the absolute value of the observed |r| = 14.95, presuming Hg is true. Table 9.6 shows part of a SAS printout for the same analysis. The two-sided P-value, listed under the heading "Pr >|f|," is <0.0001 to four decimal places. (It is 280 Chapter 9 Linear Regression and Correlation table 3.6: Part of a SAS Printout for Regression Analysis of Selling Price and Size of House Sum of Mean Source DF Squares Square Model 1 7.05729E11 7.05729E11 Error 98 3.09420EE11 3157352537 Corrected Total 99 1.0151EE12 Root MSE 56,190.3 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -50926 14896 -3.42 0.0009 •:. size 1 126.59411 8.46752 14.95 <.0001 actually 0.0000000... to a huge number of decimal places, but SAS reports it this way rather than 0.0000 so you don't think the P-value is exactly 0.) Both the SAS and SPSS printouts also contain a standard error and t test for the y-intercept. We won't use this information, since rarely is there any reason to test the hypothesis that ay-intercept equals 0. For this example, the y-intercept does not have any interpretation, since houses of size x = 0 do not exist. In summary, there is extremely strong evidence that size of house has a positive effect on selling price. On the average, selling price increases as size increases. This is no surprise. Indeed, we would be shocked if these variables were independent. For these data, estimating the size of the effect is more relevant than testing whether it exists. • Confidence Interval for the Slope A small P-value for H0: p - 0 suggests that the regression line has a nonzero slope. We should be more concerned with the size of the slope p than in knowing merely that it is not 0. A confidence interval for /3 has the formula b ± t'se). The f-score is the value from Table B, with df = n - 2, for the desired confidence level. The form of the interval is similar to the confidence interval for a mean (Section 5.3), namely, take the estimate of the parameter and add and subtract a t multiple of the standard error. The se is the same as se in the test about /3. EXAMPLE 9.11 Estimating the Slope for House Selling Prices For the data on x = size of house and y = selling price, b = 126.6 and se = 8.47. The parameter p refers to the change in the mean selling price (in dollars) for each 1-square-foot increase in size. For a 95 % confidence interval, we use the r.025 va\us for df = n - 2 = 98, which is f.025 = 1.984. (Table B shows t.025 = 1.984 for df = 100.) The interval is b ± f.025(» = 126.6 ± 1.984(8.47) = 126.6 ±16.8 or (110,143). Section 9.5 Inferences for the Slope and Correlation 281 We can be 95% confident that /3 falls between 110 and 143. The mean selling price increases by between $110 and $143 for a i-square-foot increase in house size. E3 In practice, we make inferences about the change in E(y) for an increase inx that is a relevant portion of the actual range of ,r-values. If a one-unit increase in .v is too small or too large in practical terms, the confidence interval for 0 can be adjusted to refer to a different change in x. To obtain the confidence interval for a constant multiple of the slope (such as 100/3, the change in the mean of y for an increase of 100 units in x), multiply the endpoints of the confidence interval for 0 by the same constant. For Table 9.4, x = size of house has x = 1629 and s,x - 669. A change of 1 square foot in size is small. Let's estimate the effect of a 100-square-foot increase in size. The change in the mean of y is 1000. The 95% confidence interval for /3 is (110, 143), so the 95% confidence interval for 100,8 has endpoints 100(110) = 11,100 and 100(143) = 14,300. We infer that the mean selling price increases by at least $11,100 and at most $14,300, for a 100-square-foot increase in house size. For example, assuming that the linear regression model is valid, we conclude that the mean is between $11,100 and $14,300 higher for houses of 1700 square feet than for houses of 1600 square feet. Reading the Computer Printout Let's take a closer look at the printouts in Tables 9.5 and 9.6. They contain some information we have not yet discussed. For instance, in the sum of squares table, the sum of squared errors (SSE) is 3.094 times 1011. This is a huge number because the y-values are very large and their deviations are squared. The estimated conditional standard deviation of y is s = ^SSE/(n - 2) = 56,190. SAS labels this "Root MSE" for square root of the mean square error. SPSS misleadingly labels it "Std. Error of the Estimate." This is a poor label, because s refers to a conditional standard deviation of selling prices (for a fixed house size), not a standard error of a statistic. The sum of squares table also reports the total sum of squares, TSS = 2(y - v)3 = 10.15 X 10u. From this value and SSE, ^TSS-SSE=Q695 TSS This is the proportional reduction in error in using house size to predict selling price. Since the slope of the prediction equation is positive, the correlation is the positive square root of this value, or 0.834. A strong positive association exists between these variables. The total sum of squares TSS partitions into two parts, the sum of squared errors, SSE = 3.094 X 10u, and the difference between TSS and SSE, TSS - SSE = 7.057 x 1011, This difference is the numerator of the iz measure. SPSS calls this the regression sum of squares. SAS calls it the model sum of squares. It represents the amount of the total variation TSS in y that is explained by x in using the least squares line. The ratio of this sum of squares to TSS equals r. The table of sums of squares has an associated list of degrees of freedom values. The degrees of freedom for the total sum of squares TSS = 2(}' - y)2isn - 1 = 99, since TSS refers to variability in the marginal distribution of y, which has sample T 282 Chapter 9 Linear Regression and Correlation variance — m y = ' c+D association '=f,-äf = "> _ TSS-SSE tss and quantitative predictors. Chapter 14 introduces models for more complex relationships, such as nonlinear ones. Finally, Chapter 15 presents regression models for categorical response variables. Before discussing these multivariate models, however, we introduce in the next chapter some new concepts that help us to understand and interpret multivariate relationships. PROBLEMS Practicing the Basics 9.1. For the following variables in a regression analysis, which variable more naturally plays the role of x (explanatory variable) and which plays the role of y (response variable)? (a) College grade point average (GPA) and high school GPA (b) Number of children and mother's education level (c) Annual income and number of years of education (d) Annual income and assessed value of home 9.2. Sketch plots of the following prediction equations, for values of x between 0 and 10: (a) y = 7 + 0.5.Y (b) y = 7 + x (c) y = 1 - x 0.5x (e) (f) y = 7 l = 1 y '■ 9.3. Anthropologists often try to reconstruct information using parti al human remains at burial sites. For instance, after finding a femur (thighbone), they may want to predict how tall an individual was. An equation they use to do this is y = 61.4 + 2.4.v, where y is the predicted height and x is the length of the femur, both in centimeters.3 (a) Identify the y-intercept and slope of the equation. Interpret the slope. (b) A femur found at a particular site has length of 50 cm. What is the predicted height of the person who had that femur? 9.4. The OECD (Organization for Economic Cooperation and Development) consists of 20 advanced, industrialized countries. For these nations,'1 the prediction equation relatingy = child poverty rate in 2000 to .v = social expenditure as a percent of gross domestic product is y = 22 - 1.3x. The y-values ranged from 2.8% (Finland) to 21.9% (U.S"). The x-values ranged from 2% (U.S.) to 16% (Denmark). (a) Interpret the y-intercept and the slope. (b) Find the predicted poverty rates for the U.S. and for Denmark. (c) The correlation is -0.79. Interpret. 9.5. Look at FigUTe 2 in www.ajph.org/cgi/reprinl/93/4/ 652?ck=nck, a scatterplot for U.S. states with correlation 0.53 betweenx = child poverty rate andy = child mortality rate. Approximate the y-intercept and slope of the prediction equation shown there. 9.6. A study5 of mail survey response rate patterns of the elderly found a prediction equation relatingx = age (between about 60 and 90) andy = percentage of subjects responding of y = 90.2 — O.6.v. (a) Interpret the slope. (b) Find the predicted response rate for a (i) 60-year-old, (ii) 90-year-old. 9.7. For recent UN data from 39 countries on y = per capita carbon dioxide emissions (metric tons per 3S. Junger, Vanity Fair, October 1999. ^Source: Figure 8H in www.stateofworkingamerica.org -^D. Kaldenberg et al, Public Opinion Quarterly, Vol. 58,1994, p. 68.