10 Simple linear regression This chapter is concerned with relations among variables like demand and supply relations, cost functions, production functions and many others. Deterministic relations characterized as a function y = f(x) are usually from the natural sciences. A relation between X and Y is deterministic if each element of the domain is paired off with just one element of the range. In economic situations the deterministic relations are very rare and we usually deal with stochastic (probabilistic) relations which ae more realistic for most real-world situations. A relation between X and Y is said to be stochastic if for each value of X there is a whole probability distribution of values of Y . Thus for any given value of X the variable Y may assume some specific value (or fall within some specific interval) with a probability smaller than one and greater than zero. Its value is effected by a random disturbance. Regression analysis is dealing with stochastic relations. It is aimed a) to determine a form of a function which may describe the dependence of Y on X and b) to estimate the parameters for the selected function. ad a) To determine the form of the function we may start from logical analysis (i.e. to follow some economic theory) or it could be estimated from the two-dimensional diagram (scatter plot). (This way can be used only in the case where the dependent (response) variable is a function of just one independent (explanatory, predictor) variable.) The list of commonly used forms of regression functions follows: * regression line: E(Y |x) = 0 + 1x * regression parabola: E(Y |x) = 0 + 1x + 2x2 * regression polynomial of degree p: E(Y |x) = 0 + 1x + . . . + +pxp * regression hyperbola: E(Y |x) = 0 + 1 1 x * regression logarithmic function: E(Y |x) = 0 + 1 ln x Each of listed regression functions represents simple linear regression function. (The term linear regression function is used if the function is linear with respect to the parameters 0, 1, 2, . . .. It is said to be simple if the dependent variable is a function of just one independent variable. Otherwise it is said to be multiple.) ad b) The unknown parameters 0, 1, 2, . . . are estimated so that to fit the data set of n pairs of observed values (x1, y1), . . . , (xn, yn). To estimate the parameters the least square estimation is commonly used me- thod. 10.1 Specification of the classical simple linear regression model A model consist of the regression equation and the basic assumptions. Let us begin with the equation: 48 ˇY = 0 + 1f1(x) + . . . + pfp(x) + where: Y is a dependent random variable which is observable x is an independent non-stochastic variable which is observable is a random error which accounts for the random factors and is unobservable 0+1f1(x)+. . .+pfp(x) is a theoretic regression function with unknown parameters 0, 1, . . . , p For n observations the regression equation can be expressed as follows: y1 = 0 + 1f1(x1) + . . . + pfp(x1) + 1 ... yi = 0 + 1f1(xi) + . . . + pfp(xi) + i ... yn = 0 + 1f1(xn) + . . . + pfp(xn) + n The subscript i = 1, . . . , n refers to the ith observation. Observations on X and Y can be made over time, in which case we speak of "time-series data" or they can be made over individuals, objects, or geographical areas, in which case we speak of cross-section data. ˇAssumptions for the random error i, i = 1, . . . , n are a) E(i) = 0 [zero mean for errors which are not systematic] b) D(i) = 2 > 0 [each observation is done with the equal precision] c) C(i, j) = 0 pro i = j [there is no linear relationship between the errors] d) i N(0, 2 ) [errors are normally distributed] Violations of some basic assumptions are shown in following pictures. In the first one there is a violation of the assumption b) and we speak of heteroskedasticity of random errors; in the second picture there is a violation of the assumption c) and then we speak of autocorrelation of random errors. 49 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 Heteroskedasticita x y pozorování dvojic(x,y) regresní funkce 0 5 10 15 20 25 30 35 40 45 50 496 498 500 502 504 506 508 510 512 514 516 x y autokorelace Heteroskedasticity Autocorrelation The following pictures perform low and strong linear dependence under basic assumptions: 0 5 10 15 20 25 30 35 40 45 50 -200 -150 -100 -50 0 50 100 150 200 x y Slabá lineární závislost 0 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 16 x y Silná lineární závislost Low fit Strong fit Since the mathematical form of the relation is specified the unknown parameters 0, 1 . . . , p should be estimated. 10.2 The least square estimators of regression parameters and the notation b0, b1, . . . , bp estimators of regression parameters 0, 1, . . . , p b0 + b1f1(x) + . . . + bpfp(x) empirical (sample) regression function ^yi = b0 + b1f1(xi) + . . . + bpfp(xi) regression estimate of the ith value of the random variable Y ei = yi - ^yi ith residual SE = n i=1 (yi - ^yi)2 = n i=1 e2 i residual sum of squares s2 = SE n-p-1 estimator of the variance 2 SR = n i=1 (^yi - m2)2 ; m2 = 1 n n i=1 yi regression sum of squares ST = n i=1 (yi - m2)2 ; total sum of squares [It holds: ST = SR + SE] ID2 = SR ST = 1 - SE ST coefficient of determination [ ID2 (0, 1) ] [Coefficient of determination is a measure of "goodness of fit"; it is simply the proportion of the variation of Y that can be attributed to the variation of X and describes how well the sample regression function fits the observed data. A zero value of ID2 indicates the poorest and a unit value the best fit that can be attained.] 50 10.3 The method of least squares The purpose of the least-square method is to find estimators b0, b1, . . . , bp of regression parameters 0, 1, . . . , p so that the sum of squares of residuals is as little as possible. (The regression estimates fit the data "best".) ThusTedy S(0, 1, . . . , p) = n i=1 e2 i = n i=1 [yi - 0 + 1f1(xi) + . . . + pfp(xi)]2 min Thus we have to minimize the function S(0, 1, . . . , p), which is dependent only on unknown parameters of the regression model. The procedure how to do it follows: 1. Differentiate S(0, 1, . . . , p) with respect to each regression parameter. 2. Equate each derivatives to zero. This leads to the system of n equations of n unknown variables. These equations are generally known as the least squares normal equations. 3. Solving the least squares normal equations we obtain wanted estimators b0, b1, . . . , bp of regression parameters 0, 1, . . . , p. Then the least squares normal equations have the form: 0 1 + 1 f1 + 2 f2 + . . . + p fp = yi 0 f1 + 1 f2 1 + 2 f1f2 + . . . + p f1fp = yif1 ... ... 0 fp + 1 fpf1 + 2 fpf2 + . . . + p f2 p = yifp where the symbol states for n i=1 and te symbol fj states for n i=1 fj(xi). 0, 1, . . . , p, which solve the least squares normal equations are denoted as b0, b1, . . . , bp. Example 10.4 Considering regression line find the estimators b0, b1 of the parameters 0, 1. (And the basic assumptions of the classical regression model are satisfied.) Solution The estimates b0, b1 can be obtained from the least squares normal equations : b0 n i=1 1 + b1 n i=1 xi = n i=1 yi b0 n i=1 xi + b1 n i=1 x2 i = n i=1 yixi . 51 The solution is b0 = n i=1 yi n i=1 x2 i - n i=1 xi n i=1 yixi n n i=1 x2 i - n i=1 xi 2 b1 = n n i=1 yixi - n i=1 yi n i=1 xi n n i=1 x2 i - n i=1 xi 2 Thus the estimated sample regression line is ^y = b0 + b1x. (Notice that b0, b1 are random variables; they are depending on realizations (xi, yi),while parameters 0, 1 are constants.) 10.5 The matrix notation of classical linear regression model and its solution ˇModel: yi = 0 + 1f1(xi) + . . . + pfp(xi) + i, i = 1, . . . , n can be expressed in matrix notation: y = X + , thus y1 y2 ... yn = 1 f1(x1) . . . fp(x1) 1 f1(x2) . . . fp(x21) ... ... 1 f1(xn) . . . fp(xn) 0 1 ... p + 1 2 ... n A notation: y a column vector of observed values of dependent random variable Y X a matrix of observed values of regressors [we assume the rank: h(X) = p + 1 < n, the columns of X are linear independent] a column vector of regression parameters a column vector of residuals ˇThe assumptions of the model can be rewritten as follows: Nn(0, 2 I) As it was written in 10.3 the estimators b0, b1, . . . , bp of regression parameters 0, 1, . . . , p can be obtained by solving least square normal equations. These can be expressed in matrix notation as follows: X X = X y the least square normal equations b = (X X)-1 X y the lest square estimators (LSE or frequently OLS - ordinary least squares) ^y = Xb a vector of regression estimators e = y - ^y a vector of residuals 10.6 Properties of the least square estimators b = (X X)-1 X y 1. the estimator b is linear; it is a linear combination of the random vector y 2. the estimator b is unbiased; it is true that E(b) = 3. the estimator b has a variance-covariance matrix var (b) = 2 (X X)-1 4. the estimator b is normally distributed with mean vector and variance-covariance matrix var (b) = 2 (X X)-1 , thus b Np+1(, 2 (X X)-1 ); the normality follows from Nn(0, 2 I) and the first property. 5. the estimator b is the best linear unbiased estimator of the vector . (BLUE) 52 Remark 10.7 The last property is known as Gauss-Markov theorem. The "best" means that if b is any other linear unbiased estimator, then var (b) var (b ). [var (b ) - var (b) is a positive semi-definite matrix.] As we know the distribution of the vector of estimators b we may follow with statistical inferences about regression parameters . But the parameter , which is involved in variance-covariance matrix var (b), is unknown. Thus we have to obtain its estimator and consequently estimators of variances of elements b. Variance-covariance matrix has the form var b = var(b0) cov(b0, b1) . . . cov(b0, bp) cov(b1, b0) var(b1) . . . cov(b1, bp) ... ... ... cov(bp, b0) cov(bp, b1) . . . var(bp) = 2 (X X)-1 Thus the variances D(bj), j = 0, 1, . . . , p are represented by diagonal elements of the matrix 2 (X X)-1 . Let us recall (from 10.2) that s2 = SE n-p-1 is an unbiased estimator of the parameter 2 . Thus the matrix s2 (X X)-1 estimates variance-covariance matrix var(b) and its diagonal elements estimate the variances D(bj). The following notation is used : vjj j-th diagonal element of the matrix (X X)-1 sbj = s vjj a standard error of bj 10.8 The confidence intervals for the regression parameters The statistic Tj = bj -j sbj follows t-distribution t(n - p - 1) for j = 0, 1, . . . , p. Thus considering 100(1 - )% confidence interval for j its limits are calculated as follows: bj sbj t1-/2(n - p - 1) 10.9 Test of significance of single parameters (separate t-tests) At the significance level we are testing for j = 0, 1, . . . , p H0 : j = 0 versus H1 : j = 0. The null hypothesis asserts that the vector y is not influenced by the j-th column of the matrix X. Rejecting the null it is concluded that the parameter j is relevant in our model. The test statistic Tj = bj sbj follows a distribution t(n - p - 1) if H0 is true. The critical region follows: W = (-; -t1-/2(n - p - 1) t1-/2(n - p - 1); ) 10.10 Test of significance of regression (the overall F-test) At the significance level we are testing: H0 : (1, 2, . . . , p) = (0, 0, . . . , 0) versus H1 : (1, 2, . . . , p) = (0, 0, . . . , 0). H0 is a more extensive hypothesis that none of the explanatory variables has an influence on Y . If H0 is true then the variation of Y from observation to observation is not affected by changes in any one of the explanatory variables, but is purely random, Y = 0 + . The test statistic: F = SR/p SE/(n-p-1) F(p, n - p - 1), if H0 is true. The critical region follows: W = F1-(p, n - p - 1); ) The F-test results are usually performed in ANOVA table: 53 Sources of variability sum of squares degrees of freedom mean squares test statistic regression model SR p SR/p SR/p SE/(n-p-1) error SE n - p - 1 SE/(n - p - 1) total ST n - 1 54