11 Introduction to correlation analysis Processing the data we are often interested in the relationship between two variables. Then if they are not independent we are interested in the strength of the relationship. The relationship between two sets of interval-scaled or ratio-scaled random variables is processed by correlation analysis. (Regression analysis and correlation analysis are focused on similar tasks. In case of regression analysis there is processed one dependent variable and one ore more independent variables; correlation analysis is aimed to measure the strength of two equivalent variables.) In this chapter only linear relationship is treated and the bivariate normal distribution is assumed. Remark 11.1 Let us recall the definition and properties of the correlation coefficient. R(X, Y ) = E X-E(X) D(X) Y -E(Y ) D(Y ) for D(X) D(Y ) > 0 0 otherwise The properties: 1. R(X, Y ) = C(X,Y ) D(X) D(Y ) for D(X) D(Y ) > 0 0 otherwise 2. R(X, X) = 1 pro D(X) = 0 0 jinak 3. R(X, Y ) = R(Y, X) 4. -1 R(X, Y ) 1 5. R(X, Y ) = 1, then constants a, b R, b > 0 exists such that P(Y = a + bX) = 1, R(X, Y ) = -1, then constants a, b R, b < 0 exists such that P(Y = a + bX) = 1, 6. R(a + bX, c + dY ) = sgn(bd)R(X, Y ) 7. If the random variables X, Y are independet then R(X, Y ) = 0. (The reverse implication does not hold in general!) It is obvious that if the relationship between two variables is linear, the correlation coefficient is a perfect indicator of the strength of this relationship. As the value of |R(X, Y )| approaches to 1, the relationship between X, Y is stronger. The positive values of the correlation coefficient are related to the positive slope of positive linear dependence. The negative values of the correlation coefficient are related to the negative slope of negative linear dependence. If the random variables are independent then the correlation coefficient is equal to zero. [It can be zero in case of some non-linear dependence as well!!] The population correlation coefficient is usually unknown since the distribution of the random vector (X, Y ) is usually unknown. But it can be estimated by sample correlation coefficient. Definition 11.2 Let the random sample X1 Y1 , . . . , Xn Yn follows a bivariate distribution Let M1, M2 be sample means, S2 1 = 1 n-1 n i=1 (Xi - M1)2 ; S2 2 1 n-1 n i=1 (Yi - M2)2 be sample variances and S12 = 1 n-1 n i=1 (Xi - M1)(Yi - M2) be a sample covariance. Then R12 = S12 S1 S2 for S1 S2 > 0 is called sample correlation coefficient. If S1 or S2 are equal to zero then correlation coefficientis not defined. 55 Remark 11.3 The sample correlation coefficient R12 is not an unbiased estimator of population correlation coefficient R(X, Y ), but for n > 30 rhe bias is negligible. The properties of the sample correlation coefficient R12 are paralel to the population correlation coefficient R(X, Y ). In the following text the bivariate normal distribution of a random sample X1 Y1 , . . . , Xn Yn will be assumed. Theorem 11.4 Let the random vector X, Y follows bivariate normal distribution. Then the random variables X and Y are independent if and only if the correlation coefficient = R(X, Y ) = 0. [In case of bivariate normal distribution the independence and non-correlation is equivalent.] Theorem 11.5 Let X1 Y1 , . . . , Xn Yn be a random sample from bivariate normal distribution. and let = 0. Then the statistic T = R12 n - 2 1 - R2 12 follows the student t-distribution with (n - 2) degrees of freedom. This T statistic is instrumental towards hypothesis about independence of random variables X, Y . Theorem 11.6 Considering the random sample from bivariate normal distribution, at the significance level the null hypothesis H0 : = 0is rejected in favour of alternative hypothesis H1, if the test statistic T = R12 n-2 1-R2 12 falls within the critical region W. According to the form of the alternative hypothesis the list of corresponding critical regions follows : for two-tailed test H1 : = 0 W = (-, -t1-/2(n - 2) t1-/2(n - 2), ) for left-tailed test H1 : < 0 W = (-, -t1-(n - 2) for right-tailed test H1 : > 0 W = t1-(n - 2), ) Example 11.7 The score of two subjects of eight randomly drawn students are recorded. 1 2 3 4 5 6 7 8 80 50 36 58 42 60 56 68 65 60 35 39 48 44 48 61 At the significance level 0.05 carry out the test that the results in considered two subjects are not positively correlated. Solution seminar session Through the hypothesis H0 : = 0 the independence of two normal variables was tested. Now we are interested in the strength of linear relationship. The test statistic of following test about correlation coefficient is made through use of a particular function of sample correlation coefficient R12 given by the following theorem. Theorem 11.8 Let X1 Y1 , . . . , Xn Yn be a random sample from bivariate normal distribution with correlation coefficient R(X, Y ) = . The statistic Z = 1 2 ln 1 + R12 1 - R12 56 is called Fisher R12-to-z transformation and its approximate expected value and variance follows: E(Z) = 1 2 ln 1+ 1+ 2(n-1) D(Z) = 1 n-3 . Then standardized statistic U = Z-E(Z) D(Z) N(0, 1). Theorem 11.9 Let X1 Y1 , . . . , Xn Yn be a random sample from bivariate normal distribution with correlation coefficient R(X, Y ) = . Let R12 be the sample correlation coefficient, let Z = 1 2 ln 1+R12 1-R12 be Fisher R12-to-z transformation and let c (-1, 1) be a given constant. At the significance level the null hypothesis H0 : = c is rejected in favour of alternative hypothesis H1, if the test statistic U = Z - 1 2 ln 1+c 1-c - c 2(n-1) 1 n-3 falls within the critical region W. According to the form of the alternative hypothesis the list of corresponding critical regions follows : for two-tailed test H1 : = c W = (-, -u1-/2 u1-/2, ) for left-tailed test H1 : < c W = (-, -u1for right-tailed test H1 : > c W = u1-, ) Example 11.10 A ferrum content was determined in an iron ore sample of size 600 by two analytic methods, where the sample correlation coefficient was R12 = 0, 85. A technical literature states that the correlation coefficient between considered methods is = 0, 9. At the significance level 0.05 carry out a test H0 : = 0, 9 against H1 : = 0, 9. Solution seminar session The statistic U can be used to find confidence intervals for . First the limits for the constant 1 2 ln 1+ 1are derived, then these limits are transformed to the limits for using hyperbolic tangent. Theorem 11.11 Let the assumptions from 11.9 hold. Then the 100(1 - )% confidence interval ˇfor the expression 1 2 ln 1+ 1has the form: 1 2 ln 1+ 1- Z - u1-/2 n-3 , Z + u1-/2 n-3 with approximate probability 1 - . ˇfor the parameter has the form: tgh(Z - u1-/2 n-3 ) , tgh(Z + u1-/2 n-3 ) with approximate probability 1 - . Remark 11.12 tgh(x) = ex-e-x ex+e-x for x R. Example 11.13 An officer of human resources department of particular firm is interested in a relationship between a number of absence days due to illness per year (variable Y ) and age of employee (variable X). Therefore the data about 10 employees were drawn randomly. 1 2 3 4 5 6 7 8 9 10 27 61 37 23 46 58 29 36 64 40 15 6 10 18 9 7 14 11 5 8 57 Under the assumption that X Y follows bivariate normal distribution do following tasks: a) Calculate sample correlation coefficient. b) At the significance level 0.05 carry out a test that X and Y are independent. c) Determine the 95% confidence interval for correlation coefficient . Solution seminar session Remark 11.14 We may have two sample correlation coefficients R12, R 12 corresponding to two independent bivariate normal distributions. The question to be asked is: "Do both of these sample correlation coefficients represent population having the same true value of correlation coefficient = ? The following theorem deals this question. Theorem 11.15 Two independent bivariate normal samples of sizes n and n with correlation coefficients , are given. Let R12, R 12 be sample correlation coefficients and Z, Z are corresponding Fisher transfor- mations. At the significance level the null hypothesis H0 : = is rejected in favour of alternative hypothesis H1, if the test statistic U = Z - Z 1 n-3 + 1 n-3 falls within the critical region W. According to the form of the alternative hypothesis the list of corresponding critical regions follows : for two-tailed test H1 : = W = (-, -u1-/2 u1-/2, ) for left-tailed test H1 : < W = (-, -u1for right-tailed test H1 : > W = u1-, ) Example 11.16 A medical research observed the concentration of substances A and B in urine of patients with particular kidney illness. In a sample of 100 healthy individuals the sample correlation coefficient between concentration of A and B was 0,65. In a sample of 142 individuals with mentioned kidney illness the sample correlation coefficient was 0,37. At the significance level 0.05 test the hypothesis that the true correlation coefficients are equal. Solution seminar session 58