11 Introduction to correlation analysis
Processing the data we are often interested in the relationship between two variables. Then if they are
not independent we are interested in the strength of the relationship. The relationship between two
sets of interval-scaled or ratio-scaled random variables is processed by correlation analysis. (Regression
analysis and correlation analysis are focused on similar tasks. In case of regression analysis there
is processed one dependent variable and one ore more independent variables; correlation analysis is
aimed to measure the strength of two equivalent variables.)
In this chapter only linear relationship is treated and the bivariate normal distribution is assumed.
Remark 11.1
Let us recall the definition and properties of the correlation coefficient.
R(X, Y ) =


E X-E(X)

D(X)
 Y -E(Y )

D(Y )
for D(X) D(Y ) > 0
0 otherwise
The properties:
1. R(X, Y ) =


C(X,Y )

D(X)

D(Y )
for D(X) D(Y ) > 0
0 otherwise
2. R(X, X) =
1 pro D(X) = 0
0 jinak
3. R(X, Y ) = R(Y, X)
4. -1  R(X, Y )  1
5. R(X, Y ) = 1, then constants a, b  R, b > 0 exists such that P(Y = a + bX) = 1,
R(X, Y ) = -1, then constants a, b  R, b < 0 exists such that P(Y = a + bX) = 1,
6. R(a + bX, c + dY ) = sgn(bd)R(X, Y )
7. If the random variables X, Y are independet then R(X, Y ) = 0.
(The reverse implication does not hold in general!)
It is obvious that if the relationship between two variables is linear, the correlation coefficient is a
perfect indicator of the strength of this relationship. As the value of |R(X, Y )| approaches to 1, the
relationship between X, Y is stronger. The positive values of the correlation coefficient are related to
the positive slope of positive linear dependence. The negative values of the correlation coefficient are
related to the negative slope of negative linear dependence. If the random variables are independent
then the correlation coefficient is equal to zero. [It can be zero in case of some non-linear dependence
as well!!]
The population correlation coefficient is usually unknown since the distribution of the random
vector (X, Y ) is usually unknown. But it can be estimated by sample correlation coefficient.
Definition 11.2
Let the random sample X1
Y1
, . . . , Xn
Yn
follows a bivariate distribution Let M1, M2 be sample means,
S2
1 = 1
n-1
n
i=1
(Xi - M1)2
; S2
2
1
n-1
n
i=1
(Yi - M2)2
be sample variances and
S12 = 1
n-1
n
i=1
(Xi - M1)(Yi - M2) be a sample covariance.
Then
R12 =
S12
S1  S2
for S1  S2 > 0
is called sample correlation coefficient. If S1 or S2 are equal to zero then correlation coefficientis not
defined.
55
Remark 11.3
The sample correlation coefficient R12 is not an unbiased estimator of population correlation coefficient
R(X, Y ), but for n > 30 rhe bias is negligible. The properties of the sample correlation coefficient
R12 are paralel to the population correlation coefficient R(X, Y ).
In the following text the bivariate normal distribution of a random sample X1
Y1
, . . . , Xn
Yn
will be
assumed.
Theorem 11.4
Let the random vector X, Y follows bivariate normal distribution. Then the random variables X and
Y are independent if and only if the correlation coefficient = R(X, Y ) = 0. [In case of bivariate
normal distribution the independence and non-correlation is equivalent.]
Theorem 11.5
Let X1
Y1
, . . . , Xn
Yn
be a random sample from bivariate normal distribution. and let = 0. Then the
statistic
T =
R12

n - 2
1 - R2
12
follows the student t-distribution with (n - 2) degrees of freedom. This T statistic is instrumental
towards hypothesis about independence of random variables X, Y .
Theorem 11.6
Considering the random sample from bivariate normal distribution, at the significance level  the
null hypothesis H0 : = 0is rejected in favour of alternative hypothesis H1, if the test statistic
T = R12

n-2
1-R2
12
falls within the critical region W. According to the form of the alternative hypothesis
the list of corresponding critical regions follows :
for two-tailed test H1 : = 0 W = (-, -t1-/2(n - 2) t1-/2(n - 2), )
for left-tailed test H1 : < 0 W = (-, -t1-(n - 2)
for right-tailed test H1 : > 0 W = t1-(n - 2), )
Example 11.7
The score of two subjects of eight randomly drawn students are recorded.
1 2 3 4 5 6 7 8
80 50 36 58 42 60 56 68
65 60 35 39 48 44 48 61
At the significance level 0.05 carry out the test that the results in considered two subjects are not
positively correlated.
Solution
seminar session
Through the hypothesis H0 : = 0 the independence of two normal variables was tested. Now we are
interested in the strength of linear relationship. The test statistic of following test about correlation
coefficient is made through use of a particular function of sample correlation coefficient R12 given
by the following theorem.
Theorem 11.8
Let X1
Y1
, . . . , Xn
Yn
be a random sample from bivariate normal distribution with correlation coefficient
R(X, Y ) = . The statistic
Z =
1
2
ln
1 + R12
1 - R12
56
is called Fisher R12-to-z transformation and its approximate expected value and variance follows:
E(Z) = 1
2
ln 1+
1+
2(n-1)
D(Z) = 1
n-3
.
Then standardized statistic U = Z-E(Z)

D(Z)
 N(0, 1).
Theorem 11.9
Let X1
Y1
, . . . , Xn
Yn
be a random sample from bivariate normal distribution with correlation coefficient
R(X, Y ) = . Let R12 be the sample correlation coefficient, let Z = 1
2
ln 1+R12
1-R12
be Fisher R12-to-z
transformation and let c  (-1, 1) be a given constant.
At the significance level  the null hypothesis H0 : = c is rejected in favour of alternative hypothesis
H1, if the test statistic
U =
Z - 1
2
ln 1+c
1-c
- c
2(n-1)
1
n-3
falls within the critical region W. According to the form of the alternative hypothesis the list of
corresponding critical regions follows :
for two-tailed test H1 : = c W = (-, -u1-/2 u1-/2, )
for left-tailed test H1 : < c W = (-, -u1for
right-tailed test H1 : > c W = u1-, )
Example 11.10
A ferrum content was determined in an iron ore sample of size 600 by two analytic methods, where
the sample correlation coefficient was R12 = 0, 85. A technical literature states that the correlation
coefficient between considered methods is = 0, 9. At the significance level 0.05 carry out a test
H0 : = 0, 9 against H1 : = 0, 9.
Solution
seminar session
The statistic U can be used to find confidence intervals for . First the limits for the constant 1
2
ln 1+
1are
derived, then these limits are transformed to the limits for using hyperbolic tangent.
Theorem 11.11
Let the assumptions from 11.9 hold. Then the 100(1 - )% confidence interval
ˇfor the expression 1
2
ln 1+
1has
the form:
1
2
ln 1+
1-
 Z -
u1-/2

n-3
, Z +
u1-/2

n-3
with approximate probability 1 - .
ˇfor the parameter has the form:
 tgh(Z -
u1-/2

n-3
) , tgh(Z +
u1-/2

n-3
) with approximate probability 1 - .
Remark 11.12
tgh(x) = ex-e-x
ex+e-x for x  R.
Example 11.13
An officer of human resources department of particular firm is interested in a relationship between
a number of absence days due to illness per year (variable Y ) and age of employee (variable X).
Therefore the data about 10 employees were drawn randomly.
1 2 3 4 5 6 7 8 9 10
27 61 37 23 46 58 29 36 64 40
15 6 10 18 9 7 14 11 5 8
57
Under the assumption that X
Y
follows bivariate normal distribution do following tasks:
a) Calculate sample correlation coefficient.
b) At the significance level 0.05 carry out a test that X and Y are independent.
c) Determine the 95% confidence interval for correlation coefficient .
Solution
seminar session
Remark 11.14
We may have two sample correlation coefficients R12, R
12 corresponding to two independent bivariate
normal distributions. The question to be asked is: "Do both of these sample correlation coefficients
represent population having the same true value of correlation coefficient = 
? The following
theorem deals this question.
Theorem 11.15
Two independent bivariate normal samples of sizes n and n
with correlation coefficients , 
are
given. Let R12, R
12 be sample correlation coefficients and Z, Z
are corresponding Fisher transfor-
mations.
At the significance level  the null hypothesis H0 : = 
is rejected in favour of alternative
hypothesis H1, if the test statistic
U =
Z - Z
1
n-3
+ 1
n-3
falls within the critical region W. According to the form of the alternative hypothesis the list of
corresponding critical regions follows :
for two-tailed test H1 : = 
W = (-, -u1-/2 u1-/2, )
for left-tailed test H1 : < 
W = (-, -u1for
right-tailed test H1 : > 
W = u1-, )
Example 11.16
A medical research observed the concentration of substances A and B in urine of patients with
particular kidney illness. In a sample of 100 healthy individuals the sample correlation coefficient
between concentration of A and B was 0,65. In a sample of 142 individuals with mentioned kidney
illness the sample correlation coefficient was 0,37. At the significance level 0.05 test the hypothesis
that the true correlation coefficients are equal.
Solution
seminar session
58