12 The relationship between two variables on the nominal
scale and the ordinal scale
At the Nominal scale one uses codes assigned to objects as labels; for example, "method of payment"
can be generally categorized as (1) payment on account, (2) payment in cash, (3) payment by cheque
and it has three categories (levels). For this scale or category some valid operations are equivalence
and set membership. This is also called a categorical variable. The first part of this chapter is
aimed to perform tests which determine whether two categorical variables are independent or not;
consequently we may be interested in the degree of association between considered two variables
which can be assessed by some coefficients.
The relationship between ordinal variables may also be represented in contingency tables, though
this is less often done since we have more efficient tests for ordinal variables, which are performed in
the second part of this chapter. In ordinal scale type, the numbers assigned to objects represent the
rank order (1st, 2nd, 3rd etc.) of the entities assessed. An example of ordinal measurement is the
results of a horse race, which say only which horses arrived first, second, third, etc. but include no
information about times. The central tendency of an ordinal variable can be represented by its mode
or its median, but the mean cannot be defined. For this scale the linear order is the valid relation.
The relationship between two nominal variables and related tests
Definition 12.1
Let X, Y be nominal random variables, where X has r categories: x[1], . . . , x[r] and Y has s categories:
y[1], . . . , y[s]. Consider the random sample X1
Y1
, . . ., Xn
Yn
from the distribution of X
Y
. Let us denote
as njk the joint frequency of the pair of categories (x[j], y[k]). The table of joint frequencies njk,
j = 1, . . . , r; k = 1, . . . , s is called contingency table.
y[k] y[1] . . . y[s] nj
x[j] njk
x[1] n11 . . . n1s n1
...
...
...
...
x[r] nr1 . . . nrs nr
nk n1 . . . ns n
The frequencies nj =
s
k=1
njk, nk =
r
j=1
njk are called marginal frequencies.
Theorem 12.2
Let us consider test: H0 : Variables X, Y are independent against H1 : Variables X, Y are not
independent. If H0 is true then the test statistic
K =
r
j=1
s
k=1
njk -
njnk
n
2
njnk
n
 2
((r - 1)(s - 1)).
[K is said to follow asymptotic 2
distribution with (r - 1)(s - 1) degrees of freedom.]
At the asymptotic significance level  the null hypothesis of independence is rejected in favor of
alternative hypothesis H1, if the realization of the test statistic satisfies the condition K > 2
1-((r -
1)(s - 1)). Thus the critical region W = 2
1-((r - 1)(s - 1)), ).
59
Remark 12.3
The statistic K is distributed approximately as 2
with (r - 1)(s - 1) degrees of freedom provided
that for the expression
njnk
n
it holds:
at least in 80% of cases
njnk
n
 5
at most in 20% of cases
njnk
n
 2.
If not then the pooling of appropriate categories (to attain large expected frequencies
njnk
n
) is
recommended.
Remark 12.4
Let pjk = P(X = x[j]  Y = y[k]) pj =
s
k=1
pjk pk =
r
j=1
pjk
The variables X and Y are independent if and only if the multiplicative relationship pjk = pj  pk
is true. The test statistic of independence of X and Y follows the idea that the difference between
the joint frequencies and the expected joint frequencies, if independence is really true, should be
"very small". As the marginal distributions pj, pk are usually unknown, we estimate them through
marginal frequencies: ^pj =
nj
n
and ^pk = nk
n
. Thus the expected joint frequencies n  pj  pk can be
estimated by frequencies n 
nj
n
 nk
n
=
njnk
n
.
Great differences between joint frequencies and estimated frequencies, which are expected under
independence, bring evidence against the null hypothesis. Thus the critical region is concentrated at
upper tail of 2
distribution. (That is this is an upper tailed test only.)
To determine the degrees of freedom we must reduce the number r  s of summands in double
sum
r
j=1
s
k=1
with respect to the conditions for both marginal distributions:
r
j=1
pj = 1 and
s
k=1
pk = 1.
Thus the first sum has r - 1 independent summands and the second sum has s - 1 independent
summands. Hence the double sum has (r - 1)  (s - 1) independent summands. 2
Definition 12.5
The degree of association between two nominal random variables X, Y is measured by Cramer's
coefficient
V = K
n(m-1)
, where m = min{r, s}. 2
Cramer's coefficient is a monotone function of the statistic K. It's values range from 0 (corresponding
to no association between the variables) to 1 (complete association).
Example 12.6
A sociological survey processed data about 360 students: the social origin and the type of school were
recorded. The results of the survey are as shown in the table below:
Social origin I II III IV nj
Type of schoole njk
university 50 30 10 50 140
polytechnic 30 50 20 10 110
economic 10 20 30 50 110
nk 90 100 60 110 360
At the asymptotic significance level 0,05 carry out the test that the variables type of school and
social origin are independent. Then determine the degree of association.
Solution
seminar session
Definition 12.7
The special case of 2 × 2 contingency table is called fourfold table; thus r = s = 2 and the joint
60
frequencies are commonly denoted as follows: n11 = a; n12 = b; n21 = c; n22 = d.
y[k] y[1] y[2] nj
x[j] njk
x[1] a b a + b
x[2] c d c + d
nk a + c b + d n 2
There are three available independence tests of the fourfold tables:
1.) Asymptotic 2
test.
This test suffers from the disadvantage that if below stated conditions are not satisfied, then
the pooling of categories is not possible and the test statistic is not distributed approximately
as chi2
.
2.) Asymptotic odds ratio test.
This test is based on the test statistic OR (odds ratio) which can be instrumental to measure
the degree of association. It can be used for sufficiently large frequencies.
3.) Fisher's exact test.
If the assumptions of previously mentioned tests do not hold, Fisher's test can be used. Character
of this test is discrete.
Theorem 12.8
Testing independence between two nominal variables in fourfold tables the test statistic K from the
theorem 12.2 can be rearranged into the form
K =
n(ad - bc)2
(a + b)(c + d)(a + c)(b + d)
If H0 is true then K  2
(1).
Remark 12.9
The statistic K is distributed approximately as 2
with 1 degree of freedom provided that following
conditions hold: a + b > 5; c + d > a+c
3
.
Example 12.10
Consider 135 applicants for particular university education. Suppose one random variable is the
impression upon entrance examination committee and the other random variable is the faculty entrance.
At the asymptotic level 0.05 carry out the test that the entrance and the impression are not
associated.
impression good bad nj
entrance njk
yes 17 11 28
no 39 58 97
nk 56 69 125
Solution
seminar session
61
Remark 12.11
The fourfold tabs can be treated in a different way based on following idea. Particular experiment,
which has two outcomes, can be carried in two groups. Thus the appropriate scheme is 2 × 2 contingency
table, where X has two categories: success and failure and Y has two categories: group I
and group II. (These groups might be men and women, an experimental group and a control group,
or any other dichotomous classification.)
Group I II nj
Outcomes njk
success a b a + b
failure c d c + d
nk a + c b + d n
The odds for column I are a
c
and for column II are b
d
. If the "group" does not have an impact on the
outcome of an experiment then the ratio of the two odds
a
c
b
d
is equal to one.
Definition 12.12
Considering the fourfold table the statistic OR =
a
c
b
d
= ad
bc
is called odds ratio.
The constant o = p11p22
p12p21
is called theoretic odds ratio.
Remark 12.13
If the variables X, Y are independent then pjk = pjpk and theoretic odds ratio o = 1. The further
from unit o is, the greater the dependence is. Under the condition that values in tab are non-zero,
the value of o is within the interval (0, ). Thus o values are not distributed symmetrically with
respect to one and log odds ratios ln o and ln OR are used.
Theorem 12.14
Let us consider the fourfold table for two nominal random variables X, Y .
The statistic U = ln OR-ln o1
a
+ 1
b
+ 1
c
+ 1
d
 N(0, 1).
At the asymptotic significance level  the null hypothesis
H0 : ln o = 0 [which is equivalent with independence of X, Y ] is rejected in favor of alternative
hypothesis H1, if the realization of the test statistic
U = ln OR1
a
+ 1
b
+ 1
c
+ 1
d
falls within the critical region W. According to the form of the alternative hypothesis
the list of corresponding critical regions follows :
pro oboustr. alt. H1 : ln o = 0 je W = (-, -u1-/2 u1-/2, )
pro levostr. alt. H1 : ln o < 0 je W = (-, -u1pro
pravostr. alt. H1 : ln o > 0 je W = u1-, ) 2
Notice that ln o > 0 implies that the event is more likely in the first group. ln o < 0 implies that
the event is less likely in the first group.
Theorem 12.15
Let us consider the fourfold table for two nominal random variables X, Y .
The asymptotic 100(1 - )% confidence interval for the theoretic odds ratio has the limits :
d = eln OR-
1
a
+ 1
b
+ 1
c
+ 1
d
u1-/2
h = eln OR+
1
a
+ 1
b
+ 1
c
+ 1
d
u1-/2
2
Null hypothesis of independence between X, Y [which is equivalent with o = 1] is rejected if the
62
asymptotic confidence interval for the theoretic odds ratio does not cover the value 1.
Example 12.16
Using the data from 12.10 calculate and interpret the odds ratio, construct the asymptotic confidence
interval for the theoretic odds ratio and test hypothesis that the faculty entrance and impression upon
committee are non-associated.
Solution
seminar session
Remark 12.17
An elaborated description of Fisher's exact test exceeds this course. Just short remark: this test of
independence is exact and it can therefore be used regardless of the sample characteristics; it is based
on odds ratios and could be one-tailed as well as two-tailed.
The relationship between two ordinal variables and related tests
The version of correlation performed in the 11th chapter applies to those cases where the values of X
and of Y are both measured on an equal- interval scale. It is also possible to apply the apparatus of
linear correlation to cases where X and Y are measured on a merely ordinal scale. When applied to
ordinal data, the measure of correlation is spoken of as the Spearman's rank correlation coefficient.
It assesses how well an arbitrary monotonic function could describe the relationship between two
variables, without making any other assumptions about the particular nature of the relationship
between the variables.
Definition 12.18
Let X, Y be ordinal random variables. Consider the random sample X1
Y1
, . . . , Xn
Yn
from the continuous
distribution of the vector X
Y
. Let Ri stands for the rank of Xi and Qi stands for the rank of
Yi; i = 1, 2, . . ., n. The statistic
rS = 1 -
6
n(n2 - 1)
n
i=1
(Ri - Qi)2
,
serves as a measure of the rank-order correlation between X and Y and is called Spearman's rank
correlation coefficient.
Remark 12.19
The values of Spearman's rank correlation coefficient are from the interval -1, 1 , where +1 corresponds
to the perfect positive relationship; -1 to the perfect negative relationship and 0 to no relationship
(monotonic). (We are speaking of a positive relationship in case "the more of X, the more of
Y "; and in case "the more of X, the less of Y " we are speaking of a negative relationship between
the two variables.) rS is the classic correlation coefficient applied on ranks Ri, Qi instead of original
variables Xi, Yi, thus from 11.2 follows 12.18. This formula is derived under the assumption of the
continuous distribution of the vector X
Y
that is only rankings without ties may occur. And finally
if the assumption of bivariate normality is not met in tests from the 11th chapter, Spearman's rank
correlation coefficient may be used.
Theorem 12.20
Let X, Y be ordinal random variables. Consider the random sample X1
Y1
, . . . , Xn
Yn
from the continuous
distribution of the vector X
Y
At the significance level  the null hypothesis H0 : "There is no
relationship between X and Y " is rejected in favor of the alternative hypothesis H1 if the realization
63
of the test statistic Spearman's rank correlation coefficient rS falls within the critical region W.
According to the form of the alternative hypothesis the list of corresponding critical regions follows:
two-tailed test H0 : There exist relationship betw. X and Y. W = -1, -rS,1-(n) rS,1-(n), 1
left-tailed test H1 : The relationship betwen X, Y is negative. W = -1, -rS,1-2(n)
right-tailed test H1 : The relationship betwen X, Y is positive. W = rS,1-2(n), 1
where rS,1-(n) is tabulated critical value for given  and usually n = 5, 6, . . ., 30. (For larger size of
random sample there are asymptotic statistics.)
Theorem 12.21
Let the assumptions and formulation of the null hypothesis from 12.20 hold. Further let n > 30 and
H0 is true. Then the test statistic U0 follows the standard normal distribution
U0 = rS

n - 1  N(0, 1)
and the critical region has the form W = (-, -u1-/2 u1-/2, ). Hypothesis about no
relationship between X, Y is rejected in two-tailed test if the realization of U0  W.
Example 12.22
Conditions of seven patients after particular surgery were assessed by two physicians. The highest
score obtained that patient, whose condition was most serious.
patient's index 1 2 3 4 5 6 7
The 1st physician's assessment 4 1 6 5 3 2 7
The 2nd physician's assessment 4 2 5 6 1 3 7
Calculate the Spearman's rank correlation coefficient rS and at the confidence level 0.05 carry out
the test that there is no relationship between considered assessments.
Solution
seminar session
64