STATISTICAL ASSOCIATION The purpose of iíiis chapter is to UtimllH ilic basic meaning of slaiisiical association Willi lis important feaiurcs (link, tendency, prediction, and strength), and dim to .nee how statistical association is delected and measured depending on the level of measurement of the variables involved. The interpretation of statistical association as .1 qualitative relationship between the variables (explanation, possible causal factor, spurious association or other) is briefly discussed. After studying this chapter, the studení should know: • the concept of statistical BMOCittkXI and the fundamental aspects Ol I sialism .il non (link, tendency, prediction, strength); • how 10 analyze association, depending on the iimewir iihiH level of the van • how lo produce and read j tw-i was table inianiiitlly and with NI'S.Si. • how to produce sod interpret i ooelUclenl "i correlation and a scattet plot; • how (o coinpurc Ihc menu of various subgroups Oil a variable; • how to interpret a regression hue, estimated stores, and errors In <•sIimi.Hi'. • lun to use the regression equation to predict a dťpendent variable; • the difference between a staiisin.il .ism- laHoo sod ;■ relationship between variables. • Ixiw to distinguish between lite notions nl explanation, causal factot. .nul .pinions relationship. The concept of .statistical association is (und n mental in research methodology. TW| COnCCfN allows us to formulate a clear notion of a link between variables when we notice that the scores of one imlividu.il on two different variables may somehow be (elated. Bui what do we mean by the word nrfflMrf? And how do we decide whether sune.s are related or not? Does il have 10 apply to every individual1 Aie then íl i III Midi relationships? What is Ihr teal meaning of statistical association'' Does it moan thai one factor is the cause ni the other? I lie notion of Statistical OSSOt lotion is quite abstract and it may he In// v loi now. Ihii we wilt gradually develop a detailed understanding of what it means I el us start with several example* • A le.i. hot may notice thai studenti who have good grades in mathematics tend lo (nive good grades in physics .is well • A doctor may notice Hint her lemale patients lend 10 1« more resistant lo certain kinds of infections than her male patients • A marker study may demonstrate that people who like classical music ten*! to appf eciatc going to Ihc opera more than ihosc who do not like classical music do STAnsriCAl ASSOCIATION Ml What do these statements exactly mean ' I el us examine the lust Of OUT examples. Which deal'- with llie lelalionship between fiiadcs in inalheinati. s and in physics Suppose we have n .lass with the grades lifted In fable B< I t.ibl.- H I Student nn-iii... Grade in mathematics Grade in pity." s 1 ?s 77 ; Í.Í 66 3 45 52 4 ■■„ 51 s S' 89 •> M 73 - » '-• * >' n 'J n 79 ■a 74 72 ii M 73 12 a 71 l | *i 85 13 n a 14 S3 ' ľ. v. H l.' n 72 i- 18 61 ľi '■■ 63 20 67 r,» 21 73 75 It we were to plot a scatter diagram ol these gfadet In the two disciplines, we would gel l-'iaure 8,1. oo too Figure B. 1 Grade« in mathematics and physic» lor • high school class 60 70 MaiiititiMin I 31 144 INTERPRETING QUANTITATIVE DATA WITH SPSS Each dot represents one individual; the position of ihc dot with respect to the X-axis gives the grade of the individual in mathematics, and its position with respect to the Y-axiß gives his or her grade in physics. Now wc can identify several features in this diagram: • When an individual scores low in mathematics, he/she tends to score low in physics as well. • When he/she scores high in mathematics, he/she tends to score high in physics. • Individuals whose score is close to the average in mathematics also tend to score close to the average in physics. • The preceding remarks reflect a tendency and not a rule. You may have noticed that we always say that individuals who score in a certain way in mathematics (end to score in a certain way in physics. We can sec that one individual does not fit the pattern outlined above, as this individual has a high grade in mathematics but a low grade in physics. This is why we talk about a tendency and not a rule. • The notion of prediction is very important when we have a statistical association. If we know that somebody got a good grade in mathematics, we can predict, without knowing it. that his grade in physics is likely to be high. We see from the diagram above that wc arc right most of the time, but not all the lime. Some individuals do not fit the pattern. This is why we use words like Ms likely to'. Predictions based on statistical association include a certain amount of error, in the sense that the predicted score differs from the real score by a certain amount, which is called the error. Such predictions also include a certain amount of risk, in the sense that there is a chance we are completely off track (as is the case if we tried lo predict the grade in physics of the individual who got a good grade in math but a poor grade in physics). • The notions of dependent and independent variables are used in this context. The dependent variable is what is to be explained, or what is to be predicted. The independent variable is die explanatory variable, or the variable used to make the prediction. In the example of the grades, the grade in mathematics is the independent variable and the grade in physics is the dependent variable. These two notions are not intrinsic to the variables, and the positions of dependent and independent variable could be interchanged, as wť may want to see whether Ihe grade in physics predicts the grade in mathematics with some accuracy. • There are ways of measuring how strong an association is. The notion of strength of an association is related to that of prediction: if an association is strong, predictions based on it will tend to be good and will involve a small error. But if the association is weak, predictions based on it will often be way out... and involve large errors. • The real concern here is to see whether there is some deep reason why people who perforin well in mathematics also tend to perform well in physics. In some cases such a deep relationship exists, and in some others the statistical association is not indicative of a deep relation. Settling the issue of the existence of a relationship between variables is the real reason why we study statistical association. For ihc lime being, let us remember ihat the existence of a statistical association is not a sufficient reason to say ihat there is a deep link between two variables. STATISTICAL ASSOCIATION 145 The features outlined above express ihe essence of the notion of statistical association. But what if ihe variables are not quantitative? What does statistical association mean then? Wc will have to develop this notion separately for Ihe various levels of measurements, and then draw some general conclusions. Wc will start by examining the case of two quantitative variables more closely. The Case of Two Quantitative Variables Let us suppose wc have two quantitative variables, such as the grades of a class of studeníš in mathematics and in physics in the example given above. We will denote the first one by X and ihe second one by Y. The grades of the various individuals in mathematics will be referred to as x,, x,, xv cic. and in physics as y,, yv >y etc. When we want to talk aboul an individual in general, without saying which case this is. we will use the letter i. The situation is summarized in Table 8.2. Table 8.2 ■ ■ . Variable name Symbol used Entries are denoted by General entry denoted by Grade in Mathematics Grade In Physics X y X,, X.. Xy CIC Now we can start looking in more detail ai the situation. Suppose ihe first student in the list has obtained 75 out of 100 in mathematics, and 77 out of 100 in physics, ihat is x, = 75 and y, = 77. This individual will be represented by the dot whose coordinates are (15, 77). By looking at the scatter diagram shown in Figure 8.1, we can see a paltern. All the dois tend to fall on or near a straight line, called the regression line, shown in Figure 8.2. This regression line represents the trend displayed by the dots. It can be described precisely by a mathematical equation (shown here at ihe top of the diagram). It can be used lo predict the expected score in physics if the score of an individual in mathematics is known. On the diagram, you can see ihat somebody who scores 83 in mathematics is expected to score around 82 inj>hysics: this is what the regression line suggests visually. If we want to calculate that predicted score more precisely, we could use ihe mathematical equation shown in the diagram, replacing x by the value 85. In this equation, y is the predicted value corresponding to a grade x in mathematics. This is what we get: y= 11-523 +0.83757í 146 INTERPRETING QUANTITATIVE DATA WITH SPSS ■K> so 60 70 80 Mathomotics Til Figuro 8.2 Grades in mathematics and physics for a high school class If we replace x by 85 we get: predicted value of y = II .523 + 0.83757 (85) - 82.71 or 83 if we round up. You will notice dim this is (lie predicted value. It is the expected score of the individual. Thus, tlie regression line and its equation allow us lo predict the scores in physics of an individual whose score in mathematics is known. Some individuals' real score will be slightly above or slightly below the expected value. In one of the cases shown in Figure 8.2. the expected score will be very different from the real score: this is the case of the individual represented by the dot on the lower right of the diagram. But how good are these predictions generally? Can we measure how good they are? The answer is Yes. Ti> understand it. consider the situation of one individual, illustrated by Figure 8.3. If the individual is far away from the regression line, using the regression line for prediction will yield a large error. But if the individual is close to the regression line, the error in predicting his or her y-score will be small. When we consider the whole population from the point of view of prediction, we get six types of .situations shown in Figure 8.4, diagrams (a) to (f). In diagram (a), the points that form the scatter diagram and lhat represent individuals are all found to be close to the regression line. In this case, when the STATISTICAL ASSOCIATION 147 Individual number /with scores íxtvvŕl Figure 8.3 '-'', ■ . »i • , • V :oi ;i.i r/v^H^.NÍv'- (c] Lil w.\ Figure 8.4 [f] y-scores of individuals are predicted from their x-scores, the predictions tend to be generally good. We say in this case that the correlation between the variable X and the variable Y is strong. We used here the word correlation to refer lo the statistical association. Indeed, correlation is the term to use when the variables arc quantitative. Thus, the statistical association between quantitative variables is called a correlation. In diagram (b), the points are not that close. We can still predict the y-score of an individual from his or her x-score, hut the errors in prediction will tend to be larger than they were in diagram (a). In such a case, wesay that the association between X and Y is not very strong. In diagram (c). we see that the points are scattered far away from the regression line. People with high scores on the variable X = 0 The correlation is null- Knowing ihe value of* docs not tell us anything about the likely value ofy. r ■ 0.09 Very weak negative correlation. Poor prediction of >■ on 'he basi* of knowing x. As r takes larger negative values, Ihe negative correlation gets stronger, ihe points tend to be eloser to the regression line ami the predictions are increasingly better. r = 0.64 The correlation is negative and strong. The points are fairly close to the regression lino and the predictions based on it tend to be good. ŕ ■ 1 The correlation is perfect and negative. All Ihe points fall exactly on llw ______________regression line. V. In a survey conducted in a large company, 300 employees were asked whether they are socializing with their peers ai work at a high level or at a low level, and whether they were planning to look for another job. Their answers were compiled in Table 8.4. Every rectangle in the table is called a cell. The numbers in the cells refer to the frequency of each category, and are called observed frequencies.