748 DISCOVERING STATISTICS USING I general analysis to determine whether there are overall group differences along these five measures; (2) look at the scale-by-scale analyses of group differences produced in the output and interpret the results accordingly; (3) select contrasts that test the hypothesis that second and third years will score higher than first years on all scales-(4) select tests that compare all groups to each other and briefly compare these results with the contrasts; and (5) carry out a separate analysis in which you test whether a J combination of the measures can successfully discriminate the groups (comment only i briefly on this analysis). Include only those scales that revealed group differences J for the contrasts. How do the results help you to explain the findings of your initial I analysis? The data are in the file psychology.dat. © Answers can be found on the companion website. Exploratory factor analysis 17 Further reading Bray, J. H., & Maxwell, S. E. (1985). Multivariate analysis of variance. Sage University Paper Seriefl on Quantitative Applications in the social Sciences, 07-054. Newbury Park, CA: Sage. (This monograph on MANOVA is superb: I cannot recommend anything better.) Huberty, C. J., & Morris, J. D. (1989). Multivariate analysis versus multiple univariate analysis. Psychological Bulletin, 105(2), 302-308. Interesting real research Marzillier, S. L., & Davey, G. C. L. (2005). Anxiety and disgust: Evidence for a unidirectionJ tionship. Cognition and Emotion, 19(5), 729-750. Lk. . ......................................W FIGURE 17.1 Me at Niagara Falls in 1998. | was'nthe middle of writing the firs, edition of the SPSS version of this D°okatthetime. Note how fresh-faced I look *bit f my ^'1'^>'' and' tnan'cs t0 W initial terrible teaching experiences, I i u °(an ODSessi°n with over-preparing for classes. I wrote detailed hand-tWh"8 examples. Through my girlfriend at the time I met Dan Wright t of hi*aS m my dePartment but sadly moved to Florida). He had published Wk>uts°Wn ant* WaS ne'p'ng n*s publishers to sign up new authors. On the 0f 23 ,VV6re 1u'r'<;y and that I was too young to realize that writing a text-as academic suicide (really, textbooks take a long time to write and 749 750 DISCOVERING STATISTICS USING R they are not at all valued compared to research articles), I was duly signed up. The commissioning editor was a man constantly on the verge of spontaneously combusting with intellectual energy. He can start a philosophical debate about literally anything: should he ever be trapped in a elevator he will be compelled to attempt to penetrate the occupants' minds with probing arguments that the elevator doesn't exist, that they don't exist, and that their entrapment is an illusory construct generated by their erroneous beliefs in the physical world. Ultimately, though, he'd still be a man trapped in an elevator (with several exhausted corpses). A combination of his unfaltering self-confidence, my fear of social interactions with people I don't know, and my utter bemusement that anyone would want me to write a book made me incapable of saying anything sensible to him. Ever. He must have thought that he had signed up an imbecile. He was probably right. (I find him less intimidating since thinking up the elevator scenario.) The trouble with agreeing to writm books is that you then have to write them. For the next two years or so I found myself tryj ing to juggle my research, a lectureship at the University of London, and writing a booJ Had I been writing a book on heavy metal it would have been fine because all of the infor- J mation was moshing away in my memory waiting to stage-dive out. Sadly, however, I had! agreed to write a book on something that I new nothing about: statistics. I soon discovered! that writing the book was like doing a factor analysis: in factor analysis we take a lot of information (variables) and the R program effortlessly reduces this mass of confusion into a simple message (fewer variables) that is easier to digest. The program does this (sort of) by filtering out the bits of the information overload that we don't need to know about. It takes a few seconds. Similarly, my younger self took a mass of information about statistics] that I didn't understand and filtered it down into a simple message that I could understands] I became a living, breathing factor analysis ... except that, unlike R, it took me two yeaid and some considerable effort. 17.2. When to use factor analysis © In the social sciences we are often trying to measure things that cannot directly be i ured (so-called latent variables). For example, management researchers (or psychoid even) might be interested in measuring 'burnout', which is when someone who has t working very hard on a project (a book, for example) for a prolonged period of timej denly finds themselves devoid of motivation, inspiration, and wants to repeatedly butt their computer screaming 'please Mike, unlock the door, let me out of the bas"" need to feel the soft warmth of sunlight on my skin!'. You can't measure burnout it has many facets. However, you can measure different aspects of burnout: 1 get some idea of motivation, stress levels, whether the person has any new ide on. Having done this, it would be helpful to know whether these difference reflect a single variable. Put another way, are these different variables driven 1 underlying variable? This chapter will look at factor analysis (and principal < analysis) - a technique for identifying groups or clusters of variables. This tec three main uses: (1) to understand the structure of a set of variables (e-S^J intelligence such as Spearman and Thurstone used factor analysis to try to structure of the latent variable 'intelligence'); (2) to construct a questionna an underlying variable (e.g., you might design a questionnaire to measure (3) to reduce a data set to a more manageable size while retaining as muaj information as possible (e.g., we saw in Chapter 7 that multicollineanty <* in multiple regression, and factor analysis can be used to solve this pro variables that are collinear). Through this chapter we'll discover what rac^ find them, and what they tell us (if anything) about the relationship be we've measured. CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 17.3. Factors© If we measure several variables, or ask someone several questions about themselves, the correlation between each pair of variables (or questions) can be arranged in what's known as an R-matrix. An R-matrix is just a correlation matrix: a table of correlation coefficients between variables (in fact, we saw small versions of these matrices in Chapter 6). The diagonal elements of an R-matrix are all ones because each variable will correlate perfectly with itself. The off-diagonal elements are the correlation coefficients between pairs of variables, or questions.1 The existence of clusters of large correlation coefficients between subsets of variables suggests that those variables could be measuring aspects of the same underlying dimension. These underlying dimensions are known as factors (or latent variables). By reducing a data set from a group of interrelated variables into a smaller set of factors, factor analysis achieves parsimony by explaining the maximum amount of common variance in a correlation matrix using the smallest number of explanatory constructs. There are numerous examples of the use of factor analysis in the social sciences. The trait theorists in psychology used factor analysis endlessly to assess personality traits. Most readets will be familiar with the extroversion-introversion and neuroticism traits measured by Eysenck (1953). Most other personality questionnaires are based on factor analysis \t-notably Cattell's (1966a) 16 personality factors questionnaire - and these inventories are frequently used for recruiting purposes in industry (and even by some religious groups). LHowever, although factor analysis is probably most famous for being adopted by psycholo-gists, its use is by no means restricted to measuring dimensions of personality. Economists, for example, might use factor analysis to see whether productivity, profits and workforce can be reduced down to an underlying dimension of company growth. Het's put some of these ideas into practice by imagining that we wanted to measure dif-^Btaspects of what might make a person popular. We could administer several measures Bat we believe tar liferent aspects of popularity. So, we might measure a person's social alls (Social Skills), their selfishness (Selfish), how interesting others find them (Interest), proportion of time they spend talking about the other person during a conversation 1*1), the proportion of time they spend talking about themselves (Talk2), and their lensity to lie to people (the Liar scale). We can then calculate the correlation coef-Tfor each pair of variables and create an R-matrix. Figure 17.2 shows this matrix, jnificant correlation coefficients are shown in bold type. It is clear that there are isters of interrelating variables. Therefore, these variables might be measuring some on underlying dimension. The amount that someone talks about the other person fa conversation seems to correlate highly with both the level of social skills and how jfcng the other finds that person. Also, social skills correlate well with how interest-J* perceive a person to be. These relationships indicate that the better your social ■ more interesting and talkative you are likely to be. However, there is a second • variables. The amount that people talk about themselves within a conversation Rwith how selfish they are and how much they lie. Being selfish also correlates *gree to which a person tells lies. In short, selfish people are likely to lie and talk ■Melves. ■ analysis we str to reduce this R-matrix down into its underlying dimensions 'ch variables seem to cluster together in a meaningful way. This data ^tion^s'1" ",~matr'v '"' iust ^> because it contains correlation coefficients and r usually denotes >n>e softw <"'lapte' **> ~ r'lc r turns into a capital letter when it denotes a matrix. Given that this |j*0gram Ca"eL' t'1's's s'ightly confusing, so be careful - it should be obvious when we are m' and wllen ''m talking about the correlation matrix, and when it's not, I'll tell you. 752 FIGURE 17.2 An fl-matrix DISCOVERING STATISTICS USING R Talk 1 Social Skills Interest Talk 2 Selfish Liar Talk 1 1.000 Social Skills 1.172""Vs»^vL000 Interest .646 .879J 1.000 Talk 2 Factor 1 .074 -.120 .054 1.000 Selfish -.131 .031 -.101 /.441 ^VIMO Liar .068 .012 .110 / .361 .277 ) 1.000 Factor 2 reduction is achieved by looking for variables that correlate highly with a group of other i variables, but do not correlate with variables outside of that group. In this example, there J appear to be two clusters that fit the bill. The first factor seems to relate to general sociability, whereas the second factor seems to relate to the way in which a person treats others i socially (we might call it 'consideration'). It might, therefore, be assumed that popularity I depends not only on your ability to socialize, but also on whether you are genuine towards j others. 17.3.1. Graphical representation of factors (D Factors (not to be confused with independent variables in factorial ANOVA) are statistical! entities that can be visualized as classification axes along which measurement variables can be plotted. In plain English, this statement means that if you imagine factors as being the axis of a graph, then we can plot variables along these axes. The coordinates of variables along each axis represent the strength of relationship between that variable and each factori Figure 17.3 shows such a plot for the popularity data (in which there were only two factors, The first thing to notice is that for both factors, the axis line ranges from -1 to 1, whK are the outer limits of a correlation coefficient. Therefore, the position of a given varial depends on its correlation with the two factors. The circles represent the three varijj that correlate highly with factor 1 (Sociability: horizontal axis) but have a low correU with factor 2 (Consideration: vertical axis). Conversely, the triangles represent variabu correlate highly with consideration to others but have a low correlation with social From this plot, we can tell that selfishness, the amount a person talks about themselvi their propensity to lie all contribute to a factor that could be called consideration or 0J Conversely, how much a person takes an interest in other people, how interesting and their level of social skills contribute to a second factor, sociability. This di fore supports the structure that was apparent in the R-matrix. Of course, if a existed within these data it could be represented by a third axis (creating a 3-should also be apparent that if more than three factors exist in a data set, then a ■ cannot represent them all. If each axis on the graph represents a factor, then the variables that go to m. tor can be plotted according to the extent to which they relate to a given factor, dinates of a variable, therefore, represent its relationship to the factors. In an a variable should have a large coordinate for one of the axes, and low coort other factors. This scenario would indicate that this particular variable re CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 1.00 753 FIGURE 17.3 Example of a factor plot 0.50 0.75 1.00 -0.75 —i- -0.50 -0.25 0.00 0.25 Sociability ■factor. Variables that have large coordinates on the same axis are assumed to measure different aspects of some common underlying dimension. The coordinate of a variable along ■[classification axis is known as a factor loading. The factor loading can be thought of as the I Pearson correlation between a factor and a variable (see Jane Superbrain Box 17.1). From I what we know about interpreting correlation coefficients (see section 6.5.4.3) it should be j clear that if we square the factor loading we obtain a measure of the substantive importance of a particular variable to a factor. Mathematical representation of factors © re axes drawn in Figure 17.3 are straight lines and so can be described mathematically [the equation of a straight line. Therefore, factors can also be described :~ m terms of this SELF-TEST ' What is the equation of a straight line? :+H;+...+u ez,•+• -+bn Variable. + (17.1) 754 DISCOVERING STATISTICS USING You'll notice that there is no intercept in the equation, the reason being that the lines intersect at zero (hence the intercept is also zero). The bs in the equation represent the factor loadings. Sticking with our example of popularity, we found that there were two factors underlying this construct: general sociability and consideration. We can, therefore, construct an equation that describes each factor in terms of the variables that have been measured. The ■—: as follows: equations are Sociability, + fo4Talk2; + b2X2l+... + bnXni+£i + ^Interest,- (17.2) ^ Talkl, + b2Social Skills, + b5Selfish, + £>6Liar, +e; Consideration, = £>j Talkl, + b2Social Skills,- + b3Interest,-+ b4Talk2, + b5Selfish, + b6Liar, +e- Notice that the equations are identical in form: they both include all of the variables that were measured. However, the values of b in the two equations will be different (depending on the relative importance of each variable to the particular factor). In fact, we can """■"•dinate of that variable on the graph in Figure 17.36 (i.e. i„ctor loading). The resulting equations are as follows replace each value of b with the cooromaxc «. «-> ~ , replace the values of b with the factor loadmg). The resulting y. = btXXi +b2X2i+... + bnXni +e, Sociability, = 0.87Talkl, + 0.96SocialSkills, + 0.92Interest, + 0 00Talk2, - O.lOSelfish, + 0.09Liar, +£,■ Consideration, = O.OlTalkl, - O.OSSocialSkills^O.O^nterest, (17.3) Observe that, for the sociability factor, the values of b are high for Talkl, Social Skills Interest. For the remaining variables (Talk2, Selfish and Liar) the values of b are very lo -------;mnnrt.,lir for tnat factor|^| + 0 111 aiM, ------- < .82Talk2, + 0.75Selfish, + 0.70Liar, +£,■ Interest. For the remaining variables (Talk2, Selfish ana uar; mc vam^ ~* — ~n„ ,,o .-W three of the variables are very important for that facto iree are very unimportant (the ones with low values oy ause of the way that three variables clustered hi factor plot. The point to take on board here is that the factor plot and these eqi : the factor loadings in the plot are simply the fr-values in tl mrrhrain Box 17.1). For the second factor, inconsideration " 1-----Uirrh V (close to it.-- We saw that this point is true s reus us uiau uuvv ones with high values of b) and three ar< - because of - 1 ^ ' : -------1 the way that three variables clustered highly um here is that the factor plot and these equations « resent the same thing: the factor loadings in the plot are simply the ^-values in these fl tions (but see Jane Superbrain Box 17.1). For the second factor, inconsideration too the opposite pattern can be seen in that Talk2, Selfish and Liar all have high valuel whereas the remaining three variables have ^-values close to 0. In an ideal world, vai would have very high fo-values for one factor and very low /;-values for all other Id These factor loadings can be placed in a matrix in which the columns represent 3r and the rows represent the loadings of each variable on each factor. For the poj --'----- '™P for each factor) and six rows (one^ tor and the rows represent the loadings of each vana data this matrix would have two columns (one for ea, variable). This matrix, usually denoted A, is given by. A = ^ 0.87 0.01 ^ 0.96 -0.03 0.92 0.04 0.00 0.82 -0.10 0.75 v 0.09 0.70 CHAPTER 17 EXPLORATORY FACTOR ANALYSIS To understand what the matrix means, try relating the elements to the loadings in equation (17.3). For example, the top row represents the first variable, Talkl, which had a loading of .87 for the first factor (Sociability) and a loading of .01 for the second factor (Consideration). This matrix is called the factor matrix or component matrix (if doing principal components analysis) - see Jane Superbrain Box 17.1 to find out about the different forms of this matrix. The major assumption in factor analysis is that these algebraic factors represent real-world dimensions, the nature of which must be guessed at by inspecting which variables have high loads on the same factor. So, psychologists might believe that factors represent dimensions of the psyche, education researchers might believe they represent abilities, and sociologists might believe they represent races or social classes. However, it is an extremely contentious point whether this assumption is tenable, and some believe that the dimensions derived from factor analysis are real only in the statistical sense - and are real-world fictions. 755 |ANE SUPERBRAIN 17.1 Wfiar's the difference between a paffem matrix and a structure matrix? ® Throughout my discussion of factor loadings I've been Hfe vague. Sometimes I've said that these loadings can Be thought of as the correlation between a variable and ■given factor, then at other times I've described these fedings in terms of regression coefficients (o). Now, it jpuld be obvious from what we discovered in Chapters pid 7 that correlation coefficients and regression coef-'ts are quite different things, so what the hell am I on about: shouldn't I make up my mind what the tradings actually are? ^in vague terms (the best terms for my brain) both jbon coefficients and regression coefficients repress relationship between a variable and linear model ||sense, so the key take-home message is that Factor scores © Ipnbe described in terms of the variables measured and the relative imP°rtf?« ■'that factor (represented by the value of b). Therefore, having discovere factor loadings tell us about the relative contribution that a variable makes to a factor. As long as you understand that much, you have"no problems. However, the factor loadings in a given analysis can be both correlation coefficients and regression coefficients. Soon we'll discover that the interpretation of factor analysis is helped greatly by a technique known as rotation. Without going into details, there are two types: orthogonal and oblique rotation (see section 17.3.9). When orthogonal rotation is used, any underlying factors are assumed to be independent, and the factor loading is the correlation between the factor and the variable, but is also the regression coefficient. Put another way, the values of the correlation coefficients are the same as the values of the regression coefficients. However, there are situations in which the underlying factors are assumed to be related or correlated to each other. In these situations, oblique rotation is used and the resulting correlations between variables and factors will differ from the corresponding regression coefficients. In this case, there are, in effect, two different sets of factor loadings: the correlation coefficients between each variable and factor (which are put in the factor structure matrix) and the regression coefficients for each variable on each factor (which are put in the factor pattern matrix). These coefficients can have quite different interpretations (see Graham, Guthrie, & Thompson, 2003). 756 DISCOVERING STATISTICS USING R factors exist, and estimated the equation that describes them, it should be possible to also estimate a person's score on a factor, based on their scores for the constituent variables. These scores are known as factor scores. As such, if we wanted to derive a score of sociability for a particular person, we could place their scores on the various measures into equation (17.3). This method is known as a weighted average. In fact, this method is overly simplistic and rarely used, but it is probably the easiest way to explain the principle. For example, imagine the six scales all range from 1 to 10 and that someone scored the following: Talkl (4), Social Skills (9), Interest (8), Talk2 (6), Selfish (8), and Liar (6). We could put these values into equation (17.3) to get a score for this person's sociability and their consideration to others: Sociability = 0.87Talkl + 0.96SocialSkills + 0.92Interest + 0.00Talk2 - 0. lOSelfish + 0.09Liar = (0.87 x 4) + (0.96 x 9) + (0.92 x 8) + (0.00 x 6) -(0.10x8)+ (0.09x6) = 19.22 Consideration = O.OlTalkl - 0.03SocialSkills + 0.04Interest + 0.82Talk2 + 0.75Selfish + 0.70Liar = (0.01 x 4) - (0.03 x 9) + (0.04 x 8) + (0.82 x 6) + (0.75x8)+ (0.70x6) = 15.21 The resulting scores of 19.22 and 15.21 reflect the degree to which this person is sociablci and their inconsideration to others, respectively. This person scores higher on sociability than inconsideration. However, the scales of measurement used will influence the resulting! scores, and if different variables use different measurement scales, then factor scores for different factors cannot be compared. As such, this method of calculating factor scores is poor and more sophisticated methods are usually used. 17.3.3.1. The regression method © There are several sophisticated techniques for calculating factor scores that use factor ■ coefficients as weights in equation (17.1) rather than using the factor loadings. The I of the equation remains the same, but the bs in the equation are replaced with these H score coefficients. Factor score coefficients can be calculated in several ways. The si way is the regression method. In this method the factor loadings are adjusted tt account of the initial correlations between variables; in doing so, differences in q measurement and variable variances are stabilized. To obtain the matrix of factor score coefficients (B) we multiply the matrix of factt ings by the inverse (R_1) of the original correlation or R-matrix. You might rememD^ the previous chapter that matrices cannot be divided (see section 16.4.4.1). if we want to divide by a matrix it cannot be done directly and instead we its inverse. Therefore, by multiplying the matrix of factor loadings by the u correlation matrix we are, conceptually speaking, dividing the factor loadings^ relation coefficients. The resulting factor score matrix, therefore, represents ship between each variable and each factor, taking into account the origin between pairs of variables. As such, this matrix represents a purer measur relationship between variables and factors. The matrices for the popularity data are shown below. The resulting m^'^n t coefficients, B, comes from the R (the program) output. The matrices R an CHAPTER 17 EXPLORATORY FACTOR ANALYSIS by hand to get the matrix B, and those familiar with matrix algebra - or who have consulted Namboodiri, (1984) or Stevens (2002) - might like to verify the result (see Oliver Twisted). To get the same degree of accuracy as R you should work to at least five decimal pla ilaces: The pattern of the loadings is the same for * * febee variables have high loadmgs fTtl Lst f "t Tt C°e,fficients: ^ * the first [whereas the pattern is reversed for th iOW ]°admeS f°r ^ —A ghtings, which "mi is, uie nrst whereas the pattern is reversed for the lastXLT°f u, ^ l0admgS f°r the second> actual value of the weightings, 2^^^^^'^ * ^ 'm ^ bbles are now accounted for Thesfact0TCo ! fT" ^ COrrLelations between varices in eauatinn SC°re coeffl«ents can be used to replace the i in equation (17.2): Sociability = 0.343Talkl + 0.376SocialSkills + 0.362Interest + 0.000Talk2 - 0.037Selfish + 0.039Liar = (0.343 x 4) + (0.376 x 9) + (0.362 x 8) + (0.000 x 6) -(0.037x8)+ (0.039x6) = 7.59 lonsideration = G.006Talkl - 0.020SocialSkills + 0.020Interest + 0.473Talk2 + 0.437Selfish + 0.405Liar = (0.006 x 4) - (0.020 x 9) + (0.020 x 8) + (0.473 + (0.437x8)+ (0.405x6) = 8.768 (17.5) x6) Mion (17.5) shews how these coefficient scores are used to produce two factor scores ^Person. In this case, the participant had the same scores on each variable as were used i 11 .4). The resulting scores are much more similar than when the factor loadings EdforWe'8htS be';ause the different variances among the six variables have now been KeslT\\ ^aCt L'lat tbe vames are veiT similar reflects the fact that this person not ■~>g ly on variables relating to sociability, but is also inconsiderate (i.e., they score rn °l factors). This technique for producing factor scores ensures that the 'res have a mean .fn-j.- ion P *°res have a m c recnnique tor prod L^a^Ss^^aAT^equal rthe ^'^mZ^l rth°d is that the TZ tmf faCt°r VaIues- However, the downside of the wim other factor scores from a different orthogonal factor. lb. 1 ■prnethoc 1*^ are 757 758 OLIVER TWISTED Please Sir, can I have some more ... matrix algebra? DISCOVERING STATISTICS USING I 'The Matrix enthuses Oliver, that was a good film. I want to dress in black and glide through the air as though time has stood still. Maybe the matrix of factor scores is as cool as the film.' I think you might be disappointed, Oliver, but we'll give it a shot. The matrix calculations of factor scores are detailed in the additional material for this chapter on the companion website. Be afraid, be very afraid ... 17.3.3.2. Uses of factor scores© There are several uses of factor scores. First, if the purpose of the factor analysis is toj reduce a large set of data into a smaller subset of measurement variables, then the factoj scores tell us an individual's score on this subset of measures. Therefore, any further ana-l lysis can be carried out on the factor scores rather than the original data. For example, wel could carry out a r-test to see whether females are significantly more sociable than males J using the factor scores for sociability. A second use is in overcoming collinearity problems] in regression. If, following a multiple regression analysis, we have identified sources of multicollinearity then the interpretation of the analysis is questioned (see section 7.7.2.3)J In this situation, we can carry out a principal components analysis on the predictor vari-| ables to reduce them down to a subset of uncorrelated factors. The variables causing the multicollinearity will combine to form a factor. If we then rerun the regression but using the factor scores as predictor variables then the problem of multicollinearity should vanishj (because the variables are now combined into a single factor). By now, you should have some grasp of the concept of what a factor is, how it is repre-i sented graphically, how it is represented algebraically, and how we can calculate composite scores representing an individual's 'performance' on a single factor. 1 have deliberates restricted the discussion to a conceptual level, without delving into how we actually final these mythical beasts known as factors. This section will look at how we find factors Specifically, we will examine different types of method, look at the maths behind 01 method (principal components), investigate the criteria for determining whether factofl are important, and discover how to improve the interpretation of a given solution. I Choosing a method © 17.3.4. The first thing you need to know is that there are several methods for unearthingj tors in your data. The method you chose will depend on what you hope to do wit analysis. Tinsley and Tinsley (1987) give an excellent account of the different me available. There are two things to consider: whether you want to generalize the n from your sample to a population and whether you are exploring your data oi a specific hypothesis. This chapter describes techniques for exploring data us analysis. Testing hypotheses about the structures of latent variables and their re to each other requires considerable complexity and can be done with packages st or Lavaan in R.2 Those interested in hypothesis testing techniques (known as . The sem package is the more stra.ghtforward, but is slightly less capable of handhng unus Lavaan (sem was written by John Fox, who also wrote R Commander). iual sip CHAPTER 17 EXPLORATORY FACTOR ANALYSIS factor analysis) are advised to read Pedhazur and Schmelkin (1991, Chapter 23) for an introduction. Assuming we want to explore our data, we then need to consider whether we want to app'y our findings to the sample collected (descriptive method) or to generalize our findings to a population (inferential methods). When factor analysis was originally developed it was assumed that it would be used to explore data to generate future hypotheses. As such, it was assumed that the technique would be applied to the entire population of interest. Therefore, certain techniques assume that the sample used is the population, and so results cannot be extrapolated beyond that particular sample. Principal components analysis is an example of one of these techniques, as is principal factors analysis (principal axis factoring). Principal components analysis and principal factors analysis are the preferred methods and usually result in similar solutions (see section 17.3.6). When these methods are used, conclusions are restricted to the sample collected and generalization of the results can be achieved only if analysis using different samples reveals the same factor structure. Another approach has been to assume that participants are randomly selected and that the variables measured constitute the population of variables in which we're interested. By assuming this, it is possible to develop techniques from which the results can be generalized from the sample participants to a larger population. However, a constraint is that any findings hold true only for the set of variables measured (because we've assumed this set constitutes the entire population of variables). Techniques in this category include the maximum-likelihood method (see Harman, 1976) and Kaiser's alpha factoring. The choice of method depends largely on what generalizations, if any, you want to make from your data.3 17.3.5. Communality © | Befofe continuing, it is important that you understand some basic things about the variance i within an R-macrix. It is possible to calculate the variability in scores (the variance) for any I given measure (or variable). You should be familiar with the idea of variance by now and ■Comfortable with how it can be calculated (if not, see Chapter 2). The total variance for la particular variable will have two components: some of it will be shared with other variables or measures (common variance) and some of it will be specific to that measure (unique viriance). We tend to use the term unique variance to refer to variance that can be reliably attributed to only one measure. However, there is also variance that is specific to one meas-pbut not reliably so; this variance is called error or random variance. The proportion of common variance present in a variable is known as the communality. As such, a variable that pno specific variance (or random variance) would have a communality of 1; a variable X shares none of its variance with any other variable would have a communality of 0. I factor analysis we are interested in finding common underlying dimensions within lata and so we are primarily interested only in the common variance. Therefore, when a factor analysis it is fundamental that we know how much of the variance present r data is common variance. This presents us with a logical impasse: to do the factor ls We need to know the proportion of common variance present in the data, yet the to find out the extent of the common variance is by carrying out a factor analysis, p are two ways to approach this problem. The first is to assume that all of the variance ' on variance. As such, we assume that the communality of every variable is 1. By L ,ls assumptioii we merely transpose our original data into constituent linear corn-own as principal components analysis). The second approach is to estimate the ^■p idior T f^'S po'nt t'lat Pr'nc'Pa' component analysis is not in fact the same as factor analysis. This '°ts me hom discussing them as though they are, but more on that later. 759 760 DISCOVERING STATISTICS USING R amount of common variance by estimating communality values for each variable. There are various methods of estimating communalities, but the most widely used (including alpha factoring) is to use the squared multiple correlation (SMC) of each variable with all others. So, for the popularity data, imagine you ran a multiple regression using one measure (Selfish) as the outcome and the other five measures as predictors: the resulting multiple R1 (see section 7.6.2) would be used as an estimate of the communality for the variable Selfish. This second approach is used in factor analysis. These estimates allow the factor analysis to be done. Once the underlying factors have been extracted, new communalities can be calculated that represent the multiple correlation between each variable and the factors extracted. Therefore, the communality is a measure of the proportion of variance explained by the extracted factors. 17.3.6. Factor analysis vs. principal components analysis e weights of each variable on the variate (see equation (16.5)). These values are the factor loadings described earlier. The largest eigenvalue associated with each of the eigenvectors Jovides a single indicator of the substantive importance of each variate (or component), fhe basic idea is that we retain factors with relatively large eigenvalues and ignore those [With relatively small eigenvalues. The eigenvalue tor a factor can also be calculated by summing the square of the loadings 1 at factor, i his isn't much use if you're calculating factor analysis, because you need PUciuate the eigenvalues to calculate the loadings. But it can be a useful way to help -rstand the eigenvalues - the higher the loadings on a factor, the more of the variance J* variables that the factor explains. summary, component analysis works in a similar way to MANOVA. We begin with atrix representing the relationships between variables. The linear components (also vanates, or factors) of that matrix are then calculated by determining the eigenvalues ■dedT'*' ^6Se e'genvalues are used to calculate eigenvectors, the elements of which Btotio6 '0a<^ng °'C a Particular variable on a particular factor (i.e., they are the b-values ^EL '0n ^-l))- i be eigenvalue is also a measure of the substantive importance of the |«tor with winch it is associated. 761 762 DISCOVERING STATISTICS USING R CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 17.3.8. Factor extraction: eigenvalues and the scree plot © 763 How many factors should I extract' Not all factors are retained in an analysis, and there is debate over the criterion used to decide whether a factor is statistically important. I mentioned above that the eigenvalues associated with a variate indicate the substantive importance of that factor. Therefore, it seems logical that we should retain only factors with large eigenvalues. Retaining factors is known as factor extraction. How do we decide whether or not an eigenvalue is large enough to represent a meaningful factor? Well, one technique advocated by Cattell (1966b) is to plot a graph of each eigenvalue (Y-axis) against the factor with which it is associated (X-axis). This graph is known as a scree plot (because it looks like a rock face with a pile of debris, or scree, at the bottom). I mentioned earlier that it is possible to obtain as many factors as there are variables and that each has an associated eigenvalue. By graphing the eigenvalues, the relative importance of each factor becomes apparent. Typically there will be a few factors with quite high eigen-1 values, and many factors with relatively low eigenvalues, and so this graph has a verjj characteristic shape: there is a sharp descent in the curve followed by a tailing off I (see Figure 17.4). Cattell (1966b) argued that the cut-off point for selecting factorsj should be at the point of inflexion of this curve. The point of inflexion is where the I slope of the line changes dramatically: so, in Figure 17.4, imagine drawing a straight line that summarizes the vertical part of the plot and another that summarizes the horizontal part (the blue dashed lines); then the point of inflexion is the data point at which these two lines meet. In both examples in Figure 17.4 the point of inflex- I ion occurs at the third data point (factor); therefore, we would extract two factors. Thus, you retain (or extract) only factors to the left of the point of inflexion (and do not include the factor at the point of inflexion itself).5 With a sample of more than 200 participants, the scree plot provides a fairly reliable criterion for factor selection (Stevens, 2002). Although scree plots are very useful, factor selection should not be based on this crh J terion alone. Kaiser (1960) recommended retaining all factors with eigenvalues greater , than 1. This criterion is based on the idea that the eigenvalues represent the amount 41 variation explained by a factor and that an eigenvalue of 1 represents a substantial amount of variation. Jolliffe (1972, 1986) reports that Kaiser's criterion is too strict and suggest! the third option of retaining all factors with eigenvalues greater than .7. The difference between how many factors are retained using Kaiser's methods compared to Jolliffe si be dramatic. You might well wonder how the methods compare. Generally speaking, Kaiser S| terion overestimates the number of factors to retain (see Jane Superbrain Box 1 but there is some evidence that it is accurate when the number of variables is less | 30 and the resulting communalities (after extraction) are all greater than .7. Kai criterion can also be accurate when the sample size exceeds 250 and the average*! munality is greater than or equal to .6. In any other circumstances you are best acl to use a scree plot provided the sample size is greater than 200 (see Stevens, I more detail). 5 Actually, in his original paper, Cattell advised including the factor at the point of inflexion as ' 'desirable to include at least one common error factor as a "garbage can"'. The idea is that P°J^ represents an error factor. However, in practice this garbage can factor is rarely retained; also u ^ that it is better to retain too few than too many factors, so most people do not retain the factor inflexion. 1 2 3 4 5 6 7 8 9 '01112131415,61718 Component Number 192021222324252627a 3.0 - ___m '234 8 9 10 'I 12 13 ,4 Component Number 1 2 3 FIGURE 17.4 samples of scree plots for 4 5 6 7 8 J Component Number 10 11 12 13 14 ,5 data that probably have two underlying factors I However, as ,: often the case in statistics, the three criteria often provide different •olutions. In these situations the communalities of the factors need to be considered. In pncipal components analysis we begin with communalities of 1 with all factors retained cause we assume that all variance is common variance). At this stage all we have done j> find the linear variates that exist in the data - so we have just transformed the data pout discarding any information. However, to discover what common variance really * between variables we must decide which factors are meaningful and discard any that fl|t0° tr'V'a't0 consider. Therefore, we discard some information. The factors we retain ■ot explain all of the variance in the data (because we have discarded sonic informa-| and so the communalities after extraction will always be less than 1. The factors 0 not maP perfectly onto the original variables - thev m«-i--nt in the Hat- « ^ ciu^u some mtorma- ~~u«.? aner extraction will always be less than 1. The factors t map perfectly onto the original variables - they merely reflect the common Present in the data. If the communalities represent a loss of information then they ^ajn.ant statistics. The closer the communalities are to 1, the better our factors are teat"^1 ? or^8^nal data- It is logical that the greater the number of factors retained, ■Un r ' 6 communauties wdl be (because less information is discarded); therefore, the les are g°°d indices of whether too few factors have been retained. In fact, with 764 DISCOVERING STATISTICS USING R generalized least-squares factor analysis and maximum-likelihood factor analysis you can get a statistical measure of the goodness of fit of the factor solution (see the next chapter for more on goodness-of-fit tests). This basically measures the proportion of variance that the factor solution explains (so can be thought of as comparing communalities before and after extraction). As a final word of advice, your decision on how many factors to extract will depend also on why you're doing the analysis; for example, if you're trying to overcome multicol-linearity problems in regression, then it might be better to extract too many factors than too few. JANE SUPERBRAIN 17.2 How many factors do I retain? (D The discussion of factor extraction in the text is somewhat simplified. In fact, there are fundamental problems with Kaiser's criterion (Nunnally & Bernstein, 1994; Preacher & MacCallum, 2003). For one thing an eigenvalue of 1 means different things in different analyses: with 100 variables it means that a factor explains 1% of the variance, but with 10 variables it means that a factor explains 10% of the variance. Clearly, these two situations are very different and a single rule that covers both is inappropriate. An eigenvalue of 1 also means only that the factor explains as much variance as a variable, which rather defeats the original intention of the analysis to reduce j variables down to 'more substantive' underlying factors j (Nunnally & Bernstein, 1994). Consequently, Kaiser's criterion often overestimates the number of factors. On this I basis Jolliffe's criterion is even worse (a factor explains less variance than a variable!). There are other ways to determine how many fac- ] tors to retain, but they are more complex (which is why j I'm discussing them outside of the main text). The best is j probably parallel analysis (Horn, 1965). Essentially each i eigenvalue (which represents the size of the factor) is com- J pared against an eigenvalue for the corresponding factor in many randomly generated data sets that have the same characteristics as the data being analysed. In doing so, j each eigenvalue is being compared to an eigenvalue frofl a data set that has no underlying factors. This is a bit like asking whether our observed factor is bigger than a non-existing factor. Factors that are bigger than their 'random'] counterparts are retained. Of parallel analysis, the scresj plot and Kaiser's criterion, Kaiser's criterion is, in genera, j worst and parallel analysis best (Zwick & Velicer, 1989BJ improving interpretation: factor rotation ® 17.3.9. Once faces h„e been—, , ^^I^S^ on these factors (i.e., calculate the loadtngo the v ^ , «„ a CHAPTER 17 EXPLORATORY FACTOR ANALYSIS technique called factor rotation is used to discriminate between factors. If a factor is a classification axis along which variables can be plotted, then factor rotation effectively rotates these factor axes such that variables are loaded maximally on only one factor, figure 17.5 demonstrates how this process works using an example in which there are only two factors. Imagine that a sociologist was interested in classifying university lecturers as a demographic group. She discovered that two underlying dimensions best describe this group: alcoholism and achievement (go to any academic conference and you'll see that academics drink heavily). The first factor, alcoholism, has a cluster of variables associated with it (dark blue circles), and these could be measures such as the number of units drunk in a week, dependency and obsessive personality. The second factor, achievement, also has a cluster of variables associated with it (light blue circles), and these could be measures relating to salary, job status and number of research publications. Initially, the full lines represent the factors, and by looking at the coordinates it should be clear that the light blue circles have high loadings for factor 2 (they are a long way up this axis) and medium loadings for factor 1 (they are not very far up this axis). Conversely, the dark blue circles have high loadings for factor 1 and medium loadings for factor 2. By rotating the axes (dashed lines), we ensure that both clusters of variables are intersected by the factor to which they relate most. So, after rotation, the loadings of the variables are maximized on one factor (the factor that intersects the cluster) and minimized I on the remaining factor(s). If an axis passes through a cluster of variables, then these variables will have a loading of approximately zero on the opposite axis. If this idea is confusing, then look at Figure 17.5 and think about the values of the coordinates before and after rotation (this is best achieved by turning the book when you look at the rotated axes). There are two types of rotation that can be done. The first is orthogonal rotation, and the left-hand side of Figure 17.5 represents this method. In Chapter 10 we saw that the I term orthogonal means unrelated, and in this context it means that we rotate factors while keeping them independent, or unrelated. Before rotation, all factors are independent (i.e., they do not correlate at all) and orthogonal rotation ensures that the factors remain uncorrected. That is why in Figure 17.5 the axes are turned while remaining perpendicular.6 The other form of rotation is oblique rotation. The difference with oblique rotation is that the , factors are allowed to correlate (hence, the axes of the right-hand diagram of Figure 17.5 Ho not remain perpendicular). The choice of rotation depends on whether there is a good theoretical reason to suppose that the factors should be related or independent (but see my later comments on this), K and also how the variables cluster on the factors before rotation. On the first point, we I might not expect alcoholism to be completely independent of achievement (after all, high Jcnievement leads to high stress, which can lead to the drinks cabinet!). Therefore, on peoretical grounds, we might choose oblique rotation. On the second point, Figure 17.5 •Wtonstrates how the positioning of clusters is important in determining how successful & rotation will be (note the position of the light blue circles). Specifically, if an orthogonal Ptation was carried out on the right-hand diagram it would be considerably less successful 765 Maximizing loadings than the oblique rotation that is displayed. One approach is to run t*nalysis usin8 both types of rotation. Pedhazur and Schmelkin (1991) suggest that if the IjOique rotation demonstrates a negligible correlation between the extracted factors then 0r reasonable to use the orthogonally rotated solution. If the oblique rotation reveals a |«-_--->— L means that the axes are at right angles to one another. 766 FIGURE 17.5 Schematic representations of factor rotation. The left graph displays orthogonal rotation whereas the right graph displays oblique rotation (see text for more details). 9 is the angle through which the axes are rotated Orthogonal 1.00- 0.75- 0.50- 0.25- So.OO- -0.25 - -0.50 - -0.75- DISCOVERING STATISTICS USING R Oblique -1.00- t3 -1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 -1.00 Factor 1 -o'.75 -0^0 -0.25 0.00 0.25 0.50 0.75 any case, an oblique rotation should be used only if there are good reasons to suppose that I the underlying factors could be related in theoretical terms. The mathematics behind factor rotation is complex (especially oblique rota- I tion). However, in oblique rotation, because each factor can be rotated by different! amounts, a factor transformation matrix, A is needed. The factor transformation matrixB is a square matrix and its size depends on how many factors were extracted from the data. If two factors are extracted then it will be a 2 x 2 matrix, but if four factors] are extracted then it becomes a 4 x 4 matrix. The values in the factor transformatioi matrix consist of sines and cosines of the angle of axis rotation (9). This matrix multiplied by the matrix of unrotated factor loadings, A, to obtain a matrix of rotate factor loadings. For the case of two factors the factor transformation matrix would be: A= cost? -sinö sinö cos( Therefore, you should think of this matrix as representing the angle through which thej axes have been rotated, or the degree to which factors have been rotated. The angle of rotation necessary to optimize the factor solution is found in an iterative way (see R's Soulsj Tip 8.1) and different methods can be used. 17.3.9.1. Choosing a method of factor rotation © The R function that we will use has four methods of orthogonal rotation (varimax, q max, BentlerT and geominT) and five methods of oblique rotation (oblimin, promax, plimax, BentlerQ and geominQ). These methods differ in how they rotate the factors therefore, the resulting output depends on which method you select. The most important orthogonal rotations are quartimax and varimax. Quartinia^ctorJ tion attempts to maximize the spread of factor loadings for a variable across a *j CHAPTER 17 EXPLORATORY FACTOR ANALYSIS Therefore, interpreting variables becomes easier. However, this often results in lots of variables loading highly on a single factor. Varimax is the opposite in that it attempts to maximize the dispersion of loadings within factors. Therefore, it tries to load a smaller number of variables highly on each factor, resulting in more interpretable clusters of factors. For a first analysis, you should probably select varimax because it is a good general approach that simplifies the interpretation of factors. The two important oblique rotations are promax and oblimin. Promax is a faster procedure designed for very large data sets. (If you are interested in adjustments that can be made to these rotations, other rotations, and even hand rotations, you can consult the GPARotateQ function, found in the psych package.) In theory, the exact choice of rotation will depend largely on whether or not you think that the underlying factors should be related. If you expect the factors to be independent then you should choose one of the orthogonal rotations (I recommend varimax). If, however, there are theoretical grounds for supposing that your factors might correlate, then direct oblimin should be selected. In practice, there are strong grounds to believe that orthogonal rotations are a complete nonsense for naturalistic data, and certainly for any data involving humans (can you think of any psychological construct that is not in any way correlated with some other psychological construct?). As such, some argue that orthogonal rotations should never be used. 17.3.9.2. Substantive importance of factor loadings © Once a factor structure has been found, it is important to decide which variables make up which factors. Earlier I said that the factor loadings were a gauge of the substantive importance of a given variable to a given factor. Therefore, it makes sense that we use these values to place variables with factors. It is possible to assess the statistical significance of a factor loading (after all, it is simply a correlation coefficient or regression coefficient); however, there are various reasons why this option is not as easy as it seems (see Stevens, 2002, p. 393). Typically, researchers take a loading of an absolute value of more than 0.3 to be important. However, the significance of a factor loading will depend on the sample size. Stevens (2002) produced a table of critical values against which loadings can be compared. To summarize, he recommends that for a sample size of 50 a loading of 0.722 can be considered significant, for 100 the loading should be greater than 0.512, for 200 t it should be greater than 0.364, for 300 it should be greater than 0.298, for 600 it should be greater than 0.21, and for 1000 it should be greater than 0.162. These values are based Ion an alpha level of .01 (two-tailed), which allows for the fact that several loadings will need to be tested (see Stevens, 2002, for further detail). Therefore, in very large samples, small loadings can be considered statistically meaningful. (R can provide significance tests of factor loadings, but these get rather complex and are rarely used. By applying Stevens's i Sublines you should gain some insight into the structure of variables and factors.) I The significance of a loading gives little indication of the substantive importance of , » variable to r i actor. This value can be found by squaring the factor loading to give an estimate of the amount of variance in a factor accounted for by a variable (like R2). In this espect Stevens (2002) recommends interpreting only factor loadings with an absolute Ue greater than 0.4 (which explain around 16% of the variance in the variable). 767 J7-4; Research example © VUt °^tLC uses' ,r factor analysis is to develop questionnaires: after all, if you want to meas-a liny or trait, you need to ensure that the questions asked relate to the construct 768 DISCOVERING STATISTICS USING R CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 769 FIGURE 17.6 The R anxiety questionnaire (RAQ) SD = Strongly Disagree, D = Disagree, N = Neither, A = Agree, SA = Strongly Agree SD D 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Statistics make me cry My friends will think I'm stupid for not being able to cope with R Standard deviations excite me I dream that Pearson is attacking me with correlation coefficients I don't understand statistics I have little experience of computers All computers hate me I have never been good at mathematics My friends are better at statistics than me Computers are useful only for playing games I did badly at mathematics at school People try to tell you that R makes statistics easier to understand but it doesn't I worry that I will cause irreparable damage because of my incompetence with computers Computers have minds of their own and deliberately go wrong whenever I use them Computers are out to get me I weep openly at the mention of central tendency I slip into a coma whenever I see an equation R always crashes when I try to use it Everybody looks at me when I use R I can't sleep for thoughts of eigenvectors I wake up under my duvet thinking that I am trapped under a normal distribution My friends are better at R than I am If I am good at statistics people will think I am a nerd O O O O O O O O O O o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o N o o o o o o o o o o o o o o o o o o o o o o o A o o o o o o o o o o o o o o o o o o o o o o SA o o o o o o o o o o o o o 0 o 0 0 o o o that you intend to measure. I have noticed that a lot of students become very stressed] about R. Therefore I wanted to design a questionnaire to measure a trait that I termed 'ftj anxiety'. I decided to devise a questionnaire to measure various aspects of students' anxi-1 ety towards learning R. I generated questions based on interviews with anxious and nooj anxious students and came up with 23 possible questions to include. Each question was ai statement followed by a five-point Likert scale ranging from 'strongly disagree' through 'neither agree nor disagree' to 'strongly agree'. The questionnaire is printed in Figure 17.6J The questionnaire was designed to predict how anxious a given individual would bi about learning how to use R. What's more, I wanted to know whether anxiety about14 could be broken down into specific forms of anxiety. In other words, what latent variab contribute to anxiety about R? With a little help from a few lecturer friends I collecte 2571 completed questionnaires (at this point it should become apparent that this exarnp is fictitious). The data are stored in the file RAQ.dat. Load this file into R and have a loo at the data. We know that in R, cases (or people's data) are typically stored in row ^ variables are stored in columns and so this layout is consistent with past chapters. The s< ond thing to notice is that there are 23 variables labelled Q01 to Q23. OLIVER TWISTED Please Sir, can I have some more... questionnaires? 'I'm going to design a questionnaire to measure one's propensity to pick a pocket or two', says Oliver, 'but how would I go about doing it?' You'd read the useful information about the dos and don'ts of questionnaire design in the additional material for this chapter on the companion website, that's how. Rate how useful it is on a Likert scale from 1 = not useful at all, to 5 = very useful. 17.4.1. Sample size © Correlation coefficients fluctuate from sample to sample, much more so in small samples than in large. Therefore, the reliability of factor analysis is also dependent on sample size. Much has been written about the necessary sample size for factor analysis, resulting in many 'rules of thumb'. The common rule is to suggest that a researcher has at least 10-15 participants per variable. Although I've heard this rule bandied about on numerous occasions, its empirical basis is unclear (although Nunnally, 1978, did recommend having 10 times as many participants as variables). Kass and Tinsley (1979) recommended having between 5 and 10 participants per variable up to a total of 300 (beyond which test parameters tend to be stable regardless of the participant to variable ratio). Indeed, Tabachnick and Fidell (2007) agree that 'it is comforting to have at least 300 cases for factor analysis' (p. 613), and Comrey and Lee (1992) class 300 as a good sample size, 100 as poor and 1000 as excellent. Fortunately, recent years have seen empirical research done in the form of experiments using simulated data (so-called Monte Carlo studies). Arrindell and van der Ende (1985) used real-life data to investigate the effect of different participant to variable ratios. They concluded that changes in this ratio made little difference to the stability of factor solutions. Guadagnoli and Velicer (1988) found that the most important factors in determining reliable factor solutions were the absolute sample size and the absolute magnitude of factor loadings. In short, they argue that if a factor has four or more loadings greater than .6 then it is reliable regardless of sample size. Furthermore, factors with 10 or more loadings greater than .40 are reliable if the sample size is greater than 150. Finally, factors with a few low loadings should not be interpreted unless the sample size is 300 or more. MacCallum, Widaman, Zhang, and Hong (1999) have shown that the minimum sample size or sample to variable ratio depends on other aspects of the design of the study. In short, their study indicated that as communalities become lower the importance of sample size increases. With all communalities above .6, relatively small samples (less than 100) may be perfectly adequate. With communalities in the .5 range, samples between 100 and 200 can be good enough provided there are relatively few factors each with only a small number of indicator variables. In the worst scenario of low communalities (well below .5) and a larger number of underlying factors they recommend samples above 500. What's clear from this work is that a sample of 300 or more will probably provide a stable factor solution, but that a wise researcher will measure enough variables to equately measure all of the factors that theoretically they would expect to find, ■rAnother alternative is to use the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy I aiser' 1970). The KMO can be calculated for individual and multiple variables and rep-ltfSeiuS rat'° °^ trie scluared correlation between variables to the squared partial correla-I that i^66" vai"iables. The KMO statistic varies between 0 and 1. A value of 0 indicates the sum of partial correlations is large relative to the sum of correlations, indicating 770 DISCOVERING STATISTICS USING R diffusion in the pattern of correlations (hence, factor analysis is likely to be inappropriate). A value close to 1 indicates that patterns of correlations are relatively compact and so factor analysis should yield distinct and reliable factors. Kaiser (1974) recommends accepting values greater than .5 as barely acceptable (values below this should lead you to either collect more data or rethink which variables to include). Furthermore, values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb (Hutcheson & Sofroniou, 1999). 17.4.2. Correlations between variables .8). The problem with a heuristic such as this is that the effect of two variables correlating with r = .9 might be less than the effect of, say, three variables that all correlate at r = .6. In other words, eliminating such highly correlating variables might not be getting at the cause of the multicollinearity (Rockwell, 1975). Multicollinearity can be detected by looking at the determinant of the R-matrix, denoted |R| (see Jane Superbrain Box 17.3). One simple heuristic is that the determinant of the R-matrix should be greater than 0.00001. If you have reason to believe that the correlation matrix has multicollinearity then you could look through the correlation matrix for variables that correlate very highly (R > .8) and consider eliminating one of the variables (or more, depending on the extent of the The way that I think of the determinant is as describing the 'area' of the data. In Jane Superbrain Box 16.2 we saw the two diagrams below. At the time I used these to describe eigenvectors and eigenvalues (which describe the shape of the data). The determinant is related to eigenvalues and eigenvectors, but instead of describing the height and width of the data it describes the overall area. So, in the left diagram below, the determinant of those data would represent the area inside the dashed elipse. These variables have a low correlation so the determinant (area) is big; the biggest value it can be is 1. In the right diagram, the variables are perfectly correlated or singular, and the elipse (dashed line) has been squashed down to basically a straight line. In other words, the opposite sides of the ellipse have actually met each other and there is no distance between them at all. Put another way, the area, or determinant, is zero. Therefore, the determinant tells us whether the correlation matrix is singular (determinant is 0), or if all variables are completely unrelated (determinant is 1), or somewhere in between. 771 JANE SUPERBRAIN 17.3 What is the determinant? ® The determinant of a matrix is an important diagnostic I tool in factor analysis, but the question of what it is is not easy to answer because it has a mathematical definition and I'm not a mathematician. Rather than pretending that I understand the maths, all I'll say is that a good explanation of how the determinant is derived can be found at http://mathworld.wolfram.com. However, we can bypass the maths and think about the determinant conceptually. 10 in c c o I6 iL F4 to ra CO 10-1 r * •r is a good met] vork and what t often to just use a different sort of extraction). !f you look through the menus> ^ ^ one kind of extraction (maxim able in R Commander is a little limiteu. , , ui,^ recti ii uu^o _ likelihood) and, although this is a good method when it works, if often doesn't wor Understanding why it didn't work and what to do about it is difficult (and the solution is in to just use a different sort of extraction). For this reason, we don't recommend facto analysis with R Commander. 17.6. Running the analysis with R © Packages used in this chapter © 17.6.1. There are several packages we will use in this chapter. You will need the packages corpcor, GPArotation (for rotating) and psych (for the factor analysis). If you don't have these pack-J ages installed you'll need to install them and load them. install.packagesC"corpcor"); install.packages("GPArotation"); install J packages("psych") packages!. psy«-n j Then you need to load .he packages by execn.ing these commands. Un-wyCcorpcor); libroryCGP/mototion,; UbroryCpsycW 17.6.2 Initial preparation and analysis © To run a factor analysis or a principal components analysis you can either use the raw or you can calculate a correlation matrix, and use that. If you have a massive num 7 Note that there is an h in the polychor function, that's because we're calculating polychoric cone^u°"'^or« package that calculates po/ychoric and po/yserial correlations. (Also note that it's written by John ox, several other packages we use in this book.) cases (and by massive, I mean at least 100,000, and probably closer to 1,000,000) you're better off calculating a correlation matrix first, and then factor-analysing that. If you don't have a massive number of cases, it doesn't matter which you do. It's also worth noting at this stage that sometimes the analysis doesn't work, usually because the correlation matrix that you're trying to analyse is weird (R's Souls' Tip 17.1). First, we'll load the data into a dataframe called raqData. Set your working directory to the location of the file (see section 3.4.4) and execute: raqData<-read.delim("raq.dat", header = TRUE) We want to include all of the variables in our data set in our factor analysis. We can calculate the correlation matrix, using the cor() function (see Chapter 6): raqMatrix<-cor(raqData) 1 R's Souls' Tip 17.1 ■ r Warning messages about non-positive definite matrix © On rare occasions, you might have a non-positive definite matrix. When you have this, R will give unhelpful warnings, such as: Warning messages: 1: In log(det(m.inv.r)) : NaNs produced 2: In log(det(r)) : NaNs produced What R is trying to tell you, in it's own friendly way, is that the determinant of the R (correlation) matrix is negative, and hence it cannot find the log of the determinant ('NaN' is R's way of saying "not a number"). This problem is usually described as a non-positive definite matrix. What is a non-positive definite matrix? As we have seen, factor analysis works by looking at your correlation matrix This matrix has to be 'positive definite' for the analysis to work. What does that mean in plain English? It means lots of horrible things mathematically (e.g., the eigenvalues and determinant of the matrix have to be positive) and about the best explanation I've seen is at http://www2.gsu.edu/~mkteer/npdmatri.html. In more basic terms, factors are like lines floating in space, and eigenvalues measure the length of those lines. If your eigenvalue is negative then it means that the length of your line/factor is negative too. It's a bit like me asking you how tall you are, and you responding Tm minus 175 cm tall'. That would be nonsense. By analogy, if a factor has negative length, then that too is nonsense. When R decomposes the correlation matrix to look for factors, if it comes across a negative eigenvalue it starts thinking 'oh dear, I've entered some weird parallel universe where the usual rules of maths no longer apply and things can have negative lengths, and this probably means that time runs backwards, my mum is my dad, my sister is a dog, my head is a fish, and my toe is a frog called Gerald'. It still has a go at producing results, but those results probably won't make much sense. (We'd like it if it said 'these results are probably nonsense', rather than being a bit subtle about it, so you have to be really careful.) Things like the KMO test and the determinant rely on a positive definite matrix; if you don't have one they can't . be computed Why have I got a non-positive definite matrix? The most likely answer is that you have too many variables I and t0° few cases of data, which makes the correlation matrix a bit unstable. It could also be that you have Boo many highly correlated items in your matrix (singularity, for example, tends to mess things up). In any case |' means that your data are bad, naughty data, and not to be trusted; if you let them loose then you have only yourself to blame for the consequences. ' ^nat can I do? Other than cry, there's not that much you can do. You could try to limit your items, or selec-ljye'y rerriove items (especially highly correlated ones) to see if that helps. Collecting more data can help too. 1 ere are some mathematical fudges you can do, but they're not as tasty as vanilla fudge and they are hard to Implement easily. 776 DISCOVERING STATISTICS USING R R was not square, finding R from data $chisq [1] 19334.49 $p.value [1] 0 $df [1] 253 Output 17.2 Next we'd also like the KMO. None of the packages in R currently have a straightforward way to calculate the KMO. However, one of the nice things about R is that people can write programs to do anything that R doesn't currently do, and G. Jay Kerns, from Youngstown State University (see http://tolstoy.newcastle.edu.au/R/e2/ help/07/08/22816.html) has written one called kmo(), which calculates the KMO and a variety of other things. The function itself is easy to use manually (see Oliver Twisted), but because it is not part of a package we have included it in our DSUR package so that you can use it directly (assuming you have loaded the DSUR package). You can use the function by simply entering the name of your dataframe into it and executing. kmoCraqData) The results of the KMO test are shown in Output 17.3. We came across the KMO statistic in section 17.4.1 and saw that Kaiser (1974) recommends a bare minimum of .5 and that values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb (Hutcheson & Sofroniou, J 1999). For these data the overall value is .93, which falls into the range of being superb (or I 'marvellous' as the report puts it), so we should be confident that the sample size and the data are adequate for factor analysis. OLIVER TWISTED Please Sir, can I have some more ... kmo? 'Stop spanking my monkey! '.cries an hysterical Oliver, 'it's never done you any harm, and it's orange.' I was talking about the Kaiser-Meyer-Olkin test, Oliver. 'Oh, sorry', he says with a sigh of relief, 'I thought KMO stood for Kill My Orang-utan'. Erm, OK, Oliver has finally lost the plot, which I'm fairly sure is what you'll do if you inspect the kmoQ function on the companion website. Although we have included it in J our DSUR package, you can also copy it and execute it manually. KMO can be calculated for multiple and individual variables. The value of KMO shou be above the bare minimum of .5 for all variables (and preferably higher) as well as overall The KMO values for individual variables are produced by the kmoQ function too. For the* data all values are well above .5, which is good news. If you find any variables with van below .5 then you should consider excluding them from the analysis (or run the ana y with and without that variable and note the difference). Removal of a variable affects^ KMO statistics, so if you do remove a variable be sure to rerun the kmoQ function OB new data. CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 777 $overall [1] 0.9302245 $report [1] "The KMO test yields a degree of common variance marvelous. Q03 0.9510 Q10 0.9487 Q17 0.9306 Q04 0.9553 Qll 0.9059 Q18 0.9479 Q05 0.9601 Q12 0.9548 Q19 0.9407 Q06 Q07 0.8913 0.9417 Q13 Q14 0.9482 0.9672 Q20 Q21 0.8891 0.9293 $individual Q01 Q02 0.9297 0.8748 Q08 Q09 0.8713 0.8337 Q15 Q16 0.9404 0.9336 Q22 Q23 0.8784 0.7664 Output 17.3 Finally, we'd like the determinant of the correlation matrix. To find the determinant, we use the det() function, into which we place the name of a correlation matrix. We have [ computed this matrix already for the current data (raqMatrix) so we can execute: ! If we hadn't already created the matrix, we could get the determinant by putting the cor() function for the raw data into the det() function: det(cor(raqDatcO) Either method produces the same value: [1] 0.0005271037 This value is greater than the necessary value of 0.00001 (see section 17.5). As such, our determinant does not seem problematic. After checking the determinant, you can, if necessary, eliminate variables that you think are causing the problem. In summary, all questions in the RAQ correlate reasonably well with all others and none of the correlation coefficients are excessively large; therefore, we won't eliminate any questions at this stage. CRAMMING SAM'S TIPS Preli binary analysis • Scan the correlation matrix; look for variables that don't correlate with any other variables, or correlate very highly (r = .9) with one or more other variables. In factor analysis, check that the determinant of this matrix is bigger than 0.00001; if it is then muiticollinearity isn't a problem. * Check the KMO and Bartlett's test; the KMO statistic should be greater than .5 as a bare minimum; if it isn't collect more ^ data. Bartlett's test of sphericity should be significant (the significance value should be less than .05). 778 DISCOVERING STATISTICS OSING R 17.6.3. Factor extraction using R »i>o, s sss this variable using pcibvuiuo*,. R also displays the eigenvalues in terms of the proportion of variance explained. Fac explains 7.29 units of variance out of a possible 23 (the number of factors) so as a p tion this is 7.19/13 = 0.32; this is the value that R reports. We can convert these pro tions to percentages by multiplying by 100; so, factor 1 explains 32% of the tota v 8 Some of them are very, very slightly different from zero; for example, question 2 has a un'^"e" ' g3 reported as -3.1e-15, which means .0000000000000031. This is caused by a rounding error (b variables to only 15 decimal places). CHAPTER 17 EXPLORATORY FACTOR ANALYSIS It should be clear that the first few factors explain relatively large amounts of variance (especially factor 1) whereas subsequent factors explain only small amounts of variance. The eigenvalues show us that four components (or factors) have eigenvalues greater than 1 suggesting that we extract four components if we use Kaiser's criterion. By Jolliffe's criterion (retain factors with eigenvalues greater than 0.7) we should retain 10 factors, but there is little to recommend this criterion over Kaiser's. We should also consider the scree plot. As mentioned above, the eigenvalues are stored in a variable called pcl$values, and we can draw a quick scree plot using the plot() function, by executing: plot(pcl$values, type = "b") This command simply plots the eigenvalues (y) against the factor number (x). By default, the plotQ function will plot points (type= "p"). We want to see a line so that we can look at the trend (we could ask for this by specifying type="l"), but ideally we want to look at both a line and points on the same graph, which is why we specify type-"b". 781 I Figure 17.7 shows the scree plot; I show it once as R produces it and then again with tunes showing a plateau and (what I consider to be) the point of inflexion. This curve is dif-pcult to interpret because it begins to tail off after three factors, but there is another drop pher four factors before a stable plateau is reached. Therefore, we could probably justify 'etaintng either two or four factors. Given the large sample, it is probably safe to assume ^ser s criterion. The evidence from the scree plot and from the eigenvalues suggests a four-component solution may be the best. ^Bow that we know how many components we want to extract, we can rerun the ana-s> specifying that number. To do this, we use an identical command to the previous pel but we change nfactors = 13 to be nfactors = 4 because we now want only four fac-(we should also change the name of the resulting model so that we don't overwrite ^■fcvious one): PrincipalCraqDcita, nfactors = 4, rotate = "none") Principal(raqMatrix, nfactors = 4, rotate = "none") the first command is to run the analysis from the raw data and the second is if you're 6 correlation matrix. In both cases the commands create a model called pel that is j .C as before except that we've extracted only 4 factors (not 23). We can look at this m executing its name: FIGURE 17.7 Scree plot from principal components analysis of RAQ data. The second plot shows the point of inflexion at the fourth component. 782 DISCOVERING STATISTICS USING R Output 17.5 shows the second principal components model. Again, the output contains the unrotated factor loadings, but only for the first four factors. Notice that these are unchanged from the previous factor loading matrix. Also notice that the eigenvalues (SS loadings), proportions of variance explained and cumulative proportion of variance explained are also unchanged (except now there are only four of them, because we only have four components). However, the communalities (the hi column) and uniquenesses (the ul column) are changed. Remember that the communality is the proportion of com- J mon variance within a variable (see section 17.3.4). Principal components analysis works on the initial assumption that all variance is common; therefore, before extraction the 1 communalities are all 1. In effect, all of the variance associated with a variable is assumed to be common variance. Once factors have been extracted, we have a better idea of how i much variance is, in reality, common. The communalities in the output reflect this common variance. So, for example, we can say that 43% of the variance associated with question I 1 is common, or shared, variance. Another way to look at these communalities is in terms ' of the proportion of variance explained by the underlying factors. Before extraction, there I were as many factors as there are variables, so all variance is explained by the factors and i communalities are all 1. However, after extraction some of the factors are discarded and I so some information is lost. The retained factors cannot explain all of the variance present I in the data, but they can explain some. The amount of variance in each variable that can I be explained by the retained factors is represented by the communalities after extraction. Now that we have the communalities, we can go back to Kaiser's criterion to see whether I we still think that four factors should have been extracted. In section 17.3.8 we saw that J Kaiser's criterion is accurate when there are fewer than 30 variables and communalities I after extraction are greater than .7 or when the sample size exceeds 250 and the average communality is greater than .6. Of the communalities in Output 17.5, only one exceeds* .7. The average of these communalities can be found by adding them up and dividing byB the number of communalities (11.573/23 = .503). So, on both grounds Kaiser's rule mayB not be accurate. However, in this instance we should consider the huge sample that wefl have, because the research into Kaiser's criterion gives recommendations for much smaller samples. It's also worth remembering that we have already inspected the scree plot, whiclH should be a good guide in a sample as large as ours. However, given the ambiguity in thflfl scree plot (there was also a case lor retaining only two factors) you might like to reruntheB analysis specifying that R extract only two factors and compare the results. Principal Components Analysis Call: principal(r = raq, nfactors = 4, rotate = "none") Standardized loadings based upon correlation matrix PCI PC2 PC3 PC4 h2 u2 Q01 0 59 0 18 -0 22 0 12 0 43 0 57 Q02 -0 30 0 55 0 15 0 01 0 41 0 59 Q03 -0 63 0 29 0 21 -0 07 0 53 0 47 Q04 0 63 0 14 -0 15 0 15 0 47 0 53 Q05 0 56 0 10 -0 07 0 14 0 34 0 66 Q06 0 56 0 10 0 57 -0 05 0 .65 0 35 Q07 0 69 0 04 0 25 0 10 0 55 0 45 Q08 0 55 0 40 -0 32 -0 42 0 74 0 .26 Q09 -0 28 0 63 -0 01 0 10 0 .48 0 52 Q10 0 44 0 03 0 36 -0 10 0 .33 0 67 Qll 0 65 0 25 -0 21 -0 40 0 .69 0 31 Q12 0 67 -0 05 0 05 0 25 0 .51 0 49 Q13 0 67 0 08 0 28 -0 01 0 .54 0 .46 Q14 0 66 0 02 0 20 0 14 0 .49 0 .51 Q15 0 59 0 01 0 12 -0 11 0 .38 0 .62 Q16 0 68 0 01 -0 14 0 08 0 .49 0 .51 Q17 0 64 0 33 -0 21 -0 34 0 .68 0 .32 CHAPTER 17 EXPLORATORY FACTOR ANALYSIS Q18 0.70 Q19 -0.43 Q20 0.44 Q21 0.66 Q22 -0.30 Q23 -0.14 0.03 0.30 0.39 0.10 -0.21 -0.40 -0.06 -0.19 0.47 -0.12 0-37 -0.02 °-13 0.60 0.40 -°-01 0.34 0.66 °-30 0.48 0.52 °-28 0.55 0.45 0-38 0.46 0.54 0.51 0.41 0.59 PCI PC2 PC3 PC4 SS loadings 7.29 1.74 1.32 1.23 Proportion Var 0.32 0.08 0.06 0.05 Cumulative Var 0.32 0.39 0.45 0.50 Test of the hypothesis that 4 factors are sufficient. The degrees of freedom for the null model are 253 and the objective function was 7.55 The degrees of freedom for the model are 167 and the objective function was 1.03 I The number of observations was 2571 with Chi Square = 2634.37 with prob I < 0 Fit based upon off diagonal values = 0.96 Output 17.5 There's another thing that we can look at to see if we've extracted the correct number I of factors: this is the reproduced correlation matrix and the difference between the reproduced correlation matrix and the correlation matrix in the data. The reproduced correlations are obtained with the factor.model() function. The factor. modelQ function, needs to know the factor loading matrix. The factor loading matrix is labelled as an object called loadings in the principal components model; therefore we can access it by specifying pc2$loadings (which translates as 'the loadings object associated with the pel model). Therefore, we can get the reproduced correlations by executing: factor, model (pc2$loadings) The difference between the reproduced and actual correlation matrices is referred to as I the residuals, and these are obtained with the factor.residualsQ function. You again need 1 to provide the factor loading matrix but also the correlation matrix to which you want to compare it (in this case the original correlation matrix, raqMatrix). We can, therefore, obtain the residuals by executing: ■actor, residuals(raqMatrix, pc2$loadings) Q01 O02 c.nn QOi 002 Q03 QOc Q05 006 007 Qoe Q01 0.435 -0.112 -0.372 - 0.447 Q02 qo3 ■°-H2 -0.372 0.380 0.380 0.530 °-134 -0.399 0-376 -0.122 -0.345 0-218 -0.033 -0.200 0.366 -0.148 -0.373 0.412 0.002 -0.270 fO-042 0.430 ■0.061 •0 . 097 0 .219 0 .122 0.155 172 423 402 347 362 .352 181 357 440 342 373 Q04 0.447 -0.134 -0.399 0.469 0.399 0.278 0.419 0.390 ■0.073 0.212 419 448 395 411 Q05 0.376 -0.122 -0.345 0.399 0.343 0.273 0.380 0.312 -0 . 080 0.205 0.348 0.397 0.360 0.370 Q06 0.218 -0.033 -0.200 0.278 0.273 0 . 654 0.528 0.183 -0.108 0.461 0.290 0.388 0.545 0.477 Q07 0.366 -0.148 -0.373 0.419 0.380 0.528 0.545 0.267 -0.161 382 363 495 533 514 Q08 0.412 0.002 -0.270 0.390 0.312 0.183 0.267 0.739 0.055 0.180 0.691 0.228 0.313 -0.249 - Q09 -0.042 0.430 0.352 -0.073 -0.080 -0.108 -0.161 0.055 0.484 -0.116 -0.071 -0.195 -0.147 0.159 DISCOVERING STATISTICS USING R 784 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 0.311 -0.158 0.440 -0.217 0.439 -0.048 0.368 -0.149 -0.204 0.357 0.342 -0.301 0.449 -0.254 -0.025 0 .045 -0.337 -0 .458 -0.331 -0 .376 0.403 -0.440 -0.4 0 .343 0 .466 0 .434 0 .424 0.306 0.400 0 .359 0.388 0 .406 0.300 0 .290 0.562 0.425 0 .439 0 .365 0.570 -0.231 -0.207 0.353 0.292 8 0.480 0.412 -0.147 -0.254 -0.021 0.244 o'333 0.275 -0.050 -0.060 -0.209 V . ~> ->____ /N/Inn 0.219 0.430 -0.179 0.339 0.390 0.695 0 .250 -0 .104 0.164 0.282 -0.099 0.246 0.158 0.042 0.028 -0.082 -0.037 -0.136 -0.174 -0 .175 -0.009 -0 .168 0.363 -0 .218 -0.191 0 .417 0.323 Output 17.6 Output 17.6 shows an edited version of the reproduced correlation matrix that was requested using the factor.modelQ function in the first table. The diagonal of this matrix contains the communalities after extraction for each variable (you can check the values against Output 17.5). Output 17.7 contains an extract from the matrix of residuals: the difference between the fitted model and the real data. The diagonal of this matrix is the uniquenesses. Q01 Q02 Q03 Q04 - Q05 Q06 ■ Q07 Q08 Q09 Q10 Qll Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q01 0 .565 0.013 0.035 --0.011 0.027 -0.001 ■ -0.061 ■ -0.081 -0 . 050 0.042 -0 .066 -0.057 0 . 008 -0.024 -0.065 0.059 -0.069 -0.020 0 . 015 -0.128 -0.120 -0.079 -0.049 Q02 0.013 0.586 -■0 . 062 0.022 0.003 -0.041 ■ -0.011 -0 .052 -0.115 -0.023 -0.046 0 .024 -0.021 -0.009 -0.007 0 .050 -0.039 -0.015 -0.153 0.099 0.049 -0 .102 -0.147 Q03 0.035 -•0 .062 0.470 0 . 019 0 . 035 -0.027 -0.009 0 .011 -0 . 052 -0 .013 0.006 0.030 0.024 0 .002 0.025 0.039 0.003 0 . 001 -0.061 0.115 0.071 -0.071 -0.008 Q04 0.011 0 .022 0.019 0.531 0.002 0.000 • -0.010 -0.041 -0.051 0 .003 -0.051 -0.006 -0.051 -0.060 -0 .009 -0.050 -0.052 -0.042 0 . 045 -0.110 -0 .070 -0.049 -0.076 Q05 0.027 0.003 0.035 0.002 0.657 --0 . 016 -0.041 ■ -0.044 -0.016 0.053 -0.050 -0 .050 -0.058 -0.055 -0.045 -0.005 -0.049 -0.066 0.041 -0.092 -0 . 078 -0.072 -0.070 Q06 -0.001 --0.041 --0.027 -0.000 --0.016 ■ 0.346 -0.014 0.040 -0.005 -0.139 0.038 -0.076 -0.078 -0.075 -0.047 -0.056 -0.008 -0.048 -0 . 020 0.122 0 .029 0.043 0.013 Q07 0 .061 0.011 ■0 .009 -0.010 --0.041 ■ -0.014 0.455 0.030 0.033 -0.098 -0.018 -0.072 -0.091 -0 . 074 -0.033 -0 .051 0.025 -0.069 -0.015 0.002 0.053 0.010 -0.033 Q0 -0.081 -0.052 0.011 -0.041 -0.044 0 . 040 0 . 030 0 .261 -0.039 -0.021 -0.061 0 .024 0.001 0.032 -0 . 039 -0.068 -0.105 0.030 -0 . 056 0.011 0.014 0.020 0.086 Q09 -0.050 -0.115 -0.052 -0.051 -0.016 -0.005 0.033 -0.039 0.516 -0.018 -0 . 045 0 .027 -0.021 0 .038 -0.012 -0.014 -0.027 0.018 -0.114 0 .060 0.055 -0.161 -0.152 Output 17.7 The correlations in the reproduced matrix differ from those in the R-matrix because they stem from the model rather than the observed data. If the model were a perfect fit to the data then we would expect the reproduced correlation coefficients to be the same as thq original correlation coefficients. Therefore, to assess the fit of the model we can look at the differences between the observed correlations and the correlations based on the modeJ For example, if we take the correlation between questions 1 and 2, the correlation based o I the observed data is -.099 (taken from Output 17.1). The correlation based on the modeB is -.112, which is slightly higher. We can calculate the difference as follows : residual = robserved — Ifrom model residualaQ2 = (-0.099)-(-0.112) = 0.013 CHAPTER 17 EXPLORATORY FACTOR ANALYSIS You should notice that this difference is the value quoted in Output 17.7 for questions 1 and 2. Therefore, Output 17.7 contains the differences between the observed correlation coefficients and the ones predicted from the model. For a good model these values will all be small. There are several ways we can define how small we want the residuals to be. One approach is to see how large the residuals are, compared to the original correlations. The very worst the model could be (if we extracted no factors at all) would be the size of the correlations in the original data. Thus one approach is to compare the size of the residuals with the size of the correlations. If the correlations were small to start with, we'd expect very small residuals. If the correlations were large to start with, we wouldn't mind if the residuals were relatively larger. So one measure of the residuals is to compare the residuals with the original correlations - because residuals are positive and negative, they should be squared before doing that. A measure of the fit of the model is therefore the sum of the squared residuals divided by the sum of the squared correlations. As this is considered a measure of fit and sometimes people like measures of fit to go from 0 to 1, we subtract the value from 1. This statistic is given at the bottom of the main output (Output 17.5) as: Fit based upon off diagonal values = 0.96 Values over 0.95 are often considered indicators of good fit, and as our value is 0.96, this indicates that four factors are sufficient. There are many other ways of looking at residuals, which we'll now explore. We couldn't find an R function to do these other things, but we will write one as we go along.,9 A simple approach to residuals is just to say that we want the residuals to be small. In fact, we want most values to be less than 0.05. We can work out how many residuals are large by this criterion fairly easily in R First, we need to extract the residuals into a new object. We need to do this because at the moment the matrix of residuals is symmetrical (so the residuals are repeated above and below the diagonal of the matrix), and also the diagonal of the matrix does not contain residuals. First let's create an object called residuals that contains the factor residuals by executing: residuals<-factor.residuals(raqMatrix, pc2$loadings) We can then extract the upper triangle of this matrix using the upper.tri() function. This has the effect of extracting only the elements above the diagonal (so we discard the diagonal elements and the elements below the diagonal): residuals<-as.matrix(residuals[upper.tri(residuals)]) This command re-creates the object residuals by using only the upper triangle of the original matrix. The as.matrixQ function just makes sure that the residuals are stored as a matrix (they're actually stored as a single column of data). We now have an object called residuals that contains the residuals stored in a column. This is handy because it makes it easy to calculate various things. For example, if we want to know how many large residuals there are (i.e., residuals with absolute values greater than 0.05) then we can execute: large.resid<-abs(residuals) > 0.05 Mhich uses the abs() function to first compute the absolute value of the column of residuals ■his is so we ignore whether the residual is positive or negative). The > 0.05 in the command means that large.resid will be TRUE (or 1) if the residual is greater than 0.05, and f, (°r 0) h the residual is less than or equal to 0.05. We end up with a column the same ength as the matrix of factor residuals but containing values of TRUE (if the residual is rge) or FALSE (if it is small). We can then use the sum() function to add up the number l°f TRUE responses in the matrix: ^""Clarge.resid) 785 ^ has over 3000 packages. For relatively simple things, it's often easier to write a small function yourself than gy to find wheth 'you how, because we're your friends. ""■.iher a function already exists. Or, you can find a friend that can write a function for you. We will 786 FIGURE 17.8 Histogram of the model residuals DISCOVERING STATISTICS USING R The result is 91. If we want to know this as a proportion of the total number of residuals we can simply execute: sumClarge.resid)/nrow(residuals) Executing this command will return the number of large residuals {sum(large.resid)) divided by the total number of residuals: nrowsQ tells us how many items (i.e., residuals) there are in total. This will return a value of 0.3596, or 36%. There are no hard and fast rules about what proportion of residuals should be below 0.05; however, if more than 50% are greater than 0.05 you probably have grounds for concern. For our data, we have 36% so we need not worry. Another way to look at the residuals is to look at their mean. Rather than looking at the mean, we should square the residuals, find the mean, and then find the square root. This is the root-mean-square residual. Again, this is easy to calculate from our residuals object. We can execute: sqrt(mean(restdualsA2)) This command squares each item in the residuals object (residuals ^ 2), then uses the meanQ function to compute the mean of these squared residuals. The sqrt() function is then used to compute the square root of that mean. The resulting value is 0.055, that's our mean residual. A little lower would have been nice, but this is not dreadful. If this were much higher (say 0.08) we might want to consider extracting more factors. Finally, it's worth looking at the distributions of the residuals - we expect the residuals to be approximately normally distributed - if there are any serious outliers, even if the other values are all good, we should probably look further into that. We can again use our residuals object to plot a quick histogram using the histQ function: hist(residuals) Figure 17.8 shows the histogram of the residuals. They do seem approximately normal and there are no outliers. We could wrap these commands up in a nice function called residualM stats () so that we can use it again in other factor analyses (R's Souls' Tip 17.3). Histogram of residuals 80 -i 60 u c o = 40 0) 20 H -0.20 -0.15 -0.10 -0.05 0.00 residuals 0.05 0.10 0.15 CHAPTER 17 EXPLORATORY FACTOR ANALYSIS R's Souls' Tip 17.3 787 Creating a residual.statsf) function © We saw (in R's Souls' Tip 6.2) that you can write your own functions in R. If we wanted to wrap all of the factor analysis residual commands into a function we can do this fairly easily by executing: residual.stats<-function(matrix){ residuals<-as.matrix(matrix[upper.tri(matrix)]) large.resid<-abs(residuals) > 0.05 numberLargeResids<-sum(large.resid) propLargeResid<-numberLargeResids/nrow(residuals) rmsr<-sqrt(mean(residualsA2)) cat("Root means squared residual = ", rmsr, "\n") cat("Number of absolute residuals > 0.05 = ", numberLargeResids, "\n") cat("Proportion of absolute residuals > 0.05 = ", propLargeResid, "\n") hist(residuals) The first line creates the function by naming it residual.stats and telling it to expect a matrix as input. The commands within { } are explained within the main text: they extract the residuals from the matrix entered into the function, compute the number (numberLargeResids) and proportion (propLargeResid) of absolute values greater than 0.05, compute the root mean squared residual (rmsr), and plot a histogram. The commands using the cat() function simply specify the text and values to appear in the output. Having executed the function, we could use it on our residual matrix in one of two ways. First, we could calculate the residual matrix using the factor.residualsQ function, and label the resulting matrix resids. Then pop this matrix into the residual.statsf) function: resids <- factor.residuals(raqMatrix, pc2$loadings) residual.stats(resids) The second way is to combine these steps and calculate the residuals matrix directly inside the residual.statsf) function: residual.stats(factor.residuals(raqMatrix, pc2$loadings)) l The output would be as follows (and the histogram in Figure 17.8): Root means squared residual = 0.05549286 Number of absolute residuals > 0.05 = 91 Proportion of absolute residuals > 0.05 = 0.3596838 CRAMMING SAM'S TIPS Factor extraction • To decide how many factors to extract, look at the eigenvalues and the scree plot. 1 you have [ewer than 30 variables then using eigenvalues greater than 1 is OK (Kaiser's criterion) as long as your commu-na lties are al1 over .7. Likewise, if your sample size exceeds 250 and the average of the communalities is .6 or greater then , 's is also fir|e- Alternatively, with 200 or more participants the scree plot can be used. eck the residuals and make sure that fewer than 50% have absolute values greater than 0.05, and that the model fit is [ Sreater than 0.90. 788 DISCOVERING STATISTICS USING R 17.6.4. Rotation © We have already seen that the interpretability of factors can be improved through rotation. Rotation maximizes the loading of each variable on one of the extracted factors while minimizing the loading on all other factors. This process makes it much clearer which variables relate to which factors. Rotation works through changing the absolute values of the variables while keeping their differential values constant. I've discussed the various rotation options in section 17.3.9.1, but, to summarize, the exact choice of rotation will depend on whether or not you think that the underlying factors should be related. If there are theoretical grounds to think that the factors are independent (unrelated) then you should choose one of the orthogonal rotations (I recommend varimax). However, if theory suggests that your factors might correlate then one of the oblique rotations (oblimin or promax) should be selected. 17.6.4.1. Orthogonal rotation (varimax) © To carry out a varimax rotation, we change the rotate option in the principalQ function from "none" to "varimax" (we could also exclude it altogether because varimax is the default if the option is not specified): pc3 <- principalCraqData, nfactors = 4, rotate = "varimax") pc3 <- principal(raqMatrix, nfactors = 4, rotate = "varimax") The first command is to run the analysis from the raw data and the second is if you're using the correlation matrix. In both cases the commands create a model called pc3 that is the same as the previous model (pc2) except that we have used varimax rotation on the model I We can look at this model by executing its name: pc2 Output 17.8 shows the first part of the rotated component matrix (also called the rotated factor matrix), which is a matrix of the factor loadings for each variable on eadH factor. This matrix contains the same information as the component matrix in OutputH 17.5, except that it is calculated after rotation. Notice that the loadings have changed, bum the hi (communality) and u2 (uniqueness) columns have not. Rotation changes factors tfl distribute the variance differently, but it cannot account for more or less variance in the variables than it could before rotation. Also notice that the eigenvalues (SS loadings) havjB changed. One of the aims of rotation is to even up the eigenvalues; however, the sum of the eigenvalues (and the proportion of variance accounted for) cannot change during rotatiojM Interpreting the factor loading matrix is a little complex, and we can make it easier b)| using the print.psych() function. This does two things: first, it removes loadings that arel below a certain value that we specify (by using the cut option); and second, it reorder* the items to try to put them into their factors, which we request using the sort option Generally you should be very careful with the cut-off value - if you think that a loading .4 will be interesting, you should use a lower cut-off (say, .3), because you don't wantoj miss a loading that was .39. Execute this command: print.psych(pc3, cut = 0.3, sort = TRUE) This command prints the factor loading matrix associated with the model pc3, but disp a? ing only loadings above .3 (cut = 0.3) and sorting items by the size of their loadings , = TRUE). Principal Components Analysis Call: principal(r = raqData, nfactors "varimax") 4, residuals = TRUE, ro täte CHAPTER 17 EXPLORATORY FACTOR ANALYSIS Standardized loadings based RC3 RC1 RC4 qOI 0.24 0.50 0.36 q02 -0.01 -0.34 0.07 q03 -0.20 -0.57 -0.18 Q04 0.32 0.52 0.31 Q05 0.32 0.43 0.24 Q06 0.80 -0.01 0.10 Q07 0.64 0.33 0.16 ■ upon correlation matrix RC2 h2 u2 0.06 0.43 0.57 0.54 0.41 0.59 0.37 0.53 0.47 0.04 0.47 0.53 0 . 01 0.34 0.66 0.07 0.65 0.35 0.08 0.55 0.45 SS loadings Output 17.8 RC3 RC1 RC4 rc2 3-73 3.34 2.55 I.95 The resulting matrix is in Output 17.9. Compare this matrix to the unrotated solution (Output 17.5). Before rotation, most variables loaded highly on the first factor and the remaining factors didn't really get a look in. However, the rotation of the factor structure has clarified things considerably: there are four factors and variables load very highly onto only one factor (with the exception of one question). The suppression of loadings less than .3 and ordering variables by loading size also make interpretation considerably easier (because you don't have to scan the matrix to identify substantive loadings). The next step is to look at the content of questions that load onto the same factor to try to identify common themes. If the mathematical factor produced by the analysis represents some real-world construct then common themes among highly loading questions can help us identify what the construct might be. The questions that load highly on factor 1 are Q6 (I have little experience of computers) with the highest loading of .80, Q18 (R always crashes when I try to use it), Q13 (I worry I will cause irreparable damage ...), Q7 (All computers hate me), Q14 (Computers have minds of their own ...), Q10 (Computers are only for games), and Q15 (Computers are out to get me) with the lowest loading of .46. All these items seem to relate to using computers or R. Therefore we might label this factor fear of computers. Looking at factor 2, we have Q20 (Everybody looks at me when I use R), with a loading of .68, Q2i (I wake up under my duvet ...), Q3 (Standard deviations excite me),10 Q12 (People try to tell you that R makes statistics easier ...), Q4 (I dream that Pearson is attacking me), Q16 (I weep openly at the mention of central tendency), Ql (Statistics makes me cry) and Q5 (I don't understand statistics), with the lowest loading of .52 - this item also loads modetately on some of the other factors. The questions that load highly on factor 2 1 all seem to relate to different aspects of statistics; therefore, we might label this factor fear U>f statistics. Principal Components Analysis Call: principal(r = raqData, nfactors = 4, rotate = "varimax") ^Standardized loadings based upon correlation matrix L item RC3 RC1 RC4 RC2 006 § 010 015 020 item 6 18 13 I 7 14 10 15 20 21 RC3 0.80 0.68 0. 65 0.64 0.58 0.55 0.46 0.33 33 36 0.68 0.66 h2 0.65 0.60 0.54 0.55 0.49 0.33 0.38 0.48 u2 0.35 0.40 0.46 0.45 0.51 0.67 0.62 0.52 0.55 0.45 Ml e that th 789 790 DISCOVERING STATISTICS USING R Q03 3 -0 57 0 37 0 53 0 47 Q12 12 0 47 0 52 0 51 0 49 Q04 4 0 32 0 52 0.31 0 47 0 53 Q16 16 0 33 0 51 0.31 0 49 0 51 Q01 1 0 50 0.36 0 43 0 57 Q05 5 0 32 0 43 0 34 0 66 Q08 8 0.83 0 74 0 26 Q17 17 0.75 0 68 0 32 Qll 11 0.75 0 69 0 31 Q09 9 0 65 0 48 0 52 Q22 22 0 65 0 46 0 54 Q23 23 0 59 0 41 0 59 Q02 2 -0 34 0 54 0 41 0 59 Q19 19 -0 37 0 43 0 34 0 66 RC3 RC1 RC4 RC2 SS loadings 3.73 3.34 2.55 1.95 Proportion Var 0.16 0.15 0.11 0.08 Cumulative Var 0.16 0.31 0.42 0.50 Test of the hypothesis that 4 factors are sufficient. The degrees of freedom for the null model are 253 and the objective function was 7.55 The degrees of freedom for the model are 167 and the objective function was 1.03 The number of observations was 2571 with Chi Square = 2634.37 with prob < 0 Fit based upon off diagonal values = 0.96 Output 17.9 Factor 3 has only three items loading on it. Q8 (I have never been good at mathematics), Q17 (I slip into a coma when I see an equation), and Qll (I did badly at mathematicsafj school). The three questions that load highly on factor 3 all seem to relate to mathematics; therefore, we might label this factor fear of mathematics. Finally, the questions that load highly on factor 4 are Q9 (My friends are better at statistics than me), Q22 (My friends are better at R), Q2 (My friends will think I'm stupid) and Q19 (Everybody looks at me). All these items contain some component of social evaluation from friends; therefore, we might label this factor peer evaluation. This analysis seems to reveal that the initial questionnaire, in reality, is composed of fo subscales: fear of computers, fear of statistics, fear of maths and fear of negative peer eval ation. There are two possibilities here. The first is that the RAQ failed to measure wha^ set out to (namely, R anxiety) but does measure some related constructs. The second is these four constructs are sub-components of R anxiety; however, the factor analysis not indicate which of these possibilities is true. 17.6.4.2. Oblique rotation © When we did the orthogonal rotation, we told R that we expected the componen^ we extracted to be uncorrelated. This was a bit of a strange thing to say. All o our. related to fear: fear of computers, fear of statistics, fear of negative peer eva^i feed of mathematics. It's likely that these will be correlated: people with fear nts th* ion i"" CHAPTER 17 EXPLORATORY FACTOR ANALYSIS The command for an oblique rotation is very similar to that for an orthogonal rotation, we just change the rotate option, from "varimax" to "oblimin": pc4 <- principal(raqData, nfactors = 4, rotate = "oblimin") pc4 <- principal(raqMatrix, nfactors = 4, rotate = "oblimin") The first command is to run the analysis from the raw data and the second is if you're using the correlation matrix. In both cases the commands create a model called pc4 that is the same as the model pel except that we have used oblimin rotation on the model. As with the previous model, we can look at the factor loadings from this model in a nice easy-to-digest format by executing: print.psych(pc4, cut =0.3, sort = TRUE) The output from this analysis is shown in Output 17.10. The same four factors seem to have emerged although they are in a different order. Factor 1 seems to represent fear of computers, factor 2 represents fear of peer evaluation, factor 3 represents fear of statistics and factor 4 represents fear of mathematics. Principal Components Analysis Call: principal(r = raqData, nfactors = 4, rotate = "oblimin") Standardized loadings based upon correlation matrix item TCI TC4 TC3 TC2 h2 u2 0.65 0.35 0.60 0.40 0.55 0.45 .54 0.46 .33 0.67 .49 0.51 .51 0.49 38 0.62 74 0.26 69 0.31 0.32 0.52 0.45 0 .47 0.53 0.51 Q06 Q18 Q07 Q13 Q10 Q14 012 Q15 008 Qll Q17 Q20 021 003 Q04 016 QOl 005 Q22 009 I item 6 18 7 13 10 14 12 15 87 .70 . 64 . 64 57 57 45 TC2 0.40 0.43 11 17 20 21 I 3 I 4 16 II I 5 22 I 9 23 [ 2 19 0.90 0.78 0.78 0 0 -0 0. . 71 60 51 41 0.33 0 0 0.34 41 40 .36 -0. -0.35 0.65 0.63 0.61 0.51 0.38 0 0 0 0 0. 0. 0. 0.68 0.48 0.55 0.53 0.47 0.49 0.43 0.34 0.46 0.48 0.41 0.41 0.34 .57 . 66 54 52 59 59 0.66 might have fear of other things. If this is the case an oblique rotation is called for. TCI TC4 TC3 TC2 ■padings 3.90 2.88 2.94 1.85 ■portion Var j . 17 0.13 0.13 0.08 ^■ative Var 0.17 0.29 0.42 0.50 factor correlations of TC4 TC3 TC2 0.36 -0.18 0.31 -0.10 1.00 -0.17 -0.17 1.00 TCI ^Oo 3.44 >-36 »•18 0.44 1.00 0.31 -o.io 792 DISCOVERING STATISTICS USING R Test of the hypothesis that 4 factors are sufficient. The degrees of freedom for the null model are 253 and the objective function was 7.55 The degrees of freedom for the model are 167 and the objective function was 1.03 The number of observations was 2571 with Chi Square = 2634.37 with prob < 0 Fit based upon off diagonal values = 0.96 Output 17.10 Also in this output you'll find a correlation matrix between the factors. This matrix contains the correlation coefficients between factors - R didn't bother to show this to us when it did an orthogonal rotation, because the correlations were all zero. Factor 2 (TC2) has little relationship with any other factors (the correlation coefficients are low), but all other factors are interrelated to some degree (notably TC3 with both TCI and TC4, and TC4 with TCI). The fact that these correlations exist tell us that the constructs measured can be interrelated. If the constructs were independent then we would expect oblique rotation to provide an identical solution to an orthogonal rotation and the component correlation matrix should be an identity matrix (i.e., all factors have correlation coefficients of 0). Therefore, this final matrix gives us a guide to whether it is reasonable to I assume independence between factors: for these data it appears that we cannot assume independence. Therefore, the results of the orthogonal rotation should not be trusted: the obliquely rotated solution is probably more meaningful. When an oblique rotation is conducted the factor matrix is split into two matrices: the pattern matrix and the structure matrix (see Jane Superbrain Box 17.1). For orthogonal rota* tion these matrices are the same. The pattern matrix contains the factor loadings and is comparable to the factor matrix that we interpreted for the orthogonal rotation. The struc« ture matrix takes into account the relationship between factors (in fact it is a product of the I pattern matrix and the matrix containing the correlation coefficients between factors). Most researchers interpret the pattern matrix, because it is usually simpler; however, there are situJ ations in which values in the pattern matrix are suppressed because of relationships between the factors. Therefore, the structure matrix is a useful double-check and Graham et al. (2003) recommend reporting both (with some useful examples of why this can be important). Getting the structure matrix out of R is a little bit more complex than getting the pattern matrix. You need to multiply the factor loading matrix by the correlation matrix of thftl factors. We've come across the loadings, these are called pc4$loadings. The correlations of the factors are called the Phi (Greek letter cj>, which rhymes with pie) and so are stored lU pc4$Phi. Given that we have these two matrices, we can get the structure matrix by mulo-J plying them; however, this is not a regular multiplication, this is a matrix multiplication,» instead of writing * we write %*%. The structure matrix is therefore given by executing:J pc4$loadings %*% pc4$Phi The kind of people that write R think that this is straightforward, but we realize! not, especially when you're starting out. Also, doing this calculation produces a ra ^ Ji--------------^_:„ .1... :„_>^---1— tU„ „:„„ „f f^mr loading- !>< unfriendly looking structure matrix that isn't sorted by the size of factor loadings DSUR we've written a function for you, called factor.structure(); you can source it from our package. The function takes this general form: factor.structure(pcModel, cut = 0.2, decimals = 2) All you need to do is enter the name of the principal components model into the fu and execute. Just like the print.psychQ function we have included an option (cu I can specify a value below which you don't want to see the loading (the default is • CHAPTER 17 EXPLORATORY FACTOR ANALYSIS also an option, decimals, that allows you to change the number of decimal places you see (the default is 2). For our current model we could execute: factor.structure(pc4, cut = 0.3) Output 17.11 shows the structure matrix. The picture becomes more complicated in the structure matrix because with the exception of factor 2, several variables load quite highly onto more than one factor. This has occurred because of the relationship between factors 1 and 3 and between factors 3 and 4. This example should highlight why the pattern matrix is preferable for interpretative reasons: because it contains information about the unique contribution of a variable to a factor. 793 Q06 Q18 Q13 Q07 Q14 Q12 Q10 Q15 Q08 Q17 QU Q21 I Q20 Q03 ■ Q16 Q04 Q01 Q05 Q22 Q09 Q23 Q02 Q19 TCI 0 . 78 0 . 76 0.72 0.72 0.67 0.6 0.56 0.55 0.44 0.43 0.46 -0.39 0.5 0.47 0.4 0.44 TC4 TC3 TC2 0.36 0.43 0.38 0.35 0 .33 0.44 0.85 0.82 0.82 0.37 -0.36 0.5 0.49 0.5 0.4 0.42 0.33 0.42 0.44 0.59 0.31 0.3 0.7 0.68 -0.64 0.41 0.58 0.56 0.53 0.47 0.66 0.66 0.58 -0.39 0.55 -0-44 0.45 Output 17.: 1 On a theoretical level the dependence between our factors does not cause concern; we might expect a fairly strong relationship between fear of maths, fear of statistics and fear I of computers. Generally, the less mathematically and technically minded people struggle pwth statistics. However, we would not expect these constructs to correlate with fear of *er evaluation (because this construct is more socially based). In fact, this factor is the one pat correlates fairly badly with all others - so on a theoretical level, things have turned lout rather well! Factor scores © Pes Pr^PalOaqData, nfactors 4, rotate = "oblimin", scores TRUE) DISCOVERING STATISTICS USING R 794 CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 795 CRAMMING SAM'S TIPS Interpretation If you've I conducted orthogonal rotation then look at the table labelled rotated component matrix. For each variable, note the component for which the variable has the highest loading. Also, for each component, note the variables that load highly onto it (by 'high' I mean loadings should be above .4 when you ignore the plus or minus sign). Try to make sense of what the factors represent by looking for common themes in the items that load onto them. If you've conducted oblique rotation then calculate and look at the pattern matrix. For each variable, note the component for which the variable has the highest loading. Also, for each component, note the variables that load highly onto it (by 'high' I mean loadings should be above .4 when you ignore the plus or minus sign). Double-check what you find by doing the same thing for the structure matrix. Try to make sense of what the factors represents by looking for common themes in the items that load onto them. By setting the scores option to TRUE the factor scores are added to the principal component model in an object called scores; therefore, we can access these scores by using pc5$scores (which translates as the scores object attached to the model pcS that we just created). To view the factor scores, you could execute: pc5$scores However, there are rather a lot of them (2571 actually), so let's look at the first 10 rows, by using the head() function and executing: head(pc5$scores, 10) ^ SELF-TEST s Using what you learnt in Chapter 6, or Section 17 6 2, calculate the correlation matrix for the factor scores. Compare this to the correlations of the factors in Output 17.10. [l, [2, [3, [4 [5 [6 [7 [8 [9 [10 ] TCI 37296709 63334164 39712768 78741595 04425942 -1.70018648 0.66139239 0.59491329 -2.34971189 0.93504597 0 0 0 -0 0 TC4 1.8808424 0.2374679 -0.1056263 0.2956628 0.6815179 0.2091685 0 . 4224096 0 .4060248 -3 .6134797 0 .2285419 TC3 0.95979596 0 .29090777 -0.09333769 -0 .77703307 0.59786611 0 . 02784164 1.52552021 1.06465956 -1.42999472 0.96735727 TC2 0 .3910711 -0.3504080 0.9249353 0 .2605666 -0.6912687 0.6653081 -0.9805434 -1.0932598 -0.5443773 -1.5712753 Output 17.12 he USflfl Output 17.12 shows the factor scores for the first 10 participants. Factor scores can in this way to assess the relative fear of one person compared to another. We can also use scores in regression when groups of predictors correlate so highly that there is multicolline Before we can do any analysis with our factor scores, we need to add the factor s< into our dataframe. To do this, we use the cbindQ function, which we have used nufflfl times before: raqDcta <- cbind(raqData, pc5$scores) SELF-TEST Can you think of another way of obtaining the structure matrix (the correlations between factors and items) now you've learned about factor scores? 17.6.6. Summary © To sum up, the analyses revealed four underlying scales in our questionnaire that may, or may not, relate to genuine sub-components of R anxiety. It also seems as though an obliquely rotated solution was preferred due to the interrelationships between factors. The use of factor analysis is purely exploratory; it should be used only to guide future hypotheses, or to inform researchers about patterns within data sets. A great many decisions are left to the researcher using factor analysis, and I urge you to make informed decisions, rather than basing decisions on the outcomes you would like to get. In section 17.9 we consider whether or not our scale is reliable. 17.7. How to report factor analysis © As with any analysis, when reporting factor analysis we need to provide our readers with enough information to make an informed opinion about our data. As a bare minimum we should be very clear about our criteria for extracting factors and the method of rotation used. We must also produce a table of the rotated factor loadings of all items and flag (in bold) values above a criterion level (I would personally choose .40, but I discussed the various criteria you could use in section 17.3.9.2). You should also report the percentage of variance that each factor explains and possibly the eigenvalue too. Table 17.1 shows an example of such a table for the RAQ data; note that I have also reported the sample size in rhp rifl = in the title fau^ from wluch someon?»uW ^td 1 ? fl '"S"^ thg **** °f ^relations insider inclu^^^^^J.^r' ^ ^ ^ »>• You could also "------»tion on sample size adequacy. this example we might write something like this: limit of .5. correlations betwe A principal components analysis (PCA) was conducted on the 23 items with orthogonal rotation (varimax). The Kaiser-Meyer-Olkin measure verified the sampling adequacy for the analysis KMO = .93 ('superb' according to Kaiser, 1974), and all KMO values for individual items were > .77, which is well above the acceptable Bartlett's test of sphericity, X2 (253) = 19,334, p < .001, indicated that ; between items were sufficiently large for PCA. An initial analysis was run to obtain eigenvalues for each component in the data. Four components had eigenvalues over Kaiser's criterion of 1 and in combination explained 50.32% of the variance. The scree plot was slightly ambiguous and showed inflexions that would justify retaining both two and four components. Given the large sample size, and the convergence of the scree plot and Kaiser's criterion on four components, four components were retained in the final analysis. Table 17.1 shows the factor loadings after rotation. The items that cluster on the same components suggest that compo-nent 1 represents a fear of computers, component 2 a fear of statistics, component f a fear of maths and component 4 peer evaluation concerns. 796 Table 17.1 Summary 0V = 2571) DISCOVERING STATISTICS USING R of exploratory factor analysis results for the R anxiety questionnaire CHAPTER 17 EXPLORATORY FACTOR ANALYSIS 797 ,x rotated factor loadings ~« Peer Fear °f Fparof Fear of ^eer I have little experience of computers R always crashes when I try to use it I worry that I will cause irreparable damage because of my incompetence with computers All computers hate me Computers have minds of their own and deliberately go wrong whenever I use them Computers are useful only for playing games Computers are out to get me I can't sleep for thoughts of eigen vectors I wake up under my duvet thinking that I am trapped under a normal distribution Standard deviations excite me People try to tell you that R makes statistics easier to understand but it doesn't I dream that Pearson is attacking me with correlation coefficients I weep openly at the mention of central tendency Statistics makes me cry I don't understand statistics I have never been good at mathematics I slip into a coma whenever I see an equation I did badly at mathematics at school My friends are better at statistics than me My friends are better at R than I am If I'm good at statistics my friends will think I'm a nerd My friends will think I'm stupid for not being able to cope with R Everybody looks at me when I use R Eigenvalues % of variance a Note: Factor loadings over .40 appear in bold. .80 .68 .65 -.01 -.01 .33 .23 -.07 -.08 -10 .64 .33 -.08 .58 .36 -.07 .55 .00 -.12 .46 .22 -.19 -.04 .68 -.14 .29 -.07 .DO -.20 -.57 .37 .47 .52 -.08 .32 .52 .04 .51 -.12 .24 .50 .06 .32 .43 .02 .13 .17 .01 .27 .22 -04 .26 .21 -.14 -.09 -.20 .65 -.19 .03 .65 -02 .17 .59 ,34 .54 10 .13 23 .16 .14 .13 .29 .08 .16 -.18 .10 .31 .31 .36 .24 ; .83 .75 .75 .12 j -.10 -.20 .07 Finally, if you have used oblique rotation you should consider reporting a table of both the structure and pattern matrix because the loadings in these tables have different interpretations (see Jane Superbrain Box 17.1). Labcoat Leni's Real Research 17.1 World wide addiction? © Nichols, L. A., & Nicki, R. (2004). Psychology of Addictive Behaviors, 18(4), 381-384. The Internet is now a houshold tool. In 2007 it was estimated that around 179 million people worldwide used the Internet (over 100 million of those were in the USA and Canada). From the increasing popularity (and usefulness) of the Internet has emerged a new phenomenon: Internet addiction. This is now a serious and recognized problem, but until very recently it was very difficult to research this topic because there was not a psychometrically sound measure of Internet addition. That is, until Laura Nichols and Richard Nicki developed the Internet Addiction Scale, IAS (Nichols & Nicki, 2004). (Incidentally, while doing some research on this topic I encountered an Internet addiction recovery website that I won't name but that offered a whole host of resources that would keep you online for ages, such as questionnaires, an online support group, videos, articles, a recovery blog and podcasts. It struck me that this was a bit like having a recovery centre for heroin addiction where the addict arrives to be greeted by a nice-looking counsellor who says 'there's a huge pile of heroin in the corner over there, just help yourself.) Anyway, Nichols and Nicki developed a 36-item questionnaire to measure internet addiction. It contained items such as 1 have stayed on the Internet longer than I intended to' and 'My grades/work have suffered because of my Internet use', which could be responded to on a 5-point scale (Never, Rarely, Sometimes, Frequently, Always). They collected data from 207 people to validate this measure. The data from this study are in the file Nichols & Nicki (2004).dat The authors dropped two items because they had low means and variances, and dropped three others because of relatively low correlations with other items. They performed a principal components analysis on the remaining 31 items. Labcoat Leni wants you to run some descriptive statistics to work out which two items were dropped for having low means/variances, then inspect a correlation matrix to find the three items that were dropped for having low correlations. Finally, he wants you to run a principal components analysis on the data. Answers are in the additional material on the companion website (or look at the original article). 17.8. Reliability analysis © Measures of reliability © iyou re using factor analysis to validate a questionnaire, it is useful to check the reliability of Your scale , SELF-TEST i ? s Thinking back to Chapter 1, what are reliability and test-retest reliability? 798 DISCOVERING STATISTICS USING R Reliability means that a measure (or in this case questionnaire) should consistently reflect the construct that it is measuring. One way to think of this is that, other things being equal, a person should get the same score on a questionnaire if they complete it at two different points in time (we have already discovered that this is called test-retest reliability). So, someone who is terrified of statistics and who scores highly on our RAQ should score similarly highly if we tested them a month later (assuming they hadn't gone into some kind of statistics-anxiety therapy in that month). Another way to look at reliability is to say that two people who are the same in terms of the construct being measured should get the same score. So, if we took two people who were equally statistics-phobic, then they should get more or less identical scores on the RAQ. Likewise, if we took two people who loved statistics, they should both get equally low scores. It should be apparent that if we took someone who loved statistics and someone who was terrified of it, and they got the same score on our questionnaire, then it wouldn't be an accurate measure of statistical anxiety. In statistical terms, the usual way to look at reliability is based on the idea that individual items (or sets of items) should produce results consistent with the overall questionnaire. So, if we take someone scared of statistics, then their overall score on the RAQ will be high; if the RAQ is reliable then if we randomly select some items from it the person's score on those items should also be high. The simplest way to do this in practice is to use split-half reliability. This method randomly splits the data set into two. A score for each participant is then calculated based on each half of the scale. If a scale is very reliable a person's score on one half of the scale should be the same (or similar) to their score on the other half: therefore, across several participants, scores from the two halves of the questionnaire should correlate perfectly (well, very highly). The correlation between the two halves is the statistic computed in the split-half method, with large correlations being a sign of reliability. The problem with this method is that there are several ways in which a set of data can be split into two and so the results could be a product of the way in which the data were split. To overcome this problem, Cronbach (1951) came up with a measure that is loosely equivalent to splitting data in two in every possible way and computing the correlation coefficient for each split. The average of these values is equivalent to Cronbach's alpha, a, which is the most common measure of scale reliability.11 Cronbach's a is: a = N2Cov (17.6) item which may look complicated, but actually isn't. The first thing to note is that for each item j on our scale we can calculate two things: the variance within the item, and the covarK ance between a particular item and any other item on the scale. Put another way, we cai construct a variance-covariance matrix of all items. In this matrix the diagonal elemen^ will be the variance within a particular item, and the off-diagonal elements will be covai ances between pairs of items. The top half of the equation is simply the number of itefl (N) squared multiplied by the average covariance between items (the average of the o diagonal elements in the aforementioned variance-covariance matrix). The bottom h 11 Although this is the easiest way to conceptualize Cronbach's a, whether or not it is exactly equal to the avW of all possible split-half reliabilities depends on exactly how you calculate the split-half reliability (see t g !^^B for computational details). If you use the Spearman-Brown formula, which takes no account ot ire"1 jeVjJP deviations, then Cronbach's a will be equal to the average split-half reliability only when the item stai ■ lions are equal; otherwise a will be smaller than the average. However, it you use a formula for sp 1 . /*J^H ability that does account for item standard deviations (such as Flanagan, 1937; Rulon, 1939) then a equal the average split-half reliability (see Cortina, 1993). in CHAPTER 17 EXPLORATORY FACTOR ANALYSIS just the sum of all the item variances and item covariances (i.e., the sum of everything i the variance-covariance matrix). There is a standardized version of the coefficient too, which essentially uses the same equation except that correlations are used rather than covariances, and the bottom half of the equation uses the sum of the elements in the correlation matrix of items (including the ones that appear on the diagonal of that matrix). The normal alpha is appropriate when items on a scale are summed to produce a single score for that scale (the standardized alpha is not appropriate in these cases). The standardized alpha is useful, though, when items on a scale are standardized before being summed. 17.8.2. Interpreting Cronbach's a (some cautionary tales...) .5), a can reach values around I and above .7 (.65 to .84). These results compellingly show that a should not be used as a measure of 'unidimensionality'. Indeed, Cronbach (1951) suggested that if several factors exist then the formula should be applied separately to items relating to different factors, j ^ ot"er words, if your questionnaire has subscales, a should be applied separately to these subscales. § The final warning is about items that have a reverse phrasing. For example, in our RAQ pat we used in the factor analysis part of this chapter, we had one item (question 3) that j**s Phrased the opposite way around to all other items. The item was 'standard deviations E^cite me . Compare this to any other item and you'll see it requires the opposite response, p^example, item 1 is 'statistics make me cry'. Now, if you don't like statistics then you'll °ngly agree with this statement and so will get a score of 5 on our scale. For item 3, if di- statlS[! s then standard deviations are unlikely to excite you so you'll strongly fedi ee ^et a score °f 1 on tne scale. These reverse-phrased items are important for ^Pasecf reS':50nse ^las; participants will actually have to read the items in case they are ! all th^.,t'le otner waY around. For factor analysis, this reverse phrasing doesn't matter, aPpens is you get a negative factor loading for any reversed items (in fact, look 799 800 Eek! My alpha is negative: is that correct? DISCOVERING STATISTICS USING R at Output 17.10 and you'll see that item 3 has a negative factor loading). However, in reliability analysis these reverse-scored items do make a difference. To see why, think about the equation for Cronbach's a. In this equation, the top half incorporates the average covariance between items. If an item is reverse-phrased then it will have a negative relationship with other items, hence the covariances between this item and other items will be negative. The average covariance is obviously the sum of covariances divided by the number of covariances, and by including a bunch of negative values we reduce the sum of covariances, and hence we also reduce Cronbach's a, because the top half of the equation gets smaller. In extreme cases, it is even possible to get a negative value for Cronbach's a, simply because the magnitude of negative covariances is bigger than the magnitude of positive ones. A negative Cronbach's a doesn't make much sense, but it does happen, and if it does, ask yourself whether you included any reverse-phrased items. Reliability analysis with R Commander © 17.8.3. •i 1 * = -q commander to obtain reliability estimates. that's the one we use. Reliability analysis using R © 17.8.4. Let's test the reliability of the RAQ using the data in RAQ.dat. Remember also that I said we should conduct reliability analysis on any subscales individually. If we use the results from our orthogonal rotation (look back at), then we have four subscales: 1 Subscale 1 {Fear of computers): items 6, 7, 10, 13, 14, 15, 18 2 Subscale 2 (Fear of statistics): items 1, 3, 4, 5, 12, 16, 20, 21 3 Subscale 3 (Fear of mathematics): items 8, 11, 17 4 Subscale 4 (Peer evaluation): items 2, 9, 19, 22, 23 (Don't forget that question 3 has a negative sign; we'll need to remember to deal with that.) I First, we'll create four new data sets, containing the subscales for the items. We don't neefl to do that, but it saves a lot of typing later on. We can create these data sets by simply select-1 ing the appropriate columns of the full dataframe (raqData) as described in section 3.9.1- ■ - — i nM 13, 14, 15, 18)] 4, 5, 12, 16, 20, 21)] computerFear<-raqData[, c(6, 7, 10, statisticsFear <- raqData[, c(l, 3, mathFear <- raqData[, c(8, 11, 17)] peerEvaluation <- raqData[, c(2, 9, 19, 22, 23)] This command takes the raqData dataframe and retains all of the rows (hence no co mand before the comma), and any columns specified in the c() function after the conlITj For example, the first command creates an object called computerFear that contains o columns 6, 7, 10, 13, 14, 15, and 18 of the dataframe raqData. Reliability analysis is done with the alpha() function, which is found in the psyci pa age. You might have a problem here, because there is also a function in ggplotl calle KJ CHAPTER 17 EXPLORATORY FACTOR ANALYSIS and if you've loaded ggplotl first, that version will have priority. This was covered in R's Souls' Tip 3.4, but to remind you, if you get the wrong alpha() function, you can specify the package, using: psych::alpha() An additional complication that we need to deal with is that pesky item 3, which is negatively scored. We can do one of two things with this item. We can reverse the variable in the data set, or we can tell alphaQ that it is negative, using the keys option. This latter option is better because we leave the initial data unchanged (which is useful because we don't get into awkward situations in which we save the data and then can't recall at a later date whether or not the data contains the reverse scored or original scores). To use the keys option we give alphaQ a vector of Is and -Is, which matches the number of variables in the data set, using a 1 for a positively score item and a -1 for a negatively scored item. So for computerFear, which has only positively scored items, we would use: keys = c(l, 1, 1, 1, 1, 1, 1) but for statisticsFear, which has item 3 (the negatively scored item) as its second item, we would use: keys = c(l, -1, 1, 1, 1, 1, 1, 1) For three of our four subscales we don't need to use the keys option because all items are positively scored, but for statisticsFear we need to. To use the alphaQ function we simply input the name of the dataframe for each subscale, and, where necessary, include the keys option. Therefore, we could run the reliability analysis for our four subscales by executing: alpha(computerFear"> alpha(computerFear) alphaCstatisticsFear, keys = cfl -1 1 1 1 1 1 alpha(mathFear) U* ' X' lf 1' 1' 1 alpha(peerEvaluation) D) 17.8.5. Interpreting the output © Output 17.13 shows the results of this basic reliability analysis for the fear of computing j subscale. First, and perhaps most important, the value of alpha at the very top is Cronbach's a: the overall reliability of the scale (you should look at the raw alpha, they're usually very similar though). To reiterate, we're looking for values in the range of .7 to .8 (or thereabouts) bearing in mind what we've already noted about effects from the number of items. Ih this case a is slightly above .8, and is certainly in the region indicated by Kline (1999), so this probably indicates good reliability. Along with alpha, there is a measure labelled G6, short for Guttman's lambda 6; this can be calculated from the squared multiple correlation (hence it's labelled smc).n The average j is the average inter-item correlation (from which we can calculate standardized alpha). \ Also in this top section are some scale characteristics. If we calculated someone's score I by taking the average of all of their items (which is the same as adding up the score and dividing by the number of items), we would have a variable with an overall mean of 3.4 land standard deviation of 0.71.13 \ Fa« fiends might be interested to know that Guttman came up with Cronbach's alpha before Cronbach, and (•"'d it lambda 3. LY°U Can test tins by running: Scribe(apply((raqrc(6, 7> 10; 13> i4j 15, 18)]), 1, mean)) 'gives you a mean of 3.42 and sd = 0.71. *hicl, 802 DISCOVERING STATISTICS USING R Next, we get a table giving the statistics for the scale if we deleted each item in turn. The values in the column labelled rawjdpha are the values of the overall a if that item isn't included in the calculation. As such, they reflect the change in Cronbach's a that would be seen if a particular item were deleted. The overall a is .82, and so all values in this column should be around that same value. What we're actually looking for is values of alpha greater than the overall a. If you think about it, if the deletion of an item increases Cronbach's a then this means that the deletion of that item improves reliability (remembering that scales with more items are more reliable, so removing an item should always lower alpha). Therefore, any items that have values of a in this column greater than the overall a may need to be deleted from the scale to improve its reliability. None of the items here would substantially affect reliability if they were deleted. None of the items increase alpha by being deleted. This table also contains the standardized alpha if the item is removed, the G6 if the item is removed and the mean correlation if the item is removed. The next table in the output is labelled item statistics. The values in the first column labelled r are the correlations between each item and the total score from the questionnaire - sometimes called item-total correlations. There's a problem with this statistic, and that is that the item is included in the total. That is, if we correlate item 6 with the mean of all items, we're correlating the item with itself, so of course it will correlate. We can correct this by correlating each item with all of the other items. Two versions of this are presented, r.cor and r.drop: r.cor is a little complex, so we won't go into it (but the help file for alpha explains it), r.drop is the correlation of that item with the scale total if that item isn't included in the scale total. Sometimes this is called the item-rest correlation (because it's how the item correlates with the rest of the items) and sometimes it's called the corrected item-total correlation. Reliability analysis Call: alpha(x = computerFear) raw_alpha std.alpha G6(smc) average_r mean sd 0.82 0.82 0.81 0.4 3.4 0.71 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r Q06 0 79 0 79 0 77 0 38 Q07 0 79 0 79 0 77 0 38 Q10 0 82 0 82 0 80 0 44 Q13 0 79 0 79 0 77 0 39 Q14 0 80 0 80 0 77 0 39 Q15 0 81 0 81 0 79 0 41 Q18 0 79 0 78 0 76 0 38 Item statistics n r r.cor r.drop mean 0.68 0.62 0.68 0.62 0.44 0.40 0.67 0.61 0.64 0.58 0.54 0.49 0.72 0.65 Q06 2571 0.74 Q07 2571 0.73 Q10 2571 0.57 Q13 2571 0.73 Q14 2571 0.70 Q15 2571 0.64 Q18 2571 0.76 sd 3.8 1.12 3.1 1.10 3.7 0.88 3.6 0.95 3.1 1.00 3.2 1.01 3.4 1.05 Non missing response frequency for each item 12 3 4 5 miss Q06 0.06 0.10 0.13 0.44 0.27 0 Q07 0.09 0.24 0.26 0.34 0.07 0 CHAPTER 17 EXPLORATORY FACTOR ANALYS IS Q10 0.02 0.10 803 0.18 0.57 ri 1/1 613 0.03 0.12 0.25 0 it Q14 0.07 o.lE 0.12 °-38 0.31 0 rifi WS 0.06 0.18 0.30 0.39 0 07 0" 0-06 0.12 0.31 0.37 0 14 0 0 0 0 0 Output 17.13 In a reliable scale all items should correlate with the total. So, we're looking for items that don't correlate with the overall score from the scale: if any of these values of r.drop are less than about .3 then we've got problems, because it means that a particular item does not correlate very well with the scale overall. Items with low correlations may have to be dropped. For these data, all data have corrected item-total correlations above .3, which is encouraging. The table also shows the mean and standard deviation of the scale if the item is omitted. The final table in the alpha output is a table of frequencies. It tells us what percentage of people gave each response to each of the items. This is useful to make sure that everyone in your sample is not giving the same response. It is usually the case that an item where everyone (or almost everyone) gives the same response will almost certainly have poor reliability statistics. As a final point, it's worth noting that if items do need to be removed at this stage then you should rerun your factor analysis as well to make sure that the deletion of the item has not affected the factor structure. Reliability analysis Call: alpha(x = statisticsFear, keys = c(l raw_alpha std.alpha G6(smc) average_r mean sd 0.82 0.82 0.81 0.37 3.1 0.5 l' X- !' l. 1. 1, in raw_alpha std 0 .80 0 .80 0 .80 0 81 0 80 0. 79 0 . 82 0. 79 statistics Q01 Q03 Q04 Q05 Q12 Q16 Q20 Q21 L501 2"1 0.67 LQ03 2571 0.67 Os' 2571 o 2571 °-« F Si,0'71 0.80 0.80 0.80 0.81 0.80 0.80 0.82 0.80 0.79 0.79 0.78 0.80 0.79 0.78 0.! 0.7£ 30 0.37 0.37 0.36 0.38 0.36 0.36 0.40 0.36 Hon 00i o r.cor r.drop mean 0 . 54 0 .55 0.58 0.49 0 . 57 0.60 0 .42 0 . 61 0.60 0.60 0.64 0.55 0. 63 0. 67 0.47 0.67 sd 3-6 0.83 3-4 1.08 3.2 3.3 2.8 0.95 0.96 0.92 3.1 0.92 2.4 1.04 2.8 0.98 1 rSP°3Se f- each item 4 5 miss 0.52 0.11 0.'o°3 n°'°7 °'29 u.M o.ll S°4 O.o • 7 0.34 0.26 0.19 °-17 0.36 0.37 0 - 05 0 0 0 ^COVERING STATISTICS USING R 804 012 0.09 0.23 0 4b 0 0 0 0 0 1 0.US u.u Output 17.14 OK, let's move on to the fear of statistics subscale (items 1, 3, 4, 5, 12, 16, 20 and 21). I won't go through the R output in detail again, but it is shown in Output 17.14. The overall a is .82, and none of the items here would increase the reliability if they were deleted. The values in the column labelled r.drop are again all above .3, which is good. In all, this indicates that all items are positively contributing to the overall reliability. The overall a is also excellent (.82) because it is above .8, and indicates good reliability. Reliability analysis Call: alpha(x = statisticsFear) raw_alpha std.alpha G6(smc) average_r mean sd 0.61 0.64 0.71 0.18 3.1 0.5 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r Q01 * co 0.56 0.64 0.15 Q03 1 Q04 Q05 Q12 Q16 Q20 Q21 0.52 0.80 0.50 0 .52 0.52 0.51 0.56 0 .50 0.56 0 .80 0.55 0 .57 0.56 0 .55 0.60 0 .55 0.64 0.7? 0.64 0.66 0.65 0 .63 0.68 0.63 0 .15 0.37 0 .15 0 .16 0.15 0.15 0 .18 0.15 Item statistics ^ ^ ^ mean 3 . 6 n Q01 2571 Q03 2571 Q04 2571 Q05 2571 Q12 2571 Q16 2571 Q20 2571 Q21 2571 r 0.68 -0.37 0.69 0.65 0.67 0.70 0.55 0.70 0.62 -0 .64 0.65 0.57 0.62 0 .66 0 .45 0.66 0.51 -0.55 0 .53 0 .47 0.50 0.53 0.35 0 .54 sd 0 .83 3.4 1-08 3.2 0.95 3 0.96 8 0.92 1 0.92 .4 1.04 .8 0.98 f^r- pach item o 12 3 4 Q01 0.02 0.07 0.29 0.52 0.11 Q03 0.03 0.17 0.34 0.26 0.19 Q04 0.05 0.17 0.36 0.37 0.05 Q05 0.04 0.18 0.29 0.43 0.06 Q12 0.09 0.23 0.46 0.20 0.02 Q16 0.06 0.16 0.42 0.33 0.04 Q20 0.22 0.37 0.25 0.15 0.02 Q21 0.09 0.29 0.34 0.26 0.02 0 0 0 0 0 0 0 _ reliabiW^B °*«>ut 17 *15 , verse scoring items before running fe CHAPTER 17 EXPLORATORY FACTOR ANALYSIS on the original data (i.e., without item 3 being reverse scored by using the keys option). Note that the overall a is considerably lower (.61 rather than .82). Also, note that this item has a negative item-total correlation (which is a good way to spot if you have a potential reverse-scored item in the data that hasn't been reverse scored). Finally, note that for item 3, the a if item deleted is .8. That is, if this item were deleted then the reliability would improve from about .6 to about .8. This, I hope, illustrates that failing to reverse-score items that have been phrased oppositely to other items on the scale will mess up your reliability analysis. Moving swiftly on to the fear of maths subscale (items 8, 11 and 17), Output 17.16 shows the output from the analysis. As with the previous two subscales, the overall a is around .8, which indicates good reliability. The values of alpha if the item were deleted indicate that none of the items here would increase the reliability if they were deleted because all values in this column are less than the overall reliability of .82. The values of the corrected item-total correlations (r.drop) are again all above .3, which is good, Reliability analysis Call: alpha(x = mathFear) raw_alpha std.alpha G6(smc) average_r mean sd 0.82 0.82 0.75 0.6 3.7 0.75 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r Q08 0.74 0.74 0.59 0.59 Qll 0.74 0.74 0.59 0.59 Q17 0.77 0.77 0.63 0.63 Item statistics n r r.cor r.drop mean sd Q08 2571 0.86 0.76 0.68 3.8 0.87 Qll 2571 0.86 0.75 0.68 3.7 0.88 Q17 2571 0.85 0.72 0.65 3.5 0.88 J Non missing response frequency for each item |QC8 0.03 0.06 0.19 0.58 0.15 Qll 0.02 0.06 0.22 0.53 0.16 Q17 0.03 0.10 0.27 0 5 miss 0 0 52 0.08 0 Output 17.16 Finally, if you run the analysis for the final subscale of peer evaluation, you should get output 17.17. Unlike the previous subscales, the overall a is quite low at .57 and although K is in keeping with what Kline says we should expect for this kind of social science ta> it is well below the other scales. The values of alpha if the item is dropped indicate none of the items here would increase the reliability if they were deleted because all ues in this column are less than the overall reliability of .57. The values of r.drop are all F°und .3, and in fact for item 23 the value is below .3. This indicates fairly bad internal ^jtency ar>d identifies item 23 as a potential problem. The scale has five items, com-idr t0 S£Ven' e'§ht and three on the other scales, so its reduced reliability is not going to ll»atl,aiTlatlCa'^ r-';'ected by the number of items (in fact, it has more items than the fear of IT15 subscale). ;■ f you look at the items on this subscale, they cover quite diverse themes F r evaluation, and this might explain the relative lack of consistency. This might lead T rethlnl< this subscale. ^COVERING STATISTICS USING R 806 Reliability analysis Call: alpha(x = peerEvaluation) raw_alpha std.alpha G6(smc) average_r mean 0 . 57 0 . 57 0.53 0.21 3.4 0.65 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r Q02 1 " 0.45 0.21 Q09 Q19 Q22 Q23 sd 0.52 0 .48 0 .52 0 .49 0.56 0.52 0.48 0 .53 0 .49 0 .57 0.45 0 .41 0.46 0 .43 0.50 0 .21 0 .19 0 .22 0.19 0.25 Item statistics n r r.cor r.drop mean 0.45 0.34 0.53 0.39 0.42 0.32 0.50 0.38 0.31 0.24 Non missing response frequency for each item 12 3 4 5 miss Q02 2571 0.61 Q09 2571 0.66 gig 2571 0.60 Q22 2571 0.64 Q23 2571 0.53 sd 4.4 0.85 3.2 1-26 3.7 1.10 3.1 1-04 2.6 1-04 Q02 0.01 0.04 0.08 0.31 0.56 Q09 0.08 0.28 0.23 0.20 0.20 Q19 0.02 0.15 0.22 0.33 0.29 Q22 0.05 0.26 0.34 0.26 0.10 Q23 0.12 0.42 0.27 0.12 0.06 Output 17.17 CRAMMING SAM'S TIPS o 0 0 0 0 Reliability • Reliability is really the consistency of a measure. • Reliability analysis can be used to measure the consistency of a questionnaire. • Remember to deal with reverse-scored items. Use the keys option when you run the analysis. • Run separate reliability analyses for all subscales of your questionnaire. • Cronbach's a indicates the overall reliability of a questionnaire and values around .8 are good (or .7 for ability tests and sui like). • The raw alpha when an item is dropped tells you vjheVner removing an item will improve the overall reliability: values grea" than the ornaW re\\ab\\\ty indicate that removing that item will improve the oma\\ re\\ab\\\t)/ of the scale. Look for items that dramatically increase the value of <*. • If you do remove items, rerun your factor analysis to check that the factor structure still holds! 17.9. Reporting reliability analysis £> ................................................ . „..„U^1 ni and 1 ...................................... t;uit becM are following APA format): CHAPTER 17 EXPLORATORY FACTOR ANALYSIS • The fear of computers, fear of statistics and fear of maths subscales of the RAQ all had high reliabilities, all Cronbach's a = .82. However, the fear of negative peer evaluation subscale had relatively low reliability, Cronbach's a - .57. However, the most common way to report reliability analysis when it follows a factor analysis is to report the values of Cronbach's a as part of the table of factor loadings. For example, in Table 17.1 notice that in the last row of the table I have quoted the value of Cronbach's a for each subscale in turn. w 807 What have I discovered about statistics? © This chapter has made us tiptoe along the craggy rock face that is factor analysis. This is a technique for identifying clusters of variables that relate to each other. One of the difficult things with statistics is realizing that they are subjective: many books (this one included, I suspect) create the impression that statistics are like a cook book and if you follow the instructions you'll get a nice tasty chocolate cake (yum!). Factor analysis perhaps more than any other test in this book illustrates how incorrect this is. The world of statistics is full of arbitrary rules that we probably shouldn't follow (.05 being the classic example) and nearly all of the time, whether you realize it or not, we should act upon our own discretion. So, if nothing else, I hope you've discovered enough to give you sufficient discretion about factor analysis to act upon! We saw that the first stage of factor analysis is to scan your variables to check that they relate to each other to some degree but not too strongly. The factor analysis itself has several stages: check some initial issues (e.g., sample size adequacy), decide how many factors to retain, and finally decide which items load on which factors (and try to make sense of the meaning of the factors). Having done all that, you can consider whether the items you have are reliable measures of what you're trying I to measure. We also discovered that at the age of 23 I took it upon myself to become a living i homage to the digestive system. I furiously devoured articles and books on statistics I (some of them I even understood), I mentally chewed over them, I broke them down I with the stomach acid of my intellect, I stripped them of their goodness and nutrients, I I compacted them down, and after about two years I forced the smelly brown remnants of those intellectual meals out of me in the form of a book. I was mentally exhausted at the I end °f it; 'It's a good job I'll never have to do that again', I thought. I ^packages used in this chapter corpcor , GpArotation psych D1SC0VERING STATISTICS USING R 808 functions «djntt*** abs() alpha() as. matrix () c() cat() cbind() cor() cortest.bartlett() det() factor, model () factor.residuals() factor.structureO ggpiot2() Key terms that I've Alpha factoring Bartlett's test of sphericity Common variance Communality Component matrix Confirmatory factor analysis (CFA) Cronbach's a Direct obllmln Extraction Factor Factor analysis Factor loading Factor matrix Factor scores Factor transformation matrix, Kaiser's criterion histO kmofj meanO nrowO plotO polychorO principal print.psychO residual.statsO round 0 sqrtO sumO upper.triO discovered Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy Latent variable Oblique rotation Orthogonal rotation Pattern matrix Principal components analysis (PCA) Promax Quartlmax Random variance Rotation Scree plot Singularity Split-half reliability Structure matrix Unique variance Varimax Smart Alex's tasks _ • Task 1: The University of Sussex is constantly seeking to employ the hest p possible as lecturers (no, really, it is). Anyway, they wanted to revise a question based on Bland's theory of research methods lecturers. This theory predicts t aw irch methods lecturers should have four characteristics: (1) a Pr0^°"n^B ',-"'"'ni for experimental design; (3) a love of teachITj characteristics W reseai— of statistics; (2) an e enthusiasm tor e*Pc"1"-". skiUs. These cnai*-~ ifl- be related (i.e., ^ . h umversity revise Revised (TOSSE) already Scientific Experiments - Re lu„ 'Teaching of Statistics to CHAPTER 17 EXPLORATORY FACTOR ANALYSIS gave this questionnaire to 239 research methods lecturers around the world to see if it supported Bland's theory. The questionnaire is in Figure 17.9, and the data are in TOSSE.R.dat. Conduct a factor analysis (with appropriate rotation) to see the factor structure of the data. © SD = Strongly Disagree, D = Disagree, N = Neither, A = Agree, SA = Strongly Agree 809 I 10 11 12 13 14 15 16 17 18 19 20 21 22 24 SD D N A SA I once woke up in the middle of a vegetable patch hugging a turnip that I'd mistakenly dug up thinking it was Roy's largest root O o o o o If I had a big gun I'd shoot all the students I have to teach o o o o o I memorize probability values for the F-distribution o o o o o I worship at the shrine of Pearson o o o o o I still live with my mother and have little personal hygiene o o o o Teaching others makes me want to swallow a large bottle of bleach o o o o o because the pain of my burning oesophagus would be light relief in comparison Helping others to understand sums of squares is a great feeling o o o o o I like control conditions o o o o o I calculate three ANOVAs in my head before getting out of bed every o o* o o o morning I could spend all day explaining statistics to people o o o o o I like it when people tell me I've helped them to understand factor o o o o o rotation People fall asleep as soon as I open my mouth to speak o o o o o Designing experiments is fun o o o o 0 I'd rather think about appropriate dependent variables than go to the pub o o o o o I soil my pants with excitement at the mere mention of factor analysis o . © o o o Thinking about whether to use repeated or independent measures thrills me o o o o o I enjoy sitting in the park contemplating whether to use participant observation in my next experiment o o o o "'•it Standing in front of 300 people in no way makes me lose control of my bowels o o o o o I like to help students o o o o o Passing on knowledge is the greatest gift you can bestow on an individual o o o o o Thinking about Bonferroni corrections gives me a tingly feeling in my groin o o 0 o o 1 quiver with excitement when thinking about designing my next experiment o o o o o often spend my spare time talking to the pigeons ... and even they die °> boredom o o o o o 'ned to build myself a time machine so that 1 could go back to the 30s and follow Fisher around on my hands and knees licking the ■»Won whicn he'd just trodden o o o o o Bp6 teaching o o o : ,o o L^Pend lots of time helping students o o o o o ove teaching because students have to pretend to like me or they'll ■pet bad marks o o G o o 1 ls my only friend O O O O O FIGURE 17.9 The Teaching of Statistics for Scientific Experiments - Revised (T0SSE-R) 810 FIGURE 17.10 Williams's organizational ability questionnaire DISCOVERING STATISTICS USING R • Task 2: Dr Sian Williams (University of Brighton) devised a questionnaire to measure organizational ability. She predicted five factors to do with organizational ability: (1) preference for organization; (2) goal achievement; (3) planning approach; (4) acceptance of delays; and (5) preference for routine. These dimensions are theoretically independent. Williams's questionnaire (Figure 17.10) contains 28 items using a 7-point Likert scale (1 = strongly disagree, 4 = neither, 7 = strongly agree). She gave it to 239 people. Run a principal components analysis on the data in Williams.dat. © Answers can be found on the companion website. 1 I like to have a plan to work to in everyday life 2 I feel frustrated when things don't go to plan 3 I get most things done in a day that I want to 4 I stick to a plan once I have made it 5 I enjoy spontaneity and uncertainty 6 I feel frustrated if I can't find something I need 7 I find it difficult to follow a plan through 8 I am an organized person 9 I like to know what I have to do in a day 10 Disorganized people annoy me 11 I leave things to the last minute 12 I have many different plans relating to the same goal 13 I like to have my documents filed and in order 14 I find it easy to work in a disorganized environment 15 I make 'to do' lists and achieve most of the things on it 16 My workspace is messy and disorganized 17 I like to be organized 18 Interruptions to my daily routine annoy me 19 I feel that I am wasting my time 20 I forget the plans I have made 21 I prioritize the things I have to do 22 I like to work in an organized environment 23 I feel relaxed when I don't have a routine 24 I set deadlines for myself and achieve them 25 I change rather aimlessly from one activity to another during the day 26 I have trouble organizing the things I have to do 27 I put tasks off to another day 28 I feel restricted by schedules and plans CHAPTER „ EXPLORATORY FACTOR ANALYSIS Pedhazur, E., & Schmelkin, L. (1991). Measurement, design and analysis. Hillsdale, NJ: Erlbaum (Chapter 22 is an excellent introduction to the theory of factor analysis.) Tabachnick, B. G. & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn 8c Bacon. (Chapter 13 is a technical but wonderful overview of factor analysis.) Interesting real research Nichols, L. A., & Nicki, R. (2004). Development of a psychometrically sound scale: A preliminary step. Psychology of Addictive Behavi' 'ors, 18(4), SSlTsT mtemet 3ddicti0n Further reading Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications.]oum of Applied Psychology, 78, 98-104. (A very readable paper on Cronbach's a.) Dunteman, G. E. (1989). Principal components analysis. Sage University Paper Series on Quant'' : «| Applications in the Social Sciences, 07-069. Newbury Park, CA: Sage. (This monograph »1 high level but comprehensive.)