16 Repeated measures ANOVA Contents 16.1 Chapter overview 16.2 Modeling correlated or repeated measures 16.3 ANOVA with repeated measures 16.4 Combining independent and repeated measures: mixed ANOVA designs 16.5 Comparisons, contrasts and simple effects with repeated measures 16.6 MAN OVA 16.7 ANCOVA with repeated measures 16.8 R code for Chapter 16 16.9 Notes on SPSS syntax for Chapter 16 16.10 Bibliography and further reading 623 642 656 664 666 Repeated measures ANOVA 623 16.1 Chapter overview This chapter introduces repeated measures and mixed measures ANOVA as methods for dealing with correlated measures arising from paired, repeated or matched designs. Advantages and disadvantages of repeated measures models are considered, with focus on the increased statistical power arising when individual differences are prevalent, and on the assumptions of sphericity and multisample sphericity. Later sections provide a brief overview of several related models including MANOVA, repeated measures ANCOVA and analysis of gain scores. 16.2 Modeling correlated or repeated measures A typical least squares regression model assumes measures are independent. For measures to be independent, each observation in the analysis should carry no information about the value of any other observation. Specifically, it will provide no additional information over and above that already accounted for by the structure of the model (e.g., factor or covariate values). If people are measured only once and randomly allocated to the conditions of an experiment this assumption is likely to be reasonable.1 It is often plausible even if random assignment is not possible. On the other hand, if people are measured more than once, the responses for a given person are almost certainly correlated. This gives rise to what is termed a 'repeated measures' or 'within-subjects' design. Correlated measures can also arise by design or because of the way the world is structured (e.g., if people are tested in pairs, or participants sampled from pre-existing groups). For field research with human participants, this kind of clustering is probably the norm rather than the exception. People have a tendency to create and associate within groups in many aspects of their lives: families, classrooms, schools or teams being good examples. Correlated measures have historically been regarded as nuisance variables to be eliminated or controlled by careful experimental design (not least because independent measures are easier to work with). Correlated measures can also be viewed as an asset. A repeated measures or matched design might be selected because the resulting model has greater statistical power or precision. It is also increasingly common, as statistical models for correlated measures have become more sophisticated, that the clusters or correlated measures themselves are a focus of research. A researcher may be interested in the performance of individual children in a school, but they may also be interested in the performance of different schools. The main focus of this chapter is on ANOVA models with repeated measures on one or more factors. Multilevel regression models (that attempt to model the clustered structure of the data directly) are considered in a subsequent chapter. 16.3 ANOVA with repeated measures Correlated measures present an obvious problem for a least squares regression model that assumes residuals are sampled from an independent, normal population of errors. In the simplest repeated measures models there are only two measurement points: the familiar paired design. By using the difference between pairs as the outcome Y it is possible to treat the differences as independent observations (e.g., in a one sample t test). This won't work if there 624 Serious Stats more than two levels on a factor or for a factorial design, but a similarly creative solution is feasible. In a one-way independent measures design, the ANOVA parameterization of the model (with the usual sum-to-zero constraint for ti) is: J| = U + Tj + Sjj What if you represent the deviations of each participant from the grand mean in a similar way? This gives a repeated measures model of the form: j| = fi + Kj + tj + sjj Equation 16.1 In this model both xj and nt must each sum to zero. Figure 16.1 shows the deviations of participant and level means for a trivial repeated measures ANOVA model with three levels and only two participants. The m term represents the deviation of each person's own mean (over all / levels of the factor) from the grand mean (after stripping out the average effect of the factor r). At first glance this looks almost exactly like a two-way independent measures ANOVA model incorporating only main effects (and no interaction). An important difference is that whereas the treatment x is a fixed factor, n is a random factor.2'3 Participants are viewed as randomly sampled from an infinite population, whereas the levels of the treatment are considered completely representative of the population of interest. The variance accounted for by the tt, term is due to systematic differences between participants: individual differences in Y (often termed within-subject variance). The shared subscript i of the participant and error variance indicates what is happening. In one-way independent measures ANOVA the variance not captured by 20 - Time 1 Time 2 Time 3 Figure 16.1 Deviations from the grand mean for a one-way repeated measures ANOVA, w three time points and two participants Repeated measures ANOVA 625 the factor is subsumed within the £, term, whereas in the repeated measures design it is split between e(- and itj. This model is implausible because it assumes that the effect of individual differences is exactly identical for all levels of the treatment. The model can be improved by permitting individual differences to vary across levels of z. Allowing individual differences to vary in this way requires adding an interaction term: ytj = ix + kj + tj + Tjtjj + sjj Equation 16.2 This type of model is the one implemented in repeated measures ANOVA analysis, and if a subject-by-treatment interaction is present it will, in theory, produce more accurate tests of the factor (see Howell, 2002).4 Estimating this model in a repeated measures design is challenging because a typical application has exactly one measurement per participant in each combination of the levels of the fixed factor.5 With only one observation per cell for the n x z interaction, the s,- and zn^ terms cannot be estimated separately (and are said to be completely 'aliased'). This is not a problem for the test of the fixed factor (because its error term is formed from estimates of both s,- and m» terms). Problems do arise for the test of the 7r,- subject term. Although a large f ratio provides evidence of systematic individual differences, a small f ratio is ambiguous (Kirk, 1995; Howell, 2002). It could occur because tij is small relative to £,• or because r7r« is large relative to jt,- (e.g., if Ttj «S ZIZjj) . It is possible to test whether the additive or non-additive interaction model is appropriate (see Kirk, 1995), but the choice of model ought to be decided a priori (Howell, 2002). The assumption of zero subject-by-treatment interaction is probably unjustified in research with human participants. To understand why, imagine a memory experiment comparing recall of own-race and other-race faces. If the subject by face interaction term were excluded from the model, this would be equivalent to assuming that individual differences in memory for faces are unaffected by the race of the face. This is highly implausible (e.g., it is contradicted by research that suggests lifetime exposure to other-race face influences ability to remember them). It is hard to come up with examples where one would be confident that individual differences are constant over all levels of a fixed factor unless individual differences or factor effects are themselves negligible. If zn^ is zero there will be no appreciable loss of power through adopting the non-additive model. What about the test of individual differences? The mere fact that you have decided to adopt a repeated measures model implies that you believe a priori that individual differences are present. A null hypothesis significance test (NHST) - particularly one with low statistical power - should not reverse that belief. Therefore a test of the nj subject effect is only rarely of interest to researchers (and rarely reported or interpreted). 16.3.1 Advantages of repeated measures designs A repeated measures design has one principal advantage over an independent measures design. ln situations where both designs are feasible, it produces more accurate estimates of the fixed actor f ratio. This follows from the way in which the error term is constructed (and the logic °f expected mean squares explained in Box 16.1). The error term in an independent mea-SUres design is influenced by at least two different sources of variation in the population. 626 Serious Stats These are experimental error and systematic between-subject variation (i.e., individual differences): subjects Equation 16.3 In a repeated measures design, the individual differences can be separated out from the error term to get a purer estimate of error (i.e., ctJ =6%tmr). This is possible because multiple observations for each person allow ajubjects to be estimated from the variation between them. Having one observation per person per level of a factor (the state of affairs in an independent measures design) makes it impossible to untangle systematic individual differences from other sources of error. You can't tell if someone has scored high through chance or because their true score in the population is high. There may be other advantages to repeated measures designs. One is practical: fewer participants need to be recruited to a study to obtain N data points. This is sometimes helpful if participants are scarce or difficult to recruit (because it allows a researcher to maximize the value of each participant's contribution). A further advantage of the repeated measures model is that the same equations also apply to matched or stratified designs. The calculations do not require that the same person be measured several times, just that observations are correlated with a particular structure. Equivalent correlations between measures arise for observations matched at the individual level (e.g., if each person is matched with a similar person in one or more comparisons on control conditions). The main difference between matched and repeated measures designs is in their interpretation. A repeated measures design (assuming random or otherwise representative sampling) generalizes to all members of the population sampled, while a matched design generalizes only to the population of matched sets of individuals (Kirk, 1995). The matched design and repeated measures design thus reach similar conclusions only to the extent that matching is representative of the wider population. The difficulty of finding matches for some people might introduce bias into the sampling. Kirk also discusses the relative merits of matching (stratification) and ANCOVA for increasing statistical power. ANOVA with matched observations t to be superior for low correlations between the confounding variable and the response, ANCOVA is superior for high correlations. In practice, the difficulty of obtaining good match makes ANCOVA more popular. In addition, ANCOVA usually assumes a linear effect of th covariate. Matching is a better strategy if the effect is curvilinear (Maris, 1998). These characteristics imply certain disadvantages of repeated measures designs. The pri cipal disadvantage is that, because measurements are typically spread over time, they may influenced by the order in which they are obtained. If the order of measurement is contrail (e.g., by randomization or by counterbalancing) order effects can be reduced and perhaps el" inated, though this is only possible if the fixed factors can be manipulated by the experiment If order effects are controlled the study is a true experiment. Randomizing or counterbala ing the order of repeated measurements is essential if you want to infer that an experimen manipulation is causing differences in the outcome. Order effects come in many different flavors (e.g., practice effects, fatigue effects or c over effects) and are a particularly dangerous source of confounding. If the experimente not able to control the order of measurement it may be possible to reduce the influen order effects (e.g., building in breaks to reduce fatigue effects) but not eliminate them, additional controls are also useful for counterbalanced or randomized orders of testing they should reduce error variance. It is also useful to model order of testing as a I Repeated measures ANOVA 627 covariate. This strategy may increase statistical power in the same way that a covariate can in ANCOVA. A final disadvantage of repeated measures designs is the increased complexity of the analysis and the assumptions of the model. Repeated measures ANOVA models can be inflexible. Analysis of an unbalanced design (in which some repeated measures are missing) is a good illustration. Multiple imputation provides a potential solution, if data are missing at random (MAR) or missing completely at random (MCAR).6 A more flexible alternative for missing outcome data (one that overcomes these and other ANOVA limitations) is to use a multilevel model. Box 16.1 Expected mean squares in one-way ANOVA The smaller error term for a repeated measures design leads to tests with greater statistical power. Consider a pair of one-way designs with exactly the same sampling strategy and experimental manipulations: one with an independent measures design and one with a repeated measures design. All other things being equal, the test of the treatment (fixed factor) in each experiment would address the same null hypothesis: H0- mi = m2 = • • • = My 0-e-/ afactor= °)- Tne F ratio for tne independent measures design is an estimate of the following ratio in the population: _ "factor + a error + asubjects 'independent — 2 2 oerror + CTSubyects The true population variance associated with systematic differences between levels of the fixed factor ofaaor, systematic individual differences ojubjects, and experimental error a}rror will be unknown, but imagine that they are 15, 10 and 5 respectively. The expected value of the F ratio for the values stated earlier will be: 15 + 10 + 5 30 £ (independent) — m Doing the same arithmetic for the repeated measures design gives: £ {^repeated) — factor 15 + 5 20 2 °error = — =4 This illustrates why a repeated measures design has greater statistical power, and results in a more sensitive test of H0. A crucial feature of the expected f ratio is that, regardless of the design, F= 1 if the population effect of the fixed factor is zero. The numerator and denominator of the statistics are identical w'1en °>octor= 0- The increased power of the repeated measures therefore depends on the presence °f systematic individual differences between the participants. The larger the population variance attributable to individual differences, the more sensitive the repeated measures design is (relative to an independent measures design). As the increased sensitivity to the effects of the factor is a consequence of the way that error variance is estimated, the statistical power advantage also applies to any inferential tool where the precision of estimation and hence estimation of error variance is undamental (e.g., confidence intervals, as well as likelihood and Bayesian inference). Up to this point, a number of technical points about expected F ratios have been glossed over owell, 2002; Kirk, 1995). A more detailed presentation of expected mean squares for one-way 628 Serious Stats independent measures ANOVA takes the form: E(MSfactor)=o? + na? E(MSerror)= 0. The expected mean squares for one-way repeated measures ANOVA depend on the presence of subject-by-treatment interactions. As the complete absence of subject-by-treatment variance is somewhat implausible, it is usual to adopt a structural model with interactions included. The mean squares for this structural model are: E(MSsubjects) = a^ + ka^ E(MSfactor) = .75. They also argued that it keeps the Type I error rate closer to nominal a than egg does. Subsequent authors have tended to stick to this view, advocating egg when e is thought to be close to its lower bound and shj when e is thought to be high (e.g., greater than .75 or .80). An important issue is how to decide whether sphericity is violated in the first place. One option is to form a test of the null hypothesis that e = 1. The best-known such test (implemented in a number of software packages) is Mauchly's sphericity test. This kind of test of assumptions addresses a largely irrelevant hypothesis. What matters is the degree of violation rather than its presence. Furthermore, the Mauchly test is neither robust to violations of normality nor high m statistical power. It is therefore wise to ignore the procedure entirely. Instead, focus on egg and ehf. The average of these conservative and liberal estimates provides a reasonable guide to the extent of any sphericity violation. If this average is close to its lower bound then the Greenhouse-Geisser correction is a safe choice. If the average is around .75 or better ehf should offer greater power (with only modest risk of Type I error inflation). If both estimates are close |° 1 (e.g., .95 or above) the uncorrected F test is defensible (with at most modest Type I error ■nflation). These decisions should be tempered by the relative cost of Type I or Type II errors. I Type I errors are considered the more costly of the two then egg is a good choice. If statistical 634 Serious Stats power is the main concern then shj should be preferred (unless the degree of sphericity violation is severe). There is a surprising amount published on the relative merits of different sphericity corrections. Recent work has focused more on the relative power of different procedures (see Kirk, 1995). Two further strategies for dealing with sphericity ought to be considered. Probably the best strategy is to avoid tests of multiple df effects altogether. Repeated measures contrasts have 1 df and therefore make it possible to avoid sphericity concerns altogether. A second strategy is to use a type of multivariate analysis of variance (MANOVA) known as profile analysis.8 This MANOVA approach will, under some conditions, provide more powerful tests than those of epsilon-corrected F ratios (Kirk, 1995). The precise conditions under which this occurs depend, in a somewhat unpredictable way, on the form of the population covariance matrix. A crude survey of the literature finds that MANOVA tends to provide greater power (relative to corrections using sgg) when n and the number of repeated measures is large or if s is close to its lower bound. With small n and few repeated measures experience suggests that MANOVA only rarely reports statistical significance when the Greenhouse-Geisser test does not.9 Example 16.1 Uppal (2006) investigated young children's ability to recognize different emotions. Particular interest focused on the ability to recognize and discriminate pride from other emotions (e.g., happiness or surprise). She showed 90 children (aged between seven and nine years) several sets of pictures, each showing actors expressing either pride, happiness or surprise. For each set the children were asked to point to the picture that expressed one of the three emotions (as cued by the experimenter). Subsequent examples will refer to this data set as the pride data. Mean accuracy for the three picture types (averaging over other conditions of the experiment not considered here) was 68.1% for pride, 71.1% for happiness and 78.9% for surprise. Table 16.3 summarizes the output for one-way repeated measures ANOVA with emotion as the fixed factor. The main effect of emotion is not statistically significant, F2,i 78 = 1-82, MSe = 1,544p = .165. No differences are detected in average accuracy of recognizing the three emotions. Table 16.3 Table for one-way ANOVA on the pride data Source df SS MS f Subjects 89 241,262 Emotion 2 5,616 2,808 1.82 Error (emotion x subjects) 178 274,801 1,544 Total 269 In this case there is little reason to worry about sphericity. Even though there are more than two levels on the repeated measures factor, sphericity would only decrease statistical significance (i.e., make the non-significant p value less significant). For completeness, iqq = .9879 and hi = 1 -0102 Both suggest no violation of sphericity, and the upward bias of khl is evident (as it exceeds the maximum value of the parameter it is estimating). The Greenhouse-Geisser correction (if implemen ec would produce an almost identical main effect: fi .98,175 84 = 1.82, p=.166 (although no correctic is warranted when both f statistics are so close to the upper bound). Repeated measures ANOVA 635 The main effect of emotion could also be tested via likelihood or Bayesian inference. AIC for the main effect model is 2808.2, while AIC for the intercept-only model (modeling accuracy of each emotion by the grand mean) is 2807.9. AAIC = 0.4, equivalent to LRMC = 1.2 in favor of the grand mean model. 16.3.5 Confidence intervals for repeated measures ANOVA Calculating confidence intervals (CIs) for a repeated measures ANOVA presents additional difficulties that do not arise in an independent measures design. The standard approach in an independent measures design is to plot error bars around each mean using a pooled standard error. This is useful for depicting the precision with which each mean is measured. These can be adjusted to support inference about differences between means (and for other factors such as multiple testing). The same approach is problematic for repeated measures designs because CIs based on individual means (with or without a pooled error term) may appear to be excessively wide, as they incorporate variance due to systematic individual differences. This variance is eliminated from tests of differences between means in repeated measures designs. Loftus and Masson (1994) argue that a better approach, for ANOVA, is to plot CIs that similarly exclude individual differences. These intervals tend to be much narrower (at least when individual differences are substantial) and therefore give a clearer indication of the presence of systematic differences between means. Loftus and Masson (ibid.) proposed methods for calculating CIs for repeated measures designs using the error term from the corresponding ANOVA. Being based on the same error term, the width of a CI for an individual mean is related to the CI of a difference between means by the familiar factor of ~/2. Although these interval estimates are fairly widely employed, the Loftus-Masson approach has a number of drawbacks (Cousineau, 2005; Morey, 2008). One is that for effects with multiple df the pooled error term will be inappropriate if sphericity (equality of variances of differences between means) is not met. A further problem is that they can be awkward to calculate. Cousineau (2005) proposed a simple alternative that is equivalent to the Loftus-Masson method when paired designs are employed. Morey (2008) showed that the intervals proposed by Cousineau tend to be too narrow (and explained how to correct for this problem). Like the Loftus-Masson approach, the Cousineau-Morey method attempts to strip out individual differences from the calculation of the interval estimate. Where the methods differ is that the Cousineau-Morey interval removes individual differences directly from the data prior to analysis. This is achieved by participant mean centering (subtracting the mean of each participant from their observed scores). While this strips out the individual differences, it also alters the mean score per condition. The remedy for this problem is to add the grand mean back on to each score. This process of participant mean centering followed by addition of the grand Mean is termed normalizing (Loftus and Masson, 1994; Masson and Loftus, 2003).10 Normaliz-'"8 relocates all condition effects relative to the grand mean rather than participant means (and ■"erefore relative to an average participant). I The discussion below assumes a one-way repeated measures ANOVA design with / lev-s (representing / different experimental conditions). If y« is the score of the 2th participant 636 Serious Stats in condition /', and fa is the mean of participant i across all / levels, normalized scores can be expressed as: Why does computing a CI based on the normalized scores lead to intervals that are too narrow? It happens because adding the grand mean to all values induces a positive correlation between the levels that wasn't there before (Morey, 2008). The degree of positive correlation is related to the number of levels / (being largest when / = 2 and decreasing as / rises). As the normalized scores are positively correlated, the estimate of error variance computed from them is lower by a factor of (J - 1)// than it would be for the original scores (and hence the CI is too narrow). This factor can be used to correct the CI computed from normalized scores. Thus Morey (2008) suggests computing a CI of the form where /ec(s(/l)) ^ (sSBxsubjectsiA)') dfsubjecls{A) ^fßxsubjcctsiA) Equation 11 For multiple df interactions, interaction contrasts can also be calculated. These can be g erated along the same lines as those for an interaction contrast in an independent meas design. The contrast weights for the interaction effect define patterns of means that difl between groups (e.g., differential linear trends). The difficulty is deciding on what error t to use. If sphericity is not violated, the error term for the interaction in the mixed ANOVA s be employed. If sphericity is violated, but the design balanced, the error df could be adj using egg or ihf. If sphericity is violated and the design unbalanced, the contrast could via a multilevel model. Repeated measures ANOVA 645 Example 16.5 Figure 16.3 shows a complex interaction among nine cell means. The interaction effect has 4 df and could be decomposed into one or more interaction contrasts. A hypothesis of particular interest in this case is how well the interaction is explained by a 2 x 2 interaction in which pride is better recognized from torso alone while happiness is better recognized by face alone. A reasonable choice of contrast weights for this hypothesis would be: Both Face Torso s Pride 0 -1 +1 0 Happiness 0 +1 -1 0 Surprise 0 0 0 0 E 0 0 0 0 Note that having absolute weights summing to two would have retained the original percentage scale. The next step in the analysis would normally be to create a double-centered table of contrast weights (see Key Concept 15.1). By happenstance, the initial weights are already double-centered (with both row and column marginals summing to zero). The contrast weights can then be multiplied by the cell means from Table 16.6 to give the contrast score: Both Face Torso £ Pride 0 -1 X 50 +1 x 80 30 Happiness 0 +1 x 83.3 -1 x 40 43.3 Surprise 0 0 0 0 E 0 33.3 40 73.3 SScontrait is derived from the contrast score of 73.3, and is: 2 C2 73 3 SScontrast =-7-r = 1- = 40< 296 7 1 / JL ,\ jo (1 +1 +1 +1) - £ wf "\h ' The sphericity assumption was not violated in the earlier analysis. For this reason, it is reasonable to use the emotion x subjects (condition) M5errar term from the original ANOVA. This is 1,325, and so F = 40,296.7/1, 325, and the contrast could be reported as: FhU4 = 30.4, MSe = 1, 325, p < .01. As is often the case for a contrast, the explanatory power of the interaction contrast is of greater interest than the test. SS for the interaction is 44,231 and therefore the interaction contrast explains about 91 % of the interaction effect (r2)f?r(,ng = .91). This is equivalent to a correlation of .95 between contrast weights and the residualized cell means {raienm„ = .95). Although there is quite a bit going on in Figure 16.3, much of the variation is down to main effects. Of the variation that remains, the vast majority can be explained in terms of pride being harder to identify from facial expression than from body posture, and happiness being harder to identify from body posture than facial expression. 646 Serious Stats 16.5.3 Effect size Repeated measures and mixed designs are especially difficult to obtain appropriate standardized effect size metrics for. Many commonly calculated quantities (e.g., rfy or g) are not comparable to similar metrics obtained from independent measures designs. It is a good idea to compare effects from different designs using unstandardized measures (e.g., simple mean differences) as a first - and possibly only - step. Standardized effect size metrics need to take into account factors that may distort the standardizer (the variance or standard deviation used to scale the effect). An important contributor to the standardizer in an independent measures design is individual differences (which in a repeated measures design are estimated separately from other sources of variance). The generalized effect size measures of Olejnik and Algina (2003) provide a starting point for 'design neutral' standardized effect size metrics. Their approach is to calculate generalized statistics that treat repeated measures equivalently to independent measures designs. One proviso is that, for statistical power or sample size estimation, the appropriate metric is one that matches the design of the study being planned. In theory ^| can be calculated with the formulas for other factorial designs by treating subjects as an additional measured factor. In practice, ANOVA software rarely provides the SS for such a calculation in a convenient format. Following Olejnik and Algina, calculating the required denominator for >j| by subtraction is suggested. The goal is to exclude all manipulated factors or interactions with only manipulated factors (except the one under consideration). An indicator variable I is used to designate whether the effect under consideration is a manipulated factor (/ = 1) or a measured factor (1 = 0). Vg = -oc; -, f ere- Equation 16.11 total 2^ s^manip < 1 x ^effect manip This formula excludes manipulated factors from the denominator (adding the focal effect back in only if the focal effect is a manipulated factor). Interactions with measured factors are considered measured factors. Repeated measures fixed factors tend to be manipulated factors, though it may be reasonable to treat them as measured factors in some situations. Olejnik and Algina (2003) also extend «| (generalized omega-squared) to designs with repeated measures factors. The correct formulas can become rather complex and the simplest solution is to refer to Tables 2, 3 and 4 of their paper. Example 16.6 In the two-way mixed ANOVA for the pride data, the two factors are the emotion to be recognized and the experimental condition (whether the pictures showed face, torso or both face and torso). The experimental condition is a canonical example of a manipulated variable. Emotion is manipulated by the experimenter, and for comparisons with other experiments might be considered as such. For other purposes - for example to gauge impact on everyday performance -it may be considered a measured variable (because expressions of happiness, pride and surprise ai a routine part of everyday experience). Treating both variables as manipulated factors, jjj for the interaction is: 2 =_^effect___________44231______— = .090 I ''9 SS,otai + I x SSeHM - £ SSmanip ~ 521678.2 + 1 x 44231 - (44231 + 5616 + 26060) manip Repeated measures ANOVA 647 The interaction would account for around 9% of the total sample variance in an equivalent independent measures design. Replacing SSeffect with 55COntrast allows versions of n2 to be calculated for a contrast. Thus tfi for the interaction contrast is: 2 _ S$ettea 9 5St0(0; + / x 55efc,C[ - Yi SSmanip monip 40296.7 521678.2 + 1 x 40296.7 - (40296.7 + 5616 + 26060) = .083 In practice it is usually better to evaluate contrasts relative to the effect they are decomposed from, rather than in terms of total variance. The contrast explains about 91 % of the interaction effect and about 8% of sample variance in an equivalent independent measures design. 16.6 MANOVA Multivariate analysis of variance (MANOVA) is technique of potential interest whenever correlated measurements are obtained. MANOVA is designed for applications in which several correlated outcome variables or DVs (dependent variables) are measured. An important application for repeated measures designs is a form of MANOVA called profile analysis. This does not assume sphericity and sometimes has greater statistical power than an epsilon-corrected ANOVA. In this application, already considered in passing, the repeated measures are treated as correlated outcome variables (which, in a sense, they are). Knowing in advance whether the MANOVA tests are more powerful than the corrected ANOVA analysis is far from simple. It would be possible to estimate the statistical power of each technique if the population covariance matrix s were known. Obtaining a good enough estimate of s (e.g., in a pilot study) to determine the relative power of the two approaches is likely to be difficult (but see Miles, 2003). Estimates of variances and covariances from small samples tend to be very imprecise. The MANOVA approach will tend to have greater power when many repeated measures are taken, when the degree of sphericity violation is large and when samples sizes are large (but there are departures from this general trend). For designed experiments with small n and few repeated measurements, and certainly for designs with two levels on the repeated measures factors, ANOVA should probably be preferred. In unbalanced mixed designs MANOVA tests are not robust to violations of mul-tisample sphericity. They may also be problematic for interaction tests in balanced designs (Keselman et al, 2001; Olson, 1974). Keselman et al. (2001) also argue against combining MANOVA with epsilon corrections as this can produce unpredictable results. For mixed designs with imbalance, a multilevel model is recommended. Two further applications of MANOVA, both controversial, need to be addressed. The first is the use of MANOVA to screen for effects prior to ANOVA. This is sometimes employed when a researcher has collected a number of different outcome measures. The second is to increase the statistical power to detect effects when correlated outcome measures are collected. Both Practices should generally be avoided. Arguments against the MANOVA approach are part of a broader aversion both to screening tests and tests of multiple df effects. 648 Serious Stats Using MANOVA to screen for effects is analogous to using omnibus F tests prior to post hoc tests of all pairwise comparisons. If the omnibus Ho is true (i.e., there are no differences between means for any of the outcome variables in the population) then the MANOVA test of an effect (e.g., a main effect of factor A) protects subsequent ANOVA tests on the separate outcome variables. This assumes that a researcher does not perform any further tests of differences, a rule that is not always adhered to (Huberty and Morris, 1989). Only very rarely should a researcher adopt this practice. Most research is conducted on the premise that some effects are likely to exist. Only rarely is the omnibus null hypothesis plausible.13 It is more likely that a partial null hypothesis is true; that there are non-zero effects for some outcomes and zero or negligible effects for others (Huberty and Morris, 1989; Jaccard and Guillamo-Ramos, 2002). Jaccard and Guillamo-Ramos (2002) provide a very clear illustration of the problem. Imagine a clinical study with one main outcome variable (Y\ = depression) and four secondary outcomes (Y2 to y5 measuring anxiety, self-confidence and so forth). There may be a substantial effect for Y] but negligible or zero effects for Y2 to Y5. If the Y\ effect is large enough, the overall MANOVA main effect might be statistically significant. Subsequent ANOVA tests on the secondary outcomes would then be unprotected with respect to familywise error (considering tests of the same treatment effect on different outcomes as the family). Jaccard and Guillamo-Ramos argue that modified Bonferroni corrections to separate univariate tests of secondary outcomes are a better solution (though the primary outcome should rarely if ever be corrected in this way). An alternative strategy is to report tests of all outcomes without correction. This may be sensible if the outcomes are correlated (though a powerful correction such as the Westfall procedure may be preferred). Reporting unmodified effects may be reasonable if the correlations between predictors are positive and substantial, provided care is taken not to conceal the true extent of testing when communicating the results. Using MANOVA to screen for effects prior to ANOVA is generally a very bad idea. It will usually lead either to inadequate Type I error protection or to decreased statistical power. The latter occurs when the partial null is true, but the omnibus null is not rejected. This is a consequence of the screening test itself lacking statistical power (see Zimmerman, 2004). The power issue is subtle, and will be addressed shortly. The main issue is that the test of the omnibus null hypothesis lacks focus relative to the tests of individual outcomes such as Y\ or Y2. It has already been hinted that the MANOVA omnibus tests will lack statistical power, but this isn't quite true. Think about the rationale for using MANOVA to provide more powerful tests. In Jaccard and Guillamo-Ramos's example there were five outcome measures all likely! be correlated with successful treatment for depression. In a small study, none of the individ" outcomes might reach statistical significance, but all might show an effect in the right directi Could you not use MANOVA to analyze the whole set of outcome variables for a more sensi test? This seems like a good idea, but MANOVA will not always provide a more powerful t: The power of the omnibus test in MANOVA depends on sample sizes, effect sizes and the patte of correlations between the outcomes (Cole etal, 1994). Interestingly, Cole etal. show that if outcome variables have high positive correlations (a situation quite likely where the outcon are repeated measures) MANOVA will not always have high power (depending on the mix effect sizes). With the right mix of effect sizes and correlations, MANOVA can have grea power than univariate ANOVA. However, other approaches may also have good power. In fl the most obvious choice of outcome in some studies is probably just to average the variab (e.g., taking their mean or the mean of their z scores depending on whether they have same or different scales). This approach seems particularly attractive when there are po. correlations between outcome measures and fairly consistent effects across those measur Repeated measures ANOVA 649 Many misapplications of MANOVA stem from a lack of understanding of how MANOVA works. There are many good introductions to MANOVA (e.g., Field, 2009), but most focus on calculation and basic interpretation. It will help to explore the simple case of a two-group design with J outcome variables Yx to Yy. A fundamental (and potentially surprising) characteristic of MANOVA is that it is not an analysis of the separate outcome variables at all. It is an analysis of a linear transformation or a combination of the outcomes that we will refer to as Yc. Technical details of the mathematics of the combination are given by Grayson (2004), but the basic form of Yc is similar to a contrast: Yc = C\Y\ +... + cyYy Equation 16.12 The weights or coefficients (C] to cy) for the combination are chosen to maximize the value of a one-way ANOVA F statistic for the differences between the means Y\ to Yy.14 It is important to remember that this maximizes the F ratio of the combination for the effect being considered. Different effects in the same design nearly always end up with different weights. MANOVA therefore is really a 'disguised' ANOVA with a transformed Y variable. Grayson points out that Type I protection (for the omnibus H0) is obtained because an ANOVA test of an individual outcome such as Y, is also a linear combination of the full set of Y variables (one in which the weights are one for Y\ and zero for all other outcomes): Y, = l x Yi +0xY2... + 0x Yy If the omnibus null were true and this 'combination' was statistically significant by chance alone (the definition of a Type I error), then the combination that maximizes the F statistic, Yc, would also be statistically significant. What gives cause for caution about MANOVA (at least as it is routinely applied) is that the linear combination that maximizes an F ratio is an inherently atheoretical approach. There is no guarantee that this linear combination (the discriminant function) is interpretable. Grayson {ibid.) provides several plausible examples of simple data sets that do not have an interpretable structure (or at least not one that makes any kind of theoretical sense). If a researcher has a hypothesis about the differences in a particular outcome measure, it does not seem like a good idea to test the hypothesis using a different outcome measure (whether that outcome measure is theoretically interpretable or not). A number of experts advise against MANOVA if you are really are interested in the differences between the means of individual outcome measures (Huberty and Morris, 1989; Jaccard and Guillamo-Ramos, 2002; Grayson, 2004). If you have an a priori reason to think that some linear combination of several outcomes is meaningful, then a better approach might be to combine the outcomes yourself (e.g., using an average or weighted average). Huberty and Morris (1989) discuss legitimate research questions for MANOVA. These rest on whether the inter-relationships between outcome measures themselves are of interest to a researcher. In particular, MANOVA will sometimes throw up theoretically meaningful linear combinations. Although this is true, other approaches to this problem should be considered. If the primary interest is in combinations of predictors that best discriminate different outcomes then discriminant analysis may be appropriate. Multilevel regression models can also be extended to deal with multiple outcome variables in what is termed a multivariate multilevel model (Hox, 2010). The multilevel approach is attractive because it may permit explicit modeling of the correlations between outcome variables and can handle missing outcomes. 650 Serious Stats MANOVA provides two additional causes for concern, one widely known and the other less so. The widely known issue is that MANOVA produces several different, rival test statistics. With only two outcome variables they all reduce to the same statistic, Hotelling's T2.15 With more than two outcomes Wilk's A (lambda),16 Pillai's trace, Hotelling's trace and Roy's largest root are usually provided by MANOVA software. Olson (1976) recommends Pillai's trace as the most robust to violations of MANOVA assumptions, but Wilk's A is also popular (see Field, 2009). The final cause for concern is in relation to effect size. A number of MANOVA effect size metrics have been developed, but none seem particularly useful. For instance, eta-squared variants can be derived from Wilk's A (a measure of unexplained sample variance for Yc), but have unattractive properties. Because Yc is maximized separately for each test, the total proportion of variance explained by all effects can exceed one, and will not strictly be comparable even within an analysis. As different studies will capitalize on sampling variability to determine YCl MANOVA effect sizes are also not strictly comparable between studies (and this is true also for unstandardized differences in Yc). In summary, MANOVA has a potential application in pure repeated measures analyses (the focus here). It sometimes provides more powerful tests than epsilon-corrected tests when sphericity is violated. It is less useful for mixed designs, but may be appropriate for tests on main effects in balanced designs. However, the power advantage of MANOVA is not consistent; switching to a multilevel model is recommended. The multilevel model approach has the flexibility to mimic both ANOVA and MANOVA analyses and to relax constraints inherent in both models (e.g., sphericity and multisample sphericity). The common strategy of MANOVA followed by univariate ANOVA is inappropriate for testing multivariate hypotheses and should be replaced by genuinely multivariate approaches (Huberty and Morris, 1989; Enders, 2003). 16.7 ANCOVA with repeated measures ANCOVA with repeated measures expands the familiar repeated measures designs to include a covariate. This approach is most appropriate for randomized experimental designs with continuous confounding variable, but can also be applied to non-experimental designs (provide the usual cautions about using the covariate as a 'statistical control' are borne in mind). Adding a covariate appears to be a relatively harmless process, but can end up being rather messy. A one-way design with single covariate would take the form: yij=ii+b(Cj- mc)+m+xj + xnij+1 Equation 16.13 The covariate C in Equation 16.13 is centered (by subtracting its mean nc) and takes the same value for each of the i = 1 to n observations. Both these points turn out to be very important. As it turns out, this form of pure repeated measures ANCOVA design may be uninterestin In Equation 16.13, the covariate is measured between participants; there is one covariate scot for each of the participants. Variation in the covariate equates to individual differences betwei participants. If the covariate was not present, this variation would get absorbed by the s jects term of the repeated measures analysis. If the covariate varied across repeated measu (often termed a time-varying covariate) then the model would also be inappropriate, because would fail to capture important variation across observations indexed by i and ;'. Time-v. covariates can be dealt with in a number of ways, but multilevel regression models prov wide a Repeated measures ANOVA 651 elegant solution. The main motivation to add a time-stable covariate to a pure repeated measures model therefore differs from an independent measures design. The reason for including the covariate should be because its effect is of substantive interest, or because you are interested in testing covariate-by-treatment interactions. Time-stable covariates are more interesting in mixed designs, because they potentially increase sensitivity to independent measures effects. Even so, there may still be advantages to separating out the independent measures analysis from the repeated measures analysis. In the basic one-way design ANCOVA above, it would perhaps be easier to run a one-way repeated measures ANOVA and a separate regression (or correlation) between outcome and covariate. One reason for this is that the way repeated measures ANCOVA is implemented in some software is problematic. A stronger reason to use a repeated measures ANCOVA model, in both mixed and pure repeated measures designs, is to include covariate-by-treatment interactions for the repeated measures factor. In some studies this will not be imperative, but in studies with measured factor by manipulated factor interactions it is strongly advisable (Yzerbyt etal, 2004). In the earlier presentation of factorial ANCOVA, centering of the covariate was a convenience. This is not so for tests of main effects of repeated measures factors in ANCOVA. For repeated measures analyses a number of packages use difference coding (as in MANOVA profile analyses). Difference coding strips out individual differences from the SS calculation on the repeated measures factor (because the means of differences between levels rather than the means of the levels are being compared). Delaney and Maxwell (1981) explain how adjusting the difference scores using a covariate messes up this calculation. One way to understand what happens is to realize that the difference scores have already eliminated the average effect of the covariate from the SS. A further adjustment for the covariate mean of each group confuses matters. It would, in effect, adjust the SS for the repeated measures main effect for the difference between the mean of the covariate and zero. This is not something you would usually want to do (ibid.). Once a covariate is added, any tests of repeated measures main effects are no longer interpretable in isolation (though the overall model, including the prediction equation, is not compromised). This is a little like the problem with product terms between uncentered main effects in moderated multiple regression. The solution is the same: center the covariates prior to adding them to the ANOVA. For the same reasons that apply in moderated multiple regression, interactions between covariates and other predictors are unaffected. Adding any new predictor or set of predictors to a model allows you to test the effect of the predictor using standard model comparison approaches (e.g., F tests or AAIC). As a general approach, repeated measures ANCOVA can be considered a mixed ANOVA in which the independent measures effect is a continuous covariate with a single df. This produces an ANOVA table resembling Table 16.5. Additional covariates and product terms to test covariate by factor interactions (moderator effects) can be added to the model. The product terms should be computed using the centered covariates and it is best to compute and add the centered covariate and product terms to the model yourself (unless you are certain that your software handles them correctly). Two broad strategies for repeated measures ANCOVA are recommended. One is the full repeated measures ANCOVA with all repeated measures factors, all independent measures and all covariates of interest. If you adopt this strategy it is advisable that you use effect coding (or equivalent ANOVA parameterization) and center all covariates. If covariates are not cen-ered, the repeated measures main effects may be uninterpretable. An alternative strategy is to conduct two separate analyses (see Rutherford, 2001). In the first analysis, no covariates are 652 Serious Stats included but all factors of interest (independent or repeated measures) and their interactions are present. This first analysis is used only to assess interactions between and main effects of repeated measures factors. The second analysis adds all covariates of interest and any covariate by factor interactions. The second model can be used to assess the remaining effects (i.e., any that are not pure repeated measures effects). The former 'global' strategy is probably the safest method, provided all covariates are centered. The alternative 'two phase' method is useful if you want to estimate effects at a value of a covariate other than its mean. This is useful, for example, in developmental trajectory analysis (Thomas et al, 2009). It can also be also useful for obtaining contrasts and simple effects for repeated measures factors or interactions between repeated measures factors that do not need covariate adjustment. 16.7.1 ANCOVA, pre-post designs and gain scores A pre-post design is a repeated measures design in which measurements of an outcome variable are taken before an intervention or experimental manipulation (the pre-test or baseline) and again afterwards (the post-test). When paired measures from a pre-post design are analyzed, a popular strategy is to simplify the analysis using gain scores. A gain score is the change in outcome between the pre-test and post-test (i.e., post-test score minus pre-test score). For a repeated measures design with more than two time points it is possible to generalize gain scores as change relative to baseline. Here the baseline score is subtracted from all repeated measures prior to ANOVA analysis (and is equivalent to calculating separate gain scores for each post-test measurement). Analysis of gain scores isn't the only option. Think about a two independent group design in which the aim of the study is to determine whether the change between pre-test and post-test scores differs between the groups. Two other alternatives could be selected, one more interesting than the other. First, one could use mixed measures ANOVA with pre-test and post-test scores as levels of the repeated measures factor. This is a fairly uninteresting alternative, because the F ratio from the ANOVA interaction is equivalent to the independent t test of the difference in gain scores between groups (t2 =F). Analysis of gain scores (or analysis of change relative to baseline) and ANOVA of the raw scores are equivalent with respect to the test of the differential change in outcome. The second alternative is to use the pre-test or baseline score as a covariate in the analysis. This models the change between the pre-test and post-test outcome in a very different way. T illustrate what is going on, we'll adapt the approach of Wright (2006) and present the equatio' for ANOVA and ANCOVA as regression models where group is a dummy coded predictor In the gain score model (equivalent to ANOVA) the regression model is: gain, = post, -pre, = b0 +b1xi + ej Equatio: The corresponding ANCOVA model is: postj = b0 + b2prei + b\Xi + e,- Equation 16. Repeated measures ANOVA 653 Comparing the two will be easier if the gain score model is rearranged in terms of the post-test scores (by adding the pre-test scores to both sides). This produces: post, = b0 + pKj + biXj + 6j Equation 16.16 The crucial difference between Equation 16.15 and Equation 16.16 is that the ANCOVA model estimates an additional parameter: the slope of the pre-test scores b2. In the gain score model, the pre-test scores are a constant and therefore the value of b2 is implicitly assumed to be one. Having presented the models in this way, it should be clear that the two models can lead to different conclusions, because the tests of the differences between groups (the test b\) are not identical. The most famous demonstration of this is Lord's paradox (Lord, 1967). Lord showed that two groups could show no difference when compared using gain scores, but a large difference when comparing the means adjusted for their pre-test score. Lord used an artificial example, but the 'paradox' can also be found in real data sets (Wainer and Brown, 2004; Wright and London, 2009). In the Wright and London example (using data from London et al, 2009), children's recall for an event (a magic show) was measured two weeks after the event and again ten months later. The age of the children at the start of the study varied from about five to nine years. Aside from the predictor X being a continuous covariate (the age in months of a child at the start of the study), this is identical to the earlier example. One of the research questions was whether the change in recall differed for younger and older children. The analysis of gain scores leads to the following prediction equation: post = 3.061 +pre-0.085age The ANCOVA analysis gives a very different outcome: post = -1.729 + 0.105pre + 0.044 age The effects are not only different, but they lie in opposite directions. Which model is correct? The literature on Lord's paradox is substantial and not easy to summarize. It turns out that there is no definitive answer. Because the models differ, each of them addresses a different hypothesis. The correct answer depends on the context of the study and its objective. That stated, there are ways to approach an answer. A number of commentators have pointed out that if you don't care what causes the difference in post-test scores (as may occur in some applied work) it may be reasonable just to focus on the observed differences using gain scores (Wright, 2006). More likely, a researcher will care about the relationships between predictors and the outcome measure. Wainer's (1991) position is that the choice of model depends on untestable assumptions about the relationship between pre-test and post-test scores (the parameter estimated by b2). If b2 = 1 in the population, the gain score approach will be correct, but if fe2 # 1 then ANCOVA is appropriate. If b\ = 1 then you are assuming that, on average, the mean of the post-test and pretest scores would be the same in the population if the effects of the other predictors were not present. It is the counterfactual nature of the assumption that makes it untestable; it depends °n information not present in the sample. As Wainer (ibid., p. 149) puts it: 'The very nature of untestable assumptions means that there is no statistical procedure that can be counted on to settle this issue. The answer to this question must come from other sources.' Returning to Wright and London's example, would you expect the post-test recall to be on average the same (all other things equal) as pre-test recall? The simple answer is no. After ten 654 Serious Stats months you'd expect all the children to have forgotten some of the information. This suggests the ANCOVA model is preferable. It is the more plausible of the two models, as it predicts older children remember more than younger children at the post-test (conditional on the initial level of recall). However, this is not necessarily the 'correct' answer. Even in this apparently clear-cut example Wright and London (2009) contrive a scenario in which the gain score model may be appropriate. What if an investigator has to select either an older or younger child to interview immediately after a crime (with the other child being interviewed much later)? The gain score model is useful here because it suggests that the older child should be interviewed first, because a delay would result in a larger absolute reduction in recall relative to the younger child.17 Although the scenario is contrived it illustrates how great care should be taken in selecting the correct model. In addition, the usual ANCOVA concerns about the linearity of covariate effects and the absence of interactions apply (ibid.). In this case the linearity assumption is implausible and it might be sensible to assume an approximate power law relationship, though the precise nature of the forgetting curve is a matter of considerable debate (e.g., see Lansdale and Baguley, 2008). Even though specifying the correct model can be hard, some guidelines are available. If the predictor of interest (e.g., group) is manipulated by the experimenter and therefore can be assigned at random, then all approaches produce unbiased tests, though ANCOVA tends to have greater statistical power (Wright, 2006). If pre-test score is confounded with the predictor of interest then ANCOVA can produce unbiased estimates of the de-confounded effects (though this is not necessarily the question of interest). ANCOVA is usually preferred in simple experimental and quasi-experimental studies looking to understand the effects of individual predictors. Maris (1998) also argues that ANCOVA is preferable if the pre-test is used to determine the assignment to the groups being compared. If the mechanism of assignment to groups is often unknown (e.g., because pre-existing 'groups' are measured) this presents a problem. These are the cases where it is necessary not only to know the precise hypothesis you wish to test, but also to try and work out which of the untestable assumptions about the model is most plausible. This issue is important because selecting the wrong model for the research question may introduce bias. Some broad conclusions are possible. For randomized designs and situations where the pretest determines assignment to groups ANCOVA has greater statistical power and is unbiased (Maris, 1998; Jamieson, 2004; Van Breukelen, 2006). For non-randomized studies where assignment to groups is not based on pre-test scores, it is argued that ANCOVA may have greater bias Qamieson, 2004; Van Breukelen, 2006). However, recent work comparing randomized and non-randomized studies suggests that this is not inevitable (Cook et al., 2008; Shadish ef al., \ 2008). It appears that the key factor is being able to include predictors likely to have caused differential baseline performance, rather than just the usual range predictors used for matching oi statistical control (e.g., easy to measure demographic factors such as age or sex). The argumei here is closely related to that for dealing with missing data (e.g., drop out) by including pre tors of missingness. This recent work is consistent with that of Senn (2006) who demonstr, that in situations where ANCOVA is biased it is difficult to design studies to detect a ca effect of a treatment in which change scores would not also be biased. It is sensible t< steps to remove as many sources of bias as possible (e.g., at the design stage or by mc of covariates). The question of which statistical model is best has no simple answer an depend on the design and context of the study, with work on how best to select an ana y ongoing (e.g., Dinh and Yang, 2011; Cousens et al., 2011). Repeated measures ANOVA 655 Box 16.2 Structuring repeated measures data: long form versus broad form To run any repeated measures ANOVA it is necessary to structure the data in a way that preserves the relationship between repeated observations and the units being observed. To keep the explanation that follows manageable, assume that the repeated measures are on human participants and that the outcome is measured at two time points (Time 1 and Time 2). Repeated measures analyses are tricky to deal with because different software requires data structured in different ways. Independent measures data for regression and related analyses are usually represented in a data file or spreadsheet as follows: Participant Predictor Outcome P1 8.3 29 P2 6.1 12 P3 8.5 23 Each variable (a covariate or grouping variable) is a distinct column and each participant a separate row. These rows are often termed 'cases'. The default repeated measures data structure in many packages is to represent data in what is sometimes called broad form: Participant Predictor Time 7 Time 2 PI 8.3 29 33 P2 6.1 12 19 P3 8.5 23 20 This arrangement preserves the one case (row) per participant property of the independent measures data structure. It differs in the important respect that each repeated measures outcome is defined as a separate variable. Software that uses the broad form therefore has to have a method of linking the outcome variables in some way (e.g., requiring the user to define the appropriate columns as a repeated measures or within subjects factor). An obvious alternative structure is to represent the data in long form: Participant Predictor Time Outcome PI PI P2 P2 P3 P3 8.3 8.3 6.1 6.1 8.5 8.5 29 33 12 19 23 20 The long form violates the property that the data from a single participant is restricted on a single case, but retains the property of the standard independent measures structure that the outcome is described by observations in a single column. The broad form is popular because it provides a more efficient summary of the data (i.e., one with less repetition). However, for some regression models -especially multilevel models - it is easier to work with data in long form. 656 Serious Stats For multilevel data structures, the long form extends very easily to represent three, four or more levels. An important advantage is that it allows you to represent both cross-classified and nested data structures. In a nested design, observations are clustered uniquely within other measurement units. Here participants are nested within groups (with no participant in more than one group): Croup Participant Outcome G1 PI 17 G1 PI 9 G1 P2 8 G1 P2 5 G2 P3 11 G2 P3 12 In a cross-classified design, lower-level observations are not clustered uniquely within high-level units. A fully crossed repeated measures structure is an extreme version where both measurement units are clustered within each other. Here the same four items in the experiment (II to 14) are clustered with every participant, though it would be just as valid to state that all n participants are clustered within every item: Participant Item Outcome PI 11 10 PI 12 11 P1 13 7 PI 14 9 P2 11 15 P2 12 8 Switching between the two formats can be time-consuming and is prone to error. Both SPSS and R have commands or functions for switching between the two data structures. For small data sets you may prefer to rearrange data in a spreadsheet such as Excel (because it is easy to check the results visually). In larger data sets it is better to use software to rearrange the data, but it is vital to check descriptive and other statistics to make sure the data are structured correctly. 16.8 R code for Chapter 16 16.8.1 One-way ANOVA with repeated measures (Example 16.1) There are a number of different approaches to running repeated measures ANOVA models R. There are pros and cons to each approach. Repeated measures ANOVA models with balanc data is one area in which the power and flexibility of R can be annoying; there are several ve. different methods to fit the same model (each with its own quirks). For a basic model it is possible to use aov () by Using the Error () function to specify tr correct error terms to use in the analysis. It is also necessary to code a participant or subje variable as a factor for use within the Error () function. This approach uses the long form c data rather than the broad form (used in most other software). The long form of the pride Repeated measures ANOVA 657 from Example 16.1 is in a file named pride_iongs .csv while the broad form is in prides .csv. To run a one-way repeated measures ANOVA using aov () try the following code: pride.long <- read.csv('pride_longS.csv') pride.long$participant <- as.factor(pride.long$participant) pride.anov <- aov(accuracy ~ emotion + Error(participant/emotion), pride.long) summary(pride.anov) The participant identifier can also be defined as a factor within the command: pride.anov <- aov(accuracy ~ emotion + Error(factor(participant)/emotion), pride.long) summary(pride.anov) The Error () function is required to get R to recognize the repeated measures structure and select the correct error terms. The participant/emotion argument indicates that the emotion factor is fully nested within participants. Had the original data used number and letter combinations (e.g., PI, P2 etc.), R would have converted participant to a factor by default. There are several ways to convert between the long and broad form (see Box 16.1). The following code uses the stacko andunstacko functions, which work well for simple cases. More complex data sets should use reshape (). The next example turns the long form into the broad form and then turns the broad form back. The default column names for stack () output are 'values' and 'ind', so the last command just updates these. Note that the column orders in the pride. iong2 are different from pride. long. The broad form is handy for getting the means or SDs of the repeated measures conditions. pride.broad <- unstack(pride.long, accuracy ~ emotion) mean(pride.broad) sd(pride.broad) pride.long2 <- stack(pride.broad) names(pride.long2) <- c('accuracy', 'emotion') The car package provides a range of powerful functions for running repeated measures ANOVAs and related analyses (and includes Greenhouse-Geisser and Huynh-Feldt corrections). The ez package provides a more user-friendly ('SPSS-like') way to access some car functions, ez will automatically load car and several other packages that it needs. Here is how to run a one-way repeated measures ANOVA using the ezanova () function: library(ez) ezANOVA(data=pride.long, dv=.(accuracy), wid =.(participant), within =.(emotion)) ezANovA () has several useful features. As well as providing sphericity corrections it converts the subject term defined by wid to a factor automatically. In addition, car and therefore ez defaults 658 Serious Stats to using hierarchical (Type II) sums of squares, whereas aov () defaults to sequential (Type I) sums of squares, ez also attempts to calculate (ges in the output) by treating subjects SS as a measured factor (other measured factors can be named using the observed argument). For one-way repeated measures designs this produces sensible output (equivalent to ri2). Two additional methods could be mentioned at this point. One is to fit a MANOVA model using the car package, though this offers no advantage over the ez approach at this stage. The final method is to use a multilevel model (sometimes called a linear mixed model). This approach is more versatile than repeated measures ANOVA, but will produce equivalent results for a balanced design if the model is set up in the appropriate way. The example here uses the lme () function in the nime package (part of the base distribution for R), although lmer () from ime4 could also be used (but it has slightly different syntax). library(nlme) lme.fit <- lme(accuracy ~ emotion, random = ~l|participant, pride.long) anova(lme.fit) Further explanation of this and related functions will be provided in Chapter 18. One advantage of this approach is that if they are fitted by maximum likelihood, R will calculate AIC and related statistics for repeated measures models using lme () or lmer (). These are unavailable from aov () objects fitted including Error (). The default fitting method for lme () is restricted maximum likelihood (RML). This produces inferences that are identical to repeated measures ANOVA in a completely balanced design with a fully nested data structure. To compare AIC (or AICc or BIC) for fixed effects it is necessary to switch to maximum likelihood methods (so that the log-likelihood is estimated correctly). The following commands compare the one-way model with emotion as a factor against the intercept-only model (in which all emotions have the same mean level of accuracy): io.ml <- lme (accuracy ~ 1, random = ~l|participant, pride.long, method='ML') ow.ml <- lme(accuracy ~ emotion, random = ~l|participant, pride.long, method='ML') AIC(io.ml, ow.ml) delta.aic <- AlC(ow.ml) - AlC(io.ml) exp(delta.aic12) 16.8.2 Plotting repeated measures CIs (Example 16.2) The Loftus-Masson CIs in Table 16.4 can be obtained using functions provided by Bagul (2011). Similar code is available from Wright (2007), who also provides a bootstrap version, functions in Baguley (2011) support plotting of difference-adjusted Cousineau-Morey CIS CIs from multilevel models with different covariance structures (for both one-way repe< measures and two-way mixed designs). Afshartous and Preston (2010) provide R cod< extending Goldstein-Healy intervals (Goldstein and Healy, 1995) to dependent designs. Repeated measures ANOVA 659 The examples here use the Baguley (2011) functions. The functions for one-way repeated measures analyses take data in broad form (e.g., pride .broad from the preceding section). The first function im.ci () gives an unadjusted Loftus-Masson function: lm.ci(pride.broad) The second function cm.cio gives a Cousineau-Morey interval. Its default is a difference-adjusted CI (in which overlapping CIs correspond to a CI for a difference in means that includes zero), using the formula in Equation 16.7. To get an unadjusted CI use the call: cm.ci(pride.broad, difference = FALSE) The plot in Figure 16.2 uses the two. tiered.ci () function. The default is to plot the inner tier as a difference-adjusted Cousineau-Morey CI and the outer tier as 95% CIs for individual means with a covariance matrix estimate that does not assume sphericity: two.tiered.ci(pride.broad, ylab = 'Percentage accuracy', xlab = 'Emotion', grid=TRUE) 16.8.3 ANOVA with mixed measures (Example 16.3) In Example 16.1, an independent measures factor with three levels was ignored for the pride data. This manipulated whether children saw pictures of expression with face, torso or both face and torso visible. Including the additional grouping factor presents no difficulty once the full data set is loaded. pride.long <- read.csv('pride_long.csv') pride.mixed <- aov(accuracy ~ emotion*condition + Error(factor(participant)/emotion), pride.long) summary(pride.mixed) To obtain the cell means and other summary statistics for data in long form the ezstatso function from the ez package can be used. This takes more or less the same format as ezAN0VA( ) . ezStats(data=pride.long, dv=.(accuracy), wid =.(participant), within =.(emotion), between =.(condition)) The output includes n, mean and SD per cell as well as Fisher's least significant difference (though this is not suitable for designs with repeated measures factors). To obtain sphericity corrections you can again use ezANOVA () . ezANOVA(data=pride.long, dv=.(accuracy), wid =.(participant), within =.(emotion), between =.(condition)) 660 Serious Stats The lme () function also works, though the car package output reported via ez is more informative. lme.fit <- lme(accuracy ~ emotion*condition, random = ~l|participant, pride.long) anova(lme.fit) To plot something similar to Figure 16.3, interaction.plot () could be used: interaction.plot(pride.long$emotion, pride.long$condition, pride.long$accuracy, xlab='Emotion', ylab= 'Percentage accuracy', legend = FALSE) A prettier, color plot could be specified using the ez package ezPiot () function. This function returns an object that can be edited further using the ggpiot2 package. ezPiot(data=pride.long, wid = .(participant), dv=.(accuracy), within = .(emotion), between = .(condition), x = .(emotion), split = .(condition), do_bars=FALSE) The x argument defines which of the repeated (within) or independent (between) factors is on the x-axis and split determines whether to split by levels of another factor (as separate lines). Setting do_bars=TRUE adds error bars based on Fisher's LSD (not desirable in this case). Baguley (2011) includes a two-tiered error bar function for mixed designs based on two. tiered, ci (). Below, reshape () is used to get the full pride data set into broad form and the participant ID stripped out so that the grouping variable is the first column in the new data frame pride.mixed. Shown here is a basic plot that can be edited or relabeled: pride.broad2 <- reshape(pride.long, idvar = 'participant', direction = 'wide', timevar = 'emotion', v.names = 'accuracy')[2:5] two.tiered.mixed(pride.broad2, group.var='first', lines=TRUE) 16.8.4 Contrasts on a repeated measures factor (Example 16.4) A contrast is a form of weighted comparison of means. With pure repeated measures des., contrasts can be run as paired t tests between weighted means. To illustrate this Example 1 runs a contrast for the pride data set. Working with the broad form of the data set, first ex two vectors, one for the accuracy of the pride emotion and one for the mean happiness surprise. The t test can then be run comparing the means of these vectors. pride.mean <- pride.broad$pride nonpride.mean <- (pride.broad$happiness + pride.broad$surprise)12 t.test(nonpride.mean, pride.mean, paired=TRUE) Repeated measures ANOVA 661 As an alternative to the t test or 95% CI the t statistic or mean and SE can be used calculate Bayes factors or likelihood intervals if desired (using functions from Chapter 11). JZS.prior.Bf.lsd.30, 90) t.lik.int(6.94444, 5.33, 90) There is nothing special about repeated measures contrasts - any ANOVA contrast that outputs a f statistic can be used to calculate a Bayes factor or a likelihood interval. 16.8.5 Interaction contrasts with repeated measures (Example 16.5) Interaction contrasts for pure repeated measures effects can be calculated using the t test approach described earlier. This avoids sphericity problems (possibly with some loss of power), but does not extend to mixed measures interaction terms. One way to deal with mixed effects is to adopt the approach described in Example 16.5. To calculate an interaction contrast it is necessary to decide on the interaction weights and it is usually best to work with them in matrix form. For instance, the matrix for the contrast in Example 16.5 would be: cont <- matrix(c(0,0,0,1,-1,0,-1,1,0), 3, 3) The interaction residuals (the cell means after sweeping out the main effects) could be obtained in several ways. However, R can provide them directly using the model. tables () function. Using the interaction model from the aov () command fitted earlier (pride .mixed) the residuals are given in the table for the emotion: condition interaction. model.tables(pride.mixed) int.resids <- model.tables(pride.mixed)$tables$'emotion:condition' The contrast score is the summed product of the interaction residuals multiplied by the contrast weights, while SScontrast is obtained from contrast score and n per cell. c.score <- sum(cont*int.resids) c.score n <- 30 ss.contrast <- c.scoreA2/(sum(contA2)/30) ss.contrast contrast = 40333.33 and is slightly larger than in the worked example (because of rounding eiTor)-^P-ra/ert/;igandr2ert/n£jare: 662 Serious Stats r.alerting <- cor(as.vector(resids),as.vector(cont)) r2.alerting <- r.alertingA2 c(r.alerting, r2.alerting) ms.error <- 1325 F <- ss.contrast/ms.error F ; pf(F, 1, 174, lower.tail = FALSE) As the F ratio is very large the p value is tiny, though it is more interesting to focus on r*lertjn This suggests that the interaction contrast accounts for most of the variance of the interaction effect. It is also possible to adapt the cell means approach described in Chapter 15 to mixed and repeated measures models. These analyses are run as multilevel models. First, the cell means interaction model needs to be fitted using lme (). library(nlme) pride.cmm <- lme(accuracy ~ 0 + emotion:condition, random = ~l|participant, pride.long) pride.cmm The giht () function isn't the most flexible option for running the contrast. Instead, we'll use estimable () from the package gmodeis. This is used for calculating 'estimable functions' (linear functions of model parameters) which contrasts are a special case of. The function is more flexible than giht (), but doesn't directly support corrections for multiple testing. The key advantage of estimable () is that it takes input from a wide range of model objects plus a contrast matrix. To reduce confusion, the contrast matrix works with named parameters (with unnamed parameters set to zero). The following example grabs the names required for the contrast from the model object and adds them to the contrast matrix. Because a cell means model has been fitted, the names are those of the nine cell means (e.g., emotionhappiness : conditionboth is the cell mean for the happiness emotion with both torso and face presented). library(gmodeis) labels <- names(coef(pride . cmm) )[1: 9 ] contr <- matrix(c(0,0,0,1,-1,0,-1,1,0),1,9, dimnames=list('contrast',labels)) estimable(pride.cmm, contr) The reported t statistic is 5.517036. Its squared value is 30.43769, identical, allowing for rounding error, to the 30.44025 obtained for F by the other method. The function provides a 95% CI a conf. int=. 95 argument is added, but these are on the incorrect scale. The following a will rescale the contrast score, SE and CI without also (incorrectly) rescaling the df: estimable(pride.cmm, contr, conf.int=.95) [1:2]/2 estimable(pride.cmm, contr, conf.int=.95)[6:7]/2 Repeated measures ANOVA 663 The multilevel approach has several advantages. It can be extended to deal with violations of sphericity and multisample sphericity or to more complex models (e.g., with unbalanced designs or time-varying covariates). The resulting models will not always produce statistics with an exact t or F distribution (though in this case, the statistics are known to be exact if the assumptions are met). 16.8.6 Generalized if measures for repeated measures (Example 16.6) Calculators for generalized rr2 are not easy to automate, because the decision over whether to consider variables as manipulated or measured is slightly subjective (and may vary with context). Using R can make the calculations easier - particularly for the total SS. N <- 30*3*3 ss.tot <- var(pride.long$accuracy)*(N-l) To get the generalized measures then involves manually adjusting the denominator using information from the ANOVA output: ss.effect <- 44231 ges <- ss.effect / (ss.tot + ss.effect - sum(ss.effect + 5616 + 26060)) ges.contrast <- ss.contrast / (ss.tot + ss.contrast - sum(ss.effect + 5616 + 26060)) c(ges, ges.contrast) Note that ezANOVAO also provides »j| output for the interaction. Its output of 0.09026790 matches that calculated in Example 16.6 - indicating that it is treating the repeated measures factor and the independent measures factor as manipulated factors. This is debatable. Emotions vary naturally in our environment so the repeated measures factor could be considered measured. The emotion factor should therefore be considered measured. If so, ij| for the interaction should be: ges <- ss.effect/(ss.tot - 26060) ges This can be checked with ezANOVA () where the observed argument allows you to override the defaults and treat a fixed factor as measured rather than manipulated. ezANOVA(data=pride.long, dv=.(accuracy), wid =.(participant), within =.(emotion), between = .(condition), observed= .(emotion), detailed=TRUE) 664 Serious Stats 16.8.7 R packages Bates, D. M., Maechler, M., and Bolker, B. M. (2011) lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-39. Fox, J., and Weisberg, S. (2010) An R Companion to Applied Regression, (2nd edn). Thousand Oaks CA: Sage. Lawrence, M. A. (2011) ez: Easy analysis and visualization of factorial experiments. R package version 3.0-0. Pinheiro, J., Bates, D. M., DebRoy, S., Sarkar, D, and the R Core team (2011) nlme-. Linear and Nonlinear Mixed Effects Models. R package version 3.1-98. Warnes, G. R., er al. (2011) gmodels-. Various R programming tools for model fitting. R package version 2.15.1. 16.9 Notes on SPSS syntax for Chapter 16 16.9.1 ANOVA with repeated measures (Examples 16.1 and 16.3) Repeated measures ANOVA analyses in SPSS use the general linear model glm command (which can also run unianova commands using the same syntax). SPSS data file: pride_rm. sav GLM pride happiness surprise /WSFACTOR=emotion 3 /METHOD=SSTYPE(2) /PRINT=DESCRIPTIVE /WSDESIGN=emotion. The statement pride happiness surprise defines the variables making up the within subjects (repeated measures factors) and /wsFACTOR=emotion 3 tells SPSS that there is one repeated measures factor with three levels. The /print subcommand requests descriptive statistics. Fitting a mixed ANOVA is straightforward with the same command: GLM pride happiness surprise BY condition /WSFACTOR=emotion 3 /METHOD=SSTYPE(2) /PRINT=DESCRIPTIVE /WSDESIGN=emotion. A covariate could be added with a with statement. Additional repeated measures or pendent measures can be added. For instance, a four-way mixed ANOVA with two rep measures factors and two independent measures factors (a and b) could be run as: Repeated measures ANOVA 665 GLM C1D1 C1D2 C2D1 C2D2 BY a B /wsfactor=factor1 2 Polynomial factor2 2 Polynomial /METH0D=SSTYPE(2) /WSDESIGN=factorl factor2 factorl*factor2 /DESIGN=a B a*B. The repeated measures factors each have two levels and are defined across the variables cidi cid2 C2D1 C2D2. The independent measures factors are defined using by a b and the model is specified in the /wsdesign and /design subcommands. The Polynomial statement in the /wsfactor definition tells SPSS to run polynomial contrasts on the repeated measures factors and their interactions. These models can also be fitted using the SPSS multilevel modeling mixed command. As well as being able to relax the sphericity assumption for these models, it is also possible to obtain AIC, AICC and BIC. 16.9.2 Repeated measures CIs (Example 16.2) Wright (2007) provides SPSS syntax for running Loftus-Masson CIs, while Cousineau provides SPSS syntax for the uncorrected CI using normalized data. To obtain the corrected Cousineau-Morey intervals for the pride data you can also adjust nominal confidence (in this case to 98.2%) to obtain the correct width for a 95% CI when n = 30 and / = 3 (see Baguley, 2011). The following syntax obtains the normalized data from the participant means and the grand mean (72.6851851851852). The latter is calculated from the participant means using descriptives. SPSS data file: pr ide_rm. sav compute pmeans=(pride+happiness+surprise)/3. descriptives variables=pmeans /statistics=mean. compute n_pride = pride - pmeans + 72.6851851851852. compute n_happiness = happiness - pmeans + 72.6851851851852. compute n_surprise = pride - pmeans +72.6851851851852. execute. graph /errorbar(ci 98.2) = n_pride n_happiness n_surprise. 16.9.3 Contrasts for repeated measures designs (Examples 16.4 and 16.5) Many repeated measures contrasts can be run using the paired f test commands. The following commands compute a weighted contrast for the pride versus other emotion contrast in Example 16.4. 666 Serious Stats SPSS data file: pride_rm. sav COMPUTE pride_v_other=pride-(happiness+surprise)12 . EXECUTE. T-TEST /TESTVAL=0 /VARIABLES=pride_v_other /CRITERIA=CI(.95). This is a very versatile method for running contrasts on repeated measures. It avoids a pooled error term, but may sacrifice power if sphericity is true. The contrast can also be run using GLM, by changing the default within-subject contrasts: GLM pride happiness surprise /WSFACT = emotion(3) special (111 -1 .5 .5 0-11). This gives an F test of the contrast by default. With three levels there are two orthogonal contrasts by default (usually linear and quadratic polynomials specified by Polynomial). The contrasts are set out as a 'matrix' over several lines to reveal the structure (though it is not strictly necessary). The first row l l l defines the intercept, the second the contrast of interest and the third is an arbitrary contrast orthogonal to the first (which happens to compare the happiness and surprise means). The default repeated and mixed measures ANOVA output (see the mixed ANOVA syntax above) includes polynomial contrasts for the repeated measures factors and interaction contrasts for effects involving repeated measures factors. Unfortunately, these are only really interpretable if the repeated measures factor is ordered (e.g., time points), which it is not for the pride data. In principle, it is possible to get SPSS to run custom interaction contrasts for mixed designs, but the difficulty of setting up the contrast coefficients often makes it easier to carry out the analysis by another route (e.g., by hand). 16.10 Bibliography and further reading Keselman, H. J., Algina, J., and Kowalchuk, R. K. (2001) The Analysis of Repeated Measures Designs: A Review. British Journal of Mathematical and Statistical Psychology, 54, 1 -20. Kirk, R. E. (1995) Experimental Design (3rd edn) Belmont, CA: Brooks/Cole.