Control Problen in hx~erimentalResearch Preview fi Chapter Objectives In Chapter 5 you learned the essentials of the experimental method-manipulating an independent variable, controlling everything klse, and measuring the dependent variable. In this chapter we will begin by examining two general types of experimental design, one in which different groups of participants contribute data for different levels of the independent variable (between-subjects design) and one in which the same participants contribute data to all the levels of the independent variable (within-subjects design). As you are about to learn, there are special advantages associated with each approach, but there are also problems that have to be carefully controlled-the problem of equivalent groups for between-subjects designs, and problems of sequence for within-subjects designs. The last third of the chapter addresses the issue of bias and the ways of controlling it. When you finish this chapter, you should be able to: Chapter 6. Control Problems in Experimental Research Discriminate between-subjects designs from within-subjects designs. Understand how random assignment can solve the equivalent groups problem in between-subjects designs. Understand when matching should be used instead of random assignment when attempting to create equivalent groups. Distinguish between progressive and carryovereffects in within-subjects designs, and understand why counterbalancing normally works better with the former than with the latter. Describe the various forms of counterbalancing for situations in which participants are tested once per condition and more than once per condition. Describe the specific types of between- and within-subjects designs that occur in research in developmental psychology, and understand the problems associated with each. Describe how experimenter bias can occur and how it can be controlled. Describe how participant bias can occur and how it call be controlled. In his landmark experimental psychology text, just after introducing his now famous distinction between independent and dependent variables, R. S. Woodworth emphasized the importance of control in experimental research. As he put it, "[wlhether one or more independent variables are used, it remains essential that all other conditions be constant. Otherwise you cannot connect the effect observed with any definite cause. The psychologist must expect to encounter difficulties in meeting this requirement." (Woodworth, 1938,p. 3).Someofthese difficultieswe've already seen. The general problem of confounding and the specific threats to internal validity discussed in the previous chapter are basically problems of controlling extraneous factors. In this chapter, we'll look at some other aspects of maintaining control: the problem of creating equivalent groups in experiments involving separate groups of participants, the problem of sequence effects in experiments in which participants are tested several times, and problems resulting from biases held by both experimenters and research participants. Recall that any independent variable must have a minimum of two levels. At the very least, an experiment will compare condition A with condition B. Those who participate in the study might be placed in level A, level B, or both. If they receive either A or B but not both, the design is a between-subjects design, so named because the colnpai-isonof levels A and B w d be a contrast between two different groups of individuals. On the other hand, if each participant receives both levels A and B, you could say that both levels exist zuit/iin each individual; hence, this design is called awithin-subjects design (or,sometimes, a repeated-measures design). Let's examine each approach. Between-SubjectsDesigns Between-subjects designs are sometimes used because they must be used. If the independent variable is a subject variable, for instance, there is usually no choice. Between-Subjects Designs A study comparing introverts with extroverts requires two different groups of people. Unless the researcher could round up some multiple personalities, introverted in one personality and extroverted in another, there is no alternative but to compare two different groups. One of the few times a subject variable won't be a between-subject variable is when behaviors occurring at two different ages are being compared, and the same persons are studied at two different times in their lives. Another possibility is when marital status is the subject variable, and the same people are studied before and after a marriage or a divorce. Most of the time, however, using a subject variable means that a between-subjects design will be used. Using a between-subjects design is unavoidable in some studies that use certain manipulated independent variables. That is, it is sometimes the case that when people participate in one level of an independent variable, the experience gained there will make it impossible for them to participate in other levels. This often happens in social psychological research and most research involving deception. Consider an experiment on the effects of the physical attractiveness of a defendant on recommended sentence length by SigaU and Ostrove (1975). They gave college students descriptions of a crime and asked them to recommend ajail sentence for the woman convicted of it. There were two separate between-subjects manipulated independent variables. One was the type of crime-either a burglary in which "Barbara Helm" broke into aneighbor's apartment and stole $2,200 (afair amount of money in 1975),or a swindle in which Barbara "ingratiated herself to a middle-aged bachelor and induced him to invest $2,200 in a nonexistent corporation" (Sigall & Ostrove, 1975,p. 412). The other manipulated variable was Barbara's attractiveness. Some participants saw a photo of her in which she was very attractive, others saw a photo of an unattractive Barbara (the same woman posed for both photos), and a control group did not see any photo. The interesting result was that when the crime was burglary, attractiveness paid. Attractive Barbara got a lightev sentence on average (2.8 years) than unattractive (5.2) or control (5.1) Barbara. However, the opposite happened when the crime was swindling. Apparently thinking that Barbara was using her good looks to commit the crime, participants gave attractive Barbara a harsher sentence (5.5 years) than they gave the unattractive (4.4) or control (4.4) woman. You can seewhy it was necessary to run this studywith between-subjects variables. For those participating in the Attractive-Barbara-Swindle condition, for example, the experience would certainly affect them and make it impossible for them to "start fresh" in, say, the Unattractive-Barbara-Burglary condition. In some studies, participating in one condition makes it impossible for the same person to be in a second condition. Sometimes, it is essential that each condition include uninformed participants. While the advantage of a between-subjects design is that each participant enters the study fresh, and naive with respect to the procedures to be tested, the prime disadvantage is that large numbers of people may need to be recruited, tested, and debriefed. Hence, the researcher invests a great deal of energy in this type of design. My doctoral dissertation on memoly involved five different experiments requiring between-subjects factors; more than 600 students trudged in and out of my lab before the project was finished! Clzapter 6. Control Problems in Experimental Research The ' Another disadvantage of between-subjects designs is that differences between the conditions could be due to the independent variables, but they might also be due to differences between the two groups. To deal with this potential confound, deliberate steps must be taken to create what are called equivalent groups. These groups are equal to each other in every important way except for the levels of the independent variable. The number of equivalent groups in a between-subjects study corresponds exactly to the number of different conditions in the study, with one group of participants tested in each condition. There are two common techniques for creating equivalent groups in a betweensubjects experiment. The ideal approach is to use random assignment. A second strategy is to use matching. nauuuul Assignment First, be sure you understand that random assignment and random selection are not the same. Random selection, to be described in Chapter 12 (pp. xx), is a procedure for getting volunteers to come into your study. As you wdl learn, it is a process designed to produce a sample of individuals who reflect the broader population, and it is a common strategy in research using surveys. Random assignment is a method for placing participants, once selected for a study, into the different groups. When random assignment is used, every person volunteering for the study has an equal chance of being placed in any of the groups being formed. The goal of random assignment is to take individual differencefactors that could influence the study and spread them evenly throughout the different groups. For instance, suppose you're comparing two presentation rates in a simple memory study. Further suppose that anxious participants won't do as well on your memory task as nonanxious participants, but you as the researcher are unaware of that fact. Some subjects are shown a word list at a rate of 2 seconds per word; others at 4 seconds per word. The prediction is that recall will be better for the 4-second group. Here are some hypothetical data that such a study might produce. Each number refers to the number of words recalled out of a list of 30. ARer each subject number, I've placed an "A" or an "R" in parentheses as a way of telling you which participants are anxious and which are relaxed. Data for the anxious people are shaded. If you look carefully at these data, you'll see that the three anxious participants in each group did worse than their five relaxed peers. Because there are an equal number of anxious participants in each group, however, the dampening effect of anxiety on recall is about the same for both groups. Thus, the main comparison of interest, the dfference in presentation rates, is preserved-an average of 15 words for the 2-second group and 19for the 4-second group. Random assignment won't guarantee placing an equal number of anxious participants in each group, but in general the procedure has the effectof spreadingpotential confounds evenly among the different groups. This is especially true when large numbers ofindividuals arebeing assigned to each group. In fact, the greater the number ofparticipants involved,the greater the chance that random assignmentwlll work to create equivalent groups of them. If groups are equivalent and if all else is adequatelycontrolled, then you are in that enviableposition ofbeingable to saythat your independent variable was responsible if you find differences between your groups. You might think the actualprocess ofrandom assignmentwould be fairlysimplejust use a table ofrandom numbers to assign each arrivingparticipant to a group or, in the case ofa two-group study,flip a coin. Unfortunately, however,the result ofsuch a procedure is that your groups will almost certainly contain different numbers ofpeople. In the worst-case scenario,imagine you are doing a studyusing20participants divided into two groups of 10.You decide to flip a coin as eachvolunteer arrives:heads, they're in group A; tails, group B. But what if the coin comes up heads all 20 times? To complete a random assignment of participants to conditions in a way that guarantees an equal number of participants per group, a researcher can use block randomization, a procedure ensuring that each condition of the study has a participant randomly assigned to it before any condition is repeated a second time. Each "block" contains all of the conditions of the study in a randomized order. This can be done by hand, using a table of random numbers, but in actual practice researchers typically rely on a simple computer program to generate a sequence of conditions meeting the requirements of block randomization-you can find one at http://www.randomizer.org/. When only a small number of subjects are available for your experiment, random assignment can sometimes fail to create equivalent groups. The following example showsyou how this might happen. Let's take the same study of the effect ofpresentation rate on memory, used earlier,and assume that the data youjust examined reflect Chapter 6. Control Problems in Experimental Kesearch an outcome in which random assignment happened to work. That is, there was an exact balance of five relaxed and three anxious people in each group. However, it is possible that random assignment could place all six of the anxious participants in one of the groups. This is unlikely, but it could occur (just as it's remotely possible for a perfectly fair coin to come up heads 10times in a row). If it did, this might happen:' This outcome, of course, is totally different fiom the first example. Instead of concluding that recall was better for a slower presentation rate (as in the earlier example), the researcher in this case could not reject the null hypothesis (17 = 17) and would wonder what happened. After all, participants were randomly assigned, and the researcher's prediction about better recall for a slower presentation rate certainly makes sense. So what went wrong? What happened was that random assignment inadvertently created two decidedly nonequivalent groups-one made up entirely of relaxed people and one mostly including anxious folks. A 4-second rate probably does produce better recall, but the true difference was wiped out in this study because the mean for the 2-second group was inflated by the relatively high scores of the relaxed participants and the 4-second group's mean was suppressed because of the anxiety effect. Another way of saying this is that the failure of random assignment to create equivalent groups probably led to a Type I1 error (presentation rate really does affect recall; this study just failed to find the effect). To repeat what was mentioned earlier, the chance of random assignment worhng to create equivalent groups increases as sample size increases. To deal with the problem of equivalent groups in a situation such as this, a matching procedure could be used. In matching,participants are grouped together on some trait such as anxiety level, and then distributed randomly to the different Participant S1ox) 'Thls same pattern of results could occur if an experimenter failed to randomly assign and naively tested the first eight people to sign up in the 2-second rate group and the next eight people in the other group. It is conceivable that the more anxious students would delay volunteering to participate, increasing the chances of their being placed in the 4-second group. 2-Second Rate 15 Participant s9 (R) 4-Second Rate 23 groups in the experiment. In the memory study, "anxiety level" would be called a matching variable. Individuals in the inemory experiment would be given some reliable and valid measure of anxiety, those with si~nilarscores would be paired together, and one person in each pair would be randomly placed in the group getting the 2-second rate and the other would be put into the group with the 4-second rate. As an illustration of exactly how to accomplish matching in a twogroup experiment, you should work through the example in Table 6.1. Matching sometimes is used when the number (N) of participants is small, and random assignment is therefore risky and might yield nonequivalent groups. In order to undertake matching, however, two important conditions must be met. First, you must have good reason to believe that the matching variable will have a predictable effect on the outcome of the study. That is, you must be confident that the matching variable is correlated with the dependent variable. This was the case in our hypothetical memory study-anxiety clearly reduced recall. When there is a high correlation between the matching variable and the dependent variable, the statistical techniques for evaluating matched-groups designs are sensitive to differences between the groups. On the other hand, if matching is done when there is a low correlation between the matching variable and the dependent variable, the chances of finding a true difference between the groups decline. So it is important to be careful when piclung matching variables. A second important condition for matching is that there must be some reasonable way of measuring or identifjring participants on the matching variable. In some studies, participants must be tested on the matching variable first, then assigned to groups, and then put through the experimental procedure. Depending on the circumstances, this might require bringing participants into the lab on two separate occasions, which can create logistical problems. Also, the initial testing on the matching variable might give participants an indication of the study's purpose, thereby introduci~gbias into the study.The simplestmatching situationsoccurwhen the matching variables are constructs that can be determined without directly testing the participants (e.g., Grade Point Average scores or IQ from school records), or by matching on the dependent variable itself. That is, in a memory study, participants could be given an initial memory test, then matched on their performance, and then assigned to 2-second and 4-second groups. Their preexisting memory ability would thereby be under control and the differences in performance could be attributed to the presentation rate. One practical difficultywith matching concerns the number ofmatchingvariables to use. In a memory study, should I match the groups for anxiety level?What about intelligence level?What about education level?You can see that somejudgment is required here, for matching is difficult to accomplish with more than one matching variable, and often results in having to eliminate participants because close matches sometimes cannot be made. The problem of deciding on and measuring matching variables is one reason why research psychologists generally prefer to make the effort to recruit enough volunteers to use random assignment, even when they might suspect that some extraneous variable correlates with the dependent variable. In memory research, for instance, researchers are seldom concerned about anxiety levels, intelligence, or education level. They simply make the groups large enough and assume that random assignment wdl distribute these potentially confounding factors evenly throughout the conditions of the study. Chapter 6. Control Problems in Experimental Research 'ABLE 6.1 How to Use a Matching Procedure In a study on problem solving requiring two different groups, a researcher is concerned that a participant's academic skills may correlate highly with performance on the problems to be used in the experiment.The participants are college students, so the researcher decides to match the two groups on grade point average (GPA).That is, deliberate steps will be taken to insure that the two groups are equivalent to each other in academic ability,as reflected in their average GPAs. Here's how it is done: Step 1. Get a score for each person on the matching variable.Thatls easy in this case because it simply means retrieving GPA data from the Registrar (with the students' consent of course). In other cases of matching, the matching variable must be determined by pretesting participants on the variable; this can mean bringing participants to the lab twice, which can be inconvenient (another reason why researchers like random assignment). Suppose there will be 10 volunteers (Ss) in the study, 5 per group. Here are their GPAs: S1: 3.24 S6: 2.45 S2: 3.91 S7: 3.85 S3: 2.71 S8: 3.12 S4: 2.05 S9: 2.91 S5: 2.62 S10: 2.21 Step 2. Arrange the GPAs in ascendingorder: S4: 2.05 S9: 2.91 S10: 2.21 S8: 3.12 S6: 2.45 S1: 3.24 S5: 2.62 S7: 3.85 S3: 2.71 S2: 3.91 Step 3. Create five pairs of scores, with each pair consisting of quantitatively adjacent GPA scores. Pair 1: 2.05 and 2.21 Pair 2: 2.45 and 2.62 Pair 3: 2.71 and 2.91 Pair 4: 3.12 and 3.24 Pair 5: 3.85 and 3.91 Step 4. For each pair, randomly assign one participant to Group 1and the other to Group 2. Here's one possible outcome: Group 1 Group 2 2.05 2.21 2.62 2.45 2.91 2.71 3.12 3.24 3.85 3.91 mean GPA: 2.91 2.90 Now the study can proceed with some assurance that the two groups will be equivalent to each other (2.91 is virtually the same as 2.90) in terms of academic ability. Note. If more than two groups are being tested,the matchmg procedure is the same up to and including step 2. In step 3, instead of creating pairs of scores,the researcher creates clusters equal to the number of groups needed. Then in step 4, the participants in each cluster are randomly assigned to the multiple groups. Within-Subjects Designs J Self Test 6.1 1. What is the defining feature of a between-subjects design? What is the main control problem that must be solved with this type of design? 2. Sal wishes to see if the type of font used when printing a document wdl influence comprehension of the material in the document. He thinks about matching on "verbal fluency." What two conditionsmust be in effect before this matching can occur? Nithin-SubjectsDesigns As mentioned at the start of the chapter, each participant is exposed to each level of the independent variable in a within-subjects design. Because everyone in this type of study is measured several times, you will sometimes see this procedure described as a repeated-measures design. One practical advantage of this design should be obvious-fewer people need to be recruited. If you have a study comparing two conditions and you want to test 20 people in condition 1,you'll need to recruit 40 people for a between-subjects study, but only 20 for a within-subjects study. Within-subjects designsare sometimesthe only reasonable choice. In experiments in such areas as physiological psychology and sensationand perception, comparisons often are made between conditions that require just a brief amount of time to test but might demand extensive preparation. For example, a perceptual study using the Miiller-Lyer dusion might vary the orientations of the lines to see if the illusion is especially strong when presented vertically (see Figure 6.1). The task might involve showing the dusion on a computer screen and asking the participant to press a key that changes the length of one of the lines. Participants are told to adjust the line until both lines seem to be the same length. Any one trial might take no more than 5 seconds, so it would be absurd to make the "illusion orientation" variable a between-subjects factor and use someone for a fraction of a minute. Instead, it makes more sense to make the orientation variable a within-subjects factor and give each participant a sequence of trials to cover all levels of the variable (and probably (a) Horizontal (b) 45' left (c) 45' right (d) Vertical FIGURE6.1 Set of four MiiUer-Lyer illusions: horizontal, 45' left, 45' right, vertical. Clzapter 6. Control Problems in Experimental Research duplicate each level several times). And unlike the attractive/unattractive Barbara Helm study, serving in one condition would not make it impossible to serve in another. One of psychology's oldest areas of research is in psychophysics, the study of sensory thresholds (e.g., a modern application is a hearing test). In a typical psychophysics study, subjects are asked to judge whether or not they can detect some stimulus or whether two stimuli are equal or different. Each situation requires a large number of trials and comparisons to be made within the same individual. Hence, psychophysics studies typically usejust a few participants and measure them repeatedly. Research Example 5, which you will soon encounter, uses this strategy. A within-subjects design might also be necessary when volunteers are scarce because the entire population of interest is small. Studying astronautsor people with special expertise (e.g., world-class chess players) are just two examples. Of course, there are times when, even with a limited population, the design may require a between-subjects manipulation. Evaluating the effects of a new form of therapy for those suffering from a rare form of psychopathology requires comparing those in therapy with others in a control group not being treated. Besides convenience, another advantage of within-subjects designs is that they eliminate the equivalent groups problem that occurs with between-subjects designs. Recall from Chapter 4 that an infcrential statistical analysis comparing two groups examines the variability between experimental conditions with the variability within each condition. Variability between conditions could be due to (a) the independent variable, (b) other systematic variance resulting from confounding, and/or (c)nonsystematicerror variance.Even with random assignment,a significant portion of the error variance in a between-subjects design results from individual differences between subjects in the different groups. But in a within-subjects design, any between-condition individual difference variance disappears. Let's look at a concrete example. Supposeyou are comparing two golfballsfor distance.You recruit 10professional golfers and randomly assign them to two groups of 5. After warming up, each golfer hits one ball or the other. Here are the results: r Pros in the Golf Ball Pros in the Golf Ball First Group 1 Second Group 2 Pro 1 255 Pro 6 269 Pro 2 261 Pro 7 266 Pro 3 248 Pro 8 260 Pro 4 256 Pro 9 273 Pro 5 245 Pro 10 257 M 253.00 M 265.00 SD 6.44 SD 6.52 There are several things to note here. First, there is some variability within each group, as reflected in the standard deviation for each group. This is error variance due to individual differences within each group and to other random factors. Witlzitz-SubjectsDesigns Second, there is apparently an overall difference between the groups. The pros in the second group hit their ball farther than the pros in the first group. Why? Three possibilities: a. Chance; perhaps this is not a statistically significant difference, and even if it is, there's a 5% chance that it is a Type I error if the null hypothesis is actually true. b. The golf ball; perhaps the brand of golfball hit by the second group simply goes farther (this, of course, is the research hypothesis). c. Individual differences; maybe the golfers in the second group are stronger or more slulled than those in the first group. The chances that the third possibility is a major problem are reduced by the procedures for creating equivalent groups described earlier. Using random assignment or matching allows you to be reasonably sure that the second group of golfers is approximately equal to the first group in ability, strength, and so on. Despite that, however, it is still possible that some of the difference between these groups can be traced back to the individual differences between the two groups. This problem simply does not occur in a within-subjects design. Suppose you repeated the study but usedjust the first five golfers, and each pro hits ball 1, and then ball 2. Now the table looks like this: Pros in the Golf Ball Golf Ball First Group 1 2 Pro 1 255 269 I pro 2 261 266 1 I pro 3 248 260 1 Of the three possible explanations for the differences in the first set of data, explanation 3 can be eliminated for the second set. In the first set, the difference in the first row between the 255 and the 269 could be due to chance, the difference between the balls, or individual differences between pro 1and pro 6. In the second set, there is no second group of golfers, so the third possibility is gone. Thus, in a within-subjects design,individual differences are eliminated from the estimate of the amount of variability between conditions. Statistically, this means that, in a withinsubjects design, an inferential analysis will be more sensitive to small differences between means than will be the case for a between-subjects design. But wait. Are you completely satisfied that in the second case the differences between the first set of scores and the second set could be due only to (a) chance factors and/or (b) the superiority of the second ball?Are you thinking that perhaps pro 1 actually changed in some way between hitting ball 1 and hitting ball 2? Although it's unlikely that the golfer will add 20 pounds of muscle between swings, what if some kind of practice or warm-up effect was operating? Or perhaps the pro Chapter 6. Contvol Pvoblerns in Expevimental Reseavcla detected a slight malfunction in his swing at ball 1 and corrected it for ball 2. Or perhaps the wind changed. In short, with a within-subjects design, a major problem is that once a participant has completed the first part of a study the experience or altered circumstances could influence performance in later parts of the study. The problem is referred to as a sequence or order effect, and it can operate in several ways. First, trial 1might affect the participant in some way so that performance on trial 2 is steadily inlproved, as in the example of a practice effect. On the other hand, sometimes repeated trials produce gradual fatigue or boredom, and performance steadily declines from trial to trial. These two effects can both be referred to as progressive effects because it is assumed that performance changes steadily (progressively) from trial to trial. Also, some particular sequences might produce effects that are different from those of other sequences, what could be called a carryover effect.Thus, in a study with two basic conditions, experiencing condition A before condition B might affect the person much differentlythan experiencingB before A. For example, suppose you were studying the effects of noise on a problem-solving task using a within-subjects design. Let's say that participants will be trying to solve anagram problems (rearrange letters to form words) under some time pressure. In condition A, they have to solve the anagrams while distracting noises come from the next room, and these noises arc presented randomly and therefore are unpredictable. In condition B, the same total anlount of noise occurs; however, it is not randomly presented but instead occurs in predictable patterns. Ifyou put the people in condition A first (unpredictablenoise), and then in B (predictablenoise), they will probably do poorly in A (mostpeople do). This poor performance might discourage them and carry over to condition B. They should do better in B, but as soon as the noise begins, they might say to themselves, "Here we go again," and perhaps not try as hard. O n the other hand, if you run condition B first, with the predictable noise, your subjects might do reasonably well (most people do), and some of the confidence might carry over to the second part of the study. When they then encounter condition A, they might do better than you would orcharily expect. Thus, performance in condition A might be nluch worse in the sequence A-B than in the sequence B-A, and a similar problem would occur for condition B. In short, the sequence in which the conditions are presented, independently of any practice or fatigue effects, might influence the study's outcome. In studies where carryover effects might be suspected, researchers often switch to a between-subjects design. Indeed, studies comparing predictable and unpredictable noise typically put people in two different groups. The Problem of Controlling Sequence Effects The normal way to control sequence effectsin awithin-subjects designis to use more than one sequence, a strategyknown as counterbalancing.As I w d elaboratelater, the procedure works better for progressive effects than for carryover effects. There are two general categories of counterbalancing, depending on whether participants are tested in each experimental conditionjust one time or are tested more than once per condition. The Problem of Controlling Sequence E$eects Testing Once per ConditionIn some experiments, participants will be tested in each of the conditions but tested only once per condition. Consider, for example, an interesting study by Reynolds (1992) on the ability of chess players to recognize the level of expertise in other chess players. He recruited 15 chess players with different degrees of expertise from various clubs in New York City and asked them to look at six different chess games that were said to be in progress (i.e., about 20 moves into the game). On each trial, the players examined the board of an in-progress game (they were told to assume that the pair ofplayers of each game were of equal ability)and estimated the skilllevel of the players according to a standard rating system. The games were deliberately set up to reflect different levels of player expertise. Reynolds found that the more highly skilled of the 15 chess players made more accurate estimates of the ability reflected in the board setups they examined than did the less skilled players. You'll recognize the design of the Reynolds study as including a within-subjects variable. Each of the 15 participants examined all six games. Also, you can see that it made sense for each game to be evaluated just one time by each player. Hence, Reynolds was faced with the question of how to control for any sequence effects that might be present. He certainly didn't want all 15 participants to see the six games in exactly the same order. How might he have proceeded? Complete Counterbalancing Whenever participants are tested once per condition in a within-subjects design, one solution to the sequence problem is to use complete counterbalancing.This means that every possible sequence will be used at least once. The total number of sequences needed can be determined by calculatingX!, where X is the number of conditions, and "!" standsfor the mathematical calculationofa "factorial." For example, if a study has three conditions, there are six possible sequences that can be used: The six sequences in a study with conditions A, B, and C would be A B C B A C A C B C A B B C A C B A The problem with complete counterbalancing is that as the number oflevelsofthe independent variable increases, the possible sequences that wlll be needed increase exponentially.There are 6 sequences needed for three conditions,but simply adding a fourth condition creates a need for 24 sequences (4 x 3 x 2 x 1).As you can guess, complete counterbalancing was not possible in Reynolds' study unless he recruited many more than 15 chess players. In fact, with six different games (i.e., conditions), he would need to find 6! or 720 players to cover all of the possible sequences. Clearly, Reynolds used a different strategy. Partial Counterbalancing Whenever a subset of the total number of sequencesis used, the result is called partial counterbalancing. This was Reynolds's solution; he simply took a random Chapter 6. Control Problems in Experimental Research J1 sample of the 720 possible sequences by ensuring that "the order of presentation I [was] randomized for each subject" (Reynolds, 1992, p. 411). Sampling from the population of sequences is a common strategy whenever there are fewer participants available than possible sequences or when there are a fairly large number of condtions. Reynolds sampled from the total nuinber of sequences, but he could have chosen another approach that is used sometimes-the balanced Latin square. This device gets its name from an ancient Roman puzzle about arranging Latin letters in a matrix so that each letter appears only once in each row and each column (Krik, 1968).The Latin square strategy is more sophisticated than simply choosing a random subset of the whole. With a perfectly balanced Latin square, you are assured that (a) every condition of the study occurs equally often in every sequentialposition, and (b) every condition precedes and follows every other condition exactly once. Work through Table 6.2 to see how to construct the following 6 x 6 Latin square. Think of each letter as one of the six games inspected by Reynolds's chess players. A B F C E D B C A D F E C D B E A F D E C F B A E F D A C B F A E B D C I've boldfaced condition A (chess game A) to show you how the square meets the two requirements listed in the preceding paragraph. First, condition A occurs in each of the six sequential positions (firstin the first row, third in the second row, etc.). Second, A is followed by each of the other letters exactly one time. From the top row to the bottom, (1)A is followedby B, D, F, nothing, C, and E, and (2) A is preceded by nothing, C,E, B, D, and E The same is true for each of the other letters. TO use the-6 x 6 Latin square, one randomly assigns each of the six conditions of the experiment (six different chess games for Reynolds) to one of the six letters, A through E When using Latin squares,it is necessaryfor the number ofparticipants to be equal to or be a multiple of the nuinber of rows in the square. The fact the Reynolds had 15participants in his study tells you that he didn't use a Latin square. If he had added three more chess players, giving him an N of 18, he could have randomly assigned three players to each of the six rows of the square (3 x 6 = 18). Testing More Than Once per Condition In the Reynolds study, it made no sense to ask the chess players to look at any of the six games more than once. Similarly,if participants in a memory experiment are asked to study and recall four lists of words, with the order of the lists determined by a 4 x 4 Latin square, they will seldom be asked to study and recall any particular list i a second time unless the researcher is specificallyinterested in the effects of repeated i I ! Problem of Controlling Sequence Effects TABLE6.2 Building a Balanced 6 x 6 Latin Square In a balanced Latin square,every condition of the study occurs equally often In every sequentlal position, and every condition precedes and follows every other condition exactly once. Here's how to build a 6x6 square. Step 1. Build the first row. It is fixed according to this general rule: where A refers to the first condition of the study and "X" refers to the letter symbolizing the final condltlon of the experiment.To bulld the 6 x 6 square, thls first row would substitute: X = the sixth letter of the alphabet +F X - 1 = the fifth letter +E Therefore, the first row would be A B F (subbingfor "X") C E (subbing for "X - 1") D Step 2. Build the second row. Directly below each letter of row 1,place in row 2 the letter that is next in the alphabet.The only exception is the E Under that letter, return to the first of the six letters and place the letter A.Thus: A B F C E D B C A D F E Step 3. Build the remaining four rows following the step 2 rule.Thus, the final 6 x 6 square is: A B F C E D B C A D F E C D B E A F D E C F B A E F D A C B F A E B D C Step 4. Take the six conditions of the study and randomly assign thein to the letters.A through F to determine the actual sequence of conditions for each row.Assign an equal number of participants to each row. Note. This vrocedure works whenever there is an even number of conditions. If the number of conditions is odd, two squares wdl be needed-one created using the above procedure, and a second an exact reversal of the square created with the above procedure.-~ormore details, see Winer, Brown, and Michaels (1994). trials on memory. However, in many studies it is reasonable, even necessary, for participants to experience each condition more than one time. This often happens in research in sensation and perception, for instance. A look back at Figure 6.1 provides an example. Suppose you were conducting a study in which you wanted to see if participants would be more affected by the illusion when it was presented vertically than when shown horizontally or at a 45' angle. Four conditions of the study are assigned to Chapter 6. Control Problems in Experimental Research the letters A-D: A = horizontal B = 45"to the left C = 45"to the right D = vertical Participants in the study are shown the dusion on a computer screen and have to make adjustments to the lengths of the parallel lines until they perceive that the lines are equal. The four conditions could be presented to people according to one of two basic procedures. Reverse Counterbalancin! When using reverse counterbalancing, the experimenter simply presents the conditions in one order, and then presents them again in the reverse order. In the illusioncase, the orderwould be A-B-C-D, then D-C-B-A. Ifthe researcherdesires to have the participant perform the task more than twice per condition, and this is common in perception research, this sequence could be repeated as many times as necessary. Hence, if you wanted each participant to adjust each of the four dlusions of Figure 6.1 six separate times, and you decided to use reverse counterbalancing, participants would see the illusions in this sequence: A-B-C-D-D-C-B-A-A-B-C-D-D-C-B-A-A-B-C-D-D-C-B-A Reverse counterbalancing was used in one of psychology's most famous studies, completed in the 1930sby J. kdley Stroop. You've probably tried the Stroop task yourself--when shown color names printed in the wrong colors, you were asked to name the color rather than read the word. That is, when shown the word "RED" printed blue ink, the correct response is "blue," not "red." Stroop's study is a classic example of a particular type of design described in the next chapter, so you will be learning more about his work when you encounter Box 7.1 (pp. 239).2 Block Randomization A second way to present a sequence of conditions when each condition is presented more than once is to use block randomization,the sameprocedure outlinedearlier in the context ofhow to assignparticipantsrandomly to groups in abetween-subjects experiment. The basic rule is that every condition occurs once before any condition is repeated a second time. Within each block, the order of condtions is randomized. ' ~ l t h o u ~ hreverse counterbalancing normally occurs when participants are tested more than once per condition, the principle can also be applied in a withn-subjects design in which participants see each condition only once. Thus, if a within-subjects study has six different conditions, each tested only once per person, half of the participants could get the sequence A-B-C-D-E-F, while the remaining participants experience the reverse order (F-E-D-C-B-A). The Problem ofControlling Sequence Eflects This strategy eliminates the possibility that participants can predict what is coming next, a problem that can occur with reverse counterbalancing. To use the illusions example again (Figure 6.1),participants would encounter all four conditions in a randomized order, then all four again but in a block with a new randomized order, and so on for as many blocks of four as needed. A reverse counterbalancing would look like this: A-B-C-D-D-C-B-A A block randomization procedure might produce either of these two sequences (among others): B-C-D-A-C-A-D-B or C-A-B-D-A-B-D-C To giveyou a sense ofhow block randomization works in an actualwithin-subjects experiment employing many trials, considerthe followingauditory perception study by Carello, Anderson, and Kunkler-Peck (1998). Research Example 5-Counterbalancing with Block Randomization Our abilityto localizesoundhas been known for along time-under normal circumstances,we are quite adept at identifjrlngthe location from which a sound originates. What interested Carello and her research team was whether people could identify something about the physicalsize of an object simplyby hearing it drop on the floor. She devised the apparatus pictured in Figure 6.2 to examine the question. Participants heard a wooden dowel hit the floor, and then tried to judge its length. They made their response by adjusting the distance between the edge of the desk they were sitting at and a movable vertical surface during a "trial," which was defined as having the same dowel dropped five times in a row from a given height. During the five drops, participants were encouraged to move the wall back and forth until they were comfortable with their decision about the dowel's size. In the first of two experiments, the within-subjects independent variable was the length of the dowel, and there were seven levels (30, 45, 60, 75, 90, 105, and 120 cm). Each participant FIGURE6.2 The experimentalsetup for Carello, Anderson, & Kunkler-Peck (1998). After hearing a rod drop, participants adjusted the distance between the edge of their desk and the vertical surfacefacing them to match what they perceived to be the length of the rod. Chapter 6. Control Problems in Experiinental Research J Self Test 6.2 1. What is the definingfeature of a within-subjects design?What is the main control problem that must be solved with this type of design? 2. Ifyour IV has 6 levels, each testedjust once per subject, why are you more likely to use partial counterbalancinginstead of complete counterbalancing? 3. If participants are going to be tested more than one time for each level of the IV, what two forms of counterbalancingmay be used? Control Problems in Developmental Research As you have learned, the researcher must weigh several factors when deciding whether to use a between-subjects design or a withn-subjects design. There are some additional considerationsfor researchers in developmental psychology, where two specific varieties of these designs occur. These methods are known as crosssectional and longitudinal designs. You've seen these terms before if you have taken a course in developmental or child psychology. Research in these areas includes age as the prime variableafter all, the name of the game in developmentalpsychologyis to discoverhow we change as we grow older. A cross-sectional study takes a between-subjects approach. A cross-sectional study comparing the languageperformance of 3-, 4-, and 5-year-old children would use three different groups of chldren. A longitudinal study, on the other hand, stuhes a single group over aperiod of time; it takes awithn-subjects or repeated-measures approach. The same language study would measure language behavior in a group of 3-year-olds, and then study these same chldren when they turned 4 and 5. The obvious advantage of the cross-sectional approach to the experiment on language is time; such a study might take a month to complete. If done as a longitudinal study, it would take 3 years. However, a potentially serious difficulty with some cross-sectional studies is a specialform of the problem of nonequivalent groups and involves what are known as cohort effects. A cohort is a group of people born at about the same time. If you are studying three age groups, they differ not just simply in chronological age but also in terms of the environments in which they were raised. The problem is not especially noticeable when comparing 3-, 4-, and 5-year-olds, but what if you're interested in whether intelligence declines with age and decide to compare groups aged 30,50, and 70?You might indeed find a decline with age, but doesit mean that intelligence gradually decreaseswith age, ormight the differences relate to the very differentlife histories of the three groups?For example, the 70-year-olds went to school during the Great Depression, the 50-year-olds were educated during the post-World War I1boom, and the 30-year-olds were raised on TV. These factors could bias the results. Indeed, this outcome has occurred. Early research on the effects of age on I Q suggested that sipficant declines occurred, Control Problems in Developnzental Research but these stuhes were cross-sectional (e.g., Miles, 1933). Subsequent longitudinal studies revealed a very hfferent pattern (Schaie, 1988).For example, verbal abilities show very little decline, especially if the person remains verbally active (moral: use it or lose it). While cohort effects can plague cross-sectional studies, longitudinal studies also have problems, most notably with attrition (refer back to Chapter 5, p. 189). If a large number of participants drop out of the study, the group completing it may be very different from the group starting it. Referring to the age and IQ example, if people stay healthy, they may remain more active intellectually than if they are sick all of the time. If they are chronically 111,they may die before a study is completed, leaving a group that may be generally more intelligent than the group starting the study. There are also potential ethical problems in longitudinal studies. As people develop and mature, they might change their attitudes about their willingness to participate. Most researchers doing longitudinal research recognize that informed consent is an ongoing process, not a one-time event. Ethically sensitive researchers will periodically renew the consent process in long-term studies,perhaps every few years (Fischman, 2000). In trying to balance cohort and attrition problems, some researchersuse a strategy that combines cross-sectional with longitudinal studies, a design referred to as a cohort sequential design. In such a study, a group of subjects will be selected and retested every few years, and then adhtional cohorts will be selected every few years and also retested over time. To take a simple example, suppose you wished to examine the effects of aging on memory, comparing ages 55, 60, and 65. In the study's first year, you would recruit a group of 55-year-olds. Then every five years after that, you would recruit new groups of 55-year-olds, and retest those who had been recruited earlier. Schematically, the design for a study that began in the year 1960 and lasted for 30 years would look like this (the numbers in the matrix refer to the age of the subjects at any given testing point): Year of the Study ~dhort# 1960 1965 1970 1975 1980 1985 1990 So in 1960,you have a group of55-year-olds that you test. Then in 1965,these same people (now 60 years old) would be retested, along with a new group of 55-yearolds. By year 3, you have cohorts for all three age groups. As you can see, combining the data in each of the diagonals would give you an overall comparison between those aged 55, 60, and 65. Comparing the data in the rows enables a comparison of overall differences between cohorts. In actual practice, these deigns are more complicated, because researchers will typically start the first year of the study with a range of ages. But the diagram gives you the basic idea. Perhaps the best-known Clzapter 6. Control Problems in Experimental Research example of this type of sequential design is a long series of studies by K. Warner Schaie (2005),known as the Seattle Longituhnal Study. It begail in 1956, designed to examine age-related changes in various mental abilities. The initial cohort had 500 people in it, ranging in age from their early 20s to their late 60s (as of 2005, 38 of these subjects were still in the study, 49 years later!). The study has added a new cohort at 7-year intervals ever since 1956 and has recently reached the 50-year mark. In all, about 6,000 people have participated. In general, Schaie and his team have found that performance on mental abllity tasks declines slightly with age, but with no serious losses before age 60, and the losses can be reduced by good physical health and lots of crossword puzzles. Concerning cohort effects, they have found that overall performance has been progressively better for those born more recently. Presumably, those born later in the twentieth century have had the advantages of better education, better nutrition, and so on. The length of Schaie's Seattle project is impressive,but the world's record for perseverancein a repeated-measures study occurred in what is arguablythe most famous longitudinal study of all time. Before continuing, read Box 6.1, which chronicles the epic tale of Lewis Terman's study of gifted children. Control Problems in Develop~nentalResearclz school, but a group of 444 were in junior or senior high school (sample numbers from Minton, 1988). Their average IQ score was 150, which put the group roughly in the top 1%of the population. Each child was given an extensivebatteiy of tests and questionnaires by the team of graduate students assembled by Terman. By the time the initial testing was complete, each child had a file of about 100pages long (Minton, 1988)!The resdts of the frst analysis of the group were published in more than 600 pages as the Mental and Physical Traits ofa Thousand Gijed Children (Terman, 1925). Terman intended to do just a brief follow-up study, but the project took on a life of its own. The sanlple was retested in the late 1920s (Burks,Jensen, & Terman, 1930), and additional follow-up stukes during Terman's lifetime were published 25 (Terman& Oden, 1947)and 35 (Terman& Oden, 1959)years after the Initial testing. FollowingTerman's death,the project was taken overby Robert Sears, amember ofthe gifted group and a well-known psychologist in his own right. In the foreword to the 35-year follow-up, Searswrote: "On actuarialgrounds,there 1sconsiderablelikelihood that the last of Terman's Gifted Chlldren will not have yielded his last report to the files before the year 2010!" (Terman & Oden, 1959,p. ix).Between 1960 and 1986, Sears produced five additional follow-up studies of the group, and he was workng on a book-length study of the group as they aged when he died in 1989 (Cronbach, Hastorf, Hilgard, & Maccoby, 1990).The book was eventually published as Tvle G$ed Group in Later Maturify (Holahan, Sears, & Cronbach, 1995). There are three points worth making about this mega-longitudinal study. First, Terman's work shatteredthe stereotypeofthe giftedchildas someonewho was brilhant but socially retarded and prone to burnout early in life. Rather, the members of h s group as a whole were both brilliant and well adjusted and they became successful as they matured.By the time they reached maturity, '"he group had produced thousands of scientific papers, 60 nonfiction books, 33 novels, 375 short stories, 230 patents, and numerous radio and television shows, works of art, and musical compositions" (Hothersall,1990,p. 353).Second,the data collectedby Terman'steam continuesto be a source of rich archivalinformation for modern researchers. For instance,studieshave been published on the careers of the gifted females in Terman's group (TornlinsonKeasy, 1990),and on the predictors oflongevityin the group (Friedman,et al., 1995). Third, Terman's follow-up studies are incredible from the methodologicalstandpoint of a longitudinal study's typical nemesis-attrition. The following figures (taken from Minton, 1988) are the percentage of living participants who participated in the first three follow-ups: After 10 years: 92% After 25 years: 98% After 35 years: 93% These are remarkablyhigh numbers and reflect the intenseloyalty that Terman and his group had for each other. Members of the group referred to themselves as "Termites," and some even wore termitejewelry (Hothersall,1990).Terman corresponded with hundreds of his participants and genuinely cared for his specialpeople. After all, the group represented the type of person Terman believed held the key to America's future Chapter 6. Control Problerns in Experimental Research I-3blemswith Biasing Because humans are always the experimenters and usually the participants in psychology research, there is the chance that the results of a study could be influenced by some human "bias," a preconceived expectation about what is to happen in an experiment. These biases take several forms but f d into two broad categoriesthose affecting experimenters and those affecting research participants. These two forms of bias often interact. Experimenter Bias The Clever Hans case (Chapter 3, pp. 96-98) is often used to illustrate the influence of experimenter bias on the outcome of some study. Hans's trainer, knowing the outcome to the question "What is 3 times 3?," sent subtle head-nodding cues that were read by the apparently intehgent horse. Similarly, experimenters testing hypotheses sometimes may inadvertently do sometlung that leads participants to behave in ways that confirm the hypothesis. Although the stereotype of the scientist is that of an objective, dispassionate, even mechanical person, the truth is that researchers can become rather emotionally involved in their research. It's not difficult to see how a desire to confirm some strongly held hypothesis might lead an unwary experimenter to behave in such a way as to influence the outcome of the study. For one thing, biased experimenters might treat the research participants in the various conditions differently.Robert Rosenthal developed one procedure demonstrating this. Participants in one of his studies (e.g., Rosenthal & Fode, 1963a)were shown a set of photographs of faces and asked to make some judgment about the people pictured in them. For example, they might be asked to rate each photo on how successfulthe person seemed to be, with the interval scale ranging from -10 (total failure) to +10 (totally successful).Allparticipants saw the same photos and made the samejudgments. The independent variable was experimenter expectancy. Some experimenters Were led to believe that most subjects would give people the benefit of the doubt and rate the pictures positively; other experimenters were told to expect negative ratings. Interestingly enough, the experimenter's expectancies typically produced effects on the subjects' rating behavior, even though the pictures were identical for both groups. How can this be? According to Rosenthal(1966),experimenters caninnocently communicatetheir expectanciesin a number ofsubtle ways. For instance, on the person perception task, the experimenter holds up apicture whlle the participant rates it. If the experimenter is expecting a "+8" and the person says "-3," how might the experimenter actwith a slight frown perhaps? How might the participant read the frown? Might he or she try a "+7" on the next trial to see if ths could elicit a srmle or a nod from the experimenter? In general, could it be that experimenters in this situation, without even being aware of it, are subtly shaping the responses of their participants? Does this remind you of Clever Hans? Rosenthal has even shown that experimenter expectancies can be communicated to subjects in animal research. For instance, rats learn mazes faster for experimenters Problems with Biasing who think their animals have been bred for maze-running ability than for those expecting their rats to be "maze-dull" (Rosenthal & Fode, 1963b). The rats, of course, are randomly assignedto the experimenters and are equal in ability. The key factor here seems to be that experimenters expecting their rats to be "maze-bright" treat them better; for example, they handle them more, a behavior known to affect learning. It should be noted that some of the Rosenthal research has been criticized on statisticalgrounds and for interpreting the results as being due to expectancy when they may have been due to something else. For example, Barber (1976) raised questions about the statistical conclusion validity of some of Rosenthal's work. In at least one study, according to Barber, 3 of 20 experimenters reversed the expectancy results, getting data the opposite of the expectancies created for them. Rosenthal omitted these experimenters from the analysis and obtained a significant difference for the remaining 17 experimenters. With all 20 experimenters included in the analysis, however, the hfference &sappeared. Barber also contends that, in the animal studies, some of the results occurred because experimenters simply fudged the data (e.g., misrecording maze errors). Another difficulty with the Rosenthal studies is that h s procedures don't match what normally occurs in experiments; most experimenters test all of the participants in all conditions of the experiment, not just those participating in one of the conditions. Hence, Rosenthal's results might overestimate the amount of biasing that occurs. Despite these reservations,the experimenter expectancy effect cannot be ignored; it has been replicated in a variety of situations and by many researchers other than Rosenthal and his colleagues (e.g., Word, Zanna, & Cooper, 1974).Furthermore, experimenters can be shown to influence the outcomes of stuhes in ways other than through their expectations. The behavior of participants can be affected by the experimenter's race and gender, as well as by demeanor, friendliness,and overall attitude (Adair,1973).An exampleof the latter is astudyby Fraysse and Desprels-Fraysse (1990),who found that preschoolers' performance on a cognitive classificationtask could be influenced by experimenter attitude. The children performed significantly better with "caring" than with "indifferent" experimenters. Controlling for Experimenter Bias It is probably impossible to eliminate experimenter effects completely. Experimenters cannot be turned into machines. However, one strategy to reduce bias is to mechanize procedures as much as possible. For instance, it's not hard to remove a frowning or smilingexperimenter &omthe person perception task. With modern computer technology, participants can be shown photos on a screen and asked to make their responses with a key press while the experimenter is in a different room entirely. Similarly, procedures for testing animals automatically have been available since the 1920s, even to the extent of eliminating human handling completely. E. C. Tolman didn't wait for computers to come along before inventing "a self-recordng maze with an automatic delivery table" (Tolman, Tryon, & Jeffries, 1929). The "delivery table" was so called because it "automatically delivers each rat into the entrance of the maze and 'collects' him at the end without the mediation of the Chapter 6. Control Probleins in Experimental Research experimenter. Objectivity of scoring is insured by the use of a device which automatically records his path through the maze" (Tryon, 1929, p. 73). Today such automation is routine. Recall from Chapter 4 the study of rats in the radial maze, in which rat "macrochoices" and "microchoices" were confirmed by videotaping each animal's performance and defining those two constructs in terms of easily verifiable behaviors (Brown, 1992). Furthermore, computers make it easy to present instructions and stimuli to participants whlle also keeping track of data. Experimenters can mechanize many procedures, to some degree at least, but the experimenter wdl be interacting with every participant nonetheless. Hence, it is important for experimenters to be given some training in how to be experimenters, and for the experiments to have highly detailed descriptions of the sequence of steps that experimenters should follow in every research session. These descriptions are called research protocols. Another strategy for controlling for experimenter bias is to use what is called a double blind procedure. This means simply that experimenters are kept in the dark (blind) about what to expect of participants in a particular testing session. As a result, neither the experimenters nor the participants know which condition is being tested-hence the designation "double." A double blind can be accomplished when the principal investigator sets up the experiment but a colleague (usually a graduate student) actually collects the data. Double blinds are not always possible, of course, as Illustrated by the Dutton and Aron (1974) study you read about in Chapter 3. As you recall, female experimenters arranged to encounter men either on a suspension bridge swaying 230 feet over a river or on a solid bridge 10 feet over the same river. It would be a bit difficult to prevent those experimenters from knowing whch condition of the study was being tested! On the other hand, many studies lend themselves to a procedure in which experimenters are blind to whch condition is in effect. Research Example 6, which could increase the stock price of Starbucks, is a good example. Research Example 6-Using a Double Blind There is considerableevidence that aswe age,we become less eff~cientcognitively in the afternoon. Also, older adults are more likely to describe themselves as "morning persons" (I am writing this on an early Saturday morning, so I thnk I'll get it right). Ryan, Hatfield, and Hofitetter (2002)wondered if the cognitive decline, as the day wears on, could be neutralized by America's favorite drug-caffeine. They recruited 40 seniors, all 65 or older and self-described as (a) morning types and (b) moderate users of caffeine, and placed them into either a caffeine group or a decaf group (usingStarbucks "house blends" ). They were then given a standardized memory test on two different occasions, once at 8:00 a.m. and once at 4:00 p.m. The study was a double blind because the experimenters administering the memory tests did not know whch participants had ingested caffeine, and the seniors did not know which type of coffee they were drinking. And to test for the adequacy of the control procedures, the researchers completed a clever "manipulation check7' (you will learn more about this concept in a few paragraphs). At the end of the study, during debriefing, they asked the participants to guess whether they had been drinking the real stuff or the decaf. The accuracy of the seniors' responses was at Pvoblems with Biasing chancelevel. In fact,most guessedincorrectly that they had been given regular coffee during one testing session and decaf at the other. The researchers also did a nice job of incorporating some of the other control proceduresyou learned aboutin this chapter. For instance,the seniorswere randomly assigned to the two different groups, and this random assignment seemed to produce the desired equivalent groups-the groups were indistinguishable in terms of age, educationlevel, and average dailyintake of caffeine.Also, counterbalancingwas used to insure that half of the seniorswere tested first in the morning, then the afternoon, while the other half were tested in the sequence afternoon-morning. The results? Time of day did not seem to affect a short-term memory task, but it had a significanteffect on a more difficultlonger-term task in which seniorslearned some information, then had a 20-minute delay, then tried to recall the information, and then completed a recognition test for that same information. And caffeine prevented the decline for ths more demanding task. On both the delayed recall and the delayed recognition tasks, seniors scored equally well in the morning sessions. In the afternoon sessions, however, those ingesting caffeine stdl did well, but the performance of those tahng decaf declined. On the delayed recall task, for instance, here are the means (maxscore= 16).Also, rememberfrom Chapter4 (pp. 137-138) that, when reporting descriptive statistics,it is important to report notjust ameasure of central tendency (mean),but also an indication of variability. So, in parentheses after each mean below, notice that I have included the standard deviations (SD). Morning with caffeine -+ 11.8(SD= 2.9) Morning with decaf -+ 11.O( S D = 2.7) Afternoon with caffeine -+ 11.7 (SD = 2.8) Afternoon with decaf -+ 8.9 (SD = 3.0) So, if the word gets out about ths study, the average age of Starbucks' clients might start to go up, starting around 3:00 in the afternoon. Of course, they wdl need to avoid the decaf. &rticipant Bias People participating in psychological research cannot be expected to respond like machines. They are humans who know they are in an experiment. Presumably they have been told about the general nature of the research during the informed consent process, but in deception studies they also know they haven't been told everything.Furthermore,evenifthere is no deceptionin a study,participants may not believe it-after all, they are in a "psychology experiment," and aren't psychologists always trylng to "psychoanalyze" people? In short, participant bias can occur in several ways, dependng on what participants are expecting and what they believe their role should be in the study. When behavior is affected by the knowledge that one is in an experiment and is therefore important to the study's success, the phenomenon is sometimes called the Hawthorne effect, after a famous series of studes of worker productivity. To understand the origins of this term, you should read Box 6.2 before continuing. You may be surprised to learn that most hstorians Chapter 6. Control Problems in Experimental Research believe the Hawthorne effect has been misnailled and that the data of the original study were distorted for political reasons. Problems with Biasing 7 ere to be j s mo had tc C L L L vva> >aidto hatL. SUIL~; U U ~ D I ~ C V ~ L ~ L I I I I C I & Friend, 1981). (nenlem~er,cne SovietUnion was brand new i )s, and the ace" was a threat to industrial America, resulting in thir zar oflabo Of the two replacements, one was especiallytalented ana en~n~~siasticand quic~iyoecamethe group leader. She apparently was selected becau the regular department" (Gill( the high level of productivity. ;room, tl :he room : " fn"..--l ie fact is t €orinsubo hat of the rdination five origi and low o n _..._...1. nal as- tutput. . I - women WI iemblers,t ?,, .,,,,. n the 192C ~gslike a fi 1 1 . "red men r unions.) . ,, . se she "hc :spie, 198; :Id the rec 3, p. 122). :ord as the Her effor :fastest re ts contrib lay-asseml uted migh A secon n the fam ~utputper ~dprobler ous 12th 1 .hour, yet -- L - - & ---.. n with ini ?eriod, pr workers terpreting oductivity were putti ...- .~...-. data is a : rded as ou >xtra6 hoi the relay I was reco c ng in an e t u s L C ~ Lpried. If the rrlvre appropriateoutput per 2 clined slightly (Bramel & Friend, 1981).Also, thc 2 ut the change, but afiaid to complain lest they 1 room, thereby losing bonus money. Lastly, it co ' ' ' Hawthorne experirr result of feedback a1 ] 0 7 A \ simple statistical problem. tput per week rather than Irs per week compared to he previoi ~ctuallyde lngry abo. hour is us :women >eremovt ed, produ were appa :d from tl ctivity treiitly le test uld have ~ctivityco L rewards been that luld have 1 for produ in some been sim~: ctivity (Pa of the flythe Lrsons, Lents, incr lout perfc eased WOI lrmance, 2 / I T ) . Historia :cononlic/ t ns argue 'institutio~ that events must be understood within their entire political/ c nal context, and the Hawthorne studies are no exception. Painting a gossy picture of workers unaffected by specific working conditions and more concerned with being considered speci, in industry and led corporations to empl in the h~ humane 1,,A ,",, lman relat nanageme ,,,,,+ L :ions movc nt of emp I---*-----, Ement loyees such a F 1 helps to L ,n, which ~ L U CIIluUVes behind ?owel-at t torians (e. !s complet he level oi g., Brame 1 . 7 7 7 lent and i~ d, 1981)k xic. :managen 1& Frien~ -7 mpede eB )elievewe orts at re the Most research participants, in the spirit of trying to help the experimenter and contribute meaningful results, take on the role of the good subject, first described by Orne (1962).There are exceptions, of course, but, in general, participants tend to be very cooperative, to the point of persevering through repetitive and boring tasks, all in the name of psychological science. Furthermore, if participants can figure out the hypothesis, they may try to behave in such a way that confirms it. Orne used the term demand characteristics to refer to those aspects of the study that reveal the hypotheses being tested. If these features are too obvious to participants, they no longer act naturally and it becomes difficult to interpret the results. Did participants behave as they normally would or I d they come to understand the hypothesis and behave so as to make it come true? Orne demonstrated how demand characteristics can influence a study's outcome by recruiting students for a so-called sensory deprivation experiment (Orne & Scheibe, 1964). He assumed that participants told that they were in such an experiment would expect the experience to be stressful and might respond accordingly. Chapter 6. Control Problems iiz Experinzeiztal Research This indeed occurred. Participantswho sat for four hours in a small but coillfortable room showed signs of stress only if (a) they signed a forin releasing the experimenter from any liability in case anything happened to them, and (b) the room included a "panic button" that could be pressed if they felt too stressed by the deprivation. Controlparticipantswere givenno release form to sign,no panic button to press, and no expectation that their senses were being deprived. They did not react adversely. The possibility that demand characteristics are operating has an impact on decisions about whether to opt for between- or within-s~ibjectdesigns. Participants serving in all ofthe conditions of a study have a greater opportunity to figure out the hypothesis(es).Hence, demand characteristics are potentially more troublesome in withn-subject designs than in between-subjects designs. For both types of designs, demand characteristics are especially devastating if they affect some conditions but not others, thereby introducing a confound. Besides being good subjects (i.e., trying to confirm the hypothesis), participants wish to be perceived as competent, creative, emotionally stable, and so on. The belief that they are being evaluated in the experiment produces what Rosenberg (1969) called evaluation apprehension. Participants want to be evaluated positively, so they may behave as they thnk the ideal person should behave. This concern over how one is going to look and the desire to help the experimenter often leads to the same behavior among participants, but so~lletilllesthe desire to create a favorable impression and the desire to be a good subject conflict. For example, in a helping behavior study, astute participants might guess that they are in the condition of the study designed to reduce the chances that help will be offered. On the other hand, altruism is a valued, even heroic, behavior. The pressure to be a good subject and support the hypothesis pulls the participant toward nonhelping, but evaluation apprehension makes the indvidual want to help. At least one study has suggestedthat when participants are faced with the option of confirming the hypothesis and being evaluated positively, the latter is the more powerful motivator (Rosnow, Goodstadt, Suls, & Gitter, 1973). Controlling for Participant Bias The primary strategy for controbng participant bias is to reduce demand characteristics to the minimum. One way of accomplishing ths, of course, is through deception. As we've seen in Chapter 2, the primary purpose ofdeception is to induce participants to behave more naturally than they otherwise might. A second strategy, normally found in drug studies, is to use a placebo control group (see Chapter 7, pp. 256-257). This procedure allowsfor acomparison between those actuallygetting some treatment (e.g.,a drug) and those who thnk they are getting the treatment but aren't. If the people in both groups behave identically, the effects can be attributed to participant expectations of the treatment's effects. You have probably already recognized that the caffeine study you just read (Research Example 6) used this lund of logic. A second way to check for the presence of demand characteristicsis to do what is sometimes called a manipulation check. This can be accomplished during debriefing by aslung participants in a deception study to indicate what they believe the true hypothesis to be (the "good subject" might feign ignorance though). This Problems with Biasing 225 was accomplished in Research Example 6 by asking participants to guess whether they had been given caffeine in their coffee or not. Manipulation checks can also be done during an experiment. Sometimes a random subset of participants in each condition wlll be stopped in the middle of a procedure and asked about the clarity of the instructions, what they t l n k is going on, and so on. Manipulation checks are also used to see if some procedure is producing the effect it is supposed to produce. For example, if some procedure is supposed to make people feel anxious (e.g.; telling participants to expect shock), a sample of participants might be stopped in the middle of the study and assessed for level of anxiety. A final way of avoiding demand characteristics is to conduct field research. If participants are unaware that they are in a study, they are unlikely to spend any time thinking about research hypotheses and reacting to demand characteristics. Of course, field studies have problems of their own, as you recall fiom the discussionof informed consent in Chapter 2 and of privacy invasion in Chapter 3 (pp. 83-84). Although I stated earlier that most research participants play the role of "good subjects," this is not uniformly true, and some differences exist between those who truly volunteer and are interested in the experiment and those who are more reluctant and less interested. For instance, true volunteers tend to be slightly more intelligent and have a higher need for social approval (Adair, 1973).Differences between volunteers and nonvolunteers can be a problem when college students are asked to serve as participants as part of a course requirement; some students are more enthusiastic volunteers than others. Furthermore, a "semester effect" can operate. The true volunteers, those really interested in participating, sign up earlier in the semester than the reluctant volunteers. Therefore, if you ran a study with two groups, and Group 1 was tested in the first half of the semester and Group 2 in the second half, the differences found could be due to the independent variable, but they also could be due to differencesbetween the true volunteers who sign up first and the reluctant volunteers who wait as long as they can. Can you think of a way to control for this problem? If the concept "block randomization" occurs to you, and you say to yourself "this will distribute the conditions of the study equally throughout the duration of the semester," then you've accomplished something in this chapter. Well done. J Self Test 6.3 1. Unlike most longitudinal studies, Terman's study of gifted children did not experience which control problem? 2. Why does a double blind procedure control for experimenter bias? 3. How can a demand characteristicinfluence the outcome of a study? To close out this chapter, read Box 6.3, which concerns the ethical obligations of those participating in psychological research. The list of responsibilitiesyou'll find Chapter 6. Control Problems in Experimental Research there is based on the assumption that research shouldbe a collaborative effort between experimenters and participants. We've seen that experimenters must follow the APA ethics code. In Box 6.3 you'll learn that participants have some responsibilities too. search Participants Have fes Too The APA ethics code spells out the responsibilities that researchershave to thosewho participate in their experiments. Participants have a right to expect that the guidelines w d be followed and, if not, there should be a clear process for registering complaints. But what about the subjects?What are their obligations? An articlebyJim Korn in thejournal TeachivgofPsychology (1988)outlines the basic rights that college students have when they participate in research, but it also lists the responsibilities of those who volunteer. They include d Being responsible about scheduling by showing up for their appointmentswith researchersand arriving on time J Being cooperative and acting professionally by giving their best and most honest effort J Listening carefully to the experimenter during the informedconsent and instructionsphasesand askingquestionsif they are not sure what to do J Respecting any request by the researcher to avoid discussing the researchwith others until all the data have been collected J Being active during the debriefing process by helping the researcher understandthe phenomenonbeingstudied The assumption underlying this list is that research shouldbe a collaborative effort between experimenters and participants. Korn's suggestion that participants take a more assertive role in mahng research more collaborative is a welcome one. This assertiveness,however, must be accompaniedby enlightenedexperimentingthatvalues and probes for the insights that participants have about what might be going on in a study. An experimenter who simply "runs a subject" and records the data is ignoring valuable information. . i ' i I i .. Clzapter Summary In the last two chapters you have learned about the essential features of experimental research and some of the control problems that must be faced by those who wish to do research in psychology.We've now completed the necessary groundwork for introducing the various kinds of experimental designs used to test the effects of independent variables. So, let the designs begin! Between-SubjectsDes: In between-subjects designs, individuals participate in just one of the experiment's conditions; hence, each condition in the study involves a different group of participants. Such a design is usually necessary when subject variables (e.g., gender) are being studied or when being in one condition of the experiment changes participants in ways that make it impossible for them to be in another condition. With between-subjects designs, the main difficulty is creating groups that are essentially equivalent to each other on all factors except for the independent variable. The Problem of Creating Equivalent Groups The preferred method of creating equivalent groups in between-subjects designs is random assignment. Random assignment has the effect of spreading unforeseen confounding factors evenly throughout the different groups, thereby eliminating their damaging influence. The chance of random assignment worhng effectively increases as the number of participants per group increases. If few participants are available, if some factor (e.g., intelligence) correlates highly with the dependent variable, and if that factor can be assessed without difficulty before the experiment begins, then equivalent groups can be formed by using a matching procedure. Within-Subjects Designs When each individual participates in all of the study's conditions, the study is using , a within-subjects (or repeated-measures) design. For these designs, participating " in one condition might affect how participants behave in other conditions. That is, sequence or order effects can occur, both of which can produce confounded results if not controlled. Sequence effects include progressive effects (they gradually accumulate, as in fatigue) and carryover effects (one sequence of conditions might produce effects different from another sequence). The Problem of Controlling Sequence Effects Sequence effectsare controlled by various counterbalancingprocedures, all ofwhich ensure that the different conditions are tested in more than one sequence. When Chapter 6. Control Problems in Experimental Reseavclz participants serve in each condition of the study just once, complete (all possible sequences used) or partial (a sample of different sequences or a Latin square) counterbalancingwillbe used. When participantsservein each condition inore than once, reverse counterbalancing or block randomization can be used. Asymmetric transfer can occur when carryover effects are present; such transfer reduces the effectiveness of counterbalancing. Control Problems in Developmental Research In developmental psychology, the major independent variable is age, a subject variable. If age is studied between subjects, the design is referred to as a cross-sectional design. It has the advantage of eff~ciency,but cohort effects can occur, a special form of the problem of nonequivalent groups. If age is a within-subjects variable, the design is called a longitudinal design and attrition can be a problem. The two strategies can be combined in a cohort sequential design-selecting new cohorts every few years and testing each cohort longitudinally. Problems with Biasing The results of research in psychology can be biased by experimenter expectancy effects. These can lead the experimenter to treat participants in various conditions in different ways, making the results impossible to interpret. Such effects can be reduced by automating the procedures and using double blind control procedures. Participant bias also occurs. Participants might confirm the researcher's hypothesis if demand characteristics suggest to them the true purpose of a study or they might behave in unusual ways simply because they know they are in an experiment. Demand characteristics are usually controlled through varying degrees of deception and the extent of participant bias can be evaluated through the use of a manipulation check. 1. Under what circumstanceswould a between-subjects design be preferred over a within-subjects design? 2. Under what circumstances would a within-subjects design be preferred over a between-subjects design? 3. How does random selection differ from random assignment, and what is the purpose of the latter? 4. As a means of creating equivalent groups, when is matching most likely to be used? ! AppTications Exevcises 5. Distinguish between progressive effects and carryover effects, and explain why counterbalancing might be more successful with the former than the latter. 6. In a taste test,Joan is asked to evaluate four dry white wines for taste: wines A, B, C, and D. In what sequence would they be tasted if (a) reverse counterbalancing or (b) block randomization were being used? How many sequences would be required if the researcher used complete counterbalancing? 7. What are the defining features of a Latin square and when is one likely to be used? 8. What specific control problems exist in developmental psychology with (a) cross-sectional studies and (b) longitudinal studies? 9. What is a cohort sequential design, and how does it improve on cross-sectional and longitudinal designs? 10. Describe an example of a study that Illustrates experimenter bias. How might such bias be controlled? 11. What are demand characteristics and how might they be controlled? 12. What is a Hawthorne effect and what is the origin of the term? Exercise 6.1-Between-Subject or Within-Subject? Think of a study that might test each of the following hypotheses. For each, indicate whether you think the independent variable should be a between- or a withinsubjects variable or whether either approach would be reasonable. Explain your decision in each case. 1. A neuroscientist hypothesizes that damage to the primary visual cortex is permanent in older animals. 2. A sensory psychologist predicts that it is easier to distinguish slightly different shades of gray under daylight than under fluorescent light. 3. A clinicalpsychologist thinks that phobias are best cured by repeatedly exposing the person to the feared object and not allowing the person to escape until the person realizes that the object really is harmless. 4. A developmental psychologist predicts cultural differences in moral develop- ment. Clzapter 6. Control Problems in Experimental Research 5. A socialpsychologistbelieves people will solve problems more creatively when in groups than when alone. 6. A cognitivepsychologist hypothesizes that spaced practice of verbal information wlll lead to greater retention than massed practice. 7. A clinician hypothesizes that people with an obsessive-compulsive disorder will be easier to hypnotize than people with a phobic disorder. 8. An industrial psychologist predicts that worker productivity will increase if the company introduces flextime scheduling (i.e., work 8 hours, but start and end at dfferent times). Exercise 6.2-Constructing a Balanced Latin Square A memory researcher wishes to compare long-term memory for a series of word lists as a function of whether the person initially studies either four lists or eight lists. Help the investigator in the planning stages of this project by constructing the two needed Latin squares, a 4 x 4 and an 8 x 8, using the procedure outlined in Table 6.2. Exercise 6.3-Random Assignment and Matching A researcher investigates the effectiveness of an experimental weight-loss program. Sixteen volunteers will participate, half assigned to the experimental program and halfplaced in a control group. In a study such as this, it would be good if the average weights of the subjectsin the two groups were approximately equal at the start of the experiment, Here are the weights, in pounds, for the 16 subjects before the study begins. ' First, use a matching procedure as the method to form the two groups (experimental and control), and then calculate the average weight per group. Second, assign participants to the groups again, this time using random assignment (cut out 20 small pieces of paper, write one of the weights on each, then draw them out of a hat to form the two groups).Again, calculate the average weight per group after the random assignment has occurred. Compare your results to those of the rest of the class-are the average weights for the groups closer to each other with matching or with random assignment? In a situation such as this, what do you conclude about the relative merits of matching and random assignment? Applications Exercises 231 Answers to the Sell Tests: J 6.1. 1. There is a minimum of two separate groups of subjects tested in the study, one group for each level of the IV; the problem of equivalent groups. 2. Sal must have a reason to expect verbal fluency to correlate with his dependent variable; he must also have a good way to measure verbal fluency. J 6.2. 1. Each subject participates in each level of the IV; sequence effects 2. With 6 levels of the IV, complete counterbalancing requires a minimum of 720 subjects (6 x 5 x 4 x 3 x 2 x I), which could be impractical. 3. Reverse counterbalancing, or block randomization. J 6.3. 1. Attrition. 2. If the experimenter does not know which subjects are in each of the groups in the study, the experimenter cannot behave in a way that reflects bias. 3. If subjects know what is expected of them, they might be "good subjects" and not behave naturally.