Control Problen
in hx~erimentalResearch
Preview fi Chapter Objectives
In Chapter 5 you learned the essentials of the experimental method-manipulating an
independent variable, controlling everything klse, and measuring the dependent
variable. In this chapter we will begin by examining two general types of experimental
design, one in which different groups of participants contribute data for
different levels of the independent variable (between-subjects design) and one in
which the same participants contribute data to all the levels of the independent variable
(within-subjects design). As you are about to learn, there are special advantages
associated with each approach, but there are also problems that have to be carefully
controlled-the problem of equivalent groups for between-subjects designs, and
problems of sequence for within-subjects designs. The last third of the chapter addresses
the issue of bias and the ways of controlling it. When you finish this chapter,
you should be able to:
Chapter 6. Control Problems in Experimental Research
Discriminate between-subjects designs from within-subjects designs.
Understand how random assignment can solve the equivalent groups problem in
between-subjects designs.
Understand when matching should be used instead of random assignment when
attempting to create equivalent groups.
Distinguish between progressive and carryovereffects in within-subjects designs,
and understand why counterbalancing normally works better with the former
than with the latter.
Describe the various forms of counterbalancing for situations in which participants
are tested once per condition and more than once per condition.
Describe the specific types of between- and within-subjects designs that occur in
research in developmental psychology, and understand the problems associated
with each.
Describe how experimenter bias can occur and how it can be controlled.
Describe how participant bias can occur and how it call be controlled.
In his landmark experimental psychology text, just after introducing his now famous
distinction between independent and dependent variables, R. S. Woodworth
emphasized the importance of control in experimental research. As he put it,
"[wlhether one or more independent variables are used, it remains essential that
all other conditions be constant. Otherwise you cannot connect the effect observed
with any definite cause. The psychologist must expect to encounter difficulties in
meeting this requirement." (Woodworth, 1938,p. 3).Someofthese difficultieswe've
already seen. The general problem of confounding and the specific threats to internal
validity discussed in the previous chapter are basically problems of controlling
extraneous factors. In this chapter, we'll look at some other aspects of maintaining
control: the problem of creating equivalent groups in experiments involving separate
groups of participants, the problem of sequence effects in experiments in which
participants are tested several times, and problems resulting from biases held by both
experimenters and research participants.
Recall that any independent variable must have a minimum of two levels. At the
very least, an experiment will compare condition A with condition B. Those who
participate in the study might be placed in level A, level B, or both. If they receive
either A or B but not both, the design is a between-subjects design, so named
because the colnpai-isonof levels A and B w d be a contrast between two different
groups of individuals. On the other hand, if each participant receives both levels A
and B, you could say that both levels exist zuit/iin each individual; hence, this design
is called awithin-subjects design (or,sometimes, a repeated-measures design).
Let's examine each approach.
Between-SubjectsDesigns
Between-subjects designs are sometimes used because they must be used. If the independent
variable is a subject variable, for instance, there is usually no choice.
Between-Subjects Designs
A study comparing introverts with extroverts requires two different groups of
people. Unless the researcher could round up some multiple personalities, introverted
in one personality and extroverted in another, there is no alternative but
to compare two different groups. One of the few times a subject variable won't
be a between-subject variable is when behaviors occurring at two different ages
are being compared, and the same persons are studied at two different times in
their lives. Another possibility is when marital status is the subject variable, and
the same people are studied before and after a marriage or a divorce. Most of the
time, however, using a subject variable means that a between-subjects design will be
used.
Using a between-subjects design is unavoidable in some studies that use certain
manipulated independent variables. That is, it is sometimes the case that when
people participate in one level of an independent variable, the experience gained
there will make it impossible for them to participate in other levels. This often
happens in social psychological research and most research involving deception.
Consider an experiment on the effects of the physical attractiveness of a defendant
on recommended sentence length by SigaU and Ostrove (1975). They gave college
students descriptions of a crime and asked them to recommend ajail sentence for
the woman convicted of it. There were two separate between-subjects manipulated
independent variables. One was the type of crime-either a burglary in which
"Barbara Helm" broke into aneighbor's apartment and stole $2,200 (afair amount of
money in 1975),or a swindle in which Barbara "ingratiated herself to a middle-aged
bachelor and induced him to invest $2,200 in a nonexistent corporation" (Sigall &
Ostrove, 1975,p. 412). The other manipulated variable was Barbara's attractiveness.
Some participants saw a photo of her in which she was very attractive, others saw
a photo of an unattractive Barbara (the same woman posed for both photos), and a
control group did not see any photo. The interesting result was that when the crime
was burglary, attractiveness paid. Attractive Barbara got a lightev sentence on average
(2.8 years) than unattractive (5.2) or control (5.1) Barbara. However, the opposite
happened when the crime was swindling. Apparently thinking that Barbara was
using her good looks to commit the crime, participants gave attractive Barbara a
harsher sentence (5.5 years) than they gave the unattractive (4.4) or control (4.4)
woman.
You can seewhy it was necessary to run this studywith between-subjects variables.
For those participating in the Attractive-Barbara-Swindle condition, for example,
the experience would certainly affect them and make it impossible for them to
"start fresh" in, say, the Unattractive-Barbara-Burglary condition. In some studies,
participating in one condition makes it impossible for the same person to be in a
second condition. Sometimes, it is essential that each condition include uninformed
participants.
While the advantage of a between-subjects design is that each participant enters
the study fresh, and naive with respect to the procedures to be tested, the prime
disadvantage is that large numbers of people may need to be recruited, tested, and
debriefed. Hence, the researcher invests a great deal of energy in this type of design.
My doctoral dissertation on memoly involved five different experiments requiring
between-subjects factors; more than 600 students trudged in and out of my lab
before the project was finished!
Clzapter 6. Control Problems in Experimental Research
The '
Another disadvantage of between-subjects designs is that differences between
the conditions could be due to the independent variables, but they might also be
due to differences between the two groups. To deal with this potential confound,
deliberate steps must be taken to create what are called equivalent groups. These
groups are equal to each other in every important way except for the levels of the
independent variable. The number of equivalent groups in a between-subjects study
corresponds exactly to the number of different conditions in the study, with one
group of participants tested in each condition.
There are two common techniques for creating equivalent groups in a betweensubjects
experiment. The ideal approach is to use random assignment. A second
strategy is to use matching.
nauuuul Assignment
First, be sure you understand that random assignment and random selection are
not the same. Random selection, to be described in Chapter 12 (pp. xx), is a
procedure for getting volunteers to come into your study. As you wdl learn, it
is a process designed to produce a sample of individuals who reflect the broader
population, and it is a common strategy in research using surveys. Random assignment
is a method for placing participants, once selected for a study, into the
different groups. When random assignment is used, every person volunteering
for the study has an equal chance of being placed in any of the groups being
formed.
The goal of random assignment is to take individual differencefactors that could
influence the study and spread them evenly throughout the different groups. For
instance, suppose you're comparing two presentation rates in a simple memory
study. Further suppose that anxious participants won't do as well on your memory
task as nonanxious participants, but you as the researcher are unaware of that fact.
Some subjects are shown a word list at a rate of 2 seconds per word; others at
4 seconds per word. The prediction is that recall will be better for the 4-second
group. Here are some hypothetical data that such a study might produce. Each
number refers to the number of words recalled out of a list of 30. ARer each subject
number, I've placed an "A" or an "R" in parentheses as a way of telling you which
participants are anxious and which are relaxed. Data for the anxious people are
shaded.
If you look carefully at these data, you'll see that the three anxious participants
in each group did worse than their five relaxed peers. Because there are an equal
number of anxious participants in each group, however, the dampening effect of
anxiety on recall is about the same for both groups. Thus, the main comparison of
interest, the dfference in presentation rates, is preserved-an average of 15 words
for the 2-second group and 19for the 4-second group.
Random assignment won't guarantee placing an equal number of anxious participants
in each group, but in general the procedure has the effectof spreadingpotential
confounds evenly among the different groups. This is especially true when large
numbers ofindividuals arebeing assigned to each group. In fact, the greater the number
ofparticipants involved,the greater the chance that random assignmentwlll work
to create equivalent groups of them. If groups are equivalent and if all else is adequatelycontrolled,
then you are in that enviableposition ofbeingable to saythat your
independent variable was responsible if you find differences between your groups.
You might think the actualprocess ofrandom assignmentwould be fairlysimplejust
use a table ofrandom numbers to assign each arrivingparticipant to a group or, in
the case ofa two-group study,flip a coin. Unfortunately, however,the result ofsuch a
procedure is that your groups will almost certainly contain different numbers ofpeople.
In the worst-case scenario,imagine you are doing a studyusing20participants divided
into two groups of 10.You decide to flip a coin as eachvolunteer arrives:heads,
they're in group A; tails, group B. But what if the coin comes up heads all 20 times?
To complete a random assignment of participants to conditions in a way that
guarantees an equal number of participants per group, a researcher can use block
randomization, a procedure ensuring that each condition of the study has a participant
randomly assigned to it before any condition is repeated a second time.
Each "block" contains all of the conditions of the study in a randomized order.
This can be done by hand, using a table of random numbers, but in actual practice
researchers typically rely on a simple computer program to generate a sequence of
conditions meeting the requirements of block randomization-you can find one at
http://www.randomizer.org/.
When only a small number of subjects are available for your experiment, random
assignment can sometimes fail to create equivalent groups. The following example
showsyou how this might happen. Let's take the same study of the effect ofpresentation
rate on memory, used earlier,and assume that the data youjust examined reflect
Chapter 6. Control Problems in Experimental Kesearch
an outcome in which random assignment happened to work. That is, there was an
exact balance of five relaxed and three anxious people in each group. However, it is
possible that random assignment could place all six of the anxious participants in one
of the groups. This is unlikely, but it could occur (just as it's remotely possible for a
perfectly fair coin to come up heads 10times in a row). If it did, this might happen:'
This outcome, of course, is totally different fiom the first example. Instead of
concluding that recall was better for a slower presentation rate (as in the earlier
example), the researcher in this case could not reject the null hypothesis (17 = 17)
and would wonder what happened. After all, participants were randomly assigned,
and the researcher's prediction about better recall for a slower presentation rate
certainly makes sense. So what went wrong?
What happened was that random assignment inadvertently created two decidedly
nonequivalent groups-one made up entirely of relaxed people and one mostly
including anxious folks. A 4-second rate probably does produce better recall, but
the true difference was wiped out in this study because the mean for the 2-second
group was inflated by the relatively high scores of the relaxed participants and the
4-second group's mean was suppressed because of the anxiety effect. Another way
of saying this is that the failure of random assignment to create equivalent groups
probably led to a Type I1 error (presentation rate really does affect recall; this study
just failed to find the effect). To repeat what was mentioned earlier, the chance of
random assignment worhng to create equivalent groups increases as sample size
increases.
To deal with the problem of equivalent groups in a situation such as this, a
matching procedure could be used. In matching,participants are grouped together
on some trait such as anxiety level, and then distributed randomly to the different
Participant
S1ox)
'Thls same pattern of results could occur if an experimenter failed to randomly assign and naively tested the
first eight people to sign up in the 2-second rate group and the next eight people in the other group. It is
conceivable that the more anxious students would delay volunteering to participate, increasing the chances
of their being placed in the 4-second group.
2-Second Rate
15
Participant
s9 (R)
4-Second Rate
23
groups in the experiment. In the memory study, "anxiety level" would be called a
matching variable. Individuals in the inemory experiment would be given some
reliable and valid measure of anxiety, those with si~nilarscores would be paired
together, and one person in each pair would be randomly placed in the group
getting the 2-second rate and the other would be put into the group with the
4-second rate. As an illustration of exactly how to accomplish matching in a twogroup
experiment, you should work through the example in Table 6.1.
Matching sometimes is used when the number (N) of participants is small, and
random assignment is therefore risky and might yield nonequivalent groups. In order
to undertake matching, however, two important conditions must be met. First,
you must have good reason to believe that the matching variable will have a predictable
effect on the outcome of the study. That is, you must be confident that
the matching variable is correlated with the dependent variable. This was the case
in our hypothetical memory study-anxiety clearly reduced recall. When there is
a high correlation between the matching variable and the dependent variable, the
statistical techniques for evaluating matched-groups designs are sensitive to differences
between the groups. On the other hand, if matching is done when there is
a low correlation between the matching variable and the dependent variable, the
chances of finding a true difference between the groups decline. So it is important
to be careful when piclung matching variables.
A second important condition for matching is that there must be some reasonable
way of measuring or identifjring participants on the matching variable. In
some studies, participants must be tested on the matching variable first, then assigned
to groups, and then put through the experimental procedure. Depending
on the circumstances, this might require bringing participants into the lab on two
separate occasions, which can create logistical problems. Also, the initial testing on
the matching variable might give participants an indication of the study's purpose,
thereby introduci~gbias into the study.The simplestmatching situationsoccurwhen
the matching variables are constructs that can be determined without directly testing
the participants (e.g., Grade Point Average scores or IQ from school records), or by
matching on the dependent variable itself. That is, in a memory study, participants
could be given an initial memory test, then matched on their performance, and then
assigned to 2-second and 4-second groups. Their preexisting memory ability would
thereby be under control and the differences in performance could be attributed to
the presentation rate.
One practical difficultywith matching concerns the number ofmatchingvariables
to use. In a memory study, should I match the groups for anxiety level?What about
intelligence level?What about education level?You can see that somejudgment is
required here, for matching is difficult to accomplish with more than one matching
variable, and often results in having to eliminate participants because close matches
sometimes cannot be made. The problem of deciding on and measuring matching
variables is one reason why research psychologists generally prefer to make the
effort to recruit enough volunteers to use random assignment, even when they
might suspect that some extraneous variable correlates with the dependent variable.
In memory research, for instance, researchers are seldom concerned about anxiety
levels, intelligence, or education level. They simply make the groups large enough
and assume that random assignment wdl distribute these potentially confounding
factors evenly throughout the conditions of the study.
Chapter 6. Control Problems in Experimental Research
'ABLE 6.1 How to Use a Matching Procedure
In a study on problem solving requiring two different groups, a researcher is concerned that a
participant's academic skills may correlate highly with performance on the problems to be
used in the experiment.The participants are college students, so the researcher decides to
match the two groups on grade point average (GPA).That is, deliberate steps will be taken to
insure that the two groups are equivalent to each other in academic ability,as reflected in
their average GPAs. Here's how it is done:
Step 1. Get a score for each person on the matching variable.Thatls easy in this case because
it simply means retrieving GPA data from the Registrar (with the students' consent
of course). In other cases of matching, the matching variable must be determined
by pretesting participants on the variable; this can mean bringing participants to
the lab twice, which can be inconvenient (another reason why researchers like random
assignment).
Suppose there will be 10 volunteers (Ss) in the study, 5 per group. Here are their
GPAs:
S1: 3.24 S6: 2.45
S2: 3.91 S7: 3.85
S3: 2.71 S8: 3.12
S4: 2.05 S9: 2.91
S5: 2.62 S10: 2.21
Step 2. Arrange the GPAs in ascendingorder:
S4: 2.05 S9: 2.91
S10: 2.21 S8: 3.12
S6: 2.45 S1: 3.24
S5: 2.62 S7: 3.85
S3: 2.71 S2: 3.91
Step 3. Create five pairs of scores, with each pair consisting of
quantitatively adjacent GPA scores.
Pair 1: 2.05 and 2.21
Pair 2: 2.45 and 2.62
Pair 3: 2.71 and 2.91
Pair 4: 3.12 and 3.24
Pair 5: 3.85 and 3.91
Step 4. For each pair, randomly assign one participant to Group 1and the other to Group 2.
Here's one possible outcome:
Group 1 Group 2
2.05 2.21
2.62 2.45
2.91 2.71
3.12 3.24
3.85 3.91
mean GPA: 2.91 2.90
Now the study can proceed with some assurance that the two groups will be equivalent to
each other (2.91 is virtually the same as 2.90) in terms of academic ability.
Note. If more than two groups are being tested,the matchmg procedure is the same up to and
including step 2. In step 3, instead of creating pairs of scores,the researcher creates clusters equal to
the number of groups needed. Then in step 4, the participants in each cluster are randomly assigned
to the multiple groups.
Within-Subjects Designs
J Self Test 6.1
1. What is the defining feature of a between-subjects design? What is the main
control problem that must be solved with this type of design?
2. Sal wishes to see if the type of font used when printing a document wdl influence
comprehension of the material in the document. He thinks about matching on
"verbal fluency." What two conditionsmust be in effect before this matching can
occur?
Nithin-SubjectsDesigns
As mentioned at the start of the chapter, each participant is exposed to each level of
the independent variable in a within-subjects design. Because everyone in this type
of study is measured several times, you will sometimes see this procedure described
as a repeated-measures design. One practical advantage of this design should be
obvious-fewer people need to be recruited. If you have a study comparing two
conditions and you want to test 20 people in condition 1,you'll need to recruit 40
people for a between-subjects study, but only 20 for a within-subjects study.
Within-subjects designsare sometimesthe only reasonable choice. In experiments
in such areas as physiological psychology and sensationand perception, comparisons
often are made between conditions that require just a brief amount of time to test
but might demand extensive preparation. For example, a perceptual study using the
Miiller-Lyer dusion might vary the orientations of the lines to see if the illusion is
especially strong when presented vertically (see Figure 6.1). The task might involve
showing the dusion on a computer screen and asking the participant to press a key
that changes the length of one of the lines. Participants are told to adjust the line
until both lines seem to be the same length. Any one trial might take no more
than 5 seconds, so it would be absurd to make the "illusion orientation" variable
a between-subjects factor and use someone for a fraction of a minute. Instead, it
makes more sense to make the orientation variable a within-subjects factor and give
each participant a sequence of trials to cover all levels of the variable (and probably
(a) Horizontal (b) 45' left (c) 45' right (d) Vertical
FIGURE6.1 Set of four MiiUer-Lyer illusions: horizontal, 45' left, 45' right,
vertical.
Clzapter 6. Control Problems in Experimental Research
duplicate each level several times). And unlike the attractive/unattractive Barbara
Helm study, serving in one condition would not make it impossible to serve in
another.
One of psychology's oldest areas of research is in psychophysics, the study of
sensory thresholds (e.g., a modern application is a hearing test). In a typical psychophysics
study, subjects are asked to judge whether or not they can detect some
stimulus or whether two stimuli are equal or different. Each situation requires a
large number of trials and comparisons to be made within the same individual.
Hence, psychophysics studies typically usejust a few participants and measure them
repeatedly. Research Example 5, which you will soon encounter, uses this strategy.
A within-subjects design might also be necessary when volunteers are scarce
because the entire population of interest is small. Studying astronautsor people with
special expertise (e.g., world-class chess players) are just two examples. Of course,
there are times when, even with a limited population, the design may require a
between-subjects manipulation. Evaluating the effects of a new form of therapy for
those suffering from a rare form of psychopathology requires comparing those in
therapy with others in a control group not being treated.
Besides convenience, another advantage of within-subjects designs is that they
eliminate the equivalent groups problem that occurs with between-subjects designs.
Recall from Chapter 4 that an infcrential statistical analysis comparing two
groups examines the variability between experimental conditions with the variability
within each condition. Variability between conditions could be due to (a)
the independent variable, (b) other systematic variance resulting from confounding,
and/or (c)nonsystematicerror variance.Even with random assignment,a significant
portion of the error variance in a between-subjects design results from individual
differences between subjects in the different groups. But in a within-subjects design,
any between-condition individual difference variance disappears. Let's look at
a concrete example.
Supposeyou are comparing two golfballsfor distance.You recruit 10professional
golfers and randomly assign them to two groups of 5. After warming up, each
golfer hits one ball or the other. Here are the results:
r
Pros in the Golf Ball Pros in the Golf Ball
First Group 1 Second Group 2
Pro 1 255 Pro 6 269
Pro 2 261 Pro 7 266
Pro 3 248 Pro 8 260
Pro 4 256 Pro 9 273
Pro 5 245 Pro 10 257
M 253.00 M 265.00
SD 6.44 SD 6.52
There are several things to note here. First, there is some variability within each
group, as reflected in the standard deviation for each group. This is error variance
due to individual differences within each group and to other random factors.
Witlzitz-SubjectsDesigns
Second, there is apparently an overall difference between the groups. The pros in
the second group hit their ball farther than the pros in the first group. Why? Three
possibilities:
a. Chance; perhaps this is not a statistically significant difference, and even if it is,
there's a 5% chance that it is a Type I error if the null hypothesis is actually true.
b. The golf ball; perhaps the brand of golfball hit by the second group simply goes
farther (this, of course, is the research hypothesis).
c. Individual differences; maybe the golfers in the second group are stronger or
more slulled than those in the first group.
The chances that the third possibility is a major problem are reduced by the procedures
for creating equivalent groups described earlier. Using random assignment
or matching allows you to be reasonably sure that the second group of golfers is
approximately equal to the first group in ability, strength, and so on. Despite that,
however, it is still possible that some of the difference between these groups can be
traced back to the individual differences between the two groups. This problem
simply does not occur in a within-subjects design. Suppose you repeated the study
but usedjust the first five golfers, and each pro hits ball 1, and then ball 2. Now the
table looks like this:
Pros in the Golf Ball Golf Ball
First Group 1 2
Pro 1 255 269
I pro 2 261 266 1
I pro 3 248 260 1
Of the three possible explanations for the differences in the first set of data, explanation
3 can be eliminated for the second set. In the first set, the difference in
the first row between the 255 and the 269 could be due to chance, the difference
between the balls, or individual differences between pro 1and pro 6. In the second
set, there is no second group of golfers, so the third possibility is gone. Thus, in a
within-subjects design,individual differences are eliminated from the estimate of the
amount of variability between conditions. Statistically, this means that, in a withinsubjects
design, an inferential analysis will be more sensitive to small differences
between means than will be the case for a between-subjects design.
But wait. Are you completely satisfied that in the second case the differences
between the first set of scores and the second set could be due only to (a) chance
factors and/or (b) the superiority of the second ball?Are you thinking that perhaps
pro 1 actually changed in some way between hitting ball 1 and hitting ball 2?
Although it's unlikely that the golfer will add 20 pounds of muscle between swings,
what if some kind of practice or warm-up effect was operating? Or perhaps the pro
Chapter 6. Contvol Pvoblerns in Expevimental Reseavcla
detected a slight malfunction in his swing at ball 1 and corrected it for ball 2. Or
perhaps the wind changed. In short, with a within-subjects design, a major problem
is that once a participant has completed the first part of a study the experience or
altered circumstances could influence performance in later parts of the study. The
problem is referred to as a sequence or order effect, and it can operate in several
ways.
First, trial 1might affect the participant in some way so that performance on trial
2 is steadily inlproved, as in the example of a practice effect. On the other hand,
sometimes repeated trials produce gradual fatigue or boredom, and performance
steadily declines from trial to trial. These two effects can both be referred to as
progressive effects because it is assumed that performance changes steadily (progressively)
from trial to trial. Also, some particular sequences might produce effects
that are different from those of other sequences, what could be called a carryover
effect.Thus, in a study with two basic conditions, experiencing condition A before
condition B might affect the person much differentlythan experiencingB before A.
For example, suppose you were studying the effects of noise on a problem-solving
task using a within-subjects design. Let's say that participants will be trying to solve
anagram problems (rearrange letters to form words) under some time pressure. In
condition A, they have to solve the anagrams while distracting noises come from
the next room, and these noises arc presented randomly and therefore are unpredictable.
In condition B, the same total anlount of noise occurs; however, it is not
randomly presented but instead occurs in predictable patterns. Ifyou put the people
in condition A first (unpredictablenoise), and then in B (predictablenoise), they will
probably do poorly in A (mostpeople do). This poor performance might discourage
them and carry over to condition B. They should do better in B, but as soon as the
noise begins, they might say to themselves, "Here we go again," and perhaps not
try as hard. O n the other hand, if you run condition B first, with the predictable
noise, your subjects might do reasonably well (most people do), and some of the
confidence might carry over to the second part of the study. When they then encounter
condition A, they might do better than you would orcharily expect. Thus,
performance in condition A might be nluch worse in the sequence A-B than in
the sequence B-A, and a similar problem would occur for condition B. In short,
the sequence in which the conditions are presented, independently of any practice
or fatigue effects, might influence the study's outcome. In studies where carryover
effects might be suspected, researchers often switch to a between-subjects design.
Indeed, studies comparing predictable and unpredictable noise typically put people
in two different groups.
The Problem of Controlling Sequence Effects
The normal way to control sequence effectsin awithin-subjects designis to use more
than one sequence, a strategyknown as counterbalancing.As I w d elaboratelater,
the procedure works better for progressive effects than for carryover effects. There
are two general categories of counterbalancing, depending on whether participants
are tested in each experimental conditionjust one time or are tested more than once
per condition.
The Problem of Controlling Sequence E$eects
Testing Once per ConditionIn
some experiments, participants will be tested in each of the conditions but tested
only once per condition. Consider, for example, an interesting study by Reynolds
(1992) on the ability of chess players to recognize the level of expertise in other
chess players. He recruited 15 chess players with different degrees of expertise from
various clubs in New York City and asked them to look at six different chess games
that were said to be in progress (i.e., about 20 moves into the game). On each trial,
the players examined the board of an in-progress game (they were told to assume
that the pair ofplayers of each game were of equal ability)and estimated the skilllevel
of the players according to a standard rating system. The games were deliberately
set up to reflect different levels of player expertise. Reynolds found that the more
highly skilled of the 15 chess players made more accurate estimates of the ability
reflected in the board setups they examined than did the less skilled players.
You'll recognize the design of the Reynolds study as including a within-subjects
variable. Each of the 15 participants examined all six games. Also, you can see that
it made sense for each game to be evaluated just one time by each player. Hence,
Reynolds was faced with the question of how to control for any sequence effects
that might be present. He certainly didn't want all 15 participants to see the six
games in exactly the same order. How might he have proceeded?
Complete Counterbalancing
Whenever participants are tested once per condition in a within-subjects design,
one solution to the sequence problem is to use complete counterbalancing.This
means that every possible sequence will be used at least once. The total number of
sequences needed can be determined by calculatingX!, where X is the number of
conditions, and "!" standsfor the mathematical calculationofa "factorial." For example,
if a study has three conditions, there are six possible sequences that can be used:
The six sequences in a study with conditions A, B, and C would be
A B C B A C
A C B C A B
B C A C B A
The problem with complete counterbalancing is that as the number oflevelsofthe
independent variable increases, the possible sequences that wlll be needed increase
exponentially.There are 6 sequences needed for three conditions,but simply adding
a fourth condition creates a need for 24 sequences (4 x 3 x 2 x 1).As you can guess,
complete counterbalancing was not possible in Reynolds' study unless he recruited
many more than 15 chess players. In fact, with six different games (i.e., conditions),
he would need to find 6! or 720 players to cover all of the possible sequences. Clearly,
Reynolds used a different strategy.
Partial Counterbalancing
Whenever a subset of the total number of sequencesis used, the result is called partial
counterbalancing. This was Reynolds's solution; he simply took a random
Chapter 6. Control Problems in Experimental Research
J1
sample of the 720 possible sequences by ensuring that "the order of presentation I
[was] randomized for each subject" (Reynolds, 1992, p. 411). Sampling from the
population of sequences is a common strategy whenever there are fewer participants
available than possible sequences or when there are a fairly large number of
condtions.
Reynolds sampled from the total nuinber of sequences, but he could have chosen
another approach that is used sometimes-the balanced Latin square. This device
gets its name from an ancient Roman puzzle about arranging Latin letters in a matrix
so that each letter appears only once in each row and each column (Krik, 1968).The
Latin square strategy is more sophisticated than simply choosing a random subset
of the whole. With a perfectly balanced Latin square, you are assured that (a) every
condition of the study occurs equally often in every sequentialposition, and (b) every
condition precedes and follows every other condition exactly once. Work through
Table 6.2 to see how to construct the following 6 x 6 Latin square. Think of each
letter as one of the six games inspected by Reynolds's chess players.
A B F C E D
B C A D F E
C D B E A F
D E C F B A
E F D A C B
F A E B D C
I've boldfaced condition A (chess game A) to show you how the square meets
the two requirements listed in the preceding paragraph. First, condition A occurs
in each of the six sequential positions (firstin the first row, third in the second row,
etc.). Second, A is followed by each of the other letters exactly one time. From the
top row to the bottom, (1)A is followedby B, D, F, nothing, C, and E, and (2) A is
preceded by nothing, C,E, B, D, and E The same is true for each of the other letters.
TO use the-6 x 6 Latin square, one randomly assigns each of the six conditions of
the experiment (six different chess games for Reynolds) to one of the six letters, A
through E
When using Latin squares,it is necessaryfor the number ofparticipants to be equal
to or be a multiple of the nuinber of rows in the square. The fact the Reynolds had
15participants in his study tells you that he didn't use a Latin square. If he had added
three more chess players, giving him an N of 18, he could have randomly assigned
three players to each of the six rows of the square (3 x 6 = 18).
Testing More Than Once per Condition
In the Reynolds study, it made no sense to ask the chess players to look at any of
the six games more than once. Similarly,if participants in a memory experiment are
asked to study and recall four lists of words, with the order of the lists determined by
a 4 x 4 Latin square, they will seldom be asked to study and recall any particular list i
a second time unless the researcher is specificallyinterested in the effects of repeated i
I
!
Problem of Controlling Sequence Effects
TABLE6.2 Building a Balanced 6 x 6 Latin Square
In a balanced Latin square,every condition of the study occurs equally often In every sequentlal
position, and every condition precedes and follows every other condition exactly once.
Here's how to build a 6x6 square.
Step 1. Build the first row. It is fixed according to this general rule:
where A refers to the first condition of the study and "X" refers to the letter symbolizing
the final condltlon of the experiment.To bulld the 6 x 6 square, thls first
row would substitute:
X = the sixth letter of the alphabet +F
X - 1 = the fifth letter +E
Therefore, the first row would be
A B F (subbingfor "X") C E (subbing for "X - 1") D
Step 2. Build the second row. Directly below each letter of row 1,place in row 2 the letter
that is next in the alphabet.The only exception is the E Under that letter, return to
the first of the six letters and place the letter A.Thus:
A B F C E D
B C A D F E
Step 3. Build the remaining four rows following the step 2 rule.Thus, the final 6 x 6 square is:
A B F C E D
B C A D F E
C D B E A F
D E C F B A
E F D A C B
F A E B D C
Step 4. Take the six conditions of the study and randomly assign thein to the letters.A
through F to determine the actual sequence of conditions for each row.Assign an
equal number of participants to each row.
Note. This vrocedure works whenever there is an even number of conditions. If the number of
conditions is odd, two squares wdl be needed-one created using the above procedure, and a second
an exact reversal of the square created with the above procedure.-~ormore details, see Winer,
Brown, and Michaels (1994).
trials on memory. However, in many studies it is reasonable, even necessary, for
participants to experience each condition more than one time. This often happens
in research in sensation and perception, for instance. A look back at Figure 6.1
provides an example.
Suppose you were conducting a study in which you wanted to see if participants
would be more affected by the illusion when it was presented vertically than when
shown horizontally or at a 45' angle. Four conditions of the study are assigned to
Chapter 6. Control Problems in Experimental Research
the letters A-D:
A = horizontal
B = 45"to the left
C = 45"to the right
D = vertical
Participants in the study are shown the dusion on a computer screen and have
to make adjustments to the lengths of the parallel lines until they perceive that the
lines are equal. The four conditions could be presented to people according to one
of two basic procedures.
Reverse Counterbalancin!
When using reverse counterbalancing, the experimenter simply presents the
conditions in one order, and then presents them again in the reverse order. In the
illusioncase, the orderwould be A-B-C-D, then D-C-B-A. Ifthe researcherdesires
to have the participant perform the task more than twice per condition, and this is
common in perception research, this sequence could be repeated as many times as
necessary. Hence, if you wanted each participant to adjust each of the four dlusions
of Figure 6.1 six separate times, and you decided to use reverse counterbalancing,
participants would see the illusions in this sequence:
A-B-C-D-D-C-B-A-A-B-C-D-D-C-B-A-A-B-C-D-D-C-B-A
Reverse counterbalancing was used in one of psychology's most famous studies,
completed in the 1930sby J. kdley Stroop. You've probably tried the Stroop task
yourself--when shown color names printed in the wrong colors, you were asked to
name the color rather than read the word. That is, when shown the word "RED"
printed blue ink, the correct response is "blue," not "red." Stroop's study is a classic
example of a particular type of design described in the next chapter, so you will be
learning more about his work when you encounter Box 7.1 (pp. 239).2
Block Randomization
A second way to present a sequence of conditions when each condition is presented
more than once is to use block randomization,the sameprocedure outlinedearlier
in the context ofhow to assignparticipantsrandomly to groups in abetween-subjects
experiment. The basic rule is that every condition occurs once before any condition
is repeated a second time. Within each block, the order of condtions is randomized.
' ~ l t h o u ~ hreverse counterbalancing normally occurs when participants are tested more than once per condition,
the principle can also be applied in a withn-subjects design in which participants see each condition
only once. Thus, if a within-subjects study has six different conditions, each tested only once per person,
half of the participants could get the sequence A-B-C-D-E-F, while the remaining participants experience
the reverse order (F-E-D-C-B-A).
The Problem ofControlling Sequence Eflects
This strategy eliminates the possibility that participants can predict what is coming
next, a problem that can occur with reverse counterbalancing.
To use the illusions example again (Figure 6.1),participants would encounter all
four conditions in a randomized order, then all four again but in a block with a
new randomized order, and so on for as many blocks of four as needed. A reverse
counterbalancing would look like this:
A-B-C-D-D-C-B-A
A block randomization procedure might produce either of these two sequences
(among others):
B-C-D-A-C-A-D-B or C-A-B-D-A-B-D-C
To giveyou a sense ofhow block randomization works in an actualwithin-subjects
experiment employing many trials, considerthe followingauditory perception study
by Carello, Anderson, and Kunkler-Peck (1998).
Research Example 5-Counterbalancing with Block Randomization
Our abilityto localizesoundhas been known for along time-under normal circumstances,we
are quite adept at identifjrlngthe location from which a sound originates.
What interested Carello and her research team was whether people could identify
something about the physicalsize of an object simplyby hearing it drop on the floor.
She devised the apparatus pictured in Figure 6.2 to examine the question. Participants
heard a wooden dowel hit the floor, and then tried to judge its length. They
made their response by adjusting the distance between the edge of the desk they
were sitting at and a movable vertical surface during a "trial," which was defined
as having the same dowel dropped five times in a row from a given height. During
the five drops, participants were encouraged to move the wall back and forth until
they were comfortable with their decision about the dowel's size. In the first of two
experiments, the within-subjects independent variable was the length of the dowel,
and there were seven levels (30, 45, 60, 75, 90, 105, and 120 cm). Each participant
FIGURE6.2 The experimentalsetup for
Carello, Anderson, & Kunkler-Peck (1998).
After hearing a rod drop, participants adjusted
the distance between the edge of their desk and
the vertical surfacefacing them to match what
they perceived to be the length of the rod.
Chapter 6. Control Problems in Experiinental Research
J Self Test 6.2
1. What is the definingfeature of a within-subjects design?What is the main control
problem that must be solved with this type of design?
2. Ifyour IV has 6 levels, each testedjust once per subject, why are you more likely
to use partial counterbalancinginstead of complete counterbalancing?
3. If participants are going to be tested more than one time for each level of the IV,
what two forms of counterbalancingmay be used?
Control Problems in Developmental Research
As you have learned, the researcher must weigh several factors when deciding
whether to use a between-subjects design or a withn-subjects design. There are
some additional considerationsfor researchers in developmental psychology, where
two specific varieties of these designs occur. These methods are known as crosssectional
and longitudinal designs.
You've seen these terms before if you have taken a course in developmental or
child psychology. Research in these areas includes age as the prime variableafter
all, the name of the game in developmentalpsychologyis to discoverhow we change
as we grow older. A cross-sectional study takes a between-subjects approach. A
cross-sectional study comparing the languageperformance of 3-, 4-, and 5-year-old
children would use three different groups of chldren. A longitudinal study, on
the other hand, stuhes a single group over aperiod of time; it takes awithn-subjects
or repeated-measures approach. The same language study would measure language
behavior in a group of 3-year-olds, and then study these same chldren when they
turned 4 and 5.
The obvious advantage of the cross-sectional approach to the experiment on
language is time; such a study might take a month to complete. If done as a longitudinal
study, it would take 3 years. However, a potentially serious difficulty with some
cross-sectional studies is a specialform of the problem of nonequivalent groups and
involves what are known as cohort effects. A cohort is a group of people born
at about the same time. If you are studying three age groups, they differ not just
simply in chronological age but also in terms of the environments in which they
were raised. The problem is not especially noticeable when comparing 3-, 4-, and
5-year-olds, but what if you're interested in whether intelligence declines with age
and decide to compare groups aged 30,50, and 70?You might indeed find a decline
with age, but doesit mean that intelligence gradually decreaseswith age, ormight the
differences relate to the very differentlife histories of the three groups?For example,
the 70-year-olds went to school during the Great Depression, the 50-year-olds were
educated during the post-World War I1boom, and the 30-year-olds were raised on
TV. These factors could bias the results. Indeed, this outcome has occurred. Early
research on the effects of age on I Q suggested that sipficant declines occurred,
Control Problems in Developnzental Research
but these stuhes were cross-sectional (e.g., Miles, 1933). Subsequent longitudinal
studies revealed a very hfferent pattern (Schaie, 1988).For example, verbal abilities
show very little decline, especially if the person remains verbally active (moral: use
it or lose it).
While cohort effects can plague cross-sectional studies, longitudinal studies also
have problems, most notably with attrition (refer back to Chapter 5, p. 189). If a
large number of participants drop out of the study, the group completing it may be
very different from the group starting it. Referring to the age and IQ example, if
people stay healthy, they may remain more active intellectually than if they are sick
all of the time. If they are chronically 111,they may die before a study is completed,
leaving a group that may be generally more intelligent than the group starting the
study. There are also potential ethical problems in longitudinal studies. As people
develop and mature, they might change their attitudes about their willingness to
participate. Most researchers doing longitudinal research recognize that informed
consent is an ongoing process, not a one-time event. Ethically sensitive researchers
will periodically renew the consent process in long-term studies,perhaps every few
years (Fischman, 2000).
In trying to balance cohort and attrition problems, some researchersuse a strategy
that combines cross-sectional with longitudinal studies, a design referred to as a
cohort sequential design. In such a study, a group of subjects will be selected
and retested every few years, and then adhtional cohorts will be selected every few
years and also retested over time. To take a simple example, suppose you wished to
examine the effects of aging on memory, comparing ages 55, 60, and 65. In the
study's first year, you would recruit a group of 55-year-olds. Then every five years
after that, you would recruit new groups of 55-year-olds, and retest those who had
been recruited earlier. Schematically, the design for a study that began in the year
1960 and lasted for 30 years would look like this (the numbers in the matrix refer
to the age of the subjects at any given testing point):
Year of the Study
~dhort# 1960 1965 1970 1975 1980 1985 1990
So in 1960,you have a group of55-year-olds that you test. Then in 1965,these same
people (now 60 years old) would be retested, along with a new group of 55-yearolds.
By year 3, you have cohorts for all three age groups. As you can see, combining
the data in each of the diagonals would give you an overall comparison between
those aged 55, 60, and 65. Comparing the data in the rows enables a comparison
of overall differences between cohorts. In actual practice, these deigns are more
complicated, because researchers will typically start the first year of the study with
a range of ages. But the diagram gives you the basic idea. Perhaps the best-known
Clzapter 6. Control Problems in Experimental Research
example of this type of sequential design is a long series of studies by K. Warner
Schaie (2005),known as the Seattle Longituhnal Study. It begail in 1956, designed
to examine age-related changes in various mental abilities. The initial cohort had
500 people in it, ranging in age from their early 20s to their late 60s (as of 2005,
38 of these subjects were still in the study, 49 years later!). The study has added a
new cohort at 7-year intervals ever since 1956 and has recently reached the 50-year
mark. In all, about 6,000 people have participated. In general, Schaie and his team
have found that performance on mental abllity tasks declines slightly with age, but
with no serious losses before age 60, and the losses can be reduced by good physical
health and lots of crossword puzzles. Concerning cohort effects, they have found
that overall performance has been progressively better for those born more recently.
Presumably, those born later in the twentieth century have had the advantages of
better education, better nutrition, and so on.
The length of Schaie's Seattle project is impressive,but the world's record for perseverancein
a repeated-measures study occurred in what is arguablythe most famous
longitudinal study of all time. Before continuing, read Box 6.1, which chronicles
the epic tale of Lewis Terman's study of gifted children.
Control Problems in Develop~nentalResearclz
school, but a group of 444 were in junior or senior high school (sample numbers
from Minton, 1988). Their average IQ score was 150, which put the group roughly
in the top 1%of the population. Each child was given an extensivebatteiy of tests and
questionnaires by the team of graduate students assembled by Terman. By the time
the initial testing was complete, each child had a file of about 100pages long (Minton,
1988)!The resdts of the frst analysis of the group were published in more than 600
pages as the Mental and Physical Traits ofa Thousand Gijed Children (Terman, 1925).
Terman intended to do just a brief follow-up study, but the project took on a
life of its own. The sanlple was retested in the late 1920s (Burks,Jensen, & Terman,
1930), and additional follow-up stukes during Terman's lifetime were published 25
(Terman& Oden, 1947)and 35 (Terman& Oden, 1959)years after the Initial testing.
FollowingTerman's death,the project was taken overby Robert Sears, amember ofthe
gifted group and a well-known psychologist in his own right. In the foreword to the
35-year follow-up, Searswrote: "On actuarialgrounds,there 1sconsiderablelikelihood
that the last of Terman's Gifted Chlldren will not have yielded his last report to the
files before the year 2010!" (Terman & Oden, 1959,p. ix).Between 1960 and 1986,
Sears produced five additional follow-up studies of the group, and he was workng
on a book-length study of the group as they aged when he died in 1989 (Cronbach,
Hastorf, Hilgard, & Maccoby, 1990).The book was eventually published as Tvle G$ed
Group in Later Maturify (Holahan, Sears, & Cronbach, 1995).
There are three points worth making about this mega-longitudinal study. First,
Terman's work shatteredthe stereotypeofthe giftedchildas someonewho was brilhant
but socially retarded and prone to burnout early in life. Rather, the members of h s
group as a whole were both brilliant and well adjusted and they became successful as
they matured.By the time they reached maturity, '"he group had produced thousands
of scientific papers, 60 nonfiction books, 33 novels, 375 short stories, 230 patents,
and numerous radio and television shows, works of art, and musical compositions"
(Hothersall,1990,p. 353).Second,the data collectedby Terman'steam continuesto be
a source of rich archivalinformation for modern researchers. For instance,studieshave
been published on the careers of the gifted females in Terman's group (TornlinsonKeasy,
1990),and on the predictors oflongevityin the group (Friedman,et al., 1995).
Third, Terman's follow-up studies are incredible from the methodologicalstandpoint
of a longitudinal study's typical nemesis-attrition. The following figures (taken from
Minton, 1988) are the percentage of living participants who participated in the first
three follow-ups:
After 10 years: 92%
After 25 years: 98%
After 35 years: 93%
These are remarkablyhigh numbers and reflect the intenseloyalty that Terman and
his group had for each other. Members of the group referred to themselves as "Termites,"
and some even wore termitejewelry (Hothersall,1990).Terman corresponded
with hundreds of his participants and genuinely cared for his specialpeople. After all,
the group represented the type of person Terman believed held the key to America's
future
Chapter 6. Control Problerns in Experimental Research
I-3blemswith Biasing
Because humans are always the experimenters and usually the participants in psychology
research, there is the chance that the results of a study could be influenced
by some human "bias," a preconceived expectation about what is to happen in an
experiment. These biases take several forms but f d into two broad categoriesthose
affecting experimenters and those affecting research participants. These two
forms of bias often interact.
Experimenter Bias
The Clever Hans case (Chapter 3, pp. 96-98) is often used to illustrate the influence
of experimenter bias on the outcome of some study. Hans's trainer, knowing
the outcome to the question "What is 3 times 3?," sent subtle head-nodding cues
that were read by the apparently intehgent horse. Similarly, experimenters testing
hypotheses sometimes may inadvertently do sometlung that leads participants to behave
in ways that confirm the hypothesis. Although the stereotype of the scientist
is that of an objective, dispassionate, even mechanical person, the truth is that researchers
can become rather emotionally involved in their research. It's not difficult
to see how a desire to confirm some strongly held hypothesis might lead an unwary
experimenter to behave in such a way as to influence the outcome of the study.
For one thing, biased experimenters might treat the research participants in the
various conditions differently.Robert Rosenthal developed one procedure demonstrating
this. Participants in one of his studies (e.g., Rosenthal & Fode, 1963a)were
shown a set of photographs of faces and asked to make some judgment about the
people pictured in them. For example, they might be asked to rate each photo on
how successfulthe person seemed to be, with the interval scale ranging from -10
(total failure) to +10 (totally successful).Allparticipants saw the same photos and
made the samejudgments. The independent variable was experimenter expectancy.
Some experimenters Were led to believe that most subjects would give people the
benefit of the doubt and rate the pictures positively; other experimenters were told
to expect negative ratings. Interestingly enough, the experimenter's expectancies
typically produced effects on the subjects' rating behavior, even though the pictures
were identical for both groups. How can this be?
According to Rosenthal(1966),experimenters caninnocently communicatetheir
expectanciesin a number ofsubtle ways. For instance, on the person perception task,
the experimenter holds up apicture whlle the participant rates it. If the experimenter
is expecting a "+8" and the person says "-3," how might the experimenter actwith
a slight frown perhaps? How might the participant read the frown? Might he
or she try a "+7" on the next trial to see if ths could elicit a srmle or a nod from the
experimenter? In general, could it be that experimenters in this situation, without
even being aware of it, are subtly shaping the responses of their participants? Does
this remind you of Clever Hans?
Rosenthal has even shown that experimenter expectancies can be communicated
to subjects in animal research. For instance, rats learn mazes faster for experimenters
Problems with Biasing
who think their animals have been bred for maze-running ability than for those
expecting their rats to be "maze-dull" (Rosenthal & Fode, 1963b). The rats, of
course, are randomly assignedto the experimenters and are equal in ability. The key
factor here seems to be that experimenters expecting their rats to be "maze-bright"
treat them better; for example, they handle them more, a behavior known to affect
learning.
It should be noted that some of the Rosenthal research has been criticized on statisticalgrounds
and for interpreting the results as being due to expectancy when they
may have been due to something else. For example, Barber (1976) raised questions
about the statistical conclusion validity of some of Rosenthal's work. In at least one
study, according to Barber, 3 of 20 experimenters reversed the expectancy results,
getting data the opposite of the expectancies created for them. Rosenthal omitted
these experimenters from the analysis and obtained a significant difference for the
remaining 17 experimenters. With all 20 experimenters included in the analysis,
however, the hfference &sappeared. Barber also contends that, in the animal studies,
some of the results occurred because experimenters simply fudged the data (e.g.,
misrecording maze errors). Another difficulty with the Rosenthal studies is that h s
procedures don't match what normally occurs in experiments; most experimenters
test all of the participants in all conditions of the experiment, not just those participating
in one of the conditions. Hence, Rosenthal's results might overestimate the
amount of biasing that occurs.
Despite these reservations,the experimenter expectancy effect cannot be ignored;
it has been replicated in a variety of situations and by many researchers other than
Rosenthal and his colleagues (e.g., Word, Zanna, & Cooper, 1974).Furthermore,
experimenters can be shown to influence the outcomes of stuhes in ways other
than through their expectations. The behavior of participants can be affected by the
experimenter's race and gender, as well as by demeanor, friendliness,and overall attitude
(Adair,1973).An exampleof the latter is astudyby Fraysse and Desprels-Fraysse
(1990),who found that preschoolers' performance on a cognitive classificationtask
could be influenced by experimenter attitude. The children performed significantly
better with "caring" than with "indifferent" experimenters.
Controlling for Experimenter Bias
It is probably impossible to eliminate experimenter effects completely. Experimenters
cannot be turned into machines. However, one strategy to reduce bias
is to mechanize procedures as much as possible. For instance, it's not hard to remove
a frowning or smilingexperimenter &omthe person perception task. With modern
computer technology, participants can be shown photos on a screen and asked to
make their responses with a key press while the experimenter is in a different room
entirely.
Similarly, procedures for testing animals automatically have been available since
the 1920s, even to the extent of eliminating human handling completely. E. C.
Tolman didn't wait for computers to come along before inventing "a self-recordng
maze with an automatic delivery table" (Tolman, Tryon, & Jeffries, 1929). The
"delivery table" was so called because it "automatically delivers each rat into the
entrance of the maze and 'collects' him at the end without the mediation of the
Chapter 6. Control Probleins in Experimental Research
experimenter. Objectivity of scoring is insured by the use of a device which automatically
records his path through the maze" (Tryon, 1929, p. 73). Today such
automation is routine. Recall from Chapter 4 the study of rats in the radial maze,
in which rat "macrochoices" and "microchoices" were confirmed by videotaping
each animal's performance and defining those two constructs in terms of easily verifiable
behaviors (Brown, 1992). Furthermore, computers make it easy to present
instructions and stimuli to participants whlle also keeping track of data.
Experimenters can mechanize many procedures, to some degree at least, but the
experimenter wdl be interacting with every participant nonetheless. Hence, it is
important for experimenters to be given some training in how to be experimenters,
and for the experiments to have highly detailed descriptions of the sequence of steps
that experimenters should follow in every research session. These descriptions are
called research protocols.
Another strategy for controlling for experimenter bias is to use what is called
a double blind procedure. This means simply that experimenters are kept in the
dark (blind) about what to expect of participants in a particular testing session. As
a result, neither the experimenters nor the participants know which condition is
being tested-hence the designation "double." A double blind can be accomplished
when the principal investigator sets up the experiment but a colleague (usually a
graduate student) actually collects the data. Double blinds are not always possible,
of course, as Illustrated by the Dutton and Aron (1974) study you read about in
Chapter 3. As you recall, female experimenters arranged to encounter men either
on a suspension bridge swaying 230 feet over a river or on a solid bridge 10 feet
over the same river. It would be a bit difficult to prevent those experimenters from
knowing whch condition of the study was being tested! On the other hand, many
studies lend themselves to a procedure in which experimenters are blind to whch
condition is in effect. Research Example 6, which could increase the stock price of
Starbucks, is a good example.
Research Example 6-Using a Double Blind
There is considerableevidence that aswe age,we become less eff~cientcognitively in
the afternoon. Also, older adults are more likely to describe themselves as "morning
persons" (I am writing this on an early Saturday morning, so I thnk I'll get it
right). Ryan, Hatfield, and Hofitetter (2002)wondered if the cognitive decline, as
the day wears on, could be neutralized by America's favorite drug-caffeine. They
recruited 40 seniors, all 65 or older and self-described as (a) morning types and
(b) moderate users of caffeine, and placed them into either a caffeine group or a
decaf group (usingStarbucks "house blends" ). They were then given a standardized
memory test on two different occasions, once at 8:00 a.m. and once at 4:00 p.m.
The study was a double blind because the experimenters administering the memory
tests did not know whch participants had ingested caffeine, and the seniors did not
know which type of coffee they were drinking. And to test for the adequacy of
the control procedures, the researchers completed a clever "manipulation check7'
(you will learn more about this concept in a few paragraphs). At the end of the
study, during debriefing, they asked the participants to guess whether they had been
drinking the real stuff or the decaf. The accuracy of the seniors' responses was at
Pvoblems with Biasing
chancelevel. In fact,most guessedincorrectly that they had been given regular coffee
during one testing session and decaf at the other.
The researchers also did a nice job of incorporating some of the other control
proceduresyou learned aboutin this chapter. For instance,the seniorswere randomly
assigned to the two different groups, and this random assignment seemed to produce
the desired equivalent groups-the groups were indistinguishable in terms of age,
educationlevel, and average dailyintake of caffeine.Also, counterbalancingwas used
to insure that half of the seniorswere tested first in the morning, then the afternoon,
while the other half were tested in the sequence afternoon-morning.
The results? Time of day did not seem to affect a short-term memory task, but it
had a significanteffect on a more difficultlonger-term task in which seniorslearned
some information, then had a 20-minute delay, then tried to recall the information,
and then completed a recognition test for that same information. And caffeine
prevented the decline for ths more demanding task. On both the delayed recall and
the delayed recognition tasks, seniors scored equally well in the morning sessions.
In the afternoon sessions, however, those ingesting caffeine stdl did well, but the
performance of those tahng decaf declined. On the delayed recall task, for instance,
here are the means (maxscore= 16).Also, rememberfrom Chapter4 (pp. 137-138)
that, when reporting descriptive statistics,it is important to report notjust ameasure
of central tendency (mean),but also an indication of variability. So, in parentheses
after each mean below, notice that I have included the standard deviations (SD).
Morning with caffeine -+ 11.8(SD= 2.9)
Morning with decaf -+ 11.O( S D = 2.7)
Afternoon with caffeine -+ 11.7 (SD = 2.8)
Afternoon with decaf -+ 8.9 (SD = 3.0)
So, if the word gets out about ths study, the average age of Starbucks' clients might
start to go up, starting around 3:00 in the afternoon. Of course, they wdl need to
avoid the decaf.
&rticipant Bias
People participating in psychological research cannot be expected to respond like
machines. They are humans who know they are in an experiment. Presumably
they have been told about the general nature of the research during the informed
consent process, but in deception studies they also know they haven't been told
everything.Furthermore,evenifthere is no deceptionin a study,participants may not
believe it-after all, they are in a "psychology experiment," and aren't psychologists
always trylng to "psychoanalyze" people? In short, participant bias can occur in
several ways, dependng on what participants are expecting and what they believe
their role should be in the study. When behavior is affected by the knowledge
that one is in an experiment and is therefore important to the study's success, the
phenomenon is sometimes called the Hawthorne effect, after a famous series of
studes of worker productivity. To understand the origins of this term, you should
read Box 6.2 before continuing. You may be surprised to learn that most hstorians
Chapter 6. Control Problems in Experimental Research
believe the Hawthorne effect has been misnailled and that the data of the original
study were distorted for political reasons.
Problems with Biasing
7 ere to be j
s mo had tc
C L L L vva> >aidto hatL. SUIL~; U U ~ D I ~ C V ~ L ~ L I I I I C I & Friend, 1981). (nenlem~er,cne
SovietUnion was brand new i )s, and the ace" was a threat to industrial
America, resulting in thir zar oflabo Of the two replacements,
one was especiallytalented ana en~n~~siasticand quic~iyoecamethe group leader. She
apparently was selected becau
the regular department" (Gill(
the high level of productivity.
;room, tl
:he room :
" fn"..--l
ie fact is t
€orinsubo
hat of the
rdination
five origi
and low o
n _..._...1.
nal as-
tutput.
. I - women
WI
iemblers,t
?,, .,,,,.
n the 192C
~gslike a fi
1 1 .
"red men
r unions.)
. ,, .
se she "hc
:spie, 198;
:Id the rec
3, p. 122).
:ord as the
Her effor
:fastest re
ts contrib
lay-asseml
uted migh
A secon
n the fam
~utputper
~dprobler
ous 12th 1
.hour, yet
-- L - - & ---..
n with ini
?eriod, pr
workers
terpreting
oductivity
were putti
...- .~...-.
data is a :
rded as ou
>xtra6 hoi
the relay
I was reco
c ng in an e
t u s L C ~ Lpried. If the rrlvre appropriateoutput per
2 clined slightly (Bramel & Friend, 1981).Also, thc
2 ut the change, but afiaid to complain lest they 1
room, thereby losing bonus money. Lastly, it co ' ' '
Hawthorne experirr
result of feedback a1
] 0 7 A \
simple statistical problem.
tput per week rather than
Irs per week compared to
he previoi
~ctuallyde
lngry abo.
hour is us
:women
>eremovt
ed, produ
were appa
:d from tl
ctivity
treiitly
le test
uld have
~ctivityco
L rewards
been that
luld have 1
for produ
in some
been sim~:
ctivity (Pa
of the
flythe
Lrsons,
Lents, incr
lout perfc
eased WOI
lrmance, 2
/ I T ) .
Historia
:cononlic/
t
ns argue
'institutio~
that events must be understood within their entire political/
c nal context, and the Hawthorne studies are no exception. Painting
a gossy picture of workers unaffected by specific working conditions and more
concerned with being considered speci,
in industry and led corporations to empl
in the h~
humane 1,,A
,",,
lman relat
nanageme
,,,,,+ L
:ions movc
nt of emp
I---*-----,
Ement
loyees
such a
F 1 helps to
L ,n, which
~ L U CIIluUVes behind
?owel-at t
torians (e.
!s complet
he level oi
g., Brame
1 . 7 7 7
lent and i~
d, 1981)k
xic.
:managen
1& Frien~
-7
mpede eB
)elievewe
orts at
re the
Most research participants, in the spirit of trying to help the experimenter and
contribute meaningful results, take on the role of the good subject, first described
by Orne (1962).There are exceptions, of course, but, in general, participants tend to
be very cooperative, to the point of persevering through repetitive and boring tasks,
all in the name of psychological science. Furthermore, if participants can figure out
the hypothesis, they may try to behave in such a way that confirms it. Orne used
the term demand characteristics to refer to those aspects of the study that reveal
the hypotheses being tested. If these features are too obvious to participants, they no
longer act naturally and it becomes difficult to interpret the results. Did participants
behave as they normally would or I d they come to understand the hypothesis and
behave so as to make it come true?
Orne demonstrated how demand characteristics can influence a study's outcome
by recruiting students for a so-called sensory deprivation experiment (Orne &
Scheibe, 1964). He assumed that participants told that they were in such an experiment
would expect the experience to be stressful and might respond accordingly.
Chapter 6. Control Problems iiz Experinzeiztal Research
This indeed occurred. Participantswho sat for four hours in a small but coillfortable
room showed signs of stress only if (a) they signed a forin releasing the experimenter
from any liability in case anything happened to them, and (b) the room included
a "panic button" that could be pressed if they felt too stressed by the deprivation.
Controlparticipantswere givenno release form to sign,no panic button to press, and
no expectation that their senses were being deprived. They did not react adversely.
The possibility that demand characteristics are operating has an impact on decisions
about whether to opt for between- or within-s~ibjectdesigns. Participants
serving in all ofthe conditions of a study have a greater opportunity to figure out the
hypothesis(es).Hence, demand characteristics are potentially more troublesome in
withn-subject designs than in between-subjects designs. For both types of designs,
demand characteristics are especially devastating if they affect some conditions but
not others, thereby introducing a confound.
Besides being good subjects (i.e., trying to confirm the hypothesis), participants
wish to be perceived as competent, creative, emotionally stable, and so on. The belief
that they are being evaluated in the experiment produces what Rosenberg (1969)
called evaluation apprehension. Participants want to be evaluated positively, so
they may behave as they thnk the ideal person should behave. This concern over
how one is going to look and the desire to help the experimenter often leads to the
same behavior among participants, but so~lletilllesthe desire to create a favorable
impression and the desire to be a good subject conflict. For example, in a helping
behavior study, astute participants might guess that they are in the condition of
the study designed to reduce the chances that help will be offered. On the other
hand, altruism is a valued, even heroic, behavior. The pressure to be a good subject
and support the hypothesis pulls the participant toward nonhelping, but evaluation
apprehension makes the indvidual want to help. At least one study has suggestedthat
when participants are faced with the option of confirming the hypothesis and being
evaluated positively, the latter is the more powerful motivator (Rosnow, Goodstadt,
Suls, & Gitter, 1973).
Controlling for Participant Bias
The primary strategy for controbng participant bias is to reduce demand characteristics
to the minimum. One way of accomplishing ths, of course, is through
deception. As we've seen in Chapter 2, the primary purpose ofdeception is to induce
participants to behave more naturally than they otherwise might. A second strategy,
normally found in drug studies, is to use a placebo control group (see Chapter 7,
pp. 256-257). This procedure allowsfor acomparison between those actuallygetting
some treatment (e.g.,a drug) and those who thnk they are getting the treatment but
aren't. If the people in both groups behave identically, the effects can be attributed
to participant expectations of the treatment's effects. You have probably already recognized
that the caffeine study you just read (Research Example 6) used this lund
of logic.
A second way to check for the presence of demand characteristicsis to do what
is sometimes called a manipulation check. This can be accomplished during debriefing
by aslung participants in a deception study to indicate what they believe
the true hypothesis to be (the "good subject" might feign ignorance though). This
Problems with Biasing 225
was accomplished in Research Example 6 by asking participants to guess whether
they had been given caffeine in their coffee or not. Manipulation checks can also
be done during an experiment. Sometimes a random subset of participants in each
condition wlll be stopped in the middle of a procedure and asked about the clarity
of the instructions, what they t l n k is going on, and so on. Manipulation checks
are also used to see if some procedure is producing the effect it is supposed to produce.
For example, if some procedure is supposed to make people feel anxious (e.g.;
telling participants to expect shock), a sample of participants might be stopped in
the middle of the study and assessed for level of anxiety.
A final way of avoiding demand characteristics is to conduct field research. If
participants are unaware that they are in a study, they are unlikely to spend any
time thinking about research hypotheses and reacting to demand characteristics. Of
course, field studies have problems of their own, as you recall fiom the discussionof
informed consent in Chapter 2 and of privacy invasion in Chapter 3 (pp. 83-84).
Although I stated earlier that most research participants play the role of "good
subjects," this is not uniformly true, and some differences exist between those who
truly volunteer and are interested in the experiment and those who are more reluctant
and less interested. For instance, true volunteers tend to be slightly more
intelligent and have a higher need for social approval (Adair, 1973).Differences between
volunteers and nonvolunteers can be a problem when college students are
asked to serve as participants as part of a course requirement; some students are
more enthusiastic volunteers than others. Furthermore, a "semester effect" can operate.
The true volunteers, those really interested in participating, sign up earlier in
the semester than the reluctant volunteers. Therefore, if you ran a study with two
groups, and Group 1 was tested in the first half of the semester and Group 2 in
the second half, the differences found could be due to the independent variable,
but they also could be due to differencesbetween the true volunteers who sign up
first and the reluctant volunteers who wait as long as they can. Can you think of a
way to control for this problem? If the concept "block randomization" occurs to
you, and you say to yourself "this will distribute the conditions of the study equally
throughout the duration of the semester," then you've accomplished something in
this chapter. Well done.
J Self Test 6.3
1. Unlike most longitudinal studies, Terman's study of gifted children did not experience
which control problem?
2. Why does a double blind procedure control for experimenter bias?
3. How can a demand characteristicinfluence the outcome of a study?
To close out this chapter, read Box 6.3, which concerns the ethical obligations of
those participating in psychological research. The list of responsibilitiesyou'll find
Chapter 6. Control Problems in Experimental Research
there is based on the assumption that research shouldbe a collaborative effort between
experimenters and participants. We've seen that experimenters must follow the
APA ethics code. In Box 6.3 you'll learn that participants have some responsibilities
too.
search Participants Have
fes Too
The APA ethics code spells out the responsibilities that researchershave
to thosewho participate in their experiments.
Participants have a right to expect that the guidelines w d
be followed and, if not, there should be a clear process for
registering complaints. But what about the subjects?What
are their obligations?
An articlebyJim Korn in thejournal TeachivgofPsychology (1988)outlines the basic
rights that college students have when they participate in research, but it also lists the
responsibilities of those who volunteer. They include
d Being responsible about scheduling by showing up for their appointmentswith
researchersand arriving on time
J Being cooperative and acting professionally by giving their best
and most honest effort
J Listening carefully to the experimenter during the informedconsent
and instructionsphasesand askingquestionsif they are not
sure what to do
J Respecting any request by the researcher to avoid discussing
the researchwith others until all the data have been collected
J Being active during the debriefing process by helping the researcher
understandthe phenomenonbeingstudied
The assumption underlying this list is that research shouldbe a collaborative effort
between experimenters and participants. Korn's suggestion that participants take a
more assertive role in mahng research more collaborative is a welcome one. This
assertiveness,however, must be accompaniedby enlightenedexperimentingthatvalues
and probes for the insights that participants have about what might be going on in a
study. An experimenter who simply "runs a subject" and records the data is ignoring
valuable information.
. i ' i I i ..
Clzapter Summary
In the last two chapters you have learned about the essential features of experimental
research and some of the control problems that must be faced by those who
wish to do research in psychology.We've now completed the necessary groundwork
for introducing the various kinds of experimental designs used to test the effects of
independent variables. So, let the designs begin!
Between-SubjectsDes:
In between-subjects designs, individuals participate in just one of the experiment's
conditions; hence, each condition in the study involves a different group of participants.
Such a design is usually necessary when subject variables (e.g., gender) are
being studied or when being in one condition of the experiment changes participants
in ways that make it impossible for them to be in another condition. With
between-subjects designs, the main difficulty is creating groups that are essentially
equivalent to each other on all factors except for the independent variable.
The Problem of Creating Equivalent Groups
The preferred method of creating equivalent groups in between-subjects designs
is random assignment. Random assignment has the effect of spreading unforeseen
confounding factors evenly throughout the different groups, thereby eliminating
their damaging influence. The chance of random assignment worhng effectively
increases as the number of participants per group increases. If few participants are
available, if some factor (e.g., intelligence) correlates highly with the dependent
variable, and if that factor can be assessed without difficulty before the experiment
begins, then equivalent groups can be formed by using a matching procedure.
Within-Subjects Designs
When each individual participates in all of the study's conditions, the study is using ,
a within-subjects (or repeated-measures) design. For these designs, participating "
in one condition might affect how participants behave in other conditions. That
is, sequence or order effects can occur, both of which can produce confounded
results if not controlled. Sequence effects include progressive effects (they gradually
accumulate, as in fatigue) and carryover effects (one sequence of conditions might
produce effects different from another sequence).
The Problem of Controlling Sequence Effects
Sequence effectsare controlled by various counterbalancingprocedures, all ofwhich
ensure that the different conditions are tested in more than one sequence. When
Chapter 6. Control Problems in Experimental Reseavclz
participants serve in each condition of the study just once, complete (all possible
sequences used) or partial (a sample of different sequences or a Latin square) counterbalancingwillbe
used. When participantsservein each condition inore than once,
reverse counterbalancing or block randomization can be used. Asymmetric transfer
can occur when carryover effects are present; such transfer reduces the effectiveness
of counterbalancing.
Control Problems in Developmental Research
In developmental psychology, the major independent variable is age, a subject variable.
If age is studied between subjects, the design is referred to as a cross-sectional
design. It has the advantage of eff~ciency,but cohort effects can occur, a special
form of the problem of nonequivalent groups. If age is a within-subjects variable,
the design is called a longitudinal design and attrition can be a problem. The two
strategies can be combined in a cohort sequential design-selecting new cohorts
every few years and testing each cohort longitudinally.
Problems with Biasing
The results of research in psychology can be biased by experimenter expectancy
effects. These can lead the experimenter to treat participants in various conditions
in different ways, making the results impossible to interpret. Such effects can be
reduced by automating the procedures and using double blind control procedures.
Participant bias also occurs. Participants might confirm the researcher's hypothesis
if demand characteristics suggest to them the true purpose of a study or they
might behave in unusual ways simply because they know they are in an experiment.
Demand characteristics are usually controlled through varying degrees of deception
and the extent of participant bias can be evaluated through the use of a manipulation
check.
1. Under what circumstanceswould a between-subjects design be preferred over
a within-subjects design?
2. Under what circumstances would a within-subjects design be preferred over a
between-subjects design?
3. How does random selection differ from random assignment, and what is the
purpose of the latter?
4. As a means of creating equivalent groups, when is matching most likely to be
used?
!
AppTications Exevcises
5. Distinguish between progressive effects and carryover effects, and explain why
counterbalancing might be more successful with the former than the latter.
6. In a taste test,Joan is asked to evaluate four dry white wines for taste: wines A, B,
C, and D. In what sequence would they be tasted if (a) reverse counterbalancing
or (b) block randomization were being used? How many sequences would be
required if the researcher used complete counterbalancing?
7. What are the defining features of a Latin square and when is one likely to be
used?
8. What specific control problems exist in developmental psychology with (a)
cross-sectional studies and (b) longitudinal studies?
9. What is a cohort sequential design, and how does it improve on cross-sectional
and longitudinal designs?
10. Describe an example of a study that Illustrates experimenter bias. How might
such bias be controlled?
11. What are demand characteristics and how might they be controlled?
12. What is a Hawthorne effect and what is the origin of the term?
Exercise 6.1-Between-Subject or Within-Subject?
Think of a study that might test each of the following hypotheses. For each, indicate
whether you think the independent variable should be a between- or a withinsubjects
variable or whether either approach would be reasonable. Explain your
decision in each case.
1. A neuroscientist hypothesizes that damage to the primary visual cortex is permanent
in older animals.
2. A sensory psychologist predicts that it is easier to distinguish slightly different
shades of gray under daylight than under fluorescent light.
3. A clinicalpsychologist thinks that phobias are best cured by repeatedly exposing
the person to the feared object and not allowing the person to escape until the
person realizes that the object really is harmless.
4. A developmental psychologist predicts cultural differences in moral develop-
ment.
Clzapter 6. Control Problems in Experimental Research
5. A socialpsychologistbelieves people will solve problems more creatively when
in groups than when alone.
6. A cognitivepsychologist hypothesizes that spaced practice of verbal information
wlll lead to greater retention than massed practice.
7. A clinician hypothesizes that people with an obsessive-compulsive disorder will
be easier to hypnotize than people with a phobic disorder.
8. An industrial psychologist predicts that worker productivity will increase if the
company introduces flextime scheduling (i.e., work 8 hours, but start and end
at dfferent times).
Exercise 6.2-Constructing a Balanced Latin Square
A memory researcher wishes to compare long-term memory for a series of word
lists as a function of whether the person initially studies either four lists or eight
lists. Help the investigator in the planning stages of this project by constructing the
two needed Latin squares, a 4 x 4 and an 8 x 8, using the procedure outlined in
Table 6.2.
Exercise 6.3-Random Assignment and Matching
A researcher investigates the effectiveness of an experimental weight-loss program.
Sixteen volunteers will participate, half assigned to the experimental program and
halfplaced in a control group. In a study such as this, it would be good if the average
weights of the subjectsin the two groups were approximately equal at the start of the
experiment, Here are the weights, in pounds, for the 16 subjects before the study
begins. '
First, use a matching procedure as the method to form the two groups (experimental
and control), and then calculate the average weight per group. Second, assign
participants to the groups again, this time using random assignment (cut out 20
small pieces of paper, write one of the weights on each, then draw them out of a
hat to form the two groups).Again, calculate the average weight per group after the
random assignment has occurred. Compare your results to those of the rest of the
class-are the average weights for the groups closer to each other with matching or
with random assignment? In a situation such as this, what do you conclude about
the relative merits of matching and random assignment?
Applications Exercises 231
Answers to the Sell Tests:
J 6.1.
1. There is a minimum of two separate groups of subjects tested in the study,
one group for each level of the IV; the problem of equivalent groups.
2. Sal must have a reason to expect verbal fluency to correlate with his dependent
variable; he must also have a good way to measure verbal fluency.
J 6.2.
1. Each subject participates in each level of the IV; sequence effects
2. With 6 levels of the IV, complete counterbalancing requires a minimum of
720 subjects (6 x 5 x 4 x 3 x 2 x I), which could be impractical.
3. Reverse counterbalancing, or block randomization.
J 6.3.
1. Attrition.
2. If the experimenter does not know which subjects are in each of the groups
in the study, the experimenter cannot behave in a way that reflects bias.
3. If subjects know what is expected of them, they might be "good subjects"
and not behave naturally.