CHAPTER 4 Bayes' Rule 4.1 Bayes'Rule....................................................................... 52 4.1.1 Derived from Definitions of Conditional Probability.................. 53 4.1.2 Intuited from a Two-Way Discrete Table............................... 54 4.1.3 The Denominator as an Integral over Continuous Values........... 56 4.2 Applied to Models and Data................................................ 56 4.2.1 Data Order Invariance...................................................... 59 4.2.2 An Example with Coin Flipping.......................................... 60 4.3 The Three Goals of Inference.............................................. 63 4.3.1 Estimation of Parameter Values.......................................... 63 4.3.2 Prediction of Data Values.................................................. 63 4.3.3 Model Comparison.......................................................... 64 4.3.4 Why Bayesian Inference Can Be Difficult............................... 67 4.3.5 Bayesian Reasoning in Everyday Life................................... 68 4.4 RCode............................................................................. 69 4.4.1 R Code for Figure 4.1....................................................... 69 4.5 Exercises.......................................................................... 71 I'll love you forever in every respect (I'll marginalize all your glaring defects) But if you could change some to be more like me I'd love you today unconditionally. If you see that there are clouds, what is the probability that soon there will be rain? If you know that it is raining, by hearing it patter on the roof, what is the probability that there are clouds? Notice that p(douds | rain) is not equal Being Bayesian Data Analysis: A Tutorial with R and BUGS. DOI: 10.1016/B978-0-12-381485-2.00004-3 ■mi, Elsevier Inc. All rights reserved. CHAPTER 4: Bayes'Rule to p(rain | clouds). If someone smiles at you, what is the probability that they love you? If someone loves you, what is the probability that they will smile at you? Notice that p(smile | love) is not equal to p(love | smile). Let's consider an example for which we can determine specific numbers. Suppose I have a standard deck of playing cards, which has 52 cards altogether. There are four suits: hearts, diamonds, clubs, and spades. Within each suit, there are 13 values: ace, two, three,..., ten, jack, queen, and king. I shuffle the cards and draw one at random without showing it to you. I look at the card, and tell you (truthfully) that it is a queen. Given that you know it is a queen, what is the probability that it is a heart? Think about it a moment: There are four queens in the deck, and only one of them is a heart. So the probability that the card is a heart is 1/4. We can write this as a conditional probability: P(V\Q) = ^. Now I put the card back into the deck and reshuffle. I draw another card from the deck, and this time I tell you that it is a heart. Given that you know it is a heart, what is the probability that it is a queen? Think about it a moment: There are 13 hearts in the deck, and only one of them is a queen. So the probability that the card is a queen is 1/13. We can write this as a conditional probability: Notice that p(). Despite the inequality, the reversed conditional probabilities must be related somehow, right? Answer: Yes! What Bayes' rule tells us is the relationship between the two conditional probabilities. 4.1 BAYES' RULE Thomas Bayes (1702-1761) was a reputable mathematician and Presbyterian minister in England. His famous theorem was published posthumously in 1764. The simple rule that relates conditional probabilities has vast ramifications for statistical inference, and therefore as long as his name is attached to the rule, we'll continue to see his name in textbooks. A crucial application of Bayes' rule is to determine the probability of a model when given a set of data. What the model itself provides is the probability of the data, given specific parameter values and the model structure. We use Bayes rule to get from the probability of the data, given the model, to the probability of the model, given the data. This process will be explained during the course of this chapter and, indeed, during the rest of this book. There is another branch of statistics, called null hypothesis significance testing (NHST), that relies on the probability of data given the model and does not use Bayes' rule. Chapter 11 describes NHST and its perils. This approach is often identified with another towering figure from England who lived about 200 years later than Bayes, named Ronald Fisher (1890-1962). His name, or at least the first letter of his last name, is immortalized in the most common statistic used in NHST, the F-ratio.1 It is curious and reassuring that the overwhelmingly dominant approach of the 20th century (i.e., NHST) is giving way in the 21st century to a Bayesian approach that had its genesis in the 18th century. 4.1.1 Derived from Definitions of Conditional Probability Recall from the definition of conditional probability, back in Equations 3.11 and 3.12 on p. 45, that p(y \x) = p(y,x)/p(x). In words, the definition simply says that the probability of y given x is the probability that they happen together relative to the probability that x happens at all. We used this definition quite naturally when computing the conditional probabilities for the example, presented earlier, regarding hearts and queens in a deck of cards. Now we just do some very simple algebraic manipulations. First, multiply both sides of p(y | x) = p(y,x)/p(x) by p(x) to get p(y \ x)p(x) = p(y,x). Second, notice that we can do the analogous manipulation starting with p(x \ y) = p(y, x) jp{y) to get p(x | y)p(y) = p(y, x). Now we have two different expressions equal to p(y,x), so we know those expressions equal each other: p(y \ x)p{x) = p(x I y)p(y)- Divide both sides of that last expression by p(x) to arrive at I * («, But we are not done yet, because we can rewrite the denominator in terms of p(x | y) also. Toward that goal, recall that p(x) = J2y P(x> )0- That was Equations^, on p. 44, if you're keeping score. We also know that p(x,y) = p(x| y)p(y). Combining those equations yields p{x) = J2y P = 0.25) + p(y = H\9 = 0.50)p(6> = 0.50) + p(y = H \9 = 0.75)p(6> = 0.75) = 0.25 x 0.25 + 0.50 x 0.50 + 0.75 x 0.25 = 0.5 and the probability of getting a tail is computed analogously to be p(y = T) — 0.5. Notice that the predictions are probabilities of each possible data value, given the current model beliefs. If we want to predict a particular point value for the next datum, instead of a distribution across all possible data values, it is typical to use the mean (i.e., expected value) of the predicted data distribution. Thus, the predicted value is y = f dyy p(y). This integral only makes sense if y is on a continuum. If y is nominal, like the result of a coin flip, then the most probable value can be used as "the" predicted value. The decision to use the mean of the predicted values as our single best prediction, instead of, say, the mode or median, relies implicitly on the costs of being wrong and the benefits of being correct. These costs and benefits, called the utilities, are considered in more advanced treatments of Bayesian decision theory. For our purposes, we will default to the mean, purely for convenience. 4.3.3 Model Comparison You may recall from earlier discussion (p. 58) that Bayes' rule is also useful for comparing models. Equation 4.8 indicated that the posterior beliefs in the models involve the evidences of the models. Notice that in this third goal (i.e., model comparison), the evidence term appears again, just as it appeared for the goals of parameter estimation and data prediction. One of the nice features of Bayesian model comparison is that there is an automatic accounting for model complexity when assessing the degree to which we should believe in the model. This might be best explained with an example Recall the coin-flipping example discussed earlier, illustrated in Figure 4.1 and reproduced in the left side of Figure 4.2. In that example, we supposed that the bias 9 could take on only three possible values. This restriction made the model rather simple. We could instead entertain a more complex model that i allows for many more possible values of 9. One such model is illustrated m 4.3 The Three Goals of Inference t 0.8 0.6 S 0.4 Q. 0.2 0.0 Prior Prior 0.0012- §• 0.0008 Q 'S 0.0004 0.0000 0.6 I 0.4 °- 0.2 0.0 0.0 0.2 - 0.0 0.2 0.4 0.6 0.8 1.0 e Likelihood - D=3H, 97 I 0.0 0.2 0.4 0.6 0.8 1.0 0 Posterior - p(D) = 0.000416 0.4 0.6 0 0.8 1.0 0.0012 -f 5 0.0008 Q cl 0.0004 Likelihood 0.0000 Posterior g. 0.04 a 0.02- 0.00 p(D) = 0.000392 0.0 0.2 0.4 0.6 0.8 1.0 FIGURE 4.2 A simple model in the left column and a complex model in the right column. The prior and posterior distributions indicate probability masses at discrete candidate values of 9. The same data are addressed by both models. The evidence p(D | Msimpie) for the simple model is displayed as p(D) in the lower-left panel, and the evidence p(D | Mcompiex)for the complex model is displayed as p(D) in the lower-right panel. In this case, the data are such that the simple model is favored. The R code that generated these graphs is in Section 4.4.1 (BayesUpdate. R). the right side of Figure 4.2. This model has 63 possible values of 9 instead of only 3. The shape of the prior beliefs in the complex model follows the same triangular shape as in the simple model; there is highest belief in 9 = 0.50, with lesser belief in more extreme values. The complex model has many more available values for 9, and so it has much more opportunity to fit arbitrary data sets. For example, if a sequence of coin flips has 37% heads, the simple model does not have a 9 value very close to that outcome, but the complex model does. On the other hand, for 9 values that are in both the simple and complex models, the prior probability on those values in the simple model is much higher than in the complex model. Because re are so many possibilities in the complex model, the prior beliefs have to 3016 get spread out, very shallowly, over a larger range of possibilities. This can be seen in Figure 4.2 by inspecting the numerical scales on the vertical axes of the prior beliefs. The scale on the simple model is much larger than the scale on the complex model. Therefore, if the actual data we observe happens to be well accommodated by a 0 value in the simple model, we will believe in the simple model more than the complex model, because the prior on that 9 value in the simple model is so high. Figure 4.2 shows a case in which this happens. The data have 25% heads, so the evidence in the simple model is larger than the evidence in the complex model. The complex model has its prior spread too thin for us to believe in it as much as we believe in the simple model. The complex model can be the winner if the data are not adequately fit by the simple model. For example, consider a case in which the observed data have just 1 head and 11 tails. None of the 0 values in the simple model is close to this outcome. But the complex model does have some 0 values near the observed proportion, even though there is not a strong belief in those values. Figure 4.3 shows that the simple model has less evidence in this situation, and we have stronger belief in the complex model. The evidence for a model, p(D | M), is not particularly meaningful as an absolute magnitude for a single model. The evidence is most meaningful only in the context of the Bayes factor, p(D \Ml)/p(D \ M2), which is the relative evidence for two models, when considering an observed data set D.2 Regardless of which model wins, the winning model does not need to be a good model of the data. The model comparison process merely tells us about the relative evidence for each model. The winning model is better than the other models in the competition, but the winning model might merely be less bad than the horrible competitors. In later chapters we will explore ways to assess whether the winning model is actually a viable model of the data. We will see in Chapter 10 that Bayesian model comparison is "really" just a case of Bayesian parameter estimation, in which a parameter that indexes the models is estimated. The individual model parameters depend on the index ical parameter, and thus the scheme involves a hierarchy of dependenci Hierarchial models are introduced in Chapter 9. The fact that model co parison is a case of parameter estimation is mentioned here only to fend any mistaken impression that parameter estimation and model compariso are fundamentally different. 2The Bayes factor, p(D | Ml)/p(D | MT), is quite different than considering evidences of a single for different candidate data sets. Specifically, p(Dl | M)/p(D2 \ M) is not a Bayes factor and is not discussed. 4.3 The Three Goals of Inference fftt Prior Prior 1.0 0.8 S- 0.6 'S 0.4 0.2 0.0 0.012 ^ 0.008 Q a 0.004 0.000 4 1.0 0.8 S" 0.6 I 0.4 0.2 0.0 1 1 0.0 0.2 0.4 0.6 0.8 1.0 e Likelihood - D=1H, 117 0.0 0.2 0.4 0.6 0.8 1.0 e Posterior - p(D) = 0.00276 1 0.06 g 0.04 a 0.02 0.00-1 ......1 III..... 0.0 0.2 0.4 0.6 0.8 1.0 6 Likelihood 0.030 -I Q 0.020-I 0.010 0.000 0.0 0.2 0.4 0.6 0.8 1.0 e D=1H, 117 0.0 1 0.2 i i 0.4 0.6 e Posterior 1 [ 0.8 1.0 Q V 0.06- .1 lllll P(D) = 0.00366 0.04- 1 llll 0.02-0.00- J Ii lllllllH............ 0.0 0.2 0.4 0.6 0.8 1.0 e FIGURE 4.3 A simple model in the left column and a complex model in the right column. The prior and posterior distributions indicate probability masses at discrete candidate values of 9. The same data are addressed by both models. The evidence p(D | A4simpie) for the simple model is displayed as p(D) in the lower-left panel, and the evidence p(D | MCompiex)for the complex model is displayed as p(D) in the lower-right panel. In this case, the data are such that the complex model is favored. The R code that generated these graphs is in Section 4.4.1 (Bayesllpdate. R). 4.3.4 Why Bayesian Inference Can Be Difficult All three goals involve the denominator of Bayes' formula (i.e., the evidence), which usually means computing a difficult integral. There are a few ways out af this difficulty. The traditional way is to use likelihood functions with "conjugate' prior functions. A prior function that is conjugate to the likelihood function simply makes the posterior function come out with the same functional form as the prior. That is, the math works out nicely. If this method «sn t work, an alternative is to approximate the actual functions with other nctions that are easier to work with, and then show that the approximation reasonably good under typical conditions. But this method is still pure, analytical mathematics. Yet another method is to numerically approximate the integral. When the parameter space is small, then it can be covered with a comb or grid of points and the integral can be computed by exhaustively summing across that grid. But when the parameter space gets even moderately large, there are too many grid points, and therefore other methods must be used. A large class of random sampling methods have been developed, which can be referred to as Markov chain Monte Carlo (MCMC) methods, that can numerically approximate probability distributions even for large spaces. It is the development of these MCMC methods that has allowed Bayesian statistical methods to gain practical use. The next major part of this book explains these various methods in some detail. For applications to complex situations, we will ultimately focus on MCMC methods. Another potential difficulty of Bayesian inference is determining a reasonable prior. What distribution of beliefs should we start with, over all possible parameter values or over competing models? This question may seem daunting, but in practice it is typically addressed in a straightforward manner. As we will discuss more in Chapter 11, it is actually advantageous and rational to start with an explicit prior. Prior beliefs should influence rational inference from data, because the role of new data is to modify our beliefs from whatever they were without the new data. Prior beliefs are not capricious and idiosyncratic and unknowable, but instead they are based on publicly agreed facts and theories. Prior beliefs used in data analysis must be admissible by a skeptical scientific audience. When scientists disagree about prior beliefs, the analysis can be conducted with various priors to assess the robustness of the posterior against changes in the prior. Or the priors can be mixed together into a joint prior, with the posterior thereby incorporating the uncertainty in the prior. In summary, for most applications, specification of the prior turns out to be technically unproblematic, although it is conceptually very important to understand the consequences of one's assumptions about the prior. Thus, the main reason that Bayesian analysis can be difficult is the computation of the evidence, and that computation is tractable in many situations via MCMC methods. 4.3.5 Bayesian Reasoning in Everyday Life 4.3.5.1 Holmesian Deduction Despite the difficulty of exact Bayesian inference in complex mathematical models, the essence of Bayesian reasoning is frequently used in everyday life-One example has been immortalized in the words of Sherlock Holmes to his friend Dr. Watson: "How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?" (Arthur Conan Doyle, The Sign of Four, 1890, Chapter 6). This reasoning is actually a consequence of Bayesian belief updating, as expresse in Equation 4.4. Let me restate it this way: "How often have I said to yofl 4.4 RCode that when p(D | 0,-) = 0 for all i ^ j, then, no matter how small the prior p(9j) > 0 is, the posterior p(0j | D) must equal one." Somehow it sounds better the way Holmes said it. The intuition behind Holmes's deduction is clear, though: When we reduce belief in some possibilities, we necessarily increase our belief in the remaining possibilities (if our set of possibilities exhausts all conceivable options). Thus, according to Holmesian deduction, when the data make some options less believable, we increase belief in the other options. 4.3.5.2 Judicial Exoneration The reverse of Holmes's logic is also commonplace. For example, when an object d'art is found fallen from its shelf, our prior beliefs may indict the house cat, but when the visiting toddler is seen dancing next to the shelf, then the cat is exonerated. This downgrading of a hypothesis is sometimes called explaining away of a possibility by verifying a different one. This sort of exoneration also follows from Bayesian belief updating: When p(D 1Of) is higher, then, even if p(D 19i) is unchanged for all i ^ j, p(0,-1D) is lower. This logic of exoneration is based on competition of mutually exclusive possibilities: If the culprit is suspect A, then suspect B is exonerated. Holmesian deduction and judicial exoneration are both expressions of the essence of Bayesian reasoning: We have a space of beliefs that are mutually exclusive and exhaust all possibilities. Therefore, if the data cause us to decrease belief in some possibilities, we must increase belief in other possibilities (as said Holmes), or, if the data cause us to increase belief in some possibilities, we must decrease belief in other possibilities (as in exoneration). What Bayes' rule tells us is exactly how much to shift our beliefs across the available possibilities. 4.4 R CODE 4.4.1 R Code for Figure 4.1 Several new commands are used in this program. When you encounter a puzzling command in an R program, it usually helps to try the h e 1 p command. For example, when perusing this code, you'll come across the command matri x. To find out about the syntax and usage of this command, do this: At the R command line, type he! p( "matri x") and you'll get some clues about how it works. Then experiment with the command at the interactive command line until you're confident about what its various arguments do. For example, try typing at the command line: matrixt 1:6 , nrow=2 , ncol=3 , byrow=TRUE ) Then try matrixt 1:6 nrow=2 , ncol=3 , byrow=FALSE ) Bayes' Rule The listing that follows includes line numbers in the margins, to facilitate tracking the code across page splits and to facilitate referring to specific lines of the code when you have enthusiastic conversations about it at parties. Mac users: If you are running R under MacOS instead of in a Windows emulator such as WINE, you will need to change all the wi ndows () commands to quartz(). Later in the book, when we use BUGS, there is no Mac equivalent and you must run the programs under WINE or windows. (BayesUpdate.R) § Theta is the vector of candidate values for the parameter theta. # nThetaVals is the number of candidate theta values. # To produce the examples in the book, set nThetaVals to either 3 or 63. nThetaVals = 3 # Now make the vector of theta values: Theta = seq( from = 1/(nThetaVal s+1) , to = nThetaVals/(nThetaVals+1) , by = l/(nThetaVals+l) ) } # pTheta is the vector of prior probabilities on the theta values. pTheta = pmin( Theta , 1-Theta ) # Makes a triangular belief distribution. pTheta = pTheta / sum( pTheta ) # Makes sure that beliefs sum to 1. # Specify the data. To produce the examples in the book, use either # Data = c(l, 1,1, 0,0, 0,0, 0,0,0,0,0) or Data = ctl,0,0,0,0,0,0,0,0,0,0,0). Data = c(l,1,1,0,0,0,0,0,0,0,0,0) nHeads = sum( Data == 1 ) nTails = sum( Data == 0 ) # Compute the likelihood of the data for each value of theta: pDataGivenTheta = Theta'nHeads * (1-Theta TnTai 1 s # Compute the posterior: pData = sum( pDataGivenTheta * pTheta ) pThetaGivenData = pDataGivenTheta * pTheta / pData # This is Bayes' rule! # Plot the results. windows(7,10) # create window of specified width,height inches. layout( matrix( c( 1,2,3 ) ,nrow=3 ,ncol=l ,byrow=FALSE ) ) # 3x1 panels par(mar=c(3,3,1,0)) # number of margin lines: bottom,1 eft,top,right par(mgp=c(2,1,0)) # which margin lines to use for labels par(mai=c(0.5,0.5,0.3,0.D) # margin size in inches: bottom, left,top, right j # Plot the prior: plot( Theta , pTheta , type="h" , lwd=3 , main="Prior" , xlim=c(0,l) , xlab=bquote(theta) , y1im=c(0,l.l*max(pThetaGivenData)) , ylab=bquote(p(theta)) , cex.axis=1.2 , cex.lab=1.5 , cex.main=l.5 ) # Plot the likelihood: plot( Theta , pDataGivenTheta , type="h" , lwd=3 , main="Li kel i hood" , xlim=c(0,l) , xlab=bquote(theta) , 42 ylim=c(0,1.l*max(pDataGivenTheta)) , ylab=bquote(paste("p(D|",theta,")")) 43 cex. axi s=l. 2 , cex.lab=1.5 , cex . mai n=l. 5 ) 44 text( .55 , . 85*max( pDataGi venTheta) , cex=2.0 , 45 bquote( "D=" * .(nHeads) * "H," * .(nTails) * "T" ) , adj=c(0,.5) ) 46 47 # Plot the posterior: 48 plot( Theta , pThetaGivenData , type="h" , lwd=3 , main="Posterior" , 4, xlim=c(0,l) , xl ab=bquote( theta) , 50 yl im=c(0,1. l*max(pThetaGi venData)) , yl ab=bquote(paste( "p(", theta," | D)")) 51 cex.axis=1.2 , cex.lab=1.5 , cex.mai n=l. 5 ) 52 text( .55 , .85*max(pThetaGivenData) , cex=2.0 , 53 bquote( "p(D)=" * . (signif (pData,3)) ) , adj=c(0,.5) ) 4.5 EXERCISES Exercise 4.1. [Purpose: Application of Bayes' rule to disease diagnosis, to see the important role of prior probabilities.] Suppose that in the general population, the probability of having a particular rare disease is 1 in a 1000. We denote the true presence or absence of the disease as the value of a parameter, 0, that can have the value 0 = ~ if the disease is present, or the value 0 — c if the disease is absent. The base rate of the disease is therefore denoted p{6 = ^) = 0.001. This is our prior belief that a person selected at random has the disease. Suppose that there is a test for the disease that has a 99% hit rate, which means that if a person has the disease, then the test result is positive 99% of the time. We denote a positive test result as D = + and a negative test result as D = —. The observed test result is a bit of data that we will use to modify our belief about the value of the underlying disease parameter. The hit rate is expressed as p(D = + \0 = ^) = 0.99. The test also has a false alarm rate of 5%. This means that 5% of the time when the disease is not present, the test falsely indicates that the disease is present. We denote the false alarm rate as p(D = + |0=o) = O.O5. Suppose we sample a person at random from the population, administer the test, and it comes up positive. What is the posterior probability that the person has the disease? Mathematically expressed, we are asking, what is p(6 = a |D = +)? Before determining the answer from Bayes' rule, generate an intuitive answer and see if your intuition matches the Bayesian answer. Most people have an intuition that the probability of having the disease is near the hit rate of the test (which in this case is 0.99). Hint: The following table of conjoint probabilities might help you understand the possible combinations of events. (The following table is a specific Qse of Table 4.2, p. 57.) The prior probabilities of the disease are on the hottom marginal. When we know that the test result is positive, we restrict 0Ur attention to the row marked D =+. CHAPTER 4: Bayes' Rule 9 = = A 0 = ~ D = + p(D = - -,e = p(D = +,6> = P(D = +) = p(D = -- + \t = ~)p(0 = ~) = p(D = + 19 = ~)p(0 = ~) D = - p(D=- -,o = ~) p(D = -,e = p(D = -) = p(D = -.-\9 = ^)p(g = A) = p(D = -|e = o)p(0 = O) p(e = ~) P(0 = °) Caveat regarding interpreting the results: Remember that here we have assumed that the person was selected at random from the population; there were no other symptoms that motivated getting the test. Exercise 4.2. [Purpose: Iterative application of Bayes' rule, to see how posterior probabilities change with inclusion of more data.] Continuing from the previous exercise, suppose that the same randomly selected person as in the previous exercise is retested after the first test comes back positive, and on the retest the result is negative. Now what is the probability that the person has the disease? Hint: For the prior probability of the retest, use the posterior computed from the previous exercise. Also notice that p(D = —\ 9 = ^) = 1 — p(D = +10 = a) and P(D=-\9=^) = l-p(D= + \9=~). Exercise 4.3. [Purpose: To get an intuition for the previous results by using "natural frequency" and "Markov" representations.] (A) Suppose that the population consists of 100,000 people. Compute how many people should fall into each cell of the table in the hint shown in Exercise 4.1. To compute the expected frequency of people in a cell, just multiply the cell probability by the size of the population. To get you started, a few of the cells of the frequency table are filled in here: 0mA 9 = C D= + freq(D = = p(D = = p(D = = 99 +,e=~) +-|0 = a)p(0 = freq(D = +,0 = ^) = p(D = +,0 = ^)N = p(D = +\e = ^)p{9 = freq(D = +) = p(D =+)N D = - freq(D= = p(D = = p(D = = 1 -,e=~) -,0=~)N -\e=^)p{9 = freq(D=-,0 = ~) = p(D = -,0 = ~)N = p(D = -|0 = -)p(0 = freq(D=-) = p(D = -)N freq(6> = = P(0 = = 100 freq(0 = O) = p(0=-)N = 99,900 N = 100,000 Notice the frequencies on the lower margin of the table. They indicate that out of 100,000 people, only 100 have the disease, whereas 99,900 do not have the disease. These marginal frequencies instantiate the pn<« probability that p(9 — A) = 0.001. Notice also the cell frequencies in the column 9 — ^, which indicate that of 100 people with the disease, 99 have a positive test result and 1 has a negative test result. These cell frequencies instantiate the hit rate of 0.99. Your job for this part of the exercise is to fill in the frequencies of the remaining cells of the table. Take a good look at the frequencies in the table you just computed for the previous part. These are the so-called natural frequencies of the events, as opposed to the somewhat unintuitive expression in terms of conditional probabilities (Gigerenzer & Hoffrage, 1995). From the cell frequencies alone, determine the proportion of people who have the disease, given that their test result is positive. Before computing the exact answer arithmetically, first give a rough intuitive answer merely by looking at the relative frequencies in the row D = +. Does your intuitive answer match the intuitive answer you provided back in Exercise 4.1? Probably not. Your intuitive answer here is probably much closer to the correct answer. Now compute the exact answer arithmetically. It should match the result from applying Bayes' rule in Exercise 4.1. Now we'll consider a related representation of the probabilities in terms of natural frequencies, which is especially useful when we accumulate more data. Krauss, Martignon, & Hoffrage (1999) called this type of representation a Markov representation. Suppose now we start with a population of N = 10,000,000 people. We expect 99.9% of them (i.e., 9,990,000) not to have the disease, and just 0.1% (i.e., 10,000) to have the disease. Now consider how many people we expect to test positive. Of the 10,000 people who have the disease, 99% (i.e., 9900), will be expected to test positive. Of the 9,990,000 people who do not have the disease, 5%, (i.e., 499,500) will be expected to test positive. Now consider retesting everyone who has tested positive on the first test. How many of them are expected to show a negative result on the retest? Use this diagram to compute your answer: N = 10,000,000 / xp{8=~) 10,000 I xp(D = - 4 xp(D = -\0=~) \ xp(0=o) 9,990,000 4, xp(D=+\e=^) | xp(D = -\9 = ~) When computing the frequencies for the empty boxes, be careful to use the proper conditional probabilities. (D) Use the diagram in the previous part to answer this question: What proportion of people who test positive at first and then negative on retest actually have the disease? In other words, of the total number of people at the bottom of the diagram in the previous part (those are the people who tested positive then negative), what proportion of them are in the left branch of the tree? How does the result compare with your answer to Exercise 4.2? Exercise 4.4. [Purpose: To see a hands-on example of data-order invariance.] Consider again the disease and diagnostic test of the previous two exercises. Suppose that a person selected at random from the population gets the test and it comes back negative. Compute the probability that the person has the disease. The person then is retested, and on the second test the result is positive. Compute the probability that the person has the disease. How does the result compare with your answer to Exercise 4.2? Exercise 4.5. [Purpose: An application of Bayes' rule to neuroscience, to infer cognitive function from brain activation.] Cognitive neuroscientists investigate which areas of the brain are active during particular mental tasks. In many situations, researchers observe that a certain region of the brain is active and infer that a particular cognitive function is therefore being carried out. Poldrack (2006) cautioned that such inferences are not necessarily firm and need to be made with Bayes' rule in mind. Poldrack (2006) reported the following frequency table of previous studies that involved any language-related task (specifically phonological and semantic processing) and whether or not a particular region of interest (ROI) in the brain was activated: Language Study Not Language Study Activated 166 199 Not activated 703 2154 Suppose that a new study is conducted and finds that the ROI is activated. If the prior probability that the task involves language processing is 0.5, what is the posterior probability, given that the ROI is activated? (Hint: Poldrack (2006) reports that it is 0.69. Your job is to derive this number.) Exercise 4.6. [Purpose: To make sure you really understand what is being shown in j Figure 4.1.] Derive the posterior distribution in Figure 4.1 by hand. The prior has p(6> = 0.25) = 0.25, p(0 = O.5O) = 0.50, and p(0 = O.75) = 0.25. The data consist of a specific sequence of flips with three heads and nine tails, so p(D \6) = e3(l- 0)9. Hint: Check that your posterior probabilities sum to lJ Exercise 4.7. [Purpose: For you to see, hands on, that p(D) lives in the denominator of i Bayes' rule.] Compute p(D) in Figure 4.1 by hand. Hint: Did you notice that you j already computed p(D) in the previous exercise? Metrie Predicted Variable with One Nominal Predictor 18.1 Bayesian Oneway AN OVA..................................................492 18.1.1 The Hierarchical Prior...................................................... 493 18.1.2 Doing It with R and BUGS................................................. 495 18.1.3 A Worked Example.......................................................... 497 18.2 Multiple Comparisons........................................................502 18.3 Two-Group Bayesian ANOVA and the NHST t Test................506 18.4 R Code.............................................................................507 18.4.1 Bayesian Oneway ANOVA................................................ 507 18.5 Exercises..........................................................................512 Familywise error rates breed rumors of incest, Hounding for quarry in multiple t tests. Barking at research, poor dog got run over; Should have done Bayesian oneway ANOVA. In this chapter we consider situations with a metric predicted variable and a nominally scaled predictor variable. These cases occur frequently in real-world research. For example, we may want to predict weight loss (a metric variable) as a function of which diet the person follows (e.g., low-carb, vegetarian, or low-fat). As another example, we may want to predict severity of psychosis (measured on a metric scale) as a function of which antipsychotic drug the person takes. Or we may want to predict income as a function of political party affiliation. This combination of predicted and predictor scale types occurs in the first row, fourth cell, of Table 14.1 (p. 385). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. DOI: 10.1016/B978-0-12-381485-2.00018-3 ©2011, Elsevier Inc. All rights reserved. CHAPTER 18: Metric Predicted Variable with One Nominal Predictor In traditional NHST analyses, these situations are addressed by "oneway analysis of variance" (ANOVA). The term oneway refers to the fact that a single nominal variable is being used as the predictor. The phrase analysis of variance refers to the fact that the overall variance across all the data is decomposed (i.e analyzed) into two parts: variance within the levels of the nominal predictors and variance between the levels of the nominal predictors. The variance within levels of the nominal predictor is called noise or error (i.e., variability that cannot be predicted by the predictor). The complementary variance between the levels of the nominal predictor is called the effect of the predictor. Usually we do the research with the goal of detecting an effect, which means that we would like the magnitude of the variance between levels to be large compared to the noise within levels. The ratio of variance between to variance within is called the F-ratio. In the Bayesian approach, we rarely if ever refer to the F-ratio. But because the model we use is based on the model of traditional ANOVA, we will refer to our analysis as Bayesian ANOVA or sometimes BANOVA. 18.1 BAYESIAN ONEWAY ANOVA The basic idea of oneway ANOVA was introduced in Section 14.1.6.1, p. 368. The predictor is a variable measured on a nominal scale. For example, if income is predicted as a function of political party affiliation, notice that the predictor has nominal levels such as libertarian, green, democratic, republican, and so on. We denote the predictor variable as ~x , which is a vector with one component per nominal level. For example, suppose that the predictor is political party affiliation, with Green as level 1, Democrat as level 2, Republican as level 3, Libertarian as level 4, and Other as level 5. Then Democrat is represented as ~~x = (0,1,0,0,0), and Libertarian is represented as ~f = (0,0,0,1,0). Political party affiliation is being treated here as a categorical label only, with no ordering along a liberal-conservative scale. The formal model indicates how to derive the predicted value from the predictor. The idea is that there is a baseline quantity of the predicted variable, and each level of the predictor indicates a deflection above or below that baseline. We will denote the baseline value of the prediction as po. The deflection for the jth level of the predictor is denoted ft-. When the predictor has value (..., Xji,...), then the predicted value is (18.1) where the notation ~p ■ ~x denotes the dot product of the vectorsMn Equa tion 18.1, the coefficient 0. indicates how much changes when x chang from neutral to level j. In other words, 0. indicates how much ti changes w 18.1 Bayesian x changes from all Xj = 0 to Xj = 1. The baseline is constrained such that the deflections sum to zero across the levels of x : Z>7 = 0 (18-2) The expression of the model in Equation 18.1 is not complete without the constraint in Equation 18.2. Examples were shown in Figure 14.4, p. 370, and it is worth the effort to go now to that Figure for a quick review. The predicted value, fii, in Equation 18.1 is for the central tendency in the data. The data themselves are assumed to be randomly generated around that central tendency. As usual, we will assume a normal distribution, v,- ~ N(jm, t), where x is the precision of the normal distribution. As discussed in previous chapters, if the data have outliers, a t distribution may be used instead. 18.1.1 The Hierarchical Prior Our primary interest is in estimating the deflection parameters, fy, for each level of x . We could just put a separate prior on each parameter and estimate them separately from each other. It is typical, however, that the levels of it are not utterly unrelated to each other, and therefore data from one level may inform estimates in another level. For example, the deflections for republicans, libertarians, and greens can inform an estimate of the deflection for democrats. Thus, if the deflection for libertarians is +1.0, for republicans is +0.5, and for greens is -1.0, then the deflection for democrats should be somewhere in that general range, and not out at, say, —12.0. At the least, we might have prior beliefs that the deflections for most levels of x may be small, with only a few deflections being large, and therefore we can let the various levels mutually inform each other's estimates based on this structural assumption. The form of the hierarchical model for oneway BANOVA is displayed in Figure 18.1 (Gelman, 2005, 2006). In the upper middle of the diagram is the normal distribution that describes the distribution of deflections, across levels of ~x . This normal distribution has a mean at zero, reflecting the fact that the deflections are constrained to fall both above and below the baseline, because they must sum to zero. Importantly, the precision of this normal distribution, Xp, is estimated, not preset at a constant. Thus, if many of the levels of ~x have a small deflection in the data, then the precision x^ is estimated to be high, and this in turn shrinks the estimates of other fy. The prior for tg derives from the recommendation of Gelman (2006). First, the precision is converted to standard deviation: = 1/ 2.23, because that would happen only 5% of the time by chance alone. Now consider an expanded experiment, in which there is a placebo treatment and four distinct drugs, for a total of five treatment groups. According to the null hypothesis, the five treatment groups have identical distributions c body temperatures (normally distributed with equal means and variances However, because of random sampling in any particular experiment, si treatment samples will have higher or lower mean temperatures than o treatment samples. Suppose that before we collect any real data, we plan compare the placebo group (group 1) with each of the four drug groups (i we plan four pairwise comparisons). Each of these comparisons might y a fairly large difference merely by chance, even when there is truly no ^ ference in the underlying distributions. We can determine how often t 18.2 Multiple Comparisons chance extremes happen by running a Monte Carlo simulation. For a simulated experiment, we randomly sample six scores from each of the five groups, and compute the t values of each of the four comparisons. The simulated experiment is repeated many times. For each candidate tan, we see what proportion of simulated experiments had a comparison that exceeded that critical value. The middle curve of Figure 18.4 shows the result. Notice that at any given value of tCrit, there is now a much higher probability that the simulated experiment will have at least one comparison with larger t. In particular, to keep the false alarm rate down to 5%, tcnt must be about 2.95 instead of 2.33. If we did not plan only four tests but instead decided to compare every group with every other group, then we would have even more opportunity for false alarms. With five groups, there are 10 pairwise comparisons. If we simulate experiments from equal distributions as before, but this time consider all 10 t values, the probability of false alarm is higher yet, as shown in the right curve of Figure 18.4. The critical value has risen even higher, to approximately 3.43.1 Now, suppose we actually run the experiment. We randomly assign 30 people to the five groups, six people per group. The first group gets the placebo, and the other four groups get the corresponding four drugs. We are careful to make this a double-blind experiment: Neither the subjects nor experimenters know who is getting which treatment. Moreover, no one knows whether any other person is even in the experiment. We collect the data. Our first question is to compare the placebo and the first drug (i.e., group 1 versus group 2). We compute the t statistic for the data from the two groups and find that t — 2.95. Do we decide that the two treatments had significantly different effects? The answer, bizarrely, depends on the intentions of the person we ask. Suppose, for instance, that we handed the data from the first two groups to a research assistant, who is asked to test for a difference between groups. The assistant runs a t test and finds t = 2.95, declaring it to be highly significant because it greatly exceeds the critical value of 2.23 for a two-group t test. Suppose, on the other hand, that we handed the data from all five groups to a different research assistant, who is asked to compare the first group against each of the other four. This assistant runs a t test of group 1 versus group 2 and finds t = 2.95, declaring it to be marginally significant because it just squeezes past the critical value of 2.95 for these four planned comparisons. Suppose, on yet another hand, that we handed the data from all five groups to a different research assistant, who is told to conduct all pairwise comparisons post hoc because we have no strong hypotheses about which treatments will have beneficial or detrimental or neutral effects. This assistant runs a t test of group 1 or a discussion of various correction procedures and when to use them, see Figure 5.1 of Maxwell & Delaney (2004). If you must learn NHST methods, this is an excellent resource. CHAPTER 18: Metric Predicted Variable with One Nominal Predictor versus group 2 and finds t = 2.95, declaring it to be not significant because it fails to exceed the critical value of 3.43 that is used for post hoc pairwise comparisons. Notice that regardless of which assistant analyzed the data, the t value for the two groups stayed the same because the data of the two groups stayed the same. Indeed, the data were completely uninfluenced by the intentions of the analyst. So why should the interpretation of the data be influenced by the intentions of the analyst? It shouldn't. If you believe that the interpretation should be influenced by the intentions of the analyst, how do you determine the intentions of the analyst? Did the analyst truly plan only those particular comparisons, or did the analyst really plan others but jettison them once the data were in? Or did the analyst actually plan fewer comparisons but realize later that additional comparisons should be made to address other theoretical issues? Or did the analyst actually plan to include two other Ueatment groups in the study but then not actually include those groups in the analysis because of administrative errors committed during the data collection? Or what if the experiment was planned by a team of people, some of whom planned some comparisons and others of whom planned other comparisons? Conclusion: Establishing the true intentions of the analyst is not only pointless, it is also impossible. Multiple comparisons are not a problem in a Bayesian analysis (e.g., Gelman, Hill, & Yajima, 2009). The posterior disUibution is a fixed entity in high-dimensional parameter space, and making comparisons between groups is simply examining that posterior distribution from different perspectives or margins. The posterior does not change when new comparisons come to mind. The posterior is not immune to spurious coincidences of rogue data, of course. False alarms are mitigated, however, by incorporating prior knowledge into the structure of the model. The estimates of the groups are mutually informative via estimation of higher-level structure, and shrinkage of estimates across groups attenuates false alarms. The attenuation of false alarms is governed by the data, not by unknowable intentions. 18.3 TWO-GROUP BAYESIAN ANOVA AND THE NHST t TEST The idea behind an NSHT t test is simple: We have two groups, each with mean. We compute the difference of the means and standardize that differen relative to the standard deviation of the scores within the groups. The resu -ing standardized difference is called the t value. We want to know whether observed t value is significantly different from zero, so we compare the t va to a sampling distribution of t values (Gosset, 1908). The sampling distt tion assumes that the intention of the researcher was to stop when there 18.4 RCode exactly Nl values observed for the first group, and exactly N2 values observed for the second group. The t test is a special case of NHST ANOVA when there are only two groups. More specifically, when the two groups are assumed to have equal variances in the underlying population, then the t value squared equals the F value in two-group ANOVA. (And what's an F value, you may ask? The F value is the summary statistic used in NHST ANOVA to express how much the groups differ from each other. It's the ratio of the variance between group means, relative to the variance within groups.) In typical applications of BANOVA, the prior on the between-group variance is only mildly informed. In this case, a BANOVA on two groups imposes little shrinkage on the group estimates because there are so few groups. It is only when several groups "gang up" that they strongly inform the estimate of the variation between groups and therefore constrain the estimates of other groups. When the prior on the variance within groups is also vague, the results of a two-group BANOVA closely agree with the results of an NHST t test. Exercise 18.1 has you explore this correspondence. 18.4 R CODE 18.4.1 Bayesian Oneway ANOVA (ANOVAonewayBRugs.R) 1 graphics.off() 2 rm(list=ls(all=TRUE)) 3 fnroot = "ANOVAonewayBrugs" 4 library(BRugs) # Kruschke, J. K. (2010). Doing Bayesian data analysis: 5 # A Tutorial with R and BUGS. Academic Press / Elsevier. • # 7 # THE MODEL. 11 i model string = " # BUGS model specification begins here... model { for ( i in l:Ntotal ) { y[i] ~ dnorm( mu[i] , tau ) mu[i] <- aO + a[x[i]] 1 # tau <- pow( sigma , -2 ) sigma ~dunif(0,10) # y values are assumed to be standardized # aO ~ dnorm(0,0.001) # y values are assumed to be standardized I # for ( j in l:NxLvl ) { a[j] ~ dnorm( 0.0 , atau ) ) atau <- 1 / pow( aSD , 2 ) aSD <- abs( aSDunabs ) + .1 CHAPTER 18: Metric Predicted Variable with One Nominal Predictor aSDunabs ~ dtC 0 , 0.001 , 2 ) 30 31 # ... end BUGS model specification " # close quote for modelstring # Write model to a file, and send to BUGS: writel_ines(model string, con="model .txt") modelCheck( "model.txt" ) #----------- # THE DATA. § Specify data source: dataSource = c( "McDonaldSK1991" # Load the data: "SolariLS2008" "Random" )[1] 40 if ( dataSource == "McDonaldSK1991" ) 1 41 fnroot = pastet fnroot , dataSource , sep="" ) 42 datarecord = read.table( "McDonaldSK1991data.txt", header=T 43 col CI asses=c( "factor", "numeri c") ) 44 y = as.numeric(datarecord$Size) 45 Ntotal = 1 ength(datarecord$Si ze) 46 x = as . numeri c(datarecord$Group) 47 xnames = 1 evel s(datarecord$Group) 48 NxLvl = length(unique(datarecord$Group)) 43 contrastList = list( BIGvSMALL = c(-1/3,-1/3,1/2,-1/3,1/2) , so ORElvORE2 = c(l,-1,0,0,0) , si ALAvORE = c(-l/2,-1/2,1,0,0) , 52 NPACvORE = c(-l/2,-1/2,1/2, 1/2,0) , 53 USAvRUS = c(l/3,1/3,1/3,-1,0) , 54 FINvPAC = c(-l/4,-1/4,-1/4,-1/4,1) , 55 ENGvOTH = c(l/3,1/3,1/3,-1/2,-1/2) , 56 FINvRUS = c(0,0,0,-1,1) ) 59 if ( dataSource == "Sol ari LS2008" ) 1 so fnroot = paste( fnroot , dataSource , sep="" ) 61 datarecord = read.tableCSolariLS2008data.txt", 62 col CI asses=c( "factor", 63 y = as .numeri c(datarecord$Acid) 64 Ntotal = 1 ength(datarecord$Acid) 65 x = as . numeri c(datarecord$Type) 66 xnames = 1 evel s(datarecord$Type) 67 NxLvl = length(unique(datarecord$Type)) 66 contrastList = list( G3v0THER = c(-1/8,-1/8,1,- header=T , 'numeric") ) 1/8, -1/8, -1/8, -1/8, -1/8,-im 71 if ( dataSource == "Random" ) ( 72 fnroot = paste( fnroot , dataSource , sep="" ) 7* #set.seed(47405) 74 ysdtrue = 4.0 75 aOtrue = 100 76 atrue = c( 2 , -2 ) # sum to zero 77 npercel 1 = 8 78 datarecord = matrix( 0, ncol=2 , nrow=l ength(atrue)*npercel 1 ) colnames(datarecord) = c("y","x") rowidx = 0 for ( xidx in 1:1ength(atrue) ) { for ( subjidx in l:npercell ) ( rowidx = rowidx + 1 datarecord[rowidx,"x"] = xidx datarecord[rowidx,"y"] = ( aOtrue + atrue[xidx] + rnormt1,0,ysdtrue) ) datarecord = data.frame( y=datarecord[,"y"] , x=as.factor(datarecord[,"x"]) ) y = as.numeric(datarecord$y) Ntotal = length(y) x = as.numeric(datarecord$x) xnames = 1evels(datarecord$x) NxLvl = length(uniquetx)) # Construct list of all pairwise comparisons, to compare with NHST TukeyHSD: contrastList = NULL for ( glidx in 1:(NxLvl-1) ) | for ( g2idx in (glidx+1):NxLvl ) { cmpVec = rep(0,NxLvl) cmpVec[glidx] = -1 cmpVec[g2idx] = 1 contrastList = c( contrastList , list( cmpVec ) ) } } # Specify the data in a form that is compatible with BRugs model, as a list: ySDorig = sd(y) yMorig = mean(y) z = ( y - yMorig ) / ySDorig datali st = 1 i st( y = z , X = X , Ntotal = Ntotal , NxLvl = NxLvl ) # Get the data into BRugs: modelDatai bugsData( datali st ) ) ------------------------------------------- I I NTIA LIZ E THE CHAINS. //Autocorrelation within chains is large, so use several chains to reduce I degree of thinning. But we still have to burn-in all the chains, which takes # more time with more chains (on serial CPUs), nchain = 5 modelCompile( numChains = nchain ) if ( F ) 1 modelGenInits() # often won't work for diffuse prior 1 else ( # initialization based on data theData = data.framet y=datalist$y , x=factor(x,1abels=xnames) ) CHAPTER 18: Metric Predicted Variable with One Nominal Predictor 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 aO = mean( theDataly ) a = aggregates theDataly , list( theData$x ) , mean )[,2] - aO ssw = aggregate( theDataly , 1ist( theDatalx ) , function(x){var(x)*(length(x)-l)) )[,2] sp = sqrt( sum( ssw ) / length( theData$y ) ) genlnitList <- functionO ( return( I1st5 aO = aO , a = a , sigma = sp , aSDunabs = sd(a) ) ) for ( chainldx in 1 : nchain ) ( modellnitsi bugslnits( genlnitList ) ) 152 153 154 155 156 157 158 159 160 I # RUN THE CHAINS # burn in BurnlnSteps = 10000 modelUpdateS BurnlnSteps ) # actual samples samplesSet( c( "aO" , "a" , "sigma" , "aSD" ) ) stepsPerChain = cei1 ing(5000/nchain) thinStep = 750 modelUpdate( stepsPerChain , thin=thinStep ) .65 #--------------------- 166 # EXAMINE THE RESULTS 167 168 169 source("plotChai ns.R") source("plotPost.R") 172 173 174 175 176 177 173 184 185 186 checkConvergence = T if ( checkConvergence ) I sumlnfo = plotChains( sumlnfo = plotChainsS sumlnfo = plotChains( sumlnfo = plotChains( 'aO" , saveplots=T , 'a" , saveplots=T , 'sigma" , saveplots= 'aSD" , saveplots=T fi 1enameroot=fnroot ) fi 1enameroot=fnroot ) T , fi 1enameroot=fnroot , fi 1enameroot=fnroot ) # Extract and plot the SDs: sigmaSample = samplesSample("sigma") aSDSample = samplesSample("aSD") wi ndows() layout( matrix(1:2,nrow=2) ) par( mar=c(3,l,2.5,0) , mgp=c(2,0.7,0) ) plotPostS sigmaSample , xlab="sigma" , main="Cell SD" , plotPostS aSDSample , xlab="aSD" , main="a SD" , breaks^ breaks=30 30 ) 18.4 RCode is? dev. copy2eps(f i 1 e=paste(fnroot, "SD. eps", sep="")) 188 189 # Extract a values: bo aOSample = samplesSample( "aO" ) 191 chainLength = length(aOSample) 192 aSample = array( 0 , dim=c( datalistSNxLvl , chainLength ) ) 193 for ( xidx in 1:datalistSNxLvl ) I 194 aSampl e[xidx, ] = sampl esSampl e( paste( "a[", xidx," ]", sep="") ) 195 } 196 ,97 # Convert to zero-centered b values: 198 mSample = array( 0, dim=c( datalistSNxLvl , chainLength ) ) 199 for ( stepldx in 1: chai nLength ) j 200 mSampleL stepldx ] = ( aOSampl e[stepldx] + aSampl e[, stepldx] ) 201 } 202 bOSample = apply( mSample , 2 , mean ) 203 bSample = mSample - matrix(rep( bOSample , NxLvl ), nrow=NxLvl , byrow=T) 204 # Convert from standardized b values to original scale b values: 205 bOSample = bOSample * ySDorig + yMorig 206 bSample = bSample * ySDorig 207 208 # PI ot b val ues : 209 wi ndows(datal i st$NxLvl *2 . 75, 2. 5) 210 layout( matrixt 1:datal ist$NxLvl , nrow=l ) ) 2H par( mar=c(3,l,2.5,0) , mgp=c( 2, 0 . 7, 0) ) 212 for ( xidx in 1:datal ist$NxLvl ) { 213 plotPost( bSample[xidx, ] , breaks=30 , 214 xl ab=bquote( beta*l[. (xidx) ]) , 215 main=paste("x:",xnames[xidx]) ) 216 } 217 dev. copy2eps(f i 1 e=paste(f nroot, "b.eps", sep="")) 218 219 # Display contrast analyses 220 nContrasts = length( contrastList ) 221 if ( nContrasts > 0 ) ( 222 nPlotPerRow = 5 223 nPlotRow = ceil ing(nContrasts/nPlotPerRow) 224 nPlotCol = cei 1 ing(nContrasts/nPl otRow) as wi ndows (3.75*nPl otCol, 2.5*nPl otRow) 226 layoutt matrixd: (nPl otRow*nPl otCol ), nrow=nPl otRow, ncol=nPl otCol ,byrow=T) ) m part mar=c(4,0.5,2.5,0.5) , mgp=c(2,0.7,0) ) 22» for ( cldx in l:nContrasts ) ( 229 contrast = matrixt contrast Li st[ [cldx] ], nrow=l) # make it a row matrix 23» incldx = contrast!=0 231 histlnfo = plotPostt contrast %*% bSample , compVal=0 , breaks=30 , xlab=paste( roundtcontrast[incldx],2) , xnames[incldx] , 233 c( rept"+", sumt i ncldx)-1),"") , collapse=" " ) , 23< cex.lab = 1.0 , main=paste( "X Contrast:", names(contrastList)[cIdx] ) ) 236 } dev.copy2eps(fi1e=paste(fnroot,"xContrasts.eps",sep="")) 238 ) CHAPTER 18: Metric Predicted Variable with One Nominal Predictor 242 243 244 245 247 249 250 # Do NHST ANOVA and t tests: theData = data.frame( y=y , x=factor(x,1abels=xnames) ) aovresult = aov( y ~ x , data = theData ) # NHST ANOVA cat("\n------------------------------------------------.......-----------\n\tM 246 print( summary( aovresult ) ) cat("\n------.......-------------------------.....-----------------------\n\n4 248 print( model . tabl es ( aovresult , "means" ) , digits=4 ) wi ndows() boxplott y ~ x , data = theData ) 251 cat("\n------......-------------------------..........-------------------\n\n" 252 print( TukeyHSD( aovresult , "x" , ordered = FALSE ) ) 253 wi ndows () 254 plot( TukeyHSD( aovresult , "x" ) ) if ( T ) ( 255 256 for ( xldxl in l:(NxLvl-l) ) { 257 for ( xldx2 in (xldxl+l): NxLvl ) { 258 c a t (" \ n —......---------------------------.....-----------------\ n \ n") 259 cat( "xldxl = " , xldxl , ", xldx2 = " , xldx2 , 260 ", M2-M1 = " , mean(y[x==xldx2]) ~mean(y[x==xldxl]) , "\n" ) 261 print( t.test( y[x==xldx2] , y[x==xldxl] , var.equal=T ) ) # t test 262 } 263 ) 264 ) 265 cat("\n------------------......------------------------------------------\n\n") 266 267 #= 18.5 EXERCISES Exercise 18.1. [Purpose: To notice that Bayesian ANOVA with two groups tends to agree with an NHST t test.] The BRugs program of Section 18.4.1 (ANOVAonewayBRugs. R) allows you to specify random data. It executes a Bayesian ANOVA, and at the end of the program it also conducts an NHST ANOVA and t tests (using R's aov and t.test functions). Run the program ten times with different random data by commenting out the set.seed command. Specify ysdtrue = 4.0, atrue = c(2,-2) (which implies two groups because there are two deflections) and npercell =8. For each run, record, by hand, (1) how much of the posterior difference between means falls on one side of zero (see the posterior histogram with the main tide "X Co trast" and x axis labeled "-1 1 + 1 2"), (2) whether the 95% HDI excludes zero, and (3) the confidence interval and p value of the NHST t test. Do the I test and the BANOVA usually agree in their decisions about whether the grot means are different? Exercise 18.2. [Purpose: To understand the influence of the prior in Baye ANOVA.] In the model section of the BRugs program of Section 18-(ANOVAonewayBRugs . R), and correspondingly in the diagram of Fl§ure^ there are several constants that determine the prior. These constants ir 18.5 Exercises the mean value of the baseline (Mo in the diagram), the precision on the baseline (T0 in the diagram), the precision of the folded-t distribution (Tt in the diagram), and the upper value of the uniform distribution on ay (HGy in the diagram). Because the data are standardized, Mo should be set at zero, and To can be modest (not terribly small). Hay also can be set to a modest value because the data are standardized. But what about the precision of the folded-t distribution, Tt? This constant modulates the degree of shrinkage: A large value of Tt indicates prior knowledge that the groups do not differ much, and it imposes a high degree of shrinkage that must be overcome by the data. Run the program on the mussel data using a small value of Tt, such as 1.0E-6, and a large value of Tt, such as 1000. Are the results very different? Discuss which prior value might be appropriate. Exercise 18.3. [Purpose: To understand Bayesian ANOVA without assuming equal variances.] Modify the program in Section 18.4.1 (ANOVAonewayBRugs . R) so that it allows a different variance for each group, with the different variances coming from a hyperdistribution that has its precision informed by the data. In other words, instead of assuming the same zy (= \/cty) for all the levels of x, we allow each group to have its own variance. Denote the precision of the fh group as Xj, analogous to the deflection Bj. Just as the group deflections are assumed to come from a higher-level distribution, we will assume that the group SDs come from a higher-level distribution. Because SDs must be non-negative, use a gamma density for the higher-level distribution. The gamma distribution has two parameters for which you need to establish a prior. See the right side of Figure 16.11 for guidance. Corresponding code is offered in a hint, below. Run the program on the mussel muscle data. Are the conclusions about the group means any different than when assuming equal variances across groups? Hint regarding the conclusion: The posteriors on the group means are only a little different in this case, because the group variances are roughly the same. But because the group variances are less constrained when they are all allowed to be different, they are less certain. Therefore, the group means are a little less certain, and thus the differences of means are a little less certain. Programming hints: Here are some code snippets, showing the model specification and chain initialization. (ANOVAonewayNonhomogvarBrugs.R) |« model { for ( i in l:Ntotal ) { y[i] ~ dnorm( mu[i] , tau[x[i]] ) mu[i] <- aO + a[x[i ]] ( aO ~ dnorm(0,0.001) for ( j in l:NxLvl ) 1 a[j] ~ dnorm( 0.0 , atau ) tau[j] -v dgamma( sG , rG ) CHAPTER 18: Metric Predicted Variable with One Nominal Predictor a sG <- pow(m,2)/pow(d,2) 22 rG <- m/pow(d,2) 23 m ~ dgamma (1,1) 24 d ~ dgamma (1,1) 25 atau <- 1 / pow( aSD , 2 ) 2* aSD <- abs( aSOunabs ) + .1 a? aSDunabs ~ dt( 0 , 0.001 , 2 ) 28 } (ANOVAonewayNonhomogvarBrugs.R) 133 # initialization based on data 134 theData = data.frame( y=datalist$y , x=factor(x, 1 abel s=xnames) ) us aO = mean( theData$y ) ne a = aggregate( theData$y , listí theData$x ) , mean )[,2] - aO is? tau = l/(aggregate( theData$y , listí theData$x ) , sd )[,2])"2 Ba genlnitList <- function(5 ( 139 returnt 140 list( 141 aO = aO , 142 a = a , 143 tau = tau , 144 m = mean( tau ) , us d = sd( tau ) , 146 aSDunabs = sd(a) 147 ) 148 ) 149 ) 150 for ( chainldx in 1 : nchain ) { 151 modellnitst bugslnitsi genlnitList ) ) CHAPTER 19 Metric Predicted Variable with Multiple Nominal Predictors 19.1 Bayesian Multifactor ANOVA..............................................516 19.1.1 Interaction of Nominal Predictors........................................ 517 19.1.2 The Hierarchical Prior...................................................... 519 J 19.1.3 An Example in R and BUGS............................................... 520 19.1.4 Interpreting the Posterior.................................................. 522 19.1.5 Noncrossover Interactions, Rescaling, and Homogeneous Variances...................................................................... 528 19.2 Repeated Measures, a.k.a. Within-Subject Designs................531 19.2.1 Why Use a Within-Subject Design? And Why Not?................... 533 19.3 R Code.............................................................................535 19.3.1 Bayesian Two-Factor ANOVA............................................ 535 19.4 Exercises..........................................................................544 Sometimes I wonder just how it could be, that Factors aligned so you'd end up with me. All of the priors made everyone think, that Our interaction was destined to shrink. In this chapter we consider situations with a metric predicted variable and multiple nominal predictor variables. For example, we might want to predict income (a metric variable) on the basis of political party affiliation (a nominal variable) and ethnicity (another nominal variable). Or we may want to predict tesponse time (a metric variable) on the basis of hand used for the response (a nominal value: dominant hand or nondominant hand) and modality of I°'n9 Bayesian Data Analysis: A Tutorial with R and BUGS. DOI: 10.1016/B978-0-12-381485-2.00019-5 2011' Elsevier Inc. All rights reserved. 515 CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors I stimulus (another nominal value: visual, auditory, or tactile). These situations are modeled by the cell in the first row and last column of Table 14.1, p. 385, In traditional NHST, this situation is known as multifactor ANOVA. We use the same underlying model, but without reference to F sampling distributions; instead we use hierarchical priors that provide additional structural constraints. Multifactor ANOVA is a straightforward extension of the model presented in the previous chapter, but with a new concept of interaction between nominal variables. Just as multiple regression considered interaction of meUic predictors, multifactor ANOVA considers interaction of nominal predictors. 19.1 BAYESIAN MULTIFACTOR ANOVA Recall from the previous chapter that in oneway ANOVA, we describe the effect of each level of the predictor as a deflection away from an overall baseline, where the baseline is the central tendency across all levels of the predictor. In multifactor ANOVA, the same idea applies to two or more predictors, and the deflections resulting from each predictor are added. We'll use notation analogous to the previous chapter, but with extra subscripts to indicate the different predictors, just as we used in multiple regression on continuous predictors. The mathematical notation was inttoduced as a case of the generalized linear model in Section 14.1.6.2, p. 370. Suppose we have two nominal predictors, cleverly denoted ~x\ and ~~x 2 ■ These predictor vectors can only take on values of (1,0,0,...), (0,1,0,...), and so on, with the j* component having the value 1 when the predictor has its j-th nominal level. When the effects of the two predictors are additive, the predicted tendency is as follows: 1f —> —I —> y = y80 + #1 «1 + #2 x2 h h To make the parameter values unique, we include the constraints h h J2^i.j = 0 and XXfc j=i j=k Those equations repeat Equations 14.7 and 14.8. In words, the value £0 est lishes the overall baseline from which the predictors indicate deflection When predictor xx has value x\j, a deflection of fii,j is added to the base- line, and when predictor X2 has value x2,k, a deflection of fi2,k 1S also added 10- 19.1 Bayesian Multifactor ANOVA to the baseline. The deflections may be negative. Indeed, across all levels of the predictors, the constraints demand as much negative deflection as positive deflection, so that the deflections sum to zero for each predictor. 19.1.1 Interaction of Nominal Predictors The effect of two predictors may be nonadditive, in which case we say that there is an "interaction" of the predictors. For example, if a flame is put under a hot-air balloon, its levity will increase. And if hydrogen is added to a balloon, its levity will increase. But if hydrogen and flame are added to a balloon, there is a nonadditive interaction, such that levity is not increased. Figure 19.1 displays a simple interaction. Both predictors have only two levels. The abscissa groups the two levels of predictor ~xx, and the shading of the bars indicates the two levels of predictor ~$2- All three panels of Figure 19.1 show the same data, but the nature of the interaction is highlighted differently in each panel. In the left panel of Figure 19.1, the dashed parallelogram indicates the best additive model for the data. The dashed lines indicate the average change when the levels of the predictors change. The vertical arrows highlight the nonadditive deflections, away from the additive average, that constitute the interaction. Notice that the arrows sum to zero across each edge of the parallelogram. Thus, the interaction components do not change the average deflections of each predictor. Deflection from additive 10 4- □ x2- □ x2 = <1,0> <0,1> FI x1=<1,0> x1=<0,1> Effect of x1 depends on x2 10 6 - 0 J Effect of x2 depends on x1 10 -i 8 - 2 - □ X2=<1,0> ■ X2 = <0,1> x1 = <1,0> FIGURE 19.1 An interaction of nominal variables !t\ and ~£j< parsed three ways. The left panel emphasizes that the interaction involves a nonadditive, torsion-like deflection away from the additive model, as indicated by the arrows. The middle panel shows the same data, with lines that emphasize that the effect of 1£i depends on |ne value of ~x2 - The right panel again shows the same data, but with lines that emphasize that the effect of h depends on the value of ~xx. CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors The middle and right panels of Figure 19.1 highlight different interpretations of the interaction. The middle panel shows that the effect of ~x\, that is, the amount that y changes when ~X\ changes, depends on the level of ~x*2: When ~~x*2 — (l, 0), there is only a small change in y when ~~x\ changes, but when ~x2 — (0,1}, there is a larger change in y when ~~x\ changes. The right panel makes the same point but with the roles of the predictors reversed: When ~xj = (1,0), the effect of ~x*2 is to decrease y, but when x i = (0,1), the effect of ~x2 is to increase y. The average deflection from baseline due to a predictor is called the main effect of the predictor. The main effects of the predictors correspond to the dashed lines in the left panel of Figure 19.1. When there is nonadditive interaction between predictors, the effect of one predictor depends on the level of the other predictor. The deflection from baseline for a predictor, at a fixed level of the other predictor, is called the simple effect of the predictor at the level of the other predictor. When there is interaction, the simple effects do not equal the main effect. It maybe edifying to compare Figure 19.1, which shows interaction of nominal predictors, with Figure 17.8, p. 470, which shows interaction of metric predictors. The essential notion of interaction is the same in both cases: Interaction is the nonadditive portion of the prediction, and interaction means that the effect of one predictor depends on the level of the other predictor. The mathematical formalism for nonadditive interactions was introduced in Section 14.1.6.3, p. 371, and is repeated here. The nonadditive components, indicated by the vertical arrows in Figure 19.1, are denoted fi\X2,j,k' which means the interaction of predictors 1 and 2 (denoted 1 x 2) at level j of predictor 1 and level fe of predictor 2. The formal expression merely expands the additive model by including the interaction. Recall from Equations 14.9 and 14.10 that the model with interaction term can be written as y = ßo + %~xi + %% + ~ßl x2 *lx2 h h h h j=l k=l ;=1 k=l with the constraints h h V ßhj = 0 and X ß2,k = 0 and |=1 fe=i h h £>ix2,j,fe = 0 Vfe and £>ix2,j,fe = 0 Vj j-1 k=i FIGURE 19. Hierarchical c Rgure 18.1. 19.1 Bayesian Multifactor ANOVA In those last equations, the symbol "V" means "for all." In words, the last two equations simply mean that the interaction deflections sum to zero along every level of the two predictors. A graphic example of this was presented in the left panel of Figure 19.1, which shows that the heights of the arrows sum to zero along every edge of the parallelogram. Our goal is to estimate the additive and interactive deflections, based on the observed data. It is important to understand that the observed data are not the bars in Figure 19.1; instead, the data are swarms of points at various heights near the heights of the bars. The bars represent the central tendency of the data at each combination of the predictors. Thus, what the equations above actually predict is the central tendency \i at each combination of predictors, and the data are typically modeled as being normally distributed around n. 19.1.2 The Hierarchical Prior The complete generative model of the data is shown in Figure 19.2. It might look daunting, but it really is merely the diagram for oneway ANOVA in Figure 18.1, with the hyperprior replicated for each predictor and interaction. FIGURE 19.2 Hierarchical dependencies for model of two-way Bayesian ANOVA. Compare with Figure 18.1. CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors ■ The lowest level of Figure 19.2 indicates that the observed data points, yit are distributed normally around the predicted value /x;. Moving upward in the diagram, the arrow impinging on /x, indicates that the predicted value is baseline plus additive deflection due to each predictor plus interactive deflection due to the combination of predictors. The upper levels of the diagram indicate prior structural assumptions about the deflections. We assume that the deflections produced by a predictor are centered at zero, and we allow the variance (i.e., precision) of the deflections to be estimated from the data. Thus, if most of the deflections are small, the estimated variance is small, and the hyperdistribution creates shrinkage in the estimates of other deflections. A key conceptual aspect of the hyperdistributions is that they apply separately to the different predictors and interactions. In other words, there is not just one hyperdistribution that governs all deflections for all predictors and interactions. This division of generative structure reflects a prior assumption that the magnitude of the effect of one predictor might not be informative regarding the magnitude of the effect of a different predictor. But within a predictor, the magnitude of deflection produced by one level may inform the magnitude of deflection produced by other levels of that same predictor.1 As was assumed in the case of oneway ANOVA, we will assume homogeneity of variance: The variability of the observed data is the same within each combination of predictors. This is indicated in Figure 19.2 by the single parameter oy that is used in the likelihood function, regardless of the values of the predictors. As before, there are two reasons for this assumption. First, the assumption is a natural simplification in multiple regression on metric predictors, and ANOVA can be construed as a special case of multiple regression. Second, the assumption of equal variances is made in NHST ANOVA, and we will also make it here in BANOVA to facilitate comparing across the techniques. But there is no requirement in BANOVA to assume equal variances. If the situation suggests that different levels of the predictors produce radically different variances in the data, then the hierarchical prior can allow different variances. 19.1.3 An Example in R and BUGS Figure 19.3 shows the mean annual salaries of faculty in four departments at three levels of seniority. The four departments are business finance, counseling and educational psychology, chemistry, and theater. These departments 1 By analogy to multiple regression, if there are many predictors included in a model, it is reasons in principle to include a higher-level distribution across predictors such that the estimated v3nanC^e one predictor informs the estimated variance of another predictor. This would be especially usefu application includes many nominal predictors, each with many levels. Such applications are rar . 200,000 jj 150,000 100,000 50,000 - BFIN FIGURE 19.3 Mean annual salaries are the nominal le ity are full profess professors are usua doctoral studies. A their doctoral or pi to 40 years post £ should, be treated dictor, denoted 1^ department and of ing that the change goal is to estimate bership, the main < seniority. The display of the i binations of depart In other words, the n«t necessarily equj 19.1 Bayesian Multifactor ANOVA theData$x2 FT1 -2- FT2 -3- FT3 2 BFIN CEDP CHEM THTR theData$x, FIGURE 19.3 Mean annual salaries of faculty in four departments at three levels of seniority. are the nominal levels of a predictor denoted . The three levels of seniority are full professor, associate professor, and assistant professor. Assistant professors are usually within 7 years after completing their doctoral or postdoctoral studies. Associate professors are usually within about 10 years of their doctoral or postdoctoral studies. Full professors are anywhere from 10 to 40 years post graduate school. Although seniority could, and perhaps should, be treated as an ordinal variable, we will treat it as a nominal predictor, denoted xx- A glance at the means suggests that there are effects of department and of seniority. There appears also to be an interaction, meaning that the change in salary due to seniority depends on the department. Our goal is to estimate the baseline salary, the main effect of department membership, the main effect of seniority, and the interaction of department and seniority. The display of the means in Figure 19.3 obscures the fact that different combinations of department and seniority had different numbers of data points. In other words, the number of associate professors in business finance was not necessarily equal to the number of full professors in theater. In traditional 200,000 CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors NHST ANOVA, this sort of "unbalanced" design can cause serious computational difficulties (e.g., Maxwell & Delaney, 2004, pp. 320-343). But Bayesian ANOVA has no problem with unbalanced designs. The model of Figure 19.2 was implemented in R and BRugs and is listed in Section 19.3.1 (ANOVAtwowayBRugs . R). Several tricks for running the model in BUGS are described in that section, before the program listing. The essentials, however, are much like the oneway ANOVA model of the previous chapter. The results are shown in Figure 19.4. (The means and HDI limits are displayed with only three significant digits, but more precise values can be obtained directly from the program.) The top-left histogram shows that the baseline for these four departments is 111,381. Notice, however, that most of the data fall below this baseline because the overall data are skewed by the much higher salaries in one department. For salaries in the department of chemistry, the fourth histogram in the top row indicates that 2164 should be subUacted from the baseline. For salaries of assistant professors, the first histogram in the bottom row indicates that 20,100 should be subUacted from the baseline. Thus, for assistant professors in the department of chemistry, the linearly predicted salary is 111,381 - 2164 - 20,100 = 89,117. But there is a notable nonlinear interaction component for that combination: The fourth histogram of the bottom row shows that 10,938 must be subUacted from the linear combination to get the mean estimate for that combination, namely, 78,179. 19.1.4 Interpreting the Posterior In most applications, we are interested not only in estimation of effects for each group, but we are also interested in deciding whether two groups are credibly different. Just as we compared groups in oneway ANOVA in the previous chapter, we can compare groups in multifactor ANOVA. The top and middle rows of Figure 19.5 show selected contrasts of levels of the main effects. We may ask whether there is a credible difference in salaries, on average, between business finance (BFIN) and counseling and educational psychology (CEDP). The top-left histogram indicates that the average difference is about $122,000, and the 95% HDI falls far from zero. We may also ask whether there is a credible difference in salaries, on average, between CEDP and theater (THTR). The top-right histogram indicates that the average difference is about $7780, but the 95% HDI spans zero, which indicates that the uncertainty in the estimated difference is fairly large relative to the estimated difference itself. The middle row of Figure 19.5 shows contrasts regarding levels of seniority: There is a credible difference between full professors (FT1) and associate professors (FT2), and between FT2 and assistant professors (FT3). It is important to understand that the main effects of department and seni« ity are average effects, when the other factors are collapsed. For example, 5* S. 3 o. § s- a re re S. re o S3 C a- o O 3 £ TO CrtfcrH Hi ^ ~X' E" re 65. ST offireCogEr^EcťTOTO T 3 re i-i i—'i-i =1 re . O- (X 2 TO Q re i 3 3 f? o a- TO. 3 3 Jř: FT1 Mean Í6400 iJll I I I -1 20000 25000 30000 35000 J II.. -tOOOO 0 10000 20000 x1:CHEM,xz: FT1 = 8730 Mean = 87; I,:THTR,^:FT1 Mean - --S060 llllllllll I-1-1-1-1-1-1 I-1-1- -20000 -15000 -100O0 -5000 0 5000 -5000 0 5000 100OO 15000 20000 25000 -25000 -15000 -5000 0 5000 /J122.i 012a,, /Í124i1 jf,:CEDP,Jra: FT2 -10000 -5000 -20000 -10000 0 5000 ^,2 X,:CHEM,.r2:FT2 Mean =.2210 -4900 1 I-1-1-1-1-1 -10000 -5000 0 5000 10000 15000 -10000 -5000 0 5000 10000 15000 Jf2: FT3 Mean=-20100 30O00 -25000 -20000 -15000 ^3 J1:BFIN,X2: FT3 iiililil......illlllllii 8270 I-1-1-1-1-1 -10000 -5000 0 5000 10000 15000 r,:CEDP,Jí2: FT3 X,:CHEM, r2: FT3 Mean*-10900 10200 111 I.. X): THTR, %tFn Mean-5630 lilll, -5000 0 5000 10000 15000 -10000 -5000 0 -5000 0 5000 10000 15000 20000 012,., FIGURE 19.4 Posterior distribution for data in Figure 19.3. Baseline [fig) is shown in upper left. Remainder of top row is main effect of ~x\ (department). Remainder of left column is main effect of ~x% (seniority). Remaining cells show the interaction effects. ISO u 6 74 1 CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors X, Contrast: BFINvCEDP Mean 1=1220 0%<0<100% i i I i i 956/o HD I 112000 1320 I-1-r 0 20000 t t t 60000 100000 1 BFIN+-1 CEDP 1 *l Contrast: CEDPvTHTR Mean = 7780 r t 140000 -10000 0 10000 20000 1 CEDP+-1THTR X2 Contrast: FT1 vFT2 Mean=32700 0%<0<100% 23100 1500 I i-1-1-1-1 0 10000 20000 30000 40000 50000 1 FT1 +-1 FT2 X2 Contrast: FT2vFT3 Mean=13800 22000 i-r 0 5000 T 15000 1 FT2+-1 FT3 I- 25000 CHEMvTHTRxFTI vFT3 Mean=34600 0.3%<0<9! inoOOjllll 59200 r t 1 0 20000 40000 60000 + 1 CHEM FT1 +-1 CHEM FT3+-1 THTR FT1 +1 THTR FT3 BFINvOTHxFTI vOTH Mea 1 = 13300 9%S0< Hmi. jljll %HDI -6870 IIImioo r ~r -20000 0 20000 40000 CEDP FT2+0.17 CEDP FT3+-0.33 CHEM FT1 +0.17 CHEM FT FIGURE 19.5 Selected contrasts for posterior in Figure 19.4. The rounded to three leading digits. The x-axis labels of they exceed the boundaries of the plot. values for the means and HDI limits are the bottom row are obscured because contrast between FT2 and FT3 (middle row, right panel of Figure 19.5) is the average difference between FT2 and FT3, collapsed across all departments. But if you look at the data in Figure 19.3, you can see that the difference between FT2 and FT3 is not the same in every department: There is a fairly large difference in CHEM, but a very small difference in BFIN. The effect of changing from FT2 to FT3 depends on the department, which means that there is interaction. Main effects must be interpreted and described cautiously when there are interactions. It would be a mistake to say that "the" difference between FT2 and FT: is 13,800. Instead, that is the average difference across departments. The actual difference within any particular department might be quite different. Similarly, 19.1 Bayesian Multifactor ANOVA ' CHEM FT it would be a mistake to say that "the" effect of FT3 is to subtract 20,100 from the baseline, because the effect of seniority interacts with department. 19.1.4.1 Combining Metric and Nominal Predictors: ANCOVA Consider again the salary data in Figure 19.3. You can see that the mean salary for FTTs in chemistry is much higher than in theater. This difference might be attributable solely to being in one department or the other. But the difference might also be attributable to some other factor, such as years on the job. In other words, the FTTs in chemistry might happen to have been employed for decades, whereas the FTTs in theater might happen to be relatively young. If we had the age of each employee, or, better yet, the number of years that the employee had been at the current level of seniority, then we could include that information as an additional predictor of salary. We could then assess whether department membership contributed any predictiveness beyond number of years on the job. When a nominal predictor, such as department membership, is combined with a metric predictor, such as years on the job, the model is sometimes referred to as analysis of covariance, or ANCOVA. The mettic predictor is the "covariate." Programming ANCOVA in BUGS is a trivial combination of the models we've used for linear regression and ANOVA. Denote the nominal group membership for individual i as xNom[i ], and denote the metric covariate value as xMet [ i ]. Then the core of the BUGS model specification is mu[i] <- aO + a[ xNom[i] ] * y[i] ~ dnorm( mu[i] , tau ) bMet * xMet[i] where a[] is the deflection of each group from baseline, and bMet is the regression coefficient on the covariate. As in standard ANOVA, the deflections a [ ] and intercept a 0 should be transformed so that the deflections sum to zero. When initializing the chains for ANCOVA in BUGS, it can help to start at the maximum likelihood estimate (MLE). The lm() function in R provides the MLE. If we type xNom ) ,# makes xNom + xMet xNom a "factor' xNom = factor( lmlnfo = lm( y then 1 ml nf o is a list of information about the linear model that "best" fits the data. The best fitting coefficients are stored inlmInfo$coef. The first components of 1 mlnf o$coef are the deflections for the levels of xNom and the last component oflmlnfo$coefisthe slope coefficient for xMet (because of the ordering of the variables in the call to 1 m). The deflections are parameterized relative to the first component, however. To convert to the parameterization CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors that we use, in which the deflections sum to zero around a baseline, we can use the following code: NxNomLevels = length( levelsC xNom ) ) # number of levels of xNom # Next line adds first xNom component to other xNom components: a = c( 1mlnfolcoef[1] , lmInfo$coef[1] + 1mlnfo$coef[2:NxNomLevels] ) aOInit = mean( a ) alnit = a - mean( a ) ANCOVA models can also involve a separate slope for every level of the nominal predictor. Then the core of the BUGS model specification is mu[i] <- aO + a[ xNom[i] ] + ( bMet + bMetI[ xNom[i] ] ) * xMet[i] y[i] ~ dnorm( mu[i] , tau ) where bMet+bMetI[xNom[i]] is the group-specific regression coefficient on the covariate for the xNom[i 1th group. The coefficient bMet is the overall slope, and deflection bMetl [xNom[i ] ] adjusts steeper or shallower for each group. To make the slopes identifiable, the group-specific deflections are constrained to sum to zero: J]XN0m bMetl [xNom] = 0. Just as in multiple regression, a hyperprior can be put on the group-specific slopes, whereby the group-specific slopes come from a normal (or t) distribution, and the precision of that distribution is itself estimated. The ANCOVA model, with a distinct intercept and slope for each group, closely resembles the model for repeated-measures simple linear regression in Section 16.3, p. 433. The model in that section had a distinct intercept and slope for each subject. If the subject variable in that model is considered to be a nominal predictor analogous to xNom here, then that model is essentially equivalent to one used here. The two model expressions are different, however, in how naturally they generalize to situations with more predictors. The formulation in the present section uses the general ANOVA formulation for the group-specific coefficients (i.e., deflections that sum to zero) and therefore generalizes naturally to situations with multiple nominal predictors. Additional information about non-Bayesian ANCOVA can be found in a variety of other sources. A brief Bayesian treatment can be found in the book by Ntzoufras (2009), but beware that the formulation there uses no hyperprior on the nominal or metric coefficients, and the method used there to implement the sum-to-zero constraint cannot be used with hyperpriors, as was discussed previously on p. 497. 19.1.4.2 Interaction Contrasts Just as we can ask whether differences among particular levels of predictors are credible, we can ask whether interactions among particular combinations o predictors are credible. Consider again the data in Figure 19.3. The difference 19.1 Bayesian Multifactor between full professors (FIT) and assistant professors (FT3) appears to be large in the chemistry department (CHEM) but smaller in the theater department (THTR). Is the simple effect of seniority bigger in chemistry than it is in theater? In other words, is (CHEM.FT1—CHEM.FT3) —(THTR.FT1—THTR.FT3) credibly nonzero? This sort of difference of differences is called an interaction contrast. In general, an interaction contrast is constructed by taking any set of contrast coefficients on x\, and any set of contrast coefficients on ~x2i and computing their outer product. The outer product was described in Section 8.8.1 (BernTwoGri d . R), p. 178. Formally, the outer product of two vectors is denoted by the symbol "." To provide an example of an interaction contrast as an outer product of main-effect contrasts, we will recast the one we are presently considering, namely, (CHEM.FT1—CHEM.FT3) —(THTR.FT1-THTR.FT3), in generic notation. Notice that CHEM is level 3 of predictor 1, and hence can be written as lfi,3. Writing the other components in the same fashion, the interaction contrast is (~x\,?,.~X2,\ — "^1,3-^2,3) — (~x\a-~^2,\ — ^1,4-^2,3)- That can be algebraically rearranged to highlight the coefficients on the particular combinations: (+1)^1,3.^2,1 + (-1)^1,3-^2,3 + (-1)"^1,4-"^2,1 + (+l)^l,4-~?2,3 Those highlighted coefficients can be obtained as the outer product of main-effect contrasts, namely, the contrast ~t\ — (0,0,+1,-1), which expresses CHEM minus THTR, and the contrast ~c2 = (+1,0, -1), which expresses FT1 minus FT3: -f -jt _ *1,2 «1,3 ^2,1 "^2,2 ~~X2,3 ( 0 0 +1 -1 ) ^2,1 ^2,2 ^2,3 0 0 0 0 0 0 + 1 0 -1 -1 0 +1 Notice that the coefficients in the matrix match the highlighted coefficients in the difference of differences that was expressed a few sentences previously. The posterior of this interaction contrast is shown in the bottom-left histogram of Figure 19.5. The mean of 34,600 indicates that the difference between FT1 and FT2 is about 34,600 greater for CHEM than for THTR. The 95% HDI clearly excludes zero, indicating that this interaction contrast is credibly nonzero. Interaction contrasts can involve "complex" comparisons just as simply as pair-wrse comparisons. For example, suppose we are interested in comparing BFIN CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors against the average of the other nonbusiness departments, specifically for a contrast between FT1 and lesser ranks. This interaction contrast is expressed as (+1, -1/3, -1/3, -1/3) ® (+1, -1/2, -1/2). The posterior of this contrast is shown in the bottom-right histogram of Figure 19.5. (The label on the x axis exceeds the margins of the figure becaUse there are 12 combinations of levels involved in the conttast specification.) The result suggests that there is considerable uncertainty in the larger difference, between FT1 and other ranks, in BFIN than in other departments. Therefore, we would not want to conclude that the interaction contrast is credibly nonzero. Exercise 19.2 gives you hands-on practice with specification of interaction contests. 19.1.5 Noncrossover Interactions, Rescaling, and Homogeneous Variances When interpreting interactions, it can be important to consider the scale on which the data are measured. This is because an interaction means nonadditive effects when measured in the current scale. If the data are nonlinearly Uans-formed to a different scale, then the nonadditivity can also change. Consider an example, using utterly fictional numbers merely for illustration. Suppose the average salary of Democratic women is 10 monetary units, for Democratic men it's 12 units, for Republican women it's 15 units, and for Republican men it's 18 units. These data indicate that there is a nonadditive interaction of political party and gender, because the change in pay from women to men is 2 units for Democrats, but 3 units for Republicans. Another way of describing the interaction is to notice that the change in pay from Democrats to Republicans is 5 units for women but 6 units for men. A researcher might be tempted to interpret the interaction as indicating some extra advantage attained by Republican men, or some special disadvantage suffered by Democratic women. But such an interpretation may be inappropriate, because a mere rescaling of the data makes the interaction disappear, as will be described next. Salary raises and comparisons are often measured by percentages and ratios, not by additive or subttactive differences. Consider the salary data in percentage terms. Among Democrats, men make 20% more than women. Among Republicans, the men again make 20% more than the women. Among women, Republicans make 50% more than Democrats. Among men, Republicans again make 50% more than Democrats. In these ratio terms, then is no interaction of gender and political party: Change from female t male predicts a 20% increase in salary regardless of party, and change from Democrat to Republican predicts a 50% increase in salary regardless gender 19.1 Bayesian Multifactor ANOVA « Equal ratios are transformed to equal distances by a logarithmic transformation. If we measure salary in terms of the logarithm of monetary units, then the salary of Democratic women is log10(10) = 1.000, the salary of Democratic men is log10( 12) = 1.079, the salary of Republican women is log10(15) = 1.176, and the salary of Republican men is log10(18) = 1.255. With this logarithmic scaling, the increase in salary from women to men is 0.079 for both parties, and the increase from Democrat to Republican is 0.176 for both genders. In other words, when salary is measured on a logarithmic scale, there is no interaction of gender and political party. It may seem strange to measure salary on a logarithmic scale, but there are many situations for which the scale is arbitrary. The pitch of a sound can be measured in terms of frequency (i.e., cycles per second) or in terms of perceived pitch, which is essentially the logarithm of the frequency. The magnitude of an earthquake can be measured by its energy or by its value on the Richter scale, which is the logarithm of energy. The pace of a dragster on a racetrack can be measured by the average speed during the run or by the duration from start to finish (which is the reciprocal of average speed). Thus, measurement scales are not unique and are instead determined by convention. The general issue is illustrated in Figure 19.6. Suppose that predictor X\ has two levels, as does predictor x%. Suppose we have three data points at each combination of levels, yielding 12 data points altogether. The means at each combination of levels are shown in the top-left graph of Figure 19.6. You can see that there is an interaction, with the effect of X\ being bigger when X2 = 2 than when xj = 1. But this interaction goes away when the data are transformed by taking the logarithm, as shown in the lower-left graph. Each individual data point was transformed, and then the means in each cell were computed. Of course, the transformation can go the other way: Data with no interaction, as in the lower-left plot, can be made to have an interaction when they are rescaled as in the upper-left plot, via an exponential transformation. The transformability from interaction to noninteraction is only possible for noncrossover interactions. This terminology is merely a description of the graph: The lines do not cross over each other (and they have the same sign slope). In this situation, the y axis can have different portions stretched or shrunken so that the lines become parallel. If, however, the lines cross, as in the middle column of Figure 19.6, then there is no way to uncross the lines merely by stretching or shrinking intervals of the y axis. The right column of Figure 19.6 shows the same data as the middle column, but it is plotted with the roles of *i and X2 exchanged. When plotted this way, the lines do not cross, but they do have opposite-sign slopes (i.e., one slope is positive and the other slope is negative). There is no way that stretching or shrinking the y axis can change the signs of the slopes, hence the interaction cannot be removed merely by CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors Noncrossover interaction Crossover interaction Crossover interaction *1 U- log transformed 2 3.5 - —2- - 4 - 3.0 - ,1 S 2.5 - o 2.0 - 1.5 - 1.0 - *1 U log transformed II log transformed *1 *1 x2 FIGURE 19.6 Top row shows means of original data; bottom row shows means of logarithmically transformed data. Left column shows a noncrossover interaction; middle and right columns show the same crossover interaction plotted against x\ or xi. transforming the data. Because these data have crossing lines when plotted as in the middle column, they are said to have a crossover interaction even when they are plotted as in the right column. (Test your understanding: Is the interaction in Figure 19.1a crossover interaction?) It is important to note that the transformation applies to individual raw data values, not to the means of the conditions. A consequence of transforming the data, therefore, is changes in the variances of the data within each condition. For example, suppose one condition has data values of 100, 110, and 120, whereas a second condition has data values of 1100, 1110, and 1120. For both conditions, the variance is 66.7 (i.e., there is homogeneity of van-ance). When the data are logarithmically transformed, the variance of the first group becomes 1.05e-3, but the variance of the second group becomes two orders of magnitude smaller, namely, 1.02e-5 (i.e., there is not homogeneity of variance). Therefore, when applying the hierarchical model of Figure 19.2, we must b< aware that it assumes homogeneity of variance. If we transform the data, 19.2 Repeated Measures, a.k.a. Within-Subject Designs t are changing the variances within the levels of the predictors. The transformed variances might or might not be fairly homogeneous. If they are not, then either the data should be transformed in such a way as to respect homogeneity of variance or the model should be changed to allow unequal variances. The models we have been using also assume a normal likelihood function, which means that the data at any level of the predictors should be normally distributed. When the data are transformed to a different scale, the shape of their distribution also changes. If the distributions become radically non-normal, it may be misleading to use a model with a normal likelihood function. For a discussion of these issues, review Section 15.1.4, p. 399. In summary, this section has made two main points. First, if you have a noncrossover interaction, be careful what you claim about it. A noncrossover interaction merely means nonadditivity in the scale you are using. If this scale is the only meaningful scale, or if the this scale is the overwhelmingly dominant scale used in that field of research, then you can cautiously interpret the nonadditive interaction with respect to that scale. But if transformed scales are reasonable, then keep in mind that nonadditivity is scale specific, and there might be no interaction in a different scale. With a crossover interaction, however, no rescaling can undo the interaction. Second, nonlinear transformations change the within-cell variances and the shapes of the within-cell distributions. Be sure that the model you are using is appropriate to the homogeneity or non-homogeneity of variances in the data and to the shapes of the distributions, on whatever scale you are using. Exercise 19.1 has you consider these issues "hands on." 19.2 REPEATED MEASURES, A.K.A. WITHIN-SUBJECT DESIGNS In many situations, a single "subject" contributes data to multiple levels of the predictors. For example, suppose we are interested in how quickly people can press a button in response to a stimulus onset. The stimulus could appear in the visual modality as a light, or in the auditory modality as a tone. The subject could respond with his or her dominant hand or with the nondominant hand. Thus, there are two nominal predictors, namely modality and hand. The new aspect is that a single subject contributes data to all combinations of the predictors. On many successive trials, the subject gets either a tone or light and is instructed to respond with either the dominant or nondominant hand. Because every subject is measured many times, this situation is sometimes called a repeated measures design. Because the levels of the predictors change within subjects, this situation is also called a within-subject design. I favor the latter terminology because it more explicitiy connotes the essential aspect of the design, that the same subject contributes data in more CHAPTER 19: Metrie Predicted Variable with Multiple Nominal Predictors than one condition. Within-subject designs are contrasted with between-subject designs, in which different subjects contribute data to different levels of the predictors. When every subject contributes many data points to every combination of predictors, then the model of the situation is a straightforward extension of the models we've already considered. We merely add "subject" as another predictor in the model, with each individual subject being a level of the predictor. If there is one predictor other than subject, the model becomes y = ßo + Pi x\ + ßs xs PlxS^lxS This is exactly the two-predictor model we have already considered, with the second predictor being subject. When there are two predictors other than subject, the model becomes y = PO + 7*1 ^"l + ~$2~Xl + 'ts'x's + ~t\x2~tlx2 + /^"lxS^lxS + PlxS~X2xS + "pTx2xS~*lx2xS This model includes all the two-way interactions of the factors, plus the three-way interaction. Again, subject merely plays the role of the third predictor. The preceding model, which includes all the high-order interactions with subject, is fine in principle but may be overkill in practice. Unless you have specific theoretical motivations to seek out and interpret high-order interactions of subject with other predictors, there is litde reason to model them, and there is difficulty making sense of them even if you did model them. Instead, if you have many data points from each subject in every cell, an alternative approach is to apply a Bayesian ANOVA model to each subject's data, and then put a higher-order prior across the subject parameter estimates, so that different subjects mutually inform each other's estimates and provide a stable group-level estimate. Thus, every subject has a baseline, pos, and there is a higher-order, group-level prior on the distribution of pos across subjects. Each predictor also has subject-specific estimates, with the effect of the jth level of predictor 1 denoted fJls,j. Each of these effect parameters has a higher, group-level prior across subjects. (This was the modeling approach taken for repeated measures in simple linear regression in Section 16.3, p. 433.) Finally, the group-level effects have a hyperprior that provides shrinkage on the effects of a predictor. In other words, the shrinkage prior, on effects of a predictor, is set at the group level, not at the subject level. There are other situations, however, in which each subject contributes only one datum to a combination of the other predictors. For example, in the cas of the response-time study described earlier, perhaps we have only the meal 19.2 Repeated Measures, a.k.a. Within-Subject Designs response time of the subject in each combination of hand and modality. As another example, suppose the value to be predicted is IQ, as measured by a lengthy exam, with one predictor being noisy versus quiet exam environment and the other predictor being paper versus computerized exam format. Although it is conceivable that subjects could be repeatedly tested in each condition, it would be challenging enough to get people to sit through all four combinations even once. Thus, each subject would contribute one value to each condition. In the situation when each subject contributes only one datum per condition, the models described earlier, with all the interaction terms, are not identifiable, meaning that there are more parameters than data points. The simplest case of this situation is trying to estimate the mean and variance of a normal distribution from a single data point. A Bayesian analysis can still be conducted, but there will be high uncertainty in the parameter estimates, governed largely by the priors. Therefore, instead of attempting to estimate all the interactions of subjects with other predictors, we assume a simpler model in which the only influence of subjects is on the baseline: In other words, we assume a main effect of subject but no interaction of subject with other predictors. In this model, the subject effect (deflection) is constant across treatments, and the treatment effects (deflections) are constant across subjects. Notice that the model makes no requirement that every subject contributes a datum to every condition. Indeed, the model allows zero or more than one datum per subject per condition. As mentioned earlier, the computations in Bayesian ANOVA make no assumptions or requirements that the design is "balanced." If you do have many observations per subject in every combination of predictors, then one of the previously described models may be considered. 19.2.1 Why Use a Within-Subject Design? And Why Not? The primary reason to use a within-subject design is that you can achieve much greater precision in the estimates of the effects than in a between-subject design. For example, suppose you are interested in measuring the effect on response time of using the dominant versus nondominant hand. Suppose there is a population of four subjects from whom you could measure data. If we could measure every subject in every condition, we would know that for the first subject, his or her response times for dominant and nondominant hands are 300 and 320 msec. For the second subject, the response times are 350 and 370. For the third subject, the response times are 400 and 420, and for the fourth subject, the response times are 450 and 470. Thus, for every subject, the difference between dominant and nondominant hands is 20 msec. Suppose CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors we have the resources to measure only two data points in each condition. We measure response times from the dominant hands of two subjects. Should we measure response times from the nondominant hands of the same two subjects or the nondominant hands of two other subjects? If we measure from the same two subjects, then the estimated effect for each subject is 20 msec, and we have high certainty in the magnitude of the effect. If we measure from two other subjects, then the estimated effect of dominant versus nondominant hand is the average of the first two subjects versus the average of the second two subjects, and the difference is badly affected by random sampling. The between-subject design yields lower precision in the estimate of the effect. Exercise 19.3 has you examine, hands on, a case of this improvement in precision. Because of the gain in precision, it is desirable to use within-subject designs. But there are many dangers of within-subject designs that need to be considered before they are applied in any particular situation. The key problem is that, in most situations, when you measure the subject you change the subject, and therefore subsequent measurements are not measuring the same subject. The simplest examples of this are mere fatigue or generic practice effects. In measures of response time, if you measure repeatedly from the same subject, you will find improvement over the first several trials because of the subject gaining practice with the task, but after a while, as the subject tires, there will be a decline in performance. The problem is that if you measure the dominant hand in the early trials and the nondominant hand in the later trials, then the effect of practice or fatigue will contaminate the effect of handedness. The repeated measurement process affects and contaminates the measure that i supposed to be a signature of the predictor. Practice and fatigue effects can be overcome by randomly distributing and repeating the conditions throughout the repeated measures, if the practice and fatigue effects influence all conditions equally. Thus, if practice improves both the dominant and nondominant hand by 50 msec, then the differenc between dominant and nondominant hands is unaffected by practice. Bu* practice might affect the nondominant hand much more than the dominan hand. You can imagine that in complex designs with many predictors, each with many levels, it can become difficult to justify an assumption that repeate measures have comparable effects on all conditions. Worse yet, in some situations there can be differential carryover effects fr one condition to the next. For example, having just experienced practice { the visual modality with the nondominant hand might improve subsequen performance in the auditory modality with the nondominant hand, but l might not improve subsequent performance in the visual modality with th dominant hand. Thus, the carryover effect is different for different subsequen conditions. 4 mm mm 19.3 RCode When you suspect strong differential carryover effects, you may be able to explicitly manipulate the ordering of the conditions and measure the carryover effects, but this might be impossible mathematically and impractical, depending on the specifics of your situation. In this case, you must revert to a between-subject design and simply include many subjects to attenuate between-subject noise. In general, all the models we have been using assume independence of observations. The probability of the combined data is the product of the probabilities of the individual data points. When we use repeated measures, this assumption is much less easy to justify. On the one hand, when we repeatedly flip a coin, we might be safe to assume that its underlying bias does not change much from one flip to the next. But, on the other hand, when we repeatedly test the response time of a human subject, it is less easy to justify an assumption that the underlying response time remains unaffected by the previous trial. Researchers will often make the assumption of independence merely as an approximation of convenience, hoping that by arranging conditions randomly across many repeated measures, the differential carryover effects will be minimized. 19.3 R CODE 19.3.1 Bayesian Two-Factor ANOVA Several implementation details of the program are the same as the oneway ANOVA program of the previous chapter: ■ Data are normalized so that prior constants can be more generic. ■ Initialization of chains is based on the data. It is important to do this, otherwise burn-in can take forever. ■ Because there is nasty autocorrelation, we use a large thinning constant and we also use multiple chains. For a reminder of the issues of burn-in and thinning, see Section 23.2, p. 623. A new detail arises in how the uncentered parameter estimates are recentered to respect the sum-to-zero constraints. The uncentered estimates from BUGS areaO, al[], a2[], and a la2[, ]. By definition of the ANOVA model, the predicted mean of cell i,j is m[i,j] = aO + al[i] + a 2[j] + ala2[i,j] We use these predicted means to construct the zero-centered parameters. First, bO is the mean across all the predicted means: bO mean( m[,] ) CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors Then the main effect deflections are the marginal means minus the overall mean: bl[i] = mean( m[i,] ) - bO b2[j] = mean( m[,j] ) - bO It is easy (honest!) to check that those deflections do indeed sum to zero (i.e., sum( bl[] ) = 0 and sum( b2[] ) = 0). Finally, the interaction deflections are the residuals after the additive effect of bl and b2 is taken into account: blb2[i,j] = m[i,j] - ( bO + bid2 + b2[j] ) Again, it is easy to check that the rows and columns of blb2[, ] all sum to zero. In the data section of the program, one option is to load data from the article of Qian & Shen (2007). The program here uses a hierarchical structure similar to that used by Qian & Shen (2007), but their program did not recenter the parameters as is done here. It may be instructive to compare the results of the program here with the results reported by Qian & Shen (2007). BUGS for many factors. The program that follows applies only for cases of two nominal predictors. If you have many nominal predictors, along with their two-way, three-way, and higher-order interactions, it becomes unwieldy to explicitly and separately name all the deflection parameters. Instead, it can be more elegant to use a technique of dummy coding, whereby we essentially revert back to using vectors for coding the values of the predictors instead of integer indices. That is, ~~x*i = level 2 is coded by the "dummy" vector (0,1,0,...) instead of by the integer index 2. Interactions are represented by matrices of dummy codes that have been flattened into vectors. For an example of programming this technique in BUGS, see Ntzoufras (2009, Ch. 6). Unfortunately, those examples do not incorporate the higher-level prior structure emphasized in Figure 19.2. (ANOVAtwowayBRugs.R) 1 graphi cs.off() 2 rm(list=ls(all=TRUE)) 3 fnroot = "ANOVAtwowayBrugs" 4 1ibrary(BRugs) # Kruschke, J. K. (2010). Doing Bayesian data analysis: 5 # A Tutorial with R and BUGS. Academic Press / Elsevier. 6 #------------------------------------------------------------------"j 7 # THE MODEL. 8 9 model string = " 10 § BUGS model specification begins here... 11 model { for ( i in l:Ntotal ) | y[i] ~ dnorm( mu[i] , tau ) mu[i] <- aO + al[xl[i]] + a2[x2[i]] + ala2[xl[i],x2[i ] ] ) # tau <- pow( sigma , -2 ) sigma ~ dunif(0,10) # y values are assumed to be standardized # aO ~ dnorm(0,0.001) # y values are assumed to be standardized f for ( jl in l:NxlLvl ) | al[jl] ~ dnormt 0.0 , altau ) ) altau <- 1 / pow( alSD , 2 ) alSD <- abs( alSDunabs ) + .1 alSDunabs ~ dt( 0 , 0.001 , 2 ) # for ( j2 in l:Nx2Lvl ) 1 a2[j2] ~ dnorm( 0.0 , a2tau ) ) a2tau <- 1 / pow( a2SD , 2 ) a2SD <- abs( a2SDunabs ) + .1 a2SDunabs ~ dt( 0 , 0.001 , 2 ) # for ( jl in 1:NxlLvl ) { for ( j2 in l:Nx2Lvl ) { a 1a2[j1,j2] ~ dnorm( 0.0 , ala2tau ) ) ) ala2tau <- 1 / pow( ala2SD , 2 ) ala2SD <- abs( ala2S0unabs ) + .1 ala2S0unabs ~ dt( 0 , 0.001 , 2 ) } # ... end BUGS model specification " # close quote for modelstring # Write model to a file, and send to BUGS: writeLines(model string,con="model.txt") modelCheck( "model.txt" ) | ----------------- # THE DATA. # Specify data source: dataSource = c( "QianS2007" , "Salary" , "Random" , "Exl9.3" )[4] # Load the data: if ( dataSource == "QianS2007" ) ( fnroot = paste( fnroot , dataSource , sep="" ) datarecord = read.table( "QianS2007SeaweedData.txt" , header=TRUE , sep="," ) # Logistic transform the COVER value: # Used by Appendix 3 of QianS2007 to replicate Ramsey and Schafer (2002). datarecord$C0VER = -log( ( 100 / datarecord$C0VER ) - 1 ) y = as.numeric(datarecord$COVER) xl = as.numeric(datarecord$TREAT) xlnames = 1evels(datarecord$TREAT) x2 = as.numeric(datarecord$BL0CK) x2names = 1evels(datarecord$BL0CK) Ntotal = length(y) NxlLvl = length(uniquetxl)) Nx2Lvl = length(unique(x2)) CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors 388 389 390 391 392 393 394 395 dim=c(NR0W(contrast),NC0L(contrast),chai nLength) ) contrastLab = "" for ( xlidx in l:NxlLvl ) { for ( x2idx in l:Nx2Lvl ) { if ( contrast[xlidx,x2idx] != 0 ) ( contrastLab = pastet contrastLab , "+" , signif(contrast[xlidx,x2idx],2) , xlnames[xlidx] , x2names[x2idx] ) 397 398 399 400 401 402 403 404 405 406 407 408 409 412 413 414 415 416 417 418 419 420 histlnfo = plotPost( apply( contrastArr * blb2Sample , 3 , sum ) , compVal=0 , breaks=30 , xlab=contrastLab , cex.lab = 0.75 , main=paste( names(xlx2contrastList)[cIdx] ) ) ( dev.copy2eps(fi1e=paste(fnroot,"xlx2Contrasts.eps",sep="")) # Do NHST ANOVA: theData = data.frame( y=y , xl=factor(xl,1abels=xlnames) , x2=factor(x2,1abels=x2names) ) wi ndows() interaction.plot( theDatalxl , theData$x2 , theData$y , type="b" ) dev.copy2eps(fi1e=paste(fnroot,"DataPI ot.eps",sep="")) aovresult = aov( y ~ xl * x2 , data = theData ) cat( "\n------------------------------------------------------------------\n\n"| print( summary( aovresult ) ) cat ("\n---------------------.........------------------------------------\n\n" 1 print( model.tables( aovresult , type = "effects", se = TRUE ) , digits=3 ) cat("\n--------------.....-.........-------------.........---------------\n\n" 1 #============================================================================= 19.4 EXERCISES Exercise 19.1. [Purpose: To inspect an interaction for transformed data.] Consider the data plotted in Figure 19.3, p. 521. (A) (B) (C) Is the interaction a crossover interaction or not? Briefly explain your answer. Suppose we are interested in salaries thought of in terms of percentage (i.e., ratio) differences rather than additive differences. Therefore we take the logarithm, base 10, of the individual salaries (the R code has this option built into to the data section, where the salary data are loaded). Run the analysis on the transformed data, producing the results and contrasts analogous to those in Figures 19.4 and 19.5. Do any of th« conclusions change? Examine the within-cell variances in the original and in the uansforrn data. (Hint: Try using the aggregate function on the data. As a guide, see how the function is used to initialize a la 2. Instead of applying I 19.4 Exercises mean to the aggregated data, apply the standard deviation. The result is the within-cell standard deviations. Are they all roughly the same?) Do the original or the transformed data better respect the assumption of homogeneous variances? Exercise 19.2. [Purpose: To investigate a case of two-factor Bayesian AN OVA.] In the data specification of the program in Section 19.3.1 (ANOVAtwowayBRugs), you can load data from Qian & Shen (2007), regarding how quickly seaweed regenerates when in the presence of different types of grazers. Data were collected from eight different tidal areas of the Oregon coast; this predictor (it2) is referred to as the Block. In each of the eight Blocks, there were six different combinations of seaweed grazers established by the researchers. This predictor (~^i) had six levels: control, with no grazers allowed; only small fish allowed; small and large fish allowed; only limpets allowed; limpets and small fish allowed; and all three grazers allowed. The predicted variable was the percentage of the plot covered by regenerated seaweed, logarithmically transformed. (A) Load the data and run the program. You will find that there are too many levels of the two predictors to fit all the posterior histograms into a single multipanel display. Therefore, modify the plotPost.R program so that it produces only the mean and HDI limits, marked by a horizontal bar with a circle at the mean (without a histogram) and perhaps without a main tide. Name your program something other than plotPostR, and use it in the plotting section at the end of the program instead of plotPost.R. Show your results. (A secondary goal of this part of the exercise is to give you experience modifying graphics in R to suit your own purposes.) Hints: There are many ways to do this, but here are some options. To suppress plotting of the histogram, just put this argument in the hist function: pi ot=F.To suppress a title on a plot, just use the argument main="". To adjust the font size, specify the "character expansion": cex for text, cex.lab for axis labels, and so forth. To reduce the margins around a plot, so there is more room for the plot itself, try variations of these margin specifications: par( mar=c(2,0.5,l,0.5), mgp=c(0.5,0,0) ). The par command needs to be called before the plots are made. (B) The program already includes contrasts that consider whether there is an effect of small fish, an effect of large fish, and an effect of limpets. What conclusions do you reach from the posteriors of these contrasts? (C) Construct a contrast of the average of Blocks 3 and 4 versus the average of Blocks 1 and 2. Show your specification, the graph of the posterior on the contrast, and state your conclusion. (D) Is the effect of limpets different in Block 6 than in Block 7? To answer this question, construct an interaction contrast using an outer product (Hint: refer to the already-coded L_ef f ect for the contrast that specifies the CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors effect of limpets). Is the effect of small fish different in Blocks 1 and 7 than in Blocks 3 and 5? For both questions, show the contrast vectors that you constructed and show the posterior of the contrast, and state your conclusion. Exercise 19.3. [Purpose: To notice that within-subject designs can be more sensitive (hence more powerful) than between-subject designs.] Consider these data: ?1 *2 y S *i % y S 1 1 101 1 2 1 105 1 1 1 102 2 2 1 107 2 1 1 103 3 2 1 106 3 1 1 105 4 2 1 108 4 1 1 104 5 2 1 109 5 1 2 104 1 2 2 109 1 1 2 105 2 2 2 108 2 1 2 107 3 2 2 110 3 1 2 106 4 2 2 111 4 1 2 108 5 2 2 112 5 A/ofe: The table is split into two halves so it fits the page more compactly. The continuation of the first column appears in the fifth column. The continuation of the second column appears in the sixth column, and so forth. (A) Ignoring the last column, which indicates the subject who generated the data, conduct a Bayesian ANOVA using ~x\ and ~~X2 as predictors of y. Show the code you used to load the data, and show the resulting posterior histograms of po, j62/fe, and j6iX2,jfe- Also show the posterior of the contrast f}\,2 — (i.e., the marginal difference between levels 1 and 2 of factor 1, also called the main effect of factor 1) and the posterior of the contrast fi2,i — (i.e., the marginal difference between levels 1 and 2 of factor 2, also called the main effect of factor 2). (B) Now include the subject as a predictor by expanding the model to include a deflection from baseline due to subject. (Do not include any subject interaction terms.) Again show the posteriors of the /S's requested in the previous part. Are the certainties on the estimates and contras different than in the previous part? In what way, and why? (Hint regarding the answer: Figure 19.7 shows posterior histograms fo the main effect of factor 2, when the data are considered to be between subject or within subject. Notice that the means are essentially the sam" in both histograms, but the uncertainties are very different! Programming hints: The model specification without a subject factor is mu[i] <- aO + al[xl[i]] + a2[x2[i]] + a 1 a2Cx1[i],x2[i]] but with a subject factor becomes mu[i] <- aO + al[xl[i]] + a2[x2[i]] + ala2[xl[i],x2[i]] + aS[S 19.4 Exercises X2 Contrast: X2.2vX2.1 Mean =2.95 0%<0<100% 1.45 X2 Contrast: X2.2vX2.1 Mean=2.98 0%<0<100% 2.22 3.84 0 12 3 4 -1 X2.1+1 x2.2 1 2 3 -1 X2.1+1 x2.2 FIGURE 19.7 For Exercise 19.3. Left panel: Posterior for difference between levels of factor 2 when data are considered to be between subject. Right panel: Posterior for difference between levels of factor 2 when data are considered to be within subject. Notice that the means are (essentially) the same in both histograms, but the uncertainties are very different! where S [ i ] is the subject number for the ith datum, and a S [ ] are the deflections from baseline for each subject. You must, of course, specify a prior on a S [ ] analogous to the prior on a 1 [ ]. The conversion of the a [ ] values to zero-centered b [ ] values proceeds analogously to what was explained at the beginning of Section 19.3.1 (ANOVA twowayBRugs. R). The code merely needs to be expanded to include the additional subject factor. Here is a guide (ANOVAtwowayBRugsWithinSubj.R): 209 # Convert the a values to zero-centered b values. 210 # ml2Sample is predicted cell means at every step in MCMC chain: in ml2Sample = array( 0, dim=c( datalist$NxlLvl , datal i st$Nx2Lvl , 212 datal i st$NSLvl , chainLength ) ) 213 for ( stepldx in 1:chainLength ) { 214 for ( alidx in l:NxlLvl ) { 215 for ( a2idx in l:Nx2Lvl ) | 216 for ( aSidx in l:NSLvl ) 1 21? ml2Sample[ alidx , a2idx , aSidx , stepldx ] = ( 218 aOSampl e[stepldx] 219 + alSample[alidx, stepldx] 220 + a2Sample[a2idx,stepldx] 221 + ala2Sampl e[ali dx, a2i dx, stepldx] 222 + aSSamplefaSidx, stepldx] ) 225 226 # bOSample is mean of the cell means at every step in chain: bOSample = apply( ml2Sample , 4 , mean ) # blSample is deflections of factor 1 marginal means from bOSample: blSample = ( apply( ml2Sample , c(l,4) , mean ) - matrix(rep( bOSample ,NxlLvl),nrow=NxlLvl,byrow=T) ) # b2Sample is deflections of factor 2 marginal means from bOSample: b2Sample = ( apply( ml2Sample , c(2,4) , mean ) CHAPTER 19: Metric Predicted Variable with Multiple Nominal Predictors - matrix(rep( bOSample ,Nx2Lvl),nrow=Nx2Lvl,byrow=T) ) # bSSample is deflections of factor S marginal means from bOSample: bSSample = ( apply( ml2Sample , c(3,4) , mean ) - matrix(rep( bOSample ,NSLvl),nrow=NSLvl,byrow=T) ) # 1inpredSample is linear combination of the marginal effects: 1inpredSample = 0*ml2Sample for ( stepldx in 1:chainLength ) 1 for ( alidx in 1: Nxl.Lvl ) ( for ( a2idx in l:Nx2Lvl ) { for ( aSidx in 1:NSLvl ) { 1 inpredSample[a 1idx,a 2 idx,a Sidx,stepldx ] = ( bOSample[stepldx] + blSample[alidx,stepldx] + b2Sample[a2idx,stepldx] + bSSample[aSidx,stepldx] ) 253 254 255 256 257 258 259 260 261 262 263 # blb2Sample is the interaction deflection, i.e., the difference # between the cell means and the linear combination: blb2Sample = apply( ml2Sample - 1inpredSample , c(l,2,4) # Convert from standardized b values to original bOSample = bOSample * ySDorig + yMorig blSample = blSample * ySDorig b2Sample = b2Sample * ySDorig bSSample = bSSample * ySDorig blb2Sample = blb2Sample * ySDorig mean ) scale b values: Exercise 19.4. [Purpose: To conduct a power analysis for Bayesian ANOVA, for within-subject versus between-subject designs.] Conduct power analyses for the between-subject and within-subject versions of the previous exercise. Specifically, suppose the goal is for the 95% HDI of the contrast on factor 2 to have a width of 2.0 or less. Conduct a retrospective power analysis for this goal, for the within-subject version and the between-subject version. Caution: This exercise demands a lot of programming and could be time consuming, but the results drive home the point that within-subject designs can be more powerful than between-subject designs.