Resampling Techniques Introduction to Bootstrap techniques with R U G A R T E , M . D . ( * ) ( * ) D E P A R T A M E N T O D E E S T A D Í S T I C A E I . O . , U N I V E R S I D A D P U B L I C A D E N A V A R R A , P A M P L O N A , S P A I N E-MAIL: LOLA@UNAVARRA.ES Material from the book Probability and Statistics with R by Ugarte, Militino, and Arnholt. Chapman and Hall/CRC, 2008 Brno, December 2007 Lola Ugarte Outline * Introduction * R Tools * Bootstrap Paradigm * Bootstrap Estimate of Standard Error * Bootstrap Estimate of Bias * Bootstrap Confidence Intervals Brno, December 2007 1 Lola Ugarte Introduction The term bootstrapping is due to Efron (1979), and is an allusion to a German legend about a Baron Münchhausen, who was able to lift himself out of a swamp by pulling himself up by his own hair. In later versions he was using his own boot straps to pull himself out of the sea which gave rise to the term bootstrapping. As improbable as it may seem, taking samples from the original data and using these resamples to calculate statistics can actually give more accurate answers than using the single original sample to calculate an estimate of a parameter. Brno, December 2007 2 Lola Ugarte Introduction -Cont. In fact, resampling methods require fewer assumptions than traditional parametric methods and generally give more accurate answers. The price to pay is that Bootstrap methods are computationally intensive techniques. However, today's computers are many times faster than those of a generation ago. The fundamental concept in bootstrapping is the building of a sampling distribution for a particular statistic by resampling from the data that is at hand. In this sense, bootstrap methods are both parametric and nonparametric; however, attention now is focused exclusively on the nonparametric bootstrap. Brno, December 2007 3 Lola Ugarte Introduction -Cont. Bootstrap methods offer the practitioner valuable tools for dealing with complex problems. Even though resampling procedures rely on the new power of the computer to perform simulations, they are based on the old statistical principles such as populations, parameters, samples, sampling variation, pivotal quantities, and confidence intervals. For most students, the idea of a sampling distribution for a particular statistic is completely abstract; however, once work begins with the bootstrap distribution, the bootstrap analog to the sampling distribution, the concreteness of the bootstrap distribution promotes a conceptual understanding of the more abstract sampling distribution. Brno, December 2007 4 Lola Ugarte R Tools R is a free statistical software with many possibilities allowing the practitioner to use bootstrap techniques very easily There exist two important R packages: * bootstrap by Efron and Tibshirani (1993) (ported to R from S-PLUS by Friedrich Leisch). * boot by Angelo Canty (ported to R from S-PLUS by B. D. Ripley) The boot library provides functions and data sets from the book Bootstrap Methods and Their Applications by Davison and Hinkley (1997). Brno, December 2007 5 Lola Ugarte The Bootstrap Paradigm Suppose a random sample X = (Xb X2,..., Xn) is taken from an unknown probability distribution, F, and the values x = (xi, x2,..., #n) are observed. Using x, the parameter 9 = t(F) is to be estimated. The traditional approach of estimating 9 is to make some assumptions about the population structure and to derive the sampling distribution of 9 based on these assumptions. This, of course, assumes the derivation of the sampling distribution of the statistic of interest has either been done or that the individual who needs to do the deriving has the mathematical acumen to do so. Often, the use of the bootstrap will be preferable to extensive mathematical calculations. Brno, December 2007 6 Lola Ugarte Bootstrap Paradigm -Cont In the bootstrap paradigm, the original sample, x, takes the place the population holds in the traditional approach. Subsequently, a random sample of size n is drawn from x with replacement. The resampled values are called a bootstrap sample and are denoted x*. Sampling with replacement means that after we randomly draw an observation from the original sample we put it back before drawing the next observation. Think of drawing a number from a hat, then putting it back before drawing it again. That is, given x = {4,5,6,2,8,12}, one possible bootstrap sample x* might be x* = {6,6,5,12,2,8} Some values from the original sample x may appear once, more than once, or not at all in the bootstrap sample x*. 7 Brno, December 2007 Lola Ugarte Bootstrap Paradigm -Cont Remember that the star notation indicates that x* is not the original data set x, but rather, it is a random sample of size n drawn with replacement from x. The idea of calculating the sampling distribution of a statistic in the classical approach is to collect the values of the statistic from many samples. The bootstrap distribution of a statistic collects its values from many resamples. These values are used to calculate an estimate of the statistic of interest s(x) = 0. Brno, December 2007 8 Lola Ugarte Bootstrap Paradigm -Cont The fundamental bootstrap assumption is that the sampling distribution of the statistic under the unknown probability distribution F may be approximated by the sampling distribution of 9* under the empirical probability distribution F. Remember that the empirical probability distribution puts probability 1/n for each value Xi. Brno, December 2007 9 Lola Ugarte Bootstrap Paradigm -Cont The process of creating a bootstrap sample x* and a bootstrap estimate 9* of the parameter of interest is repeated B times. The B bootstrap estimates of 0, the §*s, are subsequently used to estimate specific properties of the bootstrap sampling distribution of 9*. There are a total of (2n ~1 ) distinct bootstrap samples. Yet, a reasonable estimate of the standard error of §*, a^ = SEß, can be achieved with only B = 200 bootstrap replications in most problems. For confidence intervals and quantile estimation, B generally should be at least 999. Brno, December 2007 10 Lola Ugarte Bootstrap Estimate of Standard Error Under the fundamental bootstrap assumption, we may write seF(§) = yvarF(9) = sep§* The algorithm that we will describe soon will allow us to calculate a good numerical approximation of sep§* Brno, December 2007 11 Lola Ugarte Bootstrap Estimate of Standard Error The drawing of bootstrap samples can be easily done in a computer. We just need a selection procedure of integer random numbers among 1 and n with probability l/n: ň,..., v The bootstrap sample corresponding to a single drawing is ry ly . ly ly . ly ly . ˇ^\ ^l\ 1 <^2 *2 1 ' ' ' 1 dj n *n Brno, December 2007 12 Lola Ugarte In R the function sample does this task: x<-c(10, 15, 25, 37, 48, 23, 44, 19, 32, 20) set.seed(30) #to reproduce the same result indices<-sample(1:10, replace=T) indices [1] 1 5 4 5 4 2 9 3 10 2 x.asterisco<-x[indices] x.asterisco [1] 10 48 37 48 37 15 32 25 20 15 Also (and easier) set.seed(30) sample(c(10, 15, 25, 37, 48, 23, 44, 19, 32, 20 [1] 10 48 37 48 37 15 32 25 20 15 Brno, December 2007 13 , replace=T) Lola Ugarte Bootstrap Estimate of Standard Error * The bootstrap algorithm works drawing independent bootstrap samples, and calculating the corresponding statistic using these samples. The bootstrap standard error of a statistic is the standard deviation of the bootstrap distribution of that statistic. * The result is called bootstrap standard error and it is denoted by seß, where B is the number of replications. * To apply the bootstrap idea we must start with a statistic that estimates the parameter we are interested in. We usually come up with a suitable statistic by appealing to another principle that we often apply without thinking: the plug-in principle Brno, December 2007 14 Lola Ugarte THE PLUG-IN PRINCIPLE To estimate a parameter, a quantity that describes the population, use a statistic that is the corresponding quantity for the sample. * For example, to estimate \i we use x or to estimate the population standard deviation a, we use the sample standard deviation s. * The bootstrap idea itself is a form of the plug-in principle: substitute the sample data for the population, then draw samples (resamples) to mimic the process of building a sampling distribution. Brno, December 2007 15 Lola Ugarte Bootstrap Estimate of Standard Error The general procedure for estimating the standard error of 9* is: 1. Generate B independent bootstrap samples x*1 , x*2 ,... x*B , each consisting of n values drawn with replacement from the original sample. 2. Compute the statistic of interest for each bootstrap sample b. 0*(b) = s(x*b ), b=l,...,B 3. Estimate the standard error of 0 by computing the sample standard deviation of the bootstrap replications of 0%,b =1,2,..., B. ( B (fí* fi*\2 "I B n* I t=l v \6) 6=1 Brno, December 2007 Lola Ugarte FIGURE 10.16: Graphical representation of the bootstrap Empirical Distribatioii F Brno, uecerrmer zuu/ Bootstrap Samples of Size n s* x | x:> x , x B SE-B 17 Bootstrap Replications of I §* = shq) A (* ~0% f &B=2 B - 1 i_uia uyarce Bootstrap Estimate of Standard Error * The number of replications needed to calculate the bootstrap standard error is rarely superior to 200 (Efron and Tibshirani, 1993) * The limit of seß when B goes to infinity is the ideal bootstrap estimate of seF{§). * The fact that seß is approximately equal to se^(0*) when B goes to infinity is similar to saying that the empirical standard deviation is approximately equal to the population standard deviation when the number of replications increases. * The ideal bootstrap estimate se^(6>*) and its numerical approximation seß are called non-parametric bootstrap estimates because they are based on F, a non-parametric estimator of F. 18 Brno, December 2007 Lola Ugarte Bootstrap Standard Error- Example We generate 10 values from a N(0,1). Think in the sample mean X as an estimator of fi (in this case the true value of \i is known and equal to 0). We know theoretically that the standard error of the sample mean estimator is ee(X) = \Ja2 jn and the corresponding estimator of this standard error is ee(X) = ^/S2 /n. Then, the true standard error of X is y V 1 0 = 0,3162. Let us calculate a bootstrap numerical approximation using B=200 replicates. The student should repeat this little experiment generating a sample of size 100. Fix the seed in 10 using the command set.seed(10) to be able to reproduce results. Brno, December 2007 19 Lola Ugarte SOLUTION s e t . s e e d ( 1 0 ) n<-10 muestra. o r i g i n a K - r n o r m (n) muestra.original > muestra.original [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513 0.38979430 [7] -1.20807618 -0.36367602 -1.62667268 -0.25647839 sqrt(var(muestra.original)/n) # With theoretical results [1] 0.2213258 B<-200 muestras.bootstrap<-matrix(0,B, n) estadistico.boot<-array(0,B) i<-l while (i < (B+l)){ muestras.bootstrap[i,]<-sample(muestra.original,replace=T) estadistico.boot[i]<-mean(muestras.bootstrap [i,]) i<-i+l} error.estandar<-sd(estadistico.boot) > error.estandar [1] 0.2059542 20 Brno, December 2007 Lola Ugarte Bootstrap Estimate of Bias A statistic is biased as an estimate of a population parameter if its sampling distribution is not centered at the true value of the parameter. We may check bias by seeing whether the bootstrap distribution of the statistic is centered at the value of the statistic for the original sample. More precisely, the bias of 0 = s(X) is the difference between the expected value of 0 and the true parameter value 0 = t{F). B\as(s(X)\F) = EF[s(X)} - t(F) Brno, December 2007 21 Lola Ugarte We may use bootstrap to estimate the bias of any estimator 0 by writing in the previous expression F instead of F Bias(s(X)|F) = Ep[s(X*)]-t(F) In words, the bootstrap estimate of bias is the difference between the mean of the bootstrap distribution and the value of the statistic in the original sample We will calculate the bootstrap bias of s(X) using B resamples of the original sample B /í* BÍasB[s(X)]=0*-0 donde 0* = V ^ Brno, December 2007 22 Lola Ugarte Bootstrap Confidence Intervals * With estimates of the standard error (standard deviation) and bias of some statistic of interest, various types of confidence intervals for the parameter 9 can be constructed. * Although exact confidence intervals for specific problems can be computed, most confidence intervals are approximate. * The most common confidence interval for a parameter 9 when 9 follows either a normal or approximately normal distribution is C.I.i-a{0) = [9 - Zi-a/2&0, 9 + Zi_a/2V§\ Brno, December 2007 23 Lola Ugarte The student may remember that this interval is easily obtained from the distribution of 9 - 9 using the pivotal quantity ? ^L7v(o,r var(0) The confidence interval described above works well when the distribution of 9-9 is normal, at least approximately, but this is not always the case. In some cases we could know the approximate normality but we may have difficulties deriving the variance of the estimator. Brno, December 2007 24 Lola Ugarte Bootstrap Confidence Intervals * The confidence interval that we will call normal is a slight modification to the traditional CI that incorporates both a bootstrap adjustment for bias and a bootstrap estimate of the standard error. The normal confidence interval is calculated as C./.i-a(ö) = [0 - Biasß(ö) - ^_a/2SEß(ö),ö - BÍasB((9) + ^_a/2SEß(ö)] * To use this confidence interval it is convenient to check normality, using for instance a qq-norm of 0* = s(x*), ...,0*B = s(x#). Brno, December 2007 25 Lola Ugarte Bootstrap Confidence Intervals The basic bootstrap confidence interval is based on the idea that the quantity 9* - 9 has roughly the same distribution as 9 - 9, and then it is possible to approximate the percentiles of 9 - 9 by the percentiles of 9* - 9 P ß* ß <^ ß* ß <^ ß* tý ((B+l)a/2) ~ ü -^ ü ~ ü -^ ü((B+l)(l((ß+l)(l-a/2)) 9 \ -- OL P ô((B+i)a/2) -- 9 < 9 -- 9 < 9((B+í)(í_a/2)) -- 9((ß+l)(l-a/2)) = 1 -- a And then, P 26 -- ô((B+i)a-a/2)) <9<29- 9((B+Í({B+í)a/2) = 1 -- a I.C.i-a(0) -- [29 - ö(*(ß+i)(i_Q!/2))5 20 - 0\^B+ľ)a/2) 26 Brno, December 2007 Lola Ugarte Bootstrap Confidence Intervals * The bootstrap-t interval or Studentized interval is based on the idea of replacing the approximation of Z = ^ to the standard normal distribution N(0,1) by a bootstrap approximation Z* = 0* - 0)/SEB. * We generate B bootstrap samples and compute Z*(b). The p percentile of the Z distribution is approximated by the (B + l)p percentile of Z*(b). * The interval takes the form C.I.\-a{9) = [0 + Z*(B+1)(a/2^(T§,0 + ^(*(ß+l)(l_Q!/2))^] * The notation ^*|nteger) is used to denote the (Integer)^ z* of the B sorted Z* values. The values of B and a are generally chosen so that (B + 1) ˇ a/2 is an integer. In cases where (B +1) ˇ a/2 is not an integer, interpolation can be used. (Note that different programs use different interpolation techniques.) 27 Brno, December 2007 Lola Ugarte Bootstrap Confidence Intervals * The percentile confidence interval is based on the quantiles of the B bootstrap replications of s(X). Specifically, the (1 - a) percentile confidence interval of 9 uses the a/2 and the 1 - a/2 quantiles of the 9* values to create a (1 - a) ˇ 100% confidence interval for 9. C.I.i-a = [0(*(B+l)a/2)>0(*(B+l)(l-a:/2))] Brno, December 2007 28 Lola Ugarte Bootstrap Confidence Intervals * At this point, a reasonable question might be which confidence interval is recommended for general usage since the normal confidence interval is based on large sample properties, the t-bootstrap confidence interval is not recommended if the bootstrap distribution is not normal or shows substantial bias, and the percentile and basic bootstrap confidence interval formulas give different answers when the distribution of 9* is skewed? * In fact, the answer is to use none of the confidence intervals discussed thus far. The bootstrap confidence interval procedure recommended for general usage is the BCa method, which stands for bias-corrected and accelerated. * The bottom line is that there are theoretical reasons to prefer the BCa confidence interval over the normal, percentile, and basic bootstrap confidence intervals. 29 Brno, December 2007 Lola Ugarte ˇ It is possible that both the percentile and basic methods provide confidence intervals not centered around the true value of the parameter. Only if the bootstrap distribution is symmetric around 0, all of the methods provide the same results. Otherwise, the BCa CI corrects for bias and skewness. * The underlying idea of the BCa CI is to assume that there exist a transformation of 9 whose distribution is normal and its mean and standard error depend on 9. Then, one derives an interval of the transformed parameter and then back-transformed the confidence limits to obtain an interval for 9. * The most interesting thing is that it is possible to calculate the interval without knowing the explicit form of the transformation by using boot- strap. Brno, December 2007 30 Lola Ugarte Bootstrap Confidence Intervals To compute a BCa interval for 9, C.I.i-a(9) factor, z, where ^lower' ^upper first compute the bias z = 1 sB ztim_1 is the inverse of the cumulative distribution function of the standard normal distribution and I is the indicator function. * Provided the estimated bootstrap distribution of 9* is symmetric with respect to 9, and if 9 is unbiased, then 6=1 j,b --i will be close to 0,5, and the biased correction factor z will be close to zero since 3>_1 (0,5) = 0. Using R, type qnorm(.5)=0. 31 Brno, December 2007 Lola Ugarte Bootstrap Confidence Intervals Next, we compute the skewness correction factor E L a /(_i) - 0 H ) 6 Eti(^H)-^H))2 3/2 where 9^%) is the value of 9 = s(X) when the i-th value is deleted from the sample of n values and 0(_o = ľ=i ^ . Using z and a, we compute <2i = $ Z + Z + ^a/2 1 - a(z + za/2; and a2 = * z + + ^l-a/2 1 - a ( z + Zi_a/2)_ Brno, December 2007 32 Lola Ugarte The BCn confidence interval for 9 is C.I.\-a{9) -- [0* h i s t ( T i m e s , p r o b = T ) > mean(Times) [1] 7.8 > sd(Times) [1] 7.871402 > lamb<-l/mean(Times) > x < - s e q ( 0 , 3 5 , l e n g t h = 8 0 0 ) > f<-lamb*exp(-lamb*x) > l i n e s ( x , f , l w d = l ) Brno, December 2007 35 Lola Ugarte Histogram of Times ~r~ 10 ~r~ 15 ~r~ 25 ~r~ 3020 35 Times 36 Brno, December 2007 Lola Ugarte Example * It seems an approximate Poisson process for the number of cars passing Junction 13 Saturday 23 March 1985 with x = l/X = 7,8 is present. * Recall that the waiting time between outcomes in a Poisson process has an exponential distribution. Note that the interarrival times seems to be fit well with an exponential density with Á = 1/7,8. * Recall that mean and standard deviation of the exponential are 1/A. Brno, December 2007 37 Lola Ugarte Example The bootstrap confidence intervals are now constructed for the mean using the function boot. To use the package boot we need first to build the following function library (boot) times.fun <- function(data, i) { media <- mean(data[i]) # compute the mean of each bootstrap sample n <- length(i) v <- (n-1)*var(data[i])/n"2 # compute the variance of the sample mean it is needed only for the t-bootstrap CI c(media, v) } Brno, December 2007 38 Lola Ugarte Example -Cont. Set the number of bootstrap replications B to 9999 and generate the bootstrap distribution of X denoted by ť when using the boot package. Note that R in boot is the number of bootstrap replications which we denoted B before, so R is set equal to B. A random seed value of 10 is used so the reader can reproduce the results. B<-9999 set.seed(10) b.obj<-boot(Times, times.fun, R=B) > b.obj ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = Times, statistic = times.fun, R = B) Bootstrap Statistics : original bias std. error tl* 7.80000 -0.01249375 1.2149888 #mean t2* 1.51025 -0.04141159 0.4712177 #variance of the mean Brno, December 2007 39 Lola Ugarte Example -Cont. * Let us represent now the bootstrap distribution of the mean (t*) (we will observe from the histogram and qq-norm that the distribution is slightly skew to the right) * Hence, there will be small differences between the alternative confidence intervals * Type in R: p l o t ( b . o b j ) Brno, December 2007 40 Lola Ugarte Histogram oft , m "1 4 1 6 1 8 1 10 I 12 - 4 - 2 0 2 4 Quantiles of Standard Normal Brno, December 2007 41 Lola Ugarte Example -Cont. Next, use the function boot.ci on the object b.obj to create the five types of bootstrapped confidence intervals. > boot.ci (b.obj) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 9999 bootstrap replicates CALL : boot.ci (boot.out = b.obj) Intervals : Level Normal Basic Studentized 95% ( 5.431, 10.194 ) ( 5.275, 10.050 ) ( 5.681, 11.070 ) Level Percentile BCa 95% ( 5.550, 10.325 ) ( 5.800, 10.700 ) Calculations and Intervals on Original Scale 42 Brno, December 2007 Lola Ugarte Exact Confidence Interval From 2nX X2n, we may write 2nX P X2n;a/2 < ~ ň ~ < X2n;l-a/2 = 1 ~ 0 From the above expression, it follows 0 P ( X2n;a/2 -- Q ^ ; ^ -- X2n;l-a/2 1 -- a Then a 1 - a confidence interval for 0 is c.i.Laie) 2nX 2nX 2 ' 2 X2n;l-a/2 ^2n;a/2 43 Brno, December 2007 Lola Ugarte Exact Confidence Interval InR: > lower<-2*length(Times)*mean(Times)/qchisq( 0.975, 2*length(Times)) > upper<-2*length(Times)*mean(Times)/qchisq( 0.025, 2*length(Times)) > intervalo<-round(c(lower,upper),3) > i n t e r v a l o [1] 5.852 10.918 Brno, December 2007 44 Lola Ugarte Comparison of Results Method Lower Limit Upper Limit Normal 5.431 10.194 Basic 5.275 10.050 t or Studentized 5.681 11.070 Percentile 5.550 10.325 BCa 5.800 10.700 Exact 5.852 10.918 Brno, December 2007 45 Lola Ugarte More Bibliography Books: An Introduction to Bootstrap, B. Efron and R. J. Tibshirani, Chapman and Hall, 1998. Bootstrap Methods and Their Application, A. Davidson and D. Hinkley, Cambridge University Press, 1997. Randomization, Bootstrapping, and Monte Carlo Methods in Biology, B. Manly, Chapman and Hall, 1997. Brno, December 2007 46 Lola Ugarte