Statistics in Computer Science Seminar Exercises Stanislav Katina, Mojmir Vinkler October 15, 2016 1 Location characteristics Exercise 1 (Code Vectorization). Implement two versions of mean function in R - thee first one is mean_slow = function(x) and use for loop to sum numbers. The second version is mean_fast = function(x) and use built-in function sum to sum numbers. Generate very long random vector of uniformly distributed random numbers and compare performance of both functions and also of built-in mean function. Hints. Use runif(...) for generating random numbers and system.time({...}) for profiling. Realizations will be denoted as x1, x2, . . . , xn, sorted realizations as x(1) ≤ x(2) ≤ . . . ≤ x(n). Then we define following estimations of location characteristics (sample location characteristics): • sample minimum Xmin, with realization xmin = x(1); • sample maximum Xmax, with realization xmax = x(n); • sample (arithmetic) mean X, with realization x = 1 n n i=1 xi = 1 n nj j=1 xjfj, nj n, where fj are frequencies (counts) of xj and n = j fj; • sample mode Xmod, with realization xmod is the most common value (in case of discrete variable it is value x in which probability function has its maximum; in case of continuous variable it is value x in which density function has its maximum); • sample median X(robust estimation of location), with realization x = x(n+1 2 ) if n is odd, 1 2 x(n 2 ) + x(n 2 +1) if n is even; distribution is symmetric, if x = x = xmod, distribution is positively skewed (right), if x > x > xmod and distribution is negatively skewed (left), if x < x < xmod; • sample quartiles there are three 1 – sample first (lower) quartile Q1, with realization x0.25 is a value that splits off the lowest 25% of data from the highest 75%, Pr [xmin, x0.25] = Pr [X ≤ x0.25] = 1 4 , Pr [x0.25, xmax] = Pr [X ≥ x0.25] = 3 4 ; – sample second quartile (median) Q2, with realization x0.5 = x is a value that splits off the lowest 50% of data from the highest 50%,, Pr [xmin, x0.5] = Pr [X ≤ x0.5] = 1 2 , Pr [x0.5, xmax] = Pr [X ≥ x0.5] = 1 2 ; – sample third (upper) quartile Q3, with realization x0.75 is a value that splits off the lowest 75% of data from the highest 25%,, Pr [xmin, x0.75] = Pr [X ≤ x0.75] = 3 4 , Pr [x0.75, xmax] = Pr [X ≥ x0.75] = 1 4 ; • sample deciles Xk, with realizations xk splits data to ten buckets, i.e. k/10 of data are lower than a decile and (10 − k)/10 are higher, where k ∈ {0, 1, . . . , 10}; • sample percentile Xp (read as 100p-percentile), with realization xp defined as xp = x(k+1) for k = np, 1 2 x(k) + x(k+1) for k = np, where k = np , which is floor of np; • sample five-number summary (Xmin, Q1, Q2, Q3, Xmax)T , with realizations (xmin, x0.25, x0.50, x0.75, xmax)T . Robust location characteristics (resistant to outliers) are • sample γ-truncated arithmetic average Xg, with realization xg that is calculated as xg = 1 n − 2g x(g+1) + x(g+2) + . . . + x(n−g) , where g = {γn} , g = γn , γ = 0.1, 0.2. More than γ100 % observations must be replaced for the γ-truncated average to become large or small relative to the original [1 breakdown point xg is therefore γ], • sample γ-winsorized arithmetic average Xw, with realization xw is defined as xw = 1 n (g + 1)x(g+1) + x(g+2) + . . . + (g + 1)x(n−g) . More than γ100 % must be replaced for the γ-winsorized average to become large or small relative to the original [breakdown point xw is therefore γ]. 1 Breakdown point represents number of observations we need to significantly change value of location characteristics. For γ-truncated and γ-winsorized arithmetic average it is γn observations, for median n/2 observations and for simple arithmetic average changing just one observation is enough (that’s the reason we say that arithmetic average is very sensitive to outliers). 2 Exercise 2 (height of 10-year old girls). Let’s have n = 12 heights (in cm) of randomly sampled 10-year old girls sorted by height (order denoted as ri for x(i); in case the values are equal, ri is calculated as average of their order numbers). Table 1: Sorted realizations xi and their order ri for heights of 10-year old girls i 1 2 3 4 5 6 7 8 9 10 11 12 x(i) 131 132 135 141 141 141 141 142 143 146 146 151 ri 1 2 3 5.5 5.5 5.5 5.5 8 9 10.5 10.5 12 Then x · = 140.83, x = 1 2 x(6) + x(7) = 141, x0.25 = 1 2 x(3) + x(4) = 138, where k = 12 × 0.25 = 3, Q3 = x0.75 = 1 2 x(9) + x(10) = 144.5, where k = 12 × 0.75 = 9. Write functions for calculation of all location characteristics. Verify your functions on characteristics above. Don’t use built-in functions for location characteristics such as mean, quantile, etc. Use γ = 0.1 for truncated and winsorized arithmetic averages. 2 Spread (variability) characteristics Then we define following estimations of spread (variability) characteristics (sample spread characteristics): • sample variance S2 , with realization s2 = s2 n−1 = s2 x = 1 n − 1 n i=1 (xi − x)2 ; under linear transformation sample variance changes like this2 s2 y = s2 a+bx = b2 s2 x, i.e. s2 y = s2 a+bx = 1 n − 1 n i=1 a + bxi − a + bx 2 = 1 n − 1 n i=1 (a + bxi − (a + bx))2 = 1 n − 1 n i=1 (b (xi − x))2 = b2 s2 x; • sample standard deviation S, with realization s = sn−1 = sx = s2 x; under linear transformation standard deviation changes like this sy = sa+bx = |b| sx, 2 Equation tells us that variance of shifted and rescaled variable y is equal to square of scale multiplied by variance of original variable x. 3 • coefficient of variation Vk, with realization vk represents normalized form of standard deviation (inversion of signal-to-noise ratio; fraction of variability to mean) vk = sx x ; it is usually denoted in percentage points, i.e. 100 × (sx/x) % and can be used only for realizations with positive values; • sample variance of arithmetic average S2 X , with realization s2 x = s2 x n ; • sample standard deviation of arithmetic average (sample standard error) SX, with realization sx = sx √ n ; • sample skewness B1, with realization b1 = n−1 n i=1 (xi − x)3 n−1 n i=1 (xi − x)2 3/2 = √ n n i=1 (xi − x)3 n i=1 (xi − x)2 3/2 , distribution is symmetric, if b1 = 0, positive skewness (density on the left side is steeper than the right side), if b1 > 0 a negative skewness (density on the right side is steeper than the left side), if b1 < 0; • sample kurtosis B2, with realization b2 = n−1 n i=1 (xi − x)4 n−1 n i=1 (xi − x)2 2 − 3 = n n i=1 (xi − x)4 n i=1 (xi − x)2 2 − 3, distribution is normal (mesokurtic), if b2 = 0, pointy (leptokurtic), if b2 > 0 and flat (platykurtic), if b2 < 0; • sample sum of squares SS = n i=1 Xi − X 2 , with realization SSobs = n i=1 (xi − x)2 , • sample sum of absolute deviation SAD = n i=1 |Xi − X0.5|, with realization SADobs = n i=1 |xi − x0.5| ; • sample arithmetic average deviation MAD = 1 n n i=1 |Xi −X0.5|, with realization MADobs = SADobs/n; 4 • sample range D = Xmax − Xmin, with realization dobs = xmax − xmin; • sample interquartile range DQ = Q3 − Q1, with realization dQ = x0.75 − x0.25; distribution is (between quartiles) symmetric, if x0.75 − x0.50 = x0.50 − x0.25, positively skewed, if x0.75 −x0.50 > x0.50 −x0.25 and negatively skewed, if x0.75 −x0.50 < x0.50 −x0.25; • sample decile range DD = X0.9 − X0.1, with realization dD = x0.9 − x0.1; • sample percentile range DP = X0.99 − X0.01, with realization dP = x0.99 − x0.01. Robust spread characteristics (variability) are • sample γ-truncated variance S2 g , with realization s2 g calculated as s2 g = 1 n − 2g − 1 n−g i=g+1 x(i); more than γ100 % must be replaced so that γ-truncated variance changes to large or small relative to the original s2 [breakdown point s2 g is γ]; it applies that s2 g < s2 because truncating removes outliers; • sample γ-winsorized variance S2 w, with realizations s2 w; more than γ100 % must be replaced so that gamma-winsorized variance changes to large or small relative to the original s2 [breakdown point s2 w is γ]; it applies that s2 w < s2 because winsorization pulls outliers closer to the mean; • sample quartile coefficient of variation Vk,Q = (Q3 −Q1)/Q2, with realization vk,Q calculated as vk,Q = x0.75 − x0.25 x . Other robust spread characteristics characterized by modified boundaries are • sample robust minimum and maximum (“inner boundaries”) X∗ min = BD = Q1 − 1.5DQ and X∗ max = BH = Q1 + 1.5DQ, with realizations defined as x∗ min = bD = x0.25 − 1.5 (x0.75 − x0.25) , x∗ max = bH = x0.75 + 1.5 (x0.75 − x0.25) , values outside of boundaries are considered to be suspicious, potential outliers; 5 • sample robust minimum and maximum (“outer boundaries”) defined as B∗ H = Q1 − 3(Q3 − Q1), B∗ H = Q3 + 3(Q3 − Q1), with realizations b∗ D = x0.25 − 3dQ, b∗ H = x0.75 + 3dQ; – if there are any xi < b∗ D ∨ xi > b∗ H, we call them distant values3 , – if xi ∈ b∗ D, bD) ∨ (bH, b∗ H , these are outer values, – if xi ∈ bD, bH , these are inner values or values close to median; – normal distribution has these properties BH − BD = Q3 + 1.5DQ − Q1 + 1.5DQ = 4DQ · = 4.2; probability of xi /∈ BD, BH is then 0.04; • sample robust skewness B1Q and B1O and their variance under asymptotic normality B1·, where · = Q or O, with realizations defined as – quartile skewness b1Q = (x0.75 − x0.50) − (x0.50 − x0.25) x0.75 − x0.25 , V aras(b1Q) = 1.84, – octile skewness b1O = (x0.875 − x0.50) − (x0.50 − x0.125) x0.875 − x0.125 , V aras(b1O) = 1.15. Exercise 3 (height of 10-year old girls). Calculate all spread characteristics for the sample with heights of 10-year old girls. 3 Basics of Probability Exercise 4 (Simple random sample). In a simple random sample of size n from population of finite size N, each element has an equal probability of being chosen. If we avoid choosing any member of the population more than once, we call it simple random sample without replacement4 .(Dalgaard 2008). If we put a member back to population after choosing it, we talk about simple random sample with replacement5 . Let’s have a set M with N = 10 elements and we want to choose n = 3 elements (a) without replacement and (b) with replacement. How many combinations there are? How do these combinations look like if M = {1, 2, . . . , 10}? Do the same for N = 100, n = 30 and set M = {1, 2, . . . , 100}. Solution without code: (a) Number of combinations is N n . If N = 10 and n = 3, then N n = N! (N−n)!n! = 10 3 = 120. (b) Number of combinations with replacement is N+n−1 n . If N = 10 and n = 3, then N+n−1 n = (N+n−1)! (N−1)!n! = 10+3−1 3 = 220. If N = 100 a n = 30, then N+n−1 n = 100+30−1 30 = 2.009491 × 1029 . Hints. choose(n,k), combn(n,k)6 , sample(x=..., size=..., replace=...) 3 Symbol ∨ means “or” and symbol ∧ means “and”. 4 n-combination without replacement from N members of set M. 5 n-combination with replacement from N members of set M. 6 requires library utils 6 Exercise 5 (Simple random sample). A group of people are labeled by their identification numbers (ID) from 1 to 30. Choose (a) randomly 5 people out of 30 without replacement, (b) randomly 5 people out of 30 with replacement and finally (c) randomly 5 people out of 30 without replacement, where people with ID between 28 and 30 have 4× higher probability of being chosen than people with ID between 1 and 27. Exercise 6 (Normal distribution). Let X be a random variable (it could represent for example adult height) and let’s assume it is normally distributed with parameters µ (expectation or mean) and σ2 (standard deviation) which is written as X ∼ N(µ, σ2 ), µ = 140.83, σ2 = 33.79. Normal distribution represents a probability distribution model for this random variable. Calculate probability Pr (a < X < b) = Pr (X < b) − Pr (X < a) = FX (b) − FX (a), where a = µ − kσ, b = µ + kσ, k = 1, 2, 3.7 Write a function that takes parameters µ, σ, a and b and calculates probability Pr (a < X < b) . Partial solution: a = µ − σ = 135.0171, b = µ + σ = 146.6429, Pr (|X − µ| > σ) = 0.3173, Pr (|X − µ| < σ) = 1 − 0.3173 = 0.6827, a = µ − 2σ = 129.2042, b = µ + 2σ = 152.4558, Pr (|X − µ| > 2σ) = 0.0455, Pr (|X − µ| < 2σ) = 1 − 0.0455 = 0.9545, a = µ − 3σ = 123.3913, b = µ + 3σ = 158.2687, Pr (|X − µ| > 3σ) = 0.0027, Pr (|X − µ| < 3σ) = 1 − 0.0027 = 0.9973. 1 mu <- 0 2 sig <- 1 3 bin <- seq(mu-3*sig,mu+3*sig,by=sig) 4 pnorm(bin[7]) - pnorm(bin[1]) # 0.9973002 5 pnorm(bin[6]) - pnorm(bin[2]) # 0.9544997 6 pnorm(bin[5]) - pnorm(bin[3]) # 0.6826895 Probabilities 68.27 − 95.45 − 99.73 are called (empirical) rule or 3-sigma rule. Exercise 7 (Normal distribution). Let X ∼ N(µ, σ2 ), where µ = 150, σ2 = 6.25. Calculate a = µ − x1−ασ and b = µ + x1−ασ so that Pr (a ≤ X ≤ b) = 1 − α is equal to 0.90, 0.95 a 0.99. Value x1−α is a quantile of standardized normal distribution, i.e. Pr(Z = X−µ σ < x1−α) = 1 − α, Z ∼ N(0, 1). Similarly to previous exercise, write a function that would take parameters µ, σ and α and return values a and b as a vector. Hints. qnorm(alpha) This gives us rule 90 − 95 − 99 (so called adjusted 3-sigma rule). We used property Pr uα/2 < Z < u1−α/2 = Φ u1−α/2 − Φ uα/2 = 1 − α, where Φ is cumulative distribution function of normal distribution and in general α ∈ (0, 1/2); in the exercise we used α = 0.1, 0.05 a 0.01. 7 Note that Pr (a < X < b) = Pr (a ≤ X ≤ b) because probability of a point (here a and b) is zero for continuous random variables, i.e Pr(a) = Pr(b) = 0. This does not apply to discrete random variables. 7 Exercise 8 (Interactive Normal Distribution). Create an interactive Shiny application that will use the function defined in exercise 6 and show probability Pr(X > 0), where µ and σ can be interactively set by user. Hints. Download folder normalplot-nographics from study materials. Run it from R console with command runApp(’normalplot-nographics’, display.mode="showcase") from parent directory of normalplot-nographics (use function setwd to set your working directory if needed). Use code from previous examples in server.R to finish the exercise. Listing 1: ui.R 1 # if missing, install with command ‘install.packages(’shiny’)‘ 2 library(shiny) 3 4 # Define UI 5 shinyUI(fluidPage( 6 7 # Application title 8 titlePanel("Normal Distribution"), 9 10 # Sidebar with a slider input for the number of bins 11 numericInput("sig", 12 "sigma:", 13 min = 0.1, 14 max = 3, 15 step = 0.1, 16 value = 1), 17 numericInput("mu", 18 "mean:", 19 min = -4, 20 max = 4, 21 value = 0, 22 step = 0.5), 23 24 textOutput("myTextOutput") 25 )) Listing 2: server.R 1 library(shiny) 2 3 # Define server logic 4 shinyServer(function(input, output) { 5 6 output$myTextOutput <- renderText({ 7 # TODO: use your code from previous exercise to show probabilities 8 paste0(’My sigma is ’, input$sig) 9 }) 10 }) Exercise 9 (Binomial Distribution). Let’s assume that number of people preferring treatment A over treatment B follows a binomial distribution with parameters p (probability of success) and N (number of independent trials) denoted Bin (N, p), where N = 20, p = 0.5, i.e. people 8 prefer both treatments equally. (a) What is the probability that 16 and more patients will prefer treatment A over treatment B? (b) What is the probability that 16 and more or 4 or less patients will prefer treatment A over treatment B? Solution without code: (a) Pr(X ≥ 16) = 1 − Pr(X < 16) = 1 − Pr(X ≤ 15) = 1 − i:xi≤15 Pr (X = xi) = 1 − i:xi≤15 N xi pxi (1 − p)N−xi = 1 − i:xi≤15 20 xi 0.5xi (1 − 0.5)20−xi = 0.0059. (b) Pr(X ≤ 4, X ≥ 16) = 1 − i:xi≤15 Pr (X = xi) + i:xi≤4 Pr (X = xi) = 0.012. This probability is twice the previous one since Bin(N, 0.5) is symmetric around 0.5. Hints. pbinom(x, size=..., prob=...) gives you probability Pr(X ≤ x). How do you get probability Pr(X ≥ x) using this function? Exercise 10 (Binomial Distribution). Let’s assume that Pr(swirl) = 0.533 = p1 is the probability of having dermatological pattern swirl on right thumb of male population and Pr(other) = 0.467 = p2 is the probability of other patterns on right thumb of the same population. Random variable X represents number of swirls and Y number of other patterns, where X ∼ Bin(N, p1) a Y ∼ Bin(N, p2). Calculate 1. Pr(X ≤ 120) if N = 300 2. Pr(Y ≤ 120) if N = 300 Exercise 11 (Normal approximation of binomial distribution). 8 Let Pr(man) = 0.515 be a proportion of men in population and Pr(women) = 0.485 proportion of women. Let X represent number of men and Y number of women. Under the assumption of model Bin(N, p) calculate 1. Pr(X ≤ 3) if N = 5 2. Pr(X ≤ 5) if N = 10 3. Pr(X ≤ 25) if N = 50. Compare these probabilities with those approximated by normal distribution N(Np, Npq). 8 Approximation means ”similar but not exactly equal”, i.e. we approximate some distribution with a different one (that has certain advantages over the approximated one) or we approximate data with some distribution (that describes data with help of easily interpretable parameters) 9