Statistics in Computer Science Seminar Exercises Stanislav Katina, Mojmir Vinkler November 26, 2016 1 Location characteristics Exercise 1 (Code Vectorization). Implement two versions of mean function in R - thee first one is mean_slow = function(x) and use for loop to sum numbers. The second version is mean_fast = function(x) and use built-in function sum to sum numbers. Generate very long random vector of uniformly distributed random numbers and compare performance of both functions and also of built-in mean function. Hints. Use runif(...) for generating random numbers and system.time({...}) for profiling. Realizations will be denoted as x1, x2, . . . , xn, sorted realizations as x(1) ≤ x(2) ≤ . . . ≤ x(n). Then we define following estimations of location characteristics (sample location characteristics): • sample minimum Xmin, with realization xmin = x(1); • sample maximum Xmax, with realization xmax = x(n); • sample (arithmetic) mean X, with realization x = 1 n n i=1 xi = 1 n nj j=1 xjfj, nj n, where fj are frequencies (counts) of xj and n = j fj; • sample mode Xmod, with realization xmod is the most common value (in case of discrete variable it is value x in which probability function has its maximum; in case of continuous variable it is value x in which density function has its maximum); • sample median X(robust estimation of location), with realization x = x(n+1 2 ) if n is odd, 1 2 x(n 2 ) + x(n 2 +1) if n is even; distribution is symmetric, if x = x = xmod, distribution is positively skewed (right), if x > x > xmod and distribution is negatively skewed (left), if x < x < xmod; 1 • sample percentile Xp (read as 100p-percentile), with realization xp defined as xp = x(k+1) for k = np, 1 2 x(k) + x(k+1) for k = np, where k = np , which is floor of np; • sample quartiles there are three – sample first (lower) quartile Q1, with realization x0.25 is a value that splits off the lowest 25% of data from the highest 75%, Pr [xmin, x0.25] = Pr [X ≤ x0.25] = 1 4 , Pr [x0.25, xmax] = Pr [X ≥ x0.25] = 3 4 ; – sample second quartile (median) Q2, with realization x0.5 = x is a value that splits off the lowest 50% of data from the highest 50%,, Pr [xmin, x0.5] = Pr [X ≤ x0.5] = 1 2 , Pr [x0.5, xmax] = Pr [X ≥ x0.5] = 1 2 ; – sample third (upper) quartile Q3, with realization x0.75 is a value that splits off the lowest 75% of data from the highest 25%,, Pr [xmin, x0.75] = Pr [X ≤ x0.75] = 3 4 , Pr [x0.75, xmax] = Pr [X ≥ x0.75] = 1 4 ; • sample deciles Xk, with realizations xk splits data to ten buckets, i.e. k/10 of data are lower than a decile and (10 − k)/10 are higher, where k ∈ {0, 1, . . . , 10}; • sample five-number summary (Xmin, Q1, Q2, Q3, Xmax)T , with realizations (xmin, x0.25, x0.50, x0.75, xmax)T . Robust location characteristics (resistant to outliers) are • sample γ-truncated arithmetic average Xg, with realization xg that is calculated as xg = 1 n − 2g x(g+1) + x(g+2) + . . . + x(n−g) , where g = γn , γ = 0.1, 0.2. More than γ100 % observations must be replaced for the γ-truncated average to become large or small relative to the original [1 breakdown point xg is therefore γ], 1 Breakdown point represents number of observations we need to significantly change value of location characteristics. For γ-truncated and γ-winsorized arithmetic average it is γn observations, for median n/2 observations and for simple arithmetic average changing just one observation is enough (that’s the reason we say that arithmetic average is very sensitive to outliers). 2 • sample γ-winsorized arithmetic average Xw, with realization xw is defined as xw = 1 n (g + 1)x(g+1) + x(g+2) + . . . + (g + 1)x(n−g) . More than γ100 % must be replaced for the γ-winsorized average to become large or small relative to the original [breakdown point xw is therefore γ]. Exercise 2 (height of 10-year old girls). Let’s have n = 12 heights (in cm) of randomly sampled 10-year old girls sorted by height (order denoted as ri for x(i); in case the values are equal, ri is calculated as average of their order numbers). Table 1: Sorted realizations xi and their order ri for heights of 10-year old girls i 1 2 3 4 5 6 7 8 9 10 11 12 x(i) 131 132 135 141 141 141 141 142 143 146 146 151 ri 1 2 3 5.5 5.5 5.5 5.5 8 9 10.5 10.5 12 Then x · = 140.83, x = 1 2 x(6) + x(7) = 141, x0.25 = 1 2 x(3) + x(4) = 138, where k = 12 × 0.25 = 3, Q3 = x0.75 = 1 2 x(9) + x(10) = 144.5, where k = 12 × 0.75 = 9. Write functions for calculation of all location characteristics. Verify your functions on characteristics above. Don’t use built-in functions for location characteristics such as mean, quantile, etc. Use γ = 0.1 for truncated and winsorized arithmetic averages. 2 Spread (variability) characteristics Then we define following estimations of spread (variability) characteristics (sample spread characteristics): • sample variance S2 , with realization s2 = s2 n−1 = s2 x = 1 n − 1 n i=1 (xi − x)2 ; under linear transformation sample variance changes like this2 s2 y = s2 a+bx = b2 s2 x, i.e. s2 y = s2 a+bx = 1 n − 1 n i=1 a + bxi − a + bx 2 = 1 n − 1 n i=1 (a + bxi − (a + bx))2 = 1 n − 1 n i=1 (b (xi − x))2 = b2 s2 x; 2 Equation tells us that variance of shifted and rescaled variable y is equal to square of scale multiplied by variance of original variable x. 3 • sample standard deviation S, with realization s = sn−1 = sx = s2 x; under linear transformation standard deviation changes like this sy = sa+bx = |b| sx, • coefficient of variation Vk, with realization vk represents normalized form of standard deviation (inversion of signal-to-noise ratio; fraction of variability to mean) vk = sx x ; it is usually denoted in percentage points, i.e. 100 × (sx/x) % and can be used only for realizations with positive values; • sample variance of arithmetic average S2 X , with realization s2 x = s2 x n ; • sample standard deviation of arithmetic average (sample standard error) SX, with realization sx = sx √ n ; • sample skewness B1, with realization b1 = n−1 n i=1 (xi − x)3 n−1 n i=1 (xi − x)2 3/2 = √ n n i=1 (xi − x)3 n i=1 (xi − x)2 3/2 , distribution is symmetric, if b1 = 0, positive skewness (density on the left side is steeper than the right side), if b1 > 0 a negative skewness (density on the right side is steeper than the left side), if b1 < 0; • sample kurtosis B2, with realization b2 = n−1 n i=1 (xi − x)4 n−1 n i=1 (xi − x)2 2 − 3 = n n i=1 (xi − x)4 n i=1 (xi − x)2 2 − 3, distribution is normal (mesokurtic), if b2 = 0, pointy (leptokurtic), if b2 > 0 and flat (platykurtic), if b2 < 0; • sample sum of squares SS = n i=1 Xi − X 2 , with realization SSobs = n i=1 (xi − x)2 , 4 • sample sum of absolute deviation SAD = n i=1 |Xi − X0.5|, with realization SADobs = n i=1 |xi − x0.5| ; • sample arithmetic average deviation MAD = 1 n n i=1 |Xi −X0.5|, with realization MADobs = SADobs/n; • sample range D = Xmax − Xmin, with realization dobs = xmax − xmin; • sample interquartile range DQ = Q3 − Q1, with realization dQ = x0.75 − x0.25; distribution is (between quartiles) symmetric, if x0.75 − x0.50 = x0.50 − x0.25, positively skewed, if x0.75 −x0.50 > x0.50 −x0.25 and negatively skewed, if x0.75 −x0.50 < x0.50 −x0.25; • sample decile range DD = X0.9 − X0.1, with realization dD = x0.9 − x0.1; • sample percentile range DP = X0.99 − X0.01, with realization dP = x0.99 − x0.01. Robust spread characteristics (variability) are • sample γ-truncated variance S2 tg, with realization s2 tg calculated as s2 tg = 1 n − 2g − 1 n−g i=g+1 x(i) − xtg 2 ; more than γ100 % must be replaced so that γ-truncated variance changes to large or small relative to the original s2 [breakdown point s2 tg is γ]; it applies that s2 tg < s2 because truncating removes outliers; • sample γ-winsorized variance S2 wg, with realizations s2 wg calculated as s2 wg = 1 n − 1 (g + 1)(x(g+1) − xwg)2 + (x(g+2) − xwg)2 + . . . + (g + 1)(x(n−g) − xwg)2 ; more than γ100 % must be replaced so that gamma-winsorized variance changes to large or small relative to the original s2 [breakdown point s2 wg is γ]; it applies that s2 wg < s2 because winsorization pulls outliers closer to the mean; 5 • sample quartile coefficient of variation Vk,Q = (Q3 −Q1)/Q2, with realization vk,Q calculated as vk,Q = x0.75 − x0.25 x . Other robust spread characteristics characterized by modified boundaries are • sample robust minimum and maximum (“inner boundaries”) X∗ min = BL = Q1 − 1.5DQ and X∗ max = BU = Q1 + 1.5DQ, with realizations defined as x∗ min = bL = x0.25 − 1.5 (x0.75 − x0.25) , x∗ max = bU = x0.75 + 1.5 (x0.75 − x0.25) , values outside of boundaries are considered to be suspicious, potential outliers; • sample robust minimum and maximum (“outer boundaries”) defined as B∗ U = Q1 − 3(Q3 − Q1), B∗ U = Q3 + 3(Q3 − Q1), with realizations b∗ L = x0.25 − 3dQ, b∗ U = x0.75 + 3dQ; – if there are any xi < b∗ L ∨ xi > b∗ U , we call them distant values3 , – if xi ∈ b∗ L, bL) ∨ (bU , b∗ U , these are outer values, – if xi ∈ bL, bU , these are inner values or values close to median; – normal distribution has these properties BU − BL = Q3 + 1.5DQ − Q1 + 1.5DQ = 4DQ · = 4.2; probability of xi /∈ BL, BU is then 0.04; • sample robust skewness B1Q and B1O and their variance under asymptotic normality B1·, where · = Q or O, with realizations defined as – quartile skewness b1Q = (x0.75 − x0.50) − (x0.50 − x0.25) x0.75 − x0.25 , V aras(b1Q) = 1.84, – octile skewness b1O = (x0.875 − x0.50) − (x0.50 − x0.125) x0.875 − x0.125 , V aras(b1O) = 1.15. Exercise 3 (height of 10-year old girls). Calculate all spread characteristics for the sample with heights of 10-year old girls. 3 Symbol ∨ means “or” and symbol ∧ means “and”. 6 3 Basics of Probability Exercise 4 (Simple random sample). In a simple random sample of size n from population of finite size N, each element has an equal probability of being chosen. If we avoid choosing any member of the population more than once, we call it simple random sample without replacement4 .(Dalgaard 2008). If we put a member back to population after choosing it, we talk about simple random sample with replacement5 . Let’s have a set M with N = 10 elements and we want to choose n = 3 elements (a) without replacement and (b) with replacement. How many combinations there are? How do these combinations look like if M = {1, 2, . . . , 10}? Do the same for N = 100, n = 30 and set M = {1, 2, . . . , 100}. Solution without code: (a) Number of combinations is N n . If N = 10 and n = 3, then N n = N! (N−n)!n! = 10 3 = 120. (b) Number of combinations with replacement is N+n−1 n . If N = 10 and n = 3, then N+n−1 n = (N+n−1)! (N−1)!n! = 10+3−1 3 = 220. If N = 100 a n = 30, then N+n−1 n = 100+30−1 30 = 2.009491 × 1029 . Hints. choose(n,k), combn(n,k)6 , sample(x=..., size=..., replace=...) Exercise 5 (Simple random sample). A group of people are labeled by their identification numbers (ID) from 1 to 30. Choose (a) randomly 5 people out of 30 without replacement, (b) randomly 5 people out of 30 with replacement and finally (c) randomly 5 people out of 30 without replacement, where people with ID between 28 and 30 have 4× higher probability of being chosen than people with ID between 1 and 27. Repeat the sampling 100 times, concatenate all samples and use table to get frequencies and confirm that IDs 28 to 30 are indeed sampled more frequently. Hints. rep(1, k), sample(..., prob=p) Exercise 6 (Normal distribution). Let X be a random variable (it could represent for example adult height) and let’s assume it is normally distributed with parameters µ (expectation or mean) and σ2 (standard deviation) which is written as X ∼ N(µ, σ2 ), µ = 140.83, σ2 = 33.79. Normal distribution represents a probability distribution model for this random variable. Calculate probability Pr (a < X < b) = Pr (X < b) − Pr (X < a) = FX (b) − FX (a), where a = µ − kσ, b = µ + kσ, k = 1, 2, 3.7 Plot probability density function and fill area between points a and b and label axes x and y as shown in figure 1. Partial solution: a = µ − σ = 135.0171, b = µ + σ = 146.6429, Pr (|X − µ| > σ) = 0.3173, Pr (|X − µ| < σ) = 1 − 0.3173 = 0.6827, a = µ − 2σ = 129.2042, b = µ + 2σ = 152.4558, Pr (|X − µ| > 2σ) = 0.0455, Pr (|X − µ| < 2σ) = 1 − 0.0455 = 0.9545, 4 n-combination without replacement from N members of set M. 5 n-combination with replacement from N members of set M. 6 requires library utils 7 Note that Pr (a < X < b) = Pr (a ≤ X ≤ b) because probability of a point (here a and b) is zero for continuous random variables, i.e Pr(a) = Pr(b) = 0. This does not apply to discrete random variables. 7 a = µ − 3σ = 123.3913, b = µ + 3σ = 158.2687, Pr (|X − µ| > 3σ) = 0.0027, Pr (|X − µ| < 3σ) = 1 − 0.0027 = 0.9973. 1 mu <- 0 2 sig <- 1 3 bin <- seq(mu-3*sig,mu+3*sig,by=sig) 4 pnorm(bin[7]) - pnorm(bin[1]) # 0.9973002 5 pnorm(bin[6]) - pnorm(bin[2]) # 0.9544997 6 pnorm(bin[5]) - pnorm(bin[3]) # 0.6826895 Probabilities 68.27 − 95.45 − 99.73 are called (empirical) rule or 3-sigma rule. Hints. Similar example for exponential distribution. 1 # probability density curve 2 x = seq(0, 4, by=0.01) 3 y = dexp(x) 4 5 # tell R we will plot 3 graphs in one row 6 par(mfrow=c(1,3)) 7 8 # plot three graphs 9 for(p in c(0.2, 0.1, 0.05)){ 10 # plot probability density as a line 11 plot(x, y, type=’l’, xlab=paste0(’Area under the curve = ’, p), ylab=’density’) 12 13 # fill area under the curve 14 p_quantile = qexp(1-p) 15 xx = seq(4, p_quantile, by=-0.01) 16 pol_x = c(p_quantile, 4, xx) 17 pol_y = c(0, 0, dexp(xx)) 18 polygon(pol_x, pol_y, col = "grey") 19 } Exercise 7 (Normal distribution). Let X ∼ N(µ, σ2 ), where µ = 150, σ2 = 6.25. Calculate a = µ − x1−α 2 σ and b = µ + x1−α 2 σ so that Pr (a ≤ X ≤ b) = 1 − α is equal to 0.90, 0.95 a 0.99. Value x1−α is a quantile of standardized normal distribution, i.e. Pr(Z = X−µ σ < x1−α) = 1 − α, Z ∼ N(0, 1). Plot density function, fill area between points a and b and label axes x a y as shown in the following figure. Hints. qnorm(alpha) This gives us rule 90 − 95 − 99 (so called adjusted 3-sigma rule). We used property Pr uα/2 < Z < u1−α/2 = Φ u1−α/2 − Φ uα/2 = 1 − α, where Φ is cumulative distribution function of normal distribution and in general α ∈ (0, 1/2); in the exercise we used α = 0.1, 0.05 a 0.01. See figure 2. Exercise 8 (Interactive Normal Distribution). Create an interactive Shiny application showing probability density function of normal distribution N(µ, σ2 ), where µ = 0 and σ can be interactively set by user. Fill area between points a and b, where a = µ−kσ, b = µ+kσ, k = 1, 2, 3 and make k interactive too. Finally, add graph title showing probability Pr (a ≤ X ≤ b). 8 Figure 1: 3-sigma rule; density curve with colored area under this curve between corresponding quantiles on x-axis; volume of the area is equal to the probability of realization of random variable between these quantiles Figure 2: Adjusted 3-sigma rule; density curve with colored area under this curve between corresponding quantiles on x-axis; volume of the area is equal to the probability of realization of random variable between these quantiles Hints. Installing shiny To install shiny on your computer simply run install.packages(’shiny’) in R. If you’re installing it on a faculty computer, you have to specify library too and also load jsonlite package 1 # create a directory called "rlib" somewhere in your profile 2 libpath = ’path to your rlib directory’ 3 4 # install shiny package (need only once) 5 install.packages("shiny", lib=libpath) 6 7 # load shiny package (need to run each time you start R) 8 library("shiny", lib=libpath) 9 library("jsonlite", lib=libpath) 9 Hints. Sample app Create a folder normalPlot and copy following files ui.R and server.R into it (find these files in study materials). Run it from R console with command runApp(’normalPlot’, display .mode="showcase") from parent directory of normalPlot (use function setwd to set your working directory if needed). Insert your code from previous exercises into function renderPlot in server.R to finish the exercise. Listing 1: ui.R 1 # if missing, install with command ‘install.packages(’shiny’)‘ 2 library(shiny) 3 4 # Define UI 5 shinyUI(fluidPage( 6 7 # Application title 8 titlePanel("Normal Distribution"), 9 10 # Sidebar with a slider input for the number of bins 11 sliderInput("sig", 12 "sigma:", 13 min = 0.1, 14 max = 3, 15 value = 1), 16 sliderInput("k", 17 "k:", 18 min = 1, 19 max = 3, 20 value = 1, 21 step = 1), 22 23 plotOutput(’normalPlot’) 24 )) Listing 2: server.R 1 library(shiny) 2 3 # Define server logic 4 shinyServer(function(input, output) { 5 6 output$normalPlot <- renderPlot({ 7 # insert your code from previous exercises here and modify it so that 8 # it uses values input$sig and input$k from ui.R 9 plot(c(1,2,3), c(1,2,3), main=paste(’Sigma:’, input$sig)) 10 }) 11 }) Exercise 9 (Interactive Normal Distribution 2). Build similar app for Exercise 7, but let user choose if he wants to display normal distribution or exponential distribution and let him choose α too, where Pr (a ≤ X ≤ b) = 1 − α for normal distribution and Pr (X ≤ b) = 1 − α for exponential distribution. Furthermore, add sliders for µ and σ if user selects normal distribution or slider λ if user selects exponential distribution. 10 Figure 3: Screenshot from interactive Shiny application with normal distribution. Your plot should have fixed x-axis limits (xlim). Also set reasonable min/max values for sliders. Hints. Use control widgets conditionalPanel and selectInput in ui.R. Exercise 10 (Binomial Distribution). Let’s assume that number of people preferring treatment A over treatment B follows a binomial distribution with parameters p (probability of success) and N (number of independent trials) denoted Bin (N, p), where N = 20, p = 0.5, i.e. people prefer both treatments equally. (a) What is the probability that 16 and more patients will prefer treatment A over treatment B? (b) What is the probability that 16 and more or 4 or less patients will prefer treatment A over treatment B? Solution without code: (a) Pr(X ≥ 16) = 1 − Pr(X < 16) = 1 − Pr(X ≤ 15) = 1 − i:xi≤15 Pr (X = xi) = 1 − i:xi≤15 N xi pxi (1 − p)N−xi = 1 − i:xi≤15 20 xi 0.5xi (1 − 0.5)20−xi = 0.0059. (b) Pr(X ≤ 4, X ≥ 16) = 1 − i:xi≤15 Pr (X = xi) + i:xi≤4 Pr (X = xi) = 0.012. This probability is twice the previous one since Bin(N, 0.5) is symmetric around 0.5. Hints. pbinom(x, size=..., prob=...) gives you probability Pr(X ≤ x). How do you get probability Pr(X ≥ x) using this function? Exercise 11 (Binomial Distribution). Let’s assume that Pr(swirl) = 0.533 = p1 is the probability of having dermatological pattern swirl on right thumb of male population and Pr(other) = 0.467 = p2 is the probability of other patterns on right thumb of the same population. Random variable X represents number of swirls and Y number of other patterns, where X ∼ Bin(N, p1) a Y ∼ Bin(N, p2). Calculate 11 1. Pr(X ≤ 120) if N = 300 2. Pr(Y ≤ 120) if N = 300 Compare the results with your intuition (e.g. what do you think is the probability of getting less than 120 heads if you flip a coin 300 times?) Exercise 12 (Moments of Bernoulli distribution). Random variable X has Bernoulli distribution (X ∼ Ber(p)) if Pr(X = 1) = 1 − Pr(X = 0) = 1 − q = p. Derive its expected value EX = k kPr(X = k) and variance VarX = E[(X − EX)2 ] = E[X2 ] − (EX)2 Exercise 13 (Moments of Binomial distribution). Random variable X has Binomial distribution (X ∼ Bin(N, p)) if Pr(X = k) = N k pk (1 − p)N−k = N! k!(N − k)! pk (1 − p)N−k for k ∈ {0, . . . , N}. Derive its expected value EX = k kPr(X = k) and variance VarX = E[(X − EX)2 ] = E[X2 ] − (EX)2 Hints. The hard way involves Binomial theorem, for the easy way you need to realize relationship between Binomial and Bernoulli random variable and use results from previous exercise. Exercise 14 (Normal approximation of binomial distribution). 8 Let Pr(man) = 0.515 be a proportion of men in population and Pr(women) = 0.485 proportion of women. Let X represent number of men and Y number of women. Under the assumption of model Bin(N, p) calculate 1. Pr(X ≤ 3) if N = 5 2. Pr(X ≤ 5) if N = 10 3. Pr(X ≤ 25) if N = 50. Compare these probabilities with those approximated by normal distribution N(Np, Npq). Plot density function of normal distribution and superimpose it with probability distribution of binomial distribution. Plot cumulative distribution function of normal distribution and superimpose it with cumulative distribution function for binomial distribution. See figure 4. Hints. par(mfrow=c(2, 3)), plot(..., type=’h’) plot function creates a new graph, to add points or lines to an existing graph, use commands points or lines. These functions take the same arguments as plot. 8 Approximation means ”similar but not exactly equal”, i.e. we approximate some distribution with a different one (that has certain advantages over the approximated one) or we approximate data with some distribution (that describes data with help of easily interpretable parameters) 12 Figure 4: Approximating binomial distribution with normal distribution for p = 0.515, N = 5, 10 and 50; scatter plot superimposed with density (first row) and cumulative distribution function (second row). Exercise 15 (Levelplots / Contingency tables). Analyze data/reputation.data for potential security risks. Plot heatmap of Risk vs Reliability for all IPs (see figure 5). Then plot the same risk-reliability heatmap, but grouped by SimpleType. SimpleType has the same values as Type, but all values with multiple types (multiple types are separated by ;) are replaced by ”Multiples” value (see figure 6). Try removing dominating type ”Scanning Host” to get more insights into other types. Hints. Simple heatmap 1. Load data into variable av and check their correctness using head(av) and str(av). 2. Create a frequency table for heatmap 1 rr.df = data.frame(table(av$Risk, av$Reliability)) 2 colnames(rr.df) <- c("Risk", "Reliability", "Freq") 13 3. Plot level plot using following command (experiment with other options) 1 library(lattice) 2 levelplot(Freq~Risk*Reliability, 3 data=rr.df, main="Risk ~ Reliability", ylab="Reliability", xlab = "Risk", 4 shrink = c(0.5, 1), 5 col.regions = colorRampPalette(c("#F5F5F5", "#01665E"))(20)) Hints. Heatmap grouped by type 1. Create a new column SimpleType that is the same as Type and convert it from factor variable to character with as.character. Then replace all values containing ; with value ”Multiples” (use grep to select these rows). 2. Create frequency table like before and change the first argument of levelplot to levelplot(Freq ~Reliability*Risk|Simpletype, ...). Figure 5: Heatmap for reputation data. Exercise 16 (Bivariate normal distribution). Random vector (X, Y )T has bivariate normal distribution N2 (µ, Σ) , where µ = (µ1, µ2)T and Σ = σ2 1 ρσ1σ2 ρσ1σ2 σ2 2 , 14 Figure 6: Heatmap for reputation data grouped by Type with all multiple types merged into one value ”Multiples”. with density f (x, y) = 1 2π σ2 1σ2 2 (1 − ρ2) exp − 1 2 (1 − ρ2) (x−µ1)2 σ2 1 − 2ρ(x−µ1)(y−µ2) σ1σ2 + (y−µ2)2 σ2 2 , where (x, y)T ∈ R2 , µj ∈ R1 , σ2 j > 0, j = 1, 2, ρ ∈ −1, 1 are parameters, then θ = (µ1, µ2, σ2 1, σ2 2, ρ). Expression in the exponent can be also written as − 1 2 x − µ1 y − µ2 T σ2 1 ρσ1σ2 ρσ1σ2 σ2 2 −1 x − µ1 y − µ2 , marginal distributions9 are X ∼ N (µ1, σ2 1) and Y ∼ N (µ2, σ2 2), ρ is a correlation coeffi- cient10 . 9 Marginal distribution is a distribution of marginal random variable. Marginal distribution of multivariate normal distribution is again normal, which is a very useful property. 10 From this example it is clear that to sufficiently describe bivariate normal distribution we need 5 pa- 15 Exercise 17 (Bivariate normal distribution). (1) Create a function bnorm(x,y,mu1,mu2,sigma1 ,sigma2,corr) returning density of a bivariate normal distribution. Use function properties to test that it works correctly. (2) Plot density of bivariate normal distribution N2 (µ, Σ) using function image() and superimpose it with contour graph of the same distribution using function contour(). (3) Plot density of bivariate normal distribution N2 (µ, Σ) using function persp(). Cut density into 12 intervals, where values in these intervals correspond to colors terrain.colors(12). Use following parameters (a) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0; (b) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0.5; (c) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1.2, ρ = 0.5. See figure 5 for correct solution. Hints. 1. create a function bnorm(x,y,mu1,mu2,sigma1,sigma2,corr) returning density of bivariate normal distribution 2. to test your implementation, use properties such as symmetry, the fact that function value goes to zero as you go further away from mean, location of the maximum, etc. 3. create vectors x and y with values from -3 to 3 and length n and their cartesian product (xi, yj), i = 1, . . . , n, j = 1, . . . , n (represented as either n2 × 2 matrix or two vectors of length n2 ) 4. apply your function bnorm on the cartesian and reshape it to a matrix Z with dimensions n × n 5. use x, y and Z in image function to plot density You don’t have to use Greek letters in axis labels (use mu 1 instead of µ1), but if you want to, look up expression function. For coloring of persp look at the last example in persp documentation and modify it to your needs. Exercise 18 (Kernel density estimation). Simulate N samples from given distribution and plot them as + symbols on x-axis together with their kernel density estimation. Add theoretical normal distribution (resp. exponential distribution) with parameters estimated from data. Do it for (a) N = 50 (b) N = 1000 and following distributions 1. X ∼ N(0, 1) 2. X ∼ pN(−3, 1) + (1 − p)N(3, 1) is a mixture of normals, where p = 0.3 3. X ∼ Exp(λ), where λ = 3. Plot theoretical exponential distribution with estimated parameter ˆλ = 1 ¯X = N N i=1 Xi . rameters, i.e. mean and variance for marginal distributions of random variables X and Y and correlation coefficient ρ = ρ(X, Y ) describing the strength of linear relationship of X and Y . 16 Figure 7: Density of bivariate normal distribution with different parameters (first row – contour graph, second row – perpective 3D graph displayed as surface); the larger absolute value of ρ, the more are contours different from circles (they are changing into ellipses); as the difference between σ1 and σ2 is getting larger, we say that difference in variability of X1 and X2 is increasing See figure 8 for correct solution. Hints. To get the same results as in the figure, call set.seed(5) at the beginning of your script. Useful functions: density(..., from=, to=, n=), plot(..., lty=2), points(..., pch=3) Exercise 19 (Bivariate normal distribution). Simulating pseudorandom numbers from N2 (µ, Σ) can be done with in several ways: 1) library library(MASS) and function mvrnorm(); 2) library library(mvtnorm) and function rmvnorm(); 3) function rnorm() and this algorithm – let X1 ∼ N(0, 1) and X2 ∼ N(0, 1); then (Y1, Y2) ∼ N2 (µ, Σ), where µ = (µ1, µ2)T is vector of means and σ2 1, σ2 2 and ρ are parameters of covariance matrix Σ, where the strength of linear relationship Y1 and Y2 is given by magnitude and sign of ρ; Y1 = σ1X1 + µ1 a Y2 = σ2(ρX1 + 1 − ρ2X2) + µ2. Simulate pseudorandom numbers Y1 and Y2 from N2 (µ, Σ) using the first method. Calculate kernel density estimation of (Y1, Y2)T using function kde2d(). Plot it using function image() and superimpose it with contour graph of bivariate normal density of N2 (µ, Σ) 17 Figure 8: Kernel density estimation of several distributions (first row n = 50; second row n = 1000) using function contour().Use following parameter in simulations (a) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0; (1) n = 50 a (2) n = 1000; (b) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1, ρ = 0.5; (1) n = 50 a (2) n = 1000; (c) µ1 = 0, µ2 = 0, σ1 = 1, σ2 = 1.2, ρ = 0.5; (1) n = 50 a (2) n = 1000. See figure 11 for correct solution. Hints. kde2d(..., n=100, lims=c(-3, 3, -3, 3)) Exercise 20 (Mixture of two normal distributions). Simulate pseudorandom numbers (1) from mixture of normal distributions pN2 (µ1, Σ1) + (1 − p)N2 (µ2, Σ2) where θ1 = (µ11, µ12, σ2 11, σ2 12, ρ1, µ21, µ22, σ2 21, σ2 22, ρ2)T and (2) from bivaraite normal distribution N2 (µ, Σ), where parameters represent combined vector of means a combined covariance matrix. i.e. θ2 = (µ1, µ2, σ2 1, σ2 2, ρ)T . For (1) calculate kernel density estimation of (X, Y )T using function kde2d(). (a) Plot theoretical density (2) using function image() and superimpose it with contour graph of theoretical density (2) using function contour(). (b) Plot theoretical density (1) using function image() and superimpose it with contour graph 18 Figure 9: Density of bivariate normal distribution (first row n = 50; second row n = 1000) of theoretical density (1) using function contour(). (c) Plot kernel density estimation of (1) using function image() and superimpose it with contour graph of theoretical density (1) using function contour(). Cut density into 12 intervals, where values in these intervals correspond to colors terrain.colors(12). For simulation use these parameters θ1 = (−1.2, −1.2, 1, 1, 0, 1, 1, 1, 1, 0)T , n = 100 and p = 0.5. Use parameters from simulation for case (1), i.e. θ1 = θ1 = (µ11, µ12, σ2 11, σ2 12, ρ1, µ21, µ22, σ2 21, σ2 22, ρ2)T = (−1.2, −1.2, 1, 1, 0, 1, 1, 1, 1, 0)T . For case (2) estimate parameters from simulated data θ2 = (µ1, µ2, σ2 1, σ2 2, ρ)T . See figure 8 for correct solution. Hints. For parameter estimations in case (2) use functions mean, std and cor. Exercise 21 (Poisson distribution). We have a data (Greenwood and Yule 1920) with number of injuries of factory workers in the following table n 0 1 2 3 4 ≥ 5 mn 447 132 42 21 3 2 where n is number of injuries and mn number of workers with n injuries. 19 Figure 10: Combined density of bivariate normal distribution (left), mixture density of two bivariate normal distributions (middle) and kernel density estimation superimposed by mixture density of two bivariate normal distributions (right) Calculate expected number of worker injuries under the assumption that random variable X representing injuries has Poisson distribution with parameter λ = n nmn n mn , i.e. X ∼ Poiss(λ). Create a table with mn and expected mn and display it (figure 11). Hints. Create a dataframe with two columns and proper row names, then print it. data. frame(mn=..., expected_mn=..., row.names=...). Figure 11: Observed and expected frequencies of Poisson distribution. Exercise 22 (Binomial distribution). In a study from 1889 based on medical records in Saxony professor Geissler (1889) recorded distribution of boys in families. The study included M = 6115 families with N = 12 children with following number of boys (n stands for number of boys and mn number of families with n boys) n 0 1 2 3 4 5 6 7 8 9 10 11 12 mn 3 24 104 286 670 1033 1343 1112 829 478 181 45 7 Calculate expected mn under the assumption that number of boys X in families follow binomial distribution π = N n=0 nmn NM and N = 12 (i.e. X ∼ Bin(N, π)). Compare expected and observed frequencies - do you see any difference? Display mn and expected mn in a table and visualize both observed and expected frequencies in one graph (see figure 10 or use your own imagination). Calculate probability that family will have ≥ 4 and ≤ 6 boys from theoretical distribution (i.e. Pr(4 ≤ X ≤ 6)). 20 Figure 12: Observed and expected frequencies of Binomial distribution Hints. Add legend with legend(’topleft’, c(’observed’, ’expected’), col=c(’black’, ’red’ ), lty=1, bty=’n’). For stairs graph use argument type=’S’ or type=’s’ Exercise 23 (Approximating Binomial distribution by Poisson). If each of 50 million people in Italy drives a car next week (independently), then probability of dying in an accident will be 0.000002, where number of fatalities has binomial distribution Bin(50mil, 0.000002). Approximate this Binomial distribution with Poisson that has the same mean. Plot both probabilities for values {50, 51, . . . , 150}. Exercise 24 (Contingency table). 356 people have been polled on their smoking status (Smoke) and their socioeconomic status (SES). For each person it was determined whether or not they are current smokers, former smokers, or have never smoked. Also, for each person their socioeconomic status was determined (low, middle, or high). The data file smoker.csv (available from study materials) contains only two columns - Smoke and SES. Load this data into and create contingency table with table() function. These observations (after converting to probability) form joint distribution Pr(X = x, Y = y), where X is a discrete random variable for smoker and Y for SES. Calculate marginal distributions Pr(X = x) = y Pr(X = x, Y = y), x ∈ {current, former, never} and Pr(Y = y), y ∈ {high, low, middle}. Also calculate and compare Pr(X = x|Y = high) with Pr(X = x|Y = low) and Pr(Y = y|X = smoker). Create a table with expected frequencies using assumption that X and Y are independent (figure ). Visualize both tables (observed and expected) with mosaicplot function (figure ). 21 Hints. To calculate marginal probabilities use either special table functions margin.table and prop.table or simply rowSums and colSums. Figure 13: Observed and expected frequencies for contingency table Figure 14: Mosaic plot of contingency table Exercise 25 (Number of contingency tables). How many contingency tables of size 4 × 2 there are for N = 20? Exercise 26 (Multinomial distribution). We have following random variables (1) socioeconomic status (high - H, low - Lo), (2) political affiliation (democrat - D, republican - R) and (3) political philosophy (libral - Li, convservatism - C). Let’s denote their interactions like this: X1 (H-D-Li), X2 (H-D-C), X3 (H-R-Li), X4 (H-R-C), X5 (Lo-D-Li), X6 (Lo-D-C), X7 (Lo-R-Li) a X8 (Lo-R-C). We have random sample of size N = 50. Probabilities pj are in the following table D-Li D-C R-Li R-C all H 0.12 0.12 0.04 0.12 0.4 Lo 0.18 0.18 0.06 0.18 0.6 all 0.30 0.30 0.10 0.30 1.0 Calculate Var[X1], Var[X3], Cov [X1, X3], Cor [X1, X3] and expected frequencies Npj, j = 1, 2, . . . , 8. 22 Hints. X = (X1, X2, . . . , X8) ∼ Mult(N, p), where N = 50, p = (p1, p2, . . . , p8)T , we know that Xj ∼ Bin(N, pj), pj are given by table above j = 1, 2, . . . , 8. Exercise 27 (Product-multinomial distribution). Let’s have the same probabilities as in previous exercise, but now with two separate random samples - first with size N1 = 30 from group H and the second with size N2 = 20 from group Lo. Denote interactions of random variables X11 = X1|1 (H-D-Li), X12 = X2|1 (H-D-C), X13 = X3|1 (H-R-Li), X14 = X4|1 (H-R-C), X21 = X1|2 (Lo-D-Li), X22 = X2|2 (Lo-D-C), X23 = X3|2 (Lo-R-Li) and X24 = X4|2 (Lo-R-C), where X1 = (X11, X12, X13, X14)T and X2 = (X21, X22, X23, X24)T . Then X = (X1, X2) has product-multinomial distribution with K = 2, N1 = 30, J1 = 4, N2 = 20, J2 = 4. Notation Xj|k, where j = 1, 2, 3, 4 and k = 1, 2 highlights the fact, that distribution is conditioned by socioeconomic status (high – H, low – Lo), i.e. distribution in table columns is conditioned by its row. Realizations Xj|k are denoted as nj|k = nkj, similarly probabilities Xj|k = Xkj as pj|k = pkj. Calculate conditional probabilities pj|k, expected frequencies Nkpkj, Var[X13], Cov [X21, X23] and Cor [X11, X23]. Exercise 28 (Binomial distribution, simulation study). Generate M = 1000 pseudo-random numbers from distribution Bin(5, 0.5). Create a table with observed and theoretical probabilities (for n = 0, 1, . . . , 5). r 0 1 2 3 4 5 observed probabilities 0.031 0.158 0.302 0.324 0.161 0.024 theoretical probabilities 0.031 0.156 0.312 0.312 0.156 0.031 Superimpose histogram of simulated numbers with theoretical density function. See figure 15. Hints. For histogram plot use breaks and probability arguments. Exercise 29 (Normal distribution, simulation study). If random variable X ∼ N(150, 6.25), what distribution has its arithmetic average Xn? Verify your result using simulations for n = 30. Make 5000 simulations (each simulation involves generating 30 realizations of X) and for each simulation calculate arithmetic average xm, m = 1, 2, . . . , M, where M = 5000. Superimpose their histogram (normalized) with theoretical density function for Xn you derived. Calculate Pr(Xn > 151) from simulated data and compare this result with theoretical (expected) probability. See figure 16. Exercise 30 (Normal distribution, simulation study II). Let X ∼ N(µ1, σ2 1) a Y ∼ N(µ2, σ2 2). Then Xn1 − Y n2 ∼ N(µ1 − µ2, σ2 1 n1 + σ2 2 n2 ). Use µ1 = 100, σ1 = 10, µ2 = 50, σ2 = 9 and (a) n1 = 4, n2 = 5, (b) n1 = 100, n2 = 81. Make M = 1000 simulations and for each calculate difference xm − ym, m = 1, 2, . . . , M. Superimpose histogram (normalized) of the differences with theoretical density function of difference Xn1 −Y n2 . For both cases (a) and (b) calculate Pr(Xn1 − Y n2 ) < 52 from simulated data and compare this result with theoretical probability. 23 Figure 15: Histogram superimposed with theoretical distribution of Bin(5, 0.5) 4 Likelihood Exercise 31 (Maximum-likelihood derivation). Let X ∼ Poiss(λ) and x1, . . . , xn are its i.i.d. realizations. Analytically derive its log-likelihood function l(λ|x). Recall that l(λ|x) = ln L(λ|x) = ln n i=1 f(xi) = n i=1 ln f(xi) . Exercise 32 (Maximum-likelihood, Poisson distribution). Let X ∼ Poiss(λ). Simulate realizations x1, . . . , xn. Use n = 100 and λ = 4. Plot log-likelihood function of Poisson distribution l(λ|x) = n i=1 xi ln λ − nλ. for λ ∈ [2, 6]. Find maximum likelihood estimate of λ and mark it in graph with dashed line. See figure 17. Hints. Start with writing a function loglikehood = function(x, lambda) that calculates loglikelihood for single λ. Then in a for loop or using sapply function calculate log-likelihood for λ ∈ [2, 6]. To find maximum of a function, either use formula for MLE from the lecture or find it empirically from vector of log-likelihoods using function which.max(lambdas). To plot a vertical line use abline(v=..., lty=’dashed’). 24 Figure 16: Histogram superimposed with theoretical distribution of N(150, 6.25/n) Exercise 33 (Maximum-likelihood, Poisson distribution). Use data from exercise 21 for log-likelihood function from the previous exercise and plot it. Note that the log-likelihood function will have to be slightly modified to work with counts of realizations from exercise 21. See figure 17. Exercise 34 (Maximum-likelihood, Binomial distribution). Let X ∼ Bin(N, p) and x its realization11 . Formulate its likelihood function L(p|x) and log-likelihood function l(p|x). Use log-likelihood function to derive maximum-likelihood estimate ˆp = argmax p l(p|x) of parameter p and Fisher information I(ˆp) = − ∂2 ∂p2 l(p|x)|p=ˆp. Exercise 35 (Quadratic approximation of log-likelihood function). Let X ∼ Bin(N, p) and x its realization. Plot scaled log-likelihood function of binomial distribution with p on x-axis and scaled log-likelihood l∗ (p|x) = l(p|x) − max(l(p|x)) on y-axis. Compare l∗ (p|x) with its quadratic approximation calculated using Taylor approximation ln(L(p|x) L(p|x) ) ≈ −1 2 I(p)(p − p)2 . Use (a) x = 8, N = 10, (b) x = 80, N = 100 and (c) n = 800, N = 1000 with reasonable ranges for both axes. See figure 18. Hints. Use lines(..., lty="dashed") for the dashed line. 11 We work with just one realization, so L(p|x) = 1 i=1 f(x|p). You would get the same results if you worked with N realizations with Bernoulli distribution Ber(p). 25 Figure 17: Left: Log-likelihood function for simulated data from Poiss(λ = 4). Right: Log-likelihood function for number of injuries of factory workers. 5 Hypothesis Testing Exercise 36 (Hypothesis testing). Let X ∼ Bin(N, p) be a random variable with N = 1000 and p = 0.5. Make 100000 simulations x1, . . . , x100000 and for each simulation calculate maximum likelihood estimate of p, i.e. ˆp = x N . Plot histogram of ˆp and add vertical line at value 0.534. Calculate percentage of ˆp that are higher than 0.534. Exercise 37 (Hypothesis testing). Out of 1000 newborn children, 534 of them were boys and 466 girls. Test null hypothesis that probabilities of having a boy or a girl are equal. Calculate p-value, confidence interval, test statistic and critical region. Use Wald statistic and Likelihood statistic. Hints. Use uniroot to solve inequality in confidence interval for Likelihood statistic. 26 Figure 18: Compare scaled log-likelihood function (full line) with its quadratic approximation (dashed line). Figure 19: Histogram of simulations of ˆp with line at 0.534. 27