Statistics in Computer Science
Seminar Exercises
Stanislav Katina, Mojmir Vinkler
October 15, 2016
1 Location characteristics
Exercise 1 (Code Vectorization). Implement two versions of mean function in R - thee ﬁrst
one is mean_slow = function(x) and use for loop to sum numbers. The second version is
mean_fast = function(x) and use built-in function sum to sum numbers. Generate very long
random vector of uniformly distributed random numbers and compare performance of both
functions and also of built-in mean function.
Hints. Use runif(...) for generating random numbers and system.time({...}) for proﬁling.
Realizations will be denoted as x1, x2, . . . , xn, sorted realizations as x(1) ≤ x(2) ≤
. . . ≤ x(n). Then we deﬁne following estimations of location characteristics (sample location
characteristics):
• sample minimum Xmin, with realization xmin = x(1);
• sample maximum Xmax, with realization xmax = x(n);
• sample (arithmetic) mean X, with realization x = 1
n
n
i=1 xi = 1
n
nj
j=1 xjfj, nj
n, where fj are frequencies (counts) of xj and n = j fj;
• sample mode Xmod, with realization xmod is the most common value (in case of
discrete variable it is value x in which probability function has its maximum; in case
of continuous variable it is value x in which density function has its maximum);
• sample median X(robust estimation of location), with realization
x =
x(n+1
2 ) if n is odd,
1
2
x(n
2 ) + x(n
2
+1) if n is even;
distribution is symmetric, if x = x = xmod, distribution is positively skewed (right),
if x > x > xmod and distribution is negatively skewed (left), if x < x < xmod;
• sample quartiles there are three
1
– sample ﬁrst (lower) quartile Q1, with realization x0.25 is a value that splits oﬀ
the lowest 25% of data from the highest 75%,
Pr [xmin, x0.25] = Pr [X ≤ x0.25] =
1
4
, Pr [x0.25, xmax] = Pr [X ≥ x0.25] =
3
4
;
– sample second quartile (median) Q2, with realization x0.5 = x is a value that
splits oﬀ the lowest 50% of data from the highest 50%,,
Pr [xmin, x0.5] = Pr [X ≤ x0.5] =
1
2
, Pr [x0.5, xmax] = Pr [X ≥ x0.5] =
1
2
;
– sample third (upper) quartile Q3, with realization x0.75 is a value that splits
oﬀ the lowest 75% of data from the highest 25%,,
Pr [xmin, x0.75] = Pr [X ≤ x0.75] =
3
4
, Pr [x0.75, xmax] = Pr [X ≥ x0.75] =
1
4
;
• sample deciles Xk, with realizations xk splits data to ten buckets, i.e. k/10 of data
are lower than a decile and (10 − k)/10 are higher, where k ∈ {0, 1, . . . , 10};
• sample percentile Xp (read as 100p-percentile), with realization xp deﬁned as
xp =
x(k+1) for k = np,
1
2
x(k) + x(k+1) for k = np,
where k = np , which is ﬂoor of np;
• sample ﬁve-number summary (Xmin, Q1, Q2, Q3, Xmax)T
, with realizations (xmin,
x0.25, x0.50, x0.75, xmax)T
.
Robust location characteristics (resistant to outliers) are
• sample γ-truncated arithmetic average Xg, with realization xg that is calculated
as
xg =
1
n − 2g
x(g+1) + x(g+2) + . . . + x(n−g) ,
where g = {γn} , g = γn , γ = 0.1, 0.2. More than γ100 % observations must be
replaced for the γ-truncated average to become large or small relative to the original
[1
breakdown point xg is therefore γ],
• sample γ-winsorized arithmetic average Xw, with realization xw is deﬁned as
xw =
1
n
(g + 1)x(g+1) + x(g+2) + . . . + (g + 1)x(n−g) .
More than γ100 % must be replaced for the γ-winsorized average to become large or
small relative to the original [breakdown point xw is therefore γ].
1
Breakdown point represents number of observations we need to signiﬁcantly change value of location
characteristics. For γ-truncated and γ-winsorized arithmetic average it is γn observations, for median n/2
observations and for simple arithmetic average changing just one observation is enough (that’s the reason
we say that arithmetic average is very sensitive to outliers).
2
Exercise 2 (height of 10-year old girls). Let’s have n = 12 heights (in cm) of randomly
sampled 10-year old girls sorted by height (order denoted as ri for x(i); in case the values
are equal, ri is calculated as average of their order numbers).
Table 1: Sorted realizations xi and their order ri for heights of 10-year old girls
i 1 2 3 4 5 6 7 8 9 10 11 12
x(i) 131 132 135 141 141 141 141 142 143 146 146 151
ri 1 2 3 5.5 5.5 5.5 5.5 8 9 10.5 10.5 12
Then x
·
= 140.83, x = 1
2
x(6) + x(7) = 141, x0.25 = 1
2
x(3) + x(4) = 138, where k =
12 × 0.25 = 3, Q3 = x0.75 = 1
2
x(9) + x(10) = 144.5, where k = 12 × 0.75 = 9.
Write functions for calculation of all location characteristics. Verify your functions on
characteristics above. Don’t use built-in functions for location characteristics such as mean,
quantile, etc. Use γ = 0.1 for truncated and winsorized arithmetic averages.
2 Spread (variability) characteristics
Then we deﬁne following estimations of spread (variability) characteristics (sample spread
characteristics):
• sample variance S2
, with realization
s2
= s2
n−1 = s2
x =
1
n − 1
n
i=1
(xi − x)2
;
under linear transformation sample variance changes like this2
s2
y = s2
a+bx = b2
s2
x,
i.e.
s2
y = s2
a+bx =
1
n − 1
n
i=1
a + bxi − a + bx
2
=
1
n − 1
n
i=1
(a + bxi − (a + bx))2
=
1
n − 1
n
i=1
(b (xi − x))2
= b2
s2
x;
• sample standard deviation S, with realization
s = sn−1 = sx = s2
x;
under linear transformation standard deviation changes like this
sy = sa+bx = |b| sx,
2
Equation tells us that variance of shifted and rescaled variable y is equal to square of scale multiplied
by variance of original variable x.
3
• coeﬃcient of variation Vk, with realization vk represents normalized form of standard
deviation (inversion of signal-to-noise ratio; fraction of variability to mean)
vk =
sx
x
;
it is usually denoted in percentage points, i.e. 100 × (sx/x) % and can be used only
for realizations with positive values;
• sample variance of arithmetic average S2
X
, with realization
s2
x =
s2
x
n
;
• sample standard deviation of arithmetic average (sample standard error)
SX, with realization
sx =
sx
√
n
;
• sample skewness B1, with realization
b1 =
n−1 n
i=1 (xi − x)3
n−1 n
i=1 (xi − x)2 3/2
=
√
n n
i=1 (xi − x)3
n
i=1 (xi − x)2 3/2
,
distribution is symmetric, if b1 = 0, positive skewness (density on the left side is steeper
than the right side), if b1 > 0 a negative skewness (density on the right side is steeper
than the left side), if b1 < 0;
• sample kurtosis B2, with realization
b2 =
n−1 n
i=1 (xi − x)4
n−1 n
i=1 (xi − x)2 2 − 3 =
n n
i=1 (xi − x)4
n
i=1 (xi − x)2 2 − 3,
distribution is normal (mesokurtic), if b2 = 0, pointy (leptokurtic), if b2 > 0 and ﬂat
(platykurtic), if b2 < 0;
• sample sum of squares SS = n
i=1 Xi − X
2
, with realization
SSobs =
n
i=1
(xi − x)2
,
• sample sum of absolute deviation SAD = n
i=1 |Xi − X0.5|, with realization
SADobs =
n
i=1
|xi − x0.5| ;
• sample arithmetic average deviation MAD = 1
n
n
i=1 |Xi −X0.5|, with realization
MADobs = SADobs/n;
4
• sample range D = Xmax − Xmin, with realization
dobs = xmax − xmin;
• sample interquartile range DQ = Q3 − Q1, with realization
dQ = x0.75 − x0.25;
distribution is (between quartiles) symmetric, if x0.75 − x0.50 = x0.50 − x0.25, positively
skewed, if x0.75 −x0.50 > x0.50 −x0.25 and negatively skewed, if x0.75 −x0.50 < x0.50 −x0.25;
• sample decile range DD = X0.9 − X0.1, with realization
dD = x0.9 − x0.1;
• sample percentile range DP = X0.99 − X0.01, with realization
dP = x0.99 − x0.01.
Robust spread characteristics (variability) are
• sample γ-truncated variance S2
g , with realization s2
g calculated as
s2
g =
1
n − 2g − 1
n−g
i=g+1
x(i);
more than γ100 % must be replaced so that γ-truncated variance changes to large
or small relative to the original s2
[breakdown point s2
g is γ]; it applies that s2
g < s2
because truncating removes outliers;
• sample γ-winsorized variance S2
w, with realizations s2
w; more than γ100 % must
be replaced so that gamma-winsorized variance changes to large or small relative to
the original s2
[breakdown point s2
w is γ]; it applies that s2
w < s2
because winsorization
pulls outliers closer to the mean;
• sample quartile coeﬃcient of variation Vk,Q = (Q3 −Q1)/Q2, with realization vk,Q
calculated as
vk,Q =
x0.75 − x0.25
x
.
Other robust spread characteristics characterized by modiﬁed boundaries are
• sample robust minimum and maximum (“inner boundaries”) X∗
min = BD =
Q1 − 1.5DQ and X∗
max = BH = Q1 + 1.5DQ, with realizations deﬁned as
x∗
min = bD = x0.25 − 1.5 (x0.75 − x0.25) ,
x∗
max = bH = x0.75 + 1.5 (x0.75 − x0.25) ,
values outside of boundaries are considered to be suspicious, potential outliers;
5
• sample robust minimum and maximum (“outer boundaries”) deﬁned as B∗
H =
Q1 − 3(Q3 − Q1), B∗
H = Q3 + 3(Q3 − Q1), with realizations b∗
D = x0.25 − 3dQ, b∗
H =
x0.75 + 3dQ;
– if there are any xi < b∗
D ∨ xi > b∗
H, we call them distant values3
,
– if xi ∈ b∗
D, bD) ∨ (bH, b∗
H , these are outer values,
– if xi ∈ bD, bH , these are inner values or values close to median;
– normal distribution has these properties BH − BD = Q3 + 1.5DQ − Q1 + 1.5DQ =
4DQ
·
= 4.2; probability of xi /∈ BD, BH is then 0.04;
• sample robust skewness B1Q and B1O and their variance under asymptotic normality
B1·, where · = Q or O, with realizations deﬁned as
– quartile skewness
b1Q =
(x0.75 − x0.50) − (x0.50 − x0.25)
x0.75 − x0.25
, V aras(b1Q) = 1.84,
– octile skewness
b1O =
(x0.875 − x0.50) − (x0.50 − x0.125)
x0.875 − x0.125
, V aras(b1O) = 1.15.
Exercise 3 (height of 10-year old girls). Calculate all spread characteristics for the sample
with heights of 10-year old girls.
3 Basics of Probability
Exercise 4 (Simple random sample). In a simple random sample of size n from population
of ﬁnite size N, each element has an equal probability of being chosen. If we avoid choosing
any member of the population more than once, we call it simple random sample without
replacement4
.(Dalgaard 2008). If we put a member back to population after choosing it,
we talk about simple random sample with replacement5
. Let’s have a set M with
N = 10 elements and we want to choose n = 3 elements (a) without replacement and (b)
with replacement. How many combinations there are? How do these combinations look like
if M = {1, 2, . . . , 10}? Do the same for N = 100, n = 30 and set M = {1, 2, . . . , 100}.
Solution without code:
(a) Number of combinations is N
n
. If N = 10 and n = 3, then N
n
= N!
(N−n)!n!
= 10
3
= 120.
(b) Number of combinations with replacement is N+n−1
n
. If N = 10 and n = 3, then
N+n−1
n
= (N+n−1)!
(N−1)!n!
= 10+3−1
3
= 220. If N = 100 a n = 30, then N+n−1
n
= 100+30−1
30
=
2.009491 × 1029
.
Hints. choose(n,k), combn(n,k)6
, sample(x=..., size=..., replace=...)
3
Symbol ∨ means “or” and symbol ∧ means “and”.
4
n-combination without replacement from N members of set M.
5
n-combination with replacement from N members of set M.
6
requires library utils
6
Exercise 5 (Simple random sample). A group of people are labeled by their identiﬁcation
numbers (ID) from 1 to 30. Choose (a) randomly 5 people out of 30 without replacement,
(b) randomly 5 people out of 30 with replacement and ﬁnally (c) randomly 5 people out of
30 without replacement, where people with ID between 28 and 30 have 4× higher probability
of being chosen than people with ID between 1 and 27.
Exercise 6 (Normal distribution). Let X be a random variable (it could represent for example
adult height) and let’s assume it is normally distributed with parameters µ (expectation or
mean) and σ2
(standard deviation) which is written as X ∼ N(µ, σ2
), µ = 140.83, σ2
= 33.79.
Normal distribution represents a probability distribution model for this random variable.
Calculate probability Pr (a < X < b) = Pr (X < b) − Pr (X < a) = FX (b) − FX (a), where
a = µ − kσ, b = µ + kσ, k = 1, 2, 3.7
Write a function that takes parameters µ, σ, a and b
and calculates probability
Pr (a < X < b) .
Partial solution:
a = µ − σ = 135.0171, b = µ + σ = 146.6429,
Pr (|X − µ| > σ) = 0.3173, Pr (|X − µ| < σ) = 1 − 0.3173 = 0.6827,
a = µ − 2σ = 129.2042, b = µ + 2σ = 152.4558,
Pr (|X − µ| > 2σ) = 0.0455, Pr (|X − µ| < 2σ) = 1 − 0.0455 = 0.9545,
a = µ − 3σ = 123.3913, b = µ + 3σ = 158.2687,
Pr (|X − µ| > 3σ) = 0.0027, Pr (|X − µ| < 3σ) = 1 − 0.0027 = 0.9973.
1 mu <- 0
2 sig <- 1
3 bin <- seq(mu-3*sig,mu+3*sig,by=sig)
4 pnorm(bin[7]) - pnorm(bin[1]) # 0.9973002
5 pnorm(bin[6]) - pnorm(bin[2]) # 0.9544997
6 pnorm(bin[5]) - pnorm(bin[3]) # 0.6826895
Probabilities 68.27 − 95.45 − 99.73 are called (empirical) rule or 3-sigma rule.
Exercise 7 (Normal distribution). Let X ∼ N(µ, σ2
), where µ = 150, σ2
= 6.25. Calculate
a = µ − x1−ασ and b = µ + x1−ασ so that Pr (a ≤ X ≤ b) = 1 − α is equal to 0.90, 0.95
a 0.99. Value x1−α is a quantile of standardized normal distribution, i.e. Pr(Z = X−µ
σ
<
x1−α) = 1 − α, Z ∼ N(0, 1). Similarly to previous exercise, write a function that would take
parameters µ, σ and α and return values a and b as a vector.
Hints. qnorm(alpha)
This gives us rule 90 − 95 − 99 (so called adjusted 3-sigma rule). We used property
Pr uα/2 < Z < u1−α/2 = Φ u1−α/2 − Φ uα/2 = 1 − α, where Φ is cumulative distribution
function of normal distribution and in general α ∈ (0, 1/2); in the exercise we used α = 0.1,
0.05 a 0.01.
7
Note that Pr (a < X < b) = Pr (a ≤ X ≤ b) because probability of a point (here a and b) is zero for
continuous random variables, i.e Pr(a) = Pr(b) = 0. This does not apply to discrete random variables.
7
Exercise 8 (Interactive Normal Distribution). Create an interactive Shiny application that
will use the function deﬁned in exercise 6 and show probability Pr(X > 0), where µ and σ
can be interactively set by user.
Hints. Download folder normalplot-nographics from study materials. Run it from R
console with command runApp(’normalplot-nographics’, display.mode="showcase") from parent
directory of normalplot-nographics (use function setwd to set your working directory
if needed). Use code from previous examples in server.R to ﬁnish the exercise.
Listing 1: ui.R
1 # if missing, install with command ‘install.packages(’shiny’)‘
2 library(shiny)
3
4 # Define UI
5 shinyUI(fluidPage(
6
7 # Application title
8 titlePanel("Normal Distribution"),
9
10 # Sidebar with a slider input for the number of bins
11 numericInput("sig",
12 "sigma:",
13 min = 0.1,
14 max = 3,
15 step = 0.1,
16 value = 1),
17 numericInput("mu",
18 "mean:",
19 min = -4,
20 max = 4,
21 value = 0,
22 step = 0.5),
23
24 textOutput("myTextOutput")
25 ))
Listing 2: server.R
1 library(shiny)
2
3 # Define server logic
4 shinyServer(function(input, output) {
5
6 output$myTextOutput <- renderText({
7 # TODO: use your code from previous exercise to show probabilities
8 paste0(’My sigma is ’, input$sig)
9 })
10 })
Exercise 9 (Binomial Distribution). Let’s assume that number of people preferring treatment
A over treatment B follows a binomial distribution with parameters p (probability of success)
and N (number of independent trials) denoted Bin (N, p), where N = 20, p = 0.5, i.e. people
8
prefer both treatments equally. (a) What is the probability that 16 and more patients will
prefer treatment A over treatment B? (b) What is the probability that 16 and more or 4 or
less patients will prefer treatment A over treatment B?
Solution without code:
(a) Pr(X ≥ 16) = 1 − Pr(X < 16) = 1 − Pr(X ≤ 15) = 1 − i:xi≤15 Pr (X = xi) =
1 − i:xi≤15
N
xi
pxi
(1 − p)N−xi
= 1 − i:xi≤15
20
xi
0.5xi
(1 − 0.5)20−xi
= 0.0059.
(b) Pr(X ≤ 4, X ≥ 16) = 1 − i:xi≤15 Pr (X = xi) + i:xi≤4 Pr (X = xi) = 0.012. This
probability is twice the previous one since Bin(N, 0.5) is symmetric around 0.5.
Hints. pbinom(x, size=..., prob=...) gives you probability Pr(X ≤ x). How do you get
probability Pr(X ≥ x) using this function?
Exercise 10 (Binomial Distribution). Let’s assume that Pr(swirl) = 0.533 = p1 is the
probability of having dermatological pattern swirl on right thumb of male population and
Pr(other) = 0.467 = p2 is the probability of other patterns on right thumb of the same
population. Random variable X represents number of swirls and Y number of other patterns,
where X ∼ Bin(N, p1) a Y ∼ Bin(N, p2). Calculate
1. Pr(X ≤ 120) if N = 300
2. Pr(Y ≤ 120) if N = 300
Exercise 11 (Normal approximation of binomial distribution). 8
Let Pr(man) = 0.515 be a proportion of men in population and Pr(women) = 0.485
proportion of women. Let X represent number of men and Y number of women. Under the
assumption of model Bin(N, p) calculate
1. Pr(X ≤ 3) if N = 5
2. Pr(X ≤ 5) if N = 10
3. Pr(X ≤ 25) if N = 50.
Compare these probabilities with those approximated by normal distribution N(Np, Npq).
8
Approximation means ”similar but not exactly equal”, i.e. we approximate some distribution with a
diﬀerent one (that has certain advantages over the approximated one) or we approximate data with some
distribution (that describes data with help of easily interpretable parameters)
9