Part III
Basics of probability theory
Chapter 3. PROBABILITY THEORY BASICS
CHAPTER 3: BASICS of PROBABILITY THEORY
prof. Jozef Gruska IV054 3. Basics of probability theory 2/65
PROBABILITY INTUITIVELY
Intuitively, probability of an event E is the ratio between the number of favorable
elementary events involved in E to the number of all possible elementary events involved
in E.
Pr(E) =
number of favorable elementary events involved in E
number of all possible elementary events involved in E
Any probabilistic statement must refer to an underlying probability space - a space
elements of which have assigned a probability.
prof. Jozef Gruska IV054 3. Basics of probability theory 3/65
PROBABILITY SPACES
A probability space is deﬁned in terms of a sample space Ω (often with an algebraic
structure – for example, outcomes of cube tossing) and a probability measure
(probability distribution) deﬁned on Ω.
Subsets of a sample space Ω are called events. Elements of Ω are referred to as
elementary events.
Intuitively, the sample space represents the set of all possible outcomes of a probabilistic
experiment – for example of a cube tossing. An event represents a collection (a subset)
of possible outcomes.
Intuitively - again, probability of an event E is the ration between the number of
(favorable) elementary events involved in E to the number of all possible
elementary events.
prof. Jozef Gruska IV054 3. Basics of probability theory 4/65
PROBABILITY THEORY
Probability theory took almost 300 years to develop from
intuitive ideas of Pascal, Fermat and Huygens, around
1650, to the currently acceptable axiomatic deﬁnition of
probability (due to A. N. Kolmogorov in 1933).
prof. Jozef Gruska IV054 3. Basics of probability theory 5/65
AXIOMAITIC APPROACH - I.
Axiomatic approach: Probability distribution on a set Ω is every function
Pr : 2Ω
→ [0, 1], satisfying the following axioms (of Kolmogorov):
1 Pr({x}) ≥ 0 for any elementary event x;
2 Pr(Ω) = 1
3 Pr(A ∪ B) = Pr(A) + Pr(B) if A, B ⊆ S, A ∩ B = ∅.
Example: Probabilistic experimentr – cube tossing; elementary events – outcomes of
cube tossing; probability distribution – {p1, p2, p3, p4, p5, p6}, 6
i=1 pi = 1.
In general, a sample space is an arbitrary set. However, often we need (wish) to consider
only some (family) of all possible events of 2Ω
.
The fact that not all collections of events lead to well-deﬁned probability space leads to
the concepts presented on the next slide.
prof. Jozef Gruska IV054 3. Basics of probability theory 6/65
AXIOMATIC APPROACH - II.
Deﬁnition: A σ-ﬁeld (Ω, F) consists of a sample space Ω and a collection F of subsets of
Ω satisfying the following conditions:
1 ∅ ∈ F
2 ε ∈ F ⇒ ε ∈ F
3 ε1, ε2, . . . ∈ F ⇒ (ε1 ∪ ε2 ∪ . . .) ∈ F
Consequence
A σ-ﬁeld is closed under countable unions and intersections.
Deﬁnition: A probability measure (distribution) Pr : F → R≥0
on a σ-ﬁeld (Ω,F) is a
function satisfying conditions:
1 If ε ∈ F, then 0 ≤ Pr(ε) ≤ 1.
2 Pr[Ω] = 1.
3 For mutually disjoint events ε1, ε2, . . .
Pr i εi = i Pr(εi )
Deﬁnition: A probability space (Ω,F,Pr) consists of a σ-ﬁeld (Ω,F) with a probability
measure Pr deﬁned on (Ω, F).
prof. Jozef Gruska IV054 3. Basics of probability theory 7/65
PROBABILITIES and their PROPERTIES - I.
Properties:
Pr (ε) = 1 − Pr (ε);
Pr (ε1 ∪ ε2) = Pr (ε1) + Pr (ε2) − Pr (ε1 ∩ ε2);
Pr(
i≥1
εi ) ≤
i≥1
Pr(εi ).
Deﬁnition: Conditional probability of an event ε1 given an event ε2 is deﬁned by
Pr [ε1|ε2] =
Pr [ε1 ∩ ε2]
Pr [ε2]
if Pr [ε2] > 0.
Theorem: Law of the total probability Let ε1, ε2, . . . , εk be a partition of the sample
space Ω. Then for any event ε
Pr [ε] =
k
i=1
Pr [ε|εi ] · Pr [εi ]
prof. Jozef Gruska IV054 3. Basics of probability theory 8/65
PROBABILITIES and their PROPERTIES - II.
Theorem: (Bayes’ Rule/Law)
(a) Pr (ε1) · Pr (ε2|ε1) = Pr (ε2) · Pr (ε1|ε2) basic equality
(b) Pr(ε2|ε1) = Pr(ε2) Pr(ε1|ε2)
Pr(ε1)
simple version
(c) Pr [ε0|ε] = Pr[ε0∩ε]
Pr[ε]
= Pr[ε|ε0]·Pr[ε0]
k
i=1
Pr[ε|εi ]·Pr[εi ]
. extended version
Deﬁnition: Independence
1 Two events ε1, ε2 are called independent if
Pr (ε1 ∩ ε2) = Pr (ε1) · Pr (ε2)
.
2 A collection of events {εi |i ∈ I} is independent if for all subsets S ⊆ I
Pr
i∈S
εi =
i∈S
Pr [εi ].
prof. Jozef Gruska IV054 3. Basics of probability theory 9/65
MODERN (BAYESIAN) INTERPRETATION of BAYES RULE
for the entire process of learning from evidence has the form
Pr [ε1|ε] =
Pr [ε1 ∩ ε]
Pr [ε]
=
Pr [ε|ε1] · Pr [ε1]
k
i=1 Pr [ε|εi ] · Pr [εi ]
.
In modern terms the last equation says that Pr [ε1|ε], the probability of a hypothesis ε1
(given information ε), equals Pr (ε1), our initial estimate of its probability, times Pr [ε|ε1],
the probability of each new piece of information (under the hypothesis ε1), divided by the
sum of the probabilities of data in all possible hypotheis (εi ).
prof. Jozef Gruska IV054 3. Basics of probability theory 10/65
TWO BASIC INTERPRETATIONS of PROBABILITY
In Frequentist interpretation , probability is deﬁned with respect to a large number of
trials, each producing one outcome from a set of possible outcomes - the
probability of an event A , Pr(A), is a proportion of trials producing an
outcome in A.
In Bayesian interpretation , probbability measures a degree of belief. Bayes’ theorem
then links the degree of belief in a proposition before and after receiving
an additional evidence that the proposition holds.
prof. Jozef Gruska IV054 3. Basics of probability theory 11/65
EXAMPLE 1
Let us toss a two regular cubes, one after another and let
ε1 be the event that the sum of both tosses is ≥ 10
ε2 be the event that the ﬁrst toss providdes 5
How much are: Pr(ε1), Pr(ε2), Pr(ε1|ε2), Pr(ε1 ∩ ε2)?
Pr(ε1) =
6
36
Pr(ε2) =
1
6
Pr(ε1|ε2) =
2
6
Pr(ε1 ∩ ε2) =
2
36
prof. Jozef Gruska IV054 3. Basics of probability theory 12/65
EXAMPLE 2
Three coins are given - two fair ones and in the third one heads land with
probability 2/3, but we do not know which one is not fair one.
When making an experiment and ﬂipping all coins let the ﬁrst two come up heads and
the third one comes up tails. What is probability that the ﬁrst coin is the biased one?
Let εi be the event that the ith coin is biased and B be the event that three coins ﬂips
came up heads, heads, tails.
Before ﬂipping coins we have Pr(εi ) = 1
3
for all i. After ﬂipping coins we have
Pr(B|ε1) = Pr(B|ε2) =
2
3
1
2
1
2
=
1
6
Pr(B|ε3) =
1
2
1
2
1
3
=
1
12
and using Bayes’ law we have
Pr(ε1|B) =
Pr(B|ε1)Pr(ε1)
3
i=1 Pr(B|εi )Pr (εi )
=
1
6
· 1
3
1
6
· 1
3
+ 1
6
· 1
3
+ 1
12
· 1
3
=
2
5
Therefore, the outcome of the three coin ﬂips increases the likehood that the ﬁrst coin is
biased from 1/3 to 2/5
prof. Jozef Gruska IV054 3. Basics of probability theory 13/65
THEOREM
Let A and B be two events and let Pr(B) = 0. Events A and B are independent if and
only if
Pr(A|B) = Pr(A).
Proof
Assume that A and B are independent and Pr(B) = 0. By deﬁnition we have
Pr(A ∩ B) = Pr(A) · Pr(B)
and therefore
Pr(A|B) =
Pr(A ∩ B)
Pr(B)
=
Pr(A) · Pr(B)
Pr(B)
= Pr(A).
Assume that Pr(A|B) = Pr(A) and Pr(B) = 0. Then
Pr(A) = Pr(A|B) =
Pr(A ∩ B)
Pr(B)
and multiplying by Pr(B) we get
Pr(A ∩ B) = Pr(A) · Pr(B)
and so A and B are independent.
prof. Jozef Gruska IV054 3. Basics of probability theory 14/65
SUMMARY
The notion of conditional probability, of A given B, was introduced in order to get
an instrument for analyzing an experiment A when one has partial information B
about the outcome of the experiment A before experiment has ﬁnished.
We say that two events A and B are independent if the probability of A is
equal to the probability of A given B,
Other fundamental instruments for analysis of probabilistic experiments are random
variables as functions from the sample space to R, and the expectation of a
random variable as the weighted average of the values of the random variable.
prof. Jozef Gruska IV054 3. Basics of probability theory 15/65
MONTY HALL PARADOX
Let us assume that you see three doors D1, D2 and D3
and you know that behind one door is a car and behind
other two are goats.
You have a chance to choose one door. If it is door with
car behind the car is yours, if it is door with a goat behind
you will have to milk the goat for one year. 2use Which
door you will choose?
prof. Jozef Gruska IV054 3. Basics of probability theory 16/65
Let us now assume that you have chosen the door D1.
Now comes a moderator who knows where car is and
opens one of the doors D2 or D3, say D2, and you see that
the goat is in.
Let us assume that you now get a chance to change your
choise of the door. Should you do that?
prof. Jozef Gruska IV054 3. Basics of probability theory 17/65
Let C1 denote the event that the car is behind the door D1.
Let C3 denote the event that the car is behind the door D3.
Let M2 denote the event that moderator openes the door D2.
Let us assume that the moderator choosed a door at random if goats were behind both
doors he could open. In such a case we have
Pr[C1] =
1
3
= Pr[C3], Pr[M2|C1] =
1
2
, Pr[M2|C3] = 1
Then it holds
Pr[C1|M2] =
Pr[M2|C1]Pr[C1]
Pr[M2]
=
Pr[M2|C1]Pr[C1]
Pr[M2|C1]Pr[C1] + Pr[M2|C3]Pr[C3]
=
1/6
1/6 + 1/3
=
1
3
Similarly
Pr[C3|M2] =
Pr[M2|C3]Pr[C3]
Pr[M2]
=
Pr[M2|C3]Pr[C3]
Pr[M2|C1]Pr[C1] + Pr[M2|C3]Pr[C3]
=
1/3
1/6 + 1/3
=
2
3
prof. Jozef Gruska IV054 3. Basics of probability theory 18/65
RANDOM VARIABLES - INFORMAL APPROACH
A random variable is a function deﬁned on the elementary events of a probability space.
Example: In case of two tosses of a fair six-sided dice, the value of a random variable V
can be the sum of the two spots on the dice rolls.
The value of V can therefore be an integer from the interval [2, 12].
A random variable V with n potential values v1, v2, . . . , vn is characterized by a probability
distribution p = (p1, p2, . . . , pn), where pi is probability that V takes the value vi .
The concept of random variable is one of the most important of modern science and
technology.
prof. Jozef Gruska IV054 3. Basics of probability theory 19/65
DISTRIBUTION and DENSITY FUNCTIONS 1/2
Deﬁnition: A random variable X is a real valued function over the sample space Ω [of a
σ-ﬁeld (Ω, F)], that is X : Ω → R, such that for all x ∈ R:
{ω ∈ Ω|X(ω) ≤ x} ∈ F.
Deﬁnition
The distribution function FX : R → [0, 1] of a random variable X is deﬁned as
FX (x) = Pr[X ≤ x].
The density function p : R → [0, 1] for a random variable X is deﬁned as
pX (x) = Pr[X = x].
prof. Jozef Gruska IV054 3. Basics of probability theory 20/65
DISTRIBUTION and DENSITY FUNCTIONS 2/2
Exam: Let Ω0 be the set of all possible outcomes of throwing simultaneously three fair
six-sided dice. |Ω0| =???
Let X(ω) (ω ∈ Ω0) be the sum of numbers on dices. For the density function pX (x) it
holds:
X 3 4 5 6 7 8 9 . . .
pX (x) 1
216
3
216
6
216
10
216
12
216
21
216
25
216
. . .
Deﬁnition The joint distribution function FX,Y : R × R → [0, 1] for random variables X
and Y is deﬁned by
FX,Y (x, y) = Pr[{X ≤ x} ∧ {Y ≤ y}]
The joint density function PrX,Y : R × R → [0, 1] for random variables X and Y is
deﬁned by
PrX,Y (x, y) = Pr[{X = x} ∧ {Y = y}]
prof. Jozef Gruska IV054 3. Basics of probability theory 21/65
INDEPENDENCE of RANDOM VARIALES
Deﬁnition Two random variables X, Y are called independent random variables if
x, y ∈ R ⇒ PrX,Y (x, y) = Pr[X = x] · Pr[Y = y]
prof. Jozef Gruska IV054 3. Basics of probability theory 22/65
EXPECTATION – MEAN of RANDOM VARIABLES
Deﬁnition: The expectation (mean or expected value) E[X] of a random variable X is
deﬁned as
E[X] =
ω∈Ω
X(ω)PrX (ω).
Properties of he mean:
E[X + Y ] = E[X] + E[Y ].
E[c · X] = c · E[X].
E[X · Y ] = E[X] · E[Y ], if X, Y are independent
The ﬁrst of the above equalities is known as linearity of expectations. It can be
extended to a ﬁnite number of random variables X1, . . . , Xn to hold
E[
n
i=1
Xi ] =
n
i=1
E[Xi ]
and also to any countable set of random variables X1, X2, . . . to hold: If
∞
i=1 E[|Xi |] < ∞, then ∞
i=1 |Xi | < ∞ and
E[
∞
i=1
Xi ] =
∞
i=1
E[Xi ].
prof. Jozef Gruska IV054 3. Basics of probability theory 23/65
EXPECTATION VALUES
For any random variable X let RX be the set of values of X. Using RX one can show that
E[X] =
x∈RX
x · Pr(X = x).
Using that one can show that for any real a, b it holds
E[aX + b] =
x∈RX
(ax + b)Pr(X = x)
= a
x∈RX
x · Pr(X = x) + b
x∈RX
Pr(X = x)
= a · E[X] + b
The above relation is called weak linearity of expectation.
prof. Jozef Gruska IV054 3. Basics of probability theory 24/65
JENSEN’s INEQUALITY
Example: We show taht for any random variable X it holds
E[X2
] ≥ (E[X])2
Indeed, let Y = (X − E[X])2
0 ≤ E[Y ] = E[(X − E[X])2
]
= E[X2
− 2XE[X] + (E[X])2
]
= E
¯
[X2
] − 2E[XE[X]] + (E[X])2
= E[X2
] − (E[X])2
This can be generalised as follows
Theorem - Jensen’s inequality - I. If f is a convex function, then
E[f (X)] ≥ f (E[X])
It holds also
Theorem - Jensen’s inequality - II. If f is a concave function, then
f (E[X]) ≥ E[f (X)].
prof. Jozef Gruska IV054 3. Basics of probability theory 25/65
EXAMPLE on JENSEN’s INEQUALITIES
Suppose we ﬂip n fair coins and want to get a lower bound on E[X2
], where X is the
number of heads.
Since the function f : x → x2
is convex, Jensen’s inequality says:
E[X2
] ≥ (E[X])2
=
n2
4
what is a pretty good result because the excat value is E[X2
] = n2
4
+ n
4
, what can be
easily found using generating functions.
On the other hand, since lg x is a concave function, Jensen’s upper bound for the same
experiment X:
E[lg X] ≤ lg E[X] = lg
n
2
= lg n − 1
what is pretty close to the exact value.
prof. Jozef Gruska IV054 3. Basics of probability theory 26/65
INDICATOR VARIABLES
A random variable X is said to be an indicator variable if X takes on only values 1 and 0.
For any set A ⊂ S, one can deﬁne an indicator variable XA that takes value 1 on A and 0
on S − A, if (S, Pr) is the underlying probability space.
It holds:
E[XA] =
s∈S
XA(s) · Pr({s})
=
s∈A
XA(s) · Pr({s}) +
s∈S−A
XA(s) · Pr({s})
=
s∈A
1 · Pr({s}) +
s∈S−A
0 · Pr({s})
=
s∈A
Pr({s})
= Pr(A)
prof. Jozef Gruska IV054 3. Basics of probability theory 27/65
VARIANCE and STANDARD DEVIATION
Deﬁnition For a random variable X variance VX and standard deviation σX are
deﬁned by
VX = E((X − EX)2
)
σX =
√
VX
Since
E((X − EX)2
) = E(X2
− 2XEX + (EX)2
) =
= E(X2
) − 2(EX)2
+ (EX)2
=
= E(X2
) − (EX)2
,
it holds
VX = E(X2
) − (EX)2
Example: Let Ω = {1, 2, . . . , 10}, Pr(i) = 1
10
, X(i) = i; Y (i) = i − 1 if i ≤ 5 and
Y (i) = i + 1 otherwise.
EX = EY = 5.5, E(X2
) = 1
10
10
i=1 i2
= 38.5, E(Y 2
) = 44.5; VX = 8.25, VY = 14.25
prof. Jozef Gruska IV054 3. Basics of probability theory 28/65
TWO RULES
For independent random variables X and Y and a real number c it holds
V(cX) = c2
V(X); σ(cX) = cσ(X)
V(X + Y ) = V(X) + V(Y ). σ(X + Y ) = V (X) + V (Y ).
prof. Jozef Gruska IV054 3. Basics of probability theory 29/65
MOMENTS
Deﬁnition
For k ∈ N the k-th moment mk
X and the k-th central moment µk
X of a random variable X
are deﬁned as follows
mk
X = EXk
µk
X = E((X − EX)k
)
The mean of a random variable X is sometimes denoted by µX = m1
X and its variance by
µ2
X .
prof. Jozef Gruska IV054 3. Basics of probability theory 30/65
EXAMPLE I
Each week there is a lottery that always sells 100 tickets. One of the tickets wins 100
milions, all oher tickets win nothing.
What is better: to buy in one week two tickets (Strategy I) or two tickets in two diﬀerent
weeks (Strategy II)?
Or none of these two strategies is better than the second one?
prof. Jozef Gruska IV054 3. Basics of probability theory 31/65
EXAMPLE II
With Strategy I we win (in millions)
0 with probability 0.98
100 with probability 0.02
With Strategy II we win (in millions) o
0 with probability 0.9801 = 0.99 · 0.99
100 with probability 0.0198 = 2 · 0.01 · 0.99
200 with probability 0.0001 = 0.01 · 0.01
Variance at Strategy I is 196
Variance at Strategy II is 198
prof. Jozef Gruska IV054 3. Basics of probability theory 32/65
PROBABILITY GENERATING FUNCTION
The probability density function of a random variable X whose values are natural
numbers can be represented by the following probability generating function (PGF):
GX (z) =
k≥0
Pr(X = k) · zk
.
Main properties
GX (1) = 1
EX =
k≥0
k · Pr(X = k) =
k≥0
Pr(X = k) · (k · 1k−1
) = GX(1).
Since it holds
E(X2
) =
k≥0
k2
· Pr(X = k)
=
k≥0
Pr(X = k) · (k · (k − 1) · 1k−2
+ k · 1k−1
)
= GX(1) + GX(1)
we have
prof. Jozef Gruska IV054 3. Basics of probability theory 33/65
VX = GX (1) + GX (1) − (GX (1))2
.
prof. Jozef Gruska IV054 3. Basics of probability theory 34/65
AN INTERPRETATION
Sometimes one can think of the expectation E[Y ] of a random variable Y as the
”best guess” or the ”best prediction” of the value of Y .
It is the ”best guess” in the sense that among all constants m the expectation
E[(Y − m)2
] is minimal when m = E[Y].
prof. Jozef Gruska IV054 3. Basics of probability theory 35/65
WHY ARE PGF USEFUL?
Main reason: For many important probability distributions their PGF are very simple
and easy to work with.
For example, for the uniform distribution on the set {0, 1, . . . , n − 1} the PGF has form
Un(z) =
1
n
(1 + z + . . . + zn−1
) =
1
n
·
1 − zn
1 − z
.
Problem is with the case z = 1.
prof. Jozef Gruska IV054 3. Basics of probability theory 36/65
WAY OUT
If
G(z) =
n≥0
gnzn
is any power series that converges for at least one value of z with |z| > 1, then this
property have also G (z) = n≥0 ngnzn−1
and G (z), G (z). By Taylor theorem we
have
G(1 + t) = G(1) +
G (1)
1!
t +
G (1)
2!
t2
+ . . . .
Therefore
Un(1+t) =
1
n
(1 + t)n
− 1
t
=
1
n
n
1
+
1
n
n
2
t+
1
n
n
3
t2
+. . .+
1
n
n
n
tn−1
.
and consequently
Un(1) = 1, Un(1) =
n − 1
2
Un (1) =
(n − 1)(n − 2)
2
.
prof. Jozef Gruska IV054 3. Basics of probability theory 37/65
PROPERTIES of GENERATING FUNCTIONS
Properety 1 If X1, . . . , Xk are independent random variables with PGFs
G1(z), . . . , Gk (z), then the random variable Y = k
i=1 Xi has as its PGF the function
G(z) =
k
i=1
Gi (z).
Property 2 Let X1, . . . , Xk be a sequence of independent random variables with the same
PGF GX (z). If Y is a random variable with PGF GY (z) and Y is independent of all Xi ,
then the random variable S = X1 + . . . + XY has as PGF the function
GS (z) = GY (GX (z)).
prof. Jozef Gruska IV054 3. Basics of probability theory 38/65
IMPORTANT DISTRIBUTIONS
Two important distributions are connected with experiments, called Bernoulli trials, that
have two possible outcomes:
success with probability p
failure with probability q = 1 − p
Coin tossing is an example of a Bernoulli trial.
1. Let values of a random variable X be the number of trials needed to obtain a success.
Then
Pr(X = k) = qk−1
p
Such a probability distribution is called the geometric distribution and such a variable
geometric random variable. It holds
EX =
1
p
VX =
q
p2
G(z) =
pz
1 − qz
2. Let values of a random variable Y be the number of successes in n trials. Then
Pr(Y = k) =
n
k
pk
qn−k
Such a probability distribution is called the binomial distribution and it holds
EY = np VY = npq G(z) = (q + pz)n
prof. Jozef Gruska IV054 3. Basics of probability theory 39/65
and also
EY 2
= n(n − 1)p2
+ np
prof. Jozef Gruska IV054 3. Basics of probability theory 40/65
BERNOULLI DISTRIBUTION
Let X be a binary random variable (called usually Bernouli or indicator random variable)
that takes value 1 with probability p and 0 with probability q = 1 − p, then it holds
E[X] = p VX = pq G[z] = q + pz.
prof. Jozef Gruska IV054 3. Basics of probability theory 41/65
BINOMIAL DISTRIBUTION revisited
Let X1, . . . , Xn be random variables having Bernoulli distribution with the common
parameter p.
The random variable
X = X1 + X2 + . . . + Xn
has so called binomial distribution denoted B(n, p) with the density function denoted
B(k, n, p) = Pr(X = k) =
n
k
pk
q(n−k)
prof. Jozef Gruska IV054 3. Basics of probability theory 42/65
POISSON DISTRIBUTION
Poisson distribution
Let λ ∈ R>0
. The Poisson distribution with the parameter λ is the probability
distribution with the density function
p(x) =
λx e−λ
x!
, for x = 0, 1, 2, ...
0, otherwise
For large n the Poisson distribution is a good approximation to the Binomial distribution
B(n, λ
n
)
Property of a Poisson random variable X:
E[X] = λ VX = λ G[z] = eλ(z−1)
prof. Jozef Gruska IV054 3. Basics of probability theory 43/65
EXPECTATION+VARIANCE OF SUMS OF RANDOM VARIABLES
Let
Sn =
n
i=1
Xi
where each Xi is a random variable which takes on value 1 (0) with probability p
(1 − p = q).
It clearly holds
E(Xi ) = p
E(X2
i ) = p
E(Sn) = E(
n
i=1
Xi ) =
n
i=1
E(Xi ) = np
E(S2
n ) = E((
n
i=1
Xi )2
) = E(
n
i=1
X2
i +
i=j
Xi Xj ) =
=
n
i=1
E(X2
i ) +
i=j
E(Xi Xj )
prof. Jozef Gruska IV054 3. Basics of probability theory 44/65
Hence
E(S2
n ) = E((
n
i=1
Xi )2
) = E(
n
i=1
X2
i +
i=j
Xi Xj ) =
=
n
i=1
E(X2
i ) +
i=j
E(Xi Xj )
and therefore, if Xi , Xj are pairwise independent, as in this case, E(Xi Xj ) =
= E(Xi )E(Xj ) Hence
E(S2
n ) = np + 2
n
2
p2
= np + n(n − 1)p2
= np(1 − p) + n2
p2
= n2
p2
+ npq
VAR[Sn] = E(S2
n ) − (E(Sn))2
= n2
p2
+ npq − n2
p2
= npq
prof. Jozef Gruska IV054 3. Basics of probability theory 45/65
MOMENT INEQUALITIES
The following inequality, and several of its special cases, play very important role in the
analysis of randomized computations:
Let X be a random variable that takes on values x with probability p(x).
Theorem For any λ > 0 the so called kth
moment inequality holds:
Pr [|X| > λ] ≤
E(|X|k
)
λk
Proof of the above inequality;
E(|X|k
) = |x|k
p(x) ≥
|x|>λ
|x|k
p(x) ≥
≥ λk
|x|>λ
p(x) = λk
Pr [|X| > λ]
prof. Jozef Gruska IV054 3. Basics of probability theory 46/65
Two important special cases - I.1
of the moment inequality;
Pr [|X| > λ] ≤
E(|X|k
)
λk
Case 1 k → 1 λ → λE(|X|)
Pr [|X| ≥ λE(|X|)] ≤
1
λ
Markov s inequality
Case 2 k → 2 X → X − E(X), λ → λ V (X)
Pr |X − E(X)| ≥ λ V (X) ≤
E((X − E(X))2
)
λ2V (X)
=
=
V (X)
λ2V (X)
=
1
λ2
Chebyshev s inequality
Another variant of Chebyshev’s inequality:
Pr[|X − E(X)| ≥ λ] ≤
V (X)
λ2
and this is one of the main reasons why variance is used.
prof. Jozef Gruska IV054 3. Basics of probability theory 47/65
Two important special cases - I.2
The following generalization of the moment inequality is also of importance:
Theorem
If g(x) is non-decreasing on [0, ∞), then
Pr [|X| > λ] ≤
E(g(X))
g(λ)
As a special case, namely if g(x) = etx
, we get:
Pr [|X| > λ] ≤
E(etX
)
etλ
basic Chernoﬀ s inequality
Chebyshev’s inequalities are used to show that values of a random variable lie close to its
average with high probability. The bounds they provide are called also concentration
bounds. Better bounds can usually be obtained using Chernoﬀ bounds discussed in
Chapter 5.
prof. Jozef Gruska IV054 3. Basics of probability theory 48/65
FLIPPING COINS EXAMPLES on CHEBYSHEV INEQUALITIES
Let X be a sum of n independent fair coins and let Xi be an indicator variable for the
event that the i-th coin comes up heads. Then E(Xi ) = 1
2
, E(X) = n
2
, Var[Xi ] = 1
4
and
Var[X] = Var[Xi ] = n
4
.
Chebyshev’s inequality
Pr[|X − E(X)| ≥ λ] ≤
V (X)
λ2
for λ = n
2
gives
Pr[X = n] ≤ Pr[|X − n/2| ≥ n/2] ≤
n/4
(n/2)2
=
1
n
prof. Jozef Gruska IV054 3. Basics of probability theory 49/65
Let now n = 2m
− 1 for some m and let Y1, . . . , Ym be independent 0-1-random-variables.
For each non-empty subset S of {1, . . . , m}, let XS be the exculusive OR of all Yi for
i ∈ S. Then
Xi are pairwise independent and each Xi has variance 1/4
The same Chebyshev’s inequality analysis as above for independent coin ﬂips when
X = S XS gives
Pr[|X − n/2| ≥ n/2] ≤
1
n
prof. Jozef Gruska IV054 3. Basics of probability theory 50/65
THE INCLUSION-EXCLUSION PRINCIPLE
Let A1, A2, . . . , An be events – not necessarily disjoint. The Inclusion-Exclusion
principle, that has also a variety of applications, states that
Pr
n
i=1
Ai =
n
i=1
Pr (Ai) −
i<j
Pr (Ai ∩Aj) +
i<j<k
Pr (Ai ∩Aj ∩Ak) −
− . . . + (−1)k+1
i1<i2<...<ik
Pr
k
j=1
Aij . . . +
+ (−1)n+1
Pr
n
i=1
Ai
prof. Jozef Gruska IV054 3. Basics of probability theory 51/65
BONFERRONI’S INEQUALITIES
the following Bonferroni’s inequalities follow from the Inclusion-exclusion principle:
For every odd k ≤ n
Pr
n
i=1
Ai ≤
k
j=1
(−1)j+1
i1<...<ij ≤n
Pr
j
l=1
Ail
For every even k ≤ n
Pr
n
i=1
Ai ≥
k
j=1
(−1)j+1
i1<...<ij ≤n
Pr
j
l=1
Ail
prof. Jozef Gruska IV054 3. Basics of probability theory 52/65
SPECIAL CASES of THE INCLUSION-EXCLUSION PRINCIPLE
”Markov”-type inequality - Boole’s inequality or Union bound
Pr
i
Ai ≤
i
Pr (Ai )
”Chebyshev”-type inequality
Pr
i
Ai ≥
i
Pr (Ai ) −
i<j
Pr (Ai ∩ Aj )
Another proof of Boole’s inequality:
Let us deﬁne Bi = Ai − i−1
j=1 Aj . Then Ai = Bi . Since Bi are disjoint and for each i
we have Bi ⊂ Ai we get
Pr[ Ai ] = Pr[ Bi ] = Pr[Bi ] ≤ Pr[Ai ]
prof. Jozef Gruska IV054 3. Basics of probability theory 53/65
UNION BOUND APPLICATIONS - BASIC IDEA
The typical use of the union bound is to show that if an algorithm can fail only if various
improbable events occur, then the probability of the overall failure is no greater than the
sum of the probabilities of these events.
This reduces the problem of showing that an algorithm works with probability 1 − ε to
constructing an error budget that divides the ε probability of failure among all bad
outcomes.
prof. Jozef Gruska IV054 3. Basics of probability theory 54/65
UNION BOUND APPLICATIONS - INDEPENDENT SET DESIGN
Let a graph G = (V , E) be given with |V | = n and |E| = m. Mark uniformly and
randomly a subset of n
2
√
m
vertices.
The probability that any particular vertex is marked is then (n/(2
√
m))/n = 1
2
√
m
, and
the probability that both endpoints of an edge are marked is
1
2
√
m
· (
1
2
√
m
− 1) <
1
(2
√
m)2
=
1
4m
So the probability that at least one of the edges has two marked endpoints is at most
m
4m
= 1
4
.
In the above we have actually described a randomized algorithm that outputs a set of
nodes of size n
2
√
m
on average that is independent set with probability 3
4
- even without
looking at any edge.
We thus have a randomized lagorithm that outputs a set of size n
2
√
n
on average that is
an independent set with probability 3/4, without looking at any of the edges.
prof. Jozef Gruska IV054 3. Basics of probability theory 55/65
CONDITIONAL EXPECTATION
Deﬁnition It is natural and useful to deﬁne conditional expectation of a random variable
Y conditioned on an event E by
E[Y |E] = yPr(Y = y|E).
Example Let we roll independently two perfect dice and let Xi be the number that shows
on the ith dice and let X be sum of numbers on both dice.
E[X|X1 = 3] =
x
xPr(X = x|X1 = 3) =
9
x=4
x
1
6
=
13
2
E[X1|X = 5] =
4
x=1
xPr(X1 = x|X = 5) =
4
x=1
x
Pr(X1 = x ∩ X = 5)
Pr(X = 5)
=
5
2
Deﬁnition For two random variables Y and Z, E[Y |Z] is deﬁned to be a random variable
f (Z) that takes on the value E[Y |Z = z] when Z = z.
Theorem For any random variables Y , Z it holds
E[Y ] = E[E[Y |Z]].
prof. Jozef Gruska IV054 3. Basics of probability theory 56/65
SUMMATION with a STOPPING RULE
The following problem occurs in many applications:
For a given sequence X1, X2, . . . of random variables we need to compute
E[
T
i=1
Xi ],
where T is an integer-valued random variable. So called Wald equation, on the next
slide, shows how such expectation can be determined under special, but natural
conditions. For that we deﬁne when T represents a stopping rule.
Deﬁnition Let X1, X2, . . . be a sequence of independent random variables, and T be an
integer-valued random variable. T is called stopping rule relative to sequence X1, X2, . . .,
if the event T = i is independent of Xi+1, Xi+2, . . ..
The idea is that variables Xi are observed one after another, X1, X2, . . . and T represents
the number of variables observed when observation process is ﬁnished.
Clearly, the event T = i has to be independent of Xi+1, Xi+2, . . .
prof. Jozef Gruska IV054 3. Basics of probability theory 57/65
EXAMPLES of STOPPING TIME
A stopping time corresponds to a strategy to stop a sequence of steps (say of gamblings)
that is based only on the outcomes seen so far.
Examples of rules when a decision to stop gambling is a stopping time:
1 First time the gambler wins 5 games in a row;
2 First time the gambler either wins or looses 1000 dolars;
The rule ”Last time the gambler wins 4 times in a row” is not a stopping time.
prof. Jozef Gruska IV054 3. Basics of probability theory 58/65
APPENDIX
APPENDIX
prof. Jozef Gruska IV054 3. Basics of probability theory 59/65
MODERN (BAYESIAN) INTERPRETATION of BAYES RULE
Bayes rule for the process of learning from evidence has the form:
Pr [ε1|ε] =
Pr [ε1 ∩ ε]
Pr [ε]
=
Pr [ε|ε1] · Pr [ε1]
k
i=1 Pr [ε|εi ] · Pr [εi ]
.
In modern terms the last equation says that Pr [ε1|ε], the probability of a hypothesis ε1
(given information ε), equals Pr (ε1), our initial estimate of its probability, times Pr [ε|ε1],
the probability of each new piece of information (under the hypothesis ε1), divided by the
sum of the probabilities of data in all possible hypotheis (εi ).
prof. Jozef Gruska IV054 3. Basics of probability theory 60/65
EXAMPLE - DRUG TESTING
Suppose that a drug test will produce 99% true positive and 99% true negative results.
Suppose that 0.5% of people are drug users.
If the test of a user is positive, what is probability that such a user is a drug user?
prof. Jozef Gruska IV054 3. Basics of probability theory 61/65
SOLUTION
Pr(drg-us|+) =
Pr(+|drg-us)Pr(drg-us)
Pr(+|drg-us)Pr(drg-us) + Pr(+|no-drg-us)Pr(no-drg-us)
Pr(drg − us|+) =
0.99 × 0.005
0.99 × 0.005 + 0.01 × 0.995
=≈ 33.2%
prof. Jozef Gruska IV054 3. Basics of probability theory 62/65
BAYES’ RULE INFORMALLY
Basically, Bayes’ rule concerns of a broad and fundamental issue: how we analyse
evidence and change our mind as we get new information, and make rational decision in
the face of uncertainty.
Bayes’ rule as one line theorem: by updating our initial belief about something with new
objective information, we get a new and improved belief
prof. Jozef Gruska IV054 3. Basics of probability theory 63/65
BAYES’ RULE STORY
Reverend Thomas Bayes from England discovered the initial version of the ”Bayes’s
law” around 1974, but soon stopped to believe in it.
In behind were two philosophical questions
Can an eﬀect determine its cause?
Can we determine the existence of God by observing nature?
Bayes law was not written for long time as formula, only as the statement: By
updating our initial belief about something with objective new information, we
can get a new and improved belief.
Bayes used a tricky thought experiment to demonstrate his law.
Bayes’ rule was later invented independently by Pierre Simon Laplace, perhaps the
greatest scientist of 18th century, but at the end he also abounded it.
Till the 20 century theoreticians considered Bayes rule as unscientiﬁc. Bayes rule
had for centuries several proponents and many opponents in spite that it has turned
out to be very useful in practice.
Bayes rule was used to help to create rules of insurance industries, to develop
strategy for artillery during the ﬁrst and even Second World War (and also a great
Russian mathematician Kolmogorov helped to develop it for this purpose).
prof. Jozef Gruska IV054 3. Basics of probability theory 64/65
It was used much to decrypt ENIGMA codes during 2WW, due to Turing, and also
to locate German submarines.
prof. Jozef Gruska IV054 3. Basics of probability theory 65/65