Part III Basics of probability theory Chapter 3. PROBABILITY THEORY BASICS CHAPTER 3: BASICS of PROBABILITY THEORY prof. Jozef Gruska IV054 3. Basics of probability theory 2/65 PROBABILITY INTUITIVELY Intuitively, probability of an event E is the ratio between the number of favorable elementary events involved in E to the number of all possible elementary events involved in E. Pr(E) = number of favorable elementary events involved in E number of all possible elementary events involved in E Any probabilistic statement must refer to an underlying probability space - a space elements of which have assigned a probability. prof. Jozef Gruska IV054 3. Basics of probability theory 3/65 PROBABILITY SPACES A probability space is defined in terms of a sample space Ω (often with an algebraic structure – for example, outcomes of cube tossing) and a probability measure (probability distribution) defined on Ω. Subsets of a sample space Ω are called events. Elements of Ω are referred to as elementary events. Intuitively, the sample space represents the set of all possible outcomes of a probabilistic experiment – for example of a cube tossing. An event represents a collection (a subset) of possible outcomes. Intuitively - again, probability of an event E is the ration between the number of (favorable) elementary events involved in E to the number of all possible elementary events. prof. Jozef Gruska IV054 3. Basics of probability theory 4/65 PROBABILITY THEORY Probability theory took almost 300 years to develop from intuitive ideas of Pascal, Fermat and Huygens, around 1650, to the currently acceptable axiomatic definition of probability (due to A. N. Kolmogorov in 1933). prof. Jozef Gruska IV054 3. Basics of probability theory 5/65 AXIOMAITIC APPROACH - I. Axiomatic approach: Probability distribution on a set Ω is every function Pr : 2Ω → [0, 1], satisfying the following axioms (of Kolmogorov): 1 Pr({x}) ≥ 0 for any elementary event x; 2 Pr(Ω) = 1 3 Pr(A ∪ B) = Pr(A) + Pr(B) if A, B ⊆ S, A ∩ B = ∅. Example: Probabilistic experimentr – cube tossing; elementary events – outcomes of cube tossing; probability distribution – {p1, p2, p3, p4, p5, p6}, 6 i=1 pi = 1. In general, a sample space is an arbitrary set. However, often we need (wish) to consider only some (family) of all possible events of 2Ω . The fact that not all collections of events lead to well-defined probability space leads to the concepts presented on the next slide. prof. Jozef Gruska IV054 3. Basics of probability theory 6/65 AXIOMATIC APPROACH - II. Definition: A σ-field (Ω, F) consists of a sample space Ω and a collection F of subsets of Ω satisfying the following conditions: 1 ∅ ∈ F 2 ε ∈ F ⇒ ε ∈ F 3 ε1, ε2, . . . ∈ F ⇒ (ε1 ∪ ε2 ∪ . . .) ∈ F Consequence A σ-field is closed under countable unions and intersections. Definition: A probability measure (distribution) Pr : F → R≥0 on a σ-field (Ω,F) is a function satisfying conditions: 1 If ε ∈ F, then 0 ≤ Pr(ε) ≤ 1. 2 Pr[Ω] = 1. 3 For mutually disjoint events ε1, ε2, . . . Pr i εi = i Pr(εi ) Definition: A probability space (Ω,F,Pr) consists of a σ-field (Ω,F) with a probability measure Pr defined on (Ω, F). prof. Jozef Gruska IV054 3. Basics of probability theory 7/65 PROBABILITIES and their PROPERTIES - I. Properties: Pr (ε) = 1 − Pr (ε); Pr (ε1 ∪ ε2) = Pr (ε1) + Pr (ε2) − Pr (ε1 ∩ ε2); Pr( i≥1 εi ) ≤ i≥1 Pr(εi ). Definition: Conditional probability of an event ε1 given an event ε2 is defined by Pr [ε1|ε2] = Pr [ε1 ∩ ε2] Pr [ε2] if Pr [ε2] > 0. Theorem: Law of the total probability Let ε1, ε2, . . . , εk be a partition of the sample space Ω. Then for any event ε Pr [ε] = k i=1 Pr [ε|εi ] · Pr [εi ] prof. Jozef Gruska IV054 3. Basics of probability theory 8/65 PROBABILITIES and their PROPERTIES - II. Theorem: (Bayes’ Rule/Law) (a) Pr (ε1) · Pr (ε2|ε1) = Pr (ε2) · Pr (ε1|ε2) basic equality (b) Pr(ε2|ε1) = Pr(ε2) Pr(ε1|ε2) Pr(ε1) simple version (c) Pr [ε0|ε] = Pr[ε0∩ε] Pr[ε] = Pr[ε|ε0]·Pr[ε0] k i=1 Pr[ε|εi ]·Pr[εi ] . extended version Definition: Independence 1 Two events ε1, ε2 are called independent if Pr (ε1 ∩ ε2) = Pr (ε1) · Pr (ε2) . 2 A collection of events {εi |i ∈ I} is independent if for all subsets S ⊆ I Pr i∈S εi = i∈S Pr [εi ]. prof. Jozef Gruska IV054 3. Basics of probability theory 9/65 MODERN (BAYESIAN) INTERPRETATION of BAYES RULE for the entire process of learning from evidence has the form Pr [ε1|ε] = Pr [ε1 ∩ ε] Pr [ε] = Pr [ε|ε1] · Pr [ε1] k i=1 Pr [ε|εi ] · Pr [εi ] . In modern terms the last equation says that Pr [ε1|ε], the probability of a hypothesis ε1 (given information ε), equals Pr (ε1), our initial estimate of its probability, times Pr [ε|ε1], the probability of each new piece of information (under the hypothesis ε1), divided by the sum of the probabilities of data in all possible hypotheis (εi ). prof. Jozef Gruska IV054 3. Basics of probability theory 10/65 TWO BASIC INTERPRETATIONS of PROBABILITY In Frequentist interpretation , probability is defined with respect to a large number of trials, each producing one outcome from a set of possible outcomes - the probability of an event A , Pr(A), is a proportion of trials producing an outcome in A. In Bayesian interpretation , probbability measures a degree of belief. Bayes’ theorem then links the degree of belief in a proposition before and after receiving an additional evidence that the proposition holds. prof. Jozef Gruska IV054 3. Basics of probability theory 11/65 EXAMPLE 1 Let us toss a two regular cubes, one after another and let ε1 be the event that the sum of both tosses is ≥ 10 ε2 be the event that the first toss providdes 5 How much are: Pr(ε1), Pr(ε2), Pr(ε1|ε2), Pr(ε1 ∩ ε2)? Pr(ε1) = 6 36 Pr(ε2) = 1 6 Pr(ε1|ε2) = 2 6 Pr(ε1 ∩ ε2) = 2 36 prof. Jozef Gruska IV054 3. Basics of probability theory 12/65 EXAMPLE 2 Three coins are given - two fair ones and in the third one heads land with probability 2/3, but we do not know which one is not fair one. When making an experiment and flipping all coins let the first two come up heads and the third one comes up tails. What is probability that the first coin is the biased one? Let εi be the event that the ith coin is biased and B be the event that three coins flips came up heads, heads, tails. Before flipping coins we have Pr(εi ) = 1 3 for all i. After flipping coins we have Pr(B|ε1) = Pr(B|ε2) = 2 3 1 2 1 2 = 1 6 Pr(B|ε3) = 1 2 1 2 1 3 = 1 12 and using Bayes’ law we have Pr(ε1|B) = Pr(B|ε1)Pr(ε1) 3 i=1 Pr(B|εi )Pr (εi ) = 1 6 · 1 3 1 6 · 1 3 + 1 6 · 1 3 + 1 12 · 1 3 = 2 5 Therefore, the outcome of the three coin flips increases the likehood that the first coin is biased from 1/3 to 2/5 prof. Jozef Gruska IV054 3. Basics of probability theory 13/65 THEOREM Let A and B be two events and let Pr(B) = 0. Events A and B are independent if and only if Pr(A|B) = Pr(A). Proof Assume that A and B are independent and Pr(B) = 0. By definition we have Pr(A ∩ B) = Pr(A) · Pr(B) and therefore Pr(A|B) = Pr(A ∩ B) Pr(B) = Pr(A) · Pr(B) Pr(B) = Pr(A). Assume that Pr(A|B) = Pr(A) and Pr(B) = 0. Then Pr(A) = Pr(A|B) = Pr(A ∩ B) Pr(B) and multiplying by Pr(B) we get Pr(A ∩ B) = Pr(A) · Pr(B) and so A and B are independent. prof. Jozef Gruska IV054 3. Basics of probability theory 14/65 SUMMARY The notion of conditional probability, of A given B, was introduced in order to get an instrument for analyzing an experiment A when one has partial information B about the outcome of the experiment A before experiment has finished. We say that two events A and B are independent if the probability of A is equal to the probability of A given B, Other fundamental instruments for analysis of probabilistic experiments are random variables as functions from the sample space to R, and the expectation of a random variable as the weighted average of the values of the random variable. prof. Jozef Gruska IV054 3. Basics of probability theory 15/65 MONTY HALL PARADOX Let us assume that you see three doors D1, D2 and D3 and you know that behind one door is a car and behind other two are goats. You have a chance to choose one door. If it is door with car behind the car is yours, if it is door with a goat behind you will have to milk the goat for one year. 2use Which door you will choose? prof. Jozef Gruska IV054 3. Basics of probability theory 16/65 Let us now assume that you have chosen the door D1. Now comes a moderator who knows where car is and opens one of the doors D2 or D3, say D2, and you see that the goat is in. Let us assume that you now get a chance to change your choise of the door. Should you do that? prof. Jozef Gruska IV054 3. Basics of probability theory 17/65 Let C1 denote the event that the car is behind the door D1. Let C3 denote the event that the car is behind the door D3. Let M2 denote the event that moderator openes the door D2. Let us assume that the moderator choosed a door at random if goats were behind both doors he could open. In such a case we have Pr[C1] = 1 3 = Pr[C3], Pr[M2|C1] = 1 2 , Pr[M2|C3] = 1 Then it holds Pr[C1|M2] = Pr[M2|C1]Pr[C1] Pr[M2] = Pr[M2|C1]Pr[C1] Pr[M2|C1]Pr[C1] + Pr[M2|C3]Pr[C3] = 1/6 1/6 + 1/3 = 1 3 Similarly Pr[C3|M2] = Pr[M2|C3]Pr[C3] Pr[M2] = Pr[M2|C3]Pr[C3] Pr[M2|C1]Pr[C1] + Pr[M2|C3]Pr[C3] = 1/3 1/6 + 1/3 = 2 3 prof. Jozef Gruska IV054 3. Basics of probability theory 18/65 RANDOM VARIABLES - INFORMAL APPROACH A random variable is a function defined on the elementary events of a probability space. Example: In case of two tosses of a fair six-sided dice, the value of a random variable V can be the sum of the two spots on the dice rolls. The value of V can therefore be an integer from the interval [2, 12]. A random variable V with n potential values v1, v2, . . . , vn is characterized by a probability distribution p = (p1, p2, . . . , pn), where pi is probability that V takes the value vi . The concept of random variable is one of the most important of modern science and technology. prof. Jozef Gruska IV054 3. Basics of probability theory 19/65 DISTRIBUTION and DENSITY FUNCTIONS 1/2 Definition: A random variable X is a real valued function over the sample space Ω [of a σ-field (Ω, F)], that is X : Ω → R, such that for all x ∈ R: {ω ∈ Ω|X(ω) ≤ x} ∈ F. Definition The distribution function FX : R → [0, 1] of a random variable X is defined as FX (x) = Pr[X ≤ x]. The density function p : R → [0, 1] for a random variable X is defined as pX (x) = Pr[X = x]. prof. Jozef Gruska IV054 3. Basics of probability theory 20/65 DISTRIBUTION and DENSITY FUNCTIONS 2/2 Exam: Let Ω0 be the set of all possible outcomes of throwing simultaneously three fair six-sided dice. |Ω0| =??? Let X(ω) (ω ∈ Ω0) be the sum of numbers on dices. For the density function pX (x) it holds: X 3 4 5 6 7 8 9 . . . pX (x) 1 216 3 216 6 216 10 216 12 216 21 216 25 216 . . . Definition The joint distribution function FX,Y : R × R → [0, 1] for random variables X and Y is defined by FX,Y (x, y) = Pr[{X ≤ x} ∧ {Y ≤ y}] The joint density function PrX,Y : R × R → [0, 1] for random variables X and Y is defined by PrX,Y (x, y) = Pr[{X = x} ∧ {Y = y}] prof. Jozef Gruska IV054 3. Basics of probability theory 21/65 INDEPENDENCE of RANDOM VARIALES Definition Two random variables X, Y are called independent random variables if x, y ∈ R ⇒ PrX,Y (x, y) = Pr[X = x] · Pr[Y = y] prof. Jozef Gruska IV054 3. Basics of probability theory 22/65 EXPECTATION – MEAN of RANDOM VARIABLES Definition: The expectation (mean or expected value) E[X] of a random variable X is defined as E[X] = ω∈Ω X(ω)PrX (ω). Properties of he mean: E[X + Y ] = E[X] + E[Y ]. E[c · X] = c · E[X]. E[X · Y ] = E[X] · E[Y ], if X, Y are independent The first of the above equalities is known as linearity of expectations. It can be extended to a finite number of random variables X1, . . . , Xn to hold E[ n i=1 Xi ] = n i=1 E[Xi ] and also to any countable set of random variables X1, X2, . . . to hold: If ∞ i=1 E[|Xi |] < ∞, then ∞ i=1 |Xi | < ∞ and E[ ∞ i=1 Xi ] = ∞ i=1 E[Xi ]. prof. Jozef Gruska IV054 3. Basics of probability theory 23/65 EXPECTATION VALUES For any random variable X let RX be the set of values of X. Using RX one can show that E[X] = x∈RX x · Pr(X = x). Using that one can show that for any real a, b it holds E[aX + b] = x∈RX (ax + b)Pr(X = x) = a x∈RX x · Pr(X = x) + b x∈RX Pr(X = x) = a · E[X] + b The above relation is called weak linearity of expectation. prof. Jozef Gruska IV054 3. Basics of probability theory 24/65 JENSEN’s INEQUALITY Example: We show taht for any random variable X it holds E[X2 ] ≥ (E[X])2 Indeed, let Y = (X − E[X])2 0 ≤ E[Y ] = E[(X − E[X])2 ] = E[X2 − 2XE[X] + (E[X])2 ] = E ¯ [X2 ] − 2E[XE[X]] + (E[X])2 = E[X2 ] − (E[X])2 This can be generalised as follows Theorem - Jensen’s inequality - I. If f is a convex function, then E[f (X)] ≥ f (E[X]) It holds also Theorem - Jensen’s inequality - II. If f is a concave function, then f (E[X]) ≥ E[f (X)]. prof. Jozef Gruska IV054 3. Basics of probability theory 25/65 EXAMPLE on JENSEN’s INEQUALITIES Suppose we flip n fair coins and want to get a lower bound on E[X2 ], where X is the number of heads. Since the function f : x → x2 is convex, Jensen’s inequality says: E[X2 ] ≥ (E[X])2 = n2 4 what is a pretty good result because the excat value is E[X2 ] = n2 4 + n 4 , what can be easily found using generating functions. On the other hand, since lg x is a concave function, Jensen’s upper bound for the same experiment X: E[lg X] ≤ lg E[X] = lg n 2 = lg n − 1 what is pretty close to the exact value. prof. Jozef Gruska IV054 3. Basics of probability theory 26/65 INDICATOR VARIABLES A random variable X is said to be an indicator variable if X takes on only values 1 and 0. For any set A ⊂ S, one can define an indicator variable XA that takes value 1 on A and 0 on S − A, if (S, Pr) is the underlying probability space. It holds: E[XA] = s∈S XA(s) · Pr({s}) = s∈A XA(s) · Pr({s}) + s∈S−A XA(s) · Pr({s}) = s∈A 1 · Pr({s}) + s∈S−A 0 · Pr({s}) = s∈A Pr({s}) = Pr(A) prof. Jozef Gruska IV054 3. Basics of probability theory 27/65 VARIANCE and STANDARD DEVIATION Definition For a random variable X variance VX and standard deviation σX are defined by VX = E((X − EX)2 ) σX = √ VX Since E((X − EX)2 ) = E(X2 − 2XEX + (EX)2 ) = = E(X2 ) − 2(EX)2 + (EX)2 = = E(X2 ) − (EX)2 , it holds VX = E(X2 ) − (EX)2 Example: Let Ω = {1, 2, . . . , 10}, Pr(i) = 1 10 , X(i) = i; Y (i) = i − 1 if i ≤ 5 and Y (i) = i + 1 otherwise. EX = EY = 5.5, E(X2 ) = 1 10 10 i=1 i2 = 38.5, E(Y 2 ) = 44.5; VX = 8.25, VY = 14.25 prof. Jozef Gruska IV054 3. Basics of probability theory 28/65 TWO RULES For independent random variables X and Y and a real number c it holds V(cX) = c2 V(X); σ(cX) = cσ(X) V(X + Y ) = V(X) + V(Y ). σ(X + Y ) = V (X) + V (Y ). prof. Jozef Gruska IV054 3. Basics of probability theory 29/65 MOMENTS Definition For k ∈ N the k-th moment mk X and the k-th central moment µk X of a random variable X are defined as follows mk X = EXk µk X = E((X − EX)k ) The mean of a random variable X is sometimes denoted by µX = m1 X and its variance by µ2 X . prof. Jozef Gruska IV054 3. Basics of probability theory 30/65 EXAMPLE I Each week there is a lottery that always sells 100 tickets. One of the tickets wins 100 milions, all oher tickets win nothing. What is better: to buy in one week two tickets (Strategy I) or two tickets in two different weeks (Strategy II)? Or none of these two strategies is better than the second one? prof. Jozef Gruska IV054 3. Basics of probability theory 31/65 EXAMPLE II With Strategy I we win (in millions) 0 with probability 0.98 100 with probability 0.02 With Strategy II we win (in millions) o 0 with probability 0.9801 = 0.99 · 0.99 100 with probability 0.0198 = 2 · 0.01 · 0.99 200 with probability 0.0001 = 0.01 · 0.01 Variance at Strategy I is 196 Variance at Strategy II is 198 prof. Jozef Gruska IV054 3. Basics of probability theory 32/65 PROBABILITY GENERATING FUNCTION The probability density function of a random variable X whose values are natural numbers can be represented by the following probability generating function (PGF): GX (z) = k≥0 Pr(X = k) · zk . Main properties GX (1) = 1 EX = k≥0 k · Pr(X = k) = k≥0 Pr(X = k) · (k · 1k−1 ) = GX(1). Since it holds E(X2 ) = k≥0 k2 · Pr(X = k) = k≥0 Pr(X = k) · (k · (k − 1) · 1k−2 + k · 1k−1 ) = GX(1) + GX(1) we have prof. Jozef Gruska IV054 3. Basics of probability theory 33/65 VX = GX (1) + GX (1) − (GX (1))2 . prof. Jozef Gruska IV054 3. Basics of probability theory 34/65 AN INTERPRETATION Sometimes one can think of the expectation E[Y ] of a random variable Y as the ”best guess” or the ”best prediction” of the value of Y . It is the ”best guess” in the sense that among all constants m the expectation E[(Y − m)2 ] is minimal when m = E[Y]. prof. Jozef Gruska IV054 3. Basics of probability theory 35/65 WHY ARE PGF USEFUL? Main reason: For many important probability distributions their PGF are very simple and easy to work with. For example, for the uniform distribution on the set {0, 1, . . . , n − 1} the PGF has form Un(z) = 1 n (1 + z + . . . + zn−1 ) = 1 n · 1 − zn 1 − z . Problem is with the case z = 1. prof. Jozef Gruska IV054 3. Basics of probability theory 36/65 WAY OUT If G(z) = n≥0 gnzn is any power series that converges for at least one value of z with |z| > 1, then this property have also G (z) = n≥0 ngnzn−1 and G (z), G (z). By Taylor theorem we have G(1 + t) = G(1) + G (1) 1! t + G (1) 2! t2 + . . . . Therefore Un(1+t) = 1 n (1 + t)n − 1 t = 1 n n 1 + 1 n n 2 t+ 1 n n 3 t2 +. . .+ 1 n n n tn−1 . and consequently Un(1) = 1, Un(1) = n − 1 2 Un (1) = (n − 1)(n − 2) 2 . prof. Jozef Gruska IV054 3. Basics of probability theory 37/65 PROPERTIES of GENERATING FUNCTIONS Properety 1 If X1, . . . , Xk are independent random variables with PGFs G1(z), . . . , Gk (z), then the random variable Y = k i=1 Xi has as its PGF the function G(z) = k i=1 Gi (z). Property 2 Let X1, . . . , Xk be a sequence of independent random variables with the same PGF GX (z). If Y is a random variable with PGF GY (z) and Y is independent of all Xi , then the random variable S = X1 + . . . + XY has as PGF the function GS (z) = GY (GX (z)). prof. Jozef Gruska IV054 3. Basics of probability theory 38/65 IMPORTANT DISTRIBUTIONS Two important distributions are connected with experiments, called Bernoulli trials, that have two possible outcomes: success with probability p failure with probability q = 1 − p Coin tossing is an example of a Bernoulli trial. 1. Let values of a random variable X be the number of trials needed to obtain a success. Then Pr(X = k) = qk−1 p Such a probability distribution is called the geometric distribution and such a variable geometric random variable. It holds EX = 1 p VX = q p2 G(z) = pz 1 − qz 2. Let values of a random variable Y be the number of successes in n trials. Then Pr(Y = k) = n k pk qn−k Such a probability distribution is called the binomial distribution and it holds EY = np VY = npq G(z) = (q + pz)n prof. Jozef Gruska IV054 3. Basics of probability theory 39/65 and also EY 2 = n(n − 1)p2 + np prof. Jozef Gruska IV054 3. Basics of probability theory 40/65 BERNOULLI DISTRIBUTION Let X be a binary random variable (called usually Bernouli or indicator random variable) that takes value 1 with probability p and 0 with probability q = 1 − p, then it holds E[X] = p VX = pq G[z] = q + pz. prof. Jozef Gruska IV054 3. Basics of probability theory 41/65 BINOMIAL DISTRIBUTION revisited Let X1, . . . , Xn be random variables having Bernoulli distribution with the common parameter p. The random variable X = X1 + X2 + . . . + Xn has so called binomial distribution denoted B(n, p) with the density function denoted B(k, n, p) = Pr(X = k) = n k pk q(n−k) prof. Jozef Gruska IV054 3. Basics of probability theory 42/65 POISSON DISTRIBUTION Poisson distribution Let λ ∈ R>0 . The Poisson distribution with the parameter λ is the probability distribution with the density function p(x) = λx e−λ x! , for x = 0, 1, 2, ... 0, otherwise For large n the Poisson distribution is a good approximation to the Binomial distribution B(n, λ n ) Property of a Poisson random variable X: E[X] = λ VX = λ G[z] = eλ(z−1) prof. Jozef Gruska IV054 3. Basics of probability theory 43/65 EXPECTATION+VARIANCE OF SUMS OF RANDOM VARIABLES Let Sn = n i=1 Xi where each Xi is a random variable which takes on value 1 (0) with probability p (1 − p = q). It clearly holds E(Xi ) = p E(X2 i ) = p E(Sn) = E( n i=1 Xi ) = n i=1 E(Xi ) = np E(S2 n ) = E(( n i=1 Xi )2 ) = E( n i=1 X2 i + i=j Xi Xj ) = = n i=1 E(X2 i ) + i=j E(Xi Xj ) prof. Jozef Gruska IV054 3. Basics of probability theory 44/65 Hence E(S2 n ) = E(( n i=1 Xi )2 ) = E( n i=1 X2 i + i=j Xi Xj ) = = n i=1 E(X2 i ) + i=j E(Xi Xj ) and therefore, if Xi , Xj are pairwise independent, as in this case, E(Xi Xj ) = = E(Xi )E(Xj ) Hence E(S2 n ) = np + 2 n 2 p2 = np + n(n − 1)p2 = np(1 − p) + n2 p2 = n2 p2 + npq VAR[Sn] = E(S2 n ) − (E(Sn))2 = n2 p2 + npq − n2 p2 = npq prof. Jozef Gruska IV054 3. Basics of probability theory 45/65 MOMENT INEQUALITIES The following inequality, and several of its special cases, play very important role in the analysis of randomized computations: Let X be a random variable that takes on values x with probability p(x). Theorem For any λ > 0 the so called kth moment inequality holds: Pr [|X| > λ] ≤ E(|X|k ) λk Proof of the above inequality; E(|X|k ) = |x|k p(x) ≥ |x|>λ |x|k p(x) ≥ ≥ λk |x|>λ p(x) = λk Pr [|X| > λ] prof. Jozef Gruska IV054 3. Basics of probability theory 46/65 Two important special cases - I.1 of the moment inequality; Pr [|X| > λ] ≤ E(|X|k ) λk Case 1 k → 1 λ → λE(|X|) Pr [|X| ≥ λE(|X|)] ≤ 1 λ Markov s inequality Case 2 k → 2 X → X − E(X), λ → λ V (X) Pr |X − E(X)| ≥ λ V (X) ≤ E((X − E(X))2 ) λ2V (X) = = V (X) λ2V (X) = 1 λ2 Chebyshev s inequality Another variant of Chebyshev’s inequality: Pr[|X − E(X)| ≥ λ] ≤ V (X) λ2 and this is one of the main reasons why variance is used. prof. Jozef Gruska IV054 3. Basics of probability theory 47/65 Two important special cases - I.2 The following generalization of the moment inequality is also of importance: Theorem If g(x) is non-decreasing on [0, ∞), then Pr [|X| > λ] ≤ E(g(X)) g(λ) As a special case, namely if g(x) = etx , we get: Pr [|X| > λ] ≤ E(etX ) etλ basic Chernoff s inequality Chebyshev’s inequalities are used to show that values of a random variable lie close to its average with high probability. The bounds they provide are called also concentration bounds. Better bounds can usually be obtained using Chernoff bounds discussed in Chapter 5. prof. Jozef Gruska IV054 3. Basics of probability theory 48/65 FLIPPING COINS EXAMPLES on CHEBYSHEV INEQUALITIES Let X be a sum of n independent fair coins and let Xi be an indicator variable for the event that the i-th coin comes up heads. Then E(Xi ) = 1 2 , E(X) = n 2 , Var[Xi ] = 1 4 and Var[X] = Var[Xi ] = n 4 . Chebyshev’s inequality Pr[|X − E(X)| ≥ λ] ≤ V (X) λ2 for λ = n 2 gives Pr[X = n] ≤ Pr[|X − n/2| ≥ n/2] ≤ n/4 (n/2)2 = 1 n prof. Jozef Gruska IV054 3. Basics of probability theory 49/65 Let now n = 2m − 1 for some m and let Y1, . . . , Ym be independent 0-1-random-variables. For each non-empty subset S of {1, . . . , m}, let XS be the exculusive OR of all Yi for i ∈ S. Then Xi are pairwise independent and each Xi has variance 1/4 The same Chebyshev’s inequality analysis as above for independent coin flips when X = S XS gives Pr[|X − n/2| ≥ n/2] ≤ 1 n prof. Jozef Gruska IV054 3. Basics of probability theory 50/65 THE INCLUSION-EXCLUSION PRINCIPLE Let A1, A2, . . . , An be events – not necessarily disjoint. The Inclusion-Exclusion principle, that has also a variety of applications, states that Pr n i=1 Ai = n i=1 Pr (Ai) − i