Part VI Basic techniques II: tail probabilities inequalities Chapter 6. BASIC TECHNIQUES: CONCENTRATION BOUNDS Some general, but quite sharp, concentration bounds are derived in this chapter and their use is illustrated. For example, we derive so called tail probability bounds - bounds on probability that values of a random variable differ much from its mean. At first will detrmine bounds of the random variable X = n i=1 Xi , where all Xi are binary random variables with Bernoulli distribution. That is, Xi can be seen as a coin tossing with Pr [Xi = 1] = pi and Pr [Xi = 0] = 1 − pi . Such coin tosses are referred to also as Poisson trials and as Bernoulli trials if all pi are identical. (Observe that as a special case p1 = p2 = ... = pn = p we have a random variable X with the binomial distribution.) At the end we will deal with special sequences of dependent random variables called martingales and also tail bounds for martingales, what will then be applied also to the occupancy problem. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 2/67 BASIC PROBLEM and METHODS - I. If we want to get tight bounds on how values of a random variable X differ much from its mean, a useful trick is to pick some non-negative function f (X) such that (a) we can calculate E[f (X)], and (b) f grows so slow enough that only large values of X produce huge values of f (X). This way we can get good probability bounds, by applying Markov inequality to f (X), on huge differences of X from its mean. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 3/67 BASIC PROBLEM and METHODS - II. The above approach is often used to show that X lies close to E[X] with reasonably high probability. Of large importance is the case X is the sum of random variables. For the case that these random variaables are independent we derive so called Chernoff bound. For the case that they are dependent but form so called martingale we get so called Azuma-Hoeffding bound prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 4/67 Basic problem of the analysis of randomized algorithms What is the probability of the deviation of X = n i=1 Xi from its mean EX = µ = n i=1 pi by a fixed factor? Namely, let δ > 0. (1) what is the probability that X is larger than (1 + δ)µ ? (2) What is the probability that X is smaller than (1 − δ)µ? Notation: For a random variable X, let E etX , t > 0 fixed, be called the moment generating function of X. E etX = k≥0 tk E Xk k! Very important Chernoff bounds on the sum of independent Poisson trials are obtained when the moment generating functions of X are considered. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 5/67 CHERNOFF BOUNDS - I Theorem: Let X1, X2, .., Xn be independent Poisson trials such that, for 1 ≤ i ≤ n, Pr [Xi = 1] = pi , where 0 < pi < 1. Then for X = n i=1 Xi , µ = E [X] = n i=1 pi , and any δ > 0 Pr [X > (1 + δ) µ] < eδ (1 + δ)(1+δ) µ (1) Proof: For any t ∈ R>0 Pr [X > (1 + δ) µ] = Pr etX > et(1+δ)µ By applying Markov inequality to the right-hand side we get Pr [X > (1 + δ) µ] < E etX et(1+δ)µ (inequality is strict). Observe that: E etX = E et n i=1 Xi = E n i=1 etXi = n i=1 E etXi , Pr [X > (1 + δ) µ] < n i=1 E etXi et(1+δ)µ . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 6/67 CHERNOFF BOUNDS - II. Since E etXi = pi et + (1 − pi ), we have: Pr [X > (1 + δ) µ] < n i=1 pi et + 1 − pi et(1+δ)µ = n i=1 1 + pi et − 1 et(1+δ)µ . By taking the inequality 1 + x < ex , with x = pi et − 1 , Pr [X > (1 + δ) µ] < n i=1 epi (et −1) et(1+δ)µ = e n i=1 pi (et −1) et(1+δ)µ = e(et −1)µ et(1+δ)µ . Taking t = ln (1 + δ) we get our Theorem (and basic Chernoff bound), that is: Pr [X > (1 + δ) µ] < eδ (1 + δ)(1+δ) µ (2) Observe three tricks that have been used in the above proof! prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 7/67 COROLLARIES From the above Chernoff bound the following corollaries can be derived Corollary: Let X1, X2, .., Xn be independent Poisson trials such that, for 1 ≤ i ≤ n, Pr [Xi = 1] = pi , where 0 < pi < 1. Then for X = n i=1 Xi and µ = E [X] = n i=1 pi , it holds 1 For 0 < δ < 1.81 Pr(X > (1 + δ)µ) ≤ e−µδ2 /3 2 For 0 ≤ δ ≤ 4.11 Pr[X ≥ (1 + δ)µ] ≤ e−µδ2 /4 3 For R ≥ 6µ Pr(X ≥ R) ≤ 2−R (3) prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 8/67 EXAMPLE I - SOCCER GAMES OUTCOMES Notation: F+ (µ, δ) = eδ (1+δ)(1+δ) µ – the right-hand side of inequality (1) from the previous slide. Example: A soccer team STARS wins each game with probability 1 3 . Assuming that outcomes of different games are independent we derive an upper bound on the probability that STARS win more than half out of n games. Let Xi = 1, if STARS win i−th game 0, otherwise. Let Yn = n i=1 Xi By applying the last theorem we get for µ = n 3 and δ = 1 2 , Pr Yn > n 2 < F+ n 3 , 1 2 < (0.915)n −exponentially small in n prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 9/67 SECOND TYPE of CHERNOFF BOUNDS Previous theorem puts an upper bound on deviations of X = Xi above its expectations µ, i.e. for Pr [X > (1 + δ) µ] . Next theorem puts a lower bound on deviations of X = Xi below its expectations µ, i.e. for Pr [X < (1 − δ) µ] . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 10/67 Theorem: Let X1,X2,..,Xn be independent Poisson trials such that, for 1 ≤ i ≤ n, Pr [Xi = 1] = pi , where 0 < pi < 1. Then for X = n i=1 Xi , µ = E [X] = n i=1 pi , and for 0 < δ ≤ 1 Pr [X < (1 − δ) µ] < e−µ δ2 2 Proof: Pr [X < (1 − δ) µ] = Pr [−X > − (1 − δ) µ] = Pr e−tX > e−t(1−δ)µ for any positive real t. By applying Markov inequality Pr [X < (1 − δ) µ] < E e−tX e−t(1−δ)µ = n i=1 E e−tXi e−t(1−δ)µ < n i=1 pi e−t + 1 − pi e−t(1−δ)µ = n i=1 1 + pi e−t − 1 e−t(1−δ)µ . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 11/67 By applying the inequality 1 + x < ex we get Pr [X < (1 − δ) µ] < e n i=1 pi (e−t −1) e−t(1−δ)µ = e(e−t −1)µ e−t(1−δ)µ and if we take t = ln 1 1−δ , then Pr [X < (1 − δ) µ] < e−δ (1 − δ)1−δ µ (4) and then we have Pr [X < (1 − δ) µ] < e−µ δ2 2 From 3 and 4 it follows Corollary: For 0 < δ < 1 Pr(|X − µ| ≥ δµ) ≤ 2e−µδ2 /3 (5) prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 12/67 EXAMPLE - COIN TOSSING Let X be a number of heads in a sequence of n independent fair coin flips. An application of the bound (7) gives, for µ = n/2 and δ = 6 ln n n Pr X − n 2 ≥ 1 2 √ 6n ln n ≤ 2e−1 3 n 2 6 ln n n = 2 n This implies that concentration of the number of heads around the mean n 2 is very tight. Indeed, the deviations from the mean are on the order of O( √ n ln n). prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 13/67 CHEBYSHEV versus CHERNOFF Let X be again the number of heads in a sequence of n independent fair coin flips. Let us consider probability of having either more than 3n/4 or fewer than n/4 heads in a sequence of n independent fair coin-flips. Chebyshev’s inequality gives us Pr X − n 2 ≥ n 4 ≤ 4 n On the other side, using Chernoff bound we have Pr X − n 2 ≥ n 4 ≤ 2e− 1 3 n 2 1 4 ≤ 2e−n/24 . Chernoff’s method therefore gives an exponentially smaller upper bound than the upper bound obtained using Chebyshev’s inequality. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 14/67 SOCCER GAMES REVISITED Notation: [For the lower tail bound function] F− (µ, δ) = e −µδ2 2 . Example: Assume that the probability that STAR team wins the game is 3 4. What is the probability that in n games STAR lose more than n 2 games? In such a case µ = 0.75n, δ = 1 3 and for Yn = n i=1 Xi we have Pr Yn < n 2 < F− 0.75n, 1 3 < (0.9592)n and therefore the probability decreases exponentially fast in n. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 15/67 TWO SIDED BOUNDS By combining two previous bounds we get Pr[|X − µ| ≥ δµ] ≤ 2e−µδ2 /3 and if we want that this bound is less than an ε, then we get Pr |X − µ| ≥ 3µ ln(2/ε) ≤ ε provided ε ≥ 2e−µδ2 /3 . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 16/67 Proof If ε = 2e−µδ2 /3 , then 3µ ln(2/ε) = 3µ ln(eµδ2/3) = 3µ · µδ2/3 = µ2δ2 = µδ prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 17/67 NEW QUESTION New question: Given ε, how large has δ be in order Pr [X > (1 + δ) µ] < ε? In order to deal with such and related questions, the following definitions/notations are introduced. Df.: ∆+ (µ, ε) is a number such that F+ (µ, ∆+ (µ, ε)) = ε. ∆− (µ, ε) is a number such that F− (µ, ∆− (µ, ε)) = ε. In other words, a deviation of δ = ∆+ (µ, ε) suffices to keep Pr [X > (1 + δ) µ] bellow ε (irrespective of the values of n and pi ’s). prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 18/67 EXAMPLE and ESTIMATIONS There is a way to derive ∆− (µ, ε) explicitly. Indeed, by taking the inequality Pr [X < (1 − δ) µ] < e− µδ2 2 and setting e− µδ2 2 = ε we get ∆− (µ, ε) = 2 ln 1 ε µ . (6) because ∆− (µ, ε) is a number such that F− µ, ∆− (µ, ε) = ε. Example: Let pi = 0.75. How large must δ be so that Pr [X < (1 − δ) µ] < n−5 ? From (2) it follows: δ = ∆− 0.75n, n−5 = 10 ln n 0.75n = 13.3 ln n n prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 19/67 SOME OTHER USEFUL ESTIMATIONS 1 F+ (µ, δ) < [e/(1 + δ)](1+δ)µ . 2 If δ > 2e − 1, then F+ (µ, δ) < 2−(1+δ)µ , 3 ∆+ (µ, ε) < lg 1 ε µ − 1. 4 If δ ≤ 2e − 1, then F+ (µ, δ) < e−µδ2 4 and ∆+ (µ, ε) < 4 ln 1 ε µ . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 20/67 SUMMARY of NOTATION Let us summarize basic relations concerning values: F+ (µ, δ) = eδ (1 + δ)(1+δ) µ and F− (µ, δ) = e −µδ2 2 as well as ∆+ (µ, ε) and ∆− (µ, ε). It holds Pr[X > (1 + δ)µ] < F+ (µ, δ) and Pr[X < (1 − δ)µ] < F− (µ, δ) and Pr(X > (1 + ∆+ (µ, ε)µ) < F+ (µ, ∆+ (µ, ε)) = ε Pr(X < (1 − ∆− (µ, ε)µ) < F− (µ, ∆− (µ, ε)) = ε prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 21/67 EXAMPLE 2 - MONTE CARLO METHOD - I In this example we illustrate how Chernoff bound help us to show that a simple Monte Carlo algorithm can be used to approximate number π through sampling. The term Monte Carlo method refers to a broad collection of tools for estimating various values through sampling and simulation. Monte Carlo methods are used extensively in all areas of physical sciences and technologies. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 22/67 MONTE CARLO ESTIMATION OF π - I. Let Z = (X, Y ) be a point chosen randomly in a 2 × 2 square centered in (0, 0). This is equivalent to choosing X and Y randomly from interval [−1, 1]. Let Z be considered as a random variable that has value 1 (0) if the point (X, Y ) lies in the circle of radius 1 centered in the point (0, 0). Clearly Pr(Z = 1) = π 4 If we perform such an experiment m times and Zi be the value of Z at the ith run, and W = m i=1 Zi , then E[W ] = E m i=1 Zi = m i=1 E[Zi ] = mπ 4 and therefore W = (4/m)W is a natural estimation for π. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 23/67 MONTE CARLO ESTIMATION OF π - II. How good is this estimation? An application of second Chernoff bound gives Pr(|W − π| ≥ επ) = Pr W − mπ 4 ≥ εmπ 4 = Pr([W − E[W ]) ≥ εE[W ]) ≤ 2e−mπε2/12 because E(W ) = mπ 4 and for 0 < δ < 1 Pr(|X − µ| ≥ δµ) ≤ 2e−µδ2/3 (7) Therefore, by taking m sufficiently large we can get an arbitrarily good approximation of π prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 24/67 A CASE STUDY - routing on hypercubes Networks are modeled by graphs. Processors by nodes and Communication links are represented by edges. Principle of synchronous communication. Each link can carry one packet (i, X, d(i)) where i is a source node, X are data and d(i) is destination node. Permutation routing on an n-processor network Nodes 1, 2, ..., n The node i wants to send a packet to the node d(i) d(1), d(2), ..., d(n) is a permutation of 1, 2, ..., n. Problem: How many steps are necessary and sufficient to route an arbitrary permutation? A route for a packet is a sequence of edges the packet has to follow from its source to its destination. A routing algorithm for the permutation routing problem has to specify a route for each packet. A packet may occasionally have to wait at a node because the next edge on its route is ”busy”, transmitting another packet at that moment. We assume each node contains one queue for each edge. A routing algorithm must therefore specify also a queueing discipline. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 25/67 OBLIVIOUS ROUTING ALGORITHMS are such routing algorithms that the route followed by a packet from a source node i to a destination d(i) depends on i and d(i) only (and not on other d(j), for j = i). The following theorem gives a limit on the performance of oblivious algorithms. Theorem: For any deterministic oblivious permutation routing algorithm on a network of n nodes each of the out-degree d, there is an instance of the permutation routing requiring Ω n d steps. Example: Consider any d-dimensional hypercube Hd and the left-to-right routing. Any packet with the destination node d(i) is sent from any current node ni to the node nj such that binary representation of nj differs from the binary representation of ni in the leftmost bit in which ni and d(i) differ. Example Consider the permutation routing in H10 given by the “reverse” mapping b1...b10 → b10...b1 Observe that if the left-to-right routing strategy is used, then all messages from nodes b1b2b3b4b500000 have to go through the node 0000000000. Left-to-right routing on hypercube Hd requires sometimes Ω 2d d steps. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 26/67 RANDOMIZED ROUTING We show now a randomized (oblivious) routing algorithm with expected number of steps smaller, asymptotically, than 2d d . Notation : N = 2d Phase 1: Pick a random intermediate destination σ(i) from {1, ..., N}. Let the packet vi to travel first to the node σ(i). Phase 2: Let the packet vi to travel next from σ(i) to its final destination D(i). (In both phases the deterministic left-to-right oblivious routing is used.) Queueing discipline: FIFO for each edge. (Actually any queueing discipline is good provided at each step there is a packet ready to travel.) prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 27/67 Question: How many steps are needed before a packet vi reaches its destination? (Let us consider at first only the Phase 1). Let ρi denote the route for a packet vi . It clearly holds: The number of steps taken by vi is equal to the length of ρi , which is at most d, plus the number of steps for which vi is queued at intermediate nodes of ρi . Fact: For any two packets vi , vj there is at most one queue q such that vi and vj are in the queue q at the same time. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 28/67 Lemma: Let the route of a packet vi follow the sequence of edges ρi = (e1, e2, ..., ek ). Let S be the set of packets (other than vi ), whose routes pass through at least one of the edges {e1, ..., ek }. Then the delay the packet vi makes is at most |S|. Proof: A packet in S is said to leave ρi at that time step at which it traverses an edge of ρi for the last time. If a packet is ready to follow an edge ej at time t we define its lag at time t to be t − j. Clearly, the lag of a packet vi is initially 0, and the total delay of vi is its lag when it traverses the last edge ek of the route ρi . We show now that at each step at which the lag of vi increases by 1, the lag can be charged to a distinct member of S. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 29/67 If the lag of vi reaches a number l + 1, some packet in S leaves ρi with lag l. (When the lag of vi increases from l to l + 1, there must be at least one packet (from S) that wishes to traverse the same edge as vi at that time step.) Thus, S contains at least one packet whose lag is l. Let t be the last step any packet in S has the lag l. Thus there is a packet v ∈ S ready to follow an edge ej , at t = l + j . We show that some packet of S leaves ρi at t . This establish Lemma by the Fact from the slide before the previous slide. Since v is ready to follow ej at t , some packet ω (which may be v itself) in S follow edge ej , at t . Now ω has to leave ρi at t . We charge to ω the increase in the lag of vi from l to l + 1; since ω leaves ρi it will never be charged again. Thus, each member of S whose route intersects ρi is charged for at most one delay, what proves the lemma. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 30/67 PROOF CONTINUATION - I. Let Hij be the random variable defined as follows Hij = 1 if routes ρi and ρj share at least one edge 0 otherwise The total delay a packet vi makes is at most N j=1 Hij . Since the routes of different packets are chosen independently and randomly, the Hij ’s are independent Poisson trials for j = i. Thus, to bound the delay of the packet vi from above, using the Chernoff bound, it suffices to obtain an upper bound on N j=1 Hij . At first we find a bound for E N j=1 Hij . For an edge e of the hypercube let the random variable T(e) denote the number of routes that pass through e. Fix any route ρi = (ei,1, ei,2, ..., ei,k ) , k ≤ d. Then N j=1 Hij ≤ k j=1 T(ei,j ) ⇒ E N j=1 Hij ≤ k j=1 E [T(ei,j )] prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 31/67 PROOF CONTINUATION - II. It can be shown that E [T(ei,j )] = E [T(ei,m)] for any two edges. The expected length of ρi is d 2 . An expectation of the total route length, summed over all packets, is therefore Nd 2 . The number of edges in the hypercube is Nd and therefore ⇒ E [T(eij )] ≤ Nd/2 Nd = 1 2 for any i, j.) Therefore E N j=1 Hij ≤ k 2 ≤ d 2 . By the Chernoff bound (for δ > 2e − 1), see page 7, Pr [X > (1 + δ)µ] < 2−(1+δ)µ with X = N j=1 Hij , δ = 11, µ = d 2 , we get that probability that N j=1 Hij exceeds 6d is less than 2−6d . The total number of packets is N = 2d . The probability that any of the N packets experiences a delay exceeding 6d is less than 2d × 2−6d = 2−5d . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 32/67 PROOF CONTINUATION - III. By adding the length of the route to the delay we get: Theorem: With probability at least 1 − 2−5d every packet reaches its intermediate destination in Phase 1 in 7d or fewer steps. The routing scheme for Phase 2 can be seen as the scheme for Phase 1, which ”runs backwards”. Therefore the probability that any packet fails to reach its target in either phase is less than 2 · 2−5d . To summarize: Theorem: With probability at least 1 − 1 25d every packet reaches its destination in 14d or fewer steps. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 33/67 WIRING PROBLEM - I. Global wiring in gate arrays Gate-array:is √ n × √ n array of n gates. Net - is a pair of gates to be connected by a wire. Manhattan wiring - wires can run vertically and horizontally only. 1r r1 2r r2 3r r3 4r r4 Global wiring problem I (GWPW ): given a set of nets and an integer W we need to specify, if possible, the set of gates each wire should pass through in connecting its end-points, in such a way that no more than W nets pass through any boundary. Global wiring problem II: For a boundary b between two gates in the array, let WS (b) be the number of wires that pass through b in a solution S to the global wiring problem. Notation: WS = max b WS (b) Goal: To find S such that WS is minimal. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 34/67 WIRING PROBLEM - II. We will consider only so called one-bend Manhattan routing at which direction is changed at most once. Problem; how to decide for each net which of the following connections to use: p p p p in order to get wiring S with minimal WS . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 35/67 REFORMULATION of the WIRING PROBLEM Global wiring problem can be reformulated as a 0-1 linear programming problem. For the net i we use two binary variables xi0, xi1 xi0 = 1 ⇔ i-th route has the form p p xi1 = 1 ⇔ i-th route has the form p p Notation: Tb0 = { i | net i passes through b and xi0 = 1 } and Tb1 = { i | net i passes through b and xi1 = 1 } . The corresponding 0-1 linear programming problem minimize W , where xi0, xi1 ∈ {0, 1} for each net i (3) xi0 + xi1 = 1 for each net i (4) i∈Tb0 xi0 + i∈Tb1 xi1 ≤ W for all b. (5) Last condition requires that at most W wires cross the boundary b Denote W0 the minimum obtained this way. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 36/67 TRICK - I. 1. Solve the corresponding rational linear programming problem with xi0, xi1 ∈ [0, 1] instead of (3). This trick is called linear relaxation. Denote xi0, xi1 solutions of the above rational linear programming problem, 1 ≤ i ≤ n, and let W be the value of the objective function for this solution. Obviously, W0 ≥ W . 2. Apply the technique called randomized rounding. Independently for each i, set xi0 to 1 with probability xi0 to 0 ” xi1 and set xi1 to 0 ” xi0 to 1 ” xi1 The idea of randomized rounding is to interpret the fractional solutions provided by the linear program as probabilities for the rounding process. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 37/67 TRICK - II. A nice property of randomized rounding is that if the fractional value of a variable is close to 0 (or to 1), then this variable is likely to be set to 0 (or 1). Theorem: If 0 < ε < 1, then with probability 1 − ε the global wiring S produced by randomized rounding satisfies the inequalities: WS ≤ W 1 + ∆+ W , ε 2n ≤ W0 1 + ∆+ W0, ε 2n Proof: We show that following the rounding process, with probability at least 1 − ε, no more than W 1 + ∆+ W , ε 2n wires pass through any boundary. This will be done by showing, for any boundary b, that the probability that WS (b) > W 1 + ∆+ W , ε 2n is at most ε 2n . Since a √ n × √ n array has at most 2n boundaries, one has to sum the above probability of failure over all boundaries b to get an upper bound of ε on the failure probability. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 38/67 TRICK - III. Let b be a boundary. The solution of the rational linear program satisfy its constrains, therefore we have i∈Tb0 xi0 + i∈Tb1 xi1 ≤ W . The number of wires passing through b in the solution S is WS (b) = i∈Tb0 xi0 + i∈Tb1 xi1. xi0 and xi1 are Poisson trials with probabilities xi0 and xi1 In addition, xi0 and xi1 are each independent of xj0 and xj1 for i = j. Therefore WS (b) is the sum of independent Poisson trials. E[WS (b)] = i∈Tbo E [xi0] + i∈Tb1 E [xi1] = i∈Tb0 xi0 + i∈Tb1 xi1 ≤ W Since ∆+ W , ε 2n is such that Pr WS (b) > W 1 + ∆+ W , ε 2n ≤ ε 2n the theorem follows. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 39/67 HOEFFDING INEQUALITY The problem with Chernoff bounds is that they work only for 0-1 random variables. Hoeffding inequality is another concentration bound based on the moment generating functions that applies to any sum of independent random variables with mean 0. Theorem Let X1 . . . , Xn be independent random variables with E[Xi ] = 0 and |Xi | ≤ ci for all i. Then for all t, Pr n i=1 Xi ≥ t ≤ e − t2 2 n i=1 c2 i In the case xi are dependent, but form so called martingale Hoeffdng inequality can be generalized and we get so called Azuma-Hoeffding inequality. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 40/67 MARTINGALES MARTINGLES prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 41/67 MARTINGALES Martingales are very special sequences of random variables that arise in numerous applications, such as gambling problems or random walks. These sequences have various interesting properties and for them powerful techniques exist to derive special Chernoff-like tail bounds. Martingales can be very useful in showing that values of a random variable V are sharply concentrated around its expectation E[V ]. Martingales originally referred to systems of betting in which a player increases his stake (usually by doubling) each time he lost a bet. For analysis of randomized algorithms of large importance is that, as a general rule of thumb says, most things that work for sums of independent random variables work also for martingales. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 42/67 MARTINGALES - MAIN DEFINITION Definition A sequence of random variables Z0, Z1, . . . is a martingale (mrtngl) with respect to a sequence X0, X1, . . ., if, for all n ≥ 0, the following conditions hold: Zn is a function of X0, X1, . . . , Xn E[|Zn|] < ∞; E[Zn+1|X0, . . . , Xn] = Zn; A sequence of rand. variab. Z0, Z1, . . . is called martingale if it is mrtngl with respect to itself. That is E[|Zn|] < ∞ and E[Zn+1|Z0, . . . , Zn] = Zn prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 43/67 EXAMPLE Let us have a gambler who plays a sequence of fair games. Let Xi be the amount the gambler wins in the ith game. Let Zi be the gambler’s total winnings at the end of the ith game. Because each game is fair we have E[Xi ] = 0 E[Zi+1|X1, X2, . . . , Xi ] = Zi + E[Xi+1] = Zi Thus Z1, Z2, . . . , Zn is martingale with respect to the sequence X1, X2, . . . , Xn prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 44/67 DOOB MARTINGALES A Doob martingale is a martingale constructed using the following general scheme: Let X0, X1, . . . , Xn be a sequence of random variables, and let Y be another random variable with E[|Y |] < ∞. Then the sequence of Zi = E[Y | X0, . . . , Xi ], i = 1, . . . , n is a martingale with respect to X0, X1, . . . , Xn, since E[Zi+1 | X0, . . . , Xi ] = E[E[Y | X0, . . . , Xi+1] | X0, . . . , Xi ] = E[Y | X0, . . . , Xi ] = Zi Here we have used the fact that E[V | W ] = E[E[V | U, W ] | W ] for any three random variales U, V , W . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 45/67 REMAINDER - CONDITIONAL EXPECTATION Definition It is natural and useful to define conditional expectation of a random variable Y conditioned on an event E by E[Y |E] = yPr(Y = y|E). ExampleLet we roll independently two perfect dice and let Xi be the number that shows on the ith dice and let X be sum of numbers on both dice. E[X|X1 = 3] = x xPr(X = x|X1 = 3) = 9 x=4 x 1 6 = 13 2 E[X1|X = 5] = 4 x=1 xPr(X1 = x|X = 5) = 4 x=1 x Pr(X1 = x ∩ X = 5) Pr(X = 5) = 5 2 Definition: For two random variables Y and Z, E[Y |Z] is defined to be a random variable f (Z) that takes on the value E[Y |Z = z] when Z = z. Theorem For any random variables Y , Z it holds E[Y ] = E[E[Y |Z]]. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 46/67 A USEFUL FACT For random variables X, Y it holds E[E[X, Y ]] = E[X] In words: what you expect to expect X to be after learning Y is same as what you now expect X to be. Proof: E[X, Y = y] = x xPr[X = x, Y = y] = x x Pr[x, y] PrY [y] and therefore E[E[X|Y = y]] = y PrY [y] x x Pr[x, y] PrY [y] = x y xPr[x, y] = E[X] prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 47/67 EXAMPLE - ”EDGE EXPOSURE”-MARTINGALE Let Gn,p be a random graph, let its m possible edges be labeled in some arbitrary order, and let Xj = 1 if there is an edge in Gn,p in the jth edge slot 0 otherwise Consider now any finite-valued function F over graphs. For example, let F(G) be the size of the largest independent set in G. let Z0 = E[F(G)]] and Zi = E[F(G) | X1, . . . , Xi ], i = 1, . . . .m then the sequence Z0, Z1, . . . , Zm is a Doob martingale and represents the conditional expectation of F(G) as it is revealed when each edge is in the graph, one edge at a time. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 48/67 STOPPING TIME A stopping time corresponds to a strategy to stop a sequence of steps (say of gamblings) that is based only on the outcomes seen so far. Examples of rules when a decision to stop gambling is a stopping time: 1 First time the gambler wins 5 games in a row; 2 First time the gambler either wins or looses 1000 dolars; 3 First time the gambler wins 4 times in a row. The rule ”Last time the gambler wins 4 times in a row” is not a stopping time. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 49/67 MARTINGALE STOPPING THEOREM Theorem: If Z0, Z1, . . . , is a martingale with respect to X1, X2, . . . and if T is a stopping time for X1, X2, . . ., then E[ZT ] = E[Z0] whenever one of the following conditions holds: there is a constant c such that, for all i, |Zi | ≤ c - that is Zi are bounded; T is bounded; E[T] < ∞ and there is a constant c such that E[|Zi+1 − Zi | | X1, . . . , Xi ] < c; prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 50/67 EXAMPLE - GAMBLER’s PROBLEM Consider a sequence of independent fair games, where in each round each player either wins or looses one euro with probability 1 2. Let Z0 = 0, let Xi be the amount won at the ith game and let Zi be the total amount won after i games. Let us assume that the player quits the game when he either looses l1 euros or wins l2 euros (for given l1, l2). What is the probability that the player wins l2 euro before losing l1 euro? prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 51/67 GAMBLER’s PROBLEM - ANSWER Let T be the time when the gambler for the first time either won l2 or lost l1 euro. T is stopping time for the sequence X1, X2, . . .. Sequence Z0, Z1, . . . is martingale. Since values of Zi are bounded, the martingale stopping theorem can be applied. Therefore, we have: E[ZT ] = 0 Let now p be probability that the gambler quits playing after winning l2 euros. Then E ¯ [ZT ] = l2p − l1(1 − p) = 0 and therefore p = l1 l1 + l2 prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 52/67 ELECTION PROBLEM Suppose candidates A and B run for elections and at the end A gets vA votes and B gets vB votes and vB < vA. Let us assume that votes are counted at random. What is the probability that the candidate A will be always ahead during the counting process? Let n = vA + vB and let Sk be the number of votes by which A is leading after k votes were counted. Clearly Sn = vA − vB . For 0 ≤ k ≤ n − 1 we define Xk = Sn−k n − k It can be shown, after some calculations, that the sequence X0, X1, . . . , Xn forms a martingale. Note that the sequence X0, X1, . . . , Xn relates to the counting process in a backward order - X0 is a function of Sn,.... prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 53/67 ELECTION PROBLEM - RESULT Let T be the minimum k such that Xk = 0 if such a k exists, and T = n − 1 otherwise. T is a bounded stopping time and therefore the martingale stopping theorem gives E[XT ] = E[X0] = E[Sn] n = vA − vB vA + vB Case 1 Candidate A leads through the count. In such a case all Sn−k and therefore all Xk are positive, T = n − 1 and XT = Xn−1 = S1 = 1. Case 2. Candidate A does not lead through the count. For some k < n − 1 Xk = 0. If candidate B ever leads it has to be a k where Sk = Xk = 0. In this case T == k < n − 1 and XT = 0.. We have therefore E[XT ] = vA − vB vA + vB = 1 · Pr(Case 1) + 0 · Pr(Case 2) Therefore the probability of Case 1, in which candidate A leads through the account, is vA − vB vA + vB prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 54/67 AZUMA-HOEFFDING INEQUALITY Perhaps the main importance of the martingale concept for the analysis of randomized algorithms is due to various special Chernoff-type inequalities that can be applied even in case random variables are not independent. Theorem Let X0, X1, . . . , Xn be a martingale such that for any k |Xk − Xk−1| ≤ ck. for some ck. Then, for all t ≥ 0 and any λ > 0 Pr(|Xt − X0| ≥ λ) ≤ 2e−λ2 /(2 t i=1 c2 i ) prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 55/67 EXAMPLE - PATTERN MATCHING - I. Let S = (s1, . . . , sn) be a string of symbols chosen randomly from an s-nary alphabet Σ. Let P = (p1, . . . , pk ) be a string (pattern) of k characters from Σ. Let FP,S be the number of occurrences of a fixed pattern P of length k in a random string S of length n. Clearly E[FP,S ] = (n − k + 1) 1 s k We use now a Doob martingale and Azuma-Hoeffding inequality to show that, if k is relatively small with respect to n, then the number of occurrences of the pattern P in S is highly concentrated around its mean. Let Z0 = E[FP,S ] and, for 1 ≤ i ≤ n let Zi = E[FP,S | s1, . . . , si ]. The sequence Z0, . . . , Zn is Doob martingale, and Zn = FP,S . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 56/67 EXAMPLE - PATTERN MATCHING - II. Since each character in the pattern P can participate in no more than k possible matches, for any 0 ≤ i ≤ n we have |Zi+1 − Zi | ≤ k. In other word, the value of si+1 can affect the value of F by at most k. Hence |E[FP,S | s1, . . . , si+1] − E[FP,S | s1, . . . , si ]| = |Zi+1 − Zi | ≤ k. By Azuma-Hoeffding inequality/theorem, Pr(|FP,S − E[FP,S ]| ≥ ε) = Pr(|(Zn − Z0)| ≥ ε) ≤ 2e−ε2 /2nk2 . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 57/67 WAITING TIMES for PATTERNS PROBLEM Problem Let us suppose we flip coins until we see some pattern to appear. What is the expected number of coin-flips until this happens? Example We flip coins until we see HTHH. Suppose that x1x2 . . . xn is the pattern we want to get. Let us imagine we have an army of gamblers, and let one new shows up before each new coin flip. Let each gambler start by borrowing 1$ and betting that the next coin-flip will be x1. If she wins, she takes her 2$ and bets 2$ that next coin-flip will be x2, continuing to play double-or-nothing until either she loses (and is out of her initial 1$) or wins her last bet on xk (and is up 2k − 1 dollars). Because each gambler’s winnings form a martingale, so does their sum, and so the expected total return of all gamblers up to the stopping time τ at which our pattern occurs for the first time is 0. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 58/67 The above facts will now be used to compute E[τ]. When we stop at time τ we have one gambler who has won 2k − 1. Other gamblers may still play. For each i with x1 . . . xk = xk−i+1 . . . xk there will be a gambler with net winnings 2i − 1. All remaining gamblers will all be at −1. Let χi = 1 if x1 . . . xi = xk−i+1 . . . xk , and 0 otherwise. Then, using the stopping time theorem, E[Xτ ] = E −τ + k i=1 χi 2i = −E[τ] + k i=1 χi 2i = 0 and therefore E[τ] = k i=1 χi 2i . Examples: if pattern is HTHH (HHHH) [THHH], then E[τ] equals 18 (30) [16]. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 59/67 EXAMPLE - POLOYA’s URN SCHEME Consider an urn that initially contains b black balls, w white balls. Let a sequence of random selections from this urn be performed where at each step the chosen ball is replaced by c balls of the same color. If Xi denote the fraction of black balls in the urn after the i-th trial. Then the sequence X0, X1, . . . is a martingale. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 60/67 EXAMPLE - OCCUPANCY PROBLEM Suppose that m balls are thrown randomly into n bins and let Z denote the number of bins that remain empty at the end. For 0 ≤ t ≤ m let Zt be the expectation at time t of the number of bins that are empty at time m. The sequence of random variables Z0, Z1, . . . , Zm is a martingale, Z0 = E[Z] and Zm = Z. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 61/67 SOME ESTIMATIONS Kolmogorov-Doob inequality Let X0, X1, . . . be a martingale. Then for any λ > 0 Pr[ max 0≤i≤n Xi ≥ λ] ≤ E[|Xn|] λ . Azuma inequality Let X0, X1, . . . be a martingale sequence such that for each k |Xk − Xk−1| ≤ ck , then for all t ≥ 0 and any λ > 0 Pr[|Xt − X0| ≥ λ] ≤ 2 exp − λ2 2 t k=1 c2 k . Corollary Let X0, X1, . . . be a martingale sequence such that for each k |Xk − Xk−1| ≤ c where c is independent of k. Then, for all t ≥ 0 and any λ > 0 Pr[|Xt − X0| ≥ λc √ t] ≤ 2e−λ2 /2 , prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 62/67 OCCUPANCY PROBLEM REVISITED Let us have m balls thrown randomly into n bins and let Z denote the number of bins that remain empty. Azuma inequality allows to show: µ = E[Z] = n(1 − 1 n )m ≈ ne−m/n and for λ > 0 Pr[|Z − µ| ≥ λ] ≤ 2e − λ2(n−1/2) n2−µ2 . prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 63/67 APPENDIX APPENDIX prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 64/67 EXERCISES 1 What is larger, eπ or πe , for the basis e of natural logarithms 2 Hint 1: There exists one-line proof of the correct relation. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 65/67 EXERCISES 1 What is larger, eπ or πe , for the basis e of natural logarithms 2 Hint 1: There exists one-line proof of correct relation. 3 Hint 2: Solution: use inequality ex > 1 + x with x = π/e − 1. prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 66/67 EXERCISES 1 What is larger, eπ or πe , for the basis e of natural logarithmsa 2 Hint 1: There exists one-line proof of correct relation. 3 Hint 2: Use the inequality ex > 1 + x with x = π/e − 1. 4 Solution: eπ/e−1 > 1 + π/e − 1 implies: eπ/e−1 > π/e ==> eπ/e > π ==> eπ > πe prof. Jozef Gruska IV054 6. Basic techniques II: tail probabilities inequalities 67/67