Probability PA154 Jazykové modelovaní (1.2) Pavel Rychlý pary@fi.muni.cz February 23, 2017 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic Experiments & Sample Spaces Experiment, process, test, .. . Set of possible basic outcomes: sample space Q (základní prostor obsahující možné výsledky) ► coin toss (O = {head, tail}), die (O = {1..6}) ► yes/no opinion poll, quality test (bad/good) (O = {0,1}) ► lottery \9á 10r..1012) ► # of traffic accidents somewhere per year (O = N) ► spelling errors (O = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet ► missing word (|0 |= vocabulary size) PA154 Jazykové modelování (1.2) Probability 2/16 Events ■ Event (jev) A is a set of basic outcomes ■ Usually A C ÍÍ, and all A 6 2^ (the event space, jevové pole) ► O is the certain event (jist ev), 0 is the impossible event (nemožný jev) ■ Example: ► experiment: three times coin toss ► ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ► count cases with exactly two tails: then ► A = {HTT, THT, TTH} ► all heads: ► A = {HHH} PA154 Jazykové modelování (1.2) Probability 3/16 Probability ■ Repeat experiment many times, record how many times a given event A occured ("count" ci). ■ Do this whole series many times; remember all qs. ■ Observation: if repeated really many times, the ratios of (where T\ is the number of experiments run in the i-th series) are close to some (unknown but) constant value. ■ Call this constant a probability of A. Notation: p(A) PA154 Jazykové modelování (1.2) Probability 4/16 Estimating Probability Remember: . .. close to an unknown constant. We can only estimate it: ► from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment): P(A) = f ► otherwise, take the weighted average of all —r (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate. PA154 Jazykové modelování (1.2) Probability mple ■ Recall our example: ► experiment: three times coin toss ► ft = {HHH, HHT, HTH, HTTTHH, THTTTH, TTT} ► count cases with exactly two tails: A = {HTT, THT, TTH} ■ Run an experiment 1000 times (i.e. 3000 tosses) ■ Counted: 386 cases with two tails (HTT, THT or TTH) ■ estimate: p(A) = 386/100 = .386 ■ Run again: 373, 399, 382, 355, 372, 406, 359 ► p(A) = .379 (weighted average) or simply 3032/8000 ■ Uniform distribution assumption: p(A) = 3/8 = .375 PA154 Jazykové modelování (1.2) Probability Basic Properties ■ Basic properties: ► p: 2Q -> [0,1] - p(«) = 1 ► Disjoint events: p(U A,) = ^,-p(A;) ■ NB: axiomatic definiton of probability: take the above three conditions as axioms ■ Immediate consequences: ► P(0) = 0 ► p(A) = 1 - p(a) ► ACB^ p(A) < P(B) ► EaenP(a) = 1 PA154 Jazykové modelování (1.2) Probability 7/16 Joint and Conditional Probability p(A, B) = p(A n B) ► Estimating form counts: p(A\B) = P(A,B) P(S) c(AnB) c(B) T c(AnB) c(B) ( A ( B J PA154 Jazykové modelování (1.2) Probability 8/16 Bayes Rule p(A,B) = p(B,A) since p(A n B) = p(B n A) ► therefore p(A\B)p(B) = p(B\A)p(A), and therefore: p(A\B) = p{B\A) x p(A) P(B) PA154 Jazykové modelování (1.2) Probability 9/16 Independence Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil: p(A\B) = p(B\A) x p(A) P(B) p(A\B) x p{B) = p(B\A) x p(A) p(A, B) = p(B\A) x p(A) .. .we're almost there: how p(B\A) relates to p(B)? ► p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)! PA154 Jazykové modelování (1.2) Probability 10/16 Chain Rule p(A1,A2,A3,A4,.. ■ i An) — p(A1\A2,A3,A4,... ,An) x p(A2\A3,AA,...,An)x xp(A3\A4,...,A„) X • • • x p{An_1\An) x p(An) ■ this is a direct consequence of the Bayes rule. PA154 Jazykové modelování (1.2) Probability 11/16 "he Golden Rule of Classic Statistical NLP Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A\B)\ take Bayes rule, max over all Bs: argmaxAp(A\B) = argmaxA p(B\A) x p(A) P{B) argmaxA(p(B\A) x p(A)) as p(B) is constant when changing As PA154 Jazykové modelování (1.2) Probability 12/16 Random Variables ■ is a function X : Q —>> Q ► in general (? = Rn, typically R ► easier to handle real numbers than real-world events ■ random variable is discrete if Q is countable (i.e. also if finite) ■ Example: die: natural "numbering" [1,6], coin: {0,1} ■ Probability distribution: ► px(x) = p(X = x) =df p(Ax) where Ax = {a e Q : X(a) = x} ► often just p(x) if it is clear from context what X is PA154 Jazykové modelování (1.2) Probability 13/16 Expectation Joint and Conditional Distributions ■ is a mean of a random variable (weighted average) ► E(X) = £xGX(Q)x.px(x) ■ Example: one six-sided die: 3.5, two dice (sum): 7 ■ Joint and Conditional distribution rules: ► analogous to probability of events ■ Bayes: Pxiy(x,y) — notation even simpler notation ■ Chain rule: p(w,x,y,z) = p{z).p{y\z).p{x\y,z).p{w\x,y,z) p{y\x).p{x) p(y) p(*|y) Probability 14/16 Standard Distributions Binomial (discrete) ► outcome: 0 or 1 (thus b/nomial) ► make n trials ► interested in the (probability of) numbers of successes r Must be careful: it's not uniform! H Pt>(r\n) = (for equally likely outcome) (") counts how many possibilities there are for choosing objects out of n; n = n! Kr) (n-r)\r\ PA154 Jazykové modelování (1.2) Probability Continuous Distributions The normal distribution ("Gaussian") -(x-M)2 Pnorm{x\fl,a) = exp 2a2 where: ► ji is the mean (x-coordinate of the peak) (0) ► a is the standard deviation (1) other: hyperbolic, t PA154 Jazykové modelování (1.2) Probability 16/16