Probability PA154 Language Modeling (1.2) Pavel Rychlý pary@fi.muni.cz February 20, 2024 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic Experiments & Sample Spaces Experiment, process, test,... Set of possible basic outcomes: sample space Q základní prostor obsahující možné výsledky) ■ coin toss (O = {head, tail}), die (O = {1..6}) ■ yes/no opinion poll, quality test (bad/good) (O = {0,1}) ■ lottery \oě 10r..1012) ■ # of traffic accidents somewhere per year (O = N) ■ spelling errors (O = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet ■ missing word (|0 |= vocabulary size) Pavel Rychlý • Probability • February 20, 2024 2/16 Events ■ Event jev) A is a set of basic outcomes ■ Usually AcQ, and all Ae2Q (the event space, jevové pole) ■ O is the certain event jistý jev), 0 is the impossible event nemožný jev) ■ Example: ■ experiment: three times coin toss ■ ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ■ count cases with exactly two tails: then ■ A = {HTT, THT, TTH} ■ all heads: ■ A= {HHH} Pavel Rychlý • Probability • February 20, 2024 3/16 Probability ■ Repeat experiment many times, record how many times a given event A occured ("count" C\). ■ Do this whole series many times; remember all qs. ■ Observation: if repeated really many times, the ratios of (where 7} is the number of experiments run in the i-th series) are dose to some (unknown but) constant value. ■ Call this constant a probability of A. Notation: p(A) Pavel Rychlý • Probability • February 20, 2024 4/16 Estimating Probability Remember: ...dose to an unknown constant. We can only estimate it: ■ from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment): M = £ '1 Q ■ otherwise, take the weighted average of all — (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate. Pavel Rychlý • Probability • February 20, 2024 Example ■ Recall our example: ■ experiment: three times coin toss ■ ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ■ count cases with exactly two tails: A = {HTT, THT, TTH} ■ Run an experiment 1000 times (i.e. 3000 tosses) ■ Counted: 386 cases with two tails (HTT, THT or TTH) ■ estimate: p(A) = 386/1000 = .386 ■ Run again: 373, 399, 382, 355, 372,406, 359 ■ p(A) = .379 (weighted average) or simply 3032/8000 ■ Uniform distribution assumption: p(A) = 3/8 = .375 Pavel Rychlý • Probability • February 20, 2024 6/16 Basic Properties ■ Basic properties: ■ p: 2a -> [0,1] ■ P(«) = 1 ■ Disjoint events: p(U A,) = ^,p(A,) ■ NB: axiomatic definiton of probability: take the above three conditions as axioms ■ Immediate consequences: ■ P(0) = 0 > p(A) = l-p(A) ■ ACB^ p(A) < P(B) ■ EoGnP(a) = 1 Pavel Rychlý • Probability • February 20, 2024 7/16 Joint and Conditional Probability ■ p(A, B) = p(A n B) p(A,B) p(A\B) = Piß) Estimating form counts: p(A\B) = P(A, B) P(B) c(A n B) T cjß) T c(A n B) c(B) Pavel Rychlý • Probability • February 20, 2024 8/16 Bayes Rule p(A,B) = p(B,A) since p(A n B) = p(B n A) ■ therefore p(A\B)p(B) = p(B\A)p(A), and therefore: Bayes Rule p(A\B) = p(B\A) x p(A) P(B) Pavel Rychlý • Probability • February 20, 2024 9/16 Independence Can we compute p(A,B) from p(A) and p(B)? RecaLL from previous foil: p{A\B) xp(B)=p{B\A) xp(A) p(A,B)=p(B\A)xp(A) ...we're almost there: how p{B\A) relates to p(B)? ■ p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)! Pavel Rychlý • Probability • February 20, 2024 10/16 Chain Rule p{Aí,A2,Al,A4,...,An) = p(A1\A2,Al,A4, ...,An)x p{A2\Al,A4, ...,An)x xp(Ai\A4, ...,An)x-- - x p{An_1\An) x p(An) ■ this is a direct consequence of the Bayes rule. Pavel Rychlý • Probability • February 20,2024 11/16 The Golden Rule of Classic Statistical NLP Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A\B): take Bayes rule, max over all Bs: p{B\A) x p{A) argmaxAp(A\B) = argmaxA piß) argmaxA(p(B\A) x p (Ä)) ...as p(B) is constant when changing As Pavel Rychlý • Probability • February 20, 2024 12/16 Random Variables ■ is a function X : ft —>► 0 ■ in general 0 = /?", typically /? ■ easier to handle real numbers than real-world events ■ random variable is discrete if 0 is countable (i.e. also if finite) ■ Example: die: natural "numbering" [1,6], coin: {0,1} ■ Probability distribution: ■ px{x) = p{X = x) =df p{Ax) where Ax = {a e ft : X(a) = x} m often just p{x) if it is clear from context what X is Pavel Rychlý • Probability • February 20, 2024 13/16 Expectation Joint and Conditional Distributions is a mean of a random variable (weighted average) ■ E(X) = J2x€X(Q)XMX) Example: one six-sided die: 3.5, two dice (sum): 7 Joint and Conditional distribution rules: ■ analogous to probability of events Bayes:pX|y(x,y) —notation Pxy(x\y) —even simpler notation p{x\y) = p(y\x).p(x) p(y) Pavel Rychlý • Probability • February 20, 2024 14/16 Standard Distributions ■ Binomial (discrete) ■ outcome: 0 or 1 (thus b/nomial) ■ make n trials ■ interested in the (probability of) numbers of successes r ■ Must be careful: it's not uniform! H ■ Pb(r\n) = (for equally Likely outcome) ■ (") counts how many possibilities there are for choosing r objects out of n\ n - r)\r\ Pavel Rychlý • Probability • February 20, 2024 15/16 Continuous Distributions The normal distribution ("Gaussian") -(x-aO2 Pnorm{x\ld, a) = exp 2a2 where: ■ fi is the mean (x-coordinate of the peak) (0) ■ a is the standard deviation (1) other: hyperbolic, t Pavel Rychlý • Probability • February 20, 2024 16/16