MUNI FI Experiments & Sample Spaces Probability PA154 Language Modeling (1.2) Pavel Rychlý pary@fi.muni.cz February 16, 2023 Experiment, process, test,... Set of possible basic outcomes: sample space f! základní prostor obsahující možné výsledky) ■ coin toss (íí = {head.taiL}), die (íí = {1..6}) ■ yes/no opinion poLL, quality test (bad/good) (fi = {0,1}) ■ Lottery (|íí |= 107..1012) # of traffic accidents somewhere per year (fi = N) ■ spelling errors (fi = Z*), where Z is an apLhabet, and Z* is set of possibLe strings over such aLphabet ■ missing word (|fi = vocabuLary size) Source: Introduction to Natural Language Processing (600.465) Jan Hajic.CS Dept.,Johns Hopkins Univ. www.cs.jn u.edu/~najic ^avel Rycniy ■ Probability ■ Fepruary 16,2023 Events Probability Event jev) A is a set of basic outcomes Usually A c f!, and all A e 2n (the event space, jevové pole) ■ fi is the certain event jistý jev), 0 is the impossible event nemožný jev) Example: ■ experiment: three times coin toss ■ fi = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ■ count cases with exactLy two taiLs: then ■ A = {HTT, THT, TTH} ■ aLL heads: ■ A = {HHH} Repeat experiment many times, record how many times a given event A occured ("count" Ci_). Do this whole series many times; remember all c,s. Q Observation: if repeated really many times, the ratios of (where T-, is the number of experiments run in the i-th series) are close to some (unknown but) constant value. Call this constanta probability of A. Notation: p(A) ^avel Rycniy ■ Probability ■ Fepruary 16,2023 ^avel Rycniy ■ Probability ■ Fepruary 16,2023 Estimating Probability Example Remember: ...close to an unknown constant. We can only estimate it: ■ from a singLe series (typicaL case, as mostLy the outcome of a series is given to us we cannot repeat the experiment): ■ otherwise, take the weighted average of aLL y- (or, if the data aLLows, simpLy Look at the set of series as if it is a singLe Long series). This is the best estimate. Recall our example: ■ experiment: three times coin toss ■ n = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ■ count cases with exactLy two taiLs: A = {HTT, THT, TTH] Run an experiment 1000 times (i.e. 3000 tosses) Counted: 386 cases with two tails (HTT, THT or TTH) estimate: p(A) = 386/1000 = .386 Run again: 373,399, 382, 355, 372,406, 359 ■ p(A) = .379 (weighted average) or simpLy 3032/8000 Uniform distribution assumption: p(A) = 3/8 = .375 ^avel Rycniy ■ Probability ■ Fepruary 16,2023 5/16 ^avel Rycniy ■ Probability ■ Fepruary 16,2023 6/16 Basic Properties Joint and Conditional Probability Basic properties: ■ p: 2n -> [0,1] ■ P(fi) = 1 ■ Disjoint events: p(u A,) = X),-p(A/) NB: axiomatic depniton of probability: take the above three conditions as axioms Immediate consequences: ■ P(0) = O ■ p(A) = 1 - p(A) ■ A c B p(A) < P(B) p{A,B)=p{Ac\B) P{A,B) p{A\B) P(B) Estimating form counts: c(A n fi) p(A, fi) _ T _ c(A n fi) P(A\B) , P(B) m T c(B) ^avel Rychlý ■ Probability - February 16,2023 ^avel Rychlý ■ Probability ■ February 16,2023 Bayes Rule ■ p(A,B) = p(B,A) since p(A n B) = p(B n A) ■ therefore p(A\B)p(B) = p(B\A)p(A), and therefore: Bayes Rule P(A\B) : P(B\A) x p(A) 08 Independence ■ Can we compute p(A,B) from p(A) and p(B)? ■ Recall from previous foil: p(A\B) p(B\A) x p(A) P(B) p(A\B) x p(B) = p(B\A) x p(A) p(A,B) = p(B\A)xp(A) ...we're almost there: how p{B\A) relates to p(B)? ■ p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)! ^avel Rychlý ■ Probability ■ February 16,2023 ^avel Rychlý ■ Probability ■ February 16,2023 Chain Rule The Golden Rule of Classic Statistical NLP p{A1,A2,Ai,Ali,...,An) = p{A1\A2,A1„AA,... ,A„) x p{A2\Ai,A4,... ,A„)x xp(/3|/4, ...,/„) x ■■■ x p(/„_i|/„) x p(A„) this is a direct consequence of the Bayes rule. Interested in an event A given B (where it is not easy or practical or desirable) to estimate p{A\B): take Bayes rule, max over all Bs: P{B\A) x p{A) argmaxAp(A\B) = argmaxA- P(B) argmaxA(p(B\A) x p{A)) ..as p(B) is constant when changing As ^avel Rychlý ■ Probability - February 16,2023 11/16 ^avel Rychlý ■ Probability ■ February 16,2023 12/16 Random Variables Expectation Joint and Conditional Distributions is a function X : Q. -> 0 ■ in general. 0 = R", typicaLLy R ■ easier to handLe reaL numbers than reaL-worLd events random variable is discrete if 0 is countable (i.e. also if finite) Example: die: natural "numbering" [1,6], coin: {0,1} Probability distribution: ■ px[x) = p(X = x) =df p(Ax) where Ax = {a e Q : X{a) = x} ■ often just p(x) if it is dear from context what X is ^avel Rychlý ■ Probability ■ February 16,2023 Standard Distributions Binomial (discrete) ■ outcome: 0 or 1 (thus b/nomiaL) ■ make n triaLs ■ interested in the (probability of) numbers of successes r Must be careful: it's not uniform! (") pb(r\n) = (for equally likely outcome) (") counts how many possibilities there are for choosing r objects out of n; n = n- is a mean of a random variable (weighted average) ■ E{X) = Exex(n)xMx) Example: one six-sided die: 3.5, two dice (sum): 7 Joint and Conditional distribution rules: ■ anaLogousto probability of events Bayes:pX|y(x,y) —notation —even simpler notation p(x\y) : p{y\x)-p{x) p(y) Chain rule:ip(w,x,y,z) — p(z).p(y\z).p(x\y,z).p(w\x,y,z) ^avel Rychlý ■ Probability ■ February 16,2023 Continuous Distributions The normal distribution ("Gaussian") Pnorm(x\ß,v) = exp 2(72 2-K where: ■ ju is the mean (x-coordinate of the peak) (0) ■ a is the standard deviation (1) M- other: hyperbolic, t ^avel Rychlý ■ Probability - February 16,2023 15/16 ^avel Rychlý ■ Probability ■ February 16,2023 16/16