MUNI
FI
Experiments & Sample Spaces
Probability
PA154 Language Modeling (1.2)
Pavel Rychlý
pary@fi.muni.cz February 16, 2023
Experiment, process, test,...
Set of possible basic outcomes: sample space f! základní prostor obsahující možné výsledky)
■ coin toss (íí = {head.taiL}), die (íí = {1..6})
■ yes/no opinion poLL, quality test (bad/good) (fi = {0,1})
■ Lottery (|íí |= 107..1012)
# of traffic accidents somewhere per year (fi = N)
■ spelling errors (fi = Z*), where Z is an apLhabet, and Z* is set of possibLe strings over such aLphabet
■ missing word (|fi = vocabuLary size)
Source: Introduction to Natural Language Processing (600.465) Jan Hajic.CS Dept.,Johns Hopkins Univ. www.cs.jn u.edu/~najic
^avel Rycniy ■ Probability ■ Fepruary 16,2023
Events
Probability
Event jev) A is a set of basic outcomes
Usually A c f!, and all A e 2n (the event space, jevové pole)
■ fi is the certain event jistý jev), 0 is the impossible event nemožný jev)
Example:
■ experiment: three times coin toss
■ fi = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
■ count cases with exactLy two taiLs: then
■ A = {HTT, THT, TTH}
■ aLL heads:
■ A = {HHH}
Repeat experiment many times, record how many times a given
event A occured ("count" Ci_).
Do this whole series many times; remember all c,s.
Q
Observation: if repeated really many times, the ratios of
(where T-, is the number of experiments run in the i-th series) are close to some (unknown but) constant value. Call this constanta probability of A. Notation: p(A)
^avel Rycniy ■ Probability ■ Fepruary 16,2023
^avel Rycniy ■ Probability ■ Fepruary 16,2023
Estimating Probability
Example
Remember: ...close to an unknown constant. We can only estimate it:
■ from a singLe series (typicaL case, as mostLy the outcome of a series is given to us we cannot repeat the experiment):
■ otherwise, take the weighted average of aLL y- (or, if the data
aLLows, simpLy Look at the set of series as if it is a singLe Long series).
This is the best estimate.
Recall our example:
■ experiment: three times coin toss
■ n = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
■ count cases with exactLy two taiLs: A = {HTT, THT, TTH] Run an experiment 1000 times (i.e. 3000 tosses) Counted: 386 cases with two tails (HTT, THT or TTH) estimate: p(A) = 386/1000 = .386
Run again: 373,399, 382, 355, 372,406, 359
■ p(A) = .379 (weighted average) or simpLy 3032/8000 Uniform distribution assumption: p(A) = 3/8 = .375
^avel Rycniy ■ Probability ■ Fepruary 16,2023
5/16
^avel Rycniy ■ Probability ■ Fepruary 16,2023
6/16
Basic Properties
Joint and Conditional Probability
Basic properties:
■ p: 2n -> [0,1]
■ P(fi) = 1
■ Disjoint events: p(u A,) = X),-p(A/)
NB: axiomatic depniton of probability: take the above three conditions as axioms Immediate consequences:
■ P(0) = O
■ p(A) = 1 - p(A)
■ A c B    p(A) < P(B)
p{A,B)=p{Ac\B) P{A,B)
p{A\B)
P(B)
Estimating form counts:
c(A n fi)
p(A, fi) _      T      _ c(A n fi)
P(A\B) ,
P(B)
m
T
c(B)
^avel Rychlý ■ Probability - February 16,2023
^avel Rychlý ■ Probability ■ February 16,2023
Bayes Rule
■ p(A,B) = p(B,A) since p(A n B) = p(B n A)
■ therefore p(A\B)p(B) = p(B\A)p(A), and therefore:
Bayes Rule
P(A\B) :
P(B\A) x p(A)
08
Independence
■ Can we compute p(A,B) from p(A) and p(B)?
■ Recall from previous foil:
p(A\B)
p(B\A) x p(A)
P(B)
p(A\B) x p(B) = p(B\A) x p(A) p(A,B) = p(B\A)xp(A)
...we're almost there: how p{B\A) relates to p(B)?
■ p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789;
Any two events for which p(B|A) = P(B)!
^avel Rychlý ■ Probability ■ February 16,2023
^avel Rychlý ■ Probability ■ February 16,2023
Chain Rule
The Golden Rule of Classic Statistical NLP
p{A1,A2,Ai,Ali,...,An) =
p{A1\A2,A1„AA,... ,A„) x p{A2\Ai,A4,... ,A„)x
xp(/3|/4, ...,/„) x ■■■ x p(/„_i|/„) x p(A„)
this is a direct consequence of the Bayes rule.
Interested in an event A given B (where it is not easy or practical or desirable) to estimate p{A\B): take Bayes rule, max over all Bs:
P{B\A) x p{A)
argmaxAp(A\B) = argmaxA-
P(B)
argmaxA(p(B\A) x p{A))
..as p(B) is constant when changing As
^avel Rychlý ■ Probability - February 16,2023
11/16
^avel Rychlý ■ Probability ■ February 16,2023
12/16
Random Variables
Expectation
Joint and Conditional Distributions
is a function X : Q. -> 0
■ in general. 0 = R", typicaLLy R
■ easier to handLe reaL numbers than reaL-worLd events
random variable is discrete if 0 is countable (i.e. also if finite) Example: die: natural "numbering" [1,6], coin: {0,1} Probability distribution:
■ px[x) = p(X = x) =df p(Ax) where Ax = {a e Q : X{a) = x}
■ often just p(x) if it is dear from context what X is
^avel Rychlý ■ Probability ■ February 16,2023
Standard Distributions
Binomial (discrete)
■ outcome: 0 or 1 (thus b/nomiaL)
■ make n triaLs
■ interested in the (probability of) numbers of successes r Must be careful: it's not uniform!
(")
pb(r\n) =      (for equally likely outcome)
(") counts how many possibilities there are for choosing r objects out of n;
n = n-
is a mean of a random variable (weighted average)
■ E{X) = Exex(n)xMx)
Example: one six-sided die: 3.5, two dice (sum): 7 Joint and Conditional distribution rules:
■ anaLogousto probability of events Bayes:pX|y(x,y)
—notation —even simpler notation
p(x\y) :
p{y\x)-p{x) p(y)
Chain rule:ip(w,x,y,z) — p(z).p(y\z).p(x\y,z).p(w\x,y,z)
^avel Rychlý ■ Probability ■ February 16,2023
Continuous Distributions
The normal distribution ("Gaussian")
Pnorm(x\ß,v) = exp
2(72
2-K
where:
■ ju is the mean (x-coordinate of the peak) (0)
■ a is the standard deviation (1)
		
	M-	
other: hyperbolic, t
^avel Rychlý ■ Probability - February 16,2023
15/16
^avel Rychlý ■ Probability ■ February 16,2023
16/16