Probability
PA154 Language Modeling (1.2)
Pavel Rychlý
pary@fi.muni.cz February 16, 2023
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
Experiments & Sample Spaces
Experiment, process, test,...
Set of possible basic outcomes: sample space Q základní prostor obsahující možné výsledky)
■ coin toss (O = {head, tail}), die (O = {1..6})
■ yes/no opinion poll, quality test (bad/good) (O = {0,1})
■ lottery     \oě 10r..1012)
■ # of traffic accidents somewhere per year (O = N)
■ spelling errors (O = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet
■ missing word (|0 |= vocabulary size)
Pavel Rychlý • Probability • February 16, 2023
2/16
Events
■ Event jev) A is a set of basic outcomes
■ Usually AcQ, and all Ae2Q (the event space, jevové pole)
■ O is the certain event jistý jev), 0 is the impossible event nemožný jev)
■ Example:
■ experiment: three times coin toss
■ ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
■ count cases with exactly two tails: then
■ A = {HTT, THT, TTH}
■ all heads:
■ A= {HHH}
Pavel Rychlý • Probability • February 16, 2023
3/16
Probability
■ Repeat experiment many times, record how many times a given event A occured ("count" C\).
■ Do this whole series many times; remember all qs.
■ Observation: if repeated really many times, the ratios of
(where 7} is the number of experiments run in the i-th series) are dose to some (unknown but) constant value.
■ Call this constant a probability of A. Notation: p(A)
Pavel Rychlý • Probability • February 16, 2023
4/16
Estimating Probability
Remember: ...dose to an unknown constant. We can only estimate it:
■ from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment):
M = £
'1
Q
■ otherwise, take the weighted average of all — (or, if the data
allows, simply look at the set of series as if it is a single long series).
This is the best estimate.
Pavel Rychlý • Probability • February 16, 2023
Example
■ Recall our example:
■ experiment: three times coin toss
■ ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
■ count cases with exactly two tails: A = {HTT, THT, TTH}
■ Run an experiment 1000 times (i.e. 3000 tosses)
■ Counted: 386 cases with two tails (HTT, THT or TTH)
■ estimate: p(A) = 386/1000 = .386
■ Run again: 373, 399, 382, 355, 372,406, 359
■ p(A) = .379 (weighted average) or simply 3032/8000
■ Uniform distribution assumption: p(A) = 3/8 = .375
Pavel Rychlý • Probability • February 16, 2023
6/16
Basic Properties
■ Basic properties:
■ p: 2a -> [0,1]
■ P(«) = 1
■ Disjoint events: p(U A,) = ^,p(A,)
■ NB: axiomatic definiton of probability: take the above three conditions as axioms
■ Immediate consequences:
■ P(0) = 0
> p(A) = l-p(A)
■ ACB^ p(A) < P(B)
■ EoGnP(a) = 1
Pavel Rychlý • Probability • February 16, 2023
7/16
Joint and Conditional Probability
■ p(A, B) = p(A n B)
p(A,B)
p(A\B) =
Piß)
Estimating form counts:
p(A\B) =
P(A, B) P(B)
c(A n B)
T
cjß)
T
c(A n B) c(B)
Pavel Rychlý • Probability • February 16, 2023
8/16
Bayes Rule
p(A,B) = p(B,A) since p(A n B) = p(B n A)
■ therefore p(A\B)p(B) = p(B\A)p(A), and therefore:
Bayes Rule
p(A\B) =
p{B\A) x p{A) Piß)
Pavel Rychlý • Probability • February 16, 2023
9/16
Independence
Can we compute p(A,B) from p(A) and p(B)? RecaLL from previous foil:
p{A\B) xp(B)=p{B\A) xp(A)
p(A,B)=p(B\A)xp(A)
...we're almost there: how p{B\A) relates to p(B)? ■ p(B|A) = p(B) iff A and B are independent
Example: two coin tosses, weather today and weather on March 4th 1789;
Any two events for which p(B|A) = P(B)!
Pavel Rychlý • Probability • February 16, 2023
10/16
Chain Rule
p{Aí,A2,Al,A4,...,An) =
p(A1\A2,Al,A4, ...,An)x p{A2\Al,A4, ...,An)x
xp(Ai\A4, ...,An)x-- - x p{An_1\An) x p(An)
■ this is a direct consequence of the Bayes rule.
Pavel Rychlý • Probability • February 16,2023
11/16
The Golden Rule of Classic Statistical NLP
Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A\B):
take Bayes rule, max over all Bs:
p{B\A) x p{A)
argmaxAp(A\B) = argmaxA
piß)
argmaxA(p(B\A) x p (Ä))
...as p(B) is constant when changing As
Pavel Rychlý • Probability • February 16, 2023
12/16
Random Variables
■ is a function X : ft —>► 0
■ in general 0 = /?", typically /?
■ easier to handle real numbers than real-world events
■ random variable is discrete if 0 is countable (i.e. also if finite)
■ Example: die: natural "numbering" [1,6], coin: {0,1}
■ Probability distribution:
■ px{x) = p{X = x) =df p{Ax) where Ax = {a e ft : X(a) = x} m often just p{x) if it is clear from context what X is
Pavel Rychlý • Probability • February 16, 2023
13/16
Expectation
Joint and Conditional Distributions
is a mean of a random variable (weighted average)
■ E(X) = J2x€X(Q)XMX)
Example: one six-sided die: 3.5, two dice (sum): 7 Joint and Conditional distribution rules:
■ analogous to probability of events
Bayes:pX|y(x,y)
—notation Pxy(x\y) —even simpler notation
p{x\y) =
p(y\x).p(x)
p(y)
Pavel Rychlý • Probability • February 16, 2023
14/16
Standard Distributions
■ Binomial (discrete)
■ outcome: 0 or 1 (thus b/nomial)
■ make n trials
■ interested in the (probability of) numbers of successes r
■ Must be careful: it's not uniform!
H
■ Pb(r\n) =      (for equally Likely outcome)
■ (") counts how many possibilities there are for choosing r objects out of n\
n - r)\r\
Pavel Rychlý • Probability • February 16, 2023
15/16
Continuous Distributions
The normal distribution ("Gaussian")
-(x-aO2
Pnorm{x\ld, a) = exp
2a2
where:
■ fi is the mean (x-coordinate of the peak) (0)
■ a is the standard deviation (1)
other: hyperbolic, t
Pavel Rychlý • Probability • February 16, 2023
16/16