Probability
PA154 Jazykové modelovaní (1.2) Pavel Rychlý
pary@fi.muni.cz
February 23, 2017
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
Experiments & Sample Spaces
Experiment, process, test, .. .
Set of possible basic outcomes: sample space Q (základní prostor
obsahující možné výsledky)
► coin toss (O = {head, tail}), die (O = {1..6})
► yes/no opinion poll, quality test (bad/good) (O = {0,1})
► lottery      \9á 10r..1012)
► # of traffic accidents somewhere per year (O = N)
► spelling errors (O = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet
► missing word (|0 |= vocabulary size)
PA154 Jazykové modelování (1.2)
Probability
2/16
Events
■ Event (jev) A is a set of basic outcomes
■ Usually A C ÍÍ, and all A 6 2^ (the event space, jevové pole)
► O is the certain event (jist   ev), 0 is the impossible event
(nemožný jev)
■ Example:
► experiment: three times coin toss
► ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
► count cases with exactly two tails: then
► A = {HTT, THT, TTH}
► all heads:
► A = {HHH}
PA154 Jazykové modelování (1.2)
Probability
3/16
Probability
■ Repeat experiment many times, record how many times a given event A occured ("count" ci).
■ Do this whole series many times; remember all qs.
■ Observation: if repeated really many times, the ratios of
(where T\ is the number of experiments run in the i-th series) are close to some (unknown but) constant value.
■ Call this constant a probability of A. Notation: p(A)
PA154 Jazykové modelování (1.2)
Probability
4/16
Estimating Probability
Remember: . .. close to an unknown constant. We can only estimate it:
► from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment):
P(A) = f
► otherwise, take the weighted average of all —r (or, if the data
allows, simply look at the set of series as if it is a single long series).
This is the best estimate.
PA154 Jazykové modelování (1.2)
Probability
mple
■ Recall our example:
► experiment: three times coin toss
► ft = {HHH, HHT, HTH, HTTTHH, THTTTH, TTT}
► count cases with exactly two tails: A = {HTT, THT, TTH}
■ Run an experiment 1000 times (i.e. 3000 tosses)
■ Counted: 386 cases with two tails (HTT, THT or TTH)
■ estimate: p(A) = 386/100 = .386
■ Run again: 373, 399, 382, 355, 372, 406, 359
► p(A) = .379 (weighted average) or simply 3032/8000
■ Uniform distribution assumption: p(A) = 3/8 = .375
PA154 Jazykové modelování (1.2)
Probability
Basic Properties
■ Basic properties:
► p: 2Q -> [0,1] - p(«) = 1
► Disjoint events: p(U A,) = ^,-p(A;)
■ NB: axiomatic definiton of probability: take the above three conditions as axioms
■ Immediate consequences:
► P(0) = 0
► p(A) = 1 - p(a)
► ACB^ p(A) < P(B)
► EaenP(a) = 1
PA154 Jazykové modelování (1.2)
Probability
7/16
Joint and Conditional Probability
p(A, B) = p(A n B)
► Estimating form counts:
p(A\B) =
P(A,B) P(S)
c(AnB)
c(B) T
c(AnB) c(B)
	
(  A (	B J
PA154 Jazykové modelování (1.2)
Probability
8/16
Bayes Rule
p(A,B) = p(B,A) since p(A n B) = p(B n A)
► therefore p(A\B)p(B) = p(B\A)p(A), and therefore:
p(A\B) =
p{B\A) x p(A) P(B)
PA154 Jazykové modelování (1.2)
Probability
9/16
Independence
Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil:
p(A\B) =
p(B\A) x p(A) P(B)
p(A\B) x p{B) = p(B\A) x p(A)
p(A, B) = p(B\A) x p(A)
.. .we're almost there: how p(B\A) relates to p(B)? ► p(B|A) = p(B) iff A and B are independent
Example: two coin tosses, weather today and weather on March 4th 1789;
Any two events for which p(B|A) = P(B)!
PA154 Jazykové modelování (1.2)
Probability
10/16
Chain Rule
		
p(A1,A2,A3,A4,..	■ i An)	—
p(A1\A2,A3,A4,...	,An)	x p(A2\A3,AA,...,An)x
xp(A3\A4,...,A„)	X • • •	x p{An_1\An) x p(An)
■ this is a direct consequence of the Bayes rule.
PA154 Jazykové modelování (1.2)
Probability
11/16
"he Golden Rule of Classic Statistical NLP
Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A\B)\
take Bayes rule, max over all Bs:
argmaxAp(A\B) = argmaxA
p(B\A) x p(A) P{B)
argmaxA(p(B\A) x p(A))
as p(B) is constant when changing As
PA154 Jazykové modelování (1.2)
Probability
12/16
Random Variables
■ is a function X : Q —>> Q
► in general (? = Rn, typically R
► easier to handle real numbers than real-world events
■ random variable is discrete if Q is countable (i.e. also if finite)
■ Example: die: natural "numbering" [1,6], coin: {0,1}
■ Probability distribution:
► px(x) = p(X = x) =df p(Ax) where Ax = {a e Q : X(a) = x}
► often just p(x) if it is clear from context what X is
PA154 Jazykové modelování (1.2)
Probability
13/16
Expectation
Joint and Conditional Distributions
■ is a mean of a random variable (weighted average)
► E(X) = £xGX(Q)x.px(x)
■ Example: one six-sided die: 3.5, two dice (sum): 7
■ Joint and Conditional distribution rules:
► analogous to probability of events
■ Bayes: Pxiy(x,y) — notation even simpler notation
■ Chain rule:  p(w,x,y,z) = p{z).p{y\z).p{x\y,z).p{w\x,y,z)
p{y\x).p{x)
p(y)
p(*|y)
Probability
14/16
Standard Distributions
Binomial (discrete)
► outcome: 0 or 1 (thus b/nomial)
► make n trials
► interested in the (probability of) numbers of successes r Must be careful: it's not uniform!
H
Pt>(r\n) =       (for equally likely outcome)
(") counts how many possibilities there are for choosing objects out of n;
n = n!
Kr) (n-r)\r\
PA154 Jazykové modelování (1.2)
Probability
Continuous Distributions
The normal distribution ("Gaussian")
-(x-M)2
Pnorm{x\fl,a) = exp
2a2
where:
► ji is the mean (x-coordinate of the peak) (0)
► a is the standard deviation (1)
other: hyperbolic, t
PA154 Jazykové modelování (1.2)
Probability
16/16