Probabilistic Classification Based on the ML lecture by Raymond J. Mooney University of Texas at Austin 1 Probabilistic Classification – Idea Imagine that I look out of a window and see a bird, it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this? My usual answer: This is probably a kind of blackbird (kos černý in Czech). Here probably means that out of my extensive catalogue of four kinds of birds that I am able to recognize, "blackbird" gets the highest degree of belief based on features of this particular bird. Frequentists might say that the largest proportion of birds with similar features I have ever seen were blackbirds. The degree of belief (Bayesians), or the relative frequency (frequentists) is the probability. 2 Basic Discrete Probability Theory A finite or countably infinite set Ω of possible outcomes, Ω is called sample space. Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6} Each element ω of Ω is assigned a "probability" value f (ω), here f must satisfy f (ω) ∈ [0, 1] for all ω ∈ Ω, ω∈Ω f (ω) = 1. If the dice is fair, then f (ω) = 1 6 for all ω ∈ {1, . . . , 6}. An event is any subset E of Ω. The probability of a given event E ⊆ Ω is defined as P(E) = ω∈E f (ω) Let E be the event that an odd number is rolled, i.e., E = {1, 3, 5}. Then P(E) = 1 2 . Basic laws: P(Ω) = 1, P(∅) = 0, given disjoint sets A, B we have P(A ∪ B) = P(A) + P(B), P(Ω A) = 1 − P(A). 3 Conditional Probability and Independence P(A | B) is the probability of A given B (assume P(B) > 0) defined by P(A | B) = P(A ∩ B)/P(B) (We assume that B is all and only information known.) A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled? The law of total probability: Let A be an event and B1, . . . , Bn pairwise disjoint events such that Ω = n i=1 Bi . Then P(A) = n i=1 P(A ∩ Bi ) = n i=1 P(A | Bi ) · P(Bi ) A and B are independent if P(A ∩ B) = P(A) · P(B). It is easy to show that if P(B) > 0, then A, B are independent iff P(A | B) = P(A). 4 Random Variables A random variable X is a function X : Ω → R. A dice: X : {1, . . . , 6} → {0, 1} such that X(n) = n mod 2. A probability mass function (pmf) of X is a function p defined by p(x) := P(X = x) Often P(X) is used to denote the pmf of X. 5 Random Vectors A random vector is a function X : Ω → Rd . We use X = (X1, . . . , Xd ) where Xi is a random variable returning the i-th component of X. A joint probability mass function of X is pX (x1, . . . , xd ) := P(X1 = x1 ∧ · · · ∧ Xd = xd ). I.e., pX gives the probability of every combination of values. Often, P(X1, · · · , Xd ) denotes the joint pmf of X1, . . . , Xd . That is, P(X1, · · · , Xd ) stands for probabilities P(X1 = x1 ∧ · · · ∧ Xd = xd ) for all possible combinations of x1, . . . , xd . The probability mass function pXi of each Xi is called marginal probability mass function. We have pXi (xi ) = P(Xi = xi ) = (x1,...,xi−1,xi+1,...,xd ) pX (x1, . . . , xd ) 6 Random Vectors – Example Let Ω be a space of colored geometric shapes that are divided into two categories (positive and negative). Assume a random vector X = (Xcolor , Xshape, Xcat) where Xcolor : Ω → {red, blue}, Xshape : Ω → {circle, square}, Xcat : Ω → {pos, neg}. The joint pmf is given by the following tables: positive: circle square red 0.2 0.02 blue 0.02 0.01 negative: circle square red 0.05 0.3 blue 0.2 0.2 7 Random Vectors – Example The probability of all possible events can be calculated by summing the appropriate probabilities. P(red ∧ circle) = P(Xcolor = red ∧ Xshape = circle) = P(red ∧ circle ∧ positive)+ + P(red ∧ circle ∧ negative) = 0.2 + 0.05 = 0.25 P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57 Thus also all conditional probabilities can be computed: P(positive | red∧cicle) = P(positive ∧ red ∧ circle) P(red ∧ circle) = 0.2 0.25 = 0.8 8 Conditional Probability Mass Functions We often have to deal with a pmf of a random vector X conditioned on values of a random vector Y . I.e., we are interested in P(X = x | Y = y) for all x and y. We write P(X | Y ) to denote the pmf of X conditioned on Y . Technically, P(X | Y ) is a function which takes a possible value x of X and a possible value y of Y and returns P(X = x | Y = y). This allows us to say, e.g., that two variables X1 and X2 are independent conditioned on Y by P(X1, X2 | Y ) = P(X1 | Y ) · P(X2 | Y ) Technically this means that for all possible values x1 of X1, all possible values x2 of X2, and all possible values y of Y we have P(X1 = x1 ∧ X2 = x2 | Y = y) = P(X1 = x1 | Y = y) · P(X2 = x2 | Y = y) 9 Bayesian Classification Let Ω be a sample space (a universum) of all objects that can be classified. We assume a probability P on Ω. A training set will be a subset of Ω randomly sampled according to P. Let Y be the random variable for the category which takes values in {y1, . . . , ym}. Let X be the random vector describing n features of a given instance, i.e., X = (X1, . . . , Xn) Denote by xk possible values of X, and by xij possible values of Xi . Bayes classifier: Given a vector of feature values xk, CBayes (xk) := arg max i∈{1,...,m} P(Y = yi | X = xk) Intuitively, CBayes assigns xk to the most probable category it might be in. 10 Bayesian Classification – Example Imagine a conveyor belt with apples and apricots. A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter. That is, Y = {apple, apricot}, X = (Xweight, Xdiam). Assume that we are given a fruit that weighs 40g with 5cm diameter. The Bayes classifier compares P(Y = apple | X = (40g, 5cm)) with P(Y = apricot | X = (40g, 5cm)) and selects the more probable category given the features. 11 Optimality of the Bayes Classifier Let C be an arbitrary classifier, that is a function that to every xk assigns a class out of {y1, . . . , ym}. Slightly abusing notation, we use C to also denote the random variable which assigns a category to every instance. (Technically this is a composition C ◦ X of C and X which first determines the features using X and then classifies according to C). Define the error of the classifier C by EC = P(Y = C) Věta The Bayes classifier CBayes minimizes EC , that is ECBayes := min C is a classifier EC 12 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 xk P(Y = yi ∧ C = yi | X = xk )P(X = xk ) = 1 − xk P(X = xk ) m i=1 P(Y = yi ∧ C = yi | X = xk ) = 1 − xk P(X = xk )P(Y = C(xk ) | X = xk ) (Here the last equality follows from the fact that C is determined by xk .) Choosing C(xk ) = CBayes (xk ) = arg max i∈{1,...,m} P(Y = yi | X = xk ) maximizes P(Y = C(xk ) | X = xk ) and thus minimizes EC . 13 Practical Use of Bayes Classifier The crucial problem: How to compute P(Y = yi | X = xk) ? Given no other assumptions, this requires a table giving the probability of each category for each possible vector of feature values, which is impossible to accurately estimate from a reasonably-sized training set. Concretely, if all Y , X1, . . . , Xn are binary, we need 2n numbers to specify P(Y = 0 | X = xk) for each possible xk. (Note that we do not need to specify P(Y = 1 | X = xk ) = 1 − P(Y = 0 | X = xk )). It is a bit better than 2n+1 − 1 entries for specification of the complete joint pmf P(Y , X1, . . . , Xn). However, it is still too large for most classification problems. 14 Let’s Look at It the Other Way Round Věta (Bayes,1764) P(A | B) = P(B | A) · P(A) P(B) Důkaz. P(A | B) = P(A ∩ B) P(B) = P(A∩B) P(A) · P(A) P(B) = P(B | A) · P(A) P(B) 15 Bayesian Classification Determine the category for xk by finding yi maximizing P(Y = yi | X = xk) = P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) = P(Y = yi ) · P(X = xk | Y = yi ) m i=1 P(X = xk | Y = yi ) · P(Y = yi ) Here the second equality follows from the law of total probability. So in order to make the classifier we need to compute: The prior P(Y = yi ) for every yi The conditionals P(X = xk | Y = yi ) for every xk and yi 16 Estimating the Prior and Conditionals P(Y = yi ) can be easily estimated from data: Given a set of p training examples where ni of the examples are in the category yi , we set P(Y = yi ) = ni p If the dimension of features is small, P(X = xk | Y = yi ) can be estimated from data similarly as for P(Y = yi ). Unfortunately, for higher dimensional data too many examples are needed to estimate all P(X = xk | Y = yi ) (there are too many xk’s). So where is the advantage of using the Bayes thm.? We introduce independence assumptions about the features! 17 Generative Probabilistic Models Assume a simple (usually unrealistic) probabilistic method by which the data was generated. For classification, assume that each category yi has a different parametrized generative model for P(X = xk | Y = yi ). Maximum Likelihood Estiomation (MLE): Set parameters to maximize the probability that the model produced the given training data. More conceretely: If Mλ denotes a model with parameter values λ, and Dk is the training data for the k-th category, find model parameters for category k (λk ) that maximizes the likelihood of Dk : λk = arg max λ P(Dk | Mλ) 18 Generative Probabilistic Models – Simple Example First, let us illustrate the generative model on a simple example. Consider two binary features: Xcolor : Ω → {red, blue} Xshape : Ω → {circle, square} and two classes {pos, neg}. There are 23 = 8 possible combinations of features and classes. We assume that for each category, the features are distributed independently: P(Xcolor , Xshape | pos) = P(Xcolor | pos) · P(Xshape | pos) P(Xcolor , Xshape | neg) = P(Xcolor | pos) · P(Xshape | neg) So we have to estimate four numbers (parameters): P(red | pos), P(circle | pos), P(red | neg), P(circle | neg) (As opposed to six when we want to completely specify the joint conditional pmfs.) 19 Generative Probabilistic Models – Simple Example Given p training examples, assume that p+ of them are positive, p− of them are negative and that in + red positive examples the color is red, in + circle positive examples the shape is circle, in − red negative examples the color is red, in − circle negative examples the shape is circle. Then MLE estimate ¯P of P is ¯P(red | pos) = + red p+ ¯P(circle | pos) = + circle p+ ¯P(red | neg) = − red p− ¯P(circle | neg) = − circle p− Now e.g. ¯P(red ∧ circle | neg) = − red p− · − circle p− . Note that if in reality the features are dependent, then the joint pmf cannot be obtained by such an estimate! 20 Naive Bayes We assume that features of an instance are (conditionally) independent given the category: P(X | Y ) = P(X1, · · · , Xn | Y ) = n i=1 P(Xi | Y ) Therefore, we only need to specify P(Xi | Y ), that is P(Xi = xij | Y = yk) for each possible pair of a feature-value xij and a class yk. Note that if Y and all Xi are binary (values in {0, 1}), this requires specifying only 2n parameters: P(Xi = 1 | Y = 1) and P(Xi = 1 | Y = 0) for each Xi since P(Xi = 0 | Y ) = 1 − P(Xi = 1 | Y ). Compared to specifying 2n parameters without any independence assumptions. 21 Naive Bayes – Example positive negative P(Y) 0.5 0.5 P(small | Y ) 0.4 0.4 P(medium | Y ) 0.1 0.2 P(large | Y ) 0.5 0.4 P(red | Y ) 0.9 0.3 P(blue | Y ) 0.05 0.3 P(green | Y ) 0.05 0.4 P(square | Y ) 0.05 0.4 P(triangle | Y ) 0.05 0.3 P(circle | Y ) 0.9 0.3 Is (medium, red, circle) positive? 22 positive negative P(Y) 0.5 0.5 P(medium | Y ) 0.1 0.2 P(red | Y ) 0.9 0.3 P(circle | Y ) 0.9 0.3 Denote xk = (medium, red, circle). P(pos | X = xk ) = = P(pos) · P(medium | pos) · P(red | pos) · P(circle | pos) / P(X = xk ) = 0.5 · 0.1 · 0.9 · 0.9 / P(X = xk ) = 0.0405/P(X = xk ) P(neg | X = xk ) = = P(neg) · P(medium | neg) · P(red | neg) · P(circle | neg) / P(X = xk ) = 0.5 · 0.2 · 0.3 · 0.3 / P(X = xk ) = 0.009/P(X = xk ) Apparently, P(pos | X = xk ) = 0.0405/P(X = xk ) > 0.009/P(X = xk ) = P(neg | X = xk ) So we classify xk as positive. 23 Estimating Probabilities (In General) Normally, probabilities are estimated on observed frequencies in the training data (see the previous example). Let us have nk training examples in class yk , nijk of these nk examples have the value for Xi equal to xij . Then we put ¯P(Xi = xij | Y = yk) = nijk nk . A problem: If, by chance, a rare value xij of a feature Xi never occurs in the training data, we get ¯P(Xi = xij | Y = yk) = 0 for all k ∈ {1, . . . , m} But then ¯P(X = xk) = 0 for xk containing the value xij for Xi , and thus ¯P(Y = yk | X = xk) is not well defined. Moreover, ¯P(Y = yk) · ¯P(X = xk | Y = yk) = 0 (for all yk) so even this cannot be used for classification. 24 Probability Estimation Example Training data: Size Color Shape Class small red circle pos large red circle pos small red triangle neg large blue circle neg Learned probabilities: positive negative ¯P(Y ) 0.5 0.5 ¯P(small | Y ) 0.5 0.5 ¯P(medium | Y ) 0 0 ¯P(large | Y ) 0.5 0.5 ¯P(red | Y ) 1 0.5 ¯P(blue | Y ) 0 0.5 ¯P(green | Y ) 0 0 ¯P(square | Y ) 0 0 ¯P(triangle | Y ) 0 0.5 ¯P(circle | Y ) 1 0.5 Note that ¯P(medium ∧ red ∧ circle) = 0. So what is ¯P(pos | medium ∧ red ∧ circle) ? 25 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate works as if each feature is given a prior probability p, such feature have been observed with this probability p in a sample of size m (recall that m is the number of classes). We get ¯P(Xi = xij | Y = yk) = nijk + mp nk + m (Recall that nk is the number of training examples of class yk, and nijk is the number of training examples of class yk for which the i-th feature Xi has the value xij .) 26 Laplace Smothing Example Assume training set contains 10 positive examples: 4 small 0 medium 6 large Estimate parameters as follows (m = 1 and p = 1/3) ¯P(small | positive) = (4 + 1/3)/(10 + 1) = 0.394 ¯P(medium | positive) = (0 + 1/3)/(10 + 1) = 0.03 ¯P(large | positive) = (6 + 1/3)/(10 + 1) = 0.576 (We get ¯P(small ∨ medium ∨ large | positive) = 0.394 + 0.03 + 0.576 = 1.) 27 Continuous Features Ω may be (potentially) continuous, Xi may assign a continuum of values in R. The probabilities are computed using probability density p : R → R+ instead of pmf. A random variable X : Ω → R+ has a density p : R → R+ if for every interval [a, b] we have P(a ≤ X ≤ b) = b a p(x)dx Usually, P(Xi | Y = yk) is used to denote the density of Xi conditioned on Y = yk. The densities P(Xi | Y = yk) are usually estimated using Gaussian densities as follows: Estimate the mean µik and the standard deviation σik based on training data. Then put ¯P(Xi | Y = yk ) = 1 σik √ 2π exp −(Xi − µik )2 2σ2 ik 28 Comments on Naive Bayes Tends to work well despite rather strong assumption of conditional independence of features. Experiments show it to be quite competitive with other classification methods. Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category. Directly constructs a hypothesis from parameter estimates that are calculated from the training data. Consistency with the training data is not guaranteed. Typically handles noise well. Missing values are easy to deal with (simply average over all missing values in feature vectors). 29 Bayes Classifier vs MAP vs MLE Recall that the Bayes classifier chooses the category as follows: CBayes (xk) = arg max i∈{1,...,m} P(Y = yi | X = xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) As P(X = xk) is independent of P(Y = yi ), the Bayes is equivalent to the Maximum Aposteriori Probability rule: CMAP (xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) If we do not care about the prior (or assume uniform) we may use the Maximum Likelihood Estimate rule: CMLE (xk) = arg max i∈{1,...,m} P(X = xk | Y = yi ) (Intuitively, we maximize the probability that the data xk have been generated into the category yi .) 30 Bayesian Networks (Basic Information) In the Naive Bayes we have assumed that all features X1, . . . , Xn are independent. This is usually not realistic. E.g. Variables "rain" and "grass wet" are (usually) strongly dependent. What if we return some dependencies back? (But now in a well-defined sense.) Bayesian networks are a graphical model that uses a directed acyclic graph to specify dependencies among variables. 31 Bayesian Networks – Example Now, e.g., P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B) Now we may e.g. infer what is the probability P(C = T | A = T) that we sit in a bad chair assuming that our back aches. We have to store only 10 numbers as opposed to 25 − 1 if the whole joint 32 Bayesian Networks – Learning & Naive Bayes Many algorithms have been developed for learning: the structure of the graph of the network, the conditional probability tables. The methods are based on maximum-likelihood estimation, gradient descent, etc. Automatic procedures are usually combined with expert knowledge. Can you express the naive Bayes for Y , X1, . . . , Xn using a Bayesian network? 33