Probabilistic Classification Based on the ML lecture by Raymond J. Mooney University of Texas at Austin 1 Probabilistic Classification – Idea Imagine that I look out of a window and see a bird, it is black, approx. 25 cm long, and has a rather yellow beak. 2 Probabilistic Classification – Idea Imagine that I look out of a window and see a bird, it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this? 2 Probabilistic Classification – Idea Imagine that I look out of a window and see a bird, it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this? My usual answer: This is probably a kind of blackbird (kos černý in Czech). 2 Probabilistic Classification – Idea Imagine that I look out of a window and see a bird, it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this? My usual answer: This is probably a kind of blackbird (kos černý in Czech). Here probably means that out of my extensive catalogue of four kinds of birds that I am able to recognize, "blackbird" gets the highest degree of belief based on features of this particular bird. Frequentists might say that the largest proportion of birds with similar features I have ever seen were blackbirds. 2 Probabilistic Classification – Idea Imagine that I look out of a window and see a bird, it is black, approx. 25 cm long, and has a rather yellow beak. My daughter asks: What kind of bird is this? My usual answer: This is probably a kind of blackbird (kos černý in Czech). Here probably means that out of my extensive catalogue of four kinds of birds that I am able to recognize, "blackbird" gets the highest degree of belief based on features of this particular bird. Frequentists might say that the largest proportion of birds with similar features I have ever seen were blackbirds. The degree of belief (Bayesians), or the relative frequency (frequentists) is the probability. 2 Basic Discrete Probability Theory A finite or countably infinite set Ω of possible outcomes, Ω is called sample space. Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6} 3 Basic Discrete Probability Theory A finite or countably infinite set Ω of possible outcomes, Ω is called sample space. Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6} Each element ω of Ω is assigned a "probability" value f (ω), here f must satisfy f (ω) ∈ [0, 1] for all ω ∈ Ω, ω∈Ω f (ω) = 1. If the dice is fair, then f (ω) = 1 6 for all ω ∈ {1, . . . , 6}. 3 Basic Discrete Probability Theory A finite or countably infinite set Ω of possible outcomes, Ω is called sample space. Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6} Each element ω of Ω is assigned a "probability" value f (ω), here f must satisfy f (ω) ∈ [0, 1] for all ω ∈ Ω, ω∈Ω f (ω) = 1. If the dice is fair, then f (ω) = 1 6 for all ω ∈ {1, . . . , 6}. An event is any subset E of Ω. The probability of a given event E ⊆ Ω is defined as P(E) = ω∈E f (ω) Let E be the event that an odd number is rolled, i.e., E = {1, 3, 5}. Then P(E) = 1 2 . 3 Basic Discrete Probability Theory A finite or countably infinite set Ω of possible outcomes, Ω is called sample space. Experiment: Roll one dice once. Sample space: Ω = {1, . . . , 6} Each element ω of Ω is assigned a "probability" value f (ω), here f must satisfy f (ω) ∈ [0, 1] for all ω ∈ Ω, ω∈Ω f (ω) = 1. If the dice is fair, then f (ω) = 1 6 for all ω ∈ {1, . . . , 6}. An event is any subset E of Ω. The probability of a given event E ⊆ Ω is defined as P(E) = ω∈E f (ω) Let E be the event that an odd number is rolled, i.e., E = {1, 3, 5}. Then P(E) = 1 2 . Basic laws: P(Ω) = 1, P(∅) = 0, given disjoint sets A, B we have P(A ∪ B) = P(A) + P(B), P(Ω A) = 1 − P(A). 3 Conditional Probability and Independence P(A | B) is the probability of A given B (assume P(B) > 0) defined by P(A | B) = P(A ∩ B)/P(B) (We assume that B is all and only information known.) A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled? 4 Conditional Probability and Independence P(A | B) is the probability of A given B (assume P(B) > 0) defined by P(A | B) = P(A ∩ B)/P(B) (We assume that B is all and only information known.) A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled? The law of total probability: Let A be an event and B1, . . . , Bn pairwise disjoint events such that Ω = n i=1 Bi . Then P(A) = n i=1 P(A ∩ Bi ) = n i=1 P(A | Bi ) · P(Bi ) 4 Conditional Probability and Independence P(A | B) is the probability of A given B (assume P(B) > 0) defined by P(A | B) = P(A ∩ B)/P(B) (We assume that B is all and only information known.) A fair dice: what is the probability that 3 is rolled assuming that an odd number is rolled? ... and assuming that an even number is rolled? The law of total probability: Let A be an event and B1, . . . , Bn pairwise disjoint events such that Ω = n i=1 Bi . Then P(A) = n i=1 P(A ∩ Bi ) = n i=1 P(A | Bi ) · P(Bi ) A and B are independent if P(A ∩ B) = P(A) · P(B). It is easy to show that if P(B) > 0, then A, B are independent iff P(A | B) = P(A). 4 Random Variables A random variable X is a function X : Ω → R. A dice: X : {1, . . . , 6} → {0, 1} such that X(n) = n mod 2. 5 Random Variables A random variable X is a function X : Ω → R. A dice: X : {1, . . . , 6} → {0, 1} such that X(n) = n mod 2. A probability mass function (pmf) of X is a function p defined by p(x) := P(X = x) Often P(X) is used to denote the pmf of X. 5 Random Vectors A random vector is a function X : Ω → Rd . We use X = (X1, . . . , Xd ) where Xi is a random variable returning the i-th component of X. 6 Random Vectors A random vector is a function X : Ω → Rd . We use X = (X1, . . . , Xd ) where Xi is a random variable returning the i-th component of X. A joint probability mass function of X is pX (x1, . . . , xd ) := P(X1 = x1 ∧ · · · ∧ Xd = xd ). I.e., pX gives the probability of every combination of values. Often, P(X1, · · · , Xd ) denotes the joint pmf of X1, . . . , Xd . That is, P(X1, · · · , Xd ) stands for probabilities P(X1 = x1 ∧ · · · ∧ Xd = xd ) for all possible combinations of x1, . . . , xd . 6 Random Vectors A random vector is a function X : Ω → Rd . We use X = (X1, . . . , Xd ) where Xi is a random variable returning the i-th component of X. A joint probability mass function of X is pX (x1, . . . , xd ) := P(X1 = x1 ∧ · · · ∧ Xd = xd ). I.e., pX gives the probability of every combination of values. Often, P(X1, · · · , Xd ) denotes the joint pmf of X1, . . . , Xd . That is, P(X1, · · · , Xd ) stands for probabilities P(X1 = x1 ∧ · · · ∧ Xd = xd ) for all possible combinations of x1, . . . , xd . The probability mass function pXi of each Xi is called marginal probability mass function. We have pXi (xi ) = P(Xi = xi ) = (x1,...,xi−1,xi+1,...,xd ) pX (x1, . . . , xd ) 6 Random Vectors – Example Let Ω be a space of colored geometric shapes that are divided into two categories (positive and negative). Assume a random vector X = (Xcolor , Xshape, Xcat) where Xcolor : Ω → {red, blue}, Xshape : Ω → {circle, square}, Xcat : Ω → {pos, neg}. The joint pmf is given by the following tables: positive: circle square red 0.2 0.02 blue 0.02 0.01 negative: circle square red 0.05 0.3 blue 0.2 0.2 7 Random Vectors – Example The probability of all possible events can be calculated by summing the appropriate probabilities. P(red ∧ circle) = P(Xcolor = red ∧ Xshape = circle) = P(red ∧ circle ∧ positive)+ + P(red ∧ circle ∧ negative) = 0.2 + 0.05 = 0.25 8 Random Vectors – Example The probability of all possible events can be calculated by summing the appropriate probabilities. P(red ∧ circle) = P(Xcolor = red ∧ Xshape = circle) = P(red ∧ circle ∧ positive)+ + P(red ∧ circle ∧ negative) = 0.2 + 0.05 = 0.25 P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57 8 Random Vectors – Example The probability of all possible events can be calculated by summing the appropriate probabilities. P(red ∧ circle) = P(Xcolor = red ∧ Xshape = circle) = P(red ∧ circle ∧ positive)+ + P(red ∧ circle ∧ negative) = 0.2 + 0.05 = 0.25 P(red) = 0.2 + 0.02 + 0.05 + 0.3 = 0.57 Thus also all conditional probabilities can be computed: P(positive | red∧cicle) = P(positive ∧ red ∧ circle) P(red ∧ circle) = 0.2 0.25 = 0.8 8 Conditional Probability Mass Functions We often have to deal with a pmf of a random vector X conditioned on values of a random vector Y . I.e., we are interested in P(X = x | Y = y) for all x and y. 9 Conditional Probability Mass Functions We often have to deal with a pmf of a random vector X conditioned on values of a random vector Y . I.e., we are interested in P(X = x | Y = y) for all x and y. We write P(X | Y ) to denote the pmf of X conditioned on Y . Technically, P(X | Y ) is a function which takes a possible value x of X and a possible value y of Y and returns P(X = x | Y = y). 9 Conditional Probability Mass Functions We often have to deal with a pmf of a random vector X conditioned on values of a random vector Y . I.e., we are interested in P(X = x | Y = y) for all x and y. We write P(X | Y ) to denote the pmf of X conditioned on Y . Technically, P(X | Y ) is a function which takes a possible value x of X and a possible value y of Y and returns P(X = x | Y = y). This allows us to say, e.g., that two variables X1 and X2 are independent conditioned on Y by P(X1, X2 | Y ) = P(X1 | Y ) · P(X2 | Y ) 9 Conditional Probability Mass Functions We often have to deal with a pmf of a random vector X conditioned on values of a random vector Y . I.e., we are interested in P(X = x | Y = y) for all x and y. We write P(X | Y ) to denote the pmf of X conditioned on Y . Technically, P(X | Y ) is a function which takes a possible value x of X and a possible value y of Y and returns P(X = x | Y = y). This allows us to say, e.g., that two variables X1 and X2 are independent conditioned on Y by P(X1, X2 | Y ) = P(X1 | Y ) · P(X2 | Y ) Technically this means that for all possible values x1 of X1, all possible values x2 of X2, and all possible values y of Y we have P(X1 = x1 ∧ X2 = x2 | Y = y) = P(X1 = x1 | Y = y) · P(X2 = x2 | Y = y) 9 Bayesian Classification Let Ω be a sample space (a universum) of all objects that can be classified. We assume a probability P on Ω. A training set will be a subset of Ω randomly sampled according to P. 10 Bayesian Classification Let Ω be a sample space (a universum) of all objects that can be classified. We assume a probability P on Ω. A training set will be a subset of Ω randomly sampled according to P. Let Y be the random variable for the category which takes values in {y1, . . . , ym}. 10 Bayesian Classification Let Ω be a sample space (a universum) of all objects that can be classified. We assume a probability P on Ω. A training set will be a subset of Ω randomly sampled according to P. Let Y be the random variable for the category which takes values in {y1, . . . , ym}. Let X be the random vector describing n features of a given instance, i.e., X = (X1, . . . , Xn) Denote by xk possible values of X, and by xij possible values of Xi . 10 Bayesian Classification Let Ω be a sample space (a universum) of all objects that can be classified. We assume a probability P on Ω. A training set will be a subset of Ω randomly sampled according to P. Let Y be the random variable for the category which takes values in {y1, . . . , ym}. Let X be the random vector describing n features of a given instance, i.e., X = (X1, . . . , Xn) Denote by xk possible values of X, and by xij possible values of Xi . Bayes classifier: Given a vector of feature values xk, CBayes (xk) := y where = arg max i∈{1,...,m} P(Y = yi | X = xk) Intuitively, CBayes assigns xk to the most probable category it might be in. 10 Bayesian Classification – Example Imagine a conveyor belt with apples and apricots. A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter. 11 Bayesian Classification – Example Imagine a conveyor belt with apples and apricots. A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter. That is, Y = {apple, apricot}, X = (Xweight, Xdiam). 11 Bayesian Classification – Example Imagine a conveyor belt with apples and apricots. A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter. That is, Y = {apple, apricot}, X = (Xweight, Xdiam). Assume that we are given a fruit that weighs 40g with 5cm diameter. 11 Bayesian Classification – Example Imagine a conveyor belt with apples and apricots. A machine is supposed to correctly distinguish apples from apricots based on their weight and diameter. That is, Y = {apple, apricot}, X = (Xweight, Xdiam). Assume that we are given a fruit that weighs 40g with 5cm diameter. The Bayes classifier compares P(Y = apple | X = (40g, 5cm)) with P(Y = apricot | X = (40g, 5cm)) and selects the more probable category given the features. 11 Optimality of the Bayes Classifier Let C be an arbitrary classifier, that is a function that to every xk assigns a class out of {y1, . . . , ym}. 12 Optimality of the Bayes Classifier Let C be an arbitrary classifier, that is a function that to every xk assigns a class out of {y1, . . . , ym}. Slightly abusing notation, we use C to also denote the random variable which assigns a category to every instance. (Technically this is a composition C ◦ X of C and X which first determines the features using X and then classifies according to C). 12 Optimality of the Bayes Classifier Let C be an arbitrary classifier, that is a function that to every xk assigns a class out of {y1, . . . , ym}. Slightly abusing notation, we use C to also denote the random variable which assigns a category to every instance. (Technically this is a composition C ◦ X of C and X which first determines the features using X and then classifies according to C). Define the error of the classifier C by EC = P(Y = C) 12 Optimality of the Bayes Classifier Let C be an arbitrary classifier, that is a function that to every xk assigns a class out of {y1, . . . , ym}. Slightly abusing notation, we use C to also denote the random variable which assigns a category to every instance. (Technically this is a composition C ◦ X of C and X which first determines the features using X and then classifies according to C). Define the error of the classifier C by EC = P(Y = C) Věta The Bayes classifier CBayes minimizes EC , that is ECBayes := min C is a classifier EC 12 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) 13 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 P(Y = yi ∧ C = yi ) 13 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 xk P(Y = yi ∧ C = yi | X = xk )P(X = xk ) 13 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 xk P(Y = yi ∧ C = yi | X = xk )P(X = xk ) = 1 − xk P(X = xk ) m i=1 P(Y = yi ∧ C = yi | X = xk ) 13 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 xk P(Y = yi ∧ C = yi | X = xk )P(X = xk ) = 1 − xk P(X = xk ) m i=1 P(Y = yi ∧ C = yi | X = xk ) = 1 − xk P(X = xk )P(Y = C(xk ) | X = xk ) (Here the last equality follows from the fact that C is determined by xk .) 13 Optimality of the Bayes Classifier EC = m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 P(Y = yi ∧ C = yi ) = 1 − m i=1 xk P(Y = yi ∧ C = yi | X = xk )P(X = xk ) = 1 − xk P(X = xk ) m i=1 P(Y = yi ∧ C = yi | X = xk ) = 1 − xk P(X = xk )P(Y = C(xk ) | X = xk ) (Here the last equality follows from the fact that C is determined by xk .) Choosing C(xk ) = CBayes (xk ) = y where = arg max i∈{1,...,m} P(Y = yi | X = xk ) maximizes P(Y = C(xk ) | X = xk ) and thus minimizes EC . 13 Practical Use of Bayes Classifier The crucial problem: How to compute P(Y = yi | X = xk) ? 14 Practical Use of Bayes Classifier The crucial problem: How to compute P(Y = yi | X = xk) ? Given no other assumptions, this requires a table giving the probability of each category for each possible vector of feature values, which is impossible to accurately estimate from a reasonably-sized training set. 14 Practical Use of Bayes Classifier The crucial problem: How to compute P(Y = yi | X = xk) ? Given no other assumptions, this requires a table giving the probability of each category for each possible vector of feature values, which is impossible to accurately estimate from a reasonably-sized training set. Concretely, if all Y , X1, . . . , Xn are binary, we need 2n numbers to specify P(Y = 0 | X = xk) for each possible xk. (Note that we do not need to specify P(Y = 1 | X = xk ) = 1 − P(Y = 0 | X = xk )). 14 Practical Use of Bayes Classifier The crucial problem: How to compute P(Y = yi | X = xk) ? Given no other assumptions, this requires a table giving the probability of each category for each possible vector of feature values, which is impossible to accurately estimate from a reasonably-sized training set. Concretely, if all Y , X1, . . . , Xn are binary, we need 2n numbers to specify P(Y = 0 | X = xk) for each possible xk. (Note that we do not need to specify P(Y = 1 | X = xk ) = 1 − P(Y = 0 | X = xk )). It is a bit better than 2n+1 − 1 entries for specification of the complete joint pmf P(Y , X1, . . . , Xn). However, it is still too large for most classification problems. 14 Let’s Look at It the Other Way Round Věta (Bayes,1764) P(A | B) = P(B | A) · P(A) P(B) 15 Let’s Look at It the Other Way Round Věta (Bayes,1764) P(A | B) = P(B | A) · P(A) P(B) Důkaz. P(A | B) = P(A ∩ B) P(B) = P(A∩B) P(A) · P(A) P(B) = P(B | A) · P(A) P(B) 15 Bayesian Classification Determine the category for xk by finding yi maximizing P(Y = yi | X = xk) = P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) 16 Bayesian Classification Determine the category for xk by finding yi maximizing P(Y = yi | X = xk) = P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) So in order to make the classifier we need to compute: The prior P(Y = yi ) for every yi The conditionals P(X = xk | Y = yi ) for every xk and yi 16 Estimating the Prior and Conditionals P(Y = yi ) can be easily estimated from data: Given a set of p training examples where ni of the examples are in the category yi , we set P(Y = yi ) = ni p 17 Estimating the Prior and Conditionals P(Y = yi ) can be easily estimated from data: Given a set of p training examples where ni of the examples are in the category yi , we set P(Y = yi ) = ni p If the dimension of features is small, P(X = xk | Y = yi ) can be estimated from data similarly as for P(Y = yi ). Unfortunately, for higher dimensional data too many examples are needed to estimate all P(X = xk | Y = yi ) (there are too many xk’s). So where is the advantage of using the Bayes thm.? 17 Estimating the Prior and Conditionals P(Y = yi ) can be easily estimated from data: Given a set of p training examples where ni of the examples are in the category yi , we set P(Y = yi ) = ni p If the dimension of features is small, P(X = xk | Y = yi ) can be estimated from data similarly as for P(Y = yi ). Unfortunately, for higher dimensional data too many examples are needed to estimate all P(X = xk | Y = yi ) (there are too many xk’s). So where is the advantage of using the Bayes thm.? We introduce independence assumptions about the features! 17 Naive Bayes We assume that features of an instance are (conditionally) independent given the category: P(X | Y ) = P(X1, · · · , Xn | Y ) = n i=1 P(Xi | Y ) 18 Naive Bayes We assume that features of an instance are (conditionally) independent given the category: P(X | Y ) = P(X1, · · · , Xn | Y ) = n i=1 P(Xi | Y ) Therefore, we only need to specify P(Xi | Y ), that is P(Xi = xij | Y = yk) for each possible pair of a feature-value xij and a class yk. 18 Naive Bayes We assume that features of an instance are (conditionally) independent given the category: P(X | Y ) = P(X1, · · · , Xn | Y ) = n i=1 P(Xi | Y ) Therefore, we only need to specify P(Xi | Y ), that is P(Xi = xij | Y = yk) for each possible pair of a feature-value xij and a class yk. Note that if Y and all Xi are binary (values in {0, 1}), this requires specifying only 2n parameters: P(Xi = 1 | Y = 1) and P(Xi = 1 | Y = 0) for each Xi since P(Xi = 0 | Y ) = 1 − P(Xi = 1 | Y ). Compared to specifying 2n parameters without any independence assumptions. 18 Naive Bayes – Example positive negative P(Y) 0.5 0.5 P(small | Y ) 0.4 0.4 P(medium | Y ) 0.1 0.2 P(large | Y ) 0.5 0.4 P(red | Y ) 0.9 0.3 P(blue | Y ) 0.05 0.3 P(green | Y ) 0.05 0.4 P(square | Y ) 0.05 0.4 P(triangle | Y ) 0.05 0.3 P(circle | Y ) 0.9 0.3 Is (medium, red, circle) positive? 19 positive negative P(Y) 0.5 0.5 P(medium | Y ) 0.1 0.2 P(red | Y ) 0.9 0.3 P(circle | Y ) 0.9 0.3 Denote xk = (medium, red, circle). P(pos | X = xk ) = = P(pos) · P(medium | pos) · P(red | pos) · P(circle | pos) / P(X = xk ) = 0.5 · 0.1 · 0.9 · 0.9 / P(X = xk ) = 0.0405/P(X = xk ) P(neg | X = xk ) = = P(neg) · P(medium | neg) · P(red | neg) · P(circle | neg) / P(X = xk ) = 0.5 · 0.2 · 0.3 · 0.3 / P(X = xk ) = 0.009/P(X = xk ) Apparently, P(pos | X = xk ) = 0.0405/P(X = xk ) > 0.009/P(X = xk ) = P(neg | X = xk ) So we classify xk as positive. 20 Estimating Probabilities (In General) Normally, probabilities are estimated on observed frequencies in the training data (see the previous example). 21 Estimating Probabilities (In General) Normally, probabilities are estimated on observed frequencies in the training data (see the previous example). Let us have nk training examples in class yk , nijk of these nk examples have the value for Xi equal to xij . Then we put ¯P(Xi = xij | Y = yk) = nijk nk . 21 Estimating Probabilities (In General) Normally, probabilities are estimated on observed frequencies in the training data (see the previous example). Let us have nk training examples in class yk , nijk of these nk examples have the value for Xi equal to xij . Then we put ¯P(Xi = xij | Y = yk) = nijk nk . A problem: If, by chance, a rare value xij of a feature Xi never occurs in the training data, we get ¯P(Xi = xij | Y = yk) = 0 for all k ∈ {1, . . . , m} But then ¯P(X = xk) = 0 for xk containing the value xij for Xi , and thus ¯P(Y = yk | X = xk) is not well defined. Moreover, ¯P(Y = yk) · ¯P(X = xk | Y = yk) = 0 (for all yk) so even this cannot be used for classification. 21 Probability Estimation Example Training data: Size Color Shape Class small red circle pos large red circle pos small red triangle neg large blue circle neg Learned probabilities: positive negative ¯P(Y ) 0.5 0.5 ¯P(small | Y ) 0.5 0.5 ¯P(medium | Y ) 0 0 ¯P(large | Y ) 0.5 0.5 ¯P(red | Y ) 1 0.5 ¯P(blue | Y ) 0 0.5 ¯P(green | Y ) 0 0 ¯P(square | Y ) 0 0 ¯P(triangle | Y ) 0 0.5 ¯P(circle | Y ) 1 0.5 Note that ¯P(medium ∧ red ∧ circle) = 0. So what is ¯P(pos | medium ∧ red ∧ circle) ? 22 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. 23 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate works as if each feature is given a prior probability p, such feature have been observed with this probability p in a sample of size m (recall that m is the number of classes). 23 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate works as if each feature is given a prior probability p, such feature have been observed with this probability p in a sample of size m (recall that m is the number of classes). We get ¯P(Xi = xij | Y = yk) = nijk + mp nk + m (Recall that nk is the number of training examples of class yk, and nijk is the number of training examples of class yk for which the i-th feature Xi has the value xij .) 23 Laplace Smothing Example Assume training set contains 10 positive examples: 4 small 0 medium 6 large 24 Laplace Smothing Example Assume training set contains 10 positive examples: 4 small 0 medium 6 large Estimate parameters as follows (m = 2 and p = 1/3) ¯P(small | positive) = (4 + 2/3)/(10 + 2) = 0.389 ¯P(medium | positive) = (0 + 2/3)/(10 + 2) = 0.056 ¯P(large | positive) = (6 + 2/3)/(10 + 2) = 0.556 24 Continuous Features Ω may be (potentially) continuous, Xi may assign a continuum of values in R. 25 Continuous Features Ω may be (potentially) continuous, Xi may assign a continuum of values in R. The probabilities are computed using probability density p : R → R+ instead of pmf. A random variable X : Ω → R+ has a density p : R → R+ if for every interval [a, b] we have P(a ≤ X ≤ b) = b a p(x)dx Usually, P(Xi | Y = yk) is used to denote the density of Xi conditioned on Y = yk. 25 Continuous Features Ω may be (potentially) continuous, Xi may assign a continuum of values in R. The probabilities are computed using probability density p : R → R+ instead of pmf. A random variable X : Ω → R+ has a density p : R → R+ if for every interval [a, b] we have P(a ≤ X ≤ b) = b a p(x)dx Usually, P(Xi | Y = yk) is used to denote the density of Xi conditioned on Y = yk. The densities P(Xi | Y = yk) are usually estimated using Gaussian densities as follows: Estimate the mean µik and the standard deviation σik based on training data. Then put ¯P(Xi | Y = yk ) = 1 σik √ 2π exp −(Xi − µik )2 2σ2 ik 25 Comments on Naive Bayes Tends to work well despite rather strong assumption of conditional independence of features. 26 Comments on Naive Bayes Tends to work well despite rather strong assumption of conditional independence of features. Experiments show it to be quite competitive with other classification methods. Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category. 26 Comments on Naive Bayes Tends to work well despite rather strong assumption of conditional independence of features. Experiments show it to be quite competitive with other classification methods. Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category. Directly constructs a hypothesis from parameter estimates that are calculated from the training data. 26 Comments on Naive Bayes Tends to work well despite rather strong assumption of conditional independence of features. Experiments show it to be quite competitive with other classification methods. Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category. Directly constructs a hypothesis from parameter estimates that are calculated from the training data. Typically handles noise well. 26 Comments on Naive Bayes Tends to work well despite rather strong assumption of conditional independence of features. Experiments show it to be quite competitive with other classification methods. Even if the probabilities are not accurately estimeted, it often picks the correct maximum probability category. Directly constructs a hypothesis from parameter estimates that are calculated from the training data. Typically handles noise well. Missing values are easy to deal with (simply average over all missing values in feature vectors). 26 Bayes Classifier vs MAP vs MLE Recall that the Bayes classifier chooses the category as follows: CBayes (xk) = arg max i∈{1,...,m} P(Y = yi | X = xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) 27 Bayes Classifier vs MAP vs MLE Recall that the Bayes classifier chooses the category as follows: CBayes (xk) = arg max i∈{1,...,m} P(Y = yi | X = xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) As the denominator P(X = xk) is not influenced by i, the Bayes is equivalent to the Maximum Aposteriori Probability rule: CMAP (xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) 27 Bayes Classifier vs MAP vs MLE Recall that the Bayes classifier chooses the category as follows: CBayes (xk) = arg max i∈{1,...,m} P(Y = yi | X = xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) P(X = xk) As the denominator P(X = xk) is not influenced by i, the Bayes is equivalent to the Maximum Aposteriori Probability rule: CMAP (xk) = arg max i∈{1,...,m} P(Y = yi ) · P(X = xk | Y = yi ) If we do not care about the prior (or assume uniform) we may use the Maximum Likelihood Estimate rule: CMLE (xk) = arg max i∈{1,...,m} P(X = xk | Y = yi ) (Intuitively, we maximize the probability that the data xk have been generated into the category yi .) 27 Generative Probabilistic Models Assume a simple (usually unrealistic) probabilistic method by which the data was generated. 28 Generative Probabilistic Models Assume a simple (usually unrealistic) probabilistic method by which the data was generated. For classification, assume that each category yi has a different parametrized generative model for P(X = xk | Y = yi ). 28 Generative Probabilistic Models Assume a simple (usually unrealistic) probabilistic method by which the data was generated. For classification, assume that each category yi has a different parametrized generative model for P(X = xk | Y = yi ). Maximum Likelihood Estiomation (MLE): Set parameters to maximize the probability that the model produced the given training data. 28 Generative Probabilistic Models Assume a simple (usually unrealistic) probabilistic method by which the data was generated. For classification, assume that each category yi has a different parametrized generative model for P(X = xk | Y = yi ). Maximum Likelihood Estiomation (MLE): Set parameters to maximize the probability that the model produced the given training data. More conceretely: If Mλ denotes a model with parameter values λ, and Dk is the training data for the k-th category, find model parameters for category k (λk ) that maximizes the likelihood of Dk : λk = arg max λ P(Dk | Mλ) 28 Bayesian Networks (Basic Information) In the Naive Bayes we have assumed that all features X1, . . . , Xn are independent. 29 Bayesian Networks (Basic Information) In the Naive Bayes we have assumed that all features X1, . . . , Xn are independent. This is usually not realistic. E.g. Variables "rain" and "grass wet" are (usually) strongly dependent. 29 Bayesian Networks (Basic Information) In the Naive Bayes we have assumed that all features X1, . . . , Xn are independent. This is usually not realistic. E.g. Variables "rain" and "grass wet" are (usually) strongly dependent. What if we return some dependencies back? (But now in a well-defined sense.) 29 Bayesian Networks (Basic Information) In the Naive Bayes we have assumed that all features X1, . . . , Xn are independent. This is usually not realistic. E.g. Variables "rain" and "grass wet" are (usually) strongly dependent. What if we return some dependencies back? (But now in a well-defined sense.) Bayesian networks are a graphical model that uses a directed acyclic graph to specify dependencies among variables. 29 Bayesian Networks – Example Now, e.g., P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B) 30 Bayesian Networks – Example Now, e.g., P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B) Now we may e.g. infer what is the probability P(C = T | A = T) that we sit in a bad chair assuming that our back aches. 30 Bayesian Networks – Example Now, e.g., P(C, S, W , B, A) = P(C) · P(S) · P(W | C) · P(B | C, S) · P(A | B) Now we may e.g. infer what is the probability P(C = T | A = T) that we sit in a bad chair assuming that our back aches. We have to store only 10 numbers as opposed to 25 − 1 if the whole joint pmf is stored. 30 Bayesian Networks – Learning & Naive Bayes Many algorithms have been developed for learning: the structure of the graph of the network, the conditional probability tables. The methods are based on maximum-likelihood estimation, gradient descent, etc. Automatic procedures are usually combined with expert knowledge. 31 Bayesian Networks – Learning & Naive Bayes Many algorithms have been developed for learning: the structure of the graph of the network, the conditional probability tables. The methods are based on maximum-likelihood estimation, gradient descent, etc. Automatic procedures are usually combined with expert knowledge. Can you express the naive Bayes for Y , X1, . . . , Xn using a Bayesian network? 31