1 Text Categorization Luboš Popelínský Faculty of Informatics, Masaryk University Brno popel@fi.muni.cz, www.fi.muni.cz/~popel 2 Based partialy on Raymond J. Mooney’s course of Machine learning http://www.cs.utexas.edu/~mooney/#Teaching University of Texas at Austin, and Jose Maria Gomes Hidalgo’s ECML/PKDD tutorial www.fi.muni.cz/~popel/nll, and maybe Tom Mitchell’s handouts to ML book 3 Text Categorization and Text Mining •  Text mining •  Unsupervised learning words: e.g. Latent Semantic Analysis, Key Phrase Extraction documents: e.g. Document clustering, Topic Detection •  Supervised learning words: e.g. Part-of-Speech Tagging, Word Sense Disambiguation documents: e.g. Text categorization/classification/filtering Information Extraction 4 Word Sense Disambiguation Fui a um casamento. Casamento é uma experiência interessante. Ela me contou tudo. Eu contei pelo menos vinte pessoas. Você pode contar comigo. Isto não está direito. Ela está na faculdade de direito. … fazer … 5 Other Applications Context-sensitive spelling checking – peace of chocolate Named entity recognition – name, date, location, $, @, URL Coreference resolution – Carlos não fica na casa. El … Sentiment classification – positive or negative Text summarisation 6 nude dealNigeria Example spam legit hot $ Viagra lottery !! ! win Friday exam computer May PM test March scienceViagra homework score ! spam legit spam spam legit spam legit legit spam Category Viagra deal hot !! 7 Example (cont.) nude dealNigeria spam legit hot $ Viagra lottery !! ! win Friday exam computer May PM test March scienceViagra homework score ! spam legit spam spam legit spam legit legit spam Category Win lotttery $ ! ?? ?? 8 Text Categorization Applications •  Web pages –  Recommending –  Yahoo-like classification •  Newsgroups/Emails –  Recommending –  spam filtering –  Sentiment analysis for marketing •  News articles –  Personalized newspaper •  Scientific papers - Filtering 9 Task Based on the text of a document in a collection, assign to each document one or more labels Usable e.g. for topic recognition genre detection authorship recognition author’s meaning (agree/disagree, against/for) language recognition Benchmarks: 20 Newsgroup, Reuters data 10 Text Categorization •  The dominant approach is Given a set of manually classified (labeled) documents Use ML techniques to induce an automatic classifier of new documents •  See [Sebastiani 02] for an in-depth survey 11 Bag of words Documents represented as bag of words a structure of documents, including sentences, is ignored Decision of a classifier = based on word frequency in positive and negative examples What words to choose to reach the best accuracy? Maybe all … 12 Text Categorization Methods •  Representations of text are very high dimensional (one feature for each word). •  Vectors are sparse since most words are rare. •  For most text categorization tasks, there are many irrelevant and many relevant features. •  Methods that sum evidence from many or all features (e.g. naïve Bayes, neural-net, SVM) tend to work better than ones that try to isolate just a few relevant features (decision-tree or rule induction). 13 Text Categorization Algorithm -  Select terms (words, bi-, trigrams, named entities) -  Perform morphological analysis (lemmatization, tagging) lemmatization = sou, es → ser -  Term clustering = ser, estar, ficar → artificial term SEF -  Remove nonsignificant terms (stoplist) -  Select terms with highest discriminative power -  Learn classifier 14 Naïve Bayes for Text •  Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci). •  Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V| –  Equivalent to a virtual sample of seeing each word in each category exactly once. 15 Text Naïve Bayes Algorithm (Train) Let V be the vocabulary of all words in the documents in D For each category ci ∈ C Let Di be the subset of documents in D in category ci P(ci) = |Di| / |D| Let Ti be the concatenation of all the documents in Di Let ni be the total number of word occurrences in Ti For each word wj ∈ V Let nij be the number of occurrences of wj in Ti Let P(wj | ci) = (nij + 1) / (ni + |V|) 16 Text Naïve Bayes Algorithm (Test) Given a test document X Let n be the number of word occurrences in X Return the category: where ai is the word occurring the ith position in X € argmax ci ∈C P(ci) P(ai |ci i=1 n ∏ ) 17 Text Document Filtering Example: filtering interesting news from a news group Idea: 1.  label interesting documents (positive examples) 2.  and noninteresting ones (negative examples) 3.  run a learning algorithm also useful for spam detection and filtering web pages 18 Applications Spam filtering Sentiment analysis Recommender systems e.g. to recommend conference papers, or web pages Another example: Prediction of stock price movement from news . . . 19 Text Categorization: Summary •  Bag of words -> Vector model •  Feature/Term selection and Term extraction •  Use ML algorithm, maybe not only Naïve Bayes Classifier, SVM or neural nets 20 Textual Similarity Metrics •  Measuring similarity of two texts is a well-studied problem. •  Standard metrics are based on a “bag of words” model of a document that ignores word order and syntactic structure. •  May involve removing common “stop words” and stemming to reduce words to their root form. •  Vector-space model from Information Retrieval (IR) is the standard approach. •  Other metrics (e.g. edit-distance) are also used. 21 Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5 •  Is D1 or D2 more similar to Q? •  How to measure the degree of similarity? Distance? Angle? Projection? 22 Term Weights: Term Frequency •  More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j •  May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij} 23 Term Weights: Inverse Document Frequency •  Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) •  An indication of a term’s discrimination power. •  Log used to dampen the effect relative to tf. 24 TF-IDF Weighting •  A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) •  A term occurring frequently in the document but rarely in the rest of the collection is given high weight. •  Many other ways of determining term weights have been proposed. •  Experimentally, tf-idf has been found to work well. 25 Cosine Similarity Measure •  Cosine similarity measures the cosine of the angle between two vectors. •  Inner product normalized by the vector lengths. D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 θ2 t3 t1 t2 D1 D2 Q θ1 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product. ∑ ∑ ∑ = = = • ⋅ ⋅ = ⋅ t i t i t i ww ww qd qd iqij iqij j j 1 1 22 1 )( rr rr CosSim(dj, q) =