‹#› 1 Text Categorization Luboš Popelínský Faculty of Informatics, Masaryk University Brno popel@fi.muni.cz, www.fi.muni.cz/~popel ‹#› 2 Based partialy on Raymond J. Mooney’s course of Machine learning http://www.cs.utexas.edu/~mooney/#Teaching University of Texas at Austin, and Jose Maria Gomes Hidalgo’s ECML/PKDD tutorial www.fi.muni.cz/~popel/nll, and maybe Tom Mitchell’s handouts to ML book ‹#› 3 nude deal Nigeria Example spam legit hot $ Viagra lottery !! ! win Friday exam computer May PM test March science Viagra homework score ! spam legit spam spam legit spam legit legit spam Category Viagra deal hot !! > ‹#› 4 Example (cont.) nude deal Nigeria spam legit hot $ Viagra lottery !! ! win Friday exam computer May PM test March science Viagra homework score ! spam legit spam spam legit spam legit legit spam Category Win lotttery $ ! ?? ?? ‹#› 5 Text Categorization Applications •Web pages –Recommending –Yahoo-like classification •Newsgroups/Emails –Recommending –spam filtering –Sentiment analysis for marketing •News articles –Personalized newspaper •Scientific papers –- Filtering ‹#› 6 Task • •Based on the text of a document in a collection, •assign to each document one or more labels • •Usable e.g. for • topic recognition • genre detection • authorship recognition • author’s meaning (agree/disagree, against/for) • language recognition • •Benchmarks: 20 Newsgroup, Reuters data • ‹#› 7 Text Categorization and Text Mining •Text mining • •Unsupervised learning • words: • e.g. Latent Semantic Analysis, Key Phrase Extraction • documents: • e.g. Document clustering, Topic Detection •Supervised learning • words: • e.g. Part-of-Speech Tagging, Word Sense Disambiguation • documents: • e.g. Text categorization/classification/filtering • Information Extraction • ‹#› 8 Word Sense Disambiguation • •Fui a um casamento. •Casamento é uma experiência interessante. • •Ela me contou tudo. •Eu contei pelo menos vinte pessoas. •Você pode contar comigo. • •Isto não está direito. •Ela está na faculdade de direito. • •… fazer … ‹#› 9 Other Applications • •Context-sensitive spelling checking – peace of chocolate • •Named entity recognition – name, date, location, $, @, URL • •Coreference resolution – Carlos não fica na casa. El … • •Sentiment classification – positive or negative • •Text summarisation • ‹#› 10 Text Categorization •The dominant approach is • • Given a set of manually classified (labeled) documents • • Use ML techniques to induce an automatic classifier of new documents • • •See [Sebastiani 02] for an in-depth survey • • ‹#› 11 Bag of words • •Documents represented as bag of words • • a structure of documents, including sentences, is ignored • • •Decision of a classifier = • based on word frequency in positive and negative examples • • •What words to choose to reach the best accuracy? • •Maybe all … • ‹#› 12 Text Categorization Methods •Representations of text are very high dimensional (one feature for each word). • •Vectors are sparse since most words are rare. • •For most text categorization tasks, there are • many irrelevant and many relevant features. • •Methods that sum evidence from many or all features (e.g. naïve Bayes, neural-net, SVM) tend to work better than ones that try to isolate just a few relevant features (decision-tree or rule induction). ‹#› 13 Text Categorization Algorithm -Select terms (words, bi-, trigrams, named entities) • -Perform morphological analysis (lemmatization, tagging) • lemmatization = sou, es ® ser • -Term clustering = ser, estar, ficar ® artificial term SEF • -Remove nonsignificant terms (stoplist) • -Select terms with highest discriminative power • -Learn classifier • ‹#› 14 The Vector-Space Model •Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. •These “orthogonal” terms form a vector space. – Dimension = t = |vocabulary| •Each term, i, in a document or query, j, is given a real-valued weight, wij. •Both documents and queries are expressed as t-dimensional vectors: – dj = (w1j, w2j, …, wtj) • ‹#› 15 Document Collection •A collection of n documents can be represented in the vector space model by a term-document matrix. •An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn ‹#› 16 Naïve Bayes for Text •Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci). • •Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V| –Equivalent to a virtual sample of seeing each word in each category exactly once. ‹#› 17 Text Naïve Bayes Algorithm (Train) Let V be the vocabulary of all words in the documents in D For each category ci Î C Let Di be the subset of documents in D in category ci P(ci) = |Di| / |D| Let Ti be the concatenation of all the documents in Di Let ni be the total number of word occurrences in Ti For each word wj Î V Let nij be the number of occurrences of wj in Ti Let P(wj | ci) = (nij + 1) / (ni + |V|) ‹#› 18 Text Naïve Bayes Algorithm (Test) Given a test document X Let n be the number of word occurrences in X Return the category: where ai is the word occurring the ith position in X ‹#› 19 Text Categorization: Summary • •Bag of words • •Vector model • •Feature/Term selection and Term extraction • •Naïve Bayes Classifier, SVM, neural nets • •Applications • ‹#› 20 Textual Similarity Metrics •Measuring similarity of two texts is a well-studied problem. •Standard metrics are based on a “bag of words” model of a document that ignores word order and syntactic structure. •May involve removing common “stop words” and stemming to reduce words to their root form. •Vector-space model from Information Retrieval (IR) is the standard approach. •Other metrics (e.g. edit-distance) are also used. ‹#› 21 Graphic Representation •Example: •D1 = 2T1 + 3T2 + 5T3 •D2 = 3T1 + 7T2 + T3 •Q = 0T1 + 0T2 + 2T3 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 3 2 5 •Is D1 or D2 more similar to Q? •How to measure the degree of similarity? Distance? Angle? Projection? ‹#› 22 Term Weights: Term Frequency •More frequent terms in a document are more important, i.e. more indicative of the topic. – fij = frequency of term i in document j – •May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: – tfij = fij / maxi{fij} • ‹#› 23 Term Weights: Inverse Document Frequency •Terms that appear in many different documents are less indicative of overall topic. • df i = document frequency of term i • = number of documents containing term i • idfi = inverse document frequency of term i, • = log2 (N/ df i) • (N: total number of documents) •An indication of a term’s discrimination power. •Log used to dampen the effect relative to tf. • ‹#› 24 TF-IDF Weighting •A typical combined term importance indicator is tf-idf weighting: •wij = tfij idfi = tfij log2 (N/ dfi) •A term occurring frequently in the document but rarely in the rest of the collection is given high weight. •Many other ways of determining term weights have been proposed. •Experimentally, tf-idf has been found to work well. ‹#› 25 Cosine Similarity Measure •Cosine similarity measures the cosine of the angle between two vectors. •Inner product normalized by the vector lengths. • D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / Ö(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / Ö(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 q2 t3 t1 t2 D1 D2 Q q1 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product. CosSim(dj, q) = ‹#› 26 Text Document Filtering • •Example: filtering interesting news from a news group • •Idea: 1.label interesting documents (positive examples) 2.and noninteresting ones (negative examples) 3.run a learning algorithm • •also useful for spam detection and filtering web pages • •