‹#›
1
Text Categorization
Luboš Popelínský
Faculty of Informatics, Masaryk University Brno
popel@fi.muni.cz, www.fi.muni.cz/~popel

‹#›
2
Based partialy on
Raymond J. Mooney’s course of Machine learning
http://www.cs.utexas.edu/~mooney/#Teaching
University of Texas at Austin, and
Jose Maria Gomes Hidalgo’s ECML/PKDD tutorial
www.fi.muni.cz/~popel/nll, and maybe
Tom Mitchell’s handouts to ML book

‹#›
3
nude
deal
Nigeria
Example
spam
legit
hot
$
Viagra
lottery
!!
!
win
Friday
exam
computer
May
PM
test
March
science
Viagra
homework
score
!
spam
legit
spam
spam
legit
spam
legit
legit
spam
Category
Viagra
deal
hot
!!

>

‹#›
4
Example (cont.)
nude
deal
Nigeria
spam
legit
hot
$
Viagra
lottery
!!
!
win
Friday
exam
computer
May
PM
test
March
science
Viagra
homework
score
!
spam
legit
spam
spam
legit
spam
legit
legit
spam
Category
Win lotttery $ !
??     ??

‹#›
5
Text Categorization Applications
•Web pages
–Recommending
–Yahoo-like classification
•Newsgroups/Emails
–Recommending
–spam filtering
–Sentiment analysis for marketing
•News articles
–Personalized newspaper
•Scientific papers
–-   Filtering

‹#›
6
Task
•
•Based on the text of a document in a collection,
•assign to each document one or more labels
•
•Usable e.g. for
• topic recognition
• genre detection
• authorship recognition
• author’s meaning (agree/disagree, against/for)
• language recognition
•
•Benchmarks: 20 Newsgroup, Reuters data
•

‹#›
7
Text Categorization and Text Mining
•Text mining
•
•Unsupervised learning
• words:
• e.g. Latent Semantic Analysis, Key Phrase Extraction
• documents:
• e.g. Document clustering, Topic Detection
•Supervised learning
• words:
• e.g. Part-of-Speech Tagging, Word Sense Disambiguation
• documents:
• e.g. Text categorization/classification/filtering
• Information Extraction
•

‹#›
8
Word Sense Disambiguation
•
•Fui a um casamento.
•Casamento é uma experiência interessante.
•
•Ela me contou tudo.
•Eu contei pelo menos vinte pessoas.
•Você pode contar comigo.
•
•Isto não está direito.
•Ela está na faculdade de direito.
•
•… fazer …

‹#›
9
Other Applications
•
•Context-sensitive spelling checking – peace of chocolate
•
•Named entity recognition – name, date, location, $, @, URL
•
•Coreference resolution – Carlos não fica na casa. El …
•
•Sentiment classification – positive or negative
•
•Text summarisation
•

‹#›
10
Text Categorization
•The dominant approach is
•
• Given a set of manually classified (labeled) documents
•
• Use ML techniques to induce an automatic classifier of new documents
•
•
•See [Sebastiani 02] for an in-depth survey
•
•

‹#›
11
Bag of words
•
•Documents represented as bag of words
•
• a structure of documents, including sentences, is ignored
•
•
•Decision of a classifier =
• based on word frequency in positive and negative examples
•
•
•What words to choose to reach the best accuracy?
•
•Maybe all …
•

‹#›
12
Text Categorization Methods
•Representations of text are very high dimensional (one feature for each word).
•
•Vectors are sparse since most words are rare.
•
•For most text categorization tasks, there are
• many irrelevant and many relevant features.
•
•Methods that sum evidence from many or all features (e.g. naïve Bayes, neural-net, SVM) tend to
work better than ones that try to isolate just a few relevant features (decision-tree or rule
induction).

‹#›
13
Text Categorization Algorithm
-Select terms (words, bi-, trigrams, named entities)
•
-Perform morphological analysis (lemmatization, tagging)
• lemmatization = sou, es ® ser
•
-Term clustering = ser, estar, ficar ® artificial term SEF
•
-Remove nonsignificant terms (stoplist)
•
-Select terms with highest discriminative power
•
-Learn classifier
•

‹#›
14
The Vector-Space Model
•Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary.
•These “orthogonal” terms form a vector space.
–          Dimension = t = |vocabulary|
•Each term, i,  in a document or query, j, is given a real-valued weight, wij.
•Both documents and queries are expressed as       t-dimensional vectors:
–          dj = (w1j, w2j, …, wtj)
•

‹#›
15
Document Collection
•A collection of n documents can be represented in the vector space model by a term-document
matrix.
•An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term
has no significance in the document or it simply doesn’t exist in the document.
        T1   T2    ….      Tt
D1    w11  w21   …      wt1
D2    w12  w22   …      wt2
 :       :      :               :
 :       :      :               :
Dn    w1n  w2n   …      wtn

‹#›
16
Naïve Bayes for Text
•Modeled as generating a bag of words for a document in a given category by repeatedly sampling
with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci).
•
•Smooth probability estimates with Laplace         m-estimates assuming a uniform distribution over
all words (p = 1/|V|) and m = |V|
–Equivalent to a virtual sample of seeing each word in each category exactly once.

‹#›
17
Text Naïve Bayes Algorithm
(Train)
Let V be the vocabulary of all words in the documents in D
For each category ci  Î C
        Let Di be the subset of documents in D in category ci
        P(ci) = |Di| / |D|
      Let Ti be the concatenation of all the documents in Di
         Let ni be the total number of word occurrences in Ti
         For each word wj Î V
             Let nij be the number of occurrences of wj in Ti
                   Let P(wj | ci) = (nij + 1) / (ni + |V|)

‹#›
18
Text Naïve Bayes Algorithm
(Test)
Given a test document X
Let n be the number of word occurrences in X
Return the category:
     where ai is the word occurring the ith position in X

‹#›
19
Text Categorization: Summary
•
•Bag of words
•
•Vector model
•
•Feature/Term selection and Term extraction
•
•Naïve Bayes Classifier, SVM, neural nets
•
•Applications
•

‹#›
20
Textual Similarity Metrics
•Measuring similarity of two texts is a well-studied problem.
•Standard metrics are based on a “bag of words” model of a document that ignores word order and
syntactic structure.
•May involve removing common “stop words” and stemming to reduce words to their root form.
•Vector-space model from Information Retrieval (IR) is the standard approach.
•Other metrics (e.g. edit-distance) are also used.

‹#›
21
Graphic Representation
•Example:
•D1 = 2T1 + 3T2 + 5T3
•D2 = 3T1 + 7T2 +   T3
•Q = 0T1 + 0T2 +  2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 +  T3
Q = 0T1 + 0T2 + 2T3
7
3
2
5
•Is D1 or D2 more similar to Q?
•How to measure the degree of similarity? Distance? Angle? Projection?

‹#›
22
Term Weights: Term Frequency
•More frequent terms in a document are more important, i.e. more indicative of the topic.
–        fij = frequency of term i in document j
–
•May want to normalize term frequency (tf)  by dividing by the frequency of the most common term in
the document:
–        tfij =  fij  / maxi{fij}
•

‹#›
23
Term Weights: Inverse Document Frequency
•Terms that appear in many different documents are less indicative of overall topic.
•     df i = document frequency of term i
•           = number of documents containing term i
•     idfi = inverse document frequency of term i,
•           = log2 (N/ df i)
•             (N: total number of documents)
•An indication of a term’s discrimination power.
•Log used to dampen the effect relative to tf.
•

‹#›
24
TF-IDF Weighting
•A typical combined term importance indicator is tf-idf weighting:
•wij =  tfij idfi  =  tfij log2 (N/ dfi)
•A term occurring frequently in the document but rarely in the rest of the collection is given high
weight.
•Many other ways of determining term weights have been proposed.
•Experimentally, tf-idf has been found to work well.

‹#›
25
Cosine Similarity Measure
•Cosine similarity measures the cosine of the angle between two vectors.
•Inner product normalized by the vector lengths.
•
D1 = 2T1 + 3T2 + 5T3     CosSim(D1 , Q) = 10 / Ö(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3     CosSim(D2 , Q) =  2 / Ö(9+49+1)(0+0+4) = 0.13
 Q = 0T1 + 0T2 + 2T3
q2
t3
t1
t2
D1
D2
Q
q1
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.
CosSim(dj, q) =

‹#›
26
Text Document Filtering
•
•Example: filtering interesting news from a news group
•
•Idea:
1.label interesting documents (positive examples)
2.and noninteresting ones (negative examples)
3.run a learning algorithm
•
•also useful for spam detection and filtering web pages
•
•