Unsupervised Detection of Anomalous Text Pavel Veselý Source ● „Unsupervised Detection of Anomalous Text“ by David Guthrie – PHD thesis at University of Sheffield, 2008 Dictionary (TODO Change) ● Anomaly - a deviation from the common rule, type, arrangement, or form ● Texts – documents or segments of documents (in this thesis segments of 100, 500 and 1000 words) ● Unsupervised detection - Motivation ● Detecting plagiarism without access to source text – Koppel, Seidman – Automatically Identifying Pseudepigraphic Texts – Klára Kufová – Anomaly Detection in Text ● Getting more homogeneous text set – Guthrie, Guthrie, Wilks – An Unsupervised Approach for the Detection of Outliers in Corpora ● Out of topic posts in forums Process 1.Represent texts as numerical vectors 2.Measure distance of vectors from rest of set 3.Set threshold for anomaly Text representation ● Numerical vector of 166 features 1.Simple Surface Features (19) 2.Readability Measures (7) 3.Obscurity of Vocabulary Features (7) 4.Part of Speech and Syntax Features (11) 5.Rank Features (8) 6.Emotional Tone Features (114) Simple Surface Features I 1. Average sentence length 2. Average word length 3. Average number of syllables per word 4. Percentage of all words that have 3 or more syllables 5. Percentage of all words that only have 1 syllable 6. Percentage of long sentences (sentences greater than 15 words) 7. Percentage of short sentences (sentences less than 8 words) 8. Percentage of sentences that are questions 9. Percentage of all characters that are punctuation characters 10.Percentage of all characters that are semicolons Simple Surface Features II 11.Percentage of all characters that are commas 12.Percentage of all words that have 6 or more letters 13.Percentage of word types divided by the number of word tokens 14.Percentage of words that are subordinating conjunctions (then, until, while,since, etc.) 15.Percentage of words that are coordinating conjunctions (but, so, but, or, etc.) 16.Percentage of sentences that begin with a subordinating or coordinating conjunctions 17.Percentage of words that are articles 18.Percentage of words that are prepositions 19.Percentage of words that are pronouns Readability Measures 1. Flesch-Kincaid Reading Ease 2. Flesch-Kincaid Grade Level 3. Gunning-Fog Index 4. Coleman-Liau Formula 5.Automated Readability Index 6.Lix Formula 7. SMOG Index Obscurity of Vocabulary Usage ● Number of words in text, that have relative frequency in corpus (Gigaword) as follows: 1.Top 1000 words 2.Top 5000 words 3.Top 10,000 words 4.Top 50,000 words 5.Top 100,000 words 6.Top 200,000 words 7.Top 300,000 words Part of Speech and Syntax Features 1.Percentage of words that are adjectives 2.Percentage of words that are adverbs 3. Percentage of words that are interrogative words (who, what, where when, etc.) 4. Percentage of words that are nouns 5.Percentage of words that are verbs 6.Ratio of number of adjectives to nouns 7.Percentage of words that are proper nouns 8.Percentage of words that are numbers (i.e. cardinal, ordinal, nouns such as dozen, thousands, etc.) 9.Diversity of POS tri-grams POSTrigram Diversity=( number of different POStrigrams total number of POStrigrams ) x100 Rank Features 1.Distribution of POS tri-grams list 2.Distribution of POS bi-gram list 3.Distribution of POS list 4.Distribution of Articles list 5.Distribution of Prepositions list 6.Distribution of Conjunctions list 7.Distribution of Pronouns list 8.Distribution of Adverbs list General Inquirer Dictionary ● Capturing sentiment of text ● Quantify the connotative meaning of isolated words ● 13,000 root words mapped into 114 categories – most words assigned to more than one category – The two largest categories are ‘positive’ (1,915 words) and ‘negative’ (2,291 words) Measuring the distances ● ClustDist – A distance based on average linkage clustering ● SDEDist – The Stahel-Donoho Estimator distance ● Pcout – The weights calculated by the PCout algorthim ● MeanComp – Distance from the mean of all other segments in the data ● TxtCompDist – Method developed by authors that uses the distance from the textual complement SDEDist ● Projection can give an scalar distance to the center of all observations ● Find the direction that, when used for projection, gives maximum distance ● Infinitely many directions – problem of finding good direction set SDEDist (⃗x,V )=max ⃗a ⃗xT ⃗a−median(V ⃗a) mad (V ⃗a) TxtCompDist ● Distance from textual complement (the union of the remaining texts) ● Designed by authors ● Better use of features requiring larger texts (POS trigrams, adverb preference) TxtCompDist(⃗x ,V)=d(⃗x,⃗cx) Experiment Data ● Artificial data set – Documents of 51 segments, 1 of which is anomalous – Created as random bag of segments from 2 sources (50 segments by 1) ● Segments of length 100, 500 and 1000 words Authorship Tests ● 8 Victorian authors Fact versus Opinion ● Factual news versus editorials Genre Difference ● Newswire versus Anarchist Cookbook Machine Translation ● English Newswire versus Chinese newswire translated by 2008 Google Translate Experiment Conclusion I ● Finding anomaly as top 1 – Random chance – 2% – Average on 100 words – 32% – Average on 1000 words – 68% ● Best on Google Translate (2008) – 96% ● Works very well on anomalies different in style or genre Experiment Conclusions II ● Best metric – TxtCompDist ● Stahel-Donoho based method close second Precision and Recall ● Setting threshold of anomaly score ● Compromise between precision and recall Thresholding ● Fixed 100% precision, maximizing recall Feature Selection ● Based on ability to differentiate anomalies ● Best features are the same over all experiments 1.Gunning-Fog Index 2.Number of passive sentences 3.Flesch-Kincaid Reading Ease 4.Percenteage of sentences over 15 words ● Worst features differ, but mostly sentiment based Real Data Experiments ● TODO ● Klára ● Guthrie – corpora Summary ● TODO