Unsupervised Detection of Anomalous Text Part 2 Jozef Štyrák Contents ● Quick Recap ● Experimental Results ● IS and MUNI Applications About ● Written by David Guthrie ● Doctor Degree Thesis at University of Sheffield in 2008 What is an Anomaly? ● “something that deviates from what is standard, normal or expected” ● Anomalies in text on segment level Problem Definition ● Unsupervised Anomaly Detection ● Text is represented by set of features ● Text is split into segments, for each is counted a score – how much it deviates from what is normal for a particular document Used Techniques ● ClustDist ● SDEDist ● Pcout ● MeanComp ● TxtCompDist – textual complement ● Baseline – choosing randomly TxtCompDist Algorithm ● Measures distance from the textual complement – X – feature vector for the segment – C – feature vector for segment's complement ● Novel method designed by authors ● In comparision with MeanComp better usage of ranked lists features (POS trigrams, adverbs,... ) TxtCompDist( x,V)=d( x,cx) Stahel-Donoho Estimator Distance ● Idea: to find projections of the data which maximize an observations distance from the center of the observations ● Problem to find set of vectors a SDEDist (x ,V )=maxa xT a−median(V a) mad(V a) Experimental Setup ● Artificially created Test Document – 50 normal segments – 1 anomalous segment ● Output: List of segments ranked by how anomalous they are with respect to whole Test Document Types of Anomalies ● Authorship anomalies – 8 Victorian authors ● Factual writing vs. opinion writing anomalies – Opinion columns ● Subversive article anomalies – Newswire vs. Anarchist Cookbook ● Machine translation anomalies – Chinese news translated into English Authorship Anomalies Fact vs. Opinion ● Opinion (editorials, opinion columns) added into factual article (Gigaword Corpus) Newswire vs. Anarchist Cookbook ● Anarchist Cookbook – recipes of explosives, instructions how to build different devices, … ● Difference in genre Machine Translation Anomalies ● Usage of Google Translate Conclusions from Experiments [1] ● Best results for detecting anomalies based on difference in genre or style – Identification of machine translated text – Newswire vs. Anarchist Cookbook ● Difficult to identify anomaly in Top 1 – 96% probability for MT task (large segment) – 2% probability by chance Conclusions from Experiments [2] ● Best results for TxtCompDist ● SDEDist – higher Time costs Precision & Recall ● What is probability that given segment is anomalous? ● Definition of threshold for anomality score – Maximal precision (100%) – Best recall possible Precision & Recall [2] Feature Selection ● Which features help us to identify an anomaly and which don't ● Score: difference in values for anomalous segments and normal segments ● Results – Least effective are emotional features – Most effective features are basically the same for all anomality types, least effective features differ Most Effective Features ● Gunning-Fog Index ● Percentage of passive sentences ● Flesch-Kincaid Reading Ease ● Percentage of sentences longer than 15 words ● Lix Formula Least Effective Features ● Words of economic, commetcial, industrial orientation ● Terms denoting Kinship ● Words for non-work social rituals ● Words concerned with fetching or carrying ● Words for places occuring in nature Summary of Conclusions ● Variations in text viewed as outliers ● Best method: comparing segments with its textual complement ● With larger segment increases accuracy ● The easiest anomalies to identify are anomalies in genre or style ● Usefulness of stylistic features, word distributions, different readibility measures, … IS and MUNI Applications ● Thesis and essays plagiarism ● Discussion Forum ● Log Entries Thesis and Essays Plagiarism ● Looking for segments from external sources ● Authorship anomalies ● Identification of machine translated text – Does not have to be plagiarism Discussion Forums ● Looking for irrelevant posts ● Problems with data – Sentence boundaries – Special symbols (math, chemistry,...) – Length varies ● Anomalies in content, not in style Log Entries ● e.g. behaviour of students ● Stream processing ● Many similar entries ● Short length of entries Thank You for Your Attention Questions? Appendix [1] - Authorship Appendix [2] – Fact vs. Opinion Appendix [3] – Anarchist Cookbook Appendix [4] – MT