Unsupervised Detection of
Anomalous Text
Part 2
Jozef Štyrák
Contents
● Quick Recap
● Experimental Results
● IS and MUNI Applications
About
● Written by David Guthrie
● Doctor Degree Thesis at University of
Sheffield in 2008
What is an Anomaly?
● “something that deviates from what is
standard, normal or expected”
● Anomalies in text on segment level
Problem Definition
● Unsupervised Anomaly Detection
● Text is represented by set of features
● Text is split into segments, for each is
counted a score – how much it
deviates from what is normal for a
particular document
Used Techniques
● ClustDist
● SDEDist
● Pcout
● MeanComp
● TxtCompDist – textual complement
● Baseline – choosing randomly
TxtCompDist Algorithm
● Measures distance from the textual complement
– X – feature vector for the segment
– C – feature vector for segment's complement
● Novel method designed by authors
● In comparision with MeanComp better usage of
ranked lists features (POS trigrams, adverbs,... )
TxtCompDist( x,V)=d( x,cx)
Stahel-Donoho Estimator Distance
● Idea: to find projections of the data which
maximize an observations distance from the
center of the observations
● Problem to find set of vectors a
SDEDist (x ,V )=maxa
xT
a−median(V a)
mad(V a)
Experimental Setup
● Artificially created Test Document
– 50 normal segments
– 1 anomalous segment
● Output: List of segments ranked by how anomalous
they are with respect to whole Test Document
Types of Anomalies
● Authorship anomalies
– 8 Victorian authors
● Factual writing vs. opinion writing anomalies
– Opinion columns
● Subversive article anomalies
– Newswire vs. Anarchist Cookbook
● Machine translation anomalies
– Chinese news translated into English
Authorship Anomalies
Fact vs. Opinion
● Opinion (editorials, opinion columns) added
into factual article (Gigaword Corpus)
Newswire vs. Anarchist Cookbook
● Anarchist Cookbook – recipes of explosives,
instructions how to build different devices, …
● Difference in genre
Machine Translation Anomalies
● Usage of Google Translate
Conclusions from Experiments [1]
● Best results for detecting anomalies based on
difference in genre or style
– Identification of machine translated text
– Newswire vs. Anarchist Cookbook
● Difficult to identify anomaly in Top 1
– 96% probability for MT task (large segment)
– 2% probability by chance
Conclusions from Experiments [2]
● Best results for TxtCompDist
● SDEDist – higher Time costs
Precision & Recall
● What is probability that given segment is
anomalous?
● Definition of threshold for anomality score
– Maximal precision (100%)
– Best recall possible
Precision & Recall [2]
Feature Selection
● Which features help us to identify an anomaly
and which don't
● Score: difference in values for anomalous
segments and normal segments
● Results
– Least effective are emotional features
– Most effective features are basically the same for
all anomality types, least effective features differ
Most Effective Features
● Gunning-Fog Index
● Percentage of passive sentences
● Flesch-Kincaid Reading Ease
● Percentage of sentences longer than 15 words
● Lix Formula
Least Effective Features
● Words of economic, commetcial, industrial
orientation
● Terms denoting Kinship
● Words for non-work social rituals
● Words concerned with fetching or carrying
● Words for places occuring in nature
Summary of Conclusions
● Variations in text viewed as outliers
● Best method: comparing segments with its
textual complement
● With larger segment increases accuracy
● The easiest anomalies to identify are
anomalies in genre or style
● Usefulness of stylistic features, word
distributions, different readibility measures, …
IS and MUNI Applications
● Thesis and essays plagiarism
● Discussion Forum
● Log Entries
Thesis and Essays Plagiarism
● Looking for segments from external sources
● Authorship anomalies
● Identification of machine translated text
– Does not have to be plagiarism
Discussion Forums
● Looking for irrelevant posts
● Problems with data
– Sentence boundaries
– Special symbols (math, chemistry,...)
– Length varies
● Anomalies in content, not in style
Log Entries
● e.g. behaviour of students
● Stream processing
● Many similar entries
● Short length of entries
Thank You for Your Attention
Questions?
Appendix [1] - Authorship
Appendix [2] – Fact vs. Opinion
Appendix [3] – Anarchist Cookbook
Appendix [4] – MT