PA153: Stylometric analysis of texts using machine learning techniques
Jan Rygl rygl@fi.muni.cz
NLP Centre, Faculty of Informatics, Masaryk University
Dec 7, 2016
Stylometry
tylometry
Stylometry is the application of the study of linguistic style, Study of linguistic style:
o Find out text features.
• Define author's writeprint. Applications:
• Define the author (person, nationality, age group, ...).
• Filter out text features not usuable by selected application.
Examples of application:
o Authorship recognition
o Legal documents (verify the author of last will) o False reviews (cluster accounts by real authors) o Public security (find authors of anonymous illegal documents and threats)
o School essays authorship verification (co-authorship) o Supportive authentication, biometrics (e-learning)
• Age detection (pedophile recognition on children web sites).
• author mother language prediction (public security).
o Mental disease symptons detection (health prevention) o HR applications (find out personal traits from text)
• Automatic translation recognition.
Stylometry analysis techniques
O ideological and thematic analysis
historical documents, literature
Q documentary and factual evidence
inquisition in the Middle Ages, libraries
Q language and stylistic analysis -
Q nnanual (legal, public security and literary applications) O semi-automatic (same as above)
O automatic (false reviews and generally all online stylometry applications)
Stylometry
tylometry Verification
Definition
f— ^ ? < >
v-- J
q decide if two documents were written by the same author category (lvl)
• decide if a document was written by the signed author category (lvN)
Examples
• The Shakespeare authorship question q The verification of wills
Mendenhall, T. C. 1887.
The Characteristic Curves of Composition.
Science Vol 9: 237-49.
9 The first algorithmic analysis
o Calculating and comparing histograms of word lengths
Oxford, Bacon Derby, Marlowe
http://en.wikipedia.org/wiki/File:ShakespeareCandidates1.j pg
Stylometry
tylometry Attribution
Definition
Li_j
L_j
?
J L_I
o find out an author category of a document
o candidate authors' categories can be known (e.g. age groups, healthy/unhealthy person)
• problems solving unknown candidate authors's categories are hard (e.g. online authorship, all clustering tasks)
Examples
o Anonymous e-mails
Judiciary
The police falsify testimonies
Morton, A. Q. Word Detective Proves the Bard wasn't Bacon. Observer, 1976.
Evidence in courts of law in Britain, U.S., Australia
Expert analysis of courtroom discourse, e.g. testing "patterns of deceit" hypotheses
Stylometry
LP Centre stylometry research
Authorship Recognition Tool
9 Ministry of the Interior of CR within the project VF20102014003 • Best security research award by Minister of the Interior
Small projects (bachelor and diploma theses, papers)
• detection of automatic translation, gender detection, . ..
Text Miner
q multilingual stylometry tool + many other features not related to stylometry
• authorship, mother language, age, gender, social group detection
Updated definition
techniques that allow us to find out information about the authors of texts on the basis of an automatic linguistic analysis
Stylometry process steps
O data acquisition - obtain and preprocess data
O feature extraction methods - get features from texts
Q machine learning - train and tune classifiers
O interpretation of results - make machine learning reasoning readable by human
o Enron e-mail corpus
• Blog corpus (Koppel, M, Effects of Age and Gender on Blogging)
Manually annotated corpora
Q UCNK school essays
Techniques
ata acquisition - preprocessing
Tokenization, morphology annotation and desambiguation
o morphological analysis
je byt spor spor mezi mezi Severem sever a a Jihem jih
• •
Jde
jit
k5eAaImIp3nS
klglnScl
k7c7
klgInSc7 k8xC
klgInSc7 klx.
k5eAaImIp3nS
Techniques
Selection of feature extraction methods
Categories
o Morphological 9 Syntactic a Vocabulary o Other
Analyse problem and select only suitable features. Combine with automatic feature selection techniques (entropy).
Techniques
Tuning of feature extraction methods
Tuning process
Divide data into three independet sets:
o Tuning set (generate stopwords, part-of-speech n-grams, . ..) o Training set (train a classifier) • Test set (evaluate a classifier)
Techniques
eatures examples
Word length statistics
o Count and normalize frequencies of selected word lengths (eg. 1-15 characters)
9 Modification: word-length frequencies are influenced by adjacent frequencies in histogram, e.g.: l: 30°/., 2: 70°/., 3: o0/. is more
Similar tO 1: 70°/., 2: 30°/., 3: 0°/. than 1: 0°/., 2: 60°/., 3: 40°/.
Sentence length statistics
9 Count and normalize frequencies of
o word per sentence length
o character per sentence length
Techniques
eatures examples
Stopwords
o Count normalized frequency for each word from stopword list
• Stopword ^ general word, semantic meaning is not important, e.g. prepositions, conjunctions, .. .
• stopwords ten, by, člověk, že are the most frequent in selected
v
five texts of Karel Capek
Wordclass (bigrams) statistics
o Count and normalize frequencies of wordclasses (wordclass bigrams)
o verb is followed by noun with the same frequency in selected five
v
texts of Karel Capek
o Count and normalize frequencies of selected morphological tags
9 the most consistent frequency has the genus for family and
v
archaic freq in selected five texts of Karel Capek
Word repetition
• Analyse which words or wordclasses are frequently repeated through the sentence
• nouns, verbs and pronous are the most repetetive in selected five
v
texts of Karel Capek
Techniques
eatures examples
Extract features using SET (Syntactic Engineering Tool)
I
I
I
Verifikujeme I
autorstvi
Se
I I
syntaktickou analýzou
Syntactic analysis
..................................
"■ !k2 ^tltS&fS&ž^*1 a * a a a a a a a a a ň a a a a a a n a
syntactic trees have similar depth in selected five texts of Karel
v
Capek
Techniques
Features examples
Other stylometric features
a typography (number of dots, spaces, emoticons, ...) 9 errors
q vocabulary richness
Techniques
eatures examples
Implementation
features = (u'kA', u'kY', u'kl', u'k?', u'kO', u'kl', u'k2', u'k3', u'k4', u'k5', u'k6', u'k7', u'k8', u'k9')
def document_to_features(self, document):
"""Transform document to tuple of float features.
©return: tuple of n float feature values, n= I get_f eatures |111111 ii ii ii ii
features = np.zeros(self.features_count) sentences = self . get_structure(document, mode^'tag') for sentence in sentences: for tag in sentence:
if tag and tag[0] == u'k':
key = self.tag_to_index.get(tag[:2]) if key: features [key] += 1. total = np.sum(features) if total > 0: return features / total else: return features
Tools
• use frameworks over your own implementation (ML is HW consuming and needs to be optimal)
o programming language doesn't matter, but high-level languages can be better (readability is important and performance is not affected - ML frameworks use usually C libraries)
• for Python, good choice is Scikit-learn (http://scikit-learn.org)
Machine learning tuning
• try different machine learning techniques (Support Vector Machines, Random Forests, Neural Networks)
o use grid search/random search/other heuristic searches to find optimal parameters (use cross-validation on train data)
o but start with the fast and easy to configure ones (Naive Bayes, Decision Trees)
• feature selection (more is not better)
o make experiments replicable (use random seed), repeat experiments with different seed to check their performance
o always implement a baseline algorithm (random answer, constant answer)
Techniques
achine learning tricks
Replace feature values by ranking of feature values
Book:
long coherent text
Blog:
medium-length text
E-mail:
short noisy text
r^n-
■ ^ ... .i"«-t-
• Different "document conditions" are considered
• Attribution: replace similarity by ranking of the author against other authors
• Verification: select random similar documents from corpus and replace similarity by ranking of the document against these selected documents
Techniques
nterpretation of results
Machine learning readable
Explanation of ML reasoning can be important. We can
O not to interpret data at all (we can't enforce any consequences)
Q use one classifier per feature category and use feature categories results as a partially human readable solution
Q use ML techniques which can be interpreted:
o Linear classifiers
each feature f has weight w(f) and document value val(f), w(f)* val(f) > threshold
feF
o Extensions of black box classifiers, for random forests https://github.com/janrygl/treeinterpreter
O use another statistical module not connected to ML at all
erformance (Czech texts)
Balanced accuracy: Current (CS) —>► Desired (EN)
Verification:
books essays
newspapers blogs
letters e-mails
discussions
sms
o books, essays: 95 % —>► 99 %
o blogs, articles: 70% -> 90%
Attribution (depends on the number of candidates, comparison on blogs):
• up to 4 candidates: 80% -» 95%
o up to 100 candidates: 40% -> 60%
Clustering:
o the evaluation metric depends on the scenario (50-60%)
Results
I want to try it myself
How to start
• Select a problem
• Collect data (gender detection data are easy to find - crawler dating service)
9 Preprocess texts (remove HTML, tokenize)
9 Write a few feature extraction methods
• Use a ML framework to classify data
Results
I want to try it really quick
Quick start
Style & Identity Recognizer
https://github.com/j anrygl/sir.
9 In development, but functional.
9 Contains data from dating services.
• Contains feature extractors.
9 Uses free RFTagger for morphology tagging.
Results
Development at Fl
Text Miner
9 more languages,
9 more feature extractors,
• more machine learning experiments, 9 better visualization,
• and much more
hank you for your attention
Savage CfUc&etw
v.i»n.^£Vdv?t'tifcfceii!t.curii