Plagiarism - PAN 2011 PA164 Machine learning and natural language processing Miroslav Hlaváček (podzim 2012) Reasons for new evaluation framework ● little papers focused on text documents plagiarism ● usually dealing with a small corpus (most often 10^3) ● lack of objective and general performance evaluation methods ● availability – authorship issues ● lack of focus on information retrieval Plagiarism cases in text ● long vs. short ● unobfuscated vs. obfuscated ● obfuscated – simulated vs. artificial (both has advantages and disadvantages) ● PAN-PC-10 corpus – intrinsic (30%) vs. external – intra-topic vs. inter-topic. Obfuscation examples PAN-PC-10 corpus overview Evaluation – measures ● S and R sets ● precision (micro vs. macro version) ● recall (micro vs. macro version) ● granularity [1,|R|] ● plagdet Evaluation - results ● testing corpus for each year ● plagdet score ● winner – 500,- Euro by Yahoo! Sources ● http://www.webis.de/research/events/pan- 11 ● http://www.uni- weimar.de/medien/webis/publications/pap ers/stein_2010p.pdf Thank you for your attention.