Miroslav Hlaváček, 359995, podzim 2012 Plagiarism - PAN 2011 International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN) “PA164 Machine learning and natural language processing” course essay Abstract: Web page (source 1) dedicated to the part of PAN 2011 dealing with plagiarism contains information about text plagiarism detection evaluation framework designed especially for purposes of PAN as well as results of participants of 2011's year. The PAN plagiarism detection “framework” consists of corpus designated as learning corpus for plagiarism detection and performance measures providing objective comparison of plagiarism detectors. Reasons for new corpus PAN-PC-10 (in comparison with academic papers available in year 2010): • little papers focused on text documents plagiarism (most papers focused on plagiarism in code) • most papers refer to a small corpus (most often 10^3 documents because of local collection of documents) • lack of objective and general evaluation methods • availability – authorship issues (need to have approval from both author and plagiarist) • lack of focus on information retrieval (plagiarism case should be detected only once and in full length) The methods of building corpus and incorporating plagiarism cases were discussed. Plagiarism cases can be then divided by several points of view – mainly long vs. short, intra-topic vs. inter-topic (documents are clustered by several topics), intrinsic (does not use external knowledge and tries to identify discrepancies in style within a suspicious document) vs. external, unobfuscated (“copy paste style”) vs. obfuscated (plagiarized passage has the same meaning although different words or word ordering is used). Several methods for generating obfuscated plagiarism were used (examples can be found in source 2). These methods can be divided into: • simulated – text is rewritten by human who is paid for the task (based on Amazon’s Mechanical Turk project) (problems: is necessary to determine right amount of cash per task; these plagiarists were usually well educated, …) • artificial – generated by computer (Random text operations, Semantic word variation, POSpreserving word shuffling) (problems: connected with the fact that computer does not understand the text) Overview of PAN-PC-10 corpus can be found in source 2. For evaluation were proposed new evaluation metrics: • granularity – determines whether plagiarism case is detected only once (best possible value) or more often (number of detected plagiarism cases (R set) denoting the one (and only) real plagiarism case (S set) is the worst possible value of granularity); therefore granularity determines plagiarism detection performance for the information retrieval part of plagiarism detection task • plagdet – new performance measure which combines recall, precision and granularity (where F is F-measure for precision and recall with parameter α and logarithm of granularity is used to decrease its influence to a reasonable level). Web page (source 1) shows results determined by plagdet scores for both tasks of intrinsic (30% of corpus) and external plagiarism detection. An evaluation corpus is different for each year of PAN which means that winning performance may vary as well. Winner was awarded by money price of 500,- Euro. The best results for year 2011 are: Sources: 1. http://www.webis.de/research/events/pan-11 2. http://www.uni-weimar.de/medien/webis/publications/papers/stein_2010p.pdf