Topic Similarity in Information Retrieval Examples and Experience of NLP Centre and LEMMA Projects Petr Sojka Laboratory of Electronic and Multimedia Applications and Natural Language Processing Centre, Faculty of Informatics Masaryk University, Brno, Czech Republic sojka@fi.muni.cz PV211 Intro to Information Retrieval: LDA Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 1 / 17 Coping with Information Overload by Filtering of Big Data Life is searching: group similar and narrow focus of search in [your] Big Data. Similarity types: from plagiarism (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to thematic, topical similarity. Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 2 / 17 Prehistoric Example: Project Ottův Slovník naučný, 1998 Levels of content processing: strings → words and collocations → semantics (word meaning) → information (knowledge). Grabbing the essence (content) of documents: topical modelling. Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 3 / 17 Topical Similarity in Digital Mathematics Library 2005, GVP, Radim Řehůřek and Jan Pomikálek 2006, gensim, different machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic Indexing/Analysis, Latent Dirichlet Allocation 50,000+ full-texts on http://dml.cz Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 4 / 17 Leading Edge Example: Automated Meaning Picking from Texts Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 5 / 17 Probabilistic Topical Modelling: Latent Dirichlet Allocation topic: weighted list of words document: weighted list of topics Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 6 / 17 Topical Modelling: Latent Dirichlet Allocation II all topics computed automatically from document corpora Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 7 / 17 Content Similarity Results in EuDML Within European Digital Mathematics Library, EuDML, project EU CIP-ICT-PSP we have developed and delivered technology for similarity (gensim), document conversions (Braille) and accessibility (math OCR), NLP content normalization (Mathml2text). Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 8 / 17 Math Search Interface EuDML Demo of math search in EuDML Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 9 / 17 Digital Library Service Architecture and Workflow (EuDML) Document engineering and workflows including [Math] OCR. Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 10 / 17 Digital Library Service Architecture and Workflow (DML-CZ) Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 11 / 17 Data Visualization and Representation Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 12 / 17 Award Winning Topic Similarity Framework gensim Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing) architecture. Developed by NLPlab PG student Radim Řehůřek (awarded in Česká hlava competition in 2011). Leading edge machine learning methods implemented. Used in 60+ local, EU or worldwide projects, 260+ citations. Typical deployment and fine-tuning scenario: expressing data as words (features) → configuration of topic modelling of features → setting of gensim methods and tuning parameters → usage in an application with proper visualization interface. Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 13 / 17 Teaching Laboratory build with Constructivism Principles most work done by students themselves with agile techniques, XP Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 14 / 17 Conclusions and Mutual Research Interests similarity by topical modelling, document filtering and visualization semantic, meaning computations and modelling of natural language texts (natural NLP) personal research interests: random walking for disambiguation, math (tree) indexing and similarity Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 15 / 17 That’s it! Yes, we can! Credits: Jiří Franek (illustrations) Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 16 / 17 Links NLP Centre: http://nlp.fi.muni.cz/ Topical modelling: https://mir.fi.muni.cz/gensim/ Math Information Retrieval: https://mir.fi.muni.cz DML-CZ project: http://dml.cz, http://project.dml.cz EuDML project: http://eudml.cz, http://project.eudml.cz LEMMA: http://www.fi.muni.cz/lemma/ Petr Sojka: Topic Similarity in Information Retrieval May 3rd, 2016 17 / 17