The European Digital Mathematical Library: An Overview of Math Specific Technologies Petr Sojka Masaryk University, Faculty of Informatics, Brno, Czech Republic National Institute of Informatics, Tokyo June 24th, 2013, 1:30PM Final identity with strapline (stacked) . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Outline and take-home message ..1 Pictorial overview ..2 Motivation, vision of WDML, PubMed Central for Mathematics ..3 Data aggregation from local DMLs ..4 Conversions ..5 Search ..6 Similarity ..7 Conclusions National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Towards the dream of math-aware WDML: EuDML National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Information overload in globalized scientific world National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Mathematics should follow other sciences (HEP, PMC,…) National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions The European Digital Mathematics Library: EuDML National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions ‘Bottom up’ deployment towards EU or worldwide scale National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions EuDML: from local data collections to the virtual DL National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions From paper to digital workflow National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Retro-digitization, accessible digital library development National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Experiences from project DML-CZ for EuDML (Brno, CZ) National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions EuDML: new approaches to math document retrieval National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions New approaches to math-aware similarity, clustering and accessibility National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Tools for automated math extraction from PDF National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Yes, you can! : accessible math, search, visibility, scalability,… TEX National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions End of talk overview National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions History of the dream: vision of WDML as PubMed 4 Math In the beginning was vision of all mathematical knowledge, peer reviewed, verified (100,000,000 pages) and engineered into one-stop e-shop/DL. AMS supported NSF preparation grant (in 2003) for WDML—Worldwide digital mathematics library, planned to be funded by de Moore foundation ($100,000,000 requested). Application was not successful. Publishers started massive digitization themselves. Even other attempts on the European level (FP5, FP6) were not successful. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Vision of European Digital Mathematics Library Finally three year project or European Digital Mathematics Library, EuDML (programme EU CIP-ICT-PSP, type Pilot B, EU contribution (1.6 MEur, 50% of total budget only) February 2010–January 2013. The strategy of dentity with strapline (stacked) was: • to master the technology, develop tools and offer them; • concept of moving wall to motivate and engage commercial publishers without Open Access bussiness model; • to collect data (from existing local or publisher’s) digital libraries into ‘one-stop shop’ and achieve critical mass in the domain → ‘a must/me too’ effect then as with PubMed Central. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions EuDML as a virtual library portal EuDML provides a virtual library based on data from smaller data providers, DLs and publishers: National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions One portal: European Digital Mathematics Library National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Aggregation of data from building bricks of regional repositories 14 data and technology providers plus associated partners as ZMath, Göttingen library,… DML content providers serve mostly publisher’s or regional more or less established DML repositories: The Czech Digital Mathematics Library DML-CZ, NUMDAM, DML-PL, DML-PT, DML-GR, DML-BG, DML-ES,… Aggregation via standard OAI-PMH protocol (OAI servers run by data providers). EuDML metadata schema(s) was borrowed from NLM (heavily funded by US NiH), as it allows also math-awareness (e.g. math stored both in TEX and MathML), and fully fledged reference lists. Inovation, rather than research. Example of DML-CZ: with 30,000+ papers (300,000+ pages). For more, see (who, what, browse, browse similar, how to search). National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Challenges of Math handling: OCR, indexing, search… National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Take care! “God is in the details.” (Mies van der Rohe) National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Data heterogenity, specificity: no free lunch to unify National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Document accessibility 4 DML processing challenges Conversions (inversion of authoring+typesetting) needed from: born-digital period: typesetting by TEX with export of [meta]data into digital library: maxTract retro-digital period: scanning, geometrical transformations (BookRestorer), OCR (FineReader, InftyReader), two-layer PDF retro-born-digital period: not complete .tex or .dvi data, bad formats, bitmap fonts of low resolution: finally Tesseract and Document Meta−data RETRO−BORN DIGITAL Image Extraction OCR Data Printed Analysis Conversion RETRO−DIGITISED BORN DIGITAL Refinement Printed PDF/PS TeX Image Capture Document National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions From PDF to MathML (via LATEX) Most fulltext available as PDF only, often as low quality scanned volume pages. Aggregation via IP protected OAI-PMH, including the PDFs behing moving wall. Workflow based in the case of: born-digital PDFs: on maxTract, otherwise on PDFBox (plain text); bitmap PDFs: on Infty, otherwise on Tesseract (no math). National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Infty from Fukuoka Run in parallel in Brno, Grenoble and Lisbon to speed up. Almost 200K papers (more than 1M pages and still running). Working with prof. Suzuki to improve further (automation, support for Russian, LATEX driver,…). Automated only, no time (and money) to fix OCR errors. MathML output used for [internal] indexing and similarity computations only, not for metadata or export. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions maxTract from Birmingham National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions maxTract from Birmingham II: adding accessibility Adding accessibility to mathematical documents on multiple levels: • access to content for print impaired users, such as those with visual impairments, dyslexia or dyspraxia • output compatible with web browsers, screen readers and tools such as copy and paste, which is achieved by enriching the regular text with mathematical markup. The output can also be used directly, within the limits of the presentation MathML produced, as machine readable mathematical input to software systems such as Mathematica or Maple. On EuDML 10k+ fulltexts are served, mostly for reading in Chrome (HTML5 output) and/or Adobe Acrobat Reader (as multiple-layer PDFs, [no tagged PDFs yet]). National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Metadata and conversions: MathML and LATEX! Data heterogenity, plethora of formats, validation and conversions: world of authors: LATEX, TEX notation of mathematics world of applications/data exchange: XML, MathML REPOX engine (by IST Lisbon) to remap different metadata formats to unique representation. Metadata on the web—W3C standards: MathML, WAI-ARIA (Web Accessibility Initiative—Acessible Rich Internet Applications), WCAG (Web Content Accessibility Guidelines) 2.0. Big volumes: → high automation to save costs: converting to MathML (via Tralics) to allow discoverability and indexing (formulae similarity search). 130+K fulltexts with MathML, Infty still running…. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Why Search? Vast amounts of [moving] contents in digital libraries: from browsing to search; from static links to indirect search links, or even semantic search. Searching is crucial part of accessibility and exploration of the great ideas around, carved into 0s and 1s. Pragmatic decisions on math indexing level: presentation vs. content vs. semantic. In EuDML first step: scalable presentation (structural), with methods (tree indexing and weighting) extendable for content or semantic. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Why Math Search (MIR)? A picture is worth thousands words. “A math formulae is worth of hundreds of words.“ (Ross Moore) There are papers with more formulae than plain text. Precision vs. Recall optimisation: optimizing recall is better for exploratory searching (we have not opted for precision as holy grail at the moment). National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Motivation for MSE (including formulae) – cont. prof. James Davenport, CEIC member, MKM2011 PC chair, on panel at EuDML workshop in Bertinoro as a reply to the question “what functionality and incentives would made a working mathematician to login and use a modern DML as EuDML?”: “Math formulae search.” National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Why math search is more relevant now than ever? • Allowing formulas in queries helps to disambiguate and narrow search. Sometimes the only difference among set of notions/key words would be in a math formula. • Example 1: knowing the solution of partial differential equation in L1(C3), is there one in L2(C5)? • Example 2: historians may want to follow the history of a (class of) formula(s) across languages and vocabularies (e.g. same objects studied/used by physicists and mathematicians under different names). • Imagine your favourite ebook math textbook being [TEX]-search aware—e.g. your search app supports math formulae search. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions We did not start from scratch Compare google.com/search?q=Einstein with math-aware search of Einstein+$E=mcˆ2$ over arXiv. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Existing systems – pros and cons • MathDex: formerly MathFind * seven digit figure NSF grant by Design Science (Robert Miner) * Lucene based, indexing n-grams of presentation MathML * pioneering conversion effort • EgoMath and EgoMath2: based on full text web search system Egothor * presentation MathML for indexing * idea of formulae augmentation, α-equivalence algorithms and relevance calculation • LATEXSearch: MSE offered by Springer * closed source * only for LATEX math string approximate match based on strings * no formulae structure matching * small database: 3 million formulae from ‘random’ sources • LeActiveMath: indexing string tokens from OMDoc with OpenMath semantic notation * only for documents authored for LeActiveMath learning environment • DLMF: only for documents authored for DLMF in special markup * equation search • MathWeb Search: semantic approach – uses substitution trees – not based on full text searching * supports Content MathML and OpenMath * problem with acquiring semantic data National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions MIaS — Math Indexer and Searcher • math-aware, full-text based search engine • joins textual and mathematical querying • MathML or TEX input National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions MSE overall design input document document handler text searcher input query text term s query results index indexer unification math processing tokenization m ath searching indexing Lucene Core canonicalization math Preprocessing into canonicalized presentation MathML National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Math indexing design input canonicalized document document handler text searcher input query text term s query results index indexer unification math processing tokenization mathm ath searching indexing Lucene math processing ordering tokenization variables unification constants unification indexing searching weighting canonicalization canonicalization National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Example math processing ordering tokenization variables unification constants unification indexing searching weighting canonicalization searchingindexing xy +y3 x y +y3 , xy , y3 , x , y , 3,+ x y +y3 , xy , y3 , x , y , 3,+ , id1 id2 +id2 3 , id1 id 2 , id1 3 xy +y3 , xy , y3 , x , y , 3,+ , id1 id 2 +id 2 3 , id1 id 2 , id1 3 , xy + yconst , yconst , id1 id2 +id2 const , id1 const xy +y3 x y +y2 x y +y2 x y +y2 , id1 id2 +id2 2 xy +y2 , id1 id 2 +id 2 2 , x y +yconst , id1 id2 +id2 const x y +yconst , id1 id2 +id2 const Match! National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Formula processing example – subformulae weighting (a+b2+c , 0.125) (a+bc+2 , 0.125) (a , 0.0875) (+, 0.0875) (bc+2 , 0.0875) (b , 0.06125) (c+2, 0.06125) (c , 0.042875) (+, 0.042875) (2, 0.042875) (id1+2, 0.0343) (c+const , 0.030625) (id1+const , 0.01715) (id1 id 2+2 , 0.07) (b c+const , 0.04375) (id1 id 2+const , 0.035) (id1+id 2 id3+2 , 0.1) (a+b c+const , 0.0625) (id1+id 2 id3+const , 0.05) input: ordering: tokenization: variables unification: constants unification: National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Implementation • Java • Lucene 3.1.0, now switching to Lucene/Solr 4 • Mathematical part implements Lucene’s interface Tokenizer – able to integrate to any Lucene based system • MIaS4Solr plugin was created for the use in Solr • Textual content – processed by StandardAnalyzer • easily deployable in Java/Lucene based system or as a web service National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Search demonstration National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Formulae search demonstration comments Demo web interface: http://aura.fi.muni.cz:8085/webmias/ • MathML/TEX input (Tralics [2] for conversion to MathML [?]) • Canonicalization of the query – problems with UMCL library [1] • Matched document snippet generation • MathJax for nicer math rendering and better portability • Snuggle TeX for on-the-fly as-you-type rendering All up and ready on the EuDML system. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Searching (semantically) similar papers Exploration of a DML: browsing (semantically) similar papers Semantic search via topic modeling: Latent Semantic Indexing, Latent Dirichlet Allocation National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Leading Edge Example: Automated Meaning Picking from Texts National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Probabilistic Topical Modeling: Latent Dirichlet Allocation • topic: weighted list of words • document: weighted list of topics National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Topical Modeling: Latent Dirichlet Allocation II • all topics computed automatically from document corpora National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Content Similarity Results in We have developed and delivered technology for similarity (gensim), document conversions (to Braille or to text: Mathml2text) and math content normalization. Different formulae representations for similarity computation. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Summary • EuDML is up and running, with several novel math-aware approaches developed and in use • verified complex workflow and proven technologies and tools for DML • Scalable solution for math formulae search researched, implemented, tested and integrated into current version of EuDML system! • MIR/MIaS project pages – https://mir.fi.muni.cz/ • math-aware methods for document similarity (MathML2text, gensim) • a lot more on (e.g. PDF size reduction of 62% of original already CCITT-G4 compressed PDFs, etc.) National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Future work • DML workshop series, join us at DML 2013 c/o CICM Bath in UK in July 2013 • Activities towards WDML (Sloan funding,…) • EuDML initiative consortium, further sustainability solutions (grant proposal writing). • Improved MathML canonicalization and new preprocessing filters, search developed and evaluated with the use of EuDML math query database of intentions. • Addition of Content MathML tree indexing. • NCTIR 11! National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Acknowledgments and questions? Acknowledgements: EuDML project (funding), ELIAS (trip here), EuDML colleagues, and authors and contributors of tools used. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Archambault, D., Moço, V.: Canonical MathML to Simplify Conversion of MathML to Braille Mathematical Notations. In: Miesenberger, K., Klaus, J., Zagler, W., Karshmer, A. (eds.) Computers Helping People with Special Needs, Lecture Notes in Computer Science, vol. 4061, pp. 1191–1198. Springer Berlin / Heidelberg (2006), Grimm, J.: Producing MathML with Tralics. In: Sojka [4], pp. 105–117, MREC – Mathematical REtrieval Collection, Sojka, P. (ed.): Towards a Digital Mathematics Library. Masaryk University, Paris, France (Jul 2010), Sojka, P., Líška, M.: Indexing and Searching Mathematics in Digital Libraries – Architecture, Design and Scalability Issues. In: Davenport, J.H., Farmer, W., Urban, J., Rabe, F., (eds.) Proceedings of CICM Conference 2011 (Calculemus/MKM). Lecture Notes in Artificial Intelligence, LNAI, vol. 6824, pp. 228–243. Springer-Verlag, Berlin, Germany (Jul 2011), Líška, Martin and Petr Sojka and Michal Růžička. Similarity Search for Mathematics: Masaryk University team at the NTCIR-10 Math T ask. In Proceedings of the 10th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Math Pilot Task. pp. 686-691. NII, Tokyo, 2013. PDF D. Formánek, M. Líška, M. Růžička, and P. Sojka. Normalization of digital mathematics library content. In J. Davenport, J. Jeuring, C. Lange, and P. Libbrecht, editors, 24th OpenMath Workshop, 7th Workshop on Mathematical User Interfaces (MathUI), and Intelligent Computer Mathematics Work in Progress, number 921 in CEUR Workshop Proceedings, pp. 91–103, Aachen, 2012. Sojka, Petr and Martin Líška. The Art of Mathematics Retrieval. In Matthew R. B. Hardy , Frank Wm. Tompa. Proceedings of the 2011 ACM Symposium on Document Engineering. Mountain View, CA, USA: ACM, 2011. p. 57–60. ISBN 978-1-4503-0863-2. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies . . . . . . . . . . . . . . Overview . . . . Motivation, EuDML . . . . . Aggregation . . . . . . Conversions . . . . . . . . . . . . . . Search . . . . . Similarity . . . Conclusions Sylwestrzak, W., Borbinha, J., Bouche, T., Nowiński, A., Sojka, P.: EuDML—Towards the European Digital Mathematics Library. In: Sojka [4], pp. 11–24, Martin Líška, Petr Sojka, Michal Růžička, and Petr Mravec. Web Interface and Collection for Mathematical Retrieval. In Petr Sojka and Thierry Bouche, editors, Proceedings of DML 2011, pages 77–84, Bertinoro, Italy, July 2011. Masaryk University. . Credits for LDA pictures goes to David M. Blei. Credits for illustrations goes to Jiří Franek. National Institute of Informatics, Tokyo, June 24th, 2013: EuDML: An Overview of Math Specific Technologies