From Minds to Pixels and Back Petr Sojka Faculty of Informatics, MU, Brno October 21th, 2008 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 1 / 82 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 2 / 82 Part One: From Minds to Pixels 1 MindsPixels: Publishing Content and Form Separation Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 3 / 82 Part One: From Minds to Pixels 1 MindsPixels: Publishing Content and Form Separation 2 Competing Patterns Hyphenation Pattern Generation Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 3 / 82 Part One: From Minds to Pixels 1 MindsPixels: Publishing Content and Form Separation 2 Competing Patterns Hyphenation Pattern Generation 3 Thai Segmentation Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 3 / 82 Part One: From Minds to Pixels 1 MindsPixels: Publishing Content and Form Separation 2 Competing Patterns Hyphenation Pattern Generation 3 Thai Segmentation 4 Summary of Contributions (Part One) Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 3 / 82 Content and form Discover the outer logic of the typography in the inner logic of the text. -- Robert Bringhurst Document = content + form. Content should be marked up in author's terms and notions of domain language. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 4 / 82 Content and form Discover the outer logic of the typography in the inner logic of the text. -- Robert Bringhurst Document = content + form. Content should be marked up in author's terms and notions of domain language. Form (appearance) should reflect the design, it should use the graphical means consistently (sameness). Possibilities of a form of a document are constrained by output devices (paper, LCD monitor, PDA). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 4 / 82 Content and form Discover the outer logic of the typography in the inner logic of the text. -- Robert Bringhurst Document = content + form. Content should be marked up in author's terms and notions of domain language. Form (appearance) should reflect the design, it should use the graphical means consistently (sameness). Possibilities of a form of a document are constrained by output devices (paper, LCD monitor, PDA). Single-source publishing allows structured aggregation of content and form markup and cost-effective maintenance. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 4 / 82 Single-source publishing from author's markup Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 5 / 82 Content vs. visual markup &\elevenit I\kern.7ptllustrations by\cr &DU\kern-1ptANE BIBBY\cr \noalign{\vfill} &\setbox0=\hbox{\manual77}% \setbox2=\hbox to\wd0{\hss\manual6\hss}% \raise2.3mm\box2\kern-\wd0\box0\cr % A-W logo &ADDISON\kern.1em--WESLEY\cr &PUBLISHING COMP\kern-.13emANY\kern-1.5mm\cr OK? Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 7 / 82 Content vs. visual markup &\elevenit I\kern.7ptllustrations by\cr &DU\kern-1ptANE BIBBY\cr \noalign{\vfill} &\setbox0=\hbox{\manual77}% \setbox2=\hbox to\wd0{\hss\manual6\hss}% \raise2.3mm\box2\kern-\wd0\box0\cr % A-W logo &ADDISON\kern.1em--WESLEY\cr &PUBLISHING COMP\kern-.13emANY\kern-1.5mm\cr OK? NO! (at least for single-source publishing for multiple outputs) Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 7 / 82 Single-source publishing example I: Math textbook From one, properly marked source, multiple output versions optimized for print (PDF) optimized for LCD screen (PDF) optimized for web browser (portable HTML) optimized for web browser (scalable XML+MATHML) . . . Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 8 / 82 Single-source publishing example II: FI yellow book From one, properly marked source, multiple output versions optimized for print (PDF) optimized for LCD screen (PDF) optimized for web delivery (searchable via is.muni.cz) . . . Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 9 / 82 Single-source publishing example II: FI yellow book From one, properly marked source, multiple output versions optimized for print (PDF) optimized for LCD screen (PDF) optimized for web delivery (searchable via is.muni.cz) . . . Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 9 / 82 Hyphenation task "pattern ORIGIN Middle English patron `something serving as a modeĺ, from Old French. The change in sense is from the idea of patron giving an example to be copied. Metathesis in the second syllable occurred in the 16th cent. By 1700 patron ceased to be used of things, and the two forms became differentiated in sense." (NODE, 1998 edition) Hyphenation separated from content. There are "long-distance" dependencies. discreteness: small change in input fundamental change in output ambiguity: o-blít, ob-lít; na-rval, nar-val; po-drobit, pod-robit; wach-stube, wachs-tube; . . . hard generalization, exceptions, exceptions of exceptions, . . . Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 10 / 82 The method (competing patterns) the way to perfection (space & time minimization): instead of one big set of patterns, decomposition into several layered approximations (subpatterns) p1 (positive subpatterns), p2 (negative subpatterns--exceptions for p1), p3 (positive subpatterns to cover what has not been covered by "p1 p2"), . . . h y p h e n a t i o n p1 1n a p1 1t i o p2 n2a t p2 2i o p2 h e2n p3 h y3p h p4 h e n a4 p5 h e n5a t h0y3p0h0e2n5a4t2i0o0n h y-p h e n-a t i o n How to generate the patterns ? Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 11 / 82 Techniques of pattern generation Stratification technique: elimination of "not necessary" training examples speeds up learning. Bootstrapping technique: Iterative bootstrapping technique for corpus tagging and error correction. Final parameters of patterns generation setting--Fine tuning: With parameters of learning process we can fine-tune size and quality of patterns (general problem of finding minimal full coverage patterns is NP optimization class of problems). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 12 / 82 Stratified sampling technique "A large body of information can be comprehended reasonably well by studying more or less random portions of the data. The technical term for this approach is stratified sampling." Knuth, 1991 Example of stratification rule for e.g. hyphenation task: 1 only every 7th (actually 17th worked as well) derived word form from the full list added to the PATGEN input list, with exceptions that: 2 every stem must be accompanied by at least 1 derived form, and 3 every derived form with overlapping prefixes has to be present in the PATGEN input list as well, and 4 only one word with prefixes ne (by which one can create negation to almost every word) and nej (by which one creates superlatives) is included. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 13 / 82 Parameters of pattern generation heuristics for pattern acceptance/addition in given level: good good_weight - bad bad_weight threshold Table: Liang's patterns for English (hyphen.tex) level length param hyphens % correct % wrong # patterns 1 2­3 1 2 20 67604 14156 76.6 16.0 + 458 2 3­4 2 1 8 7407 11942 68.2 2.5 + 509 3 4­5 1 4 7 13198 551 83.2 3.1 + 985 4 5­6 3 2 1 1010 2730 82.0 0.0 +1647 5 5­8 1 4 1320 6428 89.3 0.0 +1320 4447 patterns, 1 hour CPU (PDP-10), total size 27667 B Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 14 / 82 PATGEN statistics for Czech hyphenation Table: Standard Czech hyphenation with Liang's parameters for English level length param % correct % wrong # patterns size 1 2­3 1 2 20 96.95 14.97 + 855 2 3­4 2 1 8 94.33 0.47 +1706 3 4­5 1 4 7 98.28 0.56 +1033 4 5­6 3 2 1 98.22 0.01 +2028 32 kB Table: Standard Czech hyphenation with improved (size optimized) strategy level length param % correct % wrong # patterns size 1 1­3 1 2 20 97.41 23.23 + 605 2 2­4 2 1 8 85.98 0.31 + 904 3 3­5 1 4 7 98.40 0.78 +1267 4 4­6 3 2 1 98.26 0.01 +1665 23 kB Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 15 / 82 Czech/Slovak hyphenation # of # of hyphenation points words Correct Wrong Missed Czech 372562 1019686 39 18086 (98.26 %) (0.01 %) (1.74 %) Slovak 333139 1025450 34 15273 (98.53 %) (0.01 %) (1.47 %) Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 16 / 82 Table: Standard Czech hyphenation with improved (% of correct optimized) strategy level length param % correct % wrong # patterns size 1 1­3 1 5 1 95.43 6.84 +2261 2 1­3 1 5 1 95.84 1.17 +1051 3 2­5 1 3 1 99.69 1.24 +3255 4 2­5 1 3 1 99.63 0.09 +1672 40 kB Table: Czech hyphenation of composed words (Liang but allowing 1-length patterns in level 1) level length param % correct % wrong # patterns size 1 1­3 1 2 20 72.97 14.32 + 300 2 2­4 2 1 8 69.32 3.09 + 450 3 3­5 1 4 7 84.09 4.02 + 870 4 4­6 3 2 1 82.61 0.33 +2625 25 kB Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 17 / 82 Czech hyphenation of compounds Table: Czech hyphenation of compound words (% of correct slightly optimized) level length param % correct % wrong # patterns size 1 1­3 1 2 20 72.97 14.32 + 300 2 2­4 2 1 8 69.32 3.09 + 450 3 3­5 1 4 3 90.82 4.24 +3014 4 4­6 3 2 1 89.07 0.36 +2770 40 kB Table: Czech hyphenation of compound words with parameters (% of correct optimized, but % of wrong and size increase) level length param % correct % wrong # patterns size 1 1­3 1 5 1 64.35 5.34 +1415 2 2­4 1 5 1 67.10 1.88 +1261 3 3­5 1 3 1 97.94 5.39 +8239 4 4­6 1 3 1 97.91 1.14 +2882 84 kB Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 18 / 82 Compression with patterns Reaching full recall the method may be viewed as compression: 6,000,000 hyphenated Czech words ( 40 MB) can stored in patterns of 40 kB--1:1000 ratio. In addition, searching for a word hyphenation in constant time (patterns stored in packed trie data structure) with respect to the dictionary size. 100,000+ hyphenated words per second on modern PC in tens of kB of space. The key is representing the problem as competing patterns (longer patterns beat shorter patterns as exceptions): hierarchy of exceptions. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 19 / 82 Thai segmentation Our testbed for pattern application. Thai: 44 consonants, 28 vowels. No explicit syllable, word, and sentence boundaries in paragraphs. No punctuation. We need to know when typesetting I at least word (and sentence) boundaries to break lines I tag for a web browser Even native Thai don't agree: is a compound word one word or more? Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 20 / 82 Thai segmentation patterns development Training from available Orchid corpus. Evaluation measures: Precision = # found well # found well + # bad Recall = # found well # found well + # missed segment is correct iff both the start and the end are correctly predicted In addition, combined into a single measure F-score = 2 × Precision × Recall Precision + Recall Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 21 / 82 Results of Thai segmentation patterns generation (8000 paragraphs from Orchid corpus) level length param % correct % wrong # patterns utf size (kB) 1 1­5 1 6 1 97.92 4.86 +15443 161 2 2­6 1 1 1 96.53 0.65 + 2596 196 3 3­11 1 3 1 99.57 0.79 + 3448 267 4 4­12 4 1 1 97.87 0.03 + 953 286 5 9­19 1 3 1 99.68 0.12 + 2468 364 6 10­20 1 1 1 99.67 0.04 + 129 368 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 22 / 82 Thai patterns generation results Nearly 100% precision on the training set Training sets 4,000, 6,000, and 8,000 paragraphs Tested on previously unseen text # par. # good # bad # missed prec. % recall % F-score 4000 139788 11231 15529 92.56 90.00 91.26 6000 98243 7951 9432 92.51 91.24 91.87 8000 46361 3358 3703 93.25 92.60 92.92 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 23 / 82 Thai segmentation patterns in Emacs Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 24 / 82 Contributions summary Formal definition of competing patterns. We have described and developed a new approach to language engineering based on the theory of covering and inhibiting patterns. New approaches to competing pattern generation. We have verified the plausibility and usefulness of bootstrapping and stratification techniques for machine learning techniques of the pattern generation process. We have related our new techniques to those used so far--the results improve significantly with the new approach. Properties of the pattern generation process. We have shown that reaching size-optimality of the pattern generation process is an NPO problem; however, it is possible to achieve full data recall and precision on the given data with the heuristics presented. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 25 / 82 Contributions summary (cont.) New approach to the Thai text segmentation problem. Ae show that an algorithm using competing patterns learnt from segmented Thai text returns better results than current methods for this task. Thai segmentation patterns. New patterns for the Thai segmentation problem were generated from data in the ORCHID corpus. New Czech and Slovak hyphenation patterns. The new hyphenation patterns for Czech and Slovak give a much better performace than the previous ones, and are in practical use in distributions of text processing systems ranging from TEX, SCRIBUS, OPENOFFICE.ORG to Microsoft Word. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 26 / 82 Contributions summary (cont.) New patterns for specific tasks. Patterns for specific tasks demanded in the areas of computer typesetting and NLP were developed--phonetic hyphenation, universal syllabic hyphenation, and the possibility of using context-sensitive patterns for disambiguation tasks were shown. The foundation for new pattern generation algorithms. The design of a program OPATGEN for pattern generation in an object oriented manner allows easy experimentation with new pattern generation heuristics. Usage of the methodology for partial morphological disambiguation. We have shown that the methodology of competing patterns can be used for partial disambiguation tasks. Experiments showed an improved performance for the partial morphological disambiguation of Czech. and last but not least: ecological contribution--saving a lot of trees by better hyphenation patterns. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 27 / 82 From Pixels to Minds (Digitization, Tagging) 5 What and why? Better than Google Scholar for mathematical peer reviewed literature; bott Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 28 / 82 From Pixels to Minds (Digitization, Tagging) 5 What and why? Better than Google Scholar for mathematical peer reviewed literature; bott 6 DML-CZ overview DML-CZ workflow: preparation, scanning, metadata, OCR, indexing, delivery Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 28 / 82 From Pixels to Minds (Digitization, Tagging) 5 What and why? Better than Google Scholar for mathematical peer reviewed literature; bott 6 DML-CZ overview DML-CZ workflow: preparation, scanning, metadata, OCR, indexing, delivery 7 MSC Mathematical Subject Classification Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 28 / 82 From Pixels to Minds (Digitization, Tagging) 5 What and why? Better than Google Scholar for mathematical peer reviewed literature; bott 6 DML-CZ overview DML-CZ workflow: preparation, scanning, metadata, OCR, indexing, delivery 7 MSC Mathematical Subject Classification 8 Publishing Born-digital (retro-born-digital) paper handling Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 28 / 82 From Pixels to Minds (Digitization, Tagging) 5 What and why? Better than Google Scholar for mathematical peer reviewed literature; bott 6 DML-CZ overview DML-CZ workflow: preparation, scanning, metadata, OCR, indexing, delivery 7 MSC Mathematical Subject Classification 8 Publishing Born-digital (retro-born-digital) paper handling 9 OCR DML-CZ Optical Character Recognition: (Fine+Infty)Reader++ Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 28 / 82 From Pixels to Minds (Digitization, Tagging) 5 What and why? Better than Google Scholar for mathematical peer reviewed literature; bott 6 DML-CZ overview DML-CZ workflow: preparation, scanning, metadata, OCR, indexing, delivery 7 MSC Mathematical Subject Classification 8 Publishing Born-digital (retro-born-digital) paper handling 9 OCR DML-CZ Optical Character Recognition: (Fine+Infty)Reader++ 10 Summary Summary, Conclusions, Bibliography Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 28 / 82 From pixels to minds? Digitization needed The need to digitize. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar peer_reviewed_math Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar peer_reviewed_math but better! Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar peer_reviewed_math but better! Vision of World Digital Math Library (WDML) that will bring the enduring mathematical legacy to researchers worldwide. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar peer_reviewed_math but better! Vision of World Digital Math Library (WDML) that will bring the enduring mathematical legacy to researchers worldwide. High quality, checked content, crosslinking via reviewing databases Zentralblatt MATH or Mathematical Reviews (more than 2,500,000 reviewed articles) Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar peer_reviewed_math but better! Vision of World Digital Math Library (WDML) that will bring the enduring mathematical legacy to researchers worldwide. High quality, checked content, crosslinking via reviewing databases Zentralblatt MATH or Mathematical Reviews (more than 2,500,000 reviewed articles) Estimation of 100,000,000 pages in total only (able to be cleverly stored in one portable disc (EUR 200) today), but cannot be read in the entire (wo)man's life). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 From pixels to minds? Digitization needed The need to digitize. Google Scholar peer_reviewed_math but better! Vision of World Digital Math Library (WDML) that will bring the enduring mathematical legacy to researchers worldwide. High quality, checked content, crosslinking via reviewing databases Zentralblatt MATH or Mathematical Reviews (more than 2,500,000 reviewed articles) Estimation of 100,000,000 pages in total only (able to be cleverly stored in one portable disc (EUR 200) today), but cannot be read in the entire (wo)man's life).. 250,000 distinct authors (minds) sent papers for a review in the last decade in mathematical sciences. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 29 / 82 Digital Mathematics Library ­ motivations Publish or perish ­ publication growth: but reviewers hard to found. Using bibliographical global citation analysis and ranking to tackle information overload (# of references in The Collection of Computer Science bibliographies): Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 30 / 82 Going digital ­ Digital Mathematics Library Going digital increases impact (citation scores) [Giles 1999] authors put preprints on the web, publishers eager to be indexed by search engines (50% traffic from there) Google Scholar, Citeseer. - persistence of author's information on the web + ad surrogate ad fontes + implications of digital access: from factography art of posing questions. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 31 / 82 Going digital ­ Digital Mathematics Library Going digital increases impact (citation scores) [Giles 1999] authors put preprints on the web, publishers eager to be indexed by search engines (50% traffic from there) Google Scholar, Citeseer. - persistence of author's information on the web + ad surrogate ad fontes + implications of digital access: from factography art of posing questions. - (W)DML! Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 31 / 82 (W)DML Initiatives NUMDAM Numérisation de documents anciens mathématiques. ERAM The Jahrbuch Project--Electronic Research Archive for Mathematics (1868­1942): ,,Jahrbuch über die Fortschritte der Mathematik" JSTOR (AMS journals) EMANI electronic mathematical archiving network (Cornell, SUB Göttingen, MathDoc, Tsinghua University Library) RusDML Russian DML (2,000,000 pages of papers in Zbl refereed journals) DML-CZ Digital Mathematical Library of mathematical literature published in the Czech and Slovak Republics. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 32 / 82 Specifics of Mathematical Publications ü review databases where entries are classified according to the Math Subject Classification Scheme (MSC 2000). ý Zentralblatt MATH (more than 2,000,000 entries drawn from more than 2300 serial and journals) Jahrbuch über die Fortschritte der Mathematik (JFM) covering the period 1868­1942 (200.000 entries digitized in ERAM). MathSciNet: 2,329,742 publications (May 20th, 2008), 80,000 new items and 60,000 reviews added each year; 1799 journals covered; links to 501.123 original articles; 11.304 active reviewers; 428.680 authors indexed. Since 1940. 50 years old or even older papers are frequently cited. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 33 / 82 Google Scholar vs. MR/Zbl http://scholar.google.com/scholar?q=Antonin Kucera http://www.ams.org/mathscinet/search/publications.html?pg1=IID&s1=695584 http://www.ams.org/mathscinet/pdf/1992331.pdf?pg1=IID&s1=695584&r=16 Author and institution disambiguation: http://www.ams.org/mathscinet/search/institution.html?code=CZ_MASC See the difference? Hyperlinking needed for computing H-index, high quality metadata for its robustness etc. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 34 / 82 The Goal: Bottom-up way to WDML--DML-CZ Czech Academy of Sciences grant (program Information Society) 2005­2009, full (retro)digitization of 50,000 pages of mathematical literature per year. We do not want to reinvent the wheel (scanning, text OCR). Research part: 1) gradual enhancement of the digital material by `knowledge enhancing' filters on markup-rich XML data. 2) New methods for (semantic) text processing tested on the available data IPR part: Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 35 / 82 The Goal: Bottom-up way to WDML--DML-CZ Czech Academy of Sciences grant (program Information Society) 2005­2009, full (retro)digitization of 50,000 pages of mathematical literature per year. We do not want to reinvent the wheel (scanning, text OCR). Research part: 1) gradual enhancement of the digital material by `knowledge enhancing' filters on markup-rich XML data. 2) New methods for (semantic) text processing tested on the available data IPR part:sharing/delivery (economic models for knowledge sharing due to interests of content owners/publishers). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 35 / 82 What to digitize in DML-CZ? 7­8 Czech and Slovak math journals, 100­200 monographs and textbooks and conference proceedings, in total about 250,000 pages: ü Czechoslovak Mathematical Journal (30,000 pages to scan, 7,000 are already born digital). Published by Academy of Sciences of CR, distributed partially by Springer. Founded as Časopis pro pěstování matematiky in 1872, under current name since 1951. 272 pages quarterly. ý Applications of Mathematics (20,000/5,000). Published by Academy of Sciences of CR. Founded in 1956 (as Aplikace matematiky). 80 pages bimonthly. Archivum Mathematicum (2,000/4,000) Masaryk Uni in Brno. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 36 / 82 What to digitize in DML-CZ? 7­8 Czech and Slovak math journals, 100­200 monographs and textbooks and conference proceedings, in total about 250,000 pages: ü Czechoslovak Mathematical Journal (30,000 pages to scan, 7,000 are already born digital). Published by Academy of Sciences of CR, distributed partially by Springer. Founded as Časopis pro pěstování matematiky in 1872, under current name since 1951. 272 pages quarterly. ý Applications of Mathematics (20,000/5,000). Published by Academy of Sciences of CR. Founded in 1956 (as Aplikace matematiky). 80 pages bimonthly. Archivum Mathematicum (2,000/4,000) Masaryk Uni in Brno. Mathematica Bohemica and Archivum Mathematicum already partially digitized in Göttingen,. . . Copyright issues crucial. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 36 / 82 DML-CZ workflow steps Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 37 / 82 Top-level DML-CZ workflow overview (simplified) Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 38 / 82 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 39 / 82 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 39 / 82 Preparation document selection by quality, but grey literature too. preparation acquisition of documents for scanning. copyright negotiation with publishers (or even authors?) In what order? What is important when signing digitization contract? Current trends in EU: paying for the rights to digitize and to the authors rights organizations for everything not older than 70 years :-(. Following NUMDAM :-). "I have worked for the digital math library in different committees since 1992, and now I am tired of this topic. The main obstacles are of legal nature (misuse of copyright laws by big commercial publishers), and we missed some opportunities along the way." Peter Michor Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 40 / 82 Scanning Floods in Bohemia three years ago. Many manuscripts were under water, and frozen (put into the refrigerator). Workflow for proces of defrosing includes scanning (Library of Academy of Sciences, Jenštejn near Prague, capacity of 40,000 pages per month or more!). parameters 600 dpi 4bit depth. scanning facilities Digibook RGB 10000, A1 color book scanner; two book scanners Zeutschel OS 7000, A2 B/W. software Book Restorer to make the scanned pages uniform (white space around text body,. . . ); system Sirius for archival storage of scanned materials (they are put on CDs as TIFFs); Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 41 / 82 Optical character recognition Text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 42 / 82 Optical character recognition Text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1. Errors in math - Methods for separation of text OCR and mathematics OCR. Math: Infty system (Suzuki et al., Japan): 1) layout analysis, 2) character recognition, 3) structure analysis of math. expressions, and 4) manual error correction Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 42 / 82 Optical character recognition Text OCR by two phase DML-OCR implemented with ABBYY FineReader SDK 8.1. Errors in math - Methods for separation of text OCR and mathematics OCR. Math: Infty system (Suzuki et al., Japan): 1) layout analysis, 2) character recognition, 3) structure analysis of math. expressions, and 4) manual error correction Multilayer PDF with several OCR layers (text, math in TEX, math in MathML or OMDoc) Quality assurance--quality matters most! 99%+ accuracy for text, 96%+ for mathematics Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 42 / 82 Metadata and Image Enhancements/Processing metadata standards choice of standards (MODS, METS). metadata acqusition Zbl/MR, OCR tagging, [retyping] image enhancements TIFF, PDF, jbig2 compression as a measure of quality semantic processing document markup enhancement, semantic processing, document classification, citation linking, document clustering, indexing; References and fulltexts are metadata as well, English titles and MSC mandatory. OAI-MPH export. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 43 / 82 Metadata editor http://editor.dml.cz Web-based client-server tool, developed (ICS MU) from scratch (Python) for metadata import, editing and checking. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 44 / 82 Metadata Editor Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 45 / 82 Document markup enhancement methods ü context dependent mapping from visual to logical markup ý algorithms of language identification (bi-gram, tri-gram based, par or even sentence level) document classification, metrics, ontology construction, comparison with AMS 2000 classification semiautomatic bibliography markup and metrics, global mathematics citation index, "MathRank" document clustering (for visualization, . . . ), identification of near duplicates Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 46 / 82 Visualization Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 47 / 82 Presentation visualization techniques `lost in hyperspace fear', vizualization of document clustering, Visual Browser (different user's eyes). delivery customised digital library system DSpace (open source, created at MIT) for final articles delivery, search. Manakin interface. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 48 / 82 Visualization in Visual Browser Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 49 / 82 Visualization in Visual Browser Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 50 / 82 Delivery web portal unique and persistent URLs: Digital Object Identifier DOI (URN? PURL?,. . . ) interfaces to other services OAI-PMH harvesting, bibitem export, Googlebot optimization indexing, search relevance Lucene, customized for math. (Experiments with Manatee and EDBM-2 (Zbl, NUMDAM))? Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 51 / 82 Paper Classification ü every math journal paper today classified by MSC (five alphanumerical letter code) taxonomy ý one primary, several secondary MSC useful for search narrowing, clustering, document distance basis old papers were not classified when published or reviewed Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 52 / 82 Mathematical Paper Classification and Categorization We thrive in information-thick worlds because of our marvelous and everyday capacity to select, edit, single out, structure, highlight, group, pair, merge, harmonize, synthesize, focus, organize, condense, reduce, boil down, choose, categorize, catalog, classify, list, abstract, scan, look into, idealize, isolate, discriminate, distinguish, screen, pigeonhole, pick over, sort, integrate, blend, inspect, filter, lump, skip, smooth, chunk, average, approximate, cluster, aggregate, outline, summarize, itemize, review, dip into, flip through, browse, glance into, leaf through, skim, refine, enumerate, glean, synopsize, winnow the wheat from the chaff and separate the sheep from the goats. Edward R. Tufte ü every math journal paper today classified by MSC (five alphanumerical letter code) taxonomy (tree) ý one primary, several secondary useful for search narrowing, MSC 1991, MSC 2000, MSC 2010 old papers were not classified when published or reviewed Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 53 / 82 Automated MSC classification experiment To date (March 2008), in the digitized part there are 369 volumes of 14 journals and book collections: 1,493 issues, 11,742 articles on 177,615 pages. From NUMDAM, we got another 15,767 full texts of articles (in simple XML format) for an experiment. ü several different languages ý trained on papers with one primary MSC NLP lab's GVP project code as basis Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 54 / 82 Automated MSC machine learning tokenization and lemmatization: the first part of the preprocessing relates to how the text is split into tokens (words)--alphabetic, lowercase, Krovetz stemmer, lemmatization, bi-gram tokenization; feature selectors: how to choose the tokens that discriminate best--2 , mutual information (MI-score); feature amount: how many features are needed to classify best--500, 2,000 or 20,000 features; term weighting: how the features will be weighted (tfidf variants and weights normalizations (atc (augmented term frequency), bnn and nnn)); classifiers: Nave Bayes (NB), k-Nearest Neighbours (kNN), Support Vector Machines (SVM), Artificial Neural Nets (ANN); threshold estimators: how to choose the category status of the classifier based on a threshold--fixed or s-cut strategy for threshold setting; evaluation and confidence estimation: how results are measured and how the confidence is estimated in them--Receiver Operating Characteristic (ROC), Normalized Cross Entropy (NCE). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 55 / 82 GVP Framework for comparing learning methods The two differently colored curves correspond to the chosen learning methods (k-NN, Nave Bayes in the legend on the right). From the colors below chosen function values, one immediately sees which combination (at the bottom) of preprocessing methods leads to which particular value. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 56 / 82 Dependency of performance on the number of examples per class limit From the three curves one can see that by increasing the threshold of minimum category size one gets better results in every aspect (color square combination at the bottom). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 57 / 82 Classifiers' learning methods comparison by F1 measure SVM and kNN run hand in hand while NB lags behind. The major influence is due to the threshold on minimum category size. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 58 / 82 Detail of MSC-sorted documents' similarity matrix Matrix computed by LSA for top-level MSC code 20-xx Group theory and generalizations. The white lower right square corresponds to the 20Mxx Semigroups subject papers. We can see strong similarity of 20Mxx to 20.92 Semigroups, general theory and 20.93 Semigroups, structure and classification (white lower left and upper right rectangles). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 59 / 82 Metadata from born-digital papers ü main idea: metadata and semantic information available is exported as a side-effect of publishing printed journal issues with only minimal additional costs (by requirement of proper tagging). ý references, full text for searching minimal changes in the workflow Archivum Mathematicum pilot project. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 60 / 82 Pilot project of Archivum Mathematicum ü inspired by CEDRAM ý papers in LATEX with AMS styles, references in BIBTEX. new styles files by Michal Růžička automated typesetting, page numbering, EMIS web page generation,. . . use of configurable Tralics converter to XML high automation by program make automated import to DML-CZ first 3 issues already available Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 61 / 82 How to Find? Search! ü an entry gate to the digitized papers is search Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 62 / 82 How to Find? Search! ü an entry gate to the digitized papers is search ý full text searching, searching for intext references Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 62 / 82 How to Find? Search! ü an entry gate to the digitized papers is search ý full text searching, searching for intext references search and exchange of mathematical formulas in MathML, OpenMath: project Mathdex Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 62 / 82 How to Find? Search! ü an entry gate to the digitized papers is search ý full text searching, searching for intext references search and exchange of mathematical formulas in MathML, OpenMath: project Mathdex due to the massive size of digitized material, the only way is very good OCR, including math. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 62 / 82 Existing OCR Systems ü Not to reinvent the wheel: trial of several OCR engines. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 63 / 82 Existing OCR Systems ü Not to reinvent the wheel: trial of several OCR engines. ý No single OCR system with acceptable results: high error rate, working only for specific purposes (plain English text), direct use was not possible. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 63 / 82 Existing OCR Systems ü Not to reinvent the wheel: trial of several OCR engines. ý No single OCR system with acceptable results: high error rate, working only for specific purposes (plain English text), direct use was not possible. Fine Reader by ABBYY gave good results for (even multilingual) text, and allows for typeface learning. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 63 / 82 Existing OCR Systems ü Not to reinvent the wheel: trial of several OCR engines. ý No single OCR system with acceptable results: high error rate, working only for specific purposes (plain English text), direct use was not possible. Fine Reader by ABBYY gave good results for (even multilingual) text, and allows for typeface learning. InftyReader by www.inftyproject.org the only available solution for structural math recognition. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 63 / 82 Existing OCR Systems ü Not to reinvent the wheel: trial of several OCR engines. ý No single OCR system with acceptable results: high error rate, working only for specific purposes (plain English text), direct use was not possible. Fine Reader by ABBYY gave good results for (even multilingual) text, and allows for typeface learning. InftyReader by www.inftyproject.org the only available solution for structural math recognition. No out-of-the-shelf solution. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 63 / 82 Our OCR Solution ü combining both, using FineReader and InftyReader in a pipe to let every system to do what it is good for, then `vote' Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 64 / 82 Our OCR Solution ü combining both, using FineReader and InftyReader in a pipe to let every system to do what it is good for, then `vote' ý top-level (Java) program to automate the process and fix some indeficiencies Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 64 / 82 Our OCR Solution ü combining both, using FineReader and InftyReader in a pipe to let every system to do what it is good for, then `vote' ý top-level (Java) program to automate the process and fix some indeficiencies instant setup unusable: fine-tuning and gradually enhancing the OCR procedure and program parameters so that OCR results would be acceptable for DML-CZ purposes Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 64 / 82 Our OCR Solution ü combining both, using FineReader and InftyReader in a pipe to let every system to do what it is good for, then `vote' ý top-level (Java) program to automate the process and fix some indeficiencies instant setup unusable: fine-tuning and gradually enhancing the OCR procedure and program parameters so that OCR results would be acceptable for DML-CZ purposes trying to improve the results further by close cooperation with the team of prof. Suzuki (Infty Project leader, Kyushu University, Japan, wait for next talk), and hopefully with other (retrodigitization) projects efforts. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 64 / 82 DML-CZ OCR Workflow Diagram Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 65 / 82 DML-CZ OCR Workflow ­ middle level of details I ü Choosing the testbed data (30.000 pages of CMJ since 1951). ý Scanning 600 DPI, 4-bit depth (soft binarization advantage). Lookup for hot typefaces used in CMJ. Training the Fine Reader (FR) 8.0 OCR engine for the fonts used. Training the Lingua::Ident Perl module for language identification of languages used in CMJ (EN, RU, F, GE, CZ, SK): very reliable statistical method based on character bigrams and trigram counts. FR scanning using general setup profile (no specific language vocabulary used). Evaluating the language of the scanned block. Calling FR to scan for the 2nd time with profile appropriate to the recognized language(s). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 66 / 82 DML-CZ OCR workflow ­ middle level of details II Export the result as layered PDF (+FineReader XML). Importing this PDF by InftyReader. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 67 / 82 DML-CZ OCR workflow ­ middle level of details II Export the result as layered PDF (+FineReader XML). Importing this PDF by InftyReader. InftyReader recognition and storing the result Infty Markup Language IML (XML+MathML) and LATEX. Running (our Java) program OMLCorrector to fix some Infty Reader indeficiencies in IML. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 67 / 82 DML-CZ OCR workflow ­ middle level of details II Export the result as layered PDF (+FineReader XML). Importing this PDF by InftyReader. InftyReader recognition and storing the result Infty Markup Language IML (XML+MathML) and LATEX. Running (our Java) program OMLCorrector to fix some Infty Reader indeficiencies in IML. Running (our Java) program OCRJoiner to compare characters in bounding boxes by FR and InftyReader and store the final result in IML. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 67 / 82 DML-CZ OCR workflow ­ middle level of details II Export the result as layered PDF (+FineReader XML). Importing this PDF by InftyReader. InftyReader recognition and storing the result Infty Markup Language IML (XML+MathML) and LATEX. Running (our Java) program OMLCorrector to fix some Infty Reader indeficiencies in IML. Running (our Java) program OCRJoiner to compare characters in bounding boxes by FR and InftyReader and store the final result in IML. Use the resulted files in further DML-CZ workflow. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 67 / 82 OCR XML Postprocessing ... check s ... is transformed to ... \v{s} ... Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 68 / 82 DML-CZ OCR Workflow Implementation Gory Details Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 69 / 82 DML-CZ OCR Workflow Implementation Gory Details Contact me, no secrets, no patents! Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 69 / 82 Evaluation Type of errors: T (text), D (diacritics), M (mathematics), L (layout) Steps: 1 (FR1), 2 (FR2), 3 (Infty), 4 (OCRJoiner), 5 (IMLCorrector) Step T D M L 1 10 0 224 82 2 4 0 170 78 3 4 0 168 71 4 14 0 24 15 5 14 0 24 15 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 70 / 82 DML-CZ OCR Results Picture FR 1 FR 2 FR8.0 PE IR IR fixed 1 84,99% 88,03% 88,46% 97,48% 97,48% 2 86,93% 88,76% 88,07% 98,97% 98,97% 3 89,19% 92,35% 91,53% 99,18% 99,18% 4 93,40% 93,52% 95,78% 99,15% 99,19% 5 91,09% 91,62% 92,15% 99,87% 99,87% 6 79,46% 80,05% 82,25% 99,61% 99,61% 7 92,59% 93,39% 93,71% 99,09% 99,09% 8 91,33% 91,33% 98,30% 98,18% 98,61% Average 88,65% 89,90% 91,23% 98,97% 99,02% Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 71 / 82 OCR--Conclusions less than 1% error rate (counting all types of errors). Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 72 / 82 OCR--Conclusions less than 1% error rate (counting all types of errors). still space for improvements (better text/math separation and Unicode support in InftyReader) Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 72 / 82 OCR--Conclusions less than 1% error rate (counting all types of errors). still space for improvements (better text/math separation and Unicode support in InftyReader) still space for better robustness and precission several bachelor (Vystrčil) and diploma thesis (Panák, Mudrák) using FR SDK Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 72 / 82 Contributions (part two) Workflow of bulk retro-digitization. We have described and developed a new complex workflow to digitization, fine-tuned to the specifics of the mathematical community. Procedures were engineered with respect to maximal quality of results, scalability, maximal automation, efficiency and effectiveness of processing of large volumes of text and graphics. The optimization of optical character recognition. We have verified and implemented the OCR technology based on an automated several phase character recognition. We have evaluated the technology to reach less than 1% character error rate, counting even errors in character font type and size. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 73 / 82 Contributions (part two, cont.) A new framework for retro-born-digital documents. We have designed and implemented a workflow for processing journal issues from the period, where (semi)final data are available in electronic form. Procedures for conversions of (meta)data needed for a digital library were developed and as a testbed data for Archivum Mathematicum from period 1992­2007 were prepared. The foundation for born-digital document processing. We have designed and implemented workflow for processing born-digital mathematical journal issues in such a way that all metadata needed for a digital library are secured and exported during preparation of printed issue simultaneously. The workflow is applied by a production team of Archivum Mathematicum published by Masaryk University and respects all today's demands of search engine optimization. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 74 / 82 Contributions (part two, cont.) Contributions to the design of digital library of mathematical papers. A design of procedures realized digital mathematics library The solution of automated classification of mathematical documents. Machine learning approach to the classification of mathematical papers according to widely accepted Mathematical Subject Classification. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 75 / 82 Summary and conclusions We should experiment; we should try out new things; we should tinker with technology and find better ways to communicate. John Ewing (2002) Technology of competing patterns development: methods of stratification, bootstrapping and multi-level generation shown on numerous segmentation problems, results significantly outperformed previous ones (e.g. used in everyday TEX installation), lots of trees saved :-). Hyphenation methods for several languages are in every day use, method applied with success to the problem of Thai segmentation. Single-source publishing shown very effective to deliver documents for different output devices and needs. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 76 / 82 Summary and Conclusion (cont.) Methodology for digitization of 40.000+ pages of Ottův slovník naučný and 200.000+ pages of mathematical content in DML-CZ: http://dml.cz/ and http://project.dml.cz/. Machine learning methods for automated classification and similarity of mathematical papers. Collection of 14 papers (out of 70+ coauthored with David Antoš, Mirek Bartošek, Han The Thanh, Jan Holeček, Martin Lhoták, Zuzana Nevěřilová, Jiří Rákosník, Radim Řehůřek, Michal Řužička, Martin Šárfy, Jiří Zlatuška). 50+ citations. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 77 / 82 Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 78 / 82 S. Lawrence, C.L. Giles, and K. Bollacker, Digital Libraries and Autonomous Citation Indexing, Computer, June 1999, pp. 67­71. M. Bartošek, M. Lhoták, J. Rákosník, P. Sojka, M. Šárfy: DML-CZ: The Objectives and the First Steps. book chapter in a forthcoming book by A.K. Peters Ltd., 2008. pp. 69­79. Eisenbud: World Digital Mathematics Library. A presentation to the Gordon and Betty Moore Foundation, August 19, 2004. R. Řehůřek, P. Sojka: Automated Classification and Categorization of Mathematical Knowledge Intelligent Computer Mathematics [Proceedings of 7th International Conference on Mathematical Knowledge Management MKM 2008], LNCS/LNAI 5144, Springer, pp. 543­557. P. Sojka: DML-CZ: From Scanned Image to Knowledge Sharing. In: Klaus Tochtermann, Hermann Maurer (Eds): Proceedings of KSR @ I-Know 2005 5th International Conference on Knowledge Management, pp. 664­672, June 29 - July 1, 2005, Graz. P. Sojka, J. Rákosník: From Pixels and Minds to the Mathematical Knowledge in a Digital Library. DML 2008, pp. 17­27, Birmingham, UK. P. Sojka, M. Růžička: Single-source publishing in multiple formats for different output devices. Tugboat, 29(1):118-124. ISSN 0896-3207. January 2008. M. Suzuki, F. Tamari, R. Fukuda, S. Uchida and T. Kanahori. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 78 / 82 INFTY--An integrated OCR system for mathematical documents. Proceedings of DocEng 2003, Grenoble, France. A. Shapiro. TouchGraph LLC at SourceForge, 2004. Available from: http://touchgraph.sourceforge.net/. E. Tufte. Envisioning Information. Graphics Press, 1990. Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 79 / 82 Coauthors--Thanks! David Antoš, Mirek Bartošek, Han The Thanh, Jan Holeček, Martin Lhoták, Zuzana Nevěřilová, Jiří Rákosník, Radim Řehůřek, Michal Řužička, Martin Šarfy, Jiří Zlatuška Petr Sojka (Faculty of Informatics, MU, Brno) From Minds to Pixels and Back October 21th, 2008 79 / 82