Transfer Learning of Slavic Syllabification for Hyphenation Patterns Ondřej Sojka Faculty of Informatics, Masaryk University October 16, 2024 Contents Why this problem Introduction to Hyphenation Patterns Approach Methodology Transfer of hyphens Conclusion Bibliography Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 2 / 36 Why this problem Section 1 Why this problem Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 3 / 36 Why this problem Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 4 / 36 Introduction to Hyphenation Patterns Section 2 Introduction to Hyphenation Patterns Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 5 / 36 Introduction to Hyphenation Patterns Patterns (of hyphenation) that compete with each other [1]. pattern is a substring with a piece of information about hyphenation between characters: hy3ph he2n n2at hen5at odd numbers permit, even numbers forbid hyphenation Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 6 / 36 Introduction to Hyphenation Patterns Patterns (of hyphenation) that compete with each other [1]. pattern is a substring with a piece of information about hyphenation between characters: hy3ph he2n n2at hen5at odd numbers permit, even numbers forbid hyphenation patterns are as short as possible to be as general as possible (new compound words, etc.) pattern compete with each other: instead of one big set of patterns, decomposition into layered sets generated in levels p1 hyphenating patterns generated in level 1, p2 inhibiting patterns—exceptions for p1), p3 hyphenating patterns to cover what has not been covered by “p1 ∧ ¬p2”),… Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 6 / 36 Introduction to Hyphenation Patterns Hyphenation lookup: an instance of dictionary problem h y p h e n a t i o n p1 1n a p1 1t i o n p2 n2a t p2 2i o p2 h e2n p3 h y3p h p4 h e n a4 p5 h e n5a t h0y3p0h0e2n5a4t2i0o0n hy-phen-ation → 2 6 …→ … …→ … key → data The solution to the dictionary problem: For the key part (the word) to store the data part (its division) Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 7 / 36 Introduction to Hyphenation Patterns Hyphenation lookup: an instance of dictionary problem h y p h e n a t i o n p1 1n a p1 1t i o n p2 n2a t p2 2i o p2 h e2n p3 h y3p h p4 h e n a4 p5 h e n5a t h0y3p0h0e2n5a4t2i0o0n hy-phen-ation → 2 6 …→ … …→ … key → data The solution to the dictionary problem: For the key part (the word) to store the data part (its division) Given the already hyphenated word list of a language (dictionary), how to generate the patterns? Liang’s task was: less than 5,000 patterns, less than 30,000 bytes per language in format file (RAM during TEX run). Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 7 / 36 Introduction to Hyphenation Patterns hyphen.tex generation by patgen (Liang, 1983) [1] level parameters patterns good bad good bad 1 1 2 20 (4) 458 67,604 14,156 76.6% 16.0% 2 2 1 8 (4) 509 7,407 11,942 68.2% 2.5% 3 1 4 7 (5) 985 13,198 551 83.2% 3.1% 4 3 2 1 (6) 1647 1,010 2,730 82.0% 0.0% 5 1 ∞ 4 (8) 1320 6,428 0 89.3% 0.0% A total of 4,919 patterns were obtained in hyphen.tex (27,860 bytes) from Webster’s Pocket dictionary (30,000+ words only). Suffix-compressed packed trie occupying 5,943 locations, with 181 outputs (less than 1% of original word list). Patterns find 89.3% of the hyphens in the dictionary. 109 passes through the dictionary are needed. Generation required about 1 hour of CPU time on PDP-11. Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 8 / 36 Introduction to Hyphenation Patterns tex-hyphen [3] https://hyphenation.org is the canonical source of hyphenation patterns for most software TEX web browsers LibreOffice Android (Kindle too!), … Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 9 / 36 Approach Section 3 Approach Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 10 / 36 Approach [haɪfəˈneɪʃənˌ] quality of patterns inconsistent across Slavic languages pronunciation, on which syllabic hyphenation is based, is quite similar patterns for some languages are really good we can do better Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 11 / 36 Approach Pronunciation similar, orthography different Пра-га Pra-ha Pra-ga Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 12 / 36 Approach International Phonetic Alphabet ˌɪntərˈnæʃənəl fəˈnɛtɪk ˈælfəˌbɛt Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 13 / 36 Approach Anti-goals exert my opinions as a non-native speaker into the resulting patterns as I’m not qualified for it improve already good patterns Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 14 / 36 Approach Goals improve patterns for languages with no or subpar current patterns with transfer learning to develop and deploy the methodology pattern development through transfer learning for several languages in one language family Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 15 / 36 Methodology Section 4 Methodology Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 16 / 36 Methodology joint IPA patterns hyphenated cs.wlh sh.wlh pl.wlh ... new IPA hyphenated cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... single language patterns IPA hyphenated weights cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... Wikipedia dataset cs wiki sh wiki pl wiki ... w2w1 wiw3 cs.pat sh.pat pl.pat ... Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 17 / 36 Methodology Source wordlists joint IPA patterns hyphenated cs.wlh sh.wlh pl.wlh ... new IPA hyphenated cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... IPA hyphenated weights cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... Wikipedia dataset cs wiki sh wiki pl wiki ... w2w1 wiw3 afaik, hard to acquire clean single-language wordlists previously (for Czech and Slovak) provided by Lexical Computing, now unwilling reproducibility is important ⇒ wikipedia cleaned colloquial terms not represented Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 18 / 36 Methodology Transfer of hyphens to IPA IPA hyphenated cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... espeak-ng [2] used for generation of IPA consistent across 127 languages transfer not trivial! Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 19 / 36 Methodology Transfer of hyphens Transfer of hyphens task: shro - maž - ďo - va - cí + shrˈomaʒɟˌovatsiː ⇒ shrˈo - maʒ - ɟˌo - va - tsiː IPA depends on surrounding characters where do we put the hyphens? Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 20 / 36 Methodology Transfer of hyphens Transfer of hyphens GCATGCG GATTACA --- GCAT GCG G ATTACA Needleman-Wunsch, algorithm for global alignment Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 21 / 36 Methodology Generation of joint IPA patterns joint IPA patterns weights w2w1 wiw3 weights of IPA-hyphenated wordlists crucial to well-performing final patterns optimized according to ground truth source hyphenation data patterns can learn IPA well: good 99.81 %, bad 0.28 %, missing 0.19 % challenge is not to overfit; they can infer the language and reproduce original errors won’t fix the out-of-distribution samples; anti-goal Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 22 / 36 Methodology Source hyphenated wordlist data need ground truth to optimize weights need ground truth to validate (separate from optimization of weights!) will probably use native speakers (preferably linguists) for this very few language institutes provide hyphenated words few dictionaries provide hyphenation severe lack of definitively-correctly hyphenated words do you know a good source of hyphenated words for your language? Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 23 / 36 Methodology Generation of joint IPA patterns joint IPA patterns weights w2w1 wiw3 weights of IPA-hyphenated wordlists crucial to well-performing final patterns optimized according to ground truth source hyphenation data to avoid gridsearch in parameter (weight) space, train surrogate model and sample weights to evaluate Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 24 / 36 Methodology Transfer of hyphens from IPA to original new IPA hyphenated cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... approach similar to transfer from original to IPA Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 25 / 36 Methodology Final single-language patterns single language patterns cs.pat sh.pat pl.pat ... easy to generate hard to evaluate in the absence of reliable ground truth: at least two native speakers hyphenate words, where they match, hyphenation considered good enough compute probability of improvement with new patterns, if p > 0.95, propose for inclusion into tex-hyphen [3] Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 26 / 36 Methodology joint IPA patterns hyphenated cs.wlh sh.wlh pl.wlh ... new IPA hyphenated cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... single language patterns IPA hyphenated weights cs.ipa.wlh pl.ipa.wlhsh.ipa.wlh ... Wikipedia dataset cs wiki sh wiki pl wiki ... w2w1 wiw3 cs.pat sh.pat pl.pat ... Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 27 / 36 Conclusion Section 5 Conclusion Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 28 / 36 Conclusion Results on a validation wordlist size 15714, which one is best? 1. 13106 good, 4609 bad, 26574 missed 2. 19394 good, 7745 bad, 20286 missed 3. 15091 good, 4951 bad, 24589 missed 4. 25210 good, 13154 bad, 14470 missed Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 29 / 36 Conclusion Results on a validation wordlist size 15714, which one is best? 1. 13106 good, 4609 bad, 26574 missed 2. 19394 good, 7745 bad, 20286 missed 3. 15091 good, 4951 bad, 24589 missed 4. 25210 good, 13154 bad, 14470 missed shuffled: current Ukrainian patterns transfer from 100 % Slovak transfer from 100 % Ukrainian transfer from 100 % Russian Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 30 / 36 Conclusion Results on a validation wordlist size 15714, which one is best? 1. 13106 good, 4609 bad, 26574 missed 2. 19394 good, 7745 bad, 20286 missed 3. 15091 good, 4951 bad, 24589 missed 4. 25210 good, 13154 bad, 14470 missed 5. 19308 good, 7620 bad, 20372 missed 1. transfer from 100 % Russian 2. transfer from 100 % Ukrainian 3. transfer from 100 % Slovak 4. current Ukrainian patterns 5. approx 1:1 sk:uk mix Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 31 / 36 Conclusion Results reason to believe that just through transfer, we can improve the patterns! arguably the garbage in, garbage out approach because those are terrible results so we can transfer, but we would ideally like to get something in between the original and transferred for better coverage obviously we can gridsearch various weight combinations, but can we be smarter about it? Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 32 / 36 Conclusion More than weights to tweak! 18183 good, 7857 bad, 21497 missed – german 8 levels parameters 10276 good, 3514 bad, 29404 missed - custom correctoptimized 12595 good, 3850 bad, 27085 missed – custom sizeoptimized Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 33 / 36 Conclusion Results it is feasible to significantly improve at least current Polish, Croatian, Serbian, and Ukrainian patterns applicable to other language families reproducible workflow released [4] Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 34 / 36 Bibliography Section 6 Bibliography Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 35 / 36 Bibliography Bibliography I [1] Franklin M. Liang. “Word Hy-phen-a-tion by Com-put-er.” PhD thesis. Stanford University, Aug. 1983, p. 44. url: https://tug.org/docs/liang/liang-thesis.pdf. [2] Jonathan Reynolds. eSpeak NG. Version 1.50. 2016. url: https://github.com/espeak-ng/espeak-ng. [3] Arthur Rosendahl and Mojca Miklavec. TEX hyphenation patterns. eng. Accessed 2024-07-16. 2023. url: http://hyphenation.org/tex. [4] Ondřej Sojka and Petr Sojka. patterns workflow repository. eng. url: https://github.com/tensojka/patterns. Ondřej Sojka ·Transfer Learning of Slavic Syllabification for Hyphenation Patterns ·October 16, 2024 36 / 36