Low-Resource Machine Translation Edoardo Signoroni E. Signoroni - Low-Res MT2 Introduction ● MT is the task of translating a sentence from a source language to the corresponding sentence in the target language. ● Nowadays, it is done with Neural Machine Learning systems trained on parallel corpora. ● Main issues: – Linguistic ambiguity e.g. “ It’s raining cats and dogs. ” – DATA SCARCITY E. Signoroni - Low-Res MT3 What are LRLs? 3 >7000 living languages plus: varieties; dialects; slangs; code-switching; code-mixing; … and more but most of these are “Left-Behinds” or Low-resource languages since the biggest MT system online supports a grand total of 243 (or 3.47%) Blasi et al. (2022), Joshi et al. (2020) E. Signoroni - Low-Res MT4 What are LRLs? Joshi et al. 2020 define LRLs, in incremental classes: “Have exceptionally limited resources, and have rarely been considered in language technologies.” “Have some unlabelled data; however, collecting labelled data is challenging.” “A small set of labelled datasets has been collected, and language support communities are there to support the language.” What are LRLs? Joshi et al. classification and table where are italian and czech 5“adapted” from Ranathunga et al. (2023) and Joshi et. al (2020) Italian, Czech E. Signoroni - Low-Res MT6 What are LRLs? For MT: No standard definition. Usually LR pair if the size of the parallel corpora is <500k sentences and extremely LR below 100k pairs if no data is available, we enter the zero-shot setting WMT22 deu-dsb 40k sents 500k words The Good Soldier Švejk 200k words New Testament 185k words Why work on LRLs? - decreasing the digital divide http://labs.theguardian.com/digital-language-divide/ - dealing with inequalities of information access and production - mitigating cross-cultural biases - deploying NLP technologies for underrepresented languages - understanding cross-linguistic differences - preserving linguistic diversity ~3000 (43%) are endangered 90% of all languages will be extinct within 100 years; in the best­case scenario, only 50% will survive, and just 10% are considered safe during the next century https://www.endangeredlanguages.com/about_importance/ Given this variability, always highlight clearly the languages you are working on (Bender Rule & Data Statements) 7 How is it done? 10 Currently, the state-of-the-art for HRLs is NMT Several approaches have been proposed for LRLs, too But most of the impact can be obtained with careful and clever use of the data we have Ranathunga et al. (2023) E. Signoroni - Low-Res MT11 How is it done?- Current Methods ● Multilingual NMT transfer learning is the current state-of-the- art ● Best results using data from related HRL pairs and fine-tune pre-trained NMT models to the related data or the small amount of LRLs text available ● issues with performance and equitable access “Hello, there!” "হেল্ল', তাত!" Fine Tuned Multilingual ENG>ASM MT ENG>ASM DATA Multilingual ENG>INDIC MT ENG>BEN ENG>MNI ENG>LUS … Data E. Signoroni - Low-Res MT12 Language Relatedness ● It is beneficial to use related languages for transfer between HRLs and LRLs ● However, the extent of this is not clear. Which kind of relatedness is the most helpful? – Genealogical? မြန်မဘသ (Burmese, Tibeto-Burman, Burmese) > মৈতৈলোন (Manipuri, Tibeto-Burman) – Typological? हिन्दी (Hindi, Indo-Aryan, SOV) > মৈতৈলোন (Manipuri, SOV) – Writing system? বাংলা (Bengali, Bengali script) > মৈতৈলোন (Manipuri, Bengali script) ● How can we better leverage and disentangle these factors? An Example: WMT23 Indic LR MT ● 4 Low-res lndic languages (asm,kha,lus,mni) <> English ● Collated train datasets on a same-script basis (asm&mni; kha&lus), and for all languages together ● Trained systems on the collated data, and fine-tuned child systems for the single directions ● Best option for kha;lus>eng. Mni>eng was better with same-script parent Zero-shot and relatedness Most of the times, no data for the LRL are available → Zero-shot Fine-tuning a pre-trained LLM with data from a related language helps (e.g. Slavic language into Silesian) However, the internal, fine-grained relatedness of the language, or its presence in the pre-training data seems not to matter E. Signoroni - Best Practices for Low-Resource Machine Translation15 Tokenization ● A MT system is a sequence-to-sequence model, which takes words in input and generates words as output ● Thus it needs a vocabulary of tokens, words in the most simple implementation ● Dealing with morphological variants and variation leads to huge vocabulary sizes and out-of-vocabulary words, not seen in training E. Signoroni - Best Practices for Low-Resource Machine Translation16 Tokenization ● Text is segmented into subwords with datadriven iterative algorithms ● These are combined together to deal with unknown words, but still struggle with complex morphology, non-standard forms, linguistic diversity, ... ● Character, hybrid, token-free, and even pixellevel approaches have been proposed to overcome such challenges Su b wo r d s E. Signoroni - Best Practices for Low-Resource Machine Translation17 Tokenization Impact on NMT Tokenization impacts the quality of downstream NMT, especially for LRLs, thus choosing its parameters carefully is crucial. An Example: WMT22 LR MT ● 2 LRLs: Lower & Upper Sorbian ● by using a custom our custom HFT tokenizer to obtain more frequent and thus better represented tokens, we outperformed the default bpe approach using only the given LR (40k, 450k) parallel corpora As we set an higher value for the vocabulary size, we get: ● Less characters ● Sligthly more subwords, and more mixed-use tokens at first ● More full words ● But also less quality A more “balanced” mix of characters, subwords, and words generalizes better to unseen data than a word-heavy vocabulary Tokenization Impact on NMT Tokenization Impact on NMT As we set an higher value for the vocabulary size, we naturally get longer tokens and shorter lines E. Signoroni - Best Practices for Low-Resource Machine Translation21 Tokenization Impact on NMT 250 A m b▁ as s ad or M s . N i k k i H▁ ▁ ▁ al e y , U n▁ it ed S t▁ at es P▁ er m an ent R e p▁ res ent at ive to▁ the▁ U n▁ it ed N▁ ation s 500 A m b▁ ass ad or M s . N i k k i H▁ ▁ ▁ al e y , Un▁ it ed S t▁ at es P▁ er m an ent R▁ ep res ent at ive to▁ the▁ Un▁ it ed N▁ ations 1k A m b▁ ass ad or M s . N i k k i H▁ ▁ ▁ al e y , Un▁ ited St▁ ates P▁ er m an ent R▁ ep res ent ative to▁ the▁ Un▁ ited N▁ ations 2k A▁ mb ass ad or M s . N▁ ▁ ik k i H▁ ale y , Un▁ ited St▁ ates P▁ er man ent Rep▁ res ent ative to▁ the▁ Un▁ ited N▁ ations 4k Amb▁ ass ad or M s . N▁ ▁ ik k i H▁ ale y , Un▁ ited States▁ P▁ er man ent Rep▁ res ent ative to▁ the▁ Un▁ ited N▁ ations 8k Ambassador▁ Ms▁ . N▁ ikk i H▁ ale y , United▁ States▁ Per▁ man ent Rep▁ res ent ative to▁ the▁ United▁ Nations▁ 16k Ambassador▁ Ms▁ . N▁ ikk i H▁ ale y , United▁ States▁ Permanent▁ Repres▁ ent ative to▁ the▁ United▁ Nations▁ 32k Ambassador▁ Ms▁ . N▁ ikk i Haley▁ , United▁ States▁ Permanent▁ Represent▁ ative to▁ the▁ United▁ Nations▁ Smaller Vocabularies ● Pre-trained models use huge vocabularies to account for all of the training data, and require heavy computational resources ● If carefully tuned, “traditional” trained-fromscratch systems can achieve meaningful representation at a fraction of the computational size and cost, even in extremely LR conditions ● In particular, smaller vocabulary sizes, most often lead to: – better MT performance – Smaller model size – Faster training times Automated Metrics for LR MT ● Automated metrics allow for low-cost, fast comparison of system ● Two types are relevant for LR MT: Lexical Overlap – They compare the sequence similarity between the proposed translation and one or more references – BLEU (Papineni et al. 2002), ChrF (Popovic 2015, 2017) Neural Metrics – Fine-tuned LMs on human judgements that predict a score based on a given input of source, translation, and reference. – COMET, xCOMET (Rei et al. 2020) Automated Metrics for LR MT ● While Neural Metrics are the state-of-the-art; they perform poorly in for LRLs ● Fine-tuning COMET models to LRLs was shown to be promising: IndicCOMET (Sai B et al. 2023); AfriCOMET (Wang et al. 2023) ● If this is not possible, ChrF(++) was deemed the best back off metric Some Conclusions ● Working on LRLs is important for several linguistic, social, and democratic reasons ● Multilingual NMT approaches involving transfer learning are currently the state-of-the-art for LRLs-MT ● but they still have various issues regarding their performance and equitable access ● Careful tuning of the parameters and clever use of the training data goes a long way to alleviate the problems of LR MT ● Some best practices, such as highlighting the LRLs studied and using fitting metric to evaluate the output of MT are also important