Beyond Parallel Corpora Philipp Koehn 22 October 2024 Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 1 data and machine learning Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 2Supervised and Unsupervised • We framed machine translation as a supervised machine learning task – training examples with labels – here: input sentences with translation – structured prediction: output has to be constructed in several steps • Unsupervised learning – training examples without labels – here: just sentences in the input language – we will also look at using just sentences output language • Semi-supervised learning – some labeled training data – some unlabeled training data (usually more) • Self-training – make predictions on unlabeled training data – use predicted labeled as supervised translation data Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 3Transfer Learning • Learning from data similar to our task • Other language pairs – first, train a model on different language pair – then, train on the targeted language pair – or: train jointly on both • Multi-Task training – train on a related task first – e.g., part-of-speeh tagging • Share some or all of the components Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 4 using monolingual data Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 5Using Monolingual Data • Language model – trained on large amounts of target language data – better fluency of output • Key to success of statistical machine translation • Neural machine translation – integrate neural language model into model – create artificial data with backtranslation Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 6Adding a Language Model • Train a separate language model • Add as conditioning context to the decoder • Recall state progression in the decoder – decoder state si – embedding of previous output word Eyi−1 – input context ci si = f(si−1, Eyi−1, ci) • Add hidden state of neural language model sLM i si = f(si−1, Eyi−1, ci, sLM i ) • Pre-train language model • Leave its parameters fixed during translation model training Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 7Refinements • Balance impact of language model vs. translation model • Learn a scaling factor (gate) gateLM i = f(sLM i ) • Use it to scale values of language model state ¯sLM i = gateLM i × sLM i • Use this scaled language model state for decoder state si = f(si−1, Eyi−1, ci, ¯sLM i ) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 8 backtranslation Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 9Back Translation • Monolingual data is parallel data that misses its other half • Let’s synthesize that half reverse system final system Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 10Back Translation • Steps 1. train a system in reverse language translation 2. use this system to translate target side monolingual data → synthetic parallel corpus 3. combine generated synthetic parallel data with real parallel data to build the final system • Roughly equal amounts of synthetic and real data • Useful method for domain adaptation Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 11Iterative Back Translation • Quality of backtranslation system matters • Build a better backtranslation system ... with backtranslation back system 2 final system back system 1 Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 12Iterative Back Translation • Example: Better system for backtranslation matters German–English Back Final no back-translation - 29.6 *10k iterations 10.6 29.6 (+0.0) *100k iterations 21.0 31.1 (+1.5) convergence 23.7 32.5 (+2.9) re-back-translation 27.9 33.6 (+4.0) * = limited training of back-translation system Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 13Variants • Copy Target – if no good neural machine translation system to start with – just copy target language text to the source • Forward Translation – synthesize training data in same direction as training – self-training (inferior but sometimes successful) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 14Round Trip Training • We could iterate through steps of – train system – create synthetic corpus • Dual learning: train models in both directions together – translation models F → E and E → F – take sentence f – translate into sentence e’ – translate that back into sentence f’ – training objective: f should match f’ • Setup could be fooled by just copying (e’ = f) ⇒ score e’ with a language for language E add language model score as cost to training objective Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 15Round Trip Training MT F→E MT E→F ef LM E LM F Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 16 monolingual pretraining Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 17Low Resource Language Pairs • Problem: not enough parallel to even train a proper encoder or decoder • Idea: use monolingual data – ... in source language → initialize encoder – ... in target language → initialize decoder • How do we present monolingual data in training? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 18Masked Training • Replace some input word sequences with (30% of words) • Train model MASKED → TEXT on both source and target text Why did the chicken cross the road? ⇑ Why did chicken the road? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 19Reordering Sentences • Reorder sentences (each training example has 3 sentences) Why did the chicken cross the road? The chicken wanted to get to the other side. There are some delicious sunflower seeds. ⇓ The chicken wanted to get other . are some delicious seeds. Why did chicken the road? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 20Example: mBART “Multilingual Denoising Pre-training for Neural Machine Translation” (Liu et al., 2020) • 25 languages: from 55 billion words English to 56 million words Burmese • Followed by training on parallel data ⇒ Helps with low-resource languages (but not with >20 million sentence pair parallel data) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 21 unsupervised machine translation Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 22Monolingual Embedding Spaces dog cat lion Löwe Katze Hund • Embedding spaces for different languages have similar shape • Intuition: relationship between dog, cat, and lion, independent of language • How can we rotate the triangle to match up? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 23Matching Embedding Spaces dog cat lion Löwe Katze Hund dog cat lionLöwe Katze Hund • Seed lexicon of identically spelled words, numbers, names • Adversarial training: discriminator predicts language [Conneau et al., 2018] • Match matrices with word similarity scores: Vecmap [Artetxe et al., 2018] Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 24Bilingual Lexicon Induction dog cat lionLöwe Katze Hund • Given shared embedding state ⇒ matching points in space = word translations Embeddings F Embeddings E Induced Bilingual Dictionary Monolingual Data F Monolingual Data E Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 25Inferred Translation Model • Translation model – induced word translations → statistical phrase translation table (probability similarity) • Language model – target side monolingual data → estimate statistical n-gram language model ⇒ Statistical phrase-based machine translation system Embeddings F Embeddings E Induced Bilingual Dictionary Translation Model Monolingual Data F Monolingual Data E Language Model Statistical MT Model Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 26Synthetic Training Data • Create synthetic parallel corpus – monolingual text in source language – translate with inferred system: translations in target language Translation Model Language Model Statistical MT Model Synthetic Parallel Data Monolingual Data F Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 27Iterate • Iterate – Predict data: generate translation for monolingual corpus – Predict model: estimate model from synthetic data – iterate this process, alternate between language directions • Increasingly use neural machine translation model to synthesize data Statistical MT Model Synthetic Parallel Data Monolingual Data F Monolingual Data E Neural MT Model Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 28 multiple language pairs Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 29Multiple Language Pairs • There are more than two languages in the world • We may want to build systems for many language pairs • Typical: train separate models for each • Joint training Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 30Multiple Input Languages • Example – German–English – French–English • Concatenate training data • Joint model benefits from exposure to more English data • Shown beneficial in low resource conditions • Do input languages have to be related? (maybe not) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 31Multiple Output Languages • Example – French–English – French–Spanish • Concatenate training data • Given a French input sentence, how specify output language? • Indicate output language with special tag [ENGLISH] N’y a-t-il pas ici deux poids, deux mesures? ⇒ Is this not a case of double standards? [SPANISH] N’y a-t-il pas ici deux poids, deux mesures? ⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 32Zero Shot Translation • Example – German–English – French–English – French–Spanish • We want to translate – German–Spanish English French Spanish German MT Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 33Zero Shot • Train on – German–English – French–English – French–Spanish • Specify translation [SPANISH] Messen wir hier nicht mit zweierlei Maß? ⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 34Zero Shot: Hype Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 35Zero Shot: Reality • Bridged: pivot translation Portuguese → English → Spanish • Model 1 and 2: Zero shot training • Model 2 + incremental training: use of some training data in language pair Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 36Massively Multilingual Training • Scaling up multilingual machine translation for more languages – many-to-English – English-to-many – many-to-many • Mainly motivated by improving low-resource language pairs • Move towards larger models Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 37Translation Quality for 103 Languages (source: Google) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 38Gains with Multilingual Training (source: Google) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 39Romanization (source: USC/ISI) Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 40Many-to-Many • 7.5 billion sentences for 100 languages (mined from web-crawled data) • Model with 15 billion parameters • Improvements especially for low resource languages Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 41Even Bigger: NLLB (2022) • No Language Left Behind: 200 languages • Hand-translated test set: Flores-200 • Uses diverse data sources – public parallel data – translations created by professional translators – sentence pairs based on sentence embedding similarity – monolingual data for ∗ monolingual pre-training ∗ back-translation ∗ self-training • Models of different scale (up to 54B parameters), publicly released Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 42Different Amounts of Data per Language • High-resource language pairs are undertrained • Low-resource language pairs are overtrained ⇒ Oversampling low resource language pairs Data selection probability pl for language pair l based on corpus sizes Dk pl = (Dl/ k Dk)1/T • Curriculum training: adding low-resource data only in later training stages Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 43Interference • Many languages in the same representation space • Beneficial: shared cognates, numbers, names, ... • Harmful: a lot of accidental overlap in tokens that have different meaning – die — common German determiner – die — different meaning in English • What can be done to avoid harmful interference? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 44Language-Specific Components • Various design choices – language-specific encoder – language-specific decoder – language specific adaptor components • Example: “Condensing Multilingual Knowledge with Lightweight Language-Specific Modules” Xu et al. (2023) – language specific parameters – shared parameters – self-distillation method to condense everything into shared parameters Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 45Mixture of Experts • Conditional compute • Gating mechanism decides which FF step to utilize • Allows scaling to many more parameters without increasing computational cost Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 46 document-level translation Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 47The Importance of Document-Level Context • Pronouns – I bought a table. It is pretty. – Ich kaufte einen Tisch. Er/sie/es is sch¨on. • Better disambiguation – I have a lot of numbers. I still need to make the table. • Terminological consistency Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 48Why Not Document-Level Translation? • Entire infrastructure focused on sentence level – Training data available as sentence pairs – Metrics defined at sentence level – APIs typically operate at sentence level • This is slowly changing – Scaling up transformers for multi-sentence translation [Junczys-Dowmunt et al., 2019] – Document-level metrics, e.g., CTXPRO [Wicks et al., 2023] – Release of training data in document-aligned format e.g., Europarl, News Commentary, Paracrawl [Wicks et al., 2024] Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024 49 questions? Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024