Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 1 data and machine learning Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 2Supervised and Unsupervised • We framed machine translation as a supervised machine learning task – training examples with labels – here: input sentences with translation – structured prediction: output has to be constructed in several steps • Unsupervised learning – training examples without labels – here: just sentences in the input language – we will also look at using just sentences output language • Semi-supervised learning – some labeled training data – some unlabeled training data (usually more) • Self-training – make predictions on unlabeled training data – use predicted labeled as supervised translation data Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 3Transfer Learning • Learning from data similar to our task • Other language pairs – first, train a model on different language pair – then, train on the targeted language pair – or: train jointly on both • Multi-Task training – train on a related task first – e.g., part-of-speeh tagging • Share some or all of the components Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 4 using monolingual data Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 5Using Monolingual Data • Language model – trained on large amounts of target language data – better fluency of output • Key to success of statistical machine translation • Neural machine translation – integrate neural language model into model – create artificial data with backtranslation Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 6Adding a Language Model • Train a separate language model • Add as conditioning context to the decoder • Recall state progression in the decoder – decoder state si – embedding of previous output word Eyi−1 – input context ci si = f(si−1, Eyi−1, ci) • Add hidden state of neural language model sLM i si = f(si−1, Eyi−1, ci, sLM i ) • Pre-train language model • Leave its parameters fixed during translation model training Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 7Refinements • Balance impact of language model vs. translation model • Learn a scaling factor (gate) gateLM i = f(sLM i ) • Use it to scale values of language model state ¯sLM i = gateLM i × sLM i • Use this scaled language model state for decoder state si = f(si−1, Eyi−1, ci, ¯sLM i ) Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 8Back Translation • Monolingual data is parallel data that misses its other half • Let’s synthesize that half reverse system final system Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 9Back Translation • Steps 1. train a system in reverse language translation 2. use this system to translate target side monolingual data → synthetic parallel corpus 3. combine generated synthetic parallel data with real parallel data to build the final system • Roughly equal amounts of synthetic and real data • Useful method for domain adaptation Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 10Iterative Back Translation • Quality of backtranslation system matters • Build a better backtranslation system ... with backtranslation back system 2 final system back system 1 Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 11Iterative Back Translation • Example German–English Back Final no back-translation - 29.6 *10k iterations 10.6 29.6 (+0.0) *100k iterations 21.0 31.1 (+1.5) convergence 23.7 32.5 (+2.9) re-back-translation 27.9 33.6 (+4.0) * = limited training of back-translation system Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 12Round Trip Training • We could iterate through steps of – train system – create synthetic corpus • Dual learning: train models in both directions together – translation models F → E and E → F – take sentence f – translate into sentence e’ – translate that back into sentence f’ – training objective: f should match f’ • Setup could be fooled by just copying (e’ = f) ⇒ score e’ with a language for language E add language model score as cost to training objective Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 13Round Trip Training MT F→E MT E→F ef LM E LM F Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 14Variants • Copy Target – if no good neural machine translation system to start with – just copy target language text to the source • Forward Translation – synthesize training data in same direction as training – self-training (inferior but sometimes successful) Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 15 unsupervised machine translation Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 16Monolingual Embedding Spaces dog cat lion Löwe Katze Hund • Embedding spaces for different languages have similar shape • Intuition: relationship between dog, cat, and lion, independent of language • How can we rotate the triangle to match up? Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 17Matching Embedding Spaces dog cat lion Löwe Katze Hund dog cat lionLöwe Katze Hund • Seed lexicon of identically spelled words, numbers, names • Adversarial training: discriminator predicts language [Conneau et al., 2018] • Match matrices with word similarity scores: Vecmap [Artetxe et al., 2018] Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 18Inferred Translation Model • Translation model – induced word translations (nearest neighbors of mapped embeddings) → statistical phrase translation table (probability similarity) • Language model – target side monolingual data → estimate statistical n-gram language model ⇒ Statistical phrase-based machine translation system Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 19Synthetic Training Data • Create synthetic parallel corpus – monolingual text in source language – translate with inferred system: translations in target language • Recall: EM algorithm – predict data: generate translation for monolingual corpus – predict model: estimate model from synthetic data – iterate this process, alternate between language directions • Increasingly use neural machine translation model to synthesize data Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 20 multiple language pairs Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 21Multiple Language Pairs • There are more than two languages in the world • We may want to build systems for many language pairs • Typical: train separate models for each • Joint training Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 22Multiple Input Languages • Example – German–English – French–English • Concatenate training data • Joint model benefits from exposure to more English data • Shown beneficial in low resource conditions • Do input languages have to be related? (maybe not) Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 23Multiple Output Languages • Example – French–English – French–Spanish • Concatenate training data • Given a French input sentence, how specify output language? • Indicate output language with special tag [ENGLISH] N’y a-t-il pas ici deux poids, deux mesures? ⇒ Is this not a case of double standards? [SPANISH] N’y a-t-il pas ici deux poids, deux mesures? ⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero? Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 24Zero Shot Translation • Example – German–English – French–English – French–Spanish • We want to translate – German–Spanish English French Spanish German MT Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 25Zero Shot • Train on – German–English – French–English – French–Spanish • Specify translation [SPANISH] Messen wir hier nicht mit zweierlei Maß? ⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero? Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 26Zero Shot: Hype Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 27Zero Shot: Reality • Bridged: pivot translation Portuguese → English → Spanish • Model 1 and 2: Zero shot training • Model 2 + incremental training: use of some training data in language pair Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 28Sharing Components • So far: generic neural machine translation model • Maybe better: separate systems with shared components – encoder shared in models with same input language. – decoder shared in models with same output language. – attention mechanism shared in all models • Sharing = same parameters, updates from any language pair training • No need to mark output language Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 29Massively Multilingual Training • Scaling up multilingual machine translation for more languages – many-to-English – English-to-many – many-to-many • Mainly motivated by improving low-resource language pairs • Move towards larger models Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 30Translation Quality for 103 Languages (source: Google) Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 31Gains with Multilingual Training (source: Google) Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 32Romanization (source: USC/ISI) Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 33Many-to-Many • 7.5 billion sentences for 100 languages (mined from web-crawled data) • Model with 15 billion parameters • Improvements especially for low resource languages Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 34 multi-task training Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 35Related Tasks • Our translation models: generic sequence-to-sequence models • Same model used for many other tasks – sentiment detection – grammar correction – semantic inference – summarization – question answering – speech recognition • For all these tasks, we need to learn basic properties of language – word embeddings – contextualize word representations in encoder – language model aspects of decoder • Why re-invent the wheel each time? Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 36Training on Related Tasks • Train model on several tasks • Maybe shared and task-specific components • System learns general facts about language – informed by many different tasks – useful for many different tasks Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 37Pre-Training Word Embeddings • Let us keep it simple... • Neural machine translation models use word embeddings – encoding of input words – encoding of output words • Word embeddings can be trained on vast amounts of monolingual data ⇒ pre-train word embeddings and initialize model with them • Not very successful so far – monolingual word embeddings trained on language model objectives – for machine translation, different similarity aspects may matter more – e.g., teacher and teaching similar in MT, not in LM Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 38Pre-Training the Encoder and Decoder • Pre-training other components of the translation model • Decoder – language model, informed by input context – pre-train as language model on monolingual data – input context vector set to zero • Encoder – also structures like a language model (however, not optimized to predict following words) – pre-train as language model on monolingual data Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 39Monolingual Pre-Training • Initial training of neural machine translation model on monolingual data • Replace some input word sequences with (30% of words) • Train model MASKED → TEXT on both source and target text • Reorder sentences (each training example has 3 sentences) Advanced NLP techniques master class ” how ” 3rd : 18 Results 40 of 729 ⇓ 3rd grade : 18 Advanced NLP techniques master class ” how to with clients ” Results 1 – 40 of 729 Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 40Multi-Task Training • Multiple end-to-end tasks that share common aspects – need to encode an input word sequence – produce an output word sequence • May have very different input/output – sentiment detection: output is sentiment value – part-of-speech tagging: output is tag sequence – syntactic parsing: output is recursive parse structure (may be linearized) – semantic parsing: output is logical form, database query, or AMR – grammar correction: input is error-prone text – question answering: needs to be also informed by knowledge base – speech recognition: input is sequence of acoustic features • Input and output in the same language, may be mostly copied – grammar correction, automatic post-editing – question answering, semantic inference Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020 41Multi-Task Training • Train a single model for all tasks • Positive results with joint training of – part-of-speech tagging – named entity recognition – syntactic parsing – semantic analysis. • Tasks may share just some components Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020