Beyond Parallel Corpora
Philipp Koehn
29 October 2020
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
1
data and machine learning
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
2Supervised and Unsupervised
• We framed machine translation as a supervised machine learning task
– training examples with labels
– here: input sentences with translation
– structured prediction: output has to be constructed in several steps
• Unsupervised learning
– training examples without labels
– here: just sentences in the input language
– we will also look at using just sentences output language
• Semi-supervised learning
– some labeled training data
– some unlabeled training data (usually more)
• Self-training
– make predictions on unlabeled training data
– use predicted labeled as supervised translation data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
3Transfer Learning
• Learning from data similar to our task
• Other language pairs
– ﬁrst, train a model on different language pair
– then, train on the targeted language pair
– or: train jointly on both
• Multi-Task training
– train on a related task ﬁrst
– e.g., part-of-speeh tagging
• Share some or all of the components
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
4
using monolingual data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
5Using Monolingual Data
• Language model
– trained on large amounts of target language data
– better ﬂuency of output
• Key to success of statistical machine translation
• Neural machine translation
– integrate neural language model into model
– create artiﬁcial data with backtranslation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
6Adding a Language Model
• Train a separate language model
• Add as conditioning context to the decoder
• Recall state progression in the decoder
– decoder state si
– embedding of previous output word Eyi−1
– input context ci
si = f(si−1, Eyi−1, ci)
• Add hidden state of neural language model sLM
i
si = f(si−1, Eyi−1, ci, sLM
i )
• Pre-train language model
• Leave its parameters ﬁxed during translation model training
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
7Reﬁnements
• Balance impact of language model vs. translation model
• Learn a scaling factor (gate)
gateLM
i = f(sLM
i )
• Use it to scale values of language model state
¯sLM
i = gateLM
i × sLM
i
• Use this scaled language model state for decoder state
si = f(si−1, Eyi−1, ci, ¯sLM
i )
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
8Back Translation
• Monolingual data is parallel data that misses its other half
• Let’s synthesize that half
reverse system
ﬁnal system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
9Back Translation
• Steps
1. train a system in reverse language translation
2. use this system to translate target side monolingual data
→ synthetic parallel corpus
3. combine generated synthetic parallel data with real parallel data to build the
ﬁnal system
• Roughly equal amounts of synthetic and real data
• Useful method for domain adaptation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
10Iterative Back Translation
• Quality of backtranslation system matters
• Build a better backtranslation system ... with backtranslation
back system 2 ﬁnal system
back system 1
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
11Iterative Back Translation
• Example
German–English Back Final
no back-translation - 29.6
*10k iterations 10.6 29.6 (+0.0)
*100k iterations 21.0 31.1 (+1.5)
convergence 23.7 32.5 (+2.9)
re-back-translation 27.9 33.6 (+4.0)
* = limited training of back-translation system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
12Round Trip Training
• We could iterate through steps of
– train system
– create synthetic corpus
• Dual learning: train models in both directions together
– translation models F → E and E → F
– take sentence f
– translate into sentence e’
– translate that back into sentence f’
– training objective: f should match f’
• Setup could be fooled by just copying (e’ = f)
⇒ score e’ with a language for language E
add language model score as cost to training objective
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
13Round Trip Training
MT
F→E
MT
E→F
ef
LM
E
LM
F
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
14Variants
• Copy Target
– if no good neural machine translation system to start with
– just copy target language text to the source
• Forward Translation
– synthesize training data in same direction as training
– self-training (inferior but sometimes successful)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
15
unsupervised machine translation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
16Monolingual Embedding Spaces
dog
cat
lion
Löwe
Katze
Hund
• Embedding spaces for different languages have similar shape
• Intuition: relationship between dog, cat, and lion, independent of language
• How can we rotate the triangle to match up?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
17Matching Embedding Spaces
dog
cat
lion
Löwe
Katze
Hund
dog
cat
lionLöwe
Katze
Hund
• Seed lexicon of identically spelled words, numbers, names
• Adversarial training: discriminator predicts language [Conneau et al., 2018]
• Match matrices with word similarity scores: Vecmap [Artetxe et al., 2018]
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
18Inferred Translation Model
• Translation model
– induced word translations (nearest neighbors of mapped embeddings)
→ statistical phrase translation table (probability similarity)
• Language model
– target side monolingual data
→ estimate statistical n-gram language model
⇒ Statistical phrase-based machine translation system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
19Synthetic Training Data
• Create synthetic parallel corpus
– monolingual text in source language
– translate with inferred system: translations in target language
• Recall: EM algorithm
– predict data: generate translation for monolingual corpus
– predict model: estimate model from synthetic data
– iterate this process, alternate between language directions
• Increasingly use neural machine translation model to synthesize data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
20
multiple language pairs
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
21Multiple Language Pairs
• There are more than two languages in the world
• We may want to build systems for many language pairs
• Typical: train separate models for each
• Joint training
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
22Multiple Input Languages
• Example
– German–English
– French–English
• Concatenate training data
• Joint model beneﬁts from exposure to more English data
• Shown beneﬁcial in low resource conditions
• Do input languages have to be related? (maybe not)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
23Multiple Output Languages
• Example
– French–English
– French–Spanish
• Concatenate training data
• Given a French input sentence, how specify output language?
• Indicate output language with special tag
[ENGLISH] N’y a-t-il pas ici deux poids, deux mesures?
⇒ Is this not a case of double standards?
[SPANISH] N’y a-t-il pas ici deux poids, deux mesures?
⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
24Zero Shot Translation
• Example
– German–English
– French–English
– French–Spanish
• We want to translate
– German–Spanish
English
French Spanish
German
MT
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
25Zero Shot
• Train on
– German–English
– French–English
– French–Spanish
• Specify translation
[SPANISH] Messen wir hier nicht mit zweierlei Maß?
⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
26Zero Shot: Hype
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
27Zero Shot: Reality
• Bridged: pivot translation Portuguese → English → Spanish
• Model 1 and 2: Zero shot training
• Model 2 + incremental training: use of some training data in language pair
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
28Sharing Components
• So far: generic neural machine translation model
• Maybe better: separate systems with shared components
– encoder shared in models with same input language.
– decoder shared in models with same output language.
– attention mechanism shared in all models
• Sharing = same parameters, updates from any language pair training
• No need to mark output language
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
29Massively Multilingual Training
• Scaling up multilingual machine translation for more languages
– many-to-English
– English-to-many
– many-to-many
• Mainly motivated by improving low-resource language pairs
• Move towards larger models
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
30Translation Quality for 103 Languages
(source: Google)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
31Gains with Multilingual Training
(source: Google)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
32Romanization
(source: USC/ISI)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
33Many-to-Many
• 7.5 billion sentences for 100 languages
(mined from web-crawled data)
• Model with 15 billion parameters
• Improvements especially for low resource languages
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
34
multi-task training
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
35Related Tasks
• Our translation models: generic sequence-to-sequence models
• Same model used for many other tasks
– sentiment detection
– grammar correction
– semantic inference
– summarization
– question answering
– speech recognition
• For all these tasks, we need to learn basic properties of language
– word embeddings
– contextualize word representations in encoder
– language model aspects of decoder
• Why re-invent the wheel each time?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
36Training on Related Tasks
• Train model on several tasks
• Maybe shared and task-speciﬁc components
• System learns general facts about language
– informed by many different tasks
– useful for many different tasks
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
37Pre-Training Word Embeddings
• Let us keep it simple...
• Neural machine translation models use word embeddings
– encoding of input words
– encoding of output words
• Word embeddings can be trained on vast amounts of monolingual data
⇒ pre-train word embeddings and initialize model with them
• Not very successful so far
– monolingual word embeddings trained on language model objectives
– for machine translation, different similarity aspects may matter more
– e.g., teacher and teaching similar in MT, not in LM
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
38Pre-Training the Encoder and Decoder
• Pre-training other components of the translation model
• Decoder
– language model, informed by input context
– pre-train as language model on monolingual data
– input context vector set to zero
• Encoder
– also structures like a language model
(however, not optimized to predict following words)
– pre-train as language model on monolingual data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
39Monolingual Pre-Training
• Initial training of neural machine translation model on monolingual data
• Replace some input word sequences with <pad> (30% of words)
• Train model MASKED → TEXT on both source and target text
• Reorder sentences (each training example has 3 sentences)
<en> Advanced NLP techniques master class ” how <pad> ” </s>
3rd <pad> : 18 </s>
Results <pad> 40 of 729
⇓
3rd grade : 18 </s>
Advanced NLP techniques master class ” how to with clients ” </s>
Results 1 – 40 of 729
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
40Multi-Task Training
• Multiple end-to-end tasks that share common aspects
– need to encode an input word sequence
– produce an output word sequence
• May have very different input/output
– sentiment detection: output is sentiment value
– part-of-speech tagging: output is tag sequence
– syntactic parsing: output is recursive parse structure (may be linearized)
– semantic parsing: output is logical form, database query, or AMR
– grammar correction: input is error-prone text
– question answering: needs to be also informed by knowledge base
– speech recognition: input is sequence of acoustic features
• Input and output in the same language, may be mostly copied
– grammar correction, automatic post-editing
– question answering, semantic inference
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020
41Multi-Task Training
• Train a single model for all tasks
• Positive results with joint training of
– part-of-speech tagging
– named entity recognition
– syntactic parsing
– semantic analysis.
• Tasks may share just some components
Philipp Koehn Machine Translation: Beyond Parallel Corpora 29 October 2020