Beyond Parallel Corpora
Philipp Koehn
22 October 2024
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
1
data and machine learning
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
2Supervised and Unsupervised
• We framed machine translation as a supervised machine learning task
– training examples with labels
– here: input sentences with translation
– structured prediction: output has to be constructed in several steps
• Unsupervised learning
– training examples without labels
– here: just sentences in the input language
– we will also look at using just sentences output language
• Semi-supervised learning
– some labeled training data
– some unlabeled training data (usually more)
• Self-training
– make predictions on unlabeled training data
– use predicted labeled as supervised translation data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
3Transfer Learning
• Learning from data similar to our task
• Other language pairs
– ﬁrst, train a model on different language pair
– then, train on the targeted language pair
– or: train jointly on both
• Multi-Task training
– train on a related task ﬁrst
– e.g., part-of-speeh tagging
• Share some or all of the components
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
4
using monolingual data
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
5Using Monolingual Data
• Language model
– trained on large amounts of target language data
– better ﬂuency of output
• Key to success of statistical machine translation
• Neural machine translation
– integrate neural language model into model
– create artiﬁcial data with backtranslation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
6Adding a Language Model
• Train a separate language model
• Add as conditioning context to the decoder
• Recall state progression in the decoder
– decoder state si
– embedding of previous output word Eyi−1
– input context ci
si = f(si−1, Eyi−1, ci)
• Add hidden state of neural language model sLM
i
si = f(si−1, Eyi−1, ci, sLM
i )
• Pre-train language model
• Leave its parameters ﬁxed during translation model training
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
7Reﬁnements
• Balance impact of language model vs. translation model
• Learn a scaling factor (gate)
gateLM
i = f(sLM
i )
• Use it to scale values of language model state
¯sLM
i = gateLM
i × sLM
i
• Use this scaled language model state for decoder state
si = f(si−1, Eyi−1, ci, ¯sLM
i )
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
8
backtranslation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
9Back Translation
• Monolingual data is parallel data that misses its other half
• Let’s synthesize that half
reverse system
ﬁnal system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
10Back Translation
• Steps
1. train a system in reverse language translation
2. use this system to translate target side monolingual data
→ synthetic parallel corpus
3. combine generated synthetic parallel data with real parallel data to build the
ﬁnal system
• Roughly equal amounts of synthetic and real data
• Useful method for domain adaptation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
11Iterative Back Translation
• Quality of backtranslation system matters
• Build a better backtranslation system ... with backtranslation
back system 2 ﬁnal system
back system 1
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
12Iterative Back Translation
• Example: Better system for backtranslation matters
German–English Back Final
no back-translation - 29.6
*10k iterations 10.6 29.6 (+0.0)
*100k iterations 21.0 31.1 (+1.5)
convergence 23.7 32.5 (+2.9)
re-back-translation 27.9 33.6 (+4.0)
* = limited training of back-translation system
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
13Variants
• Copy Target
– if no good neural machine translation system to start with
– just copy target language text to the source
• Forward Translation
– synthesize training data in same direction as training
– self-training (inferior but sometimes successful)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
14Round Trip Training
• We could iterate through steps of
– train system
– create synthetic corpus
• Dual learning: train models in both directions together
– translation models F → E and E → F
– take sentence f
– translate into sentence e’
– translate that back into sentence f’
– training objective: f should match f’
• Setup could be fooled by just copying (e’ = f)
⇒ score e’ with a language for language E
add language model score as cost to training objective
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
15Round Trip Training
MT
F→E
MT
E→F
ef
LM
E
LM
F
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
16
monolingual pretraining
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
17Low Resource Language Pairs
• Problem: not enough parallel to even train a proper encoder or decoder
• Idea: use monolingual data
– ... in source language → initialize encoder
– ... in target language → initialize decoder
• How do we present monolingual data in training?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
18Masked Training
• Replace some input word sequences with <pad> (30% of words)
• Train model MASKED → TEXT on both source and target text
Why did the chicken cross the road?
⇑
Why did <pad> chicken <pad> the road?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
19Reordering Sentences
• Reorder sentences (each training example has 3 sentences)
Why did the chicken cross the road?
The chicken wanted to get to the other side.
There are some delicious sunﬂower seeds.
⇓
The chicken wanted to get <pad> other <pad>.
<pad> are some delicious <pad> seeds.
Why did <pad> chicken <pad> the road?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
20Example: mBART
“Multilingual Denoising Pre-training for Neural Machine Translation” (Liu et al., 2020)
• 25 languages: from 55 billion words English to 56 million words Burmese
• Followed by training on parallel data
⇒ Helps with low-resource languages
(but not with >20 million sentence pair parallel data)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
21
unsupervised machine translation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
22Monolingual Embedding Spaces
dog
cat
lion
Löwe
Katze
Hund
• Embedding spaces for different languages have similar shape
• Intuition: relationship between dog, cat, and lion, independent of language
• How can we rotate the triangle to match up?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
23Matching Embedding Spaces
dog
cat
lion
Löwe
Katze
Hund
dog
cat
lionLöwe
Katze
Hund
• Seed lexicon of identically spelled words, numbers, names
• Adversarial training: discriminator predicts language [Conneau et al., 2018]
• Match matrices with word similarity scores: Vecmap [Artetxe et al., 2018]
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
24Bilingual Lexicon Induction
dog
cat
lionLöwe
Katze
Hund
• Given shared embedding state
⇒ matching points in space = word translations
Embeddings
F
Embeddings
E
Induced
Bilingual
Dictionary
Monolingual
Data
F
Monolingual
Data
E
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
25Inferred Translation Model
• Translation model
– induced word translations
→ statistical phrase translation table (probability similarity)
• Language model
– target side monolingual data
→ estimate statistical n-gram language model
⇒ Statistical phrase-based machine translation system
Embeddings
F
Embeddings
E
Induced
Bilingual
Dictionary
Translation
Model
Monolingual
Data
F
Monolingual
Data
E
Language
Model
Statistical
MT Model
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
26Synthetic Training Data
• Create synthetic parallel corpus
– monolingual text in source language
– translate with inferred system: translations in target language
Translation
Model
Language
Model
Statistical
MT Model
Synthetic
Parallel
Data
Monolingual
Data
F
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
27Iterate
• Iterate
– Predict data: generate translation for monolingual corpus
– Predict model: estimate model from synthetic data
– iterate this process, alternate between language directions
• Increasingly use neural machine translation model to synthesize data
Statistical
MT Model
Synthetic
Parallel
Data
Monolingual
Data
F
Monolingual
Data
E
Neural
MT Model
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
28
multiple language pairs
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
29Multiple Language Pairs
• There are more than two languages in the world
• We may want to build systems for many language pairs
• Typical: train separate models for each
• Joint training
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
30Multiple Input Languages
• Example
– German–English
– French–English
• Concatenate training data
• Joint model beneﬁts from exposure to more English data
• Shown beneﬁcial in low resource conditions
• Do input languages have to be related? (maybe not)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
31Multiple Output Languages
• Example
– French–English
– French–Spanish
• Concatenate training data
• Given a French input sentence, how specify output language?
• Indicate output language with special tag
[ENGLISH] N’y a-t-il pas ici deux poids, deux mesures?
⇒ Is this not a case of double standards?
[SPANISH] N’y a-t-il pas ici deux poids, deux mesures?
⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
32Zero Shot Translation
• Example
– German–English
– French–English
– French–Spanish
• We want to translate
– German–Spanish
English
French Spanish
German
MT
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
33Zero Shot
• Train on
– German–English
– French–English
– French–Spanish
• Specify translation
[SPANISH] Messen wir hier nicht mit zweierlei Maß?
⇒ ¿No puede verse con toda claridad que estamos utilizando un doble rasero?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
34Zero Shot: Hype
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
35Zero Shot: Reality
• Bridged: pivot translation Portuguese → English → Spanish
• Model 1 and 2: Zero shot training
• Model 2 + incremental training: use of some training data in language pair
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
36Massively Multilingual Training
• Scaling up multilingual machine translation for more languages
– many-to-English
– English-to-many
– many-to-many
• Mainly motivated by improving low-resource language pairs
• Move towards larger models
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
37Translation Quality for 103 Languages
(source: Google)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
38Gains with Multilingual Training
(source: Google)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
39Romanization
(source: USC/ISI)
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
40Many-to-Many
• 7.5 billion sentences for 100 languages
(mined from web-crawled data)
• Model with 15 billion parameters
• Improvements especially for low resource languages
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
41Even Bigger: NLLB (2022)
• No Language Left Behind: 200 languages
• Hand-translated test set: Flores-200
• Uses diverse data sources
– public parallel data
– translations created by professional translators
– sentence pairs based on sentence embedding similarity
– monolingual data for
∗ monolingual pre-training
∗ back-translation
∗ self-training
• Models of different scale (up to 54B parameters), publicly released
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
42Different Amounts of Data per Language
• High-resource language pairs are undertrained
• Low-resource language pairs are overtrained
⇒ Oversampling low resource language pairs
Data selection probability pl for language pair l based on corpus sizes Dk
pl = (Dl/
k
Dk)1/T
• Curriculum training: adding low-resource data only in later training stages
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
43Interference
• Many languages in the same representation space
• Beneﬁcial: shared cognates, numbers, names, ...
• Harmful: a lot of accidental overlap in tokens that have different meaning
– die — common German determiner
– die — different meaning in English
• What can be done to avoid harmful interference?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
44Language-Speciﬁc Components
• Various design choices
– language-speciﬁc encoder
– language-speciﬁc decoder
– language speciﬁc adaptor components
• Example:
“Condensing Multilingual Knowledge with
Lightweight Language-Speciﬁc Modules”
Xu et al. (2023)
– language speciﬁc parameters
– shared parameters
– self-distillation method to condense
everything into shared parameters
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
45Mixture of Experts
• Conditional compute
• Gating mechanism decides
which FF step to utilize
• Allows scaling to
many more parameters
without increasing
computational cost
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
46
document-level translation
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
47The Importance of Document-Level Context
• Pronouns
– I bought a table. It is pretty.
– Ich kaufte einen Tisch. Er/sie/es is sch¨on.
• Better disambiguation
– I have a lot of numbers. I still need to make the table.
• Terminological consistency
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
48Why Not Document-Level Translation?
• Entire infrastructure focused on sentence level
– Training data available as sentence pairs
– Metrics deﬁned at sentence level
– APIs typically operate at sentence level
• This is slowly changing
– Scaling up transformers for multi-sentence translation [Junczys-Dowmunt et al., 2019]
– Document-level metrics, e.g., CTXPRO [Wicks et al., 2023]
– Release of training data in document-aligned format
e.g., Europarl, News Commentary, Paracrawl [Wicks et al., 2024]
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024
49
questions?
Philipp Koehn Machine Translation: Beyond Parallel Corpora 22 October 2024