Low-Resource Machine Translation
Edoardo Signoroni
E. Signoroni - Low-Res MT2
Introduction
●
MT is the task of translating a sentence from a source language to the
corresponding sentence in the target language.
●
Nowadays, it is done with Neural Machine Learning systems trained
on parallel corpora.
●
Main issues:
– Linguistic ambiguity
e.g. “ It’s raining cats and dogs. ”
– DATA SCARCITY
E. Signoroni - Low-Res MT3
What are LRLs?
3
>7000 living languages
plus:
varieties;
dialects;
slangs;
code-switching;
code-mixing;
… and more
but most of these are “Left-Behinds” or Low-resource
languages
since the biggest MT system online supports a grand total of
243 (or 3.47%)
Blasi et al. (2022), Joshi et al. (2020)
E. Signoroni - Low-Res MT4
What are LRLs?
Joshi et al. 2020 define LRLs, in incremental classes:
“Have exceptionally limited resources, and have
rarely been considered in language technologies.”
“Have some unlabelled data; however, collecting
labelled data is challenging.”
“A small set of labelled datasets has been
collected, and language support communities are
there to support the language.”
What are LRLs?
Joshi et al. classification and table
where are italian and czech
5“adapted” from Ranathunga et al. (2023) and Joshi et. al (2020)
Italian, Czech
E. Signoroni - Low-Res MT6
What are LRLs?
For MT:
No standard definition.
Usually LR pair if the size of the parallel corpora is <500k
sentences
and extremely LR below 100k pairs
if no data is available, we enter the zero-shot setting
WMT22 deu-dsb 40k sents 500k words
The Good Soldier Švejk 200k words
New Testament 185k words
Why work on LRLs?
- decreasing the digital divide
http://labs.theguardian.com/digital-language-divide/
- dealing with inequalities of information access and production
-
mitigating cross-cultural biases
-
deploying NLP technologies for underrepresented languages
-
understanding cross-linguistic differences
-
preserving linguistic diversity
~3000 (43%) are endangered
90% of all languages will be extinct within 100 years;
in the best­case scenario, only 50% will survive,
and just 10% are considered safe during the next century
https://www.endangeredlanguages.com/about_importance/
Given this variability, always highlight clearly the languages you
are working on (Bender Rule & Data Statements)
7
How is it done?
10
Currently, the state-of-the-art for HRLs is NMT
Several approaches have been proposed for LRLs, too
But most of the impact can be obtained with careful
and clever use of the data we have
Ranathunga et al. (2023)
E. Signoroni - Low-Res MT11
How is it done?- Current Methods
●
Multilingual NMT transfer
learning is the current state-of-the-
art
●
Best results using data from
related HRL pairs and fine-tune
pre-trained NMT models to the
related data or the small amount of
LRLs text available
●
issues with performance and
equitable access
“Hello, there!”
"হেল্ল', তাত!"
Fine Tuned Multilingual
ENG>ASM MT
ENG>ASM DATA
Multilingual
ENG>INDIC MT
ENG>BEN
ENG>MNI
ENG>LUS
…
Data
E. Signoroni - Low-Res MT12
Language Relatedness
●
It is beneficial to use related languages for transfer between HRLs and LRLs
●
However, the extent of this is not clear. Which kind of relatedness is the most helpful?
– Genealogical?
မြန်မဘသ (Burmese, Tibeto-Burman, Burmese) > মৈতৈলোন (Manipuri, Tibeto-Burman)
– Typological?
हिन्दी (Hindi, Indo-Aryan, SOV) > মৈতৈলোন (Manipuri, SOV)
– Writing system?
বাংলা (Bengali, Bengali script) > মৈতৈলোন (Manipuri, Bengali script)
●
How can we better leverage and disentangle these factors?
An Example: WMT23 Indic LR MT
●
4 Low-res lndic languages (asm,kha,lus,mni) <> English
●
Collated train datasets on a same-script basis (asm&mni;
kha&lus), and for all languages together
●
Trained systems on the collated data, and fine-tuned
child systems for the single directions
●
Best option for kha;lus>eng. Mni>eng was better with
same-script parent
Zero-shot and relatedness
Most of the times, no data for the LRL are available
→ Zero-shot
Fine-tuning a pre-trained LLM with data from a
related language helps (e.g. Slavic language into
Silesian)
However, the internal, fine-grained relatedness of the
language, or its presence in the pre-training data
seems not to matter
E. Signoroni - Best Practices for Low-Resource Machine Translation15
Tokenization
●
A MT system is a sequence-to-sequence model, which takes words in
input and generates words as output
●
Thus it needs a vocabulary of tokens, words in the most simple
implementation
●
Dealing with morphological
variants and variation leads
to huge vocabulary sizes
and out-of-vocabulary words,
not seen in training
E. Signoroni - Best Practices for Low-Resource Machine Translation16
Tokenization
●
Text is segmented into subwords with datadriven
iterative algorithms
●
These are combined together to deal with
unknown words, but still struggle with
complex morphology, non-standard forms,
linguistic diversity, ...
●
Character, hybrid, token-free, and even pixellevel
approaches have been proposed to
overcome such challenges
Su b wo r d s
E. Signoroni - Best Practices for Low-Resource Machine Translation17
Tokenization Impact on NMT
Tokenization impacts the quality of
downstream NMT, especially for LRLs,
thus choosing its parameters carefully is
crucial.
An Example: WMT22 LR MT
●
2 LRLs: Lower & Upper Sorbian
●
by using a custom our custom HFT
tokenizer to obtain more frequent and thus
better represented tokens, we
outperformed the default bpe approach
using only the given LR (40k, 450k)
parallel corpora
As we set an higher value for the
vocabulary size, we get:
●
Less characters
●
Sligthly more subwords, and
more mixed-use tokens at first
●
More full words
●
But also less quality
A more “balanced” mix of
characters, subwords, and
words generalizes better to
unseen data than a word-heavy
vocabulary
Tokenization Impact on NMT
Tokenization Impact on NMT
As we set an higher value for the
vocabulary size, we naturally get
longer tokens and shorter lines
E. Signoroni - Best Practices for Low-Resource Machine Translation21
Tokenization Impact on NMT
250 A m b▁ as s ad or M s . N i k k i H▁ ▁ ▁ al e y , U n▁ it ed S t▁ at es P▁ er m an ent R e p▁ res ent at ive to▁ the▁ U n▁ it ed N▁ ation s
500 A m b▁ ass ad or M s . N i k k i H▁ ▁ ▁ al e y , Un▁ it ed S t▁ at es P▁ er m an ent R▁ ep res ent at ive to▁ the▁ Un▁ it ed N▁ ations
1k A m b▁ ass ad or M s . N i k k i H▁ ▁ ▁ al e y , Un▁ ited St▁ ates P▁ er m an ent R▁ ep res ent ative to▁ the▁ Un▁ ited N▁ ations
2k A▁ mb ass ad or M s . N▁ ▁ ik k i H▁ ale y , Un▁ ited St▁ ates P▁ er man ent Rep▁ res ent ative to▁ the▁ Un▁ ited N▁ ations
4k Amb▁ ass ad or M s . N▁ ▁ ik k i H▁ ale y , Un▁ ited States▁ P▁ er man ent Rep▁ res ent ative to▁ the▁ Un▁ ited N▁ ations
8k Ambassador▁ Ms▁ . N▁ ikk i H▁ ale y , United▁ States▁ Per▁ man ent Rep▁ res ent ative to▁ the▁ United▁ Nations▁
16k Ambassador▁ Ms▁ . N▁ ikk i H▁ ale y , United▁ States▁ Permanent▁ Repres▁ ent ative to▁ the▁ United▁ Nations▁
32k Ambassador▁ Ms▁ . N▁ ikk i Haley▁ , United▁ States▁ Permanent▁ Represent▁ ative to▁ the▁ United▁ Nations▁
Smaller Vocabularies
●
Pre-trained models use huge vocabularies to
account for all of the training data, and require
heavy computational resources
●
If carefully tuned, “traditional” trained-fromscratch
systems can achieve meaningful
representation at a fraction of the
computational size and cost, even in
extremely LR conditions
●
In particular, smaller vocabulary sizes, most
often lead to:
– better MT performance
– Smaller model size
– Faster training times
Automated Metrics for LR MT
●
Automated metrics allow for low-cost,
fast comparison of system
●
Two types are relevant for LR MT:
Lexical Overlap
– They compare the sequence similarity between the proposed
translation and one or more references
– BLEU (Papineni et al. 2002), ChrF (Popovic 2015, 2017)
Neural Metrics
– Fine-tuned LMs on human judgements that predict a score based
on a given input of source, translation, and reference.
– COMET, xCOMET (Rei et al. 2020)
Automated Metrics for LR MT
●
While Neural Metrics are the state-of-the-art; they
perform poorly in for LRLs
●
Fine-tuning COMET models to LRLs was shown to be
promising: IndicCOMET (Sai B et al. 2023); AfriCOMET
(Wang et al. 2023)
●
If this is not possible, ChrF(++) was deemed the best
back off metric
Some Conclusions
●
Working on LRLs is important for several linguistic, social, and democratic
reasons
●
Multilingual NMT approaches involving transfer learning are currently the
state-of-the-art for LRLs-MT
●
but they still have various issues regarding their performance and equitable
access
●
Careful tuning of the parameters and clever use of the training data goes a
long way to alleviate the problems of LR MT
●
Some best practices, such as highlighting the LRLs studied and using fitting
metric to evaluate the output of MT are also important