Current Challenges
Philipp Koehn 5 November 2020
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
WMT 2016 i
human
.6
.2 --
A --
Neural MT
• uedin-nmt
Statistical MT_* metamind
uedin-syntax •
• NYU-UMONTREAL
ONLINE-B
äOMT-RULE-BASED • KIT-LIMSI •• . . CAMBRIDGE
Kiir ONLINE-A
JHU-SYNTAX • . JHU-PBMT
UEDIN-PBMT
ONLINE-F # ONLINE-G
BLEU
H-1-1-1-1-1-1-1-1-1
18 20 22 24 26 28 30 32 34 36
(in 2017 barely any statistical machine translation submissions)
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
2017: Google: "Near Human Quality" 2 ^
6 perfect translation
human
I* _ -neural (GNMT)
phrase-based (PBMT)
English English English Spanish French Chinese > > > > > >
Spanish French Chinese English English English
Translation model
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
2018: More Hype
Microsoft Research Achieves Human Parity For Chinese English Translation
Written by Sue Gee Wednesday, 21 March 2018
Researchers in Microsoft's labs in Beijing and in Redmond and Washington have developed an Al machine translation system that can translate with the same accuracy as a human from Chinese to English.
SDL Cracks Russian to English Neural Machine Translation
Global Enterprises to Capitalize on Near Perfect Russian to English Machine Translation as SDL Sets New Industry Standard
'90% of the system's output labelled as perfect byprofessional Russian-English translators
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Just Better Fluency?
Adequacy +1%
100
80
60
III
CS^EN DE^EN RO^EN RU^EN
llONLINE-Bll UEDIN-NMT
Fluency
+13%
100
80
60
nrrr
CS^EN DE^EN RO^EN RU^EN
IIONLINE-BI ■ UEDIN-NMT
(from: Sennrich and Haddow, 2017)
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Challenges
• Lack of training data
• Domain mismatch
• Rare words
• Word alignment
• Beam search
• Noise
• Control over output
• Interpretability
Philipp Koehn Machine Translation: Current Challenges 5 November 2020
lack of training data
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Amount of Training Data
Corpus Size (English Words)
English-Spanish systems trained on 0.4 million to 385.7 million words
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Translation Examples
Source A Republican strategy to counter the re-election of Obama
i 1024 Un órgano de coordinación para el anuncio de libre determinación
1 512 Lista de una estrategia para luchar contra la elección de hoj as de Ohio
256 Explosion realiza una estrategia divisiva de luchar contra las elecciones de autor
1 128 Una estrategia republicana para la eliminación de la reelección de Obama
1 64 Estrategia siria para contrarrestar la reelección del Obama .
1 i 32 1 Una estrategia republicana para contrarrestar la reelección de Obama
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
9
domain mismatch
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Domain Mismatch 10
System | Law Medical IT Koran Subtitles
All Data 30.532.8 45.142.2 35.344.7 17.917.9 26.420.8
Law 31.134.4 12.118.2 3.5 6.9 1.3 2.2 2.8 6.0
Medical 3.910.2 39.443.5 2.0 8.5 0.6 2.0 1.4 5.8
IT 1.9 3.7 6.5 5.3 42.139.8 1.8 1.6 3.9 4.7
Koran 0.4 1.8 0.0 2.1 0.0 2.3 15.918.8 1.0 5.5
^^^^
Subtitles 7.0 9.9 9.317.8 9.213.6 9.0 8.4 25.922.1
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Translation Examples
Source Schaue um dich herum.
Ref. Look around you.
All NMT: Look around you. SMT: Look around you.
Law NMT: Sughum gravecorn. SMT: In order to implement dich Schaue .
Medical NMT: EMEA / MB / 049 / 01-EN-Final Work progamme for 2002 SMT: Schaue by dich around .
IT NMT: Switches to paused. SMT: To Schaue by itself . \t \t
Koran NMT: Take heed of your own souls. SMT: And you see.
Subtitles NMT: Look around you. SMT: Look around you .
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
12
rare words
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Rare Words
• More frequent in training —>• more likely to get right in test
• Let's measure thisl
• One problem
— frequency measured for input words
— translation correctness measured for output words
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Translation Accuracy for Input Words 14
• Generate word alignment between input and output words
• Look up count of input word in training
• Link to output word via word alignment
• Check if it is also in the reference translation!
• A lot of tedious special cases
— one-to-many alignment, only some output words in reference
— input word not aligned to any target word
— many-to-one alignment
— output word occurs multiple time in output or reference sentence
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Count vs. Accuracy
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
16
word alignment
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Word Alignment
§ c ^
5 qj to
•2 g
3
OS PI
^ I J§ ^ CÜ
73
qj
03 qj C ^ > qj 03 03 qj h
CD
5-1
- ^_ q qj
4^ ,£> cn <-m >->
89
die 56 Beziehungen
zwischen Obama und Netanjahu
72
16
26
96
79
98
sind 42 11 38
seit 22 54 10
Jahren 98
angespannt 84
• 11 14 23
49
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Word Alignment?
the
relationship between Obama and Netanyahu
has
been
stretched
for years
•1 c
I -a
:cö U
CO
CO
47
17
11
81
72
87
93
95
38 16 26
21 14 54
77
38 33 12
90
19 32 17
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
19
beam search
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
1 2 4 8 12 20 30 50 100 200 500 1,000
Beam Size
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
21
noisy data
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Noise in Training Data
• Crawled parallel data from the web (very noisy)
SMT NMT
WMT17 24.0 27.2
+ Paracrawl 25.2 (+1.2) 17.3 (-9.9)
(German-English, 90m words each of WMT17 and Crawl data)
5% 10% 20% 50% 100%
Raw crawl data 27.4 24.2 26.6 24.2 24.7 24.4 20.9 24.s 17.3
+0.2 +0.2 +0.4 +0.8 -6.3 + 1.2
-0.9 +02
-2.5 ,q q
• Corpus cleaning methods [Xu and Koehn, EMNLP 2017] give improvements
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Types of Noise
• Misaligned sentences
• Disfluent language (from MT, bad translations)
• Wrong language data (e.g., French in German-English corpus)
• Untranslated sentences
• Short segments (e.g., dictionaries)
• Mismatched domain
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Mismatched Sentences
• Artificial created by randomly shuffling sentence order
• Added to existing parallel corpus in different amounts
5% 10% 20% 50% 100%
24.0 24.0 23.9 26.1 23.9 25.3 23.4
-0.0 -0.0 -0.1 —-0.1 " -0.6
• Bigger impact on NMT (green, left) than SMT (blue, right)
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Misordered Words 25
• Artificial created by randomly shuffling words in each sentence
5% 10% 20% 50% 100%
Source 24.0 23.6 23.9 26.6 23.6 25.5 23.7
-0.0 -0.4 -0.1 -0.6 -0.4
Target 24.0 24.0 23.4 26.7 23.2 26.1 22.9
-0.0 -0.0 -0.6 -0.5 -0.8 -1.1 -1.1
• Similar impact on NMT than SMT, worse for source reshuffle
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Untranslated Sentences 26 ^
5% 10% 20% 50% 100%
17.6 23.8 11.2 23.9 5.6 23.8 3.2 23.4 3.2 21.1
-0.2 -0.1 -0.2 -0.6 -2.9
Source -9.8
-16.0
-21.6
-24.0 -24.0
Target 27.2 27.0 26.7 26.8 26.9
-0.0 -0.2 -0.5 -0.4 -0.3
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Wrong Language 27
5% 10% 20% 50% 100%
fr source 26.9 24.0 -0.3 -0.0 26.8 23.9 -0.4 -0.1 26.8 23.9 -0.4 -0.1 26.8 23.9 -0.4 -0.1 26.8 23.8 -0.4 -0.2
fr target 26.7 24.0 26.6 23.9 26.7 23.8 26.2 23.5 25.0 23.4
-0.5 -0.0 -0.6 -0.1 -0.5 -0.2 -1.0 -0.5 -2.2
• Surprisingly robust, maybe due to domain mismatch of French data
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Short Sentences
5% 10% 20% 50%
1 -2 words 27.1 24.1 26.5 23.9 26.7 23.8
-0.1 +0.1 -0.7 -0.1 -0.5 -0.2
27.8 24.2 27.6 24.5 2M) 24.5 26.6 24.2
1 -5 words +0.6 +0.2 +0.4 +0.5 TdT +0.5 -0.6 +0-2
• No harm done
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
29
control over output
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Specifying Decoding Constraints 30
• Overriding the decisions of the decoder
• Why?
=4> translations have followed strict terminology =4> rule-based translation of dates, quantities, etc.
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
XML Schema
The router is a model Psy X500 Pro .
• The XML tags specify to the decoder that
- the word router to be translated as Router
- The router is, to be translated before the rest ()
- brand name Psy X500 Pro to be translated as a unit (, )
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
Formal Constraints
• Subtitles
— translation has to fit into space on screen (may have to be shortened)
— input and output broken up into linesl
• Speech translation
— input often not well-formed
— real time translation: start while sentence is spoken
— subtitles: have be readable in limited time
— dubbing: sync up with video of speaker's mouth movement!
• Poetry
— meter
— rhyme
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020
33
questions?
Philipp Koehn
Machine Translation: Current Challenges
5 November 2020