Current Challenges Philipp Koehn 5 November 2020 Philipp Koehn Machine Translation: Current Challenges 5 November 2020 WMT 2016 i human .6 .2 -- A -- Neural MT • uedin-nmt Statistical MT_* metamind uedin-syntax • • NYU-UMONTREAL ONLINE-B äOMT-RULE-BASED • KIT-LIMSI •• . . CAMBRIDGE Kiir ONLINE-A JHU-SYNTAX • . JHU-PBMT UEDIN-PBMT ONLINE-F # ONLINE-G BLEU H-1-1-1-1-1-1-1-1-1 18 20 22 24 26 28 30 32 34 36 (in 2017 barely any statistical machine translation submissions) Philipp Koehn Machine Translation: Current Challenges 5 November 2020 2017: Google: "Near Human Quality" 2 ^ 6 perfect translation human I* _ -neural (GNMT) phrase-based (PBMT) English English English Spanish French Chinese > > > > > > Spanish French Chinese English English English Translation model Philipp Koehn Machine Translation: Current Challenges 5 November 2020 2018: More Hype Microsoft Research Achieves Human Parity For Chinese English Translation Written by Sue Gee Wednesday, 21 March 2018 Researchers in Microsoft's labs in Beijing and in Redmond and Washington have developed an Al machine translation system that can translate with the same accuracy as a human from Chinese to English. SDL Cracks Russian to English Neural Machine Translation Global Enterprises to Capitalize on Near Perfect Russian to English Machine Translation as SDL Sets New Industry Standard '90% of the system's output labelled as perfect byprofessional Russian-English translators Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Just Better Fluency? Adequacy +1% 100 80 60 III CS^EN DE^EN RO^EN RU^EN llONLINE-Bll UEDIN-NMT Fluency +13% 100 80 60 nrrr CS^EN DE^EN RO^EN RU^EN IIONLINE-BI ■ UEDIN-NMT (from: Sennrich and Haddow, 2017) Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Challenges • Lack of training data • Domain mismatch • Rare words • Word alignment • Beam search • Noise • Control over output • Interpretability Philipp Koehn Machine Translation: Current Challenges 5 November 2020 lack of training data Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Amount of Training Data Corpus Size (English Words) English-Spanish systems trained on 0.4 million to 385.7 million words Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Translation Examples Source A Republican strategy to counter the re-election of Obama i 1024 Un órgano de coordinación para el anuncio de libre determinación 1 512 Lista de una estrategia para luchar contra la elección de hoj as de Ohio 256 Explosion realiza una estrategia divisiva de luchar contra las elecciones de autor 1 128 Una estrategia republicana para la eliminación de la reelección de Obama 1 64 Estrategia siria para contrarrestar la reelección del Obama . 1 i 32 1 Una estrategia republicana para contrarrestar la reelección de Obama Philipp Koehn Machine Translation: Current Challenges 5 November 2020 9 domain mismatch Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Domain Mismatch 10 System | Law Medical IT Koran Subtitles All Data 30.532.8 45.142.2 35.344.7 17.917.9 26.420.8 Law 31.134.4 12.118.2 3.5 6.9 1.3 2.2 2.8 6.0 Medical 3.910.2 39.443.5 2.0 8.5 0.6 2.0 1.4 5.8 IT 1.9 3.7 6.5 5.3 42.139.8 1.8 1.6 3.9 4.7 Koran 0.4 1.8 0.0 2.1 0.0 2.3 15.918.8 1.0 5.5 ^^^^ Subtitles 7.0 9.9 9.317.8 9.213.6 9.0 8.4 25.922.1 Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Translation Examples Source Schaue um dich herum. Ref. Look around you. All NMT: Look around you. SMT: Look around you. Law NMT: Sughum gravecorn. SMT: In order to implement dich Schaue . Medical NMT: EMEA / MB / 049 / 01-EN-Final Work progamme for 2002 SMT: Schaue by dich around . IT NMT: Switches to paused. SMT: To Schaue by itself . \t \t Koran NMT: Take heed of your own souls. SMT: And you see. Subtitles NMT: Look around you. SMT: Look around you . Philipp Koehn Machine Translation: Current Challenges 5 November 2020 12 rare words Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Rare Words • More frequent in training —>• more likely to get right in test • Let's measure thisl • One problem — frequency measured for input words — translation correctness measured for output words Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Translation Accuracy for Input Words 14 • Generate word alignment between input and output words • Look up count of input word in training • Link to output word via word alignment • Check if it is also in the reference translation! • A lot of tedious special cases — one-to-many alignment, only some output words in reference — input word not aligned to any target word — many-to-one alignment — output word occurs multiple time in output or reference sentence Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Count vs. Accuracy Philipp Koehn Machine Translation: Current Challenges 5 November 2020 16 word alignment Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Word Alignment § c ^ 5 qj to •2 g 3 OS PI ^ I J§ ^ CÜ 73 qj 03 qj C ^ > qj 03 03 qj h CD 5-1 - ^_ q qj 4^ ,£> cn <-m >-> 89 die 56 Beziehungen zwischen Obama und Netanjahu 72 16 26 96 79 98 sind 42 11 38 seit 22 54 10 Jahren 98 angespannt 84 • 11 14 23 49 Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Word Alignment? the relationship between Obama and Netanyahu has been stretched for years •1 c I -a :cö U CO CO 47 17 11 81 72 87 93 95 38 16 26 21 14 54 77 38 33 12 90 19 32 17 Philipp Koehn Machine Translation: Current Challenges 5 November 2020 19 beam search Philipp Koehn Machine Translation: Current Challenges 5 November 2020 1 2 4 8 12 20 30 50 100 200 500 1,000 Beam Size Philipp Koehn Machine Translation: Current Challenges 5 November 2020 21 noisy data Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Noise in Training Data • Crawled parallel data from the web (very noisy) SMT NMT WMT17 24.0 27.2 + Paracrawl 25.2 (+1.2) 17.3 (-9.9) (German-English, 90m words each of WMT17 and Crawl data) 5% 10% 20% 50% 100% Raw crawl data 27.4 24.2 26.6 24.2 24.7 24.4 20.9 24.s 17.3 +0.2 +0.2 +0.4 +0.8 -6.3 + 1.2 -0.9 +02 -2.5 ,q q • Corpus cleaning methods [Xu and Koehn, EMNLP 2017] give improvements Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Types of Noise • Misaligned sentences • Disfluent language (from MT, bad translations) • Wrong language data (e.g., French in German-English corpus) • Untranslated sentences • Short segments (e.g., dictionaries) • Mismatched domain Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Mismatched Sentences • Artificial created by randomly shuffling sentence order • Added to existing parallel corpus in different amounts 5% 10% 20% 50% 100% 24.0 24.0 23.9 26.1 23.9 25.3 23.4 -0.0 -0.0 -0.1 —-0.1 " -0.6 • Bigger impact on NMT (green, left) than SMT (blue, right) Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Misordered Words 25 • Artificial created by randomly shuffling words in each sentence 5% 10% 20% 50% 100% Source 24.0 23.6 23.9 26.6 23.6 25.5 23.7 -0.0 -0.4 -0.1 -0.6 -0.4 Target 24.0 24.0 23.4 26.7 23.2 26.1 22.9 -0.0 -0.0 -0.6 -0.5 -0.8 -1.1 -1.1 • Similar impact on NMT than SMT, worse for source reshuffle Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Untranslated Sentences 26 ^ 5% 10% 20% 50% 100% 17.6 23.8 11.2 23.9 5.6 23.8 3.2 23.4 3.2 21.1 -0.2 -0.1 -0.2 -0.6 -2.9 Source -9.8 -16.0 -21.6 -24.0 -24.0 Target 27.2 27.0 26.7 26.8 26.9 -0.0 -0.2 -0.5 -0.4 -0.3 Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Wrong Language 27 5% 10% 20% 50% 100% fr source 26.9 24.0 -0.3 -0.0 26.8 23.9 -0.4 -0.1 26.8 23.9 -0.4 -0.1 26.8 23.9 -0.4 -0.1 26.8 23.8 -0.4 -0.2 fr target 26.7 24.0 26.6 23.9 26.7 23.8 26.2 23.5 25.0 23.4 -0.5 -0.0 -0.6 -0.1 -0.5 -0.2 -1.0 -0.5 -2.2 • Surprisingly robust, maybe due to domain mismatch of French data Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Short Sentences 5% 10% 20% 50% 1 -2 words 27.1 24.1 26.5 23.9 26.7 23.8 -0.1 +0.1 -0.7 -0.1 -0.5 -0.2 27.8 24.2 27.6 24.5 2M) 24.5 26.6 24.2 1 -5 words +0.6 +0.2 +0.4 +0.5 TdT +0.5 -0.6 +0-2 • No harm done Philipp Koehn Machine Translation: Current Challenges 5 November 2020 29 control over output Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Specifying Decoding Constraints 30 • Overriding the decisions of the decoder • Why? =4> translations have followed strict terminology =4> rule-based translation of dates, quantities, etc. Philipp Koehn Machine Translation: Current Challenges 5 November 2020 XML Schema The router is a model Psy X500 Pro . • The XML tags specify to the decoder that - the word router to be translated as Router - The router is, to be translated before the rest () - brand name Psy X500 Pro to be translated as a unit (, ) Philipp Koehn Machine Translation: Current Challenges 5 November 2020 Formal Constraints • Subtitles — translation has to fit into space on screen (may have to be shortened) — input and output broken up into linesl • Speech translation — input often not well-formed — real time translation: start while sentence is spoken — subtitles: have be readable in limited time — dubbing: sync up with video of speaker's mouth movement! • Poetry — meter — rhyme Philipp Koehn Machine Translation: Current Challenges 5 November 2020 33 questions? Philipp Koehn Machine Translation: Current Challenges 5 November 2020