Introduction Philipp Koehn 4 September 2018 Philipp Koehn Machine Translation 4 September 2018 1Administrativa • Class web site: http://www.mt-class.org/jhu/ • Graduate section: Tuesdays and Thursdays, 1:30-2:45, Ames 234 • Instructor: Philipp Koehn • TAs: Huda Khayrallah, Brian Thompson, Tanay Agarwal • Grading – five programming assignments (12% each) – final project (30%) – in-class presentation: language in ten minutes (10%) Philipp Koehn Machine Translation 4 September 2018 2Why Take This Class? • Close look at an artificial intelligence problem • Practical introduction to natural language processing • Introduction to deep learning for structured prediction Philipp Koehn Machine Translation 4 September 2018 3Textbook Neural Machine Translation Philipp Koehn Center for Speech and Language Processing Department of Computer Science Johns Hopkins University 1st public draft August 7, 2015 2nd public draft (arxiv) September 22, 2017 3rd draft September 25, 2017 Philipp Koehn Machine Translation 4 September 2018 4 some history Philipp Koehn Machine Translation 4 September 2018 5An Old Idea Warren Weaver on translation as code breaking (1947): When I look at an article in Russian, I say: ”This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode”. Philipp Koehn Machine Translation 4 September 2018 6Early Efforts and Disappointment • Excited research in 1950s and 1960s 1954 Georgetown experiment Machine could translate 250 words and 6 grammar rules • 1966 ALPAC report: – only $20 million spent on translation in the US per year – no point in machine translation Philipp Koehn Machine Translation 4 September 2018 7Rule-Based Systems • Rule-based systems – build dictionaries – write transformation rules – refine, refine, refine • M´et´eo system for weather forecasts (1976) • Systran (1968), Logos and Metal (1980s) "have" := if subject(animate) and object(owned-by-subject) then translate to "kade... aahe" if subject(animate) and object(kinship-with-subject) then translate to "laa... aahe" if subject(inanimate) then translate to "madhye... aahe" Philipp Koehn Machine Translation 4 September 2018 8Statistical Machine Translation • 1980s: IBM • 1990s: increased research • Mid 2000s: Phrase-Based MT (Moses, Google) • Around 2010: commercial viability Philipp Koehn Machine Translation 4 September 2018 9Neural Machine Translation • Late 2000s: successful use of neural models for computer vision • Since mid 2010s: neural network models for machine translation • 2016: Neural machine translation the new state of the art Philipp Koehn Machine Translation 4 September 2018 10Hype Hype 1950 1960 1970 1980 1990 2000 2010 Reality Georgetown experiment Expert systems / 5th generation AI Statistical MT Neural MT Philipp Koehn Machine Translation 4 September 2018 11 how good is machine translation? Philipp Koehn Machine Translation 4 September 2018 12Machine Translation: Chinese Philipp Koehn Machine Translation 4 September 2018 13Machine Translation: French Philipp Koehn Machine Translation 4 September 2018 14A Clear Plan Source Target Lexical Transfer Interlingua Philipp Koehn Machine Translation 4 September 2018 15A Clear Plan Source Target Lexical Transfer Syntactic Transfer Interlingua Analysis Generation Philipp Koehn Machine Translation 4 September 2018 16A Clear Plan Source Target Lexical Transfer Syntactic Transfer Semantic Transfer Interlingua Analysis Generation Philipp Koehn Machine Translation 4 September 2018 17A Clear Plan Source Target Lexical Transfer Syntactic Transfer Semantic Transfer Interlingua Analysis Generation Philipp Koehn Machine Translation 4 September 2018 18Learning from Data Statistical Machine Translation System Training Data Linguistic Tools Statistical Machine Translation System Translation Source Text Training Using parallel corpora monolingual corpora dictionaries Philipp Koehn Machine Translation 4 September 2018 19 why is that a good plan? Philipp Koehn Machine Translation 4 September 2018 20Word Translation Problems • Words are ambiguous He deposited money in a bank account with a high interest rate. Sitting on the bank of the Mississippi, a passing ship piqued his interest. • How do we find the right meaning, and thus translation? • Context should be helpful Philipp Koehn Machine Translation 4 September 2018 21Syntactic Translation Problems • Languages have different sentence structure das behaupten sie wenigstens this claim they at least the she • Convert from object-verb-subject (OVS) to subject-verb-object (SVO) • Ambiguities can be resolved through syntactic analysis – the meaning the of das not possible (not a noun phrase) – the meaning she of sie not possible (subject-verb agreement) Philipp Koehn Machine Translation 4 September 2018 22Semantic Translation Problems • Pronominal anaphora I saw the movie and it is good. • How to translate it into German (or French)? – it refers to movie – movie translates to Film – Film has masculine gender – ergo: it must be translated into masculine pronoun er • We are not handling this very well [Le Nagard and Koehn, 2010] Philipp Koehn Machine Translation 4 September 2018 23Semantic Translation Problems • Coreference Whenever I visit my uncle and his daughters, I can’t decide who is my favorite cousin. • How to translate cousin into German? Male or female? • Complex inference required Philipp Koehn Machine Translation 4 September 2018 24Semantic Translation Problems • Discourse Since you brought it up, I do not agree with you. Since you brought it up, we have been working on it. • How to translated since? Temporal or conditional? • Analysis of discourse structure — a hard problem Philipp Koehn Machine Translation 4 September 2018 25Learning from Data • What is the best translation? Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334 Philipp Koehn Machine Translation 4 September 2018 26Learning from Data • What is the best translation? Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334 • Counts in European Parliament corpus Philipp Koehn Machine Translation 4 September 2018 27Learning from Data • What is the best translation? Sicherheit → security 14,516 Sicherheit → safety 10,015 Sicherheit → certainty 334 • Phrasal rules Sicherheitspolitik → security policy 1580 Sicherheitspolitik → safety policy 13 Sicherheitspolitik → certainty policy 0 Lebensmittelsicherheit → food security 51 Lebensmittelsicherheit → food safety 1084 Lebensmittelsicherheit → food certainty 0 Rechtssicherheit → legal security 156 Rechtssicherheit → legal safety 5 Rechtssicherheit → legal certainty 723 Philipp Koehn Machine Translation 4 September 2018 28Learning from Data • What is most fluent? a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 Philipp Koehn Machine Translation 4 September 2018 29Learning from Data • What is most fluent? a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 • Hits on Google Philipp Koehn Machine Translation 4 September 2018 30Learning from Data • What is most fluent? a problem for translation 13,000 a problem of translation 61,600 a problem in translation 81,700 a translation problem 235,000 Philipp Koehn Machine Translation 4 September 2018 31Learning from Data • What is most fluent? police disrupted the demonstration 2,140 police broke up the demonstration 66,600 police dispersed the demonstration 25,800 police ended the demonstration 762 police dissolved the demonstration 2,030 police stopped the demonstration 722,000 police suppressed the demonstration 1,400 police shut down the demonstration 2,040 Philipp Koehn Machine Translation 4 September 2018 32Learning from Data • What is most fluent? police disrupted the demonstration 2,140 police broke up the demonstration 66,600 police dispersed the demonstration 25,800 police ended the demonstration 762 police dissolved the demonstration 2,030 police stopped the demonstration 722,000 police suppressed the demonstration 1,400 police shut down the demonstration 2,040 Philipp Koehn Machine Translation 4 September 2018 33 where are we now? Philipp Koehn Machine Translation 4 September 2018 34Word Alignment house the in stay will he that assumes michael michael geht davon aus dass er im haus bleibt , Philipp Koehn Machine Translation 4 September 2018 35Phrase-Based Model • Foreign input is segmented in phrases • Each phrase is translated into English • Phrases are reordered • Workhorse of today’s statistical machine translation Philipp Koehn Machine Translation 4 September 2018 36Syntax-Based Translation Sie PPER will VAFIN eine ART Tasse NN Kaffee NN trinken VVINF NP VP S PRO she VB drink NN | cup IN | of NP PP NN NP DET | a VBZ | wants VB VP VP NPTO | to NN coffee S PRO VP ➏ ➊ ➋ ➌ ➍ ➎ Philipp Koehn Machine Translation 4 September 2018 37Semantic Translation • Abstract meaning representation [Knight et al., ongoing] (w / want-01 :agent (b / boy) :theme (l / love :agent (g / girl) :patient b)) • Generalizes over equivalent syntactic constructs (e.g., active and passive) • Defines semantic relationships – semantic roles – co-reference – discourse relations • In a very preliminary stage Philipp Koehn Machine Translation 4 September 2018 38Neural Model Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Word Predictions Given Output Words Error Output Word Embedding the house is big . das Haus ist groß , Philipp Koehn Machine Translation 4 September 2018 39 what is it good for? Philipp Koehn Machine Translation 4 September 2018 40 what is it good enough for? Philipp Koehn Machine Translation 4 September 2018 41Why Machine Translation? Assimilation — reader initiates translation, wants to know content • user is tolerant of inferior quality • focus of majority of research (GALE program, etc.) Communication — participants don’t speak same language, rely on translation • users can ask questions, when something is unclear • chat room translations, hand-held devices • often combined with speech recognition, IWSLT campaign Dissemination — publisher wants to make content available in other languages • high demands for quality • currently almost exclusively done by human translators Philipp Koehn Machine Translation 4 September 2018 42Problem: No Single Right Answer Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials. Philipp Koehn Machine Translation 4 September 2018 43Quality HTER assessment 0% publishable 10% editable 20% 30% gistable 40% triagable 50% (scale developed in preparation of DARPA GALE programme) Philipp Koehn Machine Translation 4 September 2018 44Applications HTER assessment application examples 0% Seamless bridging of language divide publishable Automatic publication of official announcements 10% editable Increased productivity of human translators 20% Access to official publications Multi-lingual communication (chat, social networks) 30% gistable Information gathering Trend spotting 40% triagable Identifying relevant documents 50% Philipp Koehn Machine Translation 4 September 2018 45Current State of the Art HTER assessment language pairs and domains 0% French-English restricted domain publishable French-English technical document localization 10% French-English news stories editable German-English news stories 20% 30% gistable Swahili–English news stories 40% triagable Uyghur–English news stories 50% (informal rough estimates by presenter) Philipp Koehn Machine Translation 4 September 2018 46Thank You questions? Philipp Koehn Machine Translation 4 September 2018