Introduction
Philipp Koehn
4 September 2018
Philipp Koehn Machine Translation 4 September 2018
1Administrativa
• Class web site: http://www.mt-class.org/jhu/
• Graduate section: Tuesdays and Thursdays, 1:30-2:45, Ames 234
• Instructor: Philipp Koehn
• TAs: Huda Khayrallah, Brian Thompson, Tanay Agarwal
• Grading
– five programming assignments (12% each)
– final project (30%)
– in-class presentation: language in ten minutes (10%)
Philipp Koehn Machine Translation 4 September 2018
2Why Take This Class?
• Close look at an artificial intelligence problem
• Practical introduction to natural language processing
• Introduction to deep learning for structured prediction
Philipp Koehn Machine Translation 4 September 2018
3Textbook
Neural Machine Translation
Philipp Koehn
Center for Speech and Language Processing
Department of Computer Science
Johns Hopkins University
1st public draft
August 7, 2015
2nd public draft (arxiv)
September 22, 2017
3rd draft
September 25, 2017
Philipp Koehn Machine Translation 4 September 2018
4
some history
Philipp Koehn Machine Translation 4 September 2018
5An Old Idea
Warren Weaver on translation
as code breaking (1947):
When I look at an article in Russian, I say:
”This is really written in English,
but it has been coded in some strange symbols.
I will now proceed to decode”.
Philipp Koehn Machine Translation 4 September 2018
6Early Efforts and Disappointment
• Excited research in 1950s and 1960s
1954
Georgetown experiment
Machine could translate
250 words and
6 grammar rules
• 1966 ALPAC report:
– only $20 million spent on translation in the US per year
– no point in machine translation
Philipp Koehn Machine Translation 4 September 2018
7Rule-Based Systems
• Rule-based systems
– build dictionaries
– write transformation rules
– refine, refine, refine
• M´et´eo system for weather forecasts (1976)
• Systran (1968), Logos and Metal (1980s)
"have" :=
if
subject(animate)
and object(owned-by-subject)
then
translate to "kade... aahe"
if
subject(animate)
and object(kinship-with-subject)
then
translate to "laa... aahe"
if
subject(inanimate)
then
translate to "madhye...
aahe"
Philipp Koehn Machine Translation 4 September 2018
8Statistical Machine Translation
• 1980s: IBM
• 1990s: increased research
• Mid 2000s: Phrase-Based MT (Moses, Google)
• Around 2010: commercial viability
Philipp Koehn Machine Translation 4 September 2018
9Neural Machine Translation
• Late 2000s: successful use of neural models for computer vision
• Since mid 2010s: neural network models for machine translation
• 2016: Neural machine translation the new state of the art
Philipp Koehn Machine Translation 4 September 2018
10Hype
Hype
1950 1960 1970 1980 1990 2000 2010
Reality
Georgetown
experiment
Expert systems /
5th generation AI
Statistical
MT
Neural
MT
Philipp Koehn Machine Translation 4 September 2018
11
how good is machine translation?
Philipp Koehn Machine Translation 4 September 2018
12Machine Translation: Chinese
Philipp Koehn Machine Translation 4 September 2018
13Machine Translation: French
Philipp Koehn Machine Translation 4 September 2018
14A Clear Plan
Source Target
Lexical Transfer
Interlingua
Philipp Koehn Machine Translation 4 September 2018
15A Clear Plan
Source Target
Lexical Transfer
Syntactic Transfer
Interlingua
Analysis
Generation
Philipp Koehn Machine Translation 4 September 2018
16A Clear Plan
Source Target
Lexical Transfer
Syntactic Transfer
Semantic Transfer
Interlingua
Analysis
Generation
Philipp Koehn Machine Translation 4 September 2018
17A Clear Plan
Source Target
Lexical Transfer
Syntactic Transfer
Semantic Transfer
Interlingua
Analysis
Generation
Philipp Koehn Machine Translation 4 September 2018
18Learning from Data
Statistical
Machine
Translation
System
Training Data Linguistic Tools
Statistical
Machine
Translation
System
Translation
Source Text
Training Using
parallel corpora
monolingual corpora
dictionaries
Philipp Koehn Machine Translation 4 September 2018
19
why is that a good plan?
Philipp Koehn Machine Translation 4 September 2018
20Word Translation Problems
• Words are ambiguous
He deposited money in a bank account
with a high interest rate.
Sitting on the bank of the Mississippi,
a passing ship piqued his interest.
• How do we find the right meaning, and thus translation?
• Context should be helpful
Philipp Koehn Machine Translation 4 September 2018
21Syntactic Translation Problems
• Languages have different sentence structure
das behaupten sie wenigstens
this claim they at least
the she
• Convert from object-verb-subject (OVS) to subject-verb-object (SVO)
• Ambiguities can be resolved through syntactic analysis
– the meaning the of das not possible (not a noun phrase)
– the meaning she of sie not possible (subject-verb agreement)
Philipp Koehn Machine Translation 4 September 2018
22Semantic Translation Problems
• Pronominal anaphora
I saw the movie and it is good.
• How to translate it into German (or French)?
– it refers to movie
– movie translates to Film
– Film has masculine gender
– ergo: it must be translated into masculine pronoun er
• We are not handling this very well [Le Nagard and Koehn, 2010]
Philipp Koehn Machine Translation 4 September 2018
23Semantic Translation Problems
• Coreference
Whenever I visit my uncle and his daughters,
I can’t decide who is my favorite cousin.
• How to translate cousin into German? Male or female?
• Complex inference required
Philipp Koehn Machine Translation 4 September 2018
24Semantic Translation Problems
• Discourse
Since you brought it up, I do not agree with you.
Since you brought it up, we have been working on it.
• How to translated since? Temporal or conditional?
• Analysis of discourse structure — a hard problem
Philipp Koehn Machine Translation 4 September 2018
25Learning from Data
• What is the best translation?
Sicherheit → security 14,516
Sicherheit → safety 10,015
Sicherheit → certainty 334
Philipp Koehn Machine Translation 4 September 2018
26Learning from Data
• What is the best translation?
Sicherheit → security 14,516
Sicherheit → safety 10,015
Sicherheit → certainty 334
• Counts in European Parliament corpus
Philipp Koehn Machine Translation 4 September 2018
27Learning from Data
• What is the best translation?
Sicherheit → security 14,516
Sicherheit → safety 10,015
Sicherheit → certainty 334
• Phrasal rules
Sicherheitspolitik → security policy 1580
Sicherheitspolitik → safety policy 13
Sicherheitspolitik → certainty policy 0
Lebensmittelsicherheit → food security 51
Lebensmittelsicherheit → food safety 1084
Lebensmittelsicherheit → food certainty 0
Rechtssicherheit → legal security 156
Rechtssicherheit → legal safety 5
Rechtssicherheit → legal certainty 723
Philipp Koehn Machine Translation 4 September 2018
28Learning from Data
• What is most fluent?
a problem for translation 13,000
a problem of translation 61,600
a problem in translation 81,700
Philipp Koehn Machine Translation 4 September 2018
29Learning from Data
• What is most fluent?
a problem for translation 13,000
a problem of translation 61,600
a problem in translation 81,700
• Hits on Google
Philipp Koehn Machine Translation 4 September 2018
30Learning from Data
• What is most fluent?
a problem for translation 13,000
a problem of translation 61,600
a problem in translation 81,700
a translation problem 235,000
Philipp Koehn Machine Translation 4 September 2018
31Learning from Data
• What is most fluent?
police disrupted the demonstration 2,140
police broke up the demonstration 66,600
police dispersed the demonstration 25,800
police ended the demonstration 762
police dissolved the demonstration 2,030
police stopped the demonstration 722,000
police suppressed the demonstration 1,400
police shut down the demonstration 2,040
Philipp Koehn Machine Translation 4 September 2018
32Learning from Data
• What is most fluent?
police disrupted the demonstration 2,140
police broke up the demonstration 66,600
police dispersed the demonstration 25,800
police ended the demonstration 762
police dissolved the demonstration 2,030
police stopped the demonstration 722,000
police suppressed the demonstration 1,400
police shut down the demonstration 2,040
Philipp Koehn Machine Translation 4 September 2018
33
where are we now?
Philipp Koehn Machine Translation 4 September 2018
34Word Alignment
house
the
in
stay
will
he
that
assumes
michael
michael
geht
davon
aus
dass
er
im
haus
bleibt
,
Philipp Koehn Machine Translation 4 September 2018
35Phrase-Based Model
• Foreign input is segmented in phrases
• Each phrase is translated into English
• Phrases are reordered
• Workhorse of today’s statistical machine translation
Philipp Koehn Machine Translation 4 September 2018
36Syntax-Based Translation
Sie
PPER
will
VAFIN
eine
ART
Tasse
NN
Kaffee
NN
trinken
VVINF
NP
VP
S
PRO
she
VB
drink
NN
|
cup
IN
|
of
NP
PP
NN
NP
DET
|
a
VBZ
|
wants
VB
VP
VP
NPTO
|
to
NN
coffee
S
PRO VP
➏
➊ ➋ ➌
➍
➎
Philipp Koehn Machine Translation 4 September 2018
37Semantic Translation
• Abstract meaning representation [Knight et al., ongoing]
(w / want-01
:agent (b / boy)
:theme (l / love
:agent (g / girl)
:patient b))
• Generalizes over equivalent syntactic constructs
(e.g., active and passive)
• Defines semantic relationships
– semantic roles
– co-reference
– discourse relations
• In a very preliminary stage
Philipp Koehn Machine Translation 4 September 2018
38Neural Model
Input Word
Embeddings
Left-to-Right
Recurrent NN
Right-to-Left
Recurrent NN
Attention
Input Context
Hidden State
Output Word
Predictions
Given
Output Words
Error
Output Word
Embedding
the house is big .
das Haus ist groß ,
Philipp Koehn Machine Translation 4 September 2018
39
what is it good for?
Philipp Koehn Machine Translation 4 September 2018
40
what is it good enough for?
Philipp Koehn Machine Translation 4 September 2018
41Why Machine Translation?
Assimilation — reader initiates translation, wants to know content
• user is tolerant of inferior quality
• focus of majority of research (GALE program, etc.)
Communication — participants don’t speak same language, rely on translation
• users can ask questions, when something is unclear
• chat room translations, hand-held devices
• often combined with speech recognition, IWSLT campaign
Dissemination — publisher wants to make content available in other languages
• high demands for quality
• currently almost exclusively done by human translators
Philipp Koehn Machine Translation 4 September 2018
42Problem: No Single Right Answer
Israeli officials are responsible for airport security.
Israel is in charge of the security at this airport.
The security work for this airport is the responsibility of the Israel government.
Israeli side was in charge of the security of this airport.
Israel is responsible for the airport’s security.
Israel is responsible for safety work at this airport.
Israel presides over the security of the airport.
Israel took charge of the airport security.
The safety of this airport is taken charge of by Israel.
This airport’s security is the responsibility of the Israeli security officials.
Philipp Koehn Machine Translation 4 September 2018
43Quality
HTER assessment
0%
publishable
10%
editable
20%
30% gistable
40% triagable
50%
(scale developed in preparation of DARPA GALE programme)
Philipp Koehn Machine Translation 4 September 2018
44Applications
HTER assessment application examples
0% Seamless bridging of language divide
publishable Automatic publication of official announcements
10%
editable Increased productivity of human translators
20% Access to official publications
Multi-lingual communication (chat, social networks)
30% gistable Information gathering
Trend spotting
40% triagable Identifying relevant documents
50%
Philipp Koehn Machine Translation 4 September 2018
45Current State of the Art
HTER assessment language pairs and domains
0% French-English restricted domain
publishable French-English technical document localization
10% French-English news stories
editable German-English news stories
20%
30% gistable Swahili–English news stories
40% triagable Uyghur–English news stories
50%
(informal rough estimates by presenter)
Philipp Koehn Machine Translation 4 September 2018
46Thank You
questions?
Philipp Koehn Machine Translation 4 September 2018