PA153PA153
Vít Baisa
MACHINE TRANSLATIONMACHINE TRANSLATION
We consider only technical / specialized texts:
web pages,
technical manuals,
scienti c documents and papers,
lea ets and catalogues,
law texts and
in general, texts from speci c domains.
Nuances on di erent language levels in art literature are
out of scope of current MT systems.
MACHINE TRANSLATION: ISSUESMACHINE TRANSLATION: ISSUES
In fact an output of MT is always revised. We distinguish
pre-editing and post-editing.
MT systems make di erent types of errors.
These mistakes are characteristic for human translators:
wrong prepositions: (I am in school)
missing determiners (I saw man)
wrong tense (Viděl jsem: I was seeing), ...
For computers, errors in meaning are characteristic:
Kiss me honey. → Polib mi med.
Costa, Ângela, et al. "A linguistically motivated taxonomy for Machine Translation error analysis."
Machine Translation 29.2 (2015): 127-161.
DIRECT METHODS FOR IMPROVING MTDIRECT METHODS FOR IMPROVING MT
QUALITYQUALITY
limit input to a:
sublanguage (indicative sentences)
domain (informatics)
document type (patents)
text pre-processing (e.g. manual syntactic analysis)
CLASSIFICATION BASED ON APPROACHCLASSIFICATION BASED ON APPROACH
rule-based, knowledge-based (RBMT, KBMT)
transfer
with interlingua
statistical machine translation (SMT)
hybrid machine translation (HMT, HyTran)
neural networks
VAUQUOIS'S TRIANGLEVAUQUOIS'S TRIANGLE
MOTIVATION IN 21ST CENTURYMOTIVATION IN 21ST CENTURY
translation of for gisting (getting the main
message)
methods for speeding-up human translation
substantially (translation memories)
cross-language extraction of facts and search for
information
instant translation of e-communication
translation on mobile devices
web pages
RULE-BASED MTRULE-BASED MT
STATISTICAL MACHINESTATISTICAL MACHINE
TRANSLATIONTRANSLATION
SMT SCHEMESMT SCHEME
PARALLEL CORPORA IPARALLEL CORPORA I
basic data source for SMT
available sources ~10–100 M
size depends heavily on a language pair
multilingual webpages (online newspapers)
paragraph and sentence alignment needed
PARALLEL CORPORA IIPARALLEL CORPORA II
: 11 ls, 40 M words
: parallel texts of various origin, open subtitles,
UI localizations
: law documents of EU (20 ls)
: 1.3 M pairs of text chunks from the o cial
records of the Canadian Parliament
comparable corpora...
Europarl
OPUS
Acquis Communautaire
Hansards
EUR-Lex
SENTENCE ALIGNMENTSENTENCE ALIGNMENT
sometimes sentences are not in 1:1 ratio in corpora
Church-Gale alignment
hunalign
P alignment
0.89 1:1
0.0099 1:0, 0:1
0.089 2:1, 1:2
0.011 2:2
SMT NOISY CHANNEL PRINCIPLESMT NOISY CHANNEL PRINCIPLE
Claude Shannon (1948), self-correcting codes transfered
through noisy channels based on information about the
original data and errors made in the channels.
Used for MT, ASR, OCR. Optical Character Recognition is
erroneous but we can estimate what was damaged in a
text (with a language model); errors l 1 I, rn m etc.↔ ↔ ↔
= arg p(e|f)e
∗
max
e
= arg max
e
p(e)p(f|e)
p(f)
= arg p(e)p(f|e)max
e
SMT COMPONENTS ISMT COMPONENTS I
language model
how we get for any string
the more looks like proper language the higher
should be
issue: what is for an unseen ?
p(e) e
e p(e)
p(e) e
SMT COMPONENTS IISMT COMPONENTS II
translation model
for and compute
the more looks like a proper translation of , the
higher
e f p(f|e)
e f
p(f|e)
SMT COMPONENTS IIISMT COMPONENTS III
decoding algorithm
based on TM and LM, nd a sentence as the best
translation of
as fast as possible and with as few memory as
possible
prune non-perspective hypotheses
but do not lost any valid translations
f
e
LANGUAGE MODELSLANGUAGE MODELS
WHAT IT IS GOOD FOR?WHAT IT IS GOOD FOR?
What is the probability of utterance of ?s
I go to home vs. I go home
What is the next, most probable word?
Ke snídani jsem měl celozrnný ...
{ chléb pečivo zákusek mléko babičku}> > > >
CHOMSKY WAS WRONGCHOMSKY WAS WRONG
Colorless green ideas sleep furiously
vs. Furiously sleep ideas green colorless
LM assigns higher to the 1st! (Mikolov, 2012)p
GENERATING RANDOM TEXTGENERATING RANDOM TEXT
To him swallowed confess hear both. Which. Of save on
trail for are ay device and rote life have Every enter now
severally so, let. (unigrams)
Sweet prince, Falsta shall die. Harry of Monmouth's
grave. This shall forbid it should be branded, if renown
made it empty. (trigrams)
Can you guess the author of the original text?
CBLM
MAXIMUM LIKELIHOOD ESTIMATIONMAXIMUM LIKELIHOOD ESTIMATION
p( | , ) =w3 w1 w2
count( , , )w1 w2 w3
count( , , w)∑
w
w1 w2
(the, green, *): 1,748× in EuroParl
w count p(w)
paper 801 0.458
group 640 0.367
light 110 0.063
party 27 0.015
ecu 21 0.012
LM QUALITYLM QUALITY
We need to compare quality of various LMs.
2 approaches: extrinsic and intrinsic evaluation.
A good LM should assign a higher probability to a good
(looking) text than to an incorrect text. For a xed
testing text we can compare various LMs.
ENTROPYENTROPY
Shannon, 1949
the expected value (average) of the information
contained in a message
information viewed as the negative of the logarithm of
the probability distribution
events that always occur do not communicate
information
pure randomness has highest entropy (uniform
distribution )lo ng2
E(X) = − p( ) p( )∑
n
i=1
xi log2
xi
PERPLEXITYPERPLEXITY
P P = 2
H( )pLM
P P (W ) = p( …w1 w2 wn )
−
1
N
A good LM should not waste for improbable
phenomena. The lower entropy, the better the lower
perplexity, the better.
p
→
Minimizing probabilities = minimizing perplexity.
WHAT INFLUENCES LM QUALITY?WHAT INFLUENCES LM QUALITY?
size of training data
order of language model
smoothing, interpolation, back-o
LARGE LM - N-GRAM COUNTSLARGE LM - N-GRAM COUNTS
How many unique n-grams are in a corpus?
order types singletons %
unigram 86,700 33,447 (38,6%)
bigram 1,948,935 1,132,844 (58,1%)
trigram 8,092,798 6,022,286 (74,4%)
4-gram 15,303,847 13,081,621 (85,5%)
5-gram 19,882,175 18,324,577 (92,2%)
Taken from Europarl with 30 mil. tokens.
ZERO FREQUENCY, OOV, RARE WORDSZERO FREQUENCY, OOV, RARE WORDS
probability must always be non zero
to be able to measure perplexity
maximum likelihood bad at it
training data: work on Tuesday/Friday/Wednesday
test data: work on Sunday,
p(Sunday|work on) = 0
NEURAL NETWORK LANGUAGE MODELSNEURAL NETWORK LANGUAGE MODELS
old approach (1940s)
only recently applied successfully to LM
2003 Bengio et al. (feed-forward NNLM)
2012 Mikolov (RNN)
right now
key concept: distributed representations of words
1-of-V, one-hot representation
trending
RECURRENT NEURAL NETWORKRECURRENT NEURAL NETWORK
Tomáš Mikolov (VUT)
hidden layer feeds itself
shown to beat n-grams by large margin
WORD EMBEDDINGSWORD EMBEDDINGS
distributional semantics with vectors
skip-gram, CBOW (continuous bag-of-words)
EMBEDDINGS IN MTEMBEDDINGS IN MT
TRANSLATION MODELSTRANSLATION MODELS
LEXICAL TRANSLATIONLEXICAL TRANSLATION
Standard lexicon does not contain information about
frequency of translations of individual meanings of
words.
key → klíč, tónina, klávesa
How often are the individual translations used in
translations?
key → klíč (0.7), tónina (0.18), klávesa (0.11)
probability distribution :pf
(e) = 1∑
e
pf
∀e : 0 ≤ (e) ≤ 1pf
EM ALGORITHM - INITIALIZATIONEM ALGORITHM - INITIALIZATION
EM ALGORITHM - FINAL PHASEEM ALGORITHM - FINAL PHASE
IBM MODELSIBM MODELS
IBM-1 does not take context into account, cannot add
and skip words. Each of the following models adds
something more to the previous.
IBM-1: lexical translation
IBM-2: + absolute alignment model
IBM-3: + fertility model
IBM-4: + relative alignment model
IBM-5: + further tuning
WORD ALIGNMENT MATRIXWORD ALIGNMENT MATRIX
WORD ALIGNMENT ISSUESWORD ALIGNMENT ISSUES
PHRASE-BASE TRANSLATION MODELPHRASE-BASE TRANSLATION MODEL
Phrases not linguistically, but statistically motivated.
German am is seldom translated with single English to.
Cf. (fun (with (the game)))
ADVANTAGES OF PBTMADVANTAGES OF PBTM
translating n:m words
word is not a suitable element for translation for
many lang. pairs
models learn to translate longer phrases
simpler: no fertility, no NULL token etc.
PHRASE EXTRACTIONPHRASE EXTRACTION
EXTRACTED PHRASESEXTRACTED PHRASES
phr1 phr2
michael michael
assumes geht davon aus / geht davon aus
in the im
house haus
assumes that geht davon aus , dass
that he dass er / , dass er
in the house im haus
PHRASE-BASED MODEL OF SMTPHRASE-BASED MODEL OF SMT
= ϕ( | ) d(star − en − 1)e
∗
argmaxe
∏
i=1
I
f¯
i
e¯i ti di−1
( | . . . )∏
i=1
|e|
pLM ei e1 ei−1
DECODINGDECODING
Given a model and translation model we
need to nd a translation with the highest probability
but from exponential number of all possible
translations.
pLM p(f|e)
Heuristic search methods are used. It is not guaranteed
to nd the best translation.
Errors in translations are caused by
1) decoding process, when the best translation is not
found owing to the heuristics or
2) models, where the best translation according to the
probability functions is not the best possible.
EXAMPLE OF NOISE-INDUCED ERRORSEXAMPLE OF NOISE-INDUCED ERRORS
(GOOGLE TRANSLATE)(GOOGLE TRANSLATE)
Rinneadh clárúchán an úsáideora yxc a eiteach go
rathúil.
The user registration yxc made a successful rejection.
Rinneadh clárúchán an úsáideora qqq a eiteach go
rathúil.
Qqq made registration a user successfully refused.
NEURAL NETWORK MACHINENEURAL NETWORK MACHINE
TRANSLATIONTRANSLATION
very close to state-of-the-art (PBSMT)
a problem: variable length input and output
learning to translate and align at the same time
hot topic (2014, 2015)
LISA
NN MODELS IN MTNN MODELS IN MT
SUMMARY VECTOR FOR SENTENCESSUMMARY VECTOR FOR SENTENCES
MT QUALITY EVALUATIONMT QUALITY EVALUATION
uency, adequacy, intelligibility
AUTOMATIC TRANSLATION EVALUATIONAUTOMATIC TRANSLATION EVALUATION
advantages: speed, cost
disadvantages: do we really measure quality of translation?
gold standard: manually prepared reference translations
candidate is compared with reference translations
the paradox of automatic evaluation: the task corresponds
to situation where students are to assess their own exam:
how they know where they made a mistake?
various approaches: n-gram shared between and , edit
distance, ...
c n ri
c ri
RECALL AND PRECISION ON WORDSRECALL AND PRECISION ON WORDS
precision = = = 50%
correct
output-length
3
6
recall = = = 43%
correct
reference-length
3
7
f-score = 2 × = 2 × = 46%
precision × recall
precision + recall
.5 × .43
.5 + .43
RECALL AND PRECISION:RECALL AND PRECISION:
SHORTCOMINGSSHORTCOMINGS
metrics system A system B
precision 50% 100%
recall 43% 100%
f-score 46% 100%
It does not capture wrong word order.
BLEUBLEU
standard metrics (2001)
IBM, Papineni
n-gram match between reference and candidate
translations
precision is calculated for 1-, 2- ,3- and 4-grams
+ brevity penalty
BLEU = min (1, ) (
output-length
reference-length
∏
i=1
4
precision
BLEU: AN EXAMPLEBLEU: AN EXAMPLE
metrics system A system B
precision (1gram) 3/6 6/6
precision (2gram) 1/5 4/5
precision (3gram) 0/4 2/4
precision (4gram) 0/3 1/3
brevity penalty 6/7 6/7
BLEU 0% 52%
METEORMETEOR
aligns hypotheses to one or more references
exact, stem (morphology), synonym (WordNet),
paraphrase matches
various scores including WMT ranking and NIST
adequacy
extended support for English, Czech, German, French,
Spanish, and Arabic.
high correlation with human judgments
EVALUATION OF EVALUATION METRICSEVALUATION OF EVALUATION METRICS
Correlation of automatic evaluation with manual
evaluation.
EUROMATRIXEUROMATRIX
EUROMATRIX IIEUROMATRIX II