PLIN019 - Machine translation
Statistical machine translation Vit Baisa
Introduction
Introduction to SM"
► rule-based systems motivated by linguistics
► SMT inspired by information theory and statistics
► currently many companies focused on SMT: Google, IBM, Microsoft
► 50 million webpages translated by SMT daily
► gisting: we don't need exact translation, sometimes a gist of a text is enough - the most frequent use of SMT on Internet
SM" scheme
Spanish/English Bilingual Text
CEnglish Text
Statistical Analysis Statistical Analysis
Spanish
Broken English
English
Translation Model
Language Model
_*_k_
Decoding Algorithm
argmax P(e)*p{s|e)
Tools for SMT
► GIZA++: training of IBM models, word alignment (HMM)
► SRILM: language model training
► IRST: large language model training
► Moses: phrase decoder, model training
► Pharaoh: predecessor of Moses
► Thot: phrase model training
► SAMT: tree-based models
Data for SMT - (parallel) corpora
► Linguistics Data Consorcium (LDC): parallel corpora for Arabic-English, Chinese-English etc.
Gigaword corpus (English, 7 billion words)
► Europarl: collection of parliamentary texts of EP (11 languages, 40 M words)
► OPUS: parallel texts of various origin: subtitles, software localizations of user interfaces
► Acquis Communautaire: law documents of EU (20 lang)
► Hansards: 1.3 M pairs of text chunks from the official records of the Canadian Parliament
Regular events in SMT
Annual evaluation of SMT quality. Test set preparing, manual evaluation events, etc.
► NIST: National Institute of Standards and Technology; the oldest, prestigious; evaluation of Arabic, Chinese
► IWSLT: international workshop of spoken language translation; speech translation, Asian languages
► WMT: Workshop on SMT; mainly between European languages
Words
► for SMT (in vast majority) the basic unit = a word
► works usually with lowercased input, the original case can be restored in post-processing
► the makes up 7% of English texts
► top 10 words (by frequency) makes up 30% of all texts
► troublemakers: typos, numbers, proper names, loanwords
Sentences
► syntactic structure differs in various languages
► inserting of functional words typical for a given language {the, punctuation)
► rearranging: er wird mit uns gehen -> he will go with us
► some phenomena can not be translated on the sentence level: anaphora
► level of document: theme (topic) can help with WSD
► we are not supposed to translate bat as pálka (Czech) in a text about cave animals
Parallel corpora
► basic data source for SMT
► freely available sources are about 10 to 100 M word large
► size depends heavily on a language pair
► multilingual webpages (online newspapers)
► a problem with paragraph, document alignment, ...
► comparable corpora: texts from the same domain, not translations:
New York Times - Le Monde
► Kapradi - corpus of Shakespeare's plays by various translators (FI+FF)
► InterCorp - manually aligned fiction books (CNK, FF UK)
Sentence alignment
► sometimes sentences are not in 1:1 ratio in corpora
► some languages do not explicitly delimit sentence boundaries (Thai)
► It is small, but cozy. - Es is klein. Aber es ist gemütlich.
p	alignment
0.98 0.0099 0.089 0.011	1:1 1:0, 0:1 2:1, 1:2 2:2
Probability and statistics basics for SMT
Zipf's law
n  I_I_I_1_J_I_I_Ü_I_I_Ü_I_I_l_l_I_I_l_l_I_I_J
1 10 100        1000      10000     100000 1e+06
Ranking
r rank, f = word freq., c = constant; r x f = c
Probability distribution
► a graph of probability values for elementary phenomena of a random variable
► uniform: roll of dice, coin (discrete variable)
► binomial: multiple roll (quincunx)
^/c;p)=Qp^(1-p)^
► normal, Gauss's: continuous variable
o H
o
w o
O
ro
o
Q.
co'
Statistics
► random variable, probability function, etc.
► we have data, we want to get distribution describing the data best
► law of large numbers: the more data we have the better are we able to guess its probability distribution
► e.g.: roll of a loaded dice; tt calculation
► independent variables: Vx,y : p(x,y) = p(x).p(y)
► joint probability: roll of dice and coin: p(6, heads)
► conditional probability: p{y\x) =
for independent variables: p(y\x) = p{y)
Conditional probability
Shannon's game
Probabilty distribution for a next letter in a text depends previous letters.
Some letters bear more information than the others
Bayes's rule
p(*|y)
► example with the dice
► p(x) - prior p{y\x) - posterior
SMT - noisy channel principle
Designed by Claude Shannon (1948) for self-correcting codes, for coded signals corrections transfered through noisy channels based on information about the original data (probabilities) and about types of errors made in the channels.
Example with OCR. Optical Character Recognition is erroneous but we can estimate what was damaged in a text (with a language model); errors 1^1 ^l, rn^m etc.
e* = arg maxp(e|/r)
e
p{e)p{f\e)
= arg max
p{f)
= arg maxp(e)p(/r|e).
e
SMT components
language models:
► how we get p(e) for any string e
► the more e looks like proper language the higher is p(e)
► issue: what is p(e) for an unseen e?
translation model:
► for e and f compute p(f\e)
► the more e looks like a proper translation f, the higher
f*f\e)
► decoding algorithm
► based on TM and LM, find a sentence f as the best translation of e
as fast as possible and with as few memory as possible
Language models
Language models
Noam Chomsky, 1969
But it must be recognized that the notion "probability of a sentence" is an entirely useless one, under any known interpretation of this term.
Fred Jelinek, 1988, IBM
Anytime a linguist leaves the group the recognition rate goes up.
What is the probability of utterance of s? Ke snídani jsem měl celozrnný ...
Language models
Noam Chomsky, 1969
But it must be recognized that the notion "probability of a sentence" is an entirely useless one, under any known interpretation of this term.
Fred Jelinek, 1988, IBM
Anytime a linguist leaves the group the recognition rate goes up.
***
What is the probability of utterance of s?
Ke snídani jsem měl celozrnný ...
chléb > pečivo > zákusek > mléko > babičku
Language models - probability of a sentence
► we look for a probability of a following word
► a LM is probability distribution over all possible word sequences
Probability of word sequences P/_w(včera jsem jel do Brna) PiMiyčera jel do Brna jsem) Plm(\g\ jsem včera do Brna)
Language models for fluency and WSD
► LMs help ensure fluent output (proper word order)
► LMs help with WSD in general cases
► if a word is polysemous, we can choose the most frequent translation (pen -> pero)
► in special cases can not be used but
► LMs help with WSD using a context
► PlmO 9° home) > plmq go house)
-gram models
► n-gram is very useful concept in NLP
► it uses statistical observation of input data
► two applications in machine translation:
► what follows after / go? home is more frequent than house
► I go to home x / go home
► language (random) generation:
To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have Every enter now severally so, let. [1-grams]
Sweet prince, Falstaff shall die. Harry of Monmouth's grave. This shall forbid it should be branded, if renown made it empty. [3-grams]
Can you guess the author of the original text?
N-gram models - naive approach
1/1/ =      , 1/1/2, ' ' ' , wn
How we compute p(l/l/)? Occurrences of all W in data normalized by data size. For majority of W we won't have any occurrence in our data. The goal is to generalize observed properties of (training) data which is usually sparse (sparse data).
P(chléb|ke snídani jsem měl celozrnný) =
ke snídani jsem měl celozrnný chléb| |ke snídani jsem měl celozrnný!
Markov's chain, Markov's assumption
p(l/l/), where W is sequence of words; we model the probability word by word using rule of chain:
p(wi,w2,...wn) =
p(w<i)p{w2\wi)p{w3\wi, w2)...p{wn\wA ... wn_A)
Since p is not available for longish word sequences, we limit the history to m words using Markov's assumption
P{wn\w^ 1/1/2, . . . l/l/n_i) ~ p(l^n|l^n_A77, . . . wn_2, wn_i)
It means that for estimation of a following word it is sufficient to have m (m - 1) preceding words, m is order of a model. Usually, trigrams are used.
Computing, LM probabilities estimation
Trigram model uses 2 preceding words for probability learning. Using maximum-likelihood estimation:
p{w3\w^w2) =
count(w^ 1/1/2, w3) YJWcount{w^w2,w)
trigram: (the, green, w) (1748)
w	count	p{w)
paper	801	0.458
group	640	0.367
light	110	0.063
party	27	0.015
ecu	21	0.012
Quality and comparison of LMs
We need to compare quality of various LM (various orders, various data, smoothing techniques etc.)
2 approaches: extrinsic and intrinsic evaluation.
A good LM should assign a higher probability to a good (looking) text than to an incorrect text. For a fixed testing text we can compare various LMs.
Cross-entropy
H(Plm) = ~ \og Plm^,w2, ...wn)
1 n
-J^logp/_M(^/|^1,... Wi_A) /=1
Cross-entropy is average value of negative logarithms of words probabilities in testing text. It corresponds to measure of uncertainty of a probability distribution. The lower the better.
A good LM should reach entropy close to real entropy of language. That can not be measured but quite reliable estimates do exist, e.g. Shannon's game. For English, entropy is estimated to approx. 1.3 bit per letter.
Perplexity
PP = 2H(pLM^ PP(W) = P/_m(wi w2w3 ... wN)~~k
Perplexity is simple transformation of cross-entropy.
A good LM should not waste p for improbable phenomena.
The lower entropy, the better    the lower perplexity, the better.
Language models smoothing
Issue: if an n-gram is missing in the data but is in w -> p(w) = 0.
We need to distinguish p also for unseen data. It must hold:
Mw.p{w) > 0 The issue is more serious for high-order models.
Smoothing: attempt to amend real counts of n-grams to expected counts in any data (different corpora).
Add-one smoothing (Laplace)
Maximum likelihood estimation assigns p based on
c
Add-one smoothing uses
c+1
p =-
n+ v
where v is amount of all possible n-grams. That is quite inaccurate since all permutations might outnumber real (possible) n-grams by several magnitudes.
Europarl has (unique) 139,000 words, so 19 bilion possible bigrams. In fact it has only 53 mil. tokens, so maximally 53 mil. bigrams. Why?
This smoothing overvalues unseen n-grams.
Add-a smoothing
We won't add 1, but a. This can be estimated for the smoothing to be the most just and balanced.
c + a
P =-
n + av
a can be obtained experimentally: we can try several different values and find the best one using perplexity.
Usually it is very small (0.000X).
Deleted estimation
We can find unseen n-grams in another corpus. N-grams contained in one of them and not in the other help us to estimate general amount of unseen n-grams.
E.g. bigrams not occurring in a training corpus but present in the other corpus million times (given the amount of all possible bigrams equals 7.5 billions) will occur approx.
Good-Turing smoothing
We need to amend occurrence counts (frequencies) of n-grams in a corpus in such a way they correspond to general occurrence in texts. We use frequency of frequencies: number of various n-grams which occurr nx.
We use frequency of hapax legomena (singletons in data) to estimate unseen data.
Especially for n-grams not in our corpus we have
r* = (0 + 1)77- =0.00015
Nq
where    =1.1 x 106 a N0 = 7.5 x 109 (Europarl).
Example of Good-Turing smoothing (Europarl)
r	FF	
0	7,514,941,065	0.00015
1	1,132,844	0.46539
2	263,611	1.40679
3	123,615	2.38767
4	73,788	3.33753
5	49,254	4.36967
6	35,869	5.32929
8	21,693	7.43798
10	14,880	9.31304
20	4,546	19.54487
Comparing smoothing methods (Europarl)
method	perplexity
add-one	382.2
add-a	113.2
deleted est.	113.4
Good-Turing	112.9
Interpolation and back-off
Previous methods treated all unseen n-grams the same. Consider trigrams
nádherná červená řepa nádherná červená mrkev
Despite we don't have any of these in our training data, the former trigram should be probably more probable.
We will use probability of lower order models, for which we have necessary data:
červená řepa červená mrkev nádherná červená
Interpolation
P/( 1/1/311/1/1 i/i/2) = A-ip( 1/1/3) x A2p( 1/1/311/1/2) x A3p(w3\w^ 1/1/2)
If we have enough data we can trust higher order models more and assign a higher significance to corresponding n-grams.
Pi is probability distribution, thus this must hold:
VAn : 0 < An < 1
£a„ = i
n
Large LM - n-gram counts
How many unique n-grams are in a corpus?
order	unique	singletons
unigram	86700	33447 (38,6%)
bigram	1 948 935	1 132844 (58,1 %)
trigram	8 092 798	6022286 (74,4%)
4-gram	15 303 847	13 081 621 (85,5%)
5-gram	19882175	18324577 (92,2%)
Taken from Europarl with 30 mil. tokens.
Lexical translation
Standard lexicon does not contain information about frequency of translations of individual meanings of words.
key   klíč, tónina, klávesa
How often are the individual translations used in translations?
key^ klíčil), tónina (0.18), klávesa (0.12)
We need lexical probability distribution pf with property:
e
Ve : 0 < pf(e) < 1
Pklíč(key) ? Pmrkev(carrot)
Word alignment, alignment function
Translations differ in number of words and in word order. SMT uses alignment function
a : j -> i
where j is position of particular word in target sentence (Czech), / is position in a source sentence (English).
i:    1       2 3 4 5
the castle is very old
ten hrad je velmi starý
j:    1       2 3 4 5
a is function, therefore for each word we from the target sentence there is exactly one word wf from the source sentence.
Word alignment - more examples
► different word order: it was written here bylo to zde napsané
a:1->2,2->1,3->4,4->3
► different word number: jsem maličký
i am very small
a: 1 -> 1,2-> 1,3->2,4->2
► no translational equivalents: have you got it ?
máš to ?
a:1->1,2->4,3->5
► the opposite case, we add token NULL, pozice 0: NULL laugh
smát se
a: 1 -> 1,2->0
IBM model 1
pf for individual sentences is not available. We split translation into several steps, using pf for words. This approach is called generative modeling.
Translation model IBM-1 is defined as follows:
Is
p(e'a|f)=(^ttf n t{e^
where e = (ei,... e/e) is target sentence, f = (U, • • • fif) source sentence, le is target sentence length, lf source sentence length, e is normalizing constant, for resulting product to be proper probability distribution. (If + 1)/e is number of all possible alignments between e a f, whereas lf is increased by 1 due to NULL, t is probability translation function.
Translation probability computation
For computing p(e, a|f) we need to know value of t for all words.
The basic resource for SMT will be used: parallel corpus with aligned sentences.
Unfortunately we don't have word-level alignment (CEMAT). It is task for word-alignment and it is time for expectation-maximization (EM) algorithm.
EM algoritmus
1. initialize the model (usually uniform distribution)
2. apply model to the data (expectation step) we are looking for p(a|e, f) =
where p(e\f) = EaP(e> a\f)
3. adjust model according to the data (maximization step) we amend word alignments count we to wf (function c) according previous
c{we\wf-e, f) = EaP(a|e, 0 EyLi ^(e, ej)S{f, fa{j)) where     y) = 1 ^=4> x == y, else 0
4. repeat E-M steps until the model can be improved
Translation probability from EM algorithm
The resulting translation probability is computed using c:
t(we\wf) =
E(e,Q c{we\wf;e, f) EWeE(e,o c(we\wf;e, f)
EM algorithm - initialization
EM algorithm - final phase
...  la maison ...  la maison bleu  ...  la fleur  ...
/I IX
. .. the house  ... the blue house .,. the flower ...
p(lalthe)   = 0.453 p(le|the)  = 0,334 p(maison|house)  = 0,876 p(bleu|blue)  = 0,563
IBM models
IBM model 1 is very simple. It does not take context into account, can not add and skip words. All alignments are of the same probability. Each of following models adds something more to the previous.
► IBM-1: lexical translation
► IBM-2: + absolute alignment model
► IBM-3: + fertility model
► IBM-4: + relative alignment model
► IBM-5: + treats shortcomings of the previous models
IBM-2
In IBM-1, all translations in different word order are of the same probability. IBM-2 adds explicit model of alignment: alignment probability distribution:
a(/|y, /iv, h)
where / is source word position, j target word position.
IBM-2 - 2 kroky prekladu
Translation is split to two steps. In the first, lexical units are translated and in the second, words are rearranged according to the model.
12 3 4
klein  ist   das Haus
^     ^     ^     ^       lexical translation step small  is    the house
alignment step
the house is small
12 3 4
IBM-2
The first step is the same as in IBM-1, f(e|/) is used. Function a with probability distribution a is in the opposite direction to the translation direction. Both distributions are combined to formula for IBM-2:
le
p(e, a\f) = e Y[ t(ej\faU))a(a(j)\j, /e, lf) p{e\f) = J]p(e, a\f)
a
le If
= eUT,t(ej\fi)a(i\^l^lf)
y=1 /=0
IBM model 3
Models 1 and 2 don't consider situations where one word is translated to two and more words, or is not translated at all. IBM-3 solves this with fertility, which is modelled with a probability distribution
For each source word f, the distribution n expresses, to how many words f is usually translated.
n(0|a) = 0.999 n(1 (king) = 0.997 n(2|steep) = 0.25
NULL token insertion
If we want to translate properly to a target language which uses words without translational equivalents we have to tackle with inserting NULL token.
n(x\NULL) is not used since NULL insertion depends on a sentence length.
We add another step NULL insertion to the translation process. We use p-i and p0 = 1 - p-i, where p-i stands for probability of NULL token insertion after any word in a sentence.
-3
1 2        3        4 5 6
ich gehe ja nicht zum haus
l I VA \
ich gehe nicht zum zum haus
/♦Will
ich null gehe nicht zum zum haus
i i i i i i i
I    do   go   not   to   the house
i i x i i i
I    do   not   go    to   the house
fertility step
NULL insertion step
lexical translation step
distortion step
IBM-3 — distortion
The last step is almost the same as the 2. step in IBM-2 and is modelled with distortion probability distribution:
d(j\iJeJf)
which models positions in an opposite direction: for each source word at position / it models position j of a target word.
The process of translation from the previous figure might differ a bit (see the next figure).
-3
1 2 3 4 5 6
ich gehe ja nicht zum haus
i i VA \
ich gehe nicht zum zum haus
/♦Will
ich null gehe nicht zum zum haus
i i i i i i i
I    do   go   not  the   to house
IIX XJ
I    do   not   go    to  the house
fertility step
NULL insertion step
lexical translation step
distortion step
IBM-4, IBM-5
IBM-4
The problem of distortion lies in sparse data for long sentences. IBM-4 introduces relative distortion, where rearrangements of word positions depend on particular previous words. It comes from an assumption that we translate in phrases, which are moved together, or some rearrangement are more frequent (English: ADJ SUB    French SUB ADJ etc.).
IBM-5
This model solves other shortcomings of the previous models. E.g. it takes care of situations where two different source words could get to one target position.
Word alignment matrix
0
CO
o "E
			
H—"	o		CO
(~	>	CO	CO
0	CO		CO ^
		CO	-   "O 0
CO _Q
CO ^ -C _Q
Word alignment issues
Phrase-base Translation Model
State-of-the-art of SMT. Not only single words but whole phrases (n-grams, word sequences) can be translated in one step.
natuerlich
of course
spass am
spiel
fun with the
game
Phrases are not linguistically motivated, only statistically. German am is seldom translated with single English to. Statistically significant context spass am helps better translation. Linguistic phrase would be different: (fun (with (the game))).
Advantages of PBTM
► we often translate n : m words, a word is not a perfect element for translation
► translation of word sequences helps solve some ambiguities
► models can learn to translate longer and longer phrases
► a simplified model: no fertility, no NULL token etc.
Phrase-based model - formula
Translation probability p{f\e) is split to phrases
pQ\\e\) = ]J<l>(fi\ei)d(starti - end/.-, - 1)
/=i
Sentence f is split to / phrases 7,, all segmentations are of the same probability. Function 0 is translation probability for phrases. Function d is distance-based reordering model. We model according a previous phrase, start,- is position of the first word of phrase from sentence U which is translated to Mh phrase of sentence e.
Distance-based reordering model
Minimal reordering is preferred. The bigger reordering (measured on source side) the less preferred this operation is.
d=-3
d=0
foreign
z
d=i
	d=2 ^		
1 2 3	4 5	6	7
English
Building phrase matrix
We use word alignments from EM algorithm and then we look for consistent phrases.
Phrase 1 and e are consistent with alignment A if all words i\,... fn in phrase 7, which have alignment in A, are aligned with words e-i,... en in phrase e and vice versa.
konzistentní     nekonzistentní konzistentní
Phrase extraction
0
ü E
			
H—"	o		
(~	>	CO	
0			
	"O		-   "O 0
CO _Q JZ _Q
Extracted phrases
michael	michael
assumes	geht davon aus / geht davon aus ,
that	dass /,dass
he	er
will stay	bleibt
in the	im
house	haus
michael assumes	michael geht davon aus / michael geht davon aus ,
assumes that	geht davon aus , dass
assumes that he	geht davon aus , dass er
that he	dass er /, dass er
in the house	im haus
michael assumes that	michael geht davon aus , dass
Phrase translation probability estimation
Phrase translation probability estimation
- count(e,7)
J2f, count(e, fj) Phrase-based model of SMT
/ |e|
e* = argmaxg JJ^(/j|e,-) d(startj - end,^ - 1) IJpzjif(ei|ei-ei-i)
/=i /=i
Decoding
Given a model pLM and translation model p{f\e) we need to find a translation with the highest probability but from exponential number of all possible translations.
Heuristic search methods are used. It is not guaranteed we will find the best translation.
Errors in translations are caused by an error in 1) decoding process, where the best translation is not found owing to the heuristics or 2) models, where the best translation according to the probability functions is not the best.
Phrase-wise sentence translation
er		geht		ja nicht
er		geht		ja nicht
he
nach
hause
nach hause
home
In each step of translation we count preliminary values of probabilities from the translation, reordering and language models.
Search space of translation hypotheses
er
geht
ja
nicht
nach
hause
Tie"
Tie
is
are
goes
go
it is he will be
yes
is
ot course
it goes he goes
1
not
not do not does not is not
C
atter
to
according to )
in
house home chamber at home
is not does not
do not
home under house
return home do not
is
are is atter all
to
following'
does
not atter not to
not is not are not is not a
We are dealing with exponential space of all possible translations. We need to limit this space using various methods.
Hypothesis construction, beam search
er geht ja nicht nach hause
Beam search
Beam search uses so called breadth-first search. On each level of the search tree it generates all children of nodes on that level, sorts them according to various heuristics. It stores only a limited number of the best states on each level (beam width). Only these states are investigated further. The wider beam width the smaller number of children (descendants) are thrown away (branches are pruned). With an unlimited width it becomes breadth-first search algorithm. The width determines memory consumption. The best final state might not be found since it can be pruned somewhere in the middle of the process.