PLIN019 - Machine translation Statistical machine translation Vit Baisa Introduction Introduction to SM" ► rule-based systems motivated by linguistics ► SMT inspired by information theory and statistics ► currently many companies focused on SMT: Google, IBM, Microsoft ► 50 million webpages translated by SMT daily ► gisting: we don't need exact translation, sometimes a gist of a text is enough - the most frequent use of SMT on Internet SM" scheme Spanish/English Bilingual Text CEnglish Text Statistical Analysis Statistical Analysis Spanish Broken English English Translation Model Language Model _*_k_ Decoding Algorithm argmax P(e)*p{s|e) Tools for SMT ► GIZA++: training of IBM models, word alignment (HMM) ► SRILM: language model training ► IRST: large language model training ► Moses: phrase decoder, model training ► Pharaoh: predecessor of Moses ► Thot: phrase model training ► SAMT: tree-based models Data for SMT - (parallel) corpora ► Linguistics Data Consorcium (LDC): parallel corpora for Arabic-English, Chinese-English etc. Gigaword corpus (English, 7 billion words) ► Europarl: collection of parliamentary texts of EP (11 languages, 40 M words) ► OPUS: parallel texts of various origin: subtitles, software localizations of user interfaces ► Acquis Communautaire: law documents of EU (20 lang) ► Hansards: 1.3 M pairs of text chunks from the official records of the Canadian Parliament Regular events in SMT Annual evaluation of SMT quality. Test set preparing, manual evaluation events, etc. ► NIST: National Institute of Standards and Technology; the oldest, prestigious; evaluation of Arabic, Chinese ► IWSLT: international workshop of spoken language translation; speech translation, Asian languages ► WMT: Workshop on SMT; mainly between European languages Words ► for SMT (in vast majority) the basic unit = a word ► works usually with lowercased input, the original case can be restored in post-processing ► the makes up 7% of English texts ► top 10 words (by frequency) makes up 30% of all texts ► troublemakers: typos, numbers, proper names, loanwords Sentences ► syntactic structure differs in various languages ► inserting of functional words typical for a given language {the, punctuation) ► rearranging: er wird mit uns gehen -> he will go with us ► some phenomena can not be translated on the sentence level: anaphora ► level of document: theme (topic) can help with WSD ► we are not supposed to translate bat as pálka (Czech) in a text about cave animals Parallel corpora ► basic data source for SMT ► freely available sources are about 10 to 100 M word large ► size depends heavily on a language pair ► multilingual webpages (online newspapers) ► a problem with paragraph, document alignment, ... ► comparable corpora: texts from the same domain, not translations: New York Times - Le Monde ► Kapradi - corpus of Shakespeare's plays by various translators (FI+FF) ► InterCorp - manually aligned fiction books (CNK, FF UK) Sentence alignment ► sometimes sentences are not in 1:1 ratio in corpora ► some languages do not explicitly delimit sentence boundaries (Thai) ► It is small, but cozy. - Es is klein. Aber es ist gemütlich. p alignment 0.98 0.0099 0.089 0.011 1:1 1:0, 0:1 2:1, 1:2 2:2 Probability and statistics basics for SMT Zipf's law n I_I_I_1_J_I_I_Ü_I_I_Ü_I_I_l_l_I_I_l_l_I_I_J 1 10 100 1000 10000 100000 1e+06 Ranking r rank, f = word freq., c = constant; r x f = c Probability distribution ► a graph of probability values for elementary phenomena of a random variable ► uniform: roll of dice, coin (discrete variable) ► binomial: multiple roll (quincunx) ^/c;p)=Qp^(1-p)^ ► normal, Gauss's: continuous variable o H o w o O ro o Q. co' Statistics ► random variable, probability function, etc. ► we have data, we want to get distribution describing the data best ► law of large numbers: the more data we have the better are we able to guess its probability distribution ► e.g.: roll of a loaded dice; tt calculation ► independent variables: Vx,y : p(x,y) = p(x).p(y) ► joint probability: roll of dice and coin: p(6, heads) ► conditional probability: p{y\x) = for independent variables: p(y\x) = p{y) Conditional probability Shannon's game Probabilty distribution for a next letter in a text depends previous letters. Some letters bear more information than the others Bayes's rule p(*|y) ► example with the dice ► p(x) - prior p{y\x) - posterior SMT - noisy channel principle Designed by Claude Shannon (1948) for self-correcting codes, for coded signals corrections transfered through noisy channels based on information about the original data (probabilities) and about types of errors made in the channels. Example with OCR. Optical Character Recognition is erroneous but we can estimate what was damaged in a text (with a language model); errors 1^1 ^l, rn^m etc. e* = arg maxp(e|/r) e p{e)p{f\e) = arg max p{f) = arg maxp(e)p(/r|e). e SMT components language models: ► how we get p(e) for any string e ► the more e looks like proper language the higher is p(e) ► issue: what is p(e) for an unseen e? translation model: ► for e and f compute p(f\e) ► the more e looks like a proper translation f, the higher f*f\e) ► decoding algorithm ► based on TM and LM, find a sentence f as the best translation of e as fast as possible and with as few memory as possible Language models Language models Noam Chomsky, 1969 But it must be recognized that the notion "probability of a sentence" is an entirely useless one, under any known interpretation of this term. Fred Jelinek, 1988, IBM Anytime a linguist leaves the group the recognition rate goes up. What is the probability of utterance of s? Ke snídani jsem měl celozrnný ... Language models Noam Chomsky, 1969 But it must be recognized that the notion "probability of a sentence" is an entirely useless one, under any known interpretation of this term. Fred Jelinek, 1988, IBM Anytime a linguist leaves the group the recognition rate goes up. *** What is the probability of utterance of s? Ke snídani jsem měl celozrnný ... chléb > pečivo > zákusek > mléko > babičku Language models - probability of a sentence ► we look for a probability of a following word ► a LM is probability distribution over all possible word sequences Probability of word sequences P/_w(včera jsem jel do Brna) PiMiyčera jel do Brna jsem) Plm(\g\ jsem včera do Brna) Language models for fluency and WSD ► LMs help ensure fluent output (proper word order) ► LMs help with WSD in general cases ► if a word is polysemous, we can choose the most frequent translation (pen -> pero) ► in special cases can not be used but ► LMs help with WSD using a context ► PlmO 9° home) > plmq go house) -gram models ► n-gram is very useful concept in NLP ► it uses statistical observation of input data ► two applications in machine translation: ► what follows after / go? home is more frequent than house ► I go to home x / go home ► language (random) generation: To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have Every enter now severally so, let. [1-grams] Sweet prince, Falstaff shall die. Harry of Monmouth's grave. This shall forbid it should be branded, if renown made it empty. [3-grams] Can you guess the author of the original text? N-gram models - naive approach 1/1/ = , 1/1/2, ' ' ' , wn How we compute p(l/l/)? Occurrences of all W in data normalized by data size. For majority of W we won't have any occurrence in our data. The goal is to generalize observed properties of (training) data which is usually sparse (sparse data). P(chléb|ke snídani jsem měl celozrnný) = ke snídani jsem měl celozrnný chléb| |ke snídani jsem měl celozrnný! Markov's chain, Markov's assumption p(l/l/), where W is sequence of words; we model the probability word by word using rule of chain: p(wi,w2,...wn) = p(w p(w) = 0. We need to distinguish p also for unseen data. It must hold: Mw.p{w) > 0 The issue is more serious for high-order models. Smoothing: attempt to amend real counts of n-grams to expected counts in any data (different corpora). Add-one smoothing (Laplace) Maximum likelihood estimation assigns p based on c Add-one smoothing uses c+1 p =- n+ v where v is amount of all possible n-grams. That is quite inaccurate since all permutations might outnumber real (possible) n-grams by several magnitudes. Europarl has (unique) 139,000 words, so 19 bilion possible bigrams. In fact it has only 53 mil. tokens, so maximally 53 mil. bigrams. Why? This smoothing overvalues unseen n-grams. Add-a smoothing We won't add 1, but a. This can be estimated for the smoothing to be the most just and balanced. c + a P =- n + av a can be obtained experimentally: we can try several different values and find the best one using perplexity. Usually it is very small (0.000X). Deleted estimation We can find unseen n-grams in another corpus. N-grams contained in one of them and not in the other help us to estimate general amount of unseen n-grams. E.g. bigrams not occurring in a training corpus but present in the other corpus million times (given the amount of all possible bigrams equals 7.5 billions) will occur approx. Good-Turing smoothing We need to amend occurrence counts (frequencies) of n-grams in a corpus in such a way they correspond to general occurrence in texts. We use frequency of frequencies: number of various n-grams which occurr nx. We use frequency of hapax legomena (singletons in data) to estimate unseen data. Especially for n-grams not in our corpus we have r* = (0 + 1)77- =0.00015 Nq where =1.1 x 106 a N0 = 7.5 x 109 (Europarl). Example of Good-Turing smoothing (Europarl) r FF 0 7,514,941,065 0.00015 1 1,132,844 0.46539 2 263,611 1.40679 3 123,615 2.38767 4 73,788 3.33753 5 49,254 4.36967 6 35,869 5.32929 8 21,693 7.43798 10 14,880 9.31304 20 4,546 19.54487 Comparing smoothing methods (Europarl) method perplexity add-one 382.2 add-a 113.2 deleted est. 113.4 Good-Turing 112.9 Interpolation and back-off Previous methods treated all unseen n-grams the same. Consider trigrams nádherná červená řepa nádherná červená mrkev Despite we don't have any of these in our training data, the former trigram should be probably more probable. We will use probability of lower order models, for which we have necessary data: červená řepa červená mrkev nádherná červená Interpolation P/( 1/1/311/1/1 i/i/2) = A-ip( 1/1/3) x A2p( 1/1/311/1/2) x A3p(w3\w^ 1/1/2) If we have enough data we can trust higher order models more and assign a higher significance to corresponding n-grams. Pi is probability distribution, thus this must hold: VAn : 0 < An < 1 £a„ = i n Large LM - n-gram counts How many unique n-grams are in a corpus? order unique singletons unigram 86700 33447 (38,6%) bigram 1 948 935 1 132844 (58,1 %) trigram 8 092 798 6022286 (74,4%) 4-gram 15 303 847 13 081 621 (85,5%) 5-gram 19882175 18324577 (92,2%) Taken from Europarl with 30 mil. tokens. Lexical translation Standard lexicon does not contain information about frequency of translations of individual meanings of words. key klíč, tónina, klávesa How often are the individual translations used in translations? key^ klíčil), tónina (0.18), klávesa (0.12) We need lexical probability distribution pf with property: e Ve : 0 < pf(e) < 1 Pklíč(key) ? Pmrkev(carrot) Word alignment, alignment function Translations differ in number of words and in word order. SMT uses alignment function a : j -> i where j is position of particular word in target sentence (Czech), / is position in a source sentence (English). i: 1 2 3 4 5 the castle is very old ten hrad je velmi starý j: 1 2 3 4 5 a is function, therefore for each word we from the target sentence there is exactly one word wf from the source sentence. Word alignment - more examples ► different word order: it was written here bylo to zde napsané a:1->2,2->1,3->4,4->3 ► different word number: jsem maličký i am very small a: 1 -> 1,2-> 1,3->2,4->2 ► no translational equivalents: have you got it ? máš to ? a:1->1,2->4,3->5 ► the opposite case, we add token NULL, pozice 0: NULL laugh smát se a: 1 -> 1,2->0 IBM model 1 pf for individual sentences is not available. We split translation into several steps, using pf for words. This approach is called generative modeling. Translation model IBM-1 is defined as follows: Is p(e'a|f)=(^ttf n t{e^ where e = (ei,... e/e) is target sentence, f = (U, • • • fif) source sentence, le is target sentence length, lf source sentence length, e is normalizing constant, for resulting product to be proper probability distribution. (If + 1)/e is number of all possible alignments between e a f, whereas lf is increased by 1 due to NULL, t is probability translation function. Translation probability computation For computing p(e, a|f) we need to know value of t for all words. The basic resource for SMT will be used: parallel corpus with aligned sentences. Unfortunately we don't have word-level alignment (CEMAT). It is task for word-alignment and it is time for expectation-maximization (EM) algorithm. EM algoritmus 1. initialize the model (usually uniform distribution) 2. apply model to the data (expectation step) we are looking for p(a|e, f) = where p(e\f) = EaP(e> a\f) 3. adjust model according to the data (maximization step) we amend word alignments count we to wf (function c) according previous c{we\wf-e, f) = EaP(a|e, 0 EyLi ^(e, ej)S{f, fa{j)) where y) = 1 ^=4> x == y, else 0 4. repeat E-M steps until the model can be improved Translation probability from EM algorithm The resulting translation probability is computed using c: t(we\wf) = E(e,Q c{we\wf;e, f) EWeE(e,o c(we\wf;e, f) EM algorithm - initialization EM algorithm - final phase ... la maison ... la maison bleu ... la fleur ... /I IX . .. the house ... the blue house .,. the flower ... p(lalthe) = 0.453 p(le|the) = 0,334 p(maison|house) = 0,876 p(bleu|blue) = 0,563 IBM models IBM model 1 is very simple. It does not take context into account, can not add and skip words. All alignments are of the same probability. Each of following models adds something more to the previous. ► IBM-1: lexical translation ► IBM-2: + absolute alignment model ► IBM-3: + fertility model ► IBM-4: + relative alignment model ► IBM-5: + treats shortcomings of the previous models IBM-2 In IBM-1, all translations in different word order are of the same probability. IBM-2 adds explicit model of alignment: alignment probability distribution: a(/|y, /iv, h) where / is source word position, j target word position. IBM-2 - 2 kroky prekladu Translation is split to two steps. In the first, lexical units are translated and in the second, words are rearranged according to the model. 12 3 4 klein ist das Haus ^ ^ ^ ^ lexical translation step small is the house alignment step the house is small 12 3 4 IBM-2 The first step is the same as in IBM-1, f(e|/) is used. Function a with probability distribution a is in the opposite direction to the translation direction. Both distributions are combined to formula for IBM-2: le p(e, a\f) = e Y[ t(ej\faU))a(a(j)\j, /e, lf) p{e\f) = J]p(e, a\f) a le If = eUT,t(ej\fi)a(i\^l^lf) y=1 /=0 IBM model 3 Models 1 and 2 don't consider situations where one word is translated to two and more words, or is not translated at all. IBM-3 solves this with fertility, which is modelled with a probability distribution For each source word f, the distribution n expresses, to how many words f is usually translated. n(0|a) = 0.999 n(1 (king) = 0.997 n(2|steep) = 0.25 NULL token insertion If we want to translate properly to a target language which uses words without translational equivalents we have to tackle with inserting NULL token. n(x\NULL) is not used since NULL insertion depends on a sentence length. We add another step NULL insertion to the translation process. We use p-i and p0 = 1 - p-i, where p-i stands for probability of NULL token insertion after any word in a sentence. -3 1 2 3 4 5 6 ich gehe ja nicht zum haus l I VA \ ich gehe nicht zum zum haus /♦Will ich null gehe nicht zum zum haus i i i i i i i I do go not to the house i i x i i i I do not go to the house fertility step NULL insertion step lexical translation step distortion step IBM-3 — distortion The last step is almost the same as the 2. step in IBM-2 and is modelled with distortion probability distribution: d(j\iJeJf) which models positions in an opposite direction: for each source word at position / it models position j of a target word. The process of translation from the previous figure might differ a bit (see the next figure). -3 1 2 3 4 5 6 ich gehe ja nicht zum haus i i VA \ ich gehe nicht zum zum haus /♦Will ich null gehe nicht zum zum haus i i i i i i i I do go not the to house IIX XJ I do not go to the house fertility step NULL insertion step lexical translation step distortion step IBM-4, IBM-5 IBM-4 The problem of distortion lies in sparse data for long sentences. IBM-4 introduces relative distortion, where rearrangements of word positions depend on particular previous words. It comes from an assumption that we translate in phrases, which are moved together, or some rearrangement are more frequent (English: ADJ SUB French SUB ADJ etc.). IBM-5 This model solves other shortcomings of the previous models. E.g. it takes care of situations where two different source words could get to one target position. Word alignment matrix 0 CO o "E H—" o CO (~ > CO CO 0 CO CO ^ CO - "O 0 CO _Q CO ^ -C _Q Word alignment issues Phrase-base Translation Model State-of-the-art of SMT. Not only single words but whole phrases (n-grams, word sequences) can be translated in one step. natuerlich of course spass am spiel fun with the game Phrases are not linguistically motivated, only statistically. German am is seldom translated with single English to. Statistically significant context spass am helps better translation. Linguistic phrase would be different: (fun (with (the game))). Advantages of PBTM ► we often translate n : m words, a word is not a perfect element for translation ► translation of word sequences helps solve some ambiguities ► models can learn to translate longer and longer phrases ► a simplified model: no fertility, no NULL token etc. Phrase-based model - formula Translation probability p{f\e) is split to phrases pQ\\e\) = ]J(fi\ei)d(starti - end/.-, - 1) /=i Sentence f is split to / phrases 7,, all segmentations are of the same probability. Function 0 is translation probability for phrases. Function d is distance-based reordering model. We model according a previous phrase, start,- is position of the first word of phrase from sentence U which is translated to Mh phrase of sentence e. Distance-based reordering model Minimal reordering is preferred. The bigger reordering (measured on source side) the less preferred this operation is. d=-3 d=0 foreign z d=i d=2 ^ 1 2 3 4 5 6 7 English Building phrase matrix We use word alignments from EM algorithm and then we look for consistent phrases. Phrase 1 and e are consistent with alignment A if all words i\,... fn in phrase 7, which have alignment in A, are aligned with words e-i,... en in phrase e and vice versa. konzistentní nekonzistentní konzistentní Phrase extraction 0 ü E H—" o (~ > CO 0 "O - "O 0 CO _Q JZ _Q Extracted phrases michael michael assumes geht davon aus / geht davon aus , that dass /,dass he er will stay bleibt in the im house haus michael assumes michael geht davon aus / michael geht davon aus , assumes that geht davon aus , dass assumes that he geht davon aus , dass er that he dass er /, dass er in the house im haus michael assumes that michael geht davon aus , dass Phrase translation probability estimation Phrase translation probability estimation - count(e,7) J2f, count(e, fj) Phrase-based model of SMT / |e| e* = argmaxg JJ^(/j|e,-) d(startj - end,^ - 1) IJpzjif(ei|ei-ei-i) /=i /=i Decoding Given a model pLM and translation model p{f\e) we need to find a translation with the highest probability but from exponential number of all possible translations. Heuristic search methods are used. It is not guaranteed we will find the best translation. Errors in translations are caused by an error in 1) decoding process, where the best translation is not found owing to the heuristics or 2) models, where the best translation according to the probability functions is not the best. Phrase-wise sentence translation er geht ja nicht er geht ja nicht he nach hause nach hause home In each step of translation we count preliminary values of probabilities from the translation, reordering and language models. Search space of translation hypotheses er geht ja nicht nach hause Tie" Tie is are goes go it is he will be yes is ot course it goes he goes 1 not not do not does not is not C atter to according to ) in house home chamber at home is not does not do not home under house return home do not is are is atter all to following' does not atter not to not is not are not is not a We are dealing with exponential space of all possible translations. We need to limit this space using various methods. Hypothesis construction, beam search er geht ja nicht nach hause Beam search Beam search uses so called breadth-first search. On each level of the search tree it generates all children of nodes on that level, sorts them according to various heuristics. It stores only a limited number of the best states on each level (beam width). Only these states are investigated further. The wider beam width the smaller number of children (descendants) are thrown away (branches are pruned). With an unlimited width it becomes breadth-first search algorithm. The width determines memory consumption. The best final state might not be found since it can be pruned somewhere in the middle of the process.