Phrase-Based Models Philipp Koehn 18 September 2018 Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Motivation • Word-Based Models translate words as atomic units • Phrase-Based Models translate phrases as atomic units • Advantages: — many-to-many translation can handle non-compositional phrases — use of local context in translation — the more data, the longer phrases can be learned • "Standard Model", used by Google Translate and others Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Phrase-Based Model natuerlich hat John spass am spiel of course John has fun with the game • Foreign input is segmented in phrases • Each phrase is translated into English • Phrases are reordered Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Phrase Translation Table • Main knowledge source: table with phrase translations and their probabilities • Example: phrase translations for natuerlich Translation Probability (e f) of course 0.5 naturally 0.3 of course, 0.15 , of course, 0.05 Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Real Example =4 Phrase translations for den Vorschlag learned from the Europarl corpus: English He f) English 4>(ef) the proposal 0.6227 the suggestions 0.0114 's proposal 0.1068 the proposed 0.0114 a proposal 0.0341 the motion 0.0091 the idea 0.0250 the idea of 0.0091 this proposal 0.0227 the proposal, 0.0068 proposal 0.0205 its proposal 0.0068 of the proposal 0.0159 it 0.0068 the proposals 0.0159 • • • • • • - lexical variation (proposal vs suggestions) - morphological variation (proposal vs proposals) - included function words (the, a,...) - noise (it) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Linguistic Phrases? • Model is not limited to linguistic phrases (noun phrases, verb phrases, prepositional phrases,...) • Example non-linguistic phrase pair spass am —>• fun with the • Prior noun often helps with translation of preposition • Experiments show that limitation to linguistic phrases hurts quality Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 modeling Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Noisy Channel Model We would like to integrate a language model Bayes rule / ,nx P(f|e)p(e) aremax per) = aremax-—- 5 e^V I J 5 e p(fj = argmaxe p(f |e) p(e) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Noisy Channel Model P(S) p(R|S) source model channel model Source Channel Receiver message S message R • Applying Bayes rule also called noisy channel model — we observe a distorted message R (here: a foreign string f) — we have a model on how the message is distorted (here: translation model) — we have a model on what messages are probably (here: language model) — we want to recover the original message S (here: an English string e) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 More Detail Bayes rule ebest = argmaxep(e|f) = argmaxep(f|e) Plm(e) — translation model p(f |e) — language model pi_|\/|(e) Decomposition of the translation model l^i) = d(starU - endi-i - 1) i=i — phrase translation probability cp — reordering probability d Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Distance-Based Reordering d=-3 d= foreign =0 6=2 # 1 2 3 4 5 6 7 English 10 =4 if phrase translates movement distance 1 1-3 start at beginning 0 2 6 skip over 4-5 +2 3 4-5 move back over 4-6 -3 4 7 skip over 6 +1 Scoring function: d(x) = cJx' — exponential with distance Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 training Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Learning a Phrase Translation Table 12 • Task: learn the model from a parallel corpus • Three stages: — word alignment: using IBM models or other method — extraction of phrase pairs — scoring phrase pairs Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Word Alignment Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Extracting Phrase Pairs 14 =4 michael assumes that he will stay in the house CD CO -C Ü CD CD o > cd "a C/) cc C/) C/D _Q (C !_ c (0 J "O a) .E r d extract phrase pair consistent with word alignment: assumes that / geht davon aus , dass Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Consistent consistent inconsistent consistent ok violated one alignment point outside ok unaligned word is fine All words of the phrase pair have to align to each other. Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Consistent consistent inconsistent consistent Phrase pair (e, /) consistent with an alignment A, if all words /i,/n in / that have alignment points in A have these with words ei,en in e and vice versa: (e, /) consistent with A ^ AND Mfj e f : (e,, fj) Gi^e.Ge AND 3e, eejjef: (e.Jj) e A Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Phrase Pair Extraction CD CO cz _£Z -<— O O -£= > CO ■p CD CO =3 C O) "O CO michael assumes that he will stay in the house Smallest phrase pairs: michael — michael assumes — geht davon aus / geht davon aus , that — dass / , dass he — er will stay — bleibt in the — im house — haus unaligned words (here: German comma) lead to multiple translations CO CO _Q CO i_ p CO ® T3 CD .E -C _Q Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Larger Phrase Pairs CD CO cz michael assumes — michael geht davon aus / michael geht davon aus , assumes that — geht davon aus , dass ; assumes that he — geht davon aus , dass er that he — dass er / , dass er ; in the house — im haus michael assumes that — michael geht davon aus , dass michael assumes that he — michael geht davon aus , dass er michael assumes that he will stay in the house — michael geht davon aus , dass er im haus bleibt assumes that he will stay in the house — geht davon aus , dass er im haus bleibt that he will stay in the house — dass er im haus bleibt ; dass er im haus bleibt, he will stay in the house — er im haus bleibt ; will stay in the house — im haus bleibt Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Scoring Phrase Translations Phrase pair extraction: collect all phrase pairs from the data Phrase pair scoring: assign probabilities to phrase translations Score by relative frequency: Hf\e)= C°Unt(e_J) £/.count(e,/i) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 EM Training of the Phrase Model 20 • We presented a heuristic set-up to build phrase translation table (word alignment, phrase extraction, phrase scoring) • Alternative: align phrase pairs directly with EM algorithm — initialization: uniform model, all 0(e, /) are the same — expectation step: * estimate likelihood of all possible phrase alignments for all sentence pairs — maximization step: * collect counts for phrase pairs (e, /), weighted by alignment probability * update phrase translation probabilties p(e, /) • However: method easily overfits (learns very large phrase pairs, spanning entire sentences) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Size of the Phrase Table • Phrase translation table typically bigger than corpus ... even with limits on phrase lengths (e.g., max 7 words) —>• Too big to store in memory? • Solution for training - extract to disk, sort, construct for one source phrase at a time • Solutions for decoding - on-disk data structures with index for quick look-ups - suffix arrays to create phrase pairs on demand Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 22 advanced modeling Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Weighted Model Described standard model consists of three sub-models — phrase translation model (j>{j\e) — reordering model d — language model Plm(^) I |e| ebest = argmaxeII^^^^ d(starU - endi-x ~ 1) JJpLM(ei|ei...ei_i) 1=1 1=1 Some sub-models may be more important than others Add weights A^, \d, ALM I |e| ebest = argmaxeII^^^^A0 d(starti ~ endi-x - l)Xd JJpLM(ei|ei...ei_i) i=i i=i A LM Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Log-Linear Model • Such a weighted model is a log-linear model: • Our feature functions — number of feature function n = 3 — random variable x = (e, /, start, end) — feature function h\= log (j) — feature function /i2 = log d — feature function /i3 = log P]^y[ n Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Weighted Model as Log-Linear Model 25 ^jiy p(e, a\f) = exp(A^ ^ log (ß(fi\ei)+ i=i i \d ^2 log d(ai - bi-i - !) + i=i Alm ^ log plm(ez|ei...ei_i)) i=i Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 More Feature Functions Bidirectional alignment probabilities: (j>{e\f) and 0(/|e) Rare phrase pairs have unreliable phrase translation probability estimates lexical weighting with word translation probabilities CD 05 O) C "O does not assume length(e) lex(e|/,a) = to 03 1 i=l |{j|(<,j)ea}| V(i,j)Ga Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 More Feature Functions Language model has a bias towards short translations word count: wc(e) = log We may prefer finer or coarser segmentation phrase count pc(e) = log \ I Multiple language models Multiple translation models Other knowledge sources p Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 28 reordering Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Lexicalized Reordering 29 =4 \ Distance-based reordering model is weak learn reordering preference for each phrase pair Three orientations types: (m) monotone, (s) swap, (d) discontinuous orientation e {m, s, d} p0(orientation|/, e) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Learning Lexicalized Reordering 30 ? ■ ? ■ ■ ■ 1 • Collect orientation information during phrase pair extraction — if word alignment point to the top left exists —>• monotone — if a word alignment point to the top right exists^ swap — if neither a word alignment point to top left nor to the top right exists —>• neither monotone nor swap —>• discontinuous Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Learning Lexicalized Reordering 31 Estimation by relative frequency Y] f Yz count (orientation, e, f) p0 (orientation) = J - _-- Smoothing with unlexicalized orientation model p(orientation) to avoid zero probabilities for unseen orientations a p(orientation) + count (orientation, e, /) p0(orientation /,e) =--=--—z-j;- Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 32 operation sequence model Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Wl--^ A Critique: Phrase Segmentation is Arbitrary^ If multiple segmentations possible - why chose one over the other? spass am spiel vs. spass am spiel When choose larger phrase pairs or multiple shorter phrase pairs? spass am spiel vs. spass am spiel vs. spass am spiel None of this has been properly addressed Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Wl--^ A Critique: Strong Independence Assumptions ^ Lexical context considered only within phrase pairs spass am fun with No context considered between phrase pairs ? spass am ? —>• ? fun with ? Some phrasal context considered in lexicalized reordering model ... but not based on the identity of neighboring phrases Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Segmentation? Minimal Phrase Pairs 35 natürlich hat John Spaß am Spiel 1 X I \ of course John has fun with the game V natürlich hat John Spaß am Spiel 1 jkI I \ of course John has fun with the game Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Independence? Consider Sequence of Operations 36 Ol Generate(naturlich, of course) natürlich | of course 02 Insert Gap natürlich | John 03 Generate (John, John) of course John 04 05 Jump Back (1) Generate (hat, has) natürlich hat \. John of course John has Oß Jump Forward natürlich hat John | of course John has Ol Generate(naturlich, of course) natürlich hat John Spaß 4-of course John has fun Og 09 Generate(am, with) GenerateTargetOnly(the) natürlich hat John Spaß am | of course John has fun with the 010 Generate(Spiel, game) natürlich hat John Spaß am Spiel 4-of course John has fun with the game Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Operation Sequence Model Operations - generate (phrase translation) - generate target only - generate source only - insert gap - jump back - jump forward N-gram sequence model over operations, e.g., 5-gram model: p(oi) p(o2\oi) p(o3|oi, o2) ... p(o10\o6l o7, o8, o9) Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 In Practice • Operation Sequence Model used as additional feature function • Significant improvements over phrase-based baseline —>• State-of-the-art systems include such a model Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018 Summary • Phrase Model • Training the model — word alignment — phrase pair extraction — phrase pair scoring — EM training of the phrase model • Log linear model — sub-models as feature functions — lexical weighting — word and phrase count features • Lexicalized reordering model • Operation sequence model Philipp Koehn Machine Translation: Phrase-Based Models 18 September 2018