PA153PA153 Vít Baisa MACHINE TRANSLATIONMACHINE TRANSLATION We consider only technical / specialized texts: web pages, technical manuals, scienti c documents and papers, lea ets and catalogues, law texts and in general, texts from speci c domains. Nuances on di erent language levels in art literature are out of scope of current MT systems. MACHINE TRANSLATION: ISSUESMACHINE TRANSLATION: ISSUES In fact an output of MT is always revised. We distinguish pre-editing and post-editing. MT systems make di erent types of errors. These mistakes are characteristic for human translators: wrong prepositions: (I am in school) missing determiners (I saw man) wrong tense (Viděl jsem: I was seeing), ... For computers, errors in meaning are characteristic: Kiss me honey. → Polib mi med. Costa, Ângela, et al. "A linguistically motivated taxonomy for Machine Translation error analysis." Machine Translation 29.2 (2015): 127-161. DIRECT METHODS FOR IMPROVING MTDIRECT METHODS FOR IMPROVING MT QUALITYQUALITY limit input to a: sublanguage (indicative sentences) domain (informatics) document type (patents) text pre-processing (e.g. manual syntactic analysis) CLASSIFICATION BASED ON APPROACHCLASSIFICATION BASED ON APPROACH rule-based, knowledge-based (RBMT, KBMT) transfer with interlingua statistical machine translation (SMT) hybrid machine translation (HMT, HyTran) neural networks VAUQUOIS'S TRIANGLEVAUQUOIS'S TRIANGLE MOTIVATION IN 21ST CENTURYMOTIVATION IN 21ST CENTURY translation of for gisting (getting the main message) methods for speeding-up human translation substantially (translation memories) cross-language extraction of facts and search for information instant translation of e-communication translation on mobile devices web pages RULE-BASED MTRULE-BASED MT STATISTICAL MACHINESTATISTICAL MACHINE TRANSLATIONTRANSLATION SMT SCHEMESMT SCHEME PARALLEL CORPORA IPARALLEL CORPORA I basic data source for SMT available sources ~10–100 M size depends heavily on a language pair multilingual webpages (online newspapers) paragraph and sentence alignment needed PARALLEL CORPORA IIPARALLEL CORPORA II : 11 ls, 40 M words : parallel texts of various origin, open subtitles, UI localizations : law documents of EU (20 ls) : 1.3 M pairs of text chunks from the o cial records of the Canadian Parliament comparable corpora... Europarl OPUS Acquis Communautaire Hansards EUR-Lex SENTENCE ALIGNMENTSENTENCE ALIGNMENT sometimes sentences are not in 1:1 ratio in corpora Church-Gale alignment hunalign P alignment 0.89 1:1 0.0099 1:0, 0:1 0.089 2:1, 1:2 0.011 2:2 SMT NOISY CHANNEL PRINCIPLESMT NOISY CHANNEL PRINCIPLE Claude Shannon (1948), self-correcting codes transfered through noisy channels based on information about the original data and errors made in the channels. Used for MT, ASR, OCR. Optical Character Recognition is erroneous but we can estimate what was damaged in a text (with a language model); errors l 1 I, rn m etc.↔ ↔ ↔ = arg p(e|f)e ∗ max e = arg max e p(e)p(f|e) p(f) = arg p(e)p(f|e)max e SMT COMPONENTS ISMT COMPONENTS I language model how we get for any string the more looks like proper language the higher should be issue: what is for an unseen ? p(e) e e p(e) p(e) e SMT COMPONENTS IISMT COMPONENTS II translation model for and compute the more looks like a proper translation of , the higher e f p(f|e) e f p(f|e) SMT COMPONENTS IIISMT COMPONENTS III decoding algorithm based on TM and LM, nd a sentence as the best translation of as fast as possible and with as few memory as possible prune non-perspective hypotheses but do not lost any valid translations f e LANGUAGE MODELSLANGUAGE MODELS WHAT IT IS GOOD FOR?WHAT IT IS GOOD FOR? What is the probability of utterance of ?s I go to home vs. I go home What is the next, most probable word? Ke snídani jsem měl celozrnný ... { chléb pečivo zákusek mléko babičku}> > > > CHOMSKY WAS WRONGCHOMSKY WAS WRONG Colorless green ideas sleep furiously vs. Furiously sleep ideas green colorless LM assigns higher to the 1st! (Mikolov, 2012)p GENERATING RANDOM TEXTGENERATING RANDOM TEXT To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have Every enter now severally so, let. (unigrams) Sweet prince, Falsta shall die. Harry of Monmouth's grave. This shall forbid it should be branded, if renown made it empty. (trigrams) Can you guess the author of the original text? CBLM MAXIMUM LIKELIHOOD ESTIMATIONMAXIMUM LIKELIHOOD ESTIMATION p( | , ) =w3 w1 w2 count( , , )w1 w2 w3 count( , , w)∑ w w1 w2 (the, green, *): 1,748× in EuroParl w count p(w) paper 801 0.458 group 640 0.367 light 110 0.063 party 27 0.015 ecu 21 0.012 LM QUALITYLM QUALITY We need to compare quality of various LMs. 2 approaches: extrinsic and intrinsic evaluation. A good LM should assign a higher probability to a good (looking) text than to an incorrect text. For a xed testing text we can compare various LMs. ENTROPYENTROPY Shannon, 1949 the expected value (average) of the information contained in a message information viewed as the negative of the logarithm of the probability distribution events that always occur do not communicate information pure randomness has highest entropy (uniform distribution )lo ng2 E(X) = − p( ) p( )∑ n i=1 xi log2 xi PERPLEXITYPERPLEXITY P P = 2 H( )pLM P P (W ) = p( …w1 w2 wn ) − 1 N A good LM should not waste for improbable phenomena. The lower entropy, the better the lower perplexity, the better. p → Minimizing probabilities = minimizing perplexity. WHAT INFLUENCES LM QUALITY?WHAT INFLUENCES LM QUALITY? size of training data order of language model smoothing, interpolation, back-o LARGE LM - N-GRAM COUNTSLARGE LM - N-GRAM COUNTS How many unique n-grams are in a corpus? order types singletons % unigram 86,700 33,447 (38,6%) bigram 1,948,935 1,132,844 (58,1%) trigram 8,092,798 6,022,286 (74,4%) 4-gram 15,303,847 13,081,621 (85,5%) 5-gram 19,882,175 18,324,577 (92,2%) Taken from Europarl with 30 mil. tokens. ZERO FREQUENCY, OOV, RARE WORDSZERO FREQUENCY, OOV, RARE WORDS probability must always be non zero to be able to measure perplexity maximum likelihood bad at it training data: work on Tuesday/Friday/Wednesday test data: work on Sunday, p(Sunday|work on) = 0 NEURAL NETWORK LANGUAGE MODELSNEURAL NETWORK LANGUAGE MODELS old approach (1940s) only recently applied successfully to LM 2003 Bengio et al. (feed-forward NNLM) 2012 Mikolov (RNN) right now key concept: distributed representations of words 1-of-V, one-hot representation trending RECURRENT NEURAL NETWORKRECURRENT NEURAL NETWORK Tomáš Mikolov (VUT) hidden layer feeds itself shown to beat n-grams by large margin WORD EMBEDDINGSWORD EMBEDDINGS distributional semantics with vectors skip-gram, CBOW (continuous bag-of-words) EMBEDDINGS IN MTEMBEDDINGS IN MT TRANSLATION MODELSTRANSLATION MODELS LEXICAL TRANSLATIONLEXICAL TRANSLATION Standard lexicon does not contain information about frequency of translations of individual meanings of words. key → klíč, tónina, klávesa How often are the individual translations used in translations? key → klíč (0.7), tónina (0.18), klávesa (0.11) probability distribution :pf (e) = 1∑ e pf ∀e : 0 ≤ (e) ≤ 1pf EM ALGORITHM - INITIALIZATIONEM ALGORITHM - INITIALIZATION EM ALGORITHM - FINAL PHASEEM ALGORITHM - FINAL PHASE IBM MODELSIBM MODELS IBM-1 does not take context into account, cannot add and skip words. Each of the following models adds something more to the previous. IBM-1: lexical translation IBM-2: + absolute alignment model IBM-3: + fertility model IBM-4: + relative alignment model IBM-5: + further tuning WORD ALIGNMENT MATRIXWORD ALIGNMENT MATRIX WORD ALIGNMENT ISSUESWORD ALIGNMENT ISSUES PHRASE-BASE TRANSLATION MODELPHRASE-BASE TRANSLATION MODEL Phrases not linguistically, but statistically motivated. German am is seldom translated with single English to. Cf. (fun (with (the game))) ADVANTAGES OF PBTMADVANTAGES OF PBTM translating n:m words word is not a suitable element for translation for many lang. pairs models learn to translate longer phrases simpler: no fertility, no NULL token etc. PHRASE EXTRACTIONPHRASE EXTRACTION EXTRACTED PHRASESEXTRACTED PHRASES phr1 phr2 michael michael assumes geht davon aus / geht davon aus in the im house haus assumes that geht davon aus , dass that he dass er / , dass er in the house im haus PHRASE-BASED MODEL OF SMTPHRASE-BASED MODEL OF SMT = ϕ( | ) d(star − en − 1)e ∗ argmaxe ∏ i=1 I f¯ i e¯i ti di−1 ( | . . . )∏ i=1 |e| pLM ei e1 ei−1 DECODINGDECODING Given a model and translation model we need to nd a translation with the highest probability but from exponential number of all possible translations. pLM p(f|e) Heuristic search methods are used. It is not guaranteed to nd the best translation. Errors in translations are caused by 1) decoding process, when the best translation is not found owing to the heuristics or 2) models, where the best translation according to the probability functions is not the best possible. EXAMPLE OF NOISE-INDUCED ERRORSEXAMPLE OF NOISE-INDUCED ERRORS (GOOGLE TRANSLATE)(GOOGLE TRANSLATE) Rinneadh clárúchán an úsáideora yxc a eiteach go rathúil. The user registration yxc made a successful rejection. Rinneadh clárúchán an úsáideora qqq a eiteach go rathúil. Qqq made registration a user successfully refused. NEURAL NETWORK MACHINENEURAL NETWORK MACHINE TRANSLATIONTRANSLATION very close to state-of-the-art (PBSMT) a problem: variable length input and output learning to translate and align at the same time hot topic (2014, 2015) LISA NN MODELS IN MTNN MODELS IN MT SUMMARY VECTOR FOR SENTENCESSUMMARY VECTOR FOR SENTENCES MT QUALITY EVALUATIONMT QUALITY EVALUATION uency, adequacy, intelligibility AUTOMATIC TRANSLATION EVALUATIONAUTOMATIC TRANSLATION EVALUATION advantages: speed, cost disadvantages: do we really measure quality of translation? gold standard: manually prepared reference translations candidate is compared with reference translations the paradox of automatic evaluation: the task corresponds to situation where students are to assess their own exam: how they know where they made a mistake? various approaches: n-gram shared between and , edit distance, ... c n ri c ri RECALL AND PRECISION ON WORDSRECALL AND PRECISION ON WORDS precision = = = 50% correct output-length 3 6 recall = = = 43% correct reference-length 3 7 f-score = 2 × = 2 × = 46% precision × recall precision + recall .5 × .43 .5 + .43 RECALL AND PRECISION:RECALL AND PRECISION: SHORTCOMINGSSHORTCOMINGS metrics system A system B precision 50% 100% recall 43% 100% f-score 46% 100% It does not capture wrong word order. BLEUBLEU standard metrics (2001) IBM, Papineni n-gram match between reference and candidate translations precision is calculated for 1-, 2- ,3- and 4-grams + brevity penalty BLEU = min (1, ) ( output-length reference-length ∏ i=1 4 precision BLEU: AN EXAMPLEBLEU: AN EXAMPLE metrics system A system B precision (1gram) 3/6 6/6 precision (2gram) 1/5 4/5 precision (3gram) 0/4 2/4 precision (4gram) 0/3 1/3 brevity penalty 6/7 6/7 BLEU 0% 52% METEORMETEOR aligns hypotheses to one or more references exact, stem (morphology), synonym (WordNet), paraphrase matches various scores including WMT ranking and NIST adequacy extended support for English, Czech, German, French, Spanish, and Arabic. high correlation with human judgments EVALUATION OF EVALUATION METRICSEVALUATION OF EVALUATION METRICS Correlation of automatic evaluation with manual evaluation. EUROMATRIXEUROMATRIX EUROMATRIX IIEUROMATRIX II