Evaluation Philipp Koehn 19 September 2023 Philipp Koehn Machine Translation: Evaluation 19 September 2023 1Evaluation • How good is a given machine translation system? • Hard problem, since many different translations acceptable → semantic equivalence / similarity • Evaluation metrics – subjective judgments by human evaluators – automatic evaluation metrics – task-based evaluation, e.g.: – how much post-editing effort? – does information come across? Philipp Koehn Machine Translation: Evaluation 19 September 2023 2Ten Translations of a Chinese Sentence Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials. (a typical example from the 2001 NIST evaluation set) Philipp Koehn Machine Translation: Evaluation 19 September 2023 3 adequacy and fluency Philipp Koehn Machine Translation: Evaluation 19 September 2023 4Adequacy and Fluency • Human judgement – given: machine translation output – given: source and/or reference translation – task: assess the quality of the machine translation output • Metrics Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output good fluent English? This involves both grammatical correctness and idiomatic word choices. Philipp Koehn Machine Translation: Evaluation 19 September 2023 5Fluency and Adequacy: Scales Adequacy Fluency 5 all meaning 5 flawless English 4 most meaning 4 good English 3 much meaning 3 non-native English 2 little meaning 2 disfluent English 1 none 1 incomprehensible Philipp Koehn Machine Translation: Evaluation 19 September 2023 6Annotation Tool Philipp Koehn Machine Translation: Evaluation 19 September 2023 7Hands On: Judge Translations • Rank according to adequacy and fluency on a 1-5 scale (5 is best) – Source: L’affaire NSA souligne l’absence totale de d´ebat sur le renseignement – Reference: NSA Affair Emphasizes Complete Lack of Debate on Intelligence – System1: The NSA case underscores the total lack of debate on intelligence – System2: The case highlights the NSA total absence of debate on intelligence – System3: The matter NSA underlines the total absence of debates on the piece of information Philipp Koehn Machine Translation: Evaluation 19 September 2023 8Hands On: Judge Translations • Rank according to adequacy and fluency on a 1-5 scale (5 is best) – Source: N’y aurait-il pas comme une vague hypocrisie de votre part ? – Reference: Is there not an element of hypocrisy on your part? – System1: Would it not as a wave of hypocrisy on your part? – System2: Is there would be no hypocrisy like a wave of your hand? – System3: Is there not as a wave of hypocrisy from you? Philipp Koehn Machine Translation: Evaluation 19 September 2023 9Hands On: Judge Translations • Rank according to adequacy and fluency on a 1-5 scale (5 is best) – Source: La France a-t-elle b´en´efici´e d’informations fournies par la NSA concernant des op´erations terroristes visant nos int´erˆets ? – Reference: Has France benefited from the intelligence supplied by the NSA concerning terrorist operations against our interests? – System1: France has benefited from information supplied by the NSA on terrorist operations against our interests? – System2: Has the France received information from the NSA regarding terrorist operations aimed our interests? – System3: Did France profit from furnished information by the NSA concerning of the terrorist operations aiming our interests? Philipp Koehn Machine Translation: Evaluation 19 September 2023 10Evaluators Disagree • Histogram of adequacy judgments by different human evaluators 1 2 3 4 5 10% 20% 30% 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 (from WMT 2006 evaluation) Philipp Koehn Machine Translation: Evaluation 19 September 2023 11Measuring Agreement between Evaluators • Kappa coefficient K = p(A) − p(E) 1 − p(E) – p(A): proportion of times that the evaluators agree – p(E): proportion of time that they would agree by chance (5-point scale → p(E) = 1 5) • Example: Inter-evaluator agreement in WMT 2007 evaluation campaign Evaluation type P(A) P(E) K Fluency .400 .2 .250 Adequacy .380 .2 .226 Philipp Koehn Machine Translation: Evaluation 19 September 2023 12Ranking Translations • Task for evaluator: Is translation X better than translation Y? (choices: better, worse, equal) • Evaluators are more consistent: Evaluation type P(A) P(E) K Fluency .400 .2 .250 Adequacy .380 .2 .226 Sentence ranking .582 .333 .373 Philipp Koehn Machine Translation: Evaluation 19 September 2023 13Ways to Improve Consistency • Evaluate fluency and adequacy separately • Normalize scores – use 100-point scale with ”analog” ruler – normalize mean and variance of evaluators • Check for bad evaluators (e.g., when using Amazon Turk) – repeat items – include reference – include artificially degraded translations Philipp Koehn Machine Translation: Evaluation 19 September 2023 14Goals for Evaluation Metrics Low cost: reduce time and money spent on carrying out evaluation Tunable: automatically optimize system performance towards metric Meaningful: score should give intuitive interpretation of translation quality Consistent: repeated use of metric should give same results Correct: metric must rank better systems higher Philipp Koehn Machine Translation: Evaluation 19 September 2023 15Other Evaluation Criteria When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user’s needs Philipp Koehn Machine Translation: Evaluation 19 September 2023 16 automatic metrics Philipp Koehn Machine Translation: Evaluation 19 September 2023 17Automatic Evaluation Metrics • Goal: computer program that computes the quality of translations • Advantages: low cost, tunable, consistent • Basic strategy – given: machine translation output – given: human reference translation – task: compute similarity between them Philipp Koehn Machine Translation: Evaluation 19 September 2023 18Precision and Recall of Words Israeli officials responsibility of airport safety Israeli officials are responsible for airport securityREFERENCE: SYSTEM A: • Precision correct output-length = 3 6 = 50% • Recall correct reference-length = 3 7 = 43% • F-measure precision × recall (precision + recall)/2 = .5 × .43 (.5 + .43)/2 = 46% Philipp Koehn Machine Translation: Evaluation 19 September 2023 19Precision and Recall Israeli officials responsibility of airport safety Israeli officials are responsible for airport securityREFERENCE: SYSTEM A: airport security Israeli officials are responsibleSYSTEM B: Metric System A System B precision 50% 100% recall 43% 100% f-measure 46% 100% flaw: no penalty for reordering Philipp Koehn Machine Translation: Evaluation 19 September 2023 20Word Error Rate • Minimum number of editing steps to transform output to reference match: words match, no cost substitution: replace one word with another insertion: add word deletion: drop word • Levenshtein distance WER = substitutions + insertions + deletions reference-length Philipp Koehn Machine Translation: Evaluation 19 September 2023 21Example officials Israeli responsibility of safety airport 0 1Israeli 2 3 4 5 1officials 1 2 3 4 2 1are 1 2 3 4 3 2responsible 2 3 4 4 3for 3 3 3 4 5 4airport 4 4 4 6 5security 5 5 4 4 0 3 2 Israeli 2officials 3are 4responsible 5for airport 6security airport 1 2 3 4 5 6 security 2 3 3 4 5 6 6 Israeli 3 4 5 6 7 officials 3 3 3 4 5 6 are 4 4 3 3 4 5 responsible 52 2 5 5 2 2 1 2 4 5 63 2 3 4 5 7 1 0 6 1 2 3 4 5 60 1 2 3 4 5 6 7 Metric System A System B word error rate (WER) 57% 71% Philipp Koehn Machine Translation: Evaluation 19 September 2023 22BLEU • N-gram overlap between machine translation output and reference translation • Compute precision for n-grams of size 1 to 4 • Add brevity penalty (for too short translations) BLEU = min 1, output-length reference-length 4 i=1 precisioni 1 4 • Typically computed over the entire corpus, not single sentences Philipp Koehn Machine Translation: Evaluation 19 September 2023 23Example airport security Israeli officials are responsible Israeli officials responsibility of airport safety Israeli officials are responsible for airport securityREFERENCE: SYSTEM A: SYSTEM B: 4-GRAM MATCH2-GRAM MATCH 2-GRAM MATCH 1-GRAM MATCH Metric System A System B precision (1gram) 3/6 6/6 precision (2gram) 1/5 4/5 precision (3gram) 0/4 2/4 precision (4gram) 0/3 1/3 brevity penalty 6/7 6/7 BLEU 0% 52% Philipp Koehn Machine Translation: Evaluation 19 September 2023 24Multiple Reference Translations • To account for variability, use multiple reference translations – n-grams may match in any of the references – closest reference length used • Example Israeli officials responsibility of airport safety Israeli officials are responsible for airport security Israel is in charge of the security at this airport The security work for this airport is the responsibility of the Israel government Israeli side was in charge of the security of this airport REFERENCES: SYSTEM: 2-GRAM MATCH 1-GRAM2-GRAM MATCH Philipp Koehn Machine Translation: Evaluation 19 September 2023 25METEOR: Flexible Matching • Partial credit for matching stems SYSTEM Jim went home REFERENCE Joe goes home • Partial credit for matching synonyms SYSTEM Jim walks home REFERENCE Joe goes home • Use of paraphrases Philipp Koehn Machine Translation: Evaluation 19 September 2023 26Critique of Automatic Metrics • Ignore relevance of words (names and core concepts more important than determiners and punctuation) • Operate on local level (do not consider overall grammaticality of the sentence or sentence meaning) • Scores are meaningless (scores very test-set specific, absolute value not informative) • Human translators score low on BLEU (possibly because of higher variability, different word choices) Philipp Koehn Machine Translation: Evaluation 19 September 2023 27Evaluation of Evaluation Metrics • Automatic metrics are low cost, tunable, consistent • But are they correct? → Yes, if they correlate with human judgement Philipp Koehn Machine Translation: Evaluation 19 September 2023 28Correlation with Human Judgement Philipp Koehn Machine Translation: Evaluation 19 September 2023 29Pearson’s Correlation Coefficient • Two variables: automatic score x, human judgment y • Multiple systems (x1, y1), (x2, y2), ... • Pearson’s correlation coefficient rxy: rxy = i(xi − ¯x)(yi − ¯y) (n − 1) sx sy • Note: mean ¯x = 1 n n i=1 xi variance s2 x = 1 n − 1 n i=1 (xi − ¯x)2 Philipp Koehn Machine Translation: Evaluation 19 September 2023 30Evidence of Shortcomings of Automatic Metrics Post-edited output vs. statistical systems (NIST 2005) 2 2.5 3 3.5 4 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 HumanScore Bleu Score Adequacy Correlation Philipp Koehn Machine Translation: Evaluation 19 September 2023 31Evidence of Shortcomings of Automatic Metrics Rule-based vs. statistical systems 2 2.5 3 3.5 4 4.5 0.18 0.2 0.22 0.24 0.26 0.28 0.3 HumanScore Bleu Score Adequacy Fluency SMT System 1 SMT System 2 Rule-based System (Systran) Philipp Koehn Machine Translation: Evaluation 19 September 2023 32Metric Research • Active development of new metrics – syntactic similarity – semantic equivalence or entailment – metrics targeted at reordering – trainable metrics – etc. • Evaluation campaigns that rank metrics (using Pearson’s correlation coefficient) Philipp Koehn Machine Translation: Evaluation 19 September 2023 33chrF++ • chrF: Character n-gram F-score (e.g., 6-grams) • Some nice properties – partial credit for morphological variants – more credit for longer (content) words than for shorter (function) words • chrF++: also add F-measure on words and word bigrams to the scoring Philipp Koehn Machine Translation: Evaluation 19 September 2023 34Trained Metrics: COMET • Goal: automatic metric that correlates with human judgment • More than a decade of evaluation campaigns for machine translation metrics has produced a lot of human judgment data • Make it a machine learning problem – input: machine translation, reference translation – output: human annotation score • COMET: Trained neural model for evaluation Philipp Koehn Machine Translation: Evaluation 19 September 2023 35Automatic Metrics: Conclusions • Automatic metrics essential tool for system development • Not fully suited to rank systems of different types • Evaluation metrics still open challenge Philipp Koehn Machine Translation: Evaluation 19 September 2023 48 other evaluation methods Philipp Koehn Machine Translation: Evaluation 19 September 2023 49Task-Oriented Evaluation • Machine translations is a means to an end • Does machine translation output help accomplish a task? • Example tasks – producing high-quality translations post-editing machine translation – information gathering from foreign language sources Philipp Koehn Machine Translation: Evaluation 19 September 2023 50Post-Editing Machine Translation • Measuring time spent on producing translations – baseline: translation from scratch – post-editing machine translation But: time consuming, depend on skills of translator and post-editor • Metrics inspired by this task – TER: based on number of editing steps Levenshtein operations (insertion, deletion, substitution) plus movement – HTER: manually construct reference translation for output, apply TER (very time consuming, used in DARPA GALE program 2005-2011) Philipp Koehn Machine Translation: Evaluation 19 September 2023 51Content Understanding Tests • Given machine translation output, can monolingual target side speaker answer questions about it? 1. basic facts: who? where? when? names, numbers, and dates 2. actors and events: relationships, temporal and causal order 3. nuance and author intent: emphasis and subtext • Very hard to devise questions • Sentence editing task (WMT 2009–2010) – person A edits the translation to make it fluent (with no access to source or reference) – person B checks if edit is correct → did person A understand the translation correctly? Philipp Koehn Machine Translation: Evaluation 19 September 2023