BERTScore Marek Kadlčík, 485294 Problem statement Given: machine translation output and reference translation Compute: reasonable similarity score between the two (This problem appears also in automatic image captioning, generative question answering...) Reminder of existing solutions ● BLEU score ● word error rate ● precision and recall (or f1) of individual words ● METEOR ● ... What makes a metric good? ● agreement with human judgement ● computational speed BERTScore algorithm 1. embeddings_1 ← BERT(reference translation) 2. embeddings_2 ← BERT(machine-translated sentence) 3. C ← cosine similarity matrix, i.e.: C[i, j] = cos_similarity(embeddings_1[i], embeddings_2[j]) 4. recall ← take max in each row and compute average 5. precision ← take max in each column and compute average 6. return F1(recall, precision) Example Reference translation: The weather is cold today. Machine translation: It is freezing today. recall = avg(0.713, 0.515, 0.858, 0.796, 0.913) precision = avg(...) (Authors also try a variant with word weighting - not all words are equally important) Properties ● not as fast as simple metrics (BERTScore requires evaluating BERT) ● has high agreement (~0.95 correlation) with human judgement For detailed analysis of agreement with human judgement see the original paper, sections Experimental setup and Results. Implementations Author’s implementation (pytorch): ● github: https://github.com/Tiiiger/bert_score ● pypi: https://pypi.org/project/bert-score/ Huggingface transformers: ● https://huggingface.co/metrics/bertscore Sources https://arxiv.org/pdf/1904.09675.pdf https://jlibovicky.github.io/2019/05/01/MT-Weekly-BERTScore.html