Natural Language Processing with Deep Learning CS224N/Ling284 Yann Dubois Lecture 11: Benchmarking and Evalua;on Lecture overview 2 • Different reasons for measuring performance • Text Classifica;on / Close-ended • Text Genera;on / Open-ended • Automa;c Evalua;on • Human Evalua;on • Current evalua;ons of LLMs • Issues and challenges with evalua;on Benchmarks and evalua:ons drive progress 4 Benchmarks and how we drive the progress of the field MMLU Two major types of evalua:ons 5 Close-ended evalua;ons Open ended evaluations Close-ended evalua:on 6 Close-ended tasks 7 • Limited number of poten;al answers • OLen one or just a few correct answers • Enables automa;c evalua;on as in ML Close-ended tasks 8 • Sentiment analysis: SST / IMDB / Yelp … • Entailment: SNLI • Name entity recognition: CoNLL-2003 • Part-of-Speech: PTB Close-ended tasks 9 • Coreference resolu;on: WSC • Ques;on Answering: Squad 2 Close-ended mul:-task benchmark - superGLUE 10 AYempt to measure “general language capabili;es” Examples from superGLUE 11 Cover a number of different tasks • BoolQ, Mul;RC (reading texts) • CB, RTE (Entailment) • COPA (cause and effect) • ReCoRD (QA+reasoning) • WiC (meaning of words) • WSC (coreference) Open-ended evaluation 14 Open-ended tasks 15 • Long genera;ons with too many possible correct answers to enumerate • => can’t use standard ML metrics • There are now beYer and worse answers (not just right and wrong) • Example: • Summariza;on: CNN-DM / Gigaword • Transla;on: WMT • Instruc;on-following: Chatbot Arena / AlpacaEval / MT-Bench Types of evalua:on methods for text genera:on 16 Human Evalua1onsContent Overlap Metrics Model-based Metrics Ref: They walked to the grocery store . Gen: The woman went to the hardware store . (Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial) Content overlap metrics 17 • Compute a score that indicates the lexical similarity between generated and goldstandard (human-written) text • Fast and efficient • N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.) • Not ideal but often still reported for translation and summarization Ref: They walked to the grocery store . Gen: The woman went to the hardware store . precision recall A simple failure case 18 n-gram overlap metrics have no concept of seman;c relatedness! Are you enjoying the CS224N lectures? Heck yes ! You know it ! Yes ! Yup . Heck no ! Score: 0.67 0.25 0 0.67 False negative False posi8ve Reference free evals 23 • Reference-based evaluaEon: • Compare human wriYen reference to model outputs • Used to be ‘standard’ evalua;on for most NLP tasks • Examples: BLEU, ROUGE, BertScore etc. • Reference free evaluaEon • Have a model give a score • No human reference • Was nonstandard – now becoming popular with GPT4 • Examples: AlpacaEval, MT-Bench Human evalua:ons 24 • Automatic metrics fall short of matching human decisions • Human evaluation is most important form of evaluation for text generation. • Gold standard in developing new automatic metrics • New automated metrics must correlate well with human evaluations! Human evalua:ons 25 • Ask humans to evaluate the quality of generated text • Overall or along some specific dimension: • fluency • coherence / consistency • factuality and correctness • commonsense • style / formality • gramma;cality • redundancy FordetailsCelikyilmaz,Clark,Gao,2020 Note: Don’t compare human evalua1on scores across differently conducted studies Even if they claim to evaluate the same dimensions! Human evalua:on: Issues 26 • Human judgments are regarded as the gold standard • But it also has issues: • Slow • Expensive • Inter-annotator disagreement (esp. if subjective) • Intra-annotator disagreement across time • Not reproducible • Precision not recall • Biases/shortcuts if incentives not aligned (max $/hour) “just 5% of human evalua;ons are repeatable in the sense that (i) there are no prohibi;ve barriers to repe;;on, and (ii) sufficient informa;on about experimental design is publicly available for rerunning them. Our es;mate goes up to about 20% when author help is sought.” Human evaluation: Issues 27 • Challenges with human evalua;on • How to describe the task? • How to show the task to the humans? • What metric do you use? • Selec;ng the annotators • Monitoring the annotators: ;me, accuracy, … Reference-free eval: chatbots 28 • How do we evaluate something like ChatGPT? • So many different use cases it’s hard to evaluate • The responses are also long-form text, which is even harder to evaluate. VS Side-by-side ra:ngs 29 Have people play with two models side by side, give a thumbs up vs down ra;ng. What’s missing with side-by-side human eval? 30 • Current gold standard for evalua8on of chat LLM • External validity • Typing random ques;ons into a head-to-head website may not be representa;ve • Cost • Human annota;on takes large, community effort • New models take a long ;me to benchmark • Only notable models get benchmarked Lowering the costs – use a LM evaluator 31 • Use a LM as a reference free evaluator • Surprisingly high correla;ons with human • Common versions: AlpacaEval, MT-bench Evaluate LLM VS AlpacaFarm : Human agreement 32 • 100x Cheaper, 100x faster, and higher agreement than humans • Note: can also use for RLAIF! Evalua:on: Takeaways 65 • Closed ended tasks • Think about what you evaluate (diversity, difficulty) • Open ended tasks • Content overlap metrics (useful for low-diversity seGngs) • Chatbot evals – very difficult! Open problem to select the right examples / eval • Challenges • Consistency (hard to know if we’re evalua&ng the right thing) • Contamina&on (can we trust the numbers?) • Biases • In many cases, the best judge of output quality is YOU! • Look at your model generaEons. Don’t just rely on numbers!