Natural Language Processing
with Deep Learning
CS224N/Ling284
Yann Dubois
Lecture 11: Benchmarking and Evalua;on
Lecture overview
2
• Diﬀerent reasons for measuring performance
• Text Classiﬁca;on / Close-ended
• Text Genera;on / Open-ended
• Automa;c Evalua;on
• Human Evalua;on
• Current evalua;ons of LLMs
• Issues and challenges with evalua;on
Benchmarks and evalua:ons drive progress
4
Benchmarks and how we drive the progress of the ﬁeld
MMLU
Two major types of evalua:ons
5
Close-ended evalua;ons
Open ended evaluations
Close-ended evalua:on
6
Close-ended tasks
7
• Limited number of poten;al answers
• OLen one or just a few correct answers
• Enables automa;c evalua;on as in ML
Close-ended tasks
8
• Sentiment analysis: SST / IMDB / Yelp …
• Entailment: SNLI
• Name entity recognition: CoNLL-2003
• Part-of-Speech: PTB
Close-ended tasks
9
• Coreference resolu;on: WSC
• Ques;on Answering: Squad 2
Close-ended mul:-task benchmark - superGLUE
10
AYempt to measure “general language capabili;es”
Examples from superGLUE
11
Cover a number of diﬀerent tasks
• BoolQ, Mul;RC (reading texts)
• CB, RTE (Entailment)
• COPA (cause and eﬀect)
• ReCoRD (QA+reasoning)
• WiC (meaning of words)
• WSC (coreference)
Open-ended evaluation
14
Open-ended tasks
15
• Long genera;ons with too many possible correct answers to enumerate
• => can’t use standard ML metrics
• There are now beYer and worse answers (not just right and wrong)
• Example:
• Summariza;on: CNN-DM / Gigaword
• Transla;on: WMT
• Instruc;on-following: Chatbot Arena / AlpacaEval / MT-Bench
Types of evalua:on methods for text genera:on
16
Human Evalua1onsContent Overlap Metrics Model-based Metrics
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)
Content overlap metrics
17
• Compute a score that indicates the lexical similarity between generated and goldstandard
(human-written) text
• Fast and efficient
• N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
• Not ideal but often still reported for translation and summarization
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
precision recall
A simple failure case
18
n-gram overlap metrics have no concept of seman;c relatedness!
Are you enjoying the
CS224N lectures?
Heck yes !
You know it !
Yes !
Yup .
Heck no !
Score:
0.67
0.25
0
0.67
False negative
False posi8ve
Reference free evals
23
• Reference-based evaluaEon:
• Compare human wriYen reference to model outputs
• Used to be ‘standard’ evalua;on for most NLP tasks
• Examples: BLEU, ROUGE, BertScore etc.
• Reference free evaluaEon
• Have a model give a score
• No human reference
• Was nonstandard – now becoming popular with GPT4
• Examples: AlpacaEval, MT-Bench
Human evalua:ons
24
• Automatic metrics fall short of matching human decisions
• Human evaluation is most important form of evaluation for text generation.
• Gold standard in developing new automatic metrics
• New automated metrics must correlate well with human evaluations!
Human evalua:ons
25
• Ask humans to evaluate the quality of generated text
• Overall or along some speciﬁc dimension:
• ﬂuency
• coherence / consistency
• factuality and correctness
• commonsense
• style / formality
• gramma;cality
• redundancy
FordetailsCelikyilmaz,Clark,Gao,2020
Note: Don’t compare human
evalua1on scores across
diﬀerently conducted studies
Even if they claim to evaluate
the same dimensions!
Human evalua:on: Issues
26
• Human judgments are regarded as the gold standard
• But it also has issues:
• Slow
• Expensive
• Inter-annotator disagreement (esp. if subjective)
• Intra-annotator disagreement across time
• Not reproducible
• Precision not recall
• Biases/shortcuts if incentives not aligned (max $/hour)
“just 5% of human evalua;ons are repeatable in the sense that (i) there are no prohibi;ve
barriers to repe;;on, and (ii) suﬃcient informa;on about experimental design is publicly
available for rerunning them. Our es;mate goes up to about 20% when author help is sought.”
Human evaluation: Issues
27
• Challenges with human evalua;on
• How to describe the task?
• How to show the task to the humans?
• What metric do you use?
• Selec;ng the annotators
• Monitoring the annotators: ;me, accuracy, …
Reference-free eval: chatbots
28
• How do we evaluate something like ChatGPT?
• So many diﬀerent use cases it’s hard to evaluate
• The responses are also long-form text, which is even harder to evaluate.
VS
Side-by-side ra:ngs
29
Have people play with two models side by side, give a thumbs up vs down ra;ng.
What’s missing with side-by-side human eval?
30
• Current gold standard for evalua8on of chat LLM
• External validity
• Typing random ques;ons into a head-to-head website may not be representa;ve
• Cost
• Human annota;on takes large, community eﬀort
• New models take a long ;me to benchmark
• Only notable models get benchmarked
Lowering the costs – use a LM evaluator
31
• Use a LM as a reference free evaluator
• Surprisingly high correla;ons with human
• Common versions: AlpacaEval, MT-bench
Evaluate
LLM
VS
AlpacaFarm : Human agreement
32
• 100x Cheaper, 100x faster, and higher agreement than humans
• Note: can also use for RLAIF!
Evalua:on: Takeaways
65
• Closed ended tasks
• Think about what you evaluate (diversity, diﬃculty)
• Open ended tasks
• Content overlap metrics (useful for low-diversity seGngs)
• Chatbot evals – very diﬃcult! Open problem to select the right examples / eval
• Challenges
• Consistency (hard to know if we’re evalua&ng the right thing)
• Contamina&on (can we trust the numbers?)
• Biases
• In many cases, the best judge of output quality is YOU!
• Look at your model generaEons. Don’t just rely on numbers!