Natural Language Processing with Deep Learning CS224N/Ling284 Antoine Bosselut Lecture 12: Neural Language Generation What is natural language generation? 2 • Natural language generation (NLG) is a sub-field of natural language processing • Focused on building systems that automatically produce coherent and useful written or spoken text for human consumption • NLG systems are already changing the world we live in… Machine Translation 3 Dialogue Systems 4 Summarization 5 https://chrome.google.com/webstore/detail/gmail-summarization/ (Wang and Cardie, ACL 2013) Document Summarization E-mail Summarization Meeting Summarization http://mogren.one/lic/ Data-to-Text Generation 6 (Dusek et. al., INLG 2019) (Wiseman and Rush., EMNLP 2017) (Parikh et al.., EMNLP 2020) 7 1) A girl is eating donuts with a boy in a restaurant 2) A boy and girl sitting at a table with doughnuts. 3) Two kids sitting a coffee shop eating some frosted donuts 4) Two children sitting at a table eating donuts. 5) Two children eat doughnuts at a restaurant table. Sentences Paragraph Two children are sitting at a table in a restaurant. The children are one little girl and one little boy. The little girl is eating a pink frosted donut with white icing lines on top of it. Two children are sitting at a table in a restaurant. The children are one little girl and one little boy. The little girl is eating a pink frosted donut with white icing lines on top of it. The girl has blonde hair and is wearing a green jacket with a black long sleeve shirt underneath. The little boy is wearing a black zip up jacket and is holding his finger to his lip but is not eating. A metal napkin dispenser is in between them at the table. The wall next to them is white brick. Two adults are on the other side of the short white brick wall. The room has white circular lights on the ceiling and a large window in the front of the restaurant. It is daylight outside. (Krause et. al., CVPR 2017)(Karpathy & Li., CVPR 2015) Visual Description Creative Generation 8 (Ghazvininejad et al.., ACL 2017)(Rashkin et al.., EMNLP 2020) Stories & Narratives Poetry What is natural language generation? 9 Any task involving text production for human consumption requires natural language generation Deep Learning is powering next-gen NLG systems! Components of NLG Systems 10 • What is NLG? • Formalizing NLG: a simple model and training algorithm • Decoding from NLG models • Training NLG models • Evaluating NLG Systems • Ethical Considerations Basics of natural language generation 11 • In autoregressive text generation models, at each time step t, our model takes in a sequence of tokens of text as input 𝑦 !" and outputs a new token, "𝑦" 𝑦"#$ 𝑦"#% 𝑦"#& 𝑦"#' "𝑦" "𝑦" "𝑦"(' 𝑦"(' … "𝑦"(& A look at a single step 12 • In autoregressive text generation models, at each time step t, our model takes in a sequence of tokens of text as input 𝑦 !" and outputs a new token, "𝑦" 𝑦"#$ 𝑦"#% 𝑦"#& 𝑦"#' "𝑦" A look at a single step 13 • At each time step t, our model computes a vector of scores for each token in our vocabulary, S ∈ ℝ): • Then, we compute a probability distribution 𝑃 over 𝑤 ∈ 𝑉 using these scores: 𝑆 = 𝑓 𝑦!" , 𝜃 𝑃 𝑦" = 𝑤 𝑦!" = exp(𝑆#) ∑#!∈ % exp(𝑆#!) 𝑓( . ) is your model Basics: What are we trying to do? 14 • At each time step t, our model computes a vector of scores for each token in our vocabulary, S ∈ ℝ): • Then, we compute a probability distribution 𝑃 over 𝑤 ∈ 𝑉 using these scores: 𝑆 = 𝑓 𝑦!" , 𝜃 𝑃 𝑦" 𝑦!" = exp(𝑆#) ∑#!∈ % exp(𝑆#!) 𝑓( . ) is your model Basics: What are we trying to do? 15 • At each time step t, our model computes a vector of scores for each token in our vocabulary, S ∈ ℝ). Then, we compute a probability distribution 𝑃 over 𝑤 ∈ 𝑉 using these scores: 𝑦"#$ 𝑦"#% 𝑦"#& 𝑦"#' 𝑆 softmax 𝑃 𝑦" 𝑦!" Basics: What are we trying to do? 16 • At inference time, our decoding algorithm defines a function to select a token from this distribution: • We train the model to minimize the negative loglikelihood of predicting the next token in the sequence: • Note: This is just a classification task where each 𝑤 ∈ 𝑉 is a class. • The label at each step is the actual word 𝑦! ∗ in the training sequence • This token is often called the “gold” or “ground truth” token • This algorithm is often called “teacher forcing” /𝑦" = 𝑔(𝑃 𝑦" 𝑦!" )) 𝑔( . ) is your decoding algorithm ℒ" = − log 𝑃 𝑦" ∗ 𝑦!" ∗ ) Sum ℒ! for the entire sequence Maximum Likelihood Training (i.e., teacher forcing) 17 • Trained to generate the next word 𝑦" ∗ given a set of preceding words {𝑦∗}!" ℒ = − log 𝑃 𝑦% ∗ 𝑦' ∗ ) 𝑦1 ∗ 𝑦' ∗ Maximum Likelihood Training (i.e., teacher forcing) 18 • Trained to generate the next word 𝑦" ∗ given a set of preceding words {𝑦∗}!" ℒ = −(log 𝑃 𝑦% ∗ 𝑦' ∗ ) + log 𝑃 𝑦( ∗ 𝑦' ∗ , 𝑦% ∗ )) 𝑦1 ∗ 𝑦' ∗ 𝑦' ∗ 𝑦& ∗ Maximum Likelihood Training (i.e., teacher forcing) 19 • Trained to generate the next word 𝑦" ∗ given a set of preceding words {𝑦∗}!" ℒ = −(log 𝑃 𝑦% ∗ 𝑦' ∗ ) + log 𝑃 𝑦( ∗ 𝑦' ∗ , 𝑦% ∗ ) + log 𝑃 𝑦) ∗ 𝑦' ∗ , 𝑦% ∗ , 𝑦( ∗ )) 𝑦1 ∗ 𝑦' ∗ 𝑦' ∗ 𝑦& ∗ 𝑦& ∗ 𝑦% ∗ Maximum Likelihood Training (i.e., teacher forcing) 20 • Trained to generate the next word 𝑦" ∗ given a set of preceding words {𝑦∗}!" 𝑦1 ∗ 𝑦' ∗ 𝑦' ∗ 𝑦& ∗ ℒ = − - *+% , log 𝑃 𝑦* ∗ 𝑦∗ -*) 𝑦& ∗ 𝑦% ∗ 𝑦% ∗ 𝑦$ ∗ Maximum Likelihood Training (i.e., teacher forcing) 21 • Trained to generate the next word 𝑦" ∗ given a set of preceding words {𝑦∗}!" ℒ = − - *+% . log 𝑃 𝑦* ∗ 𝑦∗ -*) 𝑦1 ∗ 𝑦' ∗ 𝑦& ∗ 𝑦% ∗ 𝑦2#$ ∗ 𝑦2#% ∗ 𝑦2#& ∗ 𝑦2#' ∗ … … 𝑦' ∗ 𝑦& ∗ 𝑦% ∗ 𝑦$ ∗ 𝑦2#% ∗ 𝑦2#& ∗ 𝑦2#' ∗ 𝑦2 ∗ Components of NLG Systems 22 • What is NLG? • Formalizing NLG: a simple model and training algorithm • Decoding from NLG models • Training NLG models • Evaluating NLG Systems • Ethical Considerations Decoding: what is it all about? 23 • At each time step t, our model computes a vector of scores for each token in our vocabulary, S ∈ ℝ): • Then, we compute a probability distribution 𝑃 over these scores (usually with a softmax function): • Our decoding algorithm defines a function to select a token from this distribution: 𝑆 = 𝑓 𝑦!" 𝑃 𝑦" = 𝑤 𝑦!" = exp(𝑆#) ∑#!∈ % exp(𝑆#!) /𝑦" = 𝑔(𝑃 𝑦" 𝑦!" )) 𝑓( . ) is your model 𝑔( . ) is your decoding algorithm Decoding: what is it all about? 24 • Our decoding algorithm defines a function to select a token from this distribution 𝑦#& ∗ 𝑦#' ∗ 𝑦1 ∗ "𝑦2#$ "𝑦' "𝑦& "𝑦2 /𝑦" = 𝑔(𝑃 𝑦" 𝑦∗ , /𝑦 !")) "𝑦' "𝑦& "𝑦2#% "𝑦2#& "𝑦2#' "𝑦2#% "𝑦2#& "𝑦2#' … … Greedy methods 25 • Recall: Lecture 7 on Neural Machine Translation… • Argmax Decoding • Selects the highest probability token in 𝑃 𝑦" 𝑦!") • Beam Search • Discussed in Lecture 7 on Machine Translation • Also a greedy algorithm, but with wider search over candidates /𝑦" = 𝐚𝐫𝐠𝐦𝐚𝐱 𝒘∈𝑽 𝑃 𝑦" = 𝑤 𝑦!") Greedy methods get repetitive 26 Context: Continuation: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. The study, published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS), was conducted by researchers from the Universidad Nacional Autónoma de México (UNAM) and the Universidad Nacional Autónoma de México (UNAM/Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México… (Holtzman et. al., ICLR 2020) Why does repetition happen? 27 (Holtzman et. al., ICLR 2020) And it keeps going… 29 (Holtzman et. al., ICLR 2020) How can we reduce repetition? 30 Simple option: • Heuristic: Don’t repeat n-grams More complex: • Minimize embedding distance between consecutive sentences (Celikyilmaz et al., 2018) • Doesn’t help with intra-sentence repetition • Coverage loss (See et al., 2017) • Prevents attention mechanism from attending to the same words • Unlikelihood objective (Welleck et al., 2020) • Penalize generation of already-seen tokens Are greedy methods reasonable? 31 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Probability i t a ar t i r ri i a ar a (Holtzman et. al., ICLR 2020) Time to get random: Sampling! 32 • Sample a token from the distribution of tokens • It’s random so you can sample any token! He wanted to go to the Model restroom grocery store airport pub gym bathroom beach doctor hospital .𝑦* ∼ 𝑃 𝑦* = 𝑤 { 𝑦 -*) Decoding: Top-k sampling 33 • Problem: Vanilla sampling makes every token in the vocabulary an option • Even if most of the probability mass in the distribution is over a limited set of options, the tail of the distribution could be very long • Many tokens are probably irrelevant in the current context • Why are we giving them individually a tiny chance to be selected? • Why are we giving them as a group a high chance to be selected? • Solution: Top-k sampling • Only sample from the top k tokens in the probability distribution (Fan et al., ACL 2018; Holtzman et al., ACL 2018) Decoding: Top-k sampling 34 • Solution: Top-k sampling • Only sample from the top k tokens in the probability distribution • Common values are k = 5, 10, 20 (but it’s up to you!) • Increase k for more diverse/risky outputs • Decrease k for more generic/safe outputs (Fan et al., ACL 2018; Holtzman et al., ACL 2018) He wanted to go to the Model restroom grocery store airport pub gym bathroom beach doctor hospital Top-k sampling can cut off too quickly! Top-k sampling can also cut off too slowly! Issues with Top-k sampling 35 (Holtzman et. al., ICLR 2020) Decoding: Top-p (nucleus) sampling 36 • Problem: The probability distributions we sample from are dynamic • When the distribution Pt is flatter, a limited k removes many viable options • When the distribution Pt is peakier, a high k allows for too many options to have a chance of being selected • Solution: Top-p sampling • Sample from all tokens in the top p cumulative probability mass (i.e., where mass is concentrated) • Varies k depending on the uniformity of Pt (Holtzman et. al., ICLR 2020) Decoding: Top-p (nucleus) sampling 37 • Solution: Top-p sampling • Sample from all tokens in the top p cumulative probability mass (i.e., where mass is concentrated) • Varies k depending on the uniformity of Pt 𝑃" ) 𝑦" = 𝑤 { 𝑦 !") 𝑃" * 𝑦" = 𝑤 { 𝑦 !") 𝑃" + 𝑦" = 𝑤 { 𝑦 !") 38 Scaling randomness: Softmax temperature • Recall: On timestep t, the model computes a prob distribution Pt by applying the softmax function to a vector of scores 𝑠 ∈ ℝ|$| 𝑃!(𝑦! = 𝑤) = exp(𝑆%) ∑%&∈$ exp(𝑆%&) • You can apply a temperature hyperparameter 𝜏 to the softmax to rebalance 𝑃!: 𝑃! 𝑦! = 𝑤 = exp 𝑆%/𝜏 ∑%!∈$ exp 𝑆%!/𝜏 • Raise the temperature 𝜏 > 1: 𝑃! becomes more uniform • More diverse output (probability is spread around vocab) • Lower the temperature 𝜏 < 1: 𝑃! becomes more spiky • Less diverse output (probability is concentrated on top words) Note: softmax temperature is not a decoding algorithm! It’s a technique you can apply at test time, in conjunction with a decoding algorithm (such as beam search or sampling) Improving decoding: re-balancing distributions 39 • Problem: What if I don’t trust how well my model’s distributions are calibrated? • Don’t rely on ONLY your model’s distribution over tokens • Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics! (Khandelwal et. al., ICLR 2020) Obama was senator for Barack is married to Obama was born in … Obama is a native of Training Contexts Illinois Michelle Hawaii … Hawaii Targets Representations 4 100 5 … 3 Distances 0.7 0.2 0.1 Nearest k Hawaii Illinois Hawaii Normalization Hawaii Illinois Hawaii 3 4 5 0.8 0.2 Aggregation Hawaii Illinois Obama’s birthplace is Test Context ? Target Representation 0.6 0.2 … Interpolation Hawaii Illinois … 0.2 0.2 … Classification Hawaii Illinois … … Improving decoding: re-balancing distributions 40 • Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics! • Cache a database of phrases from your training corpus (or some other corpus) • At decoding time, search for most similar phrases in the database • Re-balance Pt using induced distribution Pphrase over words that follow these phrases Obama was senator for Barack is married to Obama was born in … Obama is a native of Training Contexts Illinois Michelle Hawaii … Hawaii Targets Representations 4 100 5 … 3 Distances 0.7 0.2 0.1 Nearest k Hawaii Illinois Hawaii Normalization Hawaii Illinois Hawaii 3 4 5 0.8 0.2 Aggregation Hawaii Illinois Obama’s birthplace is Test Context ? Target Representation 0.6 0.2 … Interpolation Hawaii Illinois … 0.2 0.2 … Classification Hawaii Illinois … … (Khandelwal et. al., ICLR 2020) Backpropagation-based distribution re-balancing 41 • Can I re-balance my language model’s distribution in to encourage other behaviors? • Yes! Just define a model that evaluates that behavior (e.g., sentiment, perplexity) • Use soft token distributions (e.g., Gumbel Softmax -- Pt with tiny temperature 𝜏) as inputs to the evaluator • Backpropagate gradients directly to your language model and update Pt (Dathathri et. al., ICLR 2020; Qin et al., EMNLP 2020) LM LM LM Attribute Model p(a|x) The chicken tastes chicken tastes Grad (Positive sentiment) ok delicious Original distribution ("ok") Updated distribution ("delicious") UpdatedLatents Backward Pass and update latents Forward Pass Recompute with updated latentsp(x)p(x)p(x) Recompute Step 1 { { { Step 2 Step 3 Improving Decoding: Re-ranking 42 • Problem: What if I decode a bad sequence from my model? • Decode a bunch of sequences • 10 candidates is a common number, but it’s up to you • Define a score to approximate quality of sequences and re-rank by this score • Simplest is to use perplexity! • Careful! Remember that repetitive methods can generally get high perplexity. • Re-rankers can score a variety of properties: • style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al., 2020), logical consistency (Lu et al., 2020), and many more… • Beware poorly-calibrated re-rankers • Can use multiple re-rankers in parallel Decoding: Takeaways 44 • Decoding is still a challenging problem in natural language generation • Human language distribution is noisy and doesn’t reflect simple properties (i.e., probability maximization) • Different decoding algorithms can allow us to inject biases that encourage different properties of coherent natural language generation • Some of the most impactful advances in NLG of the last few years have come from simple, but effective, modifications to decoding algorithms • A lot more work to be done! Components of NLG Systems 45 • What is NLG? • Formalizing NLG: a simple model and training algorithm • Decoding from NLG models • Training NLG models • Evaluating NLG Systems • Ethical Considerations Maximum Likelihood Training (i.e., teacher forcing) 47 • Trained to generate the minimize the negative loglikelihood of the next token 𝑦" ∗ given the preceding tokens in the sequence {𝑦∗}!": ℒ = − $ *+% . log 𝑃 𝑦* ∗ 𝑦∗ -*) 𝑦1 ∗ 𝑦' ∗ 𝑦& ∗ 𝑦% ∗ 𝑦2#$ ∗ 𝑦2#% ∗ 𝑦2#& ∗ 𝑦2#' ∗ … … 𝑦' ∗ 𝑦& ∗ 𝑦% ∗ 𝑦$ ∗ 𝑦2#% ∗ 𝑦2#& ∗ 𝑦2#' ∗ 𝑦2 ∗ Are greedy decoders bad because of how they’re trained? 48 Context: Continuation: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. The study, published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS), was conducted by researchers from the Universidad Nacional Autónoma de México (UNAM) and the Universidad Nacional Autónoma de México (UNAM/Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México… (Holtzman et. al., ICLR 2020) Diversity Issues 49 • Maximum Likelihood Estimation discourages diverse text generation Unlikelihood Training 50 • Given a set of undesired tokens 𝒞, lower their likelihood in context • Keep teacher forcing objective and combine them for final loss function • Set 𝒞 = 𝑦∗ !" and you’ll train the model to lower the likelihood of previously-seen tokens! • Limits repetition! • Increases the diversity of the text you learn to generate! ℒ@AB * = − log 𝑃 𝑦* ∗ 𝑦∗ -*) ℒCA * = − $ D!"# ∈ 𝒞 log(1 − 𝑃 𝑦FGH 𝑦∗ -*)) ℒCAB * = ℒ@AB * + 𝛼ℒCA * (Welleck et al., 2020) Exposure Bias • Training with teacher forcing leads to exposure bias at generation time • During training, our model’s inputs are gold context tokens from real, human-generated texts • At generation time, our model’s inputs are previously–decoded tokens 51 ℒ345 = − log 𝑃 𝑦" ∗ 𝑦∗ !") ℒ678 = − log 𝑃 "𝑦" "𝑦 !") !! ∗ !# ∗ !$ ∗ !% ∗ !&'( ∗ !&'% ∗ !&'$ ∗ !&'# ∗ … … !# ∗ !$ ∗ !% ∗ !( ∗ !&'% ∗ !&'$ ∗ !&'# ∗ !& ∗ !!" ∗ !!$ ∗ !% ∗ !(&!' !($ !(" !(& !($ !(" !(&!( !(&!" !(&!$ !(&!( !(&!" !(&!$ … … 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Probability i t a ar a Exposure Bias Solutions • Scheduled sampling (Bengio et al., 2015) • With some probability p, decode a token and feed that as the next input, rather than the gold token. • Increase p over the course of training • Leads to improvements in practice, but can lead to strange training objectives • Dataset Aggregation (DAgger; Ross et al., 2011) • At various intervals during training, generate sequences from your current model • Add these sequences to your training set as additional examples 52 Exposure Bias Solutions • Sequence re-writing (Guu*, Hashimoto* et al., 2018) • Learn to retrieve a sequence from an existing corpus of human-written prototypes (e.g., dialogue responses) • Learn to edit the retrieved sequence by adding, removing, and modifying tokens in the prototype • Reinforcement Learning: cast your text generation model as a Markov decision process • State s is the model’s representation of the preceding context • Actions a are the words that can be generated • Policy 𝜋 is the decoder • Rewards r are provided by an external score • Learn behaviors by rewarding the model when it exhibits them 53 REINFORCE: Basics 54 • Sample a sequence from your model 𝑦#& ∗ 𝑦#' ∗ 𝑦1 ∗ "𝑦2#$ "𝑦' "𝑦& "𝑦2 "𝑦' "𝑦& "𝑦2#% "𝑦2#& "𝑦2#' "𝑦2#% "𝑦2#& "𝑦2#' … … ℒIA = − $ *+% . 𝑟(0𝑦*) log 𝑃 0𝑦* 𝑦∗ ; {0𝑦*}-*) REINFORCE: Basics 55 • Sample a sequence from your model ℒIA = − $ *+% . 𝑟(0𝑦*) log 𝑃 0𝑦* 𝑦∗ ; {0𝑦*}-*) !!" ∗ !!$ ∗ !% ∗ !(&!' !($ !(" !(& !($ !(" !(&!( !(&!" !(&!$ !(&!( !(&!" !(&!$ … … Next time, increase the probability of this sampled token in the same context. …but do it more if I get a high reward from the reward function. Reward Estimation • How should we define a reward function? Just use your evaluation metric! • BLEU (machine translation; Ranzato et al., ICLR 2016; Wu et al., 2016) • ROUGE (summarization; Paulus et al., ICLR 2018; Celikyilmaz et al., NAACL 2018) • CIDEr (image captioning; Rennie et al., CVPR 2017) • SPIDEr (image captioning; Liu et al., ICCV 2017) • Be careful about optimizing for the task as opposed to “gaming” the reward! • Evaluation metrics are merely proxies for generation quality! • “even though RL refinement can achieve better BLEU scores, it barely improves the human impression of the translation quality” – Wu et al., 2016 56 Reward Estimation • What behaviors can we tie to rewards? • Cross-modality consistency in image captioning (Ren et al., CVPR 2017) • Sentence simplicity (Zhang and Lapata, EMNLP 2017) • Temporal Consistency (Bosselut et al., NAACL 2018) • Utterance Politeness (Tan et al., TACL 2018) • Paraphrasing (Li et al., EMNLP 2018) • Sentiment (Gong et al., NAACL 2019) • Formality (Gong et al., NAACL 2019) • If you can formalize a behavior as a reward function (or train a neural network to approximate it!), you can train a text generation model to exhibit that behavior! 57 The dark side… • Need to pretrain a model with teacher forcing before doing RL training • Your reward function probably expects coherent language inputs… • Need to set an appropriate baseline: • Use linear regression to predict it from the state s (Ranzato et al., 2015) • Decode a second sequence and use its reward as the baseline (Rennie et al., 2017) • Your model will learn the easiest way to exploit your reward function • Mitigate these shortcuts or hope that’s aligned with the behavior you want! 58 ℒIA = − $ *+% . (𝑟 0𝑦* − 𝒃) log 𝑃(… ) Training: Takeaways 59 • Teacher forcing is still the premier algorithm for training text generation models • Diversity is an issue with sequences generated from teacher forced models • New approaches focus on mitigating the effects of common words • Exposure bias causes text generation models to lose coherence easily • Models must learn to recover from their own bad samples (e.g., scheduled sampling, DAgger) • Or not be allowed to generate bad text to begin with (e.g., retrieval + generation) • Training with RL can allow models to learn behaviors that are challenging to formalize • Learning can be very unstable! Components of NLG Systems 60 • What is NLG? • Formalizing NLG: a simple model and training algorithm • Decoding from NLG models • Training NLG models • Evaluating NLG Systems • Ethical Considerations Types of evaluation methods for text generation 61 Human EvaluationsContent Overlap Metrics Model-based Metrics Ref: They walked to the grocery store . Gen: The woman went to the hardware store . (Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial) Content overlap metrics 62 • Compute a score that indicates the similarity between generated and gold-standard (human-written) text • Fast and efficient and widely used • Two broad categories: • N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.) • Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.) Ref: They walked to the grocery store . Gen: The woman went to the hardware store . Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.) • They’re not ideal for machine translation • They get progressively much worse for tasks that are more open-ended than machine translation • Worse for summarization, as longer output texts are harder to measure • Much worse for dialogue, which is more open-ended that summarization 63 N-gram overlap metrics 64 Are you going to Antoine’s incredible CS224N lecture? Heck yes ! You know it ! Yes ! Yup . Heck no ! Score: 0.61 0.25 0 0.67 False negative False positive A simple failure case n-gram overlap metrics have no concept of semantic relatedness! A more comprehensive failure analysis 65 (Liu et al, EMNLP 2016) Automatic evaluation metrics for NLG Word overlap based metrics (BLEU, ROUGE, METEOR, F1, etc.) • They’re not ideal for machine translation • They get progressively much worse for tasks that are more open-ended than machine translation • Worse for summarization, where extractive methods that copy from documents are preferred • Much worse for dialogue, which is more open-ended that summarization • Much, much worse story generation, which is also open-ended, but whose sequence length can make it seem you’re getting decent scores! 66 Semantic overlap metrics 67 SPICE: Semantic propositional image caption evaluation is an image captioning metric that initially parses the reference text to derive an abstract scene graph representation. (Anderson et al., 2016). SPIDER: A combination of semantic graph similarity (SPICE) and n-gram similarity measure (CIDER), the SPICE metric yields a more complete quality evaluation metric. (Liu et al., 2017) PYRAMID: • Incorporates human content selection variation in summarization evaluation. • Identifies Summarization Content Units (SCU)s to compare information content in summaries. (Nenkova, et al., 2007) Model-based metrics 68 • Use learned representations of words and sentences to compute semantic similarity between generated and reference texts • No more n-gram bottleneck because text units are represented as embeddings! • Even though embeddings are pretrained, distance metrics used to measure the similarity can be fixed Model-based metrics: Word distance functions 69 Word Mover’s Distance: Measures the distance between two sequences (e.g., sentences, paragraphs, etc.), using word embedding similarity matching. (Kusner et.al., 2015; Zhao et al., 2019) Vector Similarity: Embedding based similarity for semantic distance between text. • Embedding Average (Liu et al., 2016) • Vector Extrema (Liu et al., 2016) • MEANT (Lo, 2017) • YISI (Lo, 2019) BERTSCORE: Uses pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. (Zhang et.al. 2020) Model-based metrics: Beyond word matching 70 BLEURT: A regression model based on BERT returns a score that indicates to what extend the candidate text is grammatical and conveys the meaning of the reference text. (Sellam et.al. 2020) Sentence Movers Similarity : Based on Word Movers Distance to evaluate text in a continuous space using sentence embeddings from recurrent neural network representations. (Clark et.al., 2019) Human evaluations • Automatic metrics fall short of matching human decisions • Most important form of evaluation for text generation systems • >75% generation papers at ACL 2019 include human evaluations • Gold standard in developing new automatic metrics • New automated metrics must correlate well with human evaluations! 71 Human evaluations 72 • Ask humans to evaluate the quality of generated text • Overall or along some specific dimension: • fluency • coherence / consistency • factuality and correctness • commonsense • style / formality • grammaticality • typicality • redundancy FordetailsCelikyilmaz,Clark,Gao,2020 Note: Don’t compare human evaluation scores across differently-conducted studies Even if they claim to evaluate the same dimensions! Human evaluation: Issues • Human judgments are regarded as the gold standard • Of course, we know that human eval is slow and expensive • …but are those the only problems? • Supposing you do have access to human evaluation: Does human evaluation solve all of your problems? • No! • Conducting human evaluation effectively is very difficult • Humans: 74 • are inconsistent • can be illogical • lose concentration • misinterpret your question • can’t always explain why they feel the way they do Learning from human feedback HUSE: Human Unified with Statistical Evaluation (HUSE), determines the similarity of the output distribution and a human reference distribution. (Hashimoto et.al. 2019) ADEM: A learned metric from human judgments for dialog system evaluation in a chatbot setting. (Lowe et.al., 2017) Evaluation: Takeaways 78 • Content overlap metrics provide a good starting point for evaluating the quality of generated text, but they’re not good enough on their own. • Model-based metrics are can be more correlated with human judgment, but behavior is not interpretable • Human judgments are critical. • Only ones that can directly evaluate factuality – is the model saying correct things? • But humans are inconsistent! • In many cases, the best judge of output quality is YOU! • Look at your model generations. Don’t just rely on numbers! Components of NLG Systems 80 • What is NLG? • Formalizing NLG: a simple model and training algorithm • Decoding from NLG models • Training NLG models • Evaluating NLG Systems • Ethical Considerations Warning: Some of the content on the next few slides may be disturbing Ethics of text generation systems 81 Tay • Chatbot released by Microsoft in 2016 • Within 24 hours, it started making toxic racist and sexist comments • What went wrong? https://en.wikipedia.org/wiki/Tay_(bot) Ethics: Biases in text generation models 82 • Text generation models are often constructed from pretrained language models • Language models learn harmful patterns of bias from large language corpora • When prompted for this information, they repeat negative stereotypes (Sheng et al., EMNLP 2019) Hidden Biases: Universal adversarial triggers 83 • The learned behaviors of text generation models are opaque • Adversarial inputs can trigger VERY toxic content • These models can be exploited in open-world contexts by illintentioned users (Wallace et al., EMNLP 2019) Hidden Biases: Triggered innocuously 84 • Pretrained language models can degenerate into toxic text even from seemingly innocuous prompts • Models should not be deployed without proper safeguards to control for toxic content • Models should not be deployed without careful consideration of how users will interact with it (Gehman et al., EMNLP Findings 2020) Ethics: Think about what you’re building 85 • Large-scale pretrained language models allow us to build NLG systems for many new applications • Does the content we’re building a system to automatically generate… … really need to be generated? (Zellers et al., NeurIPS 2019) Concluding Thoughts 86 • Interacting with natural language generation systems quickly shows their limitations • Even in tasks with more progress, there are still many improvements ahead • Evaluation remains a huge challenge. • We need better ways of automatically evaluating performance of NLG systems • With the advent of large-scale language models, deep NLG research has been reset • it’s never been easier to jump in the space! • One of the most exciting areas of NLP to work in!