Natural Language Processing with Deep Learning CS224N/Ling284 Xiang Lisa Li Lecture 12: Neural Language Generation Adapted from slides by Antoine Bosselut and Chris Manning Today: Natural Language Generation 3 1. What is NLG? 2. A review: neural NLG model and training algorithm 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations What is natural language generation? 4 Natural language generation is one side of natural language processing. NLP = Natural Language Understanding (NLU) + Natural Language Generation (NLG) NLG focuses on systems that produce fluent, coherent and useful language output for human consumption Deep Learning is powering next-gen NLG systems! Example Uses of Natural Language Generation 5 Machine Translation systems input: utterances in source languages output: translated text in target languages. Digital assistant (dialogue) systems use NLG input: dialog history output: text that respond / continue the conversation Summarization systems (for research articles, email, meetings, documents) use NLG input: long documents output: summarization of the long documents More interesting NLG uses 6 (Krause et al. CVPR 2017) Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards. (Parikh et al.., EMNLP 2020)(Rashkin et al.., EMNLP 2020) Creative stories Data-to-text Visual description SOTA NLG system 7 ChatGPT is an NLG system! It’s general purpose and can do many NLG tasks! e.g., Chatbot: SOTA NLG system 8 ChatGPT is an NLG system! It’s general purpose and can do many NLG tasks! e.g., Poetry Generation: SOTA NLG system 9 Machine Translation Summarization Spectrum of open-endedness for Generation Tasks Source Sentence: 当局已经宣布今天是节假日。 Reference Translation: 1. Authorities have announced a national holiday today. 2. Authorities have announced that today is a national holiday. 3. Today is a national holiday, announced by the authorities. The output space is not very diverse. Categorization of NLG tasks Machine Translation Summarization Task-driven Dialog ChitChat Dialog Spectrum of open-endedness for Generation Tasks The output space is getting more diverse… Input: Hey, how are you? Outputs: 1. Good! You? 2. I just heard an exciting news, do you want to hear it? 3. Thx for asking! Barely surviving my hws. Categorization of NLG tasks Machine Translation Summarization Task-driven Dialog ChitChat Dialog Story Generation Spectrum of open-endedness for Generation Tasks Input: Write a story about three little pigs? Outputs: … (so many options) … The output space is extremely diverse… Categorization of NLG tasks Categorization of NLG tasks 13 Open-ended generation: the output distribution still has high freedom Non-open-ended generation: the input mostly determines the output generation. Remark: One way of formalizing categorization this is by entropy. These two classes of NLG tasks require different decoding and/or training approaches! Machine Translation Summarization Task-driven Dialog ChitChat Dialog Story Generation More Open-endedLess Open-ended Today: Natural Language Generation 14 1. What is NLG? 2. A review: neural NLG model and training algorithm 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations Basics of natural language generation (review of lecture 5) 15 • In autoregressive text generation models, at each time step t, our model takes in a sequence of tokens as input 𝑦 !" and outputs a new token, "𝑦" • For model 𝑓( . ) and vocab 𝑉, we get scores 𝑆 = 𝑓 𝑦!" , 𝜃 ∈ ℝ# 𝑦"$% 𝑦"$& 𝑦"$' 𝑦"$( "𝑦" "𝑦" "𝑦")( 𝑦")( … "𝑦")'𝑃 𝑦" 𝑦!" = exp(𝑆*) ∑*!∈ # exp(𝑆*!) Basics of natural language generation (review of lecture 5) 16 • For non-open-ended tasks (e.g., MT), we typically use a encoder-decoder system, where this autoregressive model serves as the decoder, and we’d have another bidirectional encoder for encoding the inputs. • For open-ended tasks (e.g., story generation), this autoregressive generation model is often the only component. Trained one token at a time by maximum likelihood 17 • Trained to maximize the probability of the next token 𝑦" ∗ given preceding words {𝑦∗}!" • This is a classification task at each time step trying to predict the actual word 𝑦! ∗ in the training data • Doing this is often called “teacher forcing” (because you reset at each time step to the ground truth) ℒ = − % !#$ % log 𝑃 𝑦! ∗ 𝑦∗ &!) 𝑦- ∗ 𝑦( ∗ 𝑦' ∗ 𝑦& ∗ 𝑦.$% ∗ 𝑦.$& ∗ 𝑦.$' ∗ 𝑦.$( ∗ … … 𝑦( ∗ 𝑦' ∗ 𝑦& ∗ 𝑦% ∗ 𝑦.$& ∗ 𝑦.$' ∗ 𝑦.$( ∗ 𝑦. ∗ Basics of natural language generation (review of lecture 5) 18 • At inference time, our decoding algorithm defines a function to select a token from this distribution: • The “obvious” decoding algorithm is to greedily choose the highest probability next token according to the model at each time step • While this basic algorithm sort of works, to do better, the two main avenues are to: 1. Improve decoding 2. Improve the training !𝑦! = 𝑔(𝑃 𝑦! 𝑦"! )) 𝑔( . ) is your decoding algorithm Of course, there’s also improving your training data or model architecture Today: Natural Language Generation 19 1. What is NLG? 2. A review: neural NLG model and training algorithm 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations Decoding: what is it all about? 20 • At each time step t, our model computes a vector of scores for each token in our vocabulary, S ∈ ℝ#: • Then, we compute a probability distribution 𝑃 over these scores with a softmax function: • Our decoding algorithm defines a function to select a token from this distribution: 𝑆 = 𝑓 𝑦"! 𝑃 𝑦! = 𝑤 𝑦"! = exp(𝑆#) ∑#!∈ % exp(𝑆#!) !𝑦! = 𝑔(𝑃 𝑦! 𝑦"! )) 𝑓( . ) is your model 𝑔( . ) is your decoding algorithm How to find the most likely string? 21 • Recall: Lecture 7 on Neural Machine Translation… • Greedy Decoding • Selects the highest probability token in 𝑃 𝑦" 𝑦!") • Beam Search • Discussed in Lecture 7 on Machine Translation • Also aims to find strings that maximize the log-prob, but with wider exploration of candidates !𝑦! = 𝐚𝐫𝐠𝐦𝐚𝐱 𝒘∈𝑽 𝑃 𝑦! = 𝑤 𝑦"!) Overall, maximum probability decoding is good for low-entropy tasks like MT and summarization! The most likely string is repetitive for Open-ended Generation 22 Context: Continuation: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. The study, published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS), was conducted by researchers from the Universidad Nacional Autónoma de México (UNAM) and the Universidad Nacional Autónoma de México (UNAM/Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México… (Holtzman et. al., ICLR 2020) Why does repetition happen? 23 (Holtzman et. al., ICLR 2020) A self-amplification effect! And it keeps going… 24 (Holtzman et. al., ICLR 2020) Scale doesn’t solve this problem: even a 175 billion parameter LM still repeats when we decode for the most likely string. How can we reduce repetition? 25 Simple option: • Heuristic: Don’t repeat n-grams More complex: • Use a different training objective: • Unlikelihood objective (Welleck et al., 2020) penalize generation of already-seen tokens • Coverage loss (See et al., 2017) Prevents attention mechanism from attending to the same words • Use a different decoding objective: • Contrastive decoding (Li et al, 2022) searches for strings x that maximize logprob_largeLM (x) – logprob_smallLM (x). Is finding the most likely string reasonable for open-ended generation? 26 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Probability i t a ar t i r ri i a ar a (Holtzman et. al., ICLR 2020) It fails to match the uncertainty distribution for human generated text. Time to get random : Sampling! 27 • Sample a token from the distribution of tokens • It’s random so you can sample any token! He wanted to go to the Model restroom grocery store airport pub gym bathroom beach doctor hospital !𝑦+ ∼ 𝑃 𝑦+ = 𝑤 { 𝑦 ,+) Decoding: Top-k sampling 28 • Problem: Vanilla sampling makes every token in the vocabulary an option • Even if most of the probability mass in the distribution is over a limited set of options, the tail of the distribution could be very long and in aggregate have considerable mass (statistics speak: we have “heavy tailed” distributions) • Many tokens are probably really wrong in the current context • For these wrong tokens, we give them individually a tiny chance to be selected. • But because there are many of them, we still give them as a group a high chance to be selected. • Solution: Top-k sampling • Only sample from the top k tokens in the probability distribution (Fan et al., ACL 2018; Holtzman et al., ACL 2018) Decoding: Top-k sampling 29 • Solution: Top-k sampling • Only sample from the top k tokens in the probability distribution • Common values are k = 50 (but it’s up to you!) • Increase k yields more diverse, but risky outputs • Decrease k yields more safe but generic outputs (Fan et al., ACL 2018; Holtzman et al., ACL 2018) He wanted to go to the Model restroom grocery store airport pub gym bathroom beach doctor hospital Top-k sampling can cut off too quickly! Top-k sampling can also cut off too slowly! Issues with Top-k sampling 30 (Holtzman et. al., ICLR 2020) Decoding: Top-p (nucleus) sampling 31 • Problem: The probability distributions we sample from are dynamic • When the distribution Pt is flatter, a limited k removes many viable options • When the distribution Pt is peakier, a high k allows for too many options to have a chance of being selected • Solution: Top-p sampling • Sample from all tokens in the top p cumulative probability mass (i.e., where mass is concentrated) • Varies k depending on the uniformity of Pt (Holtzman et. al., ICLR 2020) Decoding: Top-p (nucleus) sampling 32 • Solution: Top-p sampling • Sample from all tokens in the top p cumulative probability mass (i.e., where mass is concentrated) • Varies k depending on the uniformity of Pt 𝑃! ( 𝑦! = 𝑤 { 𝑦 "!) 𝑃! ) 𝑦! = 𝑤 { 𝑦 "!) 𝑃! * 𝑦! = 𝑤 { 𝑦 "!) (Holtzman et. al., ICLR 2020) 34 Scaling randomness: Temperature • Recall: On timestep t, the model computes a prob distribution Pt by applying the softmax function to a vector of scores 𝑠 ∈ ℝ|(| 𝑃!(𝑦! = 𝑤) = exp(𝑆)) ∑)*∈( exp(𝑆)*) • You can apply a temperature hyperparameter 𝜏 to the softmax to rebalance 𝑃!: 𝑃! 𝑦! = 𝑤 = exp 𝑆)/𝜏 ∑)!∈( exp 𝑆)!/𝜏 • Raise the temperature 𝜏 > 1: 𝑃! becomes more uniform • More diverse output (probability is spread around vocab) • Lower the temperature 𝜏 < 1: 𝑃! becomes more spiky • Less diverse output (probability is concentrated on top words) Temperature is a hyperparameter for decoding: It can be tuned for both beam search and sampling. Improving Decoding: Re-ranking 36 • Problem: What if I decode a bad sequence from my model? • Decode a bunch of sequences • 10 candidates is a common number, but it’s up to you • Define a score to approximate quality of sequences and re-rank by this score • Simplest is to use (low) perplexity! • Careful! Remember that repetitive utterances generally get low perplexity. • Re-rankers can score a variety of properties: • style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al., 2020), logical consistency (Lu et al., 2020), and many more … • Beware poorly-calibrated re-rankers • Can compose multiple re-rankers together. Decoding: Takeaways 37 • Decoding is still a challenging problem in NLG – there’s a lot more work to be done! • Different decoding algorithms can allow us to inject biases that encourage different properties of coherent natural language generation • Some of the most impactful advances in NLG of the last few years have come from simple but effective modifications to decoding algorithms Today: Natural Language Generation 38 1. What is NLG? 2. A review: neural NLG model and training algorithm 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations Is repetition due to how LMs are trained? 39 Context: Continuation: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. The study, published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS), was conducted by researchers from the Universidad Nacional Autónoma de México (UNAM) and the Universidad Nacional Autónoma de México (UNAM/Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México… (Holtzman et. al., ICLR 2020) Diversity Issues 40 • MLE model learns bad mode of the text distribution. Exposure Bias • Training with teacher forcing leads to exposure bias at generation time • During training, our model’s inputs are gold context tokens from real, human-generated texts • At generation time, our model’s inputs are previously–decoded tokens 42 ℒ456 = − log 𝑃 𝑦" ∗ 𝑦∗ !") ℒ789 = − log 𝑃 "𝑦" "𝑦 !") !! ∗ !# ∗ !$ ∗ !% ∗ !&'( ∗ !&'% ∗ !&'$ ∗ !&'# ∗ … … !# ∗ !$ ∗ !% ∗ !( ∗ !&'% ∗ !&'$ ∗ !&'# ∗ !& ∗ !!" ∗ !!$ ∗ !% ∗ !(&!' !($ !(" !(& !($ !(" !(&!( !(&!" !(&!$ !(&!( !(&!" !(&!$ … … 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 Probability i t a ar a Exposure Bias Solutions • Scheduled sampling (Bengio et al., 2015) • With some probability p, decode a token and feed that as the next input, rather than the gold token. • Increase p over the course of training • Leads to improvements in practice, but can lead to strange training objectives • Dataset Aggregation (DAgger; Ross et al., 2011) • At various intervals during training, generate sequences from your current model • Add these sequences to your training set as additional examples 43 Basically, variants of the same approach; see: https://nlpers.blogspot.com/2016/03/a-dagger-by-any-other-name-scheduled.html Exposure Bias Solutions • Retrieval Augmentation (Guu*, Hashimoto*, et al., 2018) • Learn to retrieve a sequence from an existing corpus of human-written prototypes (e.g., dialogue responses) • Learn to edit the retrieved sequence by adding, removing, and modifying tokens in the prototype – this will still result in a more “human-like” generation • Reinforcement Learning: cast your text generation model as a Markov decision process • State s is the model’s representation of the preceding context • Actions a are the words that can be generated • Policy 𝜋 is the decoder • Rewards r are provided by an external score • Learn behaviors by rewarding the model when it exhibits them – go study CS 234 44 Reward Estimation • How should we define a reward function? Just use your evaluation metric! • BLEU (machine translation; Ranzato et al., ICLR 2016; Wu et al., 2016) • ROUGE (summarization; Paulus et al., ICLR 2018; Celikyilmaz et al., NAACL 2018) • CIDEr (image captioning; Rennie et al., CVPR 2017) • SPIDEr (image captioning; Liu et al., ICCV 2017) • Be careful about optimizing for the task as opposed to “gaming” the reward! • Evaluation metrics are merely proxies for generation quality! • “even though RL refinement can achieve better BLEU scores, it barely improves the human impression of the translation quality” – Wu et al., 2016 45 Reward Estimation • What behaviors can we tie to rewards? • Cross-modality consistency in image captioning (Ren et al., CVPR 2017) • Sentence simplicity (Zhang and Lapata, EMNLP 2017) • Temporal Consistency (Bosselut et al., NAACL 2018) • Utterance Politeness (Tan et al., TACL 2018) • Formality (Gong et al., NAACL 2019) • Human Preference (RLHF): this is the technique behind ChatGPT! • (Zieglar et al. 2019, Stiennon et al., 2020) • Human ranking the generated text based on their preference. • Learn a reward function of the human preference. 46 See discussion of RLHF in the next lecture Training: Takeaways 48 • Teacher forcing is still the main algorithm for training text generation models • Exposure bias causes text generation models to lose coherence easily • Models must learn to recover from their own bad samples • E.g., scheduled sampling, DAgger • Or not be allowed to generate bad text to begin with (e.g., retrieval + generation) • Training with RL can allow models to learn behaviors that are preferred by human preference / metrics. Today: Natural Language Generation 49 1. What is NLG? 2. A review: neural NLG model and training algorithm 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations Types of evaluation methods for text generation 50 Human EvaluationsContent Overlap Metrics Model-based Metrics Ref: They walked to the grocery store . Gen: The woman went to the hardware store . (Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial) Content overlap metrics 51 • Compute a score that indicates the lexical similarity between generated and goldstandard (human-written) text • Fast and efficient and widely used • N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.) Ref: They walked to the grocery store . Gen: The woman went to the hardware store . Word overlap–based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.) • They’re not ideal for machine translation • They get progressively much worse for tasks that are more open-ended than machine translation • Worse for summarization, as longer output texts are harder to measure • Much worse for dialogue, which is more open-ended that summarization • Much, much worse story generation, which is also open-ended, but whose sequence length can make it seem you’re getting decent scores! 52 N-gram overlap metrics 53 Are you enjoying the CS224N lectures? Heck yes ! You know it ! Yes ! Yup . Heck no ! Score: 0.61 0.25 0 0.67 False negative False positive A simple failure case n-gram overlap metrics have no concept of semantic relatedness! Model-based metrics to capture more semantics 55 • Use learned representations of words and sentences to compute semantic similarity between generated and reference texts • No more n-gram bottleneck because text units are represented as embeddings! • The embeddings are pretrained, distance metrics used to measure the similarity can be fixed Model-based metrics: Word distance functions 56 Word Mover’s Distance Measures the distance between two sequences (e.g., sentences, paragraphs, etc.), using word embedding similarity matching. (Kusner et.al., 2015; Zhao et al., 2019) Vector Similarity Embedding based similarity for semantic distance between text. • Embedding Average (Liu et al., 2016) • Vector Extrema (Liu et al., 2016) • MEANT (Lo, 2017) • YISI (Lo, 2019) BERTSCORE Uses pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. (Zhang et.al. 2020) Model-based metrics: Beyond word matching 57 BLEURT: A regression model based on BERT returns a score that indicates to what extent the candidate text is grammatical and conveys the meaning of the reference text. (Sellam et.al. 2020) Sentence Movers Similarity : Based on Word Movers Distance to evaluate text in a continuous space using sentence embeddings from recurrent neural network representations. (Clark et.al., 2019) Evaluating Open-ended Text Generation 58 MAUVE MAUVE computes information divergence in a quantized embedding space, between the generated text and the gold reference text (Pillutla et.al., 2022). MAUVE (details) 59 How to evaluate an evaluation metric? 60 (Liu et al, EMNLP 2016) Human evaluations • Automatic metrics fall short of matching human decisions • Human evaluation is most important form of evaluation for text generation systems. • Gold standard in developing new automatic metrics • New automated metrics must correlate well with human evaluations! 61 Human evaluations 62 • Ask humans to evaluate the quality of generated text • Overall or along some specific dimension: • fluency • coherence / consistency • factuality and correctness • commonsense • style / formality • grammaticality • typicality • redundancy FordetailsCelikyilmaz,Clark,Gao,2020 Note: Don’t compare human evaluation scores across differently conducted studies Even if they claim to evaluate the same dimensions! Evaluating LMs by interacting with them 65 Evaluating Human Language Model Interaction (Lee et al. 2022) Prior work: Third-party evaluates the quality of the output This work: All the other axes. Evaluation: Takeaways 67 • Content overlap metrics provide a good starting point for evaluating the quality of generated text, but they’re not good enough on their own. • Model-based metrics can be more correlated with human judgment, but behavior is not interpretable • Human judgments are critical • But humans are inconsistent! • In many cases, the best judge of output quality is YOU! • Look at your model generations. Don’t just rely on numbers! • Publicly release large samples of the output of systems that you create! Today: Natural Language Generation 68 1. What is NLG? 2. A review: neural NLG model and training algorithm 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations Warning: Some of the content on the next few slides may be disturbing Ethics: Biases in text generation models 72 • Text generation models are often constructed from pretrained language models • Language models learn harmful patterns of bias from large language corpora • When prompted for this information, they repeat negative stereotypes (Sheng et al., EMNLP 2019) (Warning: examples contain sensitive content) Hidden Biases: Universal adversarial triggers 73 • Adversarial inputs can trigger VERY toxic content • These models can be exploited in open-world contexts by illintentioned users (Wallace et al., EMNLP 2019) (Warning: examples contain highly sensitive content) Hidden Biases: Triggered innocuously 74 • Pretrained language models can degenerate into toxic text even from seemingly innocuous prompts • Models should not be deployed without proper safeguards to control for toxic content • Models should not be deployed without careful consideration of how users will interact with it (Gehman et al., EMNLP Findings 2020) (Warning: examples contain sensitive content) Ethics: Think about what you’re building 75 • Large-scale pretrained language models allow us to build NLG systems for many new applications • Before deploying / publishing NLG models: • Check if the model’s output is not harmful • The model is robust to trigger words • …More… (Zellers et al., NeurIPS 2019) Concluding Thoughts 76 • Interacting with natural language generation systems quickly shows their limitations • Even in tasks with more progress, there are still many improvements ahead • Evaluation remains a huge challenge. • We need better ways of automatically evaluating performance of NLG systems • With the advent of large-scale language models, deep NLG research has been reset • It’s never been easier to jump in the space! • One of the most exciting and fun areas of NLP to work in!