Natural Language Processing
with Deep Learning
CS224N/Ling284
Xiang Lisa Li
Lecture 12: Neural Language Generation
Adapted from slides by Antoine Bosselut and Chris Manning
Today: Natural Language Generation
3
1. What is NLG?
2. A review: neural NLG model and training algorithm
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations
What is natural language generation?
4
Natural language generation is one side of natural
language processing. NLP =
Natural Language Understanding (NLU) +
Natural Language Generation (NLG)
NLG focuses on systems that produce fluent, coherent
and useful language output for human consumption
Deep Learning is powering next-gen NLG systems!
Example Uses of Natural Language Generation
5
Machine Translation systems
input: utterances in source languages
output: translated text in target languages.
Digital assistant (dialogue) systems use NLG
input: dialog history
output: text that respond / continue the conversation
Summarization systems (for research articles,
email, meetings, documents) use NLG
input: long documents
output: summarization of the long documents
More interesting NLG uses
6
(Krause et al. CVPR 2017)
Craig finished his eleven NFL seasons
with 8,189 rushing yards and 566
receptions for 4,911 receiving yards.
(Parikh et al.., EMNLP 2020)(Rashkin et al.., EMNLP 2020)
Creative stories Data-to-text Visual description
SOTA NLG system
7
ChatGPT is an NLG system!
It’s general purpose and can do many NLG tasks!
e.g., Chatbot:
SOTA NLG system
8
ChatGPT is an NLG system!
It’s general purpose and can do many NLG tasks!
e.g., Poetry Generation:
SOTA NLG system
9
Machine
Translation
Summarization
Spectrum of open-endedness for Generation Tasks
Source Sentence: 当局已经宣布今天是节假日。
Reference Translation:
1. Authorities have announced a national holiday today.
2. Authorities have announced that today is a national holiday.
3. Today is a national holiday, announced by the authorities.
The output space is not very diverse.
Categorization of NLG tasks
Machine
Translation
Summarization
Task-driven
Dialog
ChitChat
Dialog
Spectrum of open-endedness for Generation Tasks
The output space is getting more diverse…
Input: Hey, how are you?
Outputs:
1. Good! You?
2. I just heard an exciting news, do you want to hear it?
3. Thx for asking! Barely surviving my hws.
Categorization of NLG tasks
Machine
Translation
Summarization
Task-driven
Dialog
ChitChat
Dialog
Story
Generation
Spectrum of open-endedness for Generation Tasks
Input: Write a story about three little pigs?
Outputs:
… (so many options) …
The output space is extremely diverse…
Categorization of NLG tasks
Categorization of NLG tasks
13
Open-ended generation: the output distribution still has high freedom
Non-open-ended generation: the input mostly determines the output
generation.
Remark: One way of formalizing categorization this is by entropy.
These two classes of NLG tasks require different decoding and/or training
approaches!
Machine
Translation
Summarization
Task-driven
Dialog
ChitChat
Dialog
Story
Generation
More Open-endedLess Open-ended
Today: Natural Language Generation
14
1. What is NLG?
2. A review: neural NLG model and training algorithm
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations
Basics of natural language generation (review of lecture 5)
15
• In autoregressive text generation models, at each time step t, our model takes in a
sequence of tokens as input 𝑦 !" and outputs a new token, "𝑦"
• For model 𝑓( . ) and vocab 𝑉, we get scores 𝑆 = 𝑓 𝑦!" , 𝜃 ∈ ℝ#
𝑦"$% 𝑦"$& 𝑦"$' 𝑦"$(
"𝑦"
"𝑦"
"𝑦")(
𝑦")(
…
"𝑦")'𝑃 𝑦" 𝑦!" =
exp(𝑆*)
∑*!∈ # exp(𝑆*!)
Basics of natural language generation (review of lecture 5)
16
• For non-open-ended tasks (e.g., MT), we typically use a encoder-decoder system,
where this autoregressive model serves as the decoder, and we’d have another
bidirectional encoder for encoding the inputs.
• For open-ended tasks (e.g., story generation), this autoregressive generation model is
often the only component.
Trained one token at a time by maximum likelihood
17
• Trained to maximize the probability of the next token 𝑦"
∗
given preceding words {𝑦∗}!"
• This is a classification task at each time step trying to predict the actual word 𝑦!
∗ in the training data
• Doing this is often called “teacher forcing” (because you reset at each time step to the ground truth)
ℒ = − %
!#$
%
log 𝑃 𝑦!
∗ 𝑦∗
&!)
𝑦-
∗
𝑦(
∗
𝑦'
∗
𝑦&
∗
𝑦.$%
∗
𝑦.$&
∗
𝑦.$'
∗
𝑦.$(
∗
…
…
𝑦(
∗
𝑦'
∗
𝑦&
∗
𝑦%
∗
𝑦.$&
∗
𝑦.$'
∗
𝑦.$(
∗
<END>
𝑦.
∗
Basics of natural language generation (review of lecture 5)
18
• At inference time, our decoding algorithm defines a function to select a token from this
distribution:
• The “obvious” decoding algorithm is to greedily choose the highest probability next
token according to the model at each time step
• While this basic algorithm sort of works, to do better, the two main avenues are to:
1. Improve decoding
2. Improve the training
!𝑦! = 𝑔(𝑃 𝑦! 𝑦"! )) 𝑔( . ) is your decoding algorithm
Of course, there’s also improving your
training data or model architecture
Today: Natural Language Generation
19
1. What is NLG?
2. A review: neural NLG model and training algorithm
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations
Decoding: what is it all about?
20
• At each time step t, our model computes a vector of scores for each token in our
vocabulary, S ∈ ℝ#:
• Then, we compute a probability distribution 𝑃 over these scores with a softmax
function:
• Our decoding algorithm defines a function to select a token from this distribution:
𝑆 = 𝑓 𝑦"!
𝑃 𝑦! = 𝑤 𝑦"! =
exp(𝑆#)
∑#!∈ % exp(𝑆#!)
!𝑦! = 𝑔(𝑃 𝑦! 𝑦"! ))
𝑓( . ) is your model
𝑔( . ) is your decoding algorithm
How to find the most likely string?
21
• Recall: Lecture 7 on Neural Machine Translation…
• Greedy Decoding
• Selects the highest probability token in 𝑃 𝑦" 𝑦!")
• Beam Search
• Discussed in Lecture 7 on Machine Translation
• Also aims to find strings that maximize the log-prob, but with wider exploration of
candidates
!𝑦! = 𝐚𝐫𝐠𝐦𝐚𝐱
𝒘∈𝑽
𝑃 𝑦! = 𝑤 𝑦"!)
Overall, maximum probability decoding is good for low-entropy tasks like MT and
summarization!
The most likely string is repetitive for Open-ended Generation
22
Context:
Continuation:
In a shocking finding, scientist discovered a herd
of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even
more surprising to the researchers was the fact
that the unicorns spoke perfect English.
The study, published in the Proceedings of the
National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM)
and the Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
(Holtzman et. al., ICLR 2020)
Why does repetition happen?
23 (Holtzman et. al., ICLR 2020)
A self-amplification effect!
And it keeps going…
24 (Holtzman et. al., ICLR 2020)
Scale doesn’t solve this problem: even a 175 billion parameter LM still repeats when we
decode for the most likely string.
How can we reduce repetition?
25
Simple option:
• Heuristic: Don’t repeat n-grams
More complex:
• Use a different training objective:
• Unlikelihood objective (Welleck et al., 2020) penalize generation of already-seen tokens
• Coverage loss (See et al., 2017) Prevents attention mechanism from attending to the
same words
• Use a different decoding objective:
• Contrastive decoding (Li et al, 2022) searches for strings x that maximize
logprob_largeLM (x) – logprob_smallLM (x).
Is finding the most likely string reasonable for open-ended
generation?
26
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Probability
i t
a ar t i r ri i
a ar
a
(Holtzman et. al., ICLR 2020)
It fails to match the uncertainty distribution for human generated text.
Time to get random : Sampling!
27
• Sample a token from the distribution of tokens
• It’s random so you can sample any token!
He wanted
to go to the Model
restroom
grocery
store
airport
pub
gym
bathroom
beach
doctor
hospital
!𝑦+ ∼ 𝑃 𝑦+ = 𝑤 { 𝑦 ,+)
Decoding: Top-k sampling
28
• Problem: Vanilla sampling makes every token in the vocabulary an option
• Even if most of the probability mass in the distribution is over a limited set of
options, the tail of the distribution could be very long and in aggregate have
considerable mass (statistics speak: we have “heavy tailed” distributions)
• Many tokens are probably really wrong in the current context
• For these wrong tokens, we give them individually a tiny chance to be selected.
• But because there are many of them, we still give them as a group a high chance to
be selected.
• Solution: Top-k sampling
• Only sample from the top k tokens in the probability distribution
(Fan et al., ACL 2018; Holtzman et al., ACL 2018)
Decoding: Top-k sampling
29
• Solution: Top-k sampling
• Only sample from the top k tokens in the probability distribution
• Common values are k = 50 (but it’s up to you!)
• Increase k yields more diverse, but risky outputs
• Decrease k yields more safe but generic outputs
(Fan et al., ACL 2018; Holtzman et al., ACL 2018)
He wanted
to go to the Model
restroom
grocery
store
airport
pub
gym
bathroom
beach
doctor
hospital
Top-k sampling can cut off too quickly!
Top-k sampling can also cut off too slowly!
Issues with Top-k sampling
30 (Holtzman et. al., ICLR 2020)
Decoding: Top-p (nucleus) sampling
31
• Problem: The probability distributions we sample from are dynamic
• When the distribution Pt is flatter, a limited k removes many viable options
• When the distribution Pt is peakier, a high k allows for too many options to have a
chance of being selected
• Solution: Top-p sampling
• Sample from all tokens in the top p cumulative probability mass (i.e., where mass is
concentrated)
• Varies k depending on the uniformity of Pt
(Holtzman et. al., ICLR 2020)
Decoding: Top-p (nucleus) sampling
32
• Solution: Top-p sampling
• Sample from all tokens in the top p cumulative probability mass (i.e., where mass is
concentrated)
• Varies k depending on the uniformity of Pt
𝑃!
(
𝑦! = 𝑤 { 𝑦 "!) 𝑃!
)
𝑦! = 𝑤 { 𝑦 "!) 𝑃!
*
𝑦! = 𝑤 { 𝑦 "!)
(Holtzman et. al., ICLR 2020)
34
Scaling randomness: Temperature
• Recall: On timestep t, the model computes a prob distribution Pt by applying the softmax function to
a vector of scores 𝑠 ∈ ℝ|(|
𝑃!(𝑦! = 𝑤) =
exp(𝑆))
∑)*∈( exp(𝑆)*)
• You can apply a temperature hyperparameter 𝜏 to the softmax to rebalance 𝑃!:
𝑃! 𝑦! = 𝑤 =
exp 𝑆)/𝜏
∑)!∈( exp 𝑆)!/𝜏
• Raise the temperature 𝜏 > 1: 𝑃! becomes more uniform
• More diverse output (probability is spread around vocab)
• Lower the temperature 𝜏 < 1: 𝑃! becomes more spiky
• Less diverse output (probability is concentrated on top words)
Temperature is a hyperparameter for decoding:
It can be tuned for both beam search and sampling.
Improving Decoding: Re-ranking
36
• Problem: What if I decode a bad sequence from my model?
• Decode a bunch of sequences
• 10 candidates is a common number, but it’s up to you
• Define a score to approximate quality of sequences and re-rank by this score
• Simplest is to use (low) perplexity!
• Careful! Remember that repetitive utterances generally get low perplexity.
• Re-rankers can score a variety of properties:
• style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al.,
2020), logical consistency (Lu et al., 2020), and many more …
• Beware poorly-calibrated re-rankers
• Can compose multiple re-rankers together.
Decoding: Takeaways
37
• Decoding is still a challenging problem in NLG – there’s a lot more work to be done!
• Different decoding algorithms can allow us to inject biases that encourage different
properties of coherent natural language generation
• Some of the most impactful advances in NLG of the last few years have come from
simple but effective modifications to decoding algorithms
Today: Natural Language Generation
38
1. What is NLG?
2. A review: neural NLG model and training algorithm
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations
Is repetition due to how LMs are trained?
39
Context:
Continuation:
In a shocking finding, scientist discovered a herd
of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even
more surprising to the researchers was the fact
that the unicorns spoke perfect English.
The study, published in the Proceedings of the
National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM)
and the Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
(Holtzman et. al., ICLR 2020)
Diversity Issues
40
• MLE model learns bad mode of the text distribution.
Exposure Bias
• Training with teacher forcing leads to
exposure bias at generation time
• During training, our model’s inputs
are gold context tokens from real,
human-generated texts
• At generation time, our model’s
inputs are previously–decoded tokens
42
ℒ456 = − log 𝑃 𝑦"
∗
𝑦∗
!")
ℒ789 = − log 𝑃 "𝑦" "𝑦 !")
!!
∗ !#
∗ !$
∗ !%
∗ !&'(
∗ !&'%
∗ !&'$
∗ !&'#
∗
…
…
!#
∗
!$
∗
!%
∗
!(
∗
!&'%
∗
!&'$
∗
!&'#
∗
<END>
!&
∗
!!"
∗
!!$
∗
!%
∗
<START>
!(&!'
!($ !("
<END>
!(&
!($ !(" !(&!( !(&!" !(&!$
!(&!( !(&!" !(&!$
…
…
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Probability
i t a ar
a
Exposure Bias Solutions
• Scheduled sampling (Bengio et al., 2015)
• With some probability p, decode a token and feed that as the next input, rather than
the gold token.
• Increase p over the course of training
• Leads to improvements in practice, but can lead to strange training objectives
• Dataset Aggregation (DAgger; Ross et al., 2011)
• At various intervals during training, generate sequences from your current model
• Add these sequences to your training set as additional examples
43
Basically, variants of the same approach; see:
https://nlpers.blogspot.com/2016/03/a-dagger-by-any-other-name-scheduled.html
Exposure Bias Solutions
• Retrieval Augmentation (Guu*, Hashimoto*, et al., 2018)
• Learn to retrieve a sequence from an existing corpus of human-written prototypes
(e.g., dialogue responses)
• Learn to edit the retrieved sequence by adding, removing, and modifying tokens in
the prototype – this will still result in a more “human-like” generation
• Reinforcement Learning: cast your text generation model as a Markov decision process
• State s is the model’s representation of the preceding context
• Actions a are the words that can be generated
• Policy 𝜋 is the decoder
• Rewards r are provided by an external score
• Learn behaviors by rewarding the model when it exhibits them – go study CS 234
44
Reward Estimation
• How should we define a reward function? Just use your evaluation metric!
• BLEU (machine translation; Ranzato et al., ICLR 2016; Wu et al., 2016)
• ROUGE (summarization; Paulus et al., ICLR 2018; Celikyilmaz et al., NAACL 2018)
• CIDEr (image captioning; Rennie et al., CVPR 2017)
• SPIDEr (image captioning; Liu et al., ICCV 2017)
• Be careful about optimizing for the task as opposed to “gaming” the reward!
• Evaluation metrics are merely proxies for generation quality!
• “even though RL refinement can achieve better BLEU scores, it barely improves the
human impression of the translation quality” – Wu et al., 2016
45
Reward Estimation
• What behaviors can we tie to rewards?
• Cross-modality consistency in image captioning (Ren et al., CVPR 2017)
• Sentence simplicity (Zhang and Lapata, EMNLP 2017)
• Temporal Consistency (Bosselut et al., NAACL 2018)
• Utterance Politeness (Tan et al., TACL 2018)
• Formality (Gong et al., NAACL 2019)
• Human Preference (RLHF): this is the technique behind ChatGPT!
• (Zieglar et al. 2019, Stiennon et al., 2020)
• Human ranking the generated text based on their preference.
• Learn a reward function of the human preference.
46
See discussion of RLHF in
the next lecture
Training: Takeaways
48
• Teacher forcing is still the main algorithm for training text generation models
• Exposure bias causes text generation models to lose coherence easily
• Models must learn to recover from their own bad samples
• E.g., scheduled sampling, DAgger
• Or not be allowed to generate bad text to begin with (e.g., retrieval + generation)
• Training with RL can allow models to learn behaviors that are preferred by human
preference / metrics.
Today: Natural Language Generation
49
1. What is NLG?
2. A review: neural NLG model and training algorithm
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations
Types of evaluation methods for text generation
50
Human EvaluationsContent Overlap Metrics Model-based Metrics
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)
Content overlap metrics
51
• Compute a score that indicates the lexical similarity between generated and goldstandard
(human-written) text
• Fast and efficient and widely used
• N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
Word overlap–based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)
• They’re not ideal for machine translation
• They get progressively much worse for tasks that are more open-ended than machine
translation
• Worse for summarization, as longer output texts are harder to measure
• Much worse for dialogue, which is more open-ended that summarization
• Much, much worse story generation, which is also open-ended, but whose
sequence length can make it seem you’re getting decent scores!
52
N-gram overlap metrics
53
Are you enjoying the
CS224N lectures?
Heck yes !
You know it !
Yes !
Yup .
Heck no !
Score:
0.61
0.25
0
0.67
False negative
False positive
A simple failure case
n-gram overlap metrics have no concept of semantic relatedness!
Model-based metrics to capture more semantics
55
• Use learned representations of words and
sentences to compute semantic similarity
between generated and reference texts
• No more n-gram bottleneck because text
units are represented as embeddings!
• The embeddings are pretrained, distance
metrics used to measure the similarity can
be fixed
Model-based metrics: Word distance functions
56
Word Mover’s
Distance
Measures the distance
between two sequences (e.g.,
sentences, paragraphs, etc.),
using word embedding
similarity matching.
(Kusner et.al., 2015; Zhao et al., 2019)
Vector Similarity
Embedding based similarity for
semantic distance between text.
• Embedding Average (Liu et al., 2016)
• Vector Extrema (Liu et al., 2016)
• MEANT (Lo, 2017)
• YISI (Lo, 2019)
BERTSCORE
Uses pre-trained contextual embeddings from
BERT and matches words in candidate and
reference sentences by cosine similarity.
(Zhang et.al. 2020)
Model-based metrics: Beyond word matching
57
BLEURT:
A regression model based on BERT returns a score that
indicates to what extent the candidate text is grammatical
and conveys the meaning of the reference text.
(Sellam et.al. 2020)
Sentence Movers Similarity :
Based on Word Movers Distance to evaluate text in a continuous space
using sentence embeddings from recurrent neural network
representations.
(Clark et.al., 2019)
Evaluating Open-ended Text Generation
58
MAUVE
MAUVE computes information divergence in a quantized embedding
space, between the generated text and the gold reference text (Pillutla
et.al., 2022).
MAUVE (details)
59
How to evaluate an evaluation metric?
60 (Liu et al, EMNLP 2016)
Human evaluations
• Automatic metrics fall short of matching human decisions
• Human evaluation is most important form of evaluation for text generation
systems.
• Gold standard in developing new automatic metrics
• New automated metrics must correlate well with human evaluations!
61
Human evaluations
62
• Ask humans to evaluate the quality of generated text
• Overall or along some specific dimension:
• fluency
• coherence / consistency
• factuality and correctness
• commonsense
• style / formality
• grammaticality
• typicality
• redundancy
FordetailsCelikyilmaz,Clark,Gao,2020
Note: Don’t compare human
evaluation scores across
differently conducted studies
Even if they claim to evaluate
the same dimensions!
Evaluating LMs by interacting with them
65
Evaluating Human Language Model
Interaction (Lee et al. 2022)
Prior work:
Third-party evaluates the quality of
the output
This work:
All the other axes.
Evaluation: Takeaways
67
• Content overlap metrics provide a good starting point for evaluating the quality of
generated text, but they’re not good enough on their own.
• Model-based metrics can be more correlated with human judgment, but behavior is
not interpretable
• Human judgments are critical
• But humans are inconsistent!
• In many cases, the best judge of output quality is YOU!
• Look at your model generations. Don’t just rely on numbers!
• Publicly release large samples of the output of systems that you create!
Today: Natural Language Generation
68
1. What is NLG?
2. A review: neural NLG model and training algorithm
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations
Warning:
Some of the content on the
next few slides may be
disturbing
Ethics: Biases in text generation models
72
• Text generation models are often
constructed from pretrained language
models
• Language models learn harmful patterns
of bias from large language corpora
• When prompted for this information,
they repeat negative stereotypes
(Sheng et al., EMNLP 2019)
(Warning: examples contain sensitive content)
Hidden Biases: Universal adversarial triggers
73
• Adversarial inputs can trigger VERY
toxic content
• These models can be exploited in
open-world contexts by illintentioned
users
(Wallace et al., EMNLP 2019)
(Warning: examples contain highly sensitive content)
Hidden Biases: Triggered innocuously
74
• Pretrained language models can
degenerate into toxic text even from
seemingly innocuous prompts
• Models should not be deployed without
proper safeguards to control for toxic
content
• Models should not be deployed without
careful consideration of how users will
interact with it
(Gehman et al., EMNLP Findings 2020)
(Warning: examples contain sensitive content)
Ethics: Think about what you’re building
75
• Large-scale pretrained language models allow us to build NLG
systems for many new applications
• Before deploying / publishing NLG models:
• Check if the model’s output is not harmful
• The model is robust to trigger words
• …More…
(Zellers et al., NeurIPS 2019)
Concluding Thoughts
76
• Interacting with natural language generation systems quickly shows their limitations
• Even in tasks with more progress, there are still many improvements ahead
• Evaluation remains a huge challenge.
• We need better ways of automatically evaluating performance of NLG systems
• With the advent of large-scale language models, deep NLG research has been reset
• It’s never been easier to jump in the space!
• One of the most exciting and fun areas of NLP to work in!