Natural Language Processing
with Deep Learning
CS224N/Ling284
Antoine Bosselut
Lecture 12: Neural Language Generation
What is natural language generation?
2
• Natural language generation (NLG) is a
sub-field of natural language processing
• Focused on building systems that
automatically produce coherent and
useful written or spoken text for human
consumption
• NLG systems are already changing the
world we live in…
Machine Translation
3
Dialogue Systems
4
Summarization
5
https://chrome.google.com/webstore/detail/gmail-summarization/
(Wang and Cardie, ACL 2013)
Document Summarization E-mail Summarization Meeting Summarization
http://mogren.one/lic/
Data-to-Text Generation
6
(Dusek et. al., INLG 2019)
(Wiseman and Rush., EMNLP 2017)
(Parikh et al.., EMNLP 2020)
7
1) A girl is eating donuts with a boy in a restaurant
2) A boy and girl sitting at a table with doughnuts.
3) Two kids sitting a coffee shop eating some frosted donuts
4) Two children sitting at a table eating donuts.
5) Two children eat doughnuts at a restaurant table.
Sentences
Paragraph
Two children are sitting at a table in a restaurant. The
children are one little girl and one little boy. The little girl is
eating a pink frosted donut with white icing lines on top of it.
Two children are sitting at a table in a restaurant. The children are one
little girl and one little boy. The little girl is eating a pink frosted donut
with white icing lines on top of it. The girl has blonde hair and is wearing
a green jacket with a black long sleeve shirt underneath. The little boy is
wearing a black zip up jacket and is holding his finger to his lip but is not
eating. A metal napkin dispenser is in between them at the table. The
wall next to them is white brick. Two adults are on the other side of the
short white brick wall. The room has white circular lights on the ceiling
and a large window in the front of the restaurant. It is daylight outside.
(Krause et. al., CVPR 2017)(Karpathy & Li., CVPR 2015)
Visual Description
Creative Generation
8
(Ghazvininejad et al.., ACL 2017)(Rashkin et al.., EMNLP 2020)
Stories & Narratives Poetry
What is natural language generation?
9
Any task involving text production for human
consumption requires natural language generation
Deep Learning is powering next-gen NLG systems!
Components of NLG Systems
10
• What is NLG?
• Formalizing NLG: a simple model and training algorithm
• Decoding from NLG models
• Training NLG models
• Evaluating NLG Systems
• Ethical Considerations
Basics of natural language generation
11
• In autoregressive text generation models, at each time step t, our model takes in a
sequence of tokens of text as input 𝑦 !" and outputs a new token, "𝑦"
𝑦"#$ 𝑦"#% 𝑦"#& 𝑦"#'
"𝑦"
"𝑦"
"𝑦"('
𝑦"('
…
"𝑦"(&
A look at a single step
12
• In autoregressive text generation models, at each time step t, our model takes in a
sequence of tokens of text as input 𝑦 !" and outputs a new token, "𝑦"
𝑦"#$ 𝑦"#% 𝑦"#& 𝑦"#'
"𝑦"
A look at a single step
13
• At each time step t, our model computes a vector of scores for each token in our
vocabulary, S ∈ ℝ):
• Then, we compute a probability distribution 𝑃 over 𝑤 ∈ 𝑉 using these scores:
𝑆 = 𝑓 𝑦!" , 𝜃
𝑃 𝑦" = 𝑤 𝑦!" =
exp(𝑆#)
∑#!∈ % exp(𝑆#!)
𝑓( . ) is your model
Basics: What are we trying to do?
14
• At each time step t, our model computes a vector of scores for each token in our
vocabulary, S ∈ ℝ):
• Then, we compute a probability distribution 𝑃 over 𝑤 ∈ 𝑉 using these scores:
𝑆 = 𝑓 𝑦!" , 𝜃
𝑃 𝑦" 𝑦!" =
exp(𝑆#)
∑#!∈ % exp(𝑆#!)
𝑓( . ) is your model
Basics: What are we trying to do?
15
• At each time step t, our model computes a vector of scores for each token in our
vocabulary, S ∈ ℝ). Then, we compute a probability distribution 𝑃 over 𝑤 ∈ 𝑉 using
these scores:
𝑦"#$ 𝑦"#% 𝑦"#& 𝑦"#'
𝑆
softmax
𝑃 𝑦" 𝑦!"
Basics: What are we trying to do?
16
• At inference time, our decoding algorithm defines a function to select a token from this
distribution:
• We train the model to minimize the negative loglikelihood of predicting the next token
in the sequence:
• Note: This is just a classification task where each 𝑤 ∈ 𝑉 is a class.
• The label at each step is the actual word 𝑦!
∗ in the training sequence
• This token is often called the “gold” or “ground truth” token
• This algorithm is often called “teacher forcing”
/𝑦" = 𝑔(𝑃 𝑦" 𝑦!" )) 𝑔( . ) is your decoding algorithm
ℒ" = − log 𝑃 𝑦"
∗
𝑦!"
∗
)
Sum ℒ! for the
entire sequence
Maximum Likelihood Training (i.e., teacher forcing)
17
• Trained to generate the next word 𝑦"
∗
given a set of preceding words {𝑦∗}!"
ℒ = − log 𝑃 𝑦%
∗
𝑦'
∗
)
𝑦1
∗
𝑦'
∗
Maximum Likelihood Training (i.e., teacher forcing)
18
• Trained to generate the next word 𝑦"
∗
given a set of preceding words {𝑦∗}!"
ℒ = −(log 𝑃 𝑦%
∗
𝑦'
∗
) + log 𝑃 𝑦(
∗
𝑦'
∗
, 𝑦%
∗
))
𝑦1
∗
𝑦'
∗
𝑦'
∗
𝑦&
∗
Maximum Likelihood Training (i.e., teacher forcing)
19
• Trained to generate the next word 𝑦"
∗
given a set of preceding words {𝑦∗}!"
ℒ = −(log 𝑃 𝑦%
∗
𝑦'
∗
) + log 𝑃 𝑦(
∗
𝑦'
∗
, 𝑦%
∗
) + log 𝑃 𝑦)
∗
𝑦'
∗
, 𝑦%
∗
, 𝑦(
∗
))
𝑦1
∗
𝑦'
∗
𝑦'
∗
𝑦&
∗
𝑦&
∗
𝑦%
∗
Maximum Likelihood Training (i.e., teacher forcing)
20
• Trained to generate the next word 𝑦"
∗
given a set of preceding words {𝑦∗}!"
𝑦1
∗
𝑦'
∗
𝑦'
∗
𝑦&
∗
ℒ = − -
*+%
,
log 𝑃 𝑦*
∗
𝑦∗
-*)
𝑦&
∗
𝑦%
∗
𝑦%
∗
𝑦$
∗
Maximum Likelihood Training (i.e., teacher forcing)
21
• Trained to generate the next word 𝑦"
∗
given a set of preceding words {𝑦∗}!"
ℒ = − -
*+%
.
log 𝑃 𝑦*
∗
𝑦∗
-*)
𝑦1
∗
𝑦'
∗
𝑦&
∗
𝑦%
∗
𝑦2#$
∗
𝑦2#%
∗
𝑦2#&
∗
𝑦2#'
∗
…
…
𝑦'
∗
𝑦&
∗
𝑦%
∗
𝑦$
∗
𝑦2#%
∗
𝑦2#&
∗
𝑦2#'
∗
<END>
𝑦2
∗
Components of NLG Systems
22
• What is NLG?
• Formalizing NLG: a simple model and training algorithm
• Decoding from NLG models
• Training NLG models
• Evaluating NLG Systems
• Ethical Considerations
Decoding: what is it all about?
23
• At each time step t, our model computes a vector of scores for each token in our
vocabulary, S ∈ ℝ):
• Then, we compute a probability distribution 𝑃 over these scores (usually with a
softmax function):
• Our decoding algorithm defines a function to select a token from this distribution:
𝑆 = 𝑓 𝑦!"
𝑃 𝑦" = 𝑤 𝑦!" =
exp(𝑆#)
∑#!∈ % exp(𝑆#!)
/𝑦" = 𝑔(𝑃 𝑦" 𝑦!" ))
𝑓( . ) is your model
𝑔( . ) is your decoding algorithm
Decoding: what is it all about?
24
• Our decoding algorithm defines a function to select a token from this distribution
𝑦#&
∗
𝑦#'
∗
𝑦1
∗
<START>
"𝑦2#$
"𝑦' "𝑦&
<END>
"𝑦2
/𝑦" = 𝑔(𝑃 𝑦" 𝑦∗
, /𝑦 !"))
"𝑦' "𝑦& "𝑦2#% "𝑦2#& "𝑦2#'
"𝑦2#% "𝑦2#& "𝑦2#'
…
…
Greedy methods
25
• Recall: Lecture 7 on Neural Machine Translation…
• Argmax Decoding
• Selects the highest probability token in 𝑃 𝑦" 𝑦!")
• Beam Search
• Discussed in Lecture 7 on Machine Translation
• Also a greedy algorithm, but with wider search over candidates
/𝑦" = 𝐚𝐫𝐠𝐦𝐚𝐱
𝒘∈𝑽
𝑃 𝑦" = 𝑤 𝑦!")
Greedy methods get repetitive
26
Context:
Continuation:
In a shocking finding, scientist discovered a herd
of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even
more surprising to the researchers was the fact
that the unicorns spoke perfect English.
The study, published in the Proceedings of the
National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM)
and the Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
(Holtzman et. al., ICLR 2020)
Why does repetition happen?
27 (Holtzman et. al., ICLR 2020)
And it keeps going…
29 (Holtzman et. al., ICLR 2020)
How can we reduce repetition?
30
Simple option:
• Heuristic: Don’t repeat n-grams
More complex:
• Minimize embedding distance between consecutive sentences (Celikyilmaz et al., 2018)
• Doesn’t help with intra-sentence repetition
• Coverage loss (See et al., 2017)
• Prevents attention mechanism from attending to the same words
• Unlikelihood objective (Welleck et al., 2020)
• Penalize generation of already-seen tokens
Are greedy methods reasonable?
31
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Probability
i t
a ar t i r ri i
a ar
a
(Holtzman et. al., ICLR 2020)
Time to get random: Sampling!
32
• Sample a token from the distribution of tokens
• It’s random so you can sample any token!
He wanted
to go to the Model
restroom
grocery
store
airport
pub
gym
bathroom
beach
doctor
hospital
.𝑦* ∼ 𝑃 𝑦* = 𝑤 { 𝑦 -*)
Decoding: Top-k sampling
33
• Problem: Vanilla sampling makes every token in the vocabulary an option
• Even if most of the probability mass in the distribution is over a limited set of
options, the tail of the distribution could be very long
• Many tokens are probably irrelevant in the current context
• Why are we giving them individually a tiny chance to be selected?
• Why are we giving them as a group a high chance to be selected?
• Solution: Top-k sampling
• Only sample from the top k tokens in the probability distribution
(Fan et al., ACL 2018; Holtzman et al., ACL 2018)
Decoding: Top-k sampling
34
• Solution: Top-k sampling
• Only sample from the top k tokens in the probability distribution
• Common values are k = 5, 10, 20 (but it’s up to you!)
• Increase k for more diverse/risky outputs
• Decrease k for more generic/safe outputs
(Fan et al., ACL 2018; Holtzman et al., ACL 2018)
He wanted
to go to the Model
restroom
grocery
store
airport
pub
gym
bathroom
beach
doctor
hospital
Top-k sampling can cut off too quickly!
Top-k sampling can also cut off too slowly!
Issues with Top-k sampling
35 (Holtzman et. al., ICLR 2020)
Decoding: Top-p (nucleus) sampling
36
• Problem: The probability distributions we sample from are dynamic
• When the distribution Pt is flatter, a limited k removes many viable options
• When the distribution Pt is peakier, a high k allows for too many options to have a
chance of being selected
• Solution: Top-p sampling
• Sample from all tokens in the top p cumulative probability mass (i.e., where mass is
concentrated)
• Varies k depending on the uniformity of Pt
(Holtzman et. al., ICLR 2020)
Decoding: Top-p (nucleus) sampling
37
• Solution: Top-p sampling
• Sample from all tokens in the top p cumulative probability mass (i.e., where mass is
concentrated)
• Varies k depending on the uniformity of Pt
𝑃"
)
𝑦" = 𝑤 { 𝑦 !") 𝑃"
*
𝑦" = 𝑤 { 𝑦 !") 𝑃"
+
𝑦" = 𝑤 { 𝑦 !")
38
Scaling randomness: Softmax temperature
• Recall: On timestep t, the model computes a prob distribution Pt by applying the softmax function to
a vector of scores 𝑠 ∈ ℝ|$|
𝑃!(𝑦! = 𝑤) =
exp(𝑆%)
∑%&∈$ exp(𝑆%&)
• You can apply a temperature hyperparameter 𝜏 to the softmax to rebalance 𝑃!:
𝑃! 𝑦! = 𝑤 =
exp 𝑆%/𝜏
∑%!∈$ exp 𝑆%!/𝜏
• Raise the temperature 𝜏 > 1: 𝑃! becomes more uniform
• More diverse output (probability is spread around vocab)
• Lower the temperature 𝜏 < 1: 𝑃! becomes more spiky
• Less diverse output (probability is concentrated on top words)
Note: softmax temperature is not a decoding algorithm!
It’s a technique you can apply at test time, in conjunction with a
decoding algorithm (such as beam search or sampling)
Improving decoding: re-balancing distributions
39
• Problem: What if I don’t trust how well my model’s distributions are calibrated?
• Don’t rely on ONLY your model’s distribution over tokens
• Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics!
(Khandelwal et. al., ICLR 2020)
Obama was senator for
Barack is married to
Obama was born in
…
Obama is a native of
Training Contexts
Illinois
Michelle
Hawaii
…
Hawaii
Targets Representations
4
100
5
…
3
Distances
0.7
0.2
0.1
Nearest k
Hawaii
Illinois
Hawaii
Normalization
Hawaii
Illinois
Hawaii
3
4
5
0.8
0.2
Aggregation
Hawaii
Illinois
Obama’s birthplace is
Test Context
?
Target Representation
0.6
0.2
…
Interpolation
Hawaii
Illinois
…
0.2
0.2
…
Classiﬁcation
Hawaii
Illinois
…
…
Improving decoding: re-balancing distributions
40
• Solution #1: Re-balance Pt using retrieval from n-gram phrase statistics!
• Cache a database of phrases from your training corpus (or some other corpus)
• At decoding time, search for most similar phrases in the database
• Re-balance Pt using induced distribution Pphrase over words that follow these phrases
Obama was senator for
Barack is married to
Obama was born in
…
Obama is a native of
Training Contexts
Illinois
Michelle
Hawaii
…
Hawaii
Targets Representations
4
100
5
…
3
Distances
0.7
0.2
0.1
Nearest k
Hawaii
Illinois
Hawaii
Normalization
Hawaii
Illinois
Hawaii
3
4
5
0.8
0.2
Aggregation
Hawaii
Illinois
Obama’s birthplace is
Test Context
?
Target Representation
0.6
0.2
…
Interpolation
Hawaii
Illinois
…
0.2
0.2
…
Classiﬁcation
Hawaii
Illinois
…
…
(Khandelwal et. al., ICLR 2020)
Backpropagation-based distribution re-balancing
41
• Can I re-balance my language model’s distribution in to encourage other behaviors?
• Yes! Just define a model that evaluates that behavior (e.g., sentiment, perplexity)
• Use soft token distributions (e.g., Gumbel Softmax -- Pt with tiny temperature 𝜏) as
inputs to the evaluator
• Backpropagate gradients directly to your language model and update Pt
(Dathathri et. al., ICLR 2020; Qin et al., EMNLP 2020)
LM LM LM
Attribute Model p(a|x)
The chicken tastes
chicken tastes Grad
(Positive
sentiment)
ok delicious
Original distribution
("ok")
Updated distribution
("delicious")
UpdatedLatents
Backward Pass
and update latents
Forward Pass
Recompute with
updated latentsp(x)p(x)p(x)
Recompute
Step 1
{
{
{
Step 2
Step 3
Improving Decoding: Re-ranking
42
• Problem: What if I decode a bad sequence from my model?
• Decode a bunch of sequences
• 10 candidates is a common number, but it’s up to you
• Define a score to approximate quality of sequences and re-rank by this score
• Simplest is to use perplexity!
• Careful! Remember that repetitive methods can generally get high perplexity.
• Re-rankers can score a variety of properties:
• style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al.,
2020), logical consistency (Lu et al., 2020), and many more…
• Beware poorly-calibrated re-rankers
• Can use multiple re-rankers in parallel
Decoding: Takeaways
44
• Decoding is still a challenging problem in natural language generation
• Human language distribution is noisy and doesn’t reflect simple properties (i.e.,
probability maximization)
• Different decoding algorithms can allow us to inject biases that encourage different
properties of coherent natural language generation
• Some of the most impactful advances in NLG of the last few years have come from
simple, but effective, modifications to decoding algorithms
• A lot more work to be done!
Components of NLG Systems
45
• What is NLG?
• Formalizing NLG: a simple model and training algorithm
• Decoding from NLG models
• Training NLG models
• Evaluating NLG Systems
• Ethical Considerations
Maximum Likelihood Training (i.e., teacher forcing)
47
• Trained to generate the minimize the negative loglikelihood of the next token 𝑦"
∗
given
the preceding tokens in the sequence {𝑦∗}!":
ℒ = − $
*+%
.
log 𝑃 𝑦*
∗
𝑦∗
-*)
𝑦1
∗
𝑦'
∗
𝑦&
∗
𝑦%
∗
𝑦2#$
∗
𝑦2#%
∗
𝑦2#&
∗
𝑦2#'
∗
…
…
𝑦'
∗
𝑦&
∗
𝑦%
∗
𝑦$
∗
𝑦2#%
∗
𝑦2#&
∗
𝑦2#'
∗
<END>
𝑦2
∗
Are greedy decoders bad because of how they’re trained?
48
Context:
Continuation:
In a shocking finding, scientist discovered a herd
of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even
more surprising to the researchers was the fact
that the unicorns spoke perfect English.
The study, published in the Proceedings of the
National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM)
and the Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…
(Holtzman et. al., ICLR 2020)
Diversity Issues
49
• Maximum Likelihood Estimation discourages diverse text generation
Unlikelihood Training
50
• Given a set of undesired tokens 𝒞, lower their likelihood in context
• Keep teacher forcing objective and combine them for final loss function
• Set 𝒞 = 𝑦∗
!" and you’ll train the model to lower the likelihood of previously-seen
tokens!
• Limits repetition!
• Increases the diversity of the text you learn to generate!
ℒ@AB
*
= − log 𝑃 𝑦*
∗
𝑦∗
-*)
ℒCA
*
= − $
D!"# ∈ 𝒞
log(1 − 𝑃 𝑦FGH 𝑦∗
-*))
ℒCAB
*
= ℒ@AB
*
+ 𝛼ℒCA
*
(Welleck et al., 2020)
Exposure Bias
• Training with teacher forcing leads to
exposure bias at generation time
• During training, our model’s inputs
are gold context tokens from real,
human-generated texts
• At generation time, our model’s
inputs are previously–decoded tokens
51
ℒ345 = − log 𝑃 𝑦"
∗
𝑦∗
!")
ℒ678 = − log 𝑃 "𝑦" "𝑦 !")
!!
∗ !#
∗ !$
∗ !%
∗ !&'(
∗ !&'%
∗ !&'$
∗ !&'#
∗
…
…
!#
∗
!$
∗
!%
∗
!(
∗
!&'%
∗
!&'$
∗
!&'#
∗
<END>
!&
∗
!!"
∗
!!$
∗
!%
∗
<START>
!(&!'
!($ !("
<END>
!(&
!($ !(" !(&!( !(&!" !(&!$
!(&!( !(&!" !(&!$
…
…
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Probability
i t a ar
a
Exposure Bias Solutions
• Scheduled sampling (Bengio et al., 2015)
• With some probability p, decode a token and feed that as the next input, rather than
the gold token.
• Increase p over the course of training
• Leads to improvements in practice, but can lead to strange training objectives
• Dataset Aggregation (DAgger; Ross et al., 2011)
• At various intervals during training, generate sequences from your current model
• Add these sequences to your training set as additional examples
52
Exposure Bias Solutions
• Sequence re-writing (Guu*, Hashimoto* et al., 2018)
• Learn to retrieve a sequence from an existing corpus of human-written prototypes
(e.g., dialogue responses)
• Learn to edit the retrieved sequence by adding, removing, and modifying tokens in
the prototype
• Reinforcement Learning: cast your text generation model as a Markov decision process
• State s is the model’s representation of the preceding context
• Actions a are the words that can be generated
• Policy 𝜋 is the decoder
• Rewards r are provided by an external score
• Learn behaviors by rewarding the model when it exhibits them
53
REINFORCE: Basics
54
• Sample a sequence from your model
𝑦#&
∗
𝑦#'
∗
𝑦1
∗
<START>
"𝑦2#$
"𝑦' "𝑦&
<END>
"𝑦2
"𝑦' "𝑦& "𝑦2#% "𝑦2#& "𝑦2#'
"𝑦2#% "𝑦2#& "𝑦2#'
…
…
ℒIA = − $
*+%
.
𝑟(0𝑦*) log 𝑃 0𝑦* 𝑦∗
; {0𝑦*}-*)
REINFORCE: Basics
55
• Sample a sequence from your model
ℒIA = − $
*+%
.
𝑟(0𝑦*) log 𝑃 0𝑦* 𝑦∗
; {0𝑦*}-*)
!!"
∗
!!$
∗
!%
∗
<START>
!(&!'
!($ !("
<END>
!(&
!($ !(" !(&!( !(&!" !(&!$
!(&!( !(&!" !(&!$
…
…
Next time, increase the probability
of this sampled token in the same
context.
…but do it more if I get a high
reward from the reward function.
Reward Estimation
• How should we define a reward function? Just use your evaluation metric!
• BLEU (machine translation; Ranzato et al., ICLR 2016; Wu et al., 2016)
• ROUGE (summarization; Paulus et al., ICLR 2018; Celikyilmaz et al., NAACL 2018)
• CIDEr (image captioning; Rennie et al., CVPR 2017)
• SPIDEr (image captioning; Liu et al., ICCV 2017)
• Be careful about optimizing for the task as opposed to “gaming” the reward!
• Evaluation metrics are merely proxies for generation quality!
• “even though RL refinement can achieve better BLEU scores, it barely improves the
human impression of the translation quality” – Wu et al., 2016
56
Reward Estimation
• What behaviors can we tie to rewards?
• Cross-modality consistency in image captioning (Ren et al., CVPR 2017)
• Sentence simplicity (Zhang and Lapata, EMNLP 2017)
• Temporal Consistency (Bosselut et al., NAACL 2018)
• Utterance Politeness (Tan et al., TACL 2018)
• Paraphrasing (Li et al., EMNLP 2018)
• Sentiment (Gong et al., NAACL 2019)
• Formality (Gong et al., NAACL 2019)
• If you can formalize a behavior as a reward function (or train a neural network to
approximate it!), you can train a text generation model to exhibit that behavior!
57
The dark side…
• Need to pretrain a model with teacher forcing before doing RL training
• Your reward function probably expects coherent language inputs…
• Need to set an appropriate baseline:
• Use linear regression to predict it from the state s (Ranzato et al., 2015)
• Decode a second sequence and use its reward as the baseline (Rennie et al., 2017)
• Your model will learn the easiest way to exploit your reward function
• Mitigate these shortcuts or hope that’s aligned with the behavior you want!
58
ℒIA = − $
*+%
.
(𝑟 0𝑦* − 𝒃) log 𝑃(… )
Training: Takeaways
59
• Teacher forcing is still the premier algorithm for training text generation models
• Diversity is an issue with sequences generated from teacher forced models
• New approaches focus on mitigating the effects of common words
• Exposure bias causes text generation models to lose coherence easily
• Models must learn to recover from their own bad samples (e.g., scheduled sampling,
DAgger)
• Or not be allowed to generate bad text to begin with (e.g., retrieval + generation)
• Training with RL can allow models to learn behaviors that are challenging to formalize
• Learning can be very unstable!
Components of NLG Systems
60
• What is NLG?
• Formalizing NLG: a simple model and training algorithm
• Decoding from NLG models
• Training NLG models
• Evaluating NLG Systems
• Ethical Considerations
Types of evaluation methods for text generation
61
Human EvaluationsContent Overlap Metrics Model-based Metrics
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)
Content overlap metrics
62
• Compute a score that indicates the similarity between generated and gold-standard
(human-written) text
• Fast and efficient and widely used
• Two broad categories:
• N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
• Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.)
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
Word overlap based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)
• They’re not ideal for machine translation
• They get progressively much worse for tasks that are more open-ended than machine
translation
• Worse for summarization, as longer output texts are harder to measure
• Much worse for dialogue, which is more open-ended that summarization
63
N-gram overlap metrics
64
Are you going to Antoine’s
incredible CS224N lecture?
Heck yes !
You know it !
Yes !
Yup .
Heck no !
Score:
0.61
0.25
0
0.67
False negative
False positive
A simple failure case
n-gram overlap metrics have no concept of semantic relatedness!
A more comprehensive failure analysis
65 (Liu et al, EMNLP 2016)
Automatic evaluation metrics for NLG
Word overlap based metrics (BLEU, ROUGE, METEOR, F1, etc.)
• They’re not ideal for machine translation
• They get progressively much worse for tasks that are more open-ended than machine
translation
• Worse for summarization, where extractive methods that copy from documents are
preferred
• Much worse for dialogue, which is more open-ended that summarization
• Much, much worse story generation, which is also open-ended, but whose sequence
length can make it seem you’re getting decent scores!
66
Semantic overlap metrics
67
SPICE:
Semantic propositional image caption
evaluation is an image captioning metric
that initially parses the reference text to
derive an abstract scene graph
representation.
(Anderson et al., 2016).
SPIDER:
A combination of semantic graph similarity
(SPICE) and n-gram similarity measure
(CIDER), the SPICE metric yields a more
complete quality evaluation metric.
(Liu et al., 2017)
PYRAMID:
• Incorporates human content selection
variation in summarization evaluation.
• Identifies Summarization Content Units
(SCU)s to compare information content
in summaries.
(Nenkova, et al., 2007)
Model-based metrics
68
• Use learned representations of words and
sentences to compute semantic similarity
between generated and reference texts
• No more n-gram bottleneck because text
units are represented as embeddings!
• Even though embeddings are pretrained,
distance metrics used to measure the
similarity can be fixed
Model-based metrics: Word distance functions
69
Word Mover’s
Distance:
Measures the distance
between two sequences (e.g.,
sentences, paragraphs, etc.),
using word embedding
similarity matching.
(Kusner et.al., 2015; Zhao et al., 2019)
Vector Similarity:
Embedding based similarity for
semantic distance between text.
• Embedding Average (Liu et al., 2016)
• Vector Extrema (Liu et al., 2016)
• MEANT (Lo, 2017)
• YISI (Lo, 2019)
BERTSCORE:
Uses pre-trained contextual embeddings from
BERT and matches words in candidate and
reference sentences by cosine similarity.
(Zhang et.al. 2020)
Model-based metrics: Beyond word matching
70
BLEURT:
A regression model based on BERT returns a score that
indicates to what extend the candidate text is grammatical
and conveys the meaning of the reference text.
(Sellam et.al. 2020)
Sentence Movers Similarity :
Based on Word Movers Distance to evaluate text in a continuous space
using sentence embeddings from recurrent neural network
representations.
(Clark et.al., 2019)
Human evaluations
• Automatic metrics fall short of matching human decisions
• Most important form of evaluation for text generation systems
• >75% generation papers at ACL 2019 include human evaluations
• Gold standard in developing new automatic metrics
• New automated metrics must correlate well with human evaluations!
71
Human evaluations
72
• Ask humans to evaluate the quality of generated text
• Overall or along some specific dimension:
• fluency
• coherence / consistency
• factuality and correctness
• commonsense
• style / formality
• grammaticality
• typicality
• redundancy
FordetailsCelikyilmaz,Clark,Gao,2020
Note: Don’t compare human
evaluation scores across
differently-conducted studies
Even if they claim to evaluate
the same dimensions!
Human evaluation: Issues
• Human judgments are regarded as the gold standard
• Of course, we know that human eval is slow and expensive
• …but are those the only problems?
• Supposing you do have access to human evaluation:
Does human evaluation solve all of your problems?
• No!
• Conducting human evaluation effectively is very difficult
• Humans:
74
• are inconsistent
• can be illogical
• lose concentration
• misinterpret your question
• can’t always explain why they feel the way they do
Learning from human feedback
HUSE:
Human Unified with Statistical Evaluation (HUSE),
determines the similarity of the output distribution
and a human reference distribution.
(Hashimoto et.al. 2019)
ADEM:
A learned metric from human judgments for dialog
system evaluation in a chatbot setting.
(Lowe et.al., 2017)
Evaluation: Takeaways
78
• Content overlap metrics provide a good starting point for evaluating the quality of
generated text, but they’re not good enough on their own.
• Model-based metrics are can be more correlated with human judgment, but behavior is
not interpretable
• Human judgments are critical.
• Only ones that can directly evaluate factuality – is the model saying correct things?
• But humans are inconsistent!
• In many cases, the best judge of output quality is YOU!
• Look at your model generations. Don’t just rely on numbers!
Components of NLG Systems
80
• What is NLG?
• Formalizing NLG: a simple model and training algorithm
• Decoding from NLG models
• Training NLG models
• Evaluating NLG Systems
• Ethical Considerations
Warning:
Some of the content on the
next few slides may be
disturbing
Ethics of text generation systems
81
Tay
• Chatbot released by Microsoft in 2016
• Within 24 hours, it started making toxic
racist and sexist comments
• What went wrong?
https://en.wikipedia.org/wiki/Tay_(bot)
Ethics: Biases in text generation models
82
• Text generation models are often
constructed from pretrained language
models
• Language models learn harmful patterns
of bias from large language corpora
• When prompted for this information,
they repeat negative stereotypes
(Sheng et al., EMNLP 2019)
Hidden Biases: Universal adversarial triggers
83
• The learned behaviors of text
generation models are opaque
• Adversarial inputs can trigger VERY
toxic content
• These models can be exploited in
open-world contexts by illintentioned
users
(Wallace et al., EMNLP 2019)
Hidden Biases: Triggered innocuously
84
• Pretrained language models can
degenerate into toxic text even from
seemingly innocuous prompts
• Models should not be deployed without
proper safeguards to control for toxic
content
• Models should not be deployed without
careful consideration of how users will
interact with it
(Gehman et al., EMNLP Findings 2020)
Ethics: Think about what you’re building
85
• Large-scale pretrained language
models allow us to build NLG
systems for many new
applications
• Does the content we’re building
a system to automatically
generate…
… really need to be generated?
(Zellers et al., NeurIPS 2019)
Concluding Thoughts
86
• Interacting with natural language generation systems quickly shows their limitations
• Even in tasks with more progress, there are still many improvements ahead
• Evaluation remains a huge challenge.
• We need better ways of automatically evaluating performance of NLG systems
• With the advent of large-scale language models, deep NLG research has been reset
• it’s never been easier to jump in the space!
• One of the most exciting areas of NLP to work in!