Natural Language Processing
with Deep Learning
CS224N/Ling284
Archit Sharma
Lecture 10: Prompting, Instruction Finetuning, and DPO/RLHF
(Based on slides from Jesse Mu)
Larger and larger models
2 https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress
Trained on more and more data
3
# tokens seen during training
https://babylm.github.io/
Recap of Lecture 10: What kinds of things does pretraining learn?
• Stanford University is located in __________, California. [Trivia]
• I put ___ fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
• I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn
and the drink. The movie was ___. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the ______. [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic
arithmetic; they don’t learn the Fibonnaci sequence]
4
Language models as world models?
5
Language Models as Agent Models [Andreas, 2022]
Language models may do rudimentary modeling of agents, beliefs, and actions:
Language models as world models?
6
https://www.khanacademy.org/test-prep/sat/x0a8c2e5f:untitled-652
…math:
Language models as world models?
7
https://github.com/features/copilot
…code:
Language models as world models?
8
[Larnerd, 2023]
…medicine:
Language models as multitask assistants?
9
[Microsoft Bing]
(Also see OpenAI’s ChatGPT,
Google’s Bard, Anthropic’s Claude)
Language models as multitask assistants?
10
How do we get from this
to this?
Stanford University is located in __________
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
11
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
12
Let’s revisit the Generative Pretrained Transformer (GPT)
models from OpenAI as an example:
GPT (117M parameters; Radford et al., 2018)
• Transformer decoder with 12 layers.
• Trained on BooksCorpus: over 7000 unique books (4.6GB text).
Showed that language modeling at scale can be an effective pretraining technique for
downstream tasks like natural language inference.
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]
Emergent abilities of large language models: GPT (2018)
13
entailment
Decoder
Emergent abilities of large language models: GPT-2 (2019)
14
Let’s revisit the Generative Pretrained Transformer (GPT)
models from OpenAI as an example:
GPT-2 (1.5B parameters; Radford et al., 2019)
• Same architecture as GPT, just bigger (117M -> 1.5B)
• But trained on much more data: 4GB -> 40GB of internet text data (WebText)
• Scrape links posted on Reddit w/ at least 3 upvotes (rough proxy of human quality)
GPT
(2018)
GPT-2
(2019)
117M 1.5B
One key emergent ability in GPT-2 is zero-shot learning: the ability to do many tasks with no
examples, and no gradient updates, by simply:
• Specifying the right sequence prediction problem (e.g. question answering):
Passage: Tom Brady... Q: Where was Tom Brady born? A: ...
• Comparing probabilities of sequences (e.g. Winograd Schema Challenge [Levesque, 2011]):
The cat couldn’t fit into the hat because it was too big.
Does it = the cat or the hat?
≡ Is P(...because the cat was too big) >=
P(...because the hat was too big)?
Emergent zero-shot learning
15
[Radford et al., 2019]
Emergent zero-shot learning
16
[Radford et al., 2019]
GPT-2 beats SoTA on language modeling benchmarks with no task-specific fine-tuning
LAMBADA (language modeling w/ long discourse dependencies)
[Paperno et al., 2016]
Emergent zero-shot learning
17
You can get interesting zero-shot behavior if you’re creative enough with how you specify
your task!
Summarization on CNN/DailyMail dataset [See et al., 2017]:
SAN FRANCISCO,
California (CNN) --
A magnitude 4.2
earthquake shook
the San Francisco
...
overturn unstable
objects.
[Radford et al., 2019]
2018 SoTA
Supervised (287K)
“Too Long, Didn’t Read”
“Prompting”?
TL;DR: Select from article
ROUGE
Emergent abilities of large language models: GPT-3 (2020)
18
GPT-3 (175B parameters; Brown et al., 2020)
• Another increase in size (1.5B -> 175B)
• and data (40GB -> over 600GB)
117M 1.5B
GPT
(2018)
GPT-2
(2019)
GPT-3
(2020)
175B
Emergent few-shot learning
19 [Brown et al., 2020]
• Specify a task by simply prepending examples of the task before your example
• Also called in-context learning, to stress that no gradient updates are performed when
learning a new task (there is a separate literature on few-shot learning with gradient updates)
Emergent few-shot learning
20
Zero-shot
[Brown et al., 2020]
Emergent few-shot learning
21
One-shot
[Brown et al., 2020]
Emergent few-shot learning
22
Few-shot
[Brown et al., 2020]
Few-shot learning is an emergent property of model scale
23 [Brown et al., 2020]
Synthetic “word unscrambling” tasks, 100-shot
Cycle letters:
pleap ->
apple
Random insertion:
a.p!p/l!e ->
apple
Reversed words:
elppa ->
apple
New methods of “prompting” LMs
24
Traditional fine-tuning
Zero/few-shot prompting
[Brown et al., 2020]
Limits of prompting for harder tasks?
Some tasks seem too hard for even large LMs to learn through prompting alone.
Especially tasks involving richer, multi-step reasoning.
(Humans struggle at these tasks too!)
Solution: change the prompt!
25
19583 + 29534 = 49117
98394 + 49384 = 147778
29382 + 12347 = 41729
93847 + 39299 = ?
Chain-of-thought prompting
26
[Wei et al., 2022; also see Nye et al., 2021]
Chain-of-thought prompting is an emergent property of model scale
27
Middle school
math word
problems
[Wei et al., 2022; also see Nye et al., 2021]
Chain-of-thought prompting
28
Do we even need
examples of reasoning?
[Wei et al., 2022; also see Nye et al., 2021]
Can we just ask the model
to reason through things?
There are 16
balls in total. Half of the balls are golf
balls. That means there are 8 golf balls.
Half of the golf balls are blue. That means
there are 4 blue golf balls.
A: Let’s think step by step.
Zero-shot chain-of-thought prompting
29
[Kojima et al., 2022]
Q: A juggler can juggle 16 balls. Half of
the balls are golf balls, and half of the golf
balls are blue. How many blue golf balls
are there?
Zero-shot chain-of-thought prompting
30
[Kojima et al., 2022]
Manual CoT
still better
Greatly outperforms
zero-shot
Zero-shot chain-of-thought prompting
31
[Zhou et al., 2022; Kojima et al., 2022]
LM-Designed
The new dark art of “prompt engineering”?
32
Use Google code header to generate more
“professional” code?
Asking a model for reasoning
fantasy concept art, glowing blue
dodecahedron die on a wooden
table, in a cozy fantasy (workshop),
tools on the table, artstation, depth
of field, 4k, masterpiece https://www.reddit.com/r/StableDiffusion/
comments/110dymw/magic_stone_workshop/
“Jailbreaking” LMs
https://twitter.com/goodside/status/1569128808308957185/photo/1
The new dark art of “prompt engineering”?
33
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
34
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
35
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
Language modeling ≠ assisting users
36
Language models are not aligned with user intent [Ouyang et al., 2022].
Language modeling ≠ assisting users
37
Language models are not aligned with user intent [Ouyang et al., 2022].
Finetuning to the rescue!
Human
A giant rocket ship blasted off from Earth carrying
astronauts to the moon. The astronauts landed their
spaceship on the moon and walked around exploring the
lunar surface. Then they returned safely back to Earth,
bringing home moon rocks to show everyone.
Recall From Lecture 10: The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
38
Decoder
(Transformer, LSTM, ++ )
Iroh goes to make tasty tea
goes to make tasty tea END
Step 1: Pretrain (on language modeling)
Lots of text; learn general things!
Decoder
(Transformer, LSTM, ++ )
☺/
Step 2: Finetune (on your task)
Not many labels; adapt to the task!
… the movie was …
Scaling up finetuning
Pretraining can improve NLP applications by serving as parameter initialization.
39
Decoder
(Transformer, LSTM, ++ )
Iroh goes to make tasty tea
goes to make tasty tea END
Step 1: Pretrain (on language modeling)
Lots of text; learn general things!
Decoder
(Transformer, LSTM, ++ )
☺/
Step 2: Finetune (on many tasks)
Not many labels; adapt to the tasks!
… the movie was …
Instruction finetuning
40
• Collect examples of (instruction, output) pairs across many tasks and finetune an LM
[FLAN-T5; Chung et al., 2022]
• Evaluate on unseen tasks
Instruction finetuning
41 [Wang et al., 2022]
• As is usually the case, data + model
scale is key for this to work!
• For example, the SuperNaturalInstructions
dataset
contains over 1.6K tasks,
3M+ examples
• Classification, sequence tagging,
rewriting, translation, QA...
• Q: how do we evaluate such a
model?
pretraining?
Aside: Benchmarks for multitask LMs
42
Massive Multitask Language
Understanding (MMLU)
[Hendrycks et al., 2021]
New benchmarks for measuring LM
performance on 57 diverse knowledge
intensive tasks
Some intuition: examples from MMLU
Progress on MMLU
• Rapid, impressive progress on challenging knowledge-intensive benchmarks
Aside: Benchmarks for multitask LMs
45
BIG-Bench [Srivastava et al., 2022]
200+ tasks, spanning:
https://github.com/google/BIG-
bench/blob/main/bigbench/benchmark_tasks/README.md
Aside: Benchmarks for multitask LMs
46
BIG-Bench [Srivastava et al., 2022]
200+ tasks, spanning:
https://github.com/google/BIG-
bench/blob/main/bigbench/benchmark_tasks/README.md
Instruction finetuning
47 [Chung et al., 2022]
• Recall the T5 encoder-decoder
model from lecture 10 [Raffel et
al., 2018], pretrained on the span
corruption task
• Flan-T5 [Chung et al., 2020]: T5
models finetuned on
1.8K additional tasks
Bigger model
= bigger Δ
BIG-bench + MMLU avg
(normalized)
Instruction finetuning
[Chung et al., 2022]
Before instruction finetuning
48
Highly recommend trying FLAN-T5 out to get a sense of its capabilities:
https://huggingface.co/google/flan-t5-xxl
Instruction finetuning
[Chung et al., 2022]
After instruction finetuning
49
Highly recommend trying FLAN-T5 out to get a sense of its capabilities:
https://huggingface.co/google/flan-t5-xxl
A huge diversity of instruction-tuning datasets
• The release of LLaMA led to open-source attempts to `create’ instruction tuning data
What have we learned from this?
• You can generate data
synthetically (from bigger
LMs)
• You don’t need many
samples to instruction tune
• Crowdsourcing can be pretty
effective!
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
+ Simple and straightforward, generalize to unseen tasks
– ?
– ?
Lecture Plan: From Language Models to Assistants
52
musicaladventure
Limitations of instruction finetuning?
53
• One limitation of instruction finetuning is obvious: it’s expensive to collect groundtruth
data for tasks. Can you think of other subtler limitations?
• Problem 1: tasks like open-ended creative generation have no right answer.
• Write me a story about a dog and her pet grasshopper.
• Problem 2: language modeling penalizes all token-level mistakes equally, but some
errors are worse than others.
• Problem 3: humans generate suboptimal answers
• Even with instruction finetuning, there
a mismatch between the LM
objective and the objective of
“satisfy human preferences”!
• Can we explicitly attempt to satisfy
human preferences?
LM
Avatar is a fantasy TV show
is a fantasy TV show END
adventure musical
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
54
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
55
Optimizing for human preferences
56
• Let’s say we were training a language model on some task (e.g. summarization).
• For an instruction 𝑥 and a LM sample 𝑦, imagine we had a way to obtain a human
reward of that summary: 𝑅 𝑥, 𝑦 ∈ ℝ, higher is better.
• Now we want to maximize the expected reward of samples from our LM:
𝔼ො𝑦~𝑝 𝜃 𝑦 𝑥) 𝑅(𝑥, ො𝑦)
SAN FRANCISCO,
California (CNN) --
A magnitude 4.2
earthquake shook the
San Francisco
...
overturn unstable
objects.
An earthquake hit
San Francisco.
There was minor
property damage,
but no injuries.
The Bay Area has
good weather but is
prone to
earthquakes and
wildfires.
𝑦1
𝑅 𝑥, 𝑦1 = 8.0
𝑦2
𝑅 𝑥, 𝑦2 = 1.2𝑥
High-level instantiation: ‘RLHF’ pipeline
• First step: instruction tuning!
• Second + third steps: maximize reward (but how??)
How do we get the rewards?
61
• Problem 1: human-in-the-loop is expensive!
• Solution: instead of directly asking humans for preferences, model their
preferences as a separate (NLP) problem! [Knox and Stone, 2009]
An earthquake hit
San Francisco.
There was minor
property damage,
but no injuries.
The Bay Area has
good weather but is
prone to
earthquakes and
wildfires.
𝑅 𝑥, 𝑦1 = 8.0 𝑅 𝑥, 𝑦2 = 1.2
Train a 𝑅𝑀 𝜙 𝑥, 𝑦 to
predict human reward
from an annotated
dataset, then optimize for
𝑅𝑀 𝜙 instead.
How do we model human preferences?
62
• Problem 2: human judgments are noisy and miscalibrated!
• Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can
be more reliable [Phelps et al., 2015; Clark et al., 2018]
A 4.2 magnitude
earthquake hit
San Francisco,
resulting in
massive damage.
𝑦3
𝑅 𝑥, 𝑦3 = ?𝑅 𝑥, 𝑦3 = 4.1? 6.6? 3.2?
How do we model human preferences?
63
• Problem 2: human judgments are noisy and miscalibrated!
• Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can
be more reliable [Phelps et al., 2015; Clark et al., 2018]
An earthquake hit
San Francisco.
There was minor
property damage,
but no injuries.
The Bay Area has
good weather but is
prone to
earthquakes and
wildfires.
𝑦1 𝑦2
A 4.2 magnitude
earthquake hit
San Francisco,
resulting in
massive damage.
𝑦3
> >
Reward Model (𝑅𝑀 𝜙)
The Bay Area … ... wildfires
1.2
𝐽 𝑅𝑀 𝜙 = −𝔼(𝑥, 𝑦 𝑤, 𝑦 𝑙)~𝐷 log 𝜎(𝑅𝑀 𝜙 𝑥, 𝑦 𝑤 − 𝑅𝑀 𝜙(𝑥, 𝑦 𝑙))
“winning”
sample
“losing”
sample
𝑦 𝑤 should score
higher than 𝑦 𝑙
Bradley-Terry [1952] paired comparison model
• We have the following:
• A pretrained (possibly instruction-finetuned) LM 𝑝 𝑃𝑇 𝑦 𝑥)
• A reward model 𝑅𝑀 𝜙(𝑥, 𝑦) that produces scalar rewards for LM outputs, trained on
a dataset of human comparisons
• Now to do RLHF:
• Copy the model 𝑝 𝜃
𝑅𝐿
𝑦 𝑥) , with parameters 𝜃 we would like to optimize
• We want to optimize:
𝔼ො𝑦~𝑝 𝜃
𝑅𝐿
ො𝑦 𝑥) [𝑅𝑀 𝜙 𝑥, ො𝑦 ]
65
RLHF: Optimizing the learned reward model
• We want to optimize:
𝔼ො𝑦~𝑝 𝜃
𝑅𝐿
ො𝑦 𝑥) [𝑅𝑀 𝜙 𝑥, ො𝑦 ]
• Do you see any problems?
• Learned rewards are imperfect; this quantity can be imperfectly optimized
• Add a penalty for drifting too for from the initialization:
𝔼ො𝑦~𝑝 𝜃
𝑅𝐿
ො𝑦 𝑥) [𝑅𝑀 𝜙 𝑥, ො𝑦 − 𝛽 log
𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥
𝑝 𝑃𝑇 ො𝑦 | 𝑥
]
66
This penalty which prevents us from diverging too far from the
pretrained model. In expectation, it is known as the Kullback-Leibler (KL)
divergence between 𝑝 𝜃
𝑅𝐿
(ො𝑦 | 𝑥) and 𝑝 𝑃𝑇 ො𝑦 | 𝑥 .
RLHF: Optimizing the learned reward model
Pay a price when
𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥 > 𝑝 𝑃𝑇
ො𝑦 | 𝑥
How to optimize? Reinforcement Learning!
67
• The field of reinforcement learning (RL) has studied these
(and related) problems for many years now
[Williams, 1992; Sutton and Barto, 1998]
• Circa 2013: resurgence of interest in RL applied to
deep learning, game-playing [Mnih et al., 2013]
• But the interest in applying RL to modern LMs is an
even newer phenomenon [Ziegler et al., 2019;
Stiennon et al., 2020; Ouyang et al., 2022]. General Idea:
• Generate completions from 𝑝 𝜃
𝑅𝐿
for several tasks
• Compute reward using 𝑅𝑀 𝜙 𝑥, 𝑦
• Update 𝑝 𝜃
𝑅𝐿
𝑦 𝑥) to increase probability of highreward
completions
RLHF provides gains over pretraining + finetuning
[Stiennon et al., 2020]
𝑝 𝑃𝑇
𝑦 𝑥)
𝑝 𝐼𝐹𝑇
𝑦 𝑥)
𝑝 𝑅𝐿
𝑦 𝑥)
68
RLHF can be complex
[Secrets of RLHF. Zheng et al. 2023]
69
• RL optimization can be
computationally expensive
and tricky:
• Fitting a value function
• Online sampling is
slow
• Performance can be
sensitive to
hyperparameters
• Current pipeline is as follows:
• Train a reward model 𝑅𝑀 𝜙(𝑥, 𝑦) to produce scalar rewards for LM outputs, trained
on a dataset of human comparisons
• Optimize pretrained (possibly instruction-finetuned) LM 𝑝 𝑃𝑇 𝑦 𝑥) to produce the
final RLHF LM 𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥
• What if there was a way to write 𝑅𝑀 𝜙(𝑥, 𝑦) in terms of 𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥 ?
• Derive 𝑅𝑀 𝜃(𝑥, 𝑦) in terms of 𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥
• Optimizing parameters 𝜃 by fitting 𝑅𝑀 𝜃(𝑥, 𝑦) to the preference data instead of
𝑅𝑀 𝜙 𝑥, 𝑦
• How is this possible? The only external information to the optimization comes from the
preference labels
70
Can we simplify RLHF? Towards Direct Preference Optimization
• Recall, we want to maximize the following objective:
𝔼ො𝑦~𝑝 𝜃
𝑅𝐿
ො𝑦 𝑥) [𝑅𝑀 𝑥, ො𝑦 − 𝛽 log
𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥
𝑝 𝑃𝑇 ො𝑦 | 𝑥
]
• There is a closed form solution to this:
𝑝∗ ො𝑦 | 𝑥 =
1
𝑍(𝑥)
𝑝 𝑃𝑇 ො𝑦 | 𝑥 exp
1
𝛽
𝑅𝑀 𝑥, ො𝑦
• Rearrange the terms:
𝑅𝑀 𝑥, ො𝑦 = 𝛽 log
𝑝∗ ො𝑦 | 𝑥
𝑝 𝑃𝑇 ො𝑦 | 𝑥
+ 𝛽 log 𝑍 𝑥
• This holds true for arbitrary LMs
𝑅𝑀 𝜃 𝑥, ො𝑦 = 𝛽 log
𝑝 𝜃
𝑅𝐿
ො𝑦 | 𝑥
𝑝 𝑃𝑇 ො𝑦 | 𝑥
+ 𝛽 log 𝑍(𝑥)
71
Direct Preference Optimization (DPO)
• Recall, how we fit the reward model 𝑅𝑀 𝜙(𝑥, 𝑦) :
𝐽 𝑅𝑀 𝜙 = −𝔼(𝑥, 𝑦 𝑤, 𝑦 𝑙)~𝐷 log 𝜎(𝑅𝑀 𝜙(𝑥, 𝑦 𝑤) − 𝑅𝑀 𝜙(𝑥, 𝑦 𝑙))
• Notice that we only need the difference between the rewards
for 𝑦 𝑤 and 𝑦 𝑙. Simplify for 𝑅𝑀 𝜃 𝑥, 𝑦 :
𝑅𝑀 𝜃 𝑥, 𝑦 𝑤 − 𝑅𝑀 𝜃 𝑥, 𝑦 𝑙 = 𝛽 log
𝑝 𝜃
𝑅𝐿
𝑦 𝑤| 𝑥
𝑝 𝑃𝑇 𝑦 𝑤| 𝑥
− 𝛽 log
𝑝 𝜃
𝑅𝐿
𝑦 𝑙 | 𝑥
𝑝 𝑃𝑇 𝑦 𝑙 | 𝑥
• The final DPO loss function is:
𝐽DPO 𝜃 = −𝔼 𝑥, 𝑦 𝑤, 𝑦 𝑙 ~𝐷 log 𝜎 𝑅𝑀 𝜃 𝑥, 𝑦 𝑤
− 𝑅𝑀 𝜃 𝑥, 𝑦 𝑙
We have a simple classification loss function that connects preference data to language
model parameters directly!
72
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO)
• We want to optimize for human preferences
• Instead of humans writing the answers or giving uncalibrated scores, we get humans
to rank different LM generated answers
• Reinforcement learning from human feedback
• Train an explicit reward model on comparison data to predict a score for a given
completion
• Optimize the LM to maximize the predicted score (under KL-constraint)
• Very effective when tuned well, computationally expensive and tricky to get right
• Direct Preference Optimization
• Optimize LM parameters directly on preference data by solving a binary classification
problem
• Simple and effective, similar properties to RLHF, does not leverage online data
74
Summary (DPO and RLHF)
InstructGPT: scaling up RLHF to tens of thousands of tasks
[Ouyang et al., 2022]
30k
tasks!
75
InstructGPT: scaling up RLHF to tens of thousands of tasks
[Ouyang et al., 2022]
Tasks collected from labelers:
76
InstructGPT
77
InstructGPT
78
ChatGPT: Instruction Finetuning + RLHF for dialog agents
79
Note: OpenAI (and similar
companies) are keeping
more details secret about
ChatGPT training
(including data, training
parameters, model size)—
perhaps to keep a
competitive edge…
https://openai.com/blog/chatgpt/
(Instruction finetuning!)
ChatGPT: Instruction Finetuning + RLHF for dialog agents
80
Note: OpenAI (and similar
companies) are keeping
more details secret about
ChatGPT training
(including data, training
parameters, model size)—
perhaps to keep a
competitive edge…
https://openai.com/blog/chatgpt/
(RLHF!)
ChatGPT: Instruction Finetuning + RLHF for dialog agents
81
DPO is enabling open source and closed source models to improve!
82
Open source LLMs now almost all just
use DPO (and it works well!)
RLHF/DPO behaviors – clear stylistic changes
• Significantly more detailed, nicer/clearer list like formatting
[Dubois et al 2023]
+ Directly model preferences (cf. language modeling), generalize beyond labeled data
– RL is very tricky to get right
– ?
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Optimizing for human preferences (DPO/RLHF)
4. What’s next?
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
85
86
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
https://openai.com/blog/faulty-reward-functions/
87
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
• Chatbots are rewarded to
produce responses that seem
authoritative and helpful,
regardless of truth
• This can result in making up facts
+ hallucinations
https://www.npr.org/2023/02/09/1155650909/google-chatbot--error-bard-shares
https://news.ycombinator.com/item?id=34776508
https://apnews.com/article/kansas-city-chiefs-philadelphia-eagles-technology-
science-82bc20f207e3e4cf81abc6a5d9e6b23a
𝑅 𝑠 = 𝑅𝑀 𝜙(𝑠) − 𝛽 log
𝑝 𝜃
𝑅𝐿
(𝑠)
𝑝 𝑃𝑇(𝑠)
88
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
• Chatbots are rewarded to
produce responses that seem
authoritative and helpful,
regardless of truth
• This can result in making up facts
+ hallucinations
• Models of human preferences are
even more unreliable!
Reward model over-optimization
[Stiennon et al., 2020]
89
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
• Chatbots are rewarded to
produce responses that seem
authoritative and helpful,
regardless of truth
• This can result in making up facts
+ hallucinations
• Models of human preferences are
even more unreliable!
• There is a real concern of AI
mis(alignment)!
https://twitter.com/percyliang/status/1600383429463355392
Open source RLHF is now mostly (not RL)
• Open source LLMs now almost all just use DPO (and it works well!)
Where do the labels come from?
• RLHF labels are often obtained from overseas, low-wage workers
Where does the label come from?
• We also need to be quite careful about how annotator biases might creep into LMs
‘Base’ language models
[Santurkar+ 2023, OpinionQA]
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
+ Directly model preferences (cf. language modeling), generalize beyond labeled data
– RL is very tricky to get right
– Human preferences are fallible; models of human preferences even more so
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
93
Language models as multitask assistants?
94
We’ve finally (mostly) answered how we get from this
to this
Stanford University is located in __________
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
+ Directly model preferences (cf. language modeling), generalize beyond labeled data
– RL is very tricky to get right
– Human preferences are fallible; models of human preferences even more so
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
95
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2025)
these slides may look very different!
• RLHF is (still!) data expensive. Recent work
aims to alleviate such data requirements:
96
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2024)
these slides may look completely different!
• RLHF is (still!) data expensive. Recent work
aims to alleviate such data requirements:
• RL from AI feedback [Bai et al., 2022]
97
Human: Can you help me hack into my
neighbor’s wifi?
Assistant: Sure thing, you can use an
app called VeryEasyHack.
Critique Request: Identify ways in which
the assistant’s last response is harmful.
Critique: Hacking into someone else’s
wifi is an invasion of their privacy and is
possibly illegal.
Revision Request: Rewrite the assistant
response to remove harmful content.
Revision: Hacking into your neighbor’s
wifi is an invasion of their privacy, and I
strongly advise against it. It may also
land you in legal trouble.
“Constitutional” AI [Bai et al., 2022]
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2024)
these slides may look completely different!
• RLHF is (still!) data expensive. Recent work
aims to alleviate such data requirements:
• RL from AI feedback [Bai et al., 2022]
• Finetuning LMs on their own outputs
[Huang et al., 2022; Zelikman et al.,
2022]; especially code and reasoning.
98
[Huang et al., 2022]
LM chain of thought
Self-Taught Reasoner (STaR)
[Zelikman et al., 2022]
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2024)
these slides may look completely different!
• RLHF is (still!) data expensive. Recent work
aims to alleviate such data requirements:
• RL from AI feedback [Bai et al., 2022]
• Finetuning LMs on their own outputs
[Huang et al., 2022; Zelikman et al.,
2022]
• Personalizing language models
• However, there are still many limitations of
large LMs (size, hallucination) that may not
be solvable with RLHF!
99
PRISM Alignment Project
[Kirk et al., 2024]