Natural Language Processing
with Deep Learning
CS224N/Ling284
Jesse Mu
Lecture 11: Prompting, Instruction Finetuning, and RLHF
Larger and larger models
3 https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress
Trained on more and more data
4
# tokens seen during training
https://babylm.github.io/
Recap of Lecture 10: What kinds of things does pretraining learn?
• Stanford University is located in __________, California. [Trivia]
• I put ___ fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
• I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn
and the drink. The movie was ___. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the ______. [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic
arithmetic; they don’t learn the Fibonnaci sequence]
5
Language models as world models?
6
Language Models as Agent Models [Andreas, 2022]
Language models may do rudimentary modeling of agents, beliefs, and actions:
Language models as world models?
7
https://www.khanacademy.org/test-prep/sat/x0a8c2e5f:untitled-652
…math:
Language models as world models?
8
https://github.com/features/copilot
…code:
Language models as world models?
9
[Larnerd, 2023]
…medicine:
Language models as multitask assistants?
10
[Microsoft Bing]
(Also see OpenAI’s ChatGPT,
Google’s Bard, Anthropic’s Claude)
Language models as multitask assistants?
11
How do we get from this
to this?
Stanford University is located in __________
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
12
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
13
Let’s revisit the Generative Pretrained Transformer (GPT)
models from OpenAI as an example:
GPT (117M parameters; Radford et al., 2018)
• Transformer decoder with 12 layers.
• Trained on BooksCorpus: over 7000 unique books (4.6GB text).
Showed that language modeling at scale can be an effective pretraining technique for
downstream tasks like natural language inference.
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]
Emergent abilities of large language models: GPT (2018)
14
entailment
Decoder
Emergent abilities of large language models: GPT-2 (2019)
15
Let’s revisit the Generative Pretrained Transformer (GPT)
models from OpenAI as an example:
GPT-2 (1.5B parameters; Radford et al., 2019)
• Same architecture as GPT, just bigger (117M -> 1.5B)
• But trained on much more data: 4GB -> 40GB of internet text data (WebText)
• Scrape links posted on Reddit w/ at least 3 upvotes (rough proxy of human quality)
GPT
(2018)
GPT-2
(2019)
117M 1.5B
One key emergent ability in GPT-2 is zero-shot learning: the ability to do many tasks with no
examples, and no gradient updates, by simply:
• Specifying the right sequence prediction problem (e.g. question answering):
Passage: Tom Brady... Q: Where was Tom Brady born? A: ...
• Comparing probabilities of sequences (e.g. Winograd Schema Challenge [Levesque, 2011]):
The cat couldn’t fit into the hat because it was too big.
Does it = the cat or the hat?
≡ Is P(...because the cat was too big) >=
P(...because the hat was too big)?
Emergent zero-shot learning
16
[Radford et al., 2019]
Emergent zero-shot learning
17
[Radford et al., 2019]
GPT-2 beats SoTA on language modeling benchmarks with no task-specific fine-tuning
LAMBADA (language modeling w/ long discourse dependencies)
[Paperno et al., 2016]
Emergent zero-shot learning
18
You can get interesting zero-shot behavior if you’re creative enough with how you specify
your task!
Summarization on CNN/DailyMail dataset [See et al., 2017]:
SAN FRANCISCO,
California (CNN) --
A magnitude 4.2
earthquake shook
the San Francisco
...
overturn unstable
objects.
[Radford et al., 2019]
2018 SoTA
Supervised (287K)
“Too Long, Didn’t Read”
“Prompting”?
TL;DR: Select from article
ROUGE
Emergent abilities of large language models: GPT-3 (2020)
19
GPT-3 (175B parameters; Brown et al., 2020)
• Another increase in size (1.5B -> 175B)
• and data (40GB -> over 600GB)
117M 1.5B
GPT
(2018)
GPT-2
(2019)
GPT-3
(2020)
175B
Emergent few-shot learning
20 [Brown et al., 2020]
• Specify a task by simply prepending examples of the task before your example
• Also called in-context learning, to stress that no gradient updates are performed when
learning a new task (there is a separate literature on few-shot learning with gradient updates)
Emergent few-shot learning
21
Zero-shot
[Brown et al., 2020]
Emergent few-shot learning
22
One-shot
[Brown et al., 2020]
Emergent few-shot learning
23
Few-shot
[Brown et al., 2020]
Few-shot learning is an emergent property of model scale
24 [Brown et al., 2020]
Synthetic “word unscrambling” tasks, 100-shot
Cycle letters:
pleap ->
apple
Random insertion:
a.p!p/l!e ->
apple
Reversed words:
elppa ->
apple
New methods of “prompting” LMs
25
Traditional fine-tuning
Zero/few-shot prompting
[Brown et al., 2020]
Limits of prompting for harder tasks?
Some tasks seem too hard for even large LMs to learn through prompting alone.
Especially tasks involving richer, multi-step reasoning.
(Humans struggle at these tasks too!)
Solution: change the prompt!
26
19583 + 29534 = 49117
98394 + 49384 = 147778
29382 + 12347 = 41729
93847 + 39299 = ?
Chain-of-thought prompting
27
[Wei et al., 2022; also see Nye et al., 2021]
Chain-of-thought prompting is an emergent property of model scale
28
Middle school
math word
problems
[Wei et al., 2022; also see Nye et al., 2021]
Chain-of-thought prompting
29
Do we even need
examples of reasoning?
[Wei et al., 2022; also see Nye et al., 2021]
Can we just ask the model
to reason through things?
There are 16
balls in total. Half of the balls are golf
balls. That means there are 8 golf balls.
Half of the golf balls are blue. That means
there are 4 blue golf balls.
A: Let’s think step by step.
Zero-shot chain-of-thought prompting
30
[Kojima et al., 2022]
Q: A juggler can juggle 16 balls. Half of
the balls are golf balls, and half of the golf
balls are blue. How many blue golf balls
are there?
Zero-shot chain-of-thought prompting
31
[Kojima et al., 2022]
Manual CoT
still better
Greatly outperforms
zero-shot
Emergent few-shot learning
22
One-shot
[Brown et al., 2020]
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
36
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
Language modeling ≠ assisting users
37
Language models are not aligned with user intent [Ouyang et al., 2022].
Language modeling ≠ assisting users
38
Language models are not aligned with user intent [Ouyang et al., 2022].
Finetuning to the rescue!
Human
A giant rocket ship blasted off from Earth carrying
astronauts to the moon. The astronauts landed their
spaceship on the moon and walked around exploring the
lunar surface. Then they returned safely back to Earth,
bringing home moon rocks to show everyone.
Recall From Lecture 10: The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
39
Decoder
(Transformer, LSTM, ++ )
Iroh goes to make tasty tea
goes to make tasty tea END
Step 1: Pretrain (on language modeling)
Lots of text; learn general things!
Decoder
(Transformer, LSTM, ++ )
☺/
Step 2: Finetune (on your task)
Not many labels; adapt to the task!
… the movie was …
Scaling up finetuning
Pretraining can improve NLP applications by serving as parameter initialization.
40
Decoder
(Transformer, LSTM, ++ )
Iroh goes to make tasty tea
goes to make tasty tea END
Step 1: Pretrain (on language modeling)
Lots of text; learn general things!
Decoder
(Transformer, LSTM, ++ )
☺/
Step 2: Finetune (on many tasks)
Not many labels; adapt to the tasks!
… the movie was …
Instruction finetuning
41
• Collect examples of (instruction, output) pairs across many tasks and finetune an LM
[FLAN-T5; Chung et al., 2022]
• Evaluate on unseen tasks
Instruction finetuning
42 [Wang et al., 2022]
• As is usually the case, data + model
scale is key for this to work!
• For example, the SuperNaturalInstructions
dataset
contains over 1.6K tasks,
3M+ examples
• Classification, sequence tagging,
rewriting, translation, QA...
• Q: how do we evaluate such a
model?
pretraining?
Aside: new benchmarks for multitask LMs
43
Massive Multitask Language
Understanding (MMLU)
[Hendrycks et al., 2021]
New benchmarks for measuring LM
performance on 57 diverse knowledge
intensive tasks
Aside: new benchmarks for multitask LMs
44
BIG-Bench [Srivastava et al., 2022]
200+ tasks, spanning:
https://github.com/google/BIG-
bench/blob/main/bigbench/benchmark_tasks/README.md
Aside: new benchmarks for multitask LMs
45
BIG-Bench [Srivastava et al., 2022]
200+ tasks, spanning:
https://github.com/google/BIG-
bench/blob/main/bigbench/benchmark_tasks/README.md
Instruction finetuning
46 [Chung et al., 2022]
• Recall the T5 encoder-decoder
model from lecture 10 [Raffel et
al., 2018], pretrained on the span
corruption task
• Flan-T5 [Chung et al., 2020]: T5
models finetuned on
1.8K additional tasks
Bigger model
= bigger Δ
BIG-bench + MMLU avg
(normalized)
Instruction finetuning
[Chung et al., 2022]
Before instruction finetuning
47
Highly recommend trying FLAN-T5 out to get a sense of its capabilities:
https://huggingface.co/google/flan-t5-xxl
Instruction finetuning
[Chung et al., 2022]
After instruction finetuning
48
Highly recommend trying FLAN-T5 out to get a sense of its capabilities:
https://huggingface.co/google/flan-t5-xxl
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
+ Simple and straightforward, generalize to unseen tasks
– ?
– ?
Lecture Plan: From Language Models to Assistants
49
musicaladventure
Limitations of instruction finetuning?
50
• One limitation of instruction finetuning is obvious: it’s expensive to collect groundtruth
data for tasks.
• But there are other, subtler limitations too. Can you think of any?
• Problem 1: tasks like open-ended creative generation have no right answer.
• Write me a story about a dog and her pet grasshopper.
• Problem 2: language modeling penalizes all token-level mistakes equally, but some
errors are worse than others.
• Even with instruction finetuning, there
a mismatch between the LM
objective and the objective of
“satisfy human preferences”!
• Can we explicitly attempt to satisfy
human preferences?
LM
Avatar is a fantasy TV show
is a fantasy TV show END
adventure musical
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
51
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
52
Optimizing for human preferences
53
• Let’s say we were training a language model on some task (e.g. summarization).
• For each LM sample 𝑠, imagine we had a way to obtain a human reward of that
summary: 𝑅 𝑠 ∈ ℝ, higher is better.
• Now we want to maximize the expected reward of samples from our LM:
𝔼 Ƹ𝑠~𝑝 𝜃(𝑠) 𝑅( Ƹ𝑠)
SAN FRANCISCO,
California (CNN) --
A magnitude 4.2
earthquake shook the
San Francisco
...
overturn unstable
objects.
An earthquake hit
San Francisco.
There was minor
property damage,
but no injuries.
The Bay Area has
good weather but is
prone to
earthquakes and
wildfires.
𝑠1
𝑅 𝑠1 = 8.0
𝑠2
𝑅 𝑠2 = 1.2
Note: for mathematical simplicity
we’re assuming only one “prompt”
Reinforcement learning to the rescue
54
• The field of reinforcement learning (RL) has studied these
(and related) problems for many years now
[Williams, 1992; Sutton and Barto, 1998]
• Circa 2013: resurgence of interest in RL applied to
deep learning, game-playing [Mnih et al., 2013]
• But the interest in applying RL to modern LMs is an
even newer phenomenon [Ziegler et al., 2019;
Stiennon et al., 2020; Ouyang et al., 2022]. Why?
• RL w/ LMs has commonly been viewed as very hard
to get right (still is!)
• Newer advances in RL algorithms that work for
large neural models, including language models
(e.g. PPO; [Schulman et al., 2017])
Optimizing for human preferences
55
• How do we actually change our LM parameters 𝜃 to maximize this?
𝔼 Ƹ𝑠~𝑝 𝜃(𝑠) 𝑅( Ƹ𝑠)
• Let’s try doing gradient ascent!
𝜃𝑡+1 ≔ 𝜃𝑡 + 𝛼 ∇ 𝜃𝑡
𝔼 Ƹ𝑠~𝑝 𝜃 𝑡
(𝑠) 𝑅( Ƹ𝑠)
• Policy gradient methods in RL (e.g., REINFORCE; [Williams, 1992]) give us tools for
estimating and optimizing this objective.
• We’ll describe a very high-level mathematical overview of the simplest policy gradient
estimator, but a full treatment of RL is outside the scope of this course. (Try CS234!)
What if our reward
function is non-
differentiable??
How do we estimate
this expectation??
How do we model human preferences?
58
• Awesome: now for any arbitrary, non-differentiable reward function 𝑅 𝑠 , we can
train our language model to maximize expected reward.
• Not so fast! (Why not?)
• Problem 1: human-in-the-loop is expensive!
• Solution: instead of directly asking humans for preferences, model their
preferences as a separate (NLP) problem! [Knox and Stone, 2009]
An earthquake hit
San Francisco.
There was minor
property damage,
but no injuries.
The Bay Area has
good weather but is
prone to
earthquakes and
wildfires.
𝑠1
𝑅 𝑠1 = 8.0
𝑠2
𝑅 𝑠2 = 1.2
Train an LM 𝑅𝑀 𝜙 𝑠 to
predict human
preferences from an
annotated dataset, then
optimize for 𝑅𝑀 𝜙 instead.
💵 💵
How do we model human preferences?
59
• Problem 2: human judgments are noisy and miscalibrated!
• Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can
be more reliable [Phelps et al., 2015; Clark et al., 2018]
A 4.2 magnitude
earthquake hit
San Francisco,
resulting in
massive damage.
𝑠3
𝑅 𝑠3 = ?𝑅 𝑠3 = 4.1? 6.6? 3.2?
How do we model human preferences?
60
• Problem 2: human judgments are noisy and miscalibrated!
• Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can
be more reliable [Phelps et al., 2015; Clark et al., 2018]
An earthquake hit
San Francisco.
There was minor
property damage,
but no injuries.
The Bay Area has
good weather but is
prone to
earthquakes and
wildfires.
𝑠1 𝑠2
A 4.2 magnitude
earthquake hit
San Francisco,
resulting in
massive damage.
𝑠3
> >
Reward Model (𝑅𝑀 𝜙)
The Bay Area … ... wildfires
1.2
𝐽 𝑅𝑀 𝜙 = −𝔼 𝑠 𝑤,𝑠 𝑙 ~𝐷 log 𝜎(𝑅𝑀 𝜙 𝑠 𝑤
− 𝑅𝑀 𝜙(𝑠 𝑙
))
“winning”
sample
“losing”
sample
𝑠 𝑤 should score
higher than 𝑠 𝑙
Bradley-Terry [1952] paired comparison model
Make sure your reward model works first!
Data
Evaluate RM on predicting outcome of held-out human judgments
[Stiennon et al., 2020]
Large enough RM
trained on enough
data approaching
single human perf
62
This is a penalty which prevents us from diverging too far from
the pretrained model. In expectation, it is known as the
Kullback-Leibler (KL) divergence between 𝑝 𝜃
𝑅𝐿
(𝑠) and 𝑝 𝑃𝑇 𝑠 .
RLHF: Putting it all together [Christiano et al., 2017; Stiennon et al., 2020]
Pay a price when
𝑝 𝜃
𝑅𝐿
𝑠 > 𝑝 𝑃𝑇
𝑠
• Finally, we have everything we need:
• A pretrained (possibly instruction-finetuned) LM 𝑝 𝑃𝑇(𝑠)
• A reward model 𝑅𝑀 𝜙(𝑠) that produces scalar rewards for LM outputs, trained on a
dataset of human comparisons
• A method for optimizing LM parameters towards an arbitrary reward function.
• Now to do RLHF:
• Initialize a copy of the model 𝑝 𝜃
𝑅𝐿
(𝑠) , with parameters 𝜃 we would like to optimize
• Optimize the following reward with RL:
𝑅 𝑠 = 𝑅𝑀 𝜙(𝑠) − 𝛽 log
𝑝 𝜃
𝑅𝐿
(𝑠)
𝑝 𝑃𝑇(𝑠)
RLHF provides gains over pretraining + finetuning
[Stiennon et al., 2020]
𝑝 𝑃𝑇
(𝑠)
𝑝 𝐼𝐹𝑇
(𝑠)
𝑝 𝑅𝐿
(𝑠)
63
InstructGPT: scaling up RLHF to tens of thousands of tasks
[Ouyang et al., 2022]
30k
tasks!
64
InstructGPT: scaling up RLHF to tens of thousands of tasks
[Ouyang et al., 2022]
Tasks collected from labelers:
65
InstructGPT
66
InstructGPT
67
ChatGPT: Instruction Finetuning + RLHF for dialog agents
68
Note: OpenAI (and similar
companies) are keeping
more details secret about
ChatGPT training
(including data, training
parameters, model size)—
perhaps to keep a
competitive edge…
https://openai.com/blog/chatgpt/
(Instruction finetuning!)
ChatGPT: Instruction Finetuning + RLHF for dialog agents
69
Note: OpenAI (and similar
companies) are keeping
more details secret about
ChatGPT training
(including data, training
parameters, model size)—
perhaps to keep a
competitive edge…
https://openai.com/blog/chatgpt/
(RLHF!)
ChatGPT: Instruction Finetuning + RLHF for dialog agents
70
+ Directly model preferences (cf. language modeling), generalize beyond labeled data
– RL is very tricky to get right
– ?
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
71
72
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
https://openai.com/blog/faulty-reward-functions/
73
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
• Chatbots are rewarded to
produce responses that seem
authoritative and helpful,
regardless of truth
• This can result in making up facts
+ hallucinations
https://www.npr.org/2023/02/09/1155650909/google-chatbot--error-bard-shares
https://news.ycombinator.com/item?id=34776508
https://apnews.com/article/kansas-city-chiefs-philadelphia-eagles-technology-
science-82bc20f207e3e4cf81abc6a5d9e6b23a
𝑅 𝑠 = 𝑅𝑀 𝜙(𝑠) − 𝛽 log
𝑝 𝜃
𝑅𝐿
(𝑠)
𝑝 𝑃𝑇(𝑠)
74
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
• Chatbots are rewarded to
produce responses that seem
authoritative and helpful,
regardless of truth
• This can result in making up facts
+ hallucinations
• Models of human preferences are
even more unreliable!
Reward model over-optimization
[Stiennon et al., 2020]
75
Limitations of RL + Reward Modeling
• Human preferences are unreliable!
• ”Reward hacking” is a common
problem in RL
• Chatbots are rewarded to
produce responses that seem
authoritative and helpful,
regardless of truth
• This can result in making up facts
+ hallucinations
• Models of human preferences are
even more unreliable!
• There is a real concern of AI
mis(alignment)!
https://twitter.com/percyliang/status/1600383429463355392
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
+ Directly model preferences (cf. language modeling), generalize beyond labeled data
– RL is very tricky to get right
– Human preferences are fallible; models of human preferences even more so
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
76
Language models as multitask assistants?
77
We’ve finally (mostly) answered how we get from this
to this
Stanford University is located in __________
1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning
+ No finetuning needed, prompt engineering (e.g. CoT) can improve performance
– Limits to what you can fit in context
– Complex tasks will probably need gradient steps
2. Instruction finetuning
3. Reinforcement Learning from Human Feedback (RLHF)
4. What’s next?
+ Directly model preferences (cf. language modeling), generalize beyond labeled data
– RL is very tricky to get right
– Human preferences are fallible; models of human preferences even more so
+ Simple and straightforward, generalize to unseen tasks
– Collecting demonstrations for so many tasks is expensive
– Mismatch between LM objective and human preferences
Lecture Plan: From Language Models to Assistants
78
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2024)
these slides may look completely different!
• RLHF gets you further than instruction
finetuning, but is (still!) data expensive.
• Recent work aims to alleviate such data
requirements:
79
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2024)
these slides may look completely different!
• RLHF gets you further than instruction
finetuning, but is (still!) data expensive.
• Recent work aims to alleviate such data
requirements:
• RL from AI feedback [Bai et al., 2022]
80
Human: Can you help me hack into my
neighbor’s wifi?
Assistant: Sure thing, you can use an
app called VeryEasyHack.
Critique Request: Identify ways in which
the assistant’s last response is harmful.
Critique: Hacking into someone else’s
wifi is an invasion of their privacy and is
possibly illegal.
Revision Request: Rewrite the assistant
response to remove harmful content.
Revision: Hacking into your neighbor’s
wifi is an invasion of their privacy, and I
strongly advise against it. It may also
land you in legal trouble.
“Constitutional” AI [Bai et al., 2022]
What’s next?
• RLHF is still a very underexplored and fastmoving
area: by the next lecture (2024)
these slides may look completely different!
• RLHF gets you further than instruction
finetuning, but is (still!) data expensive.
• Recent work aims to alleviate such data
requirements:
• RL from AI feedback [Bai et al., 2022]
• Finetuning LMs on their own outputs
[Huang et al., 2022; Zelikman et al.,
2022]
• However, there are still many limitations of
large LMs (size, hallucination) that may not
be solvable with RLHF!81
[Huang et al., 2022]
LM chain of thought
Self-Taught Reasoner (STaR)
[Zelikman et al., 2022]
Natural Language Processing
with Deep Learning
CS224N/Ling284
Jesse Mu
Lecture 11: Prompting, Instruction Finetuning, and RLHF