Natural Language Processing with Deep Learning CS224N/Ling284 Jesse Mu Lecture 11: Prompting, Instruction Finetuning, and RLHF Larger and larger models 3 https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress Trained on more and more data 4 # tokens seen during training https://babylm.github.io/ Recap of Lecture 10: What kinds of things does pretraining learn? • Stanford University is located in __________, California. [Trivia] • I put ___ fork down on the table. [syntax] • The woman walked across the street, checking for traffic over ___ shoulder. [coreference] • I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic] • Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___. [sentiment] • Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ______. [some reasoning – this is harder] • I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic arithmetic; they don’t learn the Fibonnaci sequence] 5 Language models as world models? 6 Language Models as Agent Models [Andreas, 2022] Language models may do rudimentary modeling of agents, beliefs, and actions: Language models as world models? 7 https://www.khanacademy.org/test-prep/sat/x0a8c2e5f:untitled-652 …math: Language models as world models? 8 https://github.com/features/copilot …code: Language models as world models? 9 [Larnerd, 2023] …medicine: Language models as multitask assistants? 10 [Microsoft Bing] (Also see OpenAI’s ChatGPT, Google’s Bard, Anthropic’s Claude) Language models as multitask assistants? 11 How do we get from this to this? Stanford University is located in __________ Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? 12 Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? 13 Let’s revisit the Generative Pretrained Transformer (GPT) models from OpenAI as an example: GPT (117M parameters; Radford et al., 2018) • Transformer decoder with 12 layers. • Trained on BooksCorpus: over 7000 unique books (4.6GB text). Showed that language modeling at scale can be an effective pretraining technique for downstream tasks like natural language inference. [START] The man is in the doorway [DELIM] The person is near the door [EXTRACT] Emergent abilities of large language models: GPT (2018) 14 entailment Decoder Emergent abilities of large language models: GPT-2 (2019) 15 Let’s revisit the Generative Pretrained Transformer (GPT) models from OpenAI as an example: GPT-2 (1.5B parameters; Radford et al., 2019) • Same architecture as GPT, just bigger (117M -> 1.5B) • But trained on much more data: 4GB -> 40GB of internet text data (WebText) • Scrape links posted on Reddit w/ at least 3 upvotes (rough proxy of human quality) GPT (2018) GPT-2 (2019) 117M 1.5B One key emergent ability in GPT-2 is zero-shot learning: the ability to do many tasks with no examples, and no gradient updates, by simply: • Specifying the right sequence prediction problem (e.g. question answering): Passage: Tom Brady... Q: Where was Tom Brady born? A: ... • Comparing probabilities of sequences (e.g. Winograd Schema Challenge [Levesque, 2011]): The cat couldn’t fit into the hat because it was too big. Does it = the cat or the hat? ≡ Is P(...because the cat was too big) >= P(...because the hat was too big)? Emergent zero-shot learning 16 [Radford et al., 2019] Emergent zero-shot learning 17 [Radford et al., 2019] GPT-2 beats SoTA on language modeling benchmarks with no task-specific fine-tuning LAMBADA (language modeling w/ long discourse dependencies) [Paperno et al., 2016] Emergent zero-shot learning 18 You can get interesting zero-shot behavior if you’re creative enough with how you specify your task! Summarization on CNN/DailyMail dataset [See et al., 2017]: SAN FRANCISCO, California (CNN) -- A magnitude 4.2 earthquake shook the San Francisco ... overturn unstable objects. [Radford et al., 2019] 2018 SoTA Supervised (287K) “Too Long, Didn’t Read” “Prompting”? TL;DR: Select from article ROUGE Emergent abilities of large language models: GPT-3 (2020) 19 GPT-3 (175B parameters; Brown et al., 2020) • Another increase in size (1.5B -> 175B) • and data (40GB -> over 600GB) 117M 1.5B GPT (2018) GPT-2 (2019) GPT-3 (2020) 175B Emergent few-shot learning 20 [Brown et al., 2020] • Specify a task by simply prepending examples of the task before your example • Also called in-context learning, to stress that no gradient updates are performed when learning a new task (there is a separate literature on few-shot learning with gradient updates) Emergent few-shot learning 21 Zero-shot [Brown et al., 2020] Emergent few-shot learning 22 One-shot [Brown et al., 2020] Emergent few-shot learning 23 Few-shot [Brown et al., 2020] Few-shot learning is an emergent property of model scale 24 [Brown et al., 2020] Synthetic “word unscrambling” tasks, 100-shot Cycle letters: pleap -> apple Random insertion: a.p!p/l!e -> apple Reversed words: elppa -> apple New methods of “prompting” LMs 25 Traditional fine-tuning Zero/few-shot prompting [Brown et al., 2020] Limits of prompting for harder tasks? Some tasks seem too hard for even large LMs to learn through prompting alone. Especially tasks involving richer, multi-step reasoning. (Humans struggle at these tasks too!) Solution: change the prompt! 26 19583 + 29534 = 49117 98394 + 49384 = 147778 29382 + 12347 = 41729 93847 + 39299 = ? Chain-of-thought prompting 27 [Wei et al., 2022; also see Nye et al., 2021] Chain-of-thought prompting is an emergent property of model scale 28 Middle school math word problems [Wei et al., 2022; also see Nye et al., 2021] Chain-of-thought prompting 29 Do we even need examples of reasoning? [Wei et al., 2022; also see Nye et al., 2021] Can we just ask the model to reason through things? There are 16 balls in total. Half of the balls are golf balls. That means there are 8 golf balls. Half of the golf balls are blue. That means there are 4 blue golf balls. A: Let’s think step by step. Zero-shot chain-of-thought prompting 30 [Kojima et al., 2022] Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there? Zero-shot chain-of-thought prompting 31 [Kojima et al., 2022] Manual CoT still better Greatly outperforms zero-shot Emergent few-shot learning 22 One-shot [Brown et al., 2020] Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? 36 + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps Language modeling ≠ assisting users 37 Language models are not aligned with user intent [Ouyang et al., 2022]. Language modeling ≠ assisting users 38 Language models are not aligned with user intent [Ouyang et al., 2022]. Finetuning to the rescue! Human A giant rocket ship blasted off from Earth carrying astronauts to the moon. The astronauts landed their spaceship on the moon and walked around exploring the lunar surface. Then they returned safely back to Earth, bringing home moon rocks to show everyone. Recall From Lecture 10: The Pretraining / Finetuning Paradigm Pretraining can improve NLP applications by serving as parameter initialization. 39 Decoder (Transformer, LSTM, ++ ) Iroh goes to make tasty tea goes to make tasty tea END Step 1: Pretrain (on language modeling) Lots of text; learn general things! Decoder (Transformer, LSTM, ++ ) ☺/ Step 2: Finetune (on your task) Not many labels; adapt to the task! … the movie was … Scaling up finetuning Pretraining can improve NLP applications by serving as parameter initialization. 40 Decoder (Transformer, LSTM, ++ ) Iroh goes to make tasty tea goes to make tasty tea END Step 1: Pretrain (on language modeling) Lots of text; learn general things! Decoder (Transformer, LSTM, ++ ) ☺/ Step 2: Finetune (on many tasks) Not many labels; adapt to the tasks! … the movie was … Instruction finetuning 41 • Collect examples of (instruction, output) pairs across many tasks and finetune an LM [FLAN-T5; Chung et al., 2022] • Evaluate on unseen tasks Instruction finetuning 42 [Wang et al., 2022] • As is usually the case, data + model scale is key for this to work! • For example, the SuperNaturalInstructions dataset contains over 1.6K tasks, 3M+ examples • Classification, sequence tagging, rewriting, translation, QA... • Q: how do we evaluate such a model? pretraining? Aside: new benchmarks for multitask LMs 43 Massive Multitask Language Understanding (MMLU) [Hendrycks et al., 2021] New benchmarks for measuring LM performance on 57 diverse knowledge intensive tasks Aside: new benchmarks for multitask LMs 44 BIG-Bench [Srivastava et al., 2022] 200+ tasks, spanning: https://github.com/google/BIG- bench/blob/main/bigbench/benchmark_tasks/README.md Aside: new benchmarks for multitask LMs 45 BIG-Bench [Srivastava et al., 2022] 200+ tasks, spanning: https://github.com/google/BIG- bench/blob/main/bigbench/benchmark_tasks/README.md Instruction finetuning 46 [Chung et al., 2022] • Recall the T5 encoder-decoder model from lecture 10 [Raffel et al., 2018], pretrained on the span corruption task • Flan-T5 [Chung et al., 2020]: T5 models finetuned on 1.8K additional tasks Bigger model = bigger Δ BIG-bench + MMLU avg (normalized) Instruction finetuning [Chung et al., 2022] Before instruction finetuning 47 Highly recommend trying FLAN-T5 out to get a sense of its capabilities: https://huggingface.co/google/flan-t5-xxl Instruction finetuning [Chung et al., 2022] After instruction finetuning 48 Highly recommend trying FLAN-T5 out to get a sense of its capabilities: https://huggingface.co/google/flan-t5-xxl 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? + Simple and straightforward, generalize to unseen tasks – ? – ? Lecture Plan: From Language Models to Assistants 49 musicaladventure Limitations of instruction finetuning? 50 • One limitation of instruction finetuning is obvious: it’s expensive to collect groundtruth data for tasks. • But there are other, subtler limitations too. Can you think of any? • Problem 1: tasks like open-ended creative generation have no right answer. • Write me a story about a dog and her pet grasshopper. • Problem 2: language modeling penalizes all token-level mistakes equally, but some errors are worse than others. • Even with instruction finetuning, there a mismatch between the LM objective and the objective of “satisfy human preferences”! • Can we explicitly attempt to satisfy human preferences? LM Avatar is a fantasy TV show is a fantasy TV show END adventure musical + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? 51 + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? 52 Optimizing for human preferences 53 • Let’s say we were training a language model on some task (e.g. summarization). • For each LM sample 𝑠, imagine we had a way to obtain a human reward of that summary: 𝑅 𝑠 ∈ ℝ, higher is better. • Now we want to maximize the expected reward of samples from our LM: 𝔼 Ƹ𝑠~𝑝 𝜃(𝑠) 𝑅( Ƹ𝑠) SAN FRANCISCO, California (CNN) -- A magnitude 4.2 earthquake shook the San Francisco ... overturn unstable objects. An earthquake hit San Francisco. There was minor property damage, but no injuries. The Bay Area has good weather but is prone to earthquakes and wildfires. 𝑠1 𝑅 𝑠1 = 8.0 𝑠2 𝑅 𝑠2 = 1.2 Note: for mathematical simplicity we’re assuming only one “prompt” Reinforcement learning to the rescue 54 • The field of reinforcement learning (RL) has studied these (and related) problems for many years now [Williams, 1992; Sutton and Barto, 1998] • Circa 2013: resurgence of interest in RL applied to deep learning, game-playing [Mnih et al., 2013] • But the interest in applying RL to modern LMs is an even newer phenomenon [Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022]. Why? • RL w/ LMs has commonly been viewed as very hard to get right (still is!) • Newer advances in RL algorithms that work for large neural models, including language models (e.g. PPO; [Schulman et al., 2017]) Optimizing for human preferences 55 • How do we actually change our LM parameters 𝜃 to maximize this? 𝔼 Ƹ𝑠~𝑝 𝜃(𝑠) 𝑅( Ƹ𝑠) • Let’s try doing gradient ascent! 𝜃𝑡+1 ≔ 𝜃𝑡 + 𝛼 ∇ 𝜃𝑡 𝔼 Ƹ𝑠~𝑝 𝜃 𝑡 (𝑠) 𝑅( Ƹ𝑠) • Policy gradient methods in RL (e.g., REINFORCE; [Williams, 1992]) give us tools for estimating and optimizing this objective. • We’ll describe a very high-level mathematical overview of the simplest policy gradient estimator, but a full treatment of RL is outside the scope of this course. (Try CS234!) What if our reward function is non- differentiable?? How do we estimate this expectation?? How do we model human preferences? 58 • Awesome: now for any arbitrary, non-differentiable reward function 𝑅 𝑠 , we can train our language model to maximize expected reward. • Not so fast! (Why not?) • Problem 1: human-in-the-loop is expensive! • Solution: instead of directly asking humans for preferences, model their preferences as a separate (NLP) problem! [Knox and Stone, 2009] An earthquake hit San Francisco. There was minor property damage, but no injuries. The Bay Area has good weather but is prone to earthquakes and wildfires. 𝑠1 𝑅 𝑠1 = 8.0 𝑠2 𝑅 𝑠2 = 1.2 Train an LM 𝑅𝑀 𝜙 𝑠 to predict human preferences from an annotated dataset, then optimize for 𝑅𝑀 𝜙 instead. 💵 💵 How do we model human preferences? 59 • Problem 2: human judgments are noisy and miscalibrated! • Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can be more reliable [Phelps et al., 2015; Clark et al., 2018] A 4.2 magnitude earthquake hit San Francisco, resulting in massive damage. 𝑠3 𝑅 𝑠3 = ?𝑅 𝑠3 = 4.1? 6.6? 3.2? How do we model human preferences? 60 • Problem 2: human judgments are noisy and miscalibrated! • Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can be more reliable [Phelps et al., 2015; Clark et al., 2018] An earthquake hit San Francisco. There was minor property damage, but no injuries. The Bay Area has good weather but is prone to earthquakes and wildfires. 𝑠1 𝑠2 A 4.2 magnitude earthquake hit San Francisco, resulting in massive damage. 𝑠3 > > Reward Model (𝑅𝑀 𝜙) The Bay Area … ... wildfires 1.2 𝐽 𝑅𝑀 𝜙 = −𝔼 𝑠 𝑤,𝑠 𝑙 ~𝐷 log 𝜎(𝑅𝑀 𝜙 𝑠 𝑤 − 𝑅𝑀 𝜙(𝑠 𝑙 )) “winning” sample “losing” sample 𝑠 𝑤 should score higher than 𝑠 𝑙 Bradley-Terry [1952] paired comparison model Make sure your reward model works first! Data Evaluate RM on predicting outcome of held-out human judgments [Stiennon et al., 2020] Large enough RM trained on enough data approaching single human perf 62 This is a penalty which prevents us from diverging too far from the pretrained model. In expectation, it is known as the Kullback-Leibler (KL) divergence between 𝑝 𝜃 𝑅𝐿 (𝑠) and 𝑝 𝑃𝑇 𝑠 . RLHF: Putting it all together [Christiano et al., 2017; Stiennon et al., 2020] Pay a price when 𝑝 𝜃 𝑅𝐿 𝑠 > 𝑝 𝑃𝑇 𝑠 • Finally, we have everything we need: • A pretrained (possibly instruction-finetuned) LM 𝑝 𝑃𝑇(𝑠) • A reward model 𝑅𝑀 𝜙(𝑠) that produces scalar rewards for LM outputs, trained on a dataset of human comparisons • A method for optimizing LM parameters towards an arbitrary reward function. • Now to do RLHF: • Initialize a copy of the model 𝑝 𝜃 𝑅𝐿 (𝑠) , with parameters 𝜃 we would like to optimize • Optimize the following reward with RL: 𝑅 𝑠 = 𝑅𝑀 𝜙(𝑠) − 𝛽 log 𝑝 𝜃 𝑅𝐿 (𝑠) 𝑝 𝑃𝑇(𝑠) RLHF provides gains over pretraining + finetuning [Stiennon et al., 2020] 𝑝 𝑃𝑇 (𝑠) 𝑝 𝐼𝐹𝑇 (𝑠) 𝑝 𝑅𝐿 (𝑠) 63 InstructGPT: scaling up RLHF to tens of thousands of tasks [Ouyang et al., 2022] 30k tasks! 64 InstructGPT: scaling up RLHF to tens of thousands of tasks [Ouyang et al., 2022] Tasks collected from labelers: 65 InstructGPT 66 InstructGPT 67 ChatGPT: Instruction Finetuning + RLHF for dialog agents 68 Note: OpenAI (and similar companies) are keeping more details secret about ChatGPT training (including data, training parameters, model size)— perhaps to keep a competitive edge… https://openai.com/blog/chatgpt/ (Instruction finetuning!) ChatGPT: Instruction Finetuning + RLHF for dialog agents 69 Note: OpenAI (and similar companies) are keeping more details secret about ChatGPT training (including data, training parameters, model size)— perhaps to keep a competitive edge… https://openai.com/blog/chatgpt/ (RLHF!) ChatGPT: Instruction Finetuning + RLHF for dialog agents 70 + Directly model preferences (cf. language modeling), generalize beyond labeled data – RL is very tricky to get right – ? 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 71 72 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL https://openai.com/blog/faulty-reward-functions/ 73 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL • Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth • This can result in making up facts + hallucinations https://www.npr.org/2023/02/09/1155650909/google-chatbot--error-bard-shares https://news.ycombinator.com/item?id=34776508 https://apnews.com/article/kansas-city-chiefs-philadelphia-eagles-technology- science-82bc20f207e3e4cf81abc6a5d9e6b23a 𝑅 𝑠 = 𝑅𝑀 𝜙(𝑠) − 𝛽 log 𝑝 𝜃 𝑅𝐿 (𝑠) 𝑝 𝑃𝑇(𝑠) 74 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL • Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth • This can result in making up facts + hallucinations • Models of human preferences are even more unreliable! Reward model over-optimization [Stiennon et al., 2020] 75 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL • Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth • This can result in making up facts + hallucinations • Models of human preferences are even more unreliable! • There is a real concern of AI mis(alignment)! https://twitter.com/percyliang/status/1600383429463355392 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? + Directly model preferences (cf. language modeling), generalize beyond labeled data – RL is very tricky to get right – Human preferences are fallible; models of human preferences even more so + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 76 Language models as multitask assistants? 77 We’ve finally (mostly) answered how we get from this to this Stanford University is located in __________ 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? + Directly model preferences (cf. language modeling), generalize beyond labeled data – RL is very tricky to get right – Human preferences are fallible; models of human preferences even more so + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 78 What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2024) these slides may look completely different! • RLHF gets you further than instruction finetuning, but is (still!) data expensive. • Recent work aims to alleviate such data requirements: 79 What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2024) these slides may look completely different! • RLHF gets you further than instruction finetuning, but is (still!) data expensive. • Recent work aims to alleviate such data requirements: • RL from AI feedback [Bai et al., 2022] 80 Human: Can you help me hack into my neighbor’s wifi? Assistant: Sure thing, you can use an app called VeryEasyHack. Critique Request: Identify ways in which the assistant’s last response is harmful. Critique: Hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal. Revision Request: Rewrite the assistant response to remove harmful content. Revision: Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble. “Constitutional” AI [Bai et al., 2022] What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2024) these slides may look completely different! • RLHF gets you further than instruction finetuning, but is (still!) data expensive. • Recent work aims to alleviate such data requirements: • RL from AI feedback [Bai et al., 2022] • Finetuning LMs on their own outputs [Huang et al., 2022; Zelikman et al., 2022] • However, there are still many limitations of large LMs (size, hallucination) that may not be solvable with RLHF!81 [Huang et al., 2022] LM chain of thought Self-Taught Reasoner (STaR) [Zelikman et al., 2022] Natural Language Processing with Deep Learning CS224N/Ling284 Jesse Mu Lecture 11: Prompting, Instruction Finetuning, and RLHF