Natural Language Processing with Deep Learning CS224N/Ling284 Archit Sharma Lecture 10: Prompting, Instruction Finetuning, and DPO/RLHF (Based on slides from Jesse Mu) Larger and larger models 2 https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress Trained on more and more data 3 # tokens seen during training https://babylm.github.io/ Recap of Lecture 10: What kinds of things does pretraining learn? • Stanford University is located in __________, California. [Trivia] • I put ___ fork down on the table. [syntax] • The woman walked across the street, checking for traffic over ___ shoulder. [coreference] • I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic] • Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___. [sentiment] • Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ______. [some reasoning – this is harder] • I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic arithmetic; they don’t learn the Fibonnaci sequence] 4 Language models as world models? 5 Language Models as Agent Models [Andreas, 2022] Language models may do rudimentary modeling of agents, beliefs, and actions: Language models as world models? 6 https://www.khanacademy.org/test-prep/sat/x0a8c2e5f:untitled-652 …math: Language models as world models? 7 https://github.com/features/copilot …code: Language models as world models? 8 [Larnerd, 2023] …medicine: Language models as multitask assistants? 9 [Microsoft Bing] (Also see OpenAI’s ChatGPT, Google’s Bard, Anthropic’s Claude) Language models as multitask assistants? 10 How do we get from this to this? Stanford University is located in __________ Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? 11 Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? 12 Let’s revisit the Generative Pretrained Transformer (GPT) models from OpenAI as an example: GPT (117M parameters; Radford et al., 2018) • Transformer decoder with 12 layers. • Trained on BooksCorpus: over 7000 unique books (4.6GB text). Showed that language modeling at scale can be an effective pretraining technique for downstream tasks like natural language inference. [START] The man is in the doorway [DELIM] The person is near the door [EXTRACT] Emergent abilities of large language models: GPT (2018) 13 entailment Decoder Emergent abilities of large language models: GPT-2 (2019) 14 Let’s revisit the Generative Pretrained Transformer (GPT) models from OpenAI as an example: GPT-2 (1.5B parameters; Radford et al., 2019) • Same architecture as GPT, just bigger (117M -> 1.5B) • But trained on much more data: 4GB -> 40GB of internet text data (WebText) • Scrape links posted on Reddit w/ at least 3 upvotes (rough proxy of human quality) GPT (2018) GPT-2 (2019) 117M 1.5B One key emergent ability in GPT-2 is zero-shot learning: the ability to do many tasks with no examples, and no gradient updates, by simply: • Specifying the right sequence prediction problem (e.g. question answering): Passage: Tom Brady... Q: Where was Tom Brady born? A: ... • Comparing probabilities of sequences (e.g. Winograd Schema Challenge [Levesque, 2011]): The cat couldn’t fit into the hat because it was too big. Does it = the cat or the hat? ≡ Is P(...because the cat was too big) >= P(...because the hat was too big)? Emergent zero-shot learning 15 [Radford et al., 2019] Emergent zero-shot learning 16 [Radford et al., 2019] GPT-2 beats SoTA on language modeling benchmarks with no task-specific fine-tuning LAMBADA (language modeling w/ long discourse dependencies) [Paperno et al., 2016] Emergent zero-shot learning 17 You can get interesting zero-shot behavior if you’re creative enough with how you specify your task! Summarization on CNN/DailyMail dataset [See et al., 2017]: SAN FRANCISCO, California (CNN) -- A magnitude 4.2 earthquake shook the San Francisco ... overturn unstable objects. [Radford et al., 2019] 2018 SoTA Supervised (287K) “Too Long, Didn’t Read” “Prompting”? TL;DR: Select from article ROUGE Emergent abilities of large language models: GPT-3 (2020) 18 GPT-3 (175B parameters; Brown et al., 2020) • Another increase in size (1.5B -> 175B) • and data (40GB -> over 600GB) 117M 1.5B GPT (2018) GPT-2 (2019) GPT-3 (2020) 175B Emergent few-shot learning 19 [Brown et al., 2020] • Specify a task by simply prepending examples of the task before your example • Also called in-context learning, to stress that no gradient updates are performed when learning a new task (there is a separate literature on few-shot learning with gradient updates) Emergent few-shot learning 20 Zero-shot [Brown et al., 2020] Emergent few-shot learning 21 One-shot [Brown et al., 2020] Emergent few-shot learning 22 Few-shot [Brown et al., 2020] Few-shot learning is an emergent property of model scale 23 [Brown et al., 2020] Synthetic “word unscrambling” tasks, 100-shot Cycle letters: pleap -> apple Random insertion: a.p!p/l!e -> apple Reversed words: elppa -> apple New methods of “prompting” LMs 24 Traditional fine-tuning Zero/few-shot prompting [Brown et al., 2020] Limits of prompting for harder tasks? Some tasks seem too hard for even large LMs to learn through prompting alone. Especially tasks involving richer, multi-step reasoning. (Humans struggle at these tasks too!) Solution: change the prompt! 25 19583 + 29534 = 49117 98394 + 49384 = 147778 29382 + 12347 = 41729 93847 + 39299 = ? Chain-of-thought prompting 26 [Wei et al., 2022; also see Nye et al., 2021] Chain-of-thought prompting is an emergent property of model scale 27 Middle school math word problems [Wei et al., 2022; also see Nye et al., 2021] Chain-of-thought prompting 28 Do we even need examples of reasoning? [Wei et al., 2022; also see Nye et al., 2021] Can we just ask the model to reason through things? There are 16 balls in total. Half of the balls are golf balls. That means there are 8 golf balls. Half of the golf balls are blue. That means there are 4 blue golf balls. A: Let’s think step by step. Zero-shot chain-of-thought prompting 29 [Kojima et al., 2022] Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there? Zero-shot chain-of-thought prompting 30 [Kojima et al., 2022] Manual CoT still better Greatly outperforms zero-shot Zero-shot chain-of-thought prompting 31 [Zhou et al., 2022; Kojima et al., 2022] LM-Designed The new dark art of “prompt engineering”? 32 Use Google code header to generate more “professional” code? Asking a model for reasoning fantasy concept art, glowing blue dodecahedron die on a wooden table, in a cozy fantasy (workshop), tools on the table, artstation, depth of field, 4k, masterpiece https://www.reddit.com/r/StableDiffusion/ comments/110dymw/magic_stone_workshop/ “Jailbreaking” LMs https://twitter.com/goodside/status/1569128808308957185/photo/1 The new dark art of “prompt engineering”? 33 Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? 34 + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? 35 + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps Language modeling ≠ assisting users 36 Language models are not aligned with user intent [Ouyang et al., 2022]. Language modeling ≠ assisting users 37 Language models are not aligned with user intent [Ouyang et al., 2022]. Finetuning to the rescue! Human A giant rocket ship blasted off from Earth carrying astronauts to the moon. The astronauts landed their spaceship on the moon and walked around exploring the lunar surface. Then they returned safely back to Earth, bringing home moon rocks to show everyone. Recall From Lecture 10: The Pretraining / Finetuning Paradigm Pretraining can improve NLP applications by serving as parameter initialization. 38 Decoder (Transformer, LSTM, ++ ) Iroh goes to make tasty tea goes to make tasty tea END Step 1: Pretrain (on language modeling) Lots of text; learn general things! Decoder (Transformer, LSTM, ++ ) ☺/ Step 2: Finetune (on your task) Not many labels; adapt to the task! … the movie was … Scaling up finetuning Pretraining can improve NLP applications by serving as parameter initialization. 39 Decoder (Transformer, LSTM, ++ ) Iroh goes to make tasty tea goes to make tasty tea END Step 1: Pretrain (on language modeling) Lots of text; learn general things! Decoder (Transformer, LSTM, ++ ) ☺/ Step 2: Finetune (on many tasks) Not many labels; adapt to the tasks! … the movie was … Instruction finetuning 40 • Collect examples of (instruction, output) pairs across many tasks and finetune an LM [FLAN-T5; Chung et al., 2022] • Evaluate on unseen tasks Instruction finetuning 41 [Wang et al., 2022] • As is usually the case, data + model scale is key for this to work! • For example, the SuperNaturalInstructions dataset contains over 1.6K tasks, 3M+ examples • Classification, sequence tagging, rewriting, translation, QA... • Q: how do we evaluate such a model? pretraining? Aside: Benchmarks for multitask LMs 42 Massive Multitask Language Understanding (MMLU) [Hendrycks et al., 2021] New benchmarks for measuring LM performance on 57 diverse knowledge intensive tasks Some intuition: examples from MMLU Progress on MMLU • Rapid, impressive progress on challenging knowledge-intensive benchmarks Aside: Benchmarks for multitask LMs 45 BIG-Bench [Srivastava et al., 2022] 200+ tasks, spanning: https://github.com/google/BIG- bench/blob/main/bigbench/benchmark_tasks/README.md Aside: Benchmarks for multitask LMs 46 BIG-Bench [Srivastava et al., 2022] 200+ tasks, spanning: https://github.com/google/BIG- bench/blob/main/bigbench/benchmark_tasks/README.md Instruction finetuning 47 [Chung et al., 2022] • Recall the T5 encoder-decoder model from lecture 10 [Raffel et al., 2018], pretrained on the span corruption task • Flan-T5 [Chung et al., 2020]: T5 models finetuned on 1.8K additional tasks Bigger model = bigger Δ BIG-bench + MMLU avg (normalized) Instruction finetuning [Chung et al., 2022] Before instruction finetuning 48 Highly recommend trying FLAN-T5 out to get a sense of its capabilities: https://huggingface.co/google/flan-t5-xxl Instruction finetuning [Chung et al., 2022] After instruction finetuning 49 Highly recommend trying FLAN-T5 out to get a sense of its capabilities: https://huggingface.co/google/flan-t5-xxl A huge diversity of instruction-tuning datasets • The release of LLaMA led to open-source attempts to `create’ instruction tuning data What have we learned from this? • You can generate data synthetically (from bigger LMs) • You don’t need many samples to instruction tune • Crowdsourcing can be pretty effective! 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? + Simple and straightforward, generalize to unseen tasks – ? – ? Lecture Plan: From Language Models to Assistants 52 musicaladventure Limitations of instruction finetuning? 53 • One limitation of instruction finetuning is obvious: it’s expensive to collect groundtruth data for tasks. Can you think of other subtler limitations? • Problem 1: tasks like open-ended creative generation have no right answer. • Write me a story about a dog and her pet grasshopper. • Problem 2: language modeling penalizes all token-level mistakes equally, but some errors are worse than others. • Problem 3: humans generate suboptimal answers • Even with instruction finetuning, there a mismatch between the LM objective and the objective of “satisfy human preferences”! • Can we explicitly attempt to satisfy human preferences? LM Avatar is a fantasy TV show is a fantasy TV show END adventure musical + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? 54 + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? 55 Optimizing for human preferences 56 • Let’s say we were training a language model on some task (e.g. summarization). • For an instruction 𝑥 and a LM sample 𝑦, imagine we had a way to obtain a human reward of that summary: 𝑅 𝑥, 𝑦 ∈ ℝ, higher is better. • Now we want to maximize the expected reward of samples from our LM: 𝔼ො𝑦~𝑝 𝜃 𝑦 𝑥) 𝑅(𝑥, ො𝑦) SAN FRANCISCO, California (CNN) -- A magnitude 4.2 earthquake shook the San Francisco ... overturn unstable objects. An earthquake hit San Francisco. There was minor property damage, but no injuries. The Bay Area has good weather but is prone to earthquakes and wildfires. 𝑦1 𝑅 𝑥, 𝑦1 = 8.0 𝑦2 𝑅 𝑥, 𝑦2 = 1.2𝑥 High-level instantiation: ‘RLHF’ pipeline • First step: instruction tuning! • Second + third steps: maximize reward (but how??) How do we get the rewards? 61 • Problem 1: human-in-the-loop is expensive! • Solution: instead of directly asking humans for preferences, model their preferences as a separate (NLP) problem! [Knox and Stone, 2009] An earthquake hit San Francisco. There was minor property damage, but no injuries. The Bay Area has good weather but is prone to earthquakes and wildfires. 𝑅 𝑥, 𝑦1 = 8.0 𝑅 𝑥, 𝑦2 = 1.2 Train a 𝑅𝑀 𝜙 𝑥, 𝑦 to predict human reward from an annotated dataset, then optimize for 𝑅𝑀 𝜙 instead. How do we model human preferences? 62 • Problem 2: human judgments are noisy and miscalibrated! • Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can be more reliable [Phelps et al., 2015; Clark et al., 2018] A 4.2 magnitude earthquake hit San Francisco, resulting in massive damage. 𝑦3 𝑅 𝑥, 𝑦3 = ?𝑅 𝑥, 𝑦3 = 4.1? 6.6? 3.2? How do we model human preferences? 63 • Problem 2: human judgments are noisy and miscalibrated! • Solution: instead of asking for direct ratings, ask for pairwise comparisons, which can be more reliable [Phelps et al., 2015; Clark et al., 2018] An earthquake hit San Francisco. There was minor property damage, but no injuries. The Bay Area has good weather but is prone to earthquakes and wildfires. 𝑦1 𝑦2 A 4.2 magnitude earthquake hit San Francisco, resulting in massive damage. 𝑦3 > > Reward Model (𝑅𝑀 𝜙) The Bay Area … ... wildfires 1.2 𝐽 𝑅𝑀 𝜙 = −𝔼(𝑥, 𝑦 𝑤, 𝑦 𝑙)~𝐷 log 𝜎(𝑅𝑀 𝜙 𝑥, 𝑦 𝑤 − 𝑅𝑀 𝜙(𝑥, 𝑦 𝑙)) “winning” sample “losing” sample 𝑦 𝑤 should score higher than 𝑦 𝑙 Bradley-Terry [1952] paired comparison model • We have the following: • A pretrained (possibly instruction-finetuned) LM 𝑝 𝑃𝑇 𝑦 𝑥) • A reward model 𝑅𝑀 𝜙(𝑥, 𝑦) that produces scalar rewards for LM outputs, trained on a dataset of human comparisons • Now to do RLHF: • Copy the model 𝑝 𝜃 𝑅𝐿 𝑦 𝑥) , with parameters 𝜃 we would like to optimize • We want to optimize: 𝔼ො𝑦~𝑝 𝜃 𝑅𝐿 ො𝑦 𝑥) [𝑅𝑀 𝜙 𝑥, ො𝑦 ] 65 RLHF: Optimizing the learned reward model • We want to optimize: 𝔼ො𝑦~𝑝 𝜃 𝑅𝐿 ො𝑦 𝑥) [𝑅𝑀 𝜙 𝑥, ො𝑦 ] • Do you see any problems? • Learned rewards are imperfect; this quantity can be imperfectly optimized • Add a penalty for drifting too for from the initialization: 𝔼ො𝑦~𝑝 𝜃 𝑅𝐿 ො𝑦 𝑥) [𝑅𝑀 𝜙 𝑥, ො𝑦 − 𝛽 log 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 𝑝 𝑃𝑇 ො𝑦 | 𝑥 ] 66 This penalty which prevents us from diverging too far from the pretrained model. In expectation, it is known as the Kullback-Leibler (KL) divergence between 𝑝 𝜃 𝑅𝐿 (ො𝑦 | 𝑥) and 𝑝 𝑃𝑇 ො𝑦 | 𝑥 . RLHF: Optimizing the learned reward model Pay a price when 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 > 𝑝 𝑃𝑇 ො𝑦 | 𝑥 How to optimize? Reinforcement Learning! 67 • The field of reinforcement learning (RL) has studied these (and related) problems for many years now [Williams, 1992; Sutton and Barto, 1998] • Circa 2013: resurgence of interest in RL applied to deep learning, game-playing [Mnih et al., 2013] • But the interest in applying RL to modern LMs is an even newer phenomenon [Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022]. General Idea: • Generate completions from 𝑝 𝜃 𝑅𝐿 for several tasks • Compute reward using 𝑅𝑀 𝜙 𝑥, 𝑦 • Update 𝑝 𝜃 𝑅𝐿 𝑦 𝑥) to increase probability of highreward completions RLHF provides gains over pretraining + finetuning [Stiennon et al., 2020] 𝑝 𝑃𝑇 𝑦 𝑥) 𝑝 𝐼𝐹𝑇 𝑦 𝑥) 𝑝 𝑅𝐿 𝑦 𝑥) 68 RLHF can be complex [Secrets of RLHF. Zheng et al. 2023] 69 • RL optimization can be computationally expensive and tricky: • Fitting a value function • Online sampling is slow • Performance can be sensitive to hyperparameters • Current pipeline is as follows: • Train a reward model 𝑅𝑀 𝜙(𝑥, 𝑦) to produce scalar rewards for LM outputs, trained on a dataset of human comparisons • Optimize pretrained (possibly instruction-finetuned) LM 𝑝 𝑃𝑇 𝑦 𝑥) to produce the final RLHF LM 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 • What if there was a way to write 𝑅𝑀 𝜙(𝑥, 𝑦) in terms of 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 ? • Derive 𝑅𝑀 𝜃(𝑥, 𝑦) in terms of 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 • Optimizing parameters 𝜃 by fitting 𝑅𝑀 𝜃(𝑥, 𝑦) to the preference data instead of 𝑅𝑀 𝜙 𝑥, 𝑦 • How is this possible? The only external information to the optimization comes from the preference labels 70 Can we simplify RLHF? Towards Direct Preference Optimization • Recall, we want to maximize the following objective: 𝔼ො𝑦~𝑝 𝜃 𝑅𝐿 ො𝑦 𝑥) [𝑅𝑀 𝑥, ො𝑦 − 𝛽 log 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 𝑝 𝑃𝑇 ො𝑦 | 𝑥 ] • There is a closed form solution to this: 𝑝∗ ො𝑦 | 𝑥 = 1 𝑍(𝑥) 𝑝 𝑃𝑇 ො𝑦 | 𝑥 exp 1 𝛽 𝑅𝑀 𝑥, ො𝑦 • Rearrange the terms: 𝑅𝑀 𝑥, ො𝑦 = 𝛽 log 𝑝∗ ො𝑦 | 𝑥 𝑝 𝑃𝑇 ො𝑦 | 𝑥 + 𝛽 log 𝑍 𝑥 • This holds true for arbitrary LMs 𝑅𝑀 𝜃 𝑥, ො𝑦 = 𝛽 log 𝑝 𝜃 𝑅𝐿 ො𝑦 | 𝑥 𝑝 𝑃𝑇 ො𝑦 | 𝑥 + 𝛽 log 𝑍(𝑥) 71 Direct Preference Optimization (DPO) • Recall, how we fit the reward model 𝑅𝑀 𝜙(𝑥, 𝑦) : 𝐽 𝑅𝑀 𝜙 = −𝔼(𝑥, 𝑦 𝑤, 𝑦 𝑙)~𝐷 log 𝜎(𝑅𝑀 𝜙(𝑥, 𝑦 𝑤) − 𝑅𝑀 𝜙(𝑥, 𝑦 𝑙)) • Notice that we only need the difference between the rewards for 𝑦 𝑤 and 𝑦 𝑙. Simplify for 𝑅𝑀 𝜃 𝑥, 𝑦 : 𝑅𝑀 𝜃 𝑥, 𝑦 𝑤 − 𝑅𝑀 𝜃 𝑥, 𝑦 𝑙 = 𝛽 log 𝑝 𝜃 𝑅𝐿 𝑦 𝑤| 𝑥 𝑝 𝑃𝑇 𝑦 𝑤| 𝑥 − 𝛽 log 𝑝 𝜃 𝑅𝐿 𝑦 𝑙 | 𝑥 𝑝 𝑃𝑇 𝑦 𝑙 | 𝑥 • The final DPO loss function is: 𝐽DPO 𝜃 = −𝔼 𝑥, 𝑦 𝑤, 𝑦 𝑙 ~𝐷 log 𝜎 𝑅𝑀 𝜃 𝑥, 𝑦 𝑤 − 𝑅𝑀 𝜃 𝑥, 𝑦 𝑙 We have a simple classification loss function that connects preference data to language model parameters directly! 72 Direct Preference Optimization (DPO) Direct Preference Optimization (DPO) • We want to optimize for human preferences • Instead of humans writing the answers or giving uncalibrated scores, we get humans to rank different LM generated answers • Reinforcement learning from human feedback • Train an explicit reward model on comparison data to predict a score for a given completion • Optimize the LM to maximize the predicted score (under KL-constraint) • Very effective when tuned well, computationally expensive and tricky to get right • Direct Preference Optimization • Optimize LM parameters directly on preference data by solving a binary classification problem • Simple and effective, similar properties to RLHF, does not leverage online data 74 Summary (DPO and RLHF) InstructGPT: scaling up RLHF to tens of thousands of tasks [Ouyang et al., 2022] 30k tasks! 75 InstructGPT: scaling up RLHF to tens of thousands of tasks [Ouyang et al., 2022] Tasks collected from labelers: 76 InstructGPT 77 InstructGPT 78 ChatGPT: Instruction Finetuning + RLHF for dialog agents 79 Note: OpenAI (and similar companies) are keeping more details secret about ChatGPT training (including data, training parameters, model size)— perhaps to keep a competitive edge… https://openai.com/blog/chatgpt/ (Instruction finetuning!) ChatGPT: Instruction Finetuning + RLHF for dialog agents 80 Note: OpenAI (and similar companies) are keeping more details secret about ChatGPT training (including data, training parameters, model size)— perhaps to keep a competitive edge… https://openai.com/blog/chatgpt/ (RLHF!) ChatGPT: Instruction Finetuning + RLHF for dialog agents 81 DPO is enabling open source and closed source models to improve! 82 Open source LLMs now almost all just use DPO (and it works well!) RLHF/DPO behaviors – clear stylistic changes • Significantly more detailed, nicer/clearer list like formatting [Dubois et al 2023] + Directly model preferences (cf. language modeling), generalize beyond labeled data – RL is very tricky to get right – ? 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Optimizing for human preferences (DPO/RLHF) 4. What’s next? + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 85 86 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL https://openai.com/blog/faulty-reward-functions/ 87 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL • Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth • This can result in making up facts + hallucinations https://www.npr.org/2023/02/09/1155650909/google-chatbot--error-bard-shares https://news.ycombinator.com/item?id=34776508 https://apnews.com/article/kansas-city-chiefs-philadelphia-eagles-technology- science-82bc20f207e3e4cf81abc6a5d9e6b23a 𝑅 𝑠 = 𝑅𝑀 𝜙(𝑠) − 𝛽 log 𝑝 𝜃 𝑅𝐿 (𝑠) 𝑝 𝑃𝑇(𝑠) 88 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL • Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth • This can result in making up facts + hallucinations • Models of human preferences are even more unreliable! Reward model over-optimization [Stiennon et al., 2020] 89 Limitations of RL + Reward Modeling • Human preferences are unreliable! • ”Reward hacking” is a common problem in RL • Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth • This can result in making up facts + hallucinations • Models of human preferences are even more unreliable! • There is a real concern of AI mis(alignment)! https://twitter.com/percyliang/status/1600383429463355392 Open source RLHF is now mostly (not RL) • Open source LLMs now almost all just use DPO (and it works well!) Where do the labels come from? • RLHF labels are often obtained from overseas, low-wage workers Where does the label come from? • We also need to be quite careful about how annotator biases might creep into LMs ‘Base’ language models [Santurkar+ 2023, OpinionQA] 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? + Directly model preferences (cf. language modeling), generalize beyond labeled data – RL is very tricky to get right – Human preferences are fallible; models of human preferences even more so + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 93 Language models as multitask assistants? 94 We’ve finally (mostly) answered how we get from this to this Stanford University is located in __________ 1. Zero-Shot (ZS) and Few-Shot (FS) In-Context Learning + No finetuning needed, prompt engineering (e.g. CoT) can improve performance – Limits to what you can fit in context – Complex tasks will probably need gradient steps 2. Instruction finetuning 3. Reinforcement Learning from Human Feedback (RLHF) 4. What’s next? + Directly model preferences (cf. language modeling), generalize beyond labeled data – RL is very tricky to get right – Human preferences are fallible; models of human preferences even more so + Simple and straightforward, generalize to unseen tasks – Collecting demonstrations for so many tasks is expensive – Mismatch between LM objective and human preferences Lecture Plan: From Language Models to Assistants 95 What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2025) these slides may look very different! • RLHF is (still!) data expensive. Recent work aims to alleviate such data requirements: 96 What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2024) these slides may look completely different! • RLHF is (still!) data expensive. Recent work aims to alleviate such data requirements: • RL from AI feedback [Bai et al., 2022] 97 Human: Can you help me hack into my neighbor’s wifi? Assistant: Sure thing, you can use an app called VeryEasyHack. Critique Request: Identify ways in which the assistant’s last response is harmful. Critique: Hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal. Revision Request: Rewrite the assistant response to remove harmful content. Revision: Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble. “Constitutional” AI [Bai et al., 2022] What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2024) these slides may look completely different! • RLHF is (still!) data expensive. Recent work aims to alleviate such data requirements: • RL from AI feedback [Bai et al., 2022] • Finetuning LMs on their own outputs [Huang et al., 2022; Zelikman et al., 2022]; especially code and reasoning. 98 [Huang et al., 2022] LM chain of thought Self-Taught Reasoner (STaR) [Zelikman et al., 2022] What’s next? • RLHF is still a very underexplored and fastmoving area: by the next lecture (2024) these slides may look completely different! • RLHF is (still!) data expensive. Recent work aims to alleviate such data requirements: • RL from AI feedback [Bai et al., 2022] • Finetuning LMs on their own outputs [Huang et al., 2022; Zelikman et al., 2022] • Personalizing language models • However, there are still many limitations of large LMs (size, hallucination) that may not be solvable with RLHF! 99 PRISM Alignment Project [Kirk et al., 2024]