Natural Language Processing with Deep Learning CS224N/Ling284 John Hewitt Lecture 8: Self-Attention and Transformers Adapted from slides by Anna Goldie, John Hewitt The Transformer Decoder 23 • A Transformer decoder is how we’ll build systems like language models. • It’s a lot like our minimal selfattention architecture, but with a few more components. • The embeddings and position embeddings are identical. • We’ll next replace our selfattention with multi-head self- attention. Transformer Decoder Recall the Self-Attention Hypothetical Example 24 Hypothetical Example of Multi-Head Attention 25 Sequence-Stacked form of Attention • Let’s look at how key-query-value attention is computed, in matrices. • Let 𝑋 = 𝑥1; … ; 𝑥 𝑛 ∈ ℝ 𝑛×𝑑 be the concatenation of input vectors. • First, note that 𝑋𝐾 ∈ ℝ 𝑛×𝑑 , 𝑋𝑄 ∈ ℝ 𝑛×𝑑 , 𝑋𝑉 ∈ ℝ 𝑛×𝑑 . • The output is defined as output = softmax 𝑋𝑄 𝑋𝐾 ⊤ 𝑋𝑉 ∈∈ ℝ 𝑛×𝑑 . = 𝑋𝑄𝐾⊤ 𝑋⊤ ∈ ℝ 𝑛×𝑛 All pairs of attention scores! output ∈ ℝ 𝑛×𝑑 = 𝐾⊤ 𝑋⊤ 𝑋𝑄 First, take the query-key dot products in one matrix multiplication: 𝑋𝑄 𝑋𝐾 ⊤ Next, softmax, and compute the weighted average with another matrix multiplication. 𝑋𝑄𝐾⊤ 𝑋⊤softmax 𝑋𝑉 26 Multi-headed attention • What if we want to look in multiple places in the sentence at once? • For word 𝑖, self-attention “looks” where 𝑥𝑖 ⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but maybe we want to focus on different 𝑗 for different reasons? • We’ll define multiple attention “heads” through multiple Q,K,V matrices • Let, 𝑄ℓ, 𝐾ℓ, 𝑉ℓ ∈ ℝ 𝑑× 𝑑 ℎ, where ℎ is the number of attention heads, and ℓ ranges from 1 to ℎ. • Each attention head performs attention independently: • outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ ∗ 𝑋𝑉ℓ, where outputℓ ∈ ℝ 𝑑/ℎ • Then the outputs of all the heads are combined! • output = output1; … ; outputℎ 𝑌, where 𝑌 ∈ ℝ 𝑑×𝑑 • Each head gets to “look” at different things, and construct value vectors differently.27 Multi-head self-attention is computationally efficient • Even though we compute ℎ many attention heads, it’s not really more costly. • We compute 𝑋𝑄 ∈ ℝ 𝑛×𝑑 , and then reshape to ℝ 𝑛×ℎ×𝑑/ℎ . (Likewise for 𝑋𝐾, 𝑋𝑉.) • Then we transpose to ℝℎ×𝑛×𝑑/ℎ ; now the head axis is like a batch axis. • Almost everything else is identical, and the matrices are the same sizes. 28 𝑋𝑄 First, take the query-key dot products in one matrix multiplication: 𝑋𝑄 𝑋𝐾 ⊤ 𝐾⊤ 𝑋⊤ Next, softmax, and compute the weighted average with another matrix multiplication. softmax 𝑋𝑉𝑋𝑄𝐾⊤ 𝑋⊤ 𝑋𝑉 output ∈ ℝ 𝑛×𝑑 = 𝑃 = mix ∈ ℝ3×𝑛×𝑛 3 sets of all pairs of attention scores!𝑋𝑄𝐾⊤ 𝑋⊤ = Scaled Dot Product [Vaswani et al., 2017] • “Scaled Dot Product” attention aids in training. • When dimensionality 𝑑 becomes large, dot products between vectors tend to become large. • Because of this, inputs to the softmax function can be large, making the gradients small. • Instead of the self-attention function we’ve seen: outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ ∗ 𝑋𝑉ℓ • We divide the attention scores by 𝑑/ℎ, to stop the scores from becoming large just as a function of 𝑑/ℎ (The dimensionality divided by the number of heads.) outputℓ = softmax 𝑋𝑄ℓ 𝐾ℓ ⊤ 𝑋⊤ 𝑑/ℎ ∗ 𝑋𝑉ℓ 29 The Transformer Decoder 30 • Now that we’ve replaced selfattention with multi-head selfattention, we’ll go through two optimization tricks that end up being : • Residual Connections • Layer Normalization • In most Transformer diagrams, these are often written together as “Add & Norm” Transformer Decoder The Transformer Encoder: Residual connections [He et al., 2016] • Residual connections are a trick to help models train better. • Instead of 𝑋(𝑖) = Layer(𝑋 𝑖−1 ) (where 𝑖 represents the layer) • We let 𝑋(𝑖) = 𝑋(𝑖−1) + Layer(𝑋 𝑖−1 ) (so we only have to learn “the residual” from the previous layer) • Gradient is great through the residual connection; it’s 1! • Bias towards the identity function! 𝑋(𝑖−1) Layer 𝑋(𝑖) 𝑋(𝑖−1) Layer 𝑋(𝑖) + [no residuals] [residuals] [Loss landscape visualization, Li et al., 2018, on a ResNet]31 The Transformer Encoder: Layer normalization [Ba et al., 2016] • Layer normalization is a trick to help models train faster. • Idea: cut down on uninformative variation in hidden vector values by normalizing to unit mean and standard deviation within each layer. • LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019] • Let 𝑥 ∈ ℝ 𝑑 be an individual (word) vector in the model. • Let 𝜇 = σ 𝑗=1 𝑑 𝑥𝑗; this is the mean; 𝜇 ∈ ℝ. • Let 𝜎 = 1 𝑑 σ 𝑗=1 𝑑 𝑥𝑗 − 𝜇 2 ; this is the standard deviation; 𝜎 ∈ ℝ. • Let 𝛾 ∈ ℝ 𝑑 and 𝛽 ∈ ℝ 𝑑 be learned “gain” and “bias” parameters. (Can omit!) • Then layer normalization computes: output = 𝑥 − 𝜇 𝜎 + 𝜖 ∗ 𝛾 + 𝛽 Normalize by scalar mean and variance Modulate by learned elementwise gain and bias32 The Transformer Decoder 33 • The Transformer Decoder is a stack of Transformer Decoder Blocks. • Each Block consists of: • Self-attention • Add & Norm • Feed-Forward • Add & Norm • That’s it! We’ve gone through the Transformer Decoder. Transformer Decoder The Transformer Encoder 34 • The Transformer Decoder constrains to unidirectional context, as for language models. • What if we want bidirectional context, like in a bidirectional RNN? • This is the Transformer Encoder. The only difference is that we remove the masking in the self-attention. Transformer DecoderNo Masking! The Transformer Encoder-Decoder 35 • Recall that in machine translation, we processed the source sentence with a bidirectional model and generated the target with a unidirectional model. • For this kind of seq2seq format, we often use a Transformer Encoder-Decoder. • We use a normal Transformer Encoder. • Our Transformer Decoder is modified to perform crossattention to the output of the Encoder. Cross-attention (details) • We saw that self-attention is when keys, queries, and values come from the same source. • In the decoder, we have attention that looks more like what we saw last week. • Let ℎ1, … , ℎ 𝑛 be output vectors from the Transformer encoder; 𝑥𝑖 ∈ ℝ 𝑑 • Let 𝑧1, … , 𝑧 𝑛 be input vectors from the Transformer decoder, 𝑧𝑖 ∈ ℝ 𝑑 • Then keys and values are drawn from the encoder (like a memory): • 𝑘𝑖 = 𝐾ℎ𝑖, 𝑣𝑖 = 𝑉ℎ𝑖. • And the queries are drawn from the decoder, 𝑞𝑖 = 𝑄𝑧𝑖. 36 ℎ1, … , ℎ 𝑛 𝑧1, … , 𝑧 𝑛 Outline 1. From recurrence (RNN) to attention-based NLP models 2. Introducing the Transformer model 3. Great results with Transformers 4. Drawbacks and variants of Transformers 38 Great Results with Transformers [Vaswani et al., 2017] Not just better Machine Translation BLEU scores Also more efficient to train! First, Machine Translation from the original Transformers paper! [Test sets: WMT 2014 English-German and English-French]39 Great Results with Transformers [Liu et al., 2018]; WikiSum dataset Transformers all the way down. Next, document generation! The old standard 40 Great Results with Transformers [Liu et al., 2018] Before too long, most Transformers results also included pretraining, a method we’ll go over on Thursday. Transformers’ parallelizability allows for efficient pretraining, and have made them the de-facto standard. On this popular aggregate benchmark, for example: All top models are Transformer (and pretraining)-based. More results Thursday when we discuss pretraining. 41 Outline 1. From recurrence (RNN) to attention-based NLP models 2. Introducing the Transformer model 3. Great results with Transformers 4. Drawbacks and variants of Transformers 42 • Quadratic compute in self-attention (today): • Computing all pairs of interactions means our computation grows quadratically with the sequence length! • For recurrent models, it only grew linearly! • Position representations: • Are simple absolute indices the best we can do to represent position? • Relative linear position attention [Shaw et al., 2018] • Dependency syntax-based position [Wang et al., 2019] What would we like to fix about the Transformer? 43 • One of the benefits of self-attention over recurrence was that it’s highly parallelizable. • However, its total number of operations grows as 𝑂 𝑛2 𝑑 , where 𝑛 is the sequence length, and 𝑑 is the dimensionality. Quadratic computation as a function of sequence length 44 = 𝑋𝑄𝐾⊤ 𝑋⊤ ∈ ℝ 𝑛×𝑛 Need to compute all pairs of interactions! 𝑂 𝑛2 𝑑𝐾⊤ 𝑋⊤ 𝑋𝑄 • Think of 𝑑 as around 𝟏, 𝟎𝟎𝟎 (though for large language models it’s much larger!). • So, for a single (shortish) sentence, 𝑛 ≤ 30; 𝑛2 ≤ 𝟗𝟎𝟎. • In practice, we set a bound like 𝑛 = 512. • But what if we’d like 𝒏 ≥ 𝟓𝟎, 𝟎𝟎𝟎? For example, to work on long documents? • Considerable recent work has gone into the question, Can we build models like Transformers without paying the 𝑂 𝑇2 all-pairs self-attention cost? • For example, Linformer [Wang et al., 2020] Work on improving on quadratic self-attention cost 45 Key idea: map the sequence length dimension to a lowerdimensional space for values, keys Inferencetime(s) Sequence length / batch size • As Transformers grow larger, a larger and larger percent of compute is outside the self-attention portion, despit the quadratic cost. • In practice, almost no large Transformer language models use anything but the quadratic cost attention we’ve presented here. • The cheaper methods tend not to work as well at scale. • So, is there no point in trying to design cheaper alternatives to self-attention? • Or would we unlock much better models with much longer contexts (>100k tokens?) if we were to do it right? Do we even need to remove the quadratic cost of attention? 46 Do Transformer Modifications Transfer? 47 • "Surprisingly, we find that most modifications do not meaningfully improve performance." • Pretraining on Tuesday! • Good luck on assignment 4! • Remember to work on your project proposal! Parting remarks 48 Word structure and subword models Let’s take a look at the assumptions we’ve made about a language’s vocabulary. We assume a fixed vocab of tens of thousands of words, built from the training set. All novel words seen at test time are mapped to a single UNK. word vocab mapping embedding hat → pizza (index) learn → tasty (index) taaaaasty → UNK (index) laern → UNK (index) Transformerify → UNK (index) 3 Common words Variations misspellings novel items Word structure and subword models Finite vocabulary assumptions make even less sense in many languages. • Many languages exhibit complex morphology, or word structure. • The effect is more word types, each occurring fewer times. 4 Example: Swahili verbs can have hundreds of conjugations, each encoding a wide variety of information. (Tense, mood, definiteness, negation, information about the object, ++) Here’s a small fraction of the conjugations for ambia – to tell. [Wiktionary] The byte-pair encoding algorithm Subword modeling in NLP encompasses a wide range of methods for reasoning about structure below the word level. (Parts of words, characters, bytes.) • The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens). • At training and testing time, each word is split into a sequence of known subwords. Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary. 1. Start with a vocabulary containing only characters and an “end-of-word” symbol. 2. Using a corpus of text, find the most common adjacent characters “a,b”; add “ab” as a subword. 3. Replace instances of the character pair with the new subword; repeat until desired vocab size. Originally used in NLP for machine translation; now a similar method (WordPiece) is used in pretrained models. 5 [Sennrich et al., 2016, Wu et al., 2016] Word structure and subword models Common words end up being a part of the subword vocabulary, while rarer words are split into (sometimes intuitive, sometimes not) components. In the worst case, words are split into as many subwords as they have characters. word vocab mapping embedding hat → hat learn → learn taaaaasty → taa## aaa## sty laern → la## ern## Transformerify → Transformer## ify 6 Common words Variations misspellings novel items Outline 1. A brief note on subword modeling 2. Motivating model pretraining from word embeddings 3. Model pretraining three ways 1. Encoders 2. Encoder-Decoders 3. Decoders 4. What do we think pretraining is teaching? 7 Motivating word meaning and context Recall the adage we mentioned at the beginning of the course: “You shall know a word by the company it keeps” (J. R. Firth 1957: 11) This quote is a summary of distributional semantics, and motivated word2vec. But: “… the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously.” (J. R. Firth 1935) Consider I record the record: the two instances of record mean different things. 8 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935 Firth quote.] Where we were: pretrained word embeddings Circa 2017: • Start with pretrained word embeddings (no context!) • Learn how to incorporate context in an LSTM or Transformer while training on the task. Some issues to think about: • The training data we have for our downstream task (like question answering) must be sufficient to teach all contextual aspects of language. • Most of the parameters in our network are randomly initialized! 9 … the movie was … ෝ𝒚 Not pretrained pretrained (word embeddings) [Recall, movie gets the same word embedding, no matter what sentence it shows up in] Where we’re going: pretraining whole models In modern NLP: • All (or almost all) parameters in NLP networks are initialized via pretraining. • Pretraining methods hide parts of the input from the model, and train the model to reconstruct those parts. • This has been exceptionally effective at building strong: • representations of language • parameter initializations for strong NLP models. • Probability distributions over language that we can sample from 10 … the movie was … ෝ𝒚 Pretrained jointly [This model has learned how to represent entire sentences through pretraining] What can we learn from reconstructing the input? 11 Stanford University is located in __________, California. What can we learn from reconstructing the input? 12 I put ___ fork down on the table. What can we learn from reconstructing the input? 13 The woman walked across the street, checking for traffic over ___ shoulder. What can we learn from reconstructing the input? 14 I went to the ocean to see the fish, turtles, seals, and _____. What can we learn from reconstructing the input? 15 Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___. What can we learn from reconstructing the input? 16 Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the ______. What can we learn from reconstructing the input? 17 I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ Pretraining through language modeling [Dai and Le, 2015] Recall the language modeling task: • Model 𝑝 𝜃 𝑤𝑡 𝑤1:𝑡−1), the probability distribution over words given their past contexts. • There’s lots of data for this! (In English.) Pretraining through language modeling: • Train a neural network to perform language modeling on a large amount of text. • Save the network parameters. 18 Decoder (Transformer, LSTM, ++ ) Iroh goes to make tasty tea goes to make tasty tea END The Pretraining / Finetuning Paradigm Pretraining can improve NLP applications by serving as parameter initialization. 19 (Transformer, LSTM, ++ ) Iroh goes to make tasty tea goes to make tasty tea END Step 1: Pretrain (on language modeling) Lots of text; learn general things! Step 2: Finetune (on your task) Not many labels; adapt to the task! (Transformer, LSTM, ++ ) ☺/ … the movie was … Stochastic gradient descent and pretrain/finetune Why should pretraining and finetuning help, from a “training neural nets” perspective? • Consider, provides parameters ෠𝜃 by approximating min 𝜃 ℒpretrain 𝜃 . • (The pretraining loss.) • Then, finetuning approximates min 𝜃 ℒfinetune 𝜃 , starting at ෠𝜃. • (The finetuning loss) • The pretraining may matter because stochastic gradient descent sticks (relatively) close to ෠𝜃 during finetuning. • So, maybe the finetuning local minima near ෠𝜃 tend to generalize well! • And/or, maybe the gradients of finetuning loss near ෠𝜃 propagate nicely! 20 Lecture Plan 1. A brief note on subword modeling 2. Motivating model pretraining from word embeddings 3. Model pretraining three ways 1. Encoders 2. Encoder-Decoders 3. Decoders 4. What do we think pretraining is teaching? 21 Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. 22 Decoders • Language models! What we’ve seen so far. • Nice to generate from; can’t condition on future words Encoders • Gets bidirectional context – can condition on future! • How do we train them to build strong representations? Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them? Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. 23 Decoders • Language models! What we’ve seen so far. • Nice to generate from; can’t condition on future words Encoders • Gets bidirectional context – can condition on future! • How do we train them to build strong representations? Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them? ℎ1, … , ℎ 𝑇 Pretraining encoders: what pretraining objective to use? So far, we’ve looked at language model pretraining. But encoders get bidirectional context, so we can’t do language modeling! 24 Idea: replace some fraction of words in the input with a special [MASK] token; predict these words. ℎ1, … , ℎ 𝑇 = Encoder 𝑤1, … , 𝑤 𝑇 𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏 Only add loss terms from words that are “masked out.” If ෤𝑥 is the masked version of 𝑥, we’re learning 𝑝 𝜃(𝑥|෤𝑥). Called Masked LM. I [M] to the [M] went store 𝐴, 𝑏 [Devlin et al., 2018] BERT: Bidirectional Encoder Representations from Transformers Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a pretrained Transformer, a model they labeled BERT. 25 Some more details about Masked LM for BERT: • Predict a random 15% of (sub)word tokens. • Replace input word with [MASK] 80% of the time • Replace input word with a random token 10% of the time • Leave input word unchanged 10% of the time (but still predict it!) • Why? Doesn’t let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time!) [Predict these!] I pizza to the [M] went store Transformer Encoder [Devlin et al., 2018] to [Masked][Replaced] [Not replaced] BERT: Bidirectional Encoder Representations from Transformers 26 • The pretraining input to BERT was two separate contiguous chunks of text: • BERT was trained to predict whether one chunk follows the other or is randomly sampled. • Later work has argued this “next sentence prediction” is not necessary. [Devlin et al., 2018, Liu et al., 2019] BERT: Bidirectional Encoder Representations from Transformers Details about BERT • Two models were released: • BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params. • BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params. • Trained on: • BooksCorpus (800 million words) • English Wikipedia (2,500 million words) • Pretraining is expensive and impractical on a single GPU. • BERT was pretrained with 64 TPU chips for a total of 4 days. • (TPUs are special tensor operation acceleration hardware) • Finetuning is practical and common on a single GPU • “Pretrain once, finetune many times.” 27 [Devlin et al., 2018] BERT: Bidirectional Encoder Representations from Transformers BERT was massively popular and hugely versatile; finetuning BERT led to new state-ofthe-art results on a broad range of tasks. 28 • QQP: Quora Question Pairs (detect paraphrase questions) • QNLI: natural language inference over question answering data • SST-2: sentiment analysis • CoLA: corpus of linguistic acceptability (detect whether sentences are grammatical.) • STS-B: semantic textual similarity • MRPC: microsoft paraphrase corpus • RTE: a small natural language inference corpus [Devlin et al., 2018] Limitations of pretrained encoders Those results looked great! Why not used pretrained encoders for everything? 29 If your task involves generating sequences, consider using a pretrained decoder; BERT and other pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation methods. Pretrained Encoder Iroh goes to [MASK] tasty tea make/brew/craft Pretrained Decoder Iroh goes to make tasty tea goes to make tasty tea END Extensions of BERT You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++ 30 Some generally accepted improvements to the BERT pretraining formula: • RoBERTa: mainly just train BERT for longer and remove next sentence prediction! • SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task [Liu et al., 2019; Joshi et al., 2020] BERT [MASK] irr## esi## sti## [MASK] good It’s SpanBERT bly It’ [MASK] good irr## esi## sti## bly [MASK][MASK][MASK] Extensions of BERT A takeaway from the RoBERTa paper: more compute, more data can improve pretraining even when not changing the underlying Transformer encoder. 31 [Liu et al., 2019; Joshi et al., 2020] Full Finetuning vs. Parameter-Efficient Finetuning Finetuning every parameter in a pretrained model works well, but is memory-intensive. But lightweight finetuning methods adapt pretrained models in a constrained way. Leads to less overfitting and/or more efficient finetuning and inference. 32 [Liu et al., 2019; Joshi et al., 2020] (Transformer, LSTM, ++ ) ☺/ … the movie was … Full Finetuning Adapt all parameters (Transformer, LSTM, ++ ) ☺/ … the movie was … Lightweight Finetuning Train a few existing or new parameters Parameter-Efficient Finetuning: Prefix-Tuning, Prompt tuning Prefix-Tuning adds a prefix of parameters, and freezes all pretrained parameters. The prefix is processed by the model just like real words would be. Advantage: each element of a batch at inference could run a different tuned model. 33 [Li and Liang, 2021; Lester et al., 2021] (Transformer, LSTM, ++ ) ☺/ … the movie was … Learnable prefix parameters Parameter-Efficient Finetuning: Low-Rank Adaptation Low-Rank Adaptation Learns a low-rank “diff” between the pretrained and finetuned weight matrices. Easier to learn than prefix-tuning. 34 [Hu et al., 2021] (Transformer, LSTM, ++ ) ☺/ … the movie was … 𝑊 ∈ ℝ 𝑑×𝑑 𝐴 ∈ ℝ 𝑑×𝑘 𝐵 ∈ ℝ 𝑘×𝑑 𝑊 + 𝐴𝐵 Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. 35 Decoders • Language models! What we’ve seen so far. • Nice to generate from; can’t condition on future words Encoders • Gets bidirectional context – can condition on future! • How do we train them to build strong representations? Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them? Pretraining encoder-decoders: what pretraining objective to use? For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted. 36 ℎ1, … , ℎ 𝑇 = Encoder 𝑤1, … , 𝑤 𝑇 ℎ 𝑇+1, … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1, … , 𝑤 𝑇, ℎ1, … , ℎ 𝑇 𝑦𝑖 ∼ 𝐴ℎ𝑖 + 𝑏, 𝑖 > 𝑇 The encoder portion benefits from bidirectional context; the decoder portion is used to train the whole model through language modeling. [Raffel et al., 2018] 𝑤1, … , 𝑤 𝑇 𝑤 𝑇+1, … , 𝑤2𝑇 𝑤 𝑇+2, … , Pretraining encoder-decoders: what pretraining objective to use? What Raffel et al., 2018 found to work best was span corruption. Their model: T5. 37 Replace different-length spans from the input with unique placeholders; decode out the spans that were removed! This is implemented in text preprocessing: it’s still an objective that looks like language modeling at the decoder side. Pretraining encoder-decoders: what pretraining objective to use? Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks, and span corruption (denoising) to work better than language modeling. Pretraining encoder-decoders: what pretraining objective to use? A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving knowledge from its parameters. NQ: Natural Questions WQ: WebQuestions TQA: Trivia QA All “open-domain” versions [Raffel et al., 2018] 220 million params 770 million params 3 billion params 11 billion params Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. 40 Decoders • Language models! What we’ve seen so far. • Nice to generate from; can’t condition on future words • All the biggest pretrained models are Decoders. Encoders • Gets bidirectional context – can condition on future! • How do we train them to build strong representations? Encoder- Decoders • Good parts of decoders and encoders? • What’s the best way to pretrain them? ℎ1, … , ℎ 𝑇 Pretraining decoders When using language model pretrained decoders, we can ignore that they were trained to model 𝑝 𝑤𝑡 𝑤1:𝑡−1). 41 We can finetune them by training a classifier on the last word’s hidden state. ℎ1, … , ℎ 𝑇 = Decoder 𝑤1, … , 𝑤 𝑇 𝑦 ∼ 𝐴ℎ 𝑇 + 𝑏 Where 𝐴 and 𝑏 are randomly initialized and specified by the downstream task. Gradients backpropagate through the whole network. ☺/ 𝑤1, … , 𝑤 𝑇 Linear 𝐴, 𝑏 [Note how the linear layer hasn’t been pretrained and must be learned from scratch.] Pretraining decoders It’s natural to pretrain decoders as language models and then use them as generators, finetuning their 𝑝 𝜃 𝑤𝑡 𝑤1:𝑡−1)! 42 This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time! • Dialogue (context=dialogue history) • Summarization (context=document) ℎ1, … , ℎ 𝑇 = Decoder 𝑤1, … , 𝑤 𝑇 𝑤𝑡 ∼ 𝐴ℎ 𝑡−1 + 𝑏 Where 𝐴, 𝑏 were pretrained in the language model! 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 [Note how the linear layer has been pretrained.] 𝐴, 𝑏 ℎ1, … , ℎ 𝑇 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 Generative Pretrained Transformer (GPT) [Radford et al., 2018] 2018’s GPT was a big success in pretraining a decoder! • Transformer decoder with 12 layers, 117M parameters. • 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers. • Byte-pair encoding with 40,000 merges • Trained on BooksCorpus: over 7000 unique books. • Contains long spans of contiguous text, for learning long-distance dependencies. • The acronym “GPT” never showed up in the original paper; it could stand for “Generative PreTraining” or “Generative Pretrained Transformer” 43 [Devlin et al., 2018] Generative Pretrained Transformer (GPT) [Radford et al., 2018] How do we format inputs to our decoder for finetuning tasks? Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral Premise: The man is in the doorway Hypothesis: The person is near the door Radford et al., 2018 evaluate on natural language inference. Here’s roughly how the input was formatted, as a sequence of tokens for the decoder. [START] The man is in the doorway [DELIM] The person is near the door [EXTRACT] The linear classifier is applied to the representation of the [EXTRACT] token. 44 entailment Generative Pretrained Transformer (GPT) [Radford et al., 2018] GPT results on various natural language inference datasets. 45 We mentioned how pretrained decoders can be used in their capacities as language models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively convincing samples of natural language. Increasingly convincing generations (GPT2) [Radford et al., 2018]