Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 9: Pretraining Adopted from slides by Anno Goldie, John Hewitt, Totsunori Hashimoto The pretraining revolution X c Human MNIST Switchboard ImageNet SQuAD 1.1 SQuAD 2.0 GLUE SuperGLUE SAN (ensemble modeli ALBERT + DAAF + Verifier (ensemble] BERT + ConvLSTM + MTL + Verifier (ensemble) BERTfinetune baseline (ensemble) DERI (single mod nlnet (single model) —•<* I \ í FPNet (ensemble) Gains from pretrained language models Jan'19 May'19 Other models Sep'19 Jan'20 Models with highest EM Pretraining has had a major, tangible impact on how well NLP systems work Word structure and subword models Let's take a look at the assumptions we've made about a language's vocabulary. We assume a fixed vocab of tens of thousands of words, built from the training set. All novel words seen at test time are mapped to a single UNK. Common words Variations misspellings novel items word hat learn taaaaasty laern Transformerify -> vocab mapping pizza (index) tasty (index) UNK (index) UNK (index) UNK (index) embedding 1 i k Brianna @_parsimonia_ • 24h k Jfc Goooooood Vibesssssss m Word structure and subword models Finite vocabulary assumptions make even less sense in many languages other than English • Many languages exhibit complex morphology, or word structure. • The effect is many more word types, each occurring fewer times. Example: Swahili verbs can have hundreds of conjugations, each encoding a wide variety of information. (Tense, mood, definiteness, negation, information about the object, ++) Here's a small fraction of the conjugations for ambia - to tell. onjugation of -amb Form Infinitive Positive form Imperative Habitual Non-finite forms Positive kuambia Simple finite forms Singular ambia Negative kutoambia Plural ambieni Polarity Persons 1st 2nd So. PI. So. PI. Persons / eras ses 3rd / M-wa Sq./1 PI. 12 M-3 mi 4 Ma 5 6 Ki Classes -vi N 8 9 10 u 11 / 14 Ku 15/17 Pa 16 Mu 18 Past [less A] Positive niliambia naliambia tuliambia twaliambia uliambia mliambia mwaliambia aliambia waliambia uliambia iliambia Uliambia y.liambi. kiliambia viliambia iliambia ziliambia uliambia kuliambia paliambia muliambia Negative sikuambia hatukuambia hukuambia hamkuambia hakuambia hawakuambi haukuambia haikuambia halikuambia hayakuambi hakikuambia havikuambia haikuambia hazikuambia haukuambia hakukuambi a hapakuambi a hamukuambi a Present [less A] Positive ninaambia tunaambia unaamb a mnaambia anaambia wanaambia unaambia inaambia linaambia yanaambia kinaambia inaambia ^ naambia unaambia kunaambia panaambia Negative siambii hatuambii huambii hamambii haambii hawaambii hau am bii haiambii hahambii iij ..ij'iibn tidkiambii haviambii haiambii haziambii hauambn lidkudf'bi ■ldpddlllbll hdtnudttiLii Positive niiaambia tutaambia utaambia mtaambia ataambia w ataambia utaambia Future itaambia Mtaambia y ataambia kitaambia vila ambia itaambia zitaambia utaambia kuiaambia paiaambia (less a] mutaambia Negative sttaambia ha tutaambia hutaambia hamlaarnbia hataambia hawataambi hautaamhia haitaambia halitaamhia hayataamhia hakitaambia havitaambia hailaamhla halitaamhia haulaambia hakutaambia ha pat a a mbia hamutaambi Subjunctive [less A] Positive niambie luambie uambie mambie aambie waarnbie uambie iambie liambie yaambie kiambie viambie iambie ziambie uambie kuambie paambie muambie Negative nisiambie X: j'-lblL- usiambie ms iambie as iambie wasiambie usiambie isiambie usiambie yasiambie kisiambie visiambie isiambie zisiambie usiambie kusidtnbie pdbldlllbl« musiambie Positive ningeambia nisingeambi singeambia tungeambia tusingeambi halungeamb ungeambia usingeamhia hungeambia msingeambi hamngeambi angeambia asingeamhia hangeambia wangeambia wasingeamb hawangearn ungeambia usingeamhia haungeambi a Present Conditional ingeambia hngeambia yangeambia kingeambia Singeambia '«"S""*» "si"f'"*' Msingeambi haingeambia na''n96ambi hayangeamb hakingeambi vmgeanibia vis ingeambi havingeambi ingeamb.a isingeatnbia haingeambia zmgeambia zisingeambi hazingeambi ungeambia us ngeamb a kungeambia kusingeambi hakungeamb pangeambia pasingeambi hapangeam [less A] mungeambia musingeamb hamungeam Past Conditional [less A] Positive ningaliambia tungaliambia ungaliambia mngaliambia angaliambia wangaMambi ungaliambia ingaliambia Mngaliambia yangaMambi kmgaliambia vingaliambia ingaliambia zingaliambia ungaliambia kungaliambi a pangaliambi a mungaliambi a Neg.„,e nisingaliamb ■ singaliambia tusingaliamh halungaliam bia hungahambi a msingaliamb hamngaiiam bia a^ingaliamhi hangaliambi a hawangalia mbia usingaliamhi haungahamb isingaliambia haingahambi lisingaliambi halingaMamb yasingaliam hayangaliam bia kismgaliamhi hakingaliam bla hav.ngal.am isingaliambia haingaMambi a zisingaliambi hamngaiiam bia usingaliambi haungaliamb ia kusingaliam idkutlUdlid"! bia hapangaliam bia hamungaMa mbia Conditional Contrary to Fact [less A] Positive ningeliambia tungeliambia ungeliambia mngeliambia angeliambia wangeliambi ungeliambia ingeli.mbia ungeliambia yangeliambi kingeliambia vingeliambia ingeliambia zingeliambia ungeliambia kungeliambi pangeliambi mungeliambi Gnomic bia laambia Perfect lambia kwaambia paambia [less A] [less Al [Wiktionaryl The byte-pair encoding algorithm Subword modeling in NLP encompasses a wide range of methods for reasoning about structure below the word level. (Parts of words, characters, bytes.) • The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens). • At training and testing time, each word is split into a sequence of known subwords. Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary. 1. Start with a vocabulary containing only characters and an "end-of-word" symbol. 2. Using a corpus of text, find the most common adjacent characters "a,b"; add "ab" as a subword. 3. Replace instances of the character pair with the new subword; repeat until desired vocab size. Originally used in NLP for machine translation; now similar methods (WordPiece, SentencePiece) are used in pretrained models, like BERT, GPT. Byte Pair Encoding (BPE) [Sennrich et al. 2016] Dictionary I O W 2 lower 6 newest widest Vocabulary I, o, w, e, r, n, w, s, t, i, d 6 n e w es t 3 w i d es t I, o, w, e, r, n, w, s, t, i, d, es new w i d est I, o, w, e, r, n, w, s, t, i, d, es, est Word structure and subword models Common words end up being a part of the subword vocabulary, while rarer words are split into (sometimes intuitive, sometimes not) components. In the worst case, words are split into as many subwords as they have characters. word vocab mapping embedding Common hat -> hat H words - learn -> learn Variations - taaaaasty -> taa## aaa## sty M M M misspellings - laern -> la##ern## M M novel items A Transformerify -> Transformer## ify ; Brianna @_parsimonia_ • 24h k Jfc Goooooood Vibesssssss Words in writing systems Writing systems vary in how they represent words - or don't • No word segmentation: ^M^iX^JMbt^EUM^M^f^^: • Words (mainly) segmented: This is a sentence with words. • Clitics/pronouns/agreement? Separated Je vous ai apporte des bonbons • Joined UU& = U+U+Jli+Ui = so+said+we+it • Compounds? • Separated life insurance company employee • Joined Lebensversicherungsgesellschaftsangestellter 10 Below the word in writing systems Human language writing systems aren't one thing! Phonemic (maybe digraphs) Fossilized phonemic Syllabic/moraic Ideographic (syllabic) Combination of the above jiyawu ngabulu thorough failure Wambaya English Inuktitut Chinese Japanese li Outline 1. A brief note on subword modeling 2. Motivating model pretraining from word embeddings 3. Model pretraining three ways 1. Encoders 2. Encoder-Decoders 3. Decoders 4. What do we think pretraining is teaching? 12 Motivating word meaning and context Recall the adage we mentioned at the beginning of the course: Tot/ shall know a word by the company it keeps" (J. R. Firth 1957:11) This quote is a summary of distributional semantics, and motivated word2vec. But: "... the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously." (J. R. Firth 1935) Consider / record the record: the two instances of record mean different things. 13 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935 Where we were: pretrained word embeddings Circa 2015: • Start with pretrained word embeddings (no context!) • Learn how to incorporate context in an LSTM or Transformer while training on the task. Some issues to think about: • The training data we have for our downstream task (like question answering) must be sufficient to teach all contextual aspects of language. • Most of the parameters in our network are randomly initialized! 14 7 iiiiii the movie was Not pretrained pretrained (word embeddings) [Recall, movie gets the same word embedding, no matter what sentence it shows up in] Where we're going: pretraining whole models In modern NLP: • All (or almost all) parameters in NLP networks are initialized via pretraining. • Pretraining methods hide parts of the input from the model, and train the model to reconstruct those parts. • This has been exceptionally effective at building strong: * representations of language parameter initializations for strong NLP models. • Probability distributions over language that we can sample from 15 iiiiii Pretrained jointly the movie was [This model has learned how to represent entire sentences through pretraining] What can we learn from reconstructing the input? Stanford University is located in__, California. 16 What can we learn from reconstructing the input? I put_fork down on the table. 17 What can we learn from reconstructing the input? The woman walked across the street, checking for traffic over_shoulder. 18 What can we learn from reconstructing the input? I went to the ocean to see the fish, turtles, seals, and 19 What can we learn from reconstructing the input? Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was 20 What can we learn from reconstructing the input? Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the 21 What can we learn from reconstructing the input? I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13,21,__ 22 Pretraining through language modeling [Dai and Le, 20151 Recall the language modeling task: • Model Ve(wt\wi:t-i)> the probability distribution over words given their past goes to make tasty tea end ^^^^^^^^^^^^^^ • There's lots of data for this! (In English.) Decoder Pretraining through language modeling: • Train a neural network to perform language modeling on a large amount of text. • Save the network parameters. nsformer, LSTM,++) iii Iroh goes to make tasty tea 23 The Pretraining / Finetuning Paradigm Pretraining can improve NLP applications by serving as parameter initialization. Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task) Lots of text; learn general things! Not many labels; adapt to the task! Iroh goes to make tasty tea ... the movie WQS... 24 Stochastic gradient descent and pretrain/finetune Why should pretraining and finetuning help, from a ''training neural nets" perspective? • Pretraining provides parameters 6 by approximating rrnn Xpretrain(^). • (The pretraining loss.) • Then, finetuning approximates rrnn £finetune(60, starting at 6. • (The finetuning loss) • The pretraining may matter because stochastic gradient descent sticks (relatively) close to 6 during finetuning. So, maybe the finetuning local minima near 6 tend to generalize well! • And/or, maybe the gradients of finetuning loss near 6 propagate nicely! 25 Where does this data come from? Composition of the Pile by Category ■ Academic ■ Internet ■ Prose Dialogue ■ Misc USPTO PMA Phil NIH Model Training Data BERT BookCorpus, English Wikipedia GPT-1 BookCorpus GPT-3 CommonCrawl, WebText, English Wikipedia, and 2 book databases ("Books 1" and "Books 2") GPT-3.5+ Undisclosed Bookcorpus... what's that? G^] Smashwords Search for books, authors, or series. 1 Home About FAQ Sign Up Filtering 1 Words Published: Books Published: Free Books: Books on Sale: 32.57 billion 858,759 101,947 11,693 All Works « Fiction Adventure African American fiction Alternative history Anthologies Biographical Business Children's books Christian Classics Coming of age Cultural & ethnic themes Educational Fairv tales Special Deals Any Price Free $0.99 or less $2.99 or less $5.99 or less $9.99 or less Any Length Under 20K words Over 20K words Over 50K words Over 100K words BHM Reads You Need A Walk In The Park Rebekah Weathersp... $2.99 Add to Cart Melodies of Love Amaka Azie $2.99 Add to Cart J. Nichole $5.99 Add to Cart My Gift To You T.K. Richards $2.99 Add to Cart ■IS o Tales of Novia, Book 1 Jessica Cage $3.99 Add to Cart Scraped ebooks from the internet - highly controversial Fair use and other concerns Google swallows 11,000 novels to improve AI's conversation As writers learn that tech giant has processed their work without permission, the Authors Guild condemns 'blatantly commercial use of expressive authorship' Arts and Humanities, Law, Regulation, and Policy, Machine Learning Reexamining "Fair Use" in the Age of Al Generative Al claims to produce new language and images, but when those ideas are based on copyrighted material, who gets the credit? A new paper from Stanford University looks for answers. Jun 5,2023 | Andrew Myers if f O in (2) Q 'It doesn't harm the authors'... Google's headquarters in Mountain View, California. Photograph: Marcio Jose Sanchez/AP Lecture Plan 1. A brief note on subword modeling 2. Motivating model pretraining from word embeddings 3. Model pretraining three ways 1. Encoders 2. Encoder-Decoders 3. Decoders 4. What do we think pretraining is teaching? 29 Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. Encoders • Gets bidirectional context - can condition on future! • How do we train them to build strong representations? III Encoder-Decoders Good parts of decoders and encoders? What's the best way to pretrain them? Decoders Language models! What we've seen so far. Nice to generate from; can't condition on future words 30 Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. Encoders • Gets bidirectional context - can condition on future! • How do we train them to build strong representations? Encoder- • Good parts of decoders and encoders? Decoders • What's the best way to pretrain them? a j+^*^m^ • Language models! What we've seen so far. t^^^T Decoders • Nice to generate from; can t condition on future words 31 Pretraining encoders: what pretraining objective to use? So far, we've looked at language model pretrai context, so we can't do language modeling! Idea: replace some fraction of words in the input with a special [MASK] token; predict these words. hlf ...,hT = Encoder(w!, ...,wT) yt ~ Aht + b ing. But encoders get bidirectional went ^ore ▲ ▲ A,b h^f... f hj1 Only add loss terms from words that are ''masked out." If x is the masked version of x, we're learning pe(x\x). Called Masked LM. / [M] to the [M] [Devlin etal., 20181 32 BERT: Bidirectional Encoder Representations from Transformers Devlin et al., 2018 proposed the ''Masked LM" objective and released the weights of a pretrained Transformer, a model they labeled BERT. Some more details about Masked LM for BERT: • Predict a random 15% of (sub)word tokens. • Replace input word with [MASK] 80% of the time • Replace input word with a random token 10% of the time • Leave input word unchanged 10% of the time (but still predict it!) • Why? Doesn't let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time!) [Predict these!] went to store t t t Transformer Encoder / pizza to the [M] [Replaced] [Not replaced] [Masked] 33 [Devlin etal., 20181 BERT: Bidirectional Encoder Representations from Transformers The pretraining input to BERT was two separate contiguous chunks of text: Input [CLS] my dog cute [SEP] he likes play ##ing [SEP] Token Embeddings Segment Embeddings Position Embeddings "[CLS] ■my "dog "cute "[SEP] "he "likes "play "*»ing "[SEP] "B "B •10 • BERT was trained to predict whether one chunk follows the other or is randomly sampled. • Later work has argued this "next sentence prediction" is not necessary. 34 [Devlin et al., 2018, Liu et al., 20191 BERT: Bidirectional Encoder Representations from Transformers Details about BERT • Two models were released: • BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params. • BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params. • Trained on: • BooksCorpus (800 million words) • English Wikipedia (2,500 million words) • Pretraining is expensive and impractical on a single GPU. • BERT was pretrained with 64 TPU chips for a total of 4 days. • (TPUs are special tensor operation acceleration hardware) • Finetuning is practical and common on a single GPU • "Pretrain once, finetune many times/' 35 [Devlin etal., 20181 BERT: Bidirectional Encoder Representations from Transformers BERT was massively popular and hugely verse the-art results on a broad range of tasks. • QQP: Quora Question Pairs (detect paraphrase • questions) • QNLI: natural language inference over question* answering data • SST-2: sentiment analysis ile; finetuning BERT led to new state-of- CoLA: corpus of linguistic acceptability (detect whether sentences are grammatical.) STS-B: semantic textual similarity MRPC: microsoft paraphrase corpus RTE: a small natural language inference corpus System MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Average 392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k - Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0 BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.8 90.4 36.0 73.3 84.9 56.8 71.0 OpenAI GPT 82.1/81.4 70.3 87.4 91.3 45.4 80.0 82.3 56.0 75.1 BERTbase 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6 BERTlarge 86.7/85.9 72.1 92.7 94.9 60.5 86.5 893 70.1 82.1 36 [Devlin etal., 20181 Limitations of pretrained encoders Those results looked great! Why not use pretrained encoders for everything? If your task involves generating sequences, consider using a pretrained decoder; BERT and othe pretrained encoders don't naturally lead to nice autoregressive (1-word-at-a-time) generation methods. make/brew/craft Iroh goes to [MASK] tasty tea goes to make tasty tea END .L. t t t t t Iroh goes to make tasty tea 37 Extensions of BERT You'll see a lot of BERT variants like RoBERTa, SpanBERT, +++ Some generally accepted improvements to the BERT pretraining formula: • RoBERTa: mainly just train BERT for longer and remove next sentence prediction! • SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task It's bly irr## esi## sti## bly BERT SpanBERT II ii i ii i i i [MASK] irr## esi## sti## [MASK] good [MASK1 [MASK1 [MASK] [MASK] g0°d 38 [Liu etal., 2019; Joshi et al., 20201 Extensions of BERT A takeaway from the RoBERTa paper: more compute, more data can improve pretraining even when not changing the underlying Transformer encoder. Model data bsz steps SQuAD (v 1.1/2.0) MNLI-m SST-2 RoBERTa with Books + Wiki 16GB 8K 100K 93.6/87.3 89.0 95.3 + additional data (§3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6 + pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1 + pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4 BERTlarge with Books + Wiki 13GB 256 1M 90.9/81.8 86.6 93.7 39 [Liu etal., 2019; Joshi etal., 20201 Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. Encoders • Gets bidirectional context - can condition on future! • How do we train them to build strong representations? III Encoder-Decoders Good parts of decoders and encoders? What's the best way to pretrain them? a j+^*^m^ • Language models! What we've seen so far. t^^^T Decoders • Nice to generate from; can t condition on future words 40 Pretraining encoder-decoders: what pretraining objective to use? For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted. hlf..., hT = Encoder(w1,..., wr) hT+1,..., hT+s = Decoder(wT+1,..., wT+s, hlr..., hT) yt~ Ahi + b,i>T The encoder portion can benefit from bidirectional context; the decoder portion is used to train the whole model through language modeling, autoregressively predicting and then conditioning on one token at a time. [Raffel etal., 20181 41 Pretraining encoder-decoders: what pretraining objective to use? What Raffel et al.. 2018 found to work best was span corruption. Their model: T5. Targets ^^^^^^^^^^^^^^ for inviting last Replace different-length spans from the input with unique placeholders; decode out the spans that were removed! Original text Thank yout&t iBv^ngme to your party test week. This is implemented in text preprocessing: it's still an objective that looks like language modeling at lnputs f s the decoder side. Thank Vou me t0 Vour PartV week- 42 Pretraining encoder-decoders: what pretraining objective to use? Raffel et al.. 2018 found encoder-decoders to work better than decoders for their tasks, and span corruption (denoising) to work better than language modeling. Architecture Objective Params Cost GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo k Encoder-decoder Denoising 2P M 83.28 19.24 80.88 71.36 26.98 39.82 27.65 Enc-dec, shared Denoising P M 82.81 18.78 80.63 70.73 26.72 39.03 27.46 Enc-dec, 6 layers Denoising P M/2 80.88 18.97 77.59 68.42 26.38 38.40 26.95 Language model Denoising P M 74.70 17.93 61.14 55.02 25.09 35.28 25.86 Prefix LM Denoising P M 81.82 18.61 78.94 68.11 26.43 37.98 27.39 Encoder-decoder LM 2P M 79.56 18.59 76.02 64.29 26.27 39.17 26.86 Enc-dec, shared LM P M 79.60 18.13 76.35 63.50 26.62 39.17 27.05 Enc-dec, 6 layers LM P M/2 78.67 18.26 75.32 64.06 26.13 38.42 26.89 Language model LM P M 73.78 17.54 53.81 56.51 25.23 34.31 25.38 Prefix LM LM P M 79.68 17.84 76.87 64.86 26.28 37.51 26.76 Pretraining encoder-decoders: what pretraining objective to use? A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving knowledge from its parameters. NQ: Natural Questions WQ: WebQuestions TQA: Trivia QA All "open-domain" versions NQ WQ TQA dev test Karpukhin et al. (2020) 41.5 42.4 57.9 — T5.1.1-Base 25.7 28.2 24.2 30.6 220 million params T5.1.1-Large 27.3 29.5 28.5 37.2 770 million params T5.1.1-XL 29.5 32.4 36.0 45.1 3 billion params T5.1.1-XXL 32.8 35.6 42.9 52.5 11 billion params T5.1.1-XXL + SSM 35.2 42.8 51.9 61.6 [Raffel etal., 20181 Pretraining for three types of architectures The neural architecture influences the type of pretraining, and natural use cases. Encoders • Gets bidirectional context - can condition on future! • How do we train them to build strong representations? Encoder-Decoders Good parts of decoders and encoders? What's the best way to pretrain them? Decoders Language models! What we've seen so far. Nice to generate from; can't condition on future words. All the biggest pretrained models are Decoders. 45 Pretraining decoders When using language model pretrained decoders, we can ignore that they were trained to model p(wt\wi:t-i)- We can finetune them by training a softmax classifier on the last word's hidden state. hlf..., hT = Decoder^!, ..., wT) y ~ AhT + b Where A and b are randomly initialized and specified by the downstream task. t Linear A,b Wi f... f Wj Gradients backpropagate through the whole network. [Note how the linear layer hasn't been pretrained and must be learned from scratch.] 46 Pretraining decoders It's natural to pretrain decoders as language models and then use them as generators, finetuning their Po(wt\wi:t-i)]- This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time! • Dialogue (context=dialogue history) • Summarization (context=document) hlf..., hT = Decoder^!,..., wT) wt ~ Aht_1 + b Where A, b were pretrained in the language model! 47 w2 w3 w4 w5 w6 w± w2 w3 w4 w5 [Note how the linear layer has been pretrained.] Generative Pretrained Transformer (GPT) [Radford et al.. 20181 2018's GPT was a big success in pretraining a decoder! • Transformer decoder with 12 layers, 117M parameters. • 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers. • Byte-pair encoding with 40,000 merges • Trained on BooksCorpus: over 7000 unique books. • Contains long spans of contiguous text, for learning long-distance dependencies. • The acronym ''GPT7 never showed up in the original paper; it could stand for ''Generative PreTraining" or ''Generative Pretrained Transformer" 48 [Devlin etal., 20181 Generative Pretrained Transformer (GPT) [Radford et al.. 20181 How do we format inputs to our decoder for finetuning tasks? Natural Language Inference: Label pairs of sentences as entoiling/contradictory/neutral Premise: The man is in the doorway Hypothesis: The person is near the door entailment Radford et al., 2018 evaluate on natural language inference. Here's roughly how the input was formatted, as a sequence of tokens for the decoder. [START] The man is in the doorway [DELIM] The person is near the door [EXTRACT] The linear classifier is applied to the representation of the [EXTRACT] token. 49 Generative Pretrained Transformer (GPT) [Radford et al.. 20181 GPT results on various natural language inference datasets. Method MNLI-m MNLI-mm SNLI SciTail QNLI RTE ESIM + ELMo [44] (5x) CAFE [58] (5x) Stochastic Answer Network [35] (3x) 80.2 80.6 79.0 80.1 89.3 89.3 - - CAFE [58] 78.7 77.9 88.5 83.3 GenSen [64] Multi-task BiLSTM + Attn [64] 71.4 72.2 71.3 72.1 - - 82.3 82.1 59.2 61.7 Finetuned Transformer LM (ours) 82.1 81.4 89.9 88.3 88.1 56.0 50 Increasingly convincing generations (GPT2) [Radford et al.. 20181 We mentioned how pretrained decoders can be used in their capacities as language models. GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively convincing samples of natural language. Context (human-written): In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. GPT-2: The scientist named the population, after their distinctive horn. Ovid's Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Perez, an evolutionary biologist from the University of La Paz. and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Perez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. GPT-2 language model output (2019) PROMPT (HUMAN-WRITTEN) MODEL COMPLETION A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. The incident occurred on the downtown train line, which runs from Covington and Ashland stations. In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief. 'The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation/' said Tom Hicks, the U.S. Energy Secretary, in a statement. "Our top priority is to secure the theft and ensure it doesn't happen again." The stolen material was taken from the University of Cincinnati's Research Triangle Park nuclear research site, according to a news release from Department officials. 52 GPT-3, In-context learning, and very large models So far, we've interacted with pretrained models in two ways: • Sample from the distributions they define (maybe providing a prompt) • Fine-tune them on a task we care about and take their predictions. Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts. GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters. GPT-3 has 175 billion parameters. ChatGPT/GPT-4/GPT-3.5 Turbo introduced a further instruction-tuning idea that we cover next lecture 53 GPT-3, In-context learning, and very large models Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts. The in-context examples seem to specify the task to be performed, and the conditional distribution mocks performing the task to a certain extent. Input (prefix within a single Transformer decoder context): " thanks -> merci hello -> bonjour mint -> menthe otter -> Output (conditional generations): loutre..." 54 GPT-3, In-context learning, and very large models Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts. Learning via SGD during unsupervised pre-training 5 ♦ 8 = 13 7 ♦ 2 « 9 1*8-1 3 + 4 = 7 5 • 9 = 14 9 ♦ 8 = 17 sequence #7 o o X — - gaot •> goat sakne •> snake brid => bird f$ih => flsH dcuk => duck caihp » chuip sequence #2 o o E to 2. 2 3 thanks => aero hello » bonjour mint •> eenthe wall => «ur Otter => loutr* bread => pain sequence #3 3 O o 3 n X 5" 0) 55 Why scale? Scaling laws • Empirical observation: scaling up models leads to reliable gains in perplexity Scaling can help identify model size - data tradeoffs • Modern observation: train a big model that's not fully converged. Scaling laws for many other interesting architecture decisions 2 —•— 1 Layer —•— 2 Layers —•— 3 Layers —•— 6 Layers > 6 Layers 103 104 105 106 107 108 109 Parameters (non-embedding) Training Data Set Size, Number of Chars (Log-scale) • Predictable scaling helps us make intelligent decisions about architectures etc. Scaling Efficiency: how do we best use our compute GPT-3 was 175B parameters and trained on 300B tokens of text. Roughly, the cost of training a large transformer scales as parameters*tokens Did OpenAI strike the right parameter-token data to get the best model? No. Model Size (# Parameters) Training Tokens LaMDA (Thoppilan et al., 2022) 137 Billion 168 Billion GPT-3 (Brown et al., 2020) 175 Billion 300 Billion Jurassic (Lieber et al., 2021) 178 Billion 300 Billion Gopher (Rae et al, 2021) 280 Billion 300 Billion MT-NLG 530B (Smith et al., 2022) 530 Billion 270 Billion Chinchilla 70 Billion 1.4 Trillion This 70B parameter model is better than the much larger other models! 59 Outline 1. A brief note on subword modeling 2. Motivating model pretraining from word embeddings 3. Model pretraining three ways 1. Encoders 2. Encoder-Decoders 3. Decoders 4. What do we think pretraining is teaching? 60 What kinds of things does pretraining teach? There's increasing evidence that pretrained models learn a wide variety of things about the statistical properties of language. Taking our examples from the start of class: • Stanford University is located in_, California. [Trivia] • / put_fork down on the table, [syntax] • The woman walked across the street, checking for traffic over_shoulder, [coreference] • / went to the ocean to see the fish, turtles, seals, and_. [lexical semantics/topic] • Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was_. [sentiment] • Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the_. [some reasoning - this is harder] • I was thinking about the sequence that goes 1,1, 2, 3, 5, 8,13, 21,_ [some basic arithmetic; they don't learn the Fibonacci sequence] • Models also learn - and can exacerbate racism, sexism, all manner of bad biases. 61 Sometimes it also memorizes copyrighted material AI Art Generators Spark Multiple Copyright Lawsuits Getty and a trio of artists sued AI art generators in separate suits accusing the companies of copyright infringement for pilfering their works. BY WINSTON CHO (*J JANUARY 17. 2023 4:10PM Anthropic fires back at music publishers' AI copyright lawsuit ARTICLE Insights from the Pending Copilot Class Action Lawsuit October 4, 2023 Bloomberg Law By Daniel R. Mello, Jr.; Jenevieve J. Maerker; Matthew C. Berntsen; Ming-Tao Yang GitHub Inc. offers a cloud-based platform that is popular among many software programmers for hosting and sharing source code, and collaborating on source code drafting. GitHub's artificial The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work Millions of articles from The New York Times were used to train chatbots that now compete with it, the lawsuit said. Share free access g> Q C^] 1.3K Sometimes it learns some things we don't want.. Membership inference lets you recover parts of the training data (data record, class label) f ---------1---------- label predict (data) Target Model Sometimes this training data is semi-private material from the web (addresses, emails) • It learns the prejudices and biases of human beings who write online Attack Model prediction data € training set' Prefix East Stroudsburg Stroudsburg Memorized text | y Corporation Seabank Centre Marine Parade Southport Three types of architectures for pretraining The neural architecture influences the type of pretraining, and natural use cases. Encoders III Encoder-Decoders Decoders Gets bidirectional context - can condition on future! Good if only doing analysis of text (better than decoders) Good parts of decoders and encoders? Some evidence they are better for NLU • [Tay et al. 2022. UL2] UniLM (EncDec) PrefixLM (EncDec) SpanCorrupt (Dec) GPT-like (Dec) PrefixLM (Dec) 2 4 6 8 10 1-Shot GEM XSum, SGD, TOT Avg. Rouge-L Language models! What we've seen so far. Scale well. Best to generate from; have won as to what people build 64