Natural Language Processing with Deep Learning
CS224N/Ling284
Christopher Manning Lecture 9: Pretraining
Adopted from slides by Anno Goldie, John Hewitt, Totsunori Hashimoto
The pretraining revolution
X
c
Human
MNIST
Switchboard
ImageNet
SQuAD 1.1
SQuAD 2.0
GLUE
SuperGLUE
SAN (ensemble modeli
ALBERT + DAAF + Verifier (ensemble] BERT + ConvLSTM + MTL + Verifier (ensemble) BERTfinetune baseline (ensemble)
DERI (single mod nlnet (single model)
—•<*   I \
í
FPNet (ensemble)
Gains from pretrained language models
Jan'19 May'19
Other models
Sep'19 Jan'20
Models with highest EM
Pretraining has had a major, tangible impact on how well NLP systems work
Word structure and subword models
Let's take a look at the assumptions we've made about a language's vocabulary.
We assume a fixed vocab of tens of thousands of words, built from the training set. All novel words seen at test time are mapped to a single UNK.
Common words
Variations misspellings novel items
word
hat
learn
taaaaasty laern
Transformerify ->
vocab mapping pizza (index) tasty (index) UNK (index) UNK (index) UNK (index)
embedding
1	i k   Brianna @_parsimonia_ • 24h k Jfc Goooooood Vibesssssss		
m			
Word structure and subword models
Finite vocabulary assumptions make even less sense in many languages other than English •  Many languages exhibit complex morphology, or word structure. • The effect is many more word types, each occurring fewer times.
Example: Swahili verbs can have hundreds of conjugations, each encoding a wide variety of information. (Tense, mood, definiteness, negation, information about the object, ++)
Here's a small fraction of the conjugations for ambia - to tell.
onjugation of -amb
Form Infinitive
Positive form Imperative Habitual
Non-finite forms Positive
kuambia
Simple finite forms Singular ambia
Negative
kutoambia
Plural
ambieni
Polarity	Persons 1st 2nd So.        PI.        So. PI.				Persons / eras ses 3rd / M-wa Sq./1    PI. 12		M-3	mi 4	Ma 5 6		Ki	Classes -vi N 8           9 10			u 11 / 14	Ku 15/17	Pa 16	Mu 18
								Past										[less A]
Positive	niliambia naliambia	tuliambia twaliambia	uliambia	mliambia mwaliambia	aliambia	waliambia	uliambia	iliambia	Uliambia	y.liambi.	kiliambia	viliambia	iliambia	ziliambia	uliambia	kuliambia	paliambia	muliambia
Negative	sikuambia	hatukuambia	hukuambia	hamkuambia	hakuambia	hawakuambi	haukuambia	haikuambia	halikuambia	hayakuambi	hakikuambia	havikuambia	haikuambia	hazikuambia	haukuambia	hakukuambi a	hapakuambi a	hamukuambi a
								Present										[less A]
Positive	ninaambia	tunaambia	unaamb a	mnaambia	anaambia	wanaambia	unaambia	inaambia	linaambia	yanaambia	kinaambia		inaambia	^ naambia	unaambia	kunaambia	panaambia	
Negative	siambii	hatuambii	huambii	hamambii	haambii	hawaambii	hau am bii	haiambii	hahambii	iij ..ij'iibn	tidkiambii	haviambii	haiambii	haziambii	hauambn	lidkudf'bi	■ldpddlllbll	hdtnudttiLii
Positive	niiaambia	tutaambia	utaambia	mtaambia	ataambia	w ataambia	utaambia	Future itaambia Mtaambia		y ataambia	kitaambia	vila ambia	itaambia	zitaambia	utaambia	kuiaambia	paiaambia	(less a] mutaambia
Negative	sttaambia	ha tutaambia	hutaambia	hamlaarnbia	hataambia	hawataambi	hautaamhia	haitaambia	halitaamhia	hayataamhia	hakitaambia	havitaambia	hailaamhla	halitaamhia	haulaambia	hakutaambia	ha pat a a mbia	hamutaambi
								Subjunctive										[less A]
Positive	niambie	luambie	uambie	mambie	aambie	waarnbie	uambie	iambie	liambie	yaambie	kiambie	viambie	iambie	ziambie	uambie	kuambie	paambie	muambie
Negative	nisiambie	X: j'-lblL-	usiambie	ms iambie	as iambie	wasiambie	usiambie	isiambie	usiambie	yasiambie	kisiambie	visiambie	isiambie	zisiambie	usiambie	kusidtnbie	pdbldlllbl«	musiambie
Positive	ningeambia nisingeambi singeambia	tungeambia tusingeambi halungeamb	ungeambia usingeamhia hungeambia	msingeambi hamngeambi	angeambia asingeamhia hangeambia	wangeambia wasingeamb hawangearn	ungeambia usingeamhia haungeambi a	Present Conditional ingeambia    hngeambia   yangeambia kingeambia Singeambia '«"S""*» "si"f'"*' Msingeambi haingeambia na''n96ambi hayangeamb hakingeambi				vmgeanibia vis ingeambi havingeambi	ingeamb.a isingeatnbia haingeambia	zmgeambia zisingeambi hazingeambi	ungeambia us ngeamb a	kungeambia kusingeambi hakungeamb	pangeambia pasingeambi hapangeam	[less A] mungeambia musingeamb hamungeam
								Past Conditional										[less A]
Positive	ningaliambia	tungaliambia	ungaliambia	mngaliambia	angaliambia	wangaMambi	ungaliambia	ingaliambia	Mngaliambia	yangaMambi	kmgaliambia	vingaliambia	ingaliambia	zingaliambia	ungaliambia	kungaliambi a	pangaliambi a	mungaliambi a
Neg.„,e	nisingaliamb ■ singaliambia	tusingaliamh halungaliam bia	hungahambi a	msingaliamb hamngaiiam bia	a^ingaliamhi hangaliambi a	hawangalia mbia	usingaliamhi haungahamb	isingaliambia haingahambi	lisingaliambi halingaMamb	yasingaliam hayangaliam bia	kismgaliamhi hakingaliam bla	hav.ngal.am	isingaliambia haingaMambi a	zisingaliambi hamngaiiam bia	usingaliambi haungaliamb ia	kusingaliam idkutlUdlid"! bia	hapangaliam bia	hamungaMa mbia
Conditional Contrary										to Fact								[less A]
Positive	ningeliambia tungeliambia		ungeliambia	mngeliambia	angeliambia	wangeliambi	ungeliambia	ingeli.mbia	ungeliambia	yangeliambi	kingeliambia vingeliambia		ingeliambia	zingeliambia	ungeliambia	kungeliambi	pangeliambi	mungeliambi
Gnomic
bia laambia
Perfect
lambia     kwaambia paambia
[less A] [less Al
[Wiktionaryl
The byte-pair encoding algorithm
Subword modeling in NLP encompasses a wide range of methods for reasoning about structure below the word level. (Parts of words, characters, bytes.)
• The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens).
• At training and testing time, each word is split into a sequence of known subwords.
Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary.
1. Start with a vocabulary containing only characters and an "end-of-word" symbol.
2. Using a corpus of text, find the most common adjacent characters "a,b"; add "ab" as a subword.
3. Replace instances of the character pair with the new subword; repeat until desired vocab size.
Originally used in NLP for machine translation; now similar methods (WordPiece, SentencePiece) are used in pretrained models, like BERT, GPT.
Byte Pair Encoding (BPE) [Sennrich et al. 2016]
Dictionary I O W
2 lower 6 newest widest
Vocabulary
I, o, w, e, r, n, w, s, t, i, d
6 n e w es t 3  w i d es t
I, o, w, e, r, n, w, s, t, i, d, es
new w i d est
I, o, w, e, r, n, w, s, t, i, d, es, est
Word structure and subword models
Common words end up being a part of the subword vocabulary, while rarer words are split into (sometimes intuitive, sometimes not) components.
In the worst case, words are split into as many subwords as they have characters.
	word		vocab mapping embedding
Common	hat	->	hat H
words	- learn	->	learn
			
Variations -	taaaaasty	->	taa## aaa## sty       M  M M
misspellings -	laern	->	la##ern##             M M
novel items A	Transformerify ->		Transformer## ify
	;       Brianna @_parsimonia_ • 24h k Jfc Goooooood Vibesssssss		
			
Words in writing systems
Writing systems vary in how they represent words - or don't
• No word segmentation: ^M^iX^JMbt^EUM^M^f^^:
• Words (mainly) segmented: This is a sentence with words.
• Clitics/pronouns/agreement?
Separated Je vous ai apporte des bonbons
• Joined UU& = U+U+Jli+Ui = so+said+we+it
• Compounds?
• Separated life insurance company employee
• Joined Lebensversicherungsgesellschaftsangestellter
10
Below the word in writing systems
Human language writing systems aren't one thing!
Phonemic (maybe digraphs) Fossilized phonemic Syllabic/moraic Ideographic (syllabic) Combination of the above
jiyawu ngabulu thorough failure
Wambaya
English
Inuktitut
Chinese
Japanese
li
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?
12
Motivating word meaning and context
Recall the adage we mentioned at the beginning of the course:
Tot/ shall know a word by the company it keeps" (J. R. Firth 1957:11)
This quote is a summary of distributional semantics, and motivated word2vec. But:
"... the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously." (J. R. Firth 1935)
Consider / record the record: the two instances of record mean different things.
13 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935
Where we were: pretrained word embeddings
Circa 2015:
• Start with pretrained word embeddings (no context!)
• Learn how to incorporate context in an LSTM or Transformer while training on the task.
Some issues to think about:
• The training data we have for our downstream task (like question answering) must be sufficient to teach all contextual aspects of language.
• Most of the parameters in our network are randomly initialized!
14 7
iiiiii
the movie was
Not pretrained
pretrained (word embeddings)
[Recall, movie gets the same word embedding, no matter what sentence it shows up in]
Where we're going: pretraining whole models
In modern NLP:
• All (or almost all) parameters in NLP networks are initialized via pretraining.
• Pretraining methods hide parts of the input from the model, and train the model to reconstruct those parts.
• This has been exceptionally effective at building strong:
* representations of language
parameter initializations for strong NLP models.
• Probability distributions over language that we can sample from
15
iiiiii
Pretrained jointly
the movie was
[This model has learned how to represent entire sentences through pretraining]
What can we learn from reconstructing the input?
Stanford University is located in__, California.
16
What can we learn from reconstructing the input?
I put_fork down on the table.
17
What can we learn from reconstructing the input?
The woman walked across the street, checking for traffic over_shoulder.
18
What can we learn from reconstructing the input?
I went to the ocean to see the fish, turtles, seals, and
19
What can we learn from reconstructing the input?
Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink.
The movie was
20
What can we learn from reconstructing the input?
Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny.
Zuko left the
21
What can we learn from reconstructing the input?
I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13,21,__
22
Pretraining through language modeling [Dai and Le, 20151
Recall the language modeling task:
• Model Ve(wt\wi:t-i)> the probability
distribution over words given their past goes   to   make tasty   tea end
^^^^^^^^^^^^^^
• There's lots of data for this! (In English.) Decoder
Pretraining through language modeling:
• Train a neural network to perform language modeling on a large amount of text.
• Save the network parameters.
nsformer, LSTM,++)
iii
Iroh    goes    to    make  tasty tea
23
The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task)
Lots of text; learn general things! Not many labels; adapt to the task!
Iroh    goes     to    make  tasty    tea ... the movie WQS...
24
Stochastic gradient descent and pretrain/finetune
Why should pretraining and finetuning help, from a ''training neural nets" perspective?
• Pretraining provides parameters 6 by approximating rrnn Xpretrain(^).
• (The pretraining loss.)
• Then, finetuning approximates rrnn £finetune(60, starting at 6.
• (The finetuning loss)
• The pretraining may matter because stochastic gradient descent sticks (relatively) close to 6 during finetuning.
So, maybe the finetuning local minima near 6 tend to generalize well!
• And/or, maybe the gradients of finetuning loss near 6 propagate nicely!
25
Where does this data come from?
Composition of the Pile by Category
■ Academic ■ Internet ■ Prose  Dialogue ■ Misc
USPTO	PMA	
	Phil	NIH
Model   Training Data	
BERT	BookCorpus, English Wikipedia
GPT-1	BookCorpus
GPT-3	CommonCrawl, WebText, English Wikipedia, and 2 book databases ("Books 1" and "Books 2")
GPT-3.5+	Undisclosed
Bookcorpus... what's that?
G^] Smashwords
Search for books, authors, or series.
1 Home	About	FAQ	Sign Up	Filtering 1
				
Words Published: Books Published: Free Books: Books on Sale:
32.57 billion 858,759 101,947 11,693
All Works « Fiction
Adventure
African American fiction
Alternative history
Anthologies
Biographical
Business
Children's books
Christian
Classics
Coming of age
Cultural & ethnic themes
Educational
Fairv tales
Special Deals
Any Price	Free	$0.99 or less	$2.99 or less	$5.99 or less	$9.99 or less
Any Length
Under 20K words    Over 20K words    Over 50K words    Over 100K words
BHM Reads You Need
A Walk In The Park
Rebekah Weathersp... $2.99 Add to Cart
Melodies of Love
Amaka Azie
$2.99 Add to Cart
J. Nichole
$5.99 Add to Cart
My Gift To You
T.K. Richards $2.99 Add to Cart
■IS
o
Tales of Novia, Book 1
Jessica Cage $3.99 Add to Cart
Scraped ebooks from the internet - highly controversial
Fair use and other concerns
Google swallows 11,000 novels to improve AI's conversation
As writers learn that tech giant has processed their work without permission, the Authors Guild condemns 'blatantly commercial use of expressive authorship'
Arts and Humanities, Law, Regulation, and Policy, Machine Learning
Reexamining "Fair Use" in the Age of Al
Generative Al claims to produce new language and images, but when those ideas are based on copyrighted material, who gets the credit? A new paper from Stanford University looks for answers.
Jun 5,2023 | Andrew Myers   if    f    O    in (2)
Q 'It doesn't harm the authors'... Google's headquarters in Mountain View, California. Photograph: Marcio Jose Sanchez/AP
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?
29
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Encoders
• Gets bidirectional context - can condition on future!
• How do we train them to build strong representations?
III
Encoder-Decoders
Good parts of decoders and encoders? What's the best way to pretrain them?
Decoders
Language models! What we've seen so far.
Nice to generate from; can't condition on future words
30
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Encoders
• Gets bidirectional context - can condition on future!
• How do we train them to build strong representations?
Encoder- • Good parts of decoders and encoders? Decoders    •  What's the best way to pretrain them?
a j+^*^m^ •  Language models! What we've seen so far.
t^^^T Decoders
•  Nice to generate from; can t condition on future words
31
Pretraining encoders: what pretraining objective to use?
So far, we've looked at language model pretrai context, so we can't do language modeling!
Idea: replace some fraction of words in the input with a special [MASK] token; predict these words.
hlf ...,hT = Encoder(w!, ...,wT) yt ~ Aht + b
ing. But encoders get bidirectional
went ^ore ▲ ▲
A,b
h^f... f hj1
Only add loss terms from words that are ''masked out." If x is the masked version of x, we're learning pe(x\x). Called Masked LM.
/    [M] to  the [M]
[Devlin etal., 20181
32
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the ''Masked LM" objective and released the weights of a pretrained Transformer, a model they labeled BERT.
Some more details about Masked LM for BERT:
• Predict a random 15% of (sub)word tokens.
• Replace input word with [MASK] 80% of the time
• Replace input word with a random token 10% of the time
• Leave input word unchanged 10% of the time (but still predict it!)
• Why? Doesn't let the model get complacent and not build strong representations of non-masked words. (No masks are seen at fine-tuning time!)
[Predict these!]
went to store t     t t
Transformer Encoder
/ pizza to the [M]
[Replaced]    [Not replaced] [Masked]
33
[Devlin etal., 20181
BERT: Bidirectional Encoder Representations from Transformers
The pretraining input to BERT was two separate contiguous chunks of text:
Input
[CLS]
my
dog
cute
[SEP]
he
likes
play
##ing
[SEP]
Token
Embeddings
Segment Embeddings
Position Embeddings
"[CLS]
■my
"dog
"cute
"[SEP]
"he
"likes
"play
"*»ing
"[SEP]
"B
"B
•10
•  BERT was trained to predict whether one chunk follows the other or is randomly sampled.
• Later work has argued this "next sentence prediction" is not necessary.
34
[Devlin et al., 2018, Liu et al., 20191
BERT: Bidirectional Encoder Representations from Transformers
Details about BERT
• Two models were released:
• BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.
• BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.
• Trained on:
• BooksCorpus (800 million words)
• English Wikipedia (2,500 million words)
• Pretraining is expensive and impractical on a single GPU.
• BERT was pretrained with 64 TPU chips for a total of 4 days.
• (TPUs are special tensor operation acceleration hardware)
• Finetuning is practical and common on a single GPU
• "Pretrain once, finetune many times/'
35
[Devlin etal., 20181
BERT: Bidirectional Encoder Representations from Transformers
BERT was massively popular and hugely verse the-art results on a broad range of tasks.
• QQP: Quora Question Pairs (detect paraphrase • questions)
• QNLI: natural language inference over question* answering data
• SST-2: sentiment analysis
ile; finetuning BERT led to new state-of-
CoLA: corpus of linguistic acceptability (detect whether sentences are grammatical.)
STS-B: semantic textual similarity
MRPC: microsoft paraphrase corpus
RTE: a small natural language inference corpus
System	MNLI-(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	Average
	392k	363k	108k	67k	8.5k	5.7k	3.5k	2.5k	-
Pre-OpenAI SOTA	80.6/80.1	66.1	82.3	93.2	35.0	81.0	86.0	61.7	74.0
BiLSTM+ELMo+Attn	76.4/76.1	64.8	79.8	90.4	36.0	73.3	84.9	56.8	71.0
OpenAI GPT	82.1/81.4	70.3	87.4	91.3	45.4	80.0	82.3	56.0	75.1
BERTbase	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6
BERTlarge	86.7/85.9	72.1	92.7	94.9	60.5	86.5	893	70.1	82.1
36
[Devlin etal., 20181
Limitations of pretrained encoders
Those results looked great! Why not use pretrained encoders for everything?
If your task involves generating sequences, consider using a pretrained decoder; BERT and othe pretrained encoders don't naturally lead to nice autoregressive (1-word-at-a-time) generation methods.
make/brew/craft
Iroh     goes     to   [MASK] tasty tea
goes     to    make  tasty    tea END
.L.     t      t      t      t t
Iroh    goes    to    make  tasty tea
37
Extensions of BERT
You'll see a lot of BERT variants like RoBERTa, SpanBERT, +++
Some generally accepted improvements to the BERT pretraining formula:
• RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
• SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task
It's bly irr##   esi## sti## bly
BERT SpanBERT
II     ii    i ii i i i
[MASK]   irr##   esi##   sti## [MASK] good [MASK1 [MASK1 [MASK] [MASK] g0°d
38 [Liu etal., 2019; Joshi et al., 20201
Extensions of BERT
A takeaway from the RoBERTa paper: more compute, more data can improve pretraining even when not changing the underlying Transformer encoder.
Model
data     bsz steps
SQuAD
(v 1.1/2.0)
MNLI-m SST-2
RoBERTa
with Books + Wiki 16GB 8K 100K 93.6/87.3 89.0 95.3
+ additional data (§3.2) 160GB 8K 100K 94.0/87.7 89.3 95.6
+ pretrain longer 160GB 8K 300K 94.4/88.7 90.0 96.1
+ pretrain even longer 160GB 8K 500K 94.6/89.4 90.2 96.4
BERTlarge with Books + Wiki
13GB    256     1M 90.9/81.8
86.6
93.7
39
[Liu etal., 2019; Joshi etal., 20201
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Encoders
• Gets bidirectional context - can condition on future!
• How do we train them to build strong representations?
III
Encoder-Decoders
Good parts of decoders and encoders? What's the best way to pretrain them?
a j+^*^m^ •  Language models! What we've seen so far.
t^^^T Decoders
•  Nice to generate from; can t condition on future words
40
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a prefix of every input is provided to the encoder and is not predicted.
hlf..., hT = Encoder(w1,..., wr) hT+1,..., hT+s = Decoder(wT+1,..., wT+s, hlr..., hT)
yt~ Ahi + b,i>T
The encoder portion can benefit from bidirectional context; the decoder portion is used to train the whole model through language modeling, autoregressively predicting and then conditioning on one token at a time.
[Raffel etal., 20181
41
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al.. 2018 found to work best was span corruption. Their model: T5.
Targets ^^^^^^^^^^^^^^
<x> for inviting <y> last
Replace different-length spans from the input with unique placeholders; decode out the spans that were removed!
Original text
Thank yout&t iBv^ngme to your party test week.
This is implemented in text preprocessing: it's still an objective
that looks like language modeling at lnputs f s
the decoder side. Thank Vou <x> me t0 Vour PartV <Y> week-
42
Pretraining encoder-decoders: what pretraining objective to use?
Raffel et al.. 2018 found encoder-decoders to work better than decoders for their tasks, and span corruption (denoising) to work better than language modeling.
Architecture	Objective	Params	Cost	GLUE	CNNDM	SQuAD	SGLUE	EnDe	EnFr	EnRo
k Encoder-decoder	Denoising	2P	M	83.28	19.24	80.88	71.36	26.98	39.82	27.65
Enc-dec, shared	Denoising	P	M	82.81	18.78	80.63	70.73	26.72	39.03	27.46
Enc-dec, 6 layers	Denoising	P	M/2	80.88	18.97	77.59	68.42	26.38	38.40	26.95
Language model	Denoising	P	M	74.70	17.93	61.14	55.02	25.09	35.28	25.86
Prefix LM	Denoising	P	M	81.82	18.61	78.94	68.11	26.43	37.98	27.39
Encoder-decoder	LM	2P	M	79.56	18.59	76.02	64.29	26.27	39.17	26.86
Enc-dec, shared	LM	P	M	79.60	18.13	76.35	63.50	26.62	39.17	27.05
Enc-dec, 6 layers	LM	P	M/2	78.67	18.26	75.32	64.06	26.13	38.42	26.89
Language model	LM	P	M	73.78	17.54	53.81	56.51	25.23	34.31	25.38
Prefix LM	LM	P	M	79.68	17.84	76.87	64.86	26.28	37.51	26.76
Pretraining encoder-decoders: what pretraining objective to use?
A fascinating property of T5: it can be finetuned to answer a wide range of questions, retrieving knowledge from its parameters.
NQ: Natural Questions WQ: WebQuestions TQA: Trivia QA
All "open-domain" versions
	NQ	WQ	TQA		
			dev	test	
Karpukhin et al. (2020)	41.5	42.4	57.9	—	
T5.1.1-Base	25.7	28.2	24.2	30.6	220 million params
T5.1.1-Large	27.3	29.5	28.5	37.2	770 million params
T5.1.1-XL	29.5	32.4	36.0	45.1	3 billion params
T5.1.1-XXL	32.8	35.6	42.9	52.5	11 billion params
T5.1.1-XXL + SSM	35.2	42.8	51.9	61.6	
[Raffel etal., 20181
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
Encoders
• Gets bidirectional context - can condition on future!
• How do we train them to build strong representations?
Encoder-Decoders
Good parts of decoders and encoders? What's the best way to pretrain them?
Decoders
Language models! What we've seen so far.
Nice to generate from; can't condition on future words.
All the biggest pretrained models are Decoders.
45
Pretraining decoders
When using language model pretrained decoders, we can ignore that they were trained to model p(wt\wi:t-i)-
We can finetune them by training a softmax classifier on the last word's hidden state.
hlf..., hT = Decoder^!, ..., wT) y ~ AhT + b
Where A and b are randomly initialized and specified by the downstream task.
t
Linear A,b
Wi f... f Wj
Gradients backpropagate through the whole network.
[Note how the linear layer hasn't been pretrained and must be learned from scratch.]
46
Pretraining decoders
It's natural to pretrain decoders as language models and then use them as generators, finetuning their Po(wt\wi:t-i)]-
This is helpful in tasks where the output is a sequence with a vocabulary like that at pretraining time!
• Dialogue (context=dialogue history)
• Summarization (context=document)
hlf..., hT = Decoder^!,..., wT) wt ~ Aht_1 + b
Where A, b were pretrained in the language model!
47
w2  w3 w4  w5 w6
w±  w2 w3  w4 w5
[Note how the linear layer has been pretrained.]
Generative Pretrained Transformer (GPT) [Radford et al.. 20181
2018's GPT was a big success in pretraining a decoder!
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• The acronym ''GPT7 never showed up in the original paper; it could stand for ''Generative PreTraining" or ''Generative Pretrained Transformer"
48
[Devlin etal., 20181
Generative Pretrained Transformer (GPT) [Radford et al.. 20181
How do we format inputs to our decoder for finetuning tasks?
Natural Language Inference: Label pairs of sentences as entoiling/contradictory/neutral Premise: The man is in the doorway
Hypothesis: The person is near the door
entailment
Radford et al., 2018 evaluate on natural language inference.
Here's roughly how the input was formatted, as a sequence of tokens for the decoder.
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT] The linear classifier is applied to the representation of the [EXTRACT] token.
49
Generative Pretrained Transformer (GPT) [Radford et al.. 20181
GPT results on various natural language inference datasets.
Method	MNLI-m	MNLI-mm	SNLI	SciTail	QNLI	RTE
ESIM + ELMo [44] (5x) CAFE [58] (5x) Stochastic Answer Network [35] (3x)	80.2 80.6	79.0 80.1	89.3 89.3		-	-
CAFE [58]	78.7	77.9	88.5	83.3		
GenSen [64] Multi-task BiLSTM + Attn [64]	71.4 72.2	71.3 72.1	-	-	82.3 82.1	59.2 61.7
Finetuned Transformer LM (ours)	82.1	81.4	89.9	88.3	88.1	56.0
50
Increasingly convincing generations (GPT2) [Radford et al.. 20181
We mentioned how pretrained decoders can be used in their capacities as language models.
GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively convincing samples of natural language.
Context (human-written): In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
GPT-2: The scientist named the population, after their distinctive horn. Ovid's Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Perez, an evolutionary biologist from the University of La Paz. and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Perez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.
GPT-2 language model output (2019)
PROMPT (HUMAN-WRITTEN)
MODEL COMPLETION
A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
The incident occurred on the downtown train line, which runs from Covington and Ashland stations.
In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief.
'The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation/' said Tom Hicks, the U.S. Energy Secretary, in a statement. "Our top priority is to secure the theft and ensure it doesn't happen again."
The stolen material was taken from the University of Cincinnati's Research Triangle Park nuclear research site, according to a news release from Department officials.
52
GPT-3, In-context learning, and very large models
So far, we've interacted with pretrained models in two ways:
• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about and take their predictions.
Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters. GPT-3 has 175 billion parameters.
ChatGPT/GPT-4/GPT-3.5 Turbo introduced a further instruction-tuning idea that we cover next lecture
53
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
The in-context examples seem to specify the task to be performed, and the conditional distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):
"       thanks -> merci hello -> bonjour mint -> menthe otter ->
Output (conditional generations):
loutre..."
54
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient steps simply from examples you provide within their contexts.
Learning via SGD during unsupervised pre-training
5 ♦ 8 = 13
7 ♦ 2 « 9 1*8-1
3 + 4 = 7
5 • 9 = 14
9 ♦ 8 = 17
sequence #7
o o
X
—
-
gaot •> goat sakne •> snake brid => bird f$ih => flsH dcuk => duck caihp » chuip
sequence #2
o o
E
to
2.
2
3
thanks => aero hello » bonjour mint •> eenthe wall => «ur Otter => loutr* bread => pain
sequence #3
3
O
o
3
n
X
5"
0)
55
Why scale? Scaling laws
• Empirical observation: scaling up models leads to reliable gains in perplexity
Scaling can help identify model size - data tradeoffs
•  Modern observation: train a big model that's not fully converged.
Scaling laws for many other interesting architecture decisions
2
—•— 1 Layer	
—•— 2 Layers	
—•— 3 Layers	
—•— 6 Layers	
> 6 Layers	
103      104      105      106      107      108 109
Parameters (non-embedding)
Training Data Set Size, Number of Chars (Log-scale)
• Predictable scaling helps us make intelligent decisions about architectures etc.
Scaling Efficiency: how do we best use our compute
GPT-3 was 175B parameters and trained on 300B tokens of text.
Roughly, the cost of training a large transformer scales as parameters*tokens
Did OpenAI strike the right parameter-token data to get the best model? No.
Model	Size (# Parameters)	Training Tokens
LaMDA (Thoppilan et al., 2022)	137 Billion	168 Billion
GPT-3 (Brown et al., 2020)	175 Billion	300 Billion
Jurassic (Lieber et al., 2021)	178 Billion	300 Billion
Gopher (Rae et al, 2021)	280 Billion	300 Billion
MT-NLG 530B (Smith et al., 2022)	530 Billion	270 Billion
Chinchilla	70 Billion	1.4 Trillion
This 70B parameter model is better than the much larger other models!
59
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?
60
What kinds of things does pretraining teach?
There's increasing evidence that pretrained models learn a wide variety of things about the statistical properties of language. Taking our examples from the start of class:
• Stanford University is located in_, California. [Trivia]
• / put_fork down on the table, [syntax]
• The woman walked across the street, checking for traffic over_shoulder, [coreference]
• / went to the ocean to see the fish, turtles, seals, and_. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was_. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the_. [some reasoning - this is harder]
• I was thinking about the sequence that goes 1,1, 2, 3, 5, 8,13, 21,_        [some basic
arithmetic; they don't learn the Fibonacci sequence]
• Models also learn - and can exacerbate racism, sexism, all manner of bad biases.
61
Sometimes it also memorizes copyrighted material
AI Art Generators Spark Multiple Copyright Lawsuits
Getty and a trio of artists sued AI art generators in separate suits accusing the companies of copyright infringement for pilfering their works.
BY WINSTON CHO (*J   JANUARY 17. 2023 4:10PM
Anthropic fires back at music publishers' AI copyright lawsuit
ARTICLE
Insights from the Pending Copilot Class Action Lawsuit
October 4, 2023 Bloomberg Law
By Daniel R. Mello, Jr.; Jenevieve J. Maerker; Matthew C. Berntsen; Ming-Tao Yang
GitHub Inc. offers a cloud-based platform that is popular among many software programmers for hosting and sharing source code, and collaborating on source code drafting. GitHub's artificial
The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work
Millions of articles from The New York Times were used to train chatbots that now compete with it, the lawsuit said.
Share free access      g>      Q       C^] 1.3K
Sometimes it learns some things we don't want..
Membership inference lets you recover parts of the training data
(data record, class label) f
---------1----------
label
predict (data)
Target Model
Sometimes this training data is semi-private material from the web (addresses, emails)
•  It learns the prejudices and biases of human beings who write online
Attack Model
prediction
data € training set'
Prefix
East Stroudsburg Stroudsburg
Memorized text | y
Corporation Seabank Centre Marine Parade Southport
Three types of architectures for pretraining
The neural architecture influences the type of pretraining, and natural use cases.
Encoders
III
Encoder-Decoders
Decoders
Gets bidirectional context - can condition on future! Good if only doing analysis of text (better than decoders)
Good parts of decoders and encoders? Some evidence they are better for NLU • [Tay et al. 2022. UL2]
UniLM (EncDec)
PrefixLM (EncDec)
SpanCorrupt (Dec)
GPT-like (Dec)
PrefixLM (Dec)
2 4 6 8 10
1-Shot GEM XSum, SGD, TOT Avg. Rouge-L
Language models! What we've seen so far. Scale well. Best to generate from; have won as to what people build
64