Transformers
PA154 Language Modeling (10.1)
Pavel Rychlý
pary@fi.muni.cz April 20, 2023
Multi-layer encoder/decoder
Encoder: Input sequence —>► state
Decoder: state + sentence delimiter —>► output
Problem: fix size state
Encoder
Decoder
n x
Recurrent
Embedding
T
Sources
Pavel Rychlý • Transformers • April 20, 2023
Recurrent
Embedding
Targets
Attention
each decoder Layer has access to all hidden states from the Last encoder
use attention to extract important parts (vector)
INPUT
	ENCODER	->
		
		
f	ENCODER	>
		
		
r	ENCODER	
k.		
		
	ENCODER	
i.		
		
r	ENCODER	
		
		
r	ENCODER	
		
Je    suis étudiant
OUTPUT
I    am   a student
		->
	DECODER	
		
		
		
	DECODER	
		
		
		
	DECODER	
		
		
		
	DECODER	
k		
		
		
	DECODER	
k		
		
		
	DECODER	
		
Pavel Rychlý • Transformers • April 20, 2023
3/10
Attention
Input Hidden
use attention to extract important parts (vector)
important = similar to "me"
Encoder ouputs
Decoder
\ i
Attention
Pavel Rychlý • Transformers • April 20, 2023
4/10
Self-Attention
■ instead of sequenciaL processing
■ attention to previous (and following) tokens
■ fuLLy parallel processing during training
Pavel Rychlý • Transformers • April 20, 2023
5/10
Transformes
■ Attention is Alt You Need
■ seLf-attention in both encoder and decoder
■ masked cross-attention in decoder
http:
//jalammar.github.io/illustrated-transformer/
Pavel Rychlý • Transformers • April 20, 2023
6/10
Transformers variants
using context to compute token/sentence/document embedding
BERT = Bidirectional Encoder Representations from Transformers
GPT= Generative Pre-trained Transformer
many varians: tokenization, attention, encoder/decoder connections
BERT (Ours)
1		accessed
		
	
account	
OpenAI GPT
ELMo
1		accessed
		
Pavel Rychlý • Transformers • April 20, 2023
7/10
BERT
■ GoogLe
■ encoder only
■ pre-training on raw text
■ masking tokens, is-next-sentence
■ big pre-trained models available
■ domain (task) adaptation
Input: The man went to the [MASK]. . He bought a [MASK]2 of milk . Labels: [MASK]1 = store;   [MASK]2 = gallon
Sentence A = The man went to the store. SentcncG B = He bought a gallon of milk. Label = IsNextSentence
Sentence A = The man went to the store. Sentence B = Penguins are flightless. Label = NotNextSentence
Pavel Rychlý • Transformers • April 20, 2023
8/10
Using pre-trained models
■ (BERT) trained on huge amount of data
■ finetuned on task specific data
■ using output of BERT as an input to task specific model (without modification of BERT)
Pavel Rychlý • Transformers • April 20, 2023
9/10
GPT
Open Al
decoder only
pre-training on raw text
trained on prediction of next token
12x
Text	Task
Prediction	Classifier
Layer Norm
5
Feed Forward
Layer Norm
5
Masked Multi Self Attention
Text & Position Embed
Classification
Entailment
Similarity
Start
Text
Extract
Transformer
Linear
Multiple Choice
Start	Premise	Delim	Hypothesis	Extract
				
				
Start	Text 1	Delim	Text 2	Extract
				
Start	Text 2	Delim	Textl	Extract
				
				
Start	Context	Delim	Answer 1	Extract
				
Start	Context	Delim	Answer 2	Extract
				
Start	Context	Delim	Answer N	Extract
Transformer
Linear
Transformer
Transformer
Linear
Transformer Linear
Transformer
Linear
Transformer
Linear
Pavel Rychlý • Transformers • April 20, 2023
10/10