Transformers PA154 Language Modeling (11.1) Pavel Rychly pary@fi.muni.cz May 7, 2024 Encoder-Decoder variable input/output size, not 1-1 mapping two components Encoder: variable-Length sequence —>► fixed size state Decoder: fixed size state —>► variable-Length sequence Encoder Decoder Pavel Rychlý • Transformers • May 7, 2024 Sequence to Sequence Learning ■ Encoder: Input sequence -> state ■ Decoder: state + output sequence -> output sequence Encoder Decoder T They t r are watching T T lis i regardent i lis regardent Pavel Rychlý • Transformers • May 7, 2024 3/14 Sequence to Sequence Using Encoder: Input sequence —>► state Decoder: state + sentence delimiter -> output Encoder Decoder T í f f They are watching T lis regardent i ^ L_J) L-JT L->7 Pavel Rychlý • Transformers • May 7, 2024 4/14 Multi-layer encoder/decoder Encoder n x Recurrent Embedding Sources Decoder Recurrent I Embedding Targets Pavel Rychlý • Transformers • May 7, 2024 Multi-layer encoder/decoder ■ Encoder: Input sequence —>► state ■ Decoder: state + sentence delimiter —>► output ■ Problem: fix size state Encoder Decoder n x Recurrent Embedding Sources Recurrent I Embedding Targets Pavel Rychlý • Transformers • May 7, 2024 Attention ■ each decoder Layer has access to all hidden states from the Last encoder ■ use attention to extract important parts (vector) INPUT Ť ENCODER > r ENCODER i. r ENCODER r ENCODER ENCODER ENCODER Je suis étudiant OUTPUT I am a student DECODER DECODER J r DECODER < J DECODER DECODER >. >- DECODER Pavel Rychlý • Transformers • May 7, 2024 7/14 Attention Input Hidden use attention to extract important parts (vector) important = similar to "me" Encoder ouputs Decoder \ i Attention Pavel Rychlý • Transformers • May 7, 2024 8/14 Self-Attention ■ instead of sequenciaL processing ■ attention to previous (and following) tokens ■ fuLLy parallel processing during training Pavel Rychlý • Transformers • May 7, 2024 9/14 Transformes ■ Attention is All You Need ■ seLf-attention in both encoder and decoder ■ masked cross-attention in decoder http: //jalammar.github.io/illustrated-transformer/ Pavel Rychlý • Transformers • May 7, 2024 10/14 Transformers variants using context to compute token/sentence/document embedding BERT = Bidirectional Encoder Representations from Transformers GPT= Generative Pre-trained Transformer many varians: tokenization, attention, encoder/decoder connections BERT (Ours) 1 accessed OpenAI GPT Pavel Rychlý • Transformers • May 7, 2024 11/14 BERT ■ GoogLe ■ encoder only ■ pre-training on raw text ■ masking tokens, is-next-sentence ■ big pre-trained models available ■ domain (task) adaptation Input: The man went to the [MASK]] . He bought a [MASK]2 of milk . Labels: [MASK]1 = store; [MASK]2 = gallon Sentence A = The man went to the store. Sentence B = He bought a gallon of milk. Label = IsNextSentence Sentence A = The man went to the store. Sentence B = Penguins are flightless. Label = NotNextSentence Pavel Rychlý • Transformers • May 7, 2024 12/14 Using pre-trained models ■ (BERT) trained on huge amount of data ■ finetuned on task specific data ■ using output of BERT as an input to task specific model (without modification of BERT) Pavel Rychlý • Transformers • May 7, 2024 13/14 GPT Open Al decoder only pre-training on raw text trained on prediction of next token Text Task Prediction Classifier Classification Start Text Extract Transformer Linear 12x Layer Norm Feed Forward Layer Norm 5 Masked Multi Self Attention Entailment Similarity Multiple Choice Text & Position Embed Start Premise Delim Hypothesis Extract Start Text 1 Delim Text 2 Extract Start Text 2 Delim Textl Extract Start Context Delim Answer 1 Extract Start Context Delim Answer 2 Extract Start Context Delim Answer N Extract Transformer Linear Transformer Transformer Linear Transformer Linear Transformer Linear Transformer Linear Pavel Rychlý • Transformers • May 7, 2024 14/14