Transformers PA154 Language Modeling (10.1) Pavel Rychlý pary@fi.muni.cz April 20, 2023 Multi-layer encoder/decoder Encoder: Input sequence —>► state Decoder: state + sentence delimiter —>► output Problem: fix size state Encoder Decoder n x Recurrent Embedding T Sources Pavel Rychlý • Transformers • April 20, 2023 Recurrent Embedding Targets Attention each decoder Layer has access to all hidden states from the Last encoder use attention to extract important parts (vector) INPUT ENCODER -> f ENCODER > r ENCODER k. ENCODER i. r ENCODER r ENCODER Je suis étudiant OUTPUT I am a student -> DECODER DECODER DECODER DECODER k DECODER k DECODER Pavel Rychlý • Transformers • April 20, 2023 3/10 Attention Input Hidden use attention to extract important parts (vector) important = similar to "me" Encoder ouputs Decoder \ i Attention Pavel Rychlý • Transformers • April 20, 2023 4/10 Self-Attention ■ instead of sequenciaL processing ■ attention to previous (and following) tokens ■ fuLLy parallel processing during training Pavel Rychlý • Transformers • April 20, 2023 5/10 Transformes ■ Attention is Alt You Need ■ seLf-attention in both encoder and decoder ■ masked cross-attention in decoder http: //jalammar.github.io/illustrated-transformer/ Pavel Rychlý • Transformers • April 20, 2023 6/10 Transformers variants using context to compute token/sentence/document embedding BERT = Bidirectional Encoder Representations from Transformers GPT= Generative Pre-trained Transformer many varians: tokenization, attention, encoder/decoder connections BERT (Ours) 1 accessed account OpenAI GPT ELMo 1 accessed Pavel Rychlý • Transformers • April 20, 2023 7/10 BERT ■ GoogLe ■ encoder only ■ pre-training on raw text ■ masking tokens, is-next-sentence ■ big pre-trained models available ■ domain (task) adaptation Input: The man went to the [MASK]. . He bought a [MASK]2 of milk . Labels: [MASK]1 = store; [MASK]2 = gallon Sentence A = The man went to the store. SentcncG B = He bought a gallon of milk. Label = IsNextSentence Sentence A = The man went to the store. Sentence B = Penguins are flightless. Label = NotNextSentence Pavel Rychlý • Transformers • April 20, 2023 8/10 Using pre-trained models ■ (BERT) trained on huge amount of data ■ finetuned on task specific data ■ using output of BERT as an input to task specific model (without modification of BERT) Pavel Rychlý • Transformers • April 20, 2023 9/10 GPT Open Al decoder only pre-training on raw text trained on prediction of next token 12x Text Task Prediction Classifier Layer Norm 5 Feed Forward Layer Norm 5 Masked Multi Self Attention Text & Position Embed Classification Entailment Similarity Start Text Extract Transformer Linear Multiple Choice Start Premise Delim Hypothesis Extract Start Text 1 Delim Text 2 Extract Start Text 2 Delim Textl Extract Start Context Delim Answer 1 Extract Start Context Delim Answer 2 Extract Start Context Delim Answer N Extract Transformer Linear Transformer Transformer Linear Transformer Linear Transformer Linear Transformer Linear Pavel Rychlý • Transformers • April 20, 2023 10/10