MUNI FI Transformers PA154 Language Modeling (10.1) Pavel Rychlý pary@fi.muni.cz April 20, 2023 Multi-layer encoder/decoder ■ Encoder: Input sequence —► state ■ Decoder: state + sentence delimiter —► output ■ Problem: fix size state Encoder Decoder Sources ^avel Rychlý ■ Transformers ■ April 20,2023 FC Targets Recurrent Recurrent t t Embedding Embedding x n Attention ■ each decoder layer has access to all hidden states from the last encoder ■ use attention to extract important parts (vector) Attention use attention to extract important parts (vector) important = similar to "me" Input Hidden Encoder oupuis M J 1 J "lít ^avel Rychlý ■ Transformers ■ April 20,2023 ^avel Rychlý ■ Transformers ■ April 20,2023 Self-Attention Transformes instead of sequencial processing attention to previous (and following) tokens fully parallel processing during training ■ Attention is All You Need ■ self-attention in both encoder and decoder ■ masked cross-attention in decoder http: //jalammar.github.io/illustrated-transformer/ ^avel Rychlý ■ Transformers ■ April 20,2023 5/10 ^avel Rychlý ■ Transformers ■ April 20,2023 6/10 Transformers variants BERT using context to compute token/sentence/document embedding BERT = Bidirectional Encoder Representations from Transformers GPT= Generative Pre-trained Transformer many varians: tokenization, attention, encoder/decoder connections OpenAI GPT ELMo X X X ^avel Rychly ■ Transformers ■ April 20,2023 Using pre-trained models (BERT) trained on huge amount of data finetuned on task specific data using output of BERT as an input to task specific model (without modification of BERT) ^avel Rychly ■ Transformers ■ April 20,2023 Google encoder only pre-training on raw text masking tokens, is-next-sentence big pre-trained models available domain (task) adaptation Input: The man went to the [MASK]2 . He bought a [MASK]2 of milk . Labels: [MASK] = store; [MASK], = gallon Sentence A-The man went to Sentence B = He bought a gal Label = IsNextSentence Sentence A - The man went to the s Sentence B = penguins are flightle Label = NotKcxtSc'itcnoc ^avel Rychly ■ Transformers ■ April 20,2023 GPT Open Al decoder only pre-training on raw text trained on prediction of next token Prediction Classifier I Layer Norm | J Layer Norm j Text & Positicn Embed ^avel Rychly ■ Transformers ■ April 20,2023