Machine Translation 40 Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language). x: L'homme est né libre, et partout il est dans les fers y: Man is born free, but everywhere he is in chains – Rousseau The early history of MT: 1950s • Machine translation research began in the early 1950s on machines less powerful than high school calculators (before term “A.I.” coined!) • Concurrent with foundational work on automata, formal languages, probabilities, and information theory • MT heavily funded by military, but basically just simple rule-based systems doing word substitution • Human language is more complicated than that, and varies more across languages! • Little understanding of natural language syntax, semantics, pragmatics • Problem soon appeared intractable 1 minute video showing 1954 MT: https://youtu.be/K-HfpsHPmvw The early history of MT: 1950s 1990s-2010s: Statistical Machine Translation • Core idea: Learn a probabilistic model from data • Suppose we’re translating French → English. • We want to find best English sentence y, given French sentence x • Use Bayes Rule to break this down into two components to be learned separately: Translation Model Models how words and phrases should be translated (fidelity). Learned from parallel data. Language Model Models how to write good English (fluency). Learned from monolingual data.43 What happens in translation isn’t trivial to model! 1519年600名西班牙人在墨西哥登陆,去征服几百万人口 的阿兹特克帝国,初次交锋他们损兵三分之二。 In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a population of a few million. They lost two thirds of their soldiers in the first clash. translate.google.com (2009): 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of soldiers against their loss. translate.google.com (2013): 1519 600 Spaniards landed in Mexico to conquer the Aztec empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds. translate.google.com (2015): 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash. 1990s–2010s: Statistical Machine Translation • SMT was a huge research field • The best systems were extremely complex • Hundreds of important details • Systems had many separately-designed subcomponents • Lots of feature engineering • Need to design features to capture particular language phenomena • Required compiling and maintaining extra resources • Like tables of equivalent phrases • Lots of human effort to maintain • Repeated effort for each language pair! 45 What is Neural Machine Translation? 46 • Neural Machine Translation (NMT) is a way to do Machine Translation with a single end-to-end neural network • The neural network architecture is called a sequence-to-sequence model (aka seq2seq) and it involves two RNNs EncoderRNN Neural Machine Translation (NMT) Source sentence (input) il a m’ entarté The sequence-to-sequence model Target sentence (output) DecoderRNN Encoder RNN produces an encoding of the source sentence. Encoding of the source sentence. Provides initial hidden state for Decoder RNN. Decoder RNN is a Language Model that generates target sentence, conditioned on encoding. he argmax he argmax hit hit argmax me Note: This diagram shows test time behavior: decoder output is fed in as next step’s input with a pie me with a pie argmax argmax argmax argmax 47 Sequence-to-sequence is versatile! • The general notion here is an encoder-decoder model • One neural network takes input and produces a neural representation • Another network produces output based on that neural representation • If the input and output are sequences, we call it a seq2seq model • Sequence-to-sequence is useful for more than just MT • Many NLP tasks can be phrased as sequence-to-sequence: • Summarization (long text → short text) • Dialogue (previous utterances → next utterance) • Parsing (input text → output parse as sequence) • Code generation (natural language → Python code) 48 Neural Machine Translation (NMT) • The sequence-to-sequence model is an example of a Conditional Language Model • Language Model because the decoder is predicting the next word of the target sentence y • Conditional because its predictions are also conditioned on the source sentence x • NMT directly calculates : • Question: How to train an NMT system? • (Easy) Answer: Get a big parallel corpus… • But there is now exciting work on “unsupervised NMT”, data augmentation, etc. Probability of next target word, given target words so far and source sentence x 49 Training a Neural Machine Translation system EncoderRNN Source sentence (from corpus) he hit me with a pieil a m’ entarté Target sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “end-to-end”. DecoderRNN "𝑦! "𝑦" "𝑦# "𝑦$ "𝑦% "𝑦& "𝑦' 𝐽! 𝐽" 𝐽# 𝐽$ 𝐽% 𝐽& 𝐽' = negative log prob of “he” 𝐽 = 1 𝑇 ( ()! * 𝐽( = + + + + + + = negative log prob of = negative log prob of “with” 50 Multi-layer deep encoder-decoder machine translation net Die Proteste waren am Wochenende eskaliert The protests escalated over the weekend 0.2 0.6 -0.1 -0.7 0.1 0.4 -0.6 0.2 -0.3 0.4 0.2 -0.3 -0.1 -0.4 0.2 0.2 0.4 0.1 -0.5 -0.2 0.4 -0.2 -0.3 -0.4 -0.2 0.2 0.6 -0.1 -0.7 0.1 0.2 0.6 -0.1 -0.7 0.1 0.2 0.6 -0.1 -0.7 0.1 -0.1 0.3 -0.1 -0.7 0.1 -0.2 0.6 0.1 0.3 0.1 -0.4 0.5 -0.5 0.4 0.1 0.2 0.6 -0.1 -0.7 0.1 0.2 0.6 -0.1 -0.7 0.1 0.2 -0.2 -0.1 0.1 0.1 0.2 0.6 -0.1 -0.7 0.1 0.1 0.3 -0.1 -0.7 0.1 0.2 0.6 -0.1 -0.4 0.1 0.2 -0.8 -0.1 -0.5 0.1 0.2 0.6 -0.1 -0.7 0.1 -0.4 0.6 -0.1 -0.7 0.1 0.2 0.6 -0.1 0.3 0.1 -0.1 0.6 -0.1 0.3 0.1 0.2 0.4 -0.1 0.2 0.1 0.3 0.6 -0.1 -0.5 0.1 0.2 0.6 -0.1 -0.7 0.1 0.2 -0.1 -0.1 -0.7 0.1 0.1 0.3 0.1 -0.4 0.2 0.2 0.6 -0.1 -0.7 0.1 0.4 0.4 0.3 -0.2 -0.3 0.5 0.5 0.9 -0.3 -0.2 0.2 0.6 -0.1 -0.5 0.1 -0.1 0.6 -0.1 -0.7 0.1 0.2 0.6 -0.1 -0.7 0.1 0.3 0.6 -0.1 -0.7 0.1 0.4 0.4 -0.1 -0.7 0.1 -0.2 0.6 -0.1 -0.7 0.1 -0.4 0.6 -0.1 -0.7 0.1 -0.3 0.5 -0.1 -0.7 0.1 0.2 0.6 -0.1 -0.7 0.1 The protests escalated over the weekend Encoder: Builds up sentence meaning Source sentence Translation generated Feeding in last word Decoder Conditioning = Bottleneck [Sutskever et al. 2014; Luong et al. 2015] The hidden states from RNN layer i are the inputs to RNN layer i+1 51 How do we evaluate Machine Translation? BLEU (Bilingual Evaluation Understudy) • BLEU compares the machine-written translation to one or several human-written translation(s), and computes a similarity score based on: • n-gram precision (usually for 1, 2, 3 and 4-grams) • Plus a penalty for too-short system translations • BLEU is useful but imperfect • There are many valid ways to translate a sentence • So a good translation can get a poor BLEU score because it has low n-gram overlap with the human translation L 52 You’ll see BLEU in detail in Assignment 4! Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040 Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. BLEU score against 4 reference translations [Papineni et al. 2002] 0 5 10 15 20 25 30 35 40 45 2013 2014 2015 2016 2017 2018 2019 Phrase-based SMT Syntax-based SMT Neural MT MT progress over time 54 Sources: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf & http://matrix.statmt.org/ [Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal; NMT 2019 FAIR on newstest2019] Advantages of NMT Compared to SMT, NMT has many advantages: • Better performance • More fluent • Better use of context • Better use of phrase similarities • A single neural network to be optimized end-to-end • No subcomponents to be individually optimized • Requires much less human engineering effort • No feature engineering • Same method for all language pairs 55 Disadvantages of NMT? Compared to SMT: • NMT is less interpretable • Hard to debug • NMT is difficult to control • For example, can’t easily specify rules or guidelines for translation • Safety concerns! 56 NMT: the first big success story of NLP Deep Learning 57 Neural Machine Translation went from a fringe research attempt in 2014 to the leading standard method in 2016 • 2014: First seq2seq paper published • 2016: Google Translate switches from SMT to NMT – and by 2018 everyone has • This is amazing! • SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a small group of engineers in a few months Decoding: Greedy decoding • We saw how to generate (or “decode”) the target sentence by taking argmax on each step of the decoder • This is greedy decoding (take most probable word on each step) 4 he argmax heargmax hit hit argmax me with a pie me with a pie argmax argmax argmax argmax Problems with greedy decoding • Greedy decoding has no way to undo decisions! • Input: il a m’entarté (he hit me with a pie) • → he ____ • → he hit ____ • → he hit a ____ (whoops! no going back now…) • How to fix this? 5 Exhaustive search decoding • Ideally, we want to find a (length T) translation y that maximizes • We could try computing all possible sequences y • This means that on each step t of the decoder, we’re tracking Vt possible partial translations, where V is vocab size • This O(VT) complexity is far too expensive! 6 Beam search decoding • Core idea: On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses) • k is the beam size (in practice around 5 to 10, in NMT) • A hypothesis has a score which is its log probability: • Scores are all negative, and higher score is better • We search for high-scoring hypotheses, tracking top k on each step • Beam search is not guaranteed to find optimal solution • But much more efficient than exhaustive search! 7 Beam search decoding: example Beam size = k = 2. Blue numbers = 8 Calculate prob dist of next word Beam search decoding: example Beam size = k = 2. Blue numbers = he I 9 -0.7 -0.9 Take top k words and compute scores = log PLM(he|) = log PLM(I|) Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got he I 10 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 For each of the k hypotheses, find top k next words and calculate scores = log PLM(hit| he) + -0.7 = log PLM(struck| he) + -0.7 = log PLM(was| I) + -0.9 = log PLM(got| I) + -0.9 Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got he I 11 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 Of these k2 hypotheses, just keep k with highest scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck he I 12 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 For each of the k hypotheses, find top k next words and calculate scores = log PLM(a| he hit) + -1.7 = log PLM(me| he hit) + -1.7 = log PLM(hit| I was) + -1.6 = log PLM(struck| I was) + -1.6 Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck he I 13 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 Of these k2 hypotheses, just keep k with highest scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on he I 14 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 For each of the k hypotheses, find top k next words and calculate scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on he I 15 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 Of these k2 hypotheses, just keep k with highest scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on in with a one he I 16 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 -3.7 -4.3 -4.5 -4.8 For each of the k hypotheses, find top k next words and calculate scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on in with a one he I 17 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 -3.7 -4.3 -4.5 -4.8 Of these k2 hypotheses, just keep k with highest scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on in with a one pie tart pie tart he I 18 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 -3.7 -4.3 -4.5 -4.8 -4.3 -4.6 -5.0 -5.3 For each of the k hypotheses, find top k next words and calculate scores Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on in with a one pie tart pie tart he I 19 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 -3.7 -4.3 -4.5 -4.8 -4.3 -4.6 -5.0 -5.3 This is the top-scoring hypothesis! Beam search decoding: example Beam size = k = 2. Blue numbers = hit struck was got a me hit struck tart pie with on in with a one pie tart pie tart he I 20 -0.7 -0.9 -1.6 -1.8 -1.7 -2.9 -2.5 -2.8 -3.8 -2.9 -3.5 -3.3 -4.0 -3.4 -3.7 -4.3 -4.5 -4.8 -4.3 -4.6 -5.0 -5.3 Backtrack to obtain the full hypothesis Beam search decoding: stopping criterion • In greedy decoding, usually we decode until the model produces an token • For example: he hit me with a pie • In beam search decoding, different hypotheses may produce tokens on different timesteps • When a hypothesis produces , that hypothesis is complete. • Place it aside and continue exploring other hypotheses via beam search. • Usually we continue beam search until: • We reach timestep T (where T is some pre-defined cutoff), or • We have at least n completed hypotheses (where n is pre-defined cutoff) 21 Beam search decoding: finishing up • We have our list of completed hypotheses. • How to select top one? • Each hypothesis on our list has a score • Problem with this: longer hypotheses have lower scores • Fix: Normalize by length. Use this to select top one instead: 22 See also discussion of sampling-based decoding in the NLG lecture 2. Why attention? Sequence-to-sequence: the bottleneck problem EncoderRNN Source sentence (input) he hit me with a pieil a m’ entarté he hit me with a pie DecoderRNN Target sentence (output) Problems with this architecture? Encoding of the source sentence. 29 1. Why attention? Sequence-to-sequence: the bottleneck problem EncoderRNN Source sentence (input) he hit me with a pieil a m’ entarté he hit me with a pie DecoderRNN Target sentence (output) Encoding of the source sentence. This needs to capture all information about the source sentence. Information bottleneck! 30 Attention • Attention provides a solution to the bottleneck problem. • Core idea: on each step of the decoder, use direct connection to the encoder to focus on a particular part of the source sequence • First, we will show via diagram (no equations), then we will show with equations 31 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores dot product 32 Core idea: on each step of the decoder, use direct connection to the encoder to focus on a particular part of the source sequence Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores dot product 33 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores dot product 34 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores dot product 35 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores On this decoder timestep, we’re mostly focusing on the first encoder hidden state (”he”) Attention distribution Take softmax to turn the scores into a probability distribution 36 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention distribution Attention scores Attention output Use the attention distribution to take a weighted sum of the encoder hidden states. The attention output mostly contains information from the hidden states that received high attention. 37 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention distribution Attention scores Attention output Concatenate attention output with decoder hidden state, then use to compute !𝑦! as before !𝑦! he 38 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores he Attention distribution Attention output !𝑦" hit 39 Sometimes we take the attention output from the previous step, and also feed it into the decoder (along with the usual decoder input). We do this in Assignment 4. Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores Attention distribution Attention output he hit !𝑦# me 40 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores Attention distribution Attention output he hit me !𝑦$ with 41 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores Attention distribution Attention output he hit with !𝑦% a me 42 Sequence-to-sequence with attention Encoder RNN Source sentence (input) il a m’ entarté DecoderRNN Attention scores Attention distribution Attention output he hit me with a !𝑦& pie 43 Attention: in equations • We have encoder hidden states • On timestep t, we have decoder hidden state • We get the attention scores for this step: • We take softmax to get the attention distribution for this step (this is a probability distribution and sums to 1) • We use to take a weighted sum of the encoder hidden states to get the attention output • Finally we concatenate the attention output with the decoder hidden state and proceed as in the non-attention seq2seq model 44 Attention is great! • Attention significantly improves NMT performance • It’s very useful to allow decoder to focus on certain parts of the source • Attention provides a more “human-like” model of the MT process • You can look back at the source sentence while translating, rather than needing to remember it all • Attention solves the bottleneck problem • Attention allows decoder to look directly at source; bypass bottleneck • Attention helps with the vanishing gradient problem • Provides shortcut to faraway states • Attention provides some interpretability • By inspecting attention distribution, we see what the decoder was focusing on • We get (soft) alignment for free! • This is cool because we never explicitly trained an alignment system • The network just learned alignment by itself 45 he hit me with a pie il a m’ entarté There are several attention variants • We have some values and a query • Attention always involves: 1. Computing the attention scores 2. Taking softmax to get attention distribution ⍺: 3. Using attention distribution to take weighted sum of values: thus obtaining the attention output a (sometimes called the context vector) 46 There are multiple ways to do this Attention variants There are several ways you can compute from and : • Basic dot-product attention: • Note: this assumes . This is the version we saw earlier. • Multiplicative attention: [Luong, Pham, and Manning 2015] • Where is a weight matrix. Perhaps better called “bilinear attention” • Reduced-rank multiplicative attention: 𝑒! = 𝑠" 𝑼" 𝑽 ℎ! = (𝑼𝑠)"(𝑽ℎ!) • For low rank matrices 𝑼 ∈ ℝ#×%', 𝑽 ∈ ℝ#×%(, 𝑘 ≪ 𝑑&, 𝑑' • Additive attention: [Bahdanau, Cho, and Bengio 2014] • Where are weight matrices and is a weight vector. • d3 (the attention dimensionality) is a hyperparameter • “Additive” is a weird/bad name. It’s really using a feed-forward neural net layer. 47 More information: “Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/index.html#attention “Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/1703.03906.pdf You’ll think about the relative advantages/disadvantages of these in Assignment 4! Remember this when we look at Transformers next week! Attention is a general Deep Learning technique • We’ve seen that attention is a great way to improve the sequence-to-sequence model for Machine Translation. • However: You can use attention in many architectures (not just seq2seq) and many tasks (not just MT) • More general definition of attention: • Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. • We sometimes say that the query attends to the values. • For example, in the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states (values). 48 Attention is a general Deep Learning technique 49 • More general definition of attention: • Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. Intuition: • The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. • Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query). Upshot: • Attention has become the powerful, flexible, general way pointer and memory manipulation in all deep learning models. A new idea from after 2010! From NMT! As of last lecture: recurrent models for (most) NLP! • Circa 2016, the de facto strategy in NLP is to encode sentences with a bidirectional LSTM: (for example, the source sentence in a translation) 3 • Define your output (parse, sentence, summary) as a sequence, and use an LSTM to generate it. • Use attention to allow flexible access to memory Today: Same goals, different building blocks • Last week, we learned about sequence-to-sequence problems and encoder-decoder models. • Today, we’re not trying to motivate entirely new ways of looking at problems (like Machine Translation) • Instead, we’re trying to find the best building blocks to plug into our models and enable broad progress. 4 2014-2017ish Recurrence Lots of trial and error 2021 ?????? Issues with recurrent models: Linear interaction distance • RNNs are unrolled “left-to-right”. • This encodes linear locality: a useful heuristic! • Nearby words often affect each other’s meanings • Problem: RNNs take O(sequence length) steps for distant word pairs to interact. 5 tasty pizza The chef waswho … O(sequence length) Issues with recurrent models: Linear interaction distance • O(sequence length) steps for distant word pairs to interact means: • Hard to learn long-distance dependencies (because gradient problems!) • Linear order of words is “baked in”; we already know linear order isn’t the right way to think about sentences… 6 The waschef who … Info of chef has gone through O(sequence length) many layers! Issues with recurrent models: Lack of parallelizability • Forward and backward passes have O(sequence length) unparallelizable operations • GPUs can perform a bunch of independent computations at once! • But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed • Inhibits training on very large datasets! 7 h1 0 1 n hTh2 1 2 2 3 Numbers indicate min # of steps before a state can be computed If not recurrence, then what? How about attention? • Attention treats each word’s representation as a query to access and incorporate information from a set of values. • We saw attention from the decoder to the encoder; today we’ll think about attention within a single sentence. • Number of unparallelizable operations does not increase with sequence length. • Maximum interaction distance: O(1), since all words interact at every layer! embedding 0 0 0 0 0 0 0 0 h1 h2 hT 2 2 2 2 2 2 2 2 attention attention 1 1 1 1 1 1 1 1 All words attend to all words in previous layer; most arrows here are omitted 8 Attention as a soft, averaging lookup table 9 We can think of attention as performing fuzzy lookup in a key-value store. In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value. In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed. Self-Attention Hypothetical Example 10 Self-Attention: keys, queries, values from the same sequence 11 Let 𝒘1:𝑛 be a sequence of words in vocabulary 𝑉, like Zuko made his uncle tea. For each 𝒘𝑖 , let 𝒙𝑖 = 𝐸𝒘𝒊, where 𝐸 ∈ ℝ 𝑑×|𝑉| is an embedding matrix. 1. Transform each word embedding with weight matrices Q, K, V , each in ℝ 𝑑×𝑑 2. Compute pairwise similarities between keys and queries; normalize with softmax 𝒆𝑖𝑗 = 𝒒𝒊 ⊤ 𝒌𝒋 𝜶𝑖𝑗 = exp(𝒆𝑖𝑗) σ 𝑗′ exp(𝒆𝑖𝑗′) 3. Compute output for each word as weighted sum of values 𝒒𝑖 = 𝑄𝒙𝒊 (queries) 𝒌𝑖 = 𝐾𝒙𝒊 (keys) 𝒗𝑖 = 𝑉𝒙𝒊 (values) 𝒐𝑖 = ෍ 𝒋 𝜶𝑖𝑗 𝒗𝑖 Barriers • Doesn’t have an inherent notion of order! Barriers and solutions for Self-Attention as a building block 12 Solutions Fixing the first self-attention problem: sequence order • Since self-attention doesn’t build in order information, we need to encode the order of the sentence in our keys, queries, and values. • Consider representing each sequence index as a vector 𝒑𝑖 ∈ ℝ 𝑑 , for 𝑖 ∈ {1,2, … , 𝑛} are position vectors • Don’t worry about what the 𝑝𝑖 are made of yet! • Easy to incorporate this info into our self-attention block: just add the 𝒑𝑖 to our inputs! • Recall that 𝒙𝑖 is the embedding of the word at index 𝑖. The positioned embedding is: ෥𝒙𝑖 = 𝒙𝑖 + 𝒑𝑖 In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add… 13 • Sinusoidal position representations: concatenate sinusoidal functions of varying periods: • Pros: • Periodicity indicates that maybe “absolute position” isn’t as important • Maybe can extrapolate to longer sequences as periods restart! • Cons: • Not learnable; also the extrapolation doesn’t really work! Position representation vectors through sinusoids cos(𝑖/100002∗1/𝑑) 𝒑𝑖 = sin(𝑖/100002∗1/𝑑) sin(𝑖/100002∗ 𝑑 2 /𝑑 ) cos(𝑖/100002∗ 𝑑 2 /𝑑 ) Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/ Index in the sequence Dimension 14 • Learned absolute position representations: Let all 𝑝𝑖 be learnable parameters! Learn a matrix 𝒑 ∈ ℝ 𝑑×𝑛 , and let each 𝒑𝑖 be a column of that matrix! • Pros: • Flexibility: each position gets to be learned to fit the data • Cons: • Definitely can’t extrapolate to indices outside 1, … , 𝑛. • Most systems use this! • Sometimes people try more flexible representations of position: • Relative linear position attention [Shaw et al., 2018] • Dependency syntax-based position [Wang et al., 2019] Position representation vectors learned from scratch 15 Barriers • Doesn’t have an inherent notion of order! • No nonlinearities for deep learning! It’s all just weighted averages Barriers and solutions for Self-Attention as a building block Solutions • Add position representations to the inputs 16 Adding nonlinearities in self-attention • Note that there are no elementwise nonlinearities in self-attention; stacking more self-attention layers just re-averages value vectors (Why? Look at the notes!) • Easy fix: add a feed-forward network to post-process each output vector. 𝑚𝑖 = 𝑀𝐿𝑃 output 𝑖 = 𝑊2 ∗ ReLU 𝑊1 output 𝑖 + 𝑏1 + 𝑏2 The 𝑤1 𝑤2 chef 𝑤3 who 𝑤 𝑛 food … self-attention Intuition: the FF network processes the result of attention FF FF FF FF … self-attention FF FF FF FF 17 Barriers • Doesn’t have an inherent notion of order! • No nonlinearities for deep learning magic! It’s all just weighted averages • Need to ensure we don’t “look at the future” when predicting a sequence • Like in machine translation • Or language modeling Barriers and solutions for Self-Attention as a building block Solutions • Add position representations to the inputs • Easy fix: apply the same feedforward network to each selfattention output. 18 Masking the future in self-attention • To use self-attention in decoders, we need to ensure we can’t peek at the future. • At every timestep, we could change the set of keys and queries to include only past words. (Inefficient!) • To enable parallelization, we mask out attention to future words by setting attention scores to −∞. The chef who [START] For encoding these words We can look at these (not greyed out) words 𝑒𝑖𝑗 = ൝ 𝑞𝑖 ⊤ 𝑘𝑗, 𝑗 ≤ 𝑖 −∞, 𝑗 > 𝑖 −∞ −∞−∞ −∞−∞ −∞ 19 Barriers • Doesn’t have an inherent notion of order! • No nonlinearities for deep learning magic! It’s all just weighted averages • Need to ensure we don’t “look at the future” when predicting a sequence • Like in machine translation • Or language modeling Barriers and solutions for Self-Attention as a building block Solutions • Add position representations to the inputs • Easy fix: apply the same feedforward network to each selfattention output. • Mask out the future by artificially setting attention weights to 0! 20 • Self-attention: • the basis of the method. • Position representations: • Specify the sequence order, since self-attention is an unordered function of its inputs. • Nonlinearities: • At the output of the self-attention block • Frequently implemented as a simple feedforward network. • Masking: • In order to parallelize operations while not looking at the future. • Keeps information about the future from “leaking” to the past. Necessities for a self-attention building block: 21