Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks • Language Modeling is the task of predicting what word comes next the students opened their ______ • More formally: given a sequence of words , compute the probability distribution of the next word : where can be any word in the vocabulary • A system that does this is called a Language Model 2. Language Modeling exams minds laptops books 9 Language Modeling • You can also think of a Language Model as a system that assigns a probability to a piece of text • For example, if we have some text , then the probability of this text (according to the Language Model) is: 10 This is what our LM provides How to build a neural language model? • Recall the Language Modeling task: • Input: sequence of words • Output: prob. dist. of the next word • How about a window-based neural model? • We saw this applied to Named Entity Recognition in Lecture 2: 23 in Paris are amazingmuseums LOCATION A fixed-window neural Language Model the students opened theiras the proctor started the clock ______ discard fixed window 24 A fixed-window neural Language Model the students opened their books laptops concatenated word embeddings words / one-hot vectors hidden layer a zoo output distribution 25 A fixed-window neural Language Model the students opened their books laptops a zoo Improvements over n-gram LM: • No sparsity problem • Don’t need to store all observed n-grams Remaining problems: • Fixed window is too small • Enlarging window enlarges 𝑊 • Window can never be large enough! • 𝑥(!) and 𝑥(") are multiplied by completely different weights in 𝑊. No symmetry in how the inputs are processed. We need a neural architecture that can process any length input 26 Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model 3. Recurrent Neural Networks (RNN) 27 hidden states input sequence (any length) … … … Core idea: Apply the same weights 𝑊 repeatedlyA family of neural architectures outputs (optional) A Simple RNN Language Model the students opened theirwords / one-hot vectors books laptops word embeddings a zoo output distribution Note: this input sequence could be much longer now! hidden states is the initial hidden state 28 RNN Language Models the students opened their books laptops a zoo RNN Advantages: • Can process any length input • Computation for step t can (in theory) use information from many steps back • Model size doesn’t increase for longer input context • Same weights applied on every timestep, so there is symmetry in how inputs are processed. RNN Disadvantages: • Recurrent computation is slow • In practice, difficult to access information from many steps back More on these later 29 Training an RNN Language Model • Get a big corpus of text which is a sequence of words • Feed into RNN-LM; compute output distribution for every step t. • i.e., predict probability dist of every word, given words so far • Loss function on step t is cross-entropy between predicted probability distribution , and the true next word (one-hot for ): • Average this to get overall loss for entire training set: 30 Training an RNN Language Model = negative log prob of “students” the students opened their … examsCorpus Loss … 31 Predicted prob dists Training an RNN Language Model the students opened their … examsCorpus Loss … 32 Predicted prob dists = negative log prob of “opened” Training an RNN Language Model the students opened their … examsCorpus Loss … 33 Predicted prob dists = negative log prob of “their” Training an RNN Language Model the students opened their … examsCorpus Loss … 34 Predicted prob dists = negative log prob of “exams” Training an RNN Language Model + + + + … = the students opened their … exams … 35 Corpus Loss Predicted prob dists “Teacher forcing” Training a RNN Language Model • However: Computing loss and gradients across entire corpus at once is too expensive (memory-wise)! • In practice, consider as a sentence (or a document) • Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update. • Compute loss for a sentence (actually, a batch of sentences), compute gradients and update weights. Repeat on a new batch of sentences. 36 Backpropagation for RNNs 37 …… Question: What’s the derivative of w.r.t. the repeated weight matrix ? Answer: “The gradient w.r.t. a repeated weight is the sum of the gradient w.r.t. each time it appears” Why? Multivariable Chain Rule 38 Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version Training the parameters of RNNs: Backpropagation for RNNs 39 …… Question: How do we calculate this? Answer: Backpropagate over timesteps i = t, … ,0, summing gradients as you go. This algorithm is called “backpropagation through time” [Werbos, P.G., 1988, Neural Networks 1, and others] equals equals equals equals equals Apply the multivariable chain rule: = 1 In practice, often “truncated” after ~20 timesteps for training efficiency reasons Generating with an RNN Language Model (“Generating roll outs”) Just like an n-gram Language Model, you can use a RNN Language Model to generate text by repeated sampling. Sampled output becomes next step’s input. my favorite season sample my sample favorite sample season sample is is40 sample spring spring sample Generating text with an RNN Language Model Let’s have some fun! • You can train an RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on Obama speeches: Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0 41 Generating text with an RNN Language Model Let’s have some fun! • You can train an RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on Harry Potter: Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6 42 Generating text with an RNN Language Model Let’s have some fun! • You can train an RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on recipes: Source: https://gist.github.com/nylki/1efbaa36635956d35bcc 43 Generating text with a RNN Language Model 44 Let’s have some fun! • You can train a RNN-LM on any kind of text, then generate text in that style. • RNN-LM trained on paint color names: Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network This is an example of a character-level RNN-LM (predicts what character comes next) 4. Problems with RNNs: Vanishing and Exploding Gradients 47 Vanishing gradient intuition 48 ? Vanishing gradient intuition chain rule! 49 Vanishing gradient intuition chain rule! 50 Vanishing gradient intuition chain rule! 51 Vanishing gradient intuition What happens if these are small? Vanishing gradient problem: When these are small, the gradient signal gets smaller and smaller as it backpropagates further 52 Why is vanishing gradient a problem? Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by. So, model weights are updated only with respect to near effects, not long-term effects. 55 Effect of vanishing gradient on RNN-LM • LM task: When she tried to print her tickets, she found that the printer was out of toner. She went to the stationery store to buy more toner. It was very overpriced. After installing the toner into the printer, she finally printed her ________ • To learn from this training example, the RNN-LM needs to model the dependency between “tickets” on the 7th step and the target word “tickets” at the end. • But if the gradient is small, the model can’t learn this dependency • So, the model is unable to predict similar long-distance dependencies at test time 56 Why is exploding gradient a problem? • If the gradient becomes too big, then the SGD update step becomes too big: • This can cause bad updates: we take too large a step and reach a weird and bad parameter configuration (with large loss) • You think you’ve found a hill to climb, but suddenly you’re in Iowa • In the worst case, this will result in Inf or NaN in your network (then you have to restart training from an earlier checkpoint) 57 learning rate gradient Gradient clipping: solution for exploding gradient 58 • Gradient clipping: if the norm of the gradient is greater than some threshold, scale it down before applying SGD update • Intuition: take a step in the same direction, but a smaller step • In practice, remembering to clip gradients is important, but exploding gradients are an easy problem to solve Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf How to fix the vanishing gradient problem? • The main problem is that it’s too difficult for the RNN to learn to preserve information over many timesteps. • In a vanilla RNN, the hidden state is constantly being rewritten • First off next time: How about an RNN with separate memory which is added to? • LSTMs • And then: Creating more direct and linear pass-through connections in model • Attention, residual connections, etc. 59 5. Recap • Language Model: A system that predicts the next word • Recurrent Neural Network: A family of neural networks that: • Take sequential input of any length • Apply the same weights on each step • Can optionally produce output on each step • Recurrent Neural Network ≠ Language Model • We’ve shown that RNNs are a great way to build a LM (despite some problems) • RNNs are also useful for much more! 60 Why should we care about Language Modeling? 61 • Language Modeling is a benchmark task that helps us measure our progress on predicting language use • Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text: • Everything else in NLP has now been rebuilt upon Language Modeling: GPT-3 is an LM! • Predictive typing • Speech recognition • Handwriting recognition • Spelling/grammar correction • Authorship identification • Machine translation • Summarization • Dialogue • etc. How to fix the vanishing gradient problem? • The main problem is that it’s too difficult for the RNN to learn to preserve information over many timesteps. • In a vanilla RNN, the hidden state is constantly being rewritten • Could we design an RNN with separate memory which is added to? 16 Long Short-Term Memory RNNs (LSTMs) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the problem of vanishing gradients • Everyone cites that paper but really a crucial part of the modern LSTM is from Gers et al. (2000) 💜 • Only started to be recognized as promising through the work of S’s student Alex Graves c. 2006 • Work in which he also invented CTC (connectionist temporal classification) for speech recognition • But only really became well-known after Hinton brought it to Google in 2013 • Following Graves having been a postdoc with Hinton 18 Hochreiter and Schmidhuber, 1997. Long short-term memory. https://www.bioinf.jku.at/publications/older/2604.pdf Gers, Schmidhuber, and Cummins, 2000. Learning to Forget: Continual Prediction with LSTM. https://dl.acm.org/doi/10.1162/089976600300015015 Graves, Fernandez, Gomez, and Schmidhuber, 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. https://www.cs.toronto.edu/~graves/icml_2006.pdf Long Short-Term Memory RNNs (LSTMs) • On step t, there is a hidden state 𝒉(") and a cell state 𝒄(") • Both are vectors length n • The cell stores long-term information • The LSTM can read, erase, and write information from the cell • The cell becomes conceptually rather like RAM in a computer • The selection of which information is erased/written/read is controlled by three corresponding gates • The gates are also vectors of length n • On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between • The gates are dynamic: their value is computed based on the current context 19 We have a sequence of inputs 𝑥("), and we will compute a sequence of hidden states ℎ(") and cell states 𝑐("). On timestep t: Long Short-Term Memory (LSTM) Allthesearevectorsofsamelengthn Forget gate: controls what is kept vs forgotten, from previous cell state Input gate: controls what parts of the new cell content are written to cell Output gate: controls what parts of cell are output to hidden state New cell content: this is the new content to be written to the cell Cell state: erase (“forget”) some content from last cell state, and write (“input”) some new cell content Hidden state: read (“output”) some content from the cell Sigmoid function: all gate values are between 0 and 1 20 Gates are applied using element-wise (or Hadamard) product: ⊙ Long Short-Term Memory (LSTM) 21 You can think of the LSTM equations visually like this: Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ ct-1 ht-1 ct ht ft it ot ct ct ~ Long Short-Term Memory (LSTM) 22 You can think of the LSTM equations visually like this: Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Compute the forget gate Forget some cell content Compute the input gate Compute the new cell content Compute the output gate Write some new cell content Output some cell content to the hidden state The + sign is the secret! How does LSTM solve vanishing gradients? • The LSTM architecture makes it much easier for an RNN to preserve information over many timesteps • e.g., if the forget gate is set to 1 for a cell dimension and the input gate set to 0, then the information of that cell is preserved indefinitely. • In contrast, it’s harder for a vanilla RNN to learn a recurrent weight matrix Wh that preserves info in the hidden state • In practice, you get about 100 timesteps rather than about 7 • However, there are alternative ways of creating more direct and linear pass-through connections in models for long distance dependencies 23 Is vanishing/exploding gradient just an RNN problem? 24 • No! It can be a problem for all neural architectures (including feed-forward and convolutional), especially very deep ones. • Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates • Thus, lower layers are learned very slowly (i.e., are hard to train) • Another solution: lots of new deep feedforward/convolutional architectures add more direct connections (thus allowing the gradient to flow) For example: • Residual connections aka “ResNet” • Also known as skip-connections • The identity connection preserves information by default • This makes deep networks much easier to train "Deep Residual Learning for Image Recognition", He et al, 2015. https://arxiv.org/pdf/1512.03385.pdf Is vanishing/exploding gradient just a RNN problem? 25 Other methods: • Dense connections aka “DenseNet” • Directly connect each layer to all future layers! • Conclusion: Though vanishing/exploding gradients are a general problem, RNNs are particularly unstable due to the repeated multiplication by the same weight matrix [Bengio et al, 1994] ”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf • Highway connections aka “HighwayNet” • Similar to residual connections, but the identity connection vs the transformation layer is controlled by a dynamic gate • Inspired by LSTMs, but applied to deep feedforward/convolutional networks ”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf ”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf LSTMs: real-world success • In 2013–2015, LSTMs started achieving state-of-the-art results • Successful tasks include handwriting recognition, speech recognition, machine translation, parsing, and image captioning, as well as language models • LSTMs became the dominant approach for most NLP tasks • Now (2019–2023), Transformers have become dominant for all tasks • For example, in WMT (a Machine Translation conference + competition): • In WMT 2014, there were 0 neural machine translation systems (!) • In WMT 2016, the summary report contains “RNN” 44 times (and these systems won) • In WMT 2019: “RNN” 7 times, ”Transformer” 105 times 26 Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf Source: "Findings of the 2019 Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf 3. Other RNN uses: RNNs can be used for sequence tagging e.g., part-of-speech tagging, named entity recognition knocked over the vasethe startled cat VBN IN DT NNDT JJ NN 27 RNNs can be used as a sentence encoder model the movie a lotoverall I enjoyed positive Sentence encoding How to compute sentence encoding? e.g., for sentiment classification 28 RNNs can be used as a sentence encoder model the movie a lotoverall I enjoyed positive Sentence encoding equals How to compute sentence encoding? Basic way: Use final hidden state e.g., for sentiment classification 29 RNNs can be used as a sentence encoder model the movie a lotoverall I enjoyed positive Sentence encoding How to compute sentence encoding? Usually better: Take element-wise max or mean of all hidden states e.g., for sentiment classification 30 RNN-LMs can be used to generate text based on other information e.g., speech recognition, machine translation, summarization what’s the weatherthewhat’s This is an example of a conditional language model. We’ll see Machine Translation as an example in much more detail 31 Input (audio) conditioning RNN-LM 4. Bidirectional and Multi-layer RNNs: motivation 32 terribly exciting !the movie was positive Sentence encoding element-wise mean/max element-wise mean/max We can regard this hidden state as a representation of the word “terribly” in the context of this sentence. We call this a contextual representation. These contextual representations only contain information about the left context (e.g. “the movie was”). What about right context? In this example, “exciting” is in the right context and this modifies the meaning of “terribly” (from negative to positive) Task: Sentiment Classification Bidirectional RNNs 33 terribly exciting !the movie was Forward RNN Backward RNN Concatenated hidden states This contextual representation of “terribly” has both left and right context! Bidirectional RNNs 34 Forward RNN Backward RNN Concatenated hidden states This is a general notation to mean “compute one forward step of the RNN” – it could be a simple RNN or LSTM computation. We regard this as “the hidden state” of a bidirectional RNN. This is what we pass on to the next parts of the network. Generally, these two RNNs have separate weights On timestep t: Bidirectional RNNs: simplified diagram 35 terribly exciting !the movie was The two-way arrows indicate bidirectionality and the depicted hidden states are assumed to be the concatenated forwards+backwards states Bidirectional RNNs • Note: bidirectional RNNs are only applicable if you have access to the entire input sequence • They are not applicable to Language Modeling, because in LM you only have left context available. • If you do have entire input sequence (e.g., any kind of encoding), bidirectionality is powerful (you should use it by default). • For example, BERT (Bidirectional Encoder Representations from Transformers) is a powerful pretrained contextual representation system built on bidirectionality. • You will learn more about transformers, including BERT, in a couple of weeks! 36 Multi-layer RNNs • RNNs are already “deep” on one dimension (they unroll over many timesteps) • We can also make them “deep” in another dimension by applying multiple RNNs – this is a multi-layer RNN. • This allows the network to compute more complex representations • The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features. • Multi-layer RNNs are also called stacked RNNs. 37 Multi-layer RNNs 38 terribly exciting !the movie was RNN layer 1 RNN layer 2 RNN layer 3 The hidden states from RNN layer i are the inputs to RNN layer i+1 Multi-layer RNNs in practice • Multi-layer or stacked RNNs allow a network to compute more complex representations – they work better than just have one layer of high-dimensional encodings! • The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features. • High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or feed-forward networks) • For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN • Often 2 layers is a lot better than 1, and 3 might be a little better than 2 • Usually, skip-connections/dense-connections are needed to train deeper RNNs (e.g., 8 layers) • Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers. • You will learn about Transformers later; they have a lot of skipping-like connections 39 “Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf