Natural Language Processing
with Deep Learning
CS224N/Ling284
Christopher Manning
Lecture 5: Language Models and Recurrent Neural Networks
• Language Modeling is the task of predicting what word comes next
the students opened their ______
• More formally: given a sequence of words ,
compute the probability distribution of the next word :
where can be any word in the vocabulary
• A system that does this is called a Language Model
2. Language Modeling
exams
minds
laptops
books
9
Language Modeling
• You can also think of a Language Model as a system that
assigns a probability to a piece of text
• For example, if we have some text , then the
probability of this text (according to the Language Model) is:
10
This is what our LM provides
How to build a neural language model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob. dist. of the next word
• How about a window-based neural model?
• We saw this applied to Named Entity Recognition in Lecture 2:
23
in Paris are amazingmuseums
LOCATION
A fixed-window neural Language Model
the students opened theiras the proctor started the clock ______
discard
fixed window
24
A fixed-window neural Language Model
the students opened their
books
laptops
concatenated word embeddings
words / one-hot vectors
hidden layer
a zoo
output distribution
25
A fixed-window neural Language Model
the students opened their
books
laptops
a zoo
Improvements over n-gram LM:
• No sparsity problem
• Don’t need to store all observed n-grams
Remaining problems:
• Fixed window is too small
• Enlarging window enlarges 𝑊
• Window can never be large enough!
• 𝑥(!) and 𝑥(") are multiplied by
completely different weights in 𝑊.
No symmetry in how the inputs are
processed.
We need a neural architecture
that can process any length input
26
Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
3. Recurrent Neural Networks (RNN)
27
hidden states
input sequence
(any length)
…
…
…
Core idea: Apply the same
weights 𝑊 repeatedlyA family of neural architectures
outputs
(optional)
A Simple RNN Language Model
the students opened theirwords / one-hot vectors
books
laptops
word embeddings
a zoo
output distribution
Note: this input sequence could be much
longer now!
hidden states
is the initial hidden state
28
RNN Language Models
the students opened their
books
laptops
a zoo
RNN Advantages:
• Can process any length input
• Computation for step t can (in
theory) use information from
many steps back
• Model size doesn’t increase for
longer input context
• Same weights applied on every
timestep, so there is symmetry
in how inputs are processed.
RNN Disadvantages:
• Recurrent computation is slow
• In practice, difficult to access
information from many steps
back
More on
these later
29
Training an RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e., predict probability dist of every word, given words so far
• Loss function on step t is cross-entropy between predicted probability
distribution , and the true next word (one-hot for ):
• Average this to get overall loss for entire training set:
30
Training an RNN Language Model
= negative log prob
of “students”
the students opened their …
examsCorpus
Loss
…
31
Predicted
prob dists
Training an RNN Language Model
the students opened their …
examsCorpus
Loss
…
32
Predicted
prob dists
= negative log prob
of “opened”
Training an RNN Language Model
the students opened their …
examsCorpus
Loss
…
33
Predicted
prob dists
= negative log prob
of “their”
Training an RNN Language Model
the students opened their …
examsCorpus
Loss
…
34
Predicted
prob dists
= negative log prob
of “exams”
Training an RNN Language Model
+ + + + … =
the students opened their …
exams
…
35
Corpus
Loss
Predicted
prob dists
“Teacher forcing”
Training a RNN Language Model
• However: Computing loss and gradients across entire corpus at once is
too expensive (memory-wise)!
• In practice, consider as a sentence (or a document)
• Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small
chunk of data, and update.
• Compute loss for a sentence (actually, a batch of sentences), compute gradients
and update weights. Repeat on a new batch of sentences.
36
Backpropagation for RNNs
37
……
Question: What’s the derivative of w.r.t. the repeated weight matrix ?
Answer:
“The gradient w.r.t. a repeated weight
is the sum of the gradient
w.r.t. each time it appears”
Why?
Multivariable Chain Rule
38
Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
Training the parameters of RNNs: Backpropagation for RNNs
39
……
Question: How do we calculate this?
Answer: Backpropagate over timesteps
i = t, … ,0, summing gradients as you go.
This algorithm is called “backpropagation
through time” [Werbos, P.G., 1988, Neural
Networks 1, and others]
equals
equals
equals
equals
equals
Apply the multivariable chain rule:
= 1
In practice, often
“truncated” after ~20
timesteps for training
efficiency reasons
Generating with an RNN Language Model (“Generating roll outs”)
Just like an n-gram Language Model, you can use a RNN Language Model to
generate text by repeated sampling. Sampled output becomes next step’s input.
<s> my favorite season
sample
my
sample
favorite
sample
season
sample
is
is40
sample
spring
spring
sample
</s>
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Obama speeches:
Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0
41
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Harry Potter:
Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
42
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on recipes:
Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
43
Generating text with a RNN Language Model
44
Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on paint color names:
Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network
This is an example of a character-level RNN-LM (predicts what character comes next)
4. Problems with RNNs: Vanishing and Exploding Gradients
47
Vanishing gradient intuition
48
?
Vanishing gradient intuition
chain rule!
49
Vanishing gradient intuition
chain rule!
50
Vanishing gradient intuition
chain rule!
51
Vanishing gradient intuition
What happens if these are small?
Vanishing gradient problem:
When these are small, the gradient
signal gets smaller and smaller as it
backpropagates further
52
Why is vanishing gradient a problem?
Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.
So, model weights are updated only with respect to near effects, not long-term effects.
55
Effect of vanishing gradient on RNN-LM
• LM task: When she tried to print her tickets, she found that the printer was out of toner.
She went to the stationery store to buy more toner. It was very overpriced. After
installing the toner into the printer, she finally printed her ________
• To learn from this training example, the RNN-LM needs to model the dependency
between “tickets” on the 7th step and the target word “tickets” at the end.
• But if the gradient is small, the model can’t learn this dependency
• So, the model is unable to predict similar long-distance dependencies at test time
56
Why is exploding gradient a problem?
• If the gradient becomes too big, then the SGD update step becomes too big:
• This can cause bad updates: we take too large a step and reach a weird and bad
parameter configuration (with large loss)
• You think you’ve found a hill to climb, but suddenly you’re in Iowa
• In the worst case, this will result in Inf or NaN in your network
(then you have to restart training from an earlier checkpoint)
57
learning rate
gradient
Gradient clipping: solution for exploding gradient
58
• Gradient clipping: if the norm of the gradient is greater than some threshold, scale it
down before applying SGD update
• Intuition: take a step in the same direction, but a smaller step
• In practice, remembering to clip gradients is important, but exploding gradients are an
easy problem to solve
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
How to fix the vanishing gradient problem?
• The main problem is that it’s too difficult for the RNN to learn to preserve information
over many timesteps.
• In a vanilla RNN, the hidden state is constantly being rewritten
• First off next time: How about an RNN with separate memory which is added to?
• LSTMs
• And then: Creating more direct and linear pass-through connections in model
• Attention, residual connections, etc.
59
5. Recap
• Language Model: A system that predicts the next word
• Recurrent Neural Network: A family of neural networks that:
• Take sequential input of any length
• Apply the same weights on each step
• Can optionally produce output on each step
• Recurrent Neural Network ≠ Language Model
• We’ve shown that RNNs are a great way to build a LM (despite some problems)
• RNNs are also useful for much more!
60
Why should we care about Language Modeling?
61
• Language Modeling is a benchmark task that helps us measure our progress on
predicting language use
• Language Modeling is a subcomponent of many NLP tasks, especially those involving
generating text or estimating the probability of text:
• Everything else in NLP has now been rebuilt upon Language Modeling: GPT-3 is an LM!
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Authorship identification
• Machine translation
• Summarization
• Dialogue
• etc.
How to fix the vanishing gradient problem?
• The main problem is that it’s too difficult for the RNN to learn to preserve information
over many timesteps.
• In a vanilla RNN, the hidden state is constantly being rewritten
• Could we design an RNN with separate memory which is added to?
16
Long Short-Term Memory RNNs (LSTMs)
• A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the problem of
vanishing gradients
• Everyone cites that paper but really a crucial part of the modern LSTM is from Gers et al. (2000) 💜
• Only started to be recognized as promising through the work of S’s student Alex Graves c. 2006
• Work in which he also invented CTC (connectionist temporal classification) for speech recognition
• But only really became well-known after Hinton brought it to Google in 2013
• Following Graves having been a postdoc with Hinton
18
Hochreiter and Schmidhuber, 1997. Long short-term memory. https://www.bioinf.jku.at/publications/older/2604.pdf
Gers, Schmidhuber, and Cummins, 2000. Learning to Forget: Continual Prediction with LSTM. https://dl.acm.org/doi/10.1162/089976600300015015
Graves, Fernandez, Gomez, and Schmidhuber, 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets.
https://www.cs.toronto.edu/~graves/icml_2006.pdf
Long Short-Term Memory RNNs (LSTMs)
• On step t, there is a hidden state 𝒉(") and a cell state 𝒄(")
• Both are vectors length n
• The cell stores long-term information
• The LSTM can read, erase, and write information from the cell
• The cell becomes conceptually rather like RAM in a computer
• The selection of which information is erased/written/read is controlled by three corresponding gates
• The gates are also vectors of length n
• On each timestep, each element of the gates can be open (1), closed (0), or somewhere in-between
• The gates are dynamic: their value is computed based on the current context
19
We have a sequence of inputs 𝑥("), and we will compute a sequence of hidden states ℎ(") and cell states
𝑐("). On timestep t:
Long Short-Term Memory (LSTM)
Allthesearevectorsofsamelengthn
Forget gate: controls what is kept vs
forgotten, from previous cell state
Input gate: controls what parts of the
new cell content are written to cell
Output gate: controls what parts of
cell are output to hidden state
New cell content: this is the new
content to be written to the cell
Cell state: erase (“forget”) some
content from last cell state, and write
(“input”) some new cell content
Hidden state: read (“output”) some
content from the cell
Sigmoid function: all gate
values are between 0 and 1
20
Gates are applied using element-wise
(or Hadamard) product: ⊙
Long Short-Term Memory (LSTM)
21
You can think of the LSTM equations visually like this:
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
ct-1
ht-1
ct
ht
ft
it ot
ct
ct
~
Long Short-Term Memory (LSTM)
22
You can think of the LSTM equations visually like this:
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Compute the
forget gate
Forget some
cell content
Compute the
input gate
Compute the
new cell content
Compute the
output gate
Write some new cell content
Output some cell content
to the hidden state
The + sign is the secret!
How does LSTM solve vanishing gradients?
• The LSTM architecture makes it much easier for an RNN to
preserve information over many timesteps
• e.g., if the forget gate is set to 1 for a cell dimension and the input gate
set to 0, then the information of that cell is preserved indefinitely.
• In contrast, it’s harder for a vanilla RNN to learn a recurrent weight
matrix Wh that preserves info in the hidden state
• In practice, you get about 100 timesteps rather than about 7
• However, there are alternative ways of creating more direct and linear
pass-through connections in models for long distance dependencies
23
Is vanishing/exploding gradient just an RNN problem?
24
• No! It can be a problem for all neural architectures (including feed-forward and
convolutional), especially very deep ones.
• Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it
backpropagates
• Thus, lower layers are learned very slowly (i.e., are hard to train)
• Another solution: lots of new deep feedforward/convolutional architectures add more
direct connections (thus allowing the gradient to flow)
For example:
• Residual connections aka “ResNet”
• Also known as skip-connections
• The identity connection
preserves information by default
• This makes deep networks much
easier to train
"Deep Residual Learning for Image Recognition", He et al, 2015. https://arxiv.org/pdf/1512.03385.pdf
Is vanishing/exploding gradient just a RNN problem?
25
Other methods:
• Dense connections aka “DenseNet”
• Directly connect each layer to all future layers!
• Conclusion: Though vanishing/exploding gradients are a general problem, RNNs are particularly unstable
due to the repeated multiplication by the same weight matrix [Bengio et al, 1994]
”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf
• Highway connections aka “HighwayNet”
• Similar to residual connections, but the identity
connection vs the transformation layer is
controlled by a dynamic gate
• Inspired by LSTMs, but applied to deep
feedforward/convolutional networks
”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf
”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf
LSTMs: real-world success
• In 2013–2015, LSTMs started achieving state-of-the-art results
• Successful tasks include handwriting recognition, speech recognition, machine
translation, parsing, and image captioning, as well as language models
• LSTMs became the dominant approach for most NLP tasks
• Now (2019–2023), Transformers have become dominant for all tasks
• For example, in WMT (a Machine Translation conference + competition):
• In WMT 2014, there were 0 neural machine translation systems (!)
• In WMT 2016, the summary report contains “RNN” 44 times (and these systems won)
• In WMT 2019: “RNN” 7 times, ”Transformer” 105 times
26
Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf
Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf
Source: "Findings of the 2019 Conference on Machine Translation (WMT19)", Barrault et al. 2019, http://www.statmt.org/wmt18/pdf/WMT028.pdf
3. Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition
knocked over the vasethe startled cat
VBN IN DT NNDT JJ NN
27
RNNs can be used as a sentence encoder model
the movie a lotoverall I enjoyed
positive
Sentence
encoding
How to compute
sentence encoding?
e.g., for sentiment classification
28
RNNs can be used as a sentence encoder model
the movie a lotoverall I enjoyed
positive
Sentence
encoding
equals
How to compute
sentence encoding?
Basic way:
Use final hidden
state
e.g., for sentiment classification
29
RNNs can be used as a sentence encoder model
the movie a lotoverall I enjoyed
positive
Sentence
encoding
How to compute
sentence encoding?
Usually better:
Take element-wise
max or mean of all
hidden states
e.g., for sentiment classification
30
RNN-LMs can be used to generate text based on other information
e.g., speech recognition, machine translation, summarization
what’s the
weatherthewhat’s
This is an example of a conditional language model.
We’ll see Machine Translation as an example in much more detail
31
Input (audio)
<START>
conditioning
RNN-LM
4. Bidirectional and Multi-layer RNNs: motivation
32
terribly exciting !the movie was
positive
Sentence
encoding
element-wise mean/max element-wise mean/max
We can regard this hidden state as a
representation of the word “terribly” in the
context of this sentence. We call this a
contextual representation.
These contextual
representations only
contain information
about the left context
(e.g. “the movie was”).
What about right
context?
In this example,
“exciting” is in the right
context and this
modifies the meaning of
“terribly” (from negative
to positive)
Task: Sentiment Classification
Bidirectional RNNs
33
terribly exciting !the movie was
Forward RNN
Backward RNN
Concatenated
hidden states
This contextual representation of “terribly”
has both left and right context!
Bidirectional RNNs
34
Forward RNN
Backward RNN
Concatenated hidden states
This is a general notation to mean
“compute one forward step of the
RNN” – it could be a simple RNN or
LSTM computation.
We regard this as “the hidden
state” of a bidirectional RNN.
This is what we pass on to the
next parts of the network.
Generally, these
two RNNs have
separate weights
On timestep t:
Bidirectional RNNs: simplified diagram
35
terribly exciting !the movie was
The two-way arrows indicate bidirectionality and
the depicted hidden states are assumed to be the
concatenated forwards+backwards states
Bidirectional RNNs
• Note: bidirectional RNNs are only applicable if you have access to the entire input
sequence
• They are not applicable to Language Modeling, because in LM you only have left
context available.
• If you do have entire input sequence (e.g., any kind of encoding), bidirectionality is
powerful (you should use it by default).
• For example, BERT (Bidirectional Encoder Representations from Transformers) is a
powerful pretrained contextual representation system built on bidirectionality.
• You will learn more about transformers, including BERT, in a couple of weeks!
36
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over many timesteps)
• We can also make them “deep” in another dimension by
applying multiple RNNs – this is a multi-layer RNN.
• This allows the network to compute more complex representations
• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.
• Multi-layer RNNs are also called stacked RNNs.
37
Multi-layer RNNs
38
terribly exciting !the movie was
RNN layer 1
RNN layer 2
RNN layer 3
The hidden states from RNN layer i
are the inputs to RNN layer i+1
Multi-layer RNNs in practice
• Multi-layer or stacked RNNs allow a network to compute more complex representations
– they work better than just have one layer of high-dimensional encodings!
• The lower RNNs should compute lower-level features and the higher RNNs should
compute higher-level features.
• High-performing RNNs are usually multi-layer (but aren’t as deep as convolutional or
feed-forward networks)
• For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4
layers is best for the encoder RNN, and 4 layers is best for the decoder RNN
• Often 2 layers is a lot better than 1, and 3 might be a little better than 2
• Usually, skip-connections/dense-connections are needed to train deeper RNNs
(e.g., 8 layers)
• Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.
• You will learn about Transformers later; they have a lot of skipping-like connections
39 “Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf