Neural Networks Language Models
Huda Khayrallah
slides by Philipp Koehn
4 October 2017
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
1N-Gram Backoff Language Model
• Previously, we approximated
p(W) = p(w1, w2, ..., wn)
• ... by applying the chain rule
p(W) =
i
p(wi|w1, ..., wi−1)
• ... and limiting the history (Markov order)
p(wi|w1, ..., wi−1) p(wi|wi−4, wi−3, wi−2, wi−1)
• Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate
→ we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi)
– exact details of backing off get complicated — ”interpolated Kneser-Ney”
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
2Reﬁnements
• A whole family of back-off schemes
• Skip-n gram models that may back off to p(wi|wi−2)
• Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1))
⇒ We are wrestling here with
– using as much relevant evidence as possible
– pooling evidence between words
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
3First Sketch
Word 1
Word 2
Word 3
Word 4
Word 5
HiddenLayer
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
4Representing Words
• Words are represented with a one-hot vector, e.g.,
– dog = (0,0,0,0,1,0,0,0,0,....)
– cat = (0,0,0,0,0,0,0,1,0,....)
– eat = (0,1,0,0,0,0,0,0,0,....)
• That’s a large vector!
• Remedies
– limit to, say, 20,000 most frequent words, rest are OTHER
– place words in
√
n classes, so each word is represented by
∗ 1 class label
∗ 1 word in class label
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
5Word Classes for Two-Hot Representations
• WordNet classes
• Brown clusters
• Frequency binning
– sort words by frequency
– place them in order into classes
– each class has same token count
→ very frequent words have their own class
→ rare words share class with many other words
• Anything goes: assign words randomly to classes
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
6Second Sketch
Word 1
Word 2
Word 3
Word 4
Word 5
HiddenLayer
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
7
word embeddings
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
8Add a Hidden Layer
Word 1
Word 2
Word 3
Word 4
Word 5
HiddenLayer
C
C
C
C
• Map each word ﬁrst into a lower-dimensional real-valued space
• Shared weight matrix C
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
9Details (Bengio et al., 2003)
• Add direct connections from embedding layer to output layer
• Activation functions
– input→embedding: none
– embedding→hidden: tanh
– hidden→output: softmax
• Training
– loop through the entire corpus
– update between predicted probabilities and 1-hot vector for output word
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
10Word Embeddings
C
Word Embedding
• By-product: embedding of word into continuous space
• Similar contexts → similar embedding
• Recall: distributional semantics
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
11Word Embeddings
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
12Word Embeddings
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
13Are Word Embeddings Magic?
• Morphosyntactic regularities (Mikolov et al., 2013)
– adjectives base form vs. comparative, e.g., good, better
– nouns singular vs. plural, e.g., year, years
– verbs present tense vs. past tense, e.g., see, saw
• Semantic regularities
– clothing is to shirt as dish is to bowl
– evaluated on human judgment data of semantic similarities
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
14
recurrent neural networks
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
15Recurrent Neural Networks
Word 1 Word 2EC
1
H
• Start: predict second word from ﬁrst
• Mystery layer with nodes all with value 1
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
16Recurrent Neural Networks
Word 1 Word 2EC
1
H
Word 2 Word 3EC H
H
copy values
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
17Recurrent Neural Networks
Word 1 Word 2EC
1
H
Word 2 Word 3EC H
H
copy values
Word 3 Word 4EC H
H
copy values
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
18Training
Word 1 Word 2E
1
H
• Process ﬁrst training example
• Update weights with back-propagation
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
19Training
Word 2 Word 3E
H
H
• Process second training example
• Update weights with back-propagation
• And so on...
• But: no feedback to previous history
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
20Back-Propagation Through Time
Word 1 Word 2E
H
H
Word 2 Word 3E H
Word 3 Word 4E H
• After processing a few training examples,
update through the unfolded recurrent neural network
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
21Back-Propagation Through Time
• Carry out back-propagation though time (BPTT) after each training example
– 5 time steps seems to be sufﬁcient
– network learns to store information for more than 5 time steps
• Or: update in mini-batches
– process 10-20 training examples
– update backwards through all examples
– removes need for multiple steps for each training example
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
22
long short term memory
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
23Vanishing Gradients
• Error is propagated to previous steps
• Updates consider
– prediction at that time step
– impact on future time steps
• Vanishing gradient: propagated error disappears
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
24Recent vs. Early History
• Hidden layer plays double duty
– memory of the network
– continuous space representation used to predict output words
• Sometimes only recent context important
After much economic progress over the years, the country → has
• Sometimes much earlier context important
The country which has made much economic progress over the years still → has
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
25Long Short Term Memory (LSTM)
• Design quite elaborate, although not very complicated to use
• Basic building block: LSTM cell
– similar to a node in a hidden layer
– but: has a explicit memory state
• Output and memory state change depends on gates
– input gate: how much new input changes memory state
– forget gate: how much of prior memory state is retained
– output gate: how strongly memory state is passed on to next layer.
• Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2)
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
26LSTM Cell
inputgate
outputgate
forget gate
X i
m o
⊗ ⊕
⊗ h
m
⊗
LSTM Layer Time t-1
Next Layer
Y
LSTM Layer Time t
Preceding Layer
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
27LSTM Cell (Math)
• Memory and output values at time step t
memoryt
= gateinput × inputt
+ gateforget × memoryt−1
outputt
= gateoutput × memoryt
• Hidden node value ht
passed on to next layer applies activation function f
ht
= f(outputt
)
• Input computed as input to recurrent neural network node
– given node values for prior layer xt
= (xt
1, ..., xt
X)
– given values for hidden layer from previous time step ht−1
= (ht−1
1 , ..., ht−1
H )
– input value is combination of matrix multiplication with weights wx
and wh
and activation function g
inputt
= g
X
i=1
wx
i xt
i +
H
i=1
wh
i ht−1
i
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
28Values for Gates
• Gates are very important
• How do we compute their value?
→ with a neural network layer!
• For each gate a ∈ (input, forget, output)
– weight matrix Wxa
to consider node values in previous layer xt
– weight matrix Wha
to consider hidden layer ht−1
at previous time step
– weight matrix Wma
to consider memory at previous time step memory
t−1
– activation function h
gatea = h
X
i=1
wxa
i xt
i +
H
i=1
wha
i ht−1
i +
H
i=1
wma
i memoryt−1
i
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
29Training
• LSTM are trained the same way as recurrent neural networks
• Back-propagation through time
• This looks all very complex, but:
– all the operations are still based on
∗ matrix multiplications
∗ differentiable activation functions
→ we can compute gradients for objective function with respect to all parameters
→ we can compute update functions
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
30What is the Point?
(from Tran, Bisazza, Monz, 2016)
• Each node has memory memoryi independent from current output hi
• Memory may be carried through unchanged (gatei
input = 0, gatei
memory = 1)
⇒ can remember important features over long time span
(capture long distance dependencies)
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
31Visualizing Individual Cells
Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks”
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
32Visualizing Individual Cells
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
33Gated Recurrent Unit (GRU)
updategate
reset gate
X x ⊕ h
h
⊗
GRU Layer Time t-1
Next Layer
Y
GRU Layer Time t
Preceding Layer
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
34Gated Recurrent Unit (Math)
• Two Gates
updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate)
resett = g(Wreset inputt + Ureset statet−1 + biasreset)
• Combination of input and previous state
(similar to traditional recurrent neural network)
combinationt = f(W inputt + U(resett ◦ statet−1))
• Interpolation with previous state
statet =(1 − updatet) ◦ statet−1 +
updatet ◦ combinationt) + bias
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
35
deeper models
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
36Deep Learning?
Input
Hidden
Layer
Output
• Not much deep learning so far
• Between prediction from input to output: only 1 hidden layer
• How about more hidden layers?
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
37Deep Models
Input
Hidden
Layer
Output
Input
Hidden
Layer 2
Output
Hidden
Layer 1
Hidden
Layer 3
Shallow Deep Stacked
Input
Hidden
Layer 2
Output
Hidden
Layer 1
Hidden
Layer 3
Deep Transition
Philipp Koehn Machine Translation: Neural Networks 4 October 2017
38
questions?
Philipp Koehn Machine Translation: Neural Networks 4 October 2017