Neural Networks Language Models Huda Khayrallah slides by Philipp Koehn 4 October 2017 Philipp Koehn Machine Translation: Neural Networks 4 October 2017 1N-Gram Backoff Language Model • Previously, we approximated p(W) = p(w1, w2, ..., wn) • ... by applying the chain rule p(W) = i p(wi|w1, ..., wi−1) • ... and limiting the history (Markov order) p(wi|w1, ..., wi−1) p(wi|wi−4, wi−3, wi−2, wi−1) • Each p(wi|wi−4, wi−3, wi−2, wi−1) may not have enough statistics to estimate → we back off to p(wi|wi−3, wi−2, wi−1), p(wi|wi−2, wi−1), etc., all the way to p(wi) – exact details of backing off get complicated — ”interpolated Kneser-Ney” Philipp Koehn Machine Translation: Neural Networks 4 October 2017 2Refinements • A whole family of back-off schemes • Skip-n gram models that may back off to p(wi|wi−2) • Class-based models p(C(wi)|C(wi−4), C(wi−3), C(wi−2), C(wi−1)) ⇒ We are wrestling here with – using as much relevant evidence as possible – pooling evidence between words Philipp Koehn Machine Translation: Neural Networks 4 October 2017 3First Sketch Word 1 Word 2 Word 3 Word 4 Word 5 HiddenLayer Philipp Koehn Machine Translation: Neural Networks 4 October 2017 4Representing Words • Words are represented with a one-hot vector, e.g., – dog = (0,0,0,0,1,0,0,0,0,....) – cat = (0,0,0,0,0,0,0,1,0,....) – eat = (0,1,0,0,0,0,0,0,0,....) • That’s a large vector! • Remedies – limit to, say, 20,000 most frequent words, rest are OTHER – place words in √ n classes, so each word is represented by ∗ 1 class label ∗ 1 word in class label Philipp Koehn Machine Translation: Neural Networks 4 October 2017 5Word Classes for Two-Hot Representations • WordNet classes • Brown clusters • Frequency binning – sort words by frequency – place them in order into classes – each class has same token count → very frequent words have their own class → rare words share class with many other words • Anything goes: assign words randomly to classes Philipp Koehn Machine Translation: Neural Networks 4 October 2017 6Second Sketch Word 1 Word 2 Word 3 Word 4 Word 5 HiddenLayer Philipp Koehn Machine Translation: Neural Networks 4 October 2017 7 word embeddings Philipp Koehn Machine Translation: Neural Networks 4 October 2017 8Add a Hidden Layer Word 1 Word 2 Word 3 Word 4 Word 5 HiddenLayer C C C C • Map each word first into a lower-dimensional real-valued space • Shared weight matrix C Philipp Koehn Machine Translation: Neural Networks 4 October 2017 9Details (Bengio et al., 2003) • Add direct connections from embedding layer to output layer • Activation functions – input→embedding: none – embedding→hidden: tanh – hidden→output: softmax • Training – loop through the entire corpus – update between predicted probabilities and 1-hot vector for output word Philipp Koehn Machine Translation: Neural Networks 4 October 2017 10Word Embeddings C Word Embedding • By-product: embedding of word into continuous space • Similar contexts → similar embedding • Recall: distributional semantics Philipp Koehn Machine Translation: Neural Networks 4 October 2017 11Word Embeddings Philipp Koehn Machine Translation: Neural Networks 4 October 2017 12Word Embeddings Philipp Koehn Machine Translation: Neural Networks 4 October 2017 13Are Word Embeddings Magic? • Morphosyntactic regularities (Mikolov et al., 2013) – adjectives base form vs. comparative, e.g., good, better – nouns singular vs. plural, e.g., year, years – verbs present tense vs. past tense, e.g., see, saw • Semantic regularities – clothing is to shirt as dish is to bowl – evaluated on human judgment data of semantic similarities Philipp Koehn Machine Translation: Neural Networks 4 October 2017 14 recurrent neural networks Philipp Koehn Machine Translation: Neural Networks 4 October 2017 15Recurrent Neural Networks Word 1 Word 2EC 1 H • Start: predict second word from first • Mystery layer with nodes all with value 1 Philipp Koehn Machine Translation: Neural Networks 4 October 2017 16Recurrent Neural Networks Word 1 Word 2EC 1 H Word 2 Word 3EC H H copy values Philipp Koehn Machine Translation: Neural Networks 4 October 2017 17Recurrent Neural Networks Word 1 Word 2EC 1 H Word 2 Word 3EC H H copy values Word 3 Word 4EC H H copy values Philipp Koehn Machine Translation: Neural Networks 4 October 2017 18Training Word 1 Word 2E 1 H • Process first training example • Update weights with back-propagation Philipp Koehn Machine Translation: Neural Networks 4 October 2017 19Training Word 2 Word 3E H H • Process second training example • Update weights with back-propagation • And so on... • But: no feedback to previous history Philipp Koehn Machine Translation: Neural Networks 4 October 2017 20Back-Propagation Through Time Word 1 Word 2E H H Word 2 Word 3E H Word 3 Word 4E H • After processing a few training examples, update through the unfolded recurrent neural network Philipp Koehn Machine Translation: Neural Networks 4 October 2017 21Back-Propagation Through Time • Carry out back-propagation though time (BPTT) after each training example – 5 time steps seems to be sufficient – network learns to store information for more than 5 time steps • Or: update in mini-batches – process 10-20 training examples – update backwards through all examples – removes need for multiple steps for each training example Philipp Koehn Machine Translation: Neural Networks 4 October 2017 22 long short term memory Philipp Koehn Machine Translation: Neural Networks 4 October 2017 23Vanishing Gradients • Error is propagated to previous steps • Updates consider – prediction at that time step – impact on future time steps • Vanishing gradient: propagated error disappears Philipp Koehn Machine Translation: Neural Networks 4 October 2017 24Recent vs. Early History • Hidden layer plays double duty – memory of the network – continuous space representation used to predict output words • Sometimes only recent context important After much economic progress over the years, the country → has • Sometimes much earlier context important The country which has made much economic progress over the years still → has Philipp Koehn Machine Translation: Neural Networks 4 October 2017 25Long Short Term Memory (LSTM) • Design quite elaborate, although not very complicated to use • Basic building block: LSTM cell – similar to a node in a hidden layer – but: has a explicit memory state • Output and memory state change depends on gates – input gate: how much new input changes memory state – forget gate: how much of prior memory state is retained – output gate: how strongly memory state is passed on to next layer. • Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2) Philipp Koehn Machine Translation: Neural Networks 4 October 2017 26LSTM Cell inputgate outputgate forget gate X i m o ⊗ ⊕ ⊗ h m ⊗ LSTM Layer Time t-1 Next Layer Y LSTM Layer Time t Preceding Layer Philipp Koehn Machine Translation: Neural Networks 4 October 2017 27LSTM Cell (Math) • Memory and output values at time step t memoryt = gateinput × inputt + gateforget × memoryt−1 outputt = gateoutput × memoryt • Hidden node value ht passed on to next layer applies activation function f ht = f(outputt ) • Input computed as input to recurrent neural network node – given node values for prior layer xt = (xt 1, ..., xt X) – given values for hidden layer from previous time step ht−1 = (ht−1 1 , ..., ht−1 H ) – input value is combination of matrix multiplication with weights wx and wh and activation function g inputt = g X i=1 wx i xt i + H i=1 wh i ht−1 i Philipp Koehn Machine Translation: Neural Networks 4 October 2017 28Values for Gates • Gates are very important • How do we compute their value? → with a neural network layer! • For each gate a ∈ (input, forget, output) – weight matrix Wxa to consider node values in previous layer xt – weight matrix Wha to consider hidden layer ht−1 at previous time step – weight matrix Wma to consider memory at previous time step memory t−1 – activation function h gatea = h X i=1 wxa i xt i + H i=1 wha i ht−1 i + H i=1 wma i memoryt−1 i Philipp Koehn Machine Translation: Neural Networks 4 October 2017 29Training • LSTM are trained the same way as recurrent neural networks • Back-propagation through time • This looks all very complex, but: – all the operations are still based on ∗ matrix multiplications ∗ differentiable activation functions → we can compute gradients for objective function with respect to all parameters → we can compute update functions Philipp Koehn Machine Translation: Neural Networks 4 October 2017 30What is the Point? (from Tran, Bisazza, Monz, 2016) • Each node has memory memoryi independent from current output hi • Memory may be carried through unchanged (gatei input = 0, gatei memory = 1) ⇒ can remember important features over long time span (capture long distance dependencies) Philipp Koehn Machine Translation: Neural Networks 4 October 2017 31Visualizing Individual Cells Karpathy et al. (2015): ”Visualizing and Understanding Recurrent Networks” Philipp Koehn Machine Translation: Neural Networks 4 October 2017 32Visualizing Individual Cells Philipp Koehn Machine Translation: Neural Networks 4 October 2017 33Gated Recurrent Unit (GRU) updategate reset gate X x ⊕ h h ⊗ GRU Layer Time t-1 Next Layer Y GRU Layer Time t Preceding Layer Philipp Koehn Machine Translation: Neural Networks 4 October 2017 34Gated Recurrent Unit (Math) • Two Gates updatet = g(Wupdate inputt + Uupdate statet−1 + biasupdate) resett = g(Wreset inputt + Ureset statet−1 + biasreset) • Combination of input and previous state (similar to traditional recurrent neural network) combinationt = f(W inputt + U(resett ◦ statet−1)) • Interpolation with previous state statet =(1 − updatet) ◦ statet−1 + updatet ◦ combinationt) + bias Philipp Koehn Machine Translation: Neural Networks 4 October 2017 35 deeper models Philipp Koehn Machine Translation: Neural Networks 4 October 2017 36Deep Learning? Input Hidden Layer Output • Not much deep learning so far • Between prediction from input to output: only 1 hidden layer • How about more hidden layers? Philipp Koehn Machine Translation: Neural Networks 4 October 2017 37Deep Models Input Hidden Layer Output Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3 Shallow Deep Stacked Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3 Deep Transition Philipp Koehn Machine Translation: Neural Networks 4 October 2017 38 questions? Philipp Koehn Machine Translation: Neural Networks 4 October 2017