MUNI
FI
Deep Learning
Natural Language Modelling
PA154 Language Modeling (9.1)
Pavel Rychlý
pary@fi.muni.cz April 13, 2023
deep neural networks
many layers
trained on big data
using advanced hardware: GPU.TPU
supervised, semi-supervised or unsupervised
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
Neuron
basic element of neural networks many inputs (numbers), weights (numbers) activation (transfer) function (threshold) one output: y = 0E/lo wixi + ^)
b,
Activation function
Output
^avel Rychlý ■ Natural Language Mod
Neural Networks
■ Input/Hidden/Output layer
■ Input/output = vector of numbers
■ hidden layer = matrix of parameters (numbers)
^avel Rychlý ■ Natural Language Modelling ■ April 13,2023
Activation Functions
■ crucial component of NN
■ non-linear function
■ many layers without non-linear activation functions are equivalent to single layer (linear combination of inputs)
Sigmoid
= T+t
tanh
tanh(a;)
Leaky ReLU
max(0.1.T, x)
ReLU
max(0, x)
https://en.wikipedia.org/wiki/Activation_function
One-hot representation
words/classes: [0 0 0 1 0 0 0 0]
for each word/class one input
one input activated (1), others deactivated (0)
whole input vector could be large = size of vocabulary (Ri30k)
sequence of words requires sequence of one-hot vectors
first layer transforms words into word embeddings
usually not represented explicitly as vectors, using one single
number
output:
■ one-hot vector during training = expecting one word/cLass probability distribution during usage
■ cannot be represented using singLe number
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
5/25
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
6/25
Training Neural Networks
■ supervised training
■ example: input + result
■ difference between output and expected result (loss function) adjusts weights according to a learning rule
■ backpropagation (feedforward neural networks)
■ gradient of the loss function, stochastic gradient descent (SGD)
Hidden layer(s)
Differences red values
Training Language Models
■ core function of language models: predict the following word:
P(X5|Xi,X2,X3,X4)
■ input: context = x1,x2,x;,x4
■ output: probability distribution of the following word
■ training:
■ input: xi,X2,X3,X4
■ output: X5 (one-hot vector)
■ training data:
■ get any text (copus)
■ extract aLL n-grams
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
Why are NNs better than statistics
Why are NNs used only last 10 years
continues space representation
■ items (wordsAags/cLasses) are not atomic
■ no sparsity probLem
■ don't need to store a LL observed n-grams
■ vectors handLes relations
■ many reaLations, not explicit, unknown
NN can represent any function (if deep enough)
■ structure of the function is not pre-defined
big training data powerful hardware
■ Moore's Law: memory size, processor's speed doubLes every few years
■ matrix processing using GPU, TPU better learning strategies, NN optimizatons
■ Adam, AdaFactor optimizers
■ dropout
■ attention
ready to use libraries/framewoks, datasets
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
^avel Rychlý ■ Natural Language Modelling ■ April 13,2023
A fixed-window neural Language Model
A fixed-window neural Language Model
v = vocabulary size d = embedding size h = hidden layer size Input: IDs of words (sparse representation of one-hot vector) E: embeddings (v x d) W: hidden layer (Ad x h) U: output layer (h x v)
books J laptops
w
oooo oooo oooo oooo""]
the     students   opened their
~m 3.P) 3.(3) 3.(4)
E: embeddings (v x d)
W: hidden layer (Ad x h)
U: output layer (h x v)
A = E(X)
B=f(WA)
Z = soflmax{UB)
books
I laptops
w
oooo oooo  oooo oooo
the     students   opened their
xm     x(2)      x(3) xw
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
11/25
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
12/25
A fixed-window neural Language Model
fixed window is too small enlarging window enlarges W window can never be large enough! x-i and x2 are multiplied by completely different weights in W. No symmetry in how the inputs are [oo processed.
books
laptops
a ,	zoo u
	
	w
the     students   opened their
~m        _(3) _(<i)
Recurrent Neural Network (RNN)
■ dealing with long inputs
■ feedforward NN + internal state (memory)
■ finite impulse RNN: unroll to strictly feedforward NN
■ infinite impulse RNN: directed cyclic graph
■ additional storage managed by NN: gated state/memory
■ backpropagation through time
^—y tu
©
© ©
J w |w
fu fu ( u
©  0 ©
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
Problem of Long-Term Dependencies
■ the clouds are in the sky.
©      ©      ©      © ©
| a |—h, a |—»| a |—«■[ a |—»| a |
©       ©      ©       © ©
■ / grew up in France... I speak fluent French .
t T
~Á~|—H a Ha
a~|~H a H a I
Long short-term memory (LSTM)
■ LSTM unit: cell, input gate, output gate and forget gate
■ cell = memeory
■ gates regulate the flow of information into and out of the cell
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
^avel Rychlý ■ Natural Language Modelling ■ April 13,2023
Hidden state H
Input X,
LSTM - Input
Hidden state
H,_,
Forget
gate F,
Input X,
Input gate
r_t
Input node
Output gate O,
lanh
I      FC layer with a   I   activation function
Copy
Concatenate
FC layer with activation function
Copy
Concatenate
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
17/25
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
18/25
LSTM - Memory
LSTM - Hidden state
Memory cell internal state _
Hidden state H,_,
Forget gate
-0-
Output gate
Input f Input -y" * gate    |       node J
i_I_t
FC layer with activation function
O
Copy
FC layer with activation function
O
Elementwise operator
Copy —?—* Concatenate
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
GRU, BRNN,...
■ Gated recurrent unit (GRU)
■ fewer parameters than LSTM
■ memory = output
■ Bi-directional RNN
■ two hidden layers of opposite directions to the same output
■ hierarchical, multilayer
Encoder-Decoder
variable input/output size, not 1-1 mapping two components
Encoder: variable-length sequence —► fixed size state Decoder: fixed size state —► variable-length sequence
Input
Encoder
State
Decoder
Output
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
^avel Rychlý ■ Natural Language Modelling ■ April 13,2023
Sequence to Sequence
Sequence to Sequence
Learning
■ Encoder: Input sequence -> state
■ Decoder: state + output sequence -> output sequence
Encoder
Decoder
T     f     f     f T
They     are   watching <eo
lis regardent
L JL X
<bos>       lis regardent
Using
Encoder: Input sequence -> state
Decoder: state + sentence delimiter -> output
Encoder
Decoder
lis     regardent <eos>
They     are   watching <eos>
<bos>
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
23/25
^avel Rychly ■ Natural Language Modelling ■ April 13,2023
24/25
Multi-layer encoder/decoder
Encoder
Decoder
FC
n x
Daai irrant		Daai irrant
r\ecurreni		r\ecurreni
x n
Embedding f
Sources
Embedding
~~f
Targets
^avel Rychly ■ Natural Language Modelling ■ April 13,2023