MUNI FI Deep Learning Natural Language Modelling PA154 Language Modeling (9.1) Pavel Rychlý pary@fi.muni.cz April 13, 2023 deep neural networks many layers trained on big data using advanced hardware: GPU.TPU supervised, semi-supervised or unsupervised ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 Neuron basic element of neural networks many inputs (numbers), weights (numbers) activation (transfer) function (threshold) one output: y = 0E/lo wixi + ^) b, Activation function Output ^avel Rychlý ■ Natural Language Mod Neural Networks ■ Input/Hidden/Output layer ■ Input/output = vector of numbers ■ hidden layer = matrix of parameters (numbers) ^avel Rychlý ■ Natural Language Modelling ■ April 13,2023 Activation Functions ■ crucial component of NN ■ non-linear function ■ many layers without non-linear activation functions are equivalent to single layer (linear combination of inputs) Sigmoid = T+t tanh tanh(a;) Leaky ReLU max(0.1.T, x) ReLU max(0, x) https://en.wikipedia.org/wiki/Activation_function One-hot representation words/classes: [0 0 0 1 0 0 0 0] for each word/class one input one input activated (1), others deactivated (0) whole input vector could be large = size of vocabulary (Ri30k) sequence of words requires sequence of one-hot vectors first layer transforms words into word embeddings usually not represented explicitly as vectors, using one single number output: ■ one-hot vector during training = expecting one word/cLass probability distribution during usage ■ cannot be represented using singLe number ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 5/25 ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 6/25 Training Neural Networks ■ supervised training ■ example: input + result ■ difference between output and expected result (loss function) adjusts weights according to a learning rule ■ backpropagation (feedforward neural networks) ■ gradient of the loss function, stochastic gradient descent (SGD) Hidden layer(s) Differences red values Training Language Models ■ core function of language models: predict the following word: P(X5|Xi,X2,X3,X4) ■ input: context = x1,x2,x;,x4 ■ output: probability distribution of the following word ■ training: ■ input: xi,X2,X3,X4 ■ output: X5 (one-hot vector) ■ training data: ■ get any text (copus) ■ extract aLL n-grams ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 Why are NNs better than statistics Why are NNs used only last 10 years continues space representation ■ items (wordsAags/cLasses) are not atomic ■ no sparsity probLem ■ don't need to store a LL observed n-grams ■ vectors handLes relations ■ many reaLations, not explicit, unknown NN can represent any function (if deep enough) ■ structure of the function is not pre-defined big training data powerful hardware ■ Moore's Law: memory size, processor's speed doubLes every few years ■ matrix processing using GPU, TPU better learning strategies, NN optimizatons ■ Adam, AdaFactor optimizers ■ dropout ■ attention ready to use libraries/framewoks, datasets ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 ^avel Rychlý ■ Natural Language Modelling ■ April 13,2023 A fixed-window neural Language Model A fixed-window neural Language Model v = vocabulary size d = embedding size h = hidden layer size Input: IDs of words (sparse representation of one-hot vector) E: embeddings (v x d) W: hidden layer (Ad x h) U: output layer (h x v) books J laptops w oooo oooo oooo oooo""] the students opened their ~m 3.P) 3.(3) 3.(4) E: embeddings (v x d) W: hidden layer (Ad x h) U: output layer (h x v) A = E(X) B=f(WA) Z = soflmax{UB) books I laptops w oooo oooo oooo oooo the students opened their xm x(2) x(3) xw ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 11/25 ^avel Rychly ■ Natural Language Modelling ■ April 13,2023 12/25 A fixed-window neural Language Model fixed window is too small enlarging window enlarges W window can never be large enough! x-i and x2 are multiplied by completely different weights in W. No symmetry in how the inputs are [oo processed. books laptops a , zoo u w the students opened their ~m _(3) _(