Introduction to Neural Networks Philipp Koehn 21 September 2023 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 1Linear Models • We used before weighted linear combination of feature values hj and weights λj score(λ, di) = j λj hj(di) • Such models can be illustrated as a ”network” Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 2Limits of Linearity • We can give each feature a weight • But not more complex value relationships, e.g, – any value in the range [0;5] is equally good – values over 8 are bad – higher than 10 is not worse Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 3XOR • Linear models cannot model XOR bad good good bad Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 4Multiple Layers • Add an intermediate (”hidden”) layer of processing (each arrow is a weight) x h y • Have we gained anything so far? Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 5Non-Linearity • Instead of computing a linear combination score(λ, di) = j λj hj(di) • Add a non-linear function score(λ, di) = f j λj hj(di) • Popular choices tanh(x) sigmoid(x) = 1 1+e−x relu(x) = max(0,x) (sigmoid is also called the ”logistic function”) Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 6Deep Learning • More layers = deep learning Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 7What Depth Enables • Each layer is a processing step • Having multiple processing steps allows complex functions • Metaphor: NN and computing circuits – computer = sequence of Boolean gates – neural computer = sequence of layers • Deep neural networks can implement complex functions e.g., sorting on input values Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 8 example Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 9Simple Neural Network 11 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • One innovation: bias units (no inputs, always value 1) Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 10Sample Input 1 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Try out two input values • Hidden unit computation sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = 1 1 + e−2.2 = 0.90 sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = 1 1 + e1.6 = 0.17 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 11Computed Hidden .90 .17 1 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Try out two input values • Hidden unit computation sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = 1 1 + e−2.2 = 0.90 sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = 1 1 + e1.6 = 0.17 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 12Compute Output .90 .17 1 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Output unit computation sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = 1 1 + e−1.17 = 0.76 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 13Computed Output .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Output unit computation sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = 1 1 + e−1.17 = 0.76 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 14Output for all Binary Inputs Input x0 Input x1 Hidden h0 Hidden h1 Output y0 0 0 0.12 0.02 0.18 → 0 0 1 0.88 0.27 0.74 → 1 1 0 0.73 0.12 0.74 → 1 1 1 0.99 0.73 0.33 → 0 • Network implements XOR – hidden node h0 is OR – hidden node h1 is AND – final layer operation is h0 − −h1 • Power of deep neural networks: chaining of processing steps just as: more Boolean circuits → more complex computations possible Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 15 why ”neural” networks? Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 18The Brain vs. Artificial Neural Networks • Similarities – Neurons, connections between neurons – Learning = change of connections, not change of neurons – Massive parallel processing • But artificial neural networks are much simpler – computation within neuron vastly simplified – discrete time steps – typically some form of supervised learning with massive number of stimuli Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 19 back-propagation training Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 20Error .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 • Computed output: y = .76 • Correct output: t = 1.0 ⇒ How do we adjust the weights? Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 21Key Concepts • Gradient descent – error is a function of the weights – we want to reduce the error – gradient descent: move towards the error minimum – compute gradient → get direction to the error minimum – adjust weights towards direction of lower error • Back-propagation – first adjust last set of weights – propagate error back to each previous layer – adjust their weights Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 31Hidden Layer Update • In a hidden layer, we do not have a target output value • But we can compute how much each node contributed to downstream error • Definition of error term of each node δj = (tj − yj) yj • Back-propagate the error term (why this way? there is math to back it up...) δi = j wj←iδj yi • Universal update formula ∆wj←k = µ δj hk Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 32Our Example .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 A B C D E F G • Computed output: y = .76 • Correct output: t = 1.0 • Final layer weight updates (learning rate µ = 10) – δG = (t − y) y = (1 − .76) 0.181 = .0434 – ∆wGD = µ δG hD = 10 × .0434 × .90 = .391 – ∆wGE = µ δG hE = 10 × .0434 × .17 = .074 – ∆wGF = µ δG hF = 10 × .0434 × 1 = .434 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 33Our Example .90 .17 1 .76 1.0 0.0 1 4.5 -5.2 -2.0 -4.6 -1.5 3.7 2.9 3.7 2.9 A B C D E F G 4.891 — -5.126 —— -1.566 —— • Computed output: y = .76 • Correct output: t = 1.0 • Final layer weight updates (learning rate µ = 10) – δG = (t − y) y = (1 − .76) 0.181 = .0434 – ∆wGD = µ δG hD = 10 × .0434 × .90 = .391 – ∆wGE = µ δG hE = 10 × .0434 × .17 = .074 – ∆wGF = µ δG hF = 10 × .0434 × 1 = .434 Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 43Batches • Each training example yields a set of weight updates ∆wi. • Batch up several training examples – sum up their updates – apply sum to model • Mostly done or speed reasons Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 44 computational aspects Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 45Vector and Matrix Multiplications • Forward computation: s = Wh • Activation function: y = sigmoid(h) • Error term: δ = (t − y) sigmoid’(s) • Propagation of error term: δi = Wδi+1 · sigmoid’(s) • Weight updates: ∆W = µδhT Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 46GPU • Neural network layers may have, say, 200 nodes • Computations such as Wh require 200 × 200 = 40, 000 multiplications • Graphics Processing Units (GPU) are designed for such computations – image rendering requires such vector and matrix operations – massively multi-core but lean processing units – example: NVIDIA H100 GPU provides 18,432 CUDA cores • Extensions to C to support programming of GPUs, such as CUDA Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023 28Explosion of Deep Learning Toolkits • University of Montreal: Theano (early, now defunct) • Google: Tensorflow • Facebook: Torch, pyTorch • Microsoft: CNTK • Amazon: MX-Net • CMU: Dynet • AMU/Edinburgh/Microsoft: Marian • ... and many more Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 29Toolkits • Machine learning architectures around computations graphs very powerful – define a computation graph – provide data and a training strategy (e.g., batching) – toolkit does the rest – seamless support of GPUs Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 30Example: PyTorch • Installation     pip install torch • Usage     import torch Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 31Some Data Types • PyTorch data type for parameter vectors, matrices etc., called torch.tensor ' & $ % W = torch.tensor([[3,4],[2,3]], requires grad=True, dtype=torch.float) b = torch.tensor([-2,-4], requires grad=True, dtype=torch.float) W2 = torch.tensor([5,-5], requires grad=True, dtype=torch.float) b2 = torch.tensor([-2], requires grad=True, dtype=torch.float) • Definition of variables includes – specification of their basic data type (float) – indication to compute gradients (requires grad=True) • Input and output     x = torch.tensor([1,0], dtype=torch.float) t = torch.tensor([1], dtype=torch.float) Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 32Computation Graph • Computation graph ' & $ % s = W.mv(x) + b h = torch.nn.Sigmoid()(s) z = torch.dot(W2, h) + b2 y = torch.nn.Sigmoid()(z) error = 1/2 * (t - z) ** 2 • Note – PyTorch sigmoid function torch.nn.Sigmoid() – multiplication between matrix W and vector x is mv – multiplication between two vectors W2 and h is torch.dot. Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 33Backward Computation • Here it is:     error.backward() • No need to derive gradients — all is done automatically • We can look up computed gradients # " ! >>> W2.grad tensor([-0.0360, -0.0059]) • Note – when you run this code multiple times, then gradients accumulate – reset them with, e.g., W2.grad.data.zero () Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 34Training Data • Our training set consists of the four examples of binary XOR operations. x y x ⊕ y 0 0 0 0 1 1 1 0 1 1 1 0 • Placed into array ' & $ % training data = [ [ torch.tensor([0.,0.]), torch.tensor([0.]) ], [ torch.tensor([1.,0.]), torch.tensor([1.]) ], [ torch.tensor([0.,1.]), torch.tensor([1.]) ], [ torch.tensor([1.,1.]), torch.tensor([0.]) ] ] Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 35Training Loop: Forward ' & $ % mu = 0.1 for epoch in range(1000): total_error = 0 for item in training_data: x = item[0] t = item[1] # forward computation s = W.mv(x) + b h = torch.nn.Sigmoid()(s) z = torch.dot(W2, h) + b2 y = torch.nn.Sigmoid()(z) error = 1/2 * (t - y) ** 2 total_error = total_error + error Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 36Training Loop: Backward and Updates ' & $ % # backward computation error.backward() # weight updates W.data = W - mu * W.grad.data b.data = b - mu * b.grad.data W2.data = W2 - mu * W2.grad.data b2.data = b2 - mu * b2.grad.data W.grad.data.zero_() b.grad.data.zero_() W2.grad.data.zero_() b2.grad.data.zero_() print("error: ", total_error/4) Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 37Batch Training • We computed gradients for each training example, update model immediately • More common: process examples in batches, update after batch processed • Instead     error.backward() • Run back-propagation on accumulated error     total error.backward() Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 38Training Data Batch # " ! x = torch.tensor([ [0.,0.], [1.,0.], [0.,1.], [1.,1.] ]) t = torch.tensor([ 0., 1., 1., 0. ]) • Change to computation graph (input now a matrix, output a vector) ' & $ % s = x.mm(W) + b h = torch.nn.Sigmoid()(s) z = h.mv(W2) + b2 y = torch.nn.Sigmoid()(z) • Convert error vector into single number ' & $ % error = 1/2 * (t - y) ** 2 mean error = error.mean() mean error.backward() Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 39Parameter Updates (Optimizer) • Our code has explicit parameter update computations ' & $ % # weight updates W.data = W - mu * W.grad.data b.data = b - mu * b.grad.data W2.data = W2 - mu * W2.grad.data b2.data = b2 - mu * b2.grad.data • But fancier optimizers are typically used (Adam, etc.) • This requires more complex implementation Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 40torch.nn.Module • Neural network model is defined as class derived from torch.nn.Module ' & $ % class ExampleNet(torch.nn.Module): def init (self): super(ExampleNet, self). init () self.layer1 = torch.nn.Linear(2,2) self.layer2 = torch.nn.Linear(2,1) self.layer1.weight = torch.nn.Parameter(torch.tensor([[3.,2.],[4.,3.]])) self.layer1.bias = torch.nn.Parameter(torch.tensor([-2.,-4.])) self.layer2.weight = torch.nn.Parameter(torch.tensor([[5.,-5.]])) self.layer2.bias = torch.nn.Parameter(torch.tensor([-2.])) def forward(self, x): s = self.layer1(x) h = torch.nn.Sigmoid()(s) z = self.layer2(h) y = torch.nn.Sigmoid()(z) return y Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 41Optimizer Definition • Instantiation of neural network object     net = ExampleNet() • Optimizer definition     optimizer = torch.optim.SGD(net.parameters(), lr=0.1) Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 42Training Loop ' & $ % for iteration in range(1000): optimizer.zero grad() out = net.forward( x ) error = 1/2 * (t - out) ** 2 mean error = error.mean() print("error: ",mean error.data) mean error.backward() optimizer.step() Philipp Koehn Machine Translation: Computation Graphs 26 September 2023 43 code available on web page for textbook http://www.statmt.org/nmt-book/ Philipp Koehn Machine Translation: Computation Graphs 26 September 2023