Introduction to Neural Networks
Philipp Koehn
21 September 2023
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
1Linear Models
• We used before weighted linear combination of feature values hj and weights λj
score(λ, di) =
j
λj hj(di)
• Such models can be illustrated as a ”network”
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
2Limits of Linearity
• We can give each feature a weight
• But not more complex value relationships, e.g,
– any value in the range [0;5] is equally good
– values over 8 are bad
– higher than 10 is not worse
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
3XOR
• Linear models cannot model XOR
bad good
good bad
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
4Multiple Layers
• Add an intermediate (”hidden”) layer of processing
(each arrow is a weight)
x h y
• Have we gained anything so far?
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
5Non-Linearity
• Instead of computing a linear combination
score(λ, di) =
j
λj hj(di)
• Add a non-linear function
score(λ, di) = f
j
λj hj(di)
• Popular choices
tanh(x) sigmoid(x) = 1
1+e−x relu(x) = max(0,x)
(sigmoid is also called the ”logistic function”)
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
6Deep Learning
• More layers = deep learning
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
7What Depth Enables
• Each layer is a processing step
• Having multiple processing steps allows complex functions
• Metaphor: NN and computing circuits
– computer = sequence of Boolean gates
– neural computer = sequence of layers
• Deep neural networks can implement complex functions
e.g., sorting on input values
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
8
example
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
9Simple Neural Network
11
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• One innovation: bias units (no inputs, always value 1)
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
10Sample Input
1
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) =
1
1 + e−2.2
= 0.90
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) =
1
1 + e1.6
= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
11Computed Hidden
.90
.17
1
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Try out two input values
• Hidden unit computation
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) =
1
1 + e−2.2
= 0.90
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) =
1
1 + e1.6
= 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
12Compute Output
.90
.17
1
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Output unit computation
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) =
1
1 + e−1.17
= 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
13Computed Output
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Output unit computation
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) =
1
1 + e−1.17
= 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
14Output for all Binary Inputs
Input x0 Input x1 Hidden h0 Hidden h1 Output y0
0 0 0.12 0.02 0.18 → 0
0 1 0.88 0.27 0.74 → 1
1 0 0.73 0.12 0.74 → 1
1 1 0.99 0.73 0.33 → 0
• Network implements XOR
– hidden node h0 is OR
– hidden node h1 is AND
– ﬁnal layer operation is h0 − −h1
• Power of deep neural networks: chaining of processing steps
just as: more Boolean circuits → more complex computations possible
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
15
why ”neural” networks?
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
18The Brain vs. Artiﬁcial Neural Networks
• Similarities
– Neurons, connections between neurons
– Learning = change of connections,
not change of neurons
– Massive parallel processing
• But artiﬁcial neural networks are much simpler
– computation within neuron vastly simpliﬁed
– discrete time steps
– typically some form of supervised learning with massive number of stimuli
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
19
back-propagation training
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
20Error
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
• Computed output: y = .76
• Correct output: t = 1.0
⇒ How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
21Key Concepts
• Gradient descent
– error is a function of the weights
– we want to reduce the error
– gradient descent: move towards the error minimum
– compute gradient → get direction to the error minimum
– adjust weights towards direction of lower error
• Back-propagation
– ﬁrst adjust last set of weights
– propagate error back to each previous layer
– adjust their weights
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
31Hidden Layer Update
• In a hidden layer, we do not have a target output value
• But we can compute how much each node contributed to downstream error
• Deﬁnition of error term of each node
δj = (tj − yj) yj
• Back-propagate the error term
(why this way? there is math to back it up...)
δi =
j
wj←iδj yi
• Universal update formula
∆wj←k = µ δj hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
32Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
A
B
C
D
E
F
G
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
33Our Example
.90
.17
1
.76
1.0
0.0
1
4.5
-5.2
-2.0
-4.6
-1.5
3.7
2.9
3.7
2.9
A
B
C
D
E
F
G
4.891 —
-5.126 ——
-1.566 ——
• Computed output: y = .76
• Correct output: t = 1.0
• Final layer weight updates (learning rate µ = 10)
– δG = (t − y) y = (1 − .76) 0.181 = .0434
– ∆wGD = µ δG hD = 10 × .0434 × .90 = .391
– ∆wGE = µ δG hE = 10 × .0434 × .17 = .074
– ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
43Batches
• Each training example yields a set of weight updates ∆wi.
• Batch up several training examples
– sum up their updates
– apply sum to model
• Mostly done or speed reasons
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
44
computational aspects
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
45Vector and Matrix Multiplications
• Forward computation: s = Wh
• Activation function: y = sigmoid(h)
• Error term: δ = (t − y) sigmoid’(s)
• Propagation of error term: δi = Wδi+1 · sigmoid’(s)
• Weight updates: ∆W = µδhT
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
46GPU
• Neural network layers may have, say, 200 nodes
• Computations such as Wh require 200 × 200 = 40, 000 multiplications
• Graphics Processing Units (GPU) are designed for such computations
– image rendering requires such vector and matrix operations
– massively multi-core but lean processing units
– example: NVIDIA H100 GPU provides 18,432 CUDA cores
• Extensions to C to support programming of GPUs, such as CUDA
Philipp Koehn Machine Translation: Introduction to Neural Networks 21 September 2023
28Explosion of Deep Learning Toolkits
• University of Montreal: Theano (early, now defunct)
• Google: Tensorﬂow
• Facebook: Torch, pyTorch
• Microsoft: CNTK
• Amazon: MX-Net
• CMU: Dynet
• AMU/Edinburgh/Microsoft: Marian
• ... and many more
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
29Toolkits
• Machine learning architectures around computations graphs very powerful
– deﬁne a computation graph
– provide data and a training strategy (e.g., batching)
– toolkit does the rest
– seamless support of GPUs
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
30Example: PyTorch
• Installation




pip install torch
• Usage




import torch
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
31Some Data Types
• PyTorch data type for parameter vectors, matrices etc., called torch.tensor
'
&
$
%
W = torch.tensor([[3,4],[2,3]], requires grad=True, dtype=torch.float)
b = torch.tensor([-2,-4], requires grad=True, dtype=torch.float)
W2 = torch.tensor([5,-5], requires grad=True, dtype=torch.float)
b2 = torch.tensor([-2], requires grad=True, dtype=torch.float)
• Deﬁnition of variables includes
– speciﬁcation of their basic data type (float)
– indication to compute gradients (requires grad=True)
• Input and output




x = torch.tensor([1,0], dtype=torch.float)
t = torch.tensor([1], dtype=torch.float)
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
32Computation Graph
• Computation graph
'
&
$
%
s = W.mv(x) + b
h = torch.nn.Sigmoid()(s)
z = torch.dot(W2, h) + b2
y = torch.nn.Sigmoid()(z)
error = 1/2 * (t - z) ** 2
• Note
– PyTorch sigmoid function torch.nn.Sigmoid()
– multiplication between matrix W and vector x is mv
– multiplication between two vectors W2 and h is torch.dot.
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
33Backward Computation
• Here it is:




error.backward()
• No need to derive gradients — all is done automatically
• We can look up computed gradients
#
" !
>>> W2.grad
tensor([-0.0360, -0.0059])
• Note
– when you run this code multiple times, then gradients accumulate
– reset them with, e.g., W2.grad.data.zero ()
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
34Training Data
• Our training set consists of the four examples of binary XOR operations.
x y x ⊕ y
0 0 0
0 1 1
1 0 1
1 1 0
• Placed into array
'
&
$
%
training data =
[ [ torch.tensor([0.,0.]), torch.tensor([0.]) ],
[ torch.tensor([1.,0.]), torch.tensor([1.]) ],
[ torch.tensor([0.,1.]), torch.tensor([1.]) ],
[ torch.tensor([1.,1.]), torch.tensor([0.]) ] ]
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
35Training Loop: Forward
'
&
$
%
mu = 0.1
for epoch in range(1000):
total_error = 0
for item in training_data:
x = item[0]
t = item[1]
# forward computation
s = W.mv(x) + b
h = torch.nn.Sigmoid()(s)
z = torch.dot(W2, h) + b2
y = torch.nn.Sigmoid()(z)
error = 1/2 * (t - y) ** 2
total_error = total_error + error
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
36Training Loop: Backward and Updates
'
&
$
%
# backward computation
error.backward()
# weight updates
W.data = W - mu * W.grad.data
b.data = b - mu * b.grad.data
W2.data = W2 - mu * W2.grad.data
b2.data = b2 - mu * b2.grad.data
W.grad.data.zero_()
b.grad.data.zero_()
W2.grad.data.zero_()
b2.grad.data.zero_()
print("error: ", total_error/4)
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
37Batch Training
• We computed gradients for each training example, update model immediately
• More common: process examples in batches, update after batch processed
• Instead




error.backward()
• Run back-propagation on accumulated error




total error.backward()
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
38Training Data Batch
#
" !
x = torch.tensor([ [0.,0.], [1.,0.], [0.,1.], [1.,1.] ])
t = torch.tensor([ 0., 1., 1., 0. ])
• Change to computation graph (input now a matrix, output a vector)
'
&
$
%
s = x.mm(W) + b
h = torch.nn.Sigmoid()(s)
z = h.mv(W2) + b2
y = torch.nn.Sigmoid()(z)
• Convert error vector into single number
'
&
$
%
error = 1/2 * (t - y) ** 2
mean error = error.mean()
mean error.backward()
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
39Parameter Updates (Optimizer)
• Our code has explicit parameter update computations
'
&
$
%
# weight updates
W.data = W - mu * W.grad.data
b.data = b - mu * b.grad.data
W2.data = W2 - mu * W2.grad.data
b2.data = b2 - mu * b2.grad.data
• But fancier optimizers are typically used (Adam, etc.)
• This requires more complex implementation
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
40torch.nn.Module
• Neural network model is deﬁned as class derived from torch.nn.Module
'
&
$
%
class ExampleNet(torch.nn.Module):
def init (self):
super(ExampleNet, self). init ()
self.layer1 = torch.nn.Linear(2,2)
self.layer2 = torch.nn.Linear(2,1)
self.layer1.weight = torch.nn.Parameter(torch.tensor([[3.,2.],[4.,3.]]))
self.layer1.bias = torch.nn.Parameter(torch.tensor([-2.,-4.]))
self.layer2.weight = torch.nn.Parameter(torch.tensor([[5.,-5.]]))
self.layer2.bias = torch.nn.Parameter(torch.tensor([-2.]))
def forward(self, x):
s = self.layer1(x)
h = torch.nn.Sigmoid()(s)
z = self.layer2(h)
y = torch.nn.Sigmoid()(z)
return y
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
41Optimizer Deﬁnition
• Instantiation of neural network object




net = ExampleNet()
• Optimizer deﬁnition




optimizer = torch.optim.SGD(net.parameters(), lr=0.1)
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
42Training Loop
'
&
$
%
for iteration in range(1000):
optimizer.zero grad()
out = net.forward( x )
error = 1/2 * (t - out) ** 2
mean error = error.mean()
print("error: ",mean error.data)
mean error.backward()
optimizer.step()
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023
43
code available on web page for textbook
http://www.statmt.org/nmt-book/
Philipp Koehn Machine Translation: Computation Graphs 26 September 2023