Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 3: Neural net learning: Gradients by hand (matrix calculus) and algorithmically (the backpropagation algorithm) NER: Binary classification for center word being location • We do supervised training and want high score if it’s a location 𝐽! 𝜃 = 𝜎 𝑠 = 1 1 + 𝑒"# 3 x = [ xmuseums xin xParis xare xamazing ] predicted model probability of class f = Some elementwise non-linear function, e.g., logistic, tanh, ReLU ∈ R5d Embedding of 1-hot words 7. Neural computation 4 Original McCulloch & Pitts 1943 threshold unit: 𝟏(𝑊𝑥 > 𝜃) = 𝟏(𝑊𝑥 − 𝜃 > 0) This function has no slope, so, no gradient-based learning tanh is just a rescaled and shifted sigmoid (2×as steep, [−1,1]): Logistic and tanh are still used (e.g., logistic to get a probability) However, now, for deep networks, the first thing to try is ReLU: it trains quickly and performs well due to good gradient backflow. ReLU has a negative “dead zone” that recent proposals mitigate GELU is frequently used with Transformers (BERT, RoBERTa, etc.) Non-linearities, old and new logistic (“sigmoid”) tanh hard tanh ReLU tanh(z) = 2logistic(2z)−1 1 0 1 −1 Swish arXiv:1710.05941 swish 𝑥 = 𝑥 ' logistic(𝑥) ReLU 𝑧 = max(𝑧, 0) (Rectified Linear Unit) Leaky ReLU / Parametric ReLU 0 0 0 GELU arXiv:1606.08415 GELU 𝑥 = 𝑥 1 𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0,1) ≈ 𝑥 1 logistic(1.702𝑥) Non-linearities (i.e., “f ” on previous slide): Why they’re needed 6 • Neural networks do function approximation, e.g., regression or classification • Without non-linearities, deep neural networks can’t do anything more than a linear transform • Extra layers could just be compiled down into a single linear transform: W1 W2 x = Wx • But, with more layers that include non-linearities, they can approximate any complex function! Remember: Stochastic Gradient Descent Update equation: i.e., for each parameter: 𝜃/ 012 = 𝜃/ 345 − 𝛼 67 8 68! "#$ In deep learning, 𝜃 includes the data representation (e.g., word vectors) too! How can we compute ∇8 𝐽(𝜃)? 1. By hand 2. Algorithmically: the backpropagation algorithm 𝛼 = step size or learning rate 8 Lecture Plan Lecture 4: Gradients by hand and algorithmically 1. Introduction (10 mins) 2. Matrix calculus (35 mins) 3. Backpropagation (35 mins) 9 Computing Gradients by Hand 10 • Matrix calculus: Fully vectorized gradients • “Multivariable calculus is just like single-variable calculus if you use matrices” • Much faster and more useful than non-vectorized gradients • But doing a non-vectorized gradient can be good for intuition; recall the first lecture for an example • Lecture notes and matrix calculus notes cover this material in more detail • You might also review Math 51, which has an online textbook: http://web.stanford.edu/class/math51/textbook.html Gradients 11 • Given a function with 1 output and 1 input 𝑓 𝑥 = 𝑥9 • It’s gradient (slope) is its derivative 5: 5; = 3𝑥< “How much will the output change if we change the input a bit?” At x = 1 it changes about 3 times as much: 1.013 = 1.03 At x = 4 it changes about 48 times as much: 4.013 = 64.48 Gradients • Given a function with 1 output and n inputs • Its gradient is a vector of partial derivatives with respect to each input 12 Jacobian Matrix: Generalization of the Gradient • Given a function with m outputs and n inputs • It’s Jacobian is an m x n matrix of partial derivatives 13 Chain Rule • For composition of one-variable functions: multiply derivatives • For multiple variables functions: multiply Jacobians 14 Example Jacobian: Elementwise activation Function 15 Example Jacobian: Elementwise activation Function Function has n outputs and n inputs → n by n Jacobian 16 Example Jacobian: Elementwise activation Function 17 Example Jacobian: Elementwise activation Function 18 Example Jacobian: Elementwise activation Function 19 Other Jacobians • Compute these at home for practice! • Check your answers with the lecture notes 20 Other Jacobians • Compute these at home for practice! • Check your answers with the lecture notes 21 Other Jacobians • Compute these at home for practice! • Check your answers with the lecture notes 22 Fine print: This is the correct Jacobian. Later we discuss the “shape convention”; using it the answer would be h. Other Jacobians • Compute these at home for practice! • Check your answers with the lecture notes 23 Back to our Neural Net! x = [ xmuseums xin xParis xare xamazing ] 24 Back to our Neural Net! • Let’s find • Really, we care about the gradient of the loss Jt but we will compute the gradient of the score for simplicity 25 x = [ xmuseums xin xParis xare xamazing ] 1. Break up equations into simple pieces 26 Carefully define your variables and keep track of their dimensionality! 2. Apply the chain rule 27 2. Apply the chain rule 28 2. Apply the chain rule 29 2. Apply the chain rule 30 3. Write out the Jacobians Useful Jacobians from previous slide 31 3. Write out the Jacobians 32 𝒖! Useful Jacobians from previous slide 3. Write out the Jacobians 33 𝒖! Useful Jacobians from previous slide 3. Write out the Jacobians 34 𝒖! Useful Jacobians from previous slide 3. Write out the Jacobians 35 𝒖! 𝒖! Useful Jacobians from previous slide . ⊙ = Hadamard product = element-wise multiplication of 2 vectors to give vector Re-using Computation • Suppose we now want to compute • Using the chain rule again: 36 Re-using Computation • Suppose we now want to compute • Using the chain rule again: The same! Let’s avoid duplicated computation … 37 Re-using Computation • Suppose we now want to compute • Using the chain rule again: 38 𝛿 is the upstream gradient (“error signal”) 𝒖! Derivative with respect to Matrix: Output shape • What does look like? • 1 output, nm inputs: 1 by nm Jacobian? • Inconvenient to then do 39 Derivative with respect to Matrix: Output shape • What does look like? • 1 output, nm inputs: 1 by nm Jacobian? • Inconvenient to then do • Instead, we leave pure math and use the shape convention: the shape of the gradient is the shape of the parameters! • So is n by m: 40 Derivative with respect to Matrix • What is • is going to be in our answer • The other term should be because • Answer is: 41 𝛿 is upstream gradient (“error signal”) at 𝑧 𝑥 is local input signal Why the Transposes? 42 • Hacky answer: this makes the dimensions work out! • Useful trick for checking your work! • Full explanation in the lecture notes • Each input goes to each output – you want to get outer product Deriving local input gradient in backprop • For "𝒛 "𝑾 in our equation: • Let’s consider the derivative of a single weight Wij • Wij only contributes to zi • For example: W23 is only used to compute z2 not z1 43 x1 x2 x3 +1 f(z1)= h1 h2 =f(z2) s u2 W23 b2 𝜕𝑠 𝜕𝑾 = 𝜹 𝜕𝒛 𝜕𝑾 = 𝜹 𝜕 𝜕𝑾 (𝑾𝒙 + 𝒃) 𝜕𝑧. 𝜕𝑊./ = 𝜕 𝜕𝑊./ 𝑾.> 𝒙 + 𝑏. = 6 6?%! ∑@AB 5 𝑊.@ 𝑥@ = 𝑥/ What shape should derivatives be? • Similarly, is a row vector • But shape convention says our gradient should be a column vector because b is a column vector … • Disagreement between Jacobian form (which makes the chain rule easy) and the shape convention (which makes implementing SGD easy) • We expect answers in the assignment to follow the shape convention • But Jacobian form is useful for computing the answers 44 What shape should derivatives be? Two options for working through specific problems: 1. Use Jacobian form as much as possible, reshape to follow the shape convention at the end: • What we just did. But at the end transpose to make the derivative a column vector, resulting in 2. Always follow the shape convention • Look at dimensions to figure out when to transpose and/or reorder terms • The error message 𝜹 that arrives at a hidden layer has the same dimensionality as that hidden layer 45 3. Backpropagation We’ve almost shown you backpropagation It’s taking derivatives and using the (generalized, multivariate, or matrix) chain rule Other trick: We re-use derivatives computed for higher layers in computing derivatives for lower layers to minimize computation 46 Computation Graphs and Backpropagation Ÿ + Ÿ • Software represents our neural net equations as a graph • Source nodes: inputs • Interior nodes: operations 47 Computation Graphs and Backpropagation Ÿ + Ÿ • Software represents our neural net equations as a graph • Source nodes: inputs • Interior nodes: operations • Edges pass along result of the operation 48 Computation Graphs and Backpropagation Ÿ + Ÿ • Software represents our neural net equations as a graph • Source nodes: inputs • Interior nodes: operations • Edges pass along result of the operation “Forward Propagation” 49 Backpropagation Ÿ + Ÿ • Then go backwards along edges • Pass along gradients 50 Backpropagation: Single Node • Node receives an “upstream gradient” • Goal is to pass on the correct “downstream gradient” Upstream gradient51 Downstream gradient Backpropagation: Single Node Downstream gradient Upstream gradient • Each node has a local gradient • The gradient of its output with respect to its input Local gradient52 Backpropagation: Single Node Downstream gradient Upstream gradient • Each node has a local gradient • The gradient of its output with respect to its input Local gradient53 Chain rule! Backpropagation: Single Node Downstream gradient Upstream gradient • Each node has a local gradient • The gradient of its output with respect to its input Local gradient • [downstream gradient] = [upstream gradient] x [local gradient] 54 Backpropagation: Single Node * • What about nodes with multiple inputs? 55 Backpropagation: Single Node Downstream gradients Upstream gradient Local gradients * • Multiple inputs → multiple local gradients 56 An Example 57 An Example + * max 58 Forward prop steps An Example + * max 59 Forward prop steps 6 3 2 1 2 2 0 An Example + * max 60 Forward prop steps 6 3 2 1 2 2 0 Local gradients An Example + * max 61 Forward prop steps 6 3 2 1 2 2 0 Local gradients An Example + * max 62 Forward prop steps 6 3 2 1 2 2 0 Local gradients An Example + * max 63 Forward prop steps 6 3 2 1 2 2 0 Local gradients An Example + * max 64 Forward prop steps 6 3 2 1 2 2 0 Local gradients upstream * local = downstream 1 1*3 = 3 1*2 = 2 An Example + * max 65 Forward prop steps 6 3 2 1 2 2 0 Local gradients upstream * local = downstream 1 3 2 3*1 = 3 3*0 = 0 An Example + * max 66 Forward prop steps 6 3 2 1 2 2 0 Local gradients upstream * local = downstream 1 3 2 3 0 2*1 = 2 2*1 = 2 An Example + * max 67 Forward prop steps 6 3 2 1 2 2 0 Local gradients 1 3 2 3 0 2 2 Gradients sum at outward branches 68 + Gradients sum at outward branches 69 + Node Intuitions + * max 70 6 3 2 1 2 2 0 1 2 2 2 • + “distributes” the upstream gradient to each summand Node Intuitions + * max 71 6 3 2 1 2 2 0 1 3 3 0 • + “distributes” the upstream gradient to each summand • max “routes” the upstream gradient Node Intuitions + * max 72 6 3 2 1 2 2 0 1 3 2 • + “distributes” the upstream gradient • max “routes” the upstream gradient • * “switches” the upstream gradient Efficiency: compute all gradients at once * + Ÿ • Incorrect way of doing backprop: • First compute 73 Efficiency: compute all gradients at once * + Ÿ • Incorrect way of doing backprop: • First compute • Then independently compute • Duplicated computation! 74 Efficiency: compute all gradients at once * + Ÿ • Correct way: • Compute all the gradients at once • Analogous to using 𝜹 when we computed gradients by hand 75 1. Fprop: visit nodes in topological sort order - Compute value of node given predecessors 2. Bprop: - initialize output gradient = 1 - visit nodes in reverse order: Compute gradient wrt each node using gradient wrt successors Done correctly, big O() complexity of fprop and bprop is the same In general, our nets have regular layer-structure and so we can use matrices and Jacobians… Back-Prop in General Computation Graph … … Inputs = successors of Single scalar output 76 Automatic Differentiation • The gradient computation can be automatically inferred from the symbolic expression of the fprop • Each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output • Modern DL frameworks (Tensorflow, PyTorch, etc.) do backpropagation for you but mainly leave layer/node writer to hand-calculate the local derivative 77 Backprop Implementations 78 Implementation: forward/backward API 79 Implementation: forward/backward API 80 Summary 82 We’ve mastered the core technology of neural nets! 🎉 🎉 🎉 • Backpropagation: recursively (and hence efficiently) apply the chain rule along computation graph • [downstream gradient] = [upstream gradient] x [local gradient] • Forward pass: compute results of operations and save intermediate values • Backward pass: apply chain rule to compute gradients