Deep Reinforcement Learning Tomáš Brázdil 2016 Based on V. Mnih et al, Human-level control through deep reinforcement learning. Nature (2015). Left: https://commons.wikimedia.org/wiki/File:Atari2600a.JPG, Right: http://www.opobotics.com/. 1 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 set of states S, set of actions A, each state is assigned a set of enabled actions, transition function δ : S × A → S, reward function R : S × A → R. 2 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 3 Deterministic Markov Decision Processes Policy π chooses actions based on the current state. s1 a 5 a−3 s2 s3 a 1 a4 a 10 Notation: S1, S2, ... where St is the t-th visited state A1, A2, ... where At is the t-th taken action R1, R2, ... where Rt is the t-th obtained reward 3 Encoding Atari games States correspond to preprocessed screenshots. Original screenshots: 210 × 160 in 128 colors. Preprocessing: Rescale and crop to 80 × 80, convert to gray-scale, use 4 most recent frames in a single state. The states: Real vectors of dimension 80 × 80 × 4. 4 Encoding Atari games States correspond to preprocessed screenshots. Original screenshots: 210 × 160 in 128 colors. Preprocessing: Rescale and crop to 80 × 80, convert to gray-scale, use 4 most recent frames in a single state. The states: Real vectors of dimension 80 × 80 × 4. Actions correspond to actions of the player: joystick position, position of fire buttons. 4 Encoding Atari games States correspond to preprocessed screenshots. Original screenshots: 210 × 160 in 128 colors. Preprocessing: Rescale and crop to 80 × 80, convert to gray-scale, use 4 most recent frames in a single state. The states: Real vectors of dimension 80 × 80 × 4. Actions correspond to actions of the player: joystick position, position of fire buttons. Rewards correspond to changes in the game score. Sqeezed into three values: 1 for positive, −1 for negative, 0 for 0. 4 Return and Value Functions Definition Return G is the total discounted reward G = ∞ k=0 γk Rk+1. Here 0 < γ < 1 is a discount factor. 5 Return and Value Functions Definition Return G is the total discounted reward G = ∞ k=0 γk Rk+1. Here 0 < γ < 1 is a discount factor. Definition Action-value function qπ(s, a) is return starting from state s, taking action a, and then following π. 5 Return and Value Functions Definition Return G is the total discounted reward G = ∞ k=0 γk Rk+1. Here 0 < γ < 1 is a discount factor. Definition Action-value function qπ(s, a) is return starting from state s, taking action a, and then following π. Optimal action-value function q∗(s, a) is the maximum action-value function over all policies: q∗(s, a) = max π qπ(s, a) 5 Return and Value Functions Definition Return G is the total discounted reward G = ∞ k=0 γk Rk+1. Here 0 < γ < 1 is a discount factor. Definition Action-value function qπ(s, a) is return starting from state s, taking action a, and then following π. Optimal action-value function q∗(s, a) is the maximum action-value function over all policies: q∗(s, a) = max π qπ(s, a) Theorem Define a policy π∗ which in every s ∈ S chooses a ∈ A so that q∗(s, a) = max a q∗(s, a ) Then for all s ∈ S and a ∈ A we have that qπ∗ (s, a) = q∗(s, a) (i.e. π∗ is optimal). 5 Value Iteration Bellman equation (Bellman, 1957): q∗(s, a) = R(s, a) + γ max a q∗(s , a ) here s = δ(s, a) (1) The true optimal values q∗ form the unique solution of the above equation. 6 Value Iteration Bellman equation (Bellman, 1957): q∗(s, a) = R(s, a) + γ max a q∗(s , a ) here s = δ(s, a) (1) The true optimal values q∗ form the unique solution of the above equation. Value iteration algorithm: Start with q0(s, a) = 0 for all s, a. Iteratively apply the right-hand-side of (1): qk+1(s, a) = R(s, a)+γ max a qk(s , a ) here s = δ(s, a) Then q∗(s, a) = limk→∞ qk(s, a). 6 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 q0 q1 q2 · · · (s1, a) 0 5 5 + γ10 · · · (s1, a ) 0 −3 · · · (s2, a) 0 1 · · · (s2, a ) 0 10 · · · (s3, a) 0 4 · · · q2(s1, a) = q1(s1, a) + γ max{q1(s2, a), q1(s2, a )} = 5 + γmax{1, 10} 7 Criticism Minor issue: The value iteration can be used only if the transition relation δ is known. Major issue: State/action space is typically huge or infinite: Atari games: 12884×84×4 = 12828224 possible states! Go: 10170 states Helicopter control: Infinite! 8 Criticism Minor issue: The value iteration can be used only if the transition relation δ is known. Major issue: State/action space is typically huge or infinite: Atari games: 12884×84×4 = 12828224 possible states! Go: 10170 states Helicopter control: Infinite! We solve this problem in two steps: 1. Update our approximation of q∗ only for "relevant" state-action pairs. (using reinforcement learning) 2. Represent our approximation of q∗ succinctly. (using neural networks). 8 Reinforcement learning (roughly) The problem: How to learn q∗ ? 9 Reinforcement learning (roughly) The problem: How to learn q∗ ? In general: Start with a policy ˆπ and an estimate Q of q∗. While (unhappy with the result) do Simulate ˆπ and update the estimate Q based on "experience". Update the policy ˆπ according to Q. 9 Reinforcement learning (roughly) The problem: How to learn q∗ ? In general: Start with a policy ˆπ and an estimate Q of q∗. While (unhappy with the result) do Simulate ˆπ and update the estimate Q based on "experience". Update the policy ˆπ according to Q. We need to have a good rule for learning from experience (exploit your choice of actions), go through important parts of the state-space (explore the state space). 9 Q-learning For exploration, consider ε-greedy (randomized) policy ˆπ: With probability 1 − ε, choose a = arg maxa Q(s, a ). With probability ε, choose an arbitrary action uniformly in random. 10 Q-learning For exploration, consider ε-greedy (randomized) policy ˆπ: With probability 1 − ε, choose a = arg maxa Q(s, a ). With probability ε, choose an arbitrary action uniformly in random. Q-learning algorithm: Always follow ˆπ. In every time instant t update Q by Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At)) 10 Q-learning For exploration, consider ε-greedy (randomized) policy ˆπ: With probability 1 − ε, choose a = arg maxa Q(s, a ). With probability ε, choose an arbitrary action uniformly in random. Q-learning algorithm: Always follow ˆπ. In every time instant t update Q by Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At)) But we do not know q∗(St, At) ... 10 Q-learning For exploration, consider ε-greedy (randomized) policy ˆπ: With probability 1 − ε, choose a = arg maxa Q(s, a ). With probability ε, choose an arbitrary action uniformly in random. Q-learning algorithm: Always follow ˆπ. In every time instant t update Q by Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At)) But we do not know q∗(St, At) ... employ a bootstrap estimate: q∗(St, At) ≈ Rt + γ max a Q(St+1, a) 10 Q-learning For exploration, consider ε-greedy (randomized) policy ˆπ: With probability 1 − ε, choose a = arg maxa Q(s, a ). With probability ε, choose an arbitrary action uniformly in random. Q-learning algorithm: Always follow ˆπ. In every time instant t update Q by Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At)) But we do not know q∗(St, At) ... employ a bootstrap estimate: q∗(St, At) ≈ Rt + γ max a Q(St+1, a) and obtain Q(St, At) ← Q(St, At) + αt Rt + γ max a Q(St+1, a) − Q(St, At) 10 Q-learning For exploration, consider ε-greedy (randomized) policy ˆπ: With probability 1 − ε, choose a = arg maxa Q(s, a ). With probability ε, choose an arbitrary action uniformly in random. Q-learning algorithm: Always follow ˆπ. In every time instant t update Q by Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At)) But we do not know q∗(St, At) ... employ a bootstrap estimate: q∗(St, At) ≈ Rt + γ max a Q(St+1, a) and obtain Q(St, At) ← Q(St, At) + αt Rt + γ max a Q(St+1, a) − Q(St, At) Theorem (Watkins & Dayan 1992) If S is finite and αt = 1/t, then each Q(s, a) converges to q∗(s, a). 10 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 t 0 1 2 · · · (s1, a) 0 0 + α(5 + γ0 − 0) (s1, a ) 0 (s2, a) 0 0 + α(1 + γα5 − 0) (s2, a ) 0 (s3, a) 0 Q(St, At) ← Q(St, At)+α Rt + γ max a Q(St+1, a) − Q(St, At) 11 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 t 0 1 2 · · · (s1, a) 0 0 + α(5 + γ0 − 0) (s1, a ) 0 0 (s2, a) 0 0 0 + α(1 + γα5 − 0) (s2, a ) 0 0 (s3, a) 0 0 Q(St, At) ← Q(St, At)+α Rt + γ max a Q(St+1, a) − Q(St, At) 11 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 t 0 1 2 · · · (s1, a) 0 0 + α(5 + γ0 − 0) (s1, a ) 0 0 (s2, a) 0 0 0 + α(1 + γα5 − 0) (s2, a ) 0 0 (s3, a) 0 0 Q(St, At) ← Q(St, At)+α Rt + γ max a Q(St+1, a) − Q(St, At) 11 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 t 0 1 2 · · · (s1, a) 0 0 + α(5 + γ0 − 0) α5 (s1, a ) 0 0 0 (s2, a) 0 0 0 + α(1 + γα5 − 0) (s2, a ) 0 0 0 (s3, a) 0 0 0 Q(St, At) ← Q(St, At)+α Rt + γ max a Q(St+1, a) − Q(St, At) 11 Deterministic Markov Decision Processes s1 a 5 a−3 s2 s3 a 1 a4 a 10 t 0 1 2 · · · (s1, a) 0 0 + α(5 + γ0 − 0) α5 (s1, a ) 0 0 0 (s2, a) 0 0 0 + α(1 + γα5 − 0) (s2, a ) 0 0 0 (s3, a) 0 0 0 Q(St, At) ← Q(St, At)+α Rt + γ max a Q(St+1, a) − Q(St, At) 11 Q-learning with Function Approximation The problem: How to represent Q ? linear combinations of (manually created) features [typical] decision trees SVM neural networks ... 12 Neural networks Neural network is a directed graph of interconnected neurons. σ ξ x1 x2 xn y w1 w2 · · · wn w1, . . . , wn ∈ R are weights x1, . . . , xn ∈ R are inputs ξ = n i=1 wi xi y = σ(ξ) 13 Neural networks Neural network is a directed graph of interconnected neurons. σ ξ x1 x2 xn y w1 w2 · · · wn w1, . . . , wn ∈ R are weights x1, . . . , xn ∈ R are inputs ξ = n i=1 wi xi y = σ(ξ) Input Hidden Output · · · · · · state s Q(s, a1; W) Q(s, ak ; W) W are weights of all neurons Q(s, a; W) the Q value in (s, a) represented by the network 13 Q-learning with Neural Networks Q-learning algorithm: Always follow ε-greedy ˆπ. In every time instant t consider (St, At, Rt, St+1): 14 Q-learning with Neural Networks Q-learning algorithm: Always follow ε-greedy ˆπ. In every time instant t consider (St, At, Rt, St+1): Freeze the current weights as W− , fix the "target" value τ := Rt + γ maxa Q(St+1, a; W− ). 14 Q-learning with Neural Networks Q-learning algorithm: Always follow ε-greedy ˆπ. In every time instant t consider (St, At, Rt, St+1): Freeze the current weights as W− , fix the "target" value τ := Rt + γ maxa Q(St+1, a; W− ). Update weights W so that Q(St, At; W) gets closer to τ W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W) 14 Q-learning with Neural Networks Q-learning algorithm: Always follow ε-greedy ˆπ. In every time instant t consider (St, At, Rt, St+1): Freeze the current weights as W− , fix the "target" value τ := Rt + γ maxa Q(St+1, a; W− ). Update weights W so that Q(St, At; W) gets closer to τ W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W) How the above rule is derived? 14 Q-learning with Neural Networks Q-learning algorithm: Always follow ε-greedy ˆπ. In every time instant t consider (St, At, Rt, St+1): Freeze the current weights as W− , fix the "target" value τ := Rt + γ maxa Q(St+1, a; W− ). Update weights W so that Q(St, At; W) gets closer to τ W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W) How the above rule is derived? We want to adjust W to minimize the square error (here τ is constant!) L(W) = 1 2 (τ − Q(St, At; W))2 14 Q-learning with Neural Networks Q-learning algorithm: Always follow ε-greedy ˆπ. In every time instant t consider (St, At, Rt, St+1): Freeze the current weights as W− , fix the "target" value τ := Rt + γ maxa Q(St+1, a; W− ). Update weights W so that Q(St, At; W) gets closer to τ W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W) How the above rule is derived? We want to adjust W to minimize the square error (here τ is constant!) L(W) = 1 2 (τ − Q(St, At; W))2 using gradient descent W = W − α WL(W) where WL(W) = W (1/2)(τ − Q(St, At; W))2 = (τ − Q(St, At; W))(− WQ(St, At; W)) WQ(St, At; W) can be computed using standard backpropagation. 14 Convolutional Networks In image processing, classical MLP has been superseded by convolutional networks. First introduced in [LeCun et al., 1989d] for handwritten digits recognition. Combined with powerful GPU powered computers ⇒ breakthrough in image processing. Image: D. Silver, UCL Course on RL, http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html 15 DQN Note that the Q-learning algorithm adapts weights in every step. May be unstable: The learning may be slow or even diverge since training samples obtained along simulations are strongly correlated, their distribution changes. 16 DQN Note that the Q-learning algorithm adapts weights in every step. May be unstable: The learning may be slow or even diverge since training samples obtained along simulations are strongly correlated, their distribution changes. Replay memory: Store history of (state, action, reward, newState) tuples into a memory M. Train the network on training examples obtained by sampling from M. 16 DQN Note that the Q-learning algorithm adapts weights in every step. May be unstable: The learning may be slow or even diverge since training samples obtained along simulations are strongly correlated, their distribution changes. Replay memory: Store history of (state, action, reward, newState) tuples into a memory M. Train the network on training examples obtained by sampling from M. Dealyed target values: W− is several steps old value of weights. (Previously W− was the current weight vector.) Both adjustments considerably improve learning (see results later). 16 Experiments Training: 49 games, the same architecture of network (trained for each game). ε-greedy strategy with ε annealed linearly from 1.0 to 0.1 over the first million frames. Trained for 50 million frames (around 38 days of game experience in total). 17 Experiments Training: 49 games, the same architecture of network (trained for each game). ε-greedy strategy with ε annealed linearly from 1.0 to 0.1 over the first million frames. Trained for 50 million frames (around 38 days of game experience in total). Evaluation: Play each game 30 times for up to 5 min each time with different initial random conditions. ε-greedy policy with ε = 0.05. A random agent selecting actions at 10Hz used as baseline. Pro human tester under the same emulator, average reward for 20 episodes, max 5 minutes, following around 2h of practice playing each game. 17 Image: V. Mnih et al, Human-level control through deep reinforcement learning. Nature (2015). 18 Results Sarsa and Contingency are other reinforcement learning methods. HNeat Best and HNeat Pixel are methods based on evolutionary policy search. These methods use a hand-engineered object detector algorithm that outputs the locations and types of objects on the Atari screen. Image: D. Silver, UCL Course on RL, http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html 19 Conceptual Limitations (as opposed to humans) Prior knowledge: Humans: Huge amount, such as intuitive physics and intuitive psychology. RL: Starts from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to). 20 Conceptual Limitations (as opposed to humans) Prior knowledge: Humans: Huge amount, such as intuitive physics and intuitive psychology. RL: Starts from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to). Abstraction and planning: Humans: Build a rich, abstract model and plan within it. RL: Brute force, where the correct actions are eventually discovered and internalized into a policy. 20 Conceptual Limitations (as opposed to humans) Prior knowledge: Humans: Huge amount, such as intuitive physics and intuitive psychology. RL: Starts from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to). Abstraction and planning: Humans: Build a rich, abstract model and plan within it. RL: Brute force, where the correct actions are eventually discovered and internalized into a policy. Experience acquisition: Humans: Can figure out what is likely to give rewards without ever actually experiencing the rewarding transition. RL: Has to actually experience a positive reward. A. Karpathy, Deep Reinforcement Learning: Pong from Pixels, http://karpathy.github.io/2016/05/31/rl/. 20 Conclusions Current computers can learn to play games on old computer at (super)human level. The main algorithms used in solution: Reinforcement learning Convolutional networks It is a very active area of research, several better solutions than DQN have been recently presented. 21 Image: V. Mnih et al, Human-level control through deep reinforcement learning. Nature (2015). 22