Deep Reinforcement Learning
Tomáš Brázdil
2016
Based on V. Mnih et al, Human-level control through deep reinforcement
learning. Nature (2015).
Left: https://commons.wikimedia.org/wiki/File:Atari2600a.JPG, Right: http://www.opobotics.com/.
1
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
set of states S,
set of actions A,
each state is assigned a set of enabled actions,
transition function δ : S × A → S,
reward function R : S × A → R.
2
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
3
Deterministic Markov Decision Processes
Policy π chooses actions based on the current state.
s1
a 5 a−3
s2 s3
a 1 a4
a
10
Notation:
S1, S2, ... where St is the t-th visited state
A1, A2, ... where At is the t-th taken action
R1, R2, ... where Rt is the t-th obtained reward
3
Encoding Atari games
States correspond to preprocessed screenshots.
Original screenshots: 210 × 160 in 128 colors.
Preprocessing:
Rescale and crop to 80 × 80,
convert to gray-scale,
use 4 most recent frames in a single state.
The states: Real vectors of dimension 80 × 80 × 4.
4
Encoding Atari games
States correspond to preprocessed screenshots.
Original screenshots: 210 × 160 in 128 colors.
Preprocessing:
Rescale and crop to 80 × 80,
convert to gray-scale,
use 4 most recent frames in a single state.
The states: Real vectors of dimension 80 × 80 × 4.
Actions correspond to actions of the player: joystick position, position of
ﬁre buttons.
4
Encoding Atari games
States correspond to preprocessed screenshots.
Original screenshots: 210 × 160 in 128 colors.
Preprocessing:
Rescale and crop to 80 × 80,
convert to gray-scale,
use 4 most recent frames in a single state.
The states: Real vectors of dimension 80 × 80 × 4.
Actions correspond to actions of the player: joystick position, position of
ﬁre buttons.
Rewards correspond to changes in the game score.
Sqeezed into three values: 1 for positive, −1 for negative, 0 for 0.
4
Return and Value Functions
Deﬁnition
Return G is the total discounted reward G =
∞
k=0 γk
Rk+1.
Here 0 < γ < 1 is a discount factor.
5
Return and Value Functions
Deﬁnition
Return G is the total discounted reward G =
∞
k=0 γk
Rk+1.
Here 0 < γ < 1 is a discount factor.
Deﬁnition
Action-value function qπ(s, a) is
return starting from state s, taking action a, and then following π.
5
Return and Value Functions
Deﬁnition
Return G is the total discounted reward G =
∞
k=0 γk
Rk+1.
Here 0 < γ < 1 is a discount factor.
Deﬁnition
Action-value function qπ(s, a) is
return starting from state s, taking action a, and then following π.
Optimal action-value function q∗(s, a) is the maximum action-value
function over all policies:
q∗(s, a) = max
π
qπ(s, a)
5
Return and Value Functions
Deﬁnition
Return G is the total discounted reward G =
∞
k=0 γk
Rk+1.
Here 0 < γ < 1 is a discount factor.
Deﬁnition
Action-value function qπ(s, a) is
return starting from state s, taking action a, and then following π.
Optimal action-value function q∗(s, a) is the maximum action-value
function over all policies:
q∗(s, a) = max
π
qπ(s, a)
Theorem
Deﬁne a policy π∗ which in every s ∈ S chooses a ∈ A so that
q∗(s, a) = max
a
q∗(s, a )
Then for all s ∈ S and a ∈ A we have that qπ∗
(s, a) = q∗(s, a)
(i.e. π∗ is optimal).
5
Value Iteration
Bellman equation (Bellman, 1957):
q∗(s, a) = R(s, a) + γ max
a
q∗(s , a ) here s = δ(s, a) (1)
The true optimal values q∗ form the unique solution of the above equation.
6
Value Iteration
Bellman equation (Bellman, 1957):
q∗(s, a) = R(s, a) + γ max
a
q∗(s , a ) here s = δ(s, a) (1)
The true optimal values q∗ form the unique solution of the above equation.
Value iteration algorithm:
Start with q0(s, a) = 0 for all s, a.
Iteratively apply the right-hand-side of (1):
qk+1(s, a) = R(s, a)+γ max
a
qk(s , a ) here s = δ(s, a)
Then q∗(s, a) = limk→∞ qk(s, a).
6
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
q0 q1 q2 · · ·
(s1, a) 0 5 5 + γ10 · · ·
(s1, a ) 0 −3 · · ·
(s2, a) 0 1 · · ·
(s2, a ) 0 10 · · ·
(s3, a) 0 4 · · ·
q2(s1, a) = q1(s1, a) + γ max{q1(s2, a), q1(s2, a )} = 5 + γmax{1, 10}
7
Criticism
Minor issue: The value iteration can be used only if the transition
relation δ is known.
Major issue: State/action space is typically huge or inﬁnite:
Atari games: 12884×84×4 = 12828224 possible states!
Go: 10170 states
Helicopter control: Inﬁnite!
8
Criticism
Minor issue: The value iteration can be used only if the transition
relation δ is known.
Major issue: State/action space is typically huge or inﬁnite:
Atari games: 12884×84×4 = 12828224 possible states!
Go: 10170 states
Helicopter control: Inﬁnite!
We solve this problem in two steps:
1. Update our approximation of q∗ only for "relevant"
state-action pairs.
(using reinforcement learning)
2. Represent our approximation of q∗ succinctly.
(using neural networks).
8
Reinforcement learning (roughly)
The problem:
How to learn q∗ ?
9
Reinforcement learning (roughly)
The problem:
How to learn q∗ ?
In general:
Start with a policy ˆπ and an estimate Q of q∗.
While (unhappy with the result) do
Simulate ˆπ and update the estimate Q based on "experience".
Update the policy ˆπ according to Q.
9
Reinforcement learning (roughly)
The problem:
How to learn q∗ ?
In general:
Start with a policy ˆπ and an estimate Q of q∗.
While (unhappy with the result) do
Simulate ˆπ and update the estimate Q based on "experience".
Update the policy ˆπ according to Q.
We need to
have a good rule for learning from experience
(exploit your choice of actions),
go through important parts of the state-space
(explore the state space).
9
Q-learning
For exploration, consider ε-greedy (randomized) policy ˆπ:
With probability 1 − ε, choose a = arg maxa Q(s, a ).
With probability ε, choose an arbitrary action uniformly in random.
10
Q-learning
For exploration, consider ε-greedy (randomized) policy ˆπ:
With probability 1 − ε, choose a = arg maxa Q(s, a ).
With probability ε, choose an arbitrary action uniformly in random.
Q-learning algorithm:
Always follow ˆπ.
In every time instant t update Q by
Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At))
10
Q-learning
For exploration, consider ε-greedy (randomized) policy ˆπ:
With probability 1 − ε, choose a = arg maxa Q(s, a ).
With probability ε, choose an arbitrary action uniformly in random.
Q-learning algorithm:
Always follow ˆπ.
In every time instant t update Q by
Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At))
But we do not know q∗(St, At) ...
10
Q-learning
For exploration, consider ε-greedy (randomized) policy ˆπ:
With probability 1 − ε, choose a = arg maxa Q(s, a ).
With probability ε, choose an arbitrary action uniformly in random.
Q-learning algorithm:
Always follow ˆπ.
In every time instant t update Q by
Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At))
But we do not know q∗(St, At) ... employ a bootstrap estimate:
q∗(St, At) ≈ Rt + γ max
a
Q(St+1, a)
10
Q-learning
For exploration, consider ε-greedy (randomized) policy ˆπ:
With probability 1 − ε, choose a = arg maxa Q(s, a ).
With probability ε, choose an arbitrary action uniformly in random.
Q-learning algorithm:
Always follow ˆπ.
In every time instant t update Q by
Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At))
But we do not know q∗(St, At) ... employ a bootstrap estimate:
q∗(St, At) ≈ Rt + γ max
a
Q(St+1, a)
and obtain
Q(St, At) ← Q(St, At) + αt Rt + γ max
a
Q(St+1, a) − Q(St, At)
10
Q-learning
For exploration, consider ε-greedy (randomized) policy ˆπ:
With probability 1 − ε, choose a = arg maxa Q(s, a ).
With probability ε, choose an arbitrary action uniformly in random.
Q-learning algorithm:
Always follow ˆπ.
In every time instant t update Q by
Q(St, At) ← Q(St, At) + αt(q∗(St, At) − Q(St, At))
But we do not know q∗(St, At) ... employ a bootstrap estimate:
q∗(St, At) ≈ Rt + γ max
a
Q(St+1, a)
and obtain
Q(St, At) ← Q(St, At) + αt Rt + γ max
a
Q(St+1, a) − Q(St, At)
Theorem (Watkins & Dayan 1992)
If S is ﬁnite and αt = 1/t, then each Q(s, a) converges to q∗(s, a).
10
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
t 0 1 2 · · ·
(s1, a) 0 0 + α(5 + γ0 − 0)
(s1, a ) 0
(s2, a) 0 0 + α(1 + γα5 − 0)
(s2, a ) 0
(s3, a) 0
Q(St, At) ← Q(St, At)+α Rt + γ max
a
Q(St+1, a) − Q(St, At)
11
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
t 0 1 2 · · ·
(s1, a) 0 0 + α(5 + γ0 − 0)
(s1, a ) 0 0
(s2, a) 0 0 0 + α(1 + γα5 − 0)
(s2, a ) 0 0
(s3, a) 0 0
Q(St, At) ← Q(St, At)+α Rt + γ max
a
Q(St+1, a) − Q(St, At)
11
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
t 0 1 2 · · ·
(s1, a) 0 0 + α(5 + γ0 − 0)
(s1, a ) 0 0
(s2, a) 0 0 0 + α(1 + γα5 − 0)
(s2, a ) 0 0
(s3, a) 0 0
Q(St, At) ← Q(St, At)+α Rt + γ max
a
Q(St+1, a) − Q(St, At)
11
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
t 0 1 2 · · ·
(s1, a) 0 0 + α(5 + γ0 − 0) α5
(s1, a ) 0 0 0
(s2, a) 0 0 0 + α(1 + γα5 − 0)
(s2, a ) 0 0 0
(s3, a) 0 0 0
Q(St, At) ← Q(St, At)+α Rt + γ max
a
Q(St+1, a) − Q(St, At)
11
Deterministic Markov Decision Processes
s1
a 5 a−3
s2 s3
a 1 a4
a
10
t 0 1 2 · · ·
(s1, a) 0 0 + α(5 + γ0 − 0) α5
(s1, a ) 0 0 0
(s2, a) 0 0 0 + α(1 + γα5 − 0)
(s2, a ) 0 0 0
(s3, a) 0 0 0
Q(St, At) ← Q(St, At)+α Rt + γ max
a
Q(St+1, a) − Q(St, At)
11
Q-learning with Function Approximation
The problem: How to represent Q ?
linear combinations of (manually created) features [typical]
decision trees
SVM
neural networks
...
12
Neural networks
Neural network is a directed graph of interconnected neurons.
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
w1, . . . , wn ∈ R are weights
x1, . . . , xn ∈ R are inputs
ξ =
n
i=1 wi xi
y = σ(ξ)
13
Neural networks
Neural network is a directed graph of interconnected neurons.
σ
ξ
x1 x2 xn
y
w1 w2
· · ·
wn
w1, . . . , wn ∈ R are weights
x1, . . . , xn ∈ R are inputs
ξ =
n
i=1 wi xi
y = σ(ξ)
Input
Hidden
Output
· · ·
· · ·
state s
Q(s, a1; W) Q(s, ak ; W)
W are weights of all neurons
Q(s, a; W) the Q value in (s, a)
represented by the network 13
Q-learning with Neural Networks
Q-learning algorithm:
Always follow ε-greedy ˆπ.
In every time instant t consider (St, At, Rt, St+1):
14
Q-learning with Neural Networks
Q-learning algorithm:
Always follow ε-greedy ˆπ.
In every time instant t consider (St, At, Rt, St+1):
Freeze the current weights as W−
, ﬁx the "target" value
τ := Rt + γ maxa Q(St+1, a; W−
).
14
Q-learning with Neural Networks
Q-learning algorithm:
Always follow ε-greedy ˆπ.
In every time instant t consider (St, At, Rt, St+1):
Freeze the current weights as W−
, ﬁx the "target" value
τ := Rt + γ maxa Q(St+1, a; W−
).
Update weights W so that Q(St, At; W) gets closer to τ
W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W)
14
Q-learning with Neural Networks
Q-learning algorithm:
Always follow ε-greedy ˆπ.
In every time instant t consider (St, At, Rt, St+1):
Freeze the current weights as W−
, ﬁx the "target" value
τ := Rt + γ maxa Q(St+1, a; W−
).
Update weights W so that Q(St, At; W) gets closer to τ
W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W)
How the above rule is derived?
14
Q-learning with Neural Networks
Q-learning algorithm:
Always follow ε-greedy ˆπ.
In every time instant t consider (St, At, Rt, St+1):
Freeze the current weights as W−
, ﬁx the "target" value
τ := Rt + γ maxa Q(St+1, a; W−
).
Update weights W so that Q(St, At; W) gets closer to τ
W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W)
How the above rule is derived?
We want to adjust W to minimize the square error (here τ is constant!)
L(W) =
1
2
(τ − Q(St, At; W))2
14
Q-learning with Neural Networks
Q-learning algorithm:
Always follow ε-greedy ˆπ.
In every time instant t consider (St, At, Rt, St+1):
Freeze the current weights as W−
, ﬁx the "target" value
τ := Rt + γ maxa Q(St+1, a; W−
).
Update weights W so that Q(St, At; W) gets closer to τ
W ← W + αt(τ − Q(St, At; W)) WQ(St, At; W)
How the above rule is derived?
We want to adjust W to minimize the square error (here τ is constant!)
L(W) =
1
2
(τ − Q(St, At; W))2
using gradient descent W = W − α WL(W) where
WL(W) = W (1/2)(τ − Q(St, At; W))2
= (τ − Q(St, At; W))(− WQ(St, At; W))
WQ(St, At; W) can be computed using standard backpropagation. 14
Convolutional Networks
In image processing, classical MLP has been superseded by
convolutional networks.
First introduced in [LeCun et al., 1989d] for handwritten digits
recognition.
Combined with powerful GPU powered computers ⇒
breakthrough in image processing.
Image: D. Silver, UCL Course on RL, http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
15
DQN
Note that the Q-learning algorithm adapts weights in every step.
May be unstable: The learning may be slow or even diverge since
training samples obtained along simulations are strongly
correlated,
their distribution changes.
16
DQN
Note that the Q-learning algorithm adapts weights in every step.
May be unstable: The learning may be slow or even diverge since
training samples obtained along simulations are strongly
correlated,
their distribution changes.
Replay memory:
Store history of (state, action, reward, newState) tuples into
a memory M.
Train the network on training examples obtained by sampling
from M.
16
DQN
Note that the Q-learning algorithm adapts weights in every step.
May be unstable: The learning may be slow or even diverge since
training samples obtained along simulations are strongly
correlated,
their distribution changes.
Replay memory:
Store history of (state, action, reward, newState) tuples into
a memory M.
Train the network on training examples obtained by sampling
from M.
Dealyed target values:
W− is several steps old value of weights.
(Previously W−
was the current weight vector.)
Both adjustments considerably improve learning (see results later).
16
Experiments
Training:
49 games, the same architecture of network
(trained for each game).
ε-greedy strategy with ε annealed linearly from 1.0 to 0.1 over
the ﬁrst million frames.
Trained for 50 million frames (around 38 days of game
experience in total).
17
Experiments
Training:
49 games, the same architecture of network
(trained for each game).
ε-greedy strategy with ε annealed linearly from 1.0 to 0.1 over
the ﬁrst million frames.
Trained for 50 million frames (around 38 days of game
experience in total).
Evaluation:
Play each game 30 times for up to 5 min each time with
diﬀerent initial random conditions.
ε-greedy policy with ε = 0.05.
A random agent selecting actions at 10Hz used as baseline.
Pro human tester under the same emulator, average reward for
20 episodes, max 5 minutes, following around 2h of practice
playing each game.
17
Image: V. Mnih et al, Human-level control through deep reinforcement learning. Nature (2015). 18
Results
Sarsa and Contingency are other reinforcement learning methods.
HNeat Best and HNeat Pixel are methods based on evolutionary policy
search.
These methods use a hand-engineered object detector algorithm that outputs
the locations and types of objects on the Atari screen.
Image: D. Silver, UCL Course on RL, http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
19
Conceptual Limitations (as opposed to humans)
Prior knowledge:
Humans: Huge amount, such as intuitive physics and intuitive
psychology.
RL: Starts from scratch which is simultaneously impressive
(because it works) and depressing (because we lack concrete
ideas for how not to).
20
Conceptual Limitations (as opposed to humans)
Prior knowledge:
Humans: Huge amount, such as intuitive physics and intuitive
psychology.
RL: Starts from scratch which is simultaneously impressive
(because it works) and depressing (because we lack concrete
ideas for how not to).
Abstraction and planning:
Humans: Build a rich, abstract model and plan within it.
RL: Brute force, where the correct actions are eventually
discovered and internalized into a policy.
20
Conceptual Limitations (as opposed to humans)
Prior knowledge:
Humans: Huge amount, such as intuitive physics and intuitive
psychology.
RL: Starts from scratch which is simultaneously impressive
(because it works) and depressing (because we lack concrete
ideas for how not to).
Abstraction and planning:
Humans: Build a rich, abstract model and plan within it.
RL: Brute force, where the correct actions are eventually
discovered and internalized into a policy.
Experience acquisition:
Humans: Can ﬁgure out what is likely to give rewards without
ever actually experiencing the rewarding transition.
RL: Has to actually experience a positive reward.
A. Karpathy, Deep Reinforcement Learning: Pong from Pixels,
http://karpathy.github.io/2016/05/31/rl/.
20
Conclusions
Current computers can learn to play games on old computer at
(super)human level.
The main algorithms used in solution:
Reinforcement learning
Convolutional networks
It is a very active area of research, several better solutions
than DQN have been recently presented.
21
Image: V. Mnih et al, Human-level control through deep reinforcement learning. Nature (2015).
22