PA230: Reinforcement Learning
Petr Novotn´y
“Good and evil, reward and punishment, are the only motives to
a rational creature; these are the spur and reins whereby all
mankind are set on work and guided.”
John Locke, Some Thoughts Concerning Education (1693)
1/294
Organizational Information
General
• Lecture: Thursdays 2-3:40p.m.
• Homework: see the interactive syllabus in IS
• mainly binary classification (accepted/not accepted)
• all your homeworks need to be marked as passed to proceed to exam
• can (but do not have to) be done in pairs (pairs can differ across the individual assignments)
• for those who passed, the teacher will receive feedback on the general quality of the
solutions for each student - can be taken into account when determining the final grade
(typically in students’ favor)
• Exam:
• oral
• each attempt counts ? (unlike the Br´azdil system)
• in general, knowledge of anything mentioned on the slides can be required, unless explicitly
marked with “nex” (like the Br´azdil system)
2/294
Team
• Lecturer: Petr Novotn´y
• HW team:
Martin Kureˇcka V´aclav Nevyhoˇstˇen´y V´ıt Unˇcovsk´y 3/294
Communication
Official discord server:
https://discord.gg/9mxTgYhcdB
• Official communication forum of the course: falls under the university ethical guidelines.
• Use your real name for posting (you can set-up an account under your IS email if
necessary).
4/294
Reading
• Compulsory:
• these slides,
• material explicitly prescribed by these slides (not much).
• Recommended:
• Sutton & Barto: Reinforcement Learning: An Introduction (2nd ed.), available at
http://incompleteideas.net/book/RLbook2020.pdf
• henceforth referenced as “S&B”
• slides by David Silver https://www.davidsilver.uk/teaching/
• CMU slides https://www.andrew.cmu.edu/course/10-703/
• more specific literature recommendations will be given for each topic later
5/294
Reinforcement Learning:
What, Why, When, How,
& Other Questions
Types of machine learning
• unsupervised
• spot ”useful” patterns in data
• supervised
• given labeled data, predict labels on unlabeled data
• reinforcement
• agents and decision-making
• agency = “the ability to take action or to choose what action to take” (Cambridge
Dictionary)
6/294
General RL scheme
source: Sutton&Barto, p. 48
Keywords: sequential, dynamic, subject to uncertainty
7/294
RL: Objective and approach
• Objective: Design a decision policy (= agent behavior) which prescribes to the agent how
to act in different situations (states), typically so as to achieve some goal.
• Approach: Start with (± random) behavior and adapt it based on past experience via the
law of effect:
• actions with good/bad consequences for the agent are more/less likely to be repeated by the
agent (within the same context)
8/294
RL in psychology (nex)
I.P. Pavlov
(1849-1936)
classical conditioning
E. Thorndike
(1874-1949)
law of effect
J.B. Watson
(1878-1958)
behaviorist manifesto
B.F. Skinner
(1904-1990)
radical behaviorism,
reinforcement,
rewards
9/294
Learning by trying & XX dilemma
Underlying the RL approach is the idea of learning by trying:
• first, act more or less randomly (exploration)
• integral part of early human development
• continually adapt behavior according to experience and feedback from the environment
(exploitation)
• strength of feedback ≈ strength of behavior adaptation
Balancing exploration and exploitation (XX) is a recurring theme in RL.
10/294
Incomplete history of RL in computer science I
“Learning by trying” machines and software, ad hoc approaches:
A. Turing
(1912-1954)
1948: theoretical
“pleasure & pain” sys-
tem
to train computers
C. Shannon
(1916-2001)
1950: Theseus
maze-solving mouse
M. Minsky
(1927-2016)
1950s: analog neural net
machines (SNARCS)
And many more. . .
Recommended: S&B: Sec. 1.7.
11/294
Incomplete history of RL in computer science II
Mathematical foundations of sequential decision making:
R. Bellman
(1920-1984)
R. Howard
(b. 1934)
• Formalization via Markov decision
processes (MDPs)
• value iteration
(attributed to Bellman, 1957)
• policy iteration
(attributed to Howard, 1960)
12/294
Incomplete history of RL in computer science III
Since late 1980’s: synthesis – learning by trial in MDPs
R. Sutton A. Barto
Temporal difference learning
C. Watkins
Q-learning
. . . and many more. 13/294
Successes of RL (nex)
14/294
Words of caution (and controversy) (nex)
15/294
Mathematical Foundations
of Sequential Decision-Making
MDP Example
MDP with actions, rewards and transition probabilities.
start next passed
sleep
study: -2 1 study: -2 1
done: 20
1
done: 0 1
procrast.: 4
1 leave: -1
1
2
1
2
procrast.: -1 1
pub: +4
0.3
0.4
0.3
16/294
Markov Decision Process
Given a set X, we denote D(X) the set of all probability distributions over X.
Definition 1
A Markov decision process (MDP) is a tuple (S, A, p, r) where
• S is a set of states,
• A is a set of actions,
• p: S × A → D(S) is a probabilistic transition function,
• r : S × A → R is a reward function.
We will shorten p(s, a)(s′
) to p(s′
| s, a).
The p, r can be partial functions: action a is enabled in state s if both p(s, a) and r(s, a) are
defined. We denote by A(s) the set of all actions enabled in s.
17/294
Dynamics of MDPs
• start in some initial state s0
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
• in each time step t, let st be the current state; then:
• agent selects action at ∈ A(st )
• the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward
rt+1 = r(st , at )
• t is incremented and the process repeats in the same fashion forever
Thus, the agent produces a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ..
τ is produced randomly (due to p and possibly also agent choices being probabilistic): it is a
random variable and so are its components: we define random variables
• St = state at time step t
• At = action at time step t
• Rt = reward received just before entering St
18/294
Policies
Definition 2
A history is a finite prefix of a trajectory ending in a state, i.e., an object of type
s0, a0, r1, s1, a1, r2, . . . , at−1, rt, st ∈ (S · A · R)∗
S.
we denote by last(h) the last state of a history h.
Definition 3
A policy is a function π: (S · A · R)∗
S → D(A) which to each history h assigns a
probability distribution over A(last(h)).
A policy is by definition an infinite object!
19/294
MD policies
Definition 4
A policy π is:
• memoryless if π(h) = π(h′
) whenever last(h) = last(h′
) (we can view memoryless
policies as objects of type π: S → D(A));
• deterministic if π(h) always assigns probability 1 to one action, and zero to all
others (we can view det. policies of objects of type π: (S · A · R)∗
S → A).
Definition 5
A policy π is MD (memoryless deterministic) if it is both memoryless and deterministic.
20/294
Dynamics of MDPs (more precise)
Given a distribution I of initial states and a policy π
• start in some initial state s0 ∼ I
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
• in each time step t, let ht be the history produced so far; then:
• agent selects action at ∈ A(st ) according to π, i.e. at ∼ π(ht )
• the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward
rt+1 = r(st , at ), the history is extended by at , rt , st+1,
• t is incremented and the process repeats in the same fashion forever
21/294
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
Exercise 6
In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub.
Compute the following quantities:
• Pπ
[visit pub at least twice′
| S0 = start′′
]
• Pπ
[visit pub at exactly twice | S0 = start′′
]
• Eπ
[R1]
• Eπ
[R3]
1I is typically known from the context and hence omitted from the notation
22/294
Memorylessness
In this course, we will almost exclusively focus on memoryless policies. Hence, from now on,
policy = memoryless policy. General policies will be referred to as history-dependent policies
should the need arise.
Why memoryless?
Intuition: Markov property of MDPs: next step depends only on the current state and on
action performed in the current step. Hence, intuitively there is no need for a policy to
remember the past so as to “play well”.
The sufficiency of memoryless policies does not extended to more general/complex
decision-making settings (not covered in this course), such as:
• partially observable MDPs
• non-stationary environments
• quantile/risk-aware MDPs, etc.
23/294
Returns (payoffs)
Definition 7
Let γ ∈ [0, 1) be a discount factor.
For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . we define the discounted return
(or payoff) of τ to be the quantity
G(τ) = r1 + γ · r2 + γ2
· r3
+ · · · γ3
· r4
=
∞
i=0
γi
· ri+1.
Equivalently
G =
∞
i=0
γi
· Ri+1.
24/294
Returns (variants)
• Finite horizon (FH): additionally, we are given a finite decision horizon H ∈ N ∪ {∞}. The
return is that counted only up to step H:
GH
=
H−1
i=0
γi
· Ri+1
For finite H, the discount factor can be 1. H = ∞ corresponds to the original definition.
• Episodic returns: In episodic tasks, there is a distinguished set Term ⊆ S of terminal
states which is guaranteed to be reached with probability 1 under any policy. We denote
by T a random variable denoting the first point in time when we hit a terminal state. We
count rewards only up to that time:
GT
=
T−1
i=1
γi
· Ri+1
Can be modeled under original definition by “sink” states. 25/294
Types of returns: discussion
• We will typically omit the superscripts since the type of task considered will be known
from the context.
• We have GH
→ G (pointwise) as H → ∞. I.e., finite-horizon returns with high enough H
approximate the standard (infinite-horizon) case.
• In real world, we typically deal with FH or episodic tasks: we cannot wait infinite time to
learn something from a trajectory. However, the infinite-horizon case can be viewed as a
neat mathematical abstraction of the FH&episodic tasks, and the classical sequential
decision-making theory is most developed for the infinite horizon case.
26/294
Policy and state values
Definition 8
Let π be a policy and s a state. The value of π in state s is the quantity
vπ
(s) = Eπ
[G | S0 = s].
Exercise 9
Discuss the values of MD policies in our running example.
Definition 10
The (optimal) value of state s is the quantity
v∗
(s) = sup
π
vπ
(s).
27/294
Optimality
Definition 11
Let π be a policy and ε > 0. We say that π is ε-optimal in state s if
vπ
(s) ≥ v∗
(s) − ε.
We say that π is optimal in s is it is 0-optimal in s, i.e. if
vπ
(s) = v∗
(s).
A policy is (ε-)optimal if it is (ε-)optimal in every state.
28/294
Existence of optimal policies
Theorem 12: (Classical result, not formally proven here)
Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon
returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can
be computed in polynomial time.
Agent control solved? NO! “Only” works if you can actually construct the MDP model of your
environment and fit it into a computer. Otherwise, we use reinforcement learning.
29/294
Exact Planning
with Known Model:
Value & Policy Iteration
Goal of this lecture
Algorithms that compute the optimal value vector v∗
and some optimal MD policy π∗
given a
full knowledge of an MDP M.
30/294
Polynomial-time algorithm
MDPs can be solved by linear programming (LP)
maximize ⃗c · ⃗x
subject to A · ⃗x ≤ ⃗b
• LP can be solved in polynomial time by so-called interior-point algorithms.
• However, we typically use other, MDP-specific algorithms: value iteration (VI) and policy
iteration (PI). These are not polynomial-time in general, but typically faster on practical
instances.
• Moreover, most truly RL algorithms can be seen as approximate generalizations of VI or PI
(or both).
31/294
Example: Policy evaluation
Exercise 13
Consider all four MD policies in our running “pub or study” example that always try to
quit when in X state. Compute the values of these policies in the initial state start.
32/294
Policy evaluation equations
Theorem 14
For any memoryless policy π and any state s it holds:
vπ
(s) =
a∈A(s)
π(a|s) · r(s, a) + γ ·
s′∈S
p(s′
|s, a) · vπ
(s′
)
def
=qπ(s,a)
.
33/294
Bellman optimality equations
Theorem 15
The following holds for any state s:
v∗
(s) = max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · v∗
(s′
)
def
=q∗(s,a)
• Note: the policy evaluation equations are a special case of the Bellman ones: given a
policy π, we can consider an MDP Mπ
in which there is a single action ∗ enabled in each
state and the probability of transition s
∗
→ s′
equals a∈A(s) π(a|s) · p(s′
|s, a). Then Mπ
mimics the behavior of π in M and Bellman eq’s in Mπ
= evaluation equations for π in
M.
• But these equations are no longer linear! How do we solve them? Is the solution even
unique? 34/294
Bellman update operator
The right-hand-side (RHS) of the Bellman equations can be viewed as an operator
Φ: RS
→ RS
: for any ⃗x ∈ RS
, Φ(⃗x) is a vector such that for any state s:
Φ(⃗x)(s)
def
= max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)
Exercise 16
In our running example, compute Φ(⃗0).
Theorem 15 says that the optimal value vector v∗
is a fixed point of Φ:
v∗
= Φ(v∗
).
35/294
Mathematical hammers for Bellman
Lemma 17: (not proven here)
For any discount factor γ ∈ [0, 1), the Bellman operator Φ is a contraction, i.e. for any
pair of vectors ⃗x, ⃗y it holds
∥Φ(⃗x) − Φ(⃗y)∥∞ ≤ γ · ∥⃗x − ⃗y∥∞.
Theorem 18: Banach fixed point theorem (classical calculus, not proven here)
A contraction mapping from a complete metric space (in particular, Rn
) to itself has a
unique fixed point.
36/294
Exact characterization of v∗
Corollary 19
The optimal value vector is a unique solution of the Bellman optimality equations.
In particular, also the policy evaluation equations have a unique solution, equal to vπ
. Since the
policy evaluation equations are linear, their solution can be computed by Gaussian elimination.
But the general Bellman equations are not linear. How can ve solve them?
37/294
Banach fixpoint theorem (full)
Theorem 20: Banach fixed point theorem (full version, not proven here)
A contraction mapping Φ from a complete metric space (in particular, Rn
) to itself has
a unique fixed point ⃗z.
Moreover, ⃗z is the limit of iterative applications of Φ on any initial vector. I.e., for any
⃗x0 ∈ Rn
, the sequence ⃗x0, Φ(⃗x0), Φ(Φ(⃗x0)), Φ(3)
(⃗x0), . . . converges to ⃗z:
z = lim
i→∞
Φ(i)
(⃗x0)
38/294
Value iteration (VI; Bellman, 1957)
Algorithm 1: Value iteration
Input: MDP M = (S, A, p, r)
Output: Approximation ˜v of v∗
x ← any vector from R|S|
; // typically ⃗0
next ← x;
repeat
foreach s ∈ S do
next(s) ← max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)]
Φ(x)(s)
;
x ← next
until termination condition;
Typical term. conditions:
• after a fixed no. of iterations (i.e., use for-loop with a fixed bound)
• after each component of x changes less then some given ε 39/294
How to use VI
By the Banach fixpoint theorem (and Lemma 17), the value of variable x VI converges to v∗
.
Can we recognize when is x “close enough” to ⃗x?
In the following couple of theorems, let ⃗x0, ⃗x1, ⃗x2, . . . be the sequence of vectors computed by
VI, i.e. ⃗x0 is arbitrary and ⃗xi+1 = Φ(⃗xi ) for all i ≥ 0.
Theorem 21: Stopping condition (not proven here)
For any ε > 0: if
∥⃗xi+1 − ⃗xi ∥∞ ≤ ε ·
1 − γ
γ
,
then
∥⃗xi+1 − v∗
∥∞ ≤ ε
40/294
How to use VI (2)
How fast can we get to the point where we are close enough?
Theorem 22: Speed of convergence (not proven here)
For all i ≥ 0 it holds
∥⃗xn
− v∗
∥∞ ≤
γn
1 − γ
· ∥⃗x1 − ⃗x0∥∞.
In particular, if we terminate VI after
i =




log(ε) + log 1−γ
∥⃗x1−⃗x0∥∞
log(γ)




steps, then its output xi will be an ε-approximation of v∗
.
41/294
How to use VI (3)
Can we actually get some optimal values instead of approximations? First, note that VI
computes optimal finite-horizon values:
Let vi
= supπ Eπ
[
H
i=1 γi−1
· Ri ]. The supremum is over all (i.e., history dependent) policies,
since in the FH problem an optimal policy needs to track the number of elapsed (and thus
remaining) steps: memory is needed for that.
Theorem 23: (Easy but important exercise)
If ⃗x0 = ⃗0, then ⃗xH = vH
for all H ≥ 0.
Moreover, let πH
be a deterministic history-dependent policy such that for all 1 ≤ i ≤ H,
whenever there are i steps remaining till the horizon, the policy πH
selects in state s an
action a s.t.
a = arg max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗xi−1(s′
)]
(with ties broken arbitrarily). Then πH
is an optimal H-step policy.
42/294
Greedy policies
Can we actually get optimal policy for the inf. horizon problem?
Definition 24: ⃗x-greedy policy (very important)
Let ⃗x ∈ RS
be any vector. A ⃗x-greedy policy is an MD policy π such that in any state s:
π(s) = arg max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)].
43/294
How to use VI (4)
Theorem 25: Optimal inf.-horizon policy from VI (not proven here)
There is a number N polynomial in size of the MDP and exponential in the binary
encoding size of γ such that a policy π that is ⃗xN -greedy is optimal in every state, i.e.
vπ
= v∗
.
Note that once π is computed, vπ
can be computed in polynomial time via policy evaluation
equations.
Hence, VI can be said to solve MDPs in exponential time (and in polynomial time if the
discount factor is assumed to be a fixed constant instead of an input parameter), though the
approximate version is typically used in practice.
Note: the fact that some policy π is ⃗x-greedy does not mean that vπ
≥ ⃗x! Homework: find a
counterexample and post it to Discord.
However, for VI it can be shown that if ∥⃗xi+1 − ⃗xi ∥∞ ≤ ε · 1−γ
γ (stopping condition from
Theorem 21), then an ⃗xi+1-greedy policy is ε-optimal. 44/294
Policy improvement
start next passed
sleep
study: -2 1 study: -2 1
done: 20
1
done: 0 1
procrast.: 4
1 leave: -1
1
2
1
2
procrast.: -1 1
pub: +4
0.3
0.4
0.3
vπ
= ( )
let π′
be vπ
-greedy vπ′
= ( )
let π′′
be vπ′
-greedy
45/294
Policy improvement theorem
Theorem 26: Policy improvement
Let π be a policy. If Φ(vπ
) ≥ vπ
, then any vπ
-greedy policy πg is at least as good as π,
i.e. ∀s ∈ S : vπg
(s) ≥ vπ
(s).
Moreover, if Φ(vπ
)(s) > vπ
(s) for some state s, then also vπg
(s′
) > vπ
(s′
) for some
state s′
.
46/294
Returns from a given time step
For the proof of PIT and also many times later, we will need the following notation:
Definition 27: Important!
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t ∈ N a time step. We
define
Gt(τ) =
H−1
i=t
γi−t
· ri+1 = rt+1 + γrt+2 + γ2
rt+3 + · · · ,
where H ∈ N ∪ {∞} or H = T for episodic tasks.
We similarly define, for any policy π:
Gπ
t = Eπ
[Gt] = Eπ
[
H−1
i=t
γi−t
· Ri+1].
47/294
Proof of PIT (setup)
We will define a sequence of policies π0, π1, π2, . . . s.t.:
• π0 = π
• πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then
“switches” back to behave as π:
• we also define π∞ = πg
We want: vπ∞
(s) ≥ vπ
(s) for all s.
Not hard to see: vπi
→ vπ∞
as i → ∞ (πi behaves as π∞ for longer and longer as i increases
+ discounting).
It suffices to show: vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S.
48/294
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
= Eπi
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
= Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
= Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
Suppose we prove ≥Eπi−1 [γi−1
Ri +γi
Ri+1···|S0=s]
≥ Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 + γi−1
Ri + γi
Ri+1 · · · | S0 = s]
IH
≥ vπ
(s)
49/294
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ), since πg is vπ-greedy
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′) 50/294
Policy iteration (PI; Howard, 1960)
Algorithm 2: Policy iteration
Input: MDP M = (S, A, p, r)
Output: Optimal MD policy π∗
for M, its value vector v∗
π ← arbitrary MD policy ;
v ← vπ
; // e.g. by solving linear policy evaluation equations
while Φ(v) ̸= v do
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
v ← vπ
return π, v
51/294
PI correctness & complexity
Theorem 28
Policy iteration terminates after at most exponentially many iterations. Upon termination,
it returns an optimal MD policy.
Proof:
• Optimal upon termination: vπ
is a fixpoint of Φ when terminating: optimality follows
from Corollary 19.
• Terminates: π always stores an MD policy and there are finitely many of these. We will
show that no single MD policy appears in more than one iteration of PI.
Consider any iteration and let v, v′
be the contents of variable v before and after the
iteration. We will show that unless Φ(v) = v, it holds v′
> v, i.e. v′
≥ v componentwise
with strict inequality in some component. Hence, v = vπ
strictly increases during PI, so
no π can appear twice.
52/294
PI: correctness proof
v′
≥ v:
We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ
. For all s ∈ S:
Φ(v)(s) = max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · v(s′
)]
≥ r(s, π(s)) + γ ·
s′∈S
p(s′
| s, π(s)) · vπ
(s′
)
= vπ
(s) = v(s).
By PIT, v′
= vπ′
≥ vπ
= v (here π′
is the v-greedy policy).
53/294
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
= r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ
(s′
) (assumption)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ
(s′
)] (π′
is v = vπ
-greedy)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ′
(s′
)] (assumption)
= Φ(v′
)(s),
so PI terminates at this point. Complexity?
54/294
PI&VI: discussion
• We know that MDPs have a linear programming (LP) formulation. PI is basically a
variant of a simplex method for this LP, using a special pivoting rule.
• PI typicaly requires less iterations to converge than VI, though each iteration is more
expensive (policy eval.)
• Both PI and VI typically work well in practice for MDPs whose explicit transition table fits
inside a computer. Which of the two is faster is rather domain-specific.
55/294
PI variants
Can we get rid of the expensive policy evaluation by linear system solving?
Yes: we can approximate the value of the current policy π by applying VI on the MDP Mπ
, for
either fixed number of steps or until v does not change much. Often appearing in RL
textbooks:
56/294
Policy iteration with approximate evaluation
π ← arbitrary MD policy; v ← arbitrary vector;
repeat
v ← Eval(π, v);
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
until π has not changed;
return π, v
Function Eval(π, v):
v′
← v;
repeat
foreach s ∈ S do
v(s) ← v′
(s);
v′
(s) ← r(s, π(s)) + γ · s′∈S p(s′
| s, π′
(s)) · v(s′
)
until ∥v − v′
∥∞ ≤ ε;
return v′ 57/294
Convergence of PI variants
• The algorithm on previous slide still converges to an optimal policy provided that ε is
small enough.
• If we replaced the “π not changed condition” with the original “Φ(v) = v” condition, the
algorithm might not terminate, since the VI is only guaranteed to reach a true fixpoint in
the limit. However, v would still converge to v∗
and thus π would eventually become
equal to an optimal policy.
• The previous point holds even in the very degenerate case when we do just one iteration of
VI per policy evaluation! See next slide.
58/294
Curiously looking approximate PI
v ← arbitrary vector;
π ← v-greedy MD policy ;
repeat
foreach s ∈ S do
v′
(s) ← r(s, π(s)) + γ · s′∈S p(s′
|s, π(s)) · v(s′
);
v ← v′
;
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
until Φ(v) = v;
return π, v
This is just VI in disguise!
59/294
Generalized policy iteration
Evaluate
Improve
push v
towards vπ
make π
(more)
v-greedy
Source: Sutton&Barto, p. 87
60/294
Tabular Methods
for Model-Free
Reinforcement Learning
Model-free
We will still be working with MDPs. But for a bunch of the following lectures, we will not
(necessarily) have access to, e.g.:
• a table containing explicit enumeration of all states/actions
• a table containing the description of p or r
• the ability to compute the probability vector δ(s, a) or the reward signal r(s, a) given s
and a (having this = gray-box model of the MDP)
61/294
Sampling from MDP
But the MDP is still there “behind the scene”. In particular, we:
• know how the states of the MDP look like
• (e.g. robot state = all possible output values of its sensors)
• know how the actions of the MDP look like
• (e.g. robot = all possible signals that can be sent to the actuators)
• can, for any s ∈ S, enumerate A(s)
• could be weakened, but simplifies things
• given s ∈ S and a ∈ A(s), we can sample the next state s′
∼ p(s, a) and receive the
reward r(s, a).
62/294
Sampling from a policy
Given an effective representation of a policy π, we can sample a trajectory
s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . by performing, for each t ∈ {0, . . . , T}:
• sample at ∼ π(st)
• query the environment for st+1 ∼ p(st, at) and rt+1 = r(st, at)
• increment t
Tabular = value estimates and policies represented as tables (e.g. Q(s, a) for each state s and
action a used in s – explicit representation might only be needed for states/actions actually
encountered).
63/294
Basic classification of (tabular) RL algorithms
Three independent axes:
problem on/off updates
policy evaluation (value prediction) on-policy Monte Carlo
vs. vs.






control off-policy temporal difference
64/294
Assumptions: successor-dependent rewards & episodic tasks
Since we do no longer have the knowledge of the transition dynamics p, we cannot freely
interchange MDPs with rewards functions of type S × A × S → R and S × A → R via
the equation r(s, a) = s′∈S p(s′
| s, a) · r(s, a, s′
).
Hence, to maintain generality (and correspondence to e.g. Gymnasium environments)
we will assume reward functions of type S × A × S → R.
We will assume episodic returns: each trajectory terminates with probability 1 at some
(possibly random) time step T. Termination can be defined e.g. by reaching some terminal
state or by running out of some fixed decision horizon (in Gymnasium, this is sometimes
called truncation):
G =
T−1
i=0
γi
· Ri+1.
Episode = one high-level iteration of an RL algorithm, corresponding of sampling a single
trajectory from some policy. 65/294
Monte Carlo Methods
MC evaluation
Policy evaluation: given an effective representation of a policy π, estimate vπ
(or qπ
).
Naive Monte Carlo: Sample from π: if {τ1, τ2, . . . , τn} are trajectories (episodes) independently
sampled under π from the same initial state s, then 1
n
n
i=1 G(τi ) → vπ
(s) as n → ∞ due to
law of large numbers (LLN).
But this throws away a lot of valuable information! E.g. what if we want to estimate the whole
vπ
?
66/294
First-visit MC
For each s, we estimate vπ
(s) as an average of sample returns Ret(s) which is formed as
follows:
• initially, Ret(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, we identify the first occurrence of s on τ: let
this be at timestep t; we add Gt (τ) to Ret(s)
start X start next pass. sleep
4 -1 -2 -2 20
Sub-trajectory starting at the first appearance of s can be seen as a trajectory sampled from π
when s is the initial state! (Since we consider memoryless π.)
Theorem 29
As |Ret(s)| → ∞, the average of Ret(s) converges to vπ
(s). Moreover, the average of
Ret(s) is an unbiased estimate of vπ
(s) (as long as Ret(s) ̸= ∅).
67/294
First-visit MC (pseudocode)
Source: Sutton&Barto, p.92
68/294
Every-visit MC
For each s, we estimate vπ
(s) as an average of sample returns Ret(s) which is formed as
follows:
• initially, Ret(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, and each t such that St (τ) = s we add
Gt (τ) to Ret(s)
start X start next pass. sleep
2 -1 -2 -2 20
The sample returns added to Ret(s) within the same episode are not independent! Hence, the
estimate is biased, though the bias vanishes in the limit:
Theorem 30
As |Ret(s)| → ∞, the average of Ret(s) converges to vπ
(s).
69/294
MC convergence
Optional reading: More on MC estimate bias, variance, and convergence in:
Singh, S.P. and Sutton, R.S.: Reinforcement Learning with Replacing Eligibility Traces.
In Machine Learning 22:123–158. Kluwer, 1996.
(Section 3, particularly 3.3 and onwards, you can skip Theorem 4.)
70/294
Towards MC control
Control = computation of “good” policy for a given environment. (Ideally, the policy should
get closer to the optimal policy the more episodes we sample.)
We know (PIT): given a policy π a vπ
-greedy policy is at least as good as π:
πg (s) = arg max
a∈A(s) s′∈S
p(s′
| s, a) · r(s, a, s′
) + γ · vπ
(s′
)
Do we have an algo? There is an issue:
71/294
MC control with q-values
Recall:
qπ
(s, a)
def
=
s′∈S
p(s′
| s, a) · r(s, a, s′
) + γ · vπ
(s′
).
Thus, the vπ
-greedy policy πg can be defined as:
πg (s) = arg max
a∈A(s)
qπ
(s, a)
Estimate by MC.
72/294
MC for q-value estimation
Analogous to value estimation, e.g. first-visit:
For each s, a, we estimate qπ
(s, a) as an average of sample returns Ret(s, a) which is formed
as follows:
• initially, Ret(s, a) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state-action pair (s, a), we identify the first t such
that St (τ) = s ∧ At (τ) = a; we add Gt (τ) to Ret(s)
start X start next pass. sleep
proc.:4 study:-1 study:-2 done:-2 done:20
Similarly for every visit. Convergence guarantees the same as for state values.
73/294
Infinite exploration and exploring starts
Issue: MC only estimates qπ
(s, a) if:
• s guaranteed to be visited with positive probability in each episode
• π(a | s) > 0.
Definition 31: Infinite exploration
A RL algorithm has infinite exploration (IE) if, during the infinite execution of the algorithm,
each state-action pair (s, a) is visited infinitely often with probability 1.
One way of achieving IE is through exploring starts (ES): each episode begins with (typically
uniformly) randomly selected s0 and a0. This is achievable when training, e.g., in simulated
environments but might be difficult/impossible in real-world environments.
74/294
MC control with exploring starts
Source: Sutton&Barto, p.99
75/294
IE through ε-soft policies
Exploring starts are not always feasible. Alternative: make the sampled policy itself exploratory.
Definition 32: ε-soft policy
A policy π is ε-soft if for every s ∈ S and every a ∈ A(s) it holds π(a|s) ≥ ε
|A(s)| .
Definition 33: ε-greedy policy
Let v ∈ RS
be a value vector. A policy π is v-ε-greedy if for every state s ∈ S there
is action a∗
= arg maxa∈A(s) s′∈S p(s′
| s, a) · r(s, a, s′
) + γ · v(s′
) such that for any
action a ∈ A(s) it holds:
π(a|s) =



ε
A(s) if a ̸= a∗
1 − ε + ε
A(s) if a = a∗
.
Interpretation: with prob. ε: play uniformly at random; with prob. 1 − ε: play greedily.
76/294
ε-softing
Definition 34
Let π be a policy. An ε-softing of π is a policy πε defined as follows: in each state s
• with probability ε, πε selects an action uniformly at random;
• with probability 1 − ε, πε selects a ∼ π(s).
I.e., an ε-greedy policy can be alternatively defined as ε-softing of a greedy policy.
77/294
MC control with ε-greedy policies
source: Sutton&Barto, p. 101
78/294
Policy iteration for ε-soft policies
Theorem 35
Let π be an ε-soft policy and let π′
be a vπ
-ε-greedy policy. Than vπ′
≥ vπ
(componentwise).
Moreover, the two value vectors are equal if and only if bot π and π′
are optimal
among all ε-soft policies; i.e. if, for every state s:
vπ
(s) = sup
¯π that is ε-soft
v ¯π
(s).
Proof: Required reading: Sutton&Barto, p.101-103.
79/294
Incremental computing of averages
Given a sample {n1, n2, . . . , nk+1} and average A = avg({n1, n2, . . . , nk }), how to compute
A′
= avg({n1, n2, . . . , nk , nk+1}) without recomputing the average of the whole sample?
A′
=
k
k + 1
· A +
nk+1
k + 1
.
80/294
On-policy vs. off-policy
• On-policy algorithms: track one “policy variable” π; the policy stored in π is used to
interact with the environment (i.e., to sample episodes) and at the same time we learn
something about it (e.g. its value vector).
• Corresponds to the generalized policy iteration scheme.
• All the MC algos we have seen so far.
• Off-policy algorithms: track more (typically two) different policy variables:
• behavior policy: used to sample episodes
• target policy: which we want to learn about
81/294
Off-policy evaluation
We are given effective representations of:
• a behavior policy β,
• a target policy π.
The task is to estimate vπ
by sampling episodes from β . We cannot sample from π! (E.g. π
too risky or expensive to sample from.)
Assumptions:
• given (s, a), we can effectively compute π(a|s) and β(a|s) (or at least estimate via
sampling)
• coverage: ∀s ∈ S, a ∈ A(s): if π(a|s) > 0, then also β(a|s) > 0
82/294
Importance sampling
Definition 36: Importance ratio
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. The importance-sampling
ratio of τ is the quantity
ρ(τ)
def
=
Pπ
[τ | S0 = s0]
Pβ[τ | S0 = s0]
=
Pπ
[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0]
Pβ[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0]
.
ρ(τ) can be computed without the knowledge of MDP transition probabilities!
ρ(τ) =
π(a0 | s0) · p(s1 | s0, a0) · π(a1 | s1) · p(s2 | s1, a1) · · ·
β(a0 | s0) · p(s1 | s0, a0) · β(a1 | s1) · p(s2 | s1, a1) · · ·
83/294
Importance ratio from time t
Definition 37
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. By τi..j we denote the subtrajectory
of τ starting in time step i and ending in timestep j. By τi.. we denote the
suffix of si , ai , ri+1, si+1, ai+1, . . ..
Definition 38: Importance ratio from time t
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t a time step. The
importance-sampling ratio of τ from t is the quantity
ρt(τ)
def
=
Pπ
[τt.. | S0 = st]
Pβ[τt.. | S0 = st]
=
Pπ
[A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st]
Pβ[A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st]
.
84/294
Off-policy evaluation with importance sampling
Theorem 39
For any s ∈ S it holds:
Eβ
[ρ · G | S0 = s] = vπ
(s).
Proof:
Eβ
[ρ · G | S0 = s] =
τ
Pβ
[τ | S0 = s] · ρ(τ) · G(τ)
=
τ
Pβ
[τ | S0 = s] ·
Pπ
[τ | S0 = s]
Pβ[τ | S0 = s]
· G(τ)
=
τ
Pπ
[τ | S0 = s] · G(τ) = Eπ
[G | S0 = s] = vπ
(s).
Easily integrates into both first-visit and every visit MC: sample from β and store ρt(τ) · Gt(τ)
in Ret(st).
85/294
Weighted importance sampling
First-visit variant: for each state s, we keep a set of samples Sam(s). Each sample is a tuple
(τ, t) – trajectory and time step.
• initially, Sam(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, and the smallest t such that St (τ) = s we
add (τ, t) to Sam(s)
Throughout the algorithm, the value of state s is estimated as
WIS(s) =
(τ,t)∈Sam(s)
ρt(τ) · Gt(τ)
(τ,t)∈Sam(s)
ρt(τ)
Exercise 40
Compare ordinary/weighted importance sampling after single sample.
86/294
Weighted importance sampling – correctness
The weighted sampling is clearly a biased estimator. However, the bias vanishes in the limit:
Theorem 41
With probability 1: as |Sam(s)| → ∞, we have that WIS(s) → vπ
(s).
Proof:
87/294
Ordinary vs. weighted sampling
source: Sutton&Barto, p. 106
88/294
Importance sampling: summary
But ordinary and weighted importance sampling can be adapted to every-visit MC.
Bias & Convergence:
• First visit:
• ordinary IS: unbiased, i.e. also converges
• weighted IS: biased, but converges in the limit
• Every visit:
• both ordinary and weighted: biased (due to EV), but converges in the limit
89/294
Weighted IS: incremental implementation
Instead of recomputing the weighted average for each new sample, WIS(s) can be updated by
keeping keep just two variables:
• V – current value of WIS(s), initially arbitrary
• C – the sum of importance ratios, initially 0
Upon arrival of new sample (τ′
, t′
), we update V , C into new values V ′
, C′
by setting:
C′
= C + ρt′ (τ′
)
V ′
= V +
ρt′ (τ′
)
C′
· (Gt′ (τ′
) − V ) .
90/294
Off-policy evaluation with weighted IS
91/294
Off-policy control with weighted IS
Required reading: Sutton&Barto, Section 5.7.
92/294
Temporal Difference Methods
TD: Motivation
Let us first focus on policy evaluation.
MC: zero bias (at least in the limit), but potentially high variance: many samples needed to
converge. Also, to update estimates, it must wait till the end of each episode.
TD methods retain the focus on sampling but combine it with bootstrapping.
93/294
Incremental update notation
Definition 42: Notation for updates
In the context of RL algorithms will denote by V n
(s) (resp. Qn
(s, a)) the algorithm’s
estimate of vπ
(s) (resp. qπ
(s, a)) after n-th update of this estimate.
94/294
MC vs. TD(0) update
On-policy MC (incremental) update using sampled trajectory
τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
V n+1
(st) ← (1 − αn)V n
(st) + αnGt(τ) = V n
(st) + αn · Gt(τ)
update target
−V n
(st)
update error
,
where αn = n/(n + 1).
TD(0) update in the same situation, with αn “suitably chosen” (possibly constant):
V n+1
(st) ← V n
(st) + αn · Rt+1(τ) + γ · V n
(St+1(τ))
bootstrap
−V n
(st)
95/294
Policy evaluation with TD(0)
source: Sutton&Barto, p. 120
96/294
How can it even work?
Really “just” a very asynchronous, sample-based, and “α-dampened” version of value iteration.
Eπ
[Gt|St = s] = Eπ
[Rt+1 + γ · Gt+1 | St = s] = Eπ
[Rt+1 | St = s] + γ · Eπ
[Gt+1 | St = s]
vπ(St+1)
.
In expectation, the TD(0) update is the same as VI update in Mπ
. Thanks to the contractivity
of the Bellman operator, VI possesses an error reduction property: after each update, the error
of the estimate decreases. Hence, in expectation, the same is true for the TD(0) update.
Formal proof of correctness in optional reading:
Sutton, R.S.: Learning to Predict by Methods of Temporal Differences. In Machine
Learning 3:9–44. Kluwer, 1988. (For MDPs with function approximation.)
97/294
Why TD is natural (Sutton&Barto, p. 122-123)
Left: MC. Right TD(0). 98/294
On-policy TD control
Recall:
• In control setting, we need to estimate q-values of a policy.
• On-policy: we sample trajectories according to some policy π and then push value
estimates towards qπ
.
To maintain exploration, the policy π will typically be the ε-Q-greedy policy for some ε > 0,
where Q are the current Q values estimates. I.e., throughout the algorithm
π(a|s) =



1 − ε + ε
A(s) if a = arg maxa′∈A(s) Q(s, a′
)
ε
A(s) otherwise.
99/294
SARSA
State-Action-Reward-State-Action. Introduced in Rummery, Niranjan: On-Line Q-Learning
Using Connectionist Systems (1994).
In each episode, sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , sT according to
current policy π; for each time step 0 ≤ t ≤ T − 1, perform the following update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γQn
(st+1, at+1) − Qn
(st, at)
The update can be performed immediately when st+1 and at+1 is known (no need to wait for
the episode to terminate).
After the episode ends, make π ε-Q-greedy.
Conforms to the GVI scheme.
100/294
SARSA pseudocode
source: Sutton&Barto, p. 130
101/294
GLIE policies
Definition 43: GLIE condition
A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy
algorithms) converges to a 0-greedy policy with increasing number of episodes.
A RL algorithm is GLIE if it is GL and IE (infinitely exploring).
Typical ways of ensuring GLIE:
• Dynamically adjust ε in ε-greedy policy selection” When selecting action in state s,
behave ε-greedily with ε = c
n(s) , where 0 < c < 1 is a constant and n(s) is a number of
visits to state s over all the episodes so far.
• Use Boltzmann (softmax) exploration:
π(a | s) =
e
Q(s,a)
η(s)
b∈A(s) e
Q(s,b)
η(s)
,
where η is a state-dependent and time-varying temperature parameter. We need η to
converge to 0 over time, but not too fast (often, η(s) proportional to 1
log(n(s)) ). 102/294
Convergence of SARSA
Theorem 44
Consider a GLIE instantiation of SARSA. Moreover, assume that the sequence of learning
rates (αn)n∈N satisfies n αn = ∞ and n α2
n < ∞. In this setting, Q converges to q∗
and the behavior policy of SARSA converges to some optimal policy π∗
.
For the proof, see optional reading: Singh, Jaakkola, Littman, Szepesv´ari: Convergence Results
for Single-Step On-Policy Reinforcement-Learning Algorithms. In Machine Learning
39:287-308. Kluwer, 2000.
Note: learning rate can itself be state/action dependent (omitted for conciseness, constant
learning rates preferred in practice).
103/294
Off-policy TD control
Surprise: no importance sampling!
Recall the SARSA update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γQn
(st+1, at+1) − Qn
(st, at)
It pushes Q towards qπ
, where π is the current policy.
Idea: push Q directly towards q∗
.
We could do this e.g. by a VI-like update:
Qn+1
(st, at) = Qn
(st, at) + αn · max
a∈A(s)
s′∈S
p(s′
| s, a) r(s, a, s′
) + γV (s′
) − Qn
(st, at) .
Two problems:
• We do not calculate v-estimates. (Must somehow replace with Q)
• We must get rid of transition probabilities and instead use the sampled at and rt+1.
104/294
Q-learning update
Solution: push the max towards the bootstrap.
Q-learning (Watkins, 1989): given a sampled trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for
every t we update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γ · max
a∈A(st+1)
Qn
(st+1, a) − Qn
(st, at)
105/294
Q-learning pseudocode
source: Sutton&Bato, p. 131
Off-policy: where is the second policy? 106/294
Q-learning convergence
Theorem 45
Consider any Q-learning instantiation with infinite exploration. Assume that the sequence
of learning rates (αn)n∈N satisfies n αn = ∞ and n α2
n < ∞. In this setting, Q
converges to q∗
. Moreover, if the behavior policy is GL, then it converges to an optimal
policy π∗
.
Proof in optional reading: Watkins, Dayan: Q-Learning. In Machine Learning 8:279-292.
Kluwer, 1992.
107/294
SARSA vs. Q-learning (SB: p. 132)
Left: greedy policies learned by SARSA and Q-learning.
Right: in-training performance with a 0.1-greedy behavior policy.
(Rough) takeaway: Q-learning more aggressive in finding optimal policy, can lead to risky
behavior. Possibly advantageous when environment not too stochastic or if in-training
performance has less importance (simulator vs. real world).
108/294
Maximization bias in Q-learning
Q-learning is “risky” not only due to exploration, but also because it is optimistic in the face of
uncertainty. TBD
The positive bias only disappears in the limit.
109/294
Double Q-learning
Idea: use two independent value estimates Q1, Q2 and decouple action selection from
evaluation in the bootstrap. During each update, we randomly select one of these for update,
which is also used to select the maximizing action in bootstrap. The other is used as the
bootstrap estimate.
I.e., in each time step t we perform one of these updates, each with probability 1
2 : either
Q1(st, at) = Q1(st, at) + αn · rt+1 + γ · Q2 st+1, (arg max
a∈A(st+1)
Q1(st+1, a)) − Q1(st, at)
or
Q2(st, at) = Q2(st, at) + αn · rt+1 + γ · Q1 st+1, (arg max
a∈A(st+1)
Q2(st+1, a)) − Q2(st, at) .
Behavior policy = e.g. ε-greedy w.r.t. Q1 + Q2.
110/294
Double Q-learning pseudocode
source: Sutton&Barto, p. 136
111/294
Why Double Q-learning helps
112/294
Double Q-learning: experiment
source: Sutton&Barto, p. 135
113/294
Between Monte Carlo
and TD:
n-Step and λ-Returns
MC vs TD(0) update targets
Given a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
Update for time step t in:
• MC = discounted return from t till the end of trajectory, e.g. for t = 1:
r2 + γr3 + γ2
r4 + · · · + γT−2
rT
unbiased, but high variance + need the whole trajectory
• TD(0) = 1-step reward and then (discounted) bootstrap:
r2 + γV (s2)
114/294
n-step return
Idea: use n-step discounted return and then bootstrap
Definition 46
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and n ∈ N \ {0}.
An n-step return of τ from time step t is the quantity
Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2
· rt+3 + · · · γn−1
rt+n + γn
· V (st+n).
We also define a Q-estimate-based version:
Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2
· rt+3 + · · · γn−1
rt+n + γn
· Q(st+n, at+n).
(Which of the two is used will be clear from the context.)
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
115/294
n-step TD policy evaluation
Similar to TD(0), but using n-step return targets:
given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an
update
V (st) ← V (st) + α[Gt:t+n(τ) − V (st)].
Note 1: for n = 1 we get exactly TD(0).
Note 2: for n > 1, we cannot update V (st) directly at step t + 1. We need to obtain
rt+1, . . . , rt+n, st+n first, i.e. we can perform the update after step t + n.
Note 3: if t + n > T, we truncate the sum in Gt:t+n at rT , i.e. in such a case Gt:t+n = Gt.
116/294
n-step TD policy evaluation: pseudocode
source: Sutton&Barto, p.144 117/294
n-step TD policy evaluation: performance
19-state symmetric random walk:
source: Sutton&Barto, p.145 118/294
n-step SARSA (on-policy control)
Uses Q-value-bootstrapped n-step returns.
For a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and for all time steps
0 ≤ t < T we perform an update
Q(st, at) ← Q(st, at) + α[Gt:t+n − Q(st, at)].
We sample trajectories according to a policy π that is ε-greedy w.r.t. current Q-estimates:
π(a|s) =



1 − ε + ε
|A(s)| if a = arg maxa′∈A(s) Q(s, a′
) (ties broken in principled way)
ε
|A(s)| otherwise.
π is redefined in this way after each episode
119/294
n-step SARSA (pseudocode)
source: Sutton&Barto, p. 147
120/294
n-step SARSA (speed of signal propagation)
source: Sutton&Barto, p. 147
121/294
n-step Q-learning
The Q-learning update for a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and time step t:
Q(st, at) ← Q(st, at) + α · [rt+1 + γ · max
a∈A(st+1)
Q(st+1, a) − Q(st, at)].
Naive extension to n-step returns
Q(st, at) ← Q(st, at)+α·[rt+1+γrt+2+· · ·+γt+n−1
·rt+n+γt+n
· max
a∈A(st+1)
Q(st+n+1, a)−Q(st, at)]
does not really correspond to Q-learning, since some of the actions at+1, . . . , at+n might not be
Q-greedy (the behavior policy is ε-greedy, so some actions might be exploratory). Hence, we
are no longer pushing Q towards the Q-value of an optimal policy.
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
122/294
n-step Q-learning: correct
Idea: apply the Q-learning bootstrap at the first occurrence of a non-Q-greedy action.
I.e., for each episode:
• make π an ε-Q-greedy policy
• sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π
• for each time step 0 ≤ t < T:
• identify the smallest n′
∈ {t + 1, t + 2, . . . , t + n} such that an′ is not a Q-greedy action
• perform the update
Q(st , at ) ← Q(st , at ) + α · [rt+1 + γrt+2 + · · · + γn′
−1
rn′ + γn′
· max
a∈A(sn′ )
Q(sn′ , a) − Q(st , at )]
(if n′
> T, do the standard MC update).
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
123/294
λ-returns: idea
By varying n, the n-step returns provide a nice tradeoff between bias and variance (and update
speed). But the choice of optimal n is mostly a guesswork.
Idea: find a notion of return which combines n-step returns for multiple n’s. E.g., a suitable
convex combination of individual n-step returns. This leads to the notion of λ-returns.
We will focus only on policy evaluation, though λ-returns can be used also in control.
124/294
λ-returns: definition
Recall: Gt:t+n is the n-step return from timestep t.
Definition 47: λ-return
Let λ ∈ [0, 1]. A λ-return from timestep t is the random variable
Gλ
t = (1 − λ)
∞
n=1
λn−1
Gt:t+n.
Note that due to truncation at t + n ≥ T, the λ-return can be more explicitly written as
Gλ
t = (1 − λ)
T−t−1
n=1
λn−1
· Gt:t+n + λT−t−1
· Gt.
s0
a0
s1
a1
s2
a2
s3
a3
s4
r1 r2 r3 r4
125/294
λ-return as discounting of n-step returns
source: Sutton&Barto, p. 290
126/294
(Forward-view) TD(λ)
Like TD(0), but uses λ-returns.
Given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . sampled from the evaluated policy π,
we perform, for each time step t an update:
V (st) ← V (st) + α(Gλ
t (τ) − V (st)).
Note:
• for λ = 0, this is exactly the TD(0) update,
• for λ = 1, this is exactly the MC update,
• Gλ
t (τ) depends on the whole suffix of τt.., hence the update can be only performed at the
end of the episode. (We will show a workaround later.)
127/294
TD(λ) vs. n-step TD on 19-state random walk
source: Sutton&Barto, p. 291
128/294
Forward vs. backward view TD(λ)
Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view
TD(λ) in an online fashion (V (st) can be updated by time t + 1).
Implemented using eligibility traces: state-wise signals that indicate how much is the current
state eligible for an update (sort of state-wise modulation of the learning rate).
We are more keen to update states that:
• appear often along the trajectory (frequency heuristic)
• were visited in the recent past (recency heuristic)
129/294
Accumulating eligibility trace
Definition 48: (Accumulating) eligibility trace
For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., λ ∈ [0, 1], and a state s ∈ S,
an (accumulating) eligibility trace is a sequence of values E0(s), E1(s), E2(s), . . . defined
inductively as follows:
E0(s) = 0
and for t > 0 Et(s) = γ · λ · Et−1(s) + I(St(τ) = s),
where I(St(τ) = s) is the indicator of the t-th state of τ being s, i.e. I(St(τ) = s) = 1
if st = s and I(St(τ) = s) = 0 otherwise.
source:
slides of D.
Silver
(Model-free
prediction) 130/294
Backward-view TD(λ): idea
• Et(s) denotes how much is s eligible for an update after playing t-th action along the run
(i.e., action at−1).
• In time step t, all states with non-zero eligibility signal will have their estimates updated in
proportion to the learning rate and the strength of the eligibility signal.
• The update target is the standard TD(0) target for time t. I.e., for each timestep t and
each state q, we perform the update
V (q) ← V (q) + α · Et(q) · rt+1 + γ · V (st+1) − V (st) .
131/294
Backward-view TD(λ): pseudocode
Input: policy π to evaluate
Output: Estimate V of vπ
initialize V arbitrarily
repeat
s ← sample uniformly (ES) or according to init. distr.
initialize E to be uniformly zero
while s not terminal do
a ← sample from π(s)
s′
← sample from p(s, a)
r ← r(s, a, s′
)
foreach q ∈ S (Only q’s visited so far) do
E(q) = γ · λ · E(q) + I(q = s)
V (q) ← V (q) + α · E(q) · r + γ · V (s′
) − V (s)
s ← s′
until timeout 132/294
Forward vs. backward view
If λ = 0, then Et(q) = I(St(τ) = q), i.e. the backward-view update at time point t is
V (st) ← V (st) + α · r + γ · V (st+1) − V (st) ,
while for all states other than st, no update is performed. I.e., backward TD(0) is exactly the
same thing as forward TD(0).
For general λ the correspondence is more subtle:
Theorem 49: Forward-backward view correspondence
Assume that in the backward view, all the updates along the trajectory are performed
offline, i.e. only after the end of the episode, and in a batch, i.e. concurrently, using the
pre-episode estimates in right-hand sides.
Then, for any λ ∈ (0, 1), this offline backward TD(λ) performs the same updates as
forward TD(λ).
133/294
Offline backward and forward view (source: D. Silver slides)
Given the batch nature of updates, it suffices to show that the forward update target at time t
equals the sum of all updates “triggered” by a visit to state st.
134/294
Online vs. offline backward view on 19-state RW
In practice, we want to use the online backward algorithm, which only approximates the
forward view. Nevertheless, it performs acceptably:
source: Sutton&Barto, p. 295, LEFT: online backward TD(λ), RIGHT: offline forward TD(λ) 135/294
Concluding remarks on λ-returns
• There are other types of eligibility traces (replacing, dutch, . . . ), yielding different
algorithms.
• Eligibility traces neatly generalize to deep learning, where they are not state-wise but
parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2
• λ-returns and eligibility traces can be generalized to control setting SARSA(λ), Q(λ);
optional reading: Sutton&Barto, Sec. 12.7-12.10.
• There is a true online backward TD(λ) version. Here, true online=having perfect
equivalence with the forward view. However, the equivalence is w.r.t. a more complex
notion of λ-return (truncated λ-return) and uses more complex version of eligibility traces
than presented here. Outperforms both forward and backward algorithms presented here.
Optional reading: van Seijen, Sutton: True Online TD(λ). In Proceedings of ICML’14.
136/294
First Steps Towards Deep RL:
Value-Based
On-Policy Methods
Working with huge MDPs
E.g. original Atari games have 160x192 resolution with 128 colors: observable state space of
size 27·160·192
= 2215040
(though only a fraction reachable and resolution typically scaled down
in benchmarks – however, state typically encompass last 3 frames so as to provide some info on
movement).
State space can be even continuous (position, velocity,. . . ).
Most states will not be seen - we need the ability to generalize from experience to
unseen/rarely seen states.
137/294
Huge MDP representation
From now on, states of the MDP will be represented by vectors from Rn
. The vectorized
representation is chosen in a domain-specific manner, e.g.:
• Atari = one component per pixel per frame
• continuous navigation = agent coordinates, velocity, etc.
• small discrete MDPs can be represented by one-hot encoding
For simplicity, we will still assume that the action space is discrete, and reasonably small,
though many algorithms can be adapted for continuous actions (acceleration, etc).
138/294
Function approximators
The value functions have types:
vπ
, v∗
: Rn
→ R qπ
, q∗
: Rn
× A → R.
In RL, we need to approximate these functions.
Definition 50
A function approximator (FA) for functions of type X → Y is a class of functions f ⊆ Y X
parameterized by a some set of parameter vectors Θ ⊆ Rn
.
Each concrete parameter vector θ ∈ Θ defines a concrete function fθ ∈ f , i.e. f = {fθ | θ ∈ Θ}.
For FA f , we often write fθ(x) = f (x, θ) to stress the fact that the output of fθ depends on
both the input x and on θ. Hence, FA for type X → y can be itself seen as a function of type
X × Θ → Y .
139/294
Function approximators in RL
Our algorithms will use mainly these types of function approximators:
• V : Rn
× Θ → R to approximate vπ
or vθ
• Q : (Rn
× A) × Θ → R to approximate qπ
or qθ
The typical task is to find θ ∈ Θ such that Vθ = V (·, θ) is a “good” approximation for vπ
or
vθ
, and similarly for Qθ.
The parametrization Θ will depend on the concrete form of function approximator used.
140/294
Forms of function approximators
• tabular
• θ = the vector containing the contents of the table
• linear
• Θ = S = Rn
and e.g. Vθ(s) = θ⊤
· s
• neural nets
• θ = NN weights and biases
• decision trees
• . . .
We require the approximators to be differentiable and to admit a training method suitable for
non-stationary data.
141/294
Neural nets (source: slides by T. Br´azdil)
142/294
Policy evaluation with FAs
Task: given a policy π and FA V : Rn
× Θ → R, find θ s.t. Vθ is “close” to vπ
.
“Closeness” can be expressed using various loss functions. Typically, we want to minimize the
mean squared error (MSE):
MSE(vπ
, Vθ) =
1
2
Es∼µ (vπ
(s) − Vθ(s))2
=
1
2
s∈S
µ(s) · (vπ
(s) − Vθ(s))2
,
where µ is some distribution over states expressing how much do we care about errors in
particular states.
A local minimum of MSE can be found gradient descent: making successive step in the
direction opposite to the gradient of MSE.
143/294
Recall: gradients
Definition 51
Given a scalar function f (x1, . . . , xn, θ1, . . . , θm): Rn
× Θ → R (where Θ ⊆ Rm
), the
gradient of f w.r.t. parameters θ = (θ1, . . . , θm) is the vector function
∇θf = (
∂f
∂θ1
, . . . ,
∂f
∂θm
) of type Rn
× Θ → Rm
When f is a function approximator defined by a neural net, the value of the gradient ∇θf (x, θ)
at a given point (x, θ) = (x1, . . . , xn, θ1, . . . , θm) can be computed by backpropagation (under
some usual conditions like smoothness, etc.).
144/294
Gradient descent for policy evaluation
To (locally) minimize MSE(vπ
, Vθ), it suffices to perform (sufficiently small) steps in the
negative direction of the current gradient, i.e., repeatedly perform updates:
θ ← θ − α · ∇θMSE(vπ
, Vθ) = θ − α · ∇θ
1
2
· Es∼µ vπ
(s) − Vθ(s)
2
= θ −
α
2
· Es∼µ ∇θ vπ
(s) − Vθ(s)
2
= θ + α · Es∼µ vπ
(s) − Vθ(s) · ∇θVθ(s)
The expected value above is typically impossible to evaluate in practice. Instead we estimate it
by samples ⇒ stochastic gradient descent.
We typically take µ(s) representing the overall fraction of time spent in s when behaving
according to µ. Hence, Es∼µ can be estimated by sampling a trajectory from µ and performing
the update for each s on the trajectory in an every-visit fashion.
145/294
Stochastic gradient policy evaluation + MC instantiation
We keep sampling trajectories τ from π:
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
For each timestep t we perform the update of parameters
θ ← θ + α · vπ
(st) − Vθ(st) · ∇θVθ(st) .
Problem: in policy evaluation setting, we do not know vπ
(st). Hence, we estimate it using RL
targets.
The simplest is the Monte Carlo target: estimate st by the discounted return of the sampled
trajectory from st, i.e. perform updates of the form
θ ← θ + α · Gt(st) − Vθ(st) · ∇θVθ(st) .
146/294
Gradient Monte Carlo policy evaluation: pseudocode
Algorithm 3: Gradient MC evaluation
Input: Policy π, FA V : S × Θ → R, step size α
Output: Approximation Vθ of vπ
initialize θ arbitrarily;
repeat
sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π;
foreach t ∈ {0, . . . , T − 1} do
θ ← θ + α · [Gt(τ) − V (st, θ)] · ∇θV (st, θ)
until timeout;
147/294
Semi-gradient TD(0)
In the gradient update formula
θ ← θ + α · vπ
(st) − Vθ(st) · ∇θVθ(st) .
we can also estimate vπ
(st) with the TD(0) target:
θ ← θ + α · rt+1 + γ · Vθ(st+1) − Vθ(st) · ∇θVθ(st) .
This yield the semi-gradient TD(0) policy evaluation algorithm.
Why semi-gradient?
148/294
Gradient vs. semi-gradient TD(0)
Recall that our ultimate goal is to minimize
MSE(vπ
, Vθ) =
1
2
Es∼µ (vπ
(s) − Vθ(s))2
.
The gradient of this loss is
∇θ
1
2
Es∼µ (vπ
(s) − Vθ(s))2
=
1
2
Es∼µ ∇θ(vπ
(s) − Vθ(s))2
,
Estimation with sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and substituting vπ
(s)
with the TD(0) target would yield
∇θMSE(vπ
, Vθ) ≈
1
2
∇θ rt+1+γVθ(st+1)−Vθ(st)
2
= (rt+1+γVθ(st+1)−Vθ(st))·(γ∇θVθ(st+1)−∇θVθ(st)),
different update then semi-gradient TD(0)! However, this full gradient:
• is more expensive to compute (2 backpropagations per update);
• does not really express TD(0) idea (the update target is not fixed).
149/294
Semi-gradient TD(0): pseudocode
Algorithm 4: Semi-gradient TD(0) evaluation
Input: Policy π, FA V : S × Θ → R, step size α
Output: Approximation Vθ of vπ
initialize θ arbitrarily;
repeat
s ← initial state;
a ∼ π(s);
while s not terminal do
s′
∼ p(s, a);
r ← r(s, a, s′
);
a′
∼ π(s′
);
θ ← θ + α · [r + γ · V (s′
, θ) − V (s, θ)] · ∇θV (s, θ);
s ← s′
; a ← a′
until timeout;
150/294
On-policy control with function approximation
Semi-gradient SARSA uses the same idea as TD(0), but with Q-approximator, i.e.
Q : Rn
× A × Θ → R.
Behavior policy = e.g. ε-greedy with respect to the current Q. For a sampled trajectory τ =
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
we perform, in each timestep t, an update
θ ← θ + α · rt+1 + γ · Qθ(st+1, at+1) − Qθ(st, at) · ∇θQθ(st, at) .
151/294
Semi-gradient SARSA: pseudocode
Algorithm 5: Semi-gradient SARSA
Input: FA Q : S × A × Θ → R, step size α
Output: Approximation Qθ of q∗
initialize θ arbitrarily;
repeat
s ← initial state;
π ← policy ε-greedy w.r.t. Qθ;
a ∼ π(s);
while s not terminal do
s′
∼ p(s, a);
r ← r(s, a, s′
);
a′
∼ π(s′
);
θ ← θ + α · [r + γ · Qθ(s′
, a′
) − Qθ(s, a)] · ∇θQθ(s, a);
s ← s′
; a ← a′
until timeout;
152/294
Representing actions in DNNs
How to represent actions in the (say, DNN) function approximator Q is largely a
domain-dependent engineering choice.
If the set of actions A = {a1
, . . . , ak
} is discrete and reasonably small, we can consider a net
which inputs a state (i.e., n input neurons when S = Rn
) and outputs an |A|-dimensional
vector (i.e., one output neuron per action), so that the output of the i-th neuron on input s is
interpreted as Q(s, ai
).
I.e., in such a case we consider Q to be function of type Q : S × Θ → R|A|
.
153/294
On-policy semi-gradient methods: concluding remarks
The presented algorithms can be instantiated also with other types of returns, such as:
• n-step returns
• n-step SARSA update:
θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1
rt+n + γn
· Qθ(st+n, at+n)
Gt:t+n,θ
−Qθ(st , at ) ·∇θQθ(st , at )
• forward-view λ-returns
• SARSA(λ) update: θ ← θ + α · (1 − λ) ∞
n=1 λn−1
Gt:t+n,θ − Qθ(st , at ) · ∇θQθ(st , at )
• backward-view λ-returns (optional reading: Sutton&Barto, sections 12.2 and 12.7)
154/294
Value-Based
Off-Policy Control with
Approximators:
DQNs and Friends
Off-policy methods with function approximation
...are tricky to get right, already in the case of policy evaluation. The training can become very
unstable.
For on-policy (semi)-gradient methods, one can typically prove convergence to correct/optimal
values at least in the case of linear function approximation (though not in the more general
case of NN approximators).
Off-policy semi-gradient methods, such as:
• TD with importance sampling (not covered here), or
• Q-learning with function approximators (will be covered a bit later),
can diverge already with linear function approximators.
155/294
Divergence examples (high-level)
• Baird’s counterexample: semi-gradient TD with importance sampling can diverge in
presence of linear FAs
• Moreover, the divergence is not due to the instability of (semi)-gradient descent. Tsitsiklis
and Van Roy’s counterexample shows divergence even in the case where each update
completely replaces the current θ with the optimal θ∗
which minimizes the MSE between
Vθ and the TD(0) update target. The problem lies in the off-policy distribution of updates.
• Counterexamples explained in optional reading: Sutton&Barto, Sec. 11.2.
156/294
Deadly triad
Identified by Sutton&Barto: risk of training instability and divergence steeply rises when
combining:
• function approximation,
• bootstrapping, and
• off-policy training.
But often we want to do just that. :)
Practical solution: Happily do the deadly triad, but use insights from supervised learning to
develop additional techniques that help stabilize the training.
157/294
Deep Q-Networks (DQN)
2013 arXiv tech. report, there is also follow-up 2015 Nature paper
158/294
Q-learning with function approximators
Same semi-gradient idea as in TD(0), SARSA: adjust θ to bring Qθ(s, a) closer to the fixed
Q-learning update target.
I.e., for a sampled trajectory
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
and its timestep t, the update is
θ ← θ + α · rt+1 + γ · max
a∈A(st+1)
Qθ(st+1, a) − Qθ(st, at) · ∇θQθ(st, at) .
But performing the updates based solely on the current step would be susceptible to instability
due to the presence of the deadly triad.
159/294
Deep Q-learning challenges
From Mnih et al. Playing Atari with Deep Reinforcement Learning:
160/294
Experience replay
Originated in the work of Long-Ji Lin, e.g.: Reinforcement Learning for Robots Using Neural
Networks, dissertation, 1993.
Definition 52: Experience
An experience is a 4-tuple (s, a, r, s′
) ∈ S × A × S × R interpreted as (”state”, ”action
played in it”, ”reward obtained”, ”next state observed”).
• DQN does not perform update based only on the current step. Instead, for each sampled
trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and each timestep t it:
• first stores the one-step experience (st , at , rt+1, st+1) in a data structure B called replay
buffer;
• then, sample a random minibatch of experiences B ⊆ B of a given minibatch size Bsize
• perform a minibatch-gradient-descent update w.r.t. B: compute the gradient of the
Q-learning loss for each e ∈ B and then update θ in the direction of an average gradient
over the whole B.
161/294
Minibatch update
Fix a minibatch B.
For each e = (s, a, r, s′
) ∈ B we compute the gradient of the Q-learning loss ∇θL(θ, e) at
point e:
∇θL(θ, e) = ∇θ
1
2
r + γ · max
b∈A(s′)
Qθ(s′
, b)
fixed target
− Qθ(s, a)
2
= r + γ · max
b∈A(s′)
Qθ(s′
, b)
=0 if s terminal
− Qθ(s, a) · ∇θQθ(s, a)
We then perform an update in the direction of average gradient:
θ ← θ + α ·
1
|B|
e∈B
∇θL(θ, e).
162/294
Experience replay rationale
• helps decorrelate the DNN training data
• helps to prevent catastrophic forgetting
• improves data efficiency via experience re-use
Why good match for deep Q-learning? Experience replay is by design off-policy since we train
on old data, which were sampled from different policy than the current one.
163/294
Replay buffer implementation
The replay buffer B is typically not unbounded, but has a fixed capacity Bsize. Replacement is
eventually needed. If B is full, the oldest experience if removed (B = queue).
How to sample the minibatches?
• Original DQN: uniformly from B.
• Alternative: prioritized experience replay: each experience is assigned a priority (several
heuristics exist). An experience is sampled into a minibatch with probability proportional
to its priority.
164/294
DQN: 2013 pseudocode
Algorithm 6: DQN with replay buffer
Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s
ε, Bsize, Bsize, . . .
Output: Approximation Qθ of q∗
initialize θ arbitrarily; initialize empty replay buffer B of capacity Bsize;
repeat
s ← initial state;
while s not terminal do
π ← policy ε-greedy w.r.t. Qθ;
a ∼ π(s);
s′
∼ p(s, a);
r ← r(s, a, s′
);
store (s, a, r, s′
) in B;
sample a minibatch B of size Bsize from replay buffer B;
perform the minibatch update θ ← θ + α · 1
Bsize e∈B ∇θL(θ, e) (see this slide);
s ← s′
;
until timeout; 165/294
DQN: 2013 results
Mnih et al. (2013, arXiv)
166/294
Target networks: another stabilizing factor in DQNs
Introduced in the reviewed version of DQN paper:
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518
(2015).
Performing the (minibatch) Q-update looks like supervised learning:
change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max
b∈A(s′)
Qθ(s′
, b),
where (s, a, r, s′
) is the processed experience.
Performing the (minibatch) Q-update looks like supervised learning, but:
change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max
b∈A(s′)
Qθ(s′
, b)
the “label” changes
with each update!
,
167/294
Target networks: idea
To stabilize learning, we use two networks: the main network, and the target network. They
have the same architecture (denote it by Q), but their weights may differ during the execution
of the algorithm.
We denote:
• θ - weights of main network
• ˆθ - weights of target network
Usage:
• The target network is used only to compute TD targets when computing losses:
∇θL(θ, e) = r + γ · max
b∈A(s′)
Q(s′
, b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ)
• At the start, and also in periodic intervals (but not after each update!) the two networks
are synchronized by performing ˆθ ← θ. Other than this, ˆθ stays fixed, the gradient steps
are only used to update θ (i.e., the main network).
168/294
DQN: 2015 pseudocode
Algorithm 7: DQN with replay buffer and target network
Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s ε, Bsize, Bsize, C, . . .
Output: Approximation Qθ of q∗
initialize θ arbitrarily; ˆθ ← θ; counter ← C;
initialize empty replay buffer B of capacity Bsize;
repeat
s ← initial state;
while s not terminal do
if counter = 0 then ˆθ ← θ; counter ← C else counter ← counter − 1;
π ← policy ε-greedy w.r.t. Qθ;
a ∼ π(s);
s′
∼ p(s, a);
r ← r(s, a, s′
);
store (s, a, r, s′
) in B;
sample a minibatch B of size Bsize from replay buffer B;
perform the minibatch update θ ← θ + α · 1
Bsize e∈B ∇θL(θ, e), where
∇θL(θ, e) = r + γ · maxb∈A(s′) Q(s′
, b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ);
s ← s′
;
until timeout; 169/294
DQN: 2015 results
170/294
Engineering behind DQN for Atari: partial observability
True state = current state (program counter + variable values) of the game program.
We do not see this – only the frames rendered on screen.
Solving partially observable environments requires (per some POMDP theory) making decisions
based on the whole history of observations. This is computationally demanding (recurrent
NNs...).
DQN for Atari solves this by feeding the last 4 observed frames into the NN. This is typically
enough to deduce the dynamics of the current play.
171/294
DQN: dynamics from limited frame history
172/294
Engineering behind DQN for Atari: the network
Inputs four 84x84px images, then 3 convolutional layers, then two fully connected layers. all
with ReLU activations. Outputs Q-estimate for each action.
source: Mnih et al. (Nature, 2015), details in appendix “Model architecture” 173/294
DQN for Atari: dirty engineering tricks
• Preprocessing: 210x160 RGB color images are converted to grayscale and resized to 84x84
resolution.
• Frame skipping: the agent only observes and acts in every K-th frame, for the frames in
between, the last selected action is repeated without providing the frame to the agent. (In
the paper, K = 4.)
• Reward clipping: all positive one-step Atari rewards are clipped to +1, all negative ones
are clipped to −1 (Atari gives integer rewards).
• TD error clipping: for each update, the Q-learning error
r + γ · max b ∈ A(s′
)Q(s′
, b) − Q(s, a) is clipped to [−1, 1].
174/294
DQN for Atari: selected hyperparameters (nex)
minibatch size Bsize 32
replay buffer size Bsize 1,000,000
target network update freqeuency C 10,000
discount factor γ 0.99
update frequency (steps between two minibatch updates) 4
learning rate 0.00025
initial ε 1
final ε (linear decay) 0.1
final decay frame 1,000,000
random policy played for init. 50,000 frames
max. do-nothing actions at episode start 30
175/294
RAINBOW of heuristics
Multitude of heuristics for the improvement of DQN were developed over time. Some of them
make sense also in the context of other deep RL algorithms.
The RAINBOW agent combines six such heuristics to further improve the DQN performance
on Atari games.
(In proceedings of AAAI 2018.)
176/294
Rainbow heuristics
• dueling networks architecture
• double DQN
• prioritized experience replay
• n-step rewards
• distributional learning
• noisy networks
177/294
Action advantage
Idea: imagine that for some state s, the Q-values of all actions are high. Then s should be in
some sense valuable in itself.
Definition 53: Advantage
Let π be a policy. An advantage function advπ
: S × A → R is defined as
advπ
(s, a) = qπ
(s, a) − vπ
(s).
Dueling architecture splits certain layers of the neural network into two “streams”, one
estimating (somethign like) vπ
(s) and one estimating (something like) advπ
(s, a). The final
layer combines these estimates to produce an estimate of qπ
(s, a).
178/294
Dueling architecture
Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In proceedings
of ICML’16.
179/294
Dueling architecture: theory and training
We have Qθ,α,β(s, a) = aggregate(Vθ,α(s), Aθ,β(s, a)), where
• θ - convolutional (or other feature extraction) layer parameters
• α - value channel parameters
• β - advantage channel parameters
The whole network Q is trained to estimate qπ
(where π is the target policy) using any deep
RL algorithm (e.g. DQN, in which case π is the optimal policy). There is nothing new from RL
perspective here, all the novelty is inside the network. The factorization into state value and
advantage is supposed to help the network “focus” on features that are important to recognize
valuable states and features that help us rank actions.
180/294
Dueling architecture: Atari example
From Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In
proceedings of ICML’16.
181/294
Learning state values and advantages
• The whole net is trained end-to-end to predict qπ
.
• How do we ensure the value/advantage channels are trained to predict state
values/advantages?
• By a suitable choice of aggregator:
• Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0
and Aθ,β to qπ
.
•
Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − max
b∈A(s)
Aθ,β(s, b),
the training then indeed pushes Vθ,α to vπ
and Aθ,β to advπ
+ c where c is some constant.
Issues: not differentiable, update sensitive to the value of maximizing action changes.
182/294
Aggregation in Rainbow
The point: aggregating layer should anchor the sum of the channels to same baseline value
derived non-trivially from the advantages (if all advantages shift up/down, so should the
baseline). Rainbow uses mean advantage baseline:
Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) −
1
|A(s)|
b∈A(s)
Aθ,β(s, b)
pushes the value channel to predict 1
|A(s)| a∈A(s) qπ
(s, a).
183/294
Double DQN
van Hasselt, Guez, Silver: Deep Reinforcement Learning with Double Q-learning. In
proceedings of AAAI 2016.
Similar idea to tabular Double Q-learning (use different estimates for selecting maximizing
action in bootstrap and for evaluating the bootstrap), but instead of independently updated
networks uses main and target networks. I.e., for experience (s, a, r, s′
), the update is:
θ ← θ + α · r + γ · Qˆθ s′
, arg max
b∈A(s′)
Qθ(s′
, b) − Qθ(s, a) ∇θQθ(s, a),
where ˆθ is the parameter vector of the target network.
184/294
Prioritized experience replay (Schaul et al., ICLR’16)
Each experience e = (s, a, r, s′
) in the replay buffer is assigned a priority according to its
TD-error
pe = |r + γ · max
b∈A(s′)
Qˆθ(s′
, b) − Qθ(s, a)| + ε
(ε > 0 ensures all priorities are positive).
The probability of sampling an experience e from the buffer is set to
pα
e
e′∈B pα
e′
, where α > 0 is
a hyperparameter controlling the degree of prioritization.
Prioritization induces bias: the sampled experiences no longer follow the same distribution as
sampled trajectories. We can correct this by using importance sampling during updates:
θ ← θ + α ·
1
|B|
·
1
pα
e
β
· r + γ · Qˆθ s′
, arg max
b∈A(s′)
Qθ(s′
, b) − Qθ(s, a) · ∇θQθ(s, a),
where β > 0 determines the degree of IS correction (annealed to 1 during training). 185/294
n-step returns
Self-explanatory, use n-step return with Q-learning bootstrap when computing TD target.
How to combine with replay buffer? Each experience stores a single step.
Solutions:
• Store experiences in B sequentially, with each sampled experience, retrieve also the next
n − 1 ones (up to episode termination). Requires careful implementation.
• Naive: each element of B consists of n consecutive experiences (space inefficient).
Given consecutive experiences
(st, at, rt+1, st+1), (st+1, at+1, rt+2, st+2), . . . , ((st+n−1, at+n−1, rt+n, st+n)), perform update
θ ← θ + α · rt+1 + γrt+2 + · · · + ·γn−1
rt+n + γn
max
b∈A(st+n)
Qˆθ(st+n, b) − Qθ(st, at) ∇θQθ(st, at).
186/294
Distributional learning
Very rough idea: instead of expected returns, predict (discretized) distribution of returns.
Source: Bellemare, Dabney, Munos: A Distributional Perspective on Reinforcement Learning. In
proceedings of ICML’17.
We still optimize the expected value of the distribution, but the NN processes richer
information (neurological inspiration). 187/294
Noisy nets
Fortunato et al.: Noisy Networks for Exploration . In proceedings of ICRL’17.
An alternative way of achieving exploration (without ε-greedy policies). Replaces linear layers
y = W · x + b with noisy layers of the form
y = (µw + σw ⊙ εw ) · x + µb + σb ⊙ εb,
where matrices µw , σw and vectors µb, σb are learnable, matrix εw and vector εb consist of
random noise, and ⊙ represents component-wise multiplication.
The loss function of the DQN training is then encapsulated in expectation over the noise.
Interesting point: the net can learn to adjust σ’s and thus the degree of exploration over time.
188/294
Rainbow: evaluation and ablations (Hessel et al., 2017)
189/294
Policy Gradient Methods
Value-based vs. policy-based methods
So far: focus on approximating q∗
via some parameterized estimate Qθ, policy the defined by
Qθ, e.g. (ε-)-greedy...
In policy gradient methods we work directly with some parameterized representation of
a policy πθ, and update θ so as to improve some performance characteristic of πθ (e.g.,
expected return).
In particular, the policy π can be represented by a function approximator πθ : S × Θ → D(A).
190/294
Standard NN + softmax representation
πθ(a|s) =
eh(s,a,θ)
b∈A eh(s,b,θ)
,
where h(s, a, θ) is the logit (”preference”)
of action a
191/294
General policy gradient scheme
We want to find θ that maximizes some performance measure (or objective function) J(θ) of
πθ. The obvious choice for J is the expected return:
J(θ) = vπθ
= Eπθ
[G],
thought some algorithms use a different surrogate objective function.
The optimization problem
max
θ
J(θ)
can be (locally) solved using gradient ascent: repeatedly perform updates
θ ← θ + α · ∇θJ(θ).
It is thus necessary to compute or approximate ∇θJ(θ): this is the scope of various policy
gradient theorems.
192/294
Gradient of expected return: possible version
∇θJ(θ) = ∇θEπθ
[G] =
193/294
Gradient of expected return (cont’d)
194/294
Vanilla MC policy gradient
Algorithm 8: Vanilla MC policy gradient
Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, learning rate α
Output: Approximation πθ of π∗
initialize θ arbitrarily;
repeat
s0 ∼ initial distribution;
generate episode τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , rT , sT using πθ;
θ ← θ + α · G(τ) ·
i
t=0 ∇θ log πθ(aj |sj )
until timeout;
195/294
Score function form
196/294
Step-wise gradient
∇θJ(θ) = ∇θEπθ
[G] = ∇θEπθ
[
∞
t=0 γt
Rt+1]
197/294
Step-wise gradient (cont’d)
198/294
REINFORCE: better MC policy gradient
Algorithm 9: REINFORCE (Williams, 1992)
Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, learning rate α
Output: Approximation πθ of π∗
initialize θ arbitrarily;
repeat
s0 ∼ initial distribution;
generate episode τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , rT , sT using πθ;
G ← 0;
for t = T − 1 to 0 do
G ← rt+1 + γ · G;
θ ← θ + α · γt
· G · ∇θ log πθ(at|st)
until timeout;
199/294
Baseline in policy gradient
Theorem 54: Baseline theorem
Let bθ(s): S × θ → R be any function. Then for any t:
Eπθ
bθ(st) · ∇θlogπθ(at|st) = 0.
As a consequence
∇θJ(θ) = Eπθ
T
t=0
γt
· Gt · ∇θ log πθ(at|st) =
= Eπθ
T
t=0
γt
· Gt−bθ(st) · ∇θ log πθ(at|st)
The gradient estimates using baseline have the same expectations as the standard REINFORCE
estimate, but might have a lower variance if baseline selected correctly. 200/294
State-value baseline
Good choice is
b(s) := vπθ
(s),
reducing the estimate variance by correcting the return for a bias caused by being in a certain
state.
Problem: we do not know vπθ
.
Solution: Learn vπθ
online using a separate function approximator V , e.g. via gradient
(every-visit) Monte Carlo.
201/294
REINFORCE with a state-value baseline
Algorithm 10: REINFORCE with baseline
Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, value parametrization
Vη, learning rates απ, αV for the two approximators
Output: Approximation πθ of π∗
initialize θ and η arbitrarily;
repeat
s0 ∼ initial distribution;
generate episode τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , rT , sT using πθ;
G ← 0;
for t = T − 1 to 0 do
G ← rt+1 + γ · G;
η ← η + αV · G − Vη(st) · ∇ηVη(st);
θ ← θ + απ · γt
· G − Vη(st) · ∇θ log πθ(at|st)
until timeout;
202/294
REINFORCE: baseline effect experiment
source: Sutton&Barto, p. 330 203/294
Advantage estimation in policy gradient
Comparing returns to some state-dependent baseline is reminiscent of what happened in the
dueling network architecture.
In particular, one can derive another form of the policy gradient, showing that
∇θJ(θ) ∝ Es∼π,a∼πθ
[qπθ
(s, a) · ∇θ log π(a|s)].
Inputting a state-value baseline yields
∇θJ(θ) ∝ Es∼π,a∼πθ
[(qπθ
(s, a) − vπ
(s))
advπ(s,a)
·∇θ log π(a|s)].
Hence, policy-gradient-type algorithms often formulate the update of policy parameters in the
form
θ ← θ + α · At
θ,η(st, at) · ∇θ log πθ(at|st),
where At
θ,η is some approximator called advantage estimate. E.g. in PG with baseline,
At
(θ,η) = γt
· (Gt − Vη(st)). 204/294
Proof of Baseline theorem
205/294
Actor-critic: Policy gradient with bootstrapping
Recall the REINFORCE-with-baseline update:
θ ← θ + α · γt
· (Gt − Vη(st)) · ∇θ log πθ(at|st)
Gt is a possible source of variance: let’s remove it (at the cost of introducing bias) via
bootstrapping! E.g. TD(0):
θ ← θ + α · γt
· (rt + γ · Vη(st+1) − Vη(st)) · ∇θ log πθ(at|st).
Here, the value network Vη both estimates the baseline value of the current state, and (via
boostrap) the quality of the played action: we call it a critic, while the policy network πθ is
called an actor.
Note: we can use the same bootstrap to update critic parameters.
206/294
Basic Actor-Critic (AC) algorithm: pseudocode
Algorithm 11: One-step AC
Input: Black-box MDP M = (S, A, p, r), policy parametrization πθ, value parametrization
Vη, learning rates απ, αV for the two approximators
Output: Approximation πθ of π∗
initialize θ and η arbitrarily;
repeat
s ∼ initial distribution;
D ← 1;
while s is not terminal do
a ∼ πθ(s);
s′
∼ p(s, a);
δ ← r(s, a, s′
) + γ · Vη(s′
) − Vη(s);
η ← η + αV · δ · ∇ηVη(st);
θ ← θ + απ · D · δ · ∇θ log πθ(at|st);
D ← γ · D;
s ← s′
until timeout; 207/294
AC: picture
208/294
Comments on AC
• The algorithm presented on previous slide is just a basic variant: actor-critic framework
covers a wide range of algorithms: soft Actor-Critic (SAC), A2C, A3C, deep deterministic
policy gradient (DDPG), twin-delayed DDPG (TD3), PPO (next time),. . .
209/294
Varieties of policy gradient heuristics
• Entropy regularization: add to the objective function a term representing the entropy of
the policy: we prefer “more” randomized policies to encourage exploration (used e.g. in
SAC).
• JENTR (θ) = Eπ
[G] + β · Es∼πθ [H(π(s))]
• Off-policy training by using replay buffer (e.g. in SAC, DDPG, TD3).
• Using parallel agents whose gradients are averaged for each update (A2C).
• Using n-step (e.g. A2C) or λ-returns in bootstrap (PPO).
• Using different J(θ) than just expected return (SAC, TRPO, PPO).
Note that the above algorithms typically differ from “vanilla” AC in more aspects then
presented above. We shall see soon on the case of PPO.
210/294
Taming Unstable Gradients with
Trust Regions: TRPO, PPO
Limitations of policy gradient methods
• Basic policy gradient methods are prone to large variance of gradient estimates.
• These can be mitigated to some degree, e.g. by using n-step or λ-returns in advantage
estimation.
• Even then, the methods can yield large updates which can destabilize the training. We are
never sure the parameter updates will actually lead to an improvement of a policy:
211/294
Sensitivity of policy to parameters
For two actions and π(a|s) =
1
1−e−θ a = a1
1 − 1
1−e−θ a = a2,
we get
Source: slides by Emma Brunskill, Lecture 6,
https://web.stanford.edu/class/cs234/modules.html 212/294
Direct policy improvement via surrogate objectives
• We will present algorithms that use different performance metric so as increase the chance
of policy improvement on update.
• Moreover, we will design the performance metric so that its gradient can be easily
computed by an automated differentiation tool without the need to derive the formulas
manually.
213/294
Overall structure of presented algorithms
• The algorithms will generate a sequence of policies π1, π2, π3 such that each policy will be
(likely) an improvement over the previous one.
• Each step from πi to πi+1 entails finding a policy πi+1 that optimizes some performance
metric Lπi
that is dependent on the previous policy πi ! (Cf. standard policy gradient: the
performance metric J evaluated only the current policy).
• I.e. a run of such an algorithm entails solving (using gradient-based methods) multiple
optimization problems: one per each policy update.
• We will now focus on the single improvement step: we will denote by
• π = πθ the previous policy (πi )
• π′
= πθ′ the new policy (πi+1) we seek.
θ is treated as a constant, θ′
as variables! 214/294
Roles of advantages in policy improvement
Theorem 55
Let θ, θ′
be two parameter vectors and π = πθ, π′
= πθ′ .
Then
J(θ′
) − J(θ) = Eτ∼π′
∞
t=0
γt
· advπ
(st, at)
def
=Lπ(π′)
.
215/294
Another loss surrogate
To ensure that update from θ to θ′
is an improvement, we want to maximize
Lπ(π′
) = Eτ∼π′
∞
t=0
γt
· advπ
(st, at)
But we cannot sample from π′
, neither can we easily compute the gradient of the loss by
automated differentiation.
Trick: the loss function Lπ behaves similarly to the following loss function Lπ for all points π′
that are “close enough” to π:
Lπ(π′
) = Eτ∼π
∞
t=0
γt
· advπ
(st, at) ·
π′
(at|st)
π(at|st)
.
216/294
Closeness of Lπ and Lπ
Lπ(π) = Eτ∼π′
∞
t=0
γt
· advπ
(st, at) Lπ = Eτ∼π
∞
t=0
γt
· advπ
(st, at) ·
π′
(at|st)
π(at|st)
“Behaves similarly” can be formalized as follows:
Theorem 56
It holds Lπ(π) = Lπ(π).
Moreover the gradients ∇θ′ Lπ and ∇θ′ Lπ are equal at point π.
Proof: the first part is trivial. The second part is technical and requires converting the
expectation into expectation-over-states form, see
Optional reading: Schulman, Levine, Abbeel, Jordan, Moritz: Trust Region Policy Optimization.
In Proceedings of ICML’15.
217/294
Trust Region Methods
Hence, the constrained optimization problems
maximize Lπ(π′
) subject to π′
close to π
maximize Lπ(π′
) subject to π′
close to π
have approximately the same optimal solutions. We want to solve the first, and will proceed by
solving the second.
The set of π′
that are “close enough” to π is called a trust region. What “close enough”
means differs among algorithms. Exact bounds on the error of the approximation in terms of
KL divergence between π and π′
was given in
Optional reading: Achiam, Held, Tamar, Abbeel: Constrained Policy Optimization. In
Proceedings of ICML’17.
218/294
Optimizing the loss by sampling and automated differentiation
Lπ = Eτ∼π
∞
t=0
γt
· advπ
(st, at) ·
π′
(at|st)
π(at|st)
We replace the true advantage advπ
with an advantage estimator Aη (neural net, more on that
later).
Instead of optimizing the true loss Lπ, we optimize a sample loss Lπ: for a trajectory
τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
Lπ(π′
) ≈ Lπ(π′
) =
∞
t=0
γt
· Aη(τ, t) ·
π′
(at|st)
π(at|st)
(More trajectories can be sampled, in which case we optimize the average sample loss over the
trajectories.) We then let an automated gradient-based optimizer find π′
= πθ′ maximizing
Lπ(π′
). Note that the only term in Lπ that depends on the optimized parameters θ′
are the
likelihood ratios! Hence, the gradient can be computed easily.
219/294
Trust region policy optimization (TRPO)
Schulman, Levine, Abbeel, Jordan, Moritz: Trust Region Policy Optimization. In Proceedings
of ICML’15.
To perform update from π to π′
, theoretical TRPO:
• samples a trajectory τ (or a batch of trajectories) from π
• uses the trajectory to:
• update the advantage estimator Aη (i.e., update its parameters η)
• construct the loss Lπ
• solves the optimization problem
maximize Lπ(π′
) − β · DKL(π, π′
)
JTRPO
(DKL - KL divergence, β - suitable constant) via a black-box gradient-based optimizer.
This update loop is performed until timeout. Practical TRPO makes several changes to
individual steps of the above scheme. 220/294
Proximal Policy Optimization (PPO)
221/294
PPO vs TRPO
PPO follows the same general scheme as TRPO with the following tweaks:
• Explicitly specifies how the advantage should be estimated: generalized advantage
estimation (essentialy, truncated offline λ-returns, see later).
• Tweaks the loss function a bit - uses different way of ensuring that π′
is in the proximity
of π.
PPO comes in two variants depending on how it tweaks the loss function:
• Adaptive KL divergence penalty coefficient: like TRPO, but the coefficient β changes
adaptively. Empirically does not perform as well as the second method:
• Clipped likelihood ratios.
Definition 57: Clipping
The function CLIP is defined as follows:
CLIP(x, a, b) =



x if a ≤ x ≤ b
a if x < a
b if b < x 222/294
Basic PPO empirical loss
For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . denote
rt(π, π′
) =
π′
(at|st)
π(at|st)
The main component of PPO loss/performance metric is:
LCLIP
π (π′
) =
∞
t=0
min Aη(τ, t) · rt(π, π′
) , Aη(τ, t) · CLIP rt(π, π′
), 1 − ε, 1 + ε
(Note that no γt
is included in the formula: discounting to some degree implicitly encompassed
in the advantage estimation.)
223/294
Constraining improvements but not damages
LCLIP
π (π′
) =
∞
t=0
min Aη(τ, t) · rt(π, π′
) , Aη(τ, t) · CLIP rt(π, π′
), 1 − ε, 1 + ε
source: Schulman et al.: Proximal Policy Optimization Algorithms
224/294
Advantage estimation in PPO
PPO uses generalized advantage estimation = an actor-critic framework using λ-return to
estimate the q-value.
Formally, the advantage estimator is built on a value network Vη : S × Θ → R.
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. An n-step advantage from time step
t is the quantity
At:t+n
η (τ) = rt+1 + γrt+2 + · · · + γn−1
rt+n + γn
Vη(st+n) − Vη(st).
For a parameter λ ∈ [0, 1], the generalized advantage estimate at time t is
GAE(λ,t)
η (τ) =
T−t
n=1
λn
· At:t+n
η (τ).
For a trajectory τ, PPO puts Aη(τ, t) = GAEλ,t
η (τ).
225/294
PPO loss extension
Apart from LCLIP
π , PPO loss consist of two additional terms:
1. The value network Vη and policy network πθ might share parameters (e.g. the same feature
extraction layers). In this case, η and θ should be trained together: for a sampled trajectory
τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., the PPO loss will incorporate the empirical value loss
LV
(η′
) =
T
t=0
(Vη′ (st) − Gt(τ)
target
)2
(instead of sample return, the training target might be e.g. TD(0), or Aη(τ, t) + Vη(st): note
that η in the target (current value of the parameter) is a constant).
2. Original PPO also used entropy regularization, adding sample entropy loss:
Lentropy
(π′
) = −
1
T
T−1
t=0 a
π′
(a|st) · log π′
(a|st)
226/294
Total PPO loss
Given a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., the current reference policy
π, and hyperparameters β1, β2 ∈ R+
, the canonical PPO loss is:
LPPO
τ (θ′
, η′
) = LCLIP
π (π′
) − β1 · LV
(η′
) + β2 · Lentropy
(π′
)
=
T
t=0
min Aη(τ, t) · rt(π, π′
) , Aη(τ, t) · CLIP rt(π, π′
), 1 − ε, 1 + ε
− β1 ·
T
t=0
(Vη′ (st) − Gt(τ)
target
)2
+ β2 · −
1
T
T−1
t=0 a
π′
(a|st) · log π′
(a|st).
Important note: the advantage estimates Aη in LCLIP
are evaluated before the loss is
constructed and are treated as constants inside LCLIP
! The value network parameters are only
considered variable inside the value loss.
Usually, a batch of τ is sampled and average loss 1
|batch| · τ∈batch LPPO
τ (θ′
, η′
) is optimized
using gradient-based optimizer (eg., ADAM) to yield a policy update. 227/294
PPO high-level pseudocode
Algorithm 12: PPO
Input: policy network πθ, value network Vη, hyperparameters λ, |B|, β1, β2,. . .
initialize θ, η;
repeat
sample a batch B of trajectories from policy π = πθ;
foreach τ ∈ B do
foreach 0 ≤ t ≤ T − 1 do
at(τ) ← GAEλ,t
η (τ);
define LCLIP
τ (π′
) =
T
t=0 min at(τ) · rt(π, π′
) , at(τ) · CLIP(rt(π, π′
), 1 − ε, 1 + ε) ;
define LPPO
τ (θ′
, η′
) = LCLIP
π,τ (π′
) − β1 · LV
(η′
) + β2 · Lentropy
(π′
)
define LPPO
(θ′
, η′
) = 1
|B| · τ∈B LPPO
τ (θ′
, η′
);
use ADAM or other optimizer to find θ′
, η′
approximately maximizing LPPO
(θ′
, η′
);
θ ← θ′
; η ← η′
until timeout;
228/294
Exploration vs. Exploitation:
A Systematic Approach
EvE dilemma
When you see a good move, look for a better one.
Emanuel Lasker (1868-1941), 2nd World Chess Champion
• In RL, the algorithms often need to balance exploration of the MDP state space with
exploitation: focusing on behavior that worked well in the past.
• Typical RL algorithms ensure exploration by ε-greediness, with ε-possibly annealed towards
zero over the training. (Other options: noisy nets, entropy regularization,. . . )
• These are rather ad-hoc approaches: the setting of hyperparameters (annealing rate,
entropy coefficient) is often highly domain-dependent.
• EvE dilemma also appears in domains that are not typically modeled as MDPs:
recommender systems, disease treatment plans, or games.
• The EvE dilemma is systematically studied using the formalism of multi-armed bandits
(MABs).
229/294
One-armed bandit
230/294
Multi-armed bandit (MAB)
reward
µ(a)
arm a
µ(b)
arm b
µ(c)
arm c
A MAB is given by:
• a finite set of arms
A;
• for each arm a a
reward distribution
Da with mean µ(a)
• the Da and µ(a) are
unknown to the
player!
For simplicity, we will assume
that rewards are
from the interval [0,1].
231/294
Dynamics of multi-armed bandits
• Interaction proceeds in discrete time steps 1, 2, 3, . . . , T, where T is a termination time
which might or might not be known to the player.
• In each time step t, we choose some arm at ∈ A to pull and receive reward rt ∼ Dat
.
= like a one-state MDP with stochastic rewards, but we do not known the reward distributions
Di in advance.
Our goal is to maximize the expected accumulated reward E[
T
t=1 rt] =
T
t=1 E[rt].
Clearly, the expected reward is maximized by pulling, in each step t, the arm a∗
with maximal
mean reward µ∗
(we assume all arms have different mean rewards):
µ∗
= max
a∈A
µ(a)
a∗
= arg max
a∈A
µ(a)
However, since we do not know Da’s and µ(a)’s, we cannot a priori determine which arm is
optimal! We need to learn something about the reward distributions by exploring individual
arms, while trying to maximize the accumulated returns.
232/294
Policies in MABs
Definition 58: MAB policy
A policy in a multi-armed bandit problem is a function π which to each history
a1, r1, a2, r2, . . . , at, rt of actions and resulting rewards assigns a distribution over arms.
The policies we will work with typically based their decisions on the following statistics of the
history: for each arm a we keep:
• Nt(a) – the number of times the arm a was pulled by time t
• Rt(a) = 1≤i≤t rt · I(ai = a) – the total reward accumulated by pulling arm a up to time t
• ˆµt(a) = Rt (a)
Nt (a) – the empirical mean return of arm a
Node that if Nt(a) → ∞ as t → ∞, then ˆµt(a) → µ(a).
233/294
Example: ε-greedy policy
Given ε ∈ [0, 1], an ε-greedy policy selects arm at+1, for any t ≥ 0, as follows:
• with probability ε it selects an arm uniformly at random
• with probability 1 − ε it selects an arm a such that
a = arg max
b∈A
ˆµt(b)
Intuitively, this is sub-optimal policy: even for large t, when all ˆµt(b) should be relatively good
approximations of true mean rewards, the policy still selects sub-optimal arms at a constant
rate. How to formalize this issue?
234/294
Regret
The regret of a policy at time T is the difference between the expected return of the policy and
the return we would get by always pulling the optimal arm. Formally
Definition 59: Regret
The regret of a policy π at time T is the quantity
Regretπ(T) = T · µ∗
− Eπ
[
T
t=1
rt].
Another form of writing the regret:
Regretπ(T) = T · µ∗
−
a∈A
µ(a) · Eπ
[NT (a)].
235/294
Lower bound on the regret of ε-greedy policy
We will show that ε-greedy policy has regret linear in T, i.e. asymptotically the worst possible
one.
Let b be any sub-optimal action. Denote ∆b = µ∗
− µb > 0.
Since b has probability at least ε of being pulled in every step, the expected number of times b
is pulled in T steps is at least T · ε.
Hence,
Regretε-greedy(T) = T · µ∗
−
a∈A
µ(a) · Eε-greedy
[NT (a)]
=
a∈A
Eε-greedy
[NT (a)] · µ∗
−
a∈A
µ(a) · Eε-greedy
[NT (a)]
=
a∈A
Eε-greedy
[NT (a)] · (µ∗
− µ(a)) ≥ Eε-greedy
[NT (a)] · ∆b ≥ T · ε · ∆b.
236/294
Logarithmic regret
We will now demonstrate a simple policy achieving logarithmic regret.
The policy π we will construct requires the advance knowledge of the termination time T and
of the the gap between the optimal and second-best arm:
∆min = min
b̸=a∗
µ∗
− µ(b)
Policy π proceeds in two phases:
• Phase 1 (exploration): in the first T1 = log(T)·|A|·4
∆2
min
steps it deterministically cycles
through all arms. I.e., if A = {a1
, a2
, . . . , ak
}, then in step i it plays arm aj
where j = i
(mod k) + 1.
• Phase 2 (exploitation) at timestep T1 + 1, π identifies action a with maximal empirical
mean ˆµT1
(a) and keeps playing this action for the remaining T2 = T − T1 steps.
237/294
Upper-bounding the regret
The total regret of π can be decomposed into regrets accumulated in the two phases:
Regretπ(T) = T · µ∗
−
T
t=1
Eπ
[rt]
= T1 · µ∗
−
T1
t=1
Eπ
[rt]
R1, exploration regret
+ T2 · µ∗
−
T
t=T1+1
Eπ
[rt]
R2, exploitation regret
Clearly R1 ≤ 1 · T1 ∈ O(log(T)).
R2 depends on whether π correctly classifies the optimal arm at timestep T1 + 1.
• If yes, then R2 = 0.
• If no, then R2 ≤ T2 ≤ T ∈ O(T).
238/294
Bounding the probability of misclassification
We have
R2 ≤ T · Pπ
[Mis],
where Mis is the event that a∗
does not have the maximal empirical mean after T1 steps.
By union bound
Pπ
[Mis] ≤
b̸=a∗
Pπ
[ˆµT1 (b) ≥ ˆµT1 (a∗
)].
Clearly, the second-best action, call it ˜a, has the highest likelihood of achieving higher empirical
mean than a∗
. I.e.,
Pπ
[Mis] ≤ (|A| − 1) · Pπ
[ˆµT1 (˜a) ≥ ˆµT1 (a∗
)].
It thus suffices to bound the probability of ˜a overtaking a∗
in empirical mean.
239/294
Probability of empirical deviation
µ(˜a)
˜a
µ∗
a∗
For ˜a to overtake a∗
, at least one of the arms
must have empirical mean after T1 steps at least
∆min
2 -away from their true mean.
Hence, it suffices to bound the probability that
an empirical mean deviates too much from the
true mean.
240/294
Hoeffding’s inequality for large deviations
Theorem 60: Hoeffding’s inequality
Let D be a distribution taking values in [0, 1]. Let µ be the mean of D and let ˆµn =
n
i=1 xi , where each xi is an independent sample from D. Then, for any n ∈ N+
and
any δ > 0:
P |ˆµn − µ| ≥ δ ≤ 2e−2nδ2
Recall T1 ≈ log(T)·|A|·4
∆2
min
. Then, e.g. for a∗
:
Pπ
|µT1 (a∗
) − µ∗
| ≥
∆min
2
≤ 2e−2
T1
|A|
∆2
min
4 ≈ 2e−2 log(T)
=
2
T2
.
Hence, Pπ
[Mis] ≤ C · 1
T2 for some constant C.
It follows that the exploitation regret R2 is ≤ T · Pπ
[Mis] = C′
· 1
T for some constant C′
.
241/294
Final bound on regret of π
The total regret of π is:
Regretπ(T) = R1(T)
∈O(log(T))
+ R2(T)
∈O(1)
∈ O(log(T)).
The logarithmic regret is the best possible:
Theorem 61: Lai and Robbins (”Asymptotically efficient adaptive allocation rules”, 1985)
Any policy has regret in Ω(log(T)).
The disadvantage of π is that it needs to know both T and ∆min in advance.
Knowledge of T can be discarded by using ε-greedy policies with adaptive ε: if
εt = min{1, |A|
∆2
min·t
}, then the regret is logarithmic (see David Silver’s slides).
But there is actually a policy yielding logarithmic regret without the advance knowledge of
either T or ∆min.
242/294
Optimism in the face of uncertainty
The idea is that under-explored arms should be explored more: they could be better than we
currently think (and if not, exploring them should disprove that).
We will seek optimistic estimates of µ(a) in the form of upper confidence bounds: we seek an
empirical quantity Ut(a) such that with high probability
µ(a) ≤ ˆµt(a) + Ut(a)
UCBt (a)
.
The policy will always pick action maximizing UCBt.
Moreover, Ut(a) should be as tight as possible
given available information. In particular,
it should hold that if Nt(a) ≤ Nt(b), then
Ut(a) ≤ Ut(b).
µ(a)
a
µ(b)
b 243/294
One-sided Hoeffding’s inequality
Theorem 62: Hoeffding’s inequality (one-sided)
Let D be a distribution taking values in [0, 1]. Let µ be the mean of D and let ˆµn =
n
i=1 xi , where each xi is an independent sample from D. Then, for any n ∈ N+
and
any δ > 0:
P µ − ˆµn ≥ δ ≤ e−2nδ2
.
Let 1 − p be some required confidence level. We want to find “tight” Ut(a) such that
µ∗
≥ ˆµt(a) + Ut(a) with probability at most p, i.e.
P µ∗
− ˆµt(a) ≥ Ut(a) ≤ p.
By (one-sided) Hoeffding:
P µ∗
− ˆµt(a) ≥ Ut(a) ≤ e−2Nt (a)U2
t (a)
.
244/294
Deriving UCB formula
From the previous we have:
P µ∗
− ˆµt(a) ≥ Ut(a) ≤ e−2Nt (a)U2
t (a)
.
Performing substitution p = e−2Nt (a)U2
t (a)
we derive the expression for Ut(a):
Ut(a) =
log(1/p)
2Nt(a)
For this Ut(a), it holds P[µ∗
− µt(a) ≥ Ut(a)] ≤ p, i.e.
P µ∗
≤ ˆµt(a) + Ut(a) ≥ 1 − p.
We want to increase the confidence over time, i.e. p should be a function of t with p → 0 as
t → ∞. The standard approach is to put p = 1
tc for a suitable constant c. Then
Ut(a) =
log(1/p)
2Nt(a)
=
c log(t)
2Nt(a)
=
c
2
log t
Nt(a)
= C ·
log t
Nt(a) 245/294
Summary of UCB policy
In each step t, the UCB policy selects the arm with the highest upper confidence bound, i.e.
arm a such that
a = arg max
a∈A
UCBt(a) = arg max
a∈A
ˆµt(a) + C ·
log t
Nt(a)
,
where C is a hyperparameter known as exploration constant. It can be shown that for
[0, 1]-valued rewards, choosing C =
√
2 suffices to achieve logarithmic regret
Theorem 63
When the reward distributions are over the interval [0, 1], then for C =
√
2 it holds
RegretUCB (T) ∈ O(log T).
Proof in optional reading: Auer, Cesa-Bianchi, Fischer: Finite-time Analysis of the Multiarmed
Bandit Problem. In Machine Learning (47), 2002. 246/294
Concluding remarks on MABs
• UCB-like algorithms with logarithmic regret were developed also for more general cases
(e.g. distributions with unbounded support, where typically some shape of the distribution
or a known bound on its variance is known).
• The MAB model introduced here can be generalized in many ways (Bayesian bandits,
contextual bandits, adversarial bandits) - rich area of research and applications.
Optional reading: Slivkins: Introduction to Multi-Armed Bandits (arXiv).
• UCB-like approaches can be used also in RL, e.g. instead of ε-greedy policies use
π(s) = arg max
a∈A(s)
Q(s, a) + C ·
log N(s)
N(s, a)
.
• Most prominently, this idea is applied in the context of Monte-Carlo tree search algorithms.
247/294
Monte Carlo Tree Search
Tree? What tree?
1
2 :2 1
2 :0 1
3 :1 2
3 :-3
......
......
......
......
......
......
......
......
......
• Rooted tree with alternating state
nodes and action nodes.
• Action nodes equipped with rewards
and probability distributions over
children. We denote p(a) the
probability distribution over children of
a.
• Policy π in the tree assigns to each
state node a distribution over its
children.
• For simplicity, we consider discount
factor γ = 1.
248/294
Trees vs. MDPs
The trees we will consider can be considered as special type of MDPs. Due to lack of cycles,
all policies can be considered memoryless in the tree.
However, the tree MDPs capture full generality of MDPs due to MDP unfolding.
Definition 64: Unfolding of MDP
For an MDP M = (S, A, p, r), the unfolding of M is a (possibly infinite) tree-shaped
MDP Unfold(M) such that:
• the states nodes of Unfold(M) are the histories of M (=trajectory prefixes);
• if a is an action node that is a child of state node corresponding to history h, then
in Unfold(M) the probability of child h, a, s of a is equal to p(s|last(h), a);
• the reward function lifted similarly: rUnfold(M)(h, a, (h, a, s)) = rM(last(h), a, s)
249/294
Advantage of tree representation
The tree unfolding of an MDP implicitly carries, in each node, the information about the whole
history up to the reaching of that node. This can be advantageous when dealing with
tasks/environments where the knowledge of history is important, e.g.:
• partially observable environments (cf. Atari)
• tasks with more complex objectives than maximizing the expected discounted return
• adversarial environments (games) - requires distinguishing between player 1 and player 2
nodes (we will see later)
In what follows, we will consider trees encoding episodic MDP tasks: the tree is potentially
infinite, but the probability of all infinite branches is zero. (For the sake of concreteness,
imagine chess where our opponent is playing some fixed known, possibly randomized, strategy.)
250/294
MCTS requirements
The tree in which MCTS operates can be either given explicitly (if finite and small enough) or
represented by some black-box model which allows for the following:
• given a state node node, return a list of all action-node children of node;
• given an action node anode, return a sample from p(anode), i.e. sample a child of anode
according to the probability distribution specified by the tree;
• given an action node anode and its child node, return the reward labeling the edge
between the two nodes.
The black-box can have a form of a finite explicitly represented MDP, or (in turn) of the
sampling-model of such an MDP (as in classical RL).
251/294
Monte Carlo tree search – high-level properties
Monte Carlo tree search (MCTS) is an online algorithm for solving (=maximizing the expected
return in) tree-shaped MDPs.
Online = it does not compute a complete policy for the whole MDP; instead given the current
node = (current state and the history of states and actions visited so far,) the algorithm
computes the (approximately) best action to play in the current step.
Terminology: one step of MCTS = performing a computation to determine the best
action given the current situation.
MCTS can be employed over multiple steps, until a whole trajectory is generated. Thus, MCTS
can be seen as an algorithmic representation of a policy.
MCTS iteratively builds a structure called search tree, which is a finite sub-tree of the
original (possibly infinite) MDP tree. The search tree is a global data structure shared
across the steps. 252/294
MCTS master loop
Algorithm 13: MCTS master loop
Input: Tree-shaped MDP M with root s0
node ← s0
T ← tree with single node (root) s0 and all
its child action nodes
while node is not terminal do
BuildTree(T ) // Modifies T
a ← ActualSelect(T )
node′
∼ p(a)
node ← node′
T ← sub-tree of T rooted in node
// If node not in T , this is a
single-node tree with node as
root.
253/294
Building the search tree
Building the search tree also involves sampling trajectories from the environment. This is
standard in RL: the algorithms we have seen so far were sampling from the environment to
compute policies, and the policies themselves can be seen as prescriptions for interacting with
(=sampling from) the environment.
In contrast, planning/MCTS literature often differentiates between
• the actual online actions played by the algorithm when interacting with the “true”
environment, i.e. outcomes of ActualSelect function (e.g. moving a chess piece on
physical a board); and
• simulated actions performed when building the tree (e.g. imagining outcomes of various
chess moves when thinking about the next move)
The reason for the distinction is that in practice, we might want the simulated actions to be
performed in a virtual model of the environment (for sake of sample efficiency), while the
actual actions in the “true” environment.
We will stick to the nomenclature simulated/actual action to distinguish between actions
played during/at the end of individual steps. 254/294
Search tree statistics
The BuildTree function maintains several statistics of each node node of the tree (both state
and action node):
• N(node) – the visit count of node how many times has the node been visited during the
run of the algorithm
• R(node) – the total return achieved from that node during the run of the algorithm
(counting only rewards collected on the paths from node to a leaf)
• V (node) = R(node)
V (node) – the estimated value of the node (i.e., the estimate of best expected
return achievable from the node; for action nodes this includes the immediate reward of
that action).
Like the whole search tree T , these statistics are global to the whole MCTS algorithm, and are
typically not reset between BuildTree calls.
The ActualSelect function also uses these statistics.
Invariant: all state nodes in T have all there child action nodes also in T
255/294
Building the search three: four phases
Input: Tree-shaped MDP M with root s0
node ← s0
T ← tree with single state node (root) s0
and all its (action node) children
while node is not terminal do
BuildTree(T )
a ← ActualSelect(T )
node′
∼ p(a)
node ← node′
T ← sub-tree of T rooted in node
// If node not in T , this is a
single-node tree with node
as root.
Tree building proceeds by repeatedly performing
these 4 phases:
1. Search.
2. Expansion.
3. Rollout.
4. Backup.
The phases are repeated until a predetermined
timeout.
256/294
Tree building
Algorithm 14: Tree building
Procedure BuildTree(T ):
repeat
(node, anode) ← Search(T ) // Traverse top-down to the ‘‘best’’ node
// not in T .
Expand(T , node, anode) // Add the discovered node to T .
(R, a) ← Rollout(node) // MC estimate of node’s value
// via default policy.
Backup(T , R, a) // Update statistics on branch from root to node.
until timeout
257/294
Search phase
• Traverse T from the root to the most promising
action leaf. Then sample a state node child of
this leaf (not yet in tree).
• Traverse = in each state node, select an action
and in each action node, sample a successor state
node from the transition function of the
environment.
• Hence on each level, we also need to select
“most promising” action in the current node.
• Most promising = with best value estimate, but
we also take exploration/exploitation dilemma
into account.
258/294
Search phase 2
• We treat the decision in each concrete node as a
separate multi-armed bandit problem. That is,
when traversing a state node s, we select an
action node a such that
a = arg max
a child of s
V (a) + C ·
log N(s)
N(a)
,
where C is an exploration constant.
• I.e., when sampling, while in the tree we follow a
UCB behavior policy. This approach is called
upper confidence bound on trees (UCT), first
proposed in optional reading: Kocsis, Szepesv´ari:
Bandit-Based Monte Carlo Planning. In
proceedings of ECML 2006.
259/294
Search phase: pseudocode
Algorithm 15: Search phase
Function Search(T ):
node ← root of T
repeat
anode ← arg max
a child of node
V (a) + C ·
log N(node)
N(a)
node ∼ p(anode)
until anode is a leaf
return node, anode
260/294
Expansion phase
Algorithm 16: Expansion phase
Procedure Expand(T , node, anode):
add node to T as a child of anode
N(node) ← 0
R(node) ← 0
V (node) ← 0
// Add all action-node children of node
foreach child a of node do
add a to T as a child of node
N(a) ← 0
R(a) ← 0
V (a) ← 0
(Same initialization used in root.)
261/294
Rollout phase
• Perform a Monte Carlo sample of the rest of the
trajectory from the newly added node. Record the
return obtained.
• Since we are now outside of the tree, UCT cannot
be used (no statistics). Instead, we follow some
fixed behavior policy called default policy.
• Typical choice of default policy: in each node,
select uniformly from all actions.
• Domain knowledge can be used to craft more
intricate default policies.
262/294
Rollout phase: pseudocode
anode
Algorithm 17: Rollout with uniform default policy
Function Rollout(node):
R ← 0
while node not terminal do
sample a uniformly from all children of node
if this is the first iteration of the loop then
anode ← a
node′
∼ p(a)
R ← R + r(node, a, node′
)
node ← node′
return R, anode
263/294
Backup phase
R
anode
For all nodes currently in T traversed in the previous
phases, update their statistics using the rollout outcome:
Algorithm 18: Backup
Procedure Backup(T , anode, R):
n ← anode
while true do
N(n) ← N(n) + 1
R(n) ← R(n) + R
V (n) ← R(n)/N(n)
if n is the root then break
if n is a state node then
R ← R + the reward of the edge
connecting n to its parent
n ← parent(n)
264/294
MCTS Tree building summary
Procedure BuildTree(T ):
repeat
(node, anode) ← Search(T )
Expand(T , node, anode)
(R, a) ← Rollout(node)
Backup(T , R, a)
until timeout
265/294
MCTS: Actual action selection
Input: Tree-shaped MDP M with root s0
node ← s0
T ← tree with single state node (root) s0
and all its (action node) children
while node is not terminal do
BuildTree(T )
a ← ActualSelect(T )
node′
∼ p(a)
node ← node′
T ← sub-tree of T rooted in node
// If node not in T , this is a
single-node tree with node
as root.
• Usually just greedy according to value
estimates in the root:
a = arg max
a child of root(T )
V (a)
.
• Alternative: according to visit count:
a = arg max
a child of root(T )
N(a), or
a ∼ softmax(Na)a child of root(T )
(both used in AlphaZero).
266/294
MCTS with Function
Approximators:
AlphaZero
Towards AlphaZero
• AlphaGo Lee (2016): uses lots of domain knowledge, won 4-1 over 9-dan Go champion
Lee Sedol
• AlphaGo Zero (2017): zero domain knowledge
• AlphaZero (2017-2018): general game-playing (and MDP solving) algorithm
267/294
AlphaZero literature
AlphaGo Zero:
Silver et al.: Mastering the game of Go without human knowledge. In Nature, vol. 550
(2017).
• Focuses on Go, but has a rich methodology section explaining design details.
AlphaZero:
Silver et al.: A general reinforcement learning algorithm that masters chess, shogi, and
Go through self-play. In Science, vol. 362 (2018).
• More general, but many designed choices already explained in the AlphaGo paper
not covered: need to look into code for details.
268/294
MDP vs. game trees
1
2 :2 1
2 :0 1
3 :1 2
3 :-3
......
......
......
......
......
......
......
......
......
• State nodes alternate between
maximizer and minimizer nodes.
Maximizer/minimizer player wants to
maximize/minimize the expected
return achieved in its subtree.
For chess-type games, targeted by AlphaZero,
simpler models suffice:
• Rewards only in terminal states
(-1,0,+1).
• Deterministic action, each action
node has only one successor.
In figures, we will sometimes omit action
nodes, but pseudocodes will refer to the
most general formulation.
269/294
AlphaZero actual play
s
s′
s′′
......
......
......
......
• We are in some state node (game
position) s.
• We call AlphaZero to suggest an action
to play.
• We play the action in the game and
observe the next state s′
.
• We wait for the oponent to play an action
in his state and observe the next state s′′
.
• Repeat.
270/294
AlphaZero agent
......
......
......
......
Algorithm 19: AlphaZero single game
Input: Game tree with root s0, param’s θ
Function AlphaZeroAgent(s0, θ):
node ← s0
T ← tree with single node (root) s0
and all its child action nodes
while node is not terminal do
BuildTree(T , θ) // Modifies
T
a ← ActualSelect(T )
node′
∼ p(a)
oponent responds by playing a′
node′′
∼ p(a)
node ← node′′
T ← sub-tree of T rooted in node
271/294
AlphaZero = MCTS + function approximation
• The BuildTree() function of AlphaZero replaces rollouts with an evaluation of newly
discovered nodes via a function approximator.
• The function approximator is trained on data generate by repeated plays in the actual
environment (i.e., repeated calls of the AlphaZeroAgent() procedure.)
AlphaZeroAgent(s0, θ)
play
Train(θ, τ)
generate trajectory τ
updated θ
Slightly inaccurate AlphaZero training diagram (see later).
272/294
AlphaZero tree & approximator
Search tree of AlphaZero differs from plain MCTS in that each action node anode carries an
additional attribute: prior probability Pr(anode).
Given a state node node, AlphaZero’s approximator predicts these quantities:
• πθ(node) ∈ R|A|
: the prior probability vector, assigning a probability to each action in
node;
• vθ(node): the value estimate of node
The training will push θ so that
• πθ better approximates the policy played by AlphaZero in the actual game;
• vθ better approximates the expected return of AlphaZero in the actual game.
However, AlphaZero does not play according to πθ: it uses the MCTS mechanism to improve
upon πθ. Hence, AlphaZero is often presented as a MCTS-based policy improvement scheme.
273/294
AlphaZero: Tree building
Procedure BuildTree(T , θ):
AddNoise(T )
repeat
(node, anode) ← Search(T ) // Search phase
R ← ExpandPredict(T , node, anode, θ) // Expansion + prediction
Backup(T , R, node)
until timeout
πθ, vθ
274/294
AlphaZero: Search phase (theory)
Algorithm 20: Search phase
Function Search(T ):
repeat
if node is maximizer’s then sgn ← 1
else sgn ← −1
node ← root of T
anode ← arg max
a child of node
V (a)·sgn+C·Pr(a)·
log N(node)
N(a)
node ∼ p(anode)
until anode is a leaf
return node, anode
Note the self-play implemented by sgn and the modulation of exploration
via prior probabilities.
275/294
AlphaZero: Search phase (DeepMind implementation)
Algorithm 21: Search phase
Function Search(T ):
repeat
if node is maximizer’s then sgn ← 1
else sgn ← −1
node ← root of T
anode ← arg max
a child of node
V (a)·sgn+C(node)·Pr(a)·
N(node)
N(a) + 1
node ∼ p(anode)
until anode is a leaf
return node, anode
where C(node) = log (+N(node)+cbase
cbase
+ cinit with cinit = 1.25 and
cbase = 19652.
276/294
AlphaZero: Expansion & Prediction
πθ, vθ
Algorithm 22: Expansion phase
Procedure ExpandPredict(T , node, anode, θ):
add node to T as a child of anode
N(node) ← 0
R(node) ← 0
V (node) ← 0
// Add all action-node children of node
foreach child a of node do
add a to T as a child of node
N(a) ← 0
R(a) ← 0
V (a) ← 0
Pr(a) ← πθ(a)
return vθ(node)
(Same initialization used in root.) 277/294
AlphaZero: Backup
R
For all nodes currently in T traversed
in the previous phases, update their
statistics using the predicted outcome:
Algorithm 23: Backup
Procedure Backup(T , R, node):
n ← node
while true do
N(n) ← N(n) + 1
R(n) ← R(n) + R
V (n) ← R(n)/N(n)
if n is the root then break
if n is a state node then
R ← R + the reward of the edge
connecting n to its parent
n ← parent(n)
278/294
Dirichlet noise for root exploration
Procedure BuildTree(T , θ):
AddNoise(T )
repeat
(node, anode) ← Search(T ) // Search phase
R ← ExpandPredict(T , node, anode, θ) // Expansion + prediction
Backup(T , R, node)
until timeout
Before simulations start, we add Dirichlet noise to prior probabilities in the root node,
encouraging exploration. (Dirichlet distribution = a “distribution over discrete distributions”).
I.e., if the root has k child action nodes, we sample a vector ν ∼ Dirichlet(⃗α) and for each
action child a of root we perform:
Pr(a) ← (1 − ε) · Pr(a) + ε · ν(a),
where ⃗α ∈ Rk
>0 and ε ∈ (0, 1) are hyperparameters. 279/294
AlphaZero: Actual action selection
Algorithm 24: AlphaZero single game
Input: Game tree with root s0, param’s θ
Function AlphaZeroAgent(s0, θ):
node ← s0
T ← tree with single node (root) s0
and all its child action nodes
while node is not terminal do
BuildTree(T , θ) // Modifies
T
a ← ActualSelect(T )
node′
∼ p(a)
oponent responds by playing a′
node′′
∼ p(a)
node ← node′′
T ← sub-tree of T rooted in node
Action in actual play determined by visit count:
a = arg max
a child of root(T )
N(a), or
a ∼ softmax(Na)a child of root(T ).
The first (greedy) approach typically used in
deployment (e.g. competitive play) or later in
training, while softmax typically used early in
training (with temperature annealed over time
to decrease exploration).
280/294
AlphaZero training
AlphaZeroAgent(s0, θ)
play
Train(θ, τ)
generate trajectory τ
updated θ
Slightly inaccurate AlphaZero training diagram (see later).
AlphaZeroAgent(s′
0, θ)
play Buffer Train(θ, B)
generate trajectory τ
experience
minibatch
281/294
AlphaZero training details
AlphaZeroAgent(s′
0, θ)
AlphaZeroAgent(s0, θ)
play
Produces play s0, a0, r1, s1, a1, . . . , rT , sT
For each st we also store its statistics
and statistics of its child action nodes.
• For each timestep t:
• Compute total return Gt = T
i=t+1 ri
• Compute the target policy πt s.t. πt (a) = N(a)
b child of s N(b)
• if st is maximizer state, add (st , πt , Gt ) to buffer
• if st is minimizer state, add (st , πt , −Gt ) to buffer
• Given sampled experience e = (s, π, g), θ is updated so as to minimize the loss
L(θ, e) = (vθ(s) − g)2
+
a
πt(a) · log πθ(a) + c · ||θ||2
.
We typically perform minibatch updates akin to DQNs (average updates over a number of
sampled experiences). 282/294
State representation: chess example
The function approximator consists of feature extraction layers followed by value and policy
heads. The approximator is fed certain information carried by each state node.
For chess, each node contains k last positions along its history. Each position is represented as
a set of 8 × 8 feature planes. Most of them represent the position of pieces, with one plane for
each type of pieces of a concrete player.
0Z0Z0Z0Z
ZkZpZnZ0
0ZpZ0Z0Z
Z0O0Z0Z0
0Z0O0Z0Z
ZBZ0Z0Z0
KZ0Z0Z0Z
Z0Z0Z0Z0














0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0














K














0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0














B














0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0














p
Additional planes encode castlings, repetitions, en-passant availability etc.
283/294
Action representation: chess example
The policy head outputs 4, 672 action logits from which we derive the distribution over moves
by applying softmax.
Each move determined by:
• position of the moved piece (8 · 8 = 64 possibilities)
• 73 possibilities of what to do with the piece:
• 7 · 8 = 56 “standard moves”: choice of 8 directions (N, NW, W, . . . , NE) and advancing by
1–7 fields in the chosen direction
• 8 “knight jumps”: one of 8 possible L-shaped jumps
• 9 possibilities for pawn underpromotion (knight, bishop, rook; each either via forward move
or diagonal capture)
Total 64 · 73 = 4, 672 moves. Illegal moves are masked out and the remaining ones
renormalized.
284/294
AlphaZero approximator architecture for chess
Body + value and policy heads. Body:
• Initial convolutional layer with ReLU nonlinearity and batch normalization.
• Followed by 19 residual blocks, each block with 2 convolutional layers (again with ReLU
and batch norm.) and a skip connection around them.
• All the above conv. layers apply 256 filters with 3 × 3 kernel size and stride 1.
Policy head:
• Single-filter unit-kernel conv. layer with
stride 1 (+ ReLU and batch norm.)
• Then ReLU with 256 neurons.
• Then fully connected layer.
• Then final tanh layer of size 1.
Policy head:
• One more conv. layer as above (inlc.
ReLU and batch norm).
• Then a final conv. layer of 73
unit-kernel(?) filters.
285/294
AlphaZero experiments (from Silver et al. Science paper)
AZ trained for 700, 000 steps, ∼ 8 hours for chess on 5000 TPUs.
Note: this was 2018 version of Stockfish without NN evaluation. Modern version of Stockfish
use NN evaluation and typically outperform open-source AlphaZero-based chess agents (e.g.
LeelaChessZero).
286/294
Limitations of AlphaZero
• MCTS simulations need a (software) simulator of the environment (at least, we need to be
able to perform sampling from an arbitrary given state). Such simulator might not always
be available. (Actual environment too complex, dynamics unknown (Atari), . . . ).
• Moreover, AlphaZero is most suited for ”discrete enough” domains (such as chess).
Working with frame-like states of Atari is impractical. (E.g. how to check whether a given
frame is already included in the search tree?)
The MuZero algorithm solves both issues by learning a deep model of the environment,
including a suitable encoding of states.
Schrittwieser et al: Mastering Atari, Go, chess and shogi by planning with a learned
model. In Nature, vol 588 (2020).
287/294
MuZero: Main idea
Original MuZero works only for deterministic environments.
The high-level structure is similar to AlphaZero. However, the nodes in MCTS simulations are
formed by elements of a fixed latent space LS (Typically some low-dimensional vector space.)
From the actual plays, the algorithm trains the following networks:
• The representation function gθ : S → LS
• The dynamic dunction hθ : LS × A → LS × R
• The value and policy approximators vθ : LS → R and πθ : LS → D(A) (same role as in
AlphaZero).
288/294
MuZero: High-level picture
Apart from training (see next slide for intuition), the “master loop” of a MuZero agent looks
similar to AlphaZero. During actual play:
• In a current state, perform MCTS simulations to determine the best action.
• Play the action.
• Observe reward and new state in the actual environment.
• Repeat.
The major difference is in the “MCTS simulation” part, which is performed in the latent space.
289/294
MuZero training intuition (slide by Richard Schwarz)
290/294
MCTS in latent space
Given the current actual state s:
• Embed s into latent space to get latent state ˜s = gθ(s).
• Using the dynamics function hθ, build the MCTS search tree from ˜s using the usual UCT
approach. The nodes of the tree are elements of the latent space! (The approximators
vθ, πθ are also used during the build, as in normal AlphaZero.)
• Use statistics of the root to select the best action to play in actual game.
291/294
MuZero experiments
source: Schrittwieser et al: Mastering Atari, Go, chess and shogi by planning with a learned model
292/294
Further developments of MuZero and MCTS
• extension to stochastic environments:
Antonoglou, Schrittwieser, Ozair, Hubert, Silver: Planning in Stochastic Environments
with a Learned Model. In proceedings of ICLR 2022.
• Tricks to increase sample efficiency (e.g. GumbelZero).
• . . .
MCTS has formed a basis for practical algorithms in various domains, e.g.:
• AlphaChip
• AlphaGeometry
293/294
Glimpse of the future
294/294