PA230: Reinforcement Learning
Petr Novotný
“Good and evil, reward and punishment, are the only motives
to a rational creature; these are the spur and reins whereby
all mankind are set on work and guided.”
John Locke, Some Thoughts Concerning Education (1693)
1/189
Organizational Information
General
• Lecture: Thursdays 2-3:40p.m.
2/189
General
• Lecture: Thursdays 2-3:40p.m.
• Homework: see the interactive syllabus in IS
• mainly binary classification (accepted/not accepted)
• all your homeworks need to be marked as passed to proceed to exam
• can (but do not have to) be done in pairs (pairs can differ across the individual assignments)
• for those who passed, the teacher will receive feedback on the general quality of the
solutions for each student - can be taken into account when determining the final grade
(typically in students’ favor)
2/189
General
• Lecture: Thursdays 2-3:40p.m.
• Homework: see the interactive syllabus in IS
• mainly binary classification (accepted/not accepted)
• all your homeworks need to be marked as passed to proceed to exam
• can (but do not have to) be done in pairs (pairs can differ across the individual assignments)
• for those who passed, the teacher will receive feedback on the general quality of the
solutions for each student - can be taken into account when determining the final grade
(typically in students’ favor)
• Exam:
• oral
• each attempt counts ? (unlike the Brázdil system)
• in general, knowledge of anything mentioned on the slides can be required, unless explicitly
marked with “nex” (like the Brázdil system)
2/189
Team
• Lecturer: Petr Novotný
• HW team:
Martin Kurečka Václav Nevyhoštěný Vít Unčovský 3/189
Communication
Official discord server:
https://discord.gg/9mxTgYhcdB
• Official communication forum of the course: falls under the university ethical guidelines.
• Use your real name for posting (you can set-up an account under your IS email if
necessary).
4/189
Reading
• Compulsory:
• these slides,
• material explicitly prescribed by these slides (not much).
• Recommended:
• Sutton & Barto: Reinforcement Learning: An Introduction (2nd ed.), available at
http://incompleteideas.net/book/RLbook2020.pdf
• henceforth referenced as “S&B”
• slides by David Silver https://www.davidsilver.uk/teaching/
• CMU slides https://www.andrew.cmu.edu/course/10-703/
• more specific literature recommendations will be given for each topic later
5/189
Reinforcement Learning:
What, Why, When, How,
& Other Questions
Types of machine learning
• unsupervised
• spot ”useful” patterns in data
• supervised
• given labeled data, predict labels on unlabeled data
• reinforcement
• agents and decision-making
• agency = “the ability to take action or to choose what action to take” (Cambridge
Dictionary)
6/189
General RL scheme
source: Sutton&Barto, p. 48
Keywords: sequential, dynamic, subject to uncertainty
7/189
RL: Objective and approach
• Objective: Design a decision policy (= agent behavior) which prescribes to the agent how
to act in different situations (states), typically so as to achieve some goal.
8/189
RL: Objective and approach
• Objective: Design a decision policy (= agent behavior) which prescribes to the agent how
to act in different situations (states), typically so as to achieve some goal.
• Approach: Start with (± random) behavior and adapt it based on past experience via the
law of effect:
• actions with good/bad consequences for the agent are more/less likely to be repeated by the
agent (within the same context)
8/189
RL in psychology (nex)
I.P. Pavlov
(1849-1936)
classical conditioning
9/189
RL in psychology (nex)
I.P. Pavlov
(1849-1936)
classical conditioning
E. Thorndike
(1874-1949)
law of effect
9/189
RL in psychology (nex)
I.P. Pavlov
(1849-1936)
classical conditioning
E. Thorndike
(1874-1949)
law of effect
J.B. Watson
(1878-1958)
behaviorist manifesto
9/189
RL in psychology (nex)
I.P. Pavlov
(1849-1936)
classical conditioning
E. Thorndike
(1874-1949)
law of effect
J.B. Watson
(1878-1958)
behaviorist manifesto
B.F. Skinner
(1904-1990)
radical behaviorism,
reinforcement,
rewards
9/189
Learning by trying & XX dilemma
Underlying the RL approach is the idea of learning by trying:
• first, act more or less randomly (exploration)
• integral part of early human development
• continually adapt behavior according to experience and feedback from the environment
(exploitation)
• strength of feedback ≈ strength of behavior adaptation
Balancing exploration and exploitation (XX) is a recurring theme in RL.
10/189
Incomplete history of RL in computer science I
“Learning by trying” machines and software, ad hoc approaches:
A. Turing
(1912-1954)
1948: theoretical
“pleasure & pain” system
to train computers
11/189
Incomplete history of RL in computer science I
“Learning by trying” machines and software, ad hoc approaches:
A. Turing
(1912-1954)
1948: theoretical
“pleasure & pain” system
to train computers
C. Shannon
(1916-2001)
1950: Theseus
maze-solving mouse
M. Minsky
(1927-2016)
1950s: analog neural net
machines (SNARCS)
11/189
Incomplete history of RL in computer science I
“Learning by trying” machines and software, ad hoc approaches:
A. Turing
(1912-1954)
1948: theoretical
“pleasure & pain” system
to train computers
C. Shannon
(1916-2001)
1950: Theseus
maze-solving mouse
M. Minsky
(1927-2016)
1950s: analog neural net
machines (SNARCS)
And many more. . .
Recommended: S&B: Sec. 1.7.
11/189
Incomplete history of RL in computer science II
Mathematical foundations of sequential decision making:
R. Bellman
(1920-1984)
R. Howard
(b. 1934)
• Formalization via Markov decision
processes (MDPs)
• value iteration
(attributed to Bellman, 1957)
• policy iteration
(attributed to Howard, 1960)
12/189
Incomplete history of RL in computer science III
Since late 1980’s: synthesis – learning by trial in MDPs
R. Sutton A. Barto
Temporal difference learning
C. Watkins
Q-learning
. . . and many more. 13/189
Successes of RL (nex)
14/189
Successes of RL (nex)
14/189
Successes of RL (nex)
14/189
Successes of RL (nex)
14/189
Successes of RL (nex)
14/189
Words of caution (and controversy) (nex)
15/189
Words of caution (and controversy) (nex)
15/189
Mathematical Foundations
of Sequential Decision-Making
MDP Example
MDP with actions, rewards and transition probabilities.
start next passed
sleep
study: -2 1 study: -2 1
done: 20
1
done: 0 1
procrast.: 4
1 leave: -1
1
2
1
2
procrast.: -1 1
pub: +4
0.3
0.4
0.3
16/189
Markov Decision Process
Given a set X, we denote D(X) the set of all probability distributions over X.
Definition 1
A Markov decision process (MDP) is a tuple (S, A, p, r) where
• S is a set of states,
• A is a set of actions,
• p: S × A → D(S) is a probabilistic transition function,
• r : S × A → R is a reward function.
We will shorten p(s, a)(s′
) to p(s′
| s, a).
17/189
Markov Decision Process
Given a set X, we denote D(X) the set of all probability distributions over X.
Definition 1
A Markov decision process (MDP) is a tuple (S, A, p, r) where
• S is a set of states,
• A is a set of actions,
• p: S × A → D(S) is a probabilistic transition function,
• r : S × A → R is a reward function.
We will shorten p(s, a)(s′
) to p(s′
| s, a).
The p, r can be partial functions: action a is enabled in state s if both p(s, a) and r(s, a) are
defined. We denote by A(s) the set of all actions enabled in s.
17/189
Dynamics of MDPs
• start in some initial state s0
18/189
Dynamics of MDPs
• start in some initial state s0
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
18/189
Dynamics of MDPs
• start in some initial state s0
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
• in each time step t, let st be the current state; then:
• agent selects action at ∈ A(st )
• the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward
rt+1 = r(st , at )
• t is incremented and the process repeats in the same fashion forever
Thus, the agent produces a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ..
18/189
Dynamics of MDPs
• start in some initial state s0
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
• in each time step t, let st be the current state; then:
• agent selects action at ∈ A(st )
• the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward
rt+1 = r(st , at )
• t is incremented and the process repeats in the same fashion forever
Thus, the agent produces a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ..
τ is produced randomly (due to p and possibly also agent choices being probabilistic): it is a
random variable and so are its components: we define random variables
• St = state at time step t
• At = action at time step t
• Rt = reward received just before entering St
18/189
Policies
Definition 2
A history is a finite prefix of a trajectory ending in a state, i.e., an object of type
s0, a0, r1, s1, a1, r2, . . . , at−1, rt, st ∈ (S · A · R)∗
S.
we denote by last(h) the last state of a history h.
19/189
Policies
Definition 2
A history is a finite prefix of a trajectory ending in a state, i.e., an object of type
s0, a0, r1, s1, a1, r2, . . . , at−1, rt, st ∈ (S · A · R)∗
S.
we denote by last(h) the last state of a history h.
Definition 3
A policy is a function π: (S · A · R)∗
S → D(A) which to each history h assigns a
probability distribution over A(last(h)).
A policy is by definition an infinite object!
19/189
MD policies
Definition 4
A policy π is:
• memoryless if π(h) = π(h′
) whenever last(h) = last(h′
) (we can view memoryless
policies as objects of type π: S → D(A));
• deterministic if π(h) always assigns probability 1 to one action, and zero to all
others (we can view det. policies of objects of type π: (S · A · R)∗
S → A).
20/189
MD policies
Definition 4
A policy π is:
• memoryless if π(h) = π(h′
) whenever last(h) = last(h′
) (we can view memoryless
policies as objects of type π: S → D(A));
• deterministic if π(h) always assigns probability 1 to one action, and zero to all
others (we can view det. policies of objects of type π: (S · A · R)∗
S → A).
Definition 5
A policy π is MD (memoryless deterministic) if it is both memoryless and deterministic.
20/189
Dynamics of MDPs (more precise)
Given a distribution I of initial states and a policy π
• start in some initial state s0 ∼ I
21/189
Dynamics of MDPs (more precise)
Given a distribution I of initial states and a policy π
• start in some initial state s0 ∼ I
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
21/189
Dynamics of MDPs (more precise)
Given a distribution I of initial states and a policy π
• start in some initial state s0 ∼ I
• MDP evolves in discrete time steps t = 0, 1, 2, 3, . . .
• in each time step t, let ht be the history produced so far; then:
• agent selects action at ∈ A(st ) according to π, i.e. at ∼ π(ht )
• the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward
rt+1 = r(st , at ), the history is extended by at , rt , st+1,
• t is incremented and the process repeats in the same fashion forever
21/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
1I is typically known from the context and hence omitted from the notation
22/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
1I is typically known from the context and hence omitted from the notation
22/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
Exercise 6
In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub.
Compute the following quantities:
1I is typically known from the context and hence omitted from the notation
22/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
Exercise 6
In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub.
Compute the following quantities:
• Pπ
[visit pub at least twice′
| S0 = start′′
]
1I is typically known from the context and hence omitted from the notation
22/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
Exercise 6
In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub.
Compute the following quantities:
• Pπ
[visit pub at least twice′
| S0 = start′′
]
• Pπ
[visit pub at exactly twice | S0 = start′′
]
1I is typically known from the context and hence omitted from the notation
22/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
Exercise 6
In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub.
Compute the following quantities:
• Pπ
[visit pub at least twice′
| S0 = start′′
]
• Pπ
[visit pub at exactly twice | S0 = start′′
]
• Eπ
[R1]
1I is typically known from the context and hence omitted from the notation
22/189
Probability space induced by a policy
In particular, each policy π together with a distribution I of initial states induce a probability
measure Pπ
over the trajectories of the MDP.1
We denote by Eπ
the associated expected value (expectation) operator.
We denote by Pπ
[E | S0 = s] the probability of event E provided that the initial state is fixed
to s (and similarly for expectations).
Exercise 6
In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub.
Compute the following quantities:
• Pπ
[visit pub at least twice′
| S0 = start′′
]
• Pπ
[visit pub at exactly twice | S0 = start′′
]
• Eπ
[R1]
• Eπ
[R3]
22/189
Memorylessness
In this course, we will almost exclusively focus on memoryless policies. Hence, from now on,
policy = memoryless policy. General policies will be referred to as history-dependent policies
should the need arise.
23/189
Memorylessness
In this course, we will almost exclusively focus on memoryless policies. Hence, from now on,
policy = memoryless policy. General policies will be referred to as history-dependent policies
should the need arise.
Why memoryless?
Intuition: Markov property of MDPs: next step depends only on the current state and on
action performed in the current step. Hence, intuitively there is no need for a policy to
remember the past so as to “play well”.
The sufficiency of memoryless policies does not extended to more general/complex
decision-making settings (not covered in this course), such as:
• partially observable MDPs
• non-stationary environments
• quantile/risk-aware MDPs, etc.
23/189
Returns (payoffs)
24/189
Returns (payoffs)
Definition 7
Let γ ∈ [0, 1) be a discount factor.
For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . we define the discounted return
(or payoff) of τ to be the quantity
G(τ) = r1 + γ · r2 + γ2
· r3
+ · · · γ3
· r4
=
∞
i=0
γi
· ri+1.
Equivalently
G =
∞
i=0
γi
· Ri+1.
24/189
Returns (variants)
• Finite horizon (FH): additionally, we are given a finite decision horizon H ∈ N ∪ {∞}. The
return is that counted only up to step H:
GH
=
H−1
i=0
γi
· Ri+1
For finite H, the discount factor can be 1. H = ∞ corresponds to the original definition.
25/189
Returns (variants)
• Finite horizon (FH): additionally, we are given a finite decision horizon H ∈ N ∪ {∞}. The
return is that counted only up to step H:
GH
=
H−1
i=0
γi
· Ri+1
For finite H, the discount factor can be 1. H = ∞ corresponds to the original definition.
• Episodic returns: In episodic tasks, there is a distinguished set Term ⊆ S of terminal
states which is guaranteed to be reached with probability 1 under any policy. We denote
by T a random variable denoting the first point in time when we hit a terminal state. We
count rewards only up to that time:
GT
=
T−1
i=1
γi
· Ri+1
Can be modeled under original definition by “sink” states. 25/189
Types of returns: discussion
• We will typically omit the superscripts since the type of task considered will be known
from the context.
• We have GH
→ G (pointwise) as H → ∞. I.e., finite-horizon returns with high enough H
approximate the standard (infinite-horizon) case.
• In real world, we typically deal with FH or episodic tasks: we cannot wait infinite time to
learn something from a trajectory. However, the infinite-horizon case can be viewed as a
neat mathematical abstraction of the FH&episodic tasks, and the classical sequential
decision-making theory is most developed for the infinite horizon case.
26/189
Policy and state values
Definition 8
Let π be a policy and s a state. The value of π in state s is the quantity
vπ
(s) = Eπ
[G | S0 = s].
Exercise 9
Discuss the values of MD policies in our running example.
27/189
Policy and state values
Definition 8
Let π be a policy and s a state. The value of π in state s is the quantity
vπ
(s) = Eπ
[G | S0 = s].
Exercise 9
Discuss the values of MD policies in our running example.
Definition 10
The (optimal) value of state s is the quantity
v∗
(s) = sup
π
vπ
(s).
27/189
Optimality
Definition 11
Let π be a policy and ε > 0. We say that π is ε-optimal in state s if
vπ
(s) ≥ v∗
(s) − ε.
We say that π is optimal in s is it is 0-optimal in s, i.e. if
vπ
(s) = v∗
(s).
A policy is (ε-)optimal if it is (ε-)optimal in every state.
28/189
Existence of optimal policies
Theorem 12: (Classical result, not formally proven here)
Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon
returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can
be computed in polynomial time.
29/189
Existence of optimal policies
Theorem 12: (Classical result, not formally proven here)
Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon
returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can
be computed in polynomial time.
Agent control solved?
29/189
Existence of optimal policies
Theorem 12: (Classical result, not formally proven here)
Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon
returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can
be computed in polynomial time.
Agent control solved? NO! “Only” works if you can actually construct the MDP model of your
environment and fit it into a computer. Otherwise, we use reinforcement learning.
29/189
Exact Planning
with Known Model:
Value & Policy Iteration
Goal of this lecture
Algorithms that compute the optimal value vector v∗
and some optimal MD policy π∗
given a
full knowledge of an MDP M.
30/189
Polynomial-time algorithm
MDPs can be solved by linear programming (LP)
maximize ⃗c · ⃗x
subject to A · ⃗x ≤ ⃗b
• LP can be solved in polynomial time by so-called interior-point algorithms.
• However, we typically use other, MDP-specific algorithms: value iteration (VI) and policy
iteration (PI). These are not polynomial-time in general, but typically faster on practical
instances.
• Moreover, most truly RL algorithms can be seen as approximate generalizations of VI or PI
(or both).
31/189
Example: Policy evaluation
Exercise 13
Consider all four MD policies in our running “pub or study” example that always try to
quit when in X state. Compute the values of these policies in the initial state start.
32/189
Policy evaluation equations
Theorem 14
For any memoryless policy π and any state s it holds:
vπ
(s) =
a∈A(s)
π(a|s) · r(s, a) + γ ·
s′∈S
p(s′
|s, a) · vπ
(s′
)
def
= qπ(s,a)
.
33/189
Bellman optimality equations
Theorem 15
The following holds for any state s:
v∗
(s) = max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · v∗
(s′
)
def
= q∗(s,a)
34/189
Bellman optimality equations
Theorem 15
The following holds for any state s:
v∗
(s) = max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · v∗
(s′
)
def
= q∗(s,a)
• Note: the policy evaluation equations are a special case of the Bellman ones: given a
policy π, we can consider an MDP Mπ
in which there is a single action ∗ enabled in each
state and the probability of transition s
∗
→ s′
equals a∈A(s) π(a|s) · p(s′
|s, a). Then Mπ
mimics the behavior of π in M and Bellman eq’s in Mπ
= evaluation equations for π in
M.
34/189
Bellman optimality equations
Theorem 15
The following holds for any state s:
v∗
(s) = max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · v∗
(s′
)
def
= q∗(s,a)
• Note: the policy evaluation equations are a special case of the Bellman ones: given a
policy π, we can consider an MDP Mπ
in which there is a single action ∗ enabled in each
state and the probability of transition s
∗
→ s′
equals a∈A(s) π(a|s) · p(s′
|s, a). Then Mπ
mimics the behavior of π in M and Bellman eq’s in Mπ
= evaluation equations for π in
M.
• But these equations are no longer linear! How do we solve them? Is the solution even
unique? 34/189
Bellman update operator
The right-hand-side (RHS) of the Bellman equations can be viewed as an operator
Φ: RS
→ RS
: for any ⃗x ∈ RS
, Φ(⃗x) is a vector such that for any state s:
Φ(⃗x)(s)
def
= max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)
35/189
Bellman update operator
The right-hand-side (RHS) of the Bellman equations can be viewed as an operator
Φ: RS
→ RS
: for any ⃗x ∈ RS
, Φ(⃗x) is a vector such that for any state s:
Φ(⃗x)(s)
def
= max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)
Exercise 16
In our running example, compute Φ(⃗0).
35/189
Bellman update operator
The right-hand-side (RHS) of the Bellman equations can be viewed as an operator
Φ: RS
→ RS
: for any ⃗x ∈ RS
, Φ(⃗x) is a vector such that for any state s:
Φ(⃗x)(s)
def
= max
a∈A(s)
r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)
Exercise 16
In our running example, compute Φ(⃗0).
Theorem 15 says that the optimal value vector v∗
is a fixed point of Φ:
v∗
= Φ(v∗
).
35/189
Mathematical hammers for Bellman
Lemma 17: (not proven here)
For any discount factor γ ∈ [0, 1), the Bellman operator Φ is a contraction, i.e. for any
pair of vectors ⃗x, ⃗y it holds
∥⃗x − ⃗y∥∞ ≤ γ · ∥Φ(⃗x) − Φ(⃗y)∥∞.
36/189
Mathematical hammers for Bellman
Lemma 17: (not proven here)
For any discount factor γ ∈ [0, 1), the Bellman operator Φ is a contraction, i.e. for any
pair of vectors ⃗x, ⃗y it holds
∥⃗x − ⃗y∥∞ ≤ γ · ∥Φ(⃗x) − Φ(⃗y)∥∞.
Theorem 18: Banach fixed point theorem (classical calculus, not proven here)
A contraction mapping from a complete metric space (in particular, Rn
) to itself has a
unique fixed point.
36/189
Exact characterization of v∗
Corollary 19
The optimal value vector is a unique solution of the Bellman optimality equations.
37/189
Exact characterization of v∗
Corollary 19
The optimal value vector is a unique solution of the Bellman optimality equations.
In particular, also the policy evaluation equations have a unique solution, equal to vπ
. Since the
policy evaluation equations are linear, their solution can be computed by Gaussian elimination.
But the general Bellman equations are not linear. How can ve solve them?
37/189
Banach fixpoint theorem (full)
Theorem 20: Banach fixed point theorem (full version, not proven here)
A contraction mapping Φ from a complete metric space (in particular, Rn
) to itself has
a unique fixed point ⃗z.
Moreover, ⃗z is the limit of iterative applications of Φ on any initial vector. I.e., for any
⃗x0 ∈ Rn
, the sequence ⃗x0, Φ(⃗x0), Φ(Φ(⃗x0)), Φ(3)
(⃗x0), . . . converges to ⃗z:
z = lim
i→∞
Φ(i)
(⃗x0)
38/189
Value iteration (VI; Bellman, 1957)
Algorithm 1: Value iteration
Input: MDP M = (S, A, p, r)
Output: Approximation ˜v of v∗
x ← any vector from R|S|
; // typically ⃗0
next ← x;
repeat
foreach s ∈ S do
next(s) ← max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)]
Φ(x)(s)
;
x ← next
until termination condition;
Typical term. conditions:
• after a fixed no. of iterations (i.e., use for-loop with a fixed bound)
• after each component of x changes less then some given ε 39/189
How to use VI
By the Banach fixpoint theorem (and Lemma 17), the value of variable x VI converges to v∗
.
Can we recognize when is x “close enough” to ⃗x?
40/189
How to use VI
By the Banach fixpoint theorem (and Lemma 17), the value of variable x VI converges to v∗
.
Can we recognize when is x “close enough” to ⃗x?
In the following couple of theorems, let ⃗x0, ⃗x1, ⃗x2, . . . be the sequence of vectors computed by
VI, i.e. ⃗x0 is arbitrary and ⃗xi+1 = Φ(⃗xi ) for all i ≥ 0.
Theorem 21: Stopping condition (not proven here)
For any ε > 0: if
∥⃗xi+1 − ⃗xi ∥∞ ≤ ε ·
1 − γ
γ
,
then
∥⃗xi+1 − v∗
∥∞ ≤ ε
40/189
How to use VI (2)
How fast can we get to the point where we are close enough?
Theorem 22: Speed of convergence (not proven here)
For all i ≥ 0 it holds
∥⃗xn
− v∗
∥∞ ≤
γn
1 − γ
· ∥⃗x1 − ⃗x0∥∞.
In particular, if we terminate VI after
i =




log(ε) + log 1−γ
∥⃗x1−⃗x0∥∞
log(γ)




steps, then its output xi will be an ε-approximation of v∗
.
41/189
How to use VI (3)
Can we actually get some optimal values instead of approximations?
42/189
How to use VI (3)
Can we actually get some optimal values instead of approximations? First, note that VI
computes optimal finite-horizon values:
Let vi
= supπ Eπ
[
H
i=1 γi−1
· Ri ]. The supremum is over all (i.e., history dependent) policies,
since in the FH problem an optimal policy needs to track the number of elapsed (and thus
remaining) steps: memory is needed for that.
Theorem 23: (Easy but important exercise)
If ⃗x0 = ⃗0, then ⃗xH = vH
for all H ≥ 0.
How to use VI (3)
Can we actually get some optimal values instead of approximations? First, note that VI
computes optimal finite-horizon values:
Let vi
= supπ Eπ
[
H
i=1 γi−1
· Ri ]. The supremum is over all (i.e., history dependent) policies,
since in the FH problem an optimal policy needs to track the number of elapsed (and thus
remaining) steps: memory is needed for that.
Theorem 23: (Easy but important exercise)
If ⃗x0 = ⃗0, then ⃗xH = vH
for all H ≥ 0.
Moreover, let πH
be a deterministic history-dependent policy such that for all 1 ≤ i ≤ H,
whenever there are i steps remaining till the horizon, the policy πH
selects in state s an
action a s.t.
a = arg max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗xi−1(s′
)]
(with ties broken arbitrarily). Then πH
is an optimal H-step policy.
42/189
Greedy policies
Can we actually get optimal policy for the inf. horizon problem?
Definition 24: ⃗x-greedy policy (very important)
Let ⃗x ∈ RS
be any vector. A ⃗x-greedy policy is an MD policy π such that in any state s:
π(s) = arg max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
|s, a) · ⃗x(s′
)].
43/189
How to use VI (4)
Theorem 25: Optimal inf.-horizon policy from VI (not proven here)
There is a number N polynomial in size of the MDP and exponential in the binary
encoding size of γ such that a policy π that is ⃗xN -greedy is optimal in every state, i.e.
vπ
= v∗
.
Note that once π is computed, vπ
can be computed in polynomial time via policy evaluation
equations.
Hence, VI can be said to solve MDPs in exponential time (and in polynomial time if the
discount factor is assumed to be a fixed constant instead of an input parameter), though the
approximate version is typically used in practice.
Note: the fact that some policy π is ⃗x-greedy does not mean that vπ
≥ ⃗x! Homework: find a
counterexample and post it to Discord.
However, for VI it can be shown that if ∥⃗xi+1 − ⃗xi ∥∞ ≤ ε · 1−γ
γ (stopping condition from
Theorem 21), then an ⃗xi+1-greedy policy is ε-optimal.
44/189
Policy improvement
start next passed
sleep
study: -2 1 study: -2 1
done: 20
1
done: 0 1
procrast.: 4
1 leave: -1
1
2
1
2
procrast.: -1 1
pub: +4
0.3
0.4
0.3
vπ
= ( )
let π′
be vπ
-greedy 45/189
Policy improvement
start next passed
sleep
study: -2 1 study: -2 1
done: 20
1
done: 0 1
procrast.: 4
1 leave: -1
1
2
1
2
procrast.: -1 1
pub: +4
0.3
0.4
0.3
vπ′
= ( )
let π′′
be vπ′
-greedy 45/189
Policy improvement
start next passed
sleep
study: -2 1 study: -2 1
done: 20
1
done: 0 1
procrast.: 4
1 leave: -1
1
2
1
2
procrast.: -1 1
pub: +4
0.3
0.4
0.3
45/189
Policy improvement theorem
Theorem 26: Policy improvement
Let π be a policy. If Φ(vπ
) ≥ vπ
, then any vπ
-greedy policy πg is at least as good as π,
i.e. ∀s ∈ S : vπg
(s) ≥ vπ
(s).
Moreover, if Φ(vπ
)(s) > vπ
(s) for some state s, then also vπg
(s′
) > vπ
(s′
) for some
state s′
.
46/189
Returns from a given time step
For the proof of PIT and also many times later, we will need the following notation:
Definition 27: Important!
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t ∈ N a time step. We
define
Gt(τ) =
H−1
i=t
γi−t
· ri+1 = rt+1 + γrt+2 + γ2
rt+3 + · · · ,
where H ∈ N ∪ {∞} or H = T for episodic tasks.
We similarly define, for any policy π:
Gπ
t = Eπ
[Gt] = Eπ
[
H−1
i=t
γi−t
· Ri+1].
47/189
Proof of PIT (setup)
We will define a sequence of policies π0, π1, π2, . . . s.t.:
• π0 = π
• πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then
“switches” back to behave as π:
• we also define π∞ = πg
48/189
Proof of PIT (setup)
We will define a sequence of policies π0, π1, π2, . . . s.t.:
• π0 = π
• πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then
“switches” back to behave as π:
• we also define π∞ = πg
We want: vπ∞
(s) ≥ vπ
(s) for all s.
48/189
Proof of PIT (setup)
We will define a sequence of policies π0, π1, π2, . . . s.t.:
• π0 = π
• πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then
“switches” back to behave as π:
• we also define π∞ = πg
We want: vπ∞
(s) ≥ vπ
(s) for all s.
Not hard to see: vπi
→ vπ∞
as i → ∞ (πi behaves as π∞ for longer and longer as i increases
+ discounting).
It suffices to show: vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S.
48/189
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
49/189
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
= Eπi
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
49/189
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
= Eπi
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
= Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
49/189
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
= Eπi
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
= Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
Suppose we prove ≥Eπi−1 [γi−1
Ri +γi
Ri+1···|S0=s]
49/189
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
= Eπi
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
= Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
Suppose we prove ≥Eπi−1 [γi−1
Ri +γi
Ri+1···|S0=s]
≥ Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 + γi−1
Ri + γi
Ri+1 · · · | S0 = s]
49/189
Proof of PIT (induction)
vπi
(s) ≥ vπ
(s) for all i ∈ N and all s ∈ S
• i = 0: clear
• i > 0:
vπi
(s) = Eπi
[R1 + γR2 + · · · + γi−1
Ri + γi
Ri+1 + · · · | S0 = s]
= Eπi
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
= Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 | S0 = s] + Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
Suppose we prove ≥Eπi−1 [γi−1
Ri +γi
Ri+1···|S0=s]
≥ Eπi−1
[R1 + γR2 + · · · + γi−2
Ri−1 + γi−1
Ri + γi
Ri+1 · · · | S0 = s]
IH
≥ vπ
(s)
49/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ), since πg is vπ-greedy
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ)(s′), since πg is vπ-greedy
≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption.
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ)(s′), since πg is vπ-greedy
≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption.
≥ γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · vπ
(s′
)
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ)(s′), since πg is vπ-greedy
≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption.
≥ γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · vπ
(s′
) = γi−1
· Eπi−1
[Gi−1 | S0 = s]
= Eπi−1
[γi−1
Gi−1 | S0 = s]
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ)(s′), since πg is vπ-greedy
≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption.
≥ γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · vπ
(s′
) = γi−1
· Eπi−1
[Gi−1 | S0 = s]
= Eπi−1
[γi−1
Gi−1 | S0 = s] = Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s]
50/189
Proof of PIT (induction, behavior at “reset”)
We need: Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] ≥ Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s].
Eπi
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] = γi−1
· Eπi
[Ri + γRi+1 · · · | S0 = s]
= γi−1
·
s′∈S
Pπi
[Si−1 = s′
| S0 = s] · r(s′
, πi (s′
)) + γ ·
s′′
p(s′′
| s′
, πi (s′
)) · Eπi
[Gi | Si = s′′
]
= γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · r(s′
, πg (s′
)) + γ ·
s′′
p(s′′
| s′
, πg (s′
)) · Eπ
[Gi | Si = s′′
]
vπ(s′′)
=Φ(vπ)(s′), since πg is vπ-greedy
≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption.
≥ γi−1
·
s′∈S
Pπg
[Si−1 = s′
| S0 = s] · vπ
(s′
) = γi−1
· Eπi−1
[Gi−1 | S0 = s]
= Eπi−1
[γi−1
Gi−1 | S0 = s] = Eπi−1
[γi−1
Ri + γi
Ri+1 · · · | S0 = s] Strictness?
50/189
Policy iteration (PI; Howard, 1960)
Algorithm 2: Policy iteration
Input: MDP M = (S, A, p, r)
Output: Optimal MD policy π∗
for M, its value vector v∗
π ← arbitrary MD policy ;
v ← vπ
; // e.g. by solving linear policy evaluation equations
while Φ(v) ̸= v do
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
v ← vπ
return π, v
51/189
PI correctness & complexity
Theorem 28
Policy iteration terminates after at most exponentially many iterations. Upon termination,
it returns an optimal MD policy.
52/189
PI correctness & complexity
Theorem 28
Policy iteration terminates after at most exponentially many iterations. Upon termination,
it returns an optimal MD policy.
Proof:
• Optimal upon termination: vπ
is a fixpoint of Φ when terminating: optimality follows
from Corollary 19.
52/189
PI correctness & complexity
Theorem 28
Policy iteration terminates after at most exponentially many iterations. Upon termination,
it returns an optimal MD policy.
Proof:
• Optimal upon termination: vπ
is a fixpoint of Φ when terminating: optimality follows
from Corollary 19.
• Terminates: π always stores an MD policy and there are finitely many of these. We will
show that no single MD policy appears in more than one iteration of PI.
52/189
PI correctness & complexity
Theorem 28
Policy iteration terminates after at most exponentially many iterations. Upon termination,
it returns an optimal MD policy.
Proof:
• Optimal upon termination: vπ
is a fixpoint of Φ when terminating: optimality follows
from Corollary 19.
• Terminates: π always stores an MD policy and there are finitely many of these. We will
show that no single MD policy appears in more than one iteration of PI.
Consider any iteration and let v, v′
be the contents of variable v before and after the
iteration. We will show that unless Φ(v) = v, it holds v′
> v, i.e. v′
≥ v componentwise
with strict inequality in some component. Hence, v = vπ
strictly increases during PI, so
no π can appear twice.
52/189
PI: correctness proof
v′
≥ v:
53/189
PI: correctness proof
v′
≥ v:
We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ
. For all s ∈ S:
Φ(v)(s) = max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · v(s′
)]
53/189
PI: correctness proof
v′
≥ v:
We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ
. For all s ∈ S:
Φ(v)(s) = max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · v(s′
)]
≥ r(s, π(s)) + γ ·
s′∈S
p(s′
| s, π(s)) · vπ
(s′
)
53/189
PI: correctness proof
v′
≥ v:
We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ
. For all s ∈ S:
Φ(v)(s) = max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · v(s′
)]
≥ r(s, π(s)) + γ ·
s′∈S
p(s′
| s, π(s)) · vπ
(s′
)
= vπ
(s) = v(s).
By PIT, v′
= vπ′
≥ vπ
= v (here π′
is the v-greedy policy).
53/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
,
54/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
,
54/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
= r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ
(s′
) (assumption)
,
54/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
= r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ
(s′
) (assumption)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ
(s′
)] (π′
is v = vπ
-greedy)
,
54/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
= r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ
(s′
) (assumption)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ
(s′
)] (π′
is v = vπ
-greedy)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ′
(s′
)] (assumption)
,
54/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
= r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ
(s′
) (assumption)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ
(s′
)] (π′
is v = vπ
-greedy)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ′
(s′
)] (assumption)
= Φ(v′
)(s),
so PI terminates at this point.
54/189
PI: correctness proof II
It remains to prove that v′
> v or PI terminates. Assume that v′
= v. Then for all s ∈ S:
v′
(s) = r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ′
(s′
)
= r(s, π′
(s)) + γ ·
s′∈S
p(s′
| s, π′
(s)) · vπ
(s′
) (assumption)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ
(s′
)] (π′
is v = vπ
-greedy)
= max
a∈A(s)
[r(s, a) + γ ·
s′∈S
p(s′
| s, a) · vπ′
(s′
)] (assumption)
= Φ(v′
)(s),
so PI terminates at this point. Complexity?
54/189
PI&VI: discussion
• We know that MDPs have a linear programming (LP) formulation. PI is basically a
variant of a simplex method for this LP, using a special pivoting rule.
• PI typicaly requires less iterations to converge than VI, though each iteration is more
expensive (policy eval.)
• Both PI and VI typically work well in practice for MDPs whose explicit transition table fits
inside a computer. Which of the two is faster is rather domain-specific.
55/189
PI variants
Can we get rid of the expensive policy evaluation by linear system solving?
56/189
PI variants
Can we get rid of the expensive policy evaluation by linear system solving?
Yes: we can approximate the value of the current policy π by applying VI on the MDP Mπ
, for
either fixed number of steps or until v does not change much. Often appearing in RL
textbooks:
56/189
Policy iteration with approximate evaluation
π ← arbitrary MD policy; v ← arbitrary vector;
repeat
v ← Eval(π, v);
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
until π has not changed;
return π, v
Function Eval(π, v):
v′
← v;
repeat
foreach s ∈ S do
v(s) ← v′
(s);
v′
(s) ← r(s, π(s)) + γ · s′∈S p(s′
| s, π′
(s)) · v(s′
)
until ∥v − v′
∥∞ ≤ ε;
return v′ 57/189
Convergence of PI variants
• The algorithm on previous slide still converges to an optimal policy provided that ε is
small enough.
• If we replaced the “π not changed condition” with the original “Φ(v) = v” condition, the
algorithm might not terminate, since the VI is only guaranteed to reach a true fixpoint in
the limit. However, v would still converge to v∗
and thus π would eventually become
equal to an optimal policy.
• The previous point holds even in the very degenerate case when we do just one iteration of
VI per policy evaluation! See next slide.
58/189
Curiously looking approximate PI
v ← arbitrary vector;
π ← v-greedy MD policy ;
repeat
foreach s ∈ S do
v′
(s) ← r(s, π(s)) + γ · s′∈S p(s′
|s, π(s)) · v(s′
);
v ← v′
;
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
until Φ(v) = v;
return π, v
59/189
Curiously looking approximate PI
v ← arbitrary vector;
π ← v-greedy MD policy ;
repeat
foreach s ∈ S do
v′
(s) ← r(s, π(s)) + γ · s′∈S p(s′
|s, π(s)) · v(s′
);
v ← v′
;
foreach s ∈ S do
π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′
|s, a) · v(s′
)]
until Φ(v) = v;
return π, v
This is just VI in disguise!
59/189
Generalized policy iteration
Evaluate
Improve
push v
towards vπ
make π
(more)
v-greedy
60/189
Generalized policy iteration
Evaluate
Improve
push v
towards vπ
make π
(more)
v-greedy
Source: Sutton&Barto, p. 87
60/189
Tabular Methods
for Model-Free
Reinforcement Learning
Model-free
61/189
Model-free
We will still be working with MDPs. But for a bunch of the following lectures, we will not
(necessarily) have access to, e.g.:
• a table containing explicit enumeration of all states/actions
• a table containing the description of p or r
• the ability to compute the probability vector δ(s, a) or the reward signal r(s, a) given s
and a (having this = gray-box model of the MDP)
61/189
Sampling from MDP
But the MDP is still there “behind the scene”. In particular, we:
62/189
Sampling from MDP
But the MDP is still there “behind the scene”. In particular, we:
• know how the states of the MDP look like
• (e.g. robot state = all possible output values of its sensors)
62/189
Sampling from MDP
But the MDP is still there “behind the scene”. In particular, we:
• know how the states of the MDP look like
• (e.g. robot state = all possible output values of its sensors)
• know how the actions of the MDP look like
• (e.g. robot = all possible signals that can be sent to the actuators)
62/189
Sampling from MDP
But the MDP is still there “behind the scene”. In particular, we:
• know how the states of the MDP look like
• (e.g. robot state = all possible output values of its sensors)
• know how the actions of the MDP look like
• (e.g. robot = all possible signals that can be sent to the actuators)
• can, for any s ∈ S, enumerate A(s)
• could be weakened, but simplifies things
62/189
Sampling from MDP
But the MDP is still there “behind the scene”. In particular, we:
• know how the states of the MDP look like
• (e.g. robot state = all possible output values of its sensors)
• know how the actions of the MDP look like
• (e.g. robot = all possible signals that can be sent to the actuators)
• can, for any s ∈ S, enumerate A(s)
• could be weakened, but simplifies things
• given s ∈ S and a ∈ A(s), we can sample the next state s′
∼ p(s, a) and receive the
reward r(s, a).
62/189
Sampling from a policy
Given an effective representation of a policy π, we can sample a trajectory
s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . by performing, for each t ∈ {0, . . . , T}:
• sample at ∼ π(st)
• query the environment for st+1 ∼ p(st, at) and rt+1 = r(st, at)
• increment t
63/189
Sampling from a policy
Given an effective representation of a policy π, we can sample a trajectory
s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . by performing, for each t ∈ {0, . . . , T}:
• sample at ∼ π(st)
• query the environment for st+1 ∼ p(st, at) and rt+1 = r(st, at)
• increment t
Tabular = value estimates and policies represented as tables (e.g. Q(s, a) for each state s and
action a used in s – explicit representation might only be needed for states/actions actually
encountered).
63/189
Basic classification of (tabular) RL algorithms
Three independent axes:
problem on/off updates
policy evaluation (value prediction) on-policy Monte Carlo
vs. vs.






control off-policy temporal difference
64/189
Assumptions: successor-dependent rewards & episodic tasks
Since we do no longer have the knowledge of the transition dynamics p, we cannot freely
interchange MDPs with rewards functions of type S × A × S → R and S × A → R via
the equation r(s, a) = s′∈S p(s′
| s, a) · r(s, a, s′
).
Hence, to maintain generality (and correspondence to e.g. Gymnasium environments)
we will assume reward functions of type S × A × S → R.
We will assume episodic returns: each trajectory terminates with probability 1 at some
(possibly random) time step T. Termination can be defined e.g. by reaching some terminal
state or by running out of some fixed decision horizon (in Gymnasium, this is sometimes
called truncation):
G =
T−1
i=0
γi
· Ri+1.
Episode = one high-level iteration of an RL algorithm, corresponding of sampling a single
trajectory from some policy. 65/189
Monte Carlo Methods
MC evaluation
Policy evaluation: given an effective representation of a policy π, estimate vπ
(or qπ
).
66/189
MC evaluation
Policy evaluation: given an effective representation of a policy π, estimate vπ
(or qπ
).
Naive Monte Carlo: Sample from π: if {τ1, τ2, . . . , τn} are trajectories (episodes) independently
sampled under π from the same initial state s, then 1
n
n
i=1 G(τi ) → vπ
(s) as n → ∞ due to
law of large numbers (LLN).
66/189
MC evaluation
Policy evaluation: given an effective representation of a policy π, estimate vπ
(or qπ
).
Naive Monte Carlo: Sample from π: if {τ1, τ2, . . . , τn} are trajectories (episodes) independently
sampled under π from the same initial state s, then 1
n
n
i=1 G(τi ) → vπ
(s) as n → ∞ due to
law of large numbers (LLN).
But this throws away a lot of valuable information! E.g. what if we want to estimate the whole
vπ
?
66/189
First-visit MC
For each s, we estimate vπ
(s) as an average of sample returns Ret(s) which is formed as
follows:
• initially, Ret(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, we identify the first occurrence of s on τ: let
this be at timestep t; we add Gt (τ) to Ret(s)
start X start next pass. sleep
4 -1 -2 -2 20
67/189
First-visit MC
For each s, we estimate vπ
(s) as an average of sample returns Ret(s) which is formed as
follows:
• initially, Ret(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, we identify the first occurrence of s on τ: let
this be at timestep t; we add Gt (τ) to Ret(s)
start X start next pass. sleep
4 -1 -2 -2 20
Sub-trajectory starting at the first appearance of s can be seen as a trajectory sampled from π
when s is the initial state! (Since we consider memoryless π.)
Theorem 29
As |Ret(s)| → ∞, the average of Ret(s) converges to vπ
(s). Moreover, the average of
Ret(s) is an unbiased estimate of vπ
(s) (as long as Ret(s) ̸= ∅).
67/189
First-visit MC (pseudocode)
Source: Sutton&Barto, p.92
68/189
Every-visit MC
For each s, we estimate vπ
(s) as an average of sample returns Ret(s) which is formed as
follows:
• initially, Ret(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, and each t such that St (τ) = s we add
Gt (τ) to Ret(s)
start X start next pass. sleep
2 -1 -2 -2 20
69/189
Every-visit MC
For each s, we estimate vπ
(s) as an average of sample returns Ret(s) which is formed as
follows:
• initially, Ret(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, and each t such that St (τ) = s we add
Gt (τ) to Ret(s)
start X start next pass. sleep
2 -1 -2 -2 20
The sample returns added to Ret(s) within the same episode are not independent! Hence, the
estimate is biased, though the bias vanishes in the limit:
Theorem 30
As |Ret(s)| → ∞, the average of Ret(s) converges to vπ
(s).
69/189
MC convergence
Optional reading: More on MC estimate bias, variance, and convergence in:
Singh, S.P. and Sutton, R.S.: Reinforcement Learning with Replacing Eligibility Traces.
In Machine Learning 22:123–158. Kluwer, 1996.
(Section 3, particularly 3.3 and onwards, you can skip Theorem 4.)
70/189
Towards MC control
Control = computation of “good” policy for a given environment. (Ideally, the policy should get
closer to the optimal policy the more episodes we sample.)
71/189
Towards MC control
Control = computation of “good” policy for a given environment. (Ideally, the policy should get
closer to the optimal policy the more episodes we sample.)
We know (PIT): given a policy π a vπ
-greedy policy is at least as good as π:
πg (s) = arg max
a∈A(s) s′∈S
p(s′
| s, a) · r(s, a, s′
) + γ · vπ
(s′
)
Do we have an algo? There is an issue:
71/189
MC control with q-values
Recall:
qπ
(s, a)
def
=
s′∈S
p(s′
| s, a) · r(s, a, s′
) + γ · vπ
(s′
).
Thus, the vπ
-greedy policy πg can be defined as:
πg (s) = arg max
a∈A(s)
qπ
(s, a)
Estimate by MC.
72/189
MC for q-value estimation
Analogous to value estimation, e.g. first-visit:
For each s, a, we estimate qπ
(s, a) as an average of sample returns Ret(s, a) which is formed
as follows:
• initially, Ret(s, a) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state-action pair (s, a), we identify the first t such
that St (τ) = s ∧ At (τ) = a; we add Gt (τ) to Ret(s)
start X start next pass. sleep
proc.:4 study:-1 study:-2 done:-2 done:20
Similarly for every visit. Convergence guarantees the same as for state values.
73/189
Infinite exploration and exploring starts
Issue: MC only estimates qπ
(s, a) if:
• s guaranteed to be visited with positive probability in each episode
• π(a | s) > 0.
74/189
Infinite exploration and exploring starts
Issue: MC only estimates qπ
(s, a) if:
• s guaranteed to be visited with positive probability in each episode
• π(a | s) > 0.
Definition 31: Infinite exploration
A RL algorithm has infinite exploration (IE) if, during the infinite execution of the algorithm,
each state-action pair (s, a) is visited infinitely often with probability 1.
74/189
Infinite exploration and exploring starts
Issue: MC only estimates qπ
(s, a) if:
• s guaranteed to be visited with positive probability in each episode
• π(a | s) > 0.
Definition 31: Infinite exploration
A RL algorithm has infinite exploration (IE) if, during the infinite execution of the algorithm,
each state-action pair (s, a) is visited infinitely often with probability 1.
One way of achieving IE is through exploring starts (ES): each episode begins with (typically
uniformly) randomly selected s0 and a0. This is achievable when training, e.g., in simulated
environments but might be difficult/impossible in real-world environments.
74/189
MC control with exploring starts
Source: Sutton&Barto, p.99
75/189
IE through ε-soft policies
Exploring starts are not always feasible. Alternative: make the sampled policy itself exploratory.
Definition 32: ε-soft policy
A policy π is ε-soft if for every s ∈ S and every a ∈ A(s) it holds π(a|s) ≥ ε
|A(s)| .
76/189
IE through ε-soft policies
Exploring starts are not always feasible. Alternative: make the sampled policy itself exploratory.
Definition 32: ε-soft policy
A policy π is ε-soft if for every s ∈ S and every a ∈ A(s) it holds π(a|s) ≥ ε
|A(s)| .
Definition 33: ε-greedy policy
Let v ∈ RS
be a value vector. A policy π is v-ε-greedy if for every state s ∈ S there
is action a∗
= arg maxa∈A(s) s′∈S p(s′
| s, a) · r(s, a, s′
) + γ · v(s′
) such that for any
action a ∈ A(s) it holds:
π(a|s) =



ε
A(s) if a ̸= a∗
1 − ε + ε
A(s) if a = a∗
.
Interpretation: with prob. ε: play uniformly at random; with prob. 1 − ε: play greedily.
76/189
ε-softing
Definition 34
Let π be a policy. An ε-softing of π is a policy πε defined as follows: in each state s
• with probability ε, πε selects an action uniformly at random;
• with probability 1 − ε, πε selects a ∼ π(s).
I.e., an ε-greedy policy can be alternatively defined as ε-softing of a greedy policy.
77/189
MC control with ε-greedy policies
source: Sutton&Barto, p. 101
78/189
Policy iteration for ε-soft policies
Theorem 35
Let π be an ε-soft policy and let π′
be a vπ
-ε-greedy policy. Than vπ′
≥ vπ
(componentwise).
Moreover, the two value vectors are equal if and only if bot π and π′
are optimal
among all ε-soft policies; i.e. if, for every state s:
vπ
(s) = sup
¯π that is ε-soft
v ¯π
(s).
Proof: Required reading: Sutton&Barto, p.101-103.
79/189
Incremental computing of averages
Given a sample {n1, n2, . . . , nk+1} and average A = avg({n1, n2, . . . , nk }), how to compute
A′
= avg({n1, n2, . . . , nk , nk+1}) without recomputing the average of the whole sample?
A′
=
80/189
Incremental computing of averages
Given a sample {n1, n2, . . . , nk+1} and average A = avg({n1, n2, . . . , nk }), how to compute
A′
= avg({n1, n2, . . . , nk , nk+1}) without recomputing the average of the whole sample?
A′
=
k
k + 1
· A +
nk+1
k + 1
.
80/189
On-policy vs. off-policy
• On-policy algorithms: track one “policy variable” π; the policy stored in π is used to
interact with the environment (i.e., to sample episodes) and at the same time we learn
something about it (e.g. its value vector).
• Corresponds to the generalized policy iteration scheme.
• All the MC algos we have seen so far.
81/189
On-policy vs. off-policy
• On-policy algorithms: track one “policy variable” π; the policy stored in π is used to
interact with the environment (i.e., to sample episodes) and at the same time we learn
something about it (e.g. its value vector).
• Corresponds to the generalized policy iteration scheme.
• All the MC algos we have seen so far.
• Off-policy algorithms: track more (typically two) different policy variables:
• behavior policy: used to sample episodes
• target policy: which we want to learn about
81/189
Off-policy evaluation
We are given effective representations of:
• a behavior policy β,
• a target policy π.
The task is to estimate vπ
by sampling episodes from β . We cannot sample from π! (E.g. π
too risky or expensive to sample from.)
82/189
Off-policy evaluation
We are given effective representations of:
• a behavior policy β,
• a target policy π.
The task is to estimate vπ
by sampling episodes from β . We cannot sample from π! (E.g. π
too risky or expensive to sample from.)
Assumptions:
• given (s, a), we can effectively compute π(a|s) and β(a|s) (or at least estimate via
sampling)
• coverage: ∀s ∈ S, a ∈ A(s): if π(a|s) > 0, then also β(a|s) > 0
82/189
Importance sampling
Definition 36: Importance ratio
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. The importance-sampling
ratio of τ is the quantity
ρ(τ)
def
=
Pπ
[τ | S0 = s0]
Pβ[τ | S0 = s0]
=
Pπ
[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0]
Pβ[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0]
.
83/189
Importance sampling
Definition 36: Importance ratio
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. The importance-sampling
ratio of τ is the quantity
ρ(τ)
def
=
Pπ
[τ | S0 = s0]
Pβ[τ | S0 = s0]
=
Pπ
[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0]
Pβ[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0]
.
ρ(τ) can be computed without the knowledge of MDP transition probabilities!
ρ(τ) =
π(a0 | s0) · p(s1 | s0, a0) · π(a1 | s1) · p(s2 | s1, a1) · · ·
β(a0 | s0) · p(s1 | s0, a0) · β(a1 | s1) · p(s2 | s1, a1) · · ·
83/189
Importance ratio from time t
Definition 37
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. By τi..j we denote the subtrajectory
of τ starting in time step i and ending in timestep j. By τi.. we denote the
suffix of si , ai , ri+1, si+1, ai+1, . . ..
84/189
Importance ratio from time t
Definition 37
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. By τi..j we denote the subtrajectory
of τ starting in time step i and ending in timestep j. By τi.. we denote the
suffix of si , ai , ri+1, si+1, ai+1, . . ..
Definition 38: Importance ratio from time t
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t a time step. The
importance-sampling ratio of τ from t is the quantity
ρt(τ)
def
=
Pπ
[τt.. | S0 = st]
Pβ[τt.. | S0 = st]
=
Pπ
[A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st]
Pβ[A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st]
.
84/189
Off-policy evaluation with importance sampling
Theorem 39
For any s ∈ S it holds:
Eβ
[ρ · G | S0 = s] = vπ
(s).
.
85/189
Off-policy evaluation with importance sampling
Theorem 39
For any s ∈ S it holds:
Eβ
[ρ · G | S0 = s] = vπ
(s).
Proof:
Eβ
[ρ · G | S0 = s] =
τ
Pβ
[τ | S0 = s] · ρ(τ) · G(τ)
.
.
85/189
Off-policy evaluation with importance sampling
Theorem 39
For any s ∈ S it holds:
Eβ
[ρ · G | S0 = s] = vπ
(s).
Proof:
Eβ
[ρ · G | S0 = s] =
τ
Pβ
[τ | S0 = s] · ρ(τ) · G(τ)
=
τ
Pβ
[τ | S0 = s] ·
Pπ
[τ | S0 = s]
Pβ[τ | S0 = s]
· G(τ)
.
.
85/189
Off-policy evaluation with importance sampling
Theorem 39
For any s ∈ S it holds:
Eβ
[ρ · G | S0 = s] = vπ
(s).
Proof:
Eβ
[ρ · G | S0 = s] =
τ
Pβ
[τ | S0 = s] · ρ(τ) · G(τ)
=
τ
Pβ
[τ | S0 = s] ·
Pπ
[τ | S0 = s]
Pβ[τ | S0 = s]
· G(τ)
=
τ
Pπ
[τ | S0 = s] · G(τ) = Eπ
[G | S0 = s] = vπ
(s).
.
85/189
Off-policy evaluation with importance sampling
Theorem 39
For any s ∈ S it holds:
Eβ
[ρ · G | S0 = s] = vπ
(s).
Proof:
Eβ
[ρ · G | S0 = s] =
τ
Pβ
[τ | S0 = s] · ρ(τ) · G(τ)
=
τ
Pβ
[τ | S0 = s] ·
Pπ
[τ | S0 = s]
Pβ[τ | S0 = s]
· G(τ)
=
τ
Pπ
[τ | S0 = s] · G(τ) = Eπ
[G | S0 = s] = vπ
(s).
Easily integrates into both first-visit and every visit MC: sample from β and store ρt(τ) · Gt(τ)
in Ret(st).
85/189
Weighted importance sampling
First-visit variant: for each state s, we keep a set of samples Sam(s). Each sample is a tuple
(τ, t) – trajectory and time step.
• initially, Sam(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, and the smallest t such that St (τ) = s we
add (τ, t) to Sam(s)
Throughout the algorithm, the value of state s is estimated as
WIS(s) =
(τ,t)∈Sam(s)
ρt(τ) · Gt(τ)
(τ,t)∈Sam(s)
ρt(τ)
86/189
Weighted importance sampling
First-visit variant: for each state s, we keep a set of samples Sam(s). Each sample is a tuple
(τ, t) – trajectory and time step.
• initially, Sam(s) = ∅ for all s
• we then sample trajectories until timeout:
• for each sampled trajectory τ and each state s, and the smallest t such that St (τ) = s we
add (τ, t) to Sam(s)
Throughout the algorithm, the value of state s is estimated as
WIS(s) =
(τ,t)∈Sam(s)
ρt(τ) · Gt(τ)
(τ,t)∈Sam(s)
ρt(τ)
Exercise 40
Compare ordinary/weighted importance sampling after single sample.
86/189
Weighted importance sampling – correctness
The weighted sampling is clearly a biased estimator. However, the bias vanishes in the limit:
Theorem 41
With probability 1: as |Sam(s)| → ∞, we have that WIS(s) → vπ
(s).
Proof:
87/189
Ordinary vs. weighted sampling
source: Sutton&Barto, p. 106
88/189
Importance sampling: summary
But ordinary and weighted importance sampling can be adapted to every-visit MC.
Bias & Convergence:
• First visit:
• ordinary IS: unbiased, i.e. also converges
• weighted IS: biased, but converges in the limit
• Every visit:
• both ordinary and weighted: biased (due to EV), but converges in the limit
89/189
Weighted IS: incremental implementation
Instead of recomputing the weighted average for each new sample, WIS(s) can be updated by
keeping keep just two variables:
• V – current value of WIS(s), initially arbitrary
• C – the sum of importance ratios, initially 0
Upon arrival of new sample (τ′
, t′
), we update V , C into new values V ′
, C′
by setting:
C′
= C + ρt′ (τ′
)
V ′
= V +
ρt′ (τ′
)
C′
· (Gt′ (τ′
) − V ) .
90/189
Off-policy evaluation with weighted IS
91/189
Off-policy control with weighted IS
Required reading: Sutton&Barto, Section 5.7.
92/189
Temporal Difference Methods
TD: Motivation
Let us first focus on policy evaluation.
MC: zero bias (at least in the limit), but potentially high variance: many samples needed to
converge. Also, to update estimates, it must wait till the end of each episode.
TD methods retain the focus on sampling but combine it with bootstrapping.
93/189
Incremental update notation
Definition 42: Notation for updates
In the context of RL algorithms will denote by V n
(s) (resp. Qn
(s, a)) the algorithm’s
estimate of vπ
(s) (resp. qπ
(s, a)) after n-th update of this estimate.
94/189
MC vs. TD(0) update
On-policy MC (incremental) update using sampled trajectory
τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
V n+1
(st) ← (1 − αn)V n
(st) + αnGt(τ) = V n
(st) + αn · Gt(τ)
update target
−V n
(st)
update error
,
where αn = n/(n + 1).
95/189
MC vs. TD(0) update
On-policy MC (incremental) update using sampled trajectory
τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
V n+1
(st) ← (1 − αn)V n
(st) + αnGt(τ) = V n
(st) + αn · Gt(τ)
update target
−V n
(st)
update error
,
where αn = n/(n + 1).
TD(0) update in the same situation, with αn “suitably chosen” (possibly constant):
V n+1
(st) ← V n
(st) + αn · Rt+1(τ) + γ · V n
(St+1(τ))
bootstrap
−V n
(st)
95/189
Policy evaluation with TD(0)
source: Sutton&Barto, p. 120
96/189
How can it even work?
Really “just” a very asynchronous, sample-based, and “α-dampened” version of value iteration.
Eπ
[Gt|St = s] = Eπ
[Rt+1 + γ · Gt+1 | St = s] = Eπ
[Rt+1 | St = s] + γ · Eπ
[Gt+1 | St = s]
vπ(St+1)
.
In expectation, the TD(0) update is the same as VI update in Mπ
. Thanks to the contractivity
of the Bellman operator, VI possesses an error reduction property: after each update, the error
of the estimate decreases. Hence, in expectation, the same is true for the TD(0) update.
Formal proof of correctness in optional reading:
Sutton, R.S.: Learning to Predict by Methods of Temporal Differences. In Machine
Learning 3:9–44. Kluwer, 1988. (For MDPs with function approximation.)
97/189
Why TD is natural (Sutton&Barto, p. 122-123)
98/189
Why TD is natural (Sutton&Barto, p. 122-123)
Left: MC. Right TD(0). 98/189
On-policy TD control
Recall:
• In control setting, we need to estimate q-values of a policy.
• On-policy: we sample trajectories according to some policy π and then push value
estimates towards qπ
.
To maintain exploration, the policy π will typically be the ε-Q-greedy policy for some ε > 0,
where Q are the current Q values estimates. I.e., throughout the algorithm
π(a|s) =



1 − ε + ε
A(s) if a = arg maxa′∈A(s) Q(s, a′
)
ε
A(s) otherwise.
99/189
SARSA
State-Action-Reward-State-Action. Introduced in Rummery, Niranjan: On-Line Q-Learning
Using Connectionist Systems (1994).
In each episode, sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , sT according to
current policy π; for each time step 0 ≤ t ≤ T − 1, perform the following update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γQn
(st+1, at+1) − Qn
(st, at)
The update can be performed immediately when st+1 and at+1 is known (no need to wait for
the episode to terminate).
After the episode ends, make π ε-Q-greedy.
Conforms to the GVI scheme.
100/189
SARSA pseudocode
source: Sutton&Barto, p. 130
101/189
GLIE policies
Definition 43: GLIE condition
A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy
algorithms) converges to a 0-greedy policy with increasing number of episodes.
A RL algorithm is GLIE if it is GL and IE (infinitely exploring).
Typical ways of ensuring GLIE:
102/189
GLIE policies
Definition 43: GLIE condition
A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy
algorithms) converges to a 0-greedy policy with increasing number of episodes.
A RL algorithm is GLIE if it is GL and IE (infinitely exploring).
Typical ways of ensuring GLIE:
• Dynamically adjust ε in ε-greedy policy selection" When selecting action in state s,
behave ε-greedily with ε = c
n(s) , where 0 < c < 1 is a constant and n(s) is a number of
visits to state s over all the episodes so far.
102/189
GLIE policies
Definition 43: GLIE condition
A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy
algorithms) converges to a 0-greedy policy with increasing number of episodes.
A RL algorithm is GLIE if it is GL and IE (infinitely exploring).
Typical ways of ensuring GLIE:
• Dynamically adjust ε in ε-greedy policy selection" When selecting action in state s,
behave ε-greedily with ε = c
n(s) , where 0 < c < 1 is a constant and n(s) is a number of
visits to state s over all the episodes so far.
• Use Boltzmann (softmax) exploration:
π(a | s) =
e
Q(s,a)
η(s)
b∈A(s) e
Q(s,b)
η(s)
,
where η is a state-dependent and time-varying temperature parameter. We need η to
converge to 0 over time, but not too fast (often, η(s) proportional to 1
log(n(s)) ). 102/189
Convergence of SARSA
Theorem 44
Consider a GLIE instantiation of SARSA. Moreover, assume that the sequence of learning
rates (αn)n∈N satisfies n αn = ∞ and n α2
n < ∞. In this setting, Q converges to q∗
and the behavior policy of SARSA converges to some optimal policy π∗
.
For the proof, see optional reading: Singh, Jaakkola, Littman, Szepesvári: Convergence Results
for Single-Step On-Policy Reinforcement-Learning Algorithms. In Machine Learning
39:287-308. Kluwer, 2000.
Note: learning rate can itself be state/action dependent (omitted for conciseness, constant
learning rates preferred in practice).
103/189
Off-policy TD control
Surprise: no importance sampling!
104/189
Off-policy TD control
Surprise: no importance sampling!
Recall the SARSA update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γQn
(st+1, at+1) − Qn
(st, at)
It pushes Q towards qπ
, where π is the current policy.
104/189
Off-policy TD control
Surprise: no importance sampling!
Recall the SARSA update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γQn
(st+1, at+1) − Qn
(st, at)
It pushes Q towards qπ
, where π is the current policy.
Idea: push Q directly towards q∗
.
104/189
Off-policy TD control
Surprise: no importance sampling!
Recall the SARSA update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γQn
(st+1, at+1) − Qn
(st, at)
It pushes Q towards qπ
, where π is the current policy.
Idea: push Q directly towards q∗
.
We could do this e.g. by a VI-like update:
Qn+1
(st, at) = Qn
(st, at) + αn · max
a∈A(s)
s′∈S
p(s′
| s, a) r(s, a, s′
) + γV (s′
) − Qn
(st, at) .
Two problems:
• We do not calculate v-estimates. (Must somehow replace with Q)
• We must get rid of transition probabilities and instead use the sampled at and rt+1.
104/189
Q-learning update
Solution: push the max towards the bootstrap.
Q-learning (Watkins, 1989): given a sampled trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for
every t we update:
Qn+1
(st, at) = Qn
(st, at) + αn · rt+1 + γ · max
a∈A(st+1)
Qn
(st+1, a) − Qn
(st, at)
105/189
Q-learning pseudocode
source: Sutton&Bato, p. 131
Off-policy: where is the second policy? 106/189
Q-learning convergence
Theorem 45
Consider any Q-learning instantiation with infinite exploration. Assume that the sequence
of learning rates (αn)n∈N satisfies n αn = ∞ and n α2
n < ∞. In this setting, Q
converges to q∗
. Moreover, if the behavior policy is GL, then it converges to an optimal
policy π∗
.
Proof in optional reading: Watkins, Dayan: Q-Learning. In Machine Learning 8:279-292.
Kluwer, 1992.
107/189
SARSA vs. Q-learning (SB: p. 132)
Left: greedy policies learned by SARSA and Q-learning.
Right: in-training performance with a 0.1-greedy behavior policy.
108/189
SARSA vs. Q-learning (SB: p. 132)
Left: greedy policies learned by SARSA and Q-learning.
Right: in-training performance with a 0.1-greedy behavior policy.
(Rough) takeaway: Q-learning more aggressive in finding optimal policy, can lead to risky
behavior. Possibly advantageous when environment not too stochastic or if in-training
performance has less importance (simulator vs. real world).
108/189
Maximization bias in Q-learning
Q-learning is “risky” not only due to exploration, but also because it is optimistic in the face of
uncertainty. TBD
The positive bias only disappears in the limit.
109/189
Double Q-learning
Idea: use two independent value estimates Q1, Q2 and decouple action selection from
evaluation in the bootstrap. During each update, we randomly select one of these for update,
which is also used to select the maximizing action in bootstrap. The other is used as the
bootstrap estimate.
110/189
Double Q-learning
Idea: use two independent value estimates Q1, Q2 and decouple action selection from
evaluation in the bootstrap. During each update, we randomly select one of these for update,
which is also used to select the maximizing action in bootstrap. The other is used as the
bootstrap estimate.
I.e., in each time step t we perform one of these updates, each with probability 1
2 : either
Q1(st, at) = Q1(st, at) + αn · rt+1 + γ · Q2 st+1, ( max
a∈A(st+1)
Q1(st+1, a)) − Q1(st, at)
or
Q2(st, at) = Q2(st, at) + αn · rt+1 + γ · Q1 st+1, ( max
a∈A(st+1)
Q2(st+1, a)) − Q2(st, at) .
Behavior policy = e.g. ε-greedy w.r.t. Q1 + Q2.
110/189
Double Q-learning pseudocode
source: Sutton&Barto, p. 136
111/189
Why Double Q-learning helps
TBD
112/189
Double Q-learning: experiment
source: Sutton&Barto, p. 135
113/189
Between Monte Carlo
and TD:
n-Step and λ-Returns
MC vs TD(0) update targets
Given a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
Update for time step t in:
• MC = discounted return from t till the end of trajectory, e.g. for t = 1:
r2 + γr3 + γ2
r4 + · · · + γT−2
rT
unbiased, but high variance + need the whole trajectory
114/189
MC vs TD(0) update targets
Given a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .:
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
Update for time step t in:
• MC = discounted return from t till the end of trajectory, e.g. for t = 1:
r2 + γr3 + γ2
r4 + · · · + γT−2
rT
unbiased, but high variance + need the whole trajectory
• TD(0) = 1-step reward and then (discounted) bootstrap:
r2 + γV (s2)
114/189
n-step return
Idea: use n-step discounted return and then bootstrap
Definition 46
Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and n ∈ N \ {0}.
An n-step return of τ from time step t is the quantity
Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2
· rt+3 + · · · γn−1
rt+n + γn
· V (st+n).
We also define a Q-estimate-based version:
Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2
· rt+3 + · · · γn−1
rt+n + γn
· Q(st+n, at+n).
(Which of the two is used will be clear from the context.)
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
115/189
n-step TD policy evaluation
Similar to TD(0), but using n-step return targets:
given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an
update
V (st) ← V (st) + α[Gt:t+n(τ) − V (st)].
Note 1: for n = 1 we get exactly TD(0).
116/189
n-step TD policy evaluation
Similar to TD(0), but using n-step return targets:
given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an
update
V (st) ← V (st) + α[Gt:t+n(τ) − V (st)].
Note 1: for n = 1 we get exactly TD(0).
Note 2: for n > 1, we cannot update V (st) directly at step t + 1. We need to obtain
rt+1, . . . , rt+n, st+n first, i.e. we can perform the update after step t + n.
116/189
n-step TD policy evaluation
Similar to TD(0), but using n-step return targets:
given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an
update
V (st) ← V (st) + α[Gt:t+n(τ) − V (st)].
Note 1: for n = 1 we get exactly TD(0).
Note 2: for n > 1, we cannot update V (st) directly at step t + 1. We need to obtain
rt+1, . . . , rt+n, st+n first, i.e. we can perform the update after step t + n.
Note 3: if t + n > T, we truncate the sum in Gt:t+n at rT , i.e. in such a case Gt:t+n = Gt.
116/189
n-step TD policy evaluation: pseudocode
source: Sutton&Barto, p.144 117/189
n-step TD policy evaluation: performance
19-state symmetric random walk:
source: Sutton&Barto, p.145 118/189
n-step SARSA (on-policy control)
Uses Q-value-bootstrapped n-step returns.
For a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and for all time steps
0 ≤ t < T we perform an update
Q(st, at) ← Q(st, at) + α[Gt:t+n − Q(st, at)].
We sample trajectories according to a policy π that is ε-greedy w.r.t. current Q-estimates:
π(a|s) =



1 − ε + ε
|A(s)| if a = arg maxa′∈A(s) Q(s, a′
) (ties broken in principled way)
ε
|A(s)| otherwise.
π is redefined in this way after each episode
119/189
n-step SARSA (pseudocode)
source: Sutton&Barto, p. 147
120/189
n-step SARSA (speed of signal propagation)
source: Sutton&Barto, p. 147
121/189
n-step Q-learning
The Q-learning update for a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and time step t:
Q(st, at) ← Q(st, at) + α · [rt+1 + γ · max
a∈A(st+1)
Q(st+1, a) − Q(st, at)].
Naive extension to n-step returns
Q(st, at) ← Q(st, at)+α·[rt+1+γrt+2+· · ·+γt+n−1
·rt+n+γt+n
· max
a∈A(st+1)
Q(st+n+1, a)−Q(st, at)]
does not really correspond to Q-learning, since some of the actions at+1, . . . , at+n might not be
Q-greedy (the behavior policy is ε-greedy, so some actions might be exploratory). Hence, we
are no longer pushing Q towards the Q-value of an optimal policy.
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
122/189
n-step Q-learning: correct
Idea: apply the Q-learning bootstrap at the first occurrence of a non-Q-greedy action.
I.e., for each episode:
• make π an ε-Q-greedy policy
• sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π
• for each time step 0 ≤ t < T:
• identify the smallest n′
∈ {t + 1, t + 2, . . . , t + n} such that an′ is not a Q-greedy action
• perform the update
Q(st , at ) ← Q(st , at ) + α · [rt+1 + γrt+2 + · · · + γn′
−1
rn′ + γn′
· max
a∈A(sn′ )
Q(sn′ , a) − Q(st , at )]
(if n′
> T, do the standard MC update).
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
123/189
λ-returns: idea
By varying n, the n-step returns provide a nice tradeoff between bias and variance (and update
speed). But the choice of optimal n is mostly a guesswork.
Idea: find a notion of return which combines n-step returns for multiple n’s. E.g., a suitable
convex combination of individual n-step returns. This leads to the notion of λ-returns.
We will focus only on policy evaluation, though λ-returns can be used also in control.
124/189
λ-returns: definition
Recall: Gt:t+n is the n-step return from timestep t.
Definition 47: λ-return
Let λ ∈ [0, 1]. A λ-return from timestep t is the random variable
Gλ
t = (1 − λ)
∞
n=1
λn−1
Gt:t+n.
125/189
λ-returns: definition
Recall: Gt:t+n is the n-step return from timestep t.
Definition 47: λ-return
Let λ ∈ [0, 1]. A λ-return from timestep t is the random variable
Gλ
t = (1 − λ)
∞
n=1
λn−1
Gt:t+n.
Note that due to truncation at t + n ≥ T, the λ-return can be more explicitly written as
Gλ
t = (1 − λ)
T−t−1
n=1
λn−1
· Gt:t+n + λT−t−1
· Gt.
s0
a0
s1
a1
s2
a2
s3
a3
s4
r1 r2 r3 r4
125/189
λ-return as discounting of n-step returns
source: Sutton&Barto, p. 290
126/189
(Forward-view) TD(λ)
Like TD(0), but uses λ-returns.
Given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . sampled from the evaluated policy π,
we perform, for each time step t an update:
V (st) ← V (st) + α(Gλ
t (τ) − V (st)).
Note:
• for λ = 0, this is exactly the TD(0) update,
• for λ = 1, this is exactly the MC update,
• Gλ
t (τ) depends on the whole suffix of τt.., hence the update can be only performed at the
end of the episode. (We will show a workaround later.)
127/189
TD(λ) vs. n-step TD on 19-state random walk
source: Sutton&Barto, p. 291
128/189
Forward vs. backward view TD(λ)
Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view
TD(λ) in an online fashion (V (st) can be updated by time t + 1).
129/189
Forward vs. backward view TD(λ)
Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view
TD(λ) in an online fashion (V (st) can be updated by time t + 1).
Implemented using eligibility traces: state-wise signals that indicate how much is the current
state eligible for an update (sort of state-wise modulation of the learning rate).
129/189
Forward vs. backward view TD(λ)
Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view
TD(λ) in an online fashion (V (st) can be updated by time t + 1).
Implemented using eligibility traces: state-wise signals that indicate how much is the current
state eligible for an update (sort of state-wise modulation of the learning rate).
We are more keen to update states that:
• appear often along the trajectory (frequency heuristic)
• were visited in the recent past (recency heuristic)
129/189
Accumulating eligibility trace
Definition 48: (Accumulating) eligibility trace
For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., λ ∈ [0, 1], and a state s ∈ S,
an (accumulating) eligibility trace is a sequence of values E0(s), E1(s), E2(s), . . . defined
inductively as follows:
E0(s) = 0
and for t > 0 Et(s) = γ · λ · Et−1(s) + I(St(τ) = s),
where I(St(τ) = s) is the indicator of the t-th state of τ being s, i.e. I(St(τ) = s) = 1
if st = s and I(St(τ) = s) = 0 otherwise.
source:
slides of D.
Silver
(Model-free
prediction) 130/189
Backward-view TD(λ): idea
• Et(s) denotes how much is s eligible for an update after playing t-th action along the run
(i.e., action at−1).
• In time step t, all states with non-zero eligibility signal will have their estimates updated in
proportion to the learning rate and the strength of the eligibility signal.
• The update target is the standard TD(0) target for time t. I.e., for each timestep t and
each state q, we perform the update
V (q) ← V (q) + α · Et(q) · rt+1 + γ · V (st+1) − V (st) .
131/189
Backward-view TD(λ): pseudocode
Input: policy π to evaluate
Output: Estimate V of vπ
initialize V arbitrarily
repeat
s ← sample uniformly (ES) or according to init. distr.
initialize E to be uniformly zero
while s not terminal do
a ← sample from π(s)
s′
← sample from p(s, a)
r ← r(s, a, s′
)
foreach q ∈ S (Only q’s visited so far) do
E(q) = γ · λ · E(q) + I(q = s)
V (q) ← V (q) + α · E(q) · r + γ · V (s′
) − V (s)
s ← s′
until timeout 132/189
Forward vs. backward view
If λ = 0, then Et(q) = I(St(τ) = q), i.e. the backward-view update at time point t is
V (st) ← V (st) + α · r + γ · V (st+1) − V (st) ,
while for all states other than st, no update is performed. I.e., backward TD(0) is exactly the
same thing as forward TD(0).
133/189
Forward vs. backward view
If λ = 0, then Et(q) = I(St(τ) = q), i.e. the backward-view update at time point t is
V (st) ← V (st) + α · r + γ · V (st+1) − V (st) ,
while for all states other than st, no update is performed. I.e., backward TD(0) is exactly the
same thing as forward TD(0).
For general λ the correspondence is more subtle:
Theorem 49: Forward-backward view correspondence
Assume that in the backward view, all the updates along the trajectory are performed
offline, i.e. only after the end of the episode, and in a batch, i.e. concurrently, using the
pre-episode estimates in right-hand sides.
Then, for any λ ∈ (0, 1), this offline backward TD(λ) performs the same updates as
forward TD(λ).
133/189
Offline backward and forward view (source: D. Silver slides)
Given the batch nature of updates, it suffices to show that the forward update target at time t
equals the sum of all updates “triggered” by a visit to state st.
134/189
Online vs. offline backward view on 19-state RW
In practice, we want to use the online backward algorithm, which only approximates the
forward view. Nevertheless, it performs acceptably:
source: Sutton&Barto, p. 295, LEFT: online backward TD(λ), RIGHT: offline forward TD(λ) 135/189
Concluding remarks on λ-returns
• There are other types of eligibility traces (replacing, dutch, . . . ), yielding different
algorithms.
136/189
Concluding remarks on λ-returns
• There are other types of eligibility traces (replacing, dutch, . . . ), yielding different
algorithms.
• Eligibility traces neatly generalize to deep learning, where they are not state-wise but
parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2
136/189
Concluding remarks on λ-returns
• There are other types of eligibility traces (replacing, dutch, . . . ), yielding different
algorithms.
• Eligibility traces neatly generalize to deep learning, where they are not state-wise but
parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2
• λ-returns and eligibility traces can be generalized to control setting SARSA(λ), Q(λ);
optional reading: Sutton&Barto, Sec. 12.7-12.10.
136/189
Concluding remarks on λ-returns
• There are other types of eligibility traces (replacing, dutch, . . . ), yielding different
algorithms.
• Eligibility traces neatly generalize to deep learning, where they are not state-wise but
parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2
• λ-returns and eligibility traces can be generalized to control setting SARSA(λ), Q(λ);
optional reading: Sutton&Barto, Sec. 12.7-12.10.
• There is a true online backward TD(λ) version. Here, true online=having perfect
equivalence with the forward view. However, the equivalence is w.r.t. a more complex
notion of λ-return (truncated λ-return) and uses more complex version of eligibility traces
than presented here. Outperforms both forward and backward algorithms presented here.
Optional reading: van Seijen, Sutton: True Online TD(λ). In Proceedings of ICML’14.
136/189
First Steps Towards Deep RL:
Value-Based
On-Policy Methods
Working with huge MDPs
E.g. original Atari games have 160x192 resolution with 128 colors: observable state space of
size 27·160·192
= 2215040
(though only a fraction reachable and resolution typically scaled down
in benchmarks – however, state typically encompass last 3 frames so as to provide some info on
movement).
State space can be even continuous (position, velocity,. . . ).
Most states will not be seen - we need the ability to generalize from experience to
unseen/rarely seen states.
137/189
Huge MDP representation
From now on, states of the MDP will be represented by vectors from Rn
. The vectorized
representation is chosen in a domain-specific manner, e.g.:
• Atari = one component per pixel per frame
• continuous navigation = agent coordinates, velocity, etc.
• small discrete MDPs can be represented by one-hot encoding
For simplicity, we will still assume that the action space is discrete, and reasonably small,
though many algorithms can be adapted for continuous actions (acceleration, etc).
138/189
Function approximators
The value functions have types:
vπ
, v∗
: Rn
→ R qπ
, q∗
: Rn
× A → R.
In RL, we need to approximate these functions.
Definition 50
A function approximator (FA) for functions of type X → Y is a class of functions f ⊆ Y X
parameterized by a some set of parameter vectors Θ ⊆ Rn
.
Each concrete parameter vector θ ∈ Θ defines a concrete function fθ ∈ f , i.e. f = {fθ | θ ∈ Θ}.
For FA f , we often write fθ(x) = f (x, θ) to stress the fact that the output of fθ depends on
both the input x and on θ. Hence, FA for type X → y can be itself seen as a function of type
X × Θ → Y .
139/189
Function approximators in RL
Our algorithms will use mainly these types of function approximators:
• V : Rn
× Θ → R to approximate vπ
or vθ
• Q : (Rn
× A) × Θ → R to approximate qπ
or qθ
The typical task is to find θ ∈ Θ such that Vθ = V (·, θ) is a “good” approximation for vπ
or
vθ
, and similarly for Qθ.
The parametrization Θ will depend on the concrete form of function approximator used.
140/189
Forms of function approximators
• tabular
• θ = the vector containing the contents of the table
• linear
• Θ = S = Rn
and e.g. Vθ(s) = θ⊤
· s
• neural nets
• θ = NN weights and biases
• decision trees
• . . .
We require the approximators to be differentiable and to admit a training method suitable for
non-stationary data.
141/189
Neural nets (source: slides by T. Brázdil)
142/189
Policy evaluation with FAs
Task: given a policy π and FA V : Rn
× Θ → R, find θ s.t. Vθ is “close” to vπ
.
“Closeness” can be expressed using various loss functions. Typically, we want to minimize the
mean squared error (MSE):
MSE(vπ
, Vθ) =
1
2
Es∼µ (vπ
(s) − Vθ(s))2
=
1
2
s∈S
µ(s) · (vπ
(s) − Vθ(s))2
,
where µ is some distribution over states expressing how much do we care about errors in
particular states.
A local minimum of MSE can be found gradient descent: making successive step in the
direction opposite to the gradient of MSE.
143/189
Recall: gradients
Definition 51
Given a scalar function f (x1, . . . , xn, θ1, . . . , θm): Rn
× Θ → R (where Θ ⊆ Rm
), the
gradient of f w.r.t. parameters θ = (θ1, . . . , θm) is the vector function
∇θf = (
∂f
∂θ1
, . . . ,
∂f
∂θm
) of type Rn
× Θ → Rm
When f is a function approximator defined by a neural net, the value of the gradient ∇θf (x, θ)
at a given point (x, θ) = (x1, . . . , xn, θ1, . . . , θm) can be computed by backpropagation (under
some usual conditions like smoothness, etc.).
144/189
Gradient descent for policy evaluation
To (locally) minimize MSE(vπ
, Vθ), it suffices to perform (sufficiently small) steps in the
negative direction of the current gradient, i.e., repeatedly perform updates:
θ ← θ − α · ∇θMSE(vπ
, Vθ) = θ − α · ∇θ
1
2
· Es∼µ vπ
(s) − Vθ(s)
2
145/189
Gradient descent for policy evaluation
To (locally) minimize MSE(vπ
, Vθ), it suffices to perform (sufficiently small) steps in the
negative direction of the current gradient, i.e., repeatedly perform updates:
θ ← θ − α · ∇θMSE(vπ
, Vθ) = θ − α · ∇θ
1
2
· Es∼µ vπ
(s) − Vθ(s)
2
= θ −
α
2
· Es∼µ ∇θ vπ
(s) − Vθ(s)
2
145/189
Gradient descent for policy evaluation
To (locally) minimize MSE(vπ
, Vθ), it suffices to perform (sufficiently small) steps in the
negative direction of the current gradient, i.e., repeatedly perform updates:
θ ← θ − α · ∇θMSE(vπ
, Vθ) = θ − α · ∇θ
1
2
· Es∼µ vπ
(s) − Vθ(s)
2
= θ −
α
2
· Es∼µ ∇θ vπ
(s) − Vθ(s)
2
= θ + α · Es∼µ vπ
(s) − Vθ(s) · ∇θVθ(s)
145/189
Gradient descent for policy evaluation
To (locally) minimize MSE(vπ
, Vθ), it suffices to perform (sufficiently small) steps in the
negative direction of the current gradient, i.e., repeatedly perform updates:
θ ← θ − α · ∇θMSE(vπ
, Vθ) = θ − α · ∇θ
1
2
· Es∼µ vπ
(s) − Vθ(s)
2
= θ −
α
2
· Es∼µ ∇θ vπ
(s) − Vθ(s)
2
= θ + α · Es∼µ vπ
(s) − Vθ(s) · ∇θVθ(s)
The expected value above is typically impossible to evaluate in practice. Instead we estimate it
by samples ⇒ stochastic gradient descent.
145/189
Gradient descent for policy evaluation
To (locally) minimize MSE(vπ
, Vθ), it suffices to perform (sufficiently small) steps in the
negative direction of the current gradient, i.e., repeatedly perform updates:
θ ← θ − α · ∇θMSE(vπ
, Vθ) = θ − α · ∇θ
1
2
· Es∼µ vπ
(s) − Vθ(s)
2
= θ −
α
2
· Es∼µ ∇θ vπ
(s) − Vθ(s)
2
= θ + α · Es∼µ vπ
(s) − Vθ(s) · ∇θVθ(s)
The expected value above is typically impossible to evaluate in practice. Instead we estimate it
by samples ⇒ stochastic gradient descent.
We typically take µ(s) representing the overall fraction of time spent in s when behaving
according to µ. Hence, Es∼µ can be estimated by sampling a trajectory from µ and performing
the update for each s on the trajectory in an every-visit fashion.
145/189
Stochastic gradient policy evaluation + MC instantiation
We keep sampling trajectories τ from π:
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
For each timestep t we perform the update of parameters
θ ← θ + α · vπ
(st) − Vθ(st) · ∇θVθ(st) .
Problem: in policy evaluation setting, we do not know vπ
(st). Hence, we estimate it using RL
targets.
146/189
Stochastic gradient policy evaluation + MC instantiation
We keep sampling trajectories τ from π:
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
For each timestep t we perform the update of parameters
θ ← θ + α · vπ
(st) − Vθ(st) · ∇θVθ(st) .
Problem: in policy evaluation setting, we do not know vπ
(st). Hence, we estimate it using RL
targets.
The simplest is the Monte Carlo target: estimate st by the discounted return of the sampled
trajectory from st, i.e. perform updates of the form
θ ← θ + α · Gt(st) − Vθ(st) · ∇θVθ(st) .
146/189
Gradient Monte Carlo policy evaluation: pseudocode
Algorithm 3: Gradient MC evaluation
Input: Policy π, FA V : S × Θ → R, step size α
Output: Approximation Vθ of vπ
initialize θ arbitrarily;
repeat
sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π;
foreach t ∈ {0, . . . , T − 1} do
θ ← θ + α · [Gt(τ) − V (st, θ)] · ∇θV (st, θ)
until timeout;
147/189
Semi-gradient TD(0)
In the gradient update formula
θ ← θ + α · vπ
(st) − Vθ(st) · ∇θVθ(st) .
we can also estimate vπ
(st) with the TD(0) target:
θ ← θ + α · rt+1 + γ · Vθ(st+1) − Vθ(st) · ∇θVθ(st) .
This yield the semi-gradient TD(0) policy evaluation algorithm.
148/189
Semi-gradient TD(0)
In the gradient update formula
θ ← θ + α · vπ
(st) − Vθ(st) · ∇θVθ(st) .
we can also estimate vπ
(st) with the TD(0) target:
θ ← θ + α · rt+1 + γ · Vθ(st+1) − Vθ(st) · ∇θVθ(st) .
This yield the semi-gradient TD(0) policy evaluation algorithm.
Why semi-gradient?
148/189
Gradient vs. semi-gradient TD(0)
Recall that our ultimate goal is to minimize
MSE(vπ
, Vθ) =
1
2
Es∼µ (vπ
(s) − Vθ(s))2
.
The gradient of this loss is
∇θ
1
2
Es∼µ (vπ
(s) − Vθ(s))2
=
1
2
Es∼µ ∇θ(vπ
(s) − Vθ(s))2
,
149/189
Gradient vs. semi-gradient TD(0)
Recall that our ultimate goal is to minimize
MSE(vπ
, Vθ) =
1
2
Es∼µ (vπ
(s) − Vθ(s))2
.
The gradient of this loss is
∇θ
1
2
Es∼µ (vπ
(s) − Vθ(s))2
=
1
2
Es∼µ ∇θ(vπ
(s) − Vθ(s))2
,
Estimation with sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and substituting
vπ
(s) with the TD(0) target would yield
∇θMSE(vπ
, Vθ) ≈
1
2
∇θ rt+1+γVθ(st+1)−Vθ(st)
2
= (rt+1+γVθ(st+1)−Vθ(st))·(γ∇θVθ(st+1)−∇θVθ(st))
different update then semi-gradient TD(0)! However, this full gradient:
• is more expensive to compute (2 backpropagations per update);
• does not really express TD(0) idea (the update target is not fixed).
149/189
Semi-gradient TD(0): pseudocode
Algorithm 4: Semi-gradient TD(0) evaluation
Input: Policy π, FA V : S × Θ → R, step size α
Output: Approximation Vθ of vπ
initialize θ arbitrarily;
repeat
s ← initial state;
a ∼ π(s);
while s not terminal do
s′
∼ p(s, a);
r ← r(s, a, s′
);
a′
∼ π(s′
);
θ ← θ + α · [r + γ · V (s′
, θ) − V (s, θ)] · ∇θV (s, θ);
s ← s′
; a ← a′
until timeout;
150/189
On-policy control with function approximation
Semi-gradient SARSA uses the same idea as TD(0), but with Q-approximator, i.e.
Q : Rn
× A × Θ → R.
Behavior policy = e.g. ε-greedy with respect to the current Q. For a sampled trajectory τ =
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
we perform, in each timestep t, an update
θ ← θ + α · rt+1 + γ · Qθ(st+1, at+1) − Qθ(st, at) · ∇θQθ(st, at) .
151/189
Semi-gradient SARSA: pseudocode
Algorithm 5: Semi-gradient SARSA
Input: FA Q : S × A × Θ → R, step size α
Output: Approximation Qθ of q∗
initialize θ arbitrarily;
repeat
s ← initial state;
π ← policy ε-greedy w.r.t. Qθ;
a ∼ π(s);
while s not terminal do
s′
∼ p(s, a);
r ← r(s, a, s′
);
a′
∼ π(s′
);
θ ← θ + α · [r + γ · Qθ(s′
, a′
) − Qθ(s, a)] · ∇θQθ(s, a);
s ← s′
; a ← a′
until timeout;
152/189
Representing actions in DNNs
How to represent actions in the (say, DNN) function approximator Q is largely a
domain-dependent engineering choice.
If the set of actions A = {a1
, . . . , ak
} is discrete and reasonably small, we can consider a net
which inputs a state (i.e., n input neurons when S = Rn
) and outputs an |A|-dimensional
vector (i.e., one output neuron per action), so that the output of the i-th neuron on input s is
interpreted as Q(s, ai
).
I.e., in such a case we consider Q to be function of type Q : S × Θ → R|A|
.
153/189
On-policy semi-gradient methods: concluding remarks
The presented algorithms can be instantiated also with other types of returns, such as:
• n-step returns
• n-step SARSA update:
θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1
rt+n + γn
· Qθ(st+n, at+n)
Gt:t+n,θ
−Qθ(st , at ) ·∇θQθ(st , at )
154/189
On-policy semi-gradient methods: concluding remarks
The presented algorithms can be instantiated also with other types of returns, such as:
• n-step returns
• n-step SARSA update:
θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1
rt+n + γn
· Qθ(st+n, at+n)
Gt:t+n,θ
−Qθ(st , at ) ·∇θQθ(st , at )
• forward-view λ-returns
• SARSA(λ) update: θ ← θ + α · (1 − λ) ∞
n=1 λn−1
Gt:t+n,θ − Qθ(st , at ) · ∇θQθ(st , at )
154/189
On-policy semi-gradient methods: concluding remarks
The presented algorithms can be instantiated also with other types of returns, such as:
• n-step returns
• n-step SARSA update:
θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1
rt+n + γn
· Qθ(st+n, at+n)
Gt:t+n,θ
−Qθ(st , at ) ·∇θQθ(st , at )
• forward-view λ-returns
• SARSA(λ) update: θ ← θ + α · (1 − λ) ∞
n=1 λn−1
Gt:t+n,θ − Qθ(st , at ) · ∇θQθ(st , at )
• backward-view λ-returns (optional reading: Sutton&Barto, sections 12.2 and 12.7)
154/189
Value-Based
Off-Policy Control with
Approximators:
DQNs and Friends
Off-policy methods with function approximation
...are tricky to get right, already in the case of policy evaluation. The training can become very
unstable.
For on-policy (semi)-gradient methods, one can typically prove convergence to correct/optimal
values at least in the case of linear function approximation (though not in the more general
case of NN approximators).
155/189
Off-policy methods with function approximation
...are tricky to get right, already in the case of policy evaluation. The training can become very
unstable.
For on-policy (semi)-gradient methods, one can typically prove convergence to correct/optimal
values at least in the case of linear function approximation (though not in the more general
case of NN approximators).
Off-policy semi-gradient methods, such as:
• TD with importance sampling (not covered here), or
• Q-learning with function approximators (will be covered a bit later),
can diverge already with linear function approximators.
155/189
Divergence examples (high-level)
• Baird’s counterexample: semi-gradient TD with importance sampling can diverge in
presence of linear FAs
• Moreover, the divergence is not due to the instability of (semi)-gradient descent. Tsitsiklis
and Van Roy’s counterexample shows divergence even in the case where each update
completely replaces the current θ with the optimal θ∗
which minimizes the MSE between
Vθ and the TD(0) update target. The problem lies in the off-policy distribution of updates.
• Counterexamples explained in optional reading: Sutton&Barto, Sec. 11.2.
156/189
Deadly triad
Identified by Sutton&Barto: risk of training instability and divergence steeply rises when
combining:
• function approximation,
• bootstrapping, and
• off-policy training.
157/189
Deadly triad
Identified by Sutton&Barto: risk of training instability and divergence steeply rises when
combining:
• function approximation,
• bootstrapping, and
• off-policy training.
But often we want to do just that. :)
Practical solution: Happily do the deadly triad, but use insights from supervised learning to
develop additional techniques that help stabilize the training.
157/189
Deep Q-Networks (DQN)
2013 arXiv tech. report, there is also follow-up 2015 Nature paper
158/189
Q-learning with function approximators
Same semi-gradient idea as in TD(0), SARSA: adjust θ to bring Qθ(s, a) closer to the fixed
Q-learning update target.
I.e., for a sampled trajectory
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
and its timestep t, the update is
θ ← θ + α · rt+1 + γ · max
a∈A(st+1)
Qθ(st+1, a) − Qθ(st, at) · ∇θQθ(st, at) .
159/189
Q-learning with function approximators
Same semi-gradient idea as in TD(0), SARSA: adjust θ to bring Qθ(s, a) closer to the fixed
Q-learning update target.
I.e., for a sampled trajectory
s0
a0
s1
a1
s2
a2
s3
a3
s4
a4
. . .
r1 r2 r3 r4 r5
and its timestep t, the update is
θ ← θ + α · rt+1 + γ · max
a∈A(st+1)
Qθ(st+1, a) − Qθ(st, at) · ∇θQθ(st, at) .
But performing the updates based solely on the current step would be susceptible to instability
due to the presence of the deadly triad.
159/189
Deep Q-learning challenges
From Mnih et al. Playing Atari with Deep Reinforcement Learning:
160/189
Experience replay
Originated in the work of Long-Ji Lin, e.g.: Reinforcement Learning for Robots Using Neural
Networks, dissertation, 1993.
Definition 52: Experience
An experience is a 4-tuple (s, a, r, s′
) ∈ S × A × S × R interpreted as ("state", "action
played in it", "reward obtained", "next state observed").
161/189
Experience replay
Originated in the work of Long-Ji Lin, e.g.: Reinforcement Learning for Robots Using Neural
Networks, dissertation, 1993.
Definition 52: Experience
An experience is a 4-tuple (s, a, r, s′
) ∈ S × A × S × R interpreted as ("state", "action
played in it", "reward obtained", "next state observed").
• DQN does not perform update based only on the current step. Instead, for each sampled
trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and each timestep t it:
• first stores the one-step experience (st , at , rt+1, st+1) in a data structure B called replay
buffer;
• then, sample a random minibatch of experiences B ⊆ B of a given minibatch size Bsize
• perform a minibatch-gradient-descent update w.r.t. B: compute the gradient of the
Q-learning loss for each e ∈ B and then update θ in the direction of an average gradient
over the whole B.
161/189
Minibatch update
Fix a minibatch B.
For each e = (s, a, r, s′
) ∈ B we compute the gradient of the Q-learning loss ∇θL(θ, e) at
point e:
∇θL(θ, e) = ∇θ
1
2
r + γ · max
b∈A(s′)
Qθ(s′
, b)
fixed target
− Qθ(s, a)
2
= r + γ · max
b∈A(s′)
Qθ(s′
, b)
=0 if s terminal
− Qθ(s, a) · ∇θQθ(s, a)
We then perform an update in the direction of average gradient:
θ ← θ + α ·
1
|B|
e∈B
∇θL(θ, e).
162/189
Experience replay rationale
• helps decorrelate the DNN training data
• helps to prevent catastrophic forgetting
• improves data efficiency via experience re-use
Why good match for deep Q-learning? Experience replay is by design off-policy since we train
on old data, which were sampled from different policy than the current one.
163/189
Replay buffer implementation
The replay buffer B is typically not unbounded, but has a fixed capacity Bsize. Replacement is
eventually needed. If B is full, the oldest experience if removed (B = queue).
How to sample the minibatches?
• Original DQN: uniformly from B.
• Alternative: prioritized experience replay: each experience is assigned a priority (several
heuristics exist). An experience is sampled into a minibatch with probability proportional
to its priority.
164/189
DQN: 2013 pseudocode
Algorithm 6: DQN with replay buffer
Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s
ε, Bsize, Bsize, . . .
Output: Approximation Qθ of q∗
initialize θ arbitrarily; initialize empty replay buffer B of capacity Bsize;
repeat
s ← initial state;
while s not terminal do
π ← policy ε-greedy w.r.t. Qθ;
a ∼ π(s);
s′
∼ p(s, a);
r ← r(s, a, s′
);
store (s, a, r, s′
) in B;
sample a minibatch B of size Bsize from replay buffer B;
perform the minibatch update θ ← θ + α · 1
Bsize e∈B ∇θL(θ, e) (see this slide);
s ← s′
;
until timeout; 165/189
DQN: 2013 results
Mnih et al. (2013, arXiv)
166/189
Target networks: another stabilizing factor in DQNs
Introduced in the reviewed version of DQN paper:
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518
(2015).
167/189
Target networks: another stabilizing factor in DQNs
Introduced in the reviewed version of DQN paper:
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518
(2015).
Performing the (minibatch) Q-update looks like supervised learning:
change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max
b∈A(s′)
Qθ(s′
, b),
where (s, a, r, s′
) is the processed experience.
167/189
Target networks: another stabilizing factor in DQNs
Introduced in the reviewed version of DQN paper:
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518
(2015).
Performing the (minibatch) Q-update looks like supervised learning, but:
change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max
b∈A(s′)
Qθ(s′
, b)
the “label” changes
with each update!
,
where (s, a, r, s′
) is the processed experience.
167/189
Target networks: idea
To stabilize learning, we use two networks: the main network, and the target network. They
have the same architecture (denote it by Q), but their weights may differ during the execution
of the algorithm.
We denote:
• θ - weights of main network
• ˆθ - weights of target network
168/189
Target networks: idea
To stabilize learning, we use two networks: the main network, and the target network. They
have the same architecture (denote it by Q), but their weights may differ during the execution
of the algorithm.
We denote:
• θ - weights of main network
• ˆθ - weights of target network
Usage:
• The target network is used only to compute TD targets when computing losses:
∇θL(θ, e) = r + γ · max
b∈A(s′)
Q(s′
, b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ)
168/189
Target networks: idea
To stabilize learning, we use two networks: the main network, and the target network. They
have the same architecture (denote it by Q), but their weights may differ during the execution
of the algorithm.
We denote:
• θ - weights of main network
• ˆθ - weights of target network
Usage:
• The target network is used only to compute TD targets when computing losses:
∇θL(θ, e) = r + γ · max
b∈A(s′)
Q(s′
, b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ)
• At the start, and also in periodic intervals (but not after each update!) the two networks
are synchronized by performing ˆθ ← θ. Other than this, ˆθ stays fixed, the gradient steps
are only used to update θ (i.e., the main network).
168/189
DQN: 2015 pseudocode
Algorithm 7: DQN with replay buffer and target network
Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s ε, Bsize, Bsize, C, . . .
Output: Approximation Qθ of q∗
initialize θ arbitrarily; ˆθ ← θ; counter ← C;
initialize empty replay buffer B of capacity Bsize;
repeat
s ← initial state;
while s not terminal do
if counter = 0 then ˆθ ← θ; counter ← C else counter ← counter − 1;
π ← policy ε-greedy w.r.t. Qθ;
a ∼ π(s);
s′
∼ p(s, a);
r ← r(s, a, s′
);
store (s, a, r, s′
) in B;
sample a minibatch B of size Bsize from replay buffer B;
perform the minibatch update θ ← θ + α · 1
Bsize e∈B ∇θL(θ, e), where
∇θL(θ, e) = r + γ · maxb∈A(s′) Q(s′
, b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ);
s ← s′
;
until timeout; 169/189
DQN: 2015 results
170/189
Engineering behind DQN for Atari: partial observability
True state = current state (program counter + variable values) of the game program.
We do not see this – only the frames rendered on screen.
Solving partially observable environments requires (per some POMDP theory) making decisions
based on the whole history of observations. This is computationally demanding (recurrent
NNs...).
171/189
Engineering behind DQN for Atari: partial observability
True state = current state (program counter + variable values) of the game program.
We do not see this – only the frames rendered on screen.
Solving partially observable environments requires (per some POMDP theory) making decisions
based on the whole history of observations. This is computationally demanding (recurrent
NNs...).
DQN for Atari solves this by feeding the last 4 observed frames into the NN. This is typically
enough to deduce the dynamics of the current play.
171/189
DQN: dynamics from limited frame history
172/189
Engineering behind DQN for Atari: the network
Inputs four 84x84px images, then 3 convolutional layers, then two fully connected layers. all
with ReLU activations. Outputs Q-estimate for each action.
source: Mnih et al. (Nature, 2015), details in appendix “Model architecture” 173/189
DQN for Atari: dirty engineering tricks
• Preprocessing: 210x160 RGB color images are converted to grayscale and resized to 84x84
resolution.
• Frame skipping: the agent only observes and acts in every K-th frame, for the frames in
between, the last selected action is repeated without providing the frame to the agent. (In
the paper, K = 4.)
• Reward clipping: all positive one-step Atari rewards are clipped to +1, all negative ones
are clipped to −1 (Atari gives integer rewards).
• TD error clipping: for each update, the Q-learning error
r + γ · max b ∈ A(s′
)Q(s′
, b) − Q(s, a) is clipped to [−1, 1].
174/189
DQN for Atari: selected hyperparameters (nex)
minibatch size Bsize 32
replay buffer size Bsize 1,000,000
target network update freqeuency C 10,000
discount factor γ 0.99
update frequency (steps between two minibatch updates) 4
learning rate 0.00025
initial ε 1
final ε (linear decay) 0.1
final decay frame 1,000,000
random policy played for init. 50,000 frames
max. do-nothing actions at episode start 30
175/189
RAINBOW of heuristics
Multitude of heuristics for the improvement of DQN were developed over time. Some of them
make sense also in the context of other deep RL algorithms.
The RAINBOW agent combines six such heuristics to further improve the DQN performance
on Atari games.
(In proceedings of AAAI 2018.)
176/189
Rainbow heuristics
• dueling networks architecture
• double DQN
• prioritized experience replay
• n-step rewards
• distributional learning
• noisy networks
177/189
Action advantage
Idea: imagine that for some state s, the Q-values of all actions are high. Then s should be in
some sense valuable in itself.
178/189
Action advantage
Idea: imagine that for some state s, the Q-values of all actions are high. Then s should be in
some sense valuable in itself.
Definition 53: Advantage
Let π be a policy. An advantage function advπ
: S × A → R is defined as
advπ
(s, a) = qπ
(s, a) − vπ
(s).
Dueling architecture splits certain layers of the neural network into two “streams”, one
estimating (somethign like) vπ
(s) and one estimating (something like) advπ
(s, a). The final
layer combines these estimates to produce an estimate of qπ
(s, a).
178/189
Dueling architecture
Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In proceedings
of ICML’16.
179/189
Dueling architecture: theory and training
We have Qθ,α,β(s, a) = aggregate(Vθ,α(s), Aθ,β(s, a)), where
• θ - convolutional (or other feature extraction) layer parameters
• α - value channel parameters
• β - advantage channel parameters
180/189
Dueling architecture: theory and training
We have Qθ,α,β(s, a) = aggregate(Vθ,α(s), Aθ,β(s, a)), where
• θ - convolutional (or other feature extraction) layer parameters
• α - value channel parameters
• β - advantage channel parameters
The whole network Q is trained to estimate qπ
(where π is the target policy) using any deep
RL algorithm (e.g. DQN, in which case π is the optimal policy). There is nothing new from RL
perspective here, all the novelty is inside the network. The factorization into state value and
advantage is supposed to help the network “focus” on features that are important to recognize
valuable states and features that help us rank actions.
180/189
Dueling architecture: Atari example
From Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In
proceedings of ICML’16.
181/189
Learning state values and advantages
• The whole net is trained end-to-end to predict qπ
.
• How do we ensure the value/advantage channels are trained to predict state
values/advantages?
182/189
Learning state values and advantages
• The whole net is trained end-to-end to predict qπ
.
• How do we ensure the value/advantage channels are trained to predict state
values/advantages?
• By a suitable choice of aggregator:
• Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0
and Aθ,β to qπ
.
182/189
Learning state values and advantages
• The whole net is trained end-to-end to predict qπ
.
• How do we ensure the value/advantage channels are trained to predict state
values/advantages?
• By a suitable choice of aggregator:
• Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0
and Aθ,β to qπ
.
•
Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − max
b∈A(s)
Aθ,β(s, b),
the training then indeed pushes Vθ,α to vπ
and Aθ,β to advπ
+ c where c is some constant.
182/189
Learning state values and advantages
• The whole net is trained end-to-end to predict qπ
.
• How do we ensure the value/advantage channels are trained to predict state
values/advantages?
• By a suitable choice of aggregator:
• Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0
and Aθ,β to qπ
.
•
Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − max
b∈A(s)
Aθ,β(s, b),
the training then indeed pushes Vθ,α to vπ
and Aθ,β to advπ
+ c where c is some constant.
Issues: not differentiable, update sensitive to the value of maximizing action changes.
182/189
Aggregation in Rainbow
The point: aggregating layer should anchor the sum of the channels to same baseline value
derived non-trivially from the advantages (if all advantages shift up/down, so should the
baseline). Rainbow uses mean advantage baseline:
Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) −
1
|A(s)|
b∈A(s)
Aθ,β(s, b)
pushes the value channel to predict 1
|A(s)| a∈A(s) qπ
(s, a).
183/189
Double DQN
van Hasselt, Guez, Silver: Deep Reinforcement Learning with Double Q-learning. In
proceedings of AAAI 2016.
Similar idea to tabular Double Q-learning (use different estimates for selecting maximizing
action in bootstrap and for evaluating the bootstrap), but instead of independently updated
networks uses main and target networks. I.e., for experience (s, a, r, s′
), the update is:
θ ← θ + α · r + γ · Qˆθ s′
, arg max
b∈A(s′)
Qθ(s′
, b) − Qθ(s, a) ∇θQθ(s, a),
where ˆθ is the parameter vector of the target network.
184/189
Prioritized experience replay (Schaul et al., ICLR’16)
Each experience e = (s, a, r, s′
) in the replay buffer is assigned a priority according to its
TD-error
pe = |r + γ · max
b∈A(s′)
Qˆθ(s′
, b) − Qθ(s, a)| + ε
(ε > 0 ensures all priorities are positive).
The probability of sampling an experience e from the buffer is set to
pα
e
e′∈B pα
e′
, where α > 0 is
a hyperparameter controlling the degree of prioritization.
185/189
Prioritized experience replay (Schaul et al., ICLR’16)
Each experience e = (s, a, r, s′
) in the replay buffer is assigned a priority according to its
TD-error
pe = |r + γ · max
b∈A(s′)
Qˆθ(s′
, b) − Qθ(s, a)| + ε
(ε > 0 ensures all priorities are positive).
The probability of sampling an experience e from the buffer is set to
pα
e
e′∈B pα
e′
, where α > 0 is
a hyperparameter controlling the degree of prioritization.
Prioritization induces bias: the sampled experiences no longer follow the same distribution as
sampled trajectories. We can correct this by using importance sampling during updates:
θ ← θ + α ·
1
|B|
·
1
pα
e
β
· r + γ · Qˆθ s′
, arg max
b∈A(s′)
Qθ(s′
, b) − Qθ(s, a) · ∇θQθ(s, a),
where β > 0 determines the degree of IS correction (annealed to 1 during training). 185/189
n-step returns
Self-explanatory, use n-step return with Q-learning bootstrap when computing TD target.
How to combine with replay buffer? Each experience stores a single step.
186/189
n-step returns
Self-explanatory, use n-step return with Q-learning bootstrap when computing TD target.
How to combine with replay buffer? Each experience stores a single step.
Solutions:
• Store experiences in B sequentially, with each sampled experience, retrieve also the next
n − 1 ones (up to episode termination). Requires careful implementation.
• Naive: each element of B consists of n consecutive experiences (space inefficient).
Given consecutive experiences
(st, at, rt+1, st+1), (st+1, at+1, rt+2, st+2), . . . , ((st+n−1, at+n−1, rt+n, st+n)), perform update
θ ← θ + α · rt+1 + γrt+2 + · · · + ·γn−1
rt+n + γn
max
b∈A(st+n)
Qˆθ(st+n, b) − Qθ(st, at) ∇θQθ(st, at).
186/189
Distributional learning
Very rough idea: instead of expected returns, predict (discretized) distribution of returns.
Source: Bellemare, Dabney, Munos: A Distributional Perspective on Reinforcement Learning. In
proceedings of ICML’17.
We still optimize the expected value of the distribution, but the NN processes richer
information (neurological inspiration). 187/189
Noisy nets
Fortunato et al.: Noisy Networks for Exploration . In proceedings of ICRL’17.
An alternative way of achieving exploration (without ε-greedy policies). Replaces linear layers
y = W · x + b with noisy layers of the form
y = (µw + σw ⊙ εw ) · x + µb + σb ⊙ εb,
where matrices µw , σw and vectors µb, σb are learnable, matrix εw and vector εb consist of
random noise, and ⊙ represents component-wise multiplication.
The loss function of the DQN training is then encapsulated in expectation over the noise.
Interesting point: the net can learn to adjust σ’s and thus the degree of exploration over time.
188/189
Rainbow: evaluation and ablations (Hessel et al., 2017)
189/189