PA230: Reinforcement Learning Petr Novotný “Good and evil, reward and punishment, are the only motives to a rational creature; these are the spur and reins whereby all mankind are set on work and guided.” John Locke, Some Thoughts Concerning Education (1693) 1/189 Organizational Information General • Lecture: Thursdays 2-3:40p.m. 2/189 General • Lecture: Thursdays 2-3:40p.m. • Homework: see the interactive syllabus in IS • mainly binary classification (accepted/not accepted) • all your homeworks need to be marked as passed to proceed to exam • can (but do not have to) be done in pairs (pairs can differ across the individual assignments) • for those who passed, the teacher will receive feedback on the general quality of the solutions for each student - can be taken into account when determining the final grade (typically in students’ favor) 2/189 General • Lecture: Thursdays 2-3:40p.m. • Homework: see the interactive syllabus in IS • mainly binary classification (accepted/not accepted) • all your homeworks need to be marked as passed to proceed to exam • can (but do not have to) be done in pairs (pairs can differ across the individual assignments) • for those who passed, the teacher will receive feedback on the general quality of the solutions for each student - can be taken into account when determining the final grade (typically in students’ favor) • Exam: • oral • each attempt counts ? (unlike the Brázdil system) • in general, knowledge of anything mentioned on the slides can be required, unless explicitly marked with “nex” (like the Brázdil system) 2/189 Team • Lecturer: Petr Novotný • HW team: Martin Kurečka Václav Nevyhoštěný Vít Unčovský 3/189 Communication Official discord server: https://discord.gg/9mxTgYhcdB • Official communication forum of the course: falls under the university ethical guidelines. • Use your real name for posting (you can set-up an account under your IS email if necessary). 4/189 Reading • Compulsory: • these slides, • material explicitly prescribed by these slides (not much). • Recommended: • Sutton & Barto: Reinforcement Learning: An Introduction (2nd ed.), available at http://incompleteideas.net/book/RLbook2020.pdf • henceforth referenced as “S&B” • slides by David Silver https://www.davidsilver.uk/teaching/ • CMU slides https://www.andrew.cmu.edu/course/10-703/ • more specific literature recommendations will be given for each topic later 5/189 Reinforcement Learning: What, Why, When, How, & Other Questions Types of machine learning • unsupervised • spot ”useful” patterns in data • supervised • given labeled data, predict labels on unlabeled data • reinforcement • agents and decision-making • agency = “the ability to take action or to choose what action to take” (Cambridge Dictionary) 6/189 General RL scheme source: Sutton&Barto, p. 48 Keywords: sequential, dynamic, subject to uncertainty 7/189 RL: Objective and approach • Objective: Design a decision policy (= agent behavior) which prescribes to the agent how to act in different situations (states), typically so as to achieve some goal. 8/189 RL: Objective and approach • Objective: Design a decision policy (= agent behavior) which prescribes to the agent how to act in different situations (states), typically so as to achieve some goal. • Approach: Start with (± random) behavior and adapt it based on past experience via the law of effect: • actions with good/bad consequences for the agent are more/less likely to be repeated by the agent (within the same context) 8/189 RL in psychology (nex) I.P. Pavlov (1849-1936) classical conditioning 9/189 RL in psychology (nex) I.P. Pavlov (1849-1936) classical conditioning E. Thorndike (1874-1949) law of effect 9/189 RL in psychology (nex) I.P. Pavlov (1849-1936) classical conditioning E. Thorndike (1874-1949) law of effect J.B. Watson (1878-1958) behaviorist manifesto 9/189 RL in psychology (nex) I.P. Pavlov (1849-1936) classical conditioning E. Thorndike (1874-1949) law of effect J.B. Watson (1878-1958) behaviorist manifesto B.F. Skinner (1904-1990) radical behaviorism, reinforcement, rewards 9/189 Learning by trying & XX dilemma Underlying the RL approach is the idea of learning by trying: • first, act more or less randomly (exploration) • integral part of early human development • continually adapt behavior according to experience and feedback from the environment (exploitation) • strength of feedback ≈ strength of behavior adaptation Balancing exploration and exploitation (XX) is a recurring theme in RL. 10/189 Incomplete history of RL in computer science I “Learning by trying” machines and software, ad hoc approaches: A. Turing (1912-1954) 1948: theoretical “pleasure & pain” system to train computers 11/189 Incomplete history of RL in computer science I “Learning by trying” machines and software, ad hoc approaches: A. Turing (1912-1954) 1948: theoretical “pleasure & pain” system to train computers C. Shannon (1916-2001) 1950: Theseus maze-solving mouse M. Minsky (1927-2016) 1950s: analog neural net machines (SNARCS) 11/189 Incomplete history of RL in computer science I “Learning by trying” machines and software, ad hoc approaches: A. Turing (1912-1954) 1948: theoretical “pleasure & pain” system to train computers C. Shannon (1916-2001) 1950: Theseus maze-solving mouse M. Minsky (1927-2016) 1950s: analog neural net machines (SNARCS) And many more. . . Recommended: S&B: Sec. 1.7. 11/189 Incomplete history of RL in computer science II Mathematical foundations of sequential decision making: R. Bellman (1920-1984) R. Howard (b. 1934) • Formalization via Markov decision processes (MDPs) • value iteration (attributed to Bellman, 1957) • policy iteration (attributed to Howard, 1960) 12/189 Incomplete history of RL in computer science III Since late 1980’s: synthesis – learning by trial in MDPs R. Sutton A. Barto Temporal difference learning C. Watkins Q-learning . . . and many more. 13/189 Successes of RL (nex) 14/189 Successes of RL (nex) 14/189 Successes of RL (nex) 14/189 Successes of RL (nex) 14/189 Successes of RL (nex) 14/189 Words of caution (and controversy) (nex) 15/189 Words of caution (and controversy) (nex) 15/189 Mathematical Foundations of Sequential Decision-Making MDP Example MDP with actions, rewards and transition probabilities. start next passed sleep study: -2 1 study: -2 1 done: 20 1 done: 0 1 procrast.: 4 1 leave: -1 1 2 1 2 procrast.: -1 1 pub: +4 0.3 0.4 0.3 16/189 Markov Decision Process Given a set X, we denote D(X) the set of all probability distributions over X. Definition 1 A Markov decision process (MDP) is a tuple (S, A, p, r) where • S is a set of states, • A is a set of actions, • p: S × A → D(S) is a probabilistic transition function, • r : S × A → R is a reward function. We will shorten p(s, a)(s′ ) to p(s′ | s, a). 17/189 Markov Decision Process Given a set X, we denote D(X) the set of all probability distributions over X. Definition 1 A Markov decision process (MDP) is a tuple (S, A, p, r) where • S is a set of states, • A is a set of actions, • p: S × A → D(S) is a probabilistic transition function, • r : S × A → R is a reward function. We will shorten p(s, a)(s′ ) to p(s′ | s, a). The p, r can be partial functions: action a is enabled in state s if both p(s, a) and r(s, a) are defined. We denote by A(s) the set of all actions enabled in s. 17/189 Dynamics of MDPs • start in some initial state s0 18/189 Dynamics of MDPs • start in some initial state s0 • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . 18/189 Dynamics of MDPs • start in some initial state s0 • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . • in each time step t, let st be the current state; then: • agent selects action at ∈ A(st ) • the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward rt+1 = r(st , at ) • t is incremented and the process repeats in the same fashion forever Thus, the agent produces a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .. 18/189 Dynamics of MDPs • start in some initial state s0 • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . • in each time step t, let st be the current state; then: • agent selects action at ∈ A(st ) • the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward rt+1 = r(st , at ) • t is incremented and the process repeats in the same fashion forever Thus, the agent produces a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .. τ is produced randomly (due to p and possibly also agent choices being probabilistic): it is a random variable and so are its components: we define random variables • St = state at time step t • At = action at time step t • Rt = reward received just before entering St 18/189 Policies Definition 2 A history is a finite prefix of a trajectory ending in a state, i.e., an object of type s0, a0, r1, s1, a1, r2, . . . , at−1, rt, st ∈ (S · A · R)∗ S. we denote by last(h) the last state of a history h. 19/189 Policies Definition 2 A history is a finite prefix of a trajectory ending in a state, i.e., an object of type s0, a0, r1, s1, a1, r2, . . . , at−1, rt, st ∈ (S · A · R)∗ S. we denote by last(h) the last state of a history h. Definition 3 A policy is a function π: (S · A · R)∗ S → D(A) which to each history h assigns a probability distribution over A(last(h)). A policy is by definition an infinite object! 19/189 MD policies Definition 4 A policy π is: • memoryless if π(h) = π(h′ ) whenever last(h) = last(h′ ) (we can view memoryless policies as objects of type π: S → D(A)); • deterministic if π(h) always assigns probability 1 to one action, and zero to all others (we can view det. policies of objects of type π: (S · A · R)∗ S → A). 20/189 MD policies Definition 4 A policy π is: • memoryless if π(h) = π(h′ ) whenever last(h) = last(h′ ) (we can view memoryless policies as objects of type π: S → D(A)); • deterministic if π(h) always assigns probability 1 to one action, and zero to all others (we can view det. policies of objects of type π: (S · A · R)∗ S → A). Definition 5 A policy π is MD (memoryless deterministic) if it is both memoryless and deterministic. 20/189 Dynamics of MDPs (more precise) Given a distribution I of initial states and a policy π • start in some initial state s0 ∼ I 21/189 Dynamics of MDPs (more precise) Given a distribution I of initial states and a policy π • start in some initial state s0 ∼ I • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . 21/189 Dynamics of MDPs (more precise) Given a distribution I of initial states and a policy π • start in some initial state s0 ∼ I • MDP evolves in discrete time steps t = 0, 1, 2, 3, . . . • in each time step t, let ht be the history produced so far; then: • agent selects action at ∈ A(st ) according to π, i.e. at ∼ π(ht ) • the environment responds with next state st+1 ∼ p(st , at ) and with immediate reward rt+1 = r(st , at ), the history is extended by at , rt , st+1, • t is incremented and the process repeats in the same fashion forever 21/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 1I is typically known from the context and hence omitted from the notation 22/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). 1I is typically known from the context and hence omitted from the notation 22/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). Exercise 6 In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub. Compute the following quantities: 1I is typically known from the context and hence omitted from the notation 22/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). Exercise 6 In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub. Compute the following quantities: • Pπ [visit pub at least twice′ | S0 = start′′ ] 1I is typically known from the context and hence omitted from the notation 22/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). Exercise 6 In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub. Compute the following quantities: • Pπ [visit pub at least twice′ | S0 = start′′ ] • Pπ [visit pub at exactly twice | S0 = start′′ ] 1I is typically known from the context and hence omitted from the notation 22/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). Exercise 6 In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub. Compute the following quantities: • Pπ [visit pub at least twice′ | S0 = start′′ ] • Pπ [visit pub at exactly twice | S0 = start′′ ] • Eπ [R1] 1I is typically known from the context and hence omitted from the notation 22/189 Probability space induced by a policy In particular, each policy π together with a distribution I of initial states induce a probability measure Pπ over the trajectories of the MDP.1 We denote by Eπ the associated expected value (expectation) operator. We denote by Pπ [E | S0 = s] the probability of event E provided that the initial state is fixed to s (and similarly for expectations). Exercise 6 In the “study” MDP, consider an MD policy π s.t. π(start) = study and π(next) = pub. Compute the following quantities: • Pπ [visit pub at least twice′ | S0 = start′′ ] • Pπ [visit pub at exactly twice | S0 = start′′ ] • Eπ [R1] • Eπ [R3] 22/189 Memorylessness In this course, we will almost exclusively focus on memoryless policies. Hence, from now on, policy = memoryless policy. General policies will be referred to as history-dependent policies should the need arise. 23/189 Memorylessness In this course, we will almost exclusively focus on memoryless policies. Hence, from now on, policy = memoryless policy. General policies will be referred to as history-dependent policies should the need arise. Why memoryless? Intuition: Markov property of MDPs: next step depends only on the current state and on action performed in the current step. Hence, intuitively there is no need for a policy to remember the past so as to “play well”. The sufficiency of memoryless policies does not extended to more general/complex decision-making settings (not covered in this course), such as: • partially observable MDPs • non-stationary environments • quantile/risk-aware MDPs, etc. 23/189 Returns (payoffs) 24/189 Returns (payoffs) Definition 7 Let γ ∈ [0, 1) be a discount factor. For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . we define the discounted return (or payoff) of τ to be the quantity G(τ) = r1 + γ · r2 + γ2 · r3 + · · · γ3 · r4 = ∞ i=0 γi · ri+1. Equivalently G = ∞ i=0 γi · Ri+1. 24/189 Returns (variants) • Finite horizon (FH): additionally, we are given a finite decision horizon H ∈ N ∪ {∞}. The return is that counted only up to step H: GH = H−1 i=0 γi · Ri+1 For finite H, the discount factor can be 1. H = ∞ corresponds to the original definition. 25/189 Returns (variants) • Finite horizon (FH): additionally, we are given a finite decision horizon H ∈ N ∪ {∞}. The return is that counted only up to step H: GH = H−1 i=0 γi · Ri+1 For finite H, the discount factor can be 1. H = ∞ corresponds to the original definition. • Episodic returns: In episodic tasks, there is a distinguished set Term ⊆ S of terminal states which is guaranteed to be reached with probability 1 under any policy. We denote by T a random variable denoting the first point in time when we hit a terminal state. We count rewards only up to that time: GT = T−1 i=1 γi · Ri+1 Can be modeled under original definition by “sink” states. 25/189 Types of returns: discussion • We will typically omit the superscripts since the type of task considered will be known from the context. • We have GH → G (pointwise) as H → ∞. I.e., finite-horizon returns with high enough H approximate the standard (infinite-horizon) case. • In real world, we typically deal with FH or episodic tasks: we cannot wait infinite time to learn something from a trajectory. However, the infinite-horizon case can be viewed as a neat mathematical abstraction of the FH&episodic tasks, and the classical sequential decision-making theory is most developed for the infinite horizon case. 26/189 Policy and state values Definition 8 Let π be a policy and s a state. The value of π in state s is the quantity vπ (s) = Eπ [G | S0 = s]. Exercise 9 Discuss the values of MD policies in our running example. 27/189 Policy and state values Definition 8 Let π be a policy and s a state. The value of π in state s is the quantity vπ (s) = Eπ [G | S0 = s]. Exercise 9 Discuss the values of MD policies in our running example. Definition 10 The (optimal) value of state s is the quantity v∗ (s) = sup π vπ (s). 27/189 Optimality Definition 11 Let π be a policy and ε > 0. We say that π is ε-optimal in state s if vπ (s) ≥ v∗ (s) − ε. We say that π is optimal in s is it is 0-optimal in s, i.e. if vπ (s) = v∗ (s). A policy is (ε-)optimal if it is (ε-)optimal in every state. 28/189 Existence of optimal policies Theorem 12: (Classical result, not formally proven here) Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can be computed in polynomial time. 29/189 Existence of optimal policies Theorem 12: (Classical result, not formally proven here) Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can be computed in polynomial time. Agent control solved? 29/189 Existence of optimal policies Theorem 12: (Classical result, not formally proven here) Let M be a finite MDP (i.e., the state and action sets are finite) with infinite-horizon returns. Then there exists an optimal MD policy. Moreover, an optimal MD policy can be computed in polynomial time. Agent control solved? NO! “Only” works if you can actually construct the MDP model of your environment and fit it into a computer. Otherwise, we use reinforcement learning. 29/189 Exact Planning with Known Model: Value & Policy Iteration Goal of this lecture Algorithms that compute the optimal value vector v∗ and some optimal MD policy π∗ given a full knowledge of an MDP M. 30/189 Polynomial-time algorithm MDPs can be solved by linear programming (LP) maximize ⃗c · ⃗x subject to A · ⃗x ≤ ⃗b • LP can be solved in polynomial time by so-called interior-point algorithms. • However, we typically use other, MDP-specific algorithms: value iteration (VI) and policy iteration (PI). These are not polynomial-time in general, but typically faster on practical instances. • Moreover, most truly RL algorithms can be seen as approximate generalizations of VI or PI (or both). 31/189 Example: Policy evaluation Exercise 13 Consider all four MD policies in our running “pub or study” example that always try to quit when in X state. Compute the values of these policies in the initial state start. 32/189 Policy evaluation equations Theorem 14 For any memoryless policy π and any state s it holds: vπ (s) = a∈A(s) π(a|s) · r(s, a) + γ · s′∈S p(s′ |s, a) · vπ (s′ ) def = qπ(s,a) . 33/189 Bellman optimality equations Theorem 15 The following holds for any state s: v∗ (s) = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · v∗ (s′ ) def = q∗(s,a) 34/189 Bellman optimality equations Theorem 15 The following holds for any state s: v∗ (s) = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · v∗ (s′ ) def = q∗(s,a) • Note: the policy evaluation equations are a special case of the Bellman ones: given a policy π, we can consider an MDP Mπ in which there is a single action ∗ enabled in each state and the probability of transition s ∗ → s′ equals a∈A(s) π(a|s) · p(s′ |s, a). Then Mπ mimics the behavior of π in M and Bellman eq’s in Mπ = evaluation equations for π in M. 34/189 Bellman optimality equations Theorem 15 The following holds for any state s: v∗ (s) = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · v∗ (s′ ) def = q∗(s,a) • Note: the policy evaluation equations are a special case of the Bellman ones: given a policy π, we can consider an MDP Mπ in which there is a single action ∗ enabled in each state and the probability of transition s ∗ → s′ equals a∈A(s) π(a|s) · p(s′ |s, a). Then Mπ mimics the behavior of π in M and Bellman eq’s in Mπ = evaluation equations for π in M. • But these equations are no longer linear! How do we solve them? Is the solution even unique? 34/189 Bellman update operator The right-hand-side (RHS) of the Bellman equations can be viewed as an operator Φ: RS → RS : for any ⃗x ∈ RS , Φ(⃗x) is a vector such that for any state s: Φ(⃗x)(s) def = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ ) 35/189 Bellman update operator The right-hand-side (RHS) of the Bellman equations can be viewed as an operator Φ: RS → RS : for any ⃗x ∈ RS , Φ(⃗x) is a vector such that for any state s: Φ(⃗x)(s) def = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ ) Exercise 16 In our running example, compute Φ(⃗0). 35/189 Bellman update operator The right-hand-side (RHS) of the Bellman equations can be viewed as an operator Φ: RS → RS : for any ⃗x ∈ RS , Φ(⃗x) is a vector such that for any state s: Φ(⃗x)(s) def = max a∈A(s) r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ ) Exercise 16 In our running example, compute Φ(⃗0). Theorem 15 says that the optimal value vector v∗ is a fixed point of Φ: v∗ = Φ(v∗ ). 35/189 Mathematical hammers for Bellman Lemma 17: (not proven here) For any discount factor γ ∈ [0, 1), the Bellman operator Φ is a contraction, i.e. for any pair of vectors ⃗x, ⃗y it holds ∥⃗x − ⃗y∥∞ ≤ γ · ∥Φ(⃗x) − Φ(⃗y)∥∞. 36/189 Mathematical hammers for Bellman Lemma 17: (not proven here) For any discount factor γ ∈ [0, 1), the Bellman operator Φ is a contraction, i.e. for any pair of vectors ⃗x, ⃗y it holds ∥⃗x − ⃗y∥∞ ≤ γ · ∥Φ(⃗x) − Φ(⃗y)∥∞. Theorem 18: Banach fixed point theorem (classical calculus, not proven here) A contraction mapping from a complete metric space (in particular, Rn ) to itself has a unique fixed point. 36/189 Exact characterization of v∗ Corollary 19 The optimal value vector is a unique solution of the Bellman optimality equations. 37/189 Exact characterization of v∗ Corollary 19 The optimal value vector is a unique solution of the Bellman optimality equations. In particular, also the policy evaluation equations have a unique solution, equal to vπ . Since the policy evaluation equations are linear, their solution can be computed by Gaussian elimination. But the general Bellman equations are not linear. How can ve solve them? 37/189 Banach fixpoint theorem (full) Theorem 20: Banach fixed point theorem (full version, not proven here) A contraction mapping Φ from a complete metric space (in particular, Rn ) to itself has a unique fixed point ⃗z. Moreover, ⃗z is the limit of iterative applications of Φ on any initial vector. I.e., for any ⃗x0 ∈ Rn , the sequence ⃗x0, Φ(⃗x0), Φ(Φ(⃗x0)), Φ(3) (⃗x0), . . . converges to ⃗z: z = lim i→∞ Φ(i) (⃗x0) 38/189 Value iteration (VI; Bellman, 1957) Algorithm 1: Value iteration Input: MDP M = (S, A, p, r) Output: Approximation ˜v of v∗ x ← any vector from R|S| ; // typically ⃗0 next ← x; repeat foreach s ∈ S do next(s) ← max a∈A(s) [r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ )] Φ(x)(s) ; x ← next until termination condition; Typical term. conditions: • after a fixed no. of iterations (i.e., use for-loop with a fixed bound) • after each component of x changes less then some given ε 39/189 How to use VI By the Banach fixpoint theorem (and Lemma 17), the value of variable x VI converges to v∗ . Can we recognize when is x “close enough” to ⃗x? 40/189 How to use VI By the Banach fixpoint theorem (and Lemma 17), the value of variable x VI converges to v∗ . Can we recognize when is x “close enough” to ⃗x? In the following couple of theorems, let ⃗x0, ⃗x1, ⃗x2, . . . be the sequence of vectors computed by VI, i.e. ⃗x0 is arbitrary and ⃗xi+1 = Φ(⃗xi ) for all i ≥ 0. Theorem 21: Stopping condition (not proven here) For any ε > 0: if ∥⃗xi+1 − ⃗xi ∥∞ ≤ ε · 1 − γ γ , then ∥⃗xi+1 − v∗ ∥∞ ≤ ε 40/189 How to use VI (2) How fast can we get to the point where we are close enough? Theorem 22: Speed of convergence (not proven here) For all i ≥ 0 it holds ∥⃗xn − v∗ ∥∞ ≤ γn 1 − γ · ∥⃗x1 − ⃗x0∥∞. In particular, if we terminate VI after i =     log(ε) + log 1−γ ∥⃗x1−⃗x0∥∞ log(γ)     steps, then its output xi will be an ε-approximation of v∗ . 41/189 How to use VI (3) Can we actually get some optimal values instead of approximations? 42/189 How to use VI (3) Can we actually get some optimal values instead of approximations? First, note that VI computes optimal finite-horizon values: Let vi = supπ Eπ [ H i=1 γi−1 · Ri ]. The supremum is over all (i.e., history dependent) policies, since in the FH problem an optimal policy needs to track the number of elapsed (and thus remaining) steps: memory is needed for that. Theorem 23: (Easy but important exercise) If ⃗x0 = ⃗0, then ⃗xH = vH for all H ≥ 0. How to use VI (3) Can we actually get some optimal values instead of approximations? First, note that VI computes optimal finite-horizon values: Let vi = supπ Eπ [ H i=1 γi−1 · Ri ]. The supremum is over all (i.e., history dependent) policies, since in the FH problem an optimal policy needs to track the number of elapsed (and thus remaining) steps: memory is needed for that. Theorem 23: (Easy but important exercise) If ⃗x0 = ⃗0, then ⃗xH = vH for all H ≥ 0. Moreover, let πH be a deterministic history-dependent policy such that for all 1 ≤ i ≤ H, whenever there are i steps remaining till the horizon, the policy πH selects in state s an action a s.t. a = arg max a∈A(s) [r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗xi−1(s′ )] (with ties broken arbitrarily). Then πH is an optimal H-step policy. 42/189 Greedy policies Can we actually get optimal policy for the inf. horizon problem? Definition 24: ⃗x-greedy policy (very important) Let ⃗x ∈ RS be any vector. A ⃗x-greedy policy is an MD policy π such that in any state s: π(s) = arg max a∈A(s) [r(s, a) + γ · s′∈S p(s′ |s, a) · ⃗x(s′ )]. 43/189 How to use VI (4) Theorem 25: Optimal inf.-horizon policy from VI (not proven here) There is a number N polynomial in size of the MDP and exponential in the binary encoding size of γ such that a policy π that is ⃗xN -greedy is optimal in every state, i.e. vπ = v∗ . Note that once π is computed, vπ can be computed in polynomial time via policy evaluation equations. Hence, VI can be said to solve MDPs in exponential time (and in polynomial time if the discount factor is assumed to be a fixed constant instead of an input parameter), though the approximate version is typically used in practice. Note: the fact that some policy π is ⃗x-greedy does not mean that vπ ≥ ⃗x! Homework: find a counterexample and post it to Discord. However, for VI it can be shown that if ∥⃗xi+1 − ⃗xi ∥∞ ≤ ε · 1−γ γ (stopping condition from Theorem 21), then an ⃗xi+1-greedy policy is ε-optimal. 44/189 Policy improvement start next passed sleep study: -2 1 study: -2 1 done: 20 1 done: 0 1 procrast.: 4 1 leave: -1 1 2 1 2 procrast.: -1 1 pub: +4 0.3 0.4 0.3 vπ = ( ) let π′ be vπ -greedy 45/189 Policy improvement start next passed sleep study: -2 1 study: -2 1 done: 20 1 done: 0 1 procrast.: 4 1 leave: -1 1 2 1 2 procrast.: -1 1 pub: +4 0.3 0.4 0.3 vπ′ = ( ) let π′′ be vπ′ -greedy 45/189 Policy improvement start next passed sleep study: -2 1 study: -2 1 done: 20 1 done: 0 1 procrast.: 4 1 leave: -1 1 2 1 2 procrast.: -1 1 pub: +4 0.3 0.4 0.3 45/189 Policy improvement theorem Theorem 26: Policy improvement Let π be a policy. If Φ(vπ ) ≥ vπ , then any vπ -greedy policy πg is at least as good as π, i.e. ∀s ∈ S : vπg (s) ≥ vπ (s). Moreover, if Φ(vπ )(s) > vπ (s) for some state s, then also vπg (s′ ) > vπ (s′ ) for some state s′ . 46/189 Returns from a given time step For the proof of PIT and also many times later, we will need the following notation: Definition 27: Important! Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t ∈ N a time step. We define Gt(τ) = H−1 i=t γi−t · ri+1 = rt+1 + γrt+2 + γ2 rt+3 + · · · , where H ∈ N ∪ {∞} or H = T for episodic tasks. We similarly define, for any policy π: Gπ t = Eπ [Gt] = Eπ [ H−1 i=t γi−t · Ri+1]. 47/189 Proof of PIT (setup) We will define a sequence of policies π0, π1, π2, . . . s.t.: • π0 = π • πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then “switches” back to behave as π: • we also define π∞ = πg 48/189 Proof of PIT (setup) We will define a sequence of policies π0, π1, π2, . . . s.t.: • π0 = π • πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then “switches” back to behave as π: • we also define π∞ = πg We want: vπ∞ (s) ≥ vπ (s) for all s. 48/189 Proof of PIT (setup) We will define a sequence of policies π0, π1, π2, . . . s.t.: • π0 = π • πi behaves as πg (i.e., selects the same actions in same states) for the first i steps, then “switches” back to behave as π: • we also define π∞ = πg We want: vπ∞ (s) ≥ vπ (s) for all s. Not hard to see: vπi → vπ∞ as i → ∞ (πi behaves as π∞ for longer and longer as i increases + discounting). It suffices to show: vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S. 48/189 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] 49/189 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] = Eπi [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] 49/189 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] = Eπi [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] 49/189 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] = Eπi [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] Suppose we prove ≥Eπi−1 [γi−1 Ri +γi Ri+1···|S0=s] 49/189 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] = Eπi [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] Suppose we prove ≥Eπi−1 [γi−1 Ri +γi Ri+1···|S0=s] ≥ Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 + γi−1 Ri + γi Ri+1 · · · | S0 = s] 49/189 Proof of PIT (induction) vπi (s) ≥ vπ (s) for all i ∈ N and all s ∈ S • i = 0: clear • i > 0: vπi (s) = Eπi [R1 + γR2 + · · · + γi−1 Ri + γi Ri+1 + · · · | S0 = s] = Eπi [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 | S0 = s] + Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] Suppose we prove ≥Eπi−1 [γi−1 Ri +γi Ri+1···|S0=s] ≥ Eπi−1 [R1 + γR2 + · · · + γi−2 Ri−1 + γi−1 Ri + γi Ri+1 · · · | S0 = s] IH ≥ vπ (s) 49/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ), since πg is vπ-greedy 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ)(s′), since πg is vπ-greedy ≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption. 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ)(s′), since πg is vπ-greedy ≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption. ≥ γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · vπ (s′ ) 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ)(s′), since πg is vπ-greedy ≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption. ≥ γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · vπ (s′ ) = γi−1 · Eπi−1 [Gi−1 | S0 = s] = Eπi−1 [γi−1 Gi−1 | S0 = s] 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ)(s′), since πg is vπ-greedy ≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption. ≥ γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · vπ (s′ ) = γi−1 · Eπi−1 [Gi−1 | S0 = s] = Eπi−1 [γi−1 Gi−1 | S0 = s] = Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s] 50/189 Proof of PIT (induction, behavior at “reset”) We need: Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] ≥ Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s]. Eπi [γi−1 Ri + γi Ri+1 · · · | S0 = s] = γi−1 · Eπi [Ri + γRi+1 · · · | S0 = s] = γi−1 · s′∈S Pπi [Si−1 = s′ | S0 = s] · r(s′ , πi (s′ )) + γ · s′′ p(s′′ | s′ , πi (s′ )) · Eπi [Gi | Si = s′′ ] = γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · r(s′ , πg (s′ )) + γ · s′′ p(s′′ | s′ , πg (s′ )) · Eπ [Gi | Si = s′′ ] vπ(s′′) =Φ(vπ)(s′), since πg is vπ-greedy ≥vπ(s′), since Φ(vπ)≥vπ by PIT assumption. ≥ γi−1 · s′∈S Pπg [Si−1 = s′ | S0 = s] · vπ (s′ ) = γi−1 · Eπi−1 [Gi−1 | S0 = s] = Eπi−1 [γi−1 Gi−1 | S0 = s] = Eπi−1 [γi−1 Ri + γi Ri+1 · · · | S0 = s] Strictness? 50/189 Policy iteration (PI; Howard, 1960) Algorithm 2: Policy iteration Input: MDP M = (S, A, p, r) Output: Optimal MD policy π∗ for M, its value vector v∗ π ← arbitrary MD policy ; v ← vπ ; // e.g. by solving linear policy evaluation equations while Φ(v) ̸= v do foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] v ← vπ return π, v 51/189 PI correctness & complexity Theorem 28 Policy iteration terminates after at most exponentially many iterations. Upon termination, it returns an optimal MD policy. 52/189 PI correctness & complexity Theorem 28 Policy iteration terminates after at most exponentially many iterations. Upon termination, it returns an optimal MD policy. Proof: • Optimal upon termination: vπ is a fixpoint of Φ when terminating: optimality follows from Corollary 19. 52/189 PI correctness & complexity Theorem 28 Policy iteration terminates after at most exponentially many iterations. Upon termination, it returns an optimal MD policy. Proof: • Optimal upon termination: vπ is a fixpoint of Φ when terminating: optimality follows from Corollary 19. • Terminates: π always stores an MD policy and there are finitely many of these. We will show that no single MD policy appears in more than one iteration of PI. 52/189 PI correctness & complexity Theorem 28 Policy iteration terminates after at most exponentially many iterations. Upon termination, it returns an optimal MD policy. Proof: • Optimal upon termination: vπ is a fixpoint of Φ when terminating: optimality follows from Corollary 19. • Terminates: π always stores an MD policy and there are finitely many of these. We will show that no single MD policy appears in more than one iteration of PI. Consider any iteration and let v, v′ be the contents of variable v before and after the iteration. We will show that unless Φ(v) = v, it holds v′ > v, i.e. v′ ≥ v componentwise with strict inequality in some component. Hence, v = vπ strictly increases during PI, so no π can appear twice. 52/189 PI: correctness proof v′ ≥ v: 53/189 PI: correctness proof v′ ≥ v: We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ . For all s ∈ S: Φ(v)(s) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · v(s′ )] 53/189 PI: correctness proof v′ ≥ v: We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ . For all s ∈ S: Φ(v)(s) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · v(s′ )] ≥ r(s, π(s)) + γ · s′∈S p(s′ | s, π(s)) · vπ (s′ ) 53/189 PI: correctness proof v′ ≥ v: We verify assumptions of PIT: Φ(v) ≥ v. Recall v = vπ . For all s ∈ S: Φ(v)(s) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · v(s′ )] ≥ r(s, π(s)) + γ · s′∈S p(s′ | s, π(s)) · vπ (s′ ) = vπ (s) = v(s). By PIT, v′ = vπ′ ≥ vπ = v (here π′ is the v-greedy policy). 53/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: , 54/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) , 54/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ (s′ ) (assumption) , 54/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ (s′ ) (assumption) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ (s′ )] (π′ is v = vπ -greedy) , 54/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ (s′ ) (assumption) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ (s′ )] (π′ is v = vπ -greedy) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ′ (s′ )] (assumption) , 54/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ (s′ ) (assumption) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ (s′ )] (π′ is v = vπ -greedy) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ′ (s′ )] (assumption) = Φ(v′ )(s), so PI terminates at this point. 54/189 PI: correctness proof II It remains to prove that v′ > v or PI terminates. Assume that v′ = v. Then for all s ∈ S: v′ (s) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ′ (s′ ) = r(s, π′ (s)) + γ · s′∈S p(s′ | s, π′ (s)) · vπ (s′ ) (assumption) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ (s′ )] (π′ is v = vπ -greedy) = max a∈A(s) [r(s, a) + γ · s′∈S p(s′ | s, a) · vπ′ (s′ )] (assumption) = Φ(v′ )(s), so PI terminates at this point. Complexity? 54/189 PI&VI: discussion • We know that MDPs have a linear programming (LP) formulation. PI is basically a variant of a simplex method for this LP, using a special pivoting rule. • PI typicaly requires less iterations to converge than VI, though each iteration is more expensive (policy eval.) • Both PI and VI typically work well in practice for MDPs whose explicit transition table fits inside a computer. Which of the two is faster is rather domain-specific. 55/189 PI variants Can we get rid of the expensive policy evaluation by linear system solving? 56/189 PI variants Can we get rid of the expensive policy evaluation by linear system solving? Yes: we can approximate the value of the current policy π by applying VI on the MDP Mπ , for either fixed number of steps or until v does not change much. Often appearing in RL textbooks: 56/189 Policy iteration with approximate evaluation π ← arbitrary MD policy; v ← arbitrary vector; repeat v ← Eval(π, v); foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] until π has not changed; return π, v Function Eval(π, v): v′ ← v; repeat foreach s ∈ S do v(s) ← v′ (s); v′ (s) ← r(s, π(s)) + γ · s′∈S p(s′ | s, π′ (s)) · v(s′ ) until ∥v − v′ ∥∞ ≤ ε; return v′ 57/189 Convergence of PI variants • The algorithm on previous slide still converges to an optimal policy provided that ε is small enough. • If we replaced the “π not changed condition” with the original “Φ(v) = v” condition, the algorithm might not terminate, since the VI is only guaranteed to reach a true fixpoint in the limit. However, v would still converge to v∗ and thus π would eventually become equal to an optimal policy. • The previous point holds even in the very degenerate case when we do just one iteration of VI per policy evaluation! See next slide. 58/189 Curiously looking approximate PI v ← arbitrary vector; π ← v-greedy MD policy ; repeat foreach s ∈ S do v′ (s) ← r(s, π(s)) + γ · s′∈S p(s′ |s, π(s)) · v(s′ ); v ← v′ ; foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] until Φ(v) = v; return π, v 59/189 Curiously looking approximate PI v ← arbitrary vector; π ← v-greedy MD policy ; repeat foreach s ∈ S do v′ (s) ← r(s, π(s)) + γ · s′∈S p(s′ |s, π(s)) · v(s′ ); v ← v′ ; foreach s ∈ S do π(s) ← arg maxa∈A(s)[r(s, a) + γ · s′∈S p(s′ |s, a) · v(s′ )] until Φ(v) = v; return π, v This is just VI in disguise! 59/189 Generalized policy iteration Evaluate Improve push v towards vπ make π (more) v-greedy 60/189 Generalized policy iteration Evaluate Improve push v towards vπ make π (more) v-greedy Source: Sutton&Barto, p. 87 60/189 Tabular Methods for Model-Free Reinforcement Learning Model-free 61/189 Model-free We will still be working with MDPs. But for a bunch of the following lectures, we will not (necessarily) have access to, e.g.: • a table containing explicit enumeration of all states/actions • a table containing the description of p or r • the ability to compute the probability vector δ(s, a) or the reward signal r(s, a) given s and a (having this = gray-box model of the MDP) 61/189 Sampling from MDP But the MDP is still there “behind the scene”. In particular, we: 62/189 Sampling from MDP But the MDP is still there “behind the scene”. In particular, we: • know how the states of the MDP look like • (e.g. robot state = all possible output values of its sensors) 62/189 Sampling from MDP But the MDP is still there “behind the scene”. In particular, we: • know how the states of the MDP look like • (e.g. robot state = all possible output values of its sensors) • know how the actions of the MDP look like • (e.g. robot = all possible signals that can be sent to the actuators) 62/189 Sampling from MDP But the MDP is still there “behind the scene”. In particular, we: • know how the states of the MDP look like • (e.g. robot state = all possible output values of its sensors) • know how the actions of the MDP look like • (e.g. robot = all possible signals that can be sent to the actuators) • can, for any s ∈ S, enumerate A(s) • could be weakened, but simplifies things 62/189 Sampling from MDP But the MDP is still there “behind the scene”. In particular, we: • know how the states of the MDP look like • (e.g. robot state = all possible output values of its sensors) • know how the actions of the MDP look like • (e.g. robot = all possible signals that can be sent to the actuators) • can, for any s ∈ S, enumerate A(s) • could be weakened, but simplifies things • given s ∈ S and a ∈ A(s), we can sample the next state s′ ∼ p(s, a) and receive the reward r(s, a). 62/189 Sampling from a policy Given an effective representation of a policy π, we can sample a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . by performing, for each t ∈ {0, . . . , T}: • sample at ∼ π(st) • query the environment for st+1 ∼ p(st, at) and rt+1 = r(st, at) • increment t 63/189 Sampling from a policy Given an effective representation of a policy π, we can sample a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . by performing, for each t ∈ {0, . . . , T}: • sample at ∼ π(st) • query the environment for st+1 ∼ p(st, at) and rt+1 = r(st, at) • increment t Tabular = value estimates and policies represented as tables (e.g. Q(s, a) for each state s and action a used in s – explicit representation might only be needed for states/actions actually encountered). 63/189 Basic classification of (tabular) RL algorithms Three independent axes: problem on/off updates policy evaluation (value prediction) on-policy Monte Carlo vs. vs.       control off-policy temporal difference 64/189 Assumptions: successor-dependent rewards & episodic tasks Since we do no longer have the knowledge of the transition dynamics p, we cannot freely interchange MDPs with rewards functions of type S × A × S → R and S × A → R via the equation r(s, a) = s′∈S p(s′ | s, a) · r(s, a, s′ ). Hence, to maintain generality (and correspondence to e.g. Gymnasium environments) we will assume reward functions of type S × A × S → R. We will assume episodic returns: each trajectory terminates with probability 1 at some (possibly random) time step T. Termination can be defined e.g. by reaching some terminal state or by running out of some fixed decision horizon (in Gymnasium, this is sometimes called truncation): G = T−1 i=0 γi · Ri+1. Episode = one high-level iteration of an RL algorithm, corresponding of sampling a single trajectory from some policy. 65/189 Monte Carlo Methods MC evaluation Policy evaluation: given an effective representation of a policy π, estimate vπ (or qπ ). 66/189 MC evaluation Policy evaluation: given an effective representation of a policy π, estimate vπ (or qπ ). Naive Monte Carlo: Sample from π: if {τ1, τ2, . . . , τn} are trajectories (episodes) independently sampled under π from the same initial state s, then 1 n n i=1 G(τi ) → vπ (s) as n → ∞ due to law of large numbers (LLN). 66/189 MC evaluation Policy evaluation: given an effective representation of a policy π, estimate vπ (or qπ ). Naive Monte Carlo: Sample from π: if {τ1, τ2, . . . , τn} are trajectories (episodes) independently sampled under π from the same initial state s, then 1 n n i=1 G(τi ) → vπ (s) as n → ∞ due to law of large numbers (LLN). But this throws away a lot of valuable information! E.g. what if we want to estimate the whole vπ ? 66/189 First-visit MC For each s, we estimate vπ (s) as an average of sample returns Ret(s) which is formed as follows: • initially, Ret(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, we identify the first occurrence of s on τ: let this be at timestep t; we add Gt (τ) to Ret(s) start X start next pass. sleep 4 -1 -2 -2 20 67/189 First-visit MC For each s, we estimate vπ (s) as an average of sample returns Ret(s) which is formed as follows: • initially, Ret(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, we identify the first occurrence of s on τ: let this be at timestep t; we add Gt (τ) to Ret(s) start X start next pass. sleep 4 -1 -2 -2 20 Sub-trajectory starting at the first appearance of s can be seen as a trajectory sampled from π when s is the initial state! (Since we consider memoryless π.) Theorem 29 As |Ret(s)| → ∞, the average of Ret(s) converges to vπ (s). Moreover, the average of Ret(s) is an unbiased estimate of vπ (s) (as long as Ret(s) ̸= ∅). 67/189 First-visit MC (pseudocode) Source: Sutton&Barto, p.92 68/189 Every-visit MC For each s, we estimate vπ (s) as an average of sample returns Ret(s) which is formed as follows: • initially, Ret(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, and each t such that St (τ) = s we add Gt (τ) to Ret(s) start X start next pass. sleep 2 -1 -2 -2 20 69/189 Every-visit MC For each s, we estimate vπ (s) as an average of sample returns Ret(s) which is formed as follows: • initially, Ret(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, and each t such that St (τ) = s we add Gt (τ) to Ret(s) start X start next pass. sleep 2 -1 -2 -2 20 The sample returns added to Ret(s) within the same episode are not independent! Hence, the estimate is biased, though the bias vanishes in the limit: Theorem 30 As |Ret(s)| → ∞, the average of Ret(s) converges to vπ (s). 69/189 MC convergence Optional reading: More on MC estimate bias, variance, and convergence in: Singh, S.P. and Sutton, R.S.: Reinforcement Learning with Replacing Eligibility Traces. In Machine Learning 22:123–158. Kluwer, 1996. (Section 3, particularly 3.3 and onwards, you can skip Theorem 4.) 70/189 Towards MC control Control = computation of “good” policy for a given environment. (Ideally, the policy should get closer to the optimal policy the more episodes we sample.) 71/189 Towards MC control Control = computation of “good” policy for a given environment. (Ideally, the policy should get closer to the optimal policy the more episodes we sample.) We know (PIT): given a policy π a vπ -greedy policy is at least as good as π: πg (s) = arg max a∈A(s) s′∈S p(s′ | s, a) · r(s, a, s′ ) + γ · vπ (s′ ) Do we have an algo? There is an issue: 71/189 MC control with q-values Recall: qπ (s, a) def = s′∈S p(s′ | s, a) · r(s, a, s′ ) + γ · vπ (s′ ). Thus, the vπ -greedy policy πg can be defined as: πg (s) = arg max a∈A(s) qπ (s, a) Estimate by MC. 72/189 MC for q-value estimation Analogous to value estimation, e.g. first-visit: For each s, a, we estimate qπ (s, a) as an average of sample returns Ret(s, a) which is formed as follows: • initially, Ret(s, a) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state-action pair (s, a), we identify the first t such that St (τ) = s ∧ At (τ) = a; we add Gt (τ) to Ret(s) start X start next pass. sleep proc.:4 study:-1 study:-2 done:-2 done:20 Similarly for every visit. Convergence guarantees the same as for state values. 73/189 Infinite exploration and exploring starts Issue: MC only estimates qπ (s, a) if: • s guaranteed to be visited with positive probability in each episode • π(a | s) > 0. 74/189 Infinite exploration and exploring starts Issue: MC only estimates qπ (s, a) if: • s guaranteed to be visited with positive probability in each episode • π(a | s) > 0. Definition 31: Infinite exploration A RL algorithm has infinite exploration (IE) if, during the infinite execution of the algorithm, each state-action pair (s, a) is visited infinitely often with probability 1. 74/189 Infinite exploration and exploring starts Issue: MC only estimates qπ (s, a) if: • s guaranteed to be visited with positive probability in each episode • π(a | s) > 0. Definition 31: Infinite exploration A RL algorithm has infinite exploration (IE) if, during the infinite execution of the algorithm, each state-action pair (s, a) is visited infinitely often with probability 1. One way of achieving IE is through exploring starts (ES): each episode begins with (typically uniformly) randomly selected s0 and a0. This is achievable when training, e.g., in simulated environments but might be difficult/impossible in real-world environments. 74/189 MC control with exploring starts Source: Sutton&Barto, p.99 75/189 IE through ε-soft policies Exploring starts are not always feasible. Alternative: make the sampled policy itself exploratory. Definition 32: ε-soft policy A policy π is ε-soft if for every s ∈ S and every a ∈ A(s) it holds π(a|s) ≥ ε |A(s)| . 76/189 IE through ε-soft policies Exploring starts are not always feasible. Alternative: make the sampled policy itself exploratory. Definition 32: ε-soft policy A policy π is ε-soft if for every s ∈ S and every a ∈ A(s) it holds π(a|s) ≥ ε |A(s)| . Definition 33: ε-greedy policy Let v ∈ RS be a value vector. A policy π is v-ε-greedy if for every state s ∈ S there is action a∗ = arg maxa∈A(s) s′∈S p(s′ | s, a) · r(s, a, s′ ) + γ · v(s′ ) such that for any action a ∈ A(s) it holds: π(a|s) =    ε A(s) if a ̸= a∗ 1 − ε + ε A(s) if a = a∗ . Interpretation: with prob. ε: play uniformly at random; with prob. 1 − ε: play greedily. 76/189 ε-softing Definition 34 Let π be a policy. An ε-softing of π is a policy πε defined as follows: in each state s • with probability ε, πε selects an action uniformly at random; • with probability 1 − ε, πε selects a ∼ π(s). I.e., an ε-greedy policy can be alternatively defined as ε-softing of a greedy policy. 77/189 MC control with ε-greedy policies source: Sutton&Barto, p. 101 78/189 Policy iteration for ε-soft policies Theorem 35 Let π be an ε-soft policy and let π′ be a vπ -ε-greedy policy. Than vπ′ ≥ vπ (componentwise). Moreover, the two value vectors are equal if and only if bot π and π′ are optimal among all ε-soft policies; i.e. if, for every state s: vπ (s) = sup ¯π that is ε-soft v ¯π (s). Proof: Required reading: Sutton&Barto, p.101-103. 79/189 Incremental computing of averages Given a sample {n1, n2, . . . , nk+1} and average A = avg({n1, n2, . . . , nk }), how to compute A′ = avg({n1, n2, . . . , nk , nk+1}) without recomputing the average of the whole sample? A′ = 80/189 Incremental computing of averages Given a sample {n1, n2, . . . , nk+1} and average A = avg({n1, n2, . . . , nk }), how to compute A′ = avg({n1, n2, . . . , nk , nk+1}) without recomputing the average of the whole sample? A′ = k k + 1 · A + nk+1 k + 1 . 80/189 On-policy vs. off-policy • On-policy algorithms: track one “policy variable” π; the policy stored in π is used to interact with the environment (i.e., to sample episodes) and at the same time we learn something about it (e.g. its value vector). • Corresponds to the generalized policy iteration scheme. • All the MC algos we have seen so far. 81/189 On-policy vs. off-policy • On-policy algorithms: track one “policy variable” π; the policy stored in π is used to interact with the environment (i.e., to sample episodes) and at the same time we learn something about it (e.g. its value vector). • Corresponds to the generalized policy iteration scheme. • All the MC algos we have seen so far. • Off-policy algorithms: track more (typically two) different policy variables: • behavior policy: used to sample episodes • target policy: which we want to learn about 81/189 Off-policy evaluation We are given effective representations of: • a behavior policy β, • a target policy π. The task is to estimate vπ by sampling episodes from β . We cannot sample from π! (E.g. π too risky or expensive to sample from.) 82/189 Off-policy evaluation We are given effective representations of: • a behavior policy β, • a target policy π. The task is to estimate vπ by sampling episodes from β . We cannot sample from π! (E.g. π too risky or expensive to sample from.) Assumptions: • given (s, a), we can effectively compute π(a|s) and β(a|s) (or at least estimate via sampling) • coverage: ∀s ∈ S, a ∈ A(s): if π(a|s) > 0, then also β(a|s) > 0 82/189 Importance sampling Definition 36: Importance ratio Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. The importance-sampling ratio of τ is the quantity ρ(τ) def = Pπ [τ | S0 = s0] Pβ[τ | S0 = s0] = Pπ [A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0] Pβ[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0] . 83/189 Importance sampling Definition 36: Importance ratio Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. The importance-sampling ratio of τ is the quantity ρ(τ) def = Pπ [τ | S0 = s0] Pβ[τ | S0 = s0] = Pπ [A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0] Pβ[A0 = a0, S1 = s1, A1 = a1, . . . , AT−1 = aT−1, ST = sT | S0 = s0] . ρ(τ) can be computed without the knowledge of MDP transition probabilities! ρ(τ) = π(a0 | s0) · p(s1 | s0, a0) · π(a1 | s1) · p(s2 | s1, a1) · · · β(a0 | s0) · p(s1 | s0, a0) · β(a1 | s1) · p(s2 | s1, a1) · · · 83/189 Importance ratio from time t Definition 37 Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. By τi..j we denote the subtrajectory of τ starting in time step i and ending in timestep j. By τi.. we denote the suffix of si , ai , ri+1, si+1, ai+1, . . .. 84/189 Importance ratio from time t Definition 37 Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory. By τi..j we denote the subtrajectory of τ starting in time step i and ending in timestep j. By τi.. we denote the suffix of si , ai , ri+1, si+1, ai+1, . . .. Definition 38: Importance ratio from time t Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and t a time step. The importance-sampling ratio of τ from t is the quantity ρt(τ) def = Pπ [τt.. | S0 = st] Pβ[τt.. | S0 = st] = Pπ [A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st] Pβ[A0 = at, S1 = st+1, A1 = at+1, . . . , AT−1−t = aT−1, ST−t = sT | S0 = st] . 84/189 Off-policy evaluation with importance sampling Theorem 39 For any s ∈ S it holds: Eβ [ρ · G | S0 = s] = vπ (s). . 85/189 Off-policy evaluation with importance sampling Theorem 39 For any s ∈ S it holds: Eβ [ρ · G | S0 = s] = vπ (s). Proof: Eβ [ρ · G | S0 = s] = τ Pβ [τ | S0 = s] · ρ(τ) · G(τ) . . 85/189 Off-policy evaluation with importance sampling Theorem 39 For any s ∈ S it holds: Eβ [ρ · G | S0 = s] = vπ (s). Proof: Eβ [ρ · G | S0 = s] = τ Pβ [τ | S0 = s] · ρ(τ) · G(τ) = τ Pβ [τ | S0 = s] · Pπ [τ | S0 = s] Pβ[τ | S0 = s] · G(τ) . . 85/189 Off-policy evaluation with importance sampling Theorem 39 For any s ∈ S it holds: Eβ [ρ · G | S0 = s] = vπ (s). Proof: Eβ [ρ · G | S0 = s] = τ Pβ [τ | S0 = s] · ρ(τ) · G(τ) = τ Pβ [τ | S0 = s] · Pπ [τ | S0 = s] Pβ[τ | S0 = s] · G(τ) = τ Pπ [τ | S0 = s] · G(τ) = Eπ [G | S0 = s] = vπ (s). . 85/189 Off-policy evaluation with importance sampling Theorem 39 For any s ∈ S it holds: Eβ [ρ · G | S0 = s] = vπ (s). Proof: Eβ [ρ · G | S0 = s] = τ Pβ [τ | S0 = s] · ρ(τ) · G(τ) = τ Pβ [τ | S0 = s] · Pπ [τ | S0 = s] Pβ[τ | S0 = s] · G(τ) = τ Pπ [τ | S0 = s] · G(τ) = Eπ [G | S0 = s] = vπ (s). Easily integrates into both first-visit and every visit MC: sample from β and store ρt(τ) · Gt(τ) in Ret(st). 85/189 Weighted importance sampling First-visit variant: for each state s, we keep a set of samples Sam(s). Each sample is a tuple (τ, t) – trajectory and time step. • initially, Sam(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, and the smallest t such that St (τ) = s we add (τ, t) to Sam(s) Throughout the algorithm, the value of state s is estimated as WIS(s) = (τ,t)∈Sam(s) ρt(τ) · Gt(τ) (τ,t)∈Sam(s) ρt(τ) 86/189 Weighted importance sampling First-visit variant: for each state s, we keep a set of samples Sam(s). Each sample is a tuple (τ, t) – trajectory and time step. • initially, Sam(s) = ∅ for all s • we then sample trajectories until timeout: • for each sampled trajectory τ and each state s, and the smallest t such that St (τ) = s we add (τ, t) to Sam(s) Throughout the algorithm, the value of state s is estimated as WIS(s) = (τ,t)∈Sam(s) ρt(τ) · Gt(τ) (τ,t)∈Sam(s) ρt(τ) Exercise 40 Compare ordinary/weighted importance sampling after single sample. 86/189 Weighted importance sampling – correctness The weighted sampling is clearly a biased estimator. However, the bias vanishes in the limit: Theorem 41 With probability 1: as |Sam(s)| → ∞, we have that WIS(s) → vπ (s). Proof: 87/189 Ordinary vs. weighted sampling source: Sutton&Barto, p. 106 88/189 Importance sampling: summary But ordinary and weighted importance sampling can be adapted to every-visit MC. Bias & Convergence: • First visit: • ordinary IS: unbiased, i.e. also converges • weighted IS: biased, but converges in the limit • Every visit: • both ordinary and weighted: biased (due to EV), but converges in the limit 89/189 Weighted IS: incremental implementation Instead of recomputing the weighted average for each new sample, WIS(s) can be updated by keeping keep just two variables: • V – current value of WIS(s), initially arbitrary • C – the sum of importance ratios, initially 0 Upon arrival of new sample (τ′ , t′ ), we update V , C into new values V ′ , C′ by setting: C′ = C + ρt′ (τ′ ) V ′ = V + ρt′ (τ′ ) C′ · (Gt′ (τ′ ) − V ) . 90/189 Off-policy evaluation with weighted IS 91/189 Off-policy control with weighted IS Required reading: Sutton&Barto, Section 5.7. 92/189 Temporal Difference Methods TD: Motivation Let us first focus on policy evaluation. MC: zero bias (at least in the limit), but potentially high variance: many samples needed to converge. Also, to update estimates, it must wait till the end of each episode. TD methods retain the focus on sampling but combine it with bootstrapping. 93/189 Incremental update notation Definition 42: Notation for updates In the context of RL algorithms will denote by V n (s) (resp. Qn (s, a)) the algorithm’s estimate of vπ (s) (resp. qπ (s, a)) after n-th update of this estimate. 94/189 MC vs. TD(0) update On-policy MC (incremental) update using sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: V n+1 (st) ← (1 − αn)V n (st) + αnGt(τ) = V n (st) + αn · Gt(τ) update target −V n (st) update error , where αn = n/(n + 1). 95/189 MC vs. TD(0) update On-policy MC (incremental) update using sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: V n+1 (st) ← (1 − αn)V n (st) + αnGt(τ) = V n (st) + αn · Gt(τ) update target −V n (st) update error , where αn = n/(n + 1). TD(0) update in the same situation, with αn “suitably chosen” (possibly constant): V n+1 (st) ← V n (st) + αn · Rt+1(τ) + γ · V n (St+1(τ)) bootstrap −V n (st) 95/189 Policy evaluation with TD(0) source: Sutton&Barto, p. 120 96/189 How can it even work? Really “just” a very asynchronous, sample-based, and “α-dampened” version of value iteration. Eπ [Gt|St = s] = Eπ [Rt+1 + γ · Gt+1 | St = s] = Eπ [Rt+1 | St = s] + γ · Eπ [Gt+1 | St = s] vπ(St+1) . In expectation, the TD(0) update is the same as VI update in Mπ . Thanks to the contractivity of the Bellman operator, VI possesses an error reduction property: after each update, the error of the estimate decreases. Hence, in expectation, the same is true for the TD(0) update. Formal proof of correctness in optional reading: Sutton, R.S.: Learning to Predict by Methods of Temporal Differences. In Machine Learning 3:9–44. Kluwer, 1988. (For MDPs with function approximation.) 97/189 Why TD is natural (Sutton&Barto, p. 122-123) 98/189 Why TD is natural (Sutton&Barto, p. 122-123) Left: MC. Right TD(0). 98/189 On-policy TD control Recall: • In control setting, we need to estimate q-values of a policy. • On-policy: we sample trajectories according to some policy π and then push value estimates towards qπ . To maintain exploration, the policy π will typically be the ε-Q-greedy policy for some ε > 0, where Q are the current Q values estimates. I.e., throughout the algorithm π(a|s) =    1 − ε + ε A(s) if a = arg maxa′∈A(s) Q(s, a′ ) ε A(s) otherwise. 99/189 SARSA State-Action-Reward-State-Action. Introduced in Rummery, Niranjan: On-Line Q-Learning Using Connectionist Systems (1994). In each episode, sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . , sT according to current policy π; for each time step 0 ≤ t ≤ T − 1, perform the following update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γQn (st+1, at+1) − Qn (st, at) The update can be performed immediately when st+1 and at+1 is known (no need to wait for the episode to terminate). After the episode ends, make π ε-Q-greedy. Conforms to the GVI scheme. 100/189 SARSA pseudocode source: Sutton&Barto, p. 130 101/189 GLIE policies Definition 43: GLIE condition A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy algorithms) converges to a 0-greedy policy with increasing number of episodes. A RL algorithm is GLIE if it is GL and IE (infinitely exploring). Typical ways of ensuring GLIE: 102/189 GLIE policies Definition 43: GLIE condition A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy algorithms) converges to a 0-greedy policy with increasing number of episodes. A RL algorithm is GLIE if it is GL and IE (infinitely exploring). Typical ways of ensuring GLIE: • Dynamically adjust ε in ε-greedy policy selection" When selecting action in state s, behave ε-greedily with ε = c n(s) , where 0 < c < 1 is a constant and n(s) is a number of visits to state s over all the episodes so far. 102/189 GLIE policies Definition 43: GLIE condition A RL algorithm is greedy in the limit (GL) if its behavior policy (=target policy in onpolicy algorithms) converges to a 0-greedy policy with increasing number of episodes. A RL algorithm is GLIE if it is GL and IE (infinitely exploring). Typical ways of ensuring GLIE: • Dynamically adjust ε in ε-greedy policy selection" When selecting action in state s, behave ε-greedily with ε = c n(s) , where 0 < c < 1 is a constant and n(s) is a number of visits to state s over all the episodes so far. • Use Boltzmann (softmax) exploration: π(a | s) = e Q(s,a) η(s) b∈A(s) e Q(s,b) η(s) , where η is a state-dependent and time-varying temperature parameter. We need η to converge to 0 over time, but not too fast (often, η(s) proportional to 1 log(n(s)) ). 102/189 Convergence of SARSA Theorem 44 Consider a GLIE instantiation of SARSA. Moreover, assume that the sequence of learning rates (αn)n∈N satisfies n αn = ∞ and n α2 n < ∞. In this setting, Q converges to q∗ and the behavior policy of SARSA converges to some optimal policy π∗ . For the proof, see optional reading: Singh, Jaakkola, Littman, Szepesvári: Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. In Machine Learning 39:287-308. Kluwer, 2000. Note: learning rate can itself be state/action dependent (omitted for conciseness, constant learning rates preferred in practice). 103/189 Off-policy TD control Surprise: no importance sampling! 104/189 Off-policy TD control Surprise: no importance sampling! Recall the SARSA update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γQn (st+1, at+1) − Qn (st, at) It pushes Q towards qπ , where π is the current policy. 104/189 Off-policy TD control Surprise: no importance sampling! Recall the SARSA update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γQn (st+1, at+1) − Qn (st, at) It pushes Q towards qπ , where π is the current policy. Idea: push Q directly towards q∗ . 104/189 Off-policy TD control Surprise: no importance sampling! Recall the SARSA update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γQn (st+1, at+1) − Qn (st, at) It pushes Q towards qπ , where π is the current policy. Idea: push Q directly towards q∗ . We could do this e.g. by a VI-like update: Qn+1 (st, at) = Qn (st, at) + αn · max a∈A(s) s′∈S p(s′ | s, a) r(s, a, s′ ) + γV (s′ ) − Qn (st, at) . Two problems: • We do not calculate v-estimates. (Must somehow replace with Q) • We must get rid of transition probabilities and instead use the sampled at and rt+1. 104/189 Q-learning update Solution: push the max towards the bootstrap. Q-learning (Watkins, 1989): given a sampled trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for every t we update: Qn+1 (st, at) = Qn (st, at) + αn · rt+1 + γ · max a∈A(st+1) Qn (st+1, a) − Qn (st, at) 105/189 Q-learning pseudocode source: Sutton&Bato, p. 131 Off-policy: where is the second policy? 106/189 Q-learning convergence Theorem 45 Consider any Q-learning instantiation with infinite exploration. Assume that the sequence of learning rates (αn)n∈N satisfies n αn = ∞ and n α2 n < ∞. In this setting, Q converges to q∗ . Moreover, if the behavior policy is GL, then it converges to an optimal policy π∗ . Proof in optional reading: Watkins, Dayan: Q-Learning. In Machine Learning 8:279-292. Kluwer, 1992. 107/189 SARSA vs. Q-learning (SB: p. 132) Left: greedy policies learned by SARSA and Q-learning. Right: in-training performance with a 0.1-greedy behavior policy. 108/189 SARSA vs. Q-learning (SB: p. 132) Left: greedy policies learned by SARSA and Q-learning. Right: in-training performance with a 0.1-greedy behavior policy. (Rough) takeaway: Q-learning more aggressive in finding optimal policy, can lead to risky behavior. Possibly advantageous when environment not too stochastic or if in-training performance has less importance (simulator vs. real world). 108/189 Maximization bias in Q-learning Q-learning is “risky” not only due to exploration, but also because it is optimistic in the face of uncertainty. TBD The positive bias only disappears in the limit. 109/189 Double Q-learning Idea: use two independent value estimates Q1, Q2 and decouple action selection from evaluation in the bootstrap. During each update, we randomly select one of these for update, which is also used to select the maximizing action in bootstrap. The other is used as the bootstrap estimate. 110/189 Double Q-learning Idea: use two independent value estimates Q1, Q2 and decouple action selection from evaluation in the bootstrap. During each update, we randomly select one of these for update, which is also used to select the maximizing action in bootstrap. The other is used as the bootstrap estimate. I.e., in each time step t we perform one of these updates, each with probability 1 2 : either Q1(st, at) = Q1(st, at) + αn · rt+1 + γ · Q2 st+1, ( max a∈A(st+1) Q1(st+1, a)) − Q1(st, at) or Q2(st, at) = Q2(st, at) + αn · rt+1 + γ · Q1 st+1, ( max a∈A(st+1) Q2(st+1, a)) − Q2(st, at) . Behavior policy = e.g. ε-greedy w.r.t. Q1 + Q2. 110/189 Double Q-learning pseudocode source: Sutton&Barto, p. 136 111/189 Why Double Q-learning helps TBD 112/189 Double Q-learning: experiment source: Sutton&Barto, p. 135 113/189 Between Monte Carlo and TD: n-Step and λ-Returns MC vs TD(0) update targets Given a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 Update for time step t in: • MC = discounted return from t till the end of trajectory, e.g. for t = 1: r2 + γr3 + γ2 r4 + · · · + γT−2 rT unbiased, but high variance + need the whole trajectory 114/189 MC vs TD(0) update targets Given a trajectory s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . .: s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 Update for time step t in: • MC = discounted return from t till the end of trajectory, e.g. for t = 1: r2 + γr3 + γ2 r4 + · · · + γT−2 rT unbiased, but high variance + need the whole trajectory • TD(0) = 1-step reward and then (discounted) bootstrap: r2 + γV (s2) 114/189 n-step return Idea: use n-step discounted return and then bootstrap Definition 46 Let τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . be a trajectory and n ∈ N \ {0}. An n-step return of τ from time step t is the quantity Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2 · rt+3 + · · · γn−1 rt+n + γn · V (st+n). We also define a Q-estimate-based version: Gt:t+n(τ) = rt+1 + γ · rt+2 + γ2 · rt+3 + · · · γn−1 rt+n + γn · Q(st+n, at+n). (Which of the two is used will be clear from the context.) s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 115/189 n-step TD policy evaluation Similar to TD(0), but using n-step return targets: given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an update V (st) ← V (st) + α[Gt:t+n(τ) − V (st)]. Note 1: for n = 1 we get exactly TD(0). 116/189 n-step TD policy evaluation Similar to TD(0), but using n-step return targets: given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an update V (st) ← V (st) + α[Gt:t+n(τ) − V (st)]. Note 1: for n = 1 we get exactly TD(0). Note 2: for n > 1, we cannot update V (st) directly at step t + 1. We need to obtain rt+1, . . . , rt+n, st+n first, i.e. we can perform the update after step t + n. 116/189 n-step TD policy evaluation Similar to TD(0), but using n-step return targets: given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., for each 0 ≤ t < T we perform an update V (st) ← V (st) + α[Gt:t+n(τ) − V (st)]. Note 1: for n = 1 we get exactly TD(0). Note 2: for n > 1, we cannot update V (st) directly at step t + 1. We need to obtain rt+1, . . . , rt+n, st+n first, i.e. we can perform the update after step t + n. Note 3: if t + n > T, we truncate the sum in Gt:t+n at rT , i.e. in such a case Gt:t+n = Gt. 116/189 n-step TD policy evaluation: pseudocode source: Sutton&Barto, p.144 117/189 n-step TD policy evaluation: performance 19-state symmetric random walk: source: Sutton&Barto, p.145 118/189 n-step SARSA (on-policy control) Uses Q-value-bootstrapped n-step returns. For a sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and for all time steps 0 ≤ t < T we perform an update Q(st, at) ← Q(st, at) + α[Gt:t+n − Q(st, at)]. We sample trajectories according to a policy π that is ε-greedy w.r.t. current Q-estimates: π(a|s) =    1 − ε + ε |A(s)| if a = arg maxa′∈A(s) Q(s, a′ ) (ties broken in principled way) ε |A(s)| otherwise. π is redefined in this way after each episode 119/189 n-step SARSA (pseudocode) source: Sutton&Barto, p. 147 120/189 n-step SARSA (speed of signal propagation) source: Sutton&Barto, p. 147 121/189 n-step Q-learning The Q-learning update for a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and time step t: Q(st, at) ← Q(st, at) + α · [rt+1 + γ · max a∈A(st+1) Q(st+1, a) − Q(st, at)]. Naive extension to n-step returns Q(st, at) ← Q(st, at)+α·[rt+1+γrt+2+· · ·+γt+n−1 ·rt+n+γt+n · max a∈A(st+1) Q(st+n+1, a)−Q(st, at)] does not really correspond to Q-learning, since some of the actions at+1, . . . , at+n might not be Q-greedy (the behavior policy is ε-greedy, so some actions might be exploratory). Hence, we are no longer pushing Q towards the Q-value of an optimal policy. s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 122/189 n-step Q-learning: correct Idea: apply the Q-learning bootstrap at the first occurrence of a non-Q-greedy action. I.e., for each episode: • make π an ε-Q-greedy policy • sample a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π • for each time step 0 ≤ t < T: • identify the smallest n′ ∈ {t + 1, t + 2, . . . , t + n} such that an′ is not a Q-greedy action • perform the update Q(st , at ) ← Q(st , at ) + α · [rt+1 + γrt+2 + · · · + γn′ −1 rn′ + γn′ · max a∈A(sn′ ) Q(sn′ , a) − Q(st , at )] (if n′ > T, do the standard MC update). s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 123/189 λ-returns: idea By varying n, the n-step returns provide a nice tradeoff between bias and variance (and update speed). But the choice of optimal n is mostly a guesswork. Idea: find a notion of return which combines n-step returns for multiple n’s. E.g., a suitable convex combination of individual n-step returns. This leads to the notion of λ-returns. We will focus only on policy evaluation, though λ-returns can be used also in control. 124/189 λ-returns: definition Recall: Gt:t+n is the n-step return from timestep t. Definition 47: λ-return Let λ ∈ [0, 1]. A λ-return from timestep t is the random variable Gλ t = (1 − λ) ∞ n=1 λn−1 Gt:t+n. 125/189 λ-returns: definition Recall: Gt:t+n is the n-step return from timestep t. Definition 47: λ-return Let λ ∈ [0, 1]. A λ-return from timestep t is the random variable Gλ t = (1 − λ) ∞ n=1 λn−1 Gt:t+n. Note that due to truncation at t + n ≥ T, the λ-return can be more explicitly written as Gλ t = (1 − λ) T−t−1 n=1 λn−1 · Gt:t+n + λT−t−1 · Gt. s0 a0 s1 a1 s2 a2 s3 a3 s4 r1 r2 r3 r4 125/189 λ-return as discounting of n-step returns source: Sutton&Barto, p. 290 126/189 (Forward-view) TD(λ) Like TD(0), but uses λ-returns. Given a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . sampled from the evaluated policy π, we perform, for each time step t an update: V (st) ← V (st) + α(Gλ t (τ) − V (st)). Note: • for λ = 0, this is exactly the TD(0) update, • for λ = 1, this is exactly the MC update, • Gλ t (τ) depends on the whole suffix of τt.., hence the update can be only performed at the end of the episode. (We will show a workaround later.) 127/189 TD(λ) vs. n-step TD on 19-state random walk source: Sutton&Barto, p. 291 128/189 Forward vs. backward view TD(λ) Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view TD(λ) in an online fashion (V (st) can be updated by time t + 1). 129/189 Forward vs. backward view TD(λ) Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view TD(λ) in an online fashion (V (st) can be updated by time t + 1). Implemented using eligibility traces: state-wise signals that indicate how much is the current state eligible for an update (sort of state-wise modulation of the learning rate). 129/189 Forward vs. backward view TD(λ) Backward-view TD(λ) = an algorithm performing roughly the same updates as Forward-view TD(λ) in an online fashion (V (st) can be updated by time t + 1). Implemented using eligibility traces: state-wise signals that indicate how much is the current state eligible for an update (sort of state-wise modulation of the learning rate). We are more keen to update states that: • appear often along the trajectory (frequency heuristic) • were visited in the recent past (recency heuristic) 129/189 Accumulating eligibility trace Definition 48: (Accumulating) eligibility trace For a trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . ., λ ∈ [0, 1], and a state s ∈ S, an (accumulating) eligibility trace is a sequence of values E0(s), E1(s), E2(s), . . . defined inductively as follows: E0(s) = 0 and for t > 0 Et(s) = γ · λ · Et−1(s) + I(St(τ) = s), where I(St(τ) = s) is the indicator of the t-th state of τ being s, i.e. I(St(τ) = s) = 1 if st = s and I(St(τ) = s) = 0 otherwise. source: slides of D. Silver (Model-free prediction) 130/189 Backward-view TD(λ): idea • Et(s) denotes how much is s eligible for an update after playing t-th action along the run (i.e., action at−1). • In time step t, all states with non-zero eligibility signal will have their estimates updated in proportion to the learning rate and the strength of the eligibility signal. • The update target is the standard TD(0) target for time t. I.e., for each timestep t and each state q, we perform the update V (q) ← V (q) + α · Et(q) · rt+1 + γ · V (st+1) − V (st) . 131/189 Backward-view TD(λ): pseudocode Input: policy π to evaluate Output: Estimate V of vπ initialize V arbitrarily repeat s ← sample uniformly (ES) or according to init. distr. initialize E to be uniformly zero while s not terminal do a ← sample from π(s) s′ ← sample from p(s, a) r ← r(s, a, s′ ) foreach q ∈ S (Only q’s visited so far) do E(q) = γ · λ · E(q) + I(q = s) V (q) ← V (q) + α · E(q) · r + γ · V (s′ ) − V (s) s ← s′ until timeout 132/189 Forward vs. backward view If λ = 0, then Et(q) = I(St(τ) = q), i.e. the backward-view update at time point t is V (st) ← V (st) + α · r + γ · V (st+1) − V (st) , while for all states other than st, no update is performed. I.e., backward TD(0) is exactly the same thing as forward TD(0). 133/189 Forward vs. backward view If λ = 0, then Et(q) = I(St(τ) = q), i.e. the backward-view update at time point t is V (st) ← V (st) + α · r + γ · V (st+1) − V (st) , while for all states other than st, no update is performed. I.e., backward TD(0) is exactly the same thing as forward TD(0). For general λ the correspondence is more subtle: Theorem 49: Forward-backward view correspondence Assume that in the backward view, all the updates along the trajectory are performed offline, i.e. only after the end of the episode, and in a batch, i.e. concurrently, using the pre-episode estimates in right-hand sides. Then, for any λ ∈ (0, 1), this offline backward TD(λ) performs the same updates as forward TD(λ). 133/189 Offline backward and forward view (source: D. Silver slides) Given the batch nature of updates, it suffices to show that the forward update target at time t equals the sum of all updates “triggered” by a visit to state st. 134/189 Online vs. offline backward view on 19-state RW In practice, we want to use the online backward algorithm, which only approximates the forward view. Nevertheless, it performs acceptably: source: Sutton&Barto, p. 295, LEFT: online backward TD(λ), RIGHT: offline forward TD(λ) 135/189 Concluding remarks on λ-returns • There are other types of eligibility traces (replacing, dutch, . . . ), yielding different algorithms. 136/189 Concluding remarks on λ-returns • There are other types of eligibility traces (replacing, dutch, . . . ), yielding different algorithms. • Eligibility traces neatly generalize to deep learning, where they are not state-wise but parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2 136/189 Concluding remarks on λ-returns • There are other types of eligibility traces (replacing, dutch, . . . ), yielding different algorithms. • Eligibility traces neatly generalize to deep learning, where they are not state-wise but parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2 • λ-returns and eligibility traces can be generalized to control setting SARSA(λ), Q(λ); optional reading: Sutton&Barto, Sec. 12.7-12.10. 136/189 Concluding remarks on λ-returns • There are other types of eligibility traces (replacing, dutch, . . . ), yielding different algorithms. • Eligibility traces neatly generalize to deep learning, where they are not state-wise but parameter-wise signals; optional reading: Sutton&Barto, Sec. 12.1-12.2 • λ-returns and eligibility traces can be generalized to control setting SARSA(λ), Q(λ); optional reading: Sutton&Barto, Sec. 12.7-12.10. • There is a true online backward TD(λ) version. Here, true online=having perfect equivalence with the forward view. However, the equivalence is w.r.t. a more complex notion of λ-return (truncated λ-return) and uses more complex version of eligibility traces than presented here. Outperforms both forward and backward algorithms presented here. Optional reading: van Seijen, Sutton: True Online TD(λ). In Proceedings of ICML’14. 136/189 First Steps Towards Deep RL: Value-Based On-Policy Methods Working with huge MDPs E.g. original Atari games have 160x192 resolution with 128 colors: observable state space of size 27·160·192 = 2215040 (though only a fraction reachable and resolution typically scaled down in benchmarks – however, state typically encompass last 3 frames so as to provide some info on movement). State space can be even continuous (position, velocity,. . . ). Most states will not be seen - we need the ability to generalize from experience to unseen/rarely seen states. 137/189 Huge MDP representation From now on, states of the MDP will be represented by vectors from Rn . The vectorized representation is chosen in a domain-specific manner, e.g.: • Atari = one component per pixel per frame • continuous navigation = agent coordinates, velocity, etc. • small discrete MDPs can be represented by one-hot encoding For simplicity, we will still assume that the action space is discrete, and reasonably small, though many algorithms can be adapted for continuous actions (acceleration, etc). 138/189 Function approximators The value functions have types: vπ , v∗ : Rn → R qπ , q∗ : Rn × A → R. In RL, we need to approximate these functions. Definition 50 A function approximator (FA) for functions of type X → Y is a class of functions f ⊆ Y X parameterized by a some set of parameter vectors Θ ⊆ Rn . Each concrete parameter vector θ ∈ Θ defines a concrete function fθ ∈ f , i.e. f = {fθ | θ ∈ Θ}. For FA f , we often write fθ(x) = f (x, θ) to stress the fact that the output of fθ depends on both the input x and on θ. Hence, FA for type X → y can be itself seen as a function of type X × Θ → Y . 139/189 Function approximators in RL Our algorithms will use mainly these types of function approximators: • V : Rn × Θ → R to approximate vπ or vθ • Q : (Rn × A) × Θ → R to approximate qπ or qθ The typical task is to find θ ∈ Θ such that Vθ = V (·, θ) is a “good” approximation for vπ or vθ , and similarly for Qθ. The parametrization Θ will depend on the concrete form of function approximator used. 140/189 Forms of function approximators • tabular • θ = the vector containing the contents of the table • linear • Θ = S = Rn and e.g. Vθ(s) = θ⊤ · s • neural nets • θ = NN weights and biases • decision trees • . . . We require the approximators to be differentiable and to admit a training method suitable for non-stationary data. 141/189 Neural nets (source: slides by T. Brázdil) 142/189 Policy evaluation with FAs Task: given a policy π and FA V : Rn × Θ → R, find θ s.t. Vθ is “close” to vπ . “Closeness” can be expressed using various loss functions. Typically, we want to minimize the mean squared error (MSE): MSE(vπ , Vθ) = 1 2 Es∼µ (vπ (s) − Vθ(s))2 = 1 2 s∈S µ(s) · (vπ (s) − Vθ(s))2 , where µ is some distribution over states expressing how much do we care about errors in particular states. A local minimum of MSE can be found gradient descent: making successive step in the direction opposite to the gradient of MSE. 143/189 Recall: gradients Definition 51 Given a scalar function f (x1, . . . , xn, θ1, . . . , θm): Rn × Θ → R (where Θ ⊆ Rm ), the gradient of f w.r.t. parameters θ = (θ1, . . . , θm) is the vector function ∇θf = ( ∂f ∂θ1 , . . . , ∂f ∂θm ) of type Rn × Θ → Rm When f is a function approximator defined by a neural net, the value of the gradient ∇θf (x, θ) at a given point (x, θ) = (x1, . . . , xn, θ1, . . . , θm) can be computed by backpropagation (under some usual conditions like smoothness, etc.). 144/189 Gradient descent for policy evaluation To (locally) minimize MSE(vπ , Vθ), it suffices to perform (sufficiently small) steps in the negative direction of the current gradient, i.e., repeatedly perform updates: θ ← θ − α · ∇θMSE(vπ , Vθ) = θ − α · ∇θ 1 2 · Es∼µ vπ (s) − Vθ(s) 2 145/189 Gradient descent for policy evaluation To (locally) minimize MSE(vπ , Vθ), it suffices to perform (sufficiently small) steps in the negative direction of the current gradient, i.e., repeatedly perform updates: θ ← θ − α · ∇θMSE(vπ , Vθ) = θ − α · ∇θ 1 2 · Es∼µ vπ (s) − Vθ(s) 2 = θ − α 2 · Es∼µ ∇θ vπ (s) − Vθ(s) 2 145/189 Gradient descent for policy evaluation To (locally) minimize MSE(vπ , Vθ), it suffices to perform (sufficiently small) steps in the negative direction of the current gradient, i.e., repeatedly perform updates: θ ← θ − α · ∇θMSE(vπ , Vθ) = θ − α · ∇θ 1 2 · Es∼µ vπ (s) − Vθ(s) 2 = θ − α 2 · Es∼µ ∇θ vπ (s) − Vθ(s) 2 = θ + α · Es∼µ vπ (s) − Vθ(s) · ∇θVθ(s) 145/189 Gradient descent for policy evaluation To (locally) minimize MSE(vπ , Vθ), it suffices to perform (sufficiently small) steps in the negative direction of the current gradient, i.e., repeatedly perform updates: θ ← θ − α · ∇θMSE(vπ , Vθ) = θ − α · ∇θ 1 2 · Es∼µ vπ (s) − Vθ(s) 2 = θ − α 2 · Es∼µ ∇θ vπ (s) − Vθ(s) 2 = θ + α · Es∼µ vπ (s) − Vθ(s) · ∇θVθ(s) The expected value above is typically impossible to evaluate in practice. Instead we estimate it by samples ⇒ stochastic gradient descent. 145/189 Gradient descent for policy evaluation To (locally) minimize MSE(vπ , Vθ), it suffices to perform (sufficiently small) steps in the negative direction of the current gradient, i.e., repeatedly perform updates: θ ← θ − α · ∇θMSE(vπ , Vθ) = θ − α · ∇θ 1 2 · Es∼µ vπ (s) − Vθ(s) 2 = θ − α 2 · Es∼µ ∇θ vπ (s) − Vθ(s) 2 = θ + α · Es∼µ vπ (s) − Vθ(s) · ∇θVθ(s) The expected value above is typically impossible to evaluate in practice. Instead we estimate it by samples ⇒ stochastic gradient descent. We typically take µ(s) representing the overall fraction of time spent in s when behaving according to µ. Hence, Es∼µ can be estimated by sampling a trajectory from µ and performing the update for each s on the trajectory in an every-visit fashion. 145/189 Stochastic gradient policy evaluation + MC instantiation We keep sampling trajectories τ from π: s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 For each timestep t we perform the update of parameters θ ← θ + α · vπ (st) − Vθ(st) · ∇θVθ(st) . Problem: in policy evaluation setting, we do not know vπ (st). Hence, we estimate it using RL targets. 146/189 Stochastic gradient policy evaluation + MC instantiation We keep sampling trajectories τ from π: s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 For each timestep t we perform the update of parameters θ ← θ + α · vπ (st) − Vθ(st) · ∇θVθ(st) . Problem: in policy evaluation setting, we do not know vπ (st). Hence, we estimate it using RL targets. The simplest is the Monte Carlo target: estimate st by the discounted return of the sampled trajectory from st, i.e. perform updates of the form θ ← θ + α · Gt(st) − Vθ(st) · ∇θVθ(st) . 146/189 Gradient Monte Carlo policy evaluation: pseudocode Algorithm 3: Gradient MC evaluation Input: Policy π, FA V : S × Θ → R, step size α Output: Approximation Vθ of vπ initialize θ arbitrarily; repeat sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . from π; foreach t ∈ {0, . . . , T − 1} do θ ← θ + α · [Gt(τ) − V (st, θ)] · ∇θV (st, θ) until timeout; 147/189 Semi-gradient TD(0) In the gradient update formula θ ← θ + α · vπ (st) − Vθ(st) · ∇θVθ(st) . we can also estimate vπ (st) with the TD(0) target: θ ← θ + α · rt+1 + γ · Vθ(st+1) − Vθ(st) · ∇θVθ(st) . This yield the semi-gradient TD(0) policy evaluation algorithm. 148/189 Semi-gradient TD(0) In the gradient update formula θ ← θ + α · vπ (st) − Vθ(st) · ∇θVθ(st) . we can also estimate vπ (st) with the TD(0) target: θ ← θ + α · rt+1 + γ · Vθ(st+1) − Vθ(st) · ∇θVθ(st) . This yield the semi-gradient TD(0) policy evaluation algorithm. Why semi-gradient? 148/189 Gradient vs. semi-gradient TD(0) Recall that our ultimate goal is to minimize MSE(vπ , Vθ) = 1 2 Es∼µ (vπ (s) − Vθ(s))2 . The gradient of this loss is ∇θ 1 2 Es∼µ (vπ (s) − Vθ(s))2 = 1 2 Es∼µ ∇θ(vπ (s) − Vθ(s))2 , 149/189 Gradient vs. semi-gradient TD(0) Recall that our ultimate goal is to minimize MSE(vπ , Vθ) = 1 2 Es∼µ (vπ (s) − Vθ(s))2 . The gradient of this loss is ∇θ 1 2 Es∼µ (vπ (s) − Vθ(s))2 = 1 2 Es∼µ ∇θ(vπ (s) − Vθ(s))2 , Estimation with sample trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and substituting vπ (s) with the TD(0) target would yield ∇θMSE(vπ , Vθ) ≈ 1 2 ∇θ rt+1+γVθ(st+1)−Vθ(st) 2 = (rt+1+γVθ(st+1)−Vθ(st))·(γ∇θVθ(st+1)−∇θVθ(st)) different update then semi-gradient TD(0)! However, this full gradient: • is more expensive to compute (2 backpropagations per update); • does not really express TD(0) idea (the update target is not fixed). 149/189 Semi-gradient TD(0): pseudocode Algorithm 4: Semi-gradient TD(0) evaluation Input: Policy π, FA V : S × Θ → R, step size α Output: Approximation Vθ of vπ initialize θ arbitrarily; repeat s ← initial state; a ∼ π(s); while s not terminal do s′ ∼ p(s, a); r ← r(s, a, s′ ); a′ ∼ π(s′ ); θ ← θ + α · [r + γ · V (s′ , θ) − V (s, θ)] · ∇θV (s, θ); s ← s′ ; a ← a′ until timeout; 150/189 On-policy control with function approximation Semi-gradient SARSA uses the same idea as TD(0), but with Q-approximator, i.e. Q : Rn × A × Θ → R. Behavior policy = e.g. ε-greedy with respect to the current Q. For a sampled trajectory τ = s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 we perform, in each timestep t, an update θ ← θ + α · rt+1 + γ · Qθ(st+1, at+1) − Qθ(st, at) · ∇θQθ(st, at) . 151/189 Semi-gradient SARSA: pseudocode Algorithm 5: Semi-gradient SARSA Input: FA Q : S × A × Θ → R, step size α Output: Approximation Qθ of q∗ initialize θ arbitrarily; repeat s ← initial state; π ← policy ε-greedy w.r.t. Qθ; a ∼ π(s); while s not terminal do s′ ∼ p(s, a); r ← r(s, a, s′ ); a′ ∼ π(s′ ); θ ← θ + α · [r + γ · Qθ(s′ , a′ ) − Qθ(s, a)] · ∇θQθ(s, a); s ← s′ ; a ← a′ until timeout; 152/189 Representing actions in DNNs How to represent actions in the (say, DNN) function approximator Q is largely a domain-dependent engineering choice. If the set of actions A = {a1 , . . . , ak } is discrete and reasonably small, we can consider a net which inputs a state (i.e., n input neurons when S = Rn ) and outputs an |A|-dimensional vector (i.e., one output neuron per action), so that the output of the i-th neuron on input s is interpreted as Q(s, ai ). I.e., in such a case we consider Q to be function of type Q : S × Θ → R|A| . 153/189 On-policy semi-gradient methods: concluding remarks The presented algorithms can be instantiated also with other types of returns, such as: • n-step returns • n-step SARSA update: θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1 rt+n + γn · Qθ(st+n, at+n) Gt:t+n,θ −Qθ(st , at ) ·∇θQθ(st , at ) 154/189 On-policy semi-gradient methods: concluding remarks The presented algorithms can be instantiated also with other types of returns, such as: • n-step returns • n-step SARSA update: θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1 rt+n + γn · Qθ(st+n, at+n) Gt:t+n,θ −Qθ(st , at ) ·∇θQθ(st , at ) • forward-view λ-returns • SARSA(λ) update: θ ← θ + α · (1 − λ) ∞ n=1 λn−1 Gt:t+n,θ − Qθ(st , at ) · ∇θQθ(st , at ) 154/189 On-policy semi-gradient methods: concluding remarks The presented algorithms can be instantiated also with other types of returns, such as: • n-step returns • n-step SARSA update: θ ← θ+α· rt+1 + γ · rt+2 + · · · + γn−1 rt+n + γn · Qθ(st+n, at+n) Gt:t+n,θ −Qθ(st , at ) ·∇θQθ(st , at ) • forward-view λ-returns • SARSA(λ) update: θ ← θ + α · (1 − λ) ∞ n=1 λn−1 Gt:t+n,θ − Qθ(st , at ) · ∇θQθ(st , at ) • backward-view λ-returns (optional reading: Sutton&Barto, sections 12.2 and 12.7) 154/189 Value-Based Off-Policy Control with Approximators: DQNs and Friends Off-policy methods with function approximation ...are tricky to get right, already in the case of policy evaluation. The training can become very unstable. For on-policy (semi)-gradient methods, one can typically prove convergence to correct/optimal values at least in the case of linear function approximation (though not in the more general case of NN approximators). 155/189 Off-policy methods with function approximation ...are tricky to get right, already in the case of policy evaluation. The training can become very unstable. For on-policy (semi)-gradient methods, one can typically prove convergence to correct/optimal values at least in the case of linear function approximation (though not in the more general case of NN approximators). Off-policy semi-gradient methods, such as: • TD with importance sampling (not covered here), or • Q-learning with function approximators (will be covered a bit later), can diverge already with linear function approximators. 155/189 Divergence examples (high-level) • Baird’s counterexample: semi-gradient TD with importance sampling can diverge in presence of linear FAs • Moreover, the divergence is not due to the instability of (semi)-gradient descent. Tsitsiklis and Van Roy’s counterexample shows divergence even in the case where each update completely replaces the current θ with the optimal θ∗ which minimizes the MSE between Vθ and the TD(0) update target. The problem lies in the off-policy distribution of updates. • Counterexamples explained in optional reading: Sutton&Barto, Sec. 11.2. 156/189 Deadly triad Identified by Sutton&Barto: risk of training instability and divergence steeply rises when combining: • function approximation, • bootstrapping, and • off-policy training. 157/189 Deadly triad Identified by Sutton&Barto: risk of training instability and divergence steeply rises when combining: • function approximation, • bootstrapping, and • off-policy training. But often we want to do just that. :) Practical solution: Happily do the deadly triad, but use insights from supervised learning to develop additional techniques that help stabilize the training. 157/189 Deep Q-Networks (DQN) 2013 arXiv tech. report, there is also follow-up 2015 Nature paper 158/189 Q-learning with function approximators Same semi-gradient idea as in TD(0), SARSA: adjust θ to bring Qθ(s, a) closer to the fixed Q-learning update target. I.e., for a sampled trajectory s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 and its timestep t, the update is θ ← θ + α · rt+1 + γ · max a∈A(st+1) Qθ(st+1, a) − Qθ(st, at) · ∇θQθ(st, at) . 159/189 Q-learning with function approximators Same semi-gradient idea as in TD(0), SARSA: adjust θ to bring Qθ(s, a) closer to the fixed Q-learning update target. I.e., for a sampled trajectory s0 a0 s1 a1 s2 a2 s3 a3 s4 a4 . . . r1 r2 r3 r4 r5 and its timestep t, the update is θ ← θ + α · rt+1 + γ · max a∈A(st+1) Qθ(st+1, a) − Qθ(st, at) · ∇θQθ(st, at) . But performing the updates based solely on the current step would be susceptible to instability due to the presence of the deadly triad. 159/189 Deep Q-learning challenges From Mnih et al. Playing Atari with Deep Reinforcement Learning: 160/189 Experience replay Originated in the work of Long-Ji Lin, e.g.: Reinforcement Learning for Robots Using Neural Networks, dissertation, 1993. Definition 52: Experience An experience is a 4-tuple (s, a, r, s′ ) ∈ S × A × S × R interpreted as ("state", "action played in it", "reward obtained", "next state observed"). 161/189 Experience replay Originated in the work of Long-Ji Lin, e.g.: Reinforcement Learning for Robots Using Neural Networks, dissertation, 1993. Definition 52: Experience An experience is a 4-tuple (s, a, r, s′ ) ∈ S × A × S × R interpreted as ("state", "action played in it", "reward obtained", "next state observed"). • DQN does not perform update based only on the current step. Instead, for each sampled trajectory τ = s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, . . . and each timestep t it: • first stores the one-step experience (st , at , rt+1, st+1) in a data structure B called replay buffer; • then, sample a random minibatch of experiences B ⊆ B of a given minibatch size Bsize • perform a minibatch-gradient-descent update w.r.t. B: compute the gradient of the Q-learning loss for each e ∈ B and then update θ in the direction of an average gradient over the whole B. 161/189 Minibatch update Fix a minibatch B. For each e = (s, a, r, s′ ) ∈ B we compute the gradient of the Q-learning loss ∇θL(θ, e) at point e: ∇θL(θ, e) = ∇θ 1 2 r + γ · max b∈A(s′) Qθ(s′ , b) fixed target − Qθ(s, a) 2 = r + γ · max b∈A(s′) Qθ(s′ , b) =0 if s terminal − Qθ(s, a) · ∇θQθ(s, a) We then perform an update in the direction of average gradient: θ ← θ + α · 1 |B| e∈B ∇θL(θ, e). 162/189 Experience replay rationale • helps decorrelate the DNN training data • helps to prevent catastrophic forgetting • improves data efficiency via experience re-use Why good match for deep Q-learning? Experience replay is by design off-policy since we train on old data, which were sampled from different policy than the current one. 163/189 Replay buffer implementation The replay buffer B is typically not unbounded, but has a fixed capacity Bsize. Replacement is eventually needed. If B is full, the oldest experience if removed (B = queue). How to sample the minibatches? • Original DQN: uniformly from B. • Alternative: prioritized experience replay: each experience is assigned a priority (several heuristics exist). An experience is sampled into a minibatch with probability proportional to its priority. 164/189 DQN: 2013 pseudocode Algorithm 6: DQN with replay buffer Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s ε, Bsize, Bsize, . . . Output: Approximation Qθ of q∗ initialize θ arbitrarily; initialize empty replay buffer B of capacity Bsize; repeat s ← initial state; while s not terminal do π ← policy ε-greedy w.r.t. Qθ; a ∼ π(s); s′ ∼ p(s, a); r ← r(s, a, s′ ); store (s, a, r, s′ ) in B; sample a minibatch B of size Bsize from replay buffer B; perform the minibatch update θ ← θ + α · 1 Bsize e∈B ∇θL(θ, e) (see this slide); s ← s′ ; until timeout; 165/189 DQN: 2013 results Mnih et al. (2013, arXiv) 166/189 Target networks: another stabilizing factor in DQNs Introduced in the reviewed version of DQN paper: Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518 (2015). 167/189 Target networks: another stabilizing factor in DQNs Introduced in the reviewed version of DQN paper: Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518 (2015). Performing the (minibatch) Q-update looks like supervised learning: change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max b∈A(s′) Qθ(s′ , b), where (s, a, r, s′ ) is the processed experience. 167/189 Target networks: another stabilizing factor in DQNs Introduced in the reviewed version of DQN paper: Mnih et al.: Human-level control through deep reinforcement learning. Nature, 518 (2015). Performing the (minibatch) Q-update looks like supervised learning, but: change θ so that Qθ(s, a) gets closer to the fixed target r + γ · max b∈A(s′) Qθ(s′ , b) the “label” changes with each update! , where (s, a, r, s′ ) is the processed experience. 167/189 Target networks: idea To stabilize learning, we use two networks: the main network, and the target network. They have the same architecture (denote it by Q), but their weights may differ during the execution of the algorithm. We denote: • θ - weights of main network • ˆθ - weights of target network 168/189 Target networks: idea To stabilize learning, we use two networks: the main network, and the target network. They have the same architecture (denote it by Q), but their weights may differ during the execution of the algorithm. We denote: • θ - weights of main network • ˆθ - weights of target network Usage: • The target network is used only to compute TD targets when computing losses: ∇θL(θ, e) = r + γ · max b∈A(s′) Q(s′ , b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ) 168/189 Target networks: idea To stabilize learning, we use two networks: the main network, and the target network. They have the same architecture (denote it by Q), but their weights may differ during the execution of the algorithm. We denote: • θ - weights of main network • ˆθ - weights of target network Usage: • The target network is used only to compute TD targets when computing losses: ∇θL(θ, e) = r + γ · max b∈A(s′) Q(s′ , b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ) • At the start, and also in periodic intervals (but not after each update!) the two networks are synchronized by performing ˆθ ← θ. Other than this, ˆθ stays fixed, the gradient steps are only used to update θ (i.e., the main network). 168/189 DQN: 2015 pseudocode Algorithm 7: DQN with replay buffer and target network Input: Black-box MDP M = (S, A, p, r), approximator Q; hyperparam’s ε, Bsize, Bsize, C, . . . Output: Approximation Qθ of q∗ initialize θ arbitrarily; ˆθ ← θ; counter ← C; initialize empty replay buffer B of capacity Bsize; repeat s ← initial state; while s not terminal do if counter = 0 then ˆθ ← θ; counter ← C else counter ← counter − 1; π ← policy ε-greedy w.r.t. Qθ; a ∼ π(s); s′ ∼ p(s, a); r ← r(s, a, s′ ); store (s, a, r, s′ ) in B; sample a minibatch B of size Bsize from replay buffer B; perform the minibatch update θ ← θ + α · 1 Bsize e∈B ∇θL(θ, e), where ∇θL(θ, e) = r + γ · maxb∈A(s′) Q(s′ , b, ˆθ) − Q(s, a, θ) · ∇θQ(s, a, θ); s ← s′ ; until timeout; 169/189 DQN: 2015 results 170/189 Engineering behind DQN for Atari: partial observability True state = current state (program counter + variable values) of the game program. We do not see this – only the frames rendered on screen. Solving partially observable environments requires (per some POMDP theory) making decisions based on the whole history of observations. This is computationally demanding (recurrent NNs...). 171/189 Engineering behind DQN for Atari: partial observability True state = current state (program counter + variable values) of the game program. We do not see this – only the frames rendered on screen. Solving partially observable environments requires (per some POMDP theory) making decisions based on the whole history of observations. This is computationally demanding (recurrent NNs...). DQN for Atari solves this by feeding the last 4 observed frames into the NN. This is typically enough to deduce the dynamics of the current play. 171/189 DQN: dynamics from limited frame history 172/189 Engineering behind DQN for Atari: the network Inputs four 84x84px images, then 3 convolutional layers, then two fully connected layers. all with ReLU activations. Outputs Q-estimate for each action. source: Mnih et al. (Nature, 2015), details in appendix “Model architecture” 173/189 DQN for Atari: dirty engineering tricks • Preprocessing: 210x160 RGB color images are converted to grayscale and resized to 84x84 resolution. • Frame skipping: the agent only observes and acts in every K-th frame, for the frames in between, the last selected action is repeated without providing the frame to the agent. (In the paper, K = 4.) • Reward clipping: all positive one-step Atari rewards are clipped to +1, all negative ones are clipped to −1 (Atari gives integer rewards). • TD error clipping: for each update, the Q-learning error r + γ · max b ∈ A(s′ )Q(s′ , b) − Q(s, a) is clipped to [−1, 1]. 174/189 DQN for Atari: selected hyperparameters (nex) minibatch size Bsize 32 replay buffer size Bsize 1,000,000 target network update freqeuency C 10,000 discount factor γ 0.99 update frequency (steps between two minibatch updates) 4 learning rate 0.00025 initial ε 1 final ε (linear decay) 0.1 final decay frame 1,000,000 random policy played for init. 50,000 frames max. do-nothing actions at episode start 30 175/189 RAINBOW of heuristics Multitude of heuristics for the improvement of DQN were developed over time. Some of them make sense also in the context of other deep RL algorithms. The RAINBOW agent combines six such heuristics to further improve the DQN performance on Atari games. (In proceedings of AAAI 2018.) 176/189 Rainbow heuristics • dueling networks architecture • double DQN • prioritized experience replay • n-step rewards • distributional learning • noisy networks 177/189 Action advantage Idea: imagine that for some state s, the Q-values of all actions are high. Then s should be in some sense valuable in itself. 178/189 Action advantage Idea: imagine that for some state s, the Q-values of all actions are high. Then s should be in some sense valuable in itself. Definition 53: Advantage Let π be a policy. An advantage function advπ : S × A → R is defined as advπ (s, a) = qπ (s, a) − vπ (s). Dueling architecture splits certain layers of the neural network into two “streams”, one estimating (somethign like) vπ (s) and one estimating (something like) advπ (s, a). The final layer combines these estimates to produce an estimate of qπ (s, a). 178/189 Dueling architecture Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In proceedings of ICML’16. 179/189 Dueling architecture: theory and training We have Qθ,α,β(s, a) = aggregate(Vθ,α(s), Aθ,β(s, a)), where • θ - convolutional (or other feature extraction) layer parameters • α - value channel parameters • β - advantage channel parameters 180/189 Dueling architecture: theory and training We have Qθ,α,β(s, a) = aggregate(Vθ,α(s), Aθ,β(s, a)), where • θ - convolutional (or other feature extraction) layer parameters • α - value channel parameters • β - advantage channel parameters The whole network Q is trained to estimate qπ (where π is the target policy) using any deep RL algorithm (e.g. DQN, in which case π is the optimal policy). There is nothing new from RL perspective here, all the novelty is inside the network. The factorization into state value and advantage is supposed to help the network “focus” on features that are important to recognize valuable states and features that help us rank actions. 180/189 Dueling architecture: Atari example From Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning. In proceedings of ICML’16. 181/189 Learning state values and advantages • The whole net is trained end-to-end to predict qπ . • How do we ensure the value/advantage channels are trained to predict state values/advantages? 182/189 Learning state values and advantages • The whole net is trained end-to-end to predict qπ . • How do we ensure the value/advantage channels are trained to predict state values/advantages? • By a suitable choice of aggregator: • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0 and Aθ,β to qπ . 182/189 Learning state values and advantages • The whole net is trained end-to-end to predict qπ . • How do we ensure the value/advantage channels are trained to predict state values/advantages? • By a suitable choice of aggregator: • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0 and Aθ,β to qπ . • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − max b∈A(s) Aθ,β(s, b), the training then indeed pushes Vθ,α to vπ and Aθ,β to advπ + c where c is some constant. 182/189 Learning state values and advantages • The whole net is trained end-to-end to predict qπ . • How do we ensure the value/advantage channels are trained to predict state values/advantages? • By a suitable choice of aggregator: • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) does not work: e.g. Vθ,α could converge to constant 0 and Aθ,β to qπ . • Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − max b∈A(s) Aθ,β(s, b), the training then indeed pushes Vθ,α to vπ and Aθ,β to advπ + c where c is some constant. Issues: not differentiable, update sensitive to the value of maximizing action changes. 182/189 Aggregation in Rainbow The point: aggregating layer should anchor the sum of the channels to same baseline value derived non-trivially from the advantages (if all advantages shift up/down, so should the baseline). Rainbow uses mean advantage baseline: Qθ,α,β(s, a) = Vθ,α(s) + Aθ,β(s, a) − 1 |A(s)| b∈A(s) Aθ,β(s, b) pushes the value channel to predict 1 |A(s)| a∈A(s) qπ (s, a). 183/189 Double DQN van Hasselt, Guez, Silver: Deep Reinforcement Learning with Double Q-learning. In proceedings of AAAI 2016. Similar idea to tabular Double Q-learning (use different estimates for selecting maximizing action in bootstrap and for evaluating the bootstrap), but instead of independently updated networks uses main and target networks. I.e., for experience (s, a, r, s′ ), the update is: θ ← θ + α · r + γ · Qˆθ s′ , arg max b∈A(s′) Qθ(s′ , b) − Qθ(s, a) ∇θQθ(s, a), where ˆθ is the parameter vector of the target network. 184/189 Prioritized experience replay (Schaul et al., ICLR’16) Each experience e = (s, a, r, s′ ) in the replay buffer is assigned a priority according to its TD-error pe = |r + γ · max b∈A(s′) Qˆθ(s′ , b) − Qθ(s, a)| + ε (ε > 0 ensures all priorities are positive). The probability of sampling an experience e from the buffer is set to pα e e′∈B pα e′ , where α > 0 is a hyperparameter controlling the degree of prioritization. 185/189 Prioritized experience replay (Schaul et al., ICLR’16) Each experience e = (s, a, r, s′ ) in the replay buffer is assigned a priority according to its TD-error pe = |r + γ · max b∈A(s′) Qˆθ(s′ , b) − Qθ(s, a)| + ε (ε > 0 ensures all priorities are positive). The probability of sampling an experience e from the buffer is set to pα e e′∈B pα e′ , where α > 0 is a hyperparameter controlling the degree of prioritization. Prioritization induces bias: the sampled experiences no longer follow the same distribution as sampled trajectories. We can correct this by using importance sampling during updates: θ ← θ + α · 1 |B| · 1 pα e β · r + γ · Qˆθ s′ , arg max b∈A(s′) Qθ(s′ , b) − Qθ(s, a) · ∇θQθ(s, a), where β > 0 determines the degree of IS correction (annealed to 1 during training). 185/189 n-step returns Self-explanatory, use n-step return with Q-learning bootstrap when computing TD target. How to combine with replay buffer? Each experience stores a single step. 186/189 n-step returns Self-explanatory, use n-step return with Q-learning bootstrap when computing TD target. How to combine with replay buffer? Each experience stores a single step. Solutions: • Store experiences in B sequentially, with each sampled experience, retrieve also the next n − 1 ones (up to episode termination). Requires careful implementation. • Naive: each element of B consists of n consecutive experiences (space inefficient). Given consecutive experiences (st, at, rt+1, st+1), (st+1, at+1, rt+2, st+2), . . . , ((st+n−1, at+n−1, rt+n, st+n)), perform update θ ← θ + α · rt+1 + γrt+2 + · · · + ·γn−1 rt+n + γn max b∈A(st+n) Qˆθ(st+n, b) − Qθ(st, at) ∇θQθ(st, at). 186/189 Distributional learning Very rough idea: instead of expected returns, predict (discretized) distribution of returns. Source: Bellemare, Dabney, Munos: A Distributional Perspective on Reinforcement Learning. In proceedings of ICML’17. We still optimize the expected value of the distribution, but the NN processes richer information (neurological inspiration). 187/189 Noisy nets Fortunato et al.: Noisy Networks for Exploration . In proceedings of ICRL’17. An alternative way of achieving exploration (without ε-greedy policies). Replaces linear layers y = W · x + b with noisy layers of the form y = (µw + σw ⊙ εw ) · x + µb + σb ⊙ εb, where matrices µw , σw and vectors µb, σb are learnable, matrix εw and vector εb consist of random noise, and ⊙ represents component-wise multiplication. The loss function of the DQN training is then encapsulated in expectation over the noise. Interesting point: the net can learn to adjust σ’s and thus the degree of exploration over time. 188/189 Rainbow: evaluation and ablations (Hessel et al., 2017) 189/189